Main Memory¶

Memory Map¶

The Mach-V processor's main memory consists of Instruction Memory (IROM) and Data Memory (DMEM).

IROM starts at address 0x00400000.
DMEM starts at address 0x10010000.

And their memory map is as follows:

Address Range	Name	Permissions	Description
`0x00400000 – 0x00407FFF`	IROM (Instruction Memory)	RO (Read-Only)	Capacity: 8,192 words (32 KB). Based on `IROM_DEPTH_BITS = 15`.
`0x10010000 – 0x10013FFF`	DMEM (Data Memory)	RW (Read-Write)	Capacity: 4,096 words (16 KB). Used for storing constants and variables. Based on `DMEM_DEPTH_BITS = 14`.

Addressing Constraints

Accesses must be aligned to 4-byte boundaries.

Memory Implementation¶

In the earlier design of the Mach-V processor (e.g., Mach-V V1 and V2), both IROM and DMEM were implemented using the distributed memory resources available in the FPGA. However, starting from Mach-V V3, I transitioned to using Block RAM (BRAM) for both IROM and DMEM to enhance performance and resource efficiency.

Block RAM¶

In Nexys 4 DDR FPGA, there are a certain number of Block RAM (BRAM) resources available. These Block RAMs are read synchronously. The easiest way to implement a Block RAM in Verilog is to use explicitly the synchronous read style.

One important characteristic of Block RAM (BRAM) is that its read operation is synchronous and therefore incurs a one-clock-cycle latency. In a pipelined processor, this has direct implications for the instruction fetch stage.

BRAM Explanation — Block RAM Read Operation Timing

During the Fetch stage, the program counter PCF is presented to the instruction memory (IROM). However, if the IROM is implemented using BRAM, the memory does not produce valid read data in the same cycle. Instead, the read is initiated at the next edge clock edge (the rising edge of the second clock cycle), and the instruction word becomes available only after the next clock edge. Consequently, the fetched instruction InstrF is not available until the end of the following clock cycle, which corresponds to the end of the Decode stage.

Block RAM vs. Normal Synchronous RAM

Unlike the Block RAM, in a synchronous RAM, the working principle is shown below:

Synchronous RAM Read Timing — Synchronous RAM Read Operation Timing

In a normal synchronous RAM implemented using flip-flops and LUTs, the data read is available at the next clock edge after both the RAM_En is high and the address is presented to the memory. So, the one cycle delay for reading operation basically means that the data is available at the next clock edge.

Single-Port Block RAM¶

An example of Single-Port Block RAM (Read First) implementation is shown below:

// Single-port synchronous RAM with read-first behavior
module rams_sp_rf (
    clk,    // Clock input
    en,     // Memory enable
    we,     // Write enable
    addr,   // Memory address
    di,     // Data input (write data)
    dout    // Data output (read data)
);

    input        clk;
    input        we;
    input        en;
    input  [9:0] addr;   // 10-bit address: 1024 words
    input  [15:0] di;    // 16-bit data input
    output [15:0] dout;  // 16-bit data output

    reg [15:0] RAM  [1023:0]; // 1024 × 16-bit memory array
    reg [15:0] dout;         // Registered read data

    // Synchronous read/write operation
    always @(posedge clk) begin
        if (en) begin
            if (we)
                RAM[addr] <= di; // Write data on write enable
            dout <= RAM[addr];   // Read data (read-first behavior)
        end
    end

endmodule

In this example (from the official AMD documentation), we implement a single-port Block RAM (Read First).

Single Port: There is only one port for both read and write operations.
Read First: When a read and write operation occur simultaneously at the same address, the data read is the old data before the write.

True Dual-Port Block RAM¶

This is useful for the superscalar structure implementation of Mach-V and it is also very simple to make the dual-port version Block BRAM.

always @(negedge CLK) begin
    if (is_imem) begin
        // Synchronous Read: Data available on NEXT clock edge
        rv_instr_reg_1 <= imem[rv_pc[IMEM_DEPTH-1:2]];
        rv_instr_reg_2 <= imem[rv_pc_2[IMEM_DEPTH-1:2]];
    end
    else begin
        // Return NOP if address is invalid
        rv_instr_reg_1 <= 32'h00000013;
        rv_instr_reg_2 <= 32'h00000013;
    end
end

Distributed RAM¶

In contrast, distributed memory resources in the FPGA uses asynchronous reads, allowing instruction fetches to complete within the same clock cycle. The implementation of the RAM using distributed memory would look like this:

// Single-port asynchronous RAM
module rams_sp_async (
    clk,    // Clock input
    we,     // Write enable
    addr,   // Memory address
    di,     // Data input (write data)
    dout    // Data output (read data)
);

    input        clk;
    input        we;
    input  [9:0] addr;   // 10-bit address: 1024 words
    input  [15:0] di;    // 16-bit data input
    output [15:0] dout;  // 16-bit data output

    reg [15:0] RAM  [1023:0]; // 1024 × 16-bit memory array

    // Asynchronous read operation
    assign dout = RAM[addr];

    // Synchronous write operation
    always @(posedge clk) begin
        if (we)
            RAM[addr] <= di; // Write data on write enable
    end
endmodule

One disadvantage of using distributed memory is that it consumes more of the FPGA's lookup tables (LUTs) compared to Block RAM, which is a dedicated memory resource.

Distributed vs Block RAM

Both types write data synchronously into the RAM. Distributed RAM and dedicated block RAM differ primarily in how they read data. See the following table.

Action	Distributed RAM	Dedicated Block RAM
Write	Synchronous	Synchronous
Read	Asynchronous	Synchronous