Heterogeneous Tile

From NaplesPU Documentation
Revision as of 18:05, 13 May 2019 by Mirko (talk | contribs)
Jump to: navigation, search

The nu+ project provides a heterogeneous tile integrated into the NoC, meant to be extended by the user. Such a tile provides a first example of how to integrate a custom module in the network-on-chip with a dedicated tile.

Memory Interface

The Memory Interface provides a transparent way to interact with the coherence system. The memory interface implements a simple valid/available handshake per thread, a different thread might issue different memory transaction and those are concurrently handled by the coherence system.

When a thread has a memory request, it first checks the availability bit related to its ID, if this is high the thread issues a memory transaction setting the valid bit and loading all the needed information on the Memory Interface.

Supported memory operations are reported below along with their opcodes:

LOAD_8      = 'h0  - 'b000000    
LOAD_16     = 'h1  - 'b000001
LOAD_32     = 'h2  - 'b000010
LOAD_V_8    = 'h7  - 'b000111
LOAD_V_16   = 'h8  - 'b001000
LOAD_V_32   = 'h9  - 'b001001
STORE_8     = 'h20 - 'b100000
STORE_16    = 'h21 - 'b100001
STORE_32    = 'h22 - 'b100010
STORE_V_8   = 'h24 - 'b100100
STORE_V_16  = 'h25 - 'b100101
STORE_V_32  = 'h26 - 'b100110

A custom core to be integrate d in the nu+ system ought to implement the following interface in order to communicate with the memory system:

/* Memory Interface */
// To Heterogeneous LSU
output logic                                 req_out_valid,     // Valid signal for issued memory requests
output logic [31 : 0]                        req_out_id,        // ID of the issued request, mainly used for debugging
output logic [THREAD_IDX_W - 1 : 0]          req_out_thread_id, // Thread ID of issued request. Requests running on different threads are dispatched to the CC conccurrently 
output logic [7 : 0]                         req_out_op,        // Operation performed
output logic [ADDRESS_WIDTH - 1 : 0]         req_out_address,   // Issued request address
output logic [DATA_WIDTH - 1    : 0]         req_out_data,      // Data output
// From Heterogeneous LSU
input  logic                                 resp_in_valid,      // Valid signal for the incoming responses
input  logic [31 : 0]                        resp_in_id,         // ID of the incoming response, mainly used for debugging
input  logic [THREAD_IDX_W - 1 : 0]          resp_in_thread_id,  // Thread ID of the incoming response
input  logic [7 : 0]                         resp_in_op,         // Operation code
input  logic [DATA_WIDTH - 1 : 0]            resp_in_cache_line, // Incoming data
input  logic [BYTES_PERLINE - 1 : 0]         resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus
input  logic [ADDRESS_WIDTH - 1 : 0]         resp_in_address,    // Incoming response address

Synchronization Interface

The Synchronization Interface connects the user logic with the synchronization module core-side allocated within the tile (namely the barrier_core unit). Such an interface allows user logic to synchronize on a thread grain.

The synchronization mechanism supports inter- and intra- tile barrier synchronization. When a thread hits a synchronization point, it issues a request to the distributed synchronization master through the Synchronization Interface. Then, the thread is stalled (up to the user logic) till its release signal is high again.

A custom core may implement the following interface if synchronization is requeried:

/* Synchronization Interface */
// To Barrier Core
output logic                                 breq_valid,       // Hit barrier signal, sends a synchronization request
output logic [31 : 0]                        breq_op_id,       // Synchronization operation ID, mainly used for debugging 
output logic [THREAD_NUMB - 1 : 0]           breq_thread_id,   // ID of the thread perfoming the synchronization operation
output logic [31 : 0]                        breq_barrier_id,  // Barrier ID, has to be unique in case of concurrent barriers
output logic [31 : 0]                        breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID
// From Barrier Core
input  logic [THREAD_NUMB - 1 : 0]           bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread)


Heterogeneous Dummy provided

This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4).

// Issue synchronization requests
SEND_BARRIER    : begin
  breq_valid       <= 1'b1;
  breq_barrier_id  <= 42;
  barrier_served   <= 1'b1;
  if(rem_barriers == 1) 
    next_state <= WAIT_SYNCH;
  else
    next_state <= IDLE;
  end

The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master.

// Synchronizes all dummy cores
WAIT_SYNCH : begin
 if(&bc_release_val)
    next_state    <= IDLE;
 end

At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions.

The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way.

The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading).

In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction.