Difference between revisions of "Heterogeneous Tile"
Line 109: | Line 109: | ||
else | else | ||
next_state <= DONE; | next_state <= DONE; | ||
− | + | if(lsu_het_almost_full[thread_id_write] == 1'b0 ) begin | |
− | + | write_served <= 1'b1; | |
− | + | req_out_valid <= 1'b1; | |
− | + | req_out_id <= rem_writes; | |
− | + | req_out_thread_id <= thread_id_write; | |
− | + | req_out_op <= 'b100000; // STORE_8 | |
− | + | tmp_data_out[0] <= 8'hee; | |
− | + | incr_address <= 1'b1; | |
+ | end | ||
end | end | ||
In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction. | In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction. |
Revision as of 18:08, 13 May 2019
The nu+ project provides a heterogeneous tile integrated into the NoC, meant to be extended by the user. Such a tile provides a first example of how to integrate a custom module in the network-on-chip with a dedicated tile.
Memory Interface
The Memory Interface provides a transparent way to interact with the coherence system. The memory interface implements a simple valid/available handshake per thread, a different thread might issue different memory transaction and those are concurrently handled by the coherence system.
When a thread has a memory request, it first checks the availability bit related to its ID, if this is high the thread issues a memory transaction setting the valid bit and loading all the needed information on the Memory Interface.
Supported memory operations are reported below along with their opcodes:
LOAD_8 = 'h0 - 'b000000 LOAD_16 = 'h1 - 'b000001 LOAD_32 = 'h2 - 'b000010 LOAD_V_8 = 'h7 - 'b000111 LOAD_V_16 = 'h8 - 'b001000 LOAD_V_32 = 'h9 - 'b001001 STORE_8 = 'h20 - 'b100000 STORE_16 = 'h21 - 'b100001 STORE_32 = 'h22 - 'b100010 STORE_V_8 = 'h24 - 'b100100 STORE_V_16 = 'h25 - 'b100101 STORE_V_32 = 'h26 - 'b100110
A custom core to be integrate d in the nu+ system ought to implement the following interface in order to communicate with the memory system:
/* Memory Interface */ // To Heterogeneous LSU output logic req_out_valid, // Valid signal for issued memory requests output logic [31 : 0] req_out_id, // ID of the issued request, mainly used for debugging output logic [THREAD_IDX_W - 1 : 0] req_out_thread_id, // Thread ID of issued request. Requests running on different threads are dispatched to the CC conccurrently output logic [7 : 0] req_out_op, // Operation performed output logic [ADDRESS_WIDTH - 1 : 0] req_out_address, // Issued request address output logic [DATA_WIDTH - 1 : 0] req_out_data, // Data output // From Heterogeneous LSU input logic resp_in_valid, // Valid signal for the incoming responses input logic [31 : 0] resp_in_id, // ID of the incoming response, mainly used for debugging input logic [THREAD_IDX_W - 1 : 0] resp_in_thread_id, // Thread ID of the incoming response input logic [7 : 0] resp_in_op, // Operation code input logic [DATA_WIDTH - 1 : 0] resp_in_cache_line, // Incoming data input logic [BYTES_PERLINE - 1 : 0] resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus input logic [ADDRESS_WIDTH - 1 : 0] resp_in_address, // Incoming response address
Synchronization Interface
The Synchronization Interface connects the user logic with the synchronization module core-side allocated within the tile (namely the barrier_core unit). Such an interface allows user logic to synchronize on a thread grain.
The synchronization mechanism supports inter- and intra- tile barrier synchronization. When a thread hits a synchronization point, it issues a request to the distributed synchronization master through the Synchronization Interface. Then, the thread is stalled (up to the user logic) till its release signal is high again.
A custom core may implement the following interface if synchronization is requeried:
/* Synchronization Interface */ // To Barrier Core output logic breq_valid, // Hit barrier signal, sends a synchronization request output logic [31 : 0] breq_op_id, // Synchronization operation ID, mainly used for debugging output logic [THREAD_NUMB - 1 : 0] breq_thread_id, // ID of the thread perfoming the synchronization operation output logic [31 : 0] breq_barrier_id, // Barrier ID, has to be unique in case of concurrent barriers output logic [31 : 0] breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID // From Barrier Core input logic [THREAD_NUMB - 1 : 0] bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread)
Heterogeneous Dummy provided
This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4).
// Issue synchronization requests SEND_BARRIER : begin breq_valid <= 1'b1; breq_barrier_id <= 42; barrier_served <= 1'b1; if(rem_barriers == 1) next_state <= WAIT_SYNCH; else next_state <= IDLE; end
The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master.
// Synchronizes all dummy cores WAIT_SYNCH : begin if(&bc_release_val) next_state <= IDLE; end
At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions.
The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way.
// Starting multiple read operations START_MEM_READ_TRANS : begin if ( rem_reads == 1 ) next_state <= DONE; else next_state <= IDLE; if(lsu_het_almost_full[thread_id_read] == 1'b0) begin read_served <= 1'b1; req_out_valid <= 1'b1; req_out_id <= rem_reads; req_out_op <= 0; // LOAD_8 incr_address <= 1'b1; req_out_thread_id <= thread_id_read; end end
The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading).
// Starting multiple write operations START_MEM_WRITE_TRANS : begin if ( pending_writes ) next_state <= IDLE; else next_state <= DONE; if(lsu_het_almost_full[thread_id_write] == 1'b0 ) begin write_served <= 1'b1; req_out_valid <= 1'b1; req_out_id <= rem_writes; req_out_thread_id <= thread_id_write; req_out_op <= 'b100000; // STORE_8 tmp_data_out[0] <= 8'hee; incr_address <= 1'b1; end end
In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction.