Difference between revisions of "Heterogeneous Tile"
Line 39: | Line 39: | ||
input logic [BYTES_PERLINE - 1 : 0] resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus | input logic [BYTES_PERLINE - 1 : 0] resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus | ||
input logic [ADDRESS_WIDTH - 1 : 0] resp_in_address, // Incoming response address | input logic [ADDRESS_WIDTH - 1 : 0] resp_in_address, // Incoming response address | ||
+ | |||
+ | The Heterogeneous Tile shares the same LSU and CC of a NUPLUS tile, consequently the LSU forwards on the Memory Interface its backpressure signals, as follows: | ||
+ | // From Heterogeneous accelerator - Backpressure signals | ||
+ | input logic [THREAD_NUMB - 1 : 0] lsu_het_almost_full, // Thread bitmask, if i-th bit is high, i-th thread cannot issue requests. | ||
+ | input logic [THREAD_NUMB - 1 : 0] lsu_het_no_load_store_pending, // Thread bitmask, if i-th bit is low, i-th thread has no pending operations. | ||
+ | In particular, ''lsu_het_almost_full'' i-th bit has to be low before issuing a memory request for i-th thread. | ||
+ | |||
+ | The Memory Interface provides performance counters as part of its interface: | ||
+ | // From Heterogeneous LSU - Performance counters | ||
+ | input logic resp_in_miss, // LSU miss on resp_in_address | ||
+ | input logic resp_in_evict, // LSU eviction (replacement) on resp_in_address | ||
+ | input logic resp_in_flush, // LSU flush on resp_in_address | ||
+ | input logic resp_in_dinv, // LSU data cache invalidatio on resp_in_address | ||
+ | Those signals state when the L1 Data cache incurs in a miss, eviction (or replacement), flush, and data cache invalidation. | ||
+ | |||
+ | The LSU in the Heterogeneous Tile can be configured in two different modalities, namely Write-Through and Write-Back: | ||
+ | output logic lsu_het_ctrl_cache_wt, // Enable Write-Through cache configuration. | ||
+ | When lsu_het_ctrl_cache_wt is high the LSU acts as a Write-Through cache, dually when low it implements a Write-Back mechanism. | ||
+ | |||
+ | Finally, the Memory Interface provides error signals in case of address misaligned in the issued request: | ||
+ | // Heterogeneous accelerator - Flush and Error signals | ||
+ | input logic lsu_het_error_valid, // Error coming from LSU | ||
+ | input register_t lsu_het_error_id, // Error ID - Misaligned = 380 | ||
+ | input logic [THREAD_IDX_W - 1 : 0] lsu_het_error_thread_id, // Thread involved in the Error | ||
===Synchronization Interface=== | ===Synchronization Interface=== |
Revision as of 18:17, 13 May 2019
The nu+ project provides a heterogeneous tile integrated into the NoC, meant to be extended by the user. Such a tile provides a first example of how to integrate a custom module in the network-on-chip with a dedicated tile.
Memory Interface
The Memory Interface provides a transparent way to interact with the coherence system. The memory interface implements a simple valid/available handshake per thread, a different thread might issue different memory transaction and those are concurrently handled by the coherence system.
When a thread has a memory request, it first checks the availability bit related to its ID, if this is high the thread issues a memory transaction setting the valid bit and loading all the needed information on the Memory Interface.
Supported memory operations are reported below along with their opcodes:
LOAD_8 = 'h0 - 'b000000 LOAD_16 = 'h1 - 'b000001 LOAD_32 = 'h2 - 'b000010 LOAD_V_8 = 'h7 - 'b000111 LOAD_V_16 = 'h8 - 'b001000 LOAD_V_32 = 'h9 - 'b001001 STORE_8 = 'h20 - 'b100000 STORE_16 = 'h21 - 'b100001 STORE_32 = 'h22 - 'b100010 STORE_V_8 = 'h24 - 'b100100 STORE_V_16 = 'h25 - 'b100101 STORE_V_32 = 'h26 - 'b100110
A custom core to be integrate d in the nu+ system ought to implement the following interface in order to communicate with the memory system:
/* Memory Interface */ // To Heterogeneous LSU output logic req_out_valid, // Valid signal for issued memory requests output logic [31 : 0] req_out_id, // ID of the issued request, mainly used for debugging output logic [THREAD_IDX_W - 1 : 0] req_out_thread_id, // Thread ID of issued request. Requests running on different threads are dispatched to the CC conccurrently output logic [7 : 0] req_out_op, // Operation performed output logic [ADDRESS_WIDTH - 1 : 0] req_out_address, // Issued request address output logic [DATA_WIDTH - 1 : 0] req_out_data, // Data output // From Heterogeneous LSU input logic resp_in_valid, // Valid signal for the incoming responses input logic [31 : 0] resp_in_id, // ID of the incoming response, mainly used for debugging input logic [THREAD_IDX_W - 1 : 0] resp_in_thread_id, // Thread ID of the incoming response input logic [7 : 0] resp_in_op, // Operation code input logic [DATA_WIDTH - 1 : 0] resp_in_cache_line, // Incoming data input logic [BYTES_PERLINE - 1 : 0] resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus input logic [ADDRESS_WIDTH - 1 : 0] resp_in_address, // Incoming response address
The Heterogeneous Tile shares the same LSU and CC of a NUPLUS tile, consequently the LSU forwards on the Memory Interface its backpressure signals, as follows:
// From Heterogeneous accelerator - Backpressure signals input logic [THREAD_NUMB - 1 : 0] lsu_het_almost_full, // Thread bitmask, if i-th bit is high, i-th thread cannot issue requests. input logic [THREAD_NUMB - 1 : 0] lsu_het_no_load_store_pending, // Thread bitmask, if i-th bit is low, i-th thread has no pending operations.
In particular, lsu_het_almost_full i-th bit has to be low before issuing a memory request for i-th thread.
The Memory Interface provides performance counters as part of its interface:
// From Heterogeneous LSU - Performance counters input logic resp_in_miss, // LSU miss on resp_in_address input logic resp_in_evict, // LSU eviction (replacement) on resp_in_address input logic resp_in_flush, // LSU flush on resp_in_address input logic resp_in_dinv, // LSU data cache invalidatio on resp_in_address
Those signals state when the L1 Data cache incurs in a miss, eviction (or replacement), flush, and data cache invalidation.
The LSU in the Heterogeneous Tile can be configured in two different modalities, namely Write-Through and Write-Back:
output logic lsu_het_ctrl_cache_wt, // Enable Write-Through cache configuration.
When lsu_het_ctrl_cache_wt is high the LSU acts as a Write-Through cache, dually when low it implements a Write-Back mechanism.
Finally, the Memory Interface provides error signals in case of address misaligned in the issued request:
// Heterogeneous accelerator - Flush and Error signals input logic lsu_het_error_valid, // Error coming from LSU input register_t lsu_het_error_id, // Error ID - Misaligned = 380 input logic [THREAD_IDX_W - 1 : 0] lsu_het_error_thread_id, // Thread involved in the Error
Synchronization Interface
The Synchronization Interface connects the user logic with the synchronization module core-side allocated within the tile (namely the barrier_core unit). Such an interface allows user logic to synchronize on a thread grain.
The synchronization mechanism supports inter- and intra- tile barrier synchronization. When a thread hits a synchronization point, it issues a request to the distributed synchronization master through the Synchronization Interface. Then, the thread is stalled (up to the user logic) till its release signal is high again.
A custom core may implement the following interface if synchronization is requeried:
/* Synchronization Interface */ // To Barrier Core output logic breq_valid, // Hit barrier signal, sends a synchronization request output logic [31 : 0] breq_op_id, // Synchronization operation ID, mainly used for debugging output logic [THREAD_NUMB - 1 : 0] breq_thread_id, // ID of the thread perfoming the synchronization operation output logic [31 : 0] breq_barrier_id, // Barrier ID, has to be unique in case of concurrent barriers output logic [31 : 0] breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID // From Barrier Core input logic [THREAD_NUMB - 1 : 0] bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread)
Heterogeneous Dummy provided
This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4).
// Issue synchronization requests SEND_BARRIER : begin breq_valid <= 1'b1; breq_barrier_id <= 42; barrier_served <= 1'b1; if(rem_barriers == 1) next_state <= WAIT_SYNCH; else next_state <= IDLE; end
The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master.
// Synchronizes all dummy cores WAIT_SYNCH : begin if(&bc_release_val) next_state <= IDLE; end
At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions.
The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way.
// Starting multiple read operations START_MEM_READ_TRANS : begin if ( rem_reads == 1 ) next_state <= DONE; else next_state <= IDLE; if(lsu_het_almost_full[thread_id_read] == 1'b0) begin read_served <= 1'b1; req_out_valid <= 1'b1; req_out_id <= rem_reads; req_out_op <= 0; // LOAD_8 incr_address <= 1'b1; req_out_thread_id <= thread_id_read; end end
The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading).
// Starting multiple write operations START_MEM_WRITE_TRANS : begin if ( pending_writes ) next_state <= IDLE; else next_state <= DONE; if(lsu_het_almost_full[thread_id_write] == 1'b0 ) begin write_served <= 1'b1; req_out_valid <= 1'b1; req_out_id <= rem_writes; req_out_thread_id <= thread_id_write; req_out_op <= 'b100000; // STORE_8 tmp_data_out[0] <= 8'hee; incr_address <= 1'b1; end end
In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction.