Difference between revisions of "Heterogeneous Tile"
(→Synchronization Interface) |
|||
Line 47: | Line 47: | ||
Then, the thread is stalled (up to the user logic) till its release signal is high again. | Then, the thread is stalled (up to the user logic) till its release signal is high again. | ||
− | A custom core | + | A custom core may implement the following interface if synchronization is requeried: |
+ | /* Synchronization Interface */ | ||
+ | // To Barrier Core | ||
+ | output logic breq_valid, // Hit barrier signal, sends a synchronization request | ||
+ | output logic [31 : 0] breq_op_id, // Synchronization operation ID, mainly used for debugging | ||
+ | output logic [THREAD_NUMB - 1 : 0] breq_thread_id, // ID of the thread perfoming the synchronization operation | ||
+ | output logic [31 : 0] breq_barrier_id, // Barrier ID, has to be unique in case of concurrent barriers | ||
+ | output logic [31 : 0] breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID | ||
+ | // From Barrier Core | ||
+ | input logic [THREAD_NUMB - 1 : 0] bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread) | ||
+ | |||
== Heterogeneous Dummy provided == | == Heterogeneous Dummy provided == | ||
− | This FSM first synchronizes with other ht in the NoC. Each dummy core in | + | This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4). |
− | a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads | + | |
− | (default = 4). | + | // Issue synchronization requests |
− | The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID | + | SEND_BARRIER : begin |
− | 42 through the Synchronization interface. It sets the total number of threads | + | breq_valid <= 1'b1; |
− | synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (= | + | breq_barrier_id <= 42; |
− | LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). | + | barrier_served <= 1'b1; |
− | When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for | + | if(rem_barriers == 1) |
− | the ACK from the synchronization master. | + | next_state <= WAIT_SYNCH; |
− | At this point all threads in each ht tile are synchronized, and the FSM starts | + | else |
− | all pending memory transactions. | + | next_state <= IDLE; |
− | The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = | + | end |
− | 128), performing a LOAD_8 operation (op code = 0) each time. In the default | + | |
− | configuration, 128 LOAD_8 operations on consecutive addresses are spread among | + | The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master. |
− | all threads and issued to the LSU through the Memory interface. | + | |
− | When read operations are over, the FSM starts | + | // Synchronizes all dummy cores |
− | The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write | + | WAIT_SYNCH : begin |
− | operations on consecutive addresses through the Memory interface. This | + | if(&bc_release_val) |
− | time the operation performed is a STORE_8, and all ht tile are issuing | + | next_state <= IDLE; |
− | the same store operation on same addresses | + | end |
− | in a transparent way. The coherence is totally handled by the LSU and | + | |
− | CC, on the core side lsu_het_almost_full bitmap states the availability | + | At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions. |
− | of the LSU for each thread (both writing and reading). In both states, | + | |
− | a thread first checks the availability stored in a position equal to | + | The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way. |
− | its ID (lsu_het_almost_full[thread_id]), then performs a memory | + | |
− | transaction. | + | The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading). |
+ | |||
+ | In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction. |
Revision as of 18:05, 13 May 2019
The nu+ project provides a heterogeneous tile integrated into the NoC, meant to be extended by the user. Such a tile provides a first example of how to integrate a custom module in the network-on-chip with a dedicated tile.
Memory Interface
The Memory Interface provides a transparent way to interact with the coherence system. The memory interface implements a simple valid/available handshake per thread, a different thread might issue different memory transaction and those are concurrently handled by the coherence system.
When a thread has a memory request, it first checks the availability bit related to its ID, if this is high the thread issues a memory transaction setting the valid bit and loading all the needed information on the Memory Interface.
Supported memory operations are reported below along with their opcodes:
LOAD_8 = 'h0 - 'b000000 LOAD_16 = 'h1 - 'b000001 LOAD_32 = 'h2 - 'b000010 LOAD_V_8 = 'h7 - 'b000111 LOAD_V_16 = 'h8 - 'b001000 LOAD_V_32 = 'h9 - 'b001001 STORE_8 = 'h20 - 'b100000 STORE_16 = 'h21 - 'b100001 STORE_32 = 'h22 - 'b100010 STORE_V_8 = 'h24 - 'b100100 STORE_V_16 = 'h25 - 'b100101 STORE_V_32 = 'h26 - 'b100110
A custom core to be integrate d in the nu+ system ought to implement the following interface in order to communicate with the memory system:
/* Memory Interface */ // To Heterogeneous LSU output logic req_out_valid, // Valid signal for issued memory requests output logic [31 : 0] req_out_id, // ID of the issued request, mainly used for debugging output logic [THREAD_IDX_W - 1 : 0] req_out_thread_id, // Thread ID of issued request. Requests running on different threads are dispatched to the CC conccurrently output logic [7 : 0] req_out_op, // Operation performed output logic [ADDRESS_WIDTH - 1 : 0] req_out_address, // Issued request address output logic [DATA_WIDTH - 1 : 0] req_out_data, // Data output // From Heterogeneous LSU input logic resp_in_valid, // Valid signal for the incoming responses input logic [31 : 0] resp_in_id, // ID of the incoming response, mainly used for debugging input logic [THREAD_IDX_W - 1 : 0] resp_in_thread_id, // Thread ID of the incoming response input logic [7 : 0] resp_in_op, // Operation code input logic [DATA_WIDTH - 1 : 0] resp_in_cache_line, // Incoming data input logic [BYTES_PERLINE - 1 : 0] resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus input logic [ADDRESS_WIDTH - 1 : 0] resp_in_address, // Incoming response address
Synchronization Interface
The Synchronization Interface connects the user logic with the synchronization module core-side allocated within the tile (namely the barrier_core unit). Such an interface allows user logic to synchronize on a thread grain.
The synchronization mechanism supports inter- and intra- tile barrier synchronization. When a thread hits a synchronization point, it issues a request to the distributed synchronization master through the Synchronization Interface. Then, the thread is stalled (up to the user logic) till its release signal is high again.
A custom core may implement the following interface if synchronization is requeried:
/* Synchronization Interface */ // To Barrier Core output logic breq_valid, // Hit barrier signal, sends a synchronization request output logic [31 : 0] breq_op_id, // Synchronization operation ID, mainly used for debugging output logic [THREAD_NUMB - 1 : 0] breq_thread_id, // ID of the thread perfoming the synchronization operation output logic [31 : 0] breq_barrier_id, // Barrier ID, has to be unique in case of concurrent barriers output logic [31 : 0] breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID // From Barrier Core input logic [THREAD_NUMB - 1 : 0] bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread)
Heterogeneous Dummy provided
This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4).
// Issue synchronization requests SEND_BARRIER : begin breq_valid <= 1'b1; breq_barrier_id <= 42; barrier_served <= 1'b1; if(rem_barriers == 1) next_state <= WAIT_SYNCH; else next_state <= IDLE; end
The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master.
// Synchronizes all dummy cores WAIT_SYNCH : begin if(&bc_release_val) next_state <= IDLE; end
At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions.
The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way.
The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading).
In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction.