Difference between revisions of "Heterogeneous Tile"

From NaplesPU Documentation
Jump to: navigation, search
(Synchronization Interface)
Line 47: Line 47:
 
Then, the thread is stalled (up to the user logic) till its release signal is high again.
 
Then, the thread is stalled (up to the user logic) till its release signal is high again.
  
A custom core has to implement the following interface if
+
A custom core may implement the following interface if synchronization is requeried:
 +
/* Synchronization Interface */
 +
// To Barrier Core
 +
output logic                                breq_valid,      // Hit barrier signal, sends a synchronization request
 +
output logic [31 : 0]                        breq_op_id,      // Synchronization operation ID, mainly used for debugging
 +
output logic [THREAD_NUMB - 1 : 0]          breq_thread_id,  // ID of the thread perfoming the synchronization operation
 +
output logic [31 : 0]                        breq_barrier_id,  // Barrier ID, has to be unique in case of concurrent barriers
 +
output logic [31 : 0]                        breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID
 +
// From Barrier Core
 +
input  logic [THREAD_NUMB - 1 : 0]          bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread)
 +
 
  
 
== Heterogeneous Dummy provided ==
 
== Heterogeneous Dummy provided ==
  
This FSM first synchronizes with other ht in the NoC. Each dummy core in  
+
This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4).  
a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads
+
 
(default = 4).  
+
// Issue synchronization requests
The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID  
+
SEND_BARRIER    : begin
42 through the Synchronization interface. It sets the total number of threads  
+
  breq_valid      <= 1'b1;
synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=  
+
  breq_barrier_id  <= 42;
LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system).  
+
  barrier_served  <= 1'b1;
When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for  
+
  if(rem_barriers == 1)
the ACK from the synchronization master.  
+
    next_state <= WAIT_SYNCH;
At this point all threads in each ht tile are synchronized, and the FSM starts  
+
  else
all pending memory transactions.  
+
    next_state <= IDLE;
The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default =  
+
  end
128), performing a LOAD_8 operation (op code = 0) each time. In the default  
+
 
configuration, 128 LOAD_8 operations on consecutive addresses are spread among  
+
The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master.
all threads and issued to the LSU through the Memory interface.
+
 
When read operations are over, the FSM starts write operations in a similar way.  
+
// Synchronizes all dummy cores
The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write  
+
WAIT_SYNCH : begin
operations on consecutive addresses through the Memory interface. This
+
  if(&bc_release_val)
time the operation performed is a STORE_8, and all ht tile are issuing
+
    next_state    <= IDLE;
the same store operation on same addresses compiting for the ownership
+
  end
in a transparent way. The coherence is totally handled by the LSU and
+
 
CC, on the core side lsu_het_almost_full bitmap states the availability
+
At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions.  
of the LSU for each thread (both writing and reading). In both states,  
+
 
a thread first checks the availability stored in a position equal to
+
The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way.  
its ID (lsu_het_almost_full[thread_id]), then performs a memory
+
 
transaction.
+
The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading).  
 +
 
 +
In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction.

Revision as of 18:05, 13 May 2019

The nu+ project provides a heterogeneous tile integrated into the NoC, meant to be extended by the user. Such a tile provides a first example of how to integrate a custom module in the network-on-chip with a dedicated tile.

Memory Interface

The Memory Interface provides a transparent way to interact with the coherence system. The memory interface implements a simple valid/available handshake per thread, a different thread might issue different memory transaction and those are concurrently handled by the coherence system.

When a thread has a memory request, it first checks the availability bit related to its ID, if this is high the thread issues a memory transaction setting the valid bit and loading all the needed information on the Memory Interface.

Supported memory operations are reported below along with their opcodes:

LOAD_8      = 'h0  - 'b000000    
LOAD_16     = 'h1  - 'b000001
LOAD_32     = 'h2  - 'b000010
LOAD_V_8    = 'h7  - 'b000111
LOAD_V_16   = 'h8  - 'b001000
LOAD_V_32   = 'h9  - 'b001001
STORE_8     = 'h20 - 'b100000
STORE_16    = 'h21 - 'b100001
STORE_32    = 'h22 - 'b100010
STORE_V_8   = 'h24 - 'b100100
STORE_V_16  = 'h25 - 'b100101
STORE_V_32  = 'h26 - 'b100110

A custom core to be integrate d in the nu+ system ought to implement the following interface in order to communicate with the memory system:

/* Memory Interface */
// To Heterogeneous LSU
output logic                                 req_out_valid,     // Valid signal for issued memory requests
output logic [31 : 0]                        req_out_id,        // ID of the issued request, mainly used for debugging
output logic [THREAD_IDX_W - 1 : 0]          req_out_thread_id, // Thread ID of issued request. Requests running on different threads are dispatched to the CC conccurrently 
output logic [7 : 0]                         req_out_op,        // Operation performed
output logic [ADDRESS_WIDTH - 1 : 0]         req_out_address,   // Issued request address
output logic [DATA_WIDTH - 1    : 0]         req_out_data,      // Data output
// From Heterogeneous LSU
input  logic                                 resp_in_valid,      // Valid signal for the incoming responses
input  logic [31 : 0]                        resp_in_id,         // ID of the incoming response, mainly used for debugging
input  logic [THREAD_IDX_W - 1 : 0]          resp_in_thread_id,  // Thread ID of the incoming response
input  logic [7 : 0]                         resp_in_op,         // Operation code
input  logic [DATA_WIDTH - 1 : 0]            resp_in_cache_line, // Incoming data
input  logic [BYTES_PERLINE - 1 : 0]         resp_in_store_mask, // Bitmask of the position of the requesting bytes in the incoming data bus
input  logic [ADDRESS_WIDTH - 1 : 0]         resp_in_address,    // Incoming response address

Synchronization Interface

The Synchronization Interface connects the user logic with the synchronization module core-side allocated within the tile (namely the barrier_core unit). Such an interface allows user logic to synchronize on a thread grain.

The synchronization mechanism supports inter- and intra- tile barrier synchronization. When a thread hits a synchronization point, it issues a request to the distributed synchronization master through the Synchronization Interface. Then, the thread is stalled (up to the user logic) till its release signal is high again.

A custom core may implement the following interface if synchronization is requeried:

/* Synchronization Interface */
// To Barrier Core
output logic                                 breq_valid,       // Hit barrier signal, sends a synchronization request
output logic [31 : 0]                        breq_op_id,       // Synchronization operation ID, mainly used for debugging 
output logic [THREAD_NUMB - 1 : 0]           breq_thread_id,   // ID of the thread perfoming the synchronization operation
output logic [31 : 0]                        breq_barrier_id,  // Barrier ID, has to be unique in case of concurrent barriers
output logic [31 : 0]                        breq_thread_numb, // Total number - 1 of synchronizing threads on the current barrier ID
// From Barrier Core
input  logic [THREAD_NUMB - 1 : 0]           bc_release_val // Stalled threads bitmask waiting for release (the i-th bit low stalls the i-th thread)


Heterogeneous Dummy provided

This FSM first synchronizes with other ht in the NoC. Each dummy core in a ht tile requires a synchronization for LOCAL_BARRIER_NUMB threads (default = 4).

// Issue synchronization requests
SEND_BARRIER    : begin
  breq_valid       <= 1'b1;
  breq_barrier_id  <= 42;
  barrier_served   <= 1'b1;
  if(rem_barriers == 1) 
    next_state <= WAIT_SYNCH;
  else
    next_state <= IDLE;
  end

The SEND_BARRIER state sends LOCAL_BARRIER_NUMB requests with barrier ID 42 through the Synchronization interface. It sets the total number of threads synchronizing on the barrier ID 42 equal to TOTAL_BARRIER_NUMB (=LOCAL_BARRIER_NUMB x `TILE_HT, number of heterogeneous tile in the system). When the last barrier is issued, SEND_BARRIER jumps to WAIT_SYNCH waiting for the ACK from the synchronization master.

// Synchronizes all dummy cores
WAIT_SYNCH : begin
 if(&bc_release_val)
    next_state    <= IDLE;
 end

At this point, all threads in each ht tile are synchronized, and the FSM starts all pending memory transactions.

The START_MEM_READ_TRANS performs LOCAL_WRITE_REQS read operations (default = 128), performing a LOAD_8 operation (op code = 0) each time. In the default configuration, 128 LOAD_8 operations on consecutive addresses are spread among all threads and issued to the LSU through the Memory interface. When read operations are over, the FSM starts writing operations in a similar way.

The START_MEM_WRITE_TRANS performs LOCAL_WRITE_REQS (default = 128) write operations on consecutive addresses through the Memory interface. This time the operation performed is a STORE_8, and all ht tile are issuing the same store operation on the same addresses competing for the ownership in a transparent way. The coherence is totally handled by the LSU and CC, on the core side lsu_het_almost_full bitmap states the availability of the LSU for each thread (both writing and reading).

In both states, a thread first checks the availability stored in a position equal to its ID (lsu_het_almost_full[thread_id]), then performs a memory transaction.