Difference between revisions of "Core"

From NaplesPU Documentation
Jump to: navigation, search
Line 40: Line 40:
  
 
This unit is described in the dedicated [[Barrier unit|synchronization section]].
 
This unit is described in the dedicated [[Barrier unit|synchronization section]].
 +
 +
== Branch unit ==
  
 
== Writeback stage ==
 
== Writeback stage ==
 
== Branch controller ==
 
  
 
== Rollback handler ==
 
== Rollback handler ==

Revision as of 17:18, 20 September 2017

The core is based on a RISC in-order pipeline. Its control unit is intentionally kept lightweight. The architecture masks memory and operation latencies by heavily relying on hardware multithreading. By ensuring a light control logic, the core can devote most of its resources for accelerating computing in highly data-parallel kernels. In the hardware multithreading nuplus architecture, each hardware thread has its own PC, register file, and control registers. The number of threads is user configurable. A nuplus hardware thread is equivalent to a wavefront in the AMD terminology and a CUDA warp in the NVIDIA terminology. The processor uses a deep pipeline to improve clock speed.

nu+ microarchitecture

All threads share the same compute units. Execution pipelines are organized in hardware vector lanes (like vector processors, each operator is replicated N times). Each thread can perform a SIMD operation on independent data, while data are organized in a vector register file. The core supports a high-throughput non-coherent scratchpad memory, or SPM (corresponding to the shared memory in the NVIDIA terminology). The SPM is divided in a parameterized number of banks based on a user-configurable mapping function. The memory controller resolves bank collisions at run-time ensuring a correct execution of SPM accesses from concurrent threads. Coherence mechanisms incur a high latency and are not strictly necessary for many applications.

Instruction fetch stage

Instruction Fetch stage schedules the next thread PC from the eligible thread pool, handled by the Thread Controller. Available threads are scheduled in a Round Robin fashion. Furthermore, at the boot phase, the Thread Controller can initialize each thread PC through a specific interface.

Instruction Fetch Stage

The instruction cache is set associative and has two stages. Once an eligible thread is selected, Instruction Fetch reads its PC, and determines if the next instruction cache line is already in instruction cache memory or not. In the first stage each way has a bank of memory containing tag values and valid bits for the cache sets. This stage reads the way memories in parallel and passes those data to the second stage. The next stage tag memory has one cycle of latency, so the next stage handles the result. This stage compares the way tags read in the last stage, if they match, it is a cache hit. In this case, this stage issues the instruction cache data address to instruction cache data memory. If a miss occurs an instruction memory transaction is issued to the Network Interface and the thread is blocked until the instruction line is not retrieved from main memory.

Finally, this module handles the PC restoring in case of rollback. When a rollback occurs and the rollback signals are set by Rollback Handler stage, the Instruction Fetch module overwrites the PC of the thread that issued the rollback.

Decode stage

Decode stage decodes fetched instruction from Instruction Fetch and produces the control signals for the datapath directly from the instruction bits. Output dec_instr helps execution and control modules to manage the issued instruction and is propagated in each pipeline stage. Instruction type are presented in the ISA section.

Instruction scheduler stage

Operand fetch stage

Integer Arithmetic & Logic unit

Barrier unit

Scratchpad unit

This unit is described in the dedicated scratchpad page.

Load/Store unit

This unit is described in the dedicated load/store subsection inside the coherence section.

Floating point unit

Barrier unit

This unit is described in the dedicated synchronization section.

Branch unit

Writeback stage

Rollback handler

Thread controller

Thread controller