Difference between revisions of "Scratchpad unit"
Line 1: | Line 1: | ||
Like existing GPU-like core projects, nu+ provides limited hardware support for shared scratchpad memory. The nu+ core presents a configurable GPU-like oriented scratchpad memory (SPM) with built-in support for bank remapping. Typically, scratchpad memories are organized in multiple independently-accessible memory banks. Therefore if all memory accesses request data mapped to different banks, they can be handled in parallel (at best, with L memory banks, L different gather/scatter fulfilled in one clock cycle). Bank conflicts occur whenever multiple requests are made for data within the same bank. If N parallel memory accesses request the same bank, the hardware serializes the memory accesses, causing an N-way conflict and an N-times slowdown. The nu+ SPM dynamic bank remapping mechanism, based on specific kernel access pattern, minimizes bank conflicts. | Like existing GPU-like core projects, nu+ provides limited hardware support for shared scratchpad memory. The nu+ core presents a configurable GPU-like oriented scratchpad memory (SPM) with built-in support for bank remapping. Typically, scratchpad memories are organized in multiple independently-accessible memory banks. Therefore if all memory accesses request data mapped to different banks, they can be handled in parallel (at best, with L memory banks, L different gather/scatter fulfilled in one clock cycle). Bank conflicts occur whenever multiple requests are made for data within the same bank. If N parallel memory accesses request the same bank, the hardware serializes the memory accesses, causing an N-way conflict and an N-times slowdown. The nu+ SPM dynamic bank remapping mechanism, based on specific kernel access pattern, minimizes bank conflicts. | ||
− | ==Interface== | + | ==Interface and operation== |
+ | ===Interface=== | ||
[[File:Fig interface.png|center|thumb|SPM interface]] | [[File:Fig interface.png|center|thumb|SPM interface]] | ||
− | As shown in the figure above, the I/O interface of the SPM has several control and data signals. Due to the | + | As shown in the figure above, the I/O interface of the SPM has several control and data signals. Due to the scattered memory access of the processor, all the data signals, both in input and in output, are vectors and every index identifies a corresponding lane of the core unit. So, SPM has the following data inputs for each lane: |
*<code>A</code>: address of the memory location to be accessed; | *<code>A</code>: address of the memory location to be accessed; | ||
*<code>Din</code>: word containing data to be written (in the case of ''scatter'' operation); | *<code>Din</code>: word containing data to be written (in the case of ''scatter'' operation); | ||
Line 14: | Line 15: | ||
*<code>ready</code> is an output signal and is asserted by the SPM when it can process a new instruction; | *<code>ready</code> is an output signal and is asserted by the SPM when it can process a new instruction; | ||
*<code>valid</code> is an output signal and is asserted by the SPM when the execution of an instruction is finished and its outputs are the final outcome. | *<code>valid</code> is an output signal and is asserted by the SPM when the execution of an instruction is finished and its outputs are the final outcome. | ||
+ | ===FSM model=== | ||
+ | As said, the SPM takes as input L different addresses to provide support to the scattered memory access (and its multi-banking implementation). It can be regarded as a finite state machine with the following two states: | ||
+ | *Ready | ||
+ | :- the SPM is ready to accept a new memory request. | ||
+ | *Busy | ||
+ | :- the SPM cannot accept any request as it is still processing the previous one. | ||
+ | In the ''Busy'' state all input requests will be ignored. | ||
+ | ===Architecture=== | ||
+ | ===Customizable features=== | ||
---- | ---- |
Revision as of 16:58, 4 May 2019
Like existing GPU-like core projects, nu+ provides limited hardware support for shared scratchpad memory. The nu+ core presents a configurable GPU-like oriented scratchpad memory (SPM) with built-in support for bank remapping. Typically, scratchpad memories are organized in multiple independently-accessible memory banks. Therefore if all memory accesses request data mapped to different banks, they can be handled in parallel (at best, with L memory banks, L different gather/scatter fulfilled in one clock cycle). Bank conflicts occur whenever multiple requests are made for data within the same bank. If N parallel memory accesses request the same bank, the hardware serializes the memory accesses, causing an N-way conflict and an N-times slowdown. The nu+ SPM dynamic bank remapping mechanism, based on specific kernel access pattern, minimizes bank conflicts.
Contents
[hide]Interface and operation
Interface
As shown in the figure above, the I/O interface of the SPM has several control and data signals. Due to the scattered memory access of the processor, all the data signals, both in input and in output, are vectors and every index identifies a corresponding lane of the core unit. So, SPM has the following data inputs for each lane:
A
: address of the memory location to be accessed;Din
: word containing data to be written (in the case of scatter operation);BM[0..W-1]
: W-bit-long bitmask to enable/disable the writing of each byte ofDin
word;M
: bit asserted if the lane will participate in the next instruction execution.
As for inputs, the SPM has a single data output for each lane that is:
Dout
: data stored at the addresses contained inA
.
The store
signal is an input control signal. If store is high, the requested instruction will be a scatter operation, otherwise it is a gather one. The value of store signal is the same for all the lanes. Due to the variability of latency it is necessary to introduce some control signals that allow to implement a handshaking protocol between the control logic of the SIMD core (owner of the CUDA Thread Block) and the SPM control logic. These signals are:
start
is an input signal and is asserted by the core control unit to request the execution of an instruction;ready
is an output signal and is asserted by the SPM when it can process a new instruction;valid
is an output signal and is asserted by the SPM when the execution of an instruction is finished and its outputs are the final outcome.
FSM model
As said, the SPM takes as input L different addresses to provide support to the scattered memory access (and its multi-banking implementation). It can be regarded as a finite state machine with the following two states:
- Ready
- - the SPM is ready to accept a new memory request.
- Busy
- - the SPM cannot accept any request as it is still processing the previous one.
In the Busy state all input requests will be ignored.
Architecture
Customizable features
2/5/2019
INTERFACCIA E FUNZIONALITA'
Vista modulare con ingressi-uscite (con descrizione)
24/4/2019
INTERFACCIA E FUNZIONALITA'
Vista modulare con ingressi-uscite (con descrizione) FSM (differenza comportamento fully-pipelined e n-way conflicts) ingressi parametrizzabili
MODULI
Stadio 0 (pipe) Stadio 1 (stage1) Stadio 2 (stage2) Stadio 3 (stage3)
ESEMPIO DI FUNZIONAMENTO
Si deve aggiungere l'attributo __scratchpad per avere la sicurezza che una variabile venga allocata in SPM. Definire qualche esempio più specifico (e.g. conv_layer_mvect_mt con uno o due core)