ISA
Contents
Register File
The NPU register file is composed by a scalar register file and a vector register file; each one containing 64 registers.
The scalar register file has 64 registers. The first 58 are general purpose registers, while the remaining 8 are special purpose registers. Each scalar register can store up to 32 bits of data.
The vector register file has 64 general purpose registers
Each vector register can store up to 512 bits of data, each vector can store 16 x 32 bits.
Finally, there is a Control Register that is composed of several sub-registers. Some information are shared among all threads, others are thread specific and those registers marked 'thread' have a separate instance per thread.
Register | Read/Write | Shared/Thread | Description | ID |
---|---|---|---|---|
TILE_ID | Read | Shared | Tile ID | 0 |
CORE_ID | Read | Shared | Core ID | 1 |
THREAD_ID | Read | Thread | ThreadID | 2 |
GLOBAL_ID | Read | Thread | Global ID, previous IDs merged as follow: TILE_ID, CORE_ID, THREAD_ID | 3 |
GCOUNTER_LOW | Read | Shared | Low part of the Global counter register which counts processor cycles since reset | 4 |
GCOUNTER_HIGH | Read | Shared | High part of the Global counter register which counts processor cycles since reset | 5 |
THREAD_EN | Read | Shared | Thread enabled mask, 1 bit per thread | 6 |
MISS_DATA | Read | Shared | Count of L1 Data cache misses | 7 |
MISS_INSTR | Read | Shared | Count of L1 Instruction cache misses | 8 |
PC | Read | Thread | Current PC | 9 |
TRAP_REASON | Read | Thread | Trap Cause (see below) | 10 |
THREAD_STATUS | Read/Write | Thread | Thread Status2 (see below) | 11 |
ARGC | Read/Write | Shared | The number of strings pointed to by argv | 12 |
ARGV | Read/Write | Shared | The address of command line arguments passed to main() | 13 |
THREAD_NUMB | Read | Shared | The number of total hardware threads | 14 |
THREAD_MISS | Read | Thread | The per-thread number of data cache miss | 15 |
CACHE_MODE | Read/Write | Shared | The write policy used by the cache controller. 0 for write-back, 1 for write-through | 16 |
Trap Cause: in the current state only traps due to misaligned memory accesses can raise:
- SPM_ADDR_MISALIGN: Misaligned memory access in the SPM unit.
- LDST_ADDR_MISALIGN: Misaligned memory access in the LDST unit.
Thread Status: each thread can be in one of the following states:
- THREAD_IDLE (Value = 0): each thread starts in this state after reset.
- RUNNING (Value = 1): the thread is running a kernel.
- END_MODE (Value = 2): the thread switches in this mode when the issued kernel is completed.
- TRAPPED (Value = 3): the thread is in trap mode. At the current state, when a trap occurs, the thread jumps into an infinite loop.
- WAITING_BARRIER (Value = 4): the thread is waiting for a synchronization event.
Data Types
The following table sums up the data types that are possible to use in NPU core. The Type column has the C/C++ type names, the LLVM type column presents the type names used in LLVM and the Register column shows the register type in which a value of a specific type is stored.
The highlighted types are those the architecture natively supports, given the register files width. The others are obtained through extension so that they can be seen as the supported ones. Their advantage resides in more efficient use of the system memory.
Type | LLVM Type | Register | Notes |
---|---|---|---|
bool | i1 | scalar (32 bits) | It is expanded to 32 bits |
char | i8 | scalar (32 bits) | It is expanded to 32 bits |
short | i16 | scalar (32 bits) | It is expanded to 32 bits |
int | i32 | scalar (32 bits) | |
float | f32 | scalar (32 bits) | |
vec16i8, vec16u8 | v16i8 | vector (16 x 32 bits) | It is expanded to 32 bits vector |
vec16i16, vec16u16 | v16i16 | vector (16 x 32 bits) | It is expanded to 32 bits vector |
vec16i32, vec16u32 | v16i32 | vector (16 x 32 bits) | |
vec16f32 | v16f32 | vector (16 x 32 bits) | |
vec8i8, vec8u8 | v8i8 | vector (8 x 32 bits) | It is expanded to 32 bits vector |
vec8i16, vec8u16 | v8i16 | vector (8 x 32 bits) | It is expanded to 32 bits vector |
vec8i32, vec8u32 | v8i32 | vector (8 x 32 bits) | It is expanded to 32 bits vector |
vec8f32 | v8f32 | vector (16 x 32 bits) | It is considered as a 16 elements vector |
Instructions Format
The NaplesPU instructions have a fixed length of 32 bits. They are grouped in seven types:
- The R type includes the logical and arithmetic operations and memory operations.
- The I type includes the logical and arithmetic operations between a register operand and an immediate operand.
- The MOVEI type includes the load operations of an immediate operand in a register.
- The C type used for control operations and for synchronization instructions.
- The J type includes jump instructions.
- The M type includes the instructions used to access memory.
R type instructions
- RR (Register to Register) has a destination register and two source registers.
- RI (Register Immediate) has a destination register and one source registers and an immediate encoded in the instruction word.
or | 1 | or | Rb |
---|---|---|---|
and | 2 | and | Rd = Ra & Rb |
xor | 3 | xor | Rd = Ra ^ Rb |
add | 4 | addition | Rd = Ra + Rb |
sub | 5 | subtraction | Rd = Ra – Rb |
mullo | 6 | low result of the multiplication | Rd = Ra * Rb |
mulhi | 7 | high result of the multiplication | Rd = Ra * Rb |
mulhu | 8 | unsigned high result of the multiplication | Rd = Ra * Rb |
ashr | 9 | arithmetic shift right | Rd = Ra '>> Rb |
shr | 10 | shift right | Rd = Ra >> Rb |
shl | 11 | shift left | Rd = Ra << Rb |
clz | 12 | count leading zeros | |
ctz | 13 | count trailing zeros | |
shuffle | 24 | vector shuffle | Rd[i] = Ra[Rb[i]] |
getlane | 25 | Get lane from vector | Rd = Ra[Rb] |
move | 32 | move register | Rd = Ra |
fadd | 33 | floating point add | Rd = Ra + Rb |
fsub | 34 | floating point sub | Rd = Ra – Rb |
fmul | 35 | floating point multiplication | Rd = Ra * Rb |
fdiv | 36 | floating point division | Rd = Ra / Rb |
sext8 | 43 | sign extend 8 bits | |
sext16 | 44 | sign extend 16 bits | |
sext32 | 45 | sign extend 32 bits | |
i32tof32 | 48 | cast integer to float | |
f32toi32 | 49 | cast float to integer | |
cmpeq | 14 | compare equal | Rd = Ra == Rb |
cmpne | 15 | compare not equal | Rd = Ra != Rb |
cmpgt | 16 | compare greater then | Rd = Ra > Rb |
cmpge | 17 | compare greater or equal | Rd = Ra >= Rb |
cmplt | 18 | compare less then | Rd = Ra < Rb |
cmple | 19 | compare less or equal | Rd = Ra <= Rb |
cmpugt | 20 | unsigned compare greater then | Rd = Ra > Rb |
cmpuge | 21 | unsigned compare greater or equal | Rd = Ra >= Rb |
cmpult | 22 | unsigned compare less then | Rd = Ra < Rb |
cmpule | 23 | unsigned compare less or equal | Rd = Ra <= Rb |
cmpfeq | 37 | floating point compare equal | Rd = Ra == Rb |
cmpfne | 38 | floating point compare not equal | Rd = Ra != Rb |
cmpfgt | 39 | floating point compare greater then | Rd = Ra > Rb |
cmpfge | 40 | floating point compare greater or equal | Rd = Ra >= Rb |
cmpflt | 41 | floating point compare less then | Rd = Ra < Rb |
cmpfle | 42 | floating point compare less or equal | Rd = Ra <= Rb |
I type instructions
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
ori | 1 | or | Imm |
andi | 2 | and | Rd = Ra & Imm |
xori | 3 | xor | Rd = Ra ^ Imm |
addi | 4 | addition | Rd = Ra + Imm |
subi | 5 | subtraction | Rd = Ra – Imm |
mulli | 6 | multiplication | Rd = Ra * Imm |
mulhi | 7 | high multiply | Rd = Ra * Imm |
mulhui | 8 | high multiply unsigned | Rd = Ra * Imm |
ashri | 9 | arithmetic shift right | Rd = Ra ‘>> Imm |
shri | 10 | shift right | Rd = Ra >> Imm |
shli | 11 | shift left | Rd = Ra << Imm |
getlane | 25 | Get lane from vector | Rd = Ra[Imm] |
MOVEI type instructions
MVI (Move Immediate) has a destination register and a 16 bit instruction encoded immediate.
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
moveil | 0 | move the 16 less significant bits | Rd = Ra & 0xFFFF |
moveih | 1 | move the 16 most significant bits | Rd = (Ra >> 16) & 0xFFFF |
movei | 2 | move the 16 less significant bits with zero extension | Rd = (Rd ^ Rd) & (Ra & 0xFFFF) |
C type instructions
Mnemonic | Opcode | Meaning |
---|---|---|
barrier_core | 0 | Memory Barrier - ensure that all explicit data memory transfers before the barrier are completed before any subsequent explicit data memory transactions starting after the barrier. Register rs0 contains the barrier identification number (BID). BID can be an arbitrary number greater than 0, i.e. BID>0. Different memory barriers require different BIDs. rs1 contains the number of threads that should synchronize. |
flush | 2 | Flush a cache line to the main memory. |
read_cr | 3 | Read a sub-register of the control register. |
write_cr | 4 | Write into a sub-register of the control register |
dcache_inv | 5 | Invalidates the input address line in the L1 cache. |
J type instructions
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
jmp | 0 | jump - unconditionally jump to a specified location. | PC=rd or PC=PC+offset |
jmpsr | 1 | jump to subroutine - unconditionally jump to a specified location and store the return address in the RA register. | RA=PC+4 PC=rd or RA=PC+4 PC=PC+addr |
jret | 3 | Return from Subroutine - unconditionally return from a subroutine loading the return address from the RA register. | PC=RA |
beqz | 5 | Conditional Branch. Branch if Equal to Zero - branche to PC+offset if the contents of the condition register is equal to zero. | if(rcond==0) PC=PC+offset else PC=PC+4 |
bnez | 6 | Conditional Branch, Branch if Not Equal to Zero - branches to PC+offset if the contents of the condition register is not equal to zero. | if(rcond!=0) PC=PC+offset else PC=PC+4 |
M type instructions
MEM (Memory Instruction) has a destination/source field, in case of load the first register asses the destination register, otherwise, in case of a store, the first register contains the store value. Next, in both cases there is the base address and the immediate. The sum of the base address and immediate will give the effective memory address.
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
loadXD_s8 | 0 | load 1 byte with sign extension | Rd = [Rbase + Offset] |
loadXD_s16 | 1 | load 2 bytes with sign extension | Rd = [Rbase + Offset] |
load32D | 2 | load 1 word | Rd = [Rbase + Offset] |
loadXD_u8 | 4 | load 1 byte with zero extension | Rd = [Rbase + Offset] |
loadXD_u16 | 5 | load 2 bytes with zero extension | Rd = [Rbase + Offset] |
loadD_vYi8 | 7 | load a vector of Y bytes with sign extension | Rd = [Rbase + Offset] |
loadD_vYi16 | 8 | load a vector of Y 2 bytes with sign extension | Rd = [Rbase + Offset] |
loadD_vYi32 | 9 | load a vector of Y words with sign extension | Rd = [Rbase + Offset] |
loadD_vYu8 | 11 | load a vector of Y bytes with zero extension | Rd = [Rbase + Offset] |
loadD_vYu16 | 12 | load a vector of Y 2 bytes with zero extension | Rd = [Rbase + Offset] |
loadD_vYu32 | 13 | load a vector of Y words with zero extension | Rd = [Rbase + Offset] |
loadD_g_32 | 16 | load 16 words from different memory addresses | Rd[i] = [Rbase[i]] |
storeXD_8 | 32 | store 1 byte | [Rbase + Offset] = Rs |
storeXD_16 | 33 | store 2 bytes | [Rbase + Offset] = Rs |
store32D | 34 | store 1 word | [Rbase + Offset] = Rs |
storeD_vYi8 | 32 | store Y bytes | [Rbase + Offset] = Rs |
storeD_vYi16 | 33 | store Y 2 bytes | [Rbase + Offset] = Rs |
storeD_vYi32 | 34 | store Y words | [Rbase + Offset] = Rs |
storeD_s_32 | 42 | store 16 words to different memory addresses | [Rbase[i]] = Rs[i] |