ISA
Contents
Register File
The NPU register file is composed by a scalar register file and a vector register file; each one containing 64 registers.
The scalar register file has 64 registers. The first 58 are general purpose registers, while the remaining 8 are special purpose registers. Each scalar register can store up to 32 bits of data.
The vector register file has 64 general purpose registers
Each vector register can store up to 512 bits of data, each vector can store 16 x 32 bits.
Finally, there is a Control Register that is composed of several sub-registers. Some information are shared among all threads, others are thread specific and those registers marked 'thread' have a separate instance per thread.
Register | Read/Write | Shared/Thread | Description | ID |
---|---|---|---|---|
TILE_ID | Read | Shared | Tile ID | 0 |
CORE_ID | Read | Shared | Core ID | 1 |
THREAD_ID | Read | Thread | ThreadID | 2 |
GLOBAL_ID | Read | Thread | Global ID, previous IDs merged as follow: TILE_ID, CORE_ID, THREAD_ID | 3 |
GCOUNTER_LOW | Read | Shared | Low part of the Global counter register which counts processor cycles since reset | 4 |
GCOUNTER_HIGH | Read | Shared | High part of the Global counter register which counts processor cycles since reset | 5 |
THREAD_EN | Read | Shared | Thread enabled mask, 1 bit per thread | 6 |
MISS_DATA | Read | Shared | Count of L1 Data cache misses | 7 |
MISS_INSTR | Read | Shared | Count of L1 Instruction cache misses | 8 |
PC | Read | Thread | Current PC | 9 |
TRAP_REASON | Read | Thread | Trap Cause (see below) | 10 |
THREAD_STATUS | Read/Write | Thread | Thread Status2 (see below) | 11 |
ARGC | Read/Write | Shared | The number of strings pointed to by argv | 12 |
ARGV | Read/Write | Shared | The address of command line arguments passed to main() | 13 |
THREAD_NUMB | Read | Shared | The number of total hardware threads | 14 |
THREAD_MISS_CC | Read | Thread | The per-thread clock cycles while the thread is idle due memory operations. | 15 |
KERNEL_WORK | Read | Thread | The per-thread kernel clock cycles. | 16 |
CPU_CTRL_REG | Read/Write | Shared | CPU mode register. At the moment only write policy used by the cache controller is implemented. 0 for write-back, 1 for write-through | 17 |
UNCOHERENCE_MAP | Read/Write | Shared | Address the non-coherent table in the control register. It stores information about the non-coherent memory regions. User can define non-coherent regions addressing this special purpose register. | 19 |
DEBUG_BASE_ADDR | Read/Write | Shared | Debug registers base address. The NPU is equipped with 16 debug registers. DEBUG_BASE_ADDR fetches the value of the first debug register, DEBUG_BASE_ADDR+1 the second and so on. | 20 |
Trap Cause: in the current state only traps due to misaligned memory accesses can raise:
- SPM_ADDR_MISALIGN: Misaligned memory access in the SPM unit.
- LDST_ADDR_MISALIGN: Misaligned memory access in the LDST unit.
Thread Status: each thread can be in one of the following states:
- THREAD_IDLE (Value = 0): each thread starts in this state after reset.
- RUNNING (Value = 1): the thread is running a kernel.
- END_MODE (Value = 2): the thread switches in this mode when the issued kernel is completed.
- TRAPPED (Value = 3): the thread is in trap mode. At the current state, when a trap occurs, the thread jumps into an infinite loop.
- WAITING_BARRIER (Value = 4): the thread is waiting for a synchronization event.
Data Types
The following table sums up the data types that are possible to use in NPU core. The Type column has the C/C++ type names, the LLVM type column presents the type names used in LLVM and the Register column shows the register type in which a value of a specific type is stored.
The highlighted types are those the architecture natively supports, given the register files width. The others are obtained through extension so that they can be seen as the supported ones. Their advantage resides in more efficient use of the system memory.
Type | LLVM Type | Register | Notes |
---|---|---|---|
bool | i1 | scalar (32 bits) | It is expanded to 32 bits |
char | i8 | scalar (32 bits) | It is expanded to 32 bits |
short | i16 | scalar (32 bits) | It is expanded to 32 bits |
int | i32 | scalar (32 bits) | |
float | f32 | scalar (32 bits) | |
vec16i8, vec16u8 | v16i8 | vector (16 x 32 bits) | It is expanded to 32 bits vector |
vec16i16, vec16u16 | v16i16 | vector (16 x 32 bits) | It is expanded to 32 bits vector |
vec16i32, vec16u32 | v16i32 | vector (16 x 32 bits) | |
vec16f32 | v16f32 | vector (16 x 32 bits) | |
vec8i8, vec8u8 | v8i8 | vector (8 x 32 bits) | It is expanded to 32 bits vector |
vec8i16, vec8u16 | v8i16 | vector (8 x 32 bits) | It is expanded to 32 bits vector |
vec8i32, vec8u32 | v8i32 | vector (8 x 32 bits) | It is expanded to 32 bits vector |
vec8f32 | v8f32 | vector (16 x 32 bits) | It is considered as a 16 elements vector |
Instructions Format
The NaplesPU instructions have a fixed length of 32 bits. They are grouped in six types:
- The R type includes the logical and arithmetic operations and memory operations.
- The I type includes the logical and arithmetic operations between a register operand and an immediate operand.
- The MOVEI type includes the load operations of an immediate operand in a register.
- The C type used for control operations and for synchronization instructions.
- The J type includes jump instructions.
- The M type includes the instructions used to access memory.
R type instructions
This is the format of the R-type instruction encoded in machine code.
- RR (Register to Register) has a destination register and two source registers.
- RI (Register Immediate) has a destination register and one source registers and an immediate encoded in the instruction word.
The fields of the R-type instruction are:
- opcode (B29-24) is short for "operation code". The opcode is a binary encoding for the instruction. For R-type instructions, it is only 6 bits.
- rd (B23-18) is the destination register
- rs0 (B17-12) is the first source register.
- rs1 (B11-6) is the second source register.
- bit l (B4) is used in case of "long" operations, i.e. operations that require long integers or double precision numbers. If the operation requires 64-bit registers l=1, otherwise l=0.
- bits fmt (B3-1) are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format). B3 refers to register d, B2 refers to register rs0 and B1 refers to register rs1. For instance, if the destination register should contain a vector, B3=1, otherwise B3=0.
The R-type instructions are:
or | 1 | or | Rb |
---|---|---|---|
and | 2 | and | Rd = Ra & Rb |
xor | 3 | xor | Rd = Ra ^ Rb |
add | 4 | addition | Rd = Ra + Rb |
sub | 5 | subtraction | Rd = Ra – Rb |
mullo | 6 | low result of the multiplication | Rd = Ra * Rb |
mulhi | 7 | high result of the multiplication | Rd = Ra * Rb |
mulhu | 8 | unsigned high result of the multiplication | Rd = Ra * Rb |
ashr | 9 | arithmetic shift right | Rd = Ra '>> Rb |
shr | 10 | shift right | Rd = Ra >> Rb |
shl | 11 | shift left | Rd = Ra << Rb |
clz | 12 | count leading zeros | |
ctz | 13 | count trailing zeros | |
shuffle | 24 | vector shuffle | Rd[i] = Ra[Rb[i]] |
getlane | 25 | Get lane from vector | Rd = Ra[Rb] |
move | 32 | move register | Rd = Ra |
fadd | 33 | floating point add | Rd = Ra + Rb |
fsub | 34 | floating point sub | Rd = Ra – Rb |
fmul | 35 | floating point multiplication | Rd = Ra * Rb |
fdiv | 36 | floating point division | Rd = Ra / Rb |
sext8 | 43 | sign extend 8 bits | |
sext16 | 44 | sign extend 16 bits | |
sext32 | 45 | sign extend 32 bits | |
i32tof32 | 48 | cast integer to float | |
f32toi32 | 49 | cast float to integer | |
cmpeq | 14 | compare equal | Rd = Ra == Rb |
cmpne | 15 | compare not equal | Rd = Ra != Rb |
cmpgt | 16 | compare greater then | Rd = Ra > Rb |
cmpge | 17 | compare greater or equal | Rd = Ra >= Rb |
cmplt | 18 | compare less then | Rd = Ra < Rb |
cmple | 19 | compare less or equal | Rd = Ra <= Rb |
cmpugt | 20 | unsigned compare greater then | Rd = Ra > Rb |
cmpuge | 21 | unsigned compare greater or equal | Rd = Ra >= Rb |
cmpult | 22 | unsigned compare less then | Rd = Ra < Rb |
cmpule | 23 | unsigned compare less or equal | Rd = Ra <= Rb |
cmpfeq | 37 | floating point compare equal | Rd = Ra == Rb |
cmpfne | 38 | floating point compare not equal | Rd = Ra != Rb |
cmpfgt | 39 | floating point compare greater then | Rd = Ra > Rb |
cmpfge | 40 | floating point compare greater or equal | Rd = Ra >= Rb |
cmpflt | 41 | floating point compare less then | Rd = Ra < Rb |
cmpfle | 42 | floating point compare less or equal | Rd = Ra <= Rb |
I type instructions
This is the format of the I-type instruction encoded in machine code.
The fields of the I-type instruction are: opcode (B28-24) is short for "operation code". The opcode is a binary encoding for the instruction. For * I-type instructions, it is only 5 bits.
- rd (B23-18) is the destination register
- rs (B17-12) is the first source register.
- imm (B11-3) is the 9-bit immediate.
- fmt (B2-1) bits are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format). B2 refers to register d and B1 refers to register rs.
The I-type instructions are:
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
ori | 1 | or | Imm |
andi | 2 | and | Rd = Ra & Imm |
xori | 3 | xor | Rd = Ra ^ Imm |
addi | 4 | addition | Rd = Ra + Imm |
subi | 5 | subtraction | Rd = Ra – Imm |
mulli | 6 | multiplication | Rd = Ra * Imm |
mulhi | 7 | high multiply | Rd = Ra * Imm |
mulhui | 8 | high multiply unsigned | Rd = Ra * Imm |
ashri | 9 | arithmetic shift right | Rd = Ra ‘>> Imm |
shri | 10 | shift right | Rd = Ra >> Imm |
shli | 11 | shift left | Rd = Ra << Imm |
getlane | 25 | Get lane from vector | Rd = Ra[Imm] |
MOVEI type instructions
MVI (Move Immediate) has a destination register and a 16-bit instruction encoded immediate. This is the format of the MOVEI-type instruction encoded in machine code.
The fields of the MOVEI-type instruction are:
- opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For MOVEI-type instructions, it is only 3 bits.
- rd (B23-18) is the destination register
- imm (B17-2) is the the 16-bit immediate.
- fmt (B1) is used to specify if the destination register contains a vector or a scalar.
The MOVEI-type instructions are:
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
moveil | 0 | move the 16 less significant bits | Rd = Ra & 0xFFFF |
moveih | 1 | move the 16 most significant bits | Rd = (Ra >> 16) & 0xFFFF |
movei | 2 | move the 16 less significant bits with zero extension | Rd = (Rd ^ Rd) & (Ra & 0xFFFF) |
C type instructions
This is the format of the C-type instruction encoded in machine code.
The fields of the C-type instruction are:
- opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For C-type instructions, it is only 3 bits.
- rs0 (B23-18) is the first source register.
- rs1 (B17-12) is the second source register.
The C-type instructions are:
Mnemonic | Opcode | Meaning |
---|---|---|
barrier_core | 0 | Memory Barrier - ensure that all explicit data memory transfers before the barrier are completed before any subsequent explicit data memory transactions starting after the barrier. Register rs0 contains the barrier identification number (BID). BID can be an arbitrary number greater than 0, i.e. BID>0. Different memory barriers require different BIDs. rs1 contains the number of threads that should synchronize. |
flush | 2 | Flush a cache line to the main memory. |
read_cr | 3 | Read a sub-register of the control register. |
write_cr | 4 | Write into a sub-register of the control register |
dcache_inv | 5 | Invalidates the input address line in the L1 cache. |
J type instructions
This is the format of the J-type instruction encoded in machine code.
The fields of the J-type instruction are:
- opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For J-type instructions, it is only 3 bits.
- rcond/rd (B23-18) is the condition/destination register.
- offset (B17-0) is the offset address.
The J-type instructions are:
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
jmp | 0 | jump - unconditionally jump to a specified location. | PC=rd or PC=PC+offset |
jmpsr | 1 | jump to subroutine - unconditionally jump to a specified location and store the return address in the RA register. | RA=PC+4 PC=rd or RA=PC+4 PC=PC+addr |
jret | 3 | Return from Subroutine - unconditionally return from a subroutine loading the return address from the RA register. | PC=RA |
beqz | 5 | Conditional Branch. Branch if Equal to Zero - branche to PC+offset if the contents of the condition register is equal to zero. | if(rcond==0) PC=PC+offset else PC=PC+4 |
bnez | 6 | Conditional Branch, Branch if Not Equal to Zero - branches to PC+offset if the contents of the condition register is not equal to zero. | if(rcond!=0) PC=PC+offset else PC=PC+4 |
M type instructions
This is the format of the M-type instruction encoded in machine code.
The fields of the M-type instruction are:
- opcode (B29-24) is short for "operation code". The opcode is a binary encoding for the instruction. For M-type instructions, it is only 6 bits.
- rd/rs (B23-18) is the destination or source register
- rbase (B17-12) is the base address register.
- offset (B11-3) is the offset address.
- bit l (B2) not used. Reserved for 64-bit extension.
- bit s (B1) is used to specify if a certain load/store memory operation goes to the scratchpad memory or not. For instance, in case of a load/store from/to the scratchpad memory, B1=1, otherwise B1=0.
The typical M type instructions are load and store instructions. In both cases, the source/destination address is calculated as base register address + immediate offset, i.e. rbase + offset. In case of load, rd = [rbase+offset]. Similarly, in case of store, [rbase + offset] = rs. All M type instructions can be used for both memory operations to the main memory and the scratchpad memory. Instructions that operate with the scratchpad memory have the _scratchpad
suffix. E.g load32_s8
targets the main memory, while load32_s8_scratchpad
refers to a load operation for the on-chip scratchpad.
The M-type instructions can be classified in scalar and vector instructions. The scalar M-type instructions are:
Mnemonic | Opcode | Meaning | Operation |
---|---|---|---|
load32_s8 | 0 | load memory byte [7:0] with sign extension into a 32 bit register | Rd = [Rbase + Offset] |
load32_s16 | 1 | oad memory half word [15:0] with sign extension into a 32 bit register | Rd = [Rbase + Offset] |
load32 | 2 | load memory word into a 32 bit register | Rd = [Rbase + Offset] |
load32_u8 | 4 | load memory byte [7:0] with zero extension into a 32 bit register | Rd = [Rbase + Offset] |
load32_u16 | 5 | load memory half word [15:0] with zero extension into a 32 bit register | Rd = [Rbase + Offset] |
load_v16i8 | 7 | load 16 byte [127:0] with sign extension into a 512 bit register | Rd = [Rbase + Offset] |
load_v16i16 | 8 | load 16 half word [255:0] with sign extension | Rd = [Rbase + Offset] |
load_v16i32 | 9 | load 16 words | Rd = [Rbase + Offset] |
load_v16u8 | 11 | load 16 byte [127:0] with no sign extension | Rd = [Rbase + Offset] |
load_v16u16 | 12 | load 16 half word [255:0] with no sign extension | Rd = [Rbase + Offset] |
load_v8u32 | 13 | load 8 word [255:0] with no sign extension | Rd = [Rbase + Offset] |
loadg32 | 16 | load 16 words from different memory addresses (only for scratchpad) | Rd[i] = [Rbase[i]] |
store32_8 | 32 | store 1 byte into the effective address | [Rbase + Offset] = Rs |
store32_16 | 33 | store 2 bytes into the effective address | [Rbase + Offset] = Rs |
store32 | 34 | store 1 word into the effective address | [Rbase + Offset] = Rs |
store_v16i8 | 36 | store 16 bytes from a vectorial register (data fecthing from register schema [487:480,...,39:32,7:0]) into effective address location | [Rbase + Offset] = Rs |
store_v16i16 | 37 | store 16 half words (data fetching from register schema [495:480,...,47:32,15:0]) into effective address location | [Rbase + Offset] = Rs |
store_v16i32 | 38 | store 16 words from a vectorial register into effective address location | [Rbase + Offset] = Rs |
stores32 | 42 | scatter store - store 16 words into 16 different addresses (only for scratchpad) | [Rbase[i]] = Rs[i] |