Register File

The NPU register file is composed by a scalar register file and a vector register file; each one containing 64 registers.

The scalar register file has 64 registers. The first 58 are general purpose registers, while the remaining 8 are special purpose registers. Each scalar register can store up to 32 bits of data.

The vector register file has 64 general purpose registers Each vector register can store up to 512 bits of data, each vector can store 16 x 32 bits.

Finally, there is a Control Register that is composed of several sub-registers. Some information are shared among all threads, others are thread specific and those registers marked 'thread' have a separate instance per thread.

Register	Read/Write	Shared/Thread	Description	ID
TILE_ID	Read	Shared	Tile ID	0
CORE_ID	Read	Shared	Core ID	1
THREAD_ID	Read	Thread	ThreadID	2
GLOBAL_ID	Read	Thread	Global ID, previous IDs merged as follow: TILE_ID, CORE_ID, THREAD_ID	3
GCOUNTER_LOW	Read	Shared	Low part of the Global counter register which counts processor cycles since reset	4
GCOUNTER_HIGH	Read	Shared	High part of the Global counter register which counts processor cycles since reset	5
THREAD_EN	Read	Shared	Thread enabled mask, 1 bit per thread	6
MISS_DATA	Read	Shared	Count of L1 Data cache misses	7
MISS_INSTR	Read	Shared	Count of L1 Instruction cache misses	8
PC	Read	Thread	Current PC	9
TRAP_REASON	Read	Thread	Trap Cause (see below)	10
THREAD_STATUS	Read/Write	Thread	Thread Status2 (see below)	11
ARGC	Read/Write	Shared	The number of strings pointed to by argv	12
ARGV	Read/Write	Shared	The address of command line arguments passed to main()	13
THREAD_NUMB	Read	Shared	The number of total hardware threads	14
THREAD_MISS_CC	Read	Thread	The per-thread clock cycles while the thread is idle due memory operations.	15
KERNEL_WORK	Read	Thread	The per-thread kernel clock cycles.	16
CPU_CTRL_REG	Read/Write	Shared	CPU mode register. At the moment only write policy used by the cache controller is implemented. 0 for write-back, 1 for write-through	17
UNCOHERENCE_MAP	Read/Write	Shared	Address the non-coherent table in the control register. It stores information about the non-coherent memory regions. User can define non-coherent regions addressing this special purpose register.	19
DEBUG_BASE_ADDR	Read/Write	Shared	Debug registers base address. The NPU is equipped with 16 debug registers. DEBUG_BASE_ADDR fetches the value of the first debug register, DEBUG_BASE_ADDR+1 the second and so on.	20

Trap Cause: in the current state only traps due to misaligned memory accesses can raise:

SPM_ADDR_MISALIGN: Misaligned memory access in the SPM unit.
LDST_ADDR_MISALIGN: Misaligned memory access in the LDST unit.

Thread Status: each thread can be in one of the following states:

THREAD_IDLE (Value = 0): each thread starts in this state after reset.
RUNNING (Value = 1): the thread is running a kernel.
END_MODE (Value = 2): the thread switches in this mode when the issued kernel is completed.
TRAPPED (Value = 3): the thread is in trap mode. At the current state, when a trap occurs, the thread jumps into an infinite loop.
WAITING_BARRIER (Value = 4): the thread is waiting for a synchronization event.

Data Types

The following table sums up the data types that are possible to use in NPU core. The Type column has the C/C++ type names, the LLVM type column presents the type names used in LLVM and the Register column shows the register type in which a value of a specific type is stored.

The highlighted types are those the architecture natively supports, given the register files width. The others are obtained through extension so that they can be seen as the supported ones. Their advantage resides in more efficient use of the system memory.

Type	LLVM Type	Register	Notes
bool	i1	scalar (32 bits)	It is expanded to 32 bits
char	i8	scalar (32 bits)	It is expanded to 32 bits
short	i16	scalar (32 bits)	It is expanded to 32 bits
int	i32	scalar (32 bits)
float	f32	scalar (32 bits)
vec16i8, vec16u8	v16i8	vector (16 x 32 bits)	It is expanded to 32 bits vector
vec16i16, vec16u16	v16i16	vector (16 x 32 bits)	It is expanded to 32 bits vector
vec16i32, vec16u32	v16i32	vector (16 x 32 bits)
vec16f32	v16f32	vector (16 x 32 bits)
vec8i8, vec8u8	v8i8	vector (8 x 32 bits)	It is expanded to 32 bits vector
vec8i16, vec8u16	v8i16	vector (8 x 32 bits)	It is expanded to 32 bits vector
vec8i32, vec8u32	v8i32	vector (8 x 32 bits)	It is expanded to 32 bits vector
vec8f32	v8f32	vector (16 x 32 bits)	It is considered as a 16 elements vector

Instructions Format

The NaplesPU instructions have a fixed length of 32 bits. They are grouped in six types:

The R type includes the logical and arithmetic operations and memory operations.

The I type includes the logical and arithmetic operations between a register operand and an immediate operand.

The MOVEI type includes the load operations of an immediate operand in a register.

The C type used for control operations and for synchronization instructions.

The J type includes jump instructions.

The M type includes the instructions used to access memory.

R type instructions

This is the format of the R-type instruction encoded in machine code.

RR (Register to Register) has a destination register and two source registers.
RI (Register Immediate) has a destination register and one source registers and an immediate encoded in the instruction word.

The fields of the R-type instruction are:

opcode (B29-24) is short for "operation code". The opcode is a binary encoding for the instruction. For R-type instructions, it is only 6 bits.
rd (B23-18) is the destination register
rs0 (B17-12) is the first source register.
rs1 (B11-6) is the second source register.
bit l (B4) is used in case of "long" operations, i.e. operations that require long integers or double precision numbers. If the operation requires 64-bit registers l=1, otherwise l=0.
bits fmt (B3-1) are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format). B3 refers to register d, B2 refers to register rs0 and B1 refers to register rs1. For instance, if the destination register should contain a vector, B3=1, otherwise B3=0.

The R-type instructions are:

or	1	or	Rb
and	2	and	Rd = Ra & Rb
xor	3	xor	Rd = Ra ^ Rb
add	4	addition	Rd = Ra + Rb
sub	5	subtraction	Rd = Ra – Rb
mullo	6	low result of the multiplication	Rd = Ra * Rb
mulhi	7	high result of the multiplication	Rd = Ra * Rb
mulhu	8	unsigned high result of the multiplication	Rd = Ra * Rb
ashr	9	arithmetic shift right	Rd = Ra '>> Rb
shr	10	shift right	Rd = Ra >> Rb
shl	11	shift left	Rd = Ra << Rb
clz	12	count leading zeros
ctz	13	count trailing zeros
shuffle	24	vector shuffle	Rd[i] = Ra[Rb[i]]
getlane	25	Get lane from vector	Rd = Ra[Rb]
move	32	move register	Rd = Ra
fadd	33	floating point add	Rd = Ra + Rb
fsub	34	floating point sub	Rd = Ra – Rb
fmul	35	floating point multiplication	Rd = Ra * Rb
fdiv	36	floating point division	Rd = Ra / Rb
sext8	43	sign extend 8 bits
sext16	44	sign extend 16 bits
sext32	45	sign extend 32 bits
i32tof32	48	cast integer to float
f32toi32	49	cast float to integer
cmpeq	14	compare equal	Rd = Ra == Rb
cmpne	15	compare not equal	Rd = Ra != Rb
cmpgt	16	compare greater then	Rd = Ra > Rb
cmpge	17	compare greater or equal	Rd = Ra >= Rb
cmplt	18	compare less then	Rd = Ra < Rb
cmple	19	compare less or equal	Rd = Ra <= Rb
cmpugt	20	unsigned compare greater then	Rd = Ra > Rb
cmpuge	21	unsigned compare greater or equal	Rd = Ra >= Rb
cmpult	22	unsigned compare less then	Rd = Ra < Rb
cmpule	23	unsigned compare less or equal	Rd = Ra <= Rb
cmpfeq	37	floating point compare equal	Rd = Ra == Rb
cmpfne	38	floating point compare not equal	Rd = Ra != Rb
cmpfgt	39	floating point compare greater then	Rd = Ra > Rb
cmpfge	40	floating point compare greater or equal	Rd = Ra >= Rb
cmpflt	41	floating point compare less then	Rd = Ra < Rb
cmpfle	42	floating point compare less or equal	Rd = Ra <= Rb

I type instructions

This is the format of the I-type instruction encoded in machine code.

The fields of the I-type instruction are: opcode (B28-24) is short for "operation code". The opcode is a binary encoding for the instruction. For * I-type instructions, it is only 5 bits.

rd (B23-18) is the destination register
rs (B17-12) is the first source register.
imm (B11-3) is the 9-bit immediate.
fmt (B2-1) bits are used to specify if a certain operand is a scalar or a vector (one bit for every register in the format). B2 refers to register d and B1 refers to register rs.

The I-type instructions are:

Mnemonic	Opcode	Meaning	Operation
ori	1	or	Imm
andi	2	and	Rd = Ra & Imm
xori	3	xor	Rd = Ra ^ Imm
addi	4	addition	Rd = Ra + Imm
subi	5	subtraction	Rd = Ra – Imm
mulli	6	multiplication	Rd = Ra * Imm
mulhi	7	high multiply	Rd = Ra * Imm
mulhui	8	high multiply unsigned	Rd = Ra * Imm
ashri	9	arithmetic shift right	Rd = Ra ‘>> Imm
shri	10	shift right	Rd = Ra >> Imm
shli	11	shift left	Rd = Ra << Imm
getlane	25	Get lane from vector	Rd = Ra[Imm]

MOVEI type instructions

MVI (Move Immediate) has a destination register and a 16-bit instruction encoded immediate. This is the format of the MOVEI-type instruction encoded in machine code.

The fields of the MOVEI-type instruction are:

opcode (B26-24) is short for "operation code". The opcode is a binary encoding for the instruction. For MOVEI-type instructions, it is only 3 bits.
rd (B23-18) is the destination register
imm (B17-2) is the the 16-bit immediate.
fmt (B1) is used to specify if the destination register contains a vector or a scalar.

The MOVEI-type instructions are:

Mnemonic	Opcode	Meaning	Operation
moveil	0	move the 16 less significant bits	Rd = Ra & 0xFFFF
moveih	1	move the 16 most significant bits	Rd = (Ra >> 16) & 0xFFFF
movei	2	move the 16 less significant bits with zero extension	Rd = (Rd ^ Rd) & (Ra & 0xFFFF)

C type instructions

Mnemonic	Opcode	Meaning
barrier_core	0	Memory Barrier - ensure that all explicit data memory transfers before the barrier are completed before any subsequent explicit data memory transactions starting after the barrier. Register rs0 contains the barrier identification number (BID). BID can be an arbitrary number greater than 0, i.e. BID>0. Different memory barriers require different BIDs. rs1 contains the number of threads that should synchronize.
flush	2	Flush a cache line to the main memory.
read_cr	3	Read a sub-register of the control register.
write_cr	4	Write into a sub-register of the control register
dcache_inv	5	Invalidates the input address line in the L1 cache.

J type instructions

Mnemonic	Opcode	Meaning	Operation
jmp	0	jump - unconditionally jump to a specified location.	PC=rd or PC=PC+offset
jmpsr	1	jump to subroutine - unconditionally jump to a specified location and store the return address in the RA register.	RA=PC+4 PC=rd or RA=PC+4 PC=PC+addr
jret	3	Return from Subroutine - unconditionally return from a subroutine loading the return address from the RA register.	PC=RA
beqz	5	Conditional Branch. Branch if Equal to Zero - branche to PC+offset if the contents of the condition register is equal to zero.	if(rcond==0) PC=PC+offset else PC=PC+4
bnez	6	Conditional Branch, Branch if Not Equal to Zero - branches to PC+offset if the contents of the condition register is not equal to zero.	if(rcond!=0) PC=PC+offset else PC=PC+4

M type instructions

MEM (Memory Instruction) has a destination/source field, in case of load the first register asses the destination register, otherwise, in case of a store, the first register contains the store value. Next, in both cases there is the base address and the immediate. The sum of the base address and immediate will give the effective memory address.

Mnemonic	Opcode	Meaning	Operation
loadXD_s8	0	load 1 byte with sign extension	Rd = [Rbase + Offset]
loadXD_s16	1	load 2 bytes with sign extension	Rd = [Rbase + Offset]
load32D	2	load 1 word	Rd = [Rbase + Offset]
loadXD_u8	4	load 1 byte with zero extension	Rd = [Rbase + Offset]
loadXD_u16	5	load 2 bytes with zero extension	Rd = [Rbase + Offset]
loadD_vYi8	7	load a vector of Y bytes with sign extension	Rd = [Rbase + Offset]
loadD_vYi16	8	load a vector of Y 2 bytes with sign extension	Rd = [Rbase + Offset]
loadD_vYi32	9	load a vector of Y words with sign extension	Rd = [Rbase + Offset]
loadD_vYu8	11	load a vector of Y bytes with zero extension	Rd = [Rbase + Offset]
loadD_vYu16	12	load a vector of Y 2 bytes with zero extension	Rd = [Rbase + Offset]
loadD_vYu32	13	load a vector of Y words with zero extension	Rd = [Rbase + Offset]
loadD_g_32	16	load 16 words from different memory addresses	Rd[i] = [Rbase[i]]
storeXD_8	32	store 1 byte	[Rbase + Offset] = Rs
storeXD_16	33	store 2 bytes	[Rbase + Offset] = Rs
store32D	34	store 1 word	[Rbase + Offset] = Rs
storeD_vYi8	32	store Y bytes	[Rbase + Offset] = Rs
storeD_vYi16	33	store Y 2 bytes	[Rbase + Offset] = Rs
storeD_vYi32	34	store Y words	[Rbase + Offset] = Rs
storeD_s_32	42	store 16 words to different memory addresses	[Rbase[i]] = Rs[i]

ISA

Contents

Register File

Data Types

Instructions Format

R type instructions

I type instructions

MOVEI type instructions

C type instructions

J type instructions

M type instructions

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools