Extending NaplesPU
SystemVerilog coding NaplesPU guidelines
This is a simple guideline for extending the NaplesPU architecture.
1. the module's output signal names start ever with the mnemonic module name (e.g. writeback's signals -> wb_xxx).
2. the testbench file name starts with tb_.
3. add a folder for each different self-contained module and insert in the main folder "common" modules spread over all the project.
4. use brackets for arithmetic operation in new defines.
5. use structs or typedefs instead defines when subparts of a signal are often accessed.
6. use divide et impera philosophy to improve re-usability and comprehensibility.
7. use existing signal typedefs; if you introduce new structures and typedefs, allocate them in a specific header file for that component in the include folder (e.g. writeback unit -> writeback_defines.sv).
Adding custom instruction into NaplesPU core
This section describes how to add new functional operation, extending the instruction set and adding a custom component into the NaplesPU pipeline.
Defining a new instruction
The first step is to add a new instruction in NaplesPU ISA, starting with the instruction format. E.g., a new arithmetical operation ought to be part of R type instructions, while a new memory access instruction has to be part of the M type instructions. In the following example, we are introducing new arithmetical operation, called crp.
Extending compiler support
Extending the compiler support for a custom instruction involves two major steps:
- Adding a new intrinsic;
- Adding new instruction.
Adding a new intrinsic
Adding a new intrinsic involves three different files in both back-end and front-end of the compiler.
On the front-end side, Clang has to recognize this new intrinsic. This is accomplished by adding in the "compiler/tools/clang/include/clang/Basic/BuiltinsNaplesPU.def" file of the toolchain repo the following line:
//------ Cross Product ----------// BUILTIN ( __builtin_npu_crossprodv16i32 , " V16iV16iV16i ", "n")
Such a macro defines the signature of the intrinsic:
- __builtin_npu_crossprodv16i32 - name
- V16iV16iV16i - input and output types
- n - optional attributes
For further information, please refer to file "compiler/tools/clang/include/clang/Basic/Builtins.def" in the toolchain repo.
Then, in "compiler/tools/clang/lib/CodeGen/CGBuiltin.cpp" we extend the EmitNPUBuiltinExpr method adding a new case in the switch construct as follow:
// Cross Product case NaplesPU :: BI__builtin_npu_crossprodv16i32 : F = CGM . getIntrinsic ( Intrinsic :: npu_crossprodv16i32 ); break ;
Keywork BI__builtin_npu_crossprodv16i32 has to coherent with builtin added in the BuiltinsNaplesPU.def file, the signature name has to be preceded by BI.
Finally, we extend the compiler/include/llvm/IR/IntrinsicsNaplesPU.td file on the back-end side as follows:
// Cross Product Intrinsic def int_npu_crossprodv16i32 : Intrinsic <[ llvm_v16i32_ty ], [ llvm_v16i32_ty , llvm_v16i32_ty ], [ IntrNoMem ], " llvm.npu.__builtin_npu_crossprodv16i32 ">;
This Table-Gen code adds the new intrinsic in Clang and generates the corresponding AST node.
In questo modo si definisce un'istanza (int_npu_crossprodv16i32) della classe TableGen Intrinsic. Essa prevede di specificare rispettivamente i tipi di uscita e di ingresso (llvm_v16i32_ty), eventuali attributi (IntrNoMem) e la stringa di riconoscimento dell'IR ("llvm.npu.__builtin_npu_crossprodv16i32") che deve contenere lo stesso nome della builtin definita in BuiltinsNaplesPU.def, da inserire dopo "llvm.npu.".
Adding new instruction
Adding a new instruction in the ISA at the back-end side of the compiler, we extend compiler/lib/Target/NaplesPU/NaplesPUInstrInfo.td file in the toolchain repo. Such an extensinon requipres Table-Gen classes defined in the compiler/lib/Target/NaplesPU/NaplesPUInstrFormats.td file. In particular, the classes used for the crp instruction is FR_TwoOp_Unmasked_32, this defines the instruction as R type with two input operands (FR_TwoOp) both vectorials, with no mask (Unmasked_32):
// Cross Product Instruction def CROSSPROD_32 : FR_TwoOp_Unmasked_32 < ( outs VR512W : $dst ), // output ( ins VR512W :$src0 , VR512W : $src1 ), // input " crp $dst , $src0 , $src1 ", // corresponding assembly code [( set v16i32 :$dst , ( int_npu_crossprodv16i32 v16i32 :$src0 , v16i32 : $src1 ))], // matching pattern 63, // ISA opcode (unique for instruction) Fmt_V , // destination register format Fmt_V , // src0 register format Fmt_V >; // src1 register format
Attribute VR512W defines operation target register as vectors with 16 elements of 32-bit each. Dually, attribute Fmt_V sets accordingly in the FMT filed of the instruction bytecode. For our custom module, we chose 63 as opcode further referred to as MY_OPCODE in the text.
Extending NPU core pipeline
Extending NPU core pipeline with a new operator in the execution stage involves the following steps:
- defining new module and its interface;
- extending NPU Decode stage;
- extending NPU Writeback stage;
- adding the module in the NPU pipeline.
Modules not fully pipelined have to extend Instruction_Buffer module as well.
Custom unit interface
Follows an example interface:
`include " npu_user_defines.sv" `include " npu_defines.sv" module my_pipe ( input clk , input reset , // To Instruction buffer output thread_mask_t my_stop ; // From Operand Fetch input opf_valid , input instruction_decoded_t opf_inst_scheduled , input vec_reg_size_t opf_fecthed_op0 , input vec_reg_size_t opf_fecthed_op1 , input hw_lane_mask_t opf_hw_lane_mask , // To Writeback output logic my_valid , output instruction_decoded_t my_inst_scheduled , output vec_reg_size_t my_result , output hw_lane_mask_t my_hw_lane_mask );
If the new module cannot accept requests each clock cycles, it has to provide a stopping condition and to forward it to the Instruction_Buffer module in order to prevent further issues on the custom module. In the above example, this is done by my_stop signal, when the module cannot handle further requests it has to be high. Then add my_stop to the Instruction_Buffer module as follow:
assign ib_instructions_valid[thread_id] = ~fifo_empty & ~( l1d_full[thread_id] & ib_instructions[thread_id].pipe_sel == PIPE_MEM ) & enable & ~(my_stop & ib_instructions[thread_id].op_code == MY_OPCODE);
The last step is not required if the custom module can handle a request per clock cycle.
Input signals are generated by the Operand_Fetch module:
- opf_valid, incoming requests is valid.
- instruction_decoded_t opf_inst_scheduled, current instruction decoded. The module has to check the op_code, if it is equal to the new opcode (MY_OPCODE) the issued operation has to be elaborated. An instruction can be either scalar or vectorial. These information are stored in the instruction_decoded_t fields and each register has a dedicated bit, namely is_source0_vectorial, is_source1_vectorial, and is_destination_vectorial bits.
- vec_reg_size_t opf_fecthed_op0, vector of registers in input.
- vec_reg_size_t opf_fecthed_op1, vector of registers in input.
- hw_lane_mask_t opf_hw_lane_mask, hardware lane bitmask, the i-th bit states that the i-th element in the vector has to be elaborated.
Output signals are forwarded to the Writeback module:
- my_valid, the output result is valid.
- instruction_decoded_t my_inst_scheduled, the module has to forward the issued instruction along with results.
- vec_reg_size_t my_result, output result organized in a vector lane.
- hw_lane_mask_t my_hw_lane_mask, the module has to forward the hardware bitmask used along with results.
Extending Decode stage
First, we extend pipeline_disp_t type in include/npu_defines.sv with a new value which identifies our new module (es. PIPE_NEW). Then, in the same file, we add a new instruction to the right instruction type, we added a new R type instruction, hence in alu_op_t type add a new unique opcode (beware, as to be unique). In case of a new M type, we should have extended memory_op_t, and so on for the others (they are all in the same file).
Now, in the Decode stage, selects the right case in the switch construct based on the new instruction type, again in this example we refer to a new R type instruction, hence our code will be placed in the following case:
casez ( if_inst_scheduled.opcode ) // RR 8'b00_?????? : begin ...
Be sure that in case of opcode = MY_OPCODE, the decode stage issues a new request for the custom modules by setting the pipe_sel value to PIPE_NEW, as follow:
if ( if_inst_scheduled.opcode.alu_opcode <= MOVE || ( if_inst_scheduled.opcode.alu_opcode >= SEXT8 & if_inst_scheduled.opcode.alu_opcode <= SEXT32 ) || if_inst_scheduled.opcode.alu_opcode == MY_OPCODE ) begin if (if_inst_scheduled.opcode.alu_opcode == MY_OPCODE) begin instruction_decoded_next.pipe_sel = PIPE_NEW ; instruction_decoded_next.is_int = 1'b0; instruction_decoded_next.is_fp = 1'b0; end
Extending Writeback stage
In the Writeback stage, we first add a new dedicated interface for the custom unit, as follow:
// From MY costum module input my_valid, input instruction_decoded_t my_inst_scheduled, input hw_lane_t my_result, input hw_lane_mask_t my_mask_reg,
Then, we add a new Writeback Request FIFO dedicated to fetch incoming results from our custom module. In this case, we update `NUM_EX_PIPE parameter in include/npu_defines.sv header (by adding one to the previous value), and adding a localparameter with a new ID for our custom operation:
localparam PIPE_FP_ID = 0; // FP pipe FIFO index localparam PIPE_INT_ID = 1; // INT pipe FIFO index localparam PIPE_SPM_ID = 2; // SPM memory FIFO index localparam PIPE_MEM_ID = 3; // LDST unit FIFO index localparam PIPE_NEW_ID = 4; // NEW op FIFO index
Next, we connect the dedicated FIFO to the interface inputs from the module:
assign input_wb_request[PIPE_NEW_ID].pc = my_inst_scheduled.pc; assign input_wb_request[PIPE_NEW_ID].writeback_valid = my_valid; assign input_wb_request[PIPE_NEW_ID].thread_id = my_inst_scheduled.thread_id; assign input_wb_request[PIPE_NEW_ID].writeback_result = my_result; assign input_wb_request[PIPE_NEW_ID].writeback_hw_lane_mask = my_mask_reg; assign input_wb_request[PIPE_NEW_ID].destination = my_inst_scheduled.destination; assign input_wb_request[PIPE_NEW_ID].is_destination_vectorial = my_inst_scheduled.is_destination_vectorial; assign input_wb_request[PIPE_NEW_ID].op_code = my_inst_scheduled.op_code; assign input_wb_request[PIPE_NEW_ID].pipe_sel = my_inst_scheduled.pipe_sel; assign input_wb_request[PIPE_NEW_ID].is_memory_access = my_inst_scheduled.is_memory_access; assign input_wb_request[PIPE_NEW_ID].has_destination = my_inst_scheduled.has_destination; assign input_wb_request[PIPE_NEW_ID].is_branch = my_inst_scheduled.is_branch; assign input_wb_request[PIPE_NEW_ID].is_control = my_inst_scheduled.is_control; assign input_wb_request[PIPE_NEW_ID].is_movei = my_inst_scheduled.is_movei; assign input_wb_request[PIPE_NEW_ID].result_address = 0;
Finally, we extend the code which generates the writeback result forwarded to register files, adding a new case in the process that builds the output_wb_request.writeback signal:
// Output data composer. The wb_result_data are directly forwarded to the register files always_comb begin : WB_OUTPUT_DATA_SELECTION case ( output_wb_request[selected_pipe].pipe_sel ) PIPE_MEM : wb_next.wb_result_data = result_data_mem; PIPE_SPM : wb_next.wb_result_data = result_data_spm; PIPE_INT, PIPE_NEW, PIPE_FP : wb_next.wb_result_data = output_wb_request[selected_pipe].writeback_result; default : wb_next.wb_result_data = 0; endcase end
Adding the module in the NPU pipeline
First, we declare signals needed by our module:
// NEW Pipe Stage - Signals logic my_valid; instruction_decoded_t my_inst_scheduled; hw_lane_t my_result; hw_lane_mask_t my_hw_lane_mask;
Then, we drop our module instatiation in the NPU pipeline located into core/npu_core.sv file as follow:
my_pipe u_my_pipe ( .clk ( clk ), .reset ( reset ), .enable ( nfreeze ), //From Operand Fetch .opf_valid ( opf_valid ), .opf_inst_scheduled ( opf_inst_scheduled ), .opf_fetched_op0 ( opf_fetched_op0 ), .opf_fetched_op1 ( opf_fetched_op1 ), .opf_hw_lane_mask ( opf_hw_lane_mask ), //To Writeback .my_valid ( my_valid ), .my_inst_scheduled ( my_inst_scheduled ), .my_result ( my_result ), .my_hw_lane_mask ( my_hw_lane_mask ) );
Finally, we connect our module to the Writeback stage:
writeback #( .TILE_ID( TILE_ID ) ) u_writeback ( .clk ( clk ), .reset ( reset ), .enable ( 1'b1 ), ... //From NEW Pipe .my_valid ( my_valid ), .my_inst_scheduled ( my_inst_scheduled ), .my_result ( my_result ), .my_hw_lane_mask ( my_hw_lane_mask )