Extending NaplesPU

From NaplesPU Documentation
Jump to: navigation, search

SystemVerilog coding NaplesPU guidelines

This is a simple guideline for extending the NaplesPU architecture.

1. the module's output signal names start ever with the mnemonic module name (e.g. writeback's signals -> wb_xxx).

2. the testbench file name starts with tb_.

3. add a folder for each different self-contained module and insert in the main folder "common" modules spread over all the project.

4. use brackets for arithmetic operation in new defines.

5. use structs or typedefs instead defines when subparts of a signal are often accessed.

6. use divide et impera philosophy to improve re-usability and comprehensibility.

7. use existing signal typedefs; if you introduce new structures and typedefs, allocate them in a specific header file for that component in the include folder (e.g. writeback unit -> writeback_defines.sv).

Adding custom instruction into NaplesPU core

This section describes how to add new functional operation, extending the instruction set and adding a custom component into the NaplesPU pipeline.

Defining a new instruction

The first step is to add a new instruction in NaplesPU ISA, starting with the instruction format. E.g., a new arithmetical operation ought to be part of R type instructions, while a new memory access instruction has to be part of the M type instructions. In the following example, we are introducing new arithmetical operation, called crp.

Extending compiler support

Extending the compiler support for a custom instruction involves two major steps:

  • Adding a new intrinsic;
  • Adding new instruction.

Adding a new intrinsic

Adding a new intrinsic involves three different files in both back-end and front-end of the compiler.

On the front-end side, Clang has to recognize this new intrinsic. This is accomplished by adding in the "compiler/tools/clang/include/clang/Basic/BuiltinsNaplesPU.def" file of the toolchain repo the following line:

//------ Cross Product ----------//
BUILTIN ( __builtin_npu_crossprodv16i32 , " V16iV16iV16i ", "n")

Such a macro defines the signature of the intrinsic:

  • __builtin_npu_crossprodv16i32 - name
  • V16iV16iV16i - input and output types
  • n - optional attributes

For further information, please refer to file "compiler/tools/clang/include/clang/Basic/Builtins.def" in the toolchain repo.

Then, in "compiler/tools/clang/lib/CodeGen/CGBuiltin.cpp" we extend the EmitNPUBuiltinExpr method adding a new case in the switch construct as follow:

// Cross Product
case NaplesPU :: BI__builtin_npu_crossprodv16i32 :
   F = CGM . getIntrinsic ( Intrinsic :: npu_crossprodv16i32 );
   break ;

Keywork BI__builtin_npu_crossprodv16i32 has to coherent with builtin added in the BuiltinsNaplesPU.def file, the signature name has to be preceded by BI.

Finally, we extend the compiler/include/llvm/IR/IntrinsicsNaplesPU.td file on the back-end side as follows:

// Cross Product Intrinsic
def int_npu_crossprodv16i32 : Intrinsic <[ llvm_v16i32_ty ], [ llvm_v16i32_ty , llvm_v16i32_ty ], [ IntrNoMem ], " llvm.npu.__builtin_npu_crossprodv16i32 ">;

This Table-Gen code adds the new intrinsic in Clang and generates the corresponding AST node.

In questo modo si definisce un'istanza (int_npu_crossprodv16i32) della classe TableGen Intrinsic. Essa prevede di specificare rispettivamente i tipi di uscita e di ingresso (llvm_v16i32_ty), eventuali attributi (IntrNoMem) e la stringa di riconoscimento dell'IR ("llvm.npu.__builtin_npu_crossprodv16i32") che deve contenere lo stesso nome della builtin definita in BuiltinsNaplesPU.def, da inserire dopo "llvm.npu.".

Adding new instruction

Adding a new instruction in the ISA at the back-end side of the compiler, we extend compiler/lib/Target/NaplesPU/NaplesPUInstrInfo.td file in the toolchain repo. Such an extensinon requipres Table-Gen classes defined in the compiler/lib/Target/NaplesPU/NaplesPUInstrFormats.td file. In particular, the classes used for the crp instruction is FR_TwoOp_Unmasked_32, this defines the instruction as R type with two input operands (FR_TwoOp) both vectorials, with no mask (Unmasked_32):

// Cross Product Instruction
def CROSSPROD_32 : FR_TwoOp_Unmasked_32 <
( outs VR512W : $dst ), // output
( ins VR512W :$src0 , VR512W : $src1 ), // input
" crp $dst , $src0 , $src1 ", // corresponding assembly code
[( set v16i32 :$dst , ( int_npu_crossprodv16i32 v16i32 :$src0 , v16i32 : $src1 ))], // matching pattern
63, // ISA opcode (unique for instruction)
Fmt_V , // destination register format
Fmt_V , // src0 register format
Fmt_V >; // src1 register format

Attribute VR512W defines operation target register as vectors with 16 elements of 32-bit each. Dually, attribute Fmt_V sets accordingly in the FMT filed of the instruction bytecode. For our custom module, we chose 63 as opcode further referred to as MY_OPCODE in the text.

Extending NPU core pipeline

Extending NPU core pipeline with a new operator in the execution stage involves the following steps:

  • defining new module and its interface;
  • extending NPU Decode stage;
  • extending NPU Writeback stage;
  • adding the module in the NPU pipeline.

Modules not fully pipelined have to extend Instruction_Buffer module as well.

Custom unit interface

Follows an example interface:

`include " npu_user_defines.sv"
`include " npu_defines.sv"
module my_pipe (
   input clk ,
   input reset ,
   // To Instruction buffer
   output thread_mask_t my_stop ;
   // From Operand Fetch
   input opf_valid ,
   input instruction_decoded_t opf_inst_scheduled ,
   input vec_reg_size_t opf_fecthed_op0 ,
   input vec_reg_size_t opf_fecthed_op1 ,
   input hw_lane_mask_t opf_hw_lane_mask ,
   // To Writeback
   output logic my_valid ,
   output instruction_decoded_t my_inst_scheduled ,
   output vec_reg_size_t my_result ,
   output hw_lane_mask_t my_hw_lane_mask
);

If the new module cannot accept requests each clock cycles, it has to provide a stopping condition and to forward it to the Instruction_Buffer module in order to prevent further issues on the custom module. In the above example, this is done by my_stop signal, when the module cannot handle further requests it has to be high. Then add my_stop to the Instruction_Buffer module as follow:

assign ib_instructions_valid[thread_id] = ~fifo_empty & ~( l1d_full[thread_id] & ib_instructions[thread_id].pipe_sel == PIPE_MEM ) & enable & ~(my_stop & ib_instructions[thread_id].op_code == MY_OPCODE);

The last step is not required if the custom module can handle a request per clock cycle.

Input signals are generated by the Operand_Fetch module:

  • opf_valid, incoming requests is valid.
  • instruction_decoded_t opf_inst_scheduled, current instruction decoded. The module has to check the op_code, if it is equal to the new opcode (MY_OPCODE) the issued operation has to be elaborated. An instruction can be either scalar or vectorial. These information are stored in the instruction_decoded_t fields and each register has a dedicated bit, namely is_source0_vectorial, is_source1_vectorial, and is_destination_vectorial bits.
  • vec_reg_size_t opf_fecthed_op0, vector of registers in input.
  • vec_reg_size_t opf_fecthed_op1, vector of registers in input.
  • hw_lane_mask_t opf_hw_lane_mask, hardware lane bitmask, the i-th bit states that the i-th element in the vector has to be elaborated.

Output signals are forwarded to the Writeback module:

  • my_valid, the output result is valid.
  • instruction_decoded_t my_inst_scheduled, the module has to forward the issued instruction along with results.
  • vec_reg_size_t my_result, output result organized in a vector lane.
  • hw_lane_mask_t my_hw_lane_mask, the module has to forward the hardware bitmask used along with results.

Extending Decode stage

First, we extend pipeline_disp_t type in include/npu_defines.sv with a new value which identifies our new module (es. PIPE_NEW). Then, in the same file, we add a new instruction to the right instruction type, we added a new R type instruction, hence in alu_op_t type add a new unique opcode (beware, as to be unique). In case of a new M type, we should have extended memory_op_t, and so on for the others (they are all in the same file).

Now, in the Decode stage, selects the right case in the switch construct based on the new instruction type, again in this example we refer to a new R type instruction, hence our code will be placed in the following case:

casez ( if_inst_scheduled.opcode )
  // RR
  8'b00_?????? : begin
  ...

Be sure that in case of opcode = MY_OPCODE, the decode stage issues a new request for the custom modules by setting the pipe_sel value to PIPE_NEW, as follow:

if ( if_inst_scheduled.opcode.alu_opcode <= MOVE || ( if_inst_scheduled.opcode.alu_opcode >= SEXT8 & if_inst_scheduled.opcode.alu_opcode <= SEXT32 )
   || if_inst_scheduled.opcode.alu_opcode == MY_OPCODE ) begin
  if (if_inst_scheduled.opcode.alu_opcode == MY_OPCODE) begin
     instruction_decoded_next.pipe_sel = PIPE_NEW ;
     instruction_decoded_next.is_int = 1'b0;
     instruction_decoded_next.is_fp = 1'b0;
  end 

Extending Writeback stage

In the Writeback stage, we first add a new dedicated interface for the custom unit, as follow:

// From MY costum module
input                                               my_valid,
input  instruction_decoded_t                        my_inst_scheduled,
input  hw_lane_t                                    my_result,
input  hw_lane_mask_t                               my_mask_reg,

Then, we add a new Writeback Request FIFO dedicated to fetch incoming results from our custom module. In this case, we update `NUM_EX_PIPE parameter in include/npu_defines.sv header (by adding one to the previous value), and adding a localparameter with a new ID for our custom operation:

localparam  PIPE_FP_ID   = 0; // FP pipe FIFO index
localparam  PIPE_INT_ID  = 1; // INT pipe FIFO index
localparam  PIPE_SPM_ID  = 2; // SPM memory FIFO index
localparam  PIPE_MEM_ID  = 3; // LDST unit FIFO index
localparam  PIPE_NEW_ID  = 4; // NEW op FIFO index


Next, we connect the dedicated FIFO to the interface inputs from the module:

assign input_wb_request[PIPE_NEW_ID].pc                        = my_inst_scheduled.pc;
assign input_wb_request[PIPE_NEW_ID].writeback_valid           = my_valid;
assign input_wb_request[PIPE_NEW_ID].thread_id                 = my_inst_scheduled.thread_id;
assign input_wb_request[PIPE_NEW_ID].writeback_result          = my_result;
assign input_wb_request[PIPE_NEW_ID].writeback_hw_lane_mask    = my_mask_reg;
assign input_wb_request[PIPE_NEW_ID].destination               = my_inst_scheduled.destination;
assign input_wb_request[PIPE_NEW_ID].is_destination_vectorial  = my_inst_scheduled.is_destination_vectorial;
assign input_wb_request[PIPE_NEW_ID].op_code                   = my_inst_scheduled.op_code;
assign input_wb_request[PIPE_NEW_ID].pipe_sel                  = my_inst_scheduled.pipe_sel;
assign input_wb_request[PIPE_NEW_ID].is_memory_access          = my_inst_scheduled.is_memory_access;
assign input_wb_request[PIPE_NEW_ID].has_destination           = my_inst_scheduled.has_destination;
assign input_wb_request[PIPE_NEW_ID].is_branch                 = my_inst_scheduled.is_branch;
assign input_wb_request[PIPE_NEW_ID].is_control                = my_inst_scheduled.is_control;
assign input_wb_request[PIPE_NEW_ID].is_movei                  = my_inst_scheduled.is_movei;
assign input_wb_request[PIPE_NEW_ID].result_address            = 0;

Finally, we extend the code which generates the writeback result forwarded to register files, adding a new case in the process that builds the output_wb_request.writeback signal:

// Output data composer. The wb_result_data are directly forwarded to the register files
always_comb begin : WB_OUTPUT_DATA_SELECTION
 case ( output_wb_request[selected_pipe].pipe_sel )
  PIPE_MEM : wb_next.wb_result_data = result_data_mem;
  PIPE_SPM : wb_next.wb_result_data = result_data_spm;
  PIPE_INT,
  PIPE_NEW,
  PIPE_FP  : wb_next.wb_result_data = output_wb_request[selected_pipe].writeback_result;
  default : wb_next.wb_result_data  = 0;
  endcase
end

Adding the module in the NPU pipeline

First, we declare signals needed by our module:

// NEW Pipe Stage - Signals
logic                                        my_valid;
instruction_decoded_t                        my_inst_scheduled;
hw_lane_t                                    my_result;
hw_lane_mask_t                               my_hw_lane_mask;

Then, we drop our module instatiation in the NPU pipeline located into core/npu_core.sv file as follow:

my_pipe u_my_pipe (
 .clk                ( clk                ),
 .reset              ( reset              ),
 .enable             ( nfreeze            ),
 //From Operand Fetch
 .opf_valid          ( opf_valid          ),
 .opf_inst_scheduled ( opf_inst_scheduled ),
 .opf_fetched_op0    ( opf_fetched_op0    ),
 .opf_fetched_op1    ( opf_fetched_op1    ),
 .opf_hw_lane_mask   ( opf_hw_lane_mask   ),
 //To Writeback
 .my_valid           ( my_valid           ),
 .my_inst_scheduled  ( my_inst_scheduled  ),
 .my_result          ( my_result          ),
 .my_hw_lane_mask    ( my_hw_lane_mask    )
);

Finally, we connect our module to the Writeback stage:

writeback #(
 .TILE_ID( TILE_ID )
)
u_writeback (
  .clk                 ( clk                ),
  .reset               ( reset              ),
  .enable              ( 1'b1               ),	
 ...
 //From NEW Pipe
 .my_valid           ( my_valid           ),
 .my_inst_scheduled  ( my_inst_scheduled  ),
 .my_result          ( my_result          ),
 .my_hw_lane_mask    ( my_hw_lane_mask    )