Difference between revisions of "System deployment"
(34 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
− | The [[Single core version | + | The [[SC_System|Single core version]] has been deployed on a '''Nexys4DDR FPGA''' board, modules involved are located into ''boards/nexys4ddr'' and ''src/deploy/'' folders. The design interconnects the board '''DDR''' memory and the '''UART''' respectively to the Memory and Item interfaces. The figure below shows a schematic block of the top module: |
− | [[File: | + | [[File:NPU nexys4.png|470px]] |
+ | |||
+ | The ''npu_system'' lays in the middle of the design, while ''uart_router'' and ''memory_controller'' translate NPU transactions letting the system communicate with both the host (through the UART) and with the board memory (through the DDR interface). | ||
== Memory Controller == | == Memory Controller == | ||
− | + | The memory controller deployed in the current release translates incoming memory requests on the Memory Interface into AXI transactions, forwarded to the '''MIG IPCore''' instantiated into the design. The ''memory_controller'' provides to the ''npu_system'' a block interface compliant with the core Memory Interface: | |
+ | |||
+ | // Block interface | ||
+ | input logic [31 : 0] blk_request_address, | ||
+ | input logic [63 : 0] blk_request_dirty_mask, | ||
+ | input logic [511 : 0] blk_request_data, | ||
+ | input logic blk_request_read, | ||
+ | input logic blk_request_write, | ||
+ | output logic mc_available, | ||
+ | output logic mc_response_valid, | ||
+ | output logic [31 : 0] mc_response_address, | ||
+ | output logic [511 : 0] mc_response_data, | ||
+ | input logic blk_available, | ||
+ | Turning NPU memory requests into AXI transactions: | ||
+ | // AXI write address channel signals | ||
+ | input axi_awready, // Indicates slave is ready to accept | ||
+ | output logic [3:0] axi_awid, // Write ID | ||
+ | output logic [31:0] axi_awaddr, // Write address | ||
+ | output logic [7:0] axi_awlen, // Write Burst Length | ||
+ | ... | ||
+ | Then, the MIG turns incoming '''AXI''' requests into '''DDR''' transactions forwarded to memory blocks located on the board. | ||
+ | |||
+ | The ''memory_controller'' also generates the memory availability bit. It instantiates a FIFO ''input_fifo'' in which incoming requests from the ''npu_system'' are stored. The ''input_fifo'' stores incoming request data and information, such as address, when either ''blk_request_read'' or ''blk_request_write'' are valid: | ||
+ | sync_fifo #( | ||
+ | .WIDTH ( 32 + 64 + 512 + 1 + 1 ), | ||
+ | .SIZE ( 2 ), | ||
+ | .ALMOST_FULL_THRESHOLD ( 1 ) | ||
+ | ) input_fifo ( | ||
+ | .clk ( clk ), | ||
+ | .reset ( reset ), | ||
+ | .flush_en ( 1'b0 ), | ||
+ | .full ( ), | ||
+ | .almost_full ( input_fifo_almost_full ), | ||
+ | .enqueue_en ( blk_request_read | blk_request_write ), | ||
+ | .value_i ( {blk_request_address, blk_request_dirty_mask, blk_request_data, blk_request_read, blk_request_write} ), | ||
+ | .empty ( input_fifo_empty ), | ||
+ | .almost_empty ( ), | ||
+ | .dequeue_en ( fifo_blk_dequeue ), | ||
+ | .value_o ( {fifo_blk_request_address, fifo_blk_request_dirty_mask, fifo_blk_request_data, fifo_blk_request_read, fifo_blk_request_write} ) | ||
+ | ); | ||
+ | The availability signals is deasserted when ''input_fifo'' has an element stored: | ||
+ | |||
+ | assign mc_available = ~input_fifo_almost_full; | ||
+ | |||
+ | When the AXI transaction is closed and all input words are gathered back into a single block (512-bit): | ||
+ | READ_BLOCK: begin | ||
+ | if (axi_rvalid) begin | ||
+ | word_counter <= word_counter + 1; | ||
+ | mc_response_data[word_counter * 32 +: 32] <= axi_rdata; | ||
+ | ... | ||
+ | |||
+ | Finally, the control logic dequeues the pending request asserting ''fifo_blk_dequeue'' and forward the memory response to the ''npu_system''. | ||
+ | |||
+ | The ''memory_controller'' also bridges the ''uart_router'' (receiving commands from the host) and the board memory. It provides a Command interface directly interconnected with ''uart_router'', this interface allows the host to interfact with the memory through commands flowing on the UART: | ||
+ | |||
+ | // Command interface | ||
+ | input [31:0] command_word_i, | ||
+ | input command_valid_i, | ||
+ | output logic command_ready_o, | ||
+ | output logic [31:0] command_word_o, | ||
+ | output logic command_valid_o, | ||
+ | input command_ready_i, | ||
+ | |||
+ | The Command interface has a valid/read interface such as the ''npu_system''. When the host issues a memory request through the UART, ''uart_router'' asserts ''command_valid_i'' and forwards address and operation type on ''command_word_i''. In particular, the FSM checks the most significant bit of ''command_valid_i'', if low a READ request is issued, otherwise the FSM performs a WRITE memory request: | ||
+ | |||
+ | case (current_state) | ||
+ | IDLE: begin | ||
+ | if (command_valid_i) begin | ||
+ | is_read <= command_word_i[31] == READ; | ||
+ | |||
+ | In the case of a READ, an AXI transaction is performed and no data is needed. In case of WRITE, the FSM gathers the next 15 incoming ''command_word_i'' and keep sending them through the AXI write channel: | ||
+ | |||
+ | WRITE_BURST: begin | ||
+ | axi_wdata <= command_word_i; | ||
+ | axi_wstrb <= 4'b1111; | ||
+ | if (command_valid_i & axi_wready) begin | ||
+ | axi_wvalid <= 1'b1; | ||
+ | command_ready_o <= 1'b1; | ||
+ | if (word_counter == burst_len - 1) begin | ||
+ | axi_wlast <= 1'b1; | ||
+ | end | ||
+ | end | ||
+ | end | ||
+ | |||
+ | The burst ends when 16 words have been sent over the AXI write channel and ''axi_wlast'' is asserted. | ||
== Host interaction == | == Host interaction == | ||
− | Uart | + | Host interacts with the system through UART, the module ''uart'' instantiated in the ''nexys4ddr_top'' connects ''uart_router'' with the host organizing incoming data into bytes. A control process in the top level rebuilds these bytes into words and forwards them to the ''uart_router'': |
+ | |||
+ | always_ff @(posedge mig_ui_clk) begin | ||
+ | if (async_reset) begin | ||
+ | rx_cnt <= 3'd0; | ||
+ | end else begin | ||
+ | if (rx_cnt < 3'd4 & uart_char_out_valid) begin | ||
+ | rx_cnt <= rx_cnt + 3'd1; | ||
+ | end else if (rx_cnt == 3'd4 & router_uart_command_consumed) begin | ||
+ | rx_cnt <= 3'd0; | ||
+ | end | ||
+ | if (rx_cnt < 3'd4 & uart_char_out_valid) | ||
+ | uart_router_command_word[rx_cnt * 8 +: 8] <= uart_char_rx; | ||
+ | end | ||
+ | end | ||
+ | |||
+ | The ''uart_router_command_word'' signal is forwaded to ''uart_router'' on the ''command_word_i'' input of the ''uart_router''. When a word is received from the host, ''uart_router_command_valid'' is asserted: | ||
+ | |||
+ | assign uart_router_command_valid = rx_cnt == 3'd4; | ||
+ | |||
+ | This signal is propagated to the ''command_valid_i'' bit of the ''uart_router'' Command interface, further explained below. | ||
+ | |||
+ | ===Uart Router=== | ||
+ | The ''uart_router'' gathers commands from the host and dispatches them to the selected destination. The following code reports the Command Interface connected with the ''uart'' module: | ||
+ | input [31:0] command_word_i, | ||
+ | input command_valid_i, | ||
+ | output logic command_ready_o, | ||
+ | output logic [31:0] command_word_o, | ||
+ | output logic command_valid_o, | ||
+ | input command_ready_i, | ||
+ | |||
+ | The control logic in ''uart_router'' is organized as an FSM, during its IDLE state this logic waits until a first word arrives on the interface. The first word read in the '''IDLE''' state contains information about the destination and the number of incoming words from the host. These information are split into two signals, ''output_port'' stores on which output interface incoming data should be forwarded, while ''word_cnt'' tracks the number of incoming words from the host to forward to the same destination. When these information are gathered, the FSM transits into '''RUNNING''' state: | ||
+ | |||
+ | IDLE: begin | ||
+ | if (command_valid_i) begin | ||
+ | output_port <= command_word_i[15:0]; | ||
+ | word_cnt <= command_word_i[31:16]; | ||
+ | dn_state <= RUNNING; | ||
+ | |||
+ | The ''uart_router'' can dispatch incoming words on two different interfaces: | ||
+ | |||
+ | output logic [31:0] port_0_word_o, | ||
+ | output logic port_0_valid_o, | ||
+ | input port_0_ready_i, | ||
+ | ... | ||
+ | output logic [31:0] port_1_word_o, | ||
+ | output logic port_1_valid_o, | ||
+ | input port_1_ready_i, | ||
+ | ... | ||
+ | |||
+ | Port 0 interconnects the module with the ''memory_controller'' allocated in the system, while the other port interconnects the ''uart_router'' and the ''npu_system'' unit. | ||
+ | Output ports are selected on the base of the value of ''output_port'' signal, while data bus of the two output interfaces are connected to the incoming word: | ||
+ | |||
+ | assign port_0_word_o = command_word_i; | ||
+ | assign port_1_word_o = command_word_i; | ||
+ | |||
+ | During the '''RUNNING''' state the module keeps forwarding to the same interface all incoming words: | ||
+ | |||
+ | RUNNING: begin | ||
+ | if (command_valid_i) begin | ||
+ | if (output_port == 0) begin | ||
+ | port_0_valid_o <= 1'd1; | ||
+ | ... | ||
+ | end else if (output_port == 1) begin | ||
+ | port_1_valid_o <= 1'd1; | ||
+ | ... | ||
+ | |||
+ | Each word sent decreases an internal counter initialized by the host (the ''word_cnt'' signal). When the word counter hits 1, the FSM sends the last word over the selected interface and transits back into the '''IDLE''' state: | ||
+ | |||
+ | RUNNING: begin | ||
+ | ... | ||
+ | if ((output_port == 0 && port_0_ready_i) | (output_port == 1 && port_1_ready_i)) begin | ||
+ | word_cnt <= word_cnt - 16'd1; | ||
+ | |||
+ | if (word_cnt == 16'd1) begin | ||
+ | dn_state <= IDLE; | ||
+ | end | ||
+ | ... | ||
===Console commands=== | ===Console commands=== | ||
− | |||
− | == | + | The Single core version comes along with a host-side console in python, called '''uart_loader.py''' and located into ''boards/'' folder of the repository. The ''uart_loader.py'' abstracts communication between host and system on FPGA. Such a console implements the host side communication protocol for the [[SC Item|Item interface]], and allows users to interact with the NPU core and with the DDR memory on FPGA. |
+ | |||
+ | The tool arguments are: | ||
+ | |||
+ | * '-k', or '--kernel', kernel memory image path. | ||
+ | * '-d', or '--debug', enables debug output, optional. | ||
+ | * '-s', or '--serial', serial port to use. | ||
+ | |||
+ | Running the tool: | ||
+ | # Run NaplesPU startup self-check | ||
+ | # fetch the effective hardware threads allocated in the system. | ||
+ | # load the kernel image into the memory | ||
+ | # set PCs of selected threads | ||
+ | # activate selected threads (launch kernel) | ||
+ | # wait until threads termination, polling thread status register | ||
+ | # print return value and debug registers | ||
+ | # wait for a command (TODO check con Vincenzo) | ||
+ | # read memory output | ||
+ | |||
+ | The ''uart_loader.py'' implements high-level operations reusable in user applications. | ||
+ | |||
+ | Functions ''mem_write'' and ''mem_read'' implements respectively memory write and read on the FPGA memory. | ||
+ | def mem_write(comm, start_addr, content): | ||
+ | print("MEM: Writing " + words_to_hexstr(content) + " starting from " + hex(start_addr)) | ||
+ | cmd = 0x80000000 | ||
+ | cmd = cmd | (len(content) - 1) | ||
+ | comm.send_packet(0, [cmd, start_addr] + content) | ||
+ | |||
+ | def mem_read(comm, start_addr, num): | ||
+ | print("MEM: Reading " + str(num) + " words starting from " + hex(start_addr)) | ||
+ | cmd = 0 | ||
+ | cmd = cmd | (num - 1) | ||
+ | comm.send_packet(0, [cmd, start_addr]) | ||
+ | return comm.read_packet(0, num) | ||
+ | As said above, ''uart_router'' module connects the memory controller on port 0, and NPU core on port 1, this information is embed in the most significative bit of the reveiced word. First parameter of ''send_packet'' function selects the output port on the system and the function formats the message to send on UART consequently. | ||
+ | |||
+ | The following functions interact with the NPU core in the single core system. The ''npu_set_pc'' sends on port 1 of ''uart_router'' three words, the first is the '''HN_BOOT_COMMAND''' item command (equal to 0), then the involved thread ID, finally the PC value to set for the given thread. | ||
+ | def npu_set_pc(comm, thread, pc): | ||
+ | comm.send_packet(1, [0x0, thread, pc]) | ||
+ | |||
+ | The ''npu_en_threads'' sends the user thread-mask (bitmap) to the ''hi_thread_en'' control on TC module, this register activates the selected threads. The first word sent is the '''HN_ENABLE_CORE_COMMAND''' item command (equal to 1), then the thread mask value. | ||
+ | def npu_en_threads(comm, mask): | ||
+ | comm.send_packet(1, [0x1, mask]) | ||
+ | |||
+ | The ''npu_read_cr'' function sends a read request for a given control register, it returns the value stored in that register. The first word sent is the '''HN_READ_STATUS_COMMAND''' item command (equal to 2), then the thread ID and the register ID blent in one word. The very next cycle, the system replies with the value optained from the selected register and the ''uart_router'' sends it back to the host through UART interface. | ||
+ | def npu_read_cr(comm, thread, regid): | ||
+ | comm.send_packet(1, [0x2, (thread << 16) | regid]) | ||
+ | return comm.read_packet(1, 1)[0] | ||
+ | |||
+ | == Project Setup == | ||
+ | The following steps cover how to set-up the Nexys4DDR NaplesPU Vivado project. | ||
+ | |||
+ | A Vivado 2018.2 installation with Nexys 4 drivers should be used to execute the following steps. All the paths reported are relative to the project root. | ||
+ | |||
+ | * Open a clean Vivado session | ||
+ | * Select "Create Project" | ||
+ | * Choose a project name (ex. `vivado_proj`) and location (ex. `boards/nexys4ddr`) | ||
+ | * Select "RTL Project" as project type | ||
+ | * Add the `src/` directory to the project sources | ||
+ | * Add the `board/nexys4ddr/Nexys-4-DDR-Master.xdc` constraint file to the project | ||
+ | * Select the "Nexys4 DDR" board from the part list | ||
+ | * The project creation is now complete | ||
+ | |||
+ | * In the Tcl console, run the following command: `set_property file_type {Verilog Header} [get_files *_defines.sv]` | ||
+ | * In the Sources pane, select the `nexys4ddr_top` as Top module | ||
+ | * It is suggested to reduce the core area occupation by selecting out features in the `npu_user_defines.sv` header file; an example would be to reduce the `THREAD_NUMB` define to 4 and to comment out the `NPU_SPM` and `NPU_FPU` defines | ||
+ | |||
+ | * From the IP Catalog, run the "Memory Interface Generator" | ||
+ | * Use `mig_7series_0` as the component name | ||
+ | * Select the "Verify Pin Changes and Update Design" option | ||
+ | * Select the "AXI4 Interface" option | ||
+ | * In the "Load Prj File" field, select the `boards/nexys4ddr/mig_7series_0/mig_a.prj` file | ||
+ | * In the "Load UCF File" field, select the `boards/nexys4ddr/mig_7series_0/mig.ucf` file | ||
+ | * Complete the IP core configuration | ||
+ | * Skip the IP core output products generation | ||
+ | * Open the `boards/nexys4ddr/vivado_proj.srcs/sources_1/ip/mig_7series_0/mig_a.prj` file | ||
+ | * Find the XML element `InputClkFreq`, ensure that the element value is 200 | ||
+ | * Save and close the file | ||
+ | * Generate the IP output products | ||
+ | |||
+ | * From the IP Catalog, run the "Clocking Wizard" | ||
+ | * Use `clk_wiz_0` as the component name | ||
+ | * In the "Clocking Options" tab, ensure that the primary clock input signal `clk_in1` is set at 100 MHz | ||
+ | * In the "Output Clocks" tab, enable the `clk_out1` output clock and set the frequency to 200 MHz | ||
+ | * Ensure that the "Reset Type" is set to "Active High" | ||
+ | * Complete the IP core configuration and generation |
Latest revision as of 14:00, 19 June 2019
The Single core version has been deployed on a Nexys4DDR FPGA board, modules involved are located into boards/nexys4ddr and src/deploy/ folders. The design interconnects the board DDR memory and the UART respectively to the Memory and Item interfaces. The figure below shows a schematic block of the top module:
The npu_system lays in the middle of the design, while uart_router and memory_controller translate NPU transactions letting the system communicate with both the host (through the UART) and with the board memory (through the DDR interface).
Contents
Memory Controller
The memory controller deployed in the current release translates incoming memory requests on the Memory Interface into AXI transactions, forwarded to the MIG IPCore instantiated into the design. The memory_controller provides to the npu_system a block interface compliant with the core Memory Interface:
// Block interface input logic [31 : 0] blk_request_address, input logic [63 : 0] blk_request_dirty_mask, input logic [511 : 0] blk_request_data, input logic blk_request_read, input logic blk_request_write, output logic mc_available, output logic mc_response_valid, output logic [31 : 0] mc_response_address, output logic [511 : 0] mc_response_data, input logic blk_available,
Turning NPU memory requests into AXI transactions:
// AXI write address channel signals input axi_awready, // Indicates slave is ready to accept output logic [3:0] axi_awid, // Write ID output logic [31:0] axi_awaddr, // Write address output logic [7:0] axi_awlen, // Write Burst Length ...
Then, the MIG turns incoming AXI requests into DDR transactions forwarded to memory blocks located on the board.
The memory_controller also generates the memory availability bit. It instantiates a FIFO input_fifo in which incoming requests from the npu_system are stored. The input_fifo stores incoming request data and information, such as address, when either blk_request_read or blk_request_write are valid:
sync_fifo #( .WIDTH ( 32 + 64 + 512 + 1 + 1 ), .SIZE ( 2 ), .ALMOST_FULL_THRESHOLD ( 1 ) ) input_fifo ( .clk ( clk ), .reset ( reset ), .flush_en ( 1'b0 ), .full ( ), .almost_full ( input_fifo_almost_full ), .enqueue_en ( blk_request_read | blk_request_write ), .value_i ( {blk_request_address, blk_request_dirty_mask, blk_request_data, blk_request_read, blk_request_write} ), .empty ( input_fifo_empty ), .almost_empty ( ), .dequeue_en ( fifo_blk_dequeue ), .value_o ( {fifo_blk_request_address, fifo_blk_request_dirty_mask, fifo_blk_request_data, fifo_blk_request_read, fifo_blk_request_write} ) );
The availability signals is deasserted when input_fifo has an element stored:
assign mc_available = ~input_fifo_almost_full;
When the AXI transaction is closed and all input words are gathered back into a single block (512-bit):
READ_BLOCK: begin if (axi_rvalid) begin word_counter <= word_counter + 1; mc_response_data[word_counter * 32 +: 32] <= axi_rdata; ...
Finally, the control logic dequeues the pending request asserting fifo_blk_dequeue and forward the memory response to the npu_system.
The memory_controller also bridges the uart_router (receiving commands from the host) and the board memory. It provides a Command interface directly interconnected with uart_router, this interface allows the host to interfact with the memory through commands flowing on the UART:
// Command interface input [31:0] command_word_i, input command_valid_i, output logic command_ready_o, output logic [31:0] command_word_o, output logic command_valid_o, input command_ready_i,
The Command interface has a valid/read interface such as the npu_system. When the host issues a memory request through the UART, uart_router asserts command_valid_i and forwards address and operation type on command_word_i. In particular, the FSM checks the most significant bit of command_valid_i, if low a READ request is issued, otherwise the FSM performs a WRITE memory request:
case (current_state) IDLE: begin if (command_valid_i) begin is_read <= command_word_i[31] == READ;
In the case of a READ, an AXI transaction is performed and no data is needed. In case of WRITE, the FSM gathers the next 15 incoming command_word_i and keep sending them through the AXI write channel:
WRITE_BURST: begin axi_wdata <= command_word_i; axi_wstrb <= 4'b1111; if (command_valid_i & axi_wready) begin axi_wvalid <= 1'b1; command_ready_o <= 1'b1; if (word_counter == burst_len - 1) begin axi_wlast <= 1'b1; end end end
The burst ends when 16 words have been sent over the AXI write channel and axi_wlast is asserted.
Host interaction
Host interacts with the system through UART, the module uart instantiated in the nexys4ddr_top connects uart_router with the host organizing incoming data into bytes. A control process in the top level rebuilds these bytes into words and forwards them to the uart_router:
always_ff @(posedge mig_ui_clk) begin if (async_reset) begin rx_cnt <= 3'd0; end else begin if (rx_cnt < 3'd4 & uart_char_out_valid) begin rx_cnt <= rx_cnt + 3'd1; end else if (rx_cnt == 3'd4 & router_uart_command_consumed) begin rx_cnt <= 3'd0; end if (rx_cnt < 3'd4 & uart_char_out_valid) uart_router_command_word[rx_cnt * 8 +: 8] <= uart_char_rx; end end
The uart_router_command_word signal is forwaded to uart_router on the command_word_i input of the uart_router. When a word is received from the host, uart_router_command_valid is asserted:
assign uart_router_command_valid = rx_cnt == 3'd4;
This signal is propagated to the command_valid_i bit of the uart_router Command interface, further explained below.
Uart Router
The uart_router gathers commands from the host and dispatches them to the selected destination. The following code reports the Command Interface connected with the uart module:
input [31:0] command_word_i, input command_valid_i, output logic command_ready_o, output logic [31:0] command_word_o, output logic command_valid_o, input command_ready_i,
The control logic in uart_router is organized as an FSM, during its IDLE state this logic waits until a first word arrives on the interface. The first word read in the IDLE state contains information about the destination and the number of incoming words from the host. These information are split into two signals, output_port stores on which output interface incoming data should be forwarded, while word_cnt tracks the number of incoming words from the host to forward to the same destination. When these information are gathered, the FSM transits into RUNNING state:
IDLE: begin if (command_valid_i) begin output_port <= command_word_i[15:0]; word_cnt <= command_word_i[31:16]; dn_state <= RUNNING;
The uart_router can dispatch incoming words on two different interfaces:
output logic [31:0] port_0_word_o, output logic port_0_valid_o, input port_0_ready_i, ... output logic [31:0] port_1_word_o, output logic port_1_valid_o, input port_1_ready_i, ...
Port 0 interconnects the module with the memory_controller allocated in the system, while the other port interconnects the uart_router and the npu_system unit. Output ports are selected on the base of the value of output_port signal, while data bus of the two output interfaces are connected to the incoming word:
assign port_0_word_o = command_word_i; assign port_1_word_o = command_word_i;
During the RUNNING state the module keeps forwarding to the same interface all incoming words:
RUNNING: begin if (command_valid_i) begin if (output_port == 0) begin port_0_valid_o <= 1'd1; ... end else if (output_port == 1) begin port_1_valid_o <= 1'd1; ...
Each word sent decreases an internal counter initialized by the host (the word_cnt signal). When the word counter hits 1, the FSM sends the last word over the selected interface and transits back into the IDLE state:
RUNNING: begin ... if ((output_port == 0 && port_0_ready_i) | (output_port == 1 && port_1_ready_i)) begin word_cnt <= word_cnt - 16'd1; if (word_cnt == 16'd1) begin dn_state <= IDLE; end ...
Console commands
The Single core version comes along with a host-side console in python, called uart_loader.py and located into boards/ folder of the repository. The uart_loader.py abstracts communication between host and system on FPGA. Such a console implements the host side communication protocol for the Item interface, and allows users to interact with the NPU core and with the DDR memory on FPGA.
The tool arguments are:
- '-k', or '--kernel', kernel memory image path.
- '-d', or '--debug', enables debug output, optional.
- '-s', or '--serial', serial port to use.
Running the tool:
- Run NaplesPU startup self-check
- fetch the effective hardware threads allocated in the system.
- load the kernel image into the memory
- set PCs of selected threads
- activate selected threads (launch kernel)
- wait until threads termination, polling thread status register
- print return value and debug registers
- wait for a command (TODO check con Vincenzo)
- read memory output
The uart_loader.py implements high-level operations reusable in user applications.
Functions mem_write and mem_read implements respectively memory write and read on the FPGA memory.
def mem_write(comm, start_addr, content): print("MEM: Writing " + words_to_hexstr(content) + " starting from " + hex(start_addr)) cmd = 0x80000000 cmd = cmd | (len(content) - 1) comm.send_packet(0, [cmd, start_addr] + content) def mem_read(comm, start_addr, num): print("MEM: Reading " + str(num) + " words starting from " + hex(start_addr)) cmd = 0 cmd = cmd | (num - 1) comm.send_packet(0, [cmd, start_addr]) return comm.read_packet(0, num)
As said above, uart_router module connects the memory controller on port 0, and NPU core on port 1, this information is embed in the most significative bit of the reveiced word. First parameter of send_packet function selects the output port on the system and the function formats the message to send on UART consequently.
The following functions interact with the NPU core in the single core system. The npu_set_pc sends on port 1 of uart_router three words, the first is the HN_BOOT_COMMAND item command (equal to 0), then the involved thread ID, finally the PC value to set for the given thread.
def npu_set_pc(comm, thread, pc): comm.send_packet(1, [0x0, thread, pc])
The npu_en_threads sends the user thread-mask (bitmap) to the hi_thread_en control on TC module, this register activates the selected threads. The first word sent is the HN_ENABLE_CORE_COMMAND item command (equal to 1), then the thread mask value.
def npu_en_threads(comm, mask): comm.send_packet(1, [0x1, mask])
The npu_read_cr function sends a read request for a given control register, it returns the value stored in that register. The first word sent is the HN_READ_STATUS_COMMAND item command (equal to 2), then the thread ID and the register ID blent in one word. The very next cycle, the system replies with the value optained from the selected register and the uart_router sends it back to the host through UART interface.
def npu_read_cr(comm, thread, regid): comm.send_packet(1, [0x2, (thread << 16) | regid]) return comm.read_packet(1, 1)[0]
Project Setup
The following steps cover how to set-up the Nexys4DDR NaplesPU Vivado project.
A Vivado 2018.2 installation with Nexys 4 drivers should be used to execute the following steps. All the paths reported are relative to the project root.
- Open a clean Vivado session
- Select "Create Project"
- Choose a project name (ex. `vivado_proj`) and location (ex. `boards/nexys4ddr`)
- Select "RTL Project" as project type
- Add the `src/` directory to the project sources
- Add the `board/nexys4ddr/Nexys-4-DDR-Master.xdc` constraint file to the project
- Select the "Nexys4 DDR" board from the part list
- The project creation is now complete
- In the Tcl console, run the following command: `set_property file_type {Verilog Header} [get_files *_defines.sv]`
- In the Sources pane, select the `nexys4ddr_top` as Top module
- It is suggested to reduce the core area occupation by selecting out features in the `npu_user_defines.sv` header file; an example would be to reduce the `THREAD_NUMB` define to 4 and to comment out the `NPU_SPM` and `NPU_FPU` defines
- From the IP Catalog, run the "Memory Interface Generator"
- Use `mig_7series_0` as the component name
- Select the "Verify Pin Changes and Update Design" option
- Select the "AXI4 Interface" option
- In the "Load Prj File" field, select the `boards/nexys4ddr/mig_7series_0/mig_a.prj` file
- In the "Load UCF File" field, select the `boards/nexys4ddr/mig_7series_0/mig.ucf` file
- Complete the IP core configuration
- Skip the IP core output products generation
- Open the `boards/nexys4ddr/vivado_proj.srcs/sources_1/ip/mig_7series_0/mig_a.prj` file
- Find the XML element `InputClkFreq`, ensure that the element value is 200
- Save and close the file
- Generate the IP output products
- From the IP Catalog, run the "Clocking Wizard"
- Use `clk_wiz_0` as the component name
- In the "Clocking Options" tab, ensure that the primary clock input signal `clk_in1` is set at 100 MHz
- In the "Output Clocks" tab, enable the `clk_out1` output clock and set the frequency to 200 MHz
- Ensure that the "Reset Type" is set to "Active High"
- Complete the IP core configuration and generation