Difference between revisions of "Network router"

From NaplesPU Documentation
Jump to: navigation, search
(Created page with "Network router implementation is discussed on this page. == Router == The router moves data between two or more terminals, so the interface is standard: input and output fli...")
 
(First stage)
 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Network router implementation is discussed on this page.
+
This section describes the '''Router''' implementation.  
  
== Router ==
+
The router is part of a mesh network, it has an I/O port for each cardinal direction, plus the local injection/ejection port.
  
The router moves data between two or more terminals, so the interface is standard: input and output flit, input and output write enable, and backpressure signals.
+
Each port exchanges FLITs with neighbour routers. FLITs are routed using the XY-DOR protocol with look-ahead. This means every router calculates the next hop as if it were the next router along the path. This optimization allows us to reduce the pipeline length of the router, improving latencies.
  
This is a virtual-channel flow control X-Y look-ahead router for a 2D-mesh topology.
+
For coherence-related reasons, the network has four virtual channels, each classifies different packet types. Furthermore, an On/Off back-pressure for each virtual channel avoids FIFOs overflow. The following rules must be ensured:
 +
* a flit cannot be routed on a different virtual channel;
 +
* different packets cannot be interleaved on the same virtual channel;
  
The first choice is to use only input buffering, so this will take one pipe stage. Another technique widely used is the look-ahead routing, that permits the route calculation of the next node. It is possible to merge the virtual channel and switch allocation in just one stage.
+
The router is implemented in a pipelined fashion. Although three stages are presented, the last one's output is not buffered. This effectively reduces the pipeline delay to two stages.
  
Recapping, there are 4 stages, two of them working in parallel (routing and allocation stages), for a total of three stages. To further reduce the pipeline stages, the crossbar and link traversal stage is not buffered, reducing the stages at two and, de facto, merging the last stage to the first one.
+
[[File:router.jpg|1000px|router]]
  
[[File:router.jpg|800px|router]]
+
The first stage consists of input buffers, connected directly to the input ports. They are responsible for the storage of flits to be routed and for the generation of the back-pressure signals.
  
=== First stage ===
+
The second stage is composed of three main blocks: the allocator, the flit manager and the routing logic. The allocator manages the allocation of the third stage ports, generating a grant signal for each virtual channel allowed to proceed, considering also back-pressure signals coming from other routers. This grant is used by the flit manager to select the winning flits, which are fed to the next stage along with the routing information generated by the routing block.
  
There will be five different port - cardinal directions plus local port -, each one with IV different queues, where IV is the number of virtual channels presented.
+
The third stage is a cross-bar, allowing each of the 5 input port to access each of the 5 output port, provided that no collisions are found.
  
[[File:first_stage.jpg|800px|First stage Router]]
+
== First stage ==
  
There are two queues: one to house flits (FQ) and another to house only head flits (HQ). The queue lengths are equals to contemplate the worst case - packets with only one flit. Every time a valid flit enters in this unit, the HQ enqueues its only if the flit type is `head' or `head-tail'. The FQ has the task of housing all the flits, while the HQ has to "register" all the entrance packets. To assert the dequeue signal for HQ, either allocator grant assertion and the output of a tail flit have to happen, so the number of elements in the HQ determines the number of packet entered in this virtual channel.
+
Each input port manages a separate buffering for each virtual channel. An incoming flit carries the virtual channel ID in which it have to be enqueued. The implementation for a single virtual channel is reported below.
  
header_fifo (
+
[[File:first_stage.jpg|1000px|Router first stage]]
    .enqueue_en  ( wr_en_in & flit_in.vc_id == i & ( flit_in.flit_type == HEADER | flit_in.flit_type == HT ) ),
 
    .value_i    ( flit_in.next_hop_port                                                                    ),
 
    .dequeue_en  ( ( ip_flit_in_mux[i].flit_type == TAIL | ip_flit_in_mux[i].flit_type == HT ) & sa_grant[i] ),
 
    ...
 
    ...
 
  
This organization works only if a condition is respected: the flits of each packets are stored consecutively and ordered in the FQ. To obtain this condition, a deterministic routing has to be used and all the network interfaces have to send all the flits of a packet without interleaving with other packet flits.
+
For each virtual channel, two FIFO buffers are used: a general flit queue, which must enqueue every incoming flit, and a head queue, responsible of storing only the output port of each incoming flit. The queues have the same capacity, to account the worst case, that is, the presence of only head-tail flits.
  
=== Second stage ===
+
So the general flit queue will enqueue every incoming flit and will dequeue one as soon as the allocator grants it permission. The head queue will enqueue every incoming head or head-tail flit and will dequeue one as soon as a tail or head-tail flit gets dequeued from the general flit queue.
  
The second stage has got two units working in parallel: the look-ahead routing unit and allocator unit. This two units are linked throughout a intermediate logic.  
+
The number of flits stored in the buffers is used to generate the back-pressure signals. In particular, we should guarantee that the system is packet loss-free while generating those signals. As there is one pipeline stage delay between these buffers and the previous router's allocator, we must raise the On/Off signal when the buffer has one free slot.
The allocator unit has to accord a grant for each port. This signal is feedback either to first stage and to a second-stage multiplexer as selector signal. This mux receives as input all the virtual channel output  for that port, electing as output only one flit - based on the selection signal. This output flit goes in the look-ahead routing to calculate the next-hop port destination.
 
  
[[File:second_stage.jpg|800px|Second stage Router]]
+
FIFO instantiation code is reported below.
  
==== Allocation ====
+
genvar                                            i;
The allocation unit grants a flit to go toward a specific port of a specific virtual channel, handling the contention of virtual channels and crossbar ports. Each single allocator is a two-stage input-first separable allocator that permits a reduced number of component respect to other allocator.
+
generate
 +
for( i=0; i < `VC_PER_PORT; i=i + 1 ) begin : vc_loop
 +
sync_fifo #(
 +
.WIDTH ( $bits( flit_t )  ),
 +
.SIZE  ( `QUEUE_LEN_PER_VC ),
 +
.ALMOST_FULL_THRESHOLD ( `QUEUE_LEN_PER_VC - 1 ) )
 +
flit_fifo (
 +
.clk        ( clk                          ),
 +
.reset      ( reset                        ),
 +
.flush_en    (                              ),
 +
.full        (                              ),
 +
.almost_full ( on_off_out[i]                ),
 +
.enqueue_en  ( wr_en_in & flit_in.vc_id == i ),
 +
.value_i    ( flit_in                      ),
 +
.empty      ( ip_empty[i]                  ),
 +
.almost_empty(                              ),
 +
.dequeue_en  ( sa_grant[i]                  ),
 +
.value_o    ( ip_flit_in_mux[i]            )
 +
);
 +
   
 +
sync_fifo #(
 +
.WIDTH ( $bits( port_t ) ),
 +
.SIZE ( `QUEUE_LEN_PER_VC ) )
 +
header_fifo (
 +
.clk        ( clk                                                                                      ),
 +
.reset      ( reset                                                                                    ),
 +
.flush_en    (                                                                                          ),
 +
.full        (                                                                                          ),
 +
.almost_full (                                                                                          ),
 +
.enqueue_en  ( wr_en_in & flit_in.vc_id == i & ( flit_in.flit_type == HEADER | flit_in.flit_type == HT ) ),
 +
.value_i    ( flit_in.next_hop_port                                                                    ),
 +
.empty      ( request_not_valid[i]                                                                      ),
 +
.almost_empty(                                                                                          ),
 +
.dequeue_en ( ( ip_flit_in_mux[i].flit_type == TAIL | ip_flit_in_mux[i].flit_type == HT ) & sa_grant[i] ),
 +
.value_o    ( dest_port_app                                                                            )
 +
);
  
The overall unit receives as many allocation request as the ports are. Each request asks  to obtain a destination port grant for each of its own virtual channel - the total number of request lines is P x V x P. The allocation outputs are two for each port: (1) the winner destination port that will go into the crossbar selection; (2) the winner virtual channel that is feedback to move the proper flit at the crossbar input.
+
As stated before, this will work as long as packets don't interleave on the same virtual channel.
 +
 
 +
== Second stage ==
 +
 
 +
The second stage takes as input requests coming from each virtual channel, along with back-pressure signals sent by other routers. Its main goals are:
 +
* allocate resources, which in this case are crossbar input ports;
 +
* properly update the routing information for each flits passing through.
 +
 
 +
The overall implementation is reported below.
 +
 
 +
[[File:second_stage.jpg|800px|Router second stage]]
 +
 
 +
Grant signals are also fed-back to the first stage, to allow flits to be dequeued.
 +
 
 +
Look-ahead routing logic is replicated for each input port. This way, as soon as we know which virtual channel has been granted access for each input port, we can calculate the routing information.
 +
 
 +
=== Allocator implementation ===
 +
 
 +
The allocator is required, as we have a request for every virtual channel storing valid flits, each requiring access to an output port. As the third stage has only five input ports, and only one is allowed to access a given output port, requests must be scheduled.
  
 
[[File:allocation.jpg|800px|Allocation]]
 
[[File:allocation.jpg|800px|Allocation]]
  
The allocation unit has to respect the following rules:  
+
The allocator can be decomposed in two parts:
* the packets can move only in their respective virtual channel;  
+
* virtual channel allocator, which ensures that for every output virtual channel only one input virtual channel is granted access to it;
* a virtual channel can request only one port per time;
+
* switch allocator, which first ensures that for every input port there is only one virtual channel allowed to access the crossbar, and then ensures that for every output port there is only one input port allowed to access the crossbar.
* the physical link can be interleaved by flits belonging to different flows;  
+
 
* when a packet acquires a virtual channel on an output port, no other packets on different input ports can acquire that virtual channel on that output port.
+
==== Virtual channel allocator ====
 +
 
 +
Virtual channel allocation is implemented using a round-robin arbiter, with additional grant-and-hold circuitry (as implemented by the [[#Allocator core|allocator core]] module). Once an input virtual channel has been granted access to an output virtual channel, the grant will be held until its request is fulfilled. This is to ensure that multiple packets do not interleave on the same virtual channel.
 +
 
 +
Back-pressure signals from other routers are used in this stage.
 +
 
 +
The output grant signals are generated on a virtual-channel basis and fed to the next stage.
 +
 
 +
==== Switch allocator ====
 +
 
 +
The switch allocator has two roles:
 +
* it chooses, for each port's candidate virtual channels, which one is granted access;
 +
* it chooses, if there are any input ports requesting access to the same output port, which one is granted access.
 +
 
 +
The first part is implemented using a round-robin arbiter, allowing flits coming from different virtual channels to be interleaved on the same output port, as the first stage has already ensured that no two virtual channels have been granted access to the same output virtual channel.
  
===== Allocatore core =====
+
generate
The virtual channel and switch allocation is logically the same for both, so it is
+
for( i=0; i < `PORT_NUM; i=i + 1 ) begin : sw_allocation_loop
encased in a unit called allocator core. It is simply a parametrizable number
+
of parallel arbiters in which the input and output are properly scrambled and
+
assign va_grant_per_port[i] = {`PORT_NUM{| ( va_grant[i] & ~ip_empty[i] )}};
the output are or-ed to obtain a port-granularity grant.
+
 +
rr_arbiter #(
 +
.NUM_REQUESTERS( `VC_PER_PORT ) )
 +
sa_arbiter (
 +
.clk      ( clk                              ),
 +
.reset    ( reset                            ),
 +
.request  ( va_grant[i] & ~ip_empty[i] ),
 +
.update_lru( 1'b1                              ),
 +
.grant_oh  ( grant_to_mux_ip[i]                )
 +
);
 +
 
 +
mux_npu #(
 +
.N    ( `VC_PER_PORT ),
 +
.WIDTH( `PORT_NUM    )
 +
)
 +
sa_mux (
 +
.onehot( grant_to_mux_ip[i] ),
 +
.i_data( ip_dest_port[i]    ),
 +
.o_data( sa_port[i]        )
 +
);
 +
end
 +
endgenerate
 +
 
 +
At this point, we can proceed with the second part, and we also have all the required information to perform the routing calculation. In fact, we know for each input port which virtual channel has been granted access.
 +
 
 +
The second part is implemented again with a round-robin arbiter and a grant-and-hold circuitry, using [[#Allocator core|the allocator core]]. It guarantees that now we have no conflicts on switch allocation, and we can use this grant signals to set up the crossbar.
 +
 
 +
==== Allocator core ====
 +
 
 +
The allocator core is a generic round-robin allocator, composed of a parameterizable number of parallel arbiters in which the input and output are properly scrambled and the output are or-ed to obtain the grant signals.
  
 
[[File:allocatore_core.png|400px|allocatore_core]]
 
[[File:allocatore_core.png|400px|allocatore_core]]
  
The difference between other stages is that each arbiter is a round-robin
+
module allocator_core #(
arbiter with a grant-hold circuit. This permits to obtain an uninterrupted use
+
parameter N    = 5,  // input ports
of the obtained resource, especially requested to respect one of the rule in the
+
parameter M    = 4,  // virtual channels per port
VC allocation.
+
parameter SIZE = 5 ) // output ports
 +
(
 +
input                                            clk,
 +
input                                            reset,
 +
input  logic [N - 1 : 0][M - 1 : 0][SIZE - 1 : 0] request,
 +
input  logic [N - 1 : 0][M - 1 : 0]              on_off,
 +
output logic [N - 1 : 0][M - 1 : 0]              grant
 +
);
  
rr_arbiter u_rr_arbiter (
+
It also provides a grant-and-hold circuitry.
  . request ( request ),
 
  . update_lru ('{ default : '0}) ,
 
  . grant_oh ( grant_arb ),
 
  .*
 
);
 
assign grant_oh = anyhold ? hold : grant_arb ;
 
assign hold = last & hold_in ;
 
assign anyhold = | hold ;
 
always_ff @( posedge clk , posedge reset ) last <= grant_oh ;
 
  
===== Virtual channel allocation =====
+
=== Look-ahead routing ===
The first step for the virtual channel allocator is removed because the hypothesis is that only one port per time can be requested for each virtual channel. Under this condition, a first-stage arbitration is useless, so only the second stage is implemented troughout the allocatore_core instantiation.
 
  
The use of grant-hold arbiters in the second stage avoids that a packet loses its grant when other requests arrive after this grant. The on-off input signal is properly used to avoid that a flit is send to a full virtual channel in the next node.
+
The look-ahead routing calculates the output port as if it were the next router on the path. This allows to do it in parallel with the switch allocator, as the switch allocator doesn't need the routing output. Otherwise, an additional pipeline stage for routing calculation would be needed.
  
===== Switch allocation =====
+
=== Flit manager ===
The switch allocator receives as input the output signals from VC allocation and all the port requests. For each port, there is a signal assertion for each winning virtual channel. These winners now compete for a switch allocation. Two arbiter stage are necessary. The first stage arbiter has as many round-robin arbiter as the input port are.
 
Each round-robin arbiter chooses one VC per port and uses this result to select the request port associated at this winning VC. The winning request port goes at the input of second stage arbiter as well as the winning requests for the other ports. The second stage arbiter is an instantiation of the allocator core and chooses what input port can access to the physical links. This signal is important for two reasons: (1) it is moved toward the round-robin unit previously and-ed with the winning VC for each port; (2) it is registered, and-ed with the winning destination port, and used as selection port for the crossbar (for each port).
 
  
==== Flit handler ====
+
The flit manager selects, for each port, the flit which has access to the crossbar inputs. It also modifies the flit to include the routing results.
A mux uses the granted_vc signal to grant one of the input flit to the output register. This flit then will goes to the input crossbar port.
 
  
  always_comb begin
+
  for( i=0; i < `PORT_NUM; i=i + 1 ) begin : flit_loop
    flit_in_granted_mod[i] = flit_in_granted[i];
+
always_comb begin
    if ( flit_in_granted[i].flit_type == HEADER || flit_in_granted[i].flit_type == HT )
+
flit_in_granted_mod[i] = flit_in_granted[i];
      flit_in_granted_mod[i].next_hop_port = port_t'( lk_next_port[i] );
+
if ( flit_in_granted[i].flit_type == HEADER || flit_in_granted[i].flit_type == HT )
 +
flit_in_granted_mod[i].next_hop_port = port_t'( lk_next_port[i] );
 
  end
 
  end
  
==== Next hop routing calculation ====
+
== Third stage ==
The look-ahead routing calculates the destination port of the next node instead of the actual one because the actual destination port is yet ready in the header flit. The algorithm is a version of the X-Y deterministic routing. It is deadlock-free because it removes four on eight possible turns: when a packet turns towards Y directions, it cannot turn more.
+
 
 +
The third stage is a crossbar, connecting 5 input ports to respectively 5 output ports.
 +
 
 +
It has been implemented as a mux for each output port, and selection signals are obtained from the second stage outputs.
 +
 
 +
The output of this stage is not buffered. This means that in practice, the clock cycles needed to completely process a packet reduce to two.

Latest revision as of 11:34, 2 July 2019

This section describes the Router implementation.

The router is part of a mesh network, it has an I/O port for each cardinal direction, plus the local injection/ejection port.

Each port exchanges FLITs with neighbour routers. FLITs are routed using the XY-DOR protocol with look-ahead. This means every router calculates the next hop as if it were the next router along the path. This optimization allows us to reduce the pipeline length of the router, improving latencies.

For coherence-related reasons, the network has four virtual channels, each classifies different packet types. Furthermore, an On/Off back-pressure for each virtual channel avoids FIFOs overflow. The following rules must be ensured:

  • a flit cannot be routed on a different virtual channel;
  • different packets cannot be interleaved on the same virtual channel;

The router is implemented in a pipelined fashion. Although three stages are presented, the last one's output is not buffered. This effectively reduces the pipeline delay to two stages.

router

The first stage consists of input buffers, connected directly to the input ports. They are responsible for the storage of flits to be routed and for the generation of the back-pressure signals.

The second stage is composed of three main blocks: the allocator, the flit manager and the routing logic. The allocator manages the allocation of the third stage ports, generating a grant signal for each virtual channel allowed to proceed, considering also back-pressure signals coming from other routers. This grant is used by the flit manager to select the winning flits, which are fed to the next stage along with the routing information generated by the routing block.

The third stage is a cross-bar, allowing each of the 5 input port to access each of the 5 output port, provided that no collisions are found.

First stage

Each input port manages a separate buffering for each virtual channel. An incoming flit carries the virtual channel ID in which it have to be enqueued. The implementation for a single virtual channel is reported below.

Router first stage

For each virtual channel, two FIFO buffers are used: a general flit queue, which must enqueue every incoming flit, and a head queue, responsible of storing only the output port of each incoming flit. The queues have the same capacity, to account the worst case, that is, the presence of only head-tail flits.

So the general flit queue will enqueue every incoming flit and will dequeue one as soon as the allocator grants it permission. The head queue will enqueue every incoming head or head-tail flit and will dequeue one as soon as a tail or head-tail flit gets dequeued from the general flit queue.

The number of flits stored in the buffers is used to generate the back-pressure signals. In particular, we should guarantee that the system is packet loss-free while generating those signals. As there is one pipeline stage delay between these buffers and the previous router's allocator, we must raise the On/Off signal when the buffer has one free slot.

FIFO instantiation code is reported below.

genvar                                            i;
generate
	for( i=0; i < `VC_PER_PORT; i=i + 1 ) begin : vc_loop
		sync_fifo #(
			.WIDTH ( $bits( flit_t )   ),
			.SIZE  ( `QUEUE_LEN_PER_VC ),
			.ALMOST_FULL_THRESHOLD ( `QUEUE_LEN_PER_VC - 1 ) )
		flit_fifo (
			.clk         ( clk                           ),
			.reset       ( reset                         ),
			.flush_en    (                               ),
			.full        (                               ),
			.almost_full ( on_off_out[i]                 ),
			.enqueue_en  ( wr_en_in & flit_in.vc_id == i ),
			.value_i     ( flit_in                       ),
			.empty       ( ip_empty[i]                   ),
			.almost_empty(                               ),
			.dequeue_en  ( sa_grant[i]                   ),
			.value_o     ( ip_flit_in_mux[i]             )
		);

		sync_fifo #(
			.WIDTH ( $bits( port_t ) ),
			.SIZE ( `QUEUE_LEN_PER_VC ) )
		header_fifo (
			.clk         ( clk                                                                                       ),
			.reset       ( reset                                                                                     ),
			.flush_en    (                                                                                           ),
			.full        (                                                                                           ),
			.almost_full (                                                                                           ),
			.enqueue_en  ( wr_en_in & flit_in.vc_id == i & ( flit_in.flit_type == HEADER | flit_in.flit_type == HT ) ),
			.value_i     ( flit_in.next_hop_port                                                                     ),
			.empty       ( request_not_valid[i]                                                                      ),
			.almost_empty(                                                                                           ),
			.dequeue_en  ( ( ip_flit_in_mux[i].flit_type == TAIL | ip_flit_in_mux[i].flit_type == HT ) & sa_grant[i] ),
			.value_o     ( dest_port_app                                                                             )
		);

As stated before, this will work as long as packets don't interleave on the same virtual channel.

Second stage

The second stage takes as input requests coming from each virtual channel, along with back-pressure signals sent by other routers. Its main goals are:

  • allocate resources, which in this case are crossbar input ports;
  • properly update the routing information for each flits passing through.

The overall implementation is reported below.

Router second stage

Grant signals are also fed-back to the first stage, to allow flits to be dequeued.

Look-ahead routing logic is replicated for each input port. This way, as soon as we know which virtual channel has been granted access for each input port, we can calculate the routing information.

Allocator implementation

The allocator is required, as we have a request for every virtual channel storing valid flits, each requiring access to an output port. As the third stage has only five input ports, and only one is allowed to access a given output port, requests must be scheduled.

Allocation

The allocator can be decomposed in two parts:

  • virtual channel allocator, which ensures that for every output virtual channel only one input virtual channel is granted access to it;
  • switch allocator, which first ensures that for every input port there is only one virtual channel allowed to access the crossbar, and then ensures that for every output port there is only one input port allowed to access the crossbar.

Virtual channel allocator

Virtual channel allocation is implemented using a round-robin arbiter, with additional grant-and-hold circuitry (as implemented by the allocator core module). Once an input virtual channel has been granted access to an output virtual channel, the grant will be held until its request is fulfilled. This is to ensure that multiple packets do not interleave on the same virtual channel.

Back-pressure signals from other routers are used in this stage.

The output grant signals are generated on a virtual-channel basis and fed to the next stage.

Switch allocator

The switch allocator has two roles:

  • it chooses, for each port's candidate virtual channels, which one is granted access;
  • it chooses, if there are any input ports requesting access to the same output port, which one is granted access.

The first part is implemented using a round-robin arbiter, allowing flits coming from different virtual channels to be interleaved on the same output port, as the first stage has already ensured that no two virtual channels have been granted access to the same output virtual channel.

generate
	for( i=0; i < `PORT_NUM; i=i + 1 ) begin : sw_allocation_loop

		assign va_grant_per_port[i] = {`PORT_NUM{| ( va_grant[i] & ~ip_empty[i] )}};

		rr_arbiter #(
			.NUM_REQUESTERS( `VC_PER_PORT ) )
		sa_arbiter (
			.clk       ( clk                               ),
			.reset     ( reset                             ),
			.request   ( va_grant[i] & ~ip_empty[i] ),
			.update_lru( 1'b1                              ),
			.grant_oh  ( grant_to_mux_ip[i]                )
		);
 
		mux_npu #(
			.N    ( `VC_PER_PORT ),
			.WIDTH( `PORT_NUM    )
		)
		sa_mux (
			.onehot( grant_to_mux_ip[i] ),
			.i_data( ip_dest_port[i]    ),
			.o_data( sa_port[i]         )
		);
	end
endgenerate

At this point, we can proceed with the second part, and we also have all the required information to perform the routing calculation. In fact, we know for each input port which virtual channel has been granted access.

The second part is implemented again with a round-robin arbiter and a grant-and-hold circuitry, using the allocator core. It guarantees that now we have no conflicts on switch allocation, and we can use this grant signals to set up the crossbar.

Allocator core

The allocator core is a generic round-robin allocator, composed of a parameterizable number of parallel arbiters in which the input and output are properly scrambled and the output are or-ed to obtain the grant signals.

allocatore_core

module allocator_core #(
		parameter N    = 5,  // input ports
		parameter M    = 4,  // virtual channels per port
		parameter SIZE = 5 ) // output ports
	(
		input                                             clk,
		input                                             reset,
		input  logic [N - 1 : 0][M - 1 : 0][SIZE - 1 : 0] request,
		input  logic [N - 1 : 0][M - 1 : 0]               on_off,
		output logic [N - 1 : 0][M - 1 : 0]               grant
	);

It also provides a grant-and-hold circuitry.

Look-ahead routing

The look-ahead routing calculates the output port as if it were the next router on the path. This allows to do it in parallel with the switch allocator, as the switch allocator doesn't need the routing output. Otherwise, an additional pipeline stage for routing calculation would be needed.

Flit manager

The flit manager selects, for each port, the flit which has access to the crossbar inputs. It also modifies the flit to include the routing results.

for( i=0; i < `PORT_NUM; i=i + 1 ) begin : flit_loop
	always_comb begin
		flit_in_granted_mod[i] = flit_in_granted[i];
			if ( flit_in_granted[i].flit_type == HEADER || flit_in_granted[i].flit_type == HT )
				flit_in_granted_mod[i].next_hop_port = port_t'( lk_next_port[i] );
end

Third stage

The third stage is a crossbar, connecting 5 input ports to respectively 5 output ports.

It has been implemented as a mux for each output port, and selection signals are obtained from the second stage outputs.

The output of this stage is not buffered. This means that in practice, the clock cycles needed to completely process a packet reduce to two.