Difference between revisions of "L1 Cache Controller"

From NaplesPU Documentation
Jump to: navigation, search
(Request Issue Signals)
(Stall Protocol ROMs)
 
(200 intermediate revisions by 4 users not shown)
Line 1: Line 1:
Il cache controller è composto da 4 stage, di cui 3 registrati. Prima di descrivere nel dettaglio i componenti, introduciamo la dinamica generale degli elementi.
+
Cache controller manages the L1 cache. In particular, it handles only coherence information (such as states) since L1 data cache is managed by load/store unit.
* Stage 1. Lo scopo principale è quello di poter schedulare una tra le richieste provenienti dal core (es. cache miss) e dalla rete (es. frwd request). La scelta dell'arbitro è basata principalmente sulla presenza di un'altra richiesta nel controllore, sulla disponibilità della rete a trasmettere e sulla disponibilità del protocollo a processare il messaggio. Quest'ultimo implica la necessità di dover esaminare lo stato delle transazioni pendenti del controllore.
 
* Stage 2. Principalmente contiene il registro delle transazioni pendenti dal core (MSHR), il quale viene interrogato tramite una lookup table dall'arbitro nello stadio precedente altresì viene modificato dallo stadio successivo seguendo l'evoluzione del protocollo. Inoltre contiene una memoria per conservare lo stato di ogni singola linea di cache.
 
* Stage 3. E' sicuramente lo stadio più complesso perchè quì viene interrogato il protocollo, che sulla base della richiesta schedulata e dello stato corrente della richiesta, genera tutti i segnali per poter determinare lo stato successivo della richiesta (modificando l'MSHR), del dato in cache (modificando lo stato ed il dato nella cache L1 del datapath) e di eventuali messaggi da inviare sulla rete.
 
* Stage 4. E' lo stato più semplice perchè ha lo scopo di dover impacchettare eventuali messaggi da inviare sulla rete da parte del controllore. Tale stadio non è registrato e comunica direttamente con la network interface.
 
Come si può notare, il protocollo è interrogato sia nel primo stadio che nel terzo, quindi verrà considerato come un elemento trasversale e sarà descritto a parte.
 
  
 
[[File:l1_cache.jpg|800px|LDST_CC]]
 
[[File:l1_cache.jpg|800px|LDST_CC]]
  
Prima descrivere gli stage nel dettaglio, si riportano gli assunti basilari per comprendere ed introdurre le scelte architetturali:
+
The component is composed of 4 stages:
* Il cache controller può evadere un'altra richiesta sullo stesso indirizzo solo se la precedente è uscita dal controllore
+
 
%\item When a request is issued (both from NI or CI), it can modify an MSHR entry after two clock cycles, hence the next request may read a non up to date entry. In order to avoid this, the \texttt{Fetch Unit} does not schedule two consecutive requests on the same address.
+
* stage 1: schedules a pending request to issue (from local core or network);
* Le transazioni coinvolgono sempre blocchi di memoria.
+
* stage 2: contains coherence cache and MSHR;
* Non è possibile gestire due richieste che operano sullo stesso set.
+
* stage 3: processes a request properly with coherence protocol;
% perchè se generano più di un replacement, abbiamo più di una richiesta per core?
+
* stage 4: prepares coherence request/response to be sent on the network.
* Solo le richieste del core possono istanziare un valore nell'MSHR, di preciso le load, le store e le writeback.
+
 
* gli stati sono salvati in cache solo se stabili , altrimenti si trovano nell'MSHR
+
All these stages are represented in the figure below. The component has been realized in a ''pipelined'' fashion in order for the controller to be able to serve multiple requests at the same time.
 +
 
 +
=== Assumptions ===
 +
The design has been driven by these assumptions:
 +
 
 +
* cache controller schedules a request only when no other requests with on the same address are pending in the pipeline;
 +
* coherence transactions are at the memory blocks level;
 +
* cache controller does not schedule requests when another is in the pipeline and have the same set;
 +
* only requests from the local core (load, store, replacement) allocate MSHR entries;
 +
* information regarding cache block in a non-stable state are stored into the MSHR.
 +
 
 +
Two requests on the same block cannot be issued one after another; when the first request is issued, it may modify an MSHR entry after two clock cycles (stage 3), hence the second request may read a non-up-to-date entry.
  
 
== Stage 1 ==
 
== Stage 1 ==
Stage 1 is responsible for the issue of requests to controller. A request could be a load miss, store miss, flush and replacement request from the local core or a coherence forwarded request or response from the network interface.
+
Stage 1 is responsible for the scheduling of requests into the controller. A request could be a load miss, store miss, flush and replacement request from the local core or a coherence forwarded request or response from the network interface.
  
 
=== MSHR Signals ===
 
=== MSHR Signals ===
In order to find out if a request for the '''same block''' is already issued and ''pending'', tag and sets for each type of request are provided to MSHR. MSHR data response are considered valid for that class of request if and only if its hit signal is asserted. Here is the code for the class of load miss signals:
+
The arbiter in the first stage checks if a pending request can be issued, in order to eligible for scheduling no other requests on the '''same block''' should be ''ongoing'' (or under elaboration) in the cache controller. Ongoing requests are stored in the MSHR table. Tags and sets are provided by the MSHR for each type of pending requests and are forwarded to the arbiter at Stage 1. The arbiter uses the information provided on ongoing transactions to select a pending request. The MSHR provides a look-up port for each type of request, a <code>hit</code> single is provided along, the request is considered valid in the MSHR if such a signal is asserted:  
  
 
  // Signals to MSHR
 
  // Signals to MSHR
Line 33: Line 39:
  
 
=== Stall Protocol ROMs ===
 
=== Stall Protocol ROMs ===
In order to be compliant with the coherence protocol all incoming requests for blocks whose coherence state is non-stable state have to be stalled. This task is performed through a series of protocol ROM (one for each request type) whose output signal will stall the issue of relative request if asserted, that is for example when a block is in state SM_A and a ''Fwd_GetS'', ''Fwd_GetM'', ''recall'', ''flush'', ''store'' or ''replacement'' request for the same block is received. In order to assert this signal the protocol ROM needs the type of the request and the actual state of the block. Here is the stall logic:
+
In order to be compliant with the coherence protocol, all incoming requests on blocks in a non-stable state might be stalled. This task is performed through a series of protocol ROMs (one for each request type) that output state when the issue of relative request ought to be stalled, e.g. when a block is in state SM_A and a ''Fwd_GetS'', ''Fwd_GetM'', ''recall'', ''flush'', ''store'' or ''replacement'' request for the same block is received. In order to assert this signal, the protocol ROM needs the type of the request and the actual state of the block. The module <code>stall_protocol_rom</code> implements this logic:
  
 
  stall_protocol_rom load_stall_protocol_rom (
 
  stall_protocol_rom load_stall_protocol_rom (
Line 65: Line 71:
 
  );
 
  );
  
Note that response messages doesn't need to be stalled so a stall logic isn't required.
+
Note that response messages are never stalled in the coherence protocol, such requests are stalled only if a pending request with the same set index is already in the pipeline:
 +
 
 +
assign can_issue_response                                      = ni_response_valid &
 +
!(
 +
( cc2_pending_valid && ( ni_response.memory_address.index == cc2_pending_address.index ) ) ||
 +
( cc3_pending_valid && ( ni_response.memory_address.index == cc3_pending_address.index ) )
 +
);
  
=== Request Issue Signals ===
+
=== Issuing a Request ===
In order to issue a generic request it is required that:
+
In order to issue a generic request, it is required that:
  
* MSHR has not been issued a request for the same block;
+
* MSHR has no pending requests for the same block;
 
* if the request is already in MSHR it has to be not valid;
 
* if the request is already in MSHR it has to be not valid;
* if the request is already in MSHR and valid it must not have been stalled by Protocol ROM (see [[L1 Cache Controller#Stall Protocol ROMs | Stall signals]]).
+
* if the request is already in MSHR and valid it must not have been stalled by the protocol ROM (see [[L1 Cache Controller#Stall Protocol ROMs | stall signals]]).
* further stages are not serving a request for the same address (see ASSUNZIONI);
+
* further stages are not serving a request on the same address (see [[L1 Cache Controller#Assumptions | assumptions]]);
 
* network interface is available;
 
* network interface is available;
  
Line 87: Line 99:
 
         ni_request_network_available;
 
         ni_request_network_available;
  
Response messages doesn't need constraints for MSHR because they never use it (they never wait for following events) and are never stalled. The same goes for flush requests even though they could be stalled by the stall protocol ROM.
+
Response messages do not need feedbacks from MSHR since they do not allocate a new entry and they are never stalled. The same goes for flush requests even though they could be stalled by the relative stall protocol ROM.
  
 +
Finally a ''replacement'' request could be ''pre-allocated'' in MSHR (see [[L1 Cache Controller#MSHR Update Logic | MSHR update logic]]). In order for this request to be issued before every other request on the same block, an additional condition is added:
  
Si nota che non si esegue un controllo sull'issueing della richiesta sulla base degli slot liberi nell'MSHR. Infatti, ipotizzando che solo le richieste del core possono istanziare uno slot nell'MSHR e solo una richiesta per volta per ogni thread può essere eseguia, allora basta dimensionare le entry nell'MSHR pari al numero di thread affinchè tale controllo diventi superfluo. <br>
+
assign can_issue_replacement =
Da notare che la richiesta di writeback può avanzare anche se c'è già una richiesta pendente nell'MSHR, ma è uno slot preallocato per eseguire un eviction di un blocco (vai alla sezione Caso d'uso eviction) al fine di essere sicuri di poterla sempre eseguire \textbf{prima di ogni altra richiesta su quell'indirizzo}.
+
...
 +
( !replacement_mshr_hit ||
 +
    ( replacement_mshr_hit && !replacement_mshr.valid) ||
 +
    ( replacement_mshr_hit && replacement_mshr.valid && ( !stall_replacement || replacement_mshr.waiting_for_eviction ) ) )  
 +
...
  
// aggiunta della condizione nella writeback
+
Note the control logic does not check if the MSHR has free entries, we made the following assumption which eases this control: only a request per thread can be issued and only threads can allocate an MSHR entry, it is sufficient to size MSHR to the number of threads x 2 in order for the MSHR to be never full and make the control about his filling useless. In the worst case, the MSHR has a pending request and a pending replacement per thread.
( writeback_mshr_hit && writeback_mshr.waiting_for_eviction ||
 
!stall_writeback && writeback_mshr_hit)
 
  
 
=== Requests Scheduler ===
 
=== Requests Scheduler ===
Once the conditions for the issue have been verified, two or more requests could be ready at the same time so a scheduler must be used.  
+
Once the conditions for the issue have been verified, two or more requests could be ready at the same time so a scheduler must be used. Every request has a fixed priority whose order is set as below:
In particular this scheduler uses fixed priorities set as below:
 
  
 
# flush  
 
# flush  
 +
# dinv
 
# replacement
 
# replacement
 
# store miss
 
# store miss
Line 107: Line 122:
 
# coherence forwarded request
 
# coherence forwarded request
 
# load miss
 
# load miss
 +
# recycled response
  
Once a type of request is scheduled this block drives conveniently the output signals for the second stage.
+
Once a type of request has been scheduled this block drives conveniently the output signals for the second stage.
  
 +
== Stage 2 ==
 +
Stage 2 is responsible for managing L1 cache, the MSHR and forwarding signals from Stage 1 to Stage 3. It simply contains the L1 coherence cache (L1 data cache is in load/store unit) and all related logic for managing cache hits and block replacement. The policy used to replace a block is LRU (Least Recently Used). <br>
 +
This module receives signals from stage 3 to update MSHR and coherence cache properly once a request is processed and from load/store unit to update LRU every time a block is accessed from the core.
  
 +
=== Hit/miss logic ===
  
 +
Lookup phase is split in two parts performed by:
  
 +
# load/store unit;
 +
# cache controller (stage 2).
  
[???]Dalle considerazioni fatte precedentemente, è chiaro che la flush e la writeback hanno una priorità maggiore rispetto alle altre richieste al fine di poter dare alle richieste degli altri thread il dato che non sia vecchio.
+
Load/store unit performs the first lookup using only request's set; so it returns an array of tags (one per way) whose tags have the same set of the request and their privilege bits. This first lookup is performed at the same time the request is in cache controller stage 1. The second phase of lookup is performed by cache controller stage 2 using only the request's tag; this search is performed on the array provided by load/store unit. If there is a block with the same tag and the block is valid (its validity is checked with privilege bits) then a hit occurs and the way index of that block is provided to stage 3. The way index will be used by stage 3 to perform updates to coherence data of that block. <br>
 +
If there is no block with the same tag as the request's and no hit occurs, stage 3 takes the way index provided by LRU unit in order to replace that block (see [[L1 Cache Controller#Replacement Logic | replacement logic]]).
  
== Stadio 2 ==
+
...
 +
// Second phase lookup
 +
// Result of this lookup is an array one-hot codified
 +
assign snoop_tag_way_oh[dcache_way] = ( ldst_snoop_tag[dcache_way] == cc1_request_address.tag ) & ( ldst_snoop_privileges[dcache_way].can_read | ldst_snoop_privileges[dcache_way].can_write );
 +
...
 +
 +
assign snoop_tag_hit      = |snoop_tag_way_oh;
  
Il secondo stadio non ha una logica complessa, ma contiene elementi fondamentali del controllore: un blocco di hit/miss detection, l'MSHR, la memoria degli stati dei blocchi in cache L1 ed una pseudo LRU.
+
Note that whenever a request arrives in stage 2 its way index in the data cache is not known yet (since hit/miss logic is computing it at the same time), hence coherence cache is looked up only issuing on the bus the request's set. The result of the snoop operation is forwarded to stage 3, which elaborates them. Stage 3 knows which way index to use for fetching correct data because meanwhile hit/miss logic will have provided it.
Il flusso di esecuzione principale parte dalle uscite registrate dello stadio precedente (nel codice le si posson notare con segnali che iniziano con "cc1_request" ) che passano parallelamente al blocco di hit/miss detection, la memoria degli stati dei blocchi in cache L1 e la pseudo LRU.
 
Il blocco di hit/miss detection riceve contemporaneamente il tag proveniente dalla richiesta schedulata dal controllore ed il tag proveniente dalla cache L1 - assieme ad i privilegi associati - al fine di poter eseguire nuovamente il check sull'hit/miss del blocco in cache. È nuovamente rieseguito perchè una richiesta inviata dal core può essere schedulata dopo altre richieste dal controllore che possono comportare una modifica della cache stessa.
 
La memoria degli stati dei blocchi della L1 è acceduta dalle uscite dello stadio precedente solo in lettura al fine di poter prelevare gli stati di un determinato set. Non è ancora fatta la scelta della way da prelevare, ma è eseguita nello stadio successivo.
 
La pseudo LRU riceve una richiesta di lettura della way meno utilizzata da parte delle uscite registrate del primo stadio. Inoltre riceve direttamente dalla load store unit l'ultima way utilizzata per un set al fine di poterla spostare come way più utilizzata.
 
  
Ci sono inoltre altri due flussi secondari, ma non meno importanti, che non utilizzano le uscite registrate.
+
The choice of splitting lookup into two separate phases has been made in order to reduce the latency of the entire process.
Il primo è l'accesso alla MSHR in lettura. Tale accesso avviene ricevendo combinatorialmente gli ingressi dallo stadio precedente di tutte le richieste pendenti in ingresso (segnali cc1_mshr_lookup_tag/_index) al fine di poter leggere - tramite una lookup table - le MSHR entry per ogni richiesta. Tali entry sono inviate combinatorialmente allo stadio precedente e vengono utilizzate per poter scegliere la richiesta da schedulare.
 
Il secondo è l'accesso in scrittua per modificare l'MSHR (segnali cc3\_update\_mshr\_xxx) e gli stati della cache L1 (segnali cc3_update_coherence_state_xxx). Tali segnali provengono combinatorialmente dallo stadio successivo e determinati sulla base dell'evoluzione del protocollo di coerenza.
 
  
==== MSHR ====
+
=== MSHR ===
Il Miss Status Handling Register contiene informazioni relative a tutte le richieste di miss sollevate dal core che non sono ancora state definitivamente evase.
+
''Miss Status Handling Register'' is used to handle cache lines data whose coherence transactions are pending; that is the case in which a cache block is in a non-stable state. Bear in mind that only one request per thread can be issued, MSHR has the same entry as the number of hardware threads.
Così come detto precedentemente, tale MSHR ha un numero di entry pari al numero di thread in quanto solo una richiesta alla volta può essere eseguita per thread.
+
 
L'MSHR deve poter fornire le informazioni in maniera combinatoriale a chi lo interroga e deve poter fornire anche un modo per poter modificare la struttura delle entry. Per questo motivo, la struttura dell'MSHR può essere suddivisa in una parte di lettura e in una di scrittura delle entry.
+
An MSHR entry comprises the following data:
La parte di lettura dell'MSHR deve poter fornire tali informazioni agli altri stadi in maniera combinatoriale. Quindi, anche se non è tipico di una MSHR standard, le informazioni lette sono prelevate tramite una lookup table. Una entry della MSHR contiene le segueni informazioni:
 
  
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
 
! Valid  
 
! Valid  
! State  
+
! Address
! Address
+
! Thread ID
! Thread Id
+
! Wakeup Thread
 +
! State
 +
! Waiting For Eviction 
 +
! Ack Count
 
! Data  
 
! Data  
! Ack Count
 
! Waiting For Eviction
 
! Wakeup Thread
 
 
|-
 
|-
 
|}
 
|}
  
* Valid: indica se la entry è utilizzata o meno
+
* Valid: entry has valid data
* State: raccoglie lo stato attuale della richiesta relativo al protocollo di coerenza, usato sia per poter
+
* Address: entry memory address
* Address: indirizzo di memoria a cui si riferisce la richiesta
+
* Thread ID: requesting HW thread id
* Thread ID: id del thread che ha sollevato la richiesta
+
* Wakeup Thread: wakeup thread when the transaction is over
* Data: dati che accompagnano la richiesta
+
* State: actual coherence state
* Ack count: numero di ack che la richiesta deve ancora ricevere
+
* Waiting for eviction: asserted for replacement requests
* Waiting for eviction: asserito se sull'indirizzo della richiesta deve essere eseguita una eviction ed ha senso solo se è asserito anche il bit di valid.
+
* Ack count: remaining acks to receive
* Wakeup Thread : determina se il thread dovrà essere risvegliato al termine della transazione
+
* Data: data associated to request
  
Di seguito è riportato il punto in cui viene eseguita la lettura della entry. Tale lettura viene ulteriormente replicata per ogni lookup port (nel nostro caso sei).
+
Note that entry's ''Data'' are stored in a separate SRAM memory in order to ease the lookup process.
  
 +
==== Implementation details ====
 +
Since MSHR has to provide a lookup service to stage 1 (see [[L1 Cache Controller#MSHR Signals | lookup signals]]) and update entries coming from stage 3 (see [[L1 Cache Controller#MSHR Update Logic | update signals]]) at the '''same time''', dedicated read and a write ports have been implemented for this purpose. <br>
 +
 +
===== Write port =====
 +
A write policy is defined in order to define an order between writes and reads. This policy is can be set through a boolean parameter named WRITE_FIRST. <br>
 +
In particular, this module is instantiated with policy WRITE_FIRST set to false, this means MSHR will serve read operations ''before'' write operations; write operations are '''delayed''' of one clock cycle after they have been issued from stage 3 (because a register delays the update). Here is the code regarding write port:
 +
 +
// This logic is generated for each MSHR entry
 
  generate
 
  generate
    for ( i = 0; i < `MSHR_SIZE; i++ ) begin : lookup_logic
+
  genvar mshr_id;
    assign hit_map[i] = /*( mshr_entries[i].address.tag == tag ) &&*/
+
    for ( mshr_id = 0; mshr_id < `MSHR_SIZE; mshr_id++ ) begin : mshr_entries
    ( mshr_entries[i].address.index == index ) && mshr_entries[i].valid;
+
 
  end
+
      ...
 +
      // Write policy
 +
      if (WRITE_FIRST == "TRUE")
 +
          // If true writes are serviced immediately
 +
          assign data_updated[mshr_id] = (enable && update_this_index) ? update_entry : data[mshr_id];
 +
      else
 +
          assign data_updated[mshr_id] = data[mshr_id];
 +
 +
      ...
 +
     
 +
      // Data entries (set of registers)
 +
      always_ff @(posedge clk, posedge reset) begin
 +
          if (reset)
 +
            data[mshr_id] <= 0;
 +
          else if (enable && update_this_index)
 +
            data[mshr_id] <= update_entry;
 +
      end
 +
   
 +
    end
 
  endgenerate
 
  endgenerate
oh_to_idx #(
 
    .NUM_SIGNALS( `MSHR_SIZE )
 
)
 
    u_oh_to_idx (
 
    .index  ( selected_index ),
 
    .one_hot( hit_map        )
 
);
 
assign mshr_entry = mshr_entries[selected_index];
 
  
Supponendo che l'MSHR sia di tipo read-first, per ogni entry viene effettuato il controllo sull'address: se c'è una richiesta già pendente valida, allora viene asserito il bit nella hit map, al quale invierà la corrispondente entry in uscita.
+
===== Read port =====
 +
Read port implements a simple hit/miss logic for requests coming from stage 1 (see [[L1 Cache Controller#MSHR Signals | lookup signals]]). Write policy influents which data this logic will read though; if WRITE_FIRST is set to true then lookup is made on data just updated by write logic otherwise the lookup will be made before an update, the latter is the case when reads have more priority than writes (WRITE_FIRST is false). Here is the code regarding lookup logic:
 +
 
 +
// This logic is generated for each MSHR entry
 +
...
 +
generate
 +
  for ( i = 0; i < `MSHR_SIZE; i++ ) begin : lookup_logic
 +
 +
      // data_updated[] data are set according to write policy
 +
      assign hit_map[i] = ( data_updated[i].address.index == index ) && data_updated[i].valid;
 +
 +
    end
 +
  endgenerate
 +
...
 +
 +
assign hit        = |hit_map;
 +
 
 +
== Stage 3 ==
 +
Stage 3 is responsible for the actual execution of requests. Once a request is processed, this stage issues signals to the units in the above stages in order to update data properly. <br>
 +
In particular, this stage drives datapath to perform one of these functions:
 +
 
 +
* block replacement evaluation;
 +
* MSHR update;
 +
* cache memory (both data and coherence info) update.
 +
* preparing outgoing coherence messages.
 +
 
 +
=== Current State Selector ===
 +
Before a request is processed by coherence protocol the correct source of cache block state has to be chosen. These data could be retrieved from:
 +
 
 +
* MSHR;
 +
* coherence data cache;
 +
 
 +
If none of the conditions above are met then cache block must be in state '''I''' because it has not been ever read or modified.
 +
 
 +
=== Protocol ROM ===
 +
This module implements the coherence protocol as represented in the figure below. The choice to implement the protocol as a separate ROM has been made to ease further optimizations or changes to the protocol. It takes in input the current state and the request type and computes the next actions.
 +
 
 +
[[File:MSI_Protocol_cc-rom_p1_new.png|1100px|MSI_CC]]
 +
 
 +
[[File:MSI_Protocol_cc-rom_p2_new.png|1100px|MSI_CC]]
 +
 
 +
The coherence protocol used is MSI plus some changes due to the directory's inclusivity. In particular, a new type of forwarded request has been added, ''recall'', that is sent by directory controller when a block has to be evicted from L2 cache. A ''writeback'' response to the memory controller follows in response to a ''recall'' only when the block is in state '''M'''. Note that a ''writeback'' response is sent to the directory controller as well in order to provide a sort of acknowledgement.
  
Per quanto riguarda la parte di scrittura, viene specifcato qual è l'id della entry da aggiornare con le informazioni da inserire. Supponendo sia read-first, le informazioni aggiornate saranno disponibili il colpo di clock successivo all'aggiornamento.
+
Furthermore, another type of request, called ''flush'', has been added that simply send updated data to the from the requestor L1main memory. It also generates a ''writeback'' response even though it is directed only to the memory controller and does not impact on the coherence block state. Flushes are often used in applications for sending back to the main memory the output after the computation.  
Infine si vuol notare che nell'MSHR sono stati separati i dati della entry dalle altre informazioni per poter evitare di appesantire il processo di lookup dato che non servirebbero ai fini delle decisioni dello scheduling. Tali dati sono messi nel banco di memoria "mshr_data_sram".
 
  
=== Stadio 3 ===
+
The above table refers to a baseline protocol which explains the main logic behind the Protocol ROM. Further optimizations, such as the uncoherent states, are deeply described in detail in [[MSI Protocol]].
Il terzo stadio è sicuramente il più complesso perchè contiene il protocollo di coerenza e tutte le azioni che bisogna eseguire in seguito. Il cuore dello stadio è sicuramente il modulo ROM, il quale calcola le azioni da compiere sulla base dello stato corrente e della richiesta in ingresso schedulata. Le azioni e le uscite sono descritte dettagliatamente nella sottosezione apposita.
 
Prima di tutto, bisogna inserire correttamente i dati di ingresso all'interno della ROM. Se la richiesta è stata già schedulata nel primo stadio, lo stato corrente deve essere ancora scelto. Infati, la scelta si basa su due situazioni, cioè se c'è una richiesta pendente nell'MSHR (cc2_request_mshr_hit) o se c'è un dato già presente in cache. Nel primo caso, lo stato da considerare è quello salvato nell'MSHR entry e precedentemente prelevato; nel secondo caso, bisogna invece considerare lo stato del dato in cache - salvato nell'apposito modulo di memoria dello stadio 2; nel caso in cui nessuno dei due casi si realizzasse, allora significa che tale blocco non è stato mai letto o modificato, quindi lo si pu considerare nello stato di partenza (Invalid nel nostro caso).
 
Selezionato lo stato, si può procedere a prelevare le uscite della ROM, contenute nella struttura "pr_output". Sulla base delle uscite della ROM, è possibile identificare 4 diverse macro-uscite: valutazione di un replacement, aggiornamento dell'MSHR, aggiornamento della linea di cache (con il suo stato) e definizione del messaggio di uscita. Tali azioni sono descritte nelle apposite sottosezioni.
 
Si vuole ricordare che le azioni di modifica ai dati salvati nel cache controller avvengono dopo due cicli dallo scheduling di una richiesta. Per evitare problemi di errato aggiornamento, si bloccano tutte le richieste verso indirizzi che sono ancora nella pipe del controllore.
 
  
==== Valutazione di un replacement ====
+
=== Replacement Logic ===
Il replacement è l'operazione che si scatena quando deve essere istanziato un nuovo dato in cache, ma non c'è più spazio in quel set. Questo implica salvare il nuovo dato nella cache ed inviare il vecchio dato in cache ai livelli superiori di memoria (solo se è stato modificato). I segnali calcolati in questo punto non vengono direttamente portati in uscita, ma sono utilizzate per poter determinare le altre macro-uscite corretamente. Bisogna quindi sofferamrsi su tali segnali:
+
A cache block replacement might occur whenever a new block has to be stored into the L1 and all the sets are busy. In case of available sets, the control logic will select them avoiding replacement. Hence, an eviction occurs only when the selected block has valid information. Block validity is assured by privilege bits associated with it. These privilege bits (one for each way) come from Stage 2 that in turn has received them from load/store unit. The pseudo-LRU module, in Stage 2, selects the block to replace pointing least used way.
  
  replaced_way_valid = cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_read |
+
  replaced_way_valid                     = cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_read | cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_write;
  cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_write;
 
do_replacement = pr_output.write_data_on_cache &&
 
  !cc2_request_snoop_hit && replaced_way_valid;
 
  
Dallo stadio 2 precedente riceviamo la way uscente (cc2_request_lru_way_idx) e vediamo se il blocco è ancora valido - cioè se è possibile almeno leggere o scrivere. In questo caso, allora bisogna vedere se il protocollo ha disposto una scrittura in cache. E' evidente che bisogna anche controllare se c'è stato un hit o meno in cache, altrimenti non bisogna eseguire un replace, bensì un update dei dati.
+
The address of the evicting block has to be reconstructed. In particular, its tag is provided by tag cache from load/store unit (through Stage 2) while the index is provided by the requesting address which will take its place in the cache (since the two addresses have the same set). In case of a dirty block, the data has to be fetched and send back to the main memory, stored into the data cache in Stage 2. The address offset is kept low since the eviction operation involves the entire block.
 +
 +
replaced_way_address.tag              = cc2_request_snoop_tag[cc2_request_lru_way_idx];
 +
replaced_way_address.index            = cc2_request_address.index;
 +
replaced_way_address.offset            = {`DCACHE_OFFSET_LENGTH{1'b0}};
 +
 +
replaced_way_state                    = cc2_request_coherence_states[cc2_request_lru_way_idx];
  
==== Aggiornamento MSHR entry ====
+
Recapping, a replacement request is issued if:
L'aggiornamento di una entry MSHR dipende sia dai segnali direttamente presi dalla ROM sia de un'eventuale replacement da effettuare. Nel caso in cui bisogna eseguire un replacement, bisogna prima ricordare cosa accade al datapath. Alla ricezione della richiesta di replacement, il datapath invierà una richiesta di eviction per quel dato.
 
  
'''da fare''' % Il problema può sorgere se un'altra richiesta dello stesso thread è sospesa su
+
* protocol ROM requested for a cache update due to a new incoming data;
 +
* the block requested is not present in the L1 cache (so the update request must be a block allocation);
 +
* replaced block is valid.
  
Per questo motivo, viene preallocata la entry, ponendo tutte le informazioni ai fini del replace nella entry, e settando sia i bit di validità che di waiting_for_eviction.
+
do_replacement                        = pr_output.write_data_on_cache && !cc2_request_snoop_hit && replaced_way_valid;
Nel caso in cui non bisogna eseguire il replacement, allora si aggiornerà l'entry MSHR nei seguenti modi e casi:
 
  
always_comb begin
+
=== MSHR Update Logic ===
  cc3_update_mshr_en            = 1'b0;
+
MSHR could be updated in three different ways:
  cc3_update_mshr_entry_info.valid  = ( pr_output.allocate_mshr_entry || pr_output.update_mshr_entry )
 
      && !pr_output.deallocate_mshr_entry;
 
  cc3_update_mshr_entry_info.address              = cc2_request_address;
 
  cc3_update_mshr_entry_info.state                = pr_output.next_state;
 
  cc3_update_mshr_entry_info.thread_id  = cc2_request_mshr_hit ? cc2_request_mshr_entry_info.thread_id
 
      : cc2_request_thread_id;
 
  cc3_update_mshr_entry_info.ack_count  = pr_output.decr_ack_count ? cc2_request_mshr_entry_info.ack_count - 1
 
      : ( pr_output.req_has_ack_count ? cc2_request_sharers_count : cc2_request_mshr_entry_info.ack_count );
 
  cc3_update_mshr_entry_info.waiting_for_eviction = 1'b0;
 
  cc3_update_mshr_entry_info.wakeup_thread  = cc2_request_mshr_hit ? cc2_request_mshr_entry_info.wakeup_thread
 
      : ( cc2_request == load || cc2_request == store );
 
  cc3_update_mshr_index  = cc2_request_mshr_hit ? cc2_request_mshr_index : cc2_request_mshr_empty_index;
 
  if ( cc2_request_valid )
 
      cc3_update_mshr_en  = ( pr_output.allocate_mshr_entry || pr_output.update_mshr_entry || pr_output.deallocate_mshr_entry );
 
end
 
  
Prima di tutto, si nota che viene scatenato un update nell'MSHR solo se il protocollo ha richiesto espicitamente una sua modifica. Si nota inoltre che una linea sarà invalida sicuramente se il protocollo ha disposto una sua deallocazione. Il calcolo del conteggio degli ack viene eseguito in questo modo: se è stato disposto un decremento, viene decrementato il suo precedente valore, altrimenti si controlla se la richiesta ha un nuovo conteggio da inserire, altrimenti rimane il conteggio inalterato.
+
* entry allocation;
Una ulteriore nota di attenzione bisogna porla sull'index da calcolare. Infatti se l'entry è già presente nell'MSHR, la richiesta viene arrestata dallo scheduler, a parte per le richieste di writeback, in cui la richiesta avanza se c'è stata una preallocazione. In questo caso, non si può prelevare l'index libero dell'MSHR, ma bensì quello già preallocato e presente nella richiesta stessa(cc2_request_mshr_index).
+
* entry deallocation;
 +
* entry update.
  
==== Aggiornamento della linea di cache e del suo stato ====
+
MSHR is used to store information on pending transactions. Whenever a cache line is in the MSHR it has a non-stable state, and the state stored in the MSHR is considered the most up-to-date. So a new entry allocation is made every time the cache line state turns into a non-stable state. On the other hand, deallocation of an entry is made when a cache line's state turns into a stable state and it was pending in the MSHR, this means that the ongoing transaction is over. Finally, an update is made when there is something to change regarding the stored information in the MSHR, and the cache line state is still non-stable, e.g. if the penguin transaction is waiting for acknowledges from all the sharers, whenever an ack message arrives it increases the total number of ack received (hence update this information in the MSHR), but the transaction is still ongoing until all ack messages have arrived. Each condition is represented by a signal that is properly asserted by protocol ROM.
Ogni qual volta viene eseguita una modifica di una linea di cache, che sia per un replacement o meno, bisogna aggiornare sia la linea di cache L1 sia il suo stato di coerenza nel controllore. Per aggiornamento si intende anche una scrittura di una nuova linea di cache. Per aggiornare le linee della L1 si utilizzeranno i segnali "cc3_update_ldst_xxx", mentre si useranno le linee "cc3_update_coherence_state_xxx" per aggiornare lo stato. Vengono analizzate insieme perchè sono logicamente connesse: ogni linea di cache nella L1 ha la propria variabile di stato nel controllore.  
 
  
  assign cc3_update_ldst_command  = do_replacement ? CC_REPLACEMENT :
+
  cc3_update_mshr_en            = ( pr_output.allocate_mshr_entry || pr_output.update_mshr_entry || pr_output.deallocate_mshr_entry );
    ( pr_output.write_data_on_cache ? CC_UPDATE_INFO_DATA : CC_UPDATE_INFO );
 
assign cc3_update_ldst_way      = cc2_request_snoop_hit ? cc2_request_snoop_way_idx
 
    : cc2_request_lru_way_idx;
 
assign cc3_update_ldst_address      = cc2_request_address;
 
assign cc3_update_ldst_store_value = ( pr_output.ack_count_eqz && pr_output.req_has_data ) ?
 
    cc2_request_data : cc2_request_mshr_entry_data;
 
assign cc3_update_ldst_privileges  = pr_output.next_privileges;
 
assign cc3_wakeup_thread_id        = cc2_request_mshr_hit ? cc2_request_mshr_entry_info.thread_id
 
    : cc2_request_thread_id;
 
  
 +
Whenever the control signal <code>do_replacement</code> is asserted an MSHR entry is ''pre-allocated''. This is necessary otherwise data computed by [[L1 Cache Controller#Replacement Logic|Replacement Logic]] could be lost. The Stage 1  checks if an entry is pre-allocated during the scheduling by reading the <code>waiting_for_eviction</code> bit, see [[L1 Cache Controller#Request Issue Signals | Request Issue Signals]].
 +
 +
Note that, an issued request from the Stage 1 allocates a new entry, the index of an empty entry is provided directly by MSHR (through Stage 2). Remember that, due our previous assumptions, there is surely an empty MSHR entry otherwise, the request would have not been issued (see [[L1 Cache Controller#Request Issue Signals | Request Issue Signals]]). If the operation is an update or deallocation then the index is obtained from Stage 1 querying the MSHR on the index of the entry associated with the actual request (see [[L1 Cache Controller#MSHR Signals|MSHR Signals]]).
 +
 +
cc3_update_mshr_index      = cc2_request_mshr_hit ? cc2_request_mshr_index : cc2_request_mshr_empty_index;
 +
 +
=== Cache Update Logic ===
 +
Both data cache and coherence cache could be updated after a coherence transaction has been computed. Data cache is updated according to the occurrence of a replacement, in that case, command <code>CC_REPLACEMENT</code> is issued to load/store unit; this command ensures load/store unit will prepare the block for eviction. Otherwise, an update to cache block has to be made; if the update involves only privileges then <code>CC_UPDATE_INFO</code> command is issued otherwise command <code>CC_UPDATE_INFO_DATA</code> is issued when both the new block and its privileges are updated into the L1 cache.
 +
 +
// Data cache signals
 +
assign cc3_update_ldst_command          = do_replacement ? CC_REPLACEMENT : ( pr_output.write_data_on_cache ? CC_UPDATE_INFO_DATA : CC_UPDATE_INFO );
 +
assign cc3_update_ldst_way              = cc2_request_snoop_hit ? cc2_request_snoop_way_idx : cc2_request_lru_way_idx;
 +
...
 +
 
 +
// Coherence cache signals
 
  assign cc3_update_coherence_state_index = cc2_request_address.index;
 
  assign cc3_update_coherence_state_index = cc2_request_address.index;
  assign cc3_update_coherence_state_way  = cc2_request_snoop_hit ? cc2_request_snoop_way_idx  
+
  assign cc3_update_coherence_state_way  = cc2_request_snoop_hit ? cc2_request_snoop_way_idx : cc2_request_lru_way_idx;
    : cc2_request_lru_way_idx;
 
 
  assign cc3_update_coherence_state_entry = pr_output.next_state;
 
  assign cc3_update_coherence_state_entry = pr_output.next_state;
  
always_comb begin
+
Data cache is updated whenever updating privileges for a block in the L1 is necessary, or whenever a new block is received and has to be stored in the cache along with its privileges.
    cc3_update_ldst_valid = 0;
 
    cc3_update_coherence_state_en = 0;
 
    if ( cc2_request_valid ) begin
 
      cc3_update_ldst_valid  = ( pr_output.update_privileges && cc2_request_snoop_hit )
 
          || pr_output.write_data_on_cache;
 
      cc3_update_coherence_state_en = ( pr_output.next_state_is_stable && cc2_request_snoop_hit )
 
          || ( pr_output.next_state_is_stable && !cc2_request_snoop_hit && pr_output.write_data_on_cache );
 
    end
 
end
 
 
 
L'aggiornamento della cache L1 viene validato (cc3_update_ldst_valid) ogni qual volta bisogna effettuare una scrittura comandata dalla ROM oppure quando bisogna solo aggiornare i privilegi per una linea già esistente. Si nota che l'aggiornamento è poco restrittivo, perchè viene eseguito ogni volta che bisogna scrivere dati in cache, senza controllare che vi sia un hit o meno, come avvine per il replacement. La differenza sta nel comando da inviare al datapath: nel caso in cui sia sta comandato un replacement (CC_REPLACEMENT), allora si invia un comando che forzerà il datapath ad effettuare la scrittura del dato ed a far uscire la linea sostituita tramite evict. Nel caso di scrittura diretta si invia un altro comando(CC_UPDATE_INFO_DATA), il quale è ulteriormente differente dal comando che consente di poter aggiornare solo i privilegi della linea di cache (CC_UPDATE_INFO). Il controllo bisogna eseguirlo sulla way da aggiornare: se c'è stato un hit nella cache L1, allora bisogna utilizzare la stessa way, altrimenti si può prelevare quella fornita dalla LRU. E' inoltre interessante notare il controllo eseguito sul dato da scrivere in memoria: se il protocollo determina che la richiesta ha un dato e con un numero di ack nullo, allora bisogna scrivere il dato della richiesta, altrimenti quello nella MSHR. Ciò accade, per esempio, quando si ottiene una risposta per una getM e l'ultima risposta è proprio il dato.
 
L'aggiornamento dello stato della linea di cache (cc3_update_coherence_state_en) avviene in due casi: o se c'è stato un hit ed il prossimo stato è stabile, o se non c'è stato hit, ma il prossimo stato è stabile ed ha la necessità di scrivere in cache. E' importante sottolineare il fatto che c'è sempre la condizione di avere un prossimo stato stabile perchè implica il salvataggio dello stato solo se è stabile, altrimenti si troverà lo stato transiente nell'MSHR. La way viene è scelta con lo stesso criterio della linea di cache al fine di essere allineati.
 
 
 
==== Definizione del messaggio di uscita ====
 
Le uniche uscite registrate di tale stadio avvengono solo quando la richiesta in ingresso richiede l'invio di un messaggio da portare al prossimo stadio (e poi alla rete).
 
  
  always_ff @( posedge clk ) begin
+
  ...
    cc3_message_valid    <= cc2_request_valid && ( pr_output.send_response || pr_output.send_request );
+
cc3_update_ldst_valid         = ( pr_output.update_privileges && cc2_request_snoop_hit ) || pr_output.write_data_on_cache;
    cc3_message_is_response          <= pr_output.send_response;
+
...
    cc3_message_request_type         <= pr_output.request;
 
    cc3_message_response_type        <= pr_output.response;
 
    ...
 
    cc3_message_has_data            <= pr_output.send_data_from_cache ||
 
    pr_output.send_data_from_mshr || pr_output.send_data_from_request;
 
    cc3_message_data                <= pr_output.send_data_from_mshr ?
 
      cc2_request_mshr_entry_data : cc2_request_data;
 
end
 
  
==== Modulo ROM protocollo di coerenza ====
+
The code above describes the condition of updating a line in the L1 cache. The cache is updated whenever block state became stable (its transaction is over and it has been deallocated from the MSHR) and there is a cache hit (it is already stored in the cache). Otherwise, whenever the coherence protocol requires the update of the cache, this is signalled through the <code>pr_output.write_data_on_cache</code> bit, output of the protocol ROM.
Al fine di poter mantenere uno stato coerente per tutti i dati del sistema, tutti i controllori associati alle strutture di memoria devono eseguire una serie di azioni che prendono il nome di protocollo di coerenza. Ogni protocollo può essere visto come una macchina a stati contraddistinto da una serie di stati stabili e transienti le cui transizioni si innescano da una serie di richieste. Gli stati, le richieste e le azioni compiute dal protocollo di coerenza sono tutte prestabilite e decise a tavolino. Per questo motivo, può essere utile salvare tale protocollo all'interno di una memoria ROM, in modo da ottenere una facilità di progettazione e soprattutto una facile intercambiabilità: se in futuro si vorrà implementare un altro protocollo di coerenza, basterà cambiare i valori della ROM.
 
Gli stati stabili che sono stati definiti sono tre: Modified, Shared e Invalid. Tale protocollo va sotto il nome di MSI ed è il più comune e conosciuto protocollo di coerenza utilizzato. Gli stati transienti servono per accoglier e gestire opportunamente eventuali richieste per la stessa linea di cache mentre si sta transitando da uno stato stabile ad un altro.
 
Le richieste che possono essere fatte sono di tre tipi:
 
Il modulo che contiene l'implementazione del protocollo è composto da un grande costrutto di selezione cobinatoriale, il quale riceve in ingresso lo stato corrente - selezionato opportunamente - e la richiesta schedulata.
 
Le macroazioni da eseguire sono esplicate nella tabella sottostante.
 
  
[[File:MSI_CC.jpg|1000px|MSI_CC]]
+
...
 +
cc3_update_coherence_state_en = ( pr_output.next_state_is_stable && cc2_request_snoop_hit ) || ( pr_output.next_state_is_stable && !cc2_request_snoop_hit && pr_output.write_data_on_cache );
 +
...
  
Ad ogni richiesta, si attende uno o più ack finchè non si va a zero e si può concludere la transazione (deallocando il valore nell'MSHR e salsando lo sato stabile in cache).
+
Furthermore, if the request is a forwarded coherence request then L1 cache data are forwarded to the message generator in stage 4 in order to be sent to the requestor.
Nelle parti in cui non c'è scritto niente, significa che il protocollo non avrà mai l'opportnità di capitare in tale combinazione stato/richiesta.
 
In tale protocollo i comportamenti sono quasi tutti ben conosciuti in letteratura, ma si voglino notare comunque alcune cose:
 
  
* Il messaggio di recall viene inviato dal directory controller ed è inteso come un messaggio di back invalidation, cioè il directory controller (il quale è anche il gestore della cache livello 2) invalida una linea di cache del livello sottostante ogni qual volta una linea di cache del livello 2 viene eliminata; ciò viene fatto al fine di poter mantenere la proprietà di inclusività della gerarchia di cache
+
assign cc3_snoop_data_valid            = cc2_request_valid && pr_output.send_data_from_cache;
* il messaggio di writeback viene inviato in risposta ad un messaggio di recall al memory controller per poter effetture la scrittura della linea di cache in memoria centrale
+
assign cc3_snoop_data_set              = cc2_request_address.index;
* Tutto ciò implica che un cache controller invierà solo i messaggi di writeback e riceverà solo messaggi di recall (mai il viceversa)
+
assign cc3_snoop_data_way              = cc2_request_snoop_way_idx;
* i messaggi di risposta alla flush sono diversi rispetto a quelli del replacement perchè i messaggi di flush inviano il dato in memoria con una WB e senza cambiare lo stato del dato in cache, mentre il replacement invia i dati alla directory in seguito ad una evict del dato in cache - se necessario - per poter diventare il nuovo owner
 
* i messaggi di risposta alla flush sono diversi rispetto a quelli del recall perchè, anche se inviano entrambi un messaggio di writeback alla memoria, la flush lo invia solo al memory controller e senza modificare lo stato della cache, mentre la recall invalida il dato in cache ed inoltre invia una WB anche al directory controller. L'invio del WB al directory controller è voluto per poter fornire un ack alla recall.
 
* i messaggi di WB in risposta alla recall non sono inviati se si trova nello stato S perchè non è l'owner del dato (se ne occuperà il director controller stesso)
 
  
 
== Stage 4 ==
 
== Stage 4 ==
This stage sends a request message to the network interface whenever one is available. Messages are formatted properly with the coherence protocol.
+
This stage generates a correct request/response message for the network interface whenever a message is issued from the third stage.
  
 
== See Also ==
 
== See Also ==
 
[[Coherence]]
 
[[Coherence]]

Latest revision as of 14:21, 1 July 2019

Cache controller manages the L1 cache. In particular, it handles only coherence information (such as states) since L1 data cache is managed by load/store unit.

LDST_CC

The component is composed of 4 stages:

  • stage 1: schedules a pending request to issue (from local core or network);
  • stage 2: contains coherence cache and MSHR;
  • stage 3: processes a request properly with coherence protocol;
  • stage 4: prepares coherence request/response to be sent on the network.

All these stages are represented in the figure below. The component has been realized in a pipelined fashion in order for the controller to be able to serve multiple requests at the same time.

Assumptions

The design has been driven by these assumptions:

  • cache controller schedules a request only when no other requests with on the same address are pending in the pipeline;
  • coherence transactions are at the memory blocks level;
  • cache controller does not schedule requests when another is in the pipeline and have the same set;
  • only requests from the local core (load, store, replacement) allocate MSHR entries;
  • information regarding cache block in a non-stable state are stored into the MSHR.

Two requests on the same block cannot be issued one after another; when the first request is issued, it may modify an MSHR entry after two clock cycles (stage 3), hence the second request may read a non-up-to-date entry.

Stage 1

Stage 1 is responsible for the scheduling of requests into the controller. A request could be a load miss, store miss, flush and replacement request from the local core or a coherence forwarded request or response from the network interface.

MSHR Signals

The arbiter in the first stage checks if a pending request can be issued, in order to eligible for scheduling no other requests on the same block should be ongoing (or under elaboration) in the cache controller. Ongoing requests are stored in the MSHR table. Tags and sets are provided by the MSHR for each type of pending requests and are forwarded to the arbiter at Stage 1. The arbiter uses the information provided on ongoing transactions to select a pending request. The MSHR provides a look-up port for each type of request, a hit single is provided along, the request is considered valid in the MSHR if such a signal is asserted:

// Signals to MSHR
assign cc1_mshr_lookup_tag[MSHR_LOOKUP_PORT_LOAD ]   = ci_load_request_address.tag;
assign cc1_mshr_lookup_set[MSHR_LOOKUP_PORT_LOAD ]   = ci_load_request_address.index;

// Signals from MSHR
assign load_mshr_hit                                 = cc2_mshr_lookup_hit[MSHR_LOOKUP_PORT_LOAD ];
assign load_mshr_index                               = cc2_mshr_lookup_index[MSHR_LOOKUP_PORT_LOAD ];
assign load_mshr                                     = cc2_mshr_lookup_entry_info[MSHR_LOOKUP_PORT_LOAD ];

Stall Protocol ROMs

In order to be compliant with the coherence protocol, all incoming requests on blocks in a non-stable state might be stalled. This task is performed through a series of protocol ROMs (one for each request type) that output state when the issue of relative request ought to be stalled, e.g. when a block is in state SM_A and a Fwd_GetS, Fwd_GetM, recall, flush, store or replacement request for the same block is received. In order to assert this signal, the protocol ROM needs the type of the request and the actual state of the block. The module stall_protocol_rom implements this logic:

stall_protocol_rom load_stall_protocol_rom (
  .current_request ( load            ),
  .current_state   ( load_mshr.state ),
  .pr_output_stall ( stall_load      )
);
stall_protocol_rom store_stall_protocol_rom (
   .current_request ( store            ),
   .current_state   ( store_mshr.state ),
   .pr_output_stall ( stall_store      )
);
stall_protocol_rom flush_stall_protocol_rom (
   .current_request ( flush            ),
   .current_state   ( flush_mshr.state ),
   .pr_output_stall ( stall_flush      )
);
stall_protocol_rom replacement_stall_protocol_rom (
    .current_request ( replacement            ),
    .current_state   ( replacement_mshr.state ),
    .pr_output_stall ( stall_replacement      )
);
stall_protocol_rom forwarded_stall_protocol_rom (
   .current_request ( fwd_2_creq( ni_forwarded_request.packet_type ) ),
   .current_state   ( forwarded_request_mshr.state                   ),
   .pr_output_stall ( stall_forwarded_request                        )
);

Note that response messages are never stalled in the coherence protocol, such requests are stalled only if a pending request with the same set index is already in the pipeline:

assign can_issue_response                                      = ni_response_valid &
	!(
		( cc2_pending_valid && ( ni_response.memory_address.index == cc2_pending_address.index ) ) ||
		( cc3_pending_valid && ( ni_response.memory_address.index == cc3_pending_address.index ) )
	);

Issuing a Request

In order to issue a generic request, it is required that:

  • MSHR has no pending requests for the same block;
  • if the request is already in MSHR it has to be not valid;
  • if the request is already in MSHR and valid it must not have been stalled by the protocol ROM (see stall signals).
  • further stages are not serving a request on the same address (see assumptions);
  • network interface is available;
assign can_issue_load = ci_load_request_valid && 

       ( !load_mshr_hit || 
            ( load_mshr_hit && !load_mshr.valid) ||
             ( load_mshr_hit && load_mshr.valid  && load_mshr.address.tag == ci_load_request_address.tag && !stall_load ) ) &&

       ! (( cc2_pending_valid && ( ci_load_request_address.index == cc2_pending_address.index ) ) ||
            ( cc3_pending_valid && ( ci_load_request_address.index == cc3_pending_address.index ) ))  &&

       ni_request_network_available;

Response messages do not need feedbacks from MSHR since they do not allocate a new entry and they are never stalled. The same goes for flush requests even though they could be stalled by the relative stall protocol ROM.

Finally a replacement request could be pre-allocated in MSHR (see MSHR update logic). In order for this request to be issued before every other request on the same block, an additional condition is added:

assign can_issue_replacement = 
...
( !replacement_mshr_hit ||
    ( replacement_mshr_hit && !replacement_mshr.valid) ||
    ( replacement_mshr_hit && replacement_mshr.valid && ( !stall_replacement || replacement_mshr.waiting_for_eviction ) ) ) 
...

Note the control logic does not check if the MSHR has free entries, we made the following assumption which eases this control: only a request per thread can be issued and only threads can allocate an MSHR entry, it is sufficient to size MSHR to the number of threads x 2 in order for the MSHR to be never full and make the control about his filling useless. In the worst case, the MSHR has a pending request and a pending replacement per thread.

Requests Scheduler

Once the conditions for the issue have been verified, two or more requests could be ready at the same time so a scheduler must be used. Every request has a fixed priority whose order is set as below:

  1. flush
  2. dinv
  3. replacement
  4. store miss
  5. coherence response
  6. coherence forwarded request
  7. load miss
  8. recycled response

Once a type of request has been scheduled this block drives conveniently the output signals for the second stage.

Stage 2

Stage 2 is responsible for managing L1 cache, the MSHR and forwarding signals from Stage 1 to Stage 3. It simply contains the L1 coherence cache (L1 data cache is in load/store unit) and all related logic for managing cache hits and block replacement. The policy used to replace a block is LRU (Least Recently Used).
This module receives signals from stage 3 to update MSHR and coherence cache properly once a request is processed and from load/store unit to update LRU every time a block is accessed from the core.

Hit/miss logic

Lookup phase is split in two parts performed by:

  1. load/store unit;
  2. cache controller (stage 2).

Load/store unit performs the first lookup using only request's set; so it returns an array of tags (one per way) whose tags have the same set of the request and their privilege bits. This first lookup is performed at the same time the request is in cache controller stage 1. The second phase of lookup is performed by cache controller stage 2 using only the request's tag; this search is performed on the array provided by load/store unit. If there is a block with the same tag and the block is valid (its validity is checked with privilege bits) then a hit occurs and the way index of that block is provided to stage 3. The way index will be used by stage 3 to perform updates to coherence data of that block.
If there is no block with the same tag as the request's and no hit occurs, stage 3 takes the way index provided by LRU unit in order to replace that block (see replacement logic).

...
// Second phase lookup
// Result of this lookup is an array one-hot codified 
assign snoop_tag_way_oh[dcache_way] = ( ldst_snoop_tag[dcache_way] == cc1_request_address.tag ) & ( ldst_snoop_privileges[dcache_way].can_read | ldst_snoop_privileges[dcache_way].can_write );
...

assign snoop_tag_hit       = |snoop_tag_way_oh;

Note that whenever a request arrives in stage 2 its way index in the data cache is not known yet (since hit/miss logic is computing it at the same time), hence coherence cache is looked up only issuing on the bus the request's set. The result of the snoop operation is forwarded to stage 3, which elaborates them. Stage 3 knows which way index to use for fetching correct data because meanwhile hit/miss logic will have provided it.

The choice of splitting lookup into two separate phases has been made in order to reduce the latency of the entire process.

MSHR

Miss Status Handling Register is used to handle cache lines data whose coherence transactions are pending; that is the case in which a cache block is in a non-stable state. Bear in mind that only one request per thread can be issued, MSHR has the same entry as the number of hardware threads.

An MSHR entry comprises the following data:

Valid Address Thread ID Wakeup Thread State Waiting For Eviction Ack Count Data
  • Valid: entry has valid data
  • Address: entry memory address
  • Thread ID: requesting HW thread id
  • Wakeup Thread: wakeup thread when the transaction is over
  • State: actual coherence state
  • Waiting for eviction: asserted for replacement requests
  • Ack count: remaining acks to receive
  • Data: data associated to request

Note that entry's Data are stored in a separate SRAM memory in order to ease the lookup process.

Implementation details

Since MSHR has to provide a lookup service to stage 1 (see lookup signals) and update entries coming from stage 3 (see update signals) at the same time, dedicated read and a write ports have been implemented for this purpose.

Write port

A write policy is defined in order to define an order between writes and reads. This policy is can be set through a boolean parameter named WRITE_FIRST.
In particular, this module is instantiated with policy WRITE_FIRST set to false, this means MSHR will serve read operations before write operations; write operations are delayed of one clock cycle after they have been issued from stage 3 (because a register delays the update). Here is the code regarding write port:

// This logic is generated for each MSHR entry
generate
  genvar mshr_id;
    for ( mshr_id = 0; mshr_id < `MSHR_SIZE; mshr_id++ ) begin : mshr_entries
 
      ...
      // Write policy
      if (WRITE_FIRST == "TRUE")
         // If true writes are serviced immediately
         assign data_updated[mshr_id] = (enable && update_this_index) ? update_entry : data[mshr_id];
      else
         assign data_updated[mshr_id] = data[mshr_id];

      ...
      
      // Data entries (set of registers)
      always_ff @(posedge clk, posedge reset) begin
         if (reset)
            data[mshr_id] <= 0;
         else if (enable && update_this_index)
            data[mshr_id] <= update_entry;
      end

    end
endgenerate
Read port

Read port implements a simple hit/miss logic for requests coming from stage 1 (see lookup signals). Write policy influents which data this logic will read though; if WRITE_FIRST is set to true then lookup is made on data just updated by write logic otherwise the lookup will be made before an update, the latter is the case when reads have more priority than writes (WRITE_FIRST is false). Here is the code regarding lookup logic:

// This logic is generated for each MSHR entry
...
generate
  for ( i = 0; i < `MSHR_SIZE; i++ ) begin : lookup_logic

      // data_updated[] data are set according to write policy
      assign hit_map[i] = ( data_updated[i].address.index == index ) && data_updated[i].valid;

   end
 endgenerate
...

assign hit        = |hit_map;

Stage 3

Stage 3 is responsible for the actual execution of requests. Once a request is processed, this stage issues signals to the units in the above stages in order to update data properly.
In particular, this stage drives datapath to perform one of these functions:

  • block replacement evaluation;
  • MSHR update;
  • cache memory (both data and coherence info) update.
  • preparing outgoing coherence messages.

Current State Selector

Before a request is processed by coherence protocol the correct source of cache block state has to be chosen. These data could be retrieved from:

  • MSHR;
  • coherence data cache;

If none of the conditions above are met then cache block must be in state I because it has not been ever read or modified.

Protocol ROM

This module implements the coherence protocol as represented in the figure below. The choice to implement the protocol as a separate ROM has been made to ease further optimizations or changes to the protocol. It takes in input the current state and the request type and computes the next actions.

MSI_CC

MSI_CC

The coherence protocol used is MSI plus some changes due to the directory's inclusivity. In particular, a new type of forwarded request has been added, recall, that is sent by directory controller when a block has to be evicted from L2 cache. A writeback response to the memory controller follows in response to a recall only when the block is in state M. Note that a writeback response is sent to the directory controller as well in order to provide a sort of acknowledgement.

Furthermore, another type of request, called flush, has been added that simply send updated data to the from the requestor L1main memory. It also generates a writeback response even though it is directed only to the memory controller and does not impact on the coherence block state. Flushes are often used in applications for sending back to the main memory the output after the computation.

The above table refers to a baseline protocol which explains the main logic behind the Protocol ROM. Further optimizations, such as the uncoherent states, are deeply described in detail in MSI Protocol.

Replacement Logic

A cache block replacement might occur whenever a new block has to be stored into the L1 and all the sets are busy. In case of available sets, the control logic will select them avoiding replacement. Hence, an eviction occurs only when the selected block has valid information. Block validity is assured by privilege bits associated with it. These privilege bits (one for each way) come from Stage 2 that in turn has received them from load/store unit. The pseudo-LRU module, in Stage 2, selects the block to replace pointing least used way.

replaced_way_valid                     = cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_read | cc2_request_snoop_privileges[cc2_request_lru_way_idx].can_write;

The address of the evicting block has to be reconstructed. In particular, its tag is provided by tag cache from load/store unit (through Stage 2) while the index is provided by the requesting address which will take its place in the cache (since the two addresses have the same set). In case of a dirty block, the data has to be fetched and send back to the main memory, stored into the data cache in Stage 2. The address offset is kept low since the eviction operation involves the entire block.

replaced_way_address.tag               = cc2_request_snoop_tag[cc2_request_lru_way_idx];
replaced_way_address.index             = cc2_request_address.index;
replaced_way_address.offset            = {`DCACHE_OFFSET_LENGTH{1'b0}};

replaced_way_state                     = cc2_request_coherence_states[cc2_request_lru_way_idx];

Recapping, a replacement request is issued if:

  • protocol ROM requested for a cache update due to a new incoming data;
  • the block requested is not present in the L1 cache (so the update request must be a block allocation);
  • replaced block is valid.
do_replacement                        = pr_output.write_data_on_cache && !cc2_request_snoop_hit && replaced_way_valid;

MSHR Update Logic

MSHR could be updated in three different ways:

  • entry allocation;
  • entry deallocation;
  • entry update.

MSHR is used to store information on pending transactions. Whenever a cache line is in the MSHR it has a non-stable state, and the state stored in the MSHR is considered the most up-to-date. So a new entry allocation is made every time the cache line state turns into a non-stable state. On the other hand, deallocation of an entry is made when a cache line's state turns into a stable state and it was pending in the MSHR, this means that the ongoing transaction is over. Finally, an update is made when there is something to change regarding the stored information in the MSHR, and the cache line state is still non-stable, e.g. if the penguin transaction is waiting for acknowledges from all the sharers, whenever an ack message arrives it increases the total number of ack received (hence update this information in the MSHR), but the transaction is still ongoing until all ack messages have arrived. Each condition is represented by a signal that is properly asserted by protocol ROM.

cc3_update_mshr_en            = ( pr_output.allocate_mshr_entry || pr_output.update_mshr_entry || pr_output.deallocate_mshr_entry );

Whenever the control signal do_replacement is asserted an MSHR entry is pre-allocated. This is necessary otherwise data computed by Replacement Logic could be lost. The Stage 1 checks if an entry is pre-allocated during the scheduling by reading the waiting_for_eviction bit, see Request Issue Signals.

Note that, an issued request from the Stage 1 allocates a new entry, the index of an empty entry is provided directly by MSHR (through Stage 2). Remember that, due our previous assumptions, there is surely an empty MSHR entry otherwise, the request would have not been issued (see Request Issue Signals). If the operation is an update or deallocation then the index is obtained from Stage 1 querying the MSHR on the index of the entry associated with the actual request (see MSHR Signals).

cc3_update_mshr_index      = cc2_request_mshr_hit ? cc2_request_mshr_index : cc2_request_mshr_empty_index;

Cache Update Logic

Both data cache and coherence cache could be updated after a coherence transaction has been computed. Data cache is updated according to the occurrence of a replacement, in that case, command CC_REPLACEMENT is issued to load/store unit; this command ensures load/store unit will prepare the block for eviction. Otherwise, an update to cache block has to be made; if the update involves only privileges then CC_UPDATE_INFO command is issued otherwise command CC_UPDATE_INFO_DATA is issued when both the new block and its privileges are updated into the L1 cache.

// Data cache signals
assign cc3_update_ldst_command          = do_replacement ? CC_REPLACEMENT : ( pr_output.write_data_on_cache ? CC_UPDATE_INFO_DATA : CC_UPDATE_INFO );
assign cc3_update_ldst_way              = cc2_request_snoop_hit ? cc2_request_snoop_way_idx : cc2_request_lru_way_idx;
...
 
// Coherence cache signals
assign cc3_update_coherence_state_index = cc2_request_address.index;
assign cc3_update_coherence_state_way   = cc2_request_snoop_hit ? cc2_request_snoop_way_idx : cc2_request_lru_way_idx;
assign cc3_update_coherence_state_entry = pr_output.next_state;

Data cache is updated whenever updating privileges for a block in the L1 is necessary, or whenever a new block is received and has to be stored in the cache along with its privileges.

...
cc3_update_ldst_valid         = ( pr_output.update_privileges && cc2_request_snoop_hit ) || pr_output.write_data_on_cache;
...

The code above describes the condition of updating a line in the L1 cache. The cache is updated whenever block state became stable (its transaction is over and it has been deallocated from the MSHR) and there is a cache hit (it is already stored in the cache). Otherwise, whenever the coherence protocol requires the update of the cache, this is signalled through the pr_output.write_data_on_cache bit, output of the protocol ROM.

...
cc3_update_coherence_state_en = ( pr_output.next_state_is_stable && cc2_request_snoop_hit ) || ( pr_output.next_state_is_stable && !cc2_request_snoop_hit && pr_output.write_data_on_cache );
...

Furthermore, if the request is a forwarded coherence request then L1 cache data are forwarded to the message generator in stage 4 in order to be sent to the requestor.

assign cc3_snoop_data_valid             = cc2_request_valid && pr_output.send_data_from_cache;
assign cc3_snoop_data_set               = cc2_request_address.index;
assign cc3_snoop_data_way               = cc2_request_snoop_way_idx;

Stage 4

This stage generates a correct request/response message for the network interface whenever a message is issued from the third stage.

See Also

Coherence