Difference between revisions of "Coherence"

From NaplesPU Documentation
Jump to: navigation, search
(Cache Coherence Protocol)
 
(26 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Nu+ cores can be arranged as a many-core architecture based upon a shared-memory subsystem. With the shared-memory model, communication occurs implicitly through the loading and storing of data and the accessing of instructions. Logically, all processors access the same shared memory, allowing each to see the most up-to-date data. Practically speaking, memory hierarchies use caches to improve the performance of shared memory systems. These cache hierarchies reduce the latency to access data but complicate the logical, unified view of memory held in the shared memory paradigm. As a result, cache coherence protocols are designed to maintain a coherent view of memory for all processors in the presence of multiple cached copies of data. Therefore, it is the cache coherence protocol that governs what communication is necessary in a shared memory multiprocessor. <br>
+
NPU cores can be arranged as a many-core architecture based upon a shared memory subsystem. With the shared-memory model, communication occurs implicitly through the loading and storing of data and the accessing of instructions. Logically, all processors access the same shared memory, allowing each to see the most up-to-date data. Practically speaking, memory hierarchies use caches to improve the performance of shared memory systems. These cache hierarchies reduce the latency to access data but complicate the logical, unified view of memory held in the shared memory paradigm. As a result, cache coherence protocols are designed to maintain a coherent view of memory for all processors in the presence of multiple cached copies of data. Therefore, it is the cache coherence protocol that governs what communication is necessary for a shared memory multiprocessor. <br>
 
Two key characteristics of a shared memory multiprocessor shape its demands on the interconnect; the cache coherence protocol that makes sure nodes receive the correct up-to-date copy of a cache line, and the cache hierarchy.
 
Two key characteristics of a shared memory multiprocessor shape its demands on the interconnect; the cache coherence protocol that makes sure nodes receive the correct up-to-date copy of a cache line, and the cache hierarchy.
  
 
== Cache Coherence Protocol ==
 
== Cache Coherence Protocol ==
Nu+ many-core architecture uses a directory protocol to enforce coherence; directory protocols do not rely on any implicit network ordering and can be mapped to an arbitrary topology. Directory protocols rely on point-to-point messages rather than broadcasts as in snooping protocols; this reduction in coherence messages allows this class of protocols to provide greater ''scalability''. Rather than broadcast to all cores, the directory contains information about which cores have the cache block. A single core receives the read request from the directory resulting in lower bandwidth requirements.
+
NPU many-core architecture uses a directory protocol to enforce coherence; directory protocols do not rely on any implicit network ordering and can be mapped to an arbitrary topology. Directory protocols rely on point-to-point messages rather than broadcasts as in snooping protocols; this reduction in coherence messages allows this class of protocols to provide greater ''scalability''. Rather than broadcast to all cores, the directory contains information about which cores have the cache block. A single core receives the read request from the directory resulting in lower bandwidth requirements.
  
 
Directories maintain information about the current sharers of a cache line in the system as well as coherence state information. By maintaining a sharing list, directory protocols eliminate the need to broadcast invalidation requests to the entire system. Addresses are interleaved across directory nodes; each address is assigned a ''home'' node, which is responsible for ordering and handling all coherence requests to that address; hence there isn't a single directory but instead a ''distributed'' one across all tiles of the NoC.
 
Directories maintain information about the current sharers of a cache line in the system as well as coherence state information. By maintaining a sharing list, directory protocols eliminate the need to broadcast invalidation requests to the entire system. Addresses are interleaved across directory nodes; each address is assigned a ''home'' node, which is responsible for ordering and handling all coherence requests to that address; hence there isn't a single directory but instead a ''distributed'' one across all tiles of the NoC.
  
Furthermore the directory is ''inclusive''; this means that directory holds entries for a superset of all blocks cached on the chip. In this way it is possible to design directory caches that are more cost-effective by exploiting the observation that only cache directory states for blocks that are being cached on the chip are needed. A miss in the inclusive directory cache indicates that the block is in state N. Because the directory mirrors the contents of the LLC, the entire directory cache is embedded in the LLC simply by adding extra bits to each block in the LLC. <br>
+
Furthermore, the directory is ''inclusive''; this means that directory holds entries for a superset of all blocks cached on the chip. In this way, it is possible to design directory caches that are more cost-effective by exploiting the observation that only cache directory states for blocks that are being cached on the chip are needed. A miss in the inclusive directory cache indicates that the block is in state N. Because the directory mirrors the contents of the LLC, the entire directory cache is ''embedded'' in the LLC simply by adding extra bits to each block in the LLC.  
Unfortunately, LLC inclusion has several drawbacks. First, for the shared caches in our system model, it is generally necessary to send special''recal'' requests to invalidate blocks from the L1 caches when replacing a block in the LLC. More importantly, LLC inclusion requires maintaining redundant copies of cache blocks that are in upper-level caches.
+
Unfortunately, LLC inclusion has several drawbacks. First, for the shared caches in our system model, it is generally necessary to send special ''recall'' requests to invalidate blocks from the L1 caches when replacing a block in the LLC. More importantly, LLC inclusion requires maintaining redundant copies of cache blocks that are in upper-level caches.
 +
 
 +
For further details about the memory coherence protocol, please refer to:
 +
* ''[[MSI Protocol]]''
  
 
== Cache Hierarchy ==
 
== Cache Hierarchy ==
 +
The L2 cache is spread all over tiles in the NaplesPU system. With a shared L2 cache, a request due to a L1 miss will be forwarded to the right home node, determined by the address (not necessarily the local L2).
  
 +
Shared caches represent a more effective use of storage as there is no replication of cache lines. However, L1 cache miss incurs additional latency to request data from a different tile. Shared caches place more pressure on the interconnection network as L1 misses also go onto the network, but through more effective use of storage may reduce pressure on the off-chip bandwidth to memory.
  
Avendo un sistema many-core in cui ogni core ha una propria cache L1 privata - e condivisa tra i thread - ed una cache L2 distribuita tra le tile, è stato introdotto un sistema di coerenza al fine di poter dare l'apparenza ad un core di avere una singola cache a sua disposizione e di esserne l'unico utilizzatore. Quindi ogni struttura di memorizzazione è affiancata da un controllore di coerenza che manipola i dati e contemporaneamente invia messaggi agli altri controllori di coerenza tramite una rete costituendo un sistema distribuito basato su scambio di messaggi.
+
Memory controllers are placed as individual nodes on the mesh network; with this design, memory controllers do not have to share injection/ejection bandwidth to/from the network with cache traffic. In this way traffic is isolated; the memory controller has access to the full network bandwidth.
Una particolare attenzione è stata posta sul disaccoppiamento tra la gestione della coerenza e la gestione del dato all'interno del datapath del core. I core non hanno coscienza del protocollo di coerenza, ma hanno solo una visione dei privilegi sui dati. I dati ed i privilegi sono opportunamente manipolati dall'infrastruttura di coerenza. Il protocollo di coerenza può essere sostanzialmente intercambiabile con altri.
 
 
 
Il nostro sistema di coerenza è basato sull'utilizzo di directory al fine di poter ottenere una migliore scalabilità nei confronti del numero di core nel sistema. Ci sono N tile connesse tra di loro tramite una rete mesh 2D; ogni tile contiene un core nu+ in cui possono girare K thread. In una tile c'è una cache L1 privata - e condivisa tra i thread di un core - ed una cache L2 inclusiva e condivisa tra le varie tile. Il sistema delle directory è distribuito ed una porzione è bufferizzata nella cache associata alla L2: sfruttando l'inclusività della L2, la cache L2 viene legata alla cache della directory.
 
 
 
Il protocollo di coerenza per ora sviluppato è una versione modificata dell'MSI, con tre stati stabili per la gestione del livello 1 di cache (Modified, Shared ed Invalid) e quattro stati stabili per il livello 2 (Modified, Shared, Non-Cached ed Invalid). La semantica degli stati è la stessa di quella definita nella letteratura, ad eccezione per lo stato Non-Cached, il quale indica che una linea di memoria non si trova nel livello L2; data la natura inclusiva della L2, l'ownership è della memoria centrale.
 
 
 
 
 
 
 
 
 
 
 
 
 
Lo sviluppo dell'architettura di coerenza si basa su tre attori principali:
 
* Load Store Unit: contiene la cache L1 e si occupa di gestire la schedulazione delle richieste tra i vari thread del core.
 
* Cache Controller: gestisce la coerenza dei blocchi della cache livello 1, comunicando direttamente con la Load Store unit ed inviando/ricevendo messaggi di coerenza verso/da altri controllori attraverso una network interface.
 
* Directory Controller: contiene e gestisce direttamente la coerenza dei blocchi della cache livello 2 e della cache della directory distribuita, comunicando con gli altri controllori tramite lo scambio di messaggi sulla network attraverso l'opportuna interfaccia.  
 
 
 
La Load Store unit ed il Cache controller si trovano all'interno del core nu+, mentre il Directory controller è istanziato all'esterno del core nu+ entro la tile.
 
 
 
== L1 cache ==
 
 
 
Lo sviluppo architetturale della cache è basato su tali ipotesi restrittive:
 
* Se un thread all'interno di un core scatena un cache miss, tale thread è arrestato finchè il controllore di coerenza della cache L1 garantisce le condizioni per poter proseguire l'esecuzione dettate dal protocollo. Tale scelta semplifica l'architettura ed evidenzia la capacità del core di poter eseguire altri calcoli da parte di altri thread attivi, nascondendo le latenze del protocollo di coerenza.
 
* Non è supportato il merging delle richieste di coerenza da parte di un core. 
 
* E' possibile istanziare solo reti N x N, dove N è una potenza di 2; questo implica l'introduzione di tile vuote che non hanno a disposizione una porzione della cache L2 e directory, a differenza delle tile per il controllo della memoria principale e per la comunicazione con l'host.
 
 
Prendendo in considerazione tali assunti e ricordando l'indipendenza tra datapath e controllo di coerenza, si descrivono i due attori principali, cioè L1 cache controller e Load/store unit.  
 
 
 
=== Load/Store unit ===
 
  
This unit is described in the dedicated [[Load/Store unit|load/store unit]] page.
+
== Architectural Details ==
 +
The coherence architecture is composed of three components:
  
=== L1 cache controller ===
+
* ''[[Load/Store unit|load/store unit]]'': contains L1 data cache;
 +
* ''[[L1 Cache Controller|cache controller]]'': handles L1 coherence data cache and manages coherence transactions;
 +
* ''[[L2 and Directory cache controller|directory controller]]'': handles L2 cache and manages coherence transactions.
  
This unit is described in the dedicated [[L1 Cache Controller|L1 cache controller]] page.
+
Load/store unit and cache controller are part of NPU core while directory control is part of the tile;
  
== L2 and directory cache ==
+
==== L1 Cache Assumptions ====
 +
L1 cache design has been driven by these assumptions:
  
This unit is described in the dedicated [[L2 and Directory cache controller|L2 and Directory cache controller]] page.
+
* if a thread raises a cache miss, the thread is suspended until this request is fulfilled by L1 coherence controller;
 +
* merging of requests from the same core is forbidden;
 +
* it is possible to have only N*N networks (with N be a power of two); this implies empty tiles have to be introduced and that these tiles has a portion of the L2 cache and directory.

Latest revision as of 13:40, 19 June 2019

NPU cores can be arranged as a many-core architecture based upon a shared memory subsystem. With the shared-memory model, communication occurs implicitly through the loading and storing of data and the accessing of instructions. Logically, all processors access the same shared memory, allowing each to see the most up-to-date data. Practically speaking, memory hierarchies use caches to improve the performance of shared memory systems. These cache hierarchies reduce the latency to access data but complicate the logical, unified view of memory held in the shared memory paradigm. As a result, cache coherence protocols are designed to maintain a coherent view of memory for all processors in the presence of multiple cached copies of data. Therefore, it is the cache coherence protocol that governs what communication is necessary for a shared memory multiprocessor.
Two key characteristics of a shared memory multiprocessor shape its demands on the interconnect; the cache coherence protocol that makes sure nodes receive the correct up-to-date copy of a cache line, and the cache hierarchy.

Cache Coherence Protocol

NPU many-core architecture uses a directory protocol to enforce coherence; directory protocols do not rely on any implicit network ordering and can be mapped to an arbitrary topology. Directory protocols rely on point-to-point messages rather than broadcasts as in snooping protocols; this reduction in coherence messages allows this class of protocols to provide greater scalability. Rather than broadcast to all cores, the directory contains information about which cores have the cache block. A single core receives the read request from the directory resulting in lower bandwidth requirements.

Directories maintain information about the current sharers of a cache line in the system as well as coherence state information. By maintaining a sharing list, directory protocols eliminate the need to broadcast invalidation requests to the entire system. Addresses are interleaved across directory nodes; each address is assigned a home node, which is responsible for ordering and handling all coherence requests to that address; hence there isn't a single directory but instead a distributed one across all tiles of the NoC.

Furthermore, the directory is inclusive; this means that directory holds entries for a superset of all blocks cached on the chip. In this way, it is possible to design directory caches that are more cost-effective by exploiting the observation that only cache directory states for blocks that are being cached on the chip are needed. A miss in the inclusive directory cache indicates that the block is in state N. Because the directory mirrors the contents of the LLC, the entire directory cache is embedded in the LLC simply by adding extra bits to each block in the LLC. Unfortunately, LLC inclusion has several drawbacks. First, for the shared caches in our system model, it is generally necessary to send special recall requests to invalidate blocks from the L1 caches when replacing a block in the LLC. More importantly, LLC inclusion requires maintaining redundant copies of cache blocks that are in upper-level caches.

For further details about the memory coherence protocol, please refer to:

Cache Hierarchy

The L2 cache is spread all over tiles in the NaplesPU system. With a shared L2 cache, a request due to a L1 miss will be forwarded to the right home node, determined by the address (not necessarily the local L2).

Shared caches represent a more effective use of storage as there is no replication of cache lines. However, L1 cache miss incurs additional latency to request data from a different tile. Shared caches place more pressure on the interconnection network as L1 misses also go onto the network, but through more effective use of storage may reduce pressure on the off-chip bandwidth to memory.

Memory controllers are placed as individual nodes on the mesh network; with this design, memory controllers do not have to share injection/ejection bandwidth to/from the network with cache traffic. In this way traffic is isolated; the memory controller has access to the full network bandwidth.

Architectural Details

The coherence architecture is composed of three components:

Load/store unit and cache controller are part of NPU core while directory control is part of the tile;

L1 Cache Assumptions

L1 cache design has been driven by these assumptions:

  • if a thread raises a cache miss, the thread is suspended until this request is fulfilled by L1 coherence controller;
  • merging of requests from the same core is forbidden;
  • it is possible to have only N*N networks (with N be a power of two); this implies empty tiles have to be introduced and that these tiles has a portion of the L2 cache and directory.