Difference between revisions of "Coherence"
(→L1 cache) |
(→L1 Cache Assumptions) |
||
Line 26: | Line 26: | ||
Load/store unit and cache controller are part of nu+ core while directory control is part of the tile; | Load/store unit and cache controller are part of nu+ core while directory control is part of the tile; | ||
− | === L1 Cache Assumptions === | + | ==== L1 Cache Assumptions ==== |
The design of L1 cache has been driven by these assumptions: | The design of L1 cache has been driven by these assumptions: | ||
Revision as of 21:39, 19 December 2017
Nu+ cores can be arranged as a many-core architecture based upon a shared-memory subsystem. With the shared-memory model, communication occurs implicitly through the loading and storing of data and the accessing of instructions. Logically, all processors access the same shared memory, allowing each to see the most up-to-date data. Practically speaking, memory hierarchies use caches to improve the performance of shared memory systems. These cache hierarchies reduce the latency to access data but complicate the logical, unified view of memory held in the shared memory paradigm. As a result, cache coherence protocols are designed to maintain a coherent view of memory for all processors in the presence of multiple cached copies of data. Therefore, it is the cache coherence protocol that governs what communication is necessary in a shared memory multiprocessor.
Two key characteristics of a shared memory multiprocessor shape its demands on the interconnect; the cache coherence protocol that makes sure nodes receive the correct up-to-date copy of a cache line, and the cache hierarchy.
Contents
Cache Coherence Protocol
Nu+ many-core architecture uses a directory protocol to enforce coherence; directory protocols do not rely on any implicit network ordering and can be mapped to an arbitrary topology. Directory protocols rely on point-to-point messages rather than broadcasts as in snooping protocols; this reduction in coherence messages allows this class of protocols to provide greater scalability. Rather than broadcast to all cores, the directory contains information about which cores have the cache block. A single core receives the read request from the directory resulting in lower bandwidth requirements.
Directories maintain information about the current sharers of a cache line in the system as well as coherence state information. By maintaining a sharing list, directory protocols eliminate the need to broadcast invalidation requests to the entire system. Addresses are interleaved across directory nodes; each address is assigned a home node, which is responsible for ordering and handling all coherence requests to that address; hence there isn't a single directory but instead a distributed one across all tiles of the NoC.
Furthermore the directory is inclusive; this means that directory holds entries for a superset of all blocks cached on the chip. In this way it is possible to design directory caches that are more cost-effective by exploiting the observation that only cache directory states for blocks that are being cached on the chip are needed. A miss in the inclusive directory cache indicates that the block is in state N. Because the directory mirrors the contents of the LLC, the entire directory cache is embedded in the LLC simply by adding extra bits to each block in the LLC. Unfortunately, LLC inclusion has several drawbacks. First, for the shared caches in our system model, it is generally necessary to send special recall requests to invalidate blocks from the L1 caches when replacing a block in the LLC. More importantly, LLC inclusion requires maintaining redundant copies of cache blocks that are in upper-level caches.
Cache Hierarchy
Caches are employed to reduce the memory latency of requests. They also serve as filters for the traffic that needs to be placed on the interconnect. Each of the tiles in Nu+ many-core architecture contain a bank of shared L2 cache. With a shared L2 cache, a L1 miss will be sent to a L2 bank determined by the miss address (not necessarily the local L2 bank), where it could hit in the L2 bank or miss and be sent off-chip to access main memory.
Shared caches represent a more effective use of storage as there is no replication of cache lines. However, L1 cache miss incur additional latency to request data from a different tile. Shared caches place more pressure on the interconnection network as L1 misses also go onto the network, but through more effective use of storage may reduce pressure on the off-chip bandwidth to memory. With shared caches, more requests will travel to remote nodes for data. Using this configuration the on-chip network must attach to both the L1s and the L2.
Memory controllers are placed as individual nodes on the interconnection network; with this design, memory controllers do not have to share injection/ejection bandwidth to/from the network with cache traffic. In this way traffic is more isolated; the memory controller has access to the full amount of injection bandwidth.
Architectural Details
The coherence architecture is composed of three components:
- load/store unit: contains L1 data cache;
- cache controller: handles L1 coherence data cache and manages coherence transactions;
- directory controller: handles L2 cache and manages coherence transactions.
Load/store unit and cache controller are part of nu+ core while directory control is part of the tile;
L1 Cache Assumptions
The design of L1 cache has been driven by these assumptions:
- if a thread raises a cache miss, the thread is suspended until this request is fulfilled by L1 coherence controller;
- merging of requests from the same core is forbidden;
- it is possible to have only N*N networks (with N be a power of two); this implies empty tiles have to be introduced and that these tiles has a portion of the L2 cache and directory