Next: Branch Prediction Up: Out of Order Processor Previous: Commitment Contents

Subsections

Cache Hierarchy

The PTLsim cache hierarchy model is highly flexible and can be used to model a wide variety of contemporary cache structures. The cache subsystem (defined in dcache.h and implemented by dcache.cpp) by default consists of four levels:

L1 data cache is directly probed by all loads and stores
L1 instruction cache services all instruction fetches
L2 cache is shared between data and instructions, with data paths to both L1 caches
L3 cache is also shared and is optionally present
Main memory is considered infinite in size but still has configurable characteristics

These cache levels are listed in order from highest level (closer to the core) to lowest level (far away). The cache hierarchy is assumed to be inclusive, i.e. any data in higher levels is assumed to always be present in lower levels. Additionally, the cache levels are generally write-through, meaning that every store updates all cache levels, rather than waiting for a dirty line to be evicted. PTLsim supports a 48-bit virtual address space and 40-bit physical addresses (full system PTLsim/X only) in accordance with the x86-64 standard.

General Configurable Parameters

All caches support configuration of:

Line size in bytes. Any power of two size is acceptable, however the line size of a lower cache level must be the same or larger than any line size of a higher level cache. For example, it is illegal to have 128 byte L1 lines with 64 byte L2 lines.
Set count may be any power of two number. The total cache size in bytes is of course (line size) $\times$ (set count) $\times$ (way count)
Way count (associativity) may be any number from 1 (direct mapped) up to the set count (fully associative). Note that simulation performance (and clock speed in a real processor) will suffer if the associativity is too great, particularly for L1 caches.
Latency in cycles from a load request to the arrival of the data.

In dcache.h, the two base classes CacheLine and CacheLineWithValidMask are interchangeable, depending on the model being used. The CacheLine class is a standard cache line with no actual data (since the bytes in each line are simply held in memory for simulation purposes).

The CacheLineWithValidMask class adds a bitmask specifying which bytes within the cache line contain valid data and which are unknown. This is useful for implementing ``no stall on store'' semantics, in which stores simply allocate a new way in the appropriate set but only set the valid bits for those bytes actually modified by the store. The rest of the cache line not touched by the store can be brought in later without stalling the processor (unless a load tries to access them); this is PTLsim's default model. Additionally, this technique may be used to implement sectored cache lines, in which the line fill bus is smaller than the cache line size. This means that groups of bytes within the line may be filled over subsequent cycles rather than all at once.

The AssociativeArray template class in logic.h forms the basis of all caches in PTLsim. To construct a cache in which specific lines can be locked into place, the LockableAssociativeArray template class may be used instead. Finally, the CommitRollbackCache template class is useful for creating versions of PTLsim with cache level commit/rollback support for out of order commit, fault recovery and advanced speculation techniques.

The various caches are defined in dcache.h by specializations of these template classes. The classes are L1Cache, L1ICache, L2Cache and L3Cache.

Initiating a Cache Miss

As described in Section 21, in the out of order core model, the issueload() function determines if some combination of a prior store's forwarded bytes (if any) and data present in the L1 cache can fulfill a load. If not, this is a miss and lower cache levels must be accessed. In this case, a LoadStoreInfo structure (defined in dcache.h) is prepared with various metadata about the load, including which ROB entry and physical register to wake up when the load arrives, its size, alignment, sign extension properties, prefetch properties and so on. The issueload_slowpath() function (defined in dcache.cpp) is then called with this information, the physical address to load and any data inherited from a prior store still in the pipeline. The issueload_slowpath() function moves the load request out of the core pipeline and into the cache hierarchy.

The Load Fill Request Queue (LFRQ) is a structure used to hold information about any outstanding loads that have missed any cache level. The LFRQ allows a configurable number of loads to be outstanding at any time and provides a central control point between cache lines arriving from the L2 cache or lower levels and the movement of the requested load data into the processor core to dependent instructions. The LoadFillReq structure, prepared by issueload_slowpath(), contains all the data needed to return a filled load to the core: the physical address of the load, the data and bytemask already known so far (e.g. forwarded from a prior store) and the LoadStoreInfo metadata described above.

The Miss Buffer (MB) tracks all outstanding cache lines, rather than individual loads. Each MB slot uses a bitmap to track one or more LFRQ entries that need to be awakened when the missing cache line arrives. After adding the newly created LoadFillReq entry to the LFRQ, the MissBuffer::initiate_miss() method uses the missing line's physical address to allocate a new slot in the miss buffer array (or simply uses an existing slot if a miss was already in progress on a given line). In any case, the MB's wakeup bitmap is updated to reflect the new LFRQ entry referring to that line. Each MB entry contains a cycles field, indicating the number of cycles remaining for that miss buffer before it can be moved up the cache hierarchy until it reaches the core. Each entry also contains two bits (icache and dcache) indicating which L1 caches to which the line should eventually be delivered; this is required because a single L2 line (and corresponding miss buffer) may be referenced by both the L1 data and instruction caches.

In initiate_miss(), the L2 and L3 caches are probed to see if they contain the required line. If the L2 has the line, the miss buffer is placed into the STATE_DELIVER_TO_L1 state, indicating that the line is now in progress to the L1 cache. Similarly, an L2 miss but L3 hit results in the STATE_DELIVER_TO_L2 state, and a miss all the way to main memory results in STATE_DELIVER_TO_L3.

In the very unlikely event that either the LFRQ slot or miss buffer are full, an exception is returned to out of order core, which typically replays the affected load until space in these structures becomes available. For prefetch requests, only a miss buffer is allocated; no LFRQ slot is needed.

Filling a Cache Miss

The MissBuffer::clock() method implements all synchronous state transitions. For each active miss buffer, the cycles counter is decremented, and if it becomes zero, the MB's current state is examined. If a given miss buffer was in the STATE_DELIVER_TO_L3 state (i.e. in progress from main memory) and the cycle counter just became zero, a line in the L3 cache is validated with the incoming data (this may involve evicting another line in the same set to make room). The MB is then moved to the next state up the cache hierarchy (i.e. STATE_DELIVER_TO_L2 in this example) and its cycles field is updated with the latency of the cache level it is now leaving (e.g. L3_LATENCY in this example).

This process continues with successive levels until the MB is in the STATE_DELIVER_TO_L1 state and its cycles field has been decremented to zero. If the MB's dcache bit is set, the L1 corresponding line is validated and the lfrq.wakeup() method is called to invoke a new state machine to wake up any loads waiting on the recently filled line (as known from the MB's lfrqmap bitmap). If the MB's icache bit was set, the line is validated in the L1 instruction cache, and the PerCoreCacheCallbacks::icache_wakeup() callback is used to notify the out of order core's fetch stage that it may probe the cache for the missing line again. In any case, the miss buffer is then returned to the unused state.

Each LFRQ slot can be in one of three states: free, waiting and ready. LFRQ slots remain in the waiting state as long as they are referenced by a miss buffer; once the lfrq.wakeup() method is called, all slots affiliated with that miss buffer are moved to the ready state. The LoadFillRequestQueue::clock() method finds up to MAX_WAKEUPS_PER_CYCLE LFRQ slots in the ready state and wakes them up by calling the PerCoreCacheCallbacks::dcache_wakeup() callback with the saved LoadStoreInfo metadata. The out of order core handles this callback as described in Section 21.4.

For simulation purposes only, the value to be loaded is immediately recorded as soon as the load issues, independent of the cache hit or miss status. In real hardware, the LFRQ entry data would be used to extract the correct bytes from the newly arrived line and perform sign extension and alignment. If the original load required bytes from a mixture of its source store buffer and the data cache, the SFR data and mask fields in the LFRQ entry would be used to perform this merging operation. The data would then be written into the physical register specified by the LoadStoreInfo metadata and that register would be marked as ready before sending a signal to the issue queues to wake up dependent operations.

In some cases, the out of order core may need to annul speculatively executed loads. The cache subsystem is notified of this through the annul_lfrq_slot() function called by the core. This function clears the specified LFRQ slot in each miss buffer's lfrqmap entry (since that slot should no longer be awakened now that it has been annulled), and frees the LFRQ entry itself.

Translation Lookaside Buffers

The following section applies to full system PTLsim/X only. The userspace version of PTLsim does not model TLBs since doing so would be inaccurate: it is physically impossible to model TLB miss delays without actually walking real page tables and encountering the associated cache misses. For more information, please see Section 14.3.1 concerning page translation in PTLsim/X.

Next: Branch Prediction Up: Out of Order Processor Previous: Commitment Contents

Matt T Yourst 2007-09-26