In the PTLsim out of order model, a given store may merge its data with that of a previous store in program order. This ensures that loads which may need to forward data from a store always reference exactly one store queue entry, rather than having to merge data from multiple smaller prior stores to cover the entire byte range being loaded. In this model, physical memory is divided up into 8 byte (64 bit) chunks. As each store issues, it scans the store queue backwards in program order to find the most recent prior store to the same 8 byte aligned physical address. If there is a match, the current store depends on the matching prior store, and cannot complete and forward its data to other consuming loads and stores until the prior store in question also completes. This ensures that the current store's data can be composited on top of the older store's data to form a single up to date 8-byte chunk. As described in Section 18.4, each store queue entry contains a byte mask to indicate which of the 8 bytes in each chunk are currently modified by stores in flight versus those bytes which must come from the data cache.
Technically there are more efficient approaches, such as allowing stores to issue in any order so long as they do not overlap on the basis of individual bytes. However, no modern processor allows such arbitrary forwarding since the circuit complexity involved with scanning the store queue for partial address matches would be prohibitive and slow. Instead, most processors only support store to load forwarding when a single larger prior store covers the entire byte range accessed by a smaller or same sized load; all other combinations stall the load until the overlapping prior stores commit to the data cache.
The store inheritance scheme used by PTLsim (described first) is an improvement to the more common ``stall on size mismatch'' scheme above, but may incur more store dependency replays (since stores now depend on other stores when they target the same 8-byte chunk) compared to a stall on size mismatch scheme. As a case study, the Pentium 4 processor (Prescott core) implements a combination of these approaches.
The ReorderBufferEntry::issuestore() function is responsible for issuing all store uops. Stores are unusual in that they can issue even if their rc operand (the value to store) is not ready at the same time as the ra and rb operands forming the effective address. This property is useful since it allows a store to establish an entry in the store queue as soon as the effective address can be generated, even if the data to store is not ready. By establishing addresses in the store queue as soon as possible, we can avoid performance losses associated with the unnecessary replay of loads that may depend on a store whose address is unavailable at the time the load issues. In effect, this means that each store uop may actually issue twice.
In the first phase issue, which occurs as soon as the ra and rb operands become ready, the store uop computes its effective physical address, checks that address for all exceptions (such as alignment problems and page faults) and writes the address into the corresponding LoadStoreQueueEntry structure before setting its the addrvalid bit as described in Section 18.4. If an exception is detected at this point, the invalid bit in the store queue entry is set and the destination physical register's FLAG_inv flag is set so any attempt to commit the store will fail.
The load queue is then searched to find any loads after the current store in program order which have already issued but have done so without forwarding data from the current store. These loads erroneously issued before the current store (now known to overlap the load's address) was able to forward the correct data to the offending load(s). This situation is known as aliasing, and is effectively a mis-speculation requiring us to reissue any uops depending on the store. The redispatch method (Section 20.2) is used to re-execute only those uops dependent (either directly or indirectly) on the store.
Since the redispatch process required to correct aliasing violations is expensive and may result in infinite loops, it is desirable to predict in advance which loads and stores are likely to alias each other such that loads predicted to alias are never issued when prior stores in the store queue still have unknown addresses. This works because in most out of order processors, statistically speaking, very few loads alias stores compared to normal loads from the cache. When an aliasing mis-speculation occurs, an entry is added to a small fully associative structure (typically entries) called the Load Store Alias Predictor (LSAP). This structure is indexed by a portion of the address of the load instruction that aliased. This allows the load unit to avoid issuing any load uop that matches any address in the LSAP if any prior store addresses are still unresolved; if this is the case, a dependency is created on the first unresolved store such that the load is replayed (and the load and store queues are again scanned) once that store resolves. Similar methods of aliasing prediction are used by the Pentium 4 (Prescott core only) and Alpha 21264.
At this point the store queue is searched for prior stores to the same 8-byte block as described above in Section 22.1; if the store depends on a prior store, the scheduler structures are updated to add an additional dependency (in operands[RS]) on this prior store before the store is replayed in accordance with Section 19.3 to wait for the prior store to complete. If no prior store is found, or the prior store is ready, the current store is marked as a second phase store by setting the load_store_second_phase flag in its ROB entry. Finally, the store is replayed in accordance with Section 19.3.
In the second phase of store uop scheduling, the store uop is only re-issued when all four operands (ra + rb address, rc data and rs source store queue entry) are valid. The second phase repeats the scan of the load and store queues described above to catch any loads and stores that may have issued between the first and second phase issues; the store is replayed a third time if necessary. Otherwise, the rc operand data is merged with the data from the prior store (if any) store queue entry, and the combined data and bytemask is written into the current store's store queue entry. Finally, the entry's dataready bit is set to make the entry available for forwarding to other waiting loads and stores.
The first and second phases may be combined into a single issue without replay if both the address and data operands of the store are all ready at the same time and the prior store (if any) the current store inherits from has already successfully issued.