Next: Stores Up: Out of Order Processor Previous: Speculation and Recovery Contents

Subsections

Load Issue

Address Generation

Loads and stores both have their physical addresses computed using the ReorderBufferEntry::addrgen() method, by adding the ra and rb operands. If the load or store is one of the special unaligned fixup forms (ld.lo, ld.hi, st.lo, st.hi) described in Section 5.6, the address is re-aligned according to the type of instruction.

At this point, the check_and_translate() method is used to translate the virtual address into a mapped physical address using the page tables and TLB. The function of this method varies significantly between userspace-only PTLsim and full system PTLsim/X. In userspace-only PTLsim, the shadow page access tables (Section 11.3) are used to do access checks; the same virtual address is then returned to use as a physical address. In full system PTLsim/X, the real x86 page tables are used to produce the physical address, significantly more involved checks are done, and finally a pointer into PTLsim's mapping of all physical pages is returned (see Section 14.3.1).

If the virtual address is invalid or not present for the specified access type, check_and_translate() will return a null pointer. At this point, handle_common_load_store_exceptions() is called to take action as follows.

If a given load or store accesses an unaligned address but is not one of the special ld.lo/ld.hi/st.lo/st.hi uops described in Section 5.6, the processor responds by first setting the ``unaligned'' bit in the original TransOp in the basic block cache, then it annuls all uops after and including the problem load, and finally restarts the fetch unit at the RIP address of the load or store itself. When the load or store uop is refetched, it is transformed into a pair of ld.lo/ld.hi or st.lo/st.hi uops in accordance with Section 5.6. This refetch approach is required rather than a simple replay operation since a replay would require allocating two entries in the issue queue and potentially two ROBs, which is not possible with the PTLsim design once uops have been renamed.

If a load or store would cause a page fault for any reason, the check_and_translate() function will fill in the exception and pfec (Page Fault Error Code) variables. These two variables are then placed into the low and high 32 bits, respectively, of the 64-bit result in the destination physical register or store buffer, in place of the actual data. The load or store is then aborted and execution returns to the ReorderBufferEntry::issue() method, causing the result to be marked with an exception (EXCEPTION_PageFaultOnRead or EXCEPTION_PageFaultOnWrite).

One x86-specific complication arises at this point. If a load (or store) uop is the high part (ld.hi or st.hi) of an unaligned load or store pair, but the actual user address did not overlap any of the high 64 bits accessed by the ld.hi or st.hi uop, the load or store should be completely ignored, even if the high part overlapped onto an invalid page. This is because it is perfectly legal to do an unaligned load or store at the very end of a page such that the next 64 bit chunk is not mapped to a valid page; the x86 architecture mandates that the load or store execute correctly as far as the user program is concerned.

Store Queue Check and Store Dependencies

After doing these exception checks, the load/store queue (LSQ) is scanned backwards in time from the current load's entry to the LSQ's head. If a given LSQ entry corresponds to a store, the store's address has been resolved and the memory range needed by the load overlaps the memory range touched by the store, the load effectively has a dependency on the earlier store that must be resolved before the load can issue. The meaning of ``overlapping memory range'' is defined more specifically in Section 22.1.

In some cases, the addresses of one or more prior stores that a load may depend on may not have been resolved by the time the load issues. Some processors will stall the load uop until all prior store addresses are known, but this can decrease performance by incorrectly preventing independent loads from starting as soon as their address is available. For this reason, the PTLsim processor model aggressively issues loads as soon as possible unless the load is predicted to frequently alias another store currently in the pipeline. This load/store aliasing prediction technique is described in Section 22.2.1.

In either of the cases above, in which an overlapping store is identified by address but that store's data is not yet available for forwarding to the load, or where a prior store's address has not been resolved but is predicted to overlap the load, the load effectively has a data flow dependency on the earlier store. This dependency is represented by setting the load's fourth rs operand (operands[RS] in the ReorderBufferEntry) to the store the load is waiting on. After adding this dependency, the replay() method is used to force the load back to the dispatched state, where it waits until the prior store is resolved. After the load is re-issued for a second time, the store queue is scanned again to make sure no intervening stores arrived in the meantime. If a different match is found this time, the load is replayed a third time. In practice, loads are rarely replayed more than once.

Data Extraction

Once the prior store a load depends on (if any) is ready and all the exception checks above have passed, it is time to actually obtain the load's data. This process can be complicated since some bytes in the region accessed by the load could come from the data cache while other bytes may be forwarded from a prior store. If one or more bytes need to be obtained from the data cache, the L1 cache is probed (via the caches.probe_cache_and_sfr() function) to see if the required line is present. If so, and the combination of the forwarded store (if any) and the L1 line fills in all bytes required by the load, the final data can be extracted.

To extract the data, the load unit creates a 64-bit temporary buffer by overlaying the bytes touched by the prior store (if any) on top of the bytes obtained from the cache (i.e., the bytes at the mapped address returned by the addrgen() function). The correct word is then extracted and sign extended (if required) from this buffer to form the result of the load. Unaligned loads (described in Section 5.6) are somewhat more complex in that both the low and high 64 bit chunks from the ld.lo and ld.hi uops, respectively, are placed into a 128-bit buffer from which the final result is extracted.

For simulation purposes only, the data to load is immediately accessed and recorded by issueload() regardless of whether or not there is a cache miss. This makes the loaded data significantly easier to track. In a real processor, the data extraction process obviously only happens after the missing line actually arrives, however our implementation in no way affects performance.

Cache Miss Handling

If no combination of the prior store's forwarded bytes and data present in the L1 cache can fulfill a load, this is miss and lower cache levels must be accessed. This process is described in Sections 25.2 and 25.3. As far as the core is concerned, the load is completed at this point even if the data has not yet arrived. The issue queue entry for the load can be released since the load is now officially in progress and cannot be replayed. Once the loaded data has arrived, the cache subsystem calls the OutOfOrderCoreCacheCallbacks::dcache_wakeup() function, which marks both the physical register and LSQ entry of the load as ready, and places the load's ROB into the completed state. This allows the processor to wake up dependents of the load on the next cycle.

Next: Stores Up: Out of Order Processor Previous: Speculation and Recovery Contents

Matt T Yourst 2007-09-26