As described in Section 5.1, x86 instructions are decoded into transops prior to actual execution by the out of order core. Some processors do this translation as x86 instructions are fetched from an L1 instruction cache, while others use a trace cache to store pre-decoded uops. PTLsim takes a middle ground to allow maximum simulation flexibility. Specifically, the Fetch stage accesses the L1 instruction cache and stalls on cache misses as if it were fetching several variable length x86 instructions per cycle. However, actually decoding x86 instructions into uops over and over again during simulation would be extraordinarily slow.
Therefore, for simulation purposes only, the out of order model uses the PTLsim basic block cache. The basic block cache, described in Chapter 6, stores pre-decoded uops for each basic block, and is indexed using the RIPVirtPhys structure, consisting of the RIP virtual address, several context-dependent flags and the physical page(s) spanned by the basic block (in PTLsim/X only).
During the fetch process (implemented in the OutOfOrderCore::fetch() function in ooopipe.cpp), PTLsim looks up the current RIP to fetch from (fetchrip), uses the current context to construct a full RIPVirtPhys key, then uses this key to query the basic block cache. If the basic block has never been decoded before, bbcache.translate() is used to do this now. This is all done by the fetch_or_translate_basic_block() function.
Technically speaking, the cached basic blocks contain transops, rather than uops: as explained in Section 5.1, each transop gets transformed into a true uop after it is renamed in the rename stage. In the following discussion, the term uop is used interchangeably with transop.
Each transop fetched into the pipeline is immediately assigned a monotonically increasing uuid (universally unique identifier) to uniquely track it for debugging and statistical purposes. The fetch unit attaches additional information to each transop (such as the uop's uuid and the RIPVirtPhys of the corresponding x86 instruction) to form a FetchBufferEntry structure. This fetch buffer is then placed into the fetch queue (fetchq) assuming it isn't full (if it is, the fetch stage stalls). As the fetch unit encounters transops with their EOM (end of macro-op) bit set, the fetch RIP is advanced to the next x86 instruction according to the instruction length stored in the SOM transop.
Branch uops trigger the branch prediction mechanism (Section 26) used to select the next fetch RIP. Based on various information encoded in the branch transop and the next RIP after the x86 instruction containing the branch, the branchpred.predict() function is used to redirect fetching. If the branch is predicted not taken, the sense of the branch's condition code is inverted and the transop's riptaken and ripseq fields are swapped; this ensures all branches are considered correct only if taken. Indirect branches (jumps) have their riptaken field overwritten by the predicted target address.
PTLsim models the instruction cache by using the caches.probe_icache() function to probe the cache with the physical address of the current fetch window. Most modern x86 processors fetch aligned 16-byte or 32-byte blocks of bytes into the decoder and try to pick out 3 or 4 x86 instructions per cycle. Since PTLsim uses the basic block cache, it does not actually decode anything at this point, but it still attempts to pick out up to 4 uops (or whatever limit is specified in ooocore.h) within the current 16-byte window around the fetch RIP; switching to a new window must occur in the next cycle. The instruction cache is only probed when switching fetch windows.
If the instruction cache indicates a miss, or the ITLB misses, the waiting_for_icache_fill variable is set, and the fetch unit remains stalled in subsequent cycles until the cache subsystem calls the OutOfOrderCoreCacheCallbacks::icache_wakeup() callback registered by the core. The core's interactions with the cache subsystem will be described in great detail later on.