Next: PTLsim Support Subsystems Up: PTLsim User's Guide Previous: x86 Instructions and Micro-Ops Contents

Subsections

Decoder Architecture and Basic Block Cache

Basic Block Cache

As described in Section 5.1, x86 instructions are decoded into transops prior to actual execution by the core. To achieve high performance, PTLsim maintains a basic block cache (BB cache) containing the program ordered translated uop (transop) sequence for previously decoded basic blocks in the program. Each basic block (BasicBlock structure) consists of up to 64 uops and is terminated by either a control flow operation (conditional, unconditional, indirect branch) or a barrier operation, i.e. a microcode assist (including system calls and serializing instructions).

Identifying Basic Blocks

In a userspace only simulator, the RIP of a basic block's entry point (plus a few other attributes described below) serves to uniquely identify that basic block, and can be used as a key in accessing the basic block cache. In a full system simulator, the BB cache must be indexed by much more than just the virtual address, because of potential virtual page aliasing and the need to persistently cache translations across context switches. The following fields, in the RIPVirtPhys structure, are required to correctly access the BB cache in any full system simulator or binary translation system (128 bits total):

rip: Virtual address of first instruction in BB (48 bits), since embedded RIP-relative constants and branch encodings depend on this. Modern OS's map shared libraries and binaries at the same addresses every time, so translation caching remains effective across runs.
mfnlo: MFN (Machine Frame Number, i.e. physical page frame number) of first byte in BB (28 bits), since we need to handle self modifying code invalidations based on physical addresses (because of possible virtual page aliasing in multiple page tables)
mfnhi: MFN of last byte in BB (28 bits), since a single x86 basic block can span up to two pages. In pathological cases, it is possible to create two page tables that both map the same MFN X at virtual address V, but map different MFNs at virtual address V+4096. If an instruction crosses this page boundary, the meaning of the instruction bytes on the second page will be different; hence we must take into account both physical pages to look up the correct translation.
Context info (up to 24 bits), since the uops generated depend on the current CPU mode and CS descriptor settings
- use64: 32-bit or 64-bit mode? (encoding differences)
- kernel: Kernel or user mode?
- df: EFLAGS status (direction flag, etc)
- Other info (e.g. segmentation assumptions, etc.)

The basic block cache is always indexed using an RIPVirtPhys structure instead of a simple RIP. To do this, the RIPVirtPhys.rip field is set to the desired RIP, then RIPVirtPhys.update(ctx) is called to translate the virtual address onto the two physical page MFNs it could potentially span (assuming the basic block crosses two pages).

Notice that the other attribute bits (use64, kernel, df) mean that two distinct basic blocks may be decoded from the exact same RIP on the same physical page(s), yet the uops in each translated basic block will be different because the two basic blocks were translated in a different context (relative to these attribute bits). This is especially important for x86 string move/compare/store/load/scan instructions (MOVSB, CMPSB, STOSB, LODSB, SCASB), since the correct increment constants depend on the state of the direction flag in the context in which the BB was used. Similarly, if a user program tries to decode a supervisor-only opcode, code to call the general protection fault handler will be produced instead of the real uops produced only in kernel mode.

Invalid Translations

The BasicBlockCache.translate(ctx, rvp) function always returns a BasicBlock object, even if the specified RIP was on an invalid page or some of the instruction bytes were invalid. When decoding cannot continue for some reason, the decoder simply outputs a microcode branch to one of the following assists:

ASSIST_INVALID_OPCODE when the opcode or instruction operands are invalid relative to the current context.
ASSIST_EXEC_PAGE_FAULT when the specified RIP falls on an invalid page. This means a page is marked as not present in the current page table at the time of decoding, or the page is present but has its NX (no execute) bit set in the page table entry. The EXEC_PAGE_FAULT assist is also generated when the page containing the RIP itself is valid, but part of an instruction extends beyond that page onto an invalid page. The decoder tries to decode as many instruction bytes as possible, but will insert an EXEC_PAGE_FAULT assist whenever it determines, based on the bytes already decoded, that the remainder of the instruction would fall on the invalid page.
ASSIST_GP_FAULT when attempting to decode a restricted kernel-only opcode while running in user mode.

Before redirecting execution to the kernel's exception handler, the EXEC_PAGE_FAULT microcode verifies that the page in question is still invalid. This avoids a spurious page fault in the case where an instruction was originally decoded on an invalid page, but the page tables were updated after the translation was first made such that the page is now valid. When this is the case, all bogus basic blocks on the page (which were decoded into a call to EXEC_PAGE_FAULT) must be invalidated, allowing a correct translation to be made now that the page is valid. The page at the virtual address after the page in question may also need to be invalidated in the case where some instruction bytes cross the page boundary.

Self Modifying Code

In x86 processors, the translation process is considerably more complex, because of self modifying code (SMC) and its variants. Specifically, the instruction bytes of basic blocks that have already been translated and cached may be overwritten; these old translations must be discarded. The x86 architecture guarantees that all code modifications will be visible immediately after the instruction making the modification; unlike other architectures, no ``instruction cache flush'' operation is provided. Several kinds of SMC must be handled correctly:

Classical SMC: stores currently in the pipeline overwrite other instructions that have already been fetched into the pipeline and even speculatively executed out of order;
Indirect SMC: stores write to a page on which previously translated code used to reside, but that page is now being reused for unrelated data or new code. This case frequently arises in operating system kernels when pages are swapped in and out from disk.
Cross-modifying SMC: in a multiprocessor system, one processor overwrites instructions that are currently in the pipeline on some other core. The x86 standard is ambiguous here; technically no pipeline flush and invalidate is required; instead, the cache coherence mechanism and software mutexes are expected to prevent this case.
External SMC: an external device uses direct memory access (DMA) to overwrite the physical DRAM page containing previously translated code. In theory, this can happen while the affected instructions are in the pipeline, but in practice no operating system would ever allow this. However, we still must invalidate any translations on the target page to prevent them from being looked up far in the future.

To deal with all these forms of SMC, PTLsim associates a ``dirty'' bit with every physical page (this is unrelated to the ``dirty'' bit in user-visible page table entries). Whenever the first uop in an x86 instruction (i.e. the ``SOM'', start-of-macro-op uop) commits, the current context is used to translate its RIP into the physical page MFN on which it resides, as described in Section 6.2. If the instruction's length in bytes causes it to overlap onto a second page, that high MFN is also looked up (using the virtual address rip + 4096). If the dirty bits for either the low or high MFN are set, this means the instruction bytes may have been modified sometime after the time they were last translated and added to the basic block cache. In this case, the pipeline must be flushed, and all basic blocks on the target MFN (and possibly the overlapping high MFN) must be invalidated before clearing the dirty bit. Technically the RIP-to-physical translation would be done in the instruction fetch stage in most core models, then simply stored as an RIPVirtPhys structure inside the uop until commit time.

The dirty bit can be set by several events. Obviously any store uops will set the dirty bit (thus handling the classical, indirect and cross-modifying cases), but notice that this bit is not checked again until the first uop in the next x86 instruction. This behavior is required because it is perfectly legal for an x86 store to overwrite its own instruction bytes, but this does not become visible until the same instruction executes a second time (otherwise an infinite loop of invalidations would occur). Microcoded x86 instructions implemented by PTLsim itself set dirty bits when their constituent internal stores commit. Finally, DMA transfers and external writes also set the dirty bit of any pages touched by the DMA operation.

The dirty bit is only cleared when all translated basic blocks are invalidated on a given page, and it remains clear until the first write to that page. However, no action is taken when additional basic blocks are decoded from a page already marked as dirty. This may seem counterintuitive, but it is necessary to avoid deadlock: if the page were invalidated and retranslated at fetch time, future stages in a long pipeline could potentially still have references to unrelated basic blocks on the page being invalidated. Hence, all invalidations are checked and processed only at commit time.

Other binary translation based software and hardware [17,12,10,13,14] have special mechanisms for write protecting physical pages, such that when a page with translations is first written by stores or DMA, the system immediately invalidates all translations on that page. Unfortunately, this scheme has a number of disadvantages. First, patents cover its implementation [19,18,17], which we would like to avoid. In addition, our design eliminates forced invalidations when the kernel frees up a page containing code that's immediately overwritten with normal user data (a very common pattern according to our studies). If that page is never executed again, any translations from it will be discarded in the background by the LRU mechanism, rather than interrupting execution to invalidate translations that will never be used again anyway. Fortunately, true classical SMC is very rare in modern x86 code, in large part because major microprocessors have slapped a huge penalty on its use (particularly in the case of the Pentium 4 and Transmeta processors, both of which store translated uops in a cache similar to PTLsim's basic block cache).

Memory Management of the Basic Block Cache

The PTLsim memory manager (in mm.cpp, see Section 7.3 for details) implements a reclaim mechanism in which other subsystems register functions that get called when an allocation fails. The basic block cache registers a callback, bbcache_reclaim() and BasicBlockCache::reclaim(), to invalidate and free basic blocks when PTLsim runs out of memory.

The algorithm used to do this is a pseudo-LRU design. Every basic block has a lastused field that gets updated with the current cycle number whenever BasicBlock::use(sim_cycle) is called (for instance, in the fetch stage of a core model). The reclaim algorithm goes through all basic blocks and calculates the oldest, average and newest lastused cycles. The second pass then invalidates any basic blocks that fall below this average cycle; typically around half of all basic blocks fall in the least recently used category. This strategy has proven very effective in freeing up a large amount of space without discarding currently hot basic blocks.

Each basic block also has a reference counter, refcount, to record how many pointers or references to that basic block currently exist anywhere inside PTLsim (especially in the pipelines of core models). The BasicBlock::acquire() and release() methods adjust this counter. Core models should acquire a basic block once for every uop in the pipeline within that basic block; the basic block is released as uops commit or are annulled. Since basic blocks may be speculatively translated in the fetch stage of core models, this guarantees that live basic blocks currently in flight are never freed until they actually leave the pipeline.

Next: PTLsim Support Subsystems Up: PTLsim User's Guide Previous: x86 Instructions and Micro-Ops Contents

Matt T Yourst 2007-09-26