next up previous contents
Next: PTLsim Support Subsystems Up: PTLsim User's Guide Previous: x86 Instructions and Micro-Ops   Contents

Subsections


Decoder Architecture and Basic Block Cache

Basic Block Cache

As described in Section 5.1, x86 instructions are decoded into transops prior to actual execution by the core. To achieve high performance, PTLsim maintains a basic block cache (BB cache) containing the program ordered translated uop (transop) sequence for previously decoded basic blocks in the program. Each basic block (BasicBlock structure) consists of up to 64 uops and is terminated by either a control flow operation (conditional, unconditional, indirect branch) or a barrier operation, i.e. a microcode assist (including system calls and serializing instructions).


Identifying Basic Blocks

In a userspace only simulator, the RIP of a basic block's entry point (plus a few other attributes described below) serves to uniquely identify that basic block, and can be used as a key in accessing the basic block cache. In a full system simulator, the BB cache must be indexed by much more than just the virtual address, because of potential virtual page aliasing and the need to persistently cache translations across context switches. The following fields, in the RIPVirtPhys structure, are required to correctly access the BB cache in any full system simulator or binary translation system (128 bits total):

The basic block cache is always indexed using an RIPVirtPhys structure instead of a simple RIP. To do this, the RIPVirtPhys.rip field is set to the desired RIP, then RIPVirtPhys.update(ctx) is called to translate the virtual address onto the two physical page MFNs it could potentially span (assuming the basic block crosses two pages).

Notice that the other attribute bits (use64, kernel, df) mean that two distinct basic blocks may be decoded from the exact same RIP on the same physical page(s), yet the uops in each translated basic block will be different because the two basic blocks were translated in a different context (relative to these attribute bits). This is especially important for x86 string move/compare/store/load/scan instructions (MOVSB, CMPSB, STOSB, LODSB, SCASB), since the correct increment constants depend on the state of the direction flag in the context in which the BB was used. Similarly, if a user program tries to decode a supervisor-only opcode, code to call the general protection fault handler will be produced instead of the real uops produced only in kernel mode.


Invalid Translations

The BasicBlockCache.translate(ctx, rvp) function always returns a BasicBlock object, even if the specified RIP was on an invalid page or some of the instruction bytes were invalid. When decoding cannot continue for some reason, the decoder simply outputs a microcode branch to one of the following assists:

Before redirecting execution to the kernel's exception handler, the EXEC_PAGE_FAULT microcode verifies that the page in question is still invalid. This avoids a spurious page fault in the case where an instruction was originally decoded on an invalid page, but the page tables were updated after the translation was first made such that the page is now valid. When this is the case, all bogus basic blocks on the page (which were decoded into a call to EXEC_PAGE_FAULT) must be invalidated, allowing a correct translation to be made now that the page is valid. The page at the virtual address after the page in question may also need to be invalidated in the case where some instruction bytes cross the page boundary.


Self Modifying Code

In x86 processors, the translation process is considerably more complex, because of self modifying code (SMC) and its variants. Specifically, the instruction bytes of basic blocks that have already been translated and cached may be overwritten; these old translations must be discarded. The x86 architecture guarantees that all code modifications will be visible immediately after the instruction making the modification; unlike other architectures, no ``instruction cache flush'' operation is provided. Several kinds of SMC must be handled correctly:

To deal with all these forms of SMC, PTLsim associates a ``dirty'' bit with every physical page (this is unrelated to the ``dirty'' bit in user-visible page table entries). Whenever the first uop in an x86 instruction (i.e. the ``SOM'', start-of-macro-op uop) commits, the current context is used to translate its RIP into the physical page MFN on which it resides, as described in Section 6.2. If the instruction's length in bytes causes it to overlap onto a second page, that high MFN is also looked up (using the virtual address rip + 4096). If the dirty bits for either the low or high MFN are set, this means the instruction bytes may have been modified sometime after the time they were last translated and added to the basic block cache. In this case, the pipeline must be flushed, and all basic blocks on the target MFN (and possibly the overlapping high MFN) must be invalidated before clearing the dirty bit. Technically the RIP-to-physical translation would be done in the instruction fetch stage in most core models, then simply stored as an RIPVirtPhys structure inside the uop until commit time.

The dirty bit can be set by several events. Obviously any store uops will set the dirty bit (thus handling the classical, indirect and cross-modifying cases), but notice that this bit is not checked again until the first uop in the next x86 instruction. This behavior is required because it is perfectly legal for an x86 store to overwrite its own instruction bytes, but this does not become visible until the same instruction executes a second time (otherwise an infinite loop of invalidations would occur). Microcoded x86 instructions implemented by PTLsim itself set dirty bits when their constituent internal stores commit. Finally, DMA transfers and external writes also set the dirty bit of any pages touched by the DMA operation.

The dirty bit is only cleared when all translated basic blocks are invalidated on a given page, and it remains clear until the first write to that page. However, no action is taken when additional basic blocks are decoded from a page already marked as dirty. This may seem counterintuitive, but it is necessary to avoid deadlock: if the page were invalidated and retranslated at fetch time, future stages in a long pipeline could potentially still have references to unrelated basic blocks on the page being invalidated. Hence, all invalidations are checked and processed only at commit time.

Other binary translation based software and hardware [17,12,10,13,14] have special mechanisms for write protecting physical pages, such that when a page with translations is first written by stores or DMA, the system immediately invalidates all translations on that page. Unfortunately, this scheme has a number of disadvantages. First, patents cover its implementation [19,18,17], which we would like to avoid. In addition, our design eliminates forced invalidations when the kernel frees up a page containing code that's immediately overwritten with normal user data (a very common pattern according to our studies). If that page is never executed again, any translations from it will be discarded in the background by the LRU mechanism, rather than interrupting execution to invalidate translations that will never be used again anyway. Fortunately, true classical SMC is very rare in modern x86 code, in large part because major microprocessors have slapped a huge penalty on its use (particularly in the case of the Pentium 4 and Transmeta processors, both of which store translated uops in a cache similar to PTLsim's basic block cache).


Memory Management of the Basic Block Cache

The PTLsim memory manager (in mm.cpp, see Section 7.3 for details) implements a reclaim mechanism in which other subsystems register functions that get called when an allocation fails. The basic block cache registers a callback, bbcache_reclaim() and BasicBlockCache::reclaim(), to invalidate and free basic blocks when PTLsim runs out of memory.

The algorithm used to do this is a pseudo-LRU design. Every basic block has a lastused field that gets updated with the current cycle number whenever BasicBlock::use(sim_cycle) is called (for instance, in the fetch stage of a core model). The reclaim algorithm goes through all basic blocks and calculates the oldest, average and newest lastused cycles. The second pass then invalidates any basic blocks that fall below this average cycle; typically around half of all basic blocks fall in the least recently used category. This strategy has proven very effective in freeing up a large amount of space without discarding currently hot basic blocks.

Each basic block also has a reference counter, refcount, to record how many pointers or references to that basic block currently exist anywhere inside PTLsim (especially in the pipelines of core models). The BasicBlock::acquire() and release() methods adjust this counter. Core models should acquire a basic block once for every uop in the pipeline within that basic block; the basic block is released as uops commit or are annulled. Since basic blocks may be speculatively translated in the fetch stage of core models, this guarantees that live basic blocks currently in flight are never freed until they actually leave the pipeline.


next up previous contents
Next: PTLsim Support Subsystems Up: PTLsim User's Guide Previous: x86 Instructions and Micro-Ops   Contents
Matt T Yourst 2007-09-26