Next: Decoder Architecture and Basic Up: PTLsim User's Guide Previous: PTLsim Code Base Contents

Subsections

x86 Instructions and Micro-Ops (uops)

Micro-Ops (uops) and TransOps

PTLsim presents to the target code a full implementation of the x86 and x86-64 instruction set (both 32-bit and 64-bit modes), including most user and kernel level instructions supported by the Intel Pentium 4 and AMD K8 microprocessors (i.e. all standard instructions, SSE/SSE2, x86-64 and most of x87 FP). At the present stage of development, the vast majority of all userspace and 32-bit/64-bit privileged instructions are supported.

The x86 instruction set is based on the two-operand CISC concept of load-and-compute and load-compute-store. However, all modern x86 processors (including PTLsim) do not directly execute complex x86 instructions. Instead, these processors translate each x86 instruction into a series of micro-operations (uops) very similar to classical load-store RISC instructions. Uops can be executed very efficiently on an out of order core, unlike x86 instructions. In PTLsim, uops have three source registers and one destination register. They may generate a 64-bit result and various x86 status flags, or may be loads, stores or branches.

The x86 instruction decoding process initially generates translated uops (transops), which have a slightly different structure than the true uops used in the processor core. Specifically, sources and destinations are represented as un-renamed architectural registers (or special temporary register numbers), and a variety of additional information is attached to each uop only needed during the renaming and retirement process. TransOps (represented by the TransOp structure) consist of the following:

som: Start of Macro-Op. Since x86 instructions may consist of multiple transops, the first transop in the sequence has its som bit set to indicate this.
eom: End of Macro-Op. This bit is set for the last transop in a given x86 instruction (which may also be the first uop for single-uop instructions)
bytes: Number of bytes in the corresponding x86 instruction (1-15). The same bytes field value is present in all uops comprising an x86 instruction.
opcode: the uop (not x86) opcode
size: the effective operation size (0-3, for 1/2/4/8 bytes)
cond: the x86 condition code for branches, selects, sets, etc. For loads and stores, this field is reused to specify unaligned access information as described later.
setflags: subset of the x86 flags set by this uop (see Section 5.4)
internal: set for certain microcode operations. For instance, loads and stores marked internal access on-chip registers or buffers invisible to x86 code (e.g. machine state registers, segmentation caches, floating point constant tables, etc).
rd, ra, rb, rc: the architectural source and destination registers (see Section 18.4.1)
extshift: shift amount (0-3 bits) used for shifted adds (x86 memory addressing and LEA). The rc operand is shifted left by this amount.
cachelevel: used for prefetching and non-temporal loads and stores
rbimm and rcimm: signed 64-bit immediates for the rb and rc operands. These are selected by specifying the special constant REG_imm in the rb and rc fields, respectively.
riptaken: for branches only, the 64-bit target RIP of the branch if it were taken.
ripseq: for branches only, the 64-bit sequential RIP of the branch if it were not taken.

Appendix 27 describes the semantics and encoding of all uops supported by the PTLsim processor model. The following is an overview of the common features of these uops and how they are used to synthesize specific x86 instructions.

Load-Execute-Store Operations

Simple integer and floating point operations are fairly straightforward to decode into loads, stores and ALU operations; a typical load-op-store ALU operation will consist of a load to fetch one operand, the ALU operation itself, and a store to write the result. The instruction set also implements a number of important but complex instructions with bizarre semantics; typically the translator will synthesize and inject into the uop stream up to 8 uops for more complex instructions.

Operation Sizes

Most x86-64 instructions can operate on 8, 16, 32 or 64 bits of a given register. For 8-bit and 16-bit operations, only the low 8 or 16 bits of the destination register are actually updated; 32-bit and 64-bit operations are zero extended as with RISC architectures. As a result, a dependency on the old destination register may be introduced so merging can be performed. Fortunately, since x86 features destructive overwrites of the destination register (i.e. the rd and ra operands are the same), the ra operand is generally already a dependency. Thus, the PTLsim uop encoding reserves 2 bits to specify the operation size; the low bits of the new result are automatically merged with the old destination value (in ra) as part of the ALU logic. This applies to the mov uop as well, allowing operations like ``mov al,bl'' in one uop. Loads do not support this mode, so loads into 8-bit and 16-bit registers must be followed by a separate mov uop to truncate and merge the loaded value into the old destination properly. Fortunately this is not necessary when the load-execute form is used with 8-bit and 16-bit operations.

The x86 ISA defines some bizarre byte operations as a carryover from the ancient 8086 architecture; for instance, it is possible to address the second byte of many integer registers as a separate register (i.e. as ah, bh, ch, dh). The mask uop is used for handling this rare but important set of operations.

Flags Management and Register Renaming

Many x86 arithmetic instructions modify some or all of the processor's numerous status and condition flag bits, but only 5 are relevant to normal execution: Zero, Parity, Sign, Overflow, Carry. In accordance with the well-known ``ZAPS rule'', any instruction that updates any of the Z/P/S flags updates all three flags, so in reality only three flag entities need to be tracked: ZPS, O, F (``ZAPS'' also includes an Auxiliary flag not accessible by most modern user instructions; it is irrelevant to the discussion below).

The x86 flag update semantics can hamper out of order execution, so we use a simple and well known solution. The 5 flag bits are attached to each result and physical register (along with invalid and waiting bits used by some cores); these bits are then consumed along with the actual result value by any consumers that also need to access the flags. It should be noted that not all uops generate all the flags as well as a 64-bit result, and some uops only generate flags and no result data.

The register renaming mechanism is aware of these semantics, and tracks the latest x86 instruction in program order to update each set of flags (ZAPS, C, O); this allows branches and other flag consumers to directly access the result with the most recent program-ordered flag updates yet still allows full out of order scheduling. To do this, x86 processors maintain three separate rename table entries for the ZAPS, CF, OF flags in addition to the register rename table entry, any or all of which may be updated when uops are renamed. The TransOp structure for each uop has a 3-bit setflags field filled out during decoding in accordance with x86 semantics; the SETFLAG_ZF, SETFLAG_CF, SETFLAG_OF bits in this field are used to determine which of the ZPS, O, F flag subsets to rename.

As mentioned above, any consumer of the flags needs to consult at most three distinct sources: the last ZAPS producer, the Carry producer and the Overflow producer. This conveniently fits into PTLsim's three-operand uop semantics. Various special uops access the flags associated with an operand rather than the 64-bit operand data itself. Branches always take two flag sources, since in x86 this is enough to evaluate any possible condition code combination (the cond_code_to_flag_regs array provides this mapping).

Various ALU instructions consume only the flags part of a source physical register; these include addc (add with carry), rcl/rcr(rotate carry), sel.cc (select for conditional moves) and so on. Finally, the collcc uop takes three operands (the latest producer of the ZAPS, CF and OF flags) and merges the flag components of each operand into a single flag set as its result.

PTLsim also provides compound compare-and-branch uops (br.sub.cc and br.and.cc); these are currently used mostly in microcode, but a core could dynamically merge CMP or TEST and Jcc instructions into these uops; this is exactly what the Intel Core 2 and a few research processors already do.

x86-64

The 64-bit x86-64 instruction set is a fairly straightforward extension of the 32-bit IA-32 (x86) instruction set. The x86-64 ISA was introduced by AMD in 2000 with its K8 microarchitecture; the same instructions were subsequently plagiarized by Intel under a different name (``EM64T'') several years later. In addition to extending all integer registers and ALU datapaths to 64 bits, x86-64 also provides a total of 16 integer general purpose registers and 16 SSE (vector floating and fixed point) registers. It also introduced several 64-bit address space simplifications, including RIP-relative addressing and corresponding new addressing modes, and eliminated a number of legacy features from 64-bit mode, including segmentation, BCD arithmetic, some byte register manipulation, etc. Limited forms of segmentation are still present to allow thread local storage and mark code segments as 64-bit. In general, the encoding of x86-64 and x86 are very similar, with 64-bit mode adding a one byte REX prefix to specify additional bits for source and destination register indexes and effective address size. As a result, both variants can be decoded by similar decoding logic into a common set of uops.

Unaligned Loads and Stores

Compared to RISC architectures, the x86 architecture is infamous for its relatively widespread use of unaligned memory operations; any implementation must efficiently handle this scenario. Fortunately, analysis shows that unaligned accesses are rarely in the performance intensive parts of a modern program (with the exception of certain media processing algorithms). Once a given load or store is known to frequently have an unaligned address, it can be preemptively split into two aligned loads or stores at decode time. PTLsim does this by initially causing all unaligned loads and stores to raise an UnalignedAccess internal exception, forcing a pipeline flush. At this point, the special unaligned bit is set for the problem load or store uop in its translated basic block representation. The next time the offending uop is encountered, it will be split into two parts very early in the pipeline.

PTLsim includes special uops to handle loads and stores split into two in this manner. The ld.lo uop rounds down its effective address $\left\lfloor A\right\rfloor$ to the nearest 64-bit boundary and performs the load. The ld.hi uop rounds up to $\left\lceil A+8\right\rceil$ , performs another load, then takes as its third rc operand the first (ld.lo) load's result. The two loads are concatenated into a 128-bit word and the final unaligned data is extracted. Stores are handled in a similar manner, with st.lo and st.hi rounding down and up to store parts of the unaligned value in adjacent 64-bit blocks. Depending on the core model, these unaligned load or store pairs access separate store buffers for each half as if they were independent.

Repeated String Operations

The x86 architecture allows for repeated string operations, including block moves, stores, compares and scans. The iteration count of these repeated operations depends on a combination of the rcx register and the flags set by the repeated operation (e.g. compare). To translate these instructions, PTLsim treats the rep xxx instruction as a single basic block; any basic block in progress before the repeat instruction is terminated and the repeat is decoded as a separate basic block. To handle the unusual case where the repeat count is zero, a check uop (see below) is inserted at the top of the loop to protect against this case; PTLsim simply bypasses the offending block if the check fails.

Checks and SkipBlocks

PTLsim includes special uops (chk.and.cc, chk.sub.cc) that compare two values or condition codes and cause a special internal exception if the result is true. The SkipBlock internal exception generated by these uops tells the core to literally annul all uops in this instruction, dynamically turning it into a nop. As described above, this is useful for string operations where a zero count causes all of the instruction's side effects to be annulled. Similarly, the AssistCheck internal exception dynamically turns the instruction into an assist, for those cases where certain rare conditions may require microcode intervention more complex than can be inlined into the decoded instruction stream.

Shifts and Rotates

The shift and rotate instructions have some of the most bizarre semantics in the entire x86 instruction set: they may or may not modify a subset of the flags depending on the rotation count operand, which we may not even know until the instruction issues. For fixed shifts and rotates, these semantics can be preserved by the uops generated, however variable rotations are more complex. The collcc uop is put to use here to collect all flags; the collected result is then fed into the shift or rotate uop as its rc operand; the uop then replicates the precise x86 behavior (including rotates using the carry flag) according to its input operands.

SSE Support

PTLsim provides full support for SSE and SSE2 vector floating point and fixed point, in both scalar and vector mode. As is done in the AMD K8 and Pentium 4, each SSE operation on a 128-bit vector is split into two 64-bit halves; each half (possibly consisting of a 64-bit load and one or more FPU operations) is scheduled independently. Because SSE instructions do not set flags like x86 integer instructions, architectural state management can be restricted to the 16 128-bit SSE registers (represented as 32 paired 64-bit registers). The mxcsr (media extensions control and status register) is represented as an internal register that is only read and written by serializing microcode; since the exception and status bits are ``sticky'' (i.e. only set, never cleared by hardware), this has no effect on out of order execution. The processor's floating point units can operate in either 64-bit IEEE double precision mode or on two parallel 32-bit single precision values.

PTLsim also includes a variety of vector integer uops used to construct SSE2/MMX operations, including packed arithmetic and shuffles.

x87 Floating Point

The legacy x87 floating point architecture is the bane of all x86 processor vendors' existence, largely because its stack based nature makes out of order processing so difficult. While there are certainly ways of translating stack based instruction sets into flat addressing for scheduling purposes, we do not do this. Fortunately, following the Pentium III and AMD Athlon's introduction, x87 is rapidly headed for planned obsolescence; most major applications released within the last few years now use SSE instructions for their floating point needs either exclusively or in all performance critical parts. To this end, even Intel has relegated x86 support on the Pentium 4 and Core 2 to a separate low performance legacy unit, and AMD has restricted x87 use in 64-bit mode. For this reason, PTLsim translates legacy x87 instructions into a serialized, program ordered and emulated form; the hardware does not contain any x87-style 80-bit floating point registers (all floating point hardware is 32-bit and 64-bit IEEE compliant). We have noticed little to no performance problem from this approach when examining typical binaries, which rarely if ever still use x87 instructions in compute-intensive code.

Floating Point Unavailable Exceptions

The x86 architecture specifies a mode in which all floating point operations (SSE and x87) will trigger a Floating Point Unavailable exception (EXCEPTION_x86_fpu_not_avail, vector 0x7) if the TS (task switched) bit in control register CR0 is set. This allows the kernel to defer saving the floating point registers and state of the previously scheduled thread until that state is actually modified, thus speeding up context switches. PTLsim supports this feature by requiring any commits to the floating point state (SSE XMM registers, x87 registers or any floating point related control or status registers) to check the uop.is_sse and uop.is_x87 bits in the uop. If either of these is set, the pipeline must be flushed and redirected into the kernel so it can save the FPU state.

Assists

Some operations are too complex to inline directly into the uop stream. To perform these instructions, a special uop (brp: branch private) is executed to branch to an assist function implemented in microcode. In PTLsim, some assist functions are implemented as regular C/C++ or assembly language code when they interact with the rest of the virtual machine. Examples of instructions requiring assists include system calls, interrupts, some forms of integer division, handling of rare floating point conditions, CPUID, MSR reads/writes, various x87 operations, any serializing instructions, etc. These are listed in the ASSIST_xxx enum found in decode.h.

Prior to entering an assist, uops are generated to load the REG_selfrip and REG_nextrip internal registers with the RIP of the instruction itself and the RIP after its last byte, respectively. This lets the assist microcode correctly update RIP before returning, or signal a fault on the instruction if needed. Several other assist related registers, including REG_ar1, REG_ar2, REG_ar3, are used to store parameters passed to the assist. These registers are not architecturally visible, but must be renamed and separately maintained by the core as if they were part of the user-visible state.

While the exact behavior depends on the core model (out of order, SMT, sequential, etc), generally when the processor fetches an assist (brp uop), the frontend pipeline is stalled and execution waits until the brp commits, at which point an assist function within PTLsim is called. This is necessary because assists are not subject to the out of order execution mechanism; they directly update the architectural registers on their own. In a real processor there are slightly more efficient ways of doing this without flushing the pipeline, however in PTLsim assists are sufficiently rare that the performance impact is negligible and this approach significantly reduces complexity. For the out of order core, the exact mechanism used is described in Section 24.6.

Next: Decoder Architecture and Basic Up: PTLsim User's Guide Previous: PTLsim Code Base Contents

Matt T Yourst 2007-09-26