Next: Cache Hierarchy Up: Out of Order Processor Previous: Forwarding, Wakeup and Writeback Contents

Subsections

Commitment

Introduction

The commit stage examines uops from the head of the ROB, blocks until all uops comprising a given x86 instruction are ready to commit, commits the results of those uops to the architectural state and finally frees the resources associated with each uop.

Atomicity of x86 instructions

The x86 architecture specifies atomic execution for all distinct x86 instructions. This means that since each x86 instruction may be comprised of multiple uops; none of these uops may commit until all uops in the instruction are ready to commit. In PTLsim, this is accomplished by checking if the uop at the head of the ROB (next to commit) has its SOM (start of macro-op) bit set. If so, the ROB is scanned forwards from the SOM uop to the next uop in program order with its EOM (end of macro-op) bit set. If all uops in this range are ready to commit and exception-free, the SOM uop is allowed to commit, effectively unlocking the ROB head pointer until the next uop with a SOM bit set is encountered. However, any exception in any uop comprising the x86 instruction at the head of the ROB causes the pipeline to be flushed and an exception to be taken. Similarly, external interrupts are only acknowledged at the boundary between x86 instructions (i.e. after the EOM uop of each instruction).

Commitment

As each uop commits, it may update several components of the architectural state.

Integer ALU and floating point uops obviously update their destination architectural register (rd). In PTLsim, this is done by simply updating the committed register rename table (commitrrt) rather than actually copying register values. However, the old physical register mapped to architectural register rd will normally become inaccessible after the Commit RRT mapping for rd is overwritten with the committing uop's physical register index. The old physical register previously mapped to rd can then be freed. Technically physical registers allocated to intermediate uops (such as those used to hold temporary values) can be immediately freed without updating any Commit RRT entries, but for consistency we do not do this.

In PTLsim, a physical register is freed by moving it to the PHYSREG_FREE state. Unfortunately for various reasons related to long pipelines and the renaming of x86 flags, register reclamation is not so simple, but this will be discussed below in Section 24.5.

Some uops may also commit to a subset of the x86 flags, as specified in the uop encoding. For these uops, in theory no rename tables need updating, since the flags can be directly masked into the REG_flags architectural pseudo-register. Should the pipeline be flushed, the rename table entries for the ZAPS, CF, OF flag sets will all be reset to point to the REG_flags pseudo-register anyway. However, for the speculation recovery scheme described in Section 20.3.2, the REG_zf, REG_cf, and REG_of commit RRT entries are updated as well to match the updates done to the speculative RRT.

Branches and jumps update the REG_rip pseudo architectural register, while all other uops simply increment REG_rip by the number of bytes in the x86 instruction being committed. The number of bytes (1-15) is stored in a 4-bit bytes field of each uop in each x86 instruction.

Stores commit to the architectural state by writing directly to the data cache, which in PTLsim is equivalent to writing into real physical memory. Remember that a series of stores into a given 64-bit chunk of memory are merged within the store queue to the store uop's corresponding STQ entry as the store uop issues, so the commit unit always writes 64 bits to the cache at a time. The byte mask associated with the STQ entry of the store uop is used to only update the modified bytes in each chunk of memory in program order.

Additional Commit Actions for Full System Use

In full system PTLsim/X, several additional actions must be taken at commit time:

Self modifying code checks must be done using smc_isdirty(mfn), as described in Section 6.4.
Stores must set the dirty bit on the target physical page, using the smc_setdirty(mfn) function (so as to properly notify subsequent instructions of self modifying code).
The x86 page table accessed and dirty bits must be updated whenever a load or store commits, using the Context.update_pte_acc_dirty() function.
If an interrupt is pending, and we have just committed the last uop in an atomic x86 instruction, we can now safely service it.

Physical Register Recycling Complications

Problem Scenarios

In some processor designs, it is not always possible to immediately free the physical register mapped to a given architectural register when that old architectural register mapping is overwritten during commit as described above. Out of order x86 processors must maintain three separate rename table entries for the ZAPS, CF, OF flags in addition to the register rename table entry, any or all of which may be updated when uops rename and retire, depending on the uop's flag renaming semantics (see Section 5.4), For this reason, even though a given physical register value may become inaccessible and hence dead at commit time, the flags associated with that physical register are frequently still referenced within the pipeline, so the physical register itself must remain allocated.

Consider the following specific example, with uops listed in program order:

sub rax = rax,rbx
Assign RRT[rax] = phys reg r0
Assign RRT[flags] = r0 (since SUB all updates flags)
mov rax = rcx
Assign RRT[rax] = phys reg r1
No flags renamed: MOV never updates flags, so RRT[flags] is still r0.
br.e target
Depends on flags attached to r0, even though actual architectural register (rax) for r0 has already been overwritten in the commit RRT by the MOV's commit. We cannot free r0 since the BR uop might not have issued yet.

This situation only happens with instruction sets like x86 (and SPARC or even PowerPC to some extent) which support writing flags (particularly multiple independent flags) and data in a single instruction.

Reference Counting

For these reasons, we need to prevent U2's register from being freed if it is still referenced by anything still in the pipeline; the normal reorder buffer mechanism cannot always handle this situation in a very long pipeline.

One solution (the one used by PTLsim) is to give each physical register a reference counter. Physical registers can be referenced from three structures: as operands to ROBs, from the speculative RRT, and from the committed RRT. As each uop operand is renamed, the counter for the corresponding physical register is incremented by calling the PhysicalRegister::addref() method. As each uop commits, the counter for each of its operands is decremented via the PhysicalRegister::unref()method. Similarly, unref()and addref() are used whenever an entry in the speculative RRT or commit RRT is updated. During mis-speculation recovery (see Section 20.3.2), unref()is also used to unlock the operands of uops slated for annulment. Finally, unref() and addref()are used when loads and stores need to add a new dependency on a waiting store queue entry (see Sections 21 and 22.2).

As we update the committed RRT during the commit stage, the old register R mapped to the destination architectural register A of the uop being committed is examined. The register R is only moved to the free state iff its reference counter is zero. Otherwise, it is moved to the pendingfree state. The hardware examines the counters of pendingfree physical registers every cycle and moves physical registers to the free state only when their counters become zero and they are in the pendingfree state.

Hardware Implementation

The hardware implementation of this scheme is straightforward and low complexity. The counters can have a very small number of bits since it is very unlikely a given physical register would be referenced by all 100+ uops in the ROB; 3 bits should be enough to handle the typical maximum of < 8 uops sharing a given operand. Counter overflows can simply stall renaming or flush the pipeline since they are so rare.

The counter table can be updated in bulk each cycle by adding/subtracting the appropriate sum or just adding zero if the corresponding register wasn't used. Since there are several stages between renaming and commit, the same counter is never both incremented and decremented in the same cycle, so race conditions are not an issue.

In real processors, the Pentium 4 uses a scheme similar to this one but uses bit vectors instead. For smaller physical register files, this may be a better solution. Each physical register has a bit vector with one bit per ROB entry. If a given physical register P is still used by ROB entry E in the pipeline, P's bit vector bit R is set. Register P cannot be freed until all bits in its vector are zero.

Pipeline Flushes and Barriers

In some cases, the entire pipeline must be empty after a given uop commits. For instance, a barrier uop, represented by any br.p (branch private) uop, will stall the frontend when first renamed, and when committed (at which point it is the only uop in the pipeline), it will call flush_pipeline()to restart fetching at the appropriate RIP. Exceptions have a similar effect when they reach the commit stage. After doing this, the current architectural registers must be copied into the externally visible ctx.commitarf[] array, since normally the architectural registers are scattered throughout the physical register file. Fortunately, the commit stage also updates ctx.commitarf[] in parallel with the commit RRT, even though the commitarf array is never actually read by the out of order core. Interrupts are a special case of barriers, the difference being they can be serviced after any x86 instruction commits its last uop.

At this point, the handle_barrier(), handle_exception() or handle_interrupt() function is called to actually communicate with the world outside the out of order core. In the case of handle_barrier(), generally this involves executing native code inside PTLsim to redirect execution into or out of the kernel, or to service a very complex x86 instruction (e.g. cpuid, floating point save or restore, etc). For handle_exception(), on userspace-only PTLsim, the simulation is stopped and the user is notified that a genuine user visible (non-speculative) exception reached the commit stage. In contrast, on full system PTLsim/X, exceptions are little more than jumps into kernel space; this is described in detail in Chapter 14.

If execution can continue after handling the barrier or exception, the external_to_core_state() function is called to completely reset the out of order core using the state stored in ctx.commitarf[]. This involves allocating a fixed physical register for each of the 64 architectural registers in ctx.commitarf[], setting the speculative and committed rename tables to their proper cold start values, and resetting all reference counts on physical registers as appropriate. If the processor is configured with multiple physical register files (Section 18.3), the initial physical register for each architectural register is allocated in the first physical register file only (this is configurable by modifying external_to_core_state()). At this point, the main simulation loop can resume as if the processor had just restarted from scratch.

Next: Cache Hierarchy Up: Out of Order Processor Previous: Forwarding, Wakeup and Writeback Contents

Matt T Yourst 2007-09-26