Next: Load Issue Up: Out of Order Processor Previous: Scheduling, Dispatch and Issue Contents

Subsections

Speculation and Recovery

Misspeculation Cases

PTLsim supports three speculative execution recovery mechanisms to handle various types of speculation failures:

Replay is for scheduling and dependency mis-predictions only. Replayed uops remain in the issue queue so replay is very fast but limited in scope. Replay is described extensively in Section 19.
Redispatch finds the slice of uops in the ROB dependent on a mis-speculated uop and sends only those dependent uops back to the ready-to-dispatch state. It is used for load-store aliasing recovery, value mispredictions and other cases where the fetched uops themselves are still valid, but their outputs are invalid.
Annulment removes any uops in program order after (or optionally including) a given uop. It is used for branch mispredictions and misalignment recovery.

Redispatch

Redispatch Process

Many types of mis-speculations do not require refetching a different set of uops; instead, any uops dependent on a mis-speculated uop can simply be recirculated through the pipeline so they can re-execute and produce correct values. This process is known as redispatch; in the baseline out of order core, it is used to recover from load-store aliasing (Section 22.2.1).

When a mis-speculated ROB is detected, ROB.redispatch_dependents() is called. This function identifies the slice of uops that consumed values (directly or indirectly) from the mis-speculated uop, using dependency bitmaps similar to those used in real processors. ROB.redispatch_dependents(bool inclusive) has an inclusive parameter: if false, only the dependent uops are redispatched, not including the mis-speculated uop. This is most useful for value prediction, where the correct value can be directly reinjected into the mis-speculated uop's physical register without re-executing it.

In ROB.redispatch(), each affected uop is placed back into the rob_ready_to_dispatchstate, lways in program order. This helps to avoid deadlocks, since the redispatched slice is given priority for insertion back into the issue queue. The resources associated with each uop (physical register, LDQ/STQ slot, IQ slot, etc.) are also restored to the state they were in immediately after renaming, so they can be properly recirculated through the pipeline as if the uop never issued. Various other issues must also be handled, such as making sure known store-to-load aliasing constraints are preserved across the redispatch so as to avoid infinite replay loops, and branch directions must be corrected if a mispredict caused a fetch unit redirection but that mispredict was in fact based on mis-speculated data.

Deadlock Recovery

Redispatch can create deadlocks in cases where other unrelated uops occupy all the issue queue slots needed by the redispatched uops to make forward progress, and there is a circular dependency loop (e.g. on loads and stores not known at the time of the redispatch) that creates a chicken-and-egg problem, thus blocking forward progress.

To recover from this situation, we detect the case where no uops have been dispatched for 64 cycles, yet the ready_to_dispatch queue still has valid uops. This situation very rarely happens in practice unless there is a true deadlock. To break up the deadlock, ideally we should only need to redispatch all uops occupying issue queue slots or those already waiting for dispatch - all others have produced a result and cannot block the issue queues again. However, this does not always work in pathological cases, and can sometime lead to repeated deadlocks. Since deadlocks are very infrequent, they can be resolved by just flushing the entire pipeline. This has a negligible impact on performance.

Statistical Counters

Several statistical counters are maintained in the PTLsim statistics tree to measure redispatch overhead, in the ooocore.dispatch.redispatch node:

deadlock-flushes measures how many times the pipeline must be flushed to resolve a deadlock.
trigger-uops measures how many uops triggered redispatching because of a misspeculation. This number does not count towards the statistics below.
dependent-uops is a histogram of how many uops depended on each trigger uop, not including the trigger uop itself.

Annulment

Branch Mispredictions

Branch mispredictions form the bulk of all mis-speculated operations. Whenever the actual RIP returned by a branch uop differs from the riptaken field of the uop, the branch has been mispredicted. This means all uops after (but not including) the branch must be annulled and removed from all processor structures. The fetch queue (Section 17.1) is then reset and fetching is redirected to the correct branch target. However, all uops in program order before the branch are still correct and may continue executing.

Note that we do not just reissue the branch: this would be pointless, as we already know the correct RIP since the branch uop itself has already executed once. Instead, we let it writeback and commit as if it were predicted correctly.

Annulment Process

In PTLsim, the ReorderBufferEntry::annul() method removes any and all ROBs that entered the pipeline after and optionally including the misspeculated uop (depending on the keep_misspec_uop argument). Because this method moves all affected ROBs to the free state, they are instantly taken out of consideration for future pipeline stages and will be dropped on the next cycle.

We must be extremely careful to annul all uops in an x86 macro-op; otherwise half the x86 instruction could be executed twice once refetched. Therefore, if the first uop to annul is not also the first uop in the x86 macro-op, we may have to scan backwards in the ROB until we find the first uop of the macro-op. In this way, we ensure that we can annul the entire macro-op. All uops comprising the macro-op are guaranteed to still be in the ROB since none of the uops can commit until the entire macro-op can commit. Note that this does not apply if the final uop in the macro-op is a branch and that branch uop itself is being retained as occurs with mispredicted branches.

The first uop to annul is determined in the annul() method by scanning backwards in time from the excepting uop until a uop with its SOM (start of macro-op) bit is set, as described in Section 5.1. This SOM uop represents the boundary between x86 instructions, and is where we start annulment. The end of the range of uops to annul is at the tail of the reorder buffer.

We have to reconstruct the speculative RRT as it existed just before the first uop to be annulled was renamed. This is done by calling the pseudocommit() method of each annulled uop to implement the ``fast flush with pseudo-commit'' algorithm as follows. First, we overwrite the speculative RRT with the committed RRT. We then simulate the commitment of all non-speculative ROBs up to the first uop to be annulled by updating the speculative RRT as if it were the commit RRT. This brings the speculative RRT to the same state as if all in flight nonspeculative operations before the first uop to be annulled had actually committed. Fetching is then resumed at the correct RIP, where new uops are renamed using the recovered speculative RRT.

Other methods of RRT reconstruction (like backwards walk with saved checkpoint values) are difficult to impossible because of the requirement that flag rename tables be restored even if some of the required physical registers with attached flags have since been freed. Technically RRT checkpointing could be used but due to the load/store replay mechanism in use, this would require a checkpoint at every load and store as well as branches. Hence, the forward walk method seems to offer the best performance in practice and is quite simple. The Pentium 4 is believed to use a similar method of recovering from some types of mis-speculations.

After reconstructing the RRT, for each ROB to annul, we broadcast the ROB index to the appropriate cluster's issue queue, allowing the issue queue to purge the slot of the ROB being annulled. Finally, for each annulled uop, we free any resources allocated to it (i.e., the ROB itself, the destination physical register, the load/store queue entry (if any) and so on. Any updates to the branch predictor and return address stack made during the speculative execution of branches are also rolled back.

Finally, the fetch unit is restarted at the correct RIP and uops enter the pipeline and are renamed according to the recovered rename tables and allocated resource maps.

Next: Load Issue Up: Out of Order Processor Previous: Scheduling, Dispatch and Issue Contents

Matt T Yourst 2007-09-26