Next: Speculation and Recovery Up: Out of Order Processor Previous: Frontend and Key Structures Contents

Subsections

Scheduling, Dispatch and Issue

Clustering and Issue Queue Configuration

The PTLsim out of order model can simulate an arbitrarily complex set of functional units grouped into clusters. Clusters are specified by the Cluster structure and are defined by the clusters[] array in ooocore.h. Each Cluster element defines the name of the cluster, which functional units belong to the cluster (fu_mask field) and the maximum number of uops that can be issued in that cluster each cycle (issue_width field)

The intercluster_latency_map matrix defines the forwarding latency, in cycles, between a given cluster and every other cluster. If intercluster_latency_map[A][B] is L cycles, this means that functional units in cluster B must wait L cycles after a uop U in cluster A completes before cluster B's functional units can issue a uop dependent on U's result. If the latency is zero between clusters A and B, producer and consumer uops in A and B can always be issued back to back in subsequent cycles. Hence, the diagonal of the forwarding latency matrix is always all zeros.

This clustering mechanism can be used to implement several features of modern microprocessors. First, traditional clustering is possible, in which it takes multiple additional cycles to forward results between different clusters (for instance, one or more integer clusters and a floating point unit). Second, several issue queues and corresponding issue width limits can be defined within a given virtual cluster, for instance to sort loads, stores and ALU operations into separate issue queues with different policies. This is done by specifying an inter-cluster latency of zero cycles between the relevant pseudo-clusters with separate issue queues. Both of these uses are required to accurately model most modern processors.

There is also an equivalent intercluster_bandwidth_map matrix to specify the maximum number of values that can be routed between any two clusters each cycle.

The IssueQueue template class is used to declare issue queues; each cluster has its own issue queue. The syntax IssueQueue<size> issueq_name; is used to declare an issue queue with a specific size. In the current implementation, the size can be from 1 to 64 slots. The macros foreach_issueq(), sched_get_all_issueq_free_slots() and issueq_operation_on_cluster_with_result() macros must be modified if the cluster and issue queue configuration is changed to reflect all available clusters; the modifications required should be obvious from the example code. These macros with switch statements are required instead of a simple array since the issue queues can be of different template types and sizes.

Cluster Selection

The ReorderBufferEntry::select_cluster() function is responsible for routing a given uop into a specific cluster at the time it is dispatched; uops do not switch between clusters after this.

Various heuristics are employed to select which cluster a given uop should be routed to. In the reference implementation provided in ooopipe.cpp, a weighted score is generated for each possible cluster by scanning through the uop's operands to determine which cluster they will be forwarded from. If a given operand's corresponding producer uop S is currently either dispatched to cluster C but waiting to execute or is still on the bypass network of cluster C, then cluster C's score is incremented.

The final cluster is selected as the cluster with the highest score out of the set of clusters which the uop can actually issue on (e.g. a floating point uop cannot issue on a cluster with only integer units). The ReorderBufferEntry::executable_on_cluster_mask bitmap can be used to further restrict which clusters a uop can be dispatched to, for instance because certain clusters can only write to certain physical register files. This mechanism is designed to route each uop to the cluster in which the majority of its operands will become available at the earliest time; in practice it works quite well and variants of this technique are often used in real processors.

Issue Queue Structure and Operation

PTLsim implements issue queues in the IssueQueue template class using the collapsing priority queue design used in most modern processors.

As each uop is dispatched, it is placed at the end of the issue queue for its cluster and several associative arrays are updated to reflect which operands the uop is still waiting for. In the IssueQueue class, the insert() method takes the ROB index of the uop (its tag in issue queue terminology), the tags (ROB indices) of its operands, and a map of which of the operands are ready versus waiting. The ROB index is inserted into an associative array, and the ROB index tags of any waiting operands are inserted into corresponding slots in parallel arrays, one array per operand (in the current implementation, up to 4 operands are tracked). If an operand was ready at dispatch time, the slot for that operand in the corresponding array is marked as invalid since there is no need to wake it up later. Notice that the new slot is always at the end of the issue queue array; this is made possible by the collapsing mechanism described below.

The issue queue maintains two bitmaps to track the state of each slot in the queue. The valid bitmap indicates which slots are occupied by uops, while the issued bitmap indicates which of those uops have been issued. Together, these two bitmaps form the state machine described in Table 19.1.

**Table 19.1:** Issue Queue State Machine
Valid	Issued	Meaning `0`

After insert() is called, the slot is placed in the dispatched state. As each uop completes, its tag (ROB index) is broadcast using the broadcast() method to one or more issue queues accessible in that cycle. Because of clustering, some issue queues will receive the broadcast later than others; this is discussed below. Each slot in each of the four operand arrays is compared against the broadcast value. If the operand tag in that slot is valid and matches the broadcast tag, the slot (in one of the operand arrays only, not the entire issue queue) is invalidated to indicate it is ready and no longer waiting for further broadcasts.

Every cycle, the clock() method uses the valid and issued bitmaps together with the valid bitmaps of each of the operand arrays to compute which issue queue slots in the dispatched state are no longer waiting on any of their operands. This bitmap of ready slots is then latched into the allready bitmap.

The issue() method simply finds the index of the first set bit in the allready bitmap (this is the slot of the oldest ready uop in program order), marks the corresponding slot as issued, and returns the slot. The processor then selects a functional unit for the uop in that slot and executes it via the ReorderBufferEntry::issue() method. After the uop has completed execution (i.e. it cannot possibly be replayed), the release() method is called to remove the slot from the issue queue, freeing it up for incoming uops in the dispatch stage. The collapsing design of the issue queue means that the slot is not simply marked as invalid - all slots after it are physically shifted left by one, leaving a free slot at the end of the array. This design is relatively simple to implement in hardware and makes determining the oldest ready to issue uop very trivial.

Because of the collapsing mechanism, it is critical to note that the slot index returned by issue() will become invalid after the next call to the remove() method; hence, it should never be stored anywhere if a slot could be removed from the issue queue in the meantime.

If a uop issues but determines that it cannot actually complete at that time, it must be replayed. The replay() method clears the issued bit for the uop's issue queue slot, returning it to the dispatched state. The replay mechanism can optionally add additional dependencies such that the uop is only re-issued after those dependencies are resolved. This is important for loads and stores, which may need to add a dependency on a prior store queue entry after finding a matching address in the load or store queues. In rare cases, a replay may also be required when a uop is issued but no applicable functional units are left for it to execute on. The ReorderBufferEntry::replay() method is a wrapper around IssueQueue::replay() used to collect the operands the uop is still waiting for.

Implementation

PTLsim uses a novel method of modeling the issue queue and other associative structures with small tags. Specifically, the FullyAssociativeArrayTags8bit template class declared in logic.h and used to build the issue queue makes use of the host processor's 128-bit vector (SSE) instructions to do massively parallel associative matching, masking and bit scanning on up to 16 tags every clock cycle. This makes it substantially faster than simulators using the naive approach of scanning the issue queue entries linearly. Similar classes in logic.h support O(1) associative searches of both 8-bit and 16-bit tags; tags longer than this are generally more efficient if the generic FullyAssociativeArrayTags using standard integer comparisons is used instead.

As a result of this high performance design, each issue queue is limited to 64 entries and the tags to be matched must be between 0 and 255 to fit in 8 bits. The FullyAssociativeArrayTags16bit class can be used instead if longer tags are required, at the cost of reduced simulation performance. To enable this, BIG_ROB must be defined in ooocore.h.

Other Designs

It's important to remember that the issue queue design described above is one possible implemention out of the many designs currently used in industry and research processors. For instance, in lieu of the collapsing design (used by the Pentium 4 and Power4/5/970), the AMD K8 uses a sequence number tag of the ROB and comparator logic to select the earliest ready instruction. Similarly, the Pentium 4 uses a set of bit vectors (a dependency matrix) instead of tag broadcasts to wake up instructions. These other approaches may be implemented by modifying the IssueQueue class as appropriate.

Issue

The issue() top-level function issues one or more instructions in each cluster from each issue queue every cycle. This function consults the clusters[clusterid].issue_width field defined in ooocore.h to determine the maximum number of uops to issue from each cluster. The issueq_operation_on_cluster_with_result(cluster, iqslot, issue()) macro (Section 19.1) is used to invoke the issue() method of the appropriate cluster to select the earliest ready issue queue slot, as described in Section 19.3.

The ReorderBufferEntry::issue() method of the corresponding ROB entry is then called to actually execute the uop. This method first makes sure a functional unit is available within the cluster that's capable of executing the uop; if not, the uop is replayed and re-issued again on the next cycle. At this point, the uop's three operands (ra, rb, rc) are read from the physical register file. If any of the operands are invalid, the entire uop is marked as invalid with an EXCEPTION_Propagate result and is not further executed. Otherwise, the uop is executed by calling the synthesized execute function for the uop (see Section 17.1).

Loads and stores are handled specially by calling the issueload() or issuestore() method. Since loads and stores can encounter an mis-speculation (e.g. when a load is erroneously issued before an earlier store to the same addresses), the issueload() and issuestore() functions can return ISSUE_MISSPECULATEDto force all uops in program order after the mis-speculated uop to be annulled and sent through the pipeline again. Similarly, if issueload() or issuestore() return ISSUE_NEEDS_REPLAY, issuing from that cluster is aborted since the uop has been replayed in accordance with Section 19.3. It is important to note that loads which miss the cache are considered to complete successfully and do not require a replay; their physical register is simply marked as waiting until the load arrives. In both the mis-speculation and replay cases, no further uops from the cluster's issue queue are dispatched until the next cycle.

Branches are handled similar to integer and floating point operations, except that they may cause a mis-speculation in the event of a branch misprediction; this is discussed below.

If the uop caused an exception, we force it directly to the commit stage and not through writeback; this keeps dependencies waiting until they can be properly annulled by the speculation recovery logic. The commit stage will detect the exception and take appropriate action. If the exceptional uop was speculatively executed beyond a branch, it will never reach commit anyway since the bogus branch would have to commit before the exception would even become visible.

NOTE: In PTLsim, all issued uops put their result in the uop's assigned physical register at the time of issue, even though the data technically does not appear there until writeback (i.e. the physical register enters the written state). This is done to simplify the simulator implementation; it is assumed that any data ``read'' from physical registers before writeback is in fact being read from the bypass network instead.

Next: Speculation and Recovery Up: Out of Order Processor Previous: Frontend and Key Structures Contents

Matt T Yourst 2007-09-26