
Next: x86 Instructions and Micro-Ops Up: PTLsim User's Guide Previous: PTLsim Architecture Contents
Subsections
PTLsim Code Base
PTLsim is written in C++ with extensive use of x86 and x86-64 inline assembly code. It must be compiled with gcc on a Linux 2.6 based x86 or x86-64 machine. The C++ variant used by PTLsim is known as Embedded C++. Essentially, we only use the features found in C, but add templates, classes and operator overloading. Other C++ features such as hidden side effects in constructors, exception handling, RTTI, multiple inheritance, virtual methods (in most cases), thread local storage and so on are forbidden since they cannot be adequately controlled in the embedded ``bare hardware'' environment in which PTLsim runs, and can result in poor performance. We have our own standard template library, SuperSTL, that must be used in place of the C++ STL.
Even though the PTLsim code base is very large, it is well organized and structured for extensibility. The following section is an overview of the source files and subsystems in PTLsim:
-
PTLsim Core Subsystems:
- ptlsim.cpp and ptlsim.h are responsible for general top-level PTLsim tasks and starting the appropriate simulation core code.
- uopimpl.cpp contains implementations of all uops and their variations. PTLsim implements most ALU and floating point uops in assembly language so as to leverage the exact semantics and flags generated by real x86 instructions, since most PTLsim uops are so similar to the equivalent x86 instructions. When compiled on a 32-bit system, some of the 64-bit uops must be emulated using slower C++ code.
- ptlhwdef.cpp and ptlhwdef.h define the basic uop encodings, flags and registers. The tables of uops might be interesting to see how a modern x86 processor is designed at the microcode level. The basic format is discussed in Section 5.1; all uops are documented in Section 27.
- seqcore.cpp implements the sequential in-order core. This is a strictly functional core, without data caches, branch prediction and so forth. Its purpose is to provide fast execution of the raw uop stream and debugging of issues with the decoder, microcode or virtual hardware rather than a specific core model.
-
Decoder, Microcode and Basic Block Cache:
- decode-core.cpp coordinates the translation from x86 and x86-64 into uops, maintains the basic block cache and handles self modifying code, invalidation and other x86 specific complexities.
- decode-fast.cpp decodes the subset of the x86 instruction set used by 95% of all instructions with four or fewer uops. It should be considered the ``fast path'' decoder in a hardware microprocessor.
- decode-complex.cpp decodes complex instructions into microcode, and provides most of the assists (microcode subroutines) required by x86 machines.
- decode-sse.cpp decodes all SSE, SSE2, SSE3 and MMX instructions
- decode-x87.cpp decodes x87 floating point instructions and provides the associated microcode
- decode.h contains definitions of the above functions and classes.
-
Out Of Order Core:
- ooocore.cpp is the out of order simulator control logic. The microarchitectural model implemented by this simulator is the subject of Part IV.
- ooopipe.cpp implements the discrete pipeline stages (frontend and backend) of the out of order model.
- oooexec.cpp implements all functional units, load/store units and issue queue and replay logic
- ooocore.h defines most of the configurable parameters for the out of order core not intrinsic to the PTLsim uop instruction set itself.
- dcache.cpp and dcache.h contain the data cache model. At present the full L1/L2/L3/mem hierarchy is modeled, along with miss buffers, load fill request queues, ITLB/DTLB and bus interfaces. The cache hierarchy is very flexible configuration wise; it is described further in Section 25.
- branchpred.cpp and branchpred.h is the branch predictor. By default, this is set up as a hybrid bimodal and history based predictor with various customizable parameters.
-
Linux Hosted Kernel Interface:
- kernel.cpp and kernel.h is where all the virtual machine "black magic" takes place to let PTLsim transparently switch between simulation and native mode and 32-bit/64-bit mode (or only 32-bit mode on a 32-bit x86 machine). In general you should not need to touch this since it is very Linux kernel specific and works at a level below the standard C/C++ libraries.
- lowlevel-64bit.S contains 64-bit startup and context switching code. PTLsim execution starts here if run on an x86-64 system.
- lowlevel-32bit.S contains 32-bit startup and context switching code. PTLsim execution starts here if run on a 32-bit x86 system.
- injectcode.cpp is compiled into the 32-bit and 64-bit code injected into the target process to map the ptlsim binary and pass control to it.
- loader.h is used to pass information to the injected boot code.
-
PTLsim/X Bare Hardware and Xen Interface:
- ptlxen.cpp brings up PTLsim on the bare hardware, dispatches traps and interrupts, virtualizes Xen hypercalls, communicates via DMA with the PTLsim monitor process running in the host domain 0 and otherwise serves as the kernel of PTLsim's own mini operating system.
- ptlxen-memory.cpp is responsible for all page based memory operations within PTLsim. It manages PTLsim's own internal page tables and its physical memory map, and services page table walks, parts of the x86 microcode and memory-related Xen hypercalls.
- ptlxen-events.cpp provides all interrupt (VIRQ) and event handling, manages PTLsim's time dilation technology, and provides all time and event related hypercalls.
- ptlxen-common.cpp provides common functions used by both PTLsim itself and PTLmon.
- ptlxen.h provides inline functions and defines related to full system PTLsim/X.
- ptlmon.cpp provides the PTLsim monitor process, which runs in domain 0 and interfaces with the PTLsim hypervisor code inside the target domain to allow it to communicate with the outside world. It uses a client/server architecture to forward control commands to PTLsim using DMA and Xen hypercalls.
- xen-types.h contains Xen-specific type definitions
- ptlsim-xen-hypervisor.diff and ptlsim-xen-tools.diff are patches that must be applied to the Xen hypervisor source tree and the Xen userspace tools, respectively, to allow PTLsim to be injected into domains.
- ptlxen.lds and ptlmon.lds are linker scripts used to lay out the memory image of PTLsim and PTLmon.
- lowlevel-64bit-xen.S contains the PTLsim/X boot code, interrupt handling and exception handling
- ptlctl.cpp is a utility used within a domain under simulation to control PTLsim
- ptlcalls.h provides a library of functions used by code within the target domain to control PTLsim.
-
Support Subsystems:
- superstl.h, superstl.cpp and globals.h implement various standard library functions and classes as an alternative to C++ STL. These libraries also contain a number of features very useful for bit manipulation.
- logic.h is a library of C++ templates for implementing synchronous logic structures like associative arrays, queues, register files, etc. It has some very clever features like FullyAssociativeArray8bit, which uses x86 SSE vector instructions to associatively match and process ~16 byte-sized tags every cycle. These classes are fully parameterized and useful for all kinds of simulations.
- mm.cpp is the PTLsim custom memory manager. It provides extremely fast memory allocation functions based on multi-threaded slab caching (the same technique used inside Linux itself) and extent allocation, along with a traditional physical page allocator. The memory manager also provides PTLsim's garbage collection system, used to discard unused or least recently used objects when allocations fail.
- mathlib.cpp and mathlib.h provide standard floating point functions suitable for embedded systems use. These are used heavily as part of the x87 microcode.
- klibc.cpp and klibc.h provide standard libc-like library functions suitable for use on the bare hardware
- syscalls.cpp and syscalls.h declare all Linux system call stubs. This is also used by PTLsim/X, which emulates some Linux system calls to make porting easier.
- config.cpp and config.h manage the parsing of configuration options for each user program. This is a general purpose library used by both PTLsim itself and the userspace tools (PTLstats, etc)
- datastore.cpp and datastore.h manage the PTLsim statistics data store file structure.
-
Userspace Tools:
- ptlstats.cpp is a utility for printing and analyzing the statistics data store files in various human readable ways.
- dstbuild is a Perl script used to parse stats.h and generate the datastore template (Section 8)
- makeusage.cpp is used to capture the usage text (help screen) for linking into PTLsim
- cpuid.cpp is a utility program to show various data returned by the x86 cpuid instruction. Run it under PTLsim for a surprise.
- glibc.cpp contains miscellaneous userspace functions
- ptlcalls.c and ptlcalls.h are optionally compiled into user programs to let them switch into and out of simulation mode on their own. The ptlcalls.o file is typically linked with Fortran programs that can't use regular C header files.
PTLsim includes a number of powerful C++ templates, macros and functions not found anywhere else. This section attempts to provide an overview of these structures so that users of PTLsim will use them instead of trying to duplicate work we've already done.
The file globals.h contains a wide range of very useful definitions, functions and macros we have accumulated over the years, including:
- Basic data types used throughout PTLsim (e.g. W64 for 64-bit words, Waddr for words the same size as pointers, and so on)
- Type safe C++ template based functions, including min, max, abs, mux, etc.
- Iterator macros (foreach)
- Template based metaprogramming functions including lengthof (finds the length of any static array), offsetof(offset of member in structure), baseof (member to base of structure), and log2 (takes the base-2 log of any constant at compile time)
- Floor, ceiling and masking functions for integers and powers of two (floor, trunc, ceil, mask, floorptr, ceilptr, maskptr, signext, etc)
- Bit manipulation macros (bit, bitmask, bits, lowbits, setbit, clearbit, assignbit). Note that the bitvec template (see below) should be used in place of these macros wherever it is more convenient.
- Comparison functions (aligned, strequal, inrange, clipto)
- Modulo arithmetic (add_index_modulo, modulo_span, et al)
- Definitions of basic x86 SSE vector functions (e.g. x86_cpu_pcmpeqbet al)
- Definitions of basic x86 assembly language functions (e.g. x86_bsf64 et al)
- A full suite of bit scanning functions (lsbindex, msbindex, popcount et al)
- Miscellaneous functions (arraycopy, setzero, etc)
The Super Standard Template Library (SuperSTL) is an internal C++ library we use internally in lieu of the normal C++ STL for various technical and preferential reasons. While the full documentation is in the comments of superstl.h and superstl.cpp, the following is a brief list of its features:
- I/O stream classes familiar from Standard C++, including istream and ostream. Unique to SuperSTL is how the comma operator (``,'') can be used to separate a list of objects to send to or from a stream, in addition to the usual C++ insertion operator (``<<'').
- To read and write binary data, the idstream and odstream classes should be used instead.
- String buffer (stringbuf) class for composing strings in memory the same way they would be written to or read from an ostream or istream.
- String formatting classes (intstring, hexstring, padstring, bitstring, bytemaskstring, floatstring) provide a wrapper around objects to exercise greater control of how they are printed.
- Array (array) template class represents a fixed size array of objects. It is essentially a simple but very fast wrapper for a C-style array.
- Bit vector (bitvec) is a heavily optimized and rewritten version of the Standard C++ bitset class. It supports many additional operations well suited to logic design purposes and emphasizes extremely fast branch free code.
- Dynamic Array (dynarray) template class provides for dynamically sized arrays, stacks and other such structures, similar to the Standard C++ valarray class.
- Linked list node (listlink) template class forms the basis of double linked list structures in which a single pointer refers to the head of the list.
- Queue list node (queuelink) template class supports more operations than listlink and can serve as both a node in a list and a list head/tail header.
- Index reference (indexref) is a smart pointer which compresses a full pointer into an index into a specific structure (made unique by the template parameters). This class behaves exactly like a pointer when referenced, but takes up much less space and may be faster. The indexrefnull class adds support for storing null pointers, which indexref lacks.
- Hashtable class is a general purpose chaining based hash table with user configurable key hashing and management via add-on template classes.
- SelfHashtable class is an optimized hashtable for cases where objects contain their own keys. Its use is highly recommended instead of Hashtable.
- ChunkList class maintains a linked list of small data items, but packs many of these items into a chunk, then chains the chunks together. This is the most cache-friendly way of maintaining variable length lists.
- CRC32 calculation class is useful for hashing
- CycleTimer is useful for timing intervals with sub-nanosecond precision using the CPU cycle counter (discussed in Section 11.5).
The Logic Standard Template Library (LogicSTL) is an internally developed add-on to SuperSTL which supports a variety of structures useful for modeling sequential logic. Some of its primitives may look familiar to Verilog or VHDL programmers. While the full documentation is in the comments of logic.h, the following is a brief list of its features:
- latch template class works like any other assignable variable, but the new value only becomes visible after the clock() method is called (potentially from a global clock chain).
- Queue template class implements a general purpose fixed size queue. The queue supports various operations from both the head and the tail, and is ideal for modeling queues in microprocessors.
- Iterators for Queue objects such as foreach_forward, foreach_forward_from, foreach_forward_after, foreach_backward, foreach_backward_from, foreach_backward_before.
- HistoryBuffer maintains a shift register of values, which when combined with a hash function is useful for implementing predictor histories and the like.
- FullyAssociativeTags template class is a general purpose array of associative tags in which each tag must be unique. This class uses highly efficient matching logic and supports pseudo-LRU eviction, associative invalidation and direct indexing. It forms the basis for most associative structures in PTLsim.
- FullyAssociativeArray pairs a FullyAssociativeTags object with actual data values to form the basis of a cache.
- AssociativeArray divides a FullyAssociativeArray into sets. In effect, this class can provide a complete cache implementation for a processor.
- LockableFullyAssociativeTags, LockableFullyAssociativeArray and LockableAssociativeArray provide the same services as the classes above, but support locking lines into the cache.
- CommitRollbackCache leverages the LockableFullyAssociativeArray class to provide a cache structure with the ability to roll back all changes made to memory (not just within this object, but everywhere) after a checkpoint is made.
- FullyAssociativeTags8bit and FullyAssociativeTags16bit work just like FullyAssociativeTags, except that these classes are dramatically faster when using small 8-bit and 16-bit tags. This is possible through the clever use of x86 SSE vector instructions to associatively match and process 16 8-bit tags or 8 16-bit tags every cycle. In addition, these classes support features like removing an entry from the middle of the array while compacting entries around it in constant time. These classes should be used in place of FullyAssociativeTags whenever the tags are small enough (i.e. almost all tags except for memory addresses).
- FullyAssociativeTagsNbitOneHot is similar to FullyAssociativeTagsNbit, but the user must guarantee that all tags are unique. This property is used to perform extremely fast matching even with long tags (32+ bits). The tag data is striped across multiple SSE vectors and matched in parallel, then a clever adaptation of the sum-of-absolute-differences SSE instruction is used to extract the single matching element (if any) in O(1) time.
The out of order simulator, ooocore.h, contains several reusable classes, including:
- IssueQueue template class can be used to implement all kinds of broadcast based issue queues
- StateList and ListOfStateLists is useful for collecting various lists that objects can be on into one structure.

Next: x86 Instructions and Micro-Ops Up: PTLsim User's Guide Previous: PTLsim Architecture Contents
Matt T Yourst 2007-09-26