PTLsim

Frequently Asked Questions

NOTE: This FAQ was last updated January 2, 2005. The PTLsim User's Manual and Reference may also have more up to date or detailed information in some cases.

Overview

What is PTLsim?

PTLsim is a state of the art cycle accurate microprocessor simulator and virtual machine for the x86 and x86-64 instruction sets. PTLsim models a full out of order processor core, featuring extensive memory and branch speculation with replay, a highly configurable clustered microarchitecture with various issue queue designs, a full cache hierarchy and memory subsystem and supporting hardware.

Is PTLsim a full system simulator?

Yes! PTLsim comes in two flavors: PTLsim Classic simulates single-threaded userspace applications only, while PTLsim/X provides full system simulation. PTLsim/X integrates with the Xen hypervisor to provide this capability.

What is co-simulation?

Co-simulation is a processor research and development technique in which the simulator runs directly on a reference machine supporting the instruction set being simulated. By using co-simulation, PTLsim can context switch between native mode (directly running on the real hardware) and simulated mode completely transparent to all user code. This makes correctness verification trivial, since the output can be compared to a real machine at any time.

Why did you develop PTLsim?

PTLsim was developed to fulfull three main objectives:

We needed a simulator to study x86 and x86-64 programs, yet no pre-existing tools supported this very important instruction set.
Existing simulators, despite their widespread use, were not accurate enough for our needs and tended to have slow simulation rates. We felt we could do much better with PTLsim.
By releasing PTLsim as open source, we feel that we can steer the research and development community in a better direction than the dead end provided by existing simulation tools (see "Why x86?" below).

Why x86?

PTLsim is currently the only cycle accurate simulator avalable to the public that supports the x86 instruction set (let alone 64-bit x86-64). Unlike the ISAs targeted by other simulators, x86 is a contemporary and very widely used instruction set (x86 and x86-64) with readily available hardware implementations. It provides a new option for researchers stuck with simulation tools supporting only the Alpha or MIPS based instruction sets, both of which have since been discontinued on real commercially available hardware (making co-simulation impossible) with an uncertain future in up to date compiler toolchains. Additionally, x86 presents many interesting features (some would say "quirks") which provide opportunities for additional optimizations not possible or relevant with RISC based instruction sets. In particular, the smaller x86 register set and long load-use dependency chains with many dead values provide fertile ground for testing out new microarchitectural techniques.

Who are the developers?

PTLsim was developed from scratch by in response to his own research needs and those of colleagues at the State University of New York at Binghamton. Parts of the PTLsim code base date back to work done starting in 2001, with continuous development since 2003.

We would like to thank , , , , , , and other colleagues for their input in helping PTLsim better meet the needs of researchers.

We would also like to thank our numerous users outside our group for their input and help in improving PTLsim.

How does PTLsim compare to SimpleScalar?

PTLsim is not related to SimpleScalar, an older simulation infrastructure still widely used in the academic research community despite its numerous shortcomings. One of the main motivations for developing PTLsim (beyond support of the x86 instruction set) was to avoid the inaccurate microarchitectural model and performance problems inherent in SimpleScalar's design.

PTLsim is typically significantly faster than SimpleScalar 4.x (often 5 to 10 times faster, in terms of simulated cycles per second) on the same hardware, compiler and workload due to its unique design.

How is PTLsim licensed?

PTLsim is licensed under the GNU General Public License, version 2, the same license preferred by the majority of today's open source software. This means you are free to study, modify and use PTLsim in any context, so long as you release the source code to any programs derived from any part of PTLsim. This only applies if you actually distribute programs containing PTLsim code: you're welcome to use PTLsim within your organization (including commercial organizations) without releasing your changes (although we encourage you to do so, to help improve PTLsim for everyone!)

What language is PTLsim written in?

PTLsim is written mostly in C++. We have our own standard template libraries, SuperSTL and LogicSTL, that we use in place of the Standard C++ STL.

PTLsim also makes liberal use of inline x86 and x86-64 assembly language to significantly increase its speed, yet the code remains clear and easy to understand by isolating the assembly language code within easy to use template classes representing various synchronous logic structures. See the question "Why is PTLsim so fast?" below for more on this.

What standard library features cannot be used inside PTLsim?

When simulating a user process, PTLsim runs directly inside that process address space. This design imposes certain restrictions on the ability to link normal userspace libraries (like glibc and Standard C++ STL) to PTLsim, since the stock glibc and libstdc++ will cause interference, corruption and thread deadlocks between PTLsim itself and the user program under analysis.

Therefore, we provide our own version of the standard C library and common C++ classes (in SuperSTL, see above) that are specifically tuned to work in the PTLsim environment. Users who wish to extend PTLsim should keep in mind that they are working in an embedded system environment more similar to an OS kernel than a normal userspace program. If you need to use additional libraries that cannot be linked with PTLsim, we recommend logging the required data to a file, then using a normal program (for instance, PTLstats) to actually do the analysis.

What does "PTLsim" stand for?

PTLsim comes from the name of the simulator used internally in our research on a presently undisclosed new microarchitecture. For public reference, "Processor Technology Laboratory Simulator" is the unofficial meaning.

Performance and Accuracy

How fast is PTLsim?

Very fast. Compared to competing simulators, PTLsim provides extremely high performance yet manages to deliver higher fidelity cycle accurate data than competing simulators. Depending on the compiler and options enabled, it is not unusual for PTLsim to complete several million simulated uops (micro-operations) per second on modern hardware (such as our 2.4 GHz Athlon 64 reference systems).

Other simulators, such as SESC, also have high performance, but do not deliver the same level of accuracy as PTLsim, as will be discussed below. However, direct comparison to other simulators is impossible since at the time of writing, PTLsim is the only x86 or x86-64 cycle accurate simulator available to the public.

Compared to SimpleScalar, PTLsim is generally around 5 to 10 times as fast (in terms of simulated micro-ops per second) on the same hardware, even in full cycle accurate out of order mode.

Why is PTLsim so fast?

PTLsim is fast for a variety of reasons:

Because of its unique co-simulation approach (see above), PTLsim can run native x86 instructions in hardware to reproduce the exact semantics of each PTLsim micro-op, rather than using slow emulations written in generic C.
PTLsim uses vectorized SSE operations and x86 specific instructions to do O(1) parallel matching on most of the associative structures it models, rather than the naive linear scan approach used by competing simulators.
Branch free and cache aware algorithms are used pervasively in PTLsim, yet through the use of C++ templates and macros, the source code remains very clear.
Cache profiling is used to reduce the size of key structures such that the entire working set of the out of order core is under ~1 MB and hence fits in the L2 and/or L1 cache of most processors.
PTLsim is self profiling, allowing us to identify hot spots at any time.
Users can balance performance and accuracy by turning off certain features (e.g. internal error checking, logging support, statistics gathering) to provide only the data they're interested in.

How accurate is PTLsim?

Very accurate. In every case where cycle accurate results could be affected, PTLsim implements key structures down to the register transfer level (RTL) in its model, yet still maintains high performance through its use of the novel algorithms and hardware accelerated techniques described above.

PTLsim is highly configurable down to the very last bypass path or bit width. For an exhaustive list of user configurable structures, see the PTLsim User's Manual and Reference.

Benchmarking

Is IPC (Instructions per Cycle) a good measure of performance for an x86 processor?

NO. Because one x86 instruction may be broken up into numerous uops, it is never appropriate to compare IPC figures for committed x86 instructions per clock with IPC values from a RISC machine. Furthermore, different x86 implementations use varying numbers of uops per x86 instruction as a matter of encoding, so even comparing the uop based IPC between x86 implementations or RISC-like machines is inaccurate.

Users are strongly advised to use relative performance measures instead. Comparing the total simulated cycle count required to complete a given benchmark between different simulator configurations is much more appropriate. An example would be "the baseline took 100M cycles, while our improved system took 50M cycles, for a 2x improvement.

How do I skip X million cycles to "warm up" the simulator and skip over initialization code in a benchmark?

In PTLsim, this is neither necessary nor desirable. Because PTLsim directly executes your program on the host CPU until it switches to cycle accurate simulation mode, there is no way to count instructions in this manner.

Many researchers have gotten in the habit of blindly skipping a large number of instructions in benchmarks to avoid profiling initialization code. However, this is not a very intelligent policy: different benchmarks have different startup times until the top of the main loop is reached, and it is generally evident from the benchmark source code where that point should be.

How do I use trigger points to simulate the above?

Instead of having a fast skip mode, PTLsim supports trigger points: by inserting a special function call (ptlcall_switch_to_sim) within the benchmark source code and recompiling, the -trigger PTLsim option can be used to run the code on the host CPU until the trigger point is reached. If the source code is unavailable, the -startrip 0xADDRESS option will start full simulation only at a specified address (e.g. function entry point).

With trigger mode, how do I ensure caches and predictors are warmed up?

If you want to warm up the cache and branch predictors prior to starting statistics collection, combine the -trigger option with the -snapshot N option, to start full simulation at the top of the benchmark's main loop (where the trigger call is) but only start gathering statistics N cycles later, after the processor is warmed up.

Remember, since the trigger point is placed after all initialization code in the benchmark, in general it is only necessary to wait 10-20 million cycles of warmup before taking the first statistics snapshot. In this time, the caches and branch predictor will almost always be completely overwritten many times.

This approach significantly speeds up the simulation without any loss of accuracy compared to the "fast simulation" mode provided by other simulators.

How do I make sure my statistics don't include the warmup period?

In PTLstats, use the -deltastart and -deltaend options. To subtract the final snapshot from snapshot 0 (the first snapshot after the warmup period), do this:

ptlstats -deltastart 0 ptlsim.stats > ptlsim.stats.txt

See section 4.5 in the User's Manual for more information.

Compatibility

Can I use PTLsim on both x86-64 and regular 32-bit x86 machines?

Yes.. PTLsim can be compiled and run on both 32-bit x86 and 64-bit x86-64 machines. Any processor with SSE2 support is supported - i.e., Intel Pentium 4, Pentium M, Core, Core 2, all AMD Athlon 64, Opteron or Turion systems, Transmeta Efficeon, Via C3, etc.

Of course, the 32-bit version of PTLsim will lack x86-64 support and will run slower since the simulated microarchitecture is internally 64 bit in all PTLsim versions.

When building on an x86-64 machine, the resulting PTLsim binary will run both 64-bit x86-64 and 32-bit x86 user code, so there is no reason to have separate PTLsim versions if you're using a 64-bit host system.

Compiler Support

PTLsim can be built with gcc 3.3 or later, or gcc 4.1. Users have reported that early gcc 4.0 versions do not work; gcc 4.1 fixes the compiler bugs affecting PTLsim, so please upgrade if needed.

We have successfully compiled and run PTLsim on SuSE Linux 9.3 (kernel 2.6.11 - 2.6.16, both 32-bit and x86-64 on Athlon 64), SuSE Linux 10.1 and Debian stable (kernel 2.6.8, Pentium 4 Xeon) with gcc 3.3, 3.4 and 4.1.

Other Compilers

While PTLsim itself must be compiled with gcc, we have tested user programs compiled with Intel C++ and Intel Fortran versions 7.0 through 9.0 on recent versions of PTLsim with no problems. For best performance, your code should be compiled for a Pentium 4 machine (i.e. -xw option or -march=pentium4.

NOTE: Intel C++ versions prior to 9.0 will intentionally break or run sub-optimal code on any CPU that does not report "GenuineIntel" as its CPUID. If you must use this older version, the CPUID microcode in decode-complex.cpp will need to be patched to return "GenuineIntel" instead of "PTLsimCPUx64".

NOTE: Intel also offers a Math Kernel Library and several other libraries written in hand-coded assembly language. These libraries currently do not work, since they use a variety of exotic SSE3/MMX instructions PTLsim does not yet support. We are presently working on fixing this.

Does x87 floating point work?

Yes. PTLsim implements the vast majority of all x87 floating point instructions. The only unimplemented instructions are binary coded decimal (BCD) instructions and a few oddball instructions Intel and AMD have not recommended for many years. None of these are actually generated by modern compilers or even hand coded assembly language in the standard libraries.

Be advised that PTLsim models 64-bit IEEE compliant FPUs (i.e. with full SSE2 semantics), not 80-bit like x87. This means very old legacy code actually using this mode will have slightly less precision (which may or may not be important) and results may differ very slightly from a real machine. Fortunately C, C++ and Fortran are incapable of using 80-bit mode anyway.

IMPORTANT: Legacy x87 code will be much slower than SSE code due to our microarchitectural model (all x87 stack operations are microcoded and sequentual). We're not alone in this decision - Intel has also decided to focus solely on SSE/SSE2 performance at the expense of x87. More information is contained in the x87 Floating Point section of the User's Manual.

Enabling SSE2 floating point: In gcc, compiling with -march=pentium4 or -march=k8 automatically forces SSE2 to be used for all floating point code; the -mfpmath=sse option forces SSE to be used for older processors as well. GCC still may produce some x87 instructions due to code generator quirks and outdated libraries, but these will not be on the critical path. It is also recommended that you upgrade the glibc and libm packages on your Linux distribution to Pentium 4 or Athlon 64 specific versions to remove the last vestiges of x87 code for peak performance.

If your program uses a substantial amount of x87 floating point (i.e. beyond a few x87 instructions executed at startup), PTLsim will print a warning suggesting that you recompile your program or upgrade your software.

Are there any other incompatibilities with 64-bit x86-64 programs?

To the best of our knowledge, PTLsim correctly implements all x86 and x86-64 instructions actually generated by modern compilers (we tested with gcc 3.4.x). We've tested extensively with a wide variety of x86 and x86-64 software. Therefore, we feel confident that PTLsim is bug free, both in x86 decoding and in the actual out of order processor model.

What operating system features aren't allowed?

NOTE: This answer applies only to PTLsim Classic. Our full system simulator, PTLsim/X, supports all operating system features since it runs at the level of the bare hardware and is OS-agnostic.

Presently, multi-threaded programs cannot be used with PTLsim Classic, since the simulator can follow at most one thread at a time. If you need this capability, use PTLsim/X.

Programs which use very low level kernel features or change their behavior depending on the precise wall clock time may break. (In case you're wondering, it's not possible to recursively run PTLsim under itself due to its low level kernel specific nature).

The good news is that we've been able to run many "exotic" programs well beyond the range of traditional simulators. For instance, we've run graphical X11 programs, compiled the Linux kernel with gcc and accessed the Internet, all from within the PTLsim virtual machine.

Which Linux kernels can PTLsim run on?

PTLsim should work on any modern Linux 2.6.x kernel. We have tested with stock 2.6.11 through 2.6.18 kernels on x86 and x86-64, as well as 2.6.8 (Debian stable branch) on 32-bit x86.