Next: PTLsim Classic: Userspace Linux Up: PTLsim User's Guide Previous: Statistics Collection and Analysis Contents

Subsections

Benchmarking Techniques

Trigger Mode and other PTLsim Calls From User Code

PTLsim optionally allows user code to control the simulator mode through the ptlcall_xxx() family of functions found in ptlcalls.h when trigger mode is enabled (-trigger configuration option). This file should be included by any PTLsim-aware user programs; these programs must be recompiled to take advantage of these features. Amongst the functions provided by ptlcalls.h are:

ptlcall_switch_to_sim() is only available while the program is executing in native mode. It forces PTLsim to regain control and begin simulating instructions as soon as this call returns.
ptlcall_switch_to_native() stops simulation and returns to native execution, effectively removing PTLsim from the loop.
ptlcall_marker() simply places a user-specified marker number in the PTLsim log file
ptlcall_capture_stats() adds a new statistics data store snapshot at the time it is called. You can pass a string to this function to name your snapshot, but all names must be unique.
ptlcall_nop() does nothing but test the call mechanism.

In userspace PTLsim, these calls work by forcing execution to code on a ``gateway page'' at a specific fixed address (0x1000 currently); PTLsim will write the appropriate call gate code to this page depending on whether the process is in native or simulated mode. In native mode, the call gate page typically contains a 64-to-64-bit or 32-to-64-bit far jump into PTLsim, while in simulated mode it contains a reserved x86 opcode interpreted by the x86 decoder as a special kind of system call. If PTLsim is built on a 32-bit only system, no mode switch is required.

In full system PTLsim/X, the x86 opcodes used to implement these calls are directly handled by the PTLsim/X hypervisor as if they were actually part of the native x86 instruction set.

Generally these calls are used to perform ``intelligent benchmarking'': the ptlcall_switch_to_sim() call is made at the top of the main loop of a benchmark after initialization, while the ptlcall_switch_to_native() call is inserted after some number of iterations to stop simulation after a representative subset of the code has completed. This intelligent approach is far better than the blind ``sample for N million cycles after S million startup cycles'' approach used by most researchers.

Fortran programs will have to actually link in the ptlcalls.o object file, since they cannot include C header files. The function names that should be used in the Fortran code remain the same as those from the ptlcalls.h header file.

Notes on Benchmarking Methodology and ``IPC''

The x86 instruction set requires some different benchmarking techniques than classical RISC ISAs. In particular, uIPC (Micro-Instructions per Cycle) a NOT a good measure of performance for an x86 processor. Because one x86 instruction may be broken up into numerous uops, it is never appropriate to compare IPC figures for committed x86 instructions per clock with IPC values from a RISC machine. Furthermore, different x86 implementations use varying numbers of uops per x86 instruction as a matter of encoding, so even comparing the uop based IPC between x86 implementations or RISC-like machines is inaccurate.

Users are strongly advised to use relative performance measures instead. Comparing the total simulated cycle count required to complete a given benchmark between different simulator configurations is much more appropriate than IPC with the x86 instruction set. An example would be ``the baseline took 100M cycles, while our improved system took 50M cycles, for a 2x improvement''.

Simulation Warmup Periods

In some simulators, it is possible to quickly skip through a specific number of instructions before starting to gather statistics, to avoid including initialization code in the statistics. In PTLsim, this is neither necessary nor desirable. Because PTLsim directly executes your program on the host CPU until it switches to cycle accurate simulation mode, there is no way to count instructions in this manner.

Many researchers have gotten in the habit of blindly skipping a large number of instructions in benchmarks to avoid profiling initialization code. However, this is not a very intelligent policy: different benchmarks have different startup times until the top of the main loop is reached, and it is generally evident from the benchmark source code where that point should be. Therefore, PTLsim supports trigger points: by inserting a special function call (ptlcall_switch_to_sim) within the benchmark source code and recompiling, the -trigger PTLsim option can be used to run the code on the host CPU until the trigger point is reached. If the source code is unavailable, the -startrip0xADDRESS option will start full simulation only at a specified address (e.g. function entry point).

If you want to warm up the cache and branch predictors prior to starting statistics collection, combine the -trigger option with the -snapshot-cyclesN option, to start full simulation at the top of the benchmark's main loop (where the trigger call is), but only start gathering statistics N cycles later, after the processor is warmed up. Remember, since the trigger point is placed after all initialization code in the benchmark, in general it is only necessary to use 10-20 million cycles of warmup time before taking the first statistics snapshot. In this time, the caches and branch predictor will almost always be completely overwritten many times. This approach significantly speeds up the simulation without any loss of accuracy compared to the "fast simulation" mode provided by other simulators.

In PTLstats, use the -subtract option to make sure the final statistics don't include the warmup period before the first snapshot. To subtract the final snapshot from snapshot 0 (the first snapshot after the warmup period), use a command similar to the following:

: ptlstats -subtract 0 ptlsim.stats

Sequential Mode

PTLsim supports sequential mode, in which instructions are run on a simple, in-order processor model (in seqcore.cpp) without accounting for cache misses, branch mispredicts and so forth. This is much faster than the out of order model, but is obviously slower than native execution. The purpose of sequential mode is mainly to aid in testing the x86 to uop decoder, microcode functions and RTL-level uop implementation code. It may also be useful for gathering certain statistics on the instruction mix and count without running a full simulation.

NOTE: Sequential mode is not intended as a ``warmup mode'' for branch predictors and caches. If you want this behavior, use statistical snapshot deltas as described in Section 9.3.

Sequential mode is enabled by specifying the ``-core seq'' option. It has no other core-specific options.

Next: PTLsim Classic: Userspace Linux Up: PTLsim User's Guide Previous: Statistics Collection and Analysis Contents

Matt T Yourst 2007-09-26