Using SPEC 2006

From CAPSWiki

(Redirected from SPEC 2006)
Jump to: navigation, search

Contents

[edit] Building and running the SPECcpu 2006 benchmarks

The benchmarks are in /project/spec2006 on our cluster. Before getting started, run this:


cd /project/spec2006
. shrc

(notice the "." - this is a shell include script)

I've pre-configured the benchmarks (using the runspec --fake option) and I've built them. To rebuild a specific benchmark (e.g. benchname), run this:


cd /project/spec2006/benchspec/CPU2006/*.benchname/run/build_base_cpu2006.linux64.em64t.qxp.0000
make

To run a specfic benchmark:


cd /project/spec2006/benchspec/CPU2006/*.benchname/run/run_base_cpu2006.linux64.em64t.qxp.0000
. run.cmd

(notice the "." - this is a shell include script)

I wrote the prep-run-scripts script in benchspec/CPU2006 to auto-generate the run.cmd command lists from the SPEC-provided files. You do not need to use this script - just run run.cmd to execute the benchmarks (or look at the contents of run.cmd for details).

IMPORTANT: Please do all this benchmarking work on typhoon.cs.binghamton.edu, since this is a quad-core 2.4 GHz Intel Core 2 machine, and we're using the Intel C/C++ compiler with tuning specifically for this chip (it will run different instructions on an AMD machine because of the way Intel's compiler works). The native results will be based on this reference machine.

You can directly work on the files in /project/spec2006. I think I gave you the correct permissions for everything, but you should have root access to change them if something isn't writable. I made a backup in /project/spec2006-backup if you overwrite or delete something by accident.

You may need to manually edit run.cmd since some SPEC benchmarks run the same executable on multiple input data sets. Ideally, only one representative dataset should be used (for instance, for gcc, I'm using expr.i as the input, even though the original benchmark listed many more files).

[edit] Instrumenting the benchmarks for PTLsim

At the top of the source file(s) to which you'll be adding instrumentation, add this:


#define PTLSIM_HYPERVISOR
#include "/project/ptlsim/ptlcalls.h"

Then find the function where you'll be adding the instrumentation. A good choice is any function where status information is printed out as each part of the dataset is processed. For instance, in the gcc benchmark, the announce_function() function of diagnostic.c) is run whenever a new function is compiled, which makes it a good choice since all parts of the benchmark code are exercised between any two passes through that instrumentation point.

Here's an example of the code I added:


{
  const char* funcname = (*decl_printable_name)(decl, 2);
  int match = (!strcmp(funcname, "get_pointer_alignment"));

  printf("[%s (match? %d, id? %d)]\n", funcname, match, (int)ptlsim_marker_id);

  if (match) ptlcall_checkpoint(0);
  ptlcall_marker(ptlsim_marker_id++);
}

This code checks if the function currently being compiled is called get_pointer_alignment (part of the expr.i benchmark dataset), since I determined that was a good starting point after all the key benchmark subsystems have been warmed up.

Once this point is reached, ptlcall_checkpoint(0) is called to save a checkpoint of the virtual machine state. This will allow you to consistently resume from that checkpoint either under PTLsim or while in native mode the same way every time.

Note that a checkpoint will only be made if the benchmark is run inside a virtual machine under Xen - it's ignored under domain 0 while you're debugging the benchmarks (since dom0 cannot checkpoint itself).

The ptlcall_marker(ptlsim_marker_id++) call adds an entry to the hypervisor's log file (described below) to help you select spans of the desired number of x86 instructions for benchmarking purposes. The ptlsim_marker_id variable is built in (declared in ptlcalls.h) and starts at 0 when the benchmark is first run. The marker value is recorded in the hypervisor log file along with various timestamps and statistics, and the marker can later be used with PTLsim's -stop-at-marker-hits option to vary the total runtime, always starting from the same checkpoint (this is described later).

[edit] Viewing the hypervisor log

Whenever a program calls ptlcall_marker(markerid), it executes a special PTLCALL x86 instruction that the Xen hypervisor catches and interprets as if it were a real x86 instruction. Each PTLCALL makes an entry in the hypervisor log. The log can be viewed using:


sudo xm dmesg > xen-dmesg.log

or a similar command (redirected to a file since the log will be long and is overwritten after a while, so make sure you save it).

For instance, after running the instrumented gcc benchmark, a number of log entries will appear:


(XEN) Performance counter dump for domain 0 vcpu 1 (phys cpu 1):
(XEN)   rip:               0x0000000000688427
(XEN)   marker:                          42
(XEN)   seqid:                            6
(XEN)   tsc:                   572204319888         56789916     572204319888
(XEN)   pmc0:                             0                0                0
(XEN)   pmc1:                             0                0                0
(XEN)   retired_insns:          99131245418         26376441      99131245418
(XEN)   unhalted_cycles:        85188239353         28084988      85188239353
(XEN)   unhalted_refs:         114816648740         56172384     114816648740

These entries have the following fields:

  • rip: the RIP from which the PTLCALL instruction was executed (inside the benchmark)
  • marker: the marker value passed to ptlcall_marker() (in this case, when ptlsim_marker_id reached 42, i.e. as the 42nd function was compiled by the gcc benchmark)
  • seqid: event sequence number (not related to the marker)
  • tsc: x86 timestamp counter
  • retired_insns: total x86 instructions committed
  • unhalted_cycles: total cycle count while executing the benchmark (depends on current clock frequency)
  • unhalted_cycles: total "reference" cycle count while executing the benchmark (as counted at 2.4 GHz reference frequency)

The three columns are:

  1. Absolute value for cycles/instructions/etc. since booting the machine
  2. Difference between value at current marker and previous marker
  3. Difference between value at current marker and the start of the benchmark (i.e. marker id 0)

In this example, between markers 41 and 42, 26376441 (26.3 million) x86 instructions were committed and this took 56172384 reference cycles at 2.4 GHz (all from column 2). Since the beginning of the benchmark (i.e. the first call to ptlcall_marker(0)), 99 billion x86 instructions have retired, which took 114816648740 reference cycles.

Cycle counting on modern Intel chips is difficult because the clock frequency is constantly changing - this is why unhalted_refs is provided, since it measures time at a constant rate (2.4 GHz in this chip) regardless of the current frequency (that's why unhalted_cycles is so different - at startup, it may be running at 1 GHz, not 2.4 GHz).

Actually measuring the true cycle count at 2.4 GHz requires turning off dynamic frequency scaling (which is currently enabled on our servers to same power) so the processor runs at the same core frequency throughout the benchmark, and runs on its own core without interruptions or cache thrashing. This can be done by binding a Xen domain to a specific core and preventing dom0 from running on that core - it's complicated to do this accurately, so at this point we're not measuring cycles - only x86 instructions matter.

[edit] Benchmark Selection

It isn't necessary to instrument every benchmark at this point unless you really want to. I'd suggest doing them in the following order of priority:

  1. gcc (compiler): already done by me - see diagnostics.c
  2. povray (FP: raytracer): already done by me - see messageoutput.cpp
  3. bzip2 (compression)
  4. mcf (graph optimization)
  5. xalancbmk (XML processing)
  6. perlbench (Perl interpreter)
  7. h264ref (video compression)
  8. hmmer (gene sequence search)
  9. omnetpp (simulation)
  10. milc (FP: quantum chromodynamics)
  11. sphinx3 (FP: speech recognition)
  12. lbm (FP: fluid dynamics)
  13. soplex (FP: linear optimization)
  14. wrf (FP Fortran: weather prediction)
  15. namd (FP: molecular dynamics)

This is a good mix (8 integer, 7 FP) for the work I'm currently doing - feel free to add some more benchmarks more relevant to your own research if you want.

You can get the benchmark descriptions from the SPEC web site, or look in /project/spec2006/Docs/index.html for a more detailed guide.

[edit] Project Goals

[edit] Phase 1: Instrumentation

The goals of the first phase of this benchmarking project are:

  • Identify the point within each of the 29 SPEC benchmarks where an initial checkpoint should be made via ptlcall_checkpoint(0). This should ideally be inside some status printing function, or at the top of the main loop, once all initialization is complete. Make sure you only make one checkpoint (set some flag variable after doing this the first time!) It might be helpful to print out some status information about when the checkpoint will be made, since it doesn't normally print this.
  • Insert a call to ptlcall_marker(ptlsim_marker_id++) immediately after the call to ptlcall_checkpoint(0). Unlike making the initial checkpoint, the marker gets inserted on every pass through that block of code.
  • Consult the timing information printed by xm dmesg to help you choose where to put the initial checkpoint (i.e. maybe after a certain number of markers have elapsed, to give it time to warm up). Ideally, markers should be spaced far enough apart in time that the difference between any two markers is at least 50 million committed x86 instructions - this will let us use the -stop-at-marker-hits PTLsim option to vary the runtime between e.g. 50M cycles and billions of cycles by choosing a different number of markers to execute through.

[edit] Phase 2: Checkpoint Preparation

In the second phase of the project, for each benchmark, you'll need to:

  • Create a disk image suitable for running each benchmark, by copying files in the run_base_ref_... directory and the benchmark binary into a disk image. I already have a ramdisk image that you can copy the files into once phase 1 is done. You may need to change the memory size of the domain to suit the benchmark's requirements.
  • Boot a new domain with this disk image and wait for it to halt itself for checkpointing, in accordance with your placement of ptlcall_checkpoint(0). Then you can run xm save to save the checkpoint to disk.
  • Restore from the checkpoint via xm restore and check the Xen hypervisor log to verify the correct spans of instructions are being executed

I'll prepare the necessary files for phase 2 as you finish the instrumentation parts of phase 1.

[edit] Phase 3: PTLsim

In the final phase, you actually run the virtual machine checkpoints under PTLsim to gather your data. We'll do this last.

[edit] Disk Images and Checkpoints

[edit] Making Disk Images

All benchmark disk images and checkpoints are in /project/spec2006-checkpoints. Inside this directory, you'll find a directory called spec2006-vm.ramfs that contains all the files inside the disk image. You can run this while in /project/spec2006-checkpoints to create a new spec2006-vm.img disk image:


make ramfs

This will make a ramfs image Linux will load at boot (it is not actually a disk image: it's unpacked directly into RAM and is fully writable). The first program it starts is /sbin/init, which is a shell script used to set up the benchmark environment.

For each benchmark, we have a separate Xen domain config file, for example, spec2006-gcc. The config files are all identical except for the benchmark name, the amount of memory and the command line options, for instance:


extra = "xencons=ttyS console=ttyS0 init=/sbin/init sim-after-resume sleep=10 runscript=/spec2006/gcc/run"

This command line tells the domain to run the shell script or program /spec2006/gcc/run, which you must create. If the sim-at-start option is given on the kernel command line, it will immediately start simulation before running this script (but this is normally not used, since the benchmark is usually allowed to run for a while before a checkpoint is made).

[edit] Making Checkpoints

The sleep=10 kernel option tells the domain to pause 10 seconds at startup. IMPORTANT: If you want to make a checkpoint later on, run e.g. xm save spec2006-gcc gcc.chk in another window during this 10 second pause. The xm save command will then wait until a checkpoint is made from within the domain, either by running:


echo checkpoint > /proc/xen/checkpoint

from within some shell script, or by calling:


#define PTLSIM_HYPERVISOR
#include "/project/ptlsim/ptlcalls.h"
...
ptlcall_checkpoint("any-checkpoint-name");

within a benchmark.

The reason we run xm save during the 10-second pause is to avoid a race condition in the suspend/resume code later on.

[edit] Multi-Processor Checkpoints

In domains with multiple VCPUs, you may want to run multiple benchmarks in parallel, one per VCPU. However, making a checkpoint is difficult, since if ptlcall_checkpoint() is used in one benchmark, other benchmarks may not even be running by that point, or may be running in some undesirable part of their code (like in the kernel itself). The solution is as follows.

This method works by creating a shared memory semaphore in a file called /dev/shm/semaphore (this is an in-memory file). The first 32-bit word of this file is initially set to the number of benchmarks you plan to run in parallel. Each benchmark is run and reaches the desired simulation start point, and then decrements the semaphore word by 1. If the semaphore word becomes zero, ptlcall_checkpoint() is called to create a checkpoint, since the benchmark that just observed the zero semaphore value knows all other benchmarks must have already reached the correct place. The benchmark that just made the checkpoint then sets the semaphore to -1 after ptlcall_checkpoint() returns. Meanwhile, all other benchmarks are spinning waiting for the value to become -1. After it does, they all wake up immediately.

In each benchmark, add some code like this immediately before where the checkpoint goes:


#define PTLSIM_HYPERVISOR
#include "/project/ptlsim/ptlcalls.h"
#include 
#include 
#include 
#include 
#include 

int main(int argc, char* argv[]) {
  int fd = open("/dev/shm/semaphore", O_RDWR);
  assert(fd >= 0);
  volatile int* data = (volatile int*)mmap(NULL, sizeof(int), PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
  assert(data);

  printf("Waiting for trigger...\n");
  fflush(stdout);

  int n = __sync_sub_and_fetch(data, 1);
  if (n == 0) {
    //
    // We are the final benchmark to reach this trigger point: make a checkpoint,
    // then wake up all other benchmarks simultaneously:
    //
    printf("[benchname] Making checkpoint...\n");
    fflush(stdout);

    // ptlcall_checkpoint("anyname");
    *data = -1;
    printf("[benchname] Wake up other benchmarks...\n");
    fflush(stdout);
  } else {
    printf("[benchname] Waiting on %d other benchmarks\n", n);
    fflush(stdout);
  }

  // Spin loop waiting for data to change to a non-zero value
  while ((*data) >= 0) {  __sync_synchronize(); }

  printf("[benchname] Starting up after checkpoint\n");
  fflush(stdout);
}

Compile all benchmarks with this code, add them to the disk image, and run make ramfs.

You'll also need to add a short program called writeint to the disk image:


// Compile with gcc writeint.c -o writeint
#include 

int main(int argc, char* argv[]) {
  int k = atoi(argv[1]);
  fwrite(&k, sizeof(k), 1, stdout);
}

This just writes an integer in binary format to a file.

Edit the domain's config file to remove the runscript=xxx option. When you start the domain with the run-domain domainname option, it will boot and then enter a shell.

First set up the semaphore like this (assuming you have 4 benchmarks):


writeint 4 > /dev/shm/semaphore

Then start each benchmark you want to run in parallel, optionally using taskset N benchmark-command >& benchmarkname.log & to pin it to VCPU N and run it in the background. Then just wait. At some point, the kernel will print a message indicating it's making a checkpoint once the last benchmark has reached the synchronization point, and the xm save command will wake up and save the checkpoint image.

It's important that each benchmark runs for at least a few seconds before reaching its synchronization point - otherwise you'll still be typing the commands and some CPU time will be spent running the shell instead of the benchmarks. You may need to add "sleep 10" or something to each benchmark script.

[edit] Restoring the checkpoint

Use the restore-domain script (in /project/spec2006-checkpoints) to restart from the checkpoint. Run PTLsim on the domain, and each benchmark should start within a few thousand cycles of each other.

Views
Personal tools
  • Log in / create account
Toolbox