PTLsim/X Architecture Details

The following sections provide insight into the internal architecture of full system PTLsim/X, and how a simulator is built to run on the bare hardware. It is not necessary to understand this information to work with or customize machine models in PTLsim, but it may still be fascinating to those working with the low level infrastructure components.

Basic PTLsim/X Components

PTLsim/X works in a conceptually similar manner to the normal userspace PTLsim: the simulator is ``injected'' into the target user process address space and effectively becomes the CPU executing the process. PTLsim/X extends this concept, but instead of a process, the core PTLsim code runs on the bare hardware and accesses the same physical memory pages owned by the guest domain. Similarly, each VCPU is ``collapsed'' into a context structure within PTLsim when simulation begins; each context is then copied back onto the corresponding physical CPU context(s) when native mode is entered.

PTLsim/X consists of three primary components: the modified Xen hypervisor, the PTLsim monitor process, and the PTLsim core.

Xen Modifications

The Xen hypervisor requires some modifications to work with PTLsim. Specifically, several new major hypercalls were added:

PTLsim Monitor (PTLmon)

The PTLsim monitor (ptlmon.cpp) is a normal Linux program that runs in domain 0 with root privileges. After connecting to the specified domain, it increases the domain's memory reservation so as to reserve a range of physical pages for PTLsim (by default, 32 MB of physical memory). PTLmon maps all these reserved pages into its own address space, and loads the real PTLsim core code into these pages. The PTLsim core is linked separately as ptlxen.bin, but is then linked as a binary object into the final self-contained ptlsim executable. PTLmon then builds page tables to map PTLsim space into the target domain. Finally, PTLmon fills in various other fields in the boot info page (including a pointer to the Context structures (a modified version of Xen's vcpu_context_t) holding the interrupted guest's state for each of its VCPUs), prepares the initial registers and page tables to map PTLsim's code, then unmaps all PTLsim reserved pages except for the first few pages (as shown in Table 14.1). This is required since the monitor process cannot have writable references to any of PTLsim's pages or PTLsim may not be able to pin those pages as page table pages. At this point, PTLmon atomically restarts the domain inside PTLsim using the new contextswap hypercall. The old context of the domain is thus available for PTLsim to use and update via simulation.

PTLmon also sets up two event channels: the hostcall channel and the upcall channel. PTLsim notifies the monitor process in domain 0 via the hostcall event channel whenever it needs to access the outside world. Specifically, PTLsim will fill in the bootpage.hostreq structure with parameters to a standard Linux system calls, and will place any larger buffers in the transfer page (see Table 14.1) visible to both PTLmon and PTLsim itself. PTLsim will then notify the hostcall channel's port. The domain 0 kernel will then forward this notification to PTLmon, which will do the system call on PTLsim's behalf (while PTLsim remains blocked in the synchronous_host_call() function). PTLmon will then notify the hostcall port in the opposite direction (waking up PTLsim) when the system call is complete. This is very similar to a remote procedure call, but using shared memory. It allows PTLsim to use standard system calls (e.g. for reading and writing log files) without modification, yet remains suitable for a bare-metal embedded environment.

PTLmon can also use the upcall channel to interrupt PTLsim, for instance to switch between native and simulation mode, trigger a snapshot, or request that PTLsim update its internal parameters. The PTLmon process sets up a socket in /tmp/ptlsim-domain-XXX and waits for requests on this socket. The user can then run the ptlsim command again, which will connect to this socket and tell the main monitor process for the domain to enqueue a text string (usually the command line parameters to ptlsim) and send an interrupt to PTLsim on the upcall channel. In response, PTLsim uses the ACCEPT_UPCALL hostcall to read the enqueued command line, then parses it and acts on any listed actions or parameter updates.

It should be noted that this design allows live configuration updates, as described in Section 13.6.

PTLsim Core

PTLsim runs directly on the ``bare metal'' and has no access to traditional OS services except through the DMA and interrupt based host call requests described above. Execution begins in ptlsim_init() in ptlxen.cpp. PTLsim first sets up its internal memory management (page pool, slab allocator, extent allocator in mm.cpp as described in Section 7.3) using the initial page tables created by PTLmon in conjunction with the modified Xen hypervisor. PTLsim owns the virtual address space range starting at 0xffffff0000000000 (i.e. x86-64 PML4 slot 510, of $2^{39}$ bytes). This memory is mapped to the physical pages reserved for PTLsim. The layout is shown in Table 14.1 (assuming 32 MB is allocated for PTLsim):

**Table 14.1:** Memory Layout for PTLsim Space
Page	Size	Description (Pages below this point are shared by PTLmon in domain 0 and PTLsim in the target domain)

Starting at virtual address 0xfffffe0000000000 (i.e. x86-64 PML4 slot 508, of $2^{40}$ bytes), space is reserved to map all physical memory pages (MFNs) belonging to the guest domain. This mapping is sparse, since only a subset of the physical pages are accessible by the guest. When PTLsim is first injected into a domain, this space starts out empty. As various parts of PTLsim attempt to access physical addresses, PTLsim's internal page fault handler will map physical pages into this space. Normally all pages are mapped as writable, however Xen may not allow writable mappings to some types of pinned pages (L1/L2/L3/L4 page table pages, GDT pages, etc.). Therefore, if the writable mapping fails, PTLsim tries to map the page as read only. PTLsim monitors memory management related hypercalls as they are simulated and remaps physical pages as read-only or writable if and when they are pinned or unpinned, respectively. When PTLsim switches back to native mode, it quickly unmaps all guest pages, since we cannot hold writable references to any pages the guest kernel may later attempt to pin as page table pages. This unmapping is done very quickly by simply clearing all present bits in the physical map's L2 page table page; the PTLsim page fault handler will re-establish the L2 entries as needed.

Implementation Details

Page Translation

The Xen-x86 architecture always has paging enabled, so PTLsim uses a simulated TLB for all virtual-to-physical translations. Each TLB entry has x86 accessed and dirty bits; whenever these bits transition from 0 to 1, PTLsim must walk the page table tree and actually update the corresponding PTE's accessed and/or dirty bit. Since page table pages are mapped read-only, our modified update_mmu hypercall is used to do this. TLB misses are serviced in the normal x86 way: the page tables are walked starting from the MFN in CR3 until the page is resolved. This is done by the Context.virt_to_pte() method, which returns the L1 page table entry (PTE) providing the physical address and accumulated permissions (x86 has specific rules for deriving the effective writable/executable/supervisor permissions for each page). Internally, the page_table_walk() function actually follows the page table tree, but PTLsim maintains a small 16-entry direct mapped cache (like a TLB) to accelerate repeated translations (this is not related to any true TLB maintained by specific cores). The pte_to_ptl_virt() function then translates the PTE and original virtual address into a pointer PTLsim can actually access (inside PTLsim's mapping of the domain's physical memory pages). The software TLB is also flushed under the normal x86 conditions (MOV CR3, WBINVD, INVLPG, and Xen hypercalls like MMUEXT_NEW_BASE_PTR). Presently TLB support is in dcache.cpp; the features above are incorporated into this TLB. In addition, Context.copy_from_user() and Context.copy_to_user() functions are provided to walk the page tables and copy user data to or from a buffer inside PTLsim.

In 32-bit versions of Xen, the x86 protection ring mechanism is used to allow the guest kernel to run at ring 1 while guest userspace runs in ring 3; this allows the ``supervisor'' bit in PTEs to retain its traditional meaning. However, in its effort to clean up legacy ISA features, x86-64 has no concept of privilege rings (other than user/supervisor) or segmentation. This means the supervisor bit in PTEs is never set (only Xen internal pages not accessible to guest domains have this bit set). Instead, Xen puts the kernel in a separate address space from user mode; the top-level L4 page table page for kernel mode points to both kernel-only and user pages. Fortunately, Xen uses TLB global bits and other x86-64 features to avoid much of the context switch overhead from this approach. PTLsim does not have to worry about this detail during virtual-to-PTE translations: it just follows the currently active page table based on physical addresses only.

Exceptions

Under Xen, the set_trap_table() hypercall is used to specify an array of pointers to exception handlers; this is equivalent to the x86 LIDT (load interrupt descriptor table) instruction. Whenever we switch from native mode to simulation mode, PTLmon copies this array back into the Context.idt[] array. Whenever PTLsim detects an exception during simulation, it accesses Context.idt[vector_id] to determine where the pipeline should be restarted (CS:RIP). In the case of page faults, the simulated CR2 is loaded with the faulting virtual address. It then constructs a stack frame equivalent to Xen's structure (i.e. iret_context) at the stack segment and pointer stored in Context.kernel_sp (previously set by the stack_switch() hypercall, which replaces the legacy x86 TSS structure update). Finally, PTLsim propagates the page fault to the selected guest handler by redirecting the pipeline. This is essentially the same work performed within Xen by the create_bounce_frame() function, do_page_fault() (or its equivalent) and propagate_page_fault() (or its equivalent); all the same boundary conditions must be handled.

System Calls and Hypercalls

On 64-bit x86-64, the syscall instruction has a different meaning depending on the context in which it is executed. If executed from userspace, syscall arranges for execution to proceed directly to the guest kernel system call handler (in Context.syscall_rip). This is done by the assist_syscall() microcode handler. A similar process occurs when a 32-bit application uses ``int 0x80'' to make system calls, but in this case, Context.propagate_x86_exception() is used to redirect execution to the trap handler registered for that virtual software interrupt.

If executed from kernel space, the syscall instruction is interpreted as a hypercall into Xen itself. PTLsim emulates all Xen hypercalls. In many simple cases, PTLsim handles the hypercall all by itself, for instance by simply updating its internal tables. In other cases, the hypercall can safely be passed down to Xen without corrupting PTLsim's internal state. We must be very careful as to which hypercalls are passed through: for instance, before updating the page table base, we must ensure the new page table still maps PTLsim and the physical address space before we allow Xen to update the hardware page table base. These cases are all documented in the comments of handle_xen_hypercall().

Note that the definition of ``user mode'' and ``kernel mode'' is maintained by Xen itself: from the CPU's viewpoint, both modes are technically userspace and run in ring 3.

An interesting issue arises when PTLsim passes hypercalls through to Xen: some buffers provided by the guest kernel may reside in virtual memory not mapped by PTLsim. Normally PTLsim avoids this problem by copying any guest buffers into its own address space using Context.copy_from_user(), then copying the results back after the hypercall. However, to avoid future complexity, PTLsim currently switches its own page tables every time the guest requests a page table switch, such that Xen can see all guest kernel virtual memory as well as PTLsim itself. Obviously this means PTLsim injects its two top-level page table slots into every guest top level page table. For multi-processor simulation, PTLsim needs to swap in the target VCPU's page table base whenever it forwards a hypercall that depends on virtual addresses.

Event Channels

Xen delivers outside events, virtual interrupts, IPIs, etc. to the domain just like normal, except they are redirected to a special PTLsim upcall handler stub (in lowlevel-64bit-xen.S). The handler checks which events are pending, and if any events (other than the PTLsim hostcall and upcall events) are pending, it sets a flag so the guest's event handler is invoked the next time through the main loop. This process is equivalent to exception handling in terms of the stack frame setup and call/return sequence: the simulated pipeline is simply redirected to the handler address. It should be noted that the PTLsim handler does not set or clear any mask bits in the shared info page, since it's the (emulated) guest OS code that should actually be doing this, not PTLsim. The only exception is when the event in question is on the hostcall port or the upcall port; then PTLsim handles the event itself and never notifies the guest.

Privileged Instruction Emulation

Xen lets the guest kernel execute various privileged instructions, which it then traps and emulates with internal hypercalls. These are the same as in Xen's arch/x86/traps.c: CLTS (FPU task switches), MOV from CR0-CR4 (easy to emulate), MOV to and from DR0-DR7 (get or set debug registers), RDMSR and WRMSR (mainly to set segment bases). PTLsim decodes and executes these instructions on its own, just like any other x86 instruction.

PTLcalls

PTLsim defines the special x86 opcode 0x0f37 as a breakout opcode. It is undefined in the normal x86 instruction set, but when executed by any code running under PTLsim, it can be used to enqueue commands for PTLsim to execute.

The ptlctl program uses this facility to switch from native mode to simulation mode as follows. Whenever PTLsim is about to switch back to native mode, it uses the VCPUOP_set_breakout_insn_action to specify the opcode bytes to intercept. When the hypervisor sees an invalid instruction matching 0x0f37, it freezes the domain and sends an event channel notification to domain 0. This event channel is read by PTLmon, which then uses the contextswap hypercall to switch back into PTLsim inside the domain. PTLsim then processes whatever command caused the switch back into simulation mode.

While executing within simulation mode, this is not necessary: since PTLsim is in complete control of the execution of each x86 instruction, it simply defines microcode to handle 0x0f37 instead of triggering an invalid opcode exception. This microcode branches into PTLsim, which uses the PTLSIM_HOST_INJECT_UPCALL hostcall to add the command(s) to the command queue. The queue is maintained inside PTLmon so as to ensure synchronization between commands coming from the host and commands from within the domain arriving via PTLcalls. The queue is typically flushed before adding new commands in this manner: otherwise, it would be impossible to get immediate results using ptlctl.

All PTL calls are defined in ptlcalls.h, which simply collects the call's arguments and executes opcode 0x0f37 as if it were a normal x86 syscall instruction:

Event Trace Mode

In Section 13.9, we discussed the philosophical question of how to accurately model the timing of external events when cycle accurate simulation runs thousands of times slower than the outside world expects. To solve this problem, PTLsim/X offers event trace mode.

First, the user saves a checkpoint of the target domain, then instructs PTLsim to enter event record mode. The domain is then used interactively in native mode at full speed, for instance by starting benchmarks and waiting for their completion. In the background, PTLsim taps Xen's trace buffers to write any events delivered to the domain into an event trace file. ``Events'' refer to any time-dependent outside stimulus delivered to the domain, such as interrupts (i.e. Xen event channel notifications) and DMA traffic (i.e. the full contents of any grant pages from network or disk I/O transferred into the domain). Each event is timestamped with the relative cycle number (timestamp counter) in which it was delivered, rather than the wall clock time. When the benchmarks are done, the trace mode is terminated and recording stops.

The user then restores the domain from the checkpoint and re-injects PTLsim, but this time PTLsim reads the event trace file, rather than responding to any outside events Xen may deliver to the domain while in simulation mode. Whenever the timestamp of the event at the head of the trace file matches the current simulation cycle, that event is injected into the domain. PTLsim does this by setting the appropriate pending bits in the shared info page, and then simulates an upcall to the domain's shared info handler (i.e. by restarting the simulated pipeline at that RIP). Since the event channels used by PTLsim and those of the target domain may interfere, PTLsim maintains a shadow shared info page that's updated instead; whenever the simulated load/store pipeline accesses the real shared info page's physical address, the shadow page is used in its place. In addition, the wall clock time fields in the shadow shared info page are regularly updated by dividing the simulated cycle number by the native CPU clock frequency active during the record mode (since the guest OS will have recorded this internally in many places).

This scheme does require some extra software support, since we need to be able to identify which pages the outside source has overwritten with incoming data (i.e. as in a virtual DMA). The console I/O page is actually a guest page that domain 0 maps in xenconsoled; this is easy to identify and capture. The network and block device pages are typically grant pages; the domain 0 Linux device drivers must be modified to let PTLsim know what pages will be overwritten by outside sources.

Multiprocessor Support

PTLsim/X is designed from the ground up to support multiple VCPUs per domain. The contextof(vcpuid) function returns the Context structure allocated for each VCPU; this structure is passed to all functions and assists dealing with the domain. It is the responsibility of each core (e.g. sequential core, out of order core, user-designed cores, etc.) to update the appropriate context structure according to its own design.

VCPUs may choose to block by executing an appropriate hypercall (sched_block, sched_yield, etc.), typically suspending execution until an event arrives. PTLsim cores can simulate this by checking the Context.running field; if zero, the corresponding VCPU is blocked and no instructions should be processed until the running flag becomes set, such as when an event arrives. In realtime mode (where Xen relays real events like timer interrupts back to the simulated CPU), events and upcalls may be delivered to other VCPUs than the first VCPU which runs PTLsim; in this case, PTLsim must check the pending bitmap in the shared info page and simulate upcalls within the appropriate VCPU context (i.e. whichever VCPU context has its upcall_pending bit set).

Some Xen hypercalls must only be executed on the VCPU to which the hypercall applies. In cases where PTLsim cannot emulate the hypercall on its own internal state (and defer the actual hypercall until switching back to native mode), the Xen hypervisor has been modified to support an explicit vcpu parameter, allowing the first VCPU (which always runs PTLsim itself) to execute the required action on behalf of other VCPUs.

For simultaneous multithreading support, PTLsim is designed to run the simulation entirely on the first VCPU, while putting the other VCPUs in an idle loop. This is required because there's no easy way to parallelize an SMT core model across multiple simulation threads. In theory, a multi-core simulator could in fact be parallelized in this way, but it would be very difficult to reproduce cycle accurate behavior and debug deadlocks with asynchronous simulations running in different threads. For these reasons, currently PTLsim itself is single threaded in simulation mode, even though it simulates multiple virtual cores or threads.

Cache coherence is the responsibility of each core model. By default, PTLsim uses the ``instant visibility'' model, in which all VCPUs can have read/write copies of cache lines and all stores appear on all other VCPUs the instant they commit. More complex MOESI-compliant policies can be implemented on top of this basic framework, by stalling simulated VCPUs until cache lines travel across an interconnect network.