Next: Getting Started with PTLsim/X Up: PTLsim/X: Full System SMP/SMT Previous: PTLsim/X: Full System SMP/SMT Contents

Subsections

Background

Virtual Machines and Full System Simulation

Full system simulation and virtualization has been around since the dawn of computers. Typically virtual machine software is used to run guest operating systems on a physical host system, such that the guest believes it is running directly on the bare hardware. Modern full system simulators in the x86 world can be roughly divided into two groups (this paper does not consider systems for other instruction sets).

Hypervisors execute most unprivileged instructions on the native CPU at full speed, but trap privileged instructions used by the operating system kernel, where they are emulated by hypervisor software so as to maintain isolation between virtual machines and make the virtual machine nearly indistinguishable from the real CPU. In some cases (particularly on x86), additional software techniques are needed to fully hide the hypervisor from the guest OS.

Xen [6,7,5,8,9,2] represents the current state of the art in this field; it will be described in great detail later on.
VMware [12] is a very well known commercial product that allows unmodified x86 operating systems to run inside a virtual machine. Because the x86 instruction set is not fully virtualizable, VMware must employ x86-to-x86 binary translation techniques on kernel code (but not user mode code) to make the virtual CPU indistinguishable from the real CPU for compatibility reasons. These translations are typically cached in a hidden portion of the guest address space to improve performance compared to simply interpreting sensitive x86 instructions. While this approach is sophisticated and effective, it exacts a heavy performance penalty on I/O intensive workloads [9]. Interestingly, the latest microprocessors from Intel and AMD include hardware features (Intel VT [15], AMD SVM [16]) to eliminate the binary translation and patching overhead. Xen fully supports these technologies to allow running Windows and other OS's at full speed, while VMware has yet to include full support.
VMware comes in two flavors. ESX is a true hypervisor that boots on the bare hardware underneath the first guest OS. GSX and Workstation use a userspace frontend process containing all virtual device drivers and the binary translator, while the vmmon kernel module (open source in the Linux version) handles memory virtualization and context switching tasks similar to Xen.
Several other products, including Virtual PC and Parallels, provide features similar to VMware using similar technology.
KVM (Kernel Virtual Machine) is a new hypervisor infrastructure built into all Linux kernels after 2.6.19. It depends on the hardware virtualization extensions (Intel VT and AMD SVM) built into modern x86 chips, whereas Xen and VMware also support running on older processors without special hardware support. KVM is an attractive foundation for future virtual machine development since it's built into Linux (so it requires far less setup work than Xen or VMware) and provides excellent performance.

Unlike hypervisors, simulators perform cycle accurate execution of x86 instructions using interpreter software, without running any guest instructions on the native CPU.

Bochs [11] is the most well known open source x86 simulator; it is considered to be a nearly RTL (register transfer language) level description of every x86 behavior from legacy 16-bit features up through modern x86-64 instructions. Bochs is very useful for the functional validation of real x86 microprocessors, but it is very slow (around 5-10 MHz equivalent) and is not useful for implementing cycle accurate models of modern uop-based out of order x86 processors (for instance, it does not model caches, memory latency, functional units and so on).
QEMU [10] is similar in purpose to VMware, but unlike VMware, it supports multiple CPU host and guest architectures (PowerPC, SPARC, ARM, etc). QEMU uses binary translation technology similar to VMware to hide the hypervisor's presence from the guest kernel. However, due to its cross platform design, both kernel and user code is passed through x86-to-x86 binary translation (even on x86 platforms) and stored in a translation cache. Interestingly, Xen uses a substantial amount of QEMU code to model common hardware devices when running unmodified operating systems like Windows, but Xen still uses its own hardware-assisted technology to actually achieve virtualization. QEMU supports a proprietary hypervisor module to add VMware's and Xen's ability to run user mode code natively on the CPU to reduce the performance penalty; hence it is also in the hypervisor category.
Simics [13] is a commercial simulation suite for modeling both the functional aspects of various x86 processors (including vendor specific extensions) as well as user-designed plug-in models of real hardware devices. It is used extensively in industry for modeling new hardware and drivers, as well as firmware level debugging. Like QEMU, Simics uses x86-to-x86 binary translation to instrument code at a very low level while achieving good performance (though noticeably slower than a hypervisor provides). Unlike QEMU, Simics is fully extensible and supports a huge range of real hardware models, but it is not possible to add cycle accurate simulation features below the x86 instruction level, making it less useful to microarchitects (both because of technical considerations as well as its status as a closed source product).
SimNow [14] is an AMD simulation tool used during the design and validation of AMD's x86-64 hardware. Like Simics, it is a functional simulator only, but it models a variety of AMD-built hardware devices. SimNow uses x86-to-x86 binary translation technology similar to Simics and QEMU to achieve good performance. Because SimNow does not provide cycle accurate timing data, AMD uses its own TSIM trace-based simulator, derived from the K8 RTL, to do actual validation and timing studies. SimNow is available for free to the public, albeit as closed source.

All of these tools share one common disadvantage: they are unable to model execution at a level below the granularity of x86 instructions, making them unsuitable to microarchitects. PTLsim/X seeks to fill this void by allowing extremely detailed uop-level cycle accurate simulation of x86 and x86-64 microprocessor cores, while simultaneously delivering all the performance benefits of true native-mode hypervisors like Xen, selective binary translation based hypervisors like VMware and QEMU, and the detailed hardware modeling capabilities of Bochs and Simics.

Xen Overview

Xen [7,6,5,8,9,2] is an open source x86 virtual machine monitor, also known as a hypervisor. Each virtual machine is called a ``domain'', where domain 0 is privileged and accesses all hardware devices using the standard drivers; it can also create and directly manipulate other domains. Guest domains typically do not have hardware access do not have this access; instead, they relay requests back to domain 0 using Xen-specific virtual device drivers. Each guest can have up to 32 VCPUs (virtual CPUs). Xen itself is loaded into a reserved region of physical memory before loading a Linux kernel as domain 0; other operating systems can run in guest domains. Xen is famous for having essentially zero overhead due to its unique and well planned design; it's possible to run a normal workstation or server under Xen with full native performance.

Under Xen's ``paravirtualized'' mode, the guest OS runs on an architecture nearly identical to x86 or x86-64, but a few small changes (critical to preserving native performance levels) must be made to low-level kernel code, similar in scope to adding support for a new type of system chipset or CPU manufacturer (e.g. instead of an AMD x86-64 on an nVidia chipset, the kernel would need to support a Xen-extended x86-64 CPU on a Xen virtual ``chipset''). These changes mostly concern page tables and the interrupt controller:

Paging is always enabled, and any physical pages (called ``machine frame numbers'', or MFNs) used to form a page table must be marked read-only (a.k.a. ``pinned'') everywhere. Since the processor can only access a physical page if it's referenced by some page table, Xen can guarantee memory isolation between domains by forcing the guest kernel to replace any writes to page table pages with special mmu_update() hypercalls (a.k.a. system calls into Xen itself). Xen makes sure each update points to a page owned by the domain before updating the page table. This approach has essentially zero performance loss since the guest kernel can read its own page tables without any further indirections (i.e. the page tables point to the actual physical addresses), and hypercalls are only needed for batched updates (e.g. validating a new page table after a fork() requires only a single hypercall).
- Xen also supports pseudo-physical pages, which are consecutively numbered from 0 to some maximum (i.e. 65536 for a 256 MB domain). This is required because most kernels (including Linux and Windows) do not support ``sparse'' (discontiguous) physical memory ranges very well (remember that every domain can still address every physical page, including those of other domains - it just can't access all of them). Xen provides pseudo-to-machine (P2M) and machine-to-pseudo (M2P) tables to do this mapping. However, the physical page tables still continue to reference physical addresses and are fully visible to the guest kernel; this is just a convenience feature.
- Xen can save an entire domain to disk, then restore it later starting at that checkpoint. Since Xen tracks every read-only page that's part of some page table, it can restore domains even if the original physical pages are now used by something else: it automatically remaps all MFNs in every page table page it knows about (but the guest kernel must never store machine page numbers outside of page table pages - it's the same concept as in garbage collection, where pointers must only reside in the obvious places).
- Xen can migrate running domains between machines by tracking which physical pages become dirty as the domain executes. Xen uses shadow page tables for this: it makes copy-on-write duplicates of the domain's page tables, and presents these internal tables to the CPU, while the guest kernel still thinks it's using the original page tables. Once the migration is complete, the shadow page tables are merged back into the real page tables (as with a save and restore) and the domain continues as usual.
- The memory allocation of each domain is elastic: the domain can give any free pages back to Xen via the ``balloon'' mechanism; these pages can then be re-assigned to other domains that need more memory (up to a per-domain limit).
- Domains can share some of their pages with other domains using the grant mechanism. This is used for zero-copy network and disk I/O between domain 0 and guest domains.
Interrupts are delivered using an event channel mechanism, which is functionally identical to the IO-APIC hardware on the bare CPU (essentially it's a ``Xen APIC'' instead of the Intel and AMD models already supported by the guest kernel). Xen sets up a shared info page containing bit vectors for masked and pending interrupts (just like an APIC's memory mapped registers), and lets the guest kernel register an event handler function. Xen then does an upcall to this function whenever a virtual interrupt arrives; the guest kernel manipulates the mask and pending bits to ensure race-free notifications. Xen automatically maps physical IRQs on the APIC to event channels in domain 0, plus it adds its own virtual interrupts (for the usual timer and a Xen-specific notification port; use cat /proc/interrupts on a Linux system under Xen to see this). When the guest domain has multiple VCPUs, interprocessor interrupts (IPIs) are done through the Xen event controller in a manner identical to hardware IPIs.
- Xen is unique in that PCI devices can be assigned to any domain, so for instance each guest domain could have its own dedicated PCI network card and disk controller - there's no need to relay requests back to domain 0 in this configuration, although it only works with hardware that supports IOMMU virtualization (otherwise it's a security risk, since DMA can be used to bypass Xen's page table protections).
Xen provides the guest with additional timers, so it can be aware of both ``wall clock'' time as well as execution time (since there may be gaps in the latter as other domains use the CPU); this lets it provide a smooth interactive experience in a way systems like VMware cannot. The timers are delivered as virtual interrupt events.
All other features of the paravirtualized architecture perfectly match x86. The guest kernel can still use most x86 privileged instructions, such as rdmsr, wrmsr, and control register updates (which Xen transparently intercepts and validates), and in domain 0, it can access I/O ports, memory mapped I/O, the normal x86 segmentation (GDT and LDT) and interrupt mechanisms (IDT), etc. This makes it possible to run a normal Linux distribution, with totally unmodified drivers and software, at full native speed (we do just this on all our development workstations and servers). Benchmarks [9] have shown Xen to have ~2-3% performance decrease relative to a traditional Linux kernel, where as VMware and similar solutions yield a 20-70% decrease under heavy I/O.

Xen also supports ``HVM'' (hardware virtual machine) mode, which is equivalent to what VMware [12], QEMU [10], Bochs [11] and similar systems provide: nearly perfect emulation of the x86 architecture and some standard peripherals. The advantage is that an uncooperative guest OS never knows it's running in a virtual machine: Windows XP and Mac OS X have been successfully run inside Xen in this mode. Unfortunately, this mode has a well known performance cost, even when Xen leverages the specialized hardware support for full virtualization in newer Intel [15] and AMD [16] chips. The overhead comes from the requirement that the hypervisor still trap and emulate all sensitive instructions, whereas paravirtualized guests can intelligently batch together requests in one hypercall and can avoid virtual device driver overhead.

Next: Getting Started with PTLsim/X Up: PTLsim/X: Full System SMP/SMT Previous: PTLsim/X: Full System SMP/SMT Contents

Matt T Yourst 2007-09-26