From f53494c28e362fb7752bbc83417b9ba47cff0bf5 Mon Sep 17 00:00:00 2001 From: rsc Date: Wed, 3 Sep 2008 04:50:04 +0000 Subject: DO NOT MAIL: xv6 web pages --- web/l-vm.html | 462 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 462 insertions(+) create mode 100644 web/l-vm.html (limited to 'web/l-vm.html') diff --git a/web/l-vm.html b/web/l-vm.html new file mode 100644 index 0000000..ffce13e --- /dev/null +++ b/web/l-vm.html @@ -0,0 +1,462 @@ + + +Virtual Machines + + + + +

Virtual Machines

+ +

Required reading: Disco

+ +

Overview

+ +

What is a virtual machine? IBM definition: a fully protected and +isolated copy of the underlying machine's hardware.

+ +

Another view is that it provides another example of a kernel API. +In contrast to other kernel APIs (unix, microkernel, and exokernel), +the virtual machine operating system exports as the kernel API the +processor API (e.g., the x86 interface). Thus, each program running +in user space sees the services offered by a processor, and each +program sees its own processor. Of course, we don't want to make a +system call for each instruction, and in fact one of the main +challenges in virtual machine operation systems is to design the +system in such a way that the physical processor executes the virtual +processor API directly, at processor speed. + +

+Virtual machines can be useful for a number of reasons: +

Run multiple operating systems on single piece of hardware. For +example, in one process, you run Linux, and in another you run +Windows/XP. If the kernel API is identical to the x86 (and faithly +emulates x86 instructions, state, protection levels, page tables), +then Linux and Windows/XP, the virual machine operationg system can +run these guest operating systems without modifications. + +
- Run "older" programs on the same hardware (e.g., run one x86 +virtual machine in real mode to execute old DOS apps). + +
- Or run applications that require different operating system. +
+ +
Fault isolation: like processes on UNIX but more complete, because +the guest operating systems runs on the virtual machine in user space. +Thus, faults in the guest OS cannot effect any other software. + +
Customizing the apparent hardware: virtual machine may have +different view of hardware than is physically present. + +
Simplify deployment/development of software for scalable +processors (e.g., Disco). + +

+ +

If your operating system isn't a virtual machine operating system, +what are the alternatives? Processor simulation (e.g., bochs) or +binary emulation (WINE). Simulation runs instructions purely in +software and is slow (e.g., 100x slow down for bochs); virtualization +gets out of the way whenever possible and can be efficient. + +

Simulation gives portability whereas virtualization focuses on +performance. However, this means that you need to model your hardware +very carefully in software. Binary emulation focuses on just getting +system call for a particular operating system's interface. Binary +emulation can be hard because it is targetted towards a particular +operating system (and even that can change between revisions). +

+ +

To provide each process with its own virtual processor that exports +the same API as the physical processor, what features must +the virtual machine operating system virtualize? +

CPU: instructions -- trap all privileged instructions
Memory: address spaces -- map "physical" pages managed +by the guest OS to machinepages, handle translation, etc.
Devices: any I/O communication needs to be trapped and passed + through/handled appropriately.

+The software that implements the virtualization is typically called +the monitor, instead of the virtual machine operating system. + +

Virtual machine monitors (VMM) can be implemented in two ways: +

Run VMM directly on hardware: like Disco.
Run VMM as an application (though still running as root, with + integration into OS) on top of a host OS: like VMware. Provides + additional hardware support at low development cost in + VMM. Intercept CPU-level I/O requests and translate them into + system calls (e.g. read()).

+ +

The three primary functions of a virtual machine monitor are: +

virtualize processor (CPU, memory, and devices) +
dispatch events (e.g., forward page fault trap to guest OS). +
allocate resources (e.g., divide real memory in some way between +the physical memory of each guest OS). +

+ +

Virtualization in detail

+ +

Memory virtualization

+ +

+Understanding memory virtualization. Let's consider the MIPS example +from the paper. Ideally, we'd be able to intercept and rewrite all +memory address references. (e.g., by intercepting virtual memory +calls). Why can't we do this on the MIPS? (There are addresses that +don't go through address translation --- but we don't want the virtual +machine to directly access memory!) What does Disco do to get around +this problem? (Relink the kernel outside this address space.) +

+ +

+Having gotten around that problem, how do we handle things in general? +

+// Disco's tlb miss handler.
+// Called when a memory reference for virtual adddress
+// 'VA' is made, but there is not VA->MA (virtual -> machine)
+// mapping in the cpu's TLB.
+void tlb_miss_handler (VA)
+{
+  // see if we have a mapping in our "shadow" tlb (which includes
+  // "main" tlb)
+  tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va);
+  if (t && defined (thiscpu->pmap[t->pa]))   // is there a MA for this PA?
+    tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata);
+  else if (t)
+    // get a machine page, copy physical page into, and tlbwrite
+  else
+    // trap to the virtual CPU/OS's handler
+}
+
+// Disco's procedure which emulates the MIPS
+// instruction which writes to the tlb.
+//
+// VA -- virtual addresss
+// PA -- physical address (NOT MA machine address!)
+// otherdata -- perms and stuff
+void emulate_tlbwrite_instruction (VA, PA, otherdata)
+{
+  tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache
+  if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically
+    MA = allocate_machine_page ();
+    thiscpu->pmap[PA] = MA; // See 4.2.2
+    thiscpu->pmapbackmap[MA] = PA;
+    thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns)
+  }
+  tlbwrite (va, thiscpu->pmap[PA], otherdata);
+}
+
+// Disco's procedure which emulates the MIPS
+// instruction which read the tlb.
+tlb_entry *emulate_tlbread_instruction (VA)
+{
+  // Must return a TLB entry that has a "Physical" address;
+  // This is recorded in our secondary TLB cache.
+  // (We don't have to read from the hardware TLB since
+  // all writes to the hardware TLB are mediated by Disco.
+  // Thus we can always keep the l2tlb up to date.)
+  return tlb_lookup (thiscpu->l2tlb, va);
+}
+

+ +

CPU virtualization

+ +

Requirements: +

Results of executing non-privileged instructions in privileged and + user mode must be equivalent. (Why? B/c the virtual "privileged" + system will not be running in true "privileged" mode.) +
There must be a way to protect the VM from the real machine. (Some + sort of memory protection/address translation. For fault isolation.)
There must be a way to detect and transfer control to the VMM when + the VM tries to execute a sensitive instruction (e.g. a privileged + instruction, or one that could expose the "virtualness" of the + VM.) It must be possible to emulate these instructions in + software. Can be classified into completely virtualizable + (i.e. there are protection mechanisms that cause traps for all + instructions), partly (insufficient or incomplete trap + mechanisms), or not at all (e.g. no MMU). +

+ +

The MIPS didn't quite meet the second criteria, as discussed +above. But, it does have a supervisor mode that is between user mode and +kernel mode where any privileged instruction will trap.

+ +

What might a the VMM trap handler look like?

+void privilege_trap_handler (addr) {
+  instruction, args = decode_instruction (addr)
+  switch (instruction) {
+  case foo:
+    emulate_foo (thiscpu, args, ...);
+    break;
+  case bar:
+    emulate_bar (thiscpu, args, ...);
+    break;
+  case ...:
+    ...
+  }
+}
+

The emulator_foo bits will have to evaluate the +state of the virtual CPU and compute the appropriate "fake" answer. +

+ +

What sort of state is needed in order to appropriately emulate all +of these things? +

+- all user registers
+- CPU specific regs (e.g. on x86, %crN, debugging, FP...)
+- page tables (or tlb)
+- interrupt tables
+

+This is needed for each virtual processor. +

+ +

Device I/O virtualization

+ +

We intercept all communication to the I/O devices: read/writes to +reserved memory addresses cause page faults into special handlers +which will emulate or pass through I/O as appropriate. +

+ +

+In a system like Disco, the sequence would look something like: +

VM executes instruction to access I/O
Trap generated by CPU (based on memory or privilege protection) + transfers control to VMM.
VMM emulates I/O instruction, saving information about where this + came from (for demultiplexing async reply from hardware later) .
VMM reschedules a VM.

+ +

+Interrupts will require some additional work: +

Interrupt occurs on real machine, transfering control to VMM + handler.
VMM determines the VM that ought to receive this interrupt.
VMM causes a simulated interrupt to occur in the VM, and reschedules a + VM.
VM runs its interrupt handler, which may involve other I/O + instructions that need to be trapped.

+ +

+The above can be slow! So sometimes you want the guest operating +system to be aware that it is a guest and allow it to avoid the slow +path. Special device drivers or changing instructions that would cause +traps into memory read/write instructions. +

+ +

Intel x86/vmware

+ +

VMware, unlike Disco, runs as an application on a guest OS and +cannot modify the guest OS. Furthermore, it must virtualize the x86 +instead of MIPS processor. Both of these differences make good design +challenges. + +

The first challenge is that the monitor runs in user space, yet it +must dispatch traps and it must execute privilege instructions, which +both require kernel privileges. To address this challenge, the +monitor downloads a piece of code, a kernel module, into the guest +OS. Most modern operating systems are constructed as a core kernel, +extended with downloadable kernel modules. +Privileged users can insert kernel modules at run-time. + +

The monitor downloads a kernel module that reads the IDT, copies +it, and overwrites the hard-wired entries with addresses for stubs in +the just downloaded kernel module. When a trap happens, the kernel +module inspects the PC, and either forwards the trap to the monitor +running in user space or to the guest OS. If the trap is caused +because a guest OS execute a privileged instructions, the monitor can +emulate that privilege instruction by asking the kernel module to +perform that instructions (perhaps after modifying the arguments to +the instruction). + +

The second challenge is virtualizing the x86 + instructions. Unfortunately, x86 doesn't meet the 3 requirements for + CPU virtualization. the first two requirements above. If you run + the CPU in ring 3, most x86 instructions will be fine, + because most privileged instructions will result in a trap, which + can then be forwarded to vmware for emulation. For example, + consider a guest OS loading the root of a page table in CR3. This + results in trap (the guest OS runs in user space), which is + forwarded to the monitor, which can emulate the load to CR3 as + follows: + +

+// addr is a physical address
+void emulate_lcr3 (thiscpu, addr)
+{
+  thiscpu->cr3 = addr;
+  Pte *fakepdir = lookup (addr, oldcr3cache);
+  if (!fakepdir) {
+    fakedir = ppage_alloc ();
+    store (oldcr3cache, addr, fakedir);
+    // May wish to scan through supplied page directory to see if
+    // we have to fix up anything in particular.
+    // Exact settings will depend on how we want to handle
+    // problem cases below and our own MM.
+  }
+  asm ("movl fakepdir,%cr3");
+  // Must make sure our page fault handler is in sync with what we do here.
+}
+

+ +

To virtualize the x86, the monitor must intercept any modifications +to the page table and substitute appropriate responses. And update +things like the accessed/dirty bits. The monitor can arrange for this +to happen by making all page table pages inaccessible so that it can +emulate loads and stores to page table pages. This setup allow the +monitor to virtualize the memory interface of the x86.

+ +

Unfortunately, not all instructions that must be virtualized result +in traps: +

pushf/popf: FL_IF is handled different, + for example. In user-mode setting FL_IF is just ignored.
Anything (push, pop, mov) + that reads or writes from %cs, which contains the + privilege level. +
Setting the interrupt enable bit in EFLAGS has different +semantics in user space and kernel space. In user space, it +is ignored; in kernel space, the bit is set. +
And some others... (total, 17 instructions). +

+These instructions are unpriviliged instructions (i.e., don't cause a +trap when executed by a guest OS) but expose physical processor state. +These could reveal details of virtualization that should not be +revealed. For example, if guest OS sets the interrupt enable bit for +its virtual x86, the virtualized EFLAGS should reflect that the bit is +set, even though the guest OS is running in user space. + +

How can we virtualize these instructions? An approach is to decode +the instruction stream that is provided by the user and look for bad +instructions. When we find them, replace them with an interrupt +(INT 3) that will allow the VMM to handle it +correctly. This might look something like: +

+ +

+void initcode () {
+  scan_for_nonvirtual (0x7c00);
+}
+
+void scan_for_nonvirtualizable (thiscpu, startaddr) {
+  addr  = startaddr;
+  instr = disassemble (addr);
+  while (instr is not branch or bad) {
+    addr += len (instr);
+    instr = disassemble (addr);
+  }
+  // remember that we wanted to execute this instruction.
+  replace (addr, "int 3");
+  record (thiscpu->rewrites, addr, instr);
+}
+
+void breakpoint_handler (tf) {
+  oldinstr = lookup (thiscpu->rewrites, tf->eip);
+  if (oldinstr is branch) {
+    newcs:neweip = evaluate branch
+    scan_for_nonvirtualizable (thiscpu, newcs:neweip)
+    return;
+  } else { // something non virtualizable
+    // dispatch to appropriate emulation
+  }
+}
+

All pages must be scanned in this way. Fortunately, most pages +probably are okay and don't really need any special handling so after +scanning them once, we can just remember that the page is okay and let +it run natively. +

+ +

What if a guest OS generates instructions, writes them to memory, +and then wants to execute them? We must detect self-modifying code +(e.g. must simulate buffer overflow attacks correctly.) When a write +to a physical page that happens to be in code segment happens, must +trap the write and then rescan the affected portions of the page.

+ +

What about self-examining code? Need to protect it some +how---possibly by playing tricks with instruction/data TLB caches, or +introducing a private segment for code (%cs) that is different than +the segment used for reads/writes (%ds). +

+ +

Some Disco paper notes

+ +

+Disco has some I/O specific optimizations. +

Disk reads only need to happen once and can be shared between + virtual machines via copy-on-write virtual memory tricks.
Network cards do not need to be fully virtualized --- intra + VM communication doesn't need a real network card backing it.
Special handling for NFS so that all VMs "share" a buffer cache.

+ +

+Disco developers clearly had access to IRIX source code. +

Need to deal with KSEG0 segment of MIPS memory by relinking kernel + at different address space.
Ensuring page-alignment of network writes (for the purposes of + doing memory map tricks.)

+ +

Performance?

Evaluated in simulation.
Where are the overheads? Where do they come from?
Does it run better than NUMA IRIX?

+ +

Premise. Are virtual machine the preferred approach to extending +operating systems? Have scalable multiprocessors materialized?

+ +

Related papers

+ +

John Scott Robin, Cynthia E. Irvine. Analysis of the +Intel Pentium's Ability to Support a Secure Virtual Machine +Monitor.

+ +

Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing +I/O Devices on VMware Workstation's Hosted Virtual Machine +Monitor. In Proceedings of the 2001 Usenix Technical Conference.

+ +

Kevin Lawton, Drew Northup. Plex86 Virtual +Machine.

+ +

Xen +and the Art of Virtualization, Paul Barham, Boris +Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf +Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003

+ +

A comparison of +software and hardware techniques for x86 virtualizatonKeith Adams +and Ole Agesen, ASPLOS 2006

+ + + + + -- cgit v1.2.3