From f53494c28e362fb7752bbc83417b9ba47cff0bf5 Mon Sep 17 00:00:00 2001
From: rsc Required reading: Disco What is a virtual machine? IBM definition: a fully protected and
+isolated copy of the underlying machine's hardware. Another view is that it provides another example of a kernel API.
+In contrast to other kernel APIs (unix, microkernel, and exokernel),
+the virtual machine operating system exports as the kernel API the
+processor API (e.g., the x86 interface). Thus, each program running
+in user space sees the services offered by a processor, and each
+program sees its own processor. Of course, we don't want to make a
+system call for each instruction, and in fact one of the main
+challenges in virtual machine operation systems is to design the
+system in such a way that the physical processor executes the virtual
+processor API directly, at processor speed.
+
+
+Virtual machines can be useful for a number of reasons:
+Virtual Machines
+
+Overview
+
+
+
+
+
+
+
+
If your operating system isn't a virtual machine operating system, +what are the alternatives? Processor simulation (e.g., bochs) or +binary emulation (WINE). Simulation runs instructions purely in +software and is slow (e.g., 100x slow down for bochs); virtualization +gets out of the way whenever possible and can be efficient. + +
Simulation gives portability whereas virtualization focuses on +performance. However, this means that you need to model your hardware +very carefully in software. Binary emulation focuses on just getting +system call for a particular operating system's interface. Binary +emulation can be hard because it is targetted towards a particular +operating system (and even that can change between revisions). +
+ +To provide each process with its own virtual processor that exports +the same API as the physical processor, what features must +the virtual machine operating system virtualize? +
Virtual machine monitors (VMM) can be implemented in two ways: +
read()
).The three primary functions of a virtual machine monitor are: +
+Understanding memory virtualization. Let's consider the MIPS example +from the paper. Ideally, we'd be able to intercept and rewrite all +memory address references. (e.g., by intercepting virtual memory +calls). Why can't we do this on the MIPS? (There are addresses that +don't go through address translation --- but we don't want the virtual +machine to directly access memory!) What does Disco do to get around +this problem? (Relink the kernel outside this address space.) +
+ ++Having gotten around that problem, how do we handle things in general? +
++// Disco's tlb miss handler. +// Called when a memory reference for virtual adddress +// 'VA' is made, but there is not VA->MA (virtual -> machine) +// mapping in the cpu's TLB. +void tlb_miss_handler (VA) +{ + // see if we have a mapping in our "shadow" tlb (which includes + // "main" tlb) + tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va); + if (t && defined (thiscpu->pmap[t->pa])) // is there a MA for this PA? + tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata); + else if (t) + // get a machine page, copy physical page into, and tlbwrite + else + // trap to the virtual CPU/OS's handler +} + +// Disco's procedure which emulates the MIPS +// instruction which writes to the tlb. +// +// VA -- virtual addresss +// PA -- physical address (NOT MA machine address!) +// otherdata -- perms and stuff +void emulate_tlbwrite_instruction (VA, PA, otherdata) +{ + tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache + if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically + MA = allocate_machine_page (); + thiscpu->pmap[PA] = MA; // See 4.2.2 + thiscpu->pmapbackmap[MA] = PA; + thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns) + } + tlbwrite (va, thiscpu->pmap[PA], otherdata); +} + +// Disco's procedure which emulates the MIPS +// instruction which read the tlb. +tlb_entry *emulate_tlbread_instruction (VA) +{ + // Must return a TLB entry that has a "Physical" address; + // This is recorded in our secondary TLB cache. + // (We don't have to read from the hardware TLB since + // all writes to the hardware TLB are mediated by Disco. + // Thus we can always keep the l2tlb up to date.) + return tlb_lookup (thiscpu->l2tlb, va); +} ++ +
Requirements: +
The MIPS didn't quite meet the second criteria, as discussed +above. But, it does have a supervisor mode that is between user mode and +kernel mode where any privileged instruction will trap.
+ +What might a the VMM trap handler look like?
++void privilege_trap_handler (addr) { + instruction, args = decode_instruction (addr) + switch (instruction) { + case foo: + emulate_foo (thiscpu, args, ...); + break; + case bar: + emulate_bar (thiscpu, args, ...); + break; + case ...: + ... + } +} ++
The emulator_foo
bits will have to evaluate the
+state of the virtual CPU and compute the appropriate "fake" answer.
+
What sort of state is needed in order to appropriately emulate all +of these things? +
+- all user registers +- CPU specific regs (e.g. on x86, %crN, debugging, FP...) +- page tables (or tlb) +- interrupt tables ++This is needed for each virtual processor. + + +
We intercept all communication to the I/O devices: read/writes to +reserved memory addresses cause page faults into special handlers +which will emulate or pass through I/O as appropriate. +
+ ++In a system like Disco, the sequence would look something like: +
+Interrupts will require some additional work: +
+The above can be slow! So sometimes you want the guest operating +system to be aware that it is a guest and allow it to avoid the slow +path. Special device drivers or changing instructions that would cause +traps into memory read/write instructions. +
+ +VMware, unlike Disco, runs as an application on a guest OS and +cannot modify the guest OS. Furthermore, it must virtualize the x86 +instead of MIPS processor. Both of these differences make good design +challenges. + +
The first challenge is that the monitor runs in user space, yet it +must dispatch traps and it must execute privilege instructions, which +both require kernel privileges. To address this challenge, the +monitor downloads a piece of code, a kernel module, into the guest +OS. Most modern operating systems are constructed as a core kernel, +extended with downloadable kernel modules. +Privileged users can insert kernel modules at run-time. + +
The monitor downloads a kernel module that reads the IDT, copies +it, and overwrites the hard-wired entries with addresses for stubs in +the just downloaded kernel module. When a trap happens, the kernel +module inspects the PC, and either forwards the trap to the monitor +running in user space or to the guest OS. If the trap is caused +because a guest OS execute a privileged instructions, the monitor can +emulate that privilege instruction by asking the kernel module to +perform that instructions (perhaps after modifying the arguments to +the instruction). + +
The second challenge is virtualizing the x86 + instructions. Unfortunately, x86 doesn't meet the 3 requirements for + CPU virtualization. the first two requirements above. If you run + the CPU in ring 3, most x86 instructions will be fine, + because most privileged instructions will result in a trap, which + can then be forwarded to vmware for emulation. For example, + consider a guest OS loading the root of a page table in CR3. This + results in trap (the guest OS runs in user space), which is + forwarded to the monitor, which can emulate the load to CR3 as + follows: + +
+// addr is a physical address +void emulate_lcr3 (thiscpu, addr) +{ + thiscpu->cr3 = addr; + Pte *fakepdir = lookup (addr, oldcr3cache); + if (!fakepdir) { + fakedir = ppage_alloc (); + store (oldcr3cache, addr, fakedir); + // May wish to scan through supplied page directory to see if + // we have to fix up anything in particular. + // Exact settings will depend on how we want to handle + // problem cases below and our own MM. + } + asm ("movl fakepdir,%cr3"); + // Must make sure our page fault handler is in sync with what we do here. +} ++ +
To virtualize the x86, the monitor must intercept any modifications +to the page table and substitute appropriate responses. And update +things like the accessed/dirty bits. The monitor can arrange for this +to happen by making all page table pages inaccessible so that it can +emulate loads and stores to page table pages. This setup allow the +monitor to virtualize the memory interface of the x86.
+ +Unfortunately, not all instructions that must be virtualized result +in traps: +
pushf/popf
: FL_IF
is handled different,
+ for example. In user-mode setting FL_IF is just ignored.push
, pop
, mov
)
+ that reads or writes from %cs
, which contains the
+ privilege level.
+How can we virtualize these instructions? An approach is to decode
+the instruction stream that is provided by the user and look for bad
+instructions. When we find them, replace them with an interrupt
+(INT 3
) that will allow the VMM to handle it
+correctly. This might look something like:
+
+void initcode () { + scan_for_nonvirtual (0x7c00); +} + +void scan_for_nonvirtualizable (thiscpu, startaddr) { + addr = startaddr; + instr = disassemble (addr); + while (instr is not branch or bad) { + addr += len (instr); + instr = disassemble (addr); + } + // remember that we wanted to execute this instruction. + replace (addr, "int 3"); + record (thiscpu->rewrites, addr, instr); +} + +void breakpoint_handler (tf) { + oldinstr = lookup (thiscpu->rewrites, tf->eip); + if (oldinstr is branch) { + newcs:neweip = evaluate branch + scan_for_nonvirtualizable (thiscpu, newcs:neweip) + return; + } else { // something non virtualizable + // dispatch to appropriate emulation + } +} ++
All pages must be scanned in this way. Fortunately, most pages +probably are okay and don't really need any special handling so after +scanning them once, we can just remember that the page is okay and let +it run natively. +
+ +What if a guest OS generates instructions, writes them to memory, +and then wants to execute them? We must detect self-modifying code +(e.g. must simulate buffer overflow attacks correctly.) When a write +to a physical page that happens to be in code segment happens, must +trap the write and then rescan the affected portions of the page.
+ +What about self-examining code? Need to protect it some +how---possibly by playing tricks with instruction/data TLB caches, or +introducing a private segment for code (%cs) that is different than +the segment used for reads/writes (%ds). +
+ ++Disco has some I/O specific optimizations. +
++Disco developers clearly had access to IRIX source code. +
+Performance?
+Premise. Are virtual machine the preferred approach to extending +operating systems? Have scalable multiprocessors materialized?
+ +John Scott Robin, Cynthia E. Irvine. Analysis of the +Intel Pentium's Ability to Support a Secure Virtual Machine +Monitor.
+ +Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing +I/O Devices on VMware Workstation's Hosted Virtual Machine +Monitor. In Proceedings of the 2001 Usenix Technical Conference.
+ +Kevin Lawton, Drew Northup. Plex86 Virtual +Machine.
+ +Xen +and the Art of Virtualization, Paul Barham, Boris +Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf +Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003
+ +A comparison of +software and hardware techniques for x86 virtualizatonKeith Adams +and Ole Agesen, ASPLOS 2006
+ + + + + -- cgit v1.2.3