From 01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1 Mon Sep 17 00:00:00 2001
From: Austin Clements Required reading: Disco What is a virtual machine? IBM definition: a fully protected and
-isolated copy of the underlying machine's hardware. Another view is that it provides another example of a kernel API.
-In contrast to other kernel APIs (unix, microkernel, and exokernel),
-the virtual machine operating system exports as the kernel API the
-processor API (e.g., the x86 interface). Thus, each program running
-in user space sees the services offered by a processor, and each
-program sees its own processor. Of course, we don't want to make a
-system call for each instruction, and in fact one of the main
-challenges in virtual machine operation systems is to design the
-system in such a way that the physical processor executes the virtual
-processor API directly, at processor speed.
-
-
-Virtual machines can be useful for a number of reasons:
-Virtual Machines
-
-Overview
-
-
-
-
-
-
-
-
If your operating system isn't a virtual machine operating system, -what are the alternatives? Processor simulation (e.g., bochs) or -binary emulation (WINE). Simulation runs instructions purely in -software and is slow (e.g., 100x slow down for bochs); virtualization -gets out of the way whenever possible and can be efficient. - -
Simulation gives portability whereas virtualization focuses on -performance. However, this means that you need to model your hardware -very carefully in software. Binary emulation focuses on just getting -system call for a particular operating system's interface. Binary -emulation can be hard because it is targetted towards a particular -operating system (and even that can change between revisions). -
- -To provide each process with its own virtual processor that exports -the same API as the physical processor, what features must -the virtual machine operating system virtualize? -
Virtual machine monitors (VMM) can be implemented in two ways: -
read()
).The three primary functions of a virtual machine monitor are: -
-Understanding memory virtualization. Let's consider the MIPS example -from the paper. Ideally, we'd be able to intercept and rewrite all -memory address references. (e.g., by intercepting virtual memory -calls). Why can't we do this on the MIPS? (There are addresses that -don't go through address translation --- but we don't want the virtual -machine to directly access memory!) What does Disco do to get around -this problem? (Relink the kernel outside this address space.) -
- --Having gotten around that problem, how do we handle things in general? -
--// Disco's tlb miss handler. -// Called when a memory reference for virtual adddress -// 'VA' is made, but there is not VA->MA (virtual -> machine) -// mapping in the cpu's TLB. -void tlb_miss_handler (VA) -{ - // see if we have a mapping in our "shadow" tlb (which includes - // "main" tlb) - tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va); - if (t && defined (thiscpu->pmap[t->pa])) // is there a MA for this PA? - tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata); - else if (t) - // get a machine page, copy physical page into, and tlbwrite - else - // trap to the virtual CPU/OS's handler -} - -// Disco's procedure which emulates the MIPS -// instruction which writes to the tlb. -// -// VA -- virtual addresss -// PA -- physical address (NOT MA machine address!) -// otherdata -- perms and stuff -void emulate_tlbwrite_instruction (VA, PA, otherdata) -{ - tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache - if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically - MA = allocate_machine_page (); - thiscpu->pmap[PA] = MA; // See 4.2.2 - thiscpu->pmapbackmap[MA] = PA; - thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns) - } - tlbwrite (va, thiscpu->pmap[PA], otherdata); -} - -// Disco's procedure which emulates the MIPS -// instruction which read the tlb. -tlb_entry *emulate_tlbread_instruction (VA) -{ - // Must return a TLB entry that has a "Physical" address; - // This is recorded in our secondary TLB cache. - // (We don't have to read from the hardware TLB since - // all writes to the hardware TLB are mediated by Disco. - // Thus we can always keep the l2tlb up to date.) - return tlb_lookup (thiscpu->l2tlb, va); -} -- -
Requirements: -
The MIPS didn't quite meet the second criteria, as discussed -above. But, it does have a supervisor mode that is between user mode and -kernel mode where any privileged instruction will trap.
- -What might a the VMM trap handler look like?
--void privilege_trap_handler (addr) { - instruction, args = decode_instruction (addr) - switch (instruction) { - case foo: - emulate_foo (thiscpu, args, ...); - break; - case bar: - emulate_bar (thiscpu, args, ...); - break; - case ...: - ... - } -} --
The emulator_foo
bits will have to evaluate the
-state of the virtual CPU and compute the appropriate "fake" answer.
-
What sort of state is needed in order to appropriately emulate all -of these things? -
-- all user registers -- CPU specific regs (e.g. on x86, %crN, debugging, FP...) -- page tables (or tlb) -- interrupt tables --This is needed for each virtual processor. - - -
We intercept all communication to the I/O devices: read/writes to -reserved memory addresses cause page faults into special handlers -which will emulate or pass through I/O as appropriate. -
- --In a system like Disco, the sequence would look something like: -
-Interrupts will require some additional work: -
-The above can be slow! So sometimes you want the guest operating -system to be aware that it is a guest and allow it to avoid the slow -path. Special device drivers or changing instructions that would cause -traps into memory read/write instructions. -
- -VMware, unlike Disco, runs as an application on a guest OS and -cannot modify the guest OS. Furthermore, it must virtualize the x86 -instead of MIPS processor. Both of these differences make good design -challenges. - -
The first challenge is that the monitor runs in user space, yet it -must dispatch traps and it must execute privilege instructions, which -both require kernel privileges. To address this challenge, the -monitor downloads a piece of code, a kernel module, into the guest -OS. Most modern operating systems are constructed as a core kernel, -extended with downloadable kernel modules. -Privileged users can insert kernel modules at run-time. - -
The monitor downloads a kernel module that reads the IDT, copies -it, and overwrites the hard-wired entries with addresses for stubs in -the just downloaded kernel module. When a trap happens, the kernel -module inspects the PC, and either forwards the trap to the monitor -running in user space or to the guest OS. If the trap is caused -because a guest OS execute a privileged instructions, the monitor can -emulate that privilege instruction by asking the kernel module to -perform that instructions (perhaps after modifying the arguments to -the instruction). - -
The second challenge is virtualizing the x86 - instructions. Unfortunately, x86 doesn't meet the 3 requirements for - CPU virtualization. the first two requirements above. If you run - the CPU in ring 3, most x86 instructions will be fine, - because most privileged instructions will result in a trap, which - can then be forwarded to vmware for emulation. For example, - consider a guest OS loading the root of a page table in CR3. This - results in trap (the guest OS runs in user space), which is - forwarded to the monitor, which can emulate the load to CR3 as - follows: - -
-// addr is a physical address -void emulate_lcr3 (thiscpu, addr) -{ - thiscpu->cr3 = addr; - Pte *fakepdir = lookup (addr, oldcr3cache); - if (!fakepdir) { - fakedir = ppage_alloc (); - store (oldcr3cache, addr, fakedir); - // May wish to scan through supplied page directory to see if - // we have to fix up anything in particular. - // Exact settings will depend on how we want to handle - // problem cases below and our own MM. - } - asm ("movl fakepdir,%cr3"); - // Must make sure our page fault handler is in sync with what we do here. -} -- -
To virtualize the x86, the monitor must intercept any modifications -to the page table and substitute appropriate responses. And update -things like the accessed/dirty bits. The monitor can arrange for this -to happen by making all page table pages inaccessible so that it can -emulate loads and stores to page table pages. This setup allow the -monitor to virtualize the memory interface of the x86.
- -Unfortunately, not all instructions that must be virtualized result -in traps: -
pushf/popf
: FL_IF
is handled different,
- for example. In user-mode setting FL_IF is just ignored.push
, pop
, mov
)
- that reads or writes from %cs
, which contains the
- privilege level.
-How can we virtualize these instructions? An approach is to decode
-the instruction stream that is provided by the user and look for bad
-instructions. When we find them, replace them with an interrupt
-(INT 3
) that will allow the VMM to handle it
-correctly. This might look something like:
-
-void initcode () { - scan_for_nonvirtual (0x7c00); -} - -void scan_for_nonvirtualizable (thiscpu, startaddr) { - addr = startaddr; - instr = disassemble (addr); - while (instr is not branch or bad) { - addr += len (instr); - instr = disassemble (addr); - } - // remember that we wanted to execute this instruction. - replace (addr, "int 3"); - record (thiscpu->rewrites, addr, instr); -} - -void breakpoint_handler (tf) { - oldinstr = lookup (thiscpu->rewrites, tf->eip); - if (oldinstr is branch) { - newcs:neweip = evaluate branch - scan_for_nonvirtualizable (thiscpu, newcs:neweip) - return; - } else { // something non virtualizable - // dispatch to appropriate emulation - } -} --
All pages must be scanned in this way. Fortunately, most pages -probably are okay and don't really need any special handling so after -scanning them once, we can just remember that the page is okay and let -it run natively. -
- -What if a guest OS generates instructions, writes them to memory, -and then wants to execute them? We must detect self-modifying code -(e.g. must simulate buffer overflow attacks correctly.) When a write -to a physical page that happens to be in code segment happens, must -trap the write and then rescan the affected portions of the page.
- -What about self-examining code? Need to protect it some -how---possibly by playing tricks with instruction/data TLB caches, or -introducing a private segment for code (%cs) that is different than -the segment used for reads/writes (%ds). -
- --Disco has some I/O specific optimizations. -
--Disco developers clearly had access to IRIX source code. -
-Performance?
-Premise. Are virtual machine the preferred approach to extending -operating systems? Have scalable multiprocessors materialized?
- -John Scott Robin, Cynthia E. Irvine. Analysis of the -Intel Pentium's Ability to Support a Secure Virtual Machine -Monitor.
- -Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing -I/O Devices on VMware Workstation's Hosted Virtual Machine -Monitor. In Proceedings of the 2001 Usenix Technical Conference.
- -Kevin Lawton, Drew Northup. Plex86 Virtual -Machine.
- -Xen -and the Art of Virtualization, Paul Barham, Boris -Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf -Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003
- -A comparison of -software and hardware techniques for x86 virtualizatonKeith Adams -and Ole Agesen, ASPLOS 2006
- - - - - -- cgit v1.2.3