diff options
Diffstat (limited to 'web/l-mkernel.html')
-rw-r--r-- | web/l-mkernel.html | 262 |
1 files changed, 262 insertions, 0 deletions
diff --git a/web/l-mkernel.html b/web/l-mkernel.html new file mode 100644 index 0000000..2984796 --- /dev/null +++ b/web/l-mkernel.html @@ -0,0 +1,262 @@ +<title>Microkernel lecture</title> +<html> +<head> +</head> +<body> + +<h1>Microkernels</h1> + +<p>Required reading: Improving IPC by kernel design + +<h2>Overview</h2> + +<p>This lecture looks at the microkernel organization. In a +microkernel, services that a monolithic kernel implements in the +kernel are running as user-level programs. For example, the file +system, UNIX process management, pager, and network protocols each run +in a separate user-level address space. The microkernel itself +supports only the services that are necessary to allow system services +to run well in user space; a typical microkernel has at least support +for creating address spaces, threads, and inter process communication. + +<p>The potential advantages of a microkernel are simplicity of the +kernel (small), isolation of operating system components (each runs in +its own user-level address space), and flexibility (we can have a file +server and a database server). One potential disadvantage is +performance loss, because what in a monolithich kernel requires a +single system call may require in a microkernel multiple system calls +and context switches. + +<p>One way in how microkernels differ from each other is the exact +kernel API they implement. For example, Mach (a system developed at +CMU, which influenced a number of commercial operating systems) has +the following system calls: processes (create, terminate, suspend, +resume, priority, assign, info, threads), threads (fork, exit, join, +detach, yield, self), ports and messages (a port is a unidirectionally +communication channel with a message queue and supporting primitives +to send, destroy, etc), and regions/memory objects (allocate, +deallocate, map, copy, inherit, read, write). + +<p>Some microkernels are more "microkernel" than others. For example, +some microkernels implement the pager in user space but the basic +virtual memory abstractions in the kernel (e.g, Mach); others, are +more extreme, and implement most of the virtual memory in user space +(L4). Yet others are less extreme: many servers run in their own +address space, but in kernel mode (Chorus). + +<p>All microkernels support multiple threads per address space. xv6 +and Unix until recently didn't; why? Because, in Unix system services +are typically implemented in the kernel, and those are the primary +programs that need multiple threads to handle events concurrently +(waiting for disk and processing new I/O requests). In microkernels, +these services are implemented in user-level address spaces and so +they need a mechanism to deal with handling operations concurrently. +(Of course, one can argue if fork efficient enough, there is no need +to have threads.) + +<h2>L3/L4</h2> + +<p>L3 is a predecessor to L4. L3 provides data persistence, DOS +emulation, and ELAN runtime system. L4 is a reimplementation of L3, +but without the data persistence. L4KA is a project at +sourceforge.net, and you can download the code for the latest +incarnation of L4 from there. + +<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of +which there are several types), id_nearest (find a thread with an ID +close the given ID), fpage_unmap (unmap pages, mapping is done as a +side-effect of IPC), thread_switch (hand processor to specified +thread), lthread_ex_regs (manipulate thread registers), +thread_schedule (set scheduling policies), task_new (create a new +address space with some default number of threads). These calls +provide address spaces, tasks, threads, interprocess communication, +and unique identifiers. An address space is a set of mappings. +Multiple threads may share mappings, a thread may grants mappings to +another thread (through IPC). Task is the set of threads sharing an +address space. + +<p>A thread is the execution abstraction; it belongs to an address +space, a UID, a register set, a page fault handler, and an exception +handler. A UID of a thread is its task number plus the number of the +thread within that task. + +<p>IPC passes data by value or by reference to another address space. +It also provide for sequence coordination. It is used for +communication between client and servers, to pass interrupts to a +user-level exception handler, to pass page faults to an external +pager. In L4, device drivers are implemented has a user-level +processes with the device mapped into their address space. +Linux runs as a user-level process. + +<p>L4 provides quite a scala of messages types: inline-by-value, +strings, and virtual memory mappings. The send and receive descriptor +specify how many, if any. + +<p>In addition, there is a system call for timeouts and controling +thread scheduling. + +<h2>L3/L4 paper discussion</h2> + +<ul> + +<li>This paper is about performance. What is a microsecond? Is 100 +usec bad? Is 5 usec so much better we care? How many instructions +does 50-Mhz x86 execute in 100 usec? What can we compute with that +number of instructions? How many disk operations in that time? How +many interrupts can we take? (The livelock paper, which we cover in a +few lectures, mentions 5,000 network pkts per second, and each packet +generates two interrrupts.) + +<li>In performance calculations, what is the appropriate/better metric? +Microseconds or cycles? + +<li>Goal: improve IPC performance by a factor 10 by careful kernel +design that is fully aware of the hardware it is running on. +Principle: performance rules! Optimize for the common case. Because +in L3 interrupts are propagated to user-level using IPC, the system +may have to be able to support many IPCs per second (as many as the +device can generate interrupts). + +<li>IPC consists of transfering control and transfering data. The +minimal cost for transfering control is 127 cycles, plus 45 cycles for +TLB misses (see table 3). What are the x86 instructions to enter and +leave the kernel? (int, iret) Why do they consume so much time? +(Flush pipeline) Do modern processors perform these operations more +efficient? Worse now. Faster processors optimized for straight-line +code; Traps/Exceptions flush deeper pipeline, cache misses cost more +cycles. + +<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3 +flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and +4+5) user text - two pages B's user code looks at message + +<li>Interface: +<ul> +<li>call (threadID, send-message, receive-message, timeout); +<li>reply_and_receive (reply-message, receive-message, timeout); +</ul> + +<li>Optimizations: +<ul> + +<li>New system call: reply_and_receive. Effect: 2 system calls per +RPC. + +<li>Complex messages: direct string, indirect strings, and memory +objects. + +<li>Direct transfer by temporary mapping through a communication +window. The communication window is mapped in B address space and in +A's kernel address space; why is this better than just mapping a page +shared between A and B's address space? 1) Multi-level security, it +makes it hard to reason about information flow; 2) Receiver can't +check message legality (might change after check); 3) When server has +many clients, could run out of virtual address space Requires shared +memory region to be established ahead of time; 4) Not application +friendly, since data may already be at another address, i.e. +applications would have to copy anyway--possibly more copies. + +<li>Why not use the following approach: map the region copy-on-write +(or read-only) in A's address space after send and read-only in B's +address space? Now B may have to copy data or cannot receive data in +its final destination. + +<li>On the x86 implemented by coping B's PDE into A's address space. +Why two PDEs? (Maximum message size is 4 Meg, so guaranteed to work +if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped +region.) Why not just copy PTEs? Would be much more expensive + +<li> What does it mean for the TLB to be "window clean"? Why do we +care? Means TLB contains no mappings within communication window. We +care because mapping is cheap (copy PDE), but invalidation not; x86 +only lets you invalidate one page at a time, or whole TLB Does TLB +invalidation of communication window turn out to be a problem? Not +usually, because have to load %cr3 during IPC anyway + +<li>Thread control block registers, links to various double-linked + lists, pgdir, uid, etc.. Lower part of thread UID contains TCB + number. Can also dededuce TCB address from stack by taking SP AND + bitmask (the SP comes out of the TSS when just switching to kernel). + +<li> Kernel stack is on same page as tcb. why? 1) Minimizes TLB +misses (since accessing kernel stack will bring in tcb); 2) Allows +very efficient access to tcb -- just mask off lower 12 bits of %esp; +3) With VM, can use lower 32-bits of thread id to indicate which tcb; +using one page per tcb means no need to check if thread is swapped out +(Can simply not map that tcb if shouldn't access it). + +<li>Invariant on queues: queues always hold in-memory TCBs. + +<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8), +and smart representation of time so that 32-bit integers can be used +in the common case (base + offset in msec; bump base and recompute all +offsets ~4 hours. maximum timeout is ~24 days, 2^31 msec). + +<li>What is the problem addressed by lazy scheduling? +Conventional approach to scheduling: +<pre> + A sends message to B: + Move A from ready queue to waiting queue + Move B from waiting queue to ready queue + This requires 58 cycles, including 4 TLB misses. What are TLB misses? + One each for head of ready and waiting queues + One each for previous queue element during the remove +</pre> +<li> Lazy scheduling: +<pre> + Ready queue must contain all ready threads except current one + Might contain other threads that aren't actually ready, though + Each wakeup queue contains all threads waiting in that queue + Again, might contain other threads, too + Scheduler removes inappropriate queue entries when scanning + queue +</pre> + +<li>Why does this help performance? Only three situations in which +thread gives up CPU but stays ready: send syscall (as opposed to +call), preemption, and hardware interrupts. So very often can IPC into +thread while not putting it on ready list. + +<li>Direct process switch. This section just says you should use +kernel threads instead of continuations. + +<li>Short messages via registers. + +<li>Avoiding unnecessary copies. Basically can send and receive + messages w. same vector. Makes forwarding efficient, which is + important for Clans/Chiefs model. + +<li>Segment register optimization. Loading segments registers is + slow, have to access GDT, etc. But common case is that users don't + change their segment registers. Observation: it is faster to check + that segment descriptor than load it. So just check that segment + registers are okay. Only need to load if user code changed them. + +<li>Registers for paramater passing where ever possible: systems calls +and IPC. + +<li>Minimizing TLB misses. Try to cram as many things as possible onto +same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually +maybe can't fit whole tables but put the important parts of tables on +the same page (maybe beginning of TSS, IDT, or GDT only?) + +<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack + often-used data on same cache lines, lazily save/restore CPU state + like debug and FPU registers. Much of the kernel is written in + assembly! + +<li>What are the results? figure 7 and 8 look good. + +<li>Is fast IPC enough to get good overall system performance? This +paper doesn't make a statement either way; we have to read their 1997 +paper to find find the answer to that question. + +<li>Is the principle of optimizing for performance right? In general, +it is wrong to optimize for performance; other things matter more. Is +IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a +losing battle against CPU makers? Should fast IPC time be a hardware, +or just an OS issue? + +</ul> + +</body> |