diff options
author | Austin Clements <[email protected]> | 2011-09-07 11:49:14 -0400 |
---|---|---|
committer | Austin Clements <[email protected]> | 2011-09-07 11:49:14 -0400 |
commit | 01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1 (patch) | |
tree | 4320eb3d09f31f4a628b80d45482a72ee7c3956b /web/l-mkernel.html | |
parent | 64a03bd7aa5c03a626a2da4730a45fcceea75322 (diff) | |
download | xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.gz xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.bz2 xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.zip |
Remove web directory; all cruft or moved to 6.828 repo
Diffstat (limited to 'web/l-mkernel.html')
-rw-r--r-- | web/l-mkernel.html | 262 |
1 files changed, 0 insertions, 262 deletions
diff --git a/web/l-mkernel.html b/web/l-mkernel.html deleted file mode 100644 index 2984796..0000000 --- a/web/l-mkernel.html +++ /dev/null @@ -1,262 +0,0 @@ -<title>Microkernel lecture</title> -<html> -<head> -</head> -<body> - -<h1>Microkernels</h1> - -<p>Required reading: Improving IPC by kernel design - -<h2>Overview</h2> - -<p>This lecture looks at the microkernel organization. In a -microkernel, services that a monolithic kernel implements in the -kernel are running as user-level programs. For example, the file -system, UNIX process management, pager, and network protocols each run -in a separate user-level address space. The microkernel itself -supports only the services that are necessary to allow system services -to run well in user space; a typical microkernel has at least support -for creating address spaces, threads, and inter process communication. - -<p>The potential advantages of a microkernel are simplicity of the -kernel (small), isolation of operating system components (each runs in -its own user-level address space), and flexibility (we can have a file -server and a database server). One potential disadvantage is -performance loss, because what in a monolithich kernel requires a -single system call may require in a microkernel multiple system calls -and context switches. - -<p>One way in how microkernels differ from each other is the exact -kernel API they implement. For example, Mach (a system developed at -CMU, which influenced a number of commercial operating systems) has -the following system calls: processes (create, terminate, suspend, -resume, priority, assign, info, threads), threads (fork, exit, join, -detach, yield, self), ports and messages (a port is a unidirectionally -communication channel with a message queue and supporting primitives -to send, destroy, etc), and regions/memory objects (allocate, -deallocate, map, copy, inherit, read, write). - -<p>Some microkernels are more "microkernel" than others. For example, -some microkernels implement the pager in user space but the basic -virtual memory abstractions in the kernel (e.g, Mach); others, are -more extreme, and implement most of the virtual memory in user space -(L4). Yet others are less extreme: many servers run in their own -address space, but in kernel mode (Chorus). - -<p>All microkernels support multiple threads per address space. xv6 -and Unix until recently didn't; why? Because, in Unix system services -are typically implemented in the kernel, and those are the primary -programs that need multiple threads to handle events concurrently -(waiting for disk and processing new I/O requests). In microkernels, -these services are implemented in user-level address spaces and so -they need a mechanism to deal with handling operations concurrently. -(Of course, one can argue if fork efficient enough, there is no need -to have threads.) - -<h2>L3/L4</h2> - -<p>L3 is a predecessor to L4. L3 provides data persistence, DOS -emulation, and ELAN runtime system. L4 is a reimplementation of L3, -but without the data persistence. L4KA is a project at -sourceforge.net, and you can download the code for the latest -incarnation of L4 from there. - -<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of -which there are several types), id_nearest (find a thread with an ID -close the given ID), fpage_unmap (unmap pages, mapping is done as a -side-effect of IPC), thread_switch (hand processor to specified -thread), lthread_ex_regs (manipulate thread registers), -thread_schedule (set scheduling policies), task_new (create a new -address space with some default number of threads). These calls -provide address spaces, tasks, threads, interprocess communication, -and unique identifiers. An address space is a set of mappings. -Multiple threads may share mappings, a thread may grants mappings to -another thread (through IPC). Task is the set of threads sharing an -address space. - -<p>A thread is the execution abstraction; it belongs to an address -space, a UID, a register set, a page fault handler, and an exception -handler. A UID of a thread is its task number plus the number of the -thread within that task. - -<p>IPC passes data by value or by reference to another address space. -It also provide for sequence coordination. It is used for -communication between client and servers, to pass interrupts to a -user-level exception handler, to pass page faults to an external -pager. In L4, device drivers are implemented has a user-level -processes with the device mapped into their address space. -Linux runs as a user-level process. - -<p>L4 provides quite a scala of messages types: inline-by-value, -strings, and virtual memory mappings. The send and receive descriptor -specify how many, if any. - -<p>In addition, there is a system call for timeouts and controling -thread scheduling. - -<h2>L3/L4 paper discussion</h2> - -<ul> - -<li>This paper is about performance. What is a microsecond? Is 100 -usec bad? Is 5 usec so much better we care? How many instructions -does 50-Mhz x86 execute in 100 usec? What can we compute with that -number of instructions? How many disk operations in that time? How -many interrupts can we take? (The livelock paper, which we cover in a -few lectures, mentions 5,000 network pkts per second, and each packet -generates two interrrupts.) - -<li>In performance calculations, what is the appropriate/better metric? -Microseconds or cycles? - -<li>Goal: improve IPC performance by a factor 10 by careful kernel -design that is fully aware of the hardware it is running on. -Principle: performance rules! Optimize for the common case. Because -in L3 interrupts are propagated to user-level using IPC, the system -may have to be able to support many IPCs per second (as many as the -device can generate interrupts). - -<li>IPC consists of transfering control and transfering data. The -minimal cost for transfering control is 127 cycles, plus 45 cycles for -TLB misses (see table 3). What are the x86 instructions to enter and -leave the kernel? (int, iret) Why do they consume so much time? -(Flush pipeline) Do modern processors perform these operations more -efficient? Worse now. Faster processors optimized for straight-line -code; Traps/Exceptions flush deeper pipeline, cache misses cost more -cycles. - -<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3 -flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and -4+5) user text - two pages B's user code looks at message - -<li>Interface: -<ul> -<li>call (threadID, send-message, receive-message, timeout); -<li>reply_and_receive (reply-message, receive-message, timeout); -</ul> - -<li>Optimizations: -<ul> - -<li>New system call: reply_and_receive. Effect: 2 system calls per -RPC. - -<li>Complex messages: direct string, indirect strings, and memory -objects. - -<li>Direct transfer by temporary mapping through a communication -window. The communication window is mapped in B address space and in -A's kernel address space; why is this better than just mapping a page -shared between A and B's address space? 1) Multi-level security, it -makes it hard to reason about information flow; 2) Receiver can't -check message legality (might change after check); 3) When server has -many clients, could run out of virtual address space Requires shared -memory region to be established ahead of time; 4) Not application -friendly, since data may already be at another address, i.e. -applications would have to copy anyway--possibly more copies. - -<li>Why not use the following approach: map the region copy-on-write -(or read-only) in A's address space after send and read-only in B's -address space? Now B may have to copy data or cannot receive data in -its final destination. - -<li>On the x86 implemented by coping B's PDE into A's address space. -Why two PDEs? (Maximum message size is 4 Meg, so guaranteed to work -if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped -region.) Why not just copy PTEs? Would be much more expensive - -<li> What does it mean for the TLB to be "window clean"? Why do we -care? Means TLB contains no mappings within communication window. We -care because mapping is cheap (copy PDE), but invalidation not; x86 -only lets you invalidate one page at a time, or whole TLB Does TLB -invalidation of communication window turn out to be a problem? Not -usually, because have to load %cr3 during IPC anyway - -<li>Thread control block registers, links to various double-linked - lists, pgdir, uid, etc.. Lower part of thread UID contains TCB - number. Can also dededuce TCB address from stack by taking SP AND - bitmask (the SP comes out of the TSS when just switching to kernel). - -<li> Kernel stack is on same page as tcb. why? 1) Minimizes TLB -misses (since accessing kernel stack will bring in tcb); 2) Allows -very efficient access to tcb -- just mask off lower 12 bits of %esp; -3) With VM, can use lower 32-bits of thread id to indicate which tcb; -using one page per tcb means no need to check if thread is swapped out -(Can simply not map that tcb if shouldn't access it). - -<li>Invariant on queues: queues always hold in-memory TCBs. - -<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8), -and smart representation of time so that 32-bit integers can be used -in the common case (base + offset in msec; bump base and recompute all -offsets ~4 hours. maximum timeout is ~24 days, 2^31 msec). - -<li>What is the problem addressed by lazy scheduling? -Conventional approach to scheduling: -<pre> - A sends message to B: - Move A from ready queue to waiting queue - Move B from waiting queue to ready queue - This requires 58 cycles, including 4 TLB misses. What are TLB misses? - One each for head of ready and waiting queues - One each for previous queue element during the remove -</pre> -<li> Lazy scheduling: -<pre> - Ready queue must contain all ready threads except current one - Might contain other threads that aren't actually ready, though - Each wakeup queue contains all threads waiting in that queue - Again, might contain other threads, too - Scheduler removes inappropriate queue entries when scanning - queue -</pre> - -<li>Why does this help performance? Only three situations in which -thread gives up CPU but stays ready: send syscall (as opposed to -call), preemption, and hardware interrupts. So very often can IPC into -thread while not putting it on ready list. - -<li>Direct process switch. This section just says you should use -kernel threads instead of continuations. - -<li>Short messages via registers. - -<li>Avoiding unnecessary copies. Basically can send and receive - messages w. same vector. Makes forwarding efficient, which is - important for Clans/Chiefs model. - -<li>Segment register optimization. Loading segments registers is - slow, have to access GDT, etc. But common case is that users don't - change their segment registers. Observation: it is faster to check - that segment descriptor than load it. So just check that segment - registers are okay. Only need to load if user code changed them. - -<li>Registers for paramater passing where ever possible: systems calls -and IPC. - -<li>Minimizing TLB misses. Try to cram as many things as possible onto -same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually -maybe can't fit whole tables but put the important parts of tables on -the same page (maybe beginning of TSS, IDT, or GDT only?) - -<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack - often-used data on same cache lines, lazily save/restore CPU state - like debug and FPU registers. Much of the kernel is written in - assembly! - -<li>What are the results? figure 7 and 8 look good. - -<li>Is fast IPC enough to get good overall system performance? This -paper doesn't make a statement either way; we have to read their 1997 -paper to find find the answer to that question. - -<li>Is the principle of optimizing for performance right? In general, -it is wrong to optimize for performance; other things matter more. Is -IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a -losing battle against CPU makers? Should fast IPC time be a hardware, -or just an OS issue? - -</ul> - -</body> |