summaryrefslogtreecommitdiff
path: root/web/l-mkernel.html
diff options
context:
space:
mode:
authorAustin Clements <[email protected]>2011-09-07 11:49:14 -0400
committerAustin Clements <[email protected]>2011-09-07 11:49:14 -0400
commit01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1 (patch)
tree4320eb3d09f31f4a628b80d45482a72ee7c3956b /web/l-mkernel.html
parent64a03bd7aa5c03a626a2da4730a45fcceea75322 (diff)
downloadxv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.gz
xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.bz2
xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.zip
Remove web directory; all cruft or moved to 6.828 repo
Diffstat (limited to 'web/l-mkernel.html')
-rw-r--r--web/l-mkernel.html262
1 files changed, 0 insertions, 262 deletions
diff --git a/web/l-mkernel.html b/web/l-mkernel.html
deleted file mode 100644
index 2984796..0000000
--- a/web/l-mkernel.html
+++ /dev/null
@@ -1,262 +0,0 @@
-<title>Microkernel lecture</title>
-<html>
-<head>
-</head>
-<body>
-
-<h1>Microkernels</h1>
-
-<p>Required reading: Improving IPC by kernel design
-
-<h2>Overview</h2>
-
-<p>This lecture looks at the microkernel organization. In a
-microkernel, services that a monolithic kernel implements in the
-kernel are running as user-level programs. For example, the file
-system, UNIX process management, pager, and network protocols each run
-in a separate user-level address space. The microkernel itself
-supports only the services that are necessary to allow system services
-to run well in user space; a typical microkernel has at least support
-for creating address spaces, threads, and inter process communication.
-
-<p>The potential advantages of a microkernel are simplicity of the
-kernel (small), isolation of operating system components (each runs in
-its own user-level address space), and flexibility (we can have a file
-server and a database server). One potential disadvantage is
-performance loss, because what in a monolithich kernel requires a
-single system call may require in a microkernel multiple system calls
-and context switches.
-
-<p>One way in how microkernels differ from each other is the exact
-kernel API they implement. For example, Mach (a system developed at
-CMU, which influenced a number of commercial operating systems) has
-the following system calls: processes (create, terminate, suspend,
-resume, priority, assign, info, threads), threads (fork, exit, join,
-detach, yield, self), ports and messages (a port is a unidirectionally
-communication channel with a message queue and supporting primitives
-to send, destroy, etc), and regions/memory objects (allocate,
-deallocate, map, copy, inherit, read, write).
-
-<p>Some microkernels are more "microkernel" than others. For example,
-some microkernels implement the pager in user space but the basic
-virtual memory abstractions in the kernel (e.g, Mach); others, are
-more extreme, and implement most of the virtual memory in user space
-(L4). Yet others are less extreme: many servers run in their own
-address space, but in kernel mode (Chorus).
-
-<p>All microkernels support multiple threads per address space. xv6
-and Unix until recently didn't; why? Because, in Unix system services
-are typically implemented in the kernel, and those are the primary
-programs that need multiple threads to handle events concurrently
-(waiting for disk and processing new I/O requests). In microkernels,
-these services are implemented in user-level address spaces and so
-they need a mechanism to deal with handling operations concurrently.
-(Of course, one can argue if fork efficient enough, there is no need
-to have threads.)
-
-<h2>L3/L4</h2>
-
-<p>L3 is a predecessor to L4. L3 provides data persistence, DOS
-emulation, and ELAN runtime system. L4 is a reimplementation of L3,
-but without the data persistence. L4KA is a project at
-sourceforge.net, and you can download the code for the latest
-incarnation of L4 from there.
-
-<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of
-which there are several types), id_nearest (find a thread with an ID
-close the given ID), fpage_unmap (unmap pages, mapping is done as a
-side-effect of IPC), thread_switch (hand processor to specified
-thread), lthread_ex_regs (manipulate thread registers),
-thread_schedule (set scheduling policies), task_new (create a new
-address space with some default number of threads). These calls
-provide address spaces, tasks, threads, interprocess communication,
-and unique identifiers. An address space is a set of mappings.
-Multiple threads may share mappings, a thread may grants mappings to
-another thread (through IPC). Task is the set of threads sharing an
-address space.
-
-<p>A thread is the execution abstraction; it belongs to an address
-space, a UID, a register set, a page fault handler, and an exception
-handler. A UID of a thread is its task number plus the number of the
-thread within that task.
-
-<p>IPC passes data by value or by reference to another address space.
-It also provide for sequence coordination. It is used for
-communication between client and servers, to pass interrupts to a
-user-level exception handler, to pass page faults to an external
-pager. In L4, device drivers are implemented has a user-level
-processes with the device mapped into their address space.
-Linux runs as a user-level process.
-
-<p>L4 provides quite a scala of messages types: inline-by-value,
-strings, and virtual memory mappings. The send and receive descriptor
-specify how many, if any.
-
-<p>In addition, there is a system call for timeouts and controling
-thread scheduling.
-
-<h2>L3/L4 paper discussion</h2>
-
-<ul>
-
-<li>This paper is about performance. What is a microsecond? Is 100
-usec bad? Is 5 usec so much better we care? How many instructions
-does 50-Mhz x86 execute in 100 usec? What can we compute with that
-number of instructions? How many disk operations in that time? How
-many interrupts can we take? (The livelock paper, which we cover in a
-few lectures, mentions 5,000 network pkts per second, and each packet
-generates two interrrupts.)
-
-<li>In performance calculations, what is the appropriate/better metric?
-Microseconds or cycles?
-
-<li>Goal: improve IPC performance by a factor 10 by careful kernel
-design that is fully aware of the hardware it is running on.
-Principle: performance rules! Optimize for the common case. Because
-in L3 interrupts are propagated to user-level using IPC, the system
-may have to be able to support many IPCs per second (as many as the
-device can generate interrupts).
-
-<li>IPC consists of transfering control and transfering data. The
-minimal cost for transfering control is 127 cycles, plus 45 cycles for
-TLB misses (see table 3). What are the x86 instructions to enter and
-leave the kernel? (int, iret) Why do they consume so much time?
-(Flush pipeline) Do modern processors perform these operations more
-efficient? Worse now. Faster processors optimized for straight-line
-code; Traps/Exceptions flush deeper pipeline, cache misses cost more
-cycles.
-
-<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3
-flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
-4+5) user text - two pages B's user code looks at message
-
-<li>Interface:
-<ul>
-<li>call (threadID, send-message, receive-message, timeout);
-<li>reply_and_receive (reply-message, receive-message, timeout);
-</ul>
-
-<li>Optimizations:
-<ul>
-
-<li>New system call: reply_and_receive. Effect: 2 system calls per
-RPC.
-
-<li>Complex messages: direct string, indirect strings, and memory
-objects.
-
-<li>Direct transfer by temporary mapping through a communication
-window. The communication window is mapped in B address space and in
-A's kernel address space; why is this better than just mapping a page
-shared between A and B's address space? 1) Multi-level security, it
-makes it hard to reason about information flow; 2) Receiver can't
-check message legality (might change after check); 3) When server has
-many clients, could run out of virtual address space Requires shared
-memory region to be established ahead of time; 4) Not application
-friendly, since data may already be at another address, i.e.
-applications would have to copy anyway--possibly more copies.
-
-<li>Why not use the following approach: map the region copy-on-write
-(or read-only) in A's address space after send and read-only in B's
-address space? Now B may have to copy data or cannot receive data in
-its final destination.
-
-<li>On the x86 implemented by coping B's PDE into A's address space.
-Why two PDEs? (Maximum message size is 4 Meg, so guaranteed to work
-if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped
-region.) Why not just copy PTEs? Would be much more expensive
-
-<li> What does it mean for the TLB to be "window clean"? Why do we
-care? Means TLB contains no mappings within communication window. We
-care because mapping is cheap (copy PDE), but invalidation not; x86
-only lets you invalidate one page at a time, or whole TLB Does TLB
-invalidation of communication window turn out to be a problem? Not
-usually, because have to load %cr3 during IPC anyway
-
-<li>Thread control block registers, links to various double-linked
- lists, pgdir, uid, etc.. Lower part of thread UID contains TCB
- number. Can also dededuce TCB address from stack by taking SP AND
- bitmask (the SP comes out of the TSS when just switching to kernel).
-
-<li> Kernel stack is on same page as tcb. why? 1) Minimizes TLB
-misses (since accessing kernel stack will bring in tcb); 2) Allows
-very efficient access to tcb -- just mask off lower 12 bits of %esp;
-3) With VM, can use lower 32-bits of thread id to indicate which tcb;
-using one page per tcb means no need to check if thread is swapped out
-(Can simply not map that tcb if shouldn't access it).
-
-<li>Invariant on queues: queues always hold in-memory TCBs.
-
-<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8),
-and smart representation of time so that 32-bit integers can be used
-in the common case (base + offset in msec; bump base and recompute all
-offsets ~4 hours. maximum timeout is ~24 days, 2^31 msec).
-
-<li>What is the problem addressed by lazy scheduling?
-Conventional approach to scheduling:
-<pre>
- A sends message to B:
- Move A from ready queue to waiting queue
- Move B from waiting queue to ready queue
- This requires 58 cycles, including 4 TLB misses. What are TLB misses?
- One each for head of ready and waiting queues
- One each for previous queue element during the remove
-</pre>
-<li> Lazy scheduling:
-<pre>
- Ready queue must contain all ready threads except current one
- Might contain other threads that aren't actually ready, though
- Each wakeup queue contains all threads waiting in that queue
- Again, might contain other threads, too
- Scheduler removes inappropriate queue entries when scanning
- queue
-</pre>
-
-<li>Why does this help performance? Only three situations in which
-thread gives up CPU but stays ready: send syscall (as opposed to
-call), preemption, and hardware interrupts. So very often can IPC into
-thread while not putting it on ready list.
-
-<li>Direct process switch. This section just says you should use
-kernel threads instead of continuations.
-
-<li>Short messages via registers.
-
-<li>Avoiding unnecessary copies. Basically can send and receive
- messages w. same vector. Makes forwarding efficient, which is
- important for Clans/Chiefs model.
-
-<li>Segment register optimization. Loading segments registers is
- slow, have to access GDT, etc. But common case is that users don't
- change their segment registers. Observation: it is faster to check
- that segment descriptor than load it. So just check that segment
- registers are okay. Only need to load if user code changed them.
-
-<li>Registers for paramater passing where ever possible: systems calls
-and IPC.
-
-<li>Minimizing TLB misses. Try to cram as many things as possible onto
-same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
-maybe can't fit whole tables but put the important parts of tables on
-the same page (maybe beginning of TSS, IDT, or GDT only?)
-
-<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack
- often-used data on same cache lines, lazily save/restore CPU state
- like debug and FPU registers. Much of the kernel is written in
- assembly!
-
-<li>What are the results? figure 7 and 8 look good.
-
-<li>Is fast IPC enough to get good overall system performance? This
-paper doesn't make a statement either way; we have to read their 1997
-paper to find find the answer to that question.
-
-<li>Is the principle of optimizing for performance right? In general,
-it is wrong to optimize for performance; other things matter more. Is
-IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a
-losing battle against CPU makers? Should fast IPC time be a hardware,
-or just an OS issue?
-
-</ul>
-
-</body>