diff options
Diffstat (limited to 'web')
37 files changed, 9034 insertions, 0 deletions
| diff --git a/web/Makefile b/web/Makefile new file mode 100644 index 0000000..7b49773 --- /dev/null +++ b/web/Makefile @@ -0,0 +1,3 @@ +index.html: index.txt mkhtml +	mkhtml index.txt >_$@ && mv _$@ $@ + diff --git a/web/index.html b/web/index.html new file mode 100644 index 0000000..d5f940c --- /dev/null +++ b/web/index.html @@ -0,0 +1,353 @@ +<!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version --> +<html> +<head> +<title>Xv6, a simple Unix-like teaching operating system</title> +<style type="text/css"><!-- +body { +	background-color: white; +	color: black; +	font-size: medium; +	line-height: 1.2em; +	margin-left: 0.5in; +	margin-right: 0.5in; +	margin-top: 0; +	margin-bottom: 0; +} + +h1 { +	text-indent: 0in; +	text-align: left; +	margin-top: 2em; +	font-weight: bold; +	font-size: 1.4em; +} + +h2 { +	text-indent: 0in; +	text-align: left; +	margin-top: 2em; +	font-weight: bold; +	font-size: 1.2em; +} +--></style> +</head> +<body bgcolor=#ffffff> +<h1>Xv6, a simple Unix-like teaching operating system</h1> +<br><br> +Xv6 is a teaching operating system developed +in the summer of 2006 for MIT's operating systems course, +“6.828: Operating Systems Engineering.” +We used it for 6.828 in Fall 2006 and Fall 2007 +and are using it this semester (Fall 2008). +We hope that xv6 will be useful in other courses too. +This page collects resources to aid the use of xv6 +in other courses. + +<h2>History and Background</h2> +For many years, MIT had no operating systems course. +In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit +created a new, experimental course (6.097) +to teach operating systems engineering. +In the course lectures, the class worked through Sixth Edition Unix (aka V6) +using John Lions's famous commentary. +In the lab assignments, students wrote most of an exokernel operating +system, eventually named Jos, for the Intel x86. +Exposing students to multiple systems–V6 and Jos–helped +develop a sense of the spectrum of operating system designs. +In the fall of 2003, the experimental 6.097 became the +official course 6.828; the course has been offered each fall since then. +<br><br> +V6 presented pedagogic challenges from the start. +Students doubted the relevance of an obsolete 30-year-old operating system +written in an obsolete programming language (pre-K&R C) +running on obsolete hardware (the PDP-11). +Students also struggled to learn the low-level details of two different +architectures (the PDP-11 and the Intel x86) at the same time. +By the summer of 2006, we had decided to replace V6 +with a new operating system, xv6, modeled on V6 +but written in ANSI C and running on multiprocessor +Intel x86 machines. +Xv6's use of the x86 makes it more relevant to +students' experience than V6 was +and unifies the course around a single architecture. +Adding multiprocessor support also helps relevance +and makes it easier to discuss threads and concurrency. +(In a single processor operating system, concurrency–which only +happens because of interrupts–is too easy to view as a special case. +A multiprocessor operating system must attack the problem head on.) +Finally, writing a new system allowed us to write cleaner versions +of the rougher parts of V6, like the scheduler and file system. +<br><br> +6.828 substituted xv6 for V6 in the fall of 2006. +Based on that experience, we cleaned up rough patches +of xv6 for the course in the fall of 2007. +Since then, xv6 has stabilized, so we are making it +available in the hopes that others will find it useful too. +<br><br> +6.828 uses both xv6 and Jos. +Courses taught at UCLA, NYU, and Stanford have used +Jos without xv6; we believe other courses could use +xv6 without Jos, though we are not aware of any that have. + +<h2>Xv6 sources</h2> +The latest xv6 is <a href="xv6-rev2.tar.gz">xv6-rev2.tar.gz</a>. +We distribute the sources in electronic form but also as +a printed booklet with line numbers that keep everyone +together during lectures.  The booklet is available as +<a href="xv6-rev2.pdf">xv6-rev2.pdf</a>. +<br><br> +xv6 compiles using the GNU C compiler, +targeted at the x86 using ELF binaries. +On BSD and Linux systems, you can use the native compilers; +On OS X, which doesn't use ELF binaries, +you must use a cross-compiler. +Xv6 does boot on real hardware, but typically +we run it using the Bochs emulator. +Both the GCC cross compiler and Bochs +can be found on the <a href="../../2007/tools.html">6.828 tools page</a>. + +<h2>Lectures</h2> +In 6.828, the lectures in the first half of the course +introduce the PC hardware, the Intel x86, and then xv6. +The lectures in the second half consider advanced topics +using research papers; for some, xv6 serves as a useful +base for making discussions concrete. +This section describe a typical 6.828 lecture schedule, +linking to lecture notes and homework. +A course using only xv6 (not Jos) will need to adapt +a few of the lectures, but we hope these are a useful +starting point. + +<br><br><b><i>Lecture 1. Operating systems</i></b> +<br><br> +The first lecture introduces both the general topic of +operating systems and the specific approach of 6.828. +After defining “operating system,” the lecture +examines the implementation of a Unix shell +to look at the details the traditional Unix system call interface. +This is relevant to both xv6 and Jos: in the final +Jos labs, students implement a Unix-like interface +and culminating in a Unix shell. +<br><br> +<a href="l1.html">lecture notes</a> + +<br><br><b><i>Lecture 2.  PC hardware and x86 programming</i></b> +<br><br> +This lecture introduces the PC architecture, the 16- and 32-bit x86, +the stack, and the GCC x86 calling conventions. +It also introduces the pieces of a typical C tool chain–compiler, +assembler, linker, loader–and the Bochs emulator. +<br><br> +Reading: PC Assembly Language +<br><br> +Homework: familiarize with Bochs +<br><br> +<a href="l2.html">lecture notes</a> +<a href="x86-intro.html">homework</a> + +<br><br><b><i>Lecture 3.  Operating system organization</i></b> +<br><br> +This lecture continues Lecture 1's discussion of what +an operating system does. +An operating system provides a “virtual computer” +interface to user space programs. +At a high level, the main job of the operating system +is to implement that interface +using the physical computer it runs on. +<br><br> +The lecture discusses four approaches to that job: +monolithic operating systems, microkernels, +virtual machines, and exokernels. +Exokernels might not be worth mentioning +except that the Jos labs are built around one. +<br><br> +Reading: Engler et al., Exokernel: An Operating System Architecture +for Application-Level Resource Management +<br><br> +<a href="l3.html">lecture notes</a> + +<br><br><b><i>Lecture 4.  Address spaces using segmentation</i></b> +<br><br> +This is the first lecture that uses xv6. +It introduces the idea of address spaces and the +details of the x86 segmentation hardware. +It makes the discussion concrete by reading the xv6 +source code and watching xv6 execute using the Bochs simulator. +<br><br> +Reading: x86 MMU handout, +xv6: bootasm.S, bootother.S, <a href="src/bootmain.c.html">bootmain.c</a>, <a href="src/main.c.html">main.c</a>, <a href="src/init.c.html">init.c</a>, and setupsegs in <a href="src/proc.c.html">proc.c</a>. +<br><br> +Homework: Bochs stack introduction +<br><br> +<a href="l4.html">lecture notes</a> +<a href="xv6-intro.html">homework</a> + +<br><br><b><i>Lecture 5.  Address spaces using page tables</i></b> +<br><br> +This lecture continues the discussion of address spaces, +examining the other x86 virtual memory mechanism: page tables. +Xv6 does not use page tables, so there is no xv6 here. +Instead, the lecture uses Jos as a concrete example. +An xv6-only course might skip or shorten this discussion. +<br><br> +Reading: x86 manual excerpts +<br><br> +Homework: stuff about gdt +XXX not appropriate; should be in Lecture 4 +<br><br> +<a href="l5.html">lecture notes</a> + +<br><br><b><i>Lecture 6.  Interrupts and exceptions</i></b> +<br><br> +How does a user program invoke the operating system kernel? +How does the kernel return to the user program? +What happens when a hardware device needs attention? +This lecture explains the answer to these questions: +interrupt and exception handling. +<br><br> +It explains the x86 trap setup mechanisms and then +examines their use in xv6's SETGATE (<a href="src/mmu.h.html">mmu.h</a>), +tvinit (<a href="src/trap.c.html">trap.c</a>), idtinit (<a href="src/trap.c.html">trap.c</a>), <a href="src/vectors.pl.html">vectors.pl</a>, and vectors.S. +<br><br> +It then traces through a call to the system call open: +<a href="src/init.c.html">init.c</a>, usys.S, vector48 and alltraps (vectors.S), trap (<a href="src/trap.c.html">trap.c</a>), +syscall (<a href="src/syscall.c.html">syscall.c</a>), +sys_open (<a href="src/sysfile.c.html">sysfile.c</a>), fetcharg, fetchint, argint, argptr, argstr (<a href="src/syscall.c.html">syscall.c</a>), +<br><br> +The interrupt controller, briefly: +pic_init and pic_enable (<a href="src/picirq.c.html">picirq.c</a>). +The timer and keyboard, briefly: +timer_init (<a href="src/timer.c.html">timer.c</a>), console_init (<a href="src/console.c.html">console.c</a>). +Enabling and disabling of interrupts. +<br><br> +Reading: x86 manual excerpts, +xv6: trapasm.S, <a href="src/trap.c.html">trap.c</a>, <a href="src/syscall.c.html">syscall.c</a>, and usys.S. +Skim <a href="src/lapic.c.html">lapic.c</a>, <a href="src/ioapic.c.html">ioapic.c</a>, <a href="src/picirq.c.html">picirq.c</a>. +<br><br> +Homework: Explain the 35 words on the top of the +stack at first invocation of <code>syscall</code>. +<br><br> +<a href="l-interrupt.html">lecture notes</a> +<a href="x86-intr.html">homework</a> + +<br><br><b><i>Lecture 7.  Multiprocessors and locking</i></b> +<br><br> +This lecture introduces the problems of +coordination and synchronization on a +multiprocessor +and then the solution of mutual exclusion locks. +Atomic instructions, test-and-set locks, +lock granularity, (the mistake of) recursive locks. +<br><br> +Although xv6 user programs cannot share memory, +the xv6 kernel itself is a program with multiple threads +executing concurrently and sharing memory. +Illustration: the xv6 scheduler's proc_table_lock (<a href="src/proc.c.html">proc.c</a>) +and the spin lock implementation (<a href="src/spinlock.c.html">spinlock.c</a>). +<br><br> +Reading: xv6: <a href="src/spinlock.c.html">spinlock.c</a>.  Skim <a href="src/mp.c.html">mp.c</a>. +<br><br> +Homework: Interaction between locking and interrupts. +Try not disabling interrupts in the disk driver and watch xv6 break. +<br><br> +<a href="l-lock.html">lecture notes</a> +<a href="xv6-lock.html">homework</a> + +<br><br><b><i>Lecture 8.  Threads, processes and context switching</i></b> +<br><br> +The last lecture introduced some of the issues +in writing threaded programs, using xv6's processes +as an example. +This lecture introduces the issues in implementing +threads, continuing to use xv6 as the example. +<br><br> +The lecture defines a thread of computation as a register +set and a stack.  A process is an address space plus one +or more threads of computation sharing that address space. +Thus the xv6 kernel can be viewed as a single process +with many threads (each user process) executing concurrently. +<br><br> +Illustrations: thread switching (swtch.S), scheduler (<a href="src/proc.c.html">proc.c</a>), sys_fork (<a href="src/sysproc.c.html">sysproc.c</a>) +<br><br> +Reading: <a href="src/proc.c.html">proc.c</a>, swtch.S, sys_fork (<a href="src/sysproc.c.html">sysproc.c</a>) +<br><br> +Homework: trace through stack switching. +<br><br> +<a href="l-threads.html">lecture notes (need to be updated to use swtch)</a> +<a href="xv6-sched.html">homework</a> + +<br><br><b><i>Lecture 9.  Processes and coordination</i></b> +<br><br> +This lecture introduces the idea of sequence coordination +and then examines the particular solution illustrated by +sleep and wakeup (<a href="src/proc.c.html">proc.c</a>). +It introduces and refines a simple +producer/consumer queue to illustrate the +need for sleep and wakeup +and then the sleep and wakeup +implementations themselves. +<br><br> +Reading: <a href="src/proc.c.html">proc.c</a>, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (<a href="src/sysproc.c.html">sysproc.c</a>). +<br><br> +Homework: Explain how sleep and wakeup would break +without proc_table_lock.  Explain how devices would break +without second lock argument to sleep. +<br><br> +<a href="l-coordination.html">lecture notes</a> +<a href="xv6-sleep.html">homework</a> + +<br><br><b><i>Lecture 10.  Files and disk I/O</i></b> +<br><br> +This is the first of three file system lectures. +This lecture introduces the basic file system interface +and then considers the on-disk layout of individual files +and the free block bitmap. +<br><br> +Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and +	code related to these calls in <a href="src/fs.c.html">fs.c</a>, <a href="src/bio.c.html">bio.c</a>, <a href="src/ide.c.html">ide.c</a>, and <a href="src/file.c.html">file.c</a>. +<br><br> +Homework: Add print to bwrite to trace every disk write. +Explain the disk writes caused by some simple shell commands. +<br><br> +<a href="l-fs.html">lecture notes</a> +<a href="xv6-disk.html">homework</a> + +<br><br><b><i>Lecture 11.  Naming</i></b> +<br><br> +The last lecture discussed on-disk file system representation. +This lecture covers the implementation of +file system paths (namei in <a href="src/fs.c.html">fs.c</a>) +and also discusses the security problems of a shared /tmp +and symbolic links. +<br><br> +Understanding exec (<a href="src/exec.c.html">exec.c</a>) is left as an exercise. +<br><br> +Reading: namei in <a href="src/fs.c.html">fs.c</a>, <a href="src/sysfile.c.html">sysfile.c</a>, <a href="src/file.c.html">file.c</a>. +<br><br> +Homework: Explain how to implement symbolic links in xv6. +<br><br> +<a href="l-name.html">lecture notes</a> +<a href="xv6-names.html">homework</a> + +<br><br><b><i>Lecture 12.  High-performance file systems</i></b> +<br><br> +This lecture is the first of the research paper-based lectures. +It discusses the “soft updates” paper, +using xv6 as a concrete example. + +<h2>Feedback</h2> +If you are interested in using xv6 or have used xv6 in a course, +we would love to hear from you. +If there's anything that we can do to make xv6 easier +to adopt, we'd like to hear about it. +We'd also be interested to hear what worked well and what didn't. +<br><br> +Russ Cox (rsc@swtch.com)<br> +Frans Kaashoek (kaashoek@mit.edu)<br> +Robert Morris (rtm@mit.edu) +<br><br> +You can reach all of us at 6.828-staff@pdos.csail.mit.edu. +<br><br> +<br><br> +</body> +</html> diff --git a/web/index.txt b/web/index.txt new file mode 100644 index 0000000..41d42a4 --- /dev/null +++ b/web/index.txt @@ -0,0 +1,335 @@ +** Xv6, a simple Unix-like teaching operating system +Xv6 is a teaching operating system developed +in the summer of 2006 for MIT's operating systems course,  +``6.828: Operating Systems Engineering.'' +We used it for 6.828 in Fall 2006 and Fall 2007 +and are using it this semester (Fall 2008). +We hope that xv6 will be useful in other courses too. +This page collects resources to aid the use of xv6 +in other courses. + +* History and Background + +For many years, MIT had no operating systems course. +In the fall of 2002, Frans Kaashoek, Josh Cates, and Emil Sit +created a new, experimental course (6.097) +to teach operating systems engineering. +In the course lectures, the class worked through Sixth Edition Unix (aka V6) +using John Lions's famous commentary. +In the lab assignments, students wrote most of an exokernel operating +system, eventually named Jos, for the Intel x86. +Exposing students to multiple systems--V6 and Jos--helped +develop a sense of the spectrum of operating system designs. +In the fall of 2003, the experimental 6.097 became the +official course 6.828; the course has been offered each fall since then. + +V6 presented pedagogic challenges from the start. +Students doubted the relevance of an obsolete 30-year-old operating system +written in an obsolete programming language (pre-K&R C) +running on obsolete hardware (the PDP-11). +Students also struggled to learn the low-level details of two different +architectures (the PDP-11 and the Intel x86) at the same time. +By the summer of 2006, we had decided to replace V6 +with a new operating system, xv6, modeled on V6 +but written in ANSI C and running on multiprocessor +Intel x86 machines. +Xv6's use of the x86 makes it more relevant to +students' experience than V6 was +and unifies the course around a single architecture. +Adding multiprocessor support also helps relevance +and makes it easier to discuss threads and concurrency. +(In a single processor operating system, concurrency--which only +happens because of interrupts--is too easy to view as a special case. +A multiprocessor operating system must attack the problem head on.) +Finally, writing a new system allowed us to write cleaner versions +of the rougher parts of V6, like the scheduler and file system. + +6.828 substituted xv6 for V6 in the fall of 2006. +Based on that experience, we cleaned up rough patches +of xv6 for the course in the fall of 2007. +Since then, xv6 has stabilized, so we are making it +available in the hopes that others will find it useful too. + +6.828 uses both xv6 and Jos.   +Courses taught at UCLA, NYU, and Stanford have used +Jos without xv6; we believe other courses could use +xv6 without Jos, though we are not aware of any that have. + + +* Xv6 sources + +The latest xv6 is [xv6-rev2.tar.gz]. +We distribute the sources in electronic form but also as  +a printed booklet with line numbers that keep everyone +together during lectures.  The booklet is available as +[xv6-rev2.pdf]. + +xv6 compiles using the GNU C compiler, +targeted at the x86 using ELF binaries. +On BSD and Linux systems, you can use the native compilers; +On OS X, which doesn't use ELF binaries, +you must use a cross-compiler. +Xv6 does boot on real hardware, but typically +we run it using the Bochs emulator. +Both the GCC cross compiler and Bochs +can be found on the [../../2007/tools.html | 6.828 tools page]. + + +* Lectures + +In 6.828, the lectures in the first half of the course +introduce the PC hardware, the Intel x86, and then xv6. +The lectures in the second half consider advanced topics +using research papers; for some, xv6 serves as a useful +base for making discussions concrete. +This section describe a typical 6.828 lecture schedule, +linking to lecture notes and homework. +A course using only xv6 (not Jos) will need to adapt +a few of the lectures, but we hope these are a useful +starting point. + + +Lecture 1. Operating systems + +The first lecture introduces both the general topic of +operating systems and the specific approach of 6.828. +After defining ``operating system,'' the lecture +examines the implementation of a Unix shell +to look at the details the traditional Unix system call interface. +This is relevant to both xv6 and Jos: in the final +Jos labs, students implement a Unix-like interface +and culminating in a Unix shell. + +[l1.html | lecture notes] + + +Lecture 2.  PC hardware and x86 programming + +This lecture introduces the PC architecture, the 16- and 32-bit x86, +the stack, and the GCC x86 calling conventions. +It also introduces the pieces of a typical C tool chain--compiler, +assembler, linker, loader--and the Bochs emulator. + +Reading: PC Assembly Language + +Homework: familiarize with Bochs + +[l2.html | lecture notes] +[x86-intro.html | homework] + + +Lecture 3.  Operating system organization + +This lecture continues Lecture 1's discussion of what +an operating system does. +An operating system provides a ``virtual computer'' +interface to user space programs. +At a high level, the main job of the operating system  +is to implement that interface +using the physical computer it runs on. + +The lecture discusses four approaches to that job: +monolithic operating systems, microkernels, +virtual machines, and exokernels. +Exokernels might not be worth mentioning +except that the Jos labs are built around one. + +Reading: Engler et al., Exokernel: An Operating System Architecture +for Application-Level Resource Management + +[l3.html | lecture notes] + + +Lecture 4.  Address spaces using segmentation + +This is the first lecture that uses xv6. +It introduces the idea of address spaces and the +details of the x86 segmentation hardware. +It makes the discussion concrete by reading the xv6 +source code and watching xv6 execute using the Bochs simulator. + +Reading: x86 MMU handout, +xv6: bootasm.S, bootother.S, bootmain.c, main.c, init.c, and setupsegs in proc.c. + +Homework: Bochs stack introduction + +[l4.html | lecture notes] +[xv6-intro.html | homework] +  + +Lecture 5.  Address spaces using page tables + +This lecture continues the discussion of address spaces, +examining the other x86 virtual memory mechanism: page tables. +Xv6 does not use page tables, so there is no xv6 here. +Instead, the lecture uses Jos as a concrete example. +An xv6-only course might skip or shorten this discussion. + +Reading: x86 manual excerpts + +Homework: stuff about gdt +XXX not appropriate; should be in Lecture 4 + +[l5.html | lecture notes] + + +Lecture 6.  Interrupts and exceptions + +How does a user program invoke the operating system kernel? +How does the kernel return to the user program? +What happens when a hardware device needs attention? +This lecture explains the answer to these questions: +interrupt and exception handling. + +It explains the x86 trap setup mechanisms and then +examines their use in xv6's SETGATE (mmu.h), +tvinit (trap.c), idtinit (trap.c), vectors.pl, and vectors.S. + +It then traces through a call to the system call open: +init.c, usys.S, vector48 and alltraps (vectors.S), trap (trap.c), +syscall (syscall.c), +sys_open (sysfile.c), fetcharg, fetchint, argint, argptr, argstr (syscall.c), + +The interrupt controller, briefly: +pic_init and pic_enable (picirq.c). +The timer and keyboard, briefly: +timer_init (timer.c), console_init (console.c). +Enabling and disabling of interrupts. + +Reading: x86 manual excerpts, +xv6: trapasm.S, trap.c, syscall.c, and usys.S. +Skim lapic.c, ioapic.c, picirq.c. + +Homework: Explain the 35 words on the top of the +stack at first invocation of <code>syscall</code>. + +[l-interrupt.html | lecture notes] +[x86-intr.html | homework] + + +Lecture 7.  Multiprocessors and locking + +This lecture introduces the problems of  +coordination and synchronization on a  +multiprocessor +and then the solution of mutual exclusion locks. +Atomic instructions, test-and-set locks, +lock granularity, (the mistake of) recursive locks. + +Although xv6 user programs cannot share memory, +the xv6 kernel itself is a program with multiple threads +executing concurrently and sharing memory. +Illustration: the xv6 scheduler's proc_table_lock (proc.c) +and the spin lock implementation (spinlock.c). + +Reading: xv6: spinlock.c.  Skim mp.c. + +Homework: Interaction between locking and interrupts. +Try not disabling interrupts in the disk driver and watch xv6 break. + +[l-lock.html | lecture notes] +[xv6-lock.html | homework] + + +Lecture 8.  Threads, processes and context switching + +The last lecture introduced some of the issues  +in writing threaded programs, using xv6's processes +as an example. +This lecture introduces the issues in implementing +threads, continuing to use xv6 as the example. + +The lecture defines a thread of computation as a register +set and a stack.  A process is an address space plus one  +or more threads of computation sharing that address space. +Thus the xv6 kernel can be viewed as a single process +with many threads (each user process) executing concurrently. + +Illustrations: thread switching (swtch.S), scheduler (proc.c), sys_fork (sysproc.c) + +Reading: proc.c, swtch.S, sys_fork (sysproc.c) + +Homework: trace through stack switching. + +[l-threads.html | lecture notes (need to be updated to use swtch)] +[xv6-sched.html | homework] + + +Lecture 9.  Processes and coordination + +This lecture introduces the idea of sequence coordination +and then examines the particular solution illustrated by +sleep and wakeup (proc.c). +It introduces and refines a simple +producer/consumer queue to illustrate the  +need for sleep and wakeup +and then the sleep and wakeup +implementations themselves. + +Reading: proc.c, sys_exec, sys_sbrk, sys_wait, sys_exec, sys_kill (sysproc.c). + +Homework: Explain how sleep and wakeup would break +without proc_table_lock.  Explain how devices would break +without second lock argument to sleep. + +[l-coordination.html | lecture notes] +[xv6-sleep.html | homework] + + +Lecture 10.  Files and disk I/O + +This is the first of three file system lectures. +This lecture introduces the basic file system interface +and then considers the on-disk layout of individual files +and the free block bitmap. + +Reading: iread, iwrite, fileread, filewrite, wdir, mknod1, and  +	code related to these calls in fs.c, bio.c, ide.c, and file.c. + +Homework: Add print to bwrite to trace every disk write. +Explain the disk writes caused by some simple shell commands. + +[l-fs.html | lecture notes] +[xv6-disk.html | homework] + + +Lecture 11.  Naming + +The last lecture discussed on-disk file system representation. +This lecture covers the implementation of +file system paths (namei in fs.c) +and also discusses the security problems of a shared /tmp +and symbolic links. + +Understanding exec (exec.c) is left as an exercise. + +Reading: namei in fs.c, sysfile.c, file.c. + +Homework: Explain how to implement symbolic links in xv6. + +[l-name.html | lecture notes] +[xv6-names.html | homework] + + +Lecture 12.  High-performance file systems + +This lecture is the first of the research paper-based lectures. +It discusses the ``soft updates'' paper, +using xv6 as a concrete example. + + +* Feedback + +If you are interested in using xv6 or have used xv6 in a course, +we would love to hear from you. +If there's anything that we can do to make xv6 easier +to adopt, we'd like to hear about it. +We'd also be interested to hear what worked well and what didn't. + +Russ Cox (rsc@swtch.com)<br> +Frans Kaashoek (kaashoek@mit.edu)<br> +Robert Morris (rtm@mit.edu) + +You can reach all of us at 6.828-staff@pdos.csail.mit.edu. + + diff --git a/web/l-bugs.html b/web/l-bugs.html new file mode 100644 index 0000000..493372d --- /dev/null +++ b/web/l-bugs.html @@ -0,0 +1,187 @@ +<title>OS Bugs</title> +<html> +<head> +</head> +<body> + +<h1>OS Bugs</h1> + +<p>Required reading: Bugs as deviant behavior + +<h2>Overview</h2> + +<p>Operating systems must obey many rules for correctness and +performance.  Examples rules: +<ul> +<li>Do not call blocking functions with interrupts disabled or spin +lock held +<li>check for NULL results +<li>Do not allocate large stack variables +<li>Do no re-use already-allocated memory +<li>Check user pointers before using them in kernel mode +<li>Release acquired locks +</ul> + +<p>In addition, there are standard software engineering rules, like +use function results in consistent ways. + +<p>These rules are typically not checked by a compiler, even though +they could be checked by a compiler, in principle.  The goal of the +meta-level compilation project is to allow system implementors to +write system-specific compiler extensions that check the source code +for rule violations. + +<p>The results are good: many new bugs found (500-1000) in Linux +alone.  The paper for today studies these bugs and attempts to draw +lessons from these bugs. + +<p>Are kernel error worse than user-level errors?  That is, if we get +the kernel correct, then we won't have system crashes? + +<h2>Errors in JOS kernel</h2> + +<p>What are unstated invariants in the JOS? +<ul> +<li>Interrupts are disabled in kernel mode +<li>Only env 1 has access to disk +<li>All registers are saved & restored on context switch +<li>Application code is never executed with CPL 0 +<li>Don't allocate an already-allocated physical page +<li>Propagate error messages to user applications (e.g., out of +resources) +<li>Map pipe before fd +<li>Unmap fd before pipe +<li>A spawned program should have open only file descriptors 0, 1, and 2. +<li>Pass sometimes size in bytes and sometimes in block number to a +given file system function. +<li>User pointers should be run through TRUP before used by the kernel +</ul> + +<p>Could these errors have been caught by metacompilation?  Would +metacompilation have caught the pipe race condition? (Probably not, +it happens in only one place.) + +<p>How confident are you that your code is correct?  For example, +are you sure interrupts are always disabled in kernel mode?  How would +you test? + +<h2>Metacompilation</h2> + +<p>A system programmer writes the rule checkers in a high-level, +state-machine language (metal).  These checkers are dynamically linked +into an extensible version of g++, xg++.  Xg++ applies the rule +checkers to every possible execution path of a function that is being +compiled. + +<p>An example rule from +the <a +href="http://www.stanford.edu/~engler/exe-ccs-06.pdf">OSDI +paper</a>: +<pre> +sm check_interrupts { +   decl { unsigned} flags; +   pat enable = { sti(); } | {restore_flags(flags);} ; +   pat disable = { cli(); }; +    +   is_enabled: disable ==> is_disabled | enable ==> { err("double +      enable")}; +   ... +</pre> +A more complete version found 82 errors in the Linux 2.3.99 kernel. + +<p>Common mistake: +<pre> +get_free_buffer ( ... ) { +   .... +   save_flags (flags); +   cli (); +   if ((bh = sh->buffer_pool) == NULL) +      return NULL; +   .... +} +</pre> +<p>(Figure 2 also lists a simple metarule.) + +<p>Some checkers produce false positives, because of limitations of +both static analysis and the checkers, which mostly use local +analysis. + +<p>How does the <b>block</b> checker work?  The first pass is a rule +that marks functions as potentially blocking.  After processing a +function, the checker emits the function's flow graph to a file +(including, annotations and functions called). The second pass takes +the merged flow graph of all function calls, and produces a file with +all functions that have a path in the control-flow-graph to a blocking +function call.  For the Linux kernel this results in 3,000 functions +that potentially could call sleep.  Yet another checker like +check_interrupts checks if a function calls any of the 3,000 functions +with interrupts disabled. Etc. + +<h2>This paper</h2> + +<p>Writing rules is painful. First, you have to write them.  Second, +how do you decide what to check?  Was it easy to enumerate all +conventions for JOS? + +<p>Insight: infer programmer "beliefs" from code and cross-check +for contradictions.  If <i>cli</i> is always followed by <i>sti</i>, +except in one case, perhaps something is wrong.  This simplifies +life because we can write generic checkers instead of checkers +that specifically check for <i>sti</i>, and perhaps we get lucky +and find other temporal ordering conventions. + +<p>Do we know which case is wrong?  The 999 times or the 1 time that +<i>sti</i> is absent?  (No, this method cannot figure what the correct +sequence is but it can flag that something is weird, which in practice +useful.)  The method just detects inconsistencies. + +<p>Is every inconsistency an error?  No, some inconsistency don't +indicate an error.  If a call to function <i>f</i> is often followed +by call to function <i>g</i>, does that imply that f should always be +followed by g?  (No!) + +<p>Solution: MUST beliefs and MAYBE beliefs.  MUST beliefs are +invariants that must hold; any inconsistency indicates an error.  If a +pointer is dereferences, then the programmer MUST believe that the +pointer is pointing to something that can be dereferenced (i.e., the +pointer is definitely not zero).  MUST beliefs can be checked using +"internal inconsistencies". + +<p>An aside, can zero pointers pointers be detected during runtime? +(Sure, unmap the page at address zero.)  Why is metacompilation still +valuable?  (At runtime you will find only the null pointers that your +test code dereferenced; not all possible dereferences of null +pointers.)  An even more convincing example for Metacompilation is +tracking user pointers that the kernel dereferences.  (Is this a MUST +belief?) + +<p>MAYBE beliefs are invariants that are suggested by the code, but +they maybe coincidences.  MAYBE beliefs are ranked by statistical +analysis, and perhaps augmented with input about functions names +(e.g., alloc and free are important).  Is it computationally feasible +to check every MAYBE belief?  Could there be much noise? + +<p>What errors won't this approach catch? + +<h2>Paper discussion</h2> + +<p>This paper is best discussed by studying every code fragment.  Most +code fragments are pieces of code from Linux distributions; these +mistakes are real! + +<p>Section 3.1.  what is the error?  how does metacompilation catch +it? + +<p>Figure 1.  what is the error? is there one? + +<p>Code fragments from 6.1.  what is the error?  how does metacompilation catch +it? + +<p>Figure 3.  what is the error?  how does metacompilation catch +it? + +<p>Section 8.3.  what is the error?  how does metacompilation catch +it? + +</body> + diff --git a/web/l-coordination.html b/web/l-coordination.html new file mode 100644 index 0000000..b2f9f0d --- /dev/null +++ b/web/l-coordination.html @@ -0,0 +1,354 @@ +<title>L9</title> +<html> +<head> +</head> +<body> + +<h1>Coordination and more processes</h1> + +<p>Required reading: remainder of proc.c, sys_exec, sys_sbrk, +  sys_wait, sys_exit, and sys_kill. + +<h2>Overview</h2> + +<p>Big picture: more programs than processors.  How to share the +  limited number of processors among the programs?  Last lecture +  covered basic mechanism: threads and the distinction between process +  and thread.  Today expand: how to coordinate the interactions +  between threads explicitly, and some operations on processes. + +<p>Sequence coordination.  This is a diferrent type of coordination +  than mutual-exclusion coordination (which has its goal to make +  atomic actions so that threads don't interfere).  The goal of +  sequence coordination is for threads to coordinate the sequences in +  which they run.   + +<p>For example, a thread may want to wait until another thread +  terminates. One way to do so is to have the thread run periodically, +  let it check if the other thread terminated, and if not give up the +  processor again.  This is wasteful, especially if there are many +  threads.  + +<p>With primitives for sequence coordination one can do better.  The +  thread could tell the thread manager that it is waiting for an event +  (e.g., another thread terminating).  When the other thread +  terminates, it explicitly wakes up the waiting thread.  This is more +  work for the programmer, but more efficient. + +<p>Sequence coordination often interacts with mutual-exclusion +  coordination, as we will see below. + +<p>The operating system literature has a rich set of primivites for +  sequence coordination.  We study a very simple version of condition +  variables in xv6: sleep and wakeup, with a single lock. + +<h2>xv6 code examples</h2> + +<h3>Sleep and wakeup - usage</h3> + +Let's consider implementing a producer/consumer queue +(like a pipe) that can be used to hold a single non-null char pointer: + +<pre> +struct pcq { +    void *ptr; +}; + +void* +pcqread(struct pcq *q) +{ +    void *p; + +    while((p = q->ptr) == 0) +        ; +    q->ptr = 0; +    return p; +} + +void +pcqwrite(struct pcq *q, void *p) +{ +    while(q->ptr != 0) +        ; +    q->ptr = p; +} +</pre> + +<p>Easy and correct, at least assuming there is at most one +reader and at most one writer at a time. + +<p>Unfortunately, the while loops are inefficient. +Instead of polling, it would be great if there were +primitives saying ``wait for some event to happen'' +and ``this event happened''. +That's what sleep and wakeup do. + +<p>Second try: + +<pre> +void* +pcqread(struct pcq *q) +{ +    void *p; + +    if(q->ptr == 0) +        sleep(q); +    p = q->ptr; +    q->ptr = 0; +    wakeup(q);  /* wake pcqwrite */ +    return p; +} + +void +pcqwrite(struct pcq *q, void *p) +{ +    if(q->ptr != 0) +        sleep(q); +    q->ptr = p; +    wakeup(q);  /* wake pcqread */ +    return p; +} +</pre> + +That's better, but there is still a problem. +What if the wakeup happens between the check in the if +and the call to sleep? + +<p>Add locks: + +<pre> +struct pcq { +    void *ptr; +    struct spinlock lock; +}; + +void* +pcqread(struct pcq *q) +{ +    void *p; + +    acquire(&q->lock); +    if(q->ptr == 0) +        sleep(q, &q->lock); +    p = q->ptr; +    q->ptr = 0; +    wakeup(q);  /* wake pcqwrite */ +    release(&q->lock); +    return p; +} + +void +pcqwrite(struct pcq *q, void *p) +{ +    acquire(&q->lock); +    if(q->ptr != 0) +        sleep(q, &q->lock); +    q->ptr = p; +    wakeup(q);  /* wake pcqread */ +    release(&q->lock); +    return p; +} +</pre> + +This is okay, and now safer for multiple readers and writers, +except that wakeup wakes up everyone who is asleep on chan, +not just one guy. +So some of the guys who wake up from sleep might not +be cleared to read or write from the queue.  Have to go back to looping: + +<pre> +struct pcq { +    void *ptr; +    struct spinlock lock; +}; + +void* +pcqread(struct pcq *q) +{ +    void *p; + +    acquire(&q->lock); +    while(q->ptr == 0) +        sleep(q, &q->lock); +    p = q->ptr; +    q->ptr = 0; +    wakeup(q);  /* wake pcqwrite */ +    release(&q->lock); +    return p; +} + +void +pcqwrite(struct pcq *q, void *p) +{ +    acquire(&q->lock); +    while(q->ptr != 0) +        sleep(q, &q->lock); +    q->ptr = p; +    wakeup(q);  /* wake pcqread */ +    release(&q->lock); +    return p; +} +</pre> + +The difference between this an our original is that +the body of the while loop is a much more efficient way to pause. + +<p>Now we've figured out how to use it, but we +still need to figure out how to implement it. + +<h3>Sleep and wakeup - implementation</h3> +<p> +Simple implementation: + +<pre> +void +sleep(void *chan, struct spinlock *lk) +{ +    struct proc *p = curproc[cpu()]; +     +    release(lk); +    p->chan = chan; +    p->state = SLEEPING; +    sched(); +} + +void +wakeup(void *chan) +{ +    for(each proc p) { +        if(p->state == SLEEPING && p->chan == chan) +            p->state = RUNNABLE; +    }	 +} +</pre> + +<p>What's wrong?  What if the wakeup runs right after +the release(lk) in sleep? +It still misses the sleep. + +<p>Move the lock down: +<pre> +void +sleep(void *chan, struct spinlock *lk) +{ +    struct proc *p = curproc[cpu()]; +     +    p->chan = chan; +    p->state = SLEEPING; +    release(lk); +    sched(); +} + +void +wakeup(void *chan) +{ +    for(each proc p) { +        if(p->state == SLEEPING && p->chan == chan) +            p->state = RUNNABLE; +    }	 +} +</pre> + +<p>This almost works.  Recall from last lecture that we also need +to acquire the proc_table_lock before calling sched, to +protect p->jmpbuf. + +<pre> +void +sleep(void *chan, struct spinlock *lk) +{ +    struct proc *p = curproc[cpu()]; +     +    p->chan = chan; +    p->state = SLEEPING; +    acquire(&proc_table_lock); +    release(lk); +    sched(); +} +</pre> + +<p>The problem is that now we're using lk to protect +access to the p->chan and p->state variables +but other routines besides sleep and wakeup  +(in particular, proc_kill) will need to use them and won't +know which lock protects them. +So instead of protecting them with lk, let's use proc_table_lock: + +<pre> +void +sleep(void *chan, struct spinlock *lk) +{ +    struct proc *p = curproc[cpu()]; +     +    acquire(&proc_table_lock); +    release(lk); +    p->chan = chan; +    p->state = SLEEPING; +    sched(); +} +void +wakeup(void *chan) +{ +    acquire(&proc_table_lock); +    for(each proc p) { +        if(p->state == SLEEPING && p->chan == chan) +            p->state = RUNNABLE; +    } +    release(&proc_table_lock); +} +</pre> + +<p>One could probably make things work with lk as above, +but the relationship between data and locks would be  +more complicated with no real benefit.  Xv6 takes the easy way out  +and says that elements in the proc structure are always protected +by proc_table_lock. + +<h3>Use example: exit and wait</h3> + +<p>If proc_wait decides there are children to be waited for, +it calls sleep at line 2462. +When a process exits, we proc_exit scans the process table +to find the parent and wakes it at 2408. + +<p>Which lock protects sleep and wakeup from missing each other? +Proc_table_lock.  Have to tweak sleep again to avoid double-acquire: + +<pre> +if(lk != &proc_table_lock) { +    acquire(&proc_table_lock); +    release(lk); +} +</pre> + +<h3>New feature: kill</h3> + +<p>Proc_kill marks a process as killed (line 2371). +When the process finally exits the kernel to user space, +or if a clock interrupt happens while it is in user space, +it will be destroyed (line 2886, 2890, 2912). + +<p>Why wait until the process ends up in user space? + +<p>What if the process is stuck in sleep?  It might take a long +time to get back to user space. +Don't want to have to wait for it, so make sleep wake up early +(line 2373). + +<p>This means all callers of sleep should check +whether they have been killed, but none do. +Bug in xv6. + +<h3>System call handlers</h3> + +<p>Sheet 32 + +<p>Fork: discussed copyproc in earlier lectures. +Sys_fork (line 3218) just calls copyproc +and marks the new proc runnable. +Does fork create a new process or a new thread? +Is there any shared context? + +<p>Exec: we'll talk about exec later, when we talk about file systems. + +<p>Sbrk: Saw growproc earlier.  Why setupsegs before returning? diff --git a/web/l-fs.html b/web/l-fs.html new file mode 100644 index 0000000..ed911fc --- /dev/null +++ b/web/l-fs.html @@ -0,0 +1,222 @@ +<title>L10</title> +<html> +<head> +</head> +<body> + +<h1>File systems</h1> + +<p>Required reading: iread, iwrite, and wdir, and code related to +  these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c + +<h2>Overview</h2> + +<p>The next 3 lectures are about file systems: +<ul> +<li>Basic file system implementation +<li>Naming +<li>Performance +</ul> + +<p>Users desire to store their data durable so that data survives when +the user turns of his computer.  The primary media for doing so are: +magnetic disks, flash memory, and tapes.  We focus on magnetic disks +(e.g., through the IDE interface in xv6). + +<p>To allow users to remember where they stored a file, they can +assign a symbolic name to a file, which appears in a directory. + +<p>The data in a file can be organized in a structured way or not. +The structured variant is often called a database.  UNIX uses the +unstructured variant: files are streams of bytes.  Any particular +structure is likely to be useful to only a small class of +applications, and other applications will have to work hard to fit +their data into one of the pre-defined structures. Besides, if you +want structure, you can easily write a user-mode library program that +imposes that format on any file.  The end-to-end argument in action. +(Databases have special requirements and support an important class of +applications, and thus have a specialized plan.) + +<p>The API for a minimal file system consists of: open, read, write, +seek, close, and stat.  Dup duplicates a file descriptor. For example: +<pre> +  fd = open("x", O_RDWR); +  read (fd, buf, 100); +  write (fd, buf, 512); +  close (fd) +</pre> + +<p>Maintaining the file offset behind the read/write interface is an +  interesting design decision . The alternative is that the state of a +  read operation should be maintained by the process doing the reading +  (i.e., that the pointer should be passed as an argument to read). +  This argument is compelling in view of the UNIX fork() semantics, +  which clones a process which shares the file descriptors of its +  parent. A read by the parent of a shared file descriptor (e.g., +  stdin, changes the read pointer seen by the child).  On the other +  hand the alternative would make it difficult to get "(data; ls) > x" +  right. + +<p>Unix API doesn't specify that the effects of write are immediately +  on the disk before a write returns. It is up to the implementation +  of the file system within certain bounds. Choices include (that +  aren't non-exclusive): +<ul> +<li>At some point in the future, if the system stays up (e.g., after +  30 seconds); +<li>Before the write returns; +<li>Before close returns; +<li>User specified (e.g., before fsync returns). +</ul> + +<p>A design issue is the semantics of a file system operation that +  requires multiple disk writes.  In particular, what happens if the +  logical update requires writing multiple disks blocks and the power +  fails during the update?  For example, to create a new file, +  requires allocating an inode (which requires updating the list of +  free inodes on disk), writing a directory entry to record the +  allocated i-node under the name of the new file (which may require +  allocating a new block and updating the directory inode).  If the +  power fails during the operation, the list of free inodes and blocks +  may be inconsistent with the blocks and inodes in use. Again this is +  up to implementation of the file system to keep on disk data +  structures consistent: +<ul> +<li>Don't worry about it much, but use a recovery program to bring +  file system back into a consistent state. +<li>Journaling file system.  Never let the file system get into an +  inconsistent state. +</ul> + +<p>Another design issue is the semantics are of concurrent writes to +the same data item.  What is the order of two updates that happen at +the same time? For example, two processes open the same file and write +to it.  Modern Unix operating systems allow the application to lock a +file to get exclusive access.  If file locking is not used and if the +file descriptor is shared, then the bytes of the two writes will get +into the file in some order (this happens often for log files).  If +the file descriptor is not shared, the end result is not defined. For +example, one write may overwrite the other one (e.g., if they are +writing to the same part of the file.) + +<p>An implementation issue is performance, because writing to magnetic +disk is relatively expensive compared to computing. Three primary ways +to improve performance are: careful file system layout that induces +few seeks, an in-memory cache of frequently-accessed blocks, and +overlap I/O with computation so that file operations don't have to +wait until their completion and so that that the disk driver has more +data to write, which allows disk scheduling. (We will talk about +performance in detail later.) + +<h2>xv6 code examples</h2> + +<p>xv6 implements a minimal Unix file system interface. xv6 doesn't +pay attention to file system layout. It overlaps computation and I/O, +but doesn't do any disk scheduling.  Its cache is write-through, which +simplifies keep on disk datastructures consistent, but is bad for +performance. + +<p>On disk files are represented by an inode (struct dinode in fs.h), +and blocks.  Small files have up to 12 block addresses in their inode; +large files use files the last address in the inode as a disk address +for a block with 128 disk addresses (512/4).  The size of a file is +thus limited to 12 * 512 + 128*512 bytes.  What would you change to +support larger files? (Ans: e.g., double indirect blocks.) + +<p>Directories are files with a bit of structure to them. The file +contains of records of the type struct dirent.  The entry contains the +name for a file (or directory) and its corresponding inode number. +How many files can appear in a directory? + +<p>In memory files are represented by struct inode in fsvar.h. What is +the role of the additional fields in struct inode? + +<p>What is xv6's disk layout?  How does xv6 keep track of free blocks +  and inodes? See balloc()/bfree() and ialloc()/ifree().  Is this +  layout a good one for performance?  What are other options? + +<p>Let's assume that an application created an empty file x with +  contains 512 bytes, and that the application now calls read(fd, buf, +  100), that is, it is requesting to read 100 bytes into buf. +  Furthermore, let's assume that the inode for x is is i. Let's pick +  up what happens by investigating readi(), line 4483. +<ul> +<li>4488-4492: can iread be called on other objects than files?  (Yes. +  For example, read from the keyboard.)  Everything is a file in Unix. +<li>4495: what does bmap do? +<ul> +<li>4384: what block is being read? +</ul> +<li>4483: what does bread do?  does bread always cause a read to disk? +<ul> +<li>4006: what does bget do?  it implements a simple cache of +  recently-read disk blocks. +<ul> +<li>How big is the cache?  (see param.h) +<li>3972: look if the requested block is in the cache by walking down +  a circular list. +<li>3977: we had a match. +<li>3979: some other process has "locked" the block, wait until it +  releases.  the other processes releases the block using brelse(). +Why lock a block? +<ul> +<li>Atomic read and update.  For example, allocating an inode: read +  block containing inode, mark it allocated, and write it back.  This +  operation must be atomic. +</ul> +<li>3982: it is ours now. +<li>3987: it is not in the cache; we need to find a cache entry to +  hold the block. +<li>3987: what is the cache replacement strategy? (see also brelse()) +<li>3988: found an entry that we are going to use. +<li>3989: mark it ours but don't mark it valid (there is no valid data +  in the entry yet). +</ul> +<li>4007: if the block was in the cache and the entry has the block's +  data, return. +<li>4010: if the block wasn't in the cache, read it from disk. are +  read's synchronous or asynchronous? +<ul> +<li>3836: a bounded buffer of outstanding disk requests. +<li>3809: tell the disk to move arm and generate an interrupt. +<li>3851: go to sleep and run some other process to run. time sharing +  in action. +<li>3792: interrupt: arm is in the right position; wakeup requester. +<li>3856: read block from disk. +<li>3860: remove request from bounded buffer.  wakeup processes that +  are waiting for a slot. +<li>3864: start next disk request, if any. xv6 can overlap I/O with +computation. +</ul> +<li>4011: mark the cache entry has holding the data. +</ul> +<li>4498: To where is the block copied?  is dst a valid user address? +</ul> + +<p>Now let's suppose that the process is writing 512 bytes at the end +  of the file a. How many disk writes will happen? +<ul> +<li>4567: allocate a new block +<ul> +<li>4518: allocate a block: scan block map, and write entry +<li>4523: How many disk operations if the process would have been appending +  to a large file? (Answer: read indirect block, scan block map, write +  block map.) +</ul> +<li>4572: read the block that the process will be writing, in case the +  process writes only part of the block. +<li>4574: write it. is it synchronous or asynchronous? (Ans: +  synchronous but with timesharing.) +</ul> + +<p>Lots of code to implement reading and writing of files. How about +  directories? +<ul> +<li>4722: look for the directory, reading directory block and see if a +  directory entry is unused (inum == 0). +<li>4729: use it and update it. +<li>4735: write the modified block. +</ul> +<p>Reading and writing of directories is trivial. + +</body> diff --git a/web/l-interrupt.html b/web/l-interrupt.html new file mode 100644 index 0000000..363af5e --- /dev/null +++ b/web/l-interrupt.html @@ -0,0 +1,174 @@ +<html> +<head><title>Lecture 6: Interrupts & Exceptions</title></head> +<body> + +<h1>Interrupts & Exceptions</h1> + +<p> +Required reading: xv6 <code>trapasm.S</code>, <code>trap.c</code>, <code>syscall.c</code>, <code>usys.S</code>. +<br> +You will need to consult +<a href="../readings/ia32/IA32-3.pdf">IA32 System +Programming Guide</a> chapter 5 (skip 5.7.1, 5.8.2, 5.12.2). + +<h2>Overview</h2> + +<p> +Big picture: kernel is trusted third-party that runs the machine. +Only the kernel can execute privileged instructions (e.g., +changing MMU state). +The processor enforces this protection through the ring bits +in the code segment. +If a user application needs to carry out a privileged operation +or other kernel-only service, +it must ask the kernel nicely. +How can a user program change to the kernel address space? +How can the kernel transfer to a user address space? +What happens when a device attached to the computer  +needs attention? +These are the topics for today's lecture. + +<p> +There are three kinds of events that must be handled +by the kernel, not user programs: +(1) a system call invoked by a user program, +(2) an illegal instruction or other kind of bad processor state (memory fault, etc.). +and  +(3) an interrupt from a hardware device. + +<p> +Although these three events are different, they all use the same +mechanism to transfer control to the kernel. +This mechanism consists of three steps that execute as one atomic unit. +(a) change the processor to kernel mode; +(b) save the old processor somewhere (usually the kernel stack); +and (c) change the processor state to the values set up as +the “official kernel entry values.” +The exact implementation of this mechanism differs +from processor to processor, but the idea is the same. + +<p> +We'll work through examples of these today in lecture. +You'll see all three in great detail in the labs as well. + +<p> +A note on terminology: sometimes we'll +use interrupt (or trap) to mean both interrupts and exceptions. + +<h2> +Setting up traps on the x86 +</h2> + +<p> +See handout Table 5-1, Figure 5-1, Figure 5-2. + +<p> +xv6 Sheet 07: <code>struct gatedesc</code> and <code>SETGATE</code>. + +<p> +xv6 Sheet 28: <code>tvinit</code> and <code>idtinit</code>. +Note setting of gate for <code>T_SYSCALL</code> + +<p> +xv6 Sheet 29: <code>vectors.pl</code> (also see generated <code>vectors.S</code>). + +<h2> +System calls +</h2> + +<p> +xv6 Sheet 16: <code>init.c</code> calls <code>open("console")</code>. +How is that implemented? + +<p> +xv6 <code>usys.S</code> (not in book). +(No saving of registers.  Why?) + +<p> +Breakpoint <code>0x1b:"open"</code>, +step past <code>int</code> instruction into kernel. + +<p> +See handout Figure 9-4 [sic]. + +<p> +xv6 Sheet 28: in <code>vectors.S</code> briefly, then in <code>alltraps</code>. +Step through to <code>call trap</code>, examine registers and stack. +How will the kernel find the argument to <code>open</code>? + +<p> +xv6 Sheet 29: <code>trap</code>, on to <code>syscall</code>. + +<p> +xv6 Sheet 31: <code>syscall</code> looks at <code>eax</code>, +calls <code>sys_open</code>. + +<p> +(Briefly) +xv6 Sheet 52: <code>sys_open</code> uses <code>argstr</code> and <code>argint</code> +to get its arguments.  How do they work? + +<p> +xv6 Sheet 30: <code>fetchint</code>, <code>fetcharg</code>, <code>argint</code>, +<code>argptr</code>, <code>argstr</code>. + +<p> +What happens if a user program divides by zero +or accesses unmapped memory? +Exception.  Same path as system call until <code>trap</code>. + +<p> +What happens if kernel divides by zero or accesses unmapped memory? + +<h2> +Interrupts +</h2> + +<p> +Like system calls, except: +devices generate them at any time, +there are no arguments in CPU registers, +nothing to return to, +usually can't ignore them. + +<p> +How do they get generated? +Device essentially phones up the  +interrupt controller and asks to talk to the CPU. +Interrupt controller then buzzes the CPU and +tells it, “keyboard on line 1.” +Interrupt controller is essentially the CPU's +<strike>secretary</strike> administrative assistant, +managing the phone lines on the CPU's behalf. + +<p> +Have to set up interrupt controller. + +<p> +(Briefly) xv6 Sheet 63: <code>pic_init</code> sets up the interrupt controller, +<code>irq_enable</code> tells the interrupt controller to let the given +interrupt through.  + +<p> +(Briefly) xv6 Sheet 68: <code>pit8253_init</code> sets up the clock chip, +telling it to interrupt on <code>IRQ_TIMER</code> 100 times/second. +<code>console_init</code> sets up the keyboard, enabling <code>IRQ_KBD</code>. + +<p> +In Bochs, set breakpoint at 0x8:"vector0" +and continue, loading kernel. +Step through clock interrupt, look at  +stack, registers. + +<p> +Was the processor executing in kernel or user mode +at the time of the clock interrupt? +Why?  (Have any user-space instructions executed at all?) + +<p> +Can the kernel get an interrupt at any time? +Why or why not?  <code>cli</code> and <code>sti</code>, +<code>irq_enable</code>. + +</body> +</html> diff --git a/web/l-lock.html b/web/l-lock.html new file mode 100644 index 0000000..eea8217 --- /dev/null +++ b/web/l-lock.html @@ -0,0 +1,322 @@ +<title>L7</title> +<html> +<head> +</head> +<body> + +<h1>Locking</h1> + +<p>Required reading: spinlock.c + +<h2>Why coordinate?</h2> + +<p>Mutual-exclusion coordination is an important topic in operating +systems, because many operating systems run on +multiprocessors. Coordination techniques protect variables that are +shared among multiple threads and updated concurrently.  These +techniques allow programmers to implement atomic sections so that one +thread can safely update the shared variables without having to worry +that another thread intervening.  For example, processes in xv6 may +run concurrently on different processors and in kernel-mode share +kernel data structures.  We must ensure that these updates happen +correctly. + +<p>List and insert example: +<pre> + +struct List { +  int data; +  struct List *next; +}; + +List *list = 0; + +insert(int data) { +  List *l = new List; +  l->data = data; +  l->next = list;  // A +  list = l;        // B +} +</pre> + +<p>What needs to be atomic? The two statements labeled A and B should +always be executed together, as an indivisible fragment of code.  If +two processors execute A and B interleaved, then we end up with an +incorrect list.  To see that this is the case, draw out the list after +the sequence A1 (statement executed A by processor 1), A2 (statement A +executed by processor 2), B2, and B1. + +<p>How could this erroneous sequence happen? The varilable <i>list</i> +lives in physical memory shared among multiple processors, connected +by a bus.  The accesses to the shared memory will be ordered in some +total order by the bus/memory system.  If the programmer doesn't +coordinate the execution of the statements A and B, any order can +happen, including the erroneous one. + +<p>The erroneous case is called a race condition.  The problem with +races is that they are difficult to reproduce.  For example, if you +put print statements in to debug the incorrect behavior, you might +change the time and the race might not happen anymore. + +<h2>Atomic instructions</h2> + +<p>The programmer must be able express that A and B should be executed +as single atomic instruction.  We generally use a concept like locks +to mark an atomic region, acquiring the lock at the beginning of the +section and releasing it at the end:  + +<pre>  +void acquire(int *lock) { +   while (TSL(lock) != 0) ;  +} + +void release (int *lock) { +  *lock = 0; +} +</pre> + +<p>Acquire and release, of course, need to be atomic too, which can, +for example, be done with a hardware atomic TSL (try-set-lock) +instruction: + +<p>The semantics of TSL are: +<pre> +   R <- [mem]   // load content of mem into register R +   [mem] <- 1   // store 1 in mem. +</pre> + +<p>In a harware implementation, the bus arbiter guarantees that both +the load and store are executed without any other load/stores coming +in between. + +<p>We can use locks to implement an atomic insert, or we can use +TSL directly: +<pre> +int insert_lock = 0; + +insert(int data) { + +  /* acquire the lock: */ +  while(TSL(&insert_lock) != 0) +    ; + +  /* critical section: */ +  List *l = new List; +  l->data = data; +  l->next = list; +  list = l; + +  /* release the lock: */ +  insert_lock = 0; +} +</pre> + +<p>It is the programmer's job to make sure that locks are respected.  If +a programmer writes another function that manipulates the list, the +programmer must must make sure that the new functions acquires and +releases the appropriate locks.  If the programmer doesn't, race +conditions occur. + +<p>This code assumes that stores commit to memory in program order and +that all stores by other processors started before insert got the lock +are observable by this processor.  That is, after the other processor +released a lock, all the previous stores are committed to memory.  If +a processor executes instructions out of order, this assumption won't +hold and we must, for example, a barrier instruction that makes the +assumption true. + + +<h2>Example: Locking on x86</h2> + +<p>Here is one way we can implement acquire and release using the x86 +xchgl instruction: + +<pre> +struct Lock { +  unsigned int locked; +}; + +acquire(Lock *lck) { +  while(TSL(&(lck->locked)) != 0) +    ; +} + +release(Lock *lck) { +  lck->locked = 0; +} + +int +TSL(int *addr) +{ +  register int content = 1; +  // xchgl content, *addr +  // xchgl exchanges the values of its two operands, while +  // locking the memory bus to exclude other operations. +  asm volatile ("xchgl %0,%1" : +                "=r" (content), +                "=m" (*addr) : +                "0" (content), +                "m" (*addr)); +  return(content); +} +</pre> + +<p>the instruction "XCHG %eax, (content)" works as follows: +<ol> +<li> freeze other CPUs' memory activity +<li> temp := content +<li> content := %eax +<li> %eax := temp +<li> un-freeze other CPUs +</ol> + +<p>steps 1 and 5 make XCHG special: it is "locked" special signal +  lines on the inter-CPU bus, bus arbitration + +<p>This implementation doesn't scale to a large number of processors; +  in a later lecture we will see how we could do better. + +<h2>Lock granularity</h2> + +<p>Release/acquire is ideal for short atomic sections: increment a +counter, search in i-node cache, allocate a free buffer. + +<p>What are spin locks not so great for?  Long atomic sections may +  waste waiters' CPU time and it is to sleep while holding locks. In +  xv6 we try to avoid long atomic sections by carefully coding (can +  you find an example?).  xv6 doesn't release the processor when +  holding a lock, but has an additional set of coordination primitives +  (sleep and wakeup), which we will study later. + +<p>My list_lock protects all lists; inserts to different lists are +  blocked. A lock per list would waste less time spinning so you might +  want "fine-grained" locks, one for every object BUT acquire/release +  are expensive (500 cycles on my 3 ghz machine) because they need to +  talk off-chip. + +<p>Also, "correctness" is not that simple with fine-grained locks if +  need to maintain global invariants; e.g., "every buffer must be on +  exactly one of free list and device list".  Per-list locks are +  irrelevant for this invariant.  So you might want "large-grained", +  which reduces overhead but reduces concurrency. + +<p>This tension is hard to get right. One often starts out with +  "large-grained locks" and measures the performance of the system on +  some workloads. When more concurrency is desired (to get better +  performance), an implementor may switch to a more fine-grained +  scheme. Operating system designers fiddle with this all the time. + +<h2>Recursive locks and modularity</h2> + +<p>When designing a system we desire clean abstractions and good +  modularity. We like a caller not have to know about how a callee +  implements a particul functions.  Locks make achieving modularity +  more complicated.  For example, what to do when the caller holds a +  lock, then calls a function, which also needs to the lock to perform +  its job. + +<p>There are no transparent solutions that allow the caller and callee +  to be unaware of which lokcs they use. One transparent, but +  unsatisfactory option is recursive locks: If a callee asks for a +  lock that its caller has, then we allow the callee to proceed. +  Unfortunately, this solution is not ideal either. + +<p>Consider the following. If lock x protects the internals of some +  struct foo, then if the caller acquires lock x, it know that the +  internals of foo are in a sane state and it can fiddle with them. +  And then the caller must restore them to a sane state before release +  lock x, but until then anything goes. + +<p>This assumption doesn't hold with recursive locking.  After +  acquiring lock x, the acquirer knows that either it is the first to +  get this lock, in which case the internals are in a sane state, or +  maybe some caller holds the lock and has messed up the internals and +  didn't realize when calling the callee that it was going to try to +  look at them too.  So the fact that a function acquired the lock x +  doesn't guarantee anything at all. In short, locks protect against +  callers and callees just as much as they protect against other +  threads. + +<p>Since transparent solutions aren't ideal, it is better to consider +  locks part of the function specification. The programmer must +  arrange that a caller doesn't invoke another function while holding +  a lock that the callee also needs. + +<h2>Locking in xv6</h2> + +<p>xv6 runs on a multiprocessor and is programmed to allow multiple +threads of computation to run concurrently.  In xv6 an interrupt might +run on one processor and a process in kernel mode may run on another +processor, sharing a kernel data structure with the interrupt routing. +xv6 uses locks, implemented using an atomic instruction, to coordinate +concurrent activities. + +<p>Let's check out why xv6 needs locks by following what happens when +we start a second processor: +<ul> +<li>1516: mp_init (called from main0) +<li>1606: mp_startthem (called from main0) +<li>1302: mpmain +<li>2208: scheduler. +  <br>Now we have several processors invoking the scheduler +  function. xv6 better ensure that multiple processors don't run the +  same process!  does it? +  <br>Yes, if multiple schedulers run concurrently, only one will +  acquire proc_table_lock, and proceed looking for a runnable +  process. if it finds a process, it will mark it running, longjmps to +  it, and the process will release proc_table_lock.  the next instance +  of scheduler will skip this entry, because it is marked running, and +  look for another runnable process. +</ul> + +<p>Why hold proc_table_lock during a context switch?  It protects +p->state; the process has to hold some lock to avoid a race with +wakeup() and yield(), as we will see in the next lectures. + +<p>Why not a lock per proc entry?  It might be expensive in in whole +table scans (in wait, wakeup, scheduler). proc_table_lock also +protects some larger invariants, for example it might be hard to get +proc_wait() right with just per entry locks.  Right now the check to +see if there are any exited children and the sleep are atomic -- but +that would be hard with per entry locks.  One could have both, but +that would probably be neither clean nor fast. + +<p>Of course, there is only processor searching the proc table if +acquire is implemented correctly. Let's check out acquire in +spinlock.c: +<ul> +<li>1807: no recursive locks! +<li>1811: why disable interrupts on the current processor? (if +interrupt code itself tries to take a held lock, xv6 will deadlock; +the panic will fire on 1808.) +<ul> +<li>can a process on a processor hold multiple locks? +</ul> +<li>1814: the (hopefully) atomic instruction. +<ul> +<li>see sheet 4, line 0468. +</ul> +<li>1819: make sure that stores issued on other processors before we +got the lock are observed by this processor.  these may be stores to +the shared data structure that is protected by the lock. +</ul> + +<p> + +<h2>Locking in JOS</h2> + +<p>JOS is meant to run on single-CPU machines, and the plan can be +simple.  The simple plan is disabling/enabling interrupts in the +kernel (IF flags in the EFLAGS register).  Thus, in the kernel, +threads release the processors only when they want to and can ensure +that they don't release the processor during a critical section. + +<p>In user mode, JOS runs with interrupts enabled, but Unix user +applications don't share data structures.  The data structures that +must be protected, however, are the ones shared in the library +operating system (e.g., pipes). In JOS we will use special-case +solutions, as you will find out in lab 6.  For example, to implement +pipe we will assume there is one reader and one writer.  The reader +and writer never update each other's variables; they only read each +other's variables.  Carefully programming using this rule we can avoid +races. diff --git a/web/l-mkernel.html b/web/l-mkernel.html new file mode 100644 index 0000000..2984796 --- /dev/null +++ b/web/l-mkernel.html @@ -0,0 +1,262 @@ +<title>Microkernel lecture</title> +<html> +<head> +</head> +<body> + +<h1>Microkernels</h1> + +<p>Required reading: Improving IPC by kernel design + +<h2>Overview</h2> + +<p>This lecture looks at the microkernel organization.  In a +microkernel, services that a monolithic kernel implements in the +kernel are running as user-level programs.  For example, the file +system, UNIX process management, pager, and network protocols each run +in a separate user-level address space.  The microkernel itself +supports only the services that are necessary to allow system services +to run well in user space; a typical microkernel has at least support +for creating address spaces, threads, and inter process communication. + +<p>The potential advantages of a microkernel are simplicity of the +kernel (small), isolation of operating system components (each runs in +its own user-level address space), and flexibility (we can have a file +server and a database server).  One potential disadvantage is +performance loss, because what in a monolithich kernel requires a +single system call may require in a microkernel multiple system calls +and context switches. + +<p>One way in how microkernels differ from each other is the exact +kernel API they implement.  For example, Mach (a system developed at +CMU, which influenced a number of commercial operating systems) has +the following system calls: processes (create, terminate, suspend, +resume, priority, assign, info, threads), threads (fork, exit, join, +detach, yield, self), ports and messages (a port is a unidirectionally +communication channel with a message queue and supporting primitives +to send, destroy, etc), and regions/memory objects (allocate, +deallocate, map, copy, inherit, read, write). + +<p>Some microkernels are more "microkernel" than others.  For example, +some microkernels implement the pager in user space but the basic +virtual memory abstractions in the kernel (e.g, Mach); others, are +more extreme, and implement most of the virtual memory in user space +(L4).  Yet others are less extreme: many servers run in their own +address space, but in kernel mode (Chorus). + +<p>All microkernels support multiple threads per address space. xv6 +and Unix until recently didn't; why? Because, in Unix system services +are typically implemented in the kernel, and those are the primary +programs that need multiple threads to handle events concurrently +(waiting for disk and processing new I/O requests).  In microkernels, +these services are implemented in user-level address spaces and so +they need a mechanism to deal with handling operations concurrently. +(Of course, one can argue if fork efficient enough, there is no need +to have threads.) + +<h2>L3/L4</h2> + +<p>L3 is a predecessor to L4.  L3 provides data persistence, DOS +emulation, and ELAN runtime system.  L4 is a reimplementation of L3, +but without the data persistence.  L4KA is a project at +sourceforge.net, and you can download the code for the latest +incarnation of L4 from there. + +<p>L4 is a "second-generation" microkernel, with 7 calls: IPC (of +which there are several types), id_nearest (find a thread with an ID +close the given ID), fpage_unmap (unmap pages, mapping is done as a +side-effect of IPC), thread_switch (hand processor to specified +thread), lthread_ex_regs (manipulate thread registers), +thread_schedule (set scheduling policies), task_new (create a new +address space with some default number of threads).  These calls +provide address spaces, tasks, threads, interprocess communication, +and unique identifiers.  An address space is a set of mappings. +Multiple threads may share mappings, a thread may grants mappings to +another thread (through IPC).  Task is the set of threads sharing an +address space. + +<p>A thread is the execution abstraction; it belongs to an address +space, a UID, a register set, a page fault handler, and an exception +handler. A UID of a thread is its task number plus the number of the +thread within that task. + +<p>IPC passes data by value or by reference to another address space. +It also provide for sequence coordination.  It is used for +communication between client and servers, to pass interrupts to a +user-level exception handler, to pass page faults to an external +pager.  In L4, device drivers are implemented has a user-level +processes with the device mapped into their address space. +Linux runs as a user-level process. + +<p>L4 provides quite a scala of messages types: inline-by-value, +strings, and virtual memory mappings.  The send and receive descriptor +specify how many, if any. + +<p>In addition, there is a system call for timeouts and controling +thread scheduling. + +<h2>L3/L4 paper discussion</h2> + +<ul> + +<li>This paper is about performance.  What is a microsecond?  Is 100 +usec bad?  Is 5 usec so much better we care? How many instructions +does 50-Mhz x86 execute in 100 usec?  What can we compute with that +number of instructions?  How many disk operations in that time?  How +many interrupts can we take? (The livelock paper, which we cover in a +few lectures, mentions 5,000 network pkts per second, and each packet +generates two interrrupts.) + +<li>In performance calculations, what is the appropriate/better metric? +Microseconds or cycles? + +<li>Goal: improve IPC performance by a factor 10 by careful kernel +design that is fully aware of the hardware it is running on. +Principle: performance rules!  Optimize for the common case.  Because +in L3 interrupts are propagated to user-level using IPC, the system +may have to be able to support many IPCs per second (as many as the +device can generate interrupts). + +<li>IPC consists of transfering control and transfering data.  The +minimal cost for transfering control is 127 cycles, plus 45 cycles for +TLB misses (see table 3).  What are the x86 instructions to enter and +leave the kernel? (int, iret) Why do they consume so much time? +(Flush pipeline) Do modern processors perform these operations more +efficient?  Worse now.  Faster processors optimized for straight-line +code; Traps/Exceptions flush deeper pipeline, cache misses cost more +cycles. + +<li>What are the 5 TLB misses: 1) B's thread control block; loading %cr3 +flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and +4+5) user text - two pages B's user code looks at message + +<li>Interface: +<ul> +<li>call (threadID, send-message, receive-message, timeout); +<li>reply_and_receive (reply-message, receive-message, timeout); +</ul> + +<li>Optimizations: +<ul> + +<li>New system call: reply_and_receive.  Effect: 2 system calls per +RPC. + +<li>Complex messages: direct string, indirect strings, and memory +objects. + +<li>Direct transfer by temporary mapping through a communication +window.  The communication window is mapped in B address space and in +A's kernel address space; why is this better than just mapping a page +shared between A and B's address space?  1) Multi-level security, it +makes it hard to reason about information flow; 2) Receiver can't +check message legality (might change after check); 3) When server has +many clients, could run out of virtual address space Requires shared +memory region to be established ahead of time; 4) Not application +friendly, since data may already be at another address, i.e. +applications would have to copy anyway--possibly more copies. + +<li>Why not use the following approach: map the region copy-on-write +(or read-only) in A's address space after send and read-only in B's +address space?  Now B may have to copy data or cannot receive data in +its final destination. + +<li>On the x86 implemented by coping B's PDE into A's address space. +Why two PDEs?  (Maximum message size is 4 Meg, so guaranteed to work +if the message starts in the bottom for 4 Mbyte of an 8 Mbyte mapped +region.) Why not just copy PTEs?  Would be much more expensive + +<li> What does it mean for the TLB to be "window clean"?  Why do we +care?  Means TLB contains no mappings within communication window. We +care because mapping is cheap (copy PDE), but invalidation not; x86 +only lets you invalidate one page at a time, or whole TLB Does TLB +invalidation of communication window turn out to be a problem?  Not +usually, because have to load %cr3 during IPC anyway + +<li>Thread control block registers, links to various double-linked +  lists, pgdir, uid, etc.. Lower part of thread UID contains TCB +  number.  Can also dededuce TCB address from stack by taking SP AND +  bitmask (the SP comes out of the TSS when just switching to kernel). + +<li> Kernel stack is on same page as tcb.  why?  1) Minimizes TLB +misses (since accessing kernel stack will bring in tcb); 2) Allows +very efficient access to tcb -- just mask off lower 12 bits of %esp; +3) With VM, can use lower 32-bits of thread id to indicate which tcb; +using one page per tcb means no need to check if thread is swapped out +(Can simply not map that tcb if shouldn't access it). + +<li>Invariant on queues: queues always hold in-memory TCBs. + +<li>Wakeup queue: set of 8 unordered wakeup lists (wakup time mod 8), +and smart representation of time so that 32-bit integers can be used +in the common case (base + offset in msec; bump base and recompute all +offsets ~4 hours.  maximum timeout is ~24 days, 2^31 msec). + +<li>What is the problem addressed by lazy scheduling? +Conventional approach to scheduling: +<pre> +    A sends message to B: +      Move A from ready queue to waiting queue +      Move B from waiting queue to ready queue +    This requires 58 cycles, including 4 TLB misses.  What are TLB misses? +      One each for head of ready and waiting queues +      One each for previous queue element during the remove +</pre> +<li> Lazy scheduling: +<pre> +    Ready queue must contain all ready threads except current one +      Might contain other threads that aren't actually ready, though +    Each wakeup queue contains all threads waiting in that queue +      Again, might contain other threads, too +      Scheduler removes inappropriate queue entries when scanning +      queue +</pre> +    +<li>Why does this help performance?  Only three situations in which +thread gives up CPU but stays ready: send syscall (as opposed to +call), preemption, and hardware interrupts. So very often can IPC into +thread while not putting it on ready list. + +<li>Direct process switch.  This section just says you should use +kernel threads instead of continuations. + +<li>Short messages via registers. + +<li>Avoiding unnecessary copies.  Basically can send and receive +  messages w. same vector.  Makes forwarding efficient, which is +  important for Clans/Chiefs model. + +<li>Segment register optimization.  Loading segments registers is +  slow, have to access GDT, etc.  But common case is that users don't +  change their segment registers. Observation: it is faster to check +  that segment descriptor than load it.  So just check that segment +  registers are okay.  Only need to load if user code changed them. + +<li>Registers for paramater passing where ever possible: systems calls +and IPC. + +<li>Minimizing TLB misses. Try to cram as many things as possible onto +same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually +maybe can't fit whole tables but put the important parts of tables on +the same page (maybe beginning of TSS, IDT, or GDT only?) + +<li>Coding tricks: short offsets, avoid jumps, avoid checks, pack +  often-used data on same cache lines, lazily save/restore CPU state +  like debug and FPU registers.  Much of the kernel is written in +  assembly! + +<li>What are the results?  figure 7 and 8 look good. + +<li>Is fast IPC enough to get good overall system performance?  This +paper doesn't make a statement either way; we have to read their 1997 +paper to find find the answer to that question. + +<li>Is the principle of optimizing for performance right?  In general, +it is wrong to optimize for performance; other things matter more.  Is +IPC the one exception?  Maybe, perhaps not.  Was Liedtke fighting a +losing battle against CPU makers?  Should fast IPC time be a hardware, +or just an OS issue? + +</ul> + +</body> diff --git a/web/l-name.html b/web/l-name.html new file mode 100644 index 0000000..9c211f3 --- /dev/null +++ b/web/l-name.html @@ -0,0 +1,181 @@ +<title>L11</title> +<html> +<head> +</head> +<body> + +<h1>Naming in file systems</h1> + +<p>Required reading: nami(), and all other file system code. + +<h2>Overview</h2> + +<p>To help users to remember where they stored their data, most +systems allow users to assign their own names to their data. +Typically the data is organized in files and users assign names to +files.  To deal with many files, users can organize their files in +directories, in a hierarchical manner.  Each name is a pathname, with +the components separated by "/". + +<p>To avoid that users have to type long abolute names (i.e., names +starting with "/" in Unix), users can change their working directory +and use relative names (i.e., naming that don't start with "/"). + +<p>User file namespace operations include create, mkdir, mv, ln +(link), unlink, and chdir. (How is "mv a b" implemented in xv6? +Answer: "link a b"; "unlink a".)  To be able to name the current +directory and the parent directory every directory includes two +entries "." and "..".  Files and directories can reclaimed if users +cannot name it anymore (i.e., after the last unlink). + +<p>Recall from last lecture, all directories entries contain a name, +followed by an inode number. The inode number names an inode of the +file system.  How can we merge file systems from different disks into +a single name space? + +<p>A user grafts new file systems on a name space using mount.  Umount +removes a file system from the name space.  (In DOS, a file system is +named by its device letter.)  Mount takes the root inode of the +to-be-mounted file system and grafts it on the inode of the name space +entry where the file system is mounted (e.g., /mnt/disk1). The +in-memory inode of /mnt/disk1 records the major and minor number of +the file system mounted on it.  When namei sees an inode on which a +file system is mounted, it looks up the root inode of the mounted file +system, and proceeds with that inode. + +<p>Mount is not a durable operation; it doesn't surive power failures. +After a power failure, the system administrator must remount the file +system (i.e., often in a startup script that is run from init). + +<p>Links are convenient, because with users can create synonyms for +  file names.  But, it creates the potential of introducing cycles in +  the naning tree.  For example, consider link("a/b/c", "a").  This +  makes c a synonym for a. This cycle can complicate matters; for +  example: +<ul> +<li>If a user subsequently calls unlink ("a"), then the user cannot +  name the directory "b" and the link "c" anymore, but how can the +  file system decide that? +</ul> + +<p>This problem can be solved by detecting cycles.  The second problem +  can be solved by computing with files are reacheable from "/" and +  reclaim all the ones that aren't reacheable.  Unix takes a simpler +  approach: avoid cycles by disallowing users to create links for +  directories.  If there are no cycles, then reference counts can be +  used to see if a file is still referenced. In the inode maintain a +  field for counting references (nlink in xv6's dinode). link +  increases the reference count, and unlink decreases the count; if +  the count reaches zero the inode and disk blocks can be reclaimed. + +<p>How to handle symbolic links across file systems (i.e., from one +  mounted file system to another)?  Since inodes are not unique across +  file systems, we cannot create a link across file systems; the +  directory entry only contains an inode number, not the inode number +  and the name of the disk on which the inode is located.  To handle +  this case, Unix provides a second type of link, which are called +  soft links. + +<p>Soft links are a special file type (e.g., T_SYMLINK).  If namei +  encounters a inode of type T_SYMLINK, it resolves the the name in +  the symlink file to an inode, and continues from there.  With +  symlinks one can create cycles and they can point to non-existing +  files. + +<p>The design of the name system can have security implications. For +  example, if you tests if a name exists, and then use the name, +  between testing and using it an adversary can have change the +  binding from name to object.  Such problems are called TOCTTOU. + +<p>An example of TOCTTOU is follows.  Let's say root runs a script +  every night to remove file in /tmp.  This gets rid off the files +  that editors might left behind, but we will never be used again. An +  adversary can exploit this script as follows: +<pre> +    Root                         Attacker +                                 mkdir ("/tmp/etc") +				 creat ("/tmp/etc/passw") +    readdir ("tmp"); +    lstat ("tmp/etc"); +    readdir ("tmp/etc"); +                                 rename ("tmp/etc", "/tmp/x"); +				 symlink ("etc", "/tmp/etc"); +    unlink ("tmp/etc/passwd"); +</pre> +Lstat checks whether /tmp/etc is not symbolic link, but by the time it +runs unlink the attacker had time to creat a symbolic link in the +place of /tmp/etc, with a password file of the adversary's choice. + +<p>This problem could have been avoided if every user or process group +  had its own private /tmp, or if access to the shared one was +  mediated. + +<h2>V6 code examples</h2> + +<p> namei (sheet 46) is the core of the Unix naming system. namei can +  be called in several ways: NAMEI_LOOKUP (resolve a name to an inode +  and lock inode), NAMEI_CREATE (resolve a name, but lock parent +  inode), and NAMEI_DELETE (resolve a name, lock parent inode, and +  return offset in the directory).  The reason is that namei is +  complicated is that we want to atomically test if a name exist and +  remove/create it, if it does; otherwise, two concurrent processes +  could interfere with each other and directory could end up in an +  inconsistent state. + +<p>Let's trace open("a", O_RDWR), focussing on namei: +<ul> +<li>5263: we will look at creating a file in a bit. +<li>5277: call namei with NAMEI_LOOKUP +<li>4629: if path name start with "/", lookup root inode (1). +<li>4632: otherwise, use inode for current working directory. +<li>4638: consume row of "/", for example in "/////a////b" +<li>4641: if we are done with NAMEI_LOOKUP, return inode (e.g., +  namei("/")). +<li>4652: if the inode we are searching for a name isn't of type +  directory, give up. +<li>4657-4661: determine length of the current component of the +  pathname we are resolving. +<li>4663-4681: scan the directory for the component. +<li>4682-4696: the entry wasn't found. if we are the end of the +  pathname and NAMEI_CREATE is set, lock parent directory and return a +  pointer to the start of the component.  In all other case, unlock +  inode of directory, and return 0. +<li>4701: if NAMEI_DELETE is set, return locked parent inode and the +  offset of the to-be-deleted component in the directory. +<li>4707: lookup inode of the component, and go to the top of the loop. +</ul> + +<p>Now let's look at creating a file in a directory: +<ul> +<li>5264: if the last component doesn't exist, but first part of the +  pathname resolved to a directory, then dp will be 0, last will point +  to the beginning of the last component, and ip will be the locked +  parent directory. +<li>5266: create an entry for last in the directory. +<li>4772: mknod1 allocates a new named inode and adds it to an +  existing directory. +<li>4776: ialloc. skan inode block, find unused entry, and write +  it. (if lucky 1 read and 1 write.) +<li>4784: fill out the inode entry, and write it. (another write) +<li>4786: write the entry into the directory (if lucky, 1 write) +</ul> + +</ul> +Why must the parent directory be locked?  If two processes try to +create the same name in the same directory, only one should succeed +and the other one, should receive an error (file exist). + +<p>Link, unlink, chdir, mount, umount could have taken file +descriptors instead of their path argument. In fact, this would get +rid of some possible race conditions (some of which have security +implications, TOCTTOU). However, this would require that the current +working directory be remembered by the process, and UNIX didn't have +good ways of maintaining static state shared among all processes +belonging to a given user. The easiest way is to create shared state +is to place it in the kernel. +    +<p>We have one piece of code in xv6 that we haven't studied: exec. +  With all the ground work we have done this code can be easily +  understood (see sheet 54). + +</body> diff --git a/web/l-okws.txt b/web/l-okws.txt new file mode 100644 index 0000000..fa940d0 --- /dev/null +++ b/web/l-okws.txt @@ -0,0 +1,249 @@ + +Security +------------------- +I. 2 Intro Examples +II. Security Overview +III. Server Security: Offense + Defense +IV. Unix Security + POLP +V. Example: OKWS +VI. How to Build a Website + +I. Intro Examples +-------------------- +1. Apache + OpenSSL 0.9.6a (CAN 2002-0656) +  - SSL = More security! + +  unsigned int j; +  p=(unsigned char *)s->init_buf->data; +  j= *(p++); +  s->session->session_id_length=j; +  memcpy(s->session->session_id,p,j); + + - the result: an Apache worm + +2. SparkNotes.com 2000: + - New profile feature that displays "public" information about users +   but bug that made e-mail addresses "public" by default. + - New program for getting that data: +  +     http://www.sparknotes.com/getprofile.cgi?id=1343 + +II. Security Overview +---------------------- + +What Is Security? + - Protecting your system from attack. + + What's an attack? + - Stealing data + - Corrupting data + - Controlling resources + - DOS + + Why attack? + - Money + - Blackmail / extortion + - Vendetta + - intellectual curiosity + - fame + +Security is a Big topic + + - Server security -- today's focus.  There's some machine sitting on the +   Internet somewhere, with a certain interface exposed, and attackers +   want to circumvent it. +    - Why should you trust your software? + + - Client security +    - Clients are usually servers, so they have many of the same issues. +    - Slight simplification: people across the network cannot typically +      initiate connections. +    - Has a "fallible operator": +        - Spyware +        - Drive-by-Downloads + + - Client security turns out to be much harder -- GUI considerations, +   look inside the browser and the applications. + - Systems community can more easily handle server security. + - We think mainly of servers. + +III. Server Security: Offense and Defense +----------------------------------------- + - Show picture of a Web site. + + Attacks			|	Defense +---------------------------------------------------------------------------- + 1. Break into DB from net	| 1. FW it off + 2. Break into WS on telnet	| 2. FW it off + 3. Buffer overrun in Apache	| 3. Patch apache / use better lang? + 4. Buffer overrun in our code  | 4. Use better lang / isolate it + 5. SQL injection		| 5. Better escaping / don't interpret code. + 6. Data scraping.              | 6. Use a sparse UID space. + 7. PW sniffing			| 7.  ??? + 8. Fetch /etc/passwd and crack | 8. Don't expose /etc/passwd +     PW				| + 9. Root escalation from apache | 9. No setuid programs available to Apache +10. XSS                         |10. Filter JS and input HTML code. +11. Keystroke recorded on sys-  |11. Client security +    admin's desktop (planetlab) | +12. DDOS                        |12.  ??? + +Summary: + - That we want private data to be available to right people makes +   this problem hard in the first place. Internet servers are there +   for a reason. + - Security != "just encrypt your data;" this in fact can sometimes +   make the problem worse. + - Best to prevent break-ins from happening in the first place. + - If they do happen, want to limit their damage (POLP). + - Security policies are difficult to express / package up neatly. + +IV. Design According to POLP (in Unix) +--------------------------------------- + - Assume any piece of a system can be compromised, by either bad +   programming or malicious attack. + - Try to limit the damage done by such a compromise (along the lines +   of the 4 attack goals). + + <Draw a picture of a server process on Unix, w/ other processes> + +What's the goal on Unix? + - Keep processes from communicating that don't have to: +    - limit FS, IPC, signals, ptrace + - Strip away unneeded privilege  +    - with respect to network, FS. + - Strip away FS access. + +How on Unix? + - setuid/setgid + - system call interposition + - chroot (away from setuid executables, /etc/passwd, /etc/ssh/..) + + <show Code snippet> + +How do you write chroot'ed programs? + - What about shared libraries? + - /etc/resolv.conf? + - Can chroot'ed programs access the FS at all? What if they need +   to write to the FS or read from the FS? + - Fd's are *capabilities*; can pass them to chroot'ed services, +   thereby opening new files on its behalf. + - Unforgeable - can only get them from the kernel via open/socket, etc. + +Unix Shortcomings (round 1) + - It's bad to run as root! + - Yet, need root for: +    - chroot +    - setuid/setgid to a lower-privileged user +    - create a new user ID + - Still no guarantee that we've cut off all channels +    - 200 syscalls! +    - Default is to give most/all privileges. + - Can "break out" of chroot jails? + - Can still exploit race conditions in the kernel to escalate privileges. + +Sidebar + - setuid / setuid misunderstanding + - root / root misunderstanding + - effective vs. real vs. saved set-user-ID + +V. OKWS +------- +- Taking these principles as far as possible. +- C.f. Figure 1 From the paper.. +- Discussion of which privileges are in which processes + +<Table of how to hack, what you get, etc...> + +- Technical details: how to launch a new service +- Within the launcher (running as root): + +<on board:> + +    // receive FDs from logger, pubd, demux +    fork (); +    chroot ("/var/okws/run"); +    chdir ("/coredumps/51001"); +    setgid (51001); +    setuid (51001); +    exec ("login", fds ... ); + +- Note no chroot -- why not? +- Once launched, how does a service get new connections? +- Note the goal - minimum tampering with each other in the  +  case of a compromise. + +Shortcoming of Unix (2) +- A lot of plumbing involved with this system.  FDs flying everywhere. +- Isolation still not fine enough.  If a service gets taken over, +  can compromise all users of that service. + +VI. Reflections on Building Websites +--------------------------------- +- OKWS interesting "experiment" +- Need for speed; also, good gzip support. +- If you need compiled code, it's a good way to go. +- RPC-like system a must for backend communication +- Connection-pooling for free + +Biggest difficulties: +- Finding good C++ programmers. +- Compile times. +- The DB is still always the problem. + +Hard to Find good Alternatives +- Python / Perl - you might spend a lot of time writing C code /  +  integrating with lower level languages. +- Have to worry about DB pooling. +- Java -- must viable, and is getting better.  Scary you can't peer +  inside. +- .Net / C#-based system might be the way to go. + + +======================================================================= + +Extra Material: + +Capabilities (From the Eros Paper in SOSP 1999) + + - "Unforgeable pair made up of an object ID and a set of authorized  +   operations (an interface) on that object." +   - c.f. Dennis and van Horn. "Programming semantics for multiprogrammed +     computations," Communications of the ACM 9(3):143-154, Mar 1966. + - Thus: +      <object ID, set of authorized OPs on that object> + - Examples: +      "Process X can write to file at inode Y" +      "Process P can read from file at inode Z" + - Familiar example: Unix file descriptors +  + - Why are they secure? +    - Capabilities are "unforgeable" +    - Processes can get them only through authorized interfaces +    - Capabilities are only given to processes authorized to hold them + + - How do you get them? +    - From the kernel (e.g., open) +    - From other applications (e.g., FD passing) + + - How do you use them? +    - read (fd), write(fd). + + - How do you revoke them once granted? +   - In Unix, you do not. +   - In some systems, a central authority ("reference monitor") can revoke. +  + - How do you store them persistently? +    - Can have circular dependencies (unlike an FS). +    - What happens when the system starts up?  +    - Revert to checkpointed state. +    - Often capability systems chose a single-level store. +  + - Capability systems, a historical prospective: +    - KeyKOS, Eros, Cyotos (UP research) +        - Never saw any applications +    - IBM Systems (System 38, later AS/400, later 'i Series') +        - Commercially viable + - Problems: +    - All bets are off when a capability is sent to the wrong place. +    - Firewall analogy? diff --git a/web/l-plan9.html b/web/l-plan9.html new file mode 100644 index 0000000..a3af3d5 --- /dev/null +++ b/web/l-plan9.html @@ -0,0 +1,249 @@ +<html> +<head> +<title>Plan 9</title> +</head> +<body> + +<h1>Plan 9</h1> + +<p>Required reading: Plan 9 from Bell Labs</p> + +<h2>Background</h2> + +<p>Had moved away from the ``one computing system'' model of +Multics and Unix.</p> + +<p>Many computers (`workstations'), self-maintained, not a coherent whole.</p> + +<p>Pike and Thompson had been batting around ideas about a system glued together  +by a single protocol as early as 1984. +Various small experiments involving individual pieces (file server, OS, computer) +tried throughout 1980s.</p> + +<p>Ordered the hardware for the ``real thing'' in beginning of 1989, +built up WORM file server, kernel, throughout that year.</p> + +<p>Some time in early fall 1989, Pike and Thompson were +trying to figure out a way to fit the window system in. +On way home from dinner, both independently realized that +needed to be able to mount a user-space file descriptor, +not just a network address.</p> + +<p>Around Thanksgiving 1989, spent a few days rethinking the whole +thing, added bind, new mount, flush, and spent a weekend  +making everything work again.  The protocol at that point was +essentially identical to the 9P in the paper.</p> + +<p>In May 1990, tried to use system as self-hosting. +File server kept breaking, had to keep rewriting window system. +Dozen or so users by then, mostly using terminal windows to +connect to Unix.</p> + +<p>Paper written and submitted to UKUUG in July 1990.</p> + +<p>Because it was an entirely new system, could take the +time to fix problems as they arose, <i>in the right place</i>.</p> + + +<h2>Design Principles</h2> + +<p>Three design principles:</p> + +<p> +1. Everything is a file.<br> +2. There is a standard protocol for accessing files.<br> +3. Private, malleable name spaces (bind, mount). +</p> + +<h3>Everything is a file.</h3> + +<p>Everything is a file (more everything than Unix: networks, graphics).</p> + +<pre> +% ls -l /net +% lp /dev/screen +% cat /mnt/wsys/1/text +</pre> + +<h3>Standard protocol for accessing files</h3> + +<p>9P is the only protocol the kernel knows: other protocols +(NFS, disk file systems, etc.) are provided by user-level translators.</p> + +<p>Only one protocol, so easy to write filters and other +converters.  <i>Iostats</i> puts itself between the kernel +and a command.</p> + +<pre> +% iostats -xvdfdf /bin/ls +</pre> + +<h3>Private, malleable name spaces</h3> + +<p>Each process has its own private name space that it +can customize at will.   +(Full disclosure: can arrange groups of +processes to run in a shared name space.  Otherwise how do +you implement <i>mount</i> and <i>bind</i>?)</p> + +<p><i>Iostats</i> remounts the root of the name space +with its own filter service.</p> + +<p>The window system mounts a file system that it serves +on <tt>/mnt/wsys</tt>.</p> + +<p>The network is actually a kernel device (no 9P involved) +but it still serves a file interface that other programs +use to access the network. +Easy to move out to user space (or replace) if necessary: +<i>import</i> network from another machine.</p> + +<h3>Implications</h3> + +<p>Everything is a file + can share files => can share everything.</p> + +<p>Per-process name spaces help move toward ``each process has its own +private machine.''</p> + +<p>One protocol: easy to build custom filters to add functionality +(e.g., reestablishing broken network connections). + +<h3>File representation for networks, graphics, etc.</h3> + +<p>Unix sockets are file descriptors, but you can't use the +usual file operations on them.  Also far too much detail that +the user doesn't care about.</p> + +<p>In Plan 9:  +<pre>dial("tcp!plan9.bell-labs.com!http"); +</pre> +(Protocol-independent!)</p> + +<p>Dial more or less does:<br> +write to /net/cs: tcp!plan9.bell-labs.com!http +read back: /net/tcp/clone 204.178.31.2!80 +write to /net/tcp/clone: connect 204.178.31.2!80 +read connection number: 4 +open /net/tcp/4/data +</p> + +<p>Details don't really matter.  Two important points: +protocol-independent, and ordinary file operations +(open, read, write).</p> + +<p>Networks can be shared just like any other files.</p> + +<p>Similar story for graphics, other resources.</p> + +<h2>Conventions</h2> + +<p>Per-process name spaces mean that even full path names are ambiguous +(<tt>/bin/cat</tt> means different things on different machines, +or even for different users).</p> + +<p><i>Convention</i> binds everything together.   +On a 386, <tt>bind /386/bin /bin</tt>. + +<p>In Plan 9, always know where the resource <i>should</i> be +(e.g., <tt>/net</tt>, <tt>/dev</tt>, <tt>/proc</tt>, etc.), +but not which one is there.</p> + +<p>Can break conventions: on a 386, <tt>bind /alpha/bin /bin</tt>, just won't +have usable binaries in <tt>/bin</tt> anymore.</p> + +<p>Object-oriented in the sense of having objects (files) that all +present the same interface and can be substituted for one another +to arrange the system in different ways.</p> + +<p>Very little ``type-checking'': <tt>bind /net /proc; ps</tt>. +Great benefit (generality) but must be careful (no safety nets).</p> + + +<h2>Other Contributions</h2> + +<h3>Portability</h3> + +<p>Plan 9 still is the most portable operating system. +Not much machine-dependent code, no fancy features +tied to one machine's MMU, multiprocessor from the start (1989).</p> + +<p>Many other systems are still struggling with converting to SMPs.</p> + +<p>Has run on MIPS, Motorola 68000, Nextstation, Sparc, x86, PowerPC, Alpha, others.</p> + +<p>All the world is not an x86.</p> + +<h3>Alef</h3> + +<p>New programming language: convenient, but difficult to maintain. +Retired when author (Winterbottom) stopped working on Plan 9.</p> + +<p>Good ideas transferred to C library plus conventions.</p> + +<p>All the world is not C.</p> + +<h3>UTF-8</h3> + +<p>Thompson invented UTF-8.  Pike and Thompson +converted Plan 9 to use it over the first weekend of September 1992, +in time for X/Open to choose it as the Unicode standard byte format +at a meeting the next week.</p> + +<p>UTF-8 is now the standard character encoding for Unicode on +all systems and interoperating between systems.</p> + +<h3>Simple, easy to modify base for experiments</h3> + +<p>Whole system source code is available, simple, easy to  +understand and change. +There's a reason it only took a couple days to convert to UTF-8.</p> + +<pre> +  49343  file server kernel + + 181611  main kernel +  78521    ipaq port (small kernel) +  20027      TCP/IP stack +  15365      ipaq-specific code +  43129      portable code + +1326778  total lines of source code +</pre> + +<h3>Dump file system</h3> + +<p>Snapshot idea might well have been ``in the air'' at the time. +(<tt>OldFiles</tt> in AFS appears to be independently derived, +use of WORM media was common research topic.)</p> + +<h3>Generalized Fork</h3> + +<p>Picked up by other systems: FreeBSD, Linux.</p> + +<h3>Authentication</h3> + +<p>No global super-user. +Newer, more Plan 9-like authentication described in later paper.</p> + +<h3>New Compilers</h3> + +<p>Much faster than gcc, simpler.</p> + +<p>8s to build acme for Linux using gcc; 1s to build acme for Plan 9 using 8c (but running on Linux)</p> + +<h3>IL Protocol</h3> + +<p>Now retired.   +For better or worse, TCP has all the installed base. +IL didn't work very well on asymmetric or high-latency links +(e.g., cable modems).</p> + +<h2>Idea propagation</h2> + +<p>Many ideas have propagated out to varying degrees.</p> + +<p>Linux even has bind and user-level file servers now (FUSE), +but still not per-process name spaces.</p> + + +</body> diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html new file mode 100644 index 0000000..da72c37 --- /dev/null +++ b/web/l-scalablecoord.html @@ -0,0 +1,202 @@ +<title>Scalable coordination</title> +<html> +<head> +</head> +<body> + +<h1>Scalable coordination</h1> + +<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable +  Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991. + +<h2>Overview</h2> + +<p>Shared memory machines are bunch of CPUs, sharing physical memory. +Typically each processor also mantains a cache (for performance), +which introduces the problem of keep caches coherent.  If processor 1 +writes a memory location whose value processor 2 has cached, then +processor 2's cache must be updated in some way.  How? +<ul> + +<li>Bus-based schemes.  Any CPU can access "dance with" any memory +equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache +listens to the memory bus. With write-through architecture, invalidate +copy when see a write. Or can have "ownership" scheme with write-back +cache (E.g., Pentium cache have MESI bits---modified, exclusive, +shared, invalid). If E bit set, CPU caches exclusively and can do +write back. But bus places limits on scalability. + +<li>More scalability w. NUMA schemes (non-uniform memory access). Each +CPU comes with fast "close" memory. Slower to access memory that is +stored with another processor. Use a directory to keep track of who is +caching what.  For example, processor 0 is responsible for all memory +starting with address "000", processor 1 is responsible for all memory +starting with "001", etc. + +<li>COMA - cache-only memory architecture.  Each CPU has local RAM, +treated as cache. Cache lines migrate around to different nodes based +on access pattern. Data only lives in cache, no permanent memory +location. (These machines aren't too popular any more.) + +</ul> + + +<h2>Scalable locks</h2> + +<p>This paper is about cost and scalability of locking; what if you +have 10 CPUs waiting for the same lock?  For example, what would +happen if xv6 runs on an SMP with many processors? + +<p>What's the cost of a simple spinning acquire/release?  Algorithm 1 +*without* the delays, which is like xv6's implementation of acquire +and release (xv6 uses XCHG instead of test_and_set): +<pre> +  each of the 10 CPUs gets the lock in turn +  meanwhile, remaining CPUs in XCHG on lock +  lock must be X in cache to run XCHG +    otherwise all might read, then all might write +  so bus is busy all the time with XCHGs! +  can we avoid constant XCHGs while lock is held? +</pre> + +<p>test-and-test-and-set +<pre> +  only run expensive TSL if not locked +  spin on ordinary load instruction, so cache line is S +  acquire(l) +    while(1){ +      while(l->locked != 0) { } +      if(TSL(&l->locked) == 0) +        return; +    } +</pre> + +<p>suppose 10 CPUs are waiting, let's count cost in total bus +  transactions +<pre> +  CPU1 gets lock in one cycle +    sets lock's cache line to I in other CPUs +  9 CPUs each use bus once in XCHG +    then everyone has the line S, so they spin locally +  CPU1 release the lock +  CPU2 gets the lock in one cycle +  8 CPUs each use bus once... +  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs! +  Look at "test-and-test-and-set" in Figure 6 +</pre> +<p>  Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time? + +<p>What is the point of the exponential backoff in Algorithm 1? +<pre> +  Does it buy us O(n) time for n acquires? +  Is there anything wrong with it? +  may not be fair +  exponential backoff may increase delay after release +</pre> + +<p>What's the point of the ticket locks, Algorithm 2? +<pre> +  one interlocked instruction to get my ticket number +  then I spin on now_serving with ordinary load +  release() just increments now_serving +</pre> + +<p>why is that good? +<pre> +  + fair +  + no exponential backoff overshoot +  + no spinning on  +</pre> + +<p>but what's the cost, in bus transactions? +<pre> +  while lock is held, now_serving is S in all caches +  release makes it I in all caches +  then each waiters uses a bus transaction to get new value +  so still O(n^2) +</pre> + +<p>What's the point of the array-based queuing locks, Algorithm 3? +<pre> +    a lock has an array of "slots" +    waiter allocates a slot, spins on that slot +    release wakes up just next slot +  so O(n) bus transactions to get through n waiters: good! +  anderson lines in Figure 4 and 6 are flat-ish +    they only go up because lock data structures protected by simpler lock +  but O(n) space *per lock*! +</pre> + +<p>Algorithm 5 (MCS), the new algorithm of the paper, uses +compare_and_swap: +<pre> +int compare_and_swap(addr, v1, v2) { +  int ret = 0; +  // stop all memory activity and ignore interrupts +  if (*addr == v1) { +    *addr = v2; +    ret = 1; +  } +  // resume other memory activity and take interrupts +  return ret; +} +</pre> + +<p>What's the point of the MCS lock, Algorithm 5? +<pre> +  constant space per lock, rather than O(n) +  one "qnode" per thread, used for whatever lock it's waiting for +  lock holder's qnode points to start of list +  lock variable points to end of list +  acquire adds your qnode to end of list +    then you spin on your own qnode +  release wakes up next qnode +</pre> + +<h2>Wait-free or non-blocking data structures</h2> + +<p>The previous implementations all block threads when there is +  contention for a lock.  Other atomic hardware operations allows one +  to build implementation wait-free data structures.  For example, one +  can make an insert of an element in a shared list that don't block a +  thread.  Such versions are called wait free.  + +<p>A linked list with locks is as follows: +<pre> +Lock list_lock; + +insert(int x) { +  element *n = new Element; +  n->x = x; + +  acquire(&list_lock); +  n->next = list; +  list = n; +  release(&list_lock); +} +</pre> + +<p>A wait-free implementation is as follows: +<pre> +insert (int x) { +  element *n = new Element; +  n->x = x; +  do { +     n->next = list; +  } while (compare_and_swap (&list, n->next, n) == 0); +} +</pre> +<p>How many bus transactions with 10 CPUs inserting one element in the +list? Could you do better? + +<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This + paper by Fraser and Harris</a> compares lock-based implementations + versus corresponding non-blocking implementations of a number of data + structures. + +<p>It is not possible to make every operation wait-free, and there are +  times we will need an implementation of acquire and release. +  research on non-blocking data structures is active; the last word +  isn't said on this topic yet. + +</body> diff --git a/web/l-schedule.html b/web/l-schedule.html new file mode 100644 index 0000000..d87d7da --- /dev/null +++ b/web/l-schedule.html @@ -0,0 +1,340 @@ +<title>Scheduling</title> +<html> +<head> +</head> +<body> + +<h1>Scheduling</h1> + +<p>Required reading: Eliminating receive livelock + +<p>Notes based on prof. Morris's lecture on scheduling (6.824, fall'02). + +<h2>Overview</h2> + +<ul> + +<li>What is scheduling?  The OS policies and mechanisms to allocates +resources to entities.  A good scheduling policy ensures that the most +important entitity gets the resources it needs.  This topic was +popular in the days of time sharing, when there was a shortage of +resources.  It seemed irrelevant in era of PCs and workstations, when +resources were plenty. Now the topic is back from the dead to handle +massive Internet servers with paying customers.  The Internet exposes +web sites to international abuse and overload, which can lead to +resource shortages.  Furthermore, some customers are more important +than others (e.g., the ones that buy a lot). + +<li>Key problems: +<ul> +<li>Gap between desired policy and available mechanism.  The desired +policies often include elements that not implementable with the +mechanisms available to the operation system.  Furthermore, often +there are many conflicting goals (low latency, high throughput, and +fairness), and the scheduler must make a trade-off between the goals. + +<li>Interaction between different schedulers.  One have to take a +systems view.  Just optimizing the CPU scheduler may do little to for +the overall desired policy. +</ul> + +<li>Resources you might want to schedule: CPU time, physical memory, +disk and network I/O, and I/O bus bandwidth. + +<li>Entities that you might want to give resources to: users, +processes, threads, web requests, or MIT accounts. + +<li>Many polices for resource to entity allocation are possible: +strict priority, divide equally, shortest job first, minimum guarantee +combined with admission control. + +<li>General plan for scheduling mechanisms +<ol> +<li> Understand where scheduling is occuring. +<li> Expose scheduling decisions, allow control. +<li> Account for resource consumption, to allow intelligent control. +</ol> + +<li>Simple example from 6.828 kernel. The policy for scheduling +environments is to give each one equal CPU time. The mechanism used to +implement this policy is a clock interrupt every 10 msec and then +selecting the next environment in a round-robin fashion.   + +<p>But this only works if processes are compute-bound.  What if a +process gives up some of its 10 ms to wait for input?  Do we have to +keep track of that and give it back?  + +<p>How long should the quantum be?  is 10 msec the right answer? +Shorter quantum will lead to better interactive performance, but +lowers overall system throughput because we will reschedule more, +which has overhead. + +<p>What if the environment computes for 1 msec and sends an IPC to +the file server environment?  Shouldn't the file server get more CPU +time because it operates on behalf of all other functions? + +<p>Potential improvements for the 6.828 kernel: track "recent" CPU use +(e.g., over the last second) and always run environment with least +recent CPU use.  (Still, if you sleep long enough you lose.) Other +solution: directed yield; specify on the yield to which environment +you are donating the remainder of the quantuam (e.g., to the file +server so that it can compute on the environment's behalf). + +<li>Pitfall: Priority Inversion +<pre> +  Assume policy is strict priority. +  Thread T1: low priority. +  Thread T2: medium priority. +  Thread T3: high priority. +  T1: acquire(l) +  context switch to T3 +  T3: acquire(l)... must wait for T1 to release(l)... +  context switch to T2 +  T2 computes for a while +  T3 is indefinitely delayed despite high priority. +  Can solve if T3 lends its priority to holder of lock it is waiting for. +    So T1 runs, not T2. +  [this is really a multiple scheduler problem.] +  [since locks schedule access to locked resource.] +</pre> + +<li>Pitfall: Efficiency.  Efficiency often conflicts with fairness (or +any other policy).  Long time quantum for efficiency in CPU scheduling +versus low delay.  Shortest seek versus FIFO disk scheduling. +Contiguous read-ahead vs data needed now.  For example, scheduler +swaps out my idle emacs to let gcc run faster with more phys mem. +What happens when I type a key?  These don't fit well into a "who gets +to go next" scheduler framework.  Inefficient scheduling may make +<i>everybody</i> slower, including high priority users. + +<li>Pitfall: Multiple Interacting Schedulers. Suppose you want your +emacs to have priority over everything else.  Give it high CPU +priority.  Does that mean nothing else will run if emacs wants to run? +Disk scheduler might not know to favor emacs's disk I/Os.  Typical +UNIX disk scheduler favors disk efficiency, not process prio.  Suppose +emacs needs more memory.  Other processes have dirty pages; emacs must +wait.  Does disk scheduler know these other processes' writes are high +prio? + +<li>Pitfall: Server Processes.  Suppose emacs uses X windows to +display.  The X server must serve requests from many clients.  Does it +know that emacs' requests should be given priority?  Does the OS know +to raise X's priority when it is serving emacs?  Similarly for DNS, +and NFS.  Does the network know to give emacs' NFS requests priority? + +</ul> + +<p>In short, scheduling is a system problem. There are many +schedulers; they interact.  The CPU scheduler is usually the easy +part.  The hardest part is system structure.  For example, the +<i>existence</i> of interrupts is bad for scheduling.  Conflicting +goals may limit effectiveness. + +<h2>Case study: modern UNIX</h2> + +<p>Goals:  +<ul> +<li>Simplicity (e.g. avoid complex locking regimes).   +<li>Quick response to device interrupts.   +<li> Favor interactive response.   +</ul>  + +<p>UNIX has a number of execution environments.  We care about +scheduling transitions among them.  Some transitions aren't possible, +some can't be be controlled. The execution environments are: + +<ul> +<li>Process, user half +<li>Process, kernel half +<li>Soft interrupts: timer, network +<li>Device interrupts +</ul> + +<p>The rules are: +<ul> +<li>User is pre-emptible. +<li>Kernel half and software interrupts are not pre-emptible. +<li>Device handlers may not make blocking calls (e.g., sleep) +<li>Effective priorities: intr > soft intr > kernel half > user +</ul>   + +</ul> + +<p>Rules are implemented as follows: + +<ul> + +<li>UNIX: Process User Half.  Runs in process address space, on +per-process stack.  Interruptible.  Pre-emptible: interrupt may cause +context switch.  We don't trust user processes to yield CPU. +Voluntarily enters kernel half via system calls and faults. + +<li>UNIX: Process Kernel Half. Runs in kernel address space, on +per-process kernel stack.  Executes system calls and faults for its +process.  Interruptible (but can defer interrupts in critical +sections).  Not pre-emptible.  Only yields voluntarily, when waiting +for an event. E.g. disk I/O done.  This simplifies concurrency +control; locks often not required.  No user process runs if any kernel +half wants to run.  Many process' kernel halfs may be sleeping in the +kernel. + +<li>UNIX: Device Interrupts. Hardware asks CPU for an interrupt to ask +for attention.  Disk read/write completed, or network packet received. +Runs in kernel space, on special interrupt stack.  Interrupt routine +cannot block; must return.  Interrupts are interruptible.  They nest +on the one interrupt stack.  Interrupts are not pre-emptible, and +cannot really yield.  The real-time clock is a device and interrupts +every 10ms (or whatever).  Process scheduling decisions can be made +when interrupt returns (e.g. wake up the process waiting for this +event).  You want interrupt processing to be fast, since it has +priority.  Don't do any more work than you have to.  You're blocking +processes and other interrupts.  Typically, an interrupt does the +minimal work necessary to keep the device happy, and then call wakeup +on a thread. + +<li>UNIX: Soft Interrupts.  (Didn't exist in xv6) Used when device +handling is expensive.  But no obvious process context in which to +run.  Examples include IP forwarding, TCP input processing.  Runs in +kernel space, on interrupt stack.  Interruptable.  Not pre-emptable, +can't really yield.  Triggered by hardware interrupt.  Called when +outermost hardware interrupt returns.  Periodic scheduling decisions +are made in timer s/w interrupt.  Scheduled by hardware timer +interrupt (i.e., if current process has run long enough, switch). +</ul> + +<p>Is this good software structure?  Let's talk about receive +livelock. + +<h2>Paper discussion</h2> + +<ul> + +<li>What is application that the paper is addressing: IP forwarding. +What functionality does a network interface offer to driver? +<ul> +<li> Read packets +<li> Poke hardware to send packets +<li> Interrupts when packet received/transmit complete +<li> Buffer many input packets +</ul> + +<li>What devices in the 6.828 kernel are interrupt driven?  Which one +are polling?  Is this ideal? + +<li>Explain Figure 6-1.  Why does it go up?  What determines how high +the peak is?  Why does it go down?  What determines how fast it goes +does? Answer: +<pre> +(fraction of packets discarded)(work invested in discarded packets) +           ------------------------------------------- +              (total work CPU is capable of) +</pre> + +<li>Suppose I wanted to test an NFS server for livelock. +<pre> +  Run client with this loop: +    while(1){ +      send NFS READ RPC; +      wait for response; +    } +</pre> +What would I see?  Is the NFS server probably subject to livelock? +(No--offered load subject to feedback). + +<li>What other problems are we trying to address? +<ul>  +<li>Increased latency for packet delivery and forwarding (e.g., start +disk head moving  when first NFS read request comes) +<li>Transmit starvation  +<li>User-level CPU starvation +</ul> + +<li>Why not tell the O/S scheduler to give interrupts lower priority? +Non-preemptible. +Could you fix this by making interrupts faster? (Maybe, if coupled +with some limit on input rate.)   + +<li>Why not completely process each packet in the interrupt handler? +(I.e. forward it?) Other parts of kernel don't expect to run at high +interrupt-level (e.g., some packet processing code might invoke a function +that sleeps). Still might want an output queue + +<li>What about using polling instead of interrupts?  Solves overload +problem, but killer for latency. + +<li>What's the paper's solution? +<ul> +<li>No IP input queue. +<li>Input processing and device input polling in kernel thread. +<li>Device receive interrupt just wakes up thread. And leaves +interrupts *disabled* for that device. +<li>Thread does all input processing, then re-enables interrupts. +</ul> +<p>Why does this work?  What happens when packets arrive too fast? +What happens when packets arrive slowly? + +<li>Explain Figure 6-3. +<ul> +<li>Why does "Polling (no quota)" work badly? (Input still starves +xmit complete processing.) +<li>Why does it immediately fall to zero, rather than gradually decreasing? +(xmit complete processing must be very cheap compared to input.) +</ul> + +<li>Explain Figure 6-4. +<ul> + +<li>Why does "Polling, no feedback" behave badly? There's a queue in +front of screend.  We can still give 100% to input thread, 0% to +screend. + +<li>Why does "Polling w/ feedback" behave well? Input thread yields +when queue to screend fills. + +<li>What if screend hangs, what about other consumers of packets? +(e.g., can you ssh to machine to fix screend?)  Fortunately screend +typically is only application. Also, re-enable input after timeout. + +</ul> + +<li>Why are the two solutions different? +<ol> +<li> Polling thread <i>with quotas</i>. +<li> Feedback from full queue. +</ol> +(I believe they should have used #2 for both.) + +<li>If we apply the proposed fixes, does the phenomemon totally go + away? (e.g. for web server, waits for disk, &c.)   +<ul> +<li>Can the net device throw away packets without slowing down host? +<li>Problem: We want to drop packets for applications with big queues. +But requires work to determine which application a packet belongs to +Solution: NI-LRP (have network interface sort packets) +</ul> + +<li>What about latency question?  (Look at figure 14 p. 243.) +<ul> +<li>1st packet looks like an improvement over non-polling. But 2nd +packet transmitted later with poling.  Why? (No new packets added to +xmit buffer until xmit interrupt)  +<li>Why?  In traditional BSD, to +amortize cost of poking device. Maybe better to poke a second time +anyway. +</ul> + +<li>What if processing has more complex structure? +<ul> +<li>Chain of processing stages with queues? Does feedback work? +    What happens when a late stage is slow? +<li>Split at some point, multiple parallel paths? No so great; one +    slow path blocks all paths. +</ul> + +<li>Can we formulate any general principles from paper? +<ul> +<li>Don't spend time on new work before completing existing work. +<li>Or give new work lower priority than partially-completed work. +</ul> + +</ul> diff --git a/web/l-threads.html b/web/l-threads.html new file mode 100644 index 0000000..8587abb --- /dev/null +++ b/web/l-threads.html @@ -0,0 +1,316 @@ +<title>L8</title> +<html> +<head> +</head> +<body> + +<h1>Threads, processes, and context switching</h1> + +<p>Required reading: proc.c (focus on scheduler() and sched()), +setjmp.S, and sys_fork (in sysproc.c) + +<h2>Overview</h2> + + +<p>Big picture: more programs than processors.  How to share the +limited number of processors among the programs? + +<p>Observation: most programs don't need the processor continuously, +because they frequently have to wait for input (from user, disk, +network, etc.) + +<p>Idea: when one program must wait, it releases the processor, and +gives it to another program. + +<p>Mechanism: thread of computation, an active active computation.  A +thread is an abstraction that contains the minimal state that is +necessary to stop an active and an resume it at some point later. +What that state is depends on the processor.  On x86, it is the +processor registers (see setjmp.S). + +<p>Address spaces and threads: address spaces and threads are in +principle independent concepts.  One can switch from one thread to +another thread in the same address space, or one can switch from one +thread to another thread in another address space.  Example: in xv6, +one switches address spaces by switching segmentation registers (see +setupsegs).  Does xv6 ever switch from one thread to another in the +same address space? (Answer: yes, v6 switches, for example, from the +scheduler, proc[0], to the kernel part of init, proc[1].)  In the JOS +kernel we switch from the kernel thread to a user thread, but we don't +switch kernel space necessarily. + +<p>Process: one address space plus one or more threads of computation. +In xv6 all <i>user</i> programs contain one thread of computation and +one address space, and the concepts of address space and threads of +computation are not separated but bundled together in the concept of a +process.  When switching from the kernel program (which has multiple +threads) to a user program, xv6 switches threads (switching from a +kernel stack to a user stack) and address spaces (the hardware uses +the kernel segment registers and the user segment registers). + +<p>xv6 supports the following operations on processes: +<ul> +<li>fork; create a new process, which is a copy of the parent. +<li>exec; execute a program +<li>exit: terminte process +<li>wait: wait for a process to terminate +<li>kill: kill process +<li>sbrk: grow the address space of a process. +</ul> +This interfaces doesn't separate threads and address spaces. For +example, with this interface one cannot create additional threads in +the same threads.  Modern Unixes provides additional primitives +(called pthreads, POSIX threads) to create additional threads in a +process and coordinate their activities. + +<p>Scheduling.  The thread manager needs a method for deciding which +thread to run if multiple threads are runnable.  The xv6 policy is to +run the processes round robin. Why round robin?  What other methods +can you imagine? + +<p>Preemptive scheduling.  To force a thread to release the processor +periodically (in case the thread never calls sleep), a thread manager +can use preemptive scheduling.  The thread manager uses the clock chip +to generate periodically a hardware interrupt, which will cause +control to transfer to the thread manager, which then can decide to +run another thread (e.g., see trap.c). + +<h2>xv6 code examples</h2> + +<p>Thread switching is implemented in xv6 using setjmp and longjmp, +which take a jumpbuf as an argument.  setjmp saves its context in a +jumpbuf for later use by longjmp.  longjmp restores the context saved +by the last setjmp.  It then causes execution to continue as if the +call of setjmp has just returned 1. +<ul> +<li>setjmp saves: ebx, exc, edx, esi, edi, esp, ebp, and eip. +<li>longjmp restores them, and puts 1 in eax! +</ul> + +<p> Example of thread switching: proc[0] switches to scheduler: +<ul> +<li>1359: proc[0] calls iget, which calls sleep, which calls sched. +<li>2261: The stack before the call to setjmp in sched is: +<pre> +CPU 0: +eax: 0x10a144   1089860 +ecx: 0x6c65746e 1818588270 +edx: 0x0        0 +ebx: 0x10a0e0   1089760 +esp: 0x210ea8   2166440 +ebp: 0x210ebc   2166460 +esi: 0x107f20   1081120 +edi: 0x107740   1079104 +eip: 0x1023c9   +eflags 0x12       +cs:  0x8        +ss:  0x10       +ds:  0x10       +es:  0x10       +fs:  0x10       +gs:  0x10       +   00210ea8 [00210ea8]  10111e +   00210eac [00210eac]  210ebc +   00210eb0 [00210eb0]  10239e +   00210eb4 [00210eb4]  0001 +   00210eb8 [00210eb8]  10a0e0 +   00210ebc [00210ebc]  210edc +   00210ec0 [00210ec0]  1024ce +   00210ec4 [00210ec4]  1010101 +   00210ec8 [00210ec8]  1010101 +   00210ecc [00210ecc]  1010101 +   00210ed0 [00210ed0]  107740 +   00210ed4 [00210ed4]  0001 +   00210ed8 [00210ed8]  10cd74 +   00210edc [00210edc]  210f1c +   00210ee0 [00210ee0]  100bbc +   00210ee4 [00210ee4]  107740 +</pre> +<li>2517: stack at beginning of setjmp: +<pre> +CPU 0: +eax: 0x10a144   1089860 +ecx: 0x6c65746e 1818588270 +edx: 0x0        0 +ebx: 0x10a0e0   1089760 +esp: 0x210ea0   2166432 +ebp: 0x210ebc   2166460 +esi: 0x107f20   1081120 +edi: 0x107740   1079104 +eip: 0x102848   +eflags 0x12       +cs:  0x8        +ss:  0x10       +ds:  0x10       +es:  0x10       +fs:  0x10       +gs:  0x10       +   00210ea0 [00210ea0]  1023cf   <--- return address (sched) +   00210ea4 [00210ea4]  10a144 +   00210ea8 [00210ea8]  10111e +   00210eac [00210eac]  210ebc +   00210eb0 [00210eb0]  10239e +   00210eb4 [00210eb4]  0001 +   00210eb8 [00210eb8]  10a0e0 +   00210ebc [00210ebc]  210edc +   00210ec0 [00210ec0]  1024ce +   00210ec4 [00210ec4]  1010101 +   00210ec8 [00210ec8]  1010101 +   00210ecc [00210ecc]  1010101 +   00210ed0 [00210ed0]  107740 +   00210ed4 [00210ed4]  0001 +   00210ed8 [00210ed8]  10cd74 +   00210edc [00210edc]  210f1c +</pre> +<li>2519: What is saved in jmpbuf of proc[0]? +<li>2529: return 0! +<li>2534: What is in jmpbuf of cpu 0?  The stack is as follows: +<pre> +CPU 0: +eax: 0x0        0 +ecx: 0x6c65746e 1818588270 +edx: 0x108aa4   1084068 +ebx: 0x10a0e0   1089760 +esp: 0x210ea0   2166432 +ebp: 0x210ebc   2166460 +esi: 0x107f20   1081120 +edi: 0x107740   1079104 +eip: 0x10286e   +eflags 0x46       +cs:  0x8        +ss:  0x10       +ds:  0x10       +es:  0x10       +fs:  0x10       +gs:  0x10       +   00210ea0 [00210ea0]  1023fe +   00210ea4 [00210ea4]  108aa4 +   00210ea8 [00210ea8]  10111e +   00210eac [00210eac]  210ebc +   00210eb0 [00210eb0]  10239e +   00210eb4 [00210eb4]  0001 +   00210eb8 [00210eb8]  10a0e0 +   00210ebc [00210ebc]  210edc +   00210ec0 [00210ec0]  1024ce +   00210ec4 [00210ec4]  1010101 +   00210ec8 [00210ec8]  1010101 +   00210ecc [00210ecc]  1010101 +   00210ed0 [00210ed0]  107740 +   00210ed4 [00210ed4]  0001 +   00210ed8 [00210ed8]  10cd74 +   00210edc [00210edc]  210f1c +</pre> +<li>2547: return 1! stack looks as follows: +<pre> +CPU 0: +eax: 0x1        1 +ecx: 0x108aa0   1084064 +edx: 0x108aa4   1084068 +ebx: 0x10074    65652 +esp: 0x108d40   1084736 +ebp: 0x108d5c   1084764 +esi: 0x10074    65652 +edi: 0xffde     65502 +eip: 0x102892   +eflags 0x6        +cs:  0x8        +ss:  0x10       +ds:  0x10       +es:  0x10       +fs:  0x10       +gs:  0x10       +   00108d40 [00108d40]  10231c +   00108d44 [00108d44]  10a144 +   00108d48 [00108d48]  0010 +   00108d4c [00108d4c]  0021 +   00108d50 [00108d50]  0000 +   00108d54 [00108d54]  0000 +   00108d58 [00108d58]  10a0e0 +   00108d5c [00108d5c]  0000 +   00108d60 [00108d60]  0001 +   00108d64 [00108d64]  0000 +   00108d68 [00108d68]  0000 +   00108d6c [00108d6c]  0000 +   00108d70 [00108d70]  0000 +   00108d74 [00108d74]  0000 +   00108d78 [00108d78]  0000 +   00108d7c [00108d7c]  0000 +</pre> +<li>2548: where will longjmp return? (answer: 10231c, in scheduler) +<li>2233:Scheduler on each processor selects in a round-robin fashion the +  first runnable process.  Which process will that be? (If we are +  running with one processor.)  (Ans: proc[0].) +<li>2229: what will be saved in cpu's jmpbuf? +<li>What is in proc[0]'s jmpbuf?  +<li>2548: return 1. Stack looks as follows: +<pre> +CPU 0: +eax: 0x1        1 +ecx: 0x6c65746e 1818588270 +edx: 0x0        0 +ebx: 0x10a0e0   1089760 +esp: 0x210ea0   2166432 +ebp: 0x210ebc   2166460 +esi: 0x107f20   1081120 +edi: 0x107740   1079104 +eip: 0x102892   +eflags 0x2        +cs:  0x8        +ss:  0x10       +ds:  0x10       +es:  0x10       +fs:  0x10       +gs:  0x10       +   00210ea0 [00210ea0]  1023cf   <--- return to sleep +   00210ea4 [00210ea4]  108aa4 +   00210ea8 [00210ea8]  10111e +   00210eac [00210eac]  210ebc +   00210eb0 [00210eb0]  10239e +   00210eb4 [00210eb4]  0001 +   00210eb8 [00210eb8]  10a0e0 +   00210ebc [00210ebc]  210edc +   00210ec0 [00210ec0]  1024ce +   00210ec4 [00210ec4]  1010101 +   00210ec8 [00210ec8]  1010101 +   00210ecc [00210ecc]  1010101 +   00210ed0 [00210ed0]  107740 +   00210ed4 [00210ed4]  0001 +   00210ed8 [00210ed8]  10cd74 +   00210edc [00210edc]  210f1c +</pre> +</ul> + +<p>Why switch from proc[0] to the processor stack, and then to +  proc[0]'s stack?  Why not instead run the scheduler on the kernel +  stack of the last process that run on that cpu?  + +<ul> + +<li>If the scheduler wanted to use the process stack, then it couldn't +  have any stack variables live across process scheduling, since +  they'd be different depending on which process just stopped running. + +<li>Suppose process p goes to sleep on CPU1, so CPU1 is idling in +  scheduler() on p's stack. Someone wakes up p. CPU2 decides to run +  p. Now p is running on its stack, and CPU1 is also running on the +  same stack. They will likely scribble on each others' local +  variables, return pointers, etc. + +<li>The same thing happens if CPU1 tries to reuse the process's page +tables to avoid a TLB flush.  If the process gets killed and cleaned +up by the other CPU, now the page tables are wrong.  I think some OSes +actually do this (with appropriate ref counting). + +</ul> + +<p>How is preemptive scheduling implemented in xv6?  Answer see trap.c +  line 2905 through 2917, and the implementation of yield() on sheet +  22. + +<p>How long is a timeslice for a user process?  (possibly very short; +  very important lock is held across context switch!) + +</body> + + + diff --git a/web/l-vm.html b/web/l-vm.html new file mode 100644 index 0000000..ffce13e --- /dev/null +++ b/web/l-vm.html @@ -0,0 +1,462 @@ +<html> +<head> +<title>Virtual Machines</title> +</head> + +<body> + +<h1>Virtual Machines</h1> + +<p>Required reading: Disco</p> + +<h2>Overview</h2> + +<p>What is a virtual machine? IBM definition: a fully protected and +isolated copy of the underlying machine's hardware.</p> + +<p>Another view is that it provides another example of a kernel API. +In contrast to other kernel APIs (unix, microkernel, and exokernel), +the virtual machine operating system exports as the kernel API the +processor API (e.g., the x86 interface).  Thus, each program running +in user space sees the services offered by a processor, and each +program sees its own processor.  Of course, we don't want to make a +system call for each instruction, and in fact one of the main +challenges in virtual machine operation systems is to design the +system in such a way that the physical processor executes the virtual +processor API directly, at processor speed. + +<p> +Virtual machines can be useful for a number of reasons: +<ol> + +<li>Run multiple operating systems on single piece of hardware.  For +example, in one process, you run Linux, and in another you run +Windows/XP.  If the kernel API is identical to the x86 (and faithly +emulates x86 instructions, state, protection levels, page tables), +then Linux and Windows/XP, the virual machine operationg system can +run these <i>guest</i> operating systems without modifications. + +<ul> +<li>Run "older" programs on the same hardware (e.g., run one x86 +virtual machine in real mode to execute old DOS apps). + +<li>Or run applications that require different operating system. +</ul> + +<li>Fault isolation: like processes on UNIX but more complete, because +the guest operating systems runs on the virtual machine in user space. +Thus, faults in the guest OS cannot effect any other software. + +<li>Customizing the apparent hardware: virtual machine may have +different view of hardware than is physically present. + +<li>Simplify deployment/development of software for scalable +processors (e.g., Disco). + +</ol> +</p> + +<p>If your operating system isn't a virtual machine operating system, +what are the alternatives? Processor simulation (e.g., bochs) or +binary emulation (WINE).  Simulation runs instructions purely in +software and is slow (e.g., 100x slow down for bochs); virtualization +gets out of the way whenever possible and can be efficient.  + +<p>Simulation gives portability whereas virtualization focuses on +performance. However, this means that you need to model your hardware +very carefully in software. Binary emulation focuses on just getting +system call for a particular operating system's interface. Binary +emulation can be hard because it is targetted towards a particular +operating system (and even that can change between revisions). +</p> + +<p>To provide each process with its own virtual processor that exports +the same API as the physical processor, what features must +the virtual machine operating system virtualize? +<ol> +<li>CPU: instructions -- trap all privileged instructions</li> +<li>Memory: address spaces -- map "physical" pages managed +by the guest OS to <i>machine</i>pages, handle translation, etc.</li> +<li>Devices: any I/O communication needs to be trapped and passed +    through/handled appropriately.</li> +</ol> +</p> +The software that implements the virtualization is typically called +the monitor, instead of the virtual machine operating system. + +<p>Virtual machine monitors (VMM) can be implemented in two ways: +<ol> +<li>Run VMM directly on hardware: like Disco.</li> +<li>Run VMM as an application (though still running as root, with +    integration into OS) on top of a <i>host</i> OS: like VMware. Provides +    additional hardware support at low development cost in +    VMM. Intercept CPU-level I/O requests and translate them into +    system calls (e.g. <code>read()</code>).</li> +</ol> +</p> + +<p>The three primary functions of a virtual machine monitor are: +<ul> +<li>virtualize processor (CPU, memory, and devices) +<li>dispatch events (e.g., forward page fault trap to guest OS). +<li>allocate resources (e.g., divide real memory in some way between +the physical memory of each guest OS). +</ul> + +<h2>Virtualization in detail</h2> + +<h3>Memory virtualization</h3> + +<p> +Understanding memory virtualization. Let's consider the MIPS example +from the paper. Ideally, we'd be able to intercept and rewrite all +memory address references. (e.g., by intercepting virtual memory +calls). Why can't we do this on the MIPS? (There are addresses that +don't go through address translation --- but we don't want the virtual +machine to directly access memory!)  What does Disco do to get around +this problem?  (Relink the kernel outside this address space.) +</p> + +<p> +Having gotten around that problem, how do we handle things in general? +</p> +<pre> +// Disco's tlb miss handler. +// Called when a memory reference for virtual adddress +// 'VA' is made, but there is not VA->MA (virtual -> machine) +// mapping in the cpu's TLB. +void tlb_miss_handler (VA) +{ +  // see if we have a mapping in our "shadow" tlb (which includes +  // "main" tlb) +  tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va); +  if (t && defined (thiscpu->pmap[t->pa]))   // is there a MA for this PA? +    tlbwrite (va, thiscpu->pmap[t->pa], t->otherdata); +  else if (t) +    // get a machine page, copy physical page into, and tlbwrite +  else +    // trap to the virtual CPU/OS's handler +} + +// Disco's procedure which emulates the MIPS +// instruction which writes to the tlb. +// +// VA -- virtual addresss +// PA -- physical address (NOT MA machine address!) +// otherdata -- perms and stuff +void emulate_tlbwrite_instruction (VA, PA, otherdata) +{ +  tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache +  if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically +    MA = allocate_machine_page (); +    thiscpu->pmap[PA] = MA; // See 4.2.2 +    thiscpu->pmapbackmap[MA] = PA; +    thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns) +  } +  tlbwrite (va, thiscpu->pmap[PA], otherdata); +} + +// Disco's procedure which emulates the MIPS +// instruction which read the tlb. +tlb_entry *emulate_tlbread_instruction (VA) +{ +  // Must return a TLB entry that has a "Physical" address; +  // This is recorded in our secondary TLB cache. +  // (We don't have to read from the hardware TLB since +  // all writes to the hardware TLB are mediated by Disco. +  // Thus we can always keep the l2tlb up to date.) +  return tlb_lookup (thiscpu->l2tlb, va); +} +</pre> + +<h3>CPU virtualization</h3> + +<p>Requirements: +<ol> +<li>Results of executing non-privileged instructions in privileged and +    user mode must be equivalent. (Why? B/c the virtual "privileged" +    system will not be running in true "privileged" mode.) +<li>There must be a way to protect the VM from the real machine. (Some +    sort of memory protection/address translation. For fault isolation.)</li> +<li>There must be a way to detect and transfer control to the VMM when +    the VM tries to execute a sensitive instruction (e.g. a privileged +    instruction, or one that could expose the "virtualness" of the +    VM.) It must be possible to emulate these instructions in +    software. Can be classified into completely virtualizable +    (i.e. there are protection mechanisms that cause traps for all +    instructions), partly (insufficient or incomplete trap +    mechanisms), or not at all (e.g. no MMU). +</ol> +</p> + +<p>The MIPS didn't quite meet the second criteria, as discussed +above. But, it does have a supervisor mode that is between user mode and +kernel mode where any privileged instruction will trap.</p> + +<p>What might a the VMM trap handler look like?</p> +<pre> +void privilege_trap_handler (addr) { +  instruction, args = decode_instruction (addr) +  switch (instruction) { +  case foo: +    emulate_foo (thiscpu, args, ...); +    break; +  case bar: +    emulate_bar (thiscpu, args, ...); +    break; +  case ...: +    ... +  } +} +</pre> +<p>The <code>emulator_foo</code> bits will have to evaluate the +state of the virtual CPU and compute the appropriate "fake" answer. +</p> + +<p>What sort of state is needed in order to appropriately emulate all +of these things? +<pre> +- all user registers +- CPU specific regs (e.g. on x86, %crN, debugging, FP...) +- page tables (or tlb) +- interrupt tables +</pre> +This is needed for each virtual processor. +</p> + +<h3>Device I/O virtualization</h3> + +<p>We intercept all communication to the I/O devices: read/writes to +reserved memory addresses cause page faults into special handlers +which will emulate or pass through I/O as appropriate. +</p> + +<p> +In a system like Disco, the sequence would look something like: +<ol> +<li>VM executes instruction to access I/O</li> +<li>Trap generated by CPU (based on memory or privilege protection) +    transfers control to VMM.</li> +<li>VMM emulates I/O instruction, saving information about where this +    came from (for demultiplexing async reply from hardware later) .</li> +<li>VMM reschedules a VM.</li> +</ol> +</p> + +<p> +Interrupts will require some additional work: +<ol> +<li>Interrupt occurs on real machine, transfering control to VMM +    handler.</li> +<li>VMM determines the VM that ought to receive this interrupt.</li> +<li>VMM causes a simulated interrupt to occur in the VM, and reschedules a +    VM.</li> +<li>VM runs its interrupt handler, which may involve other I/O +    instructions that need to be trapped.</li> +</ol> +</p> + +<p> +The above can be slow! So sometimes you want the guest operating +system to be aware that it is a guest and allow it to avoid the slow +path. Special device drivers or changing instructions that would cause +traps into memory read/write instructions. +</p> + +<h2>Intel x86/vmware</h2> + +<p>VMware, unlike Disco, runs as an application on a guest OS and +cannot modify the guest OS.  Furthermore, it must virtualize the x86 +instead of MIPS processor.  Both of these differences make good design +challenges. + +<p>The first challenge is that the monitor runs in user space, yet it +must dispatch traps and it must execute privilege instructions, which +both require kernel privileges.  To address this challenge, the +monitor downloads a piece of code, a kernel module, into the guest +OS. Most modern operating systems are constructed as a core kernel, +extended with downloadable kernel modules. +Privileged users can insert kernel modules at run-time. + +<p>The monitor downloads a kernel module that reads the IDT, copies +it, and overwrites the hard-wired entries with addresses for stubs in +the just downloaded kernel module.  When a trap happens, the kernel +module inspects the PC, and either forwards the trap to the monitor +running in user space or to the guest OS.  If the trap is caused +because a guest OS execute a privileged instructions, the monitor can +emulate that privilege instruction by asking the kernel module to +perform that instructions (perhaps after modifying the arguments to +the instruction). + +<p>The second challenge is virtualizing the x86 +  instructions. Unfortunately, x86 doesn't meet the 3 requirements for +  CPU virtualization. the first two requirements above.  If you run +  the CPU in ring 3, <i>most</i> x86 instructions will be fine, +  because most privileged instructions will result in a trap, which +  can then be forwarded to vmware for emulation.  For example, +  consider a guest OS loading the root of a page table in CR3. This +  results in trap (the guest OS runs in user space), which is +  forwarded to the monitor, which can emulate the load to CR3 as +  follows: + +<pre> +// addr is a physical address +void emulate_lcr3 (thiscpu, addr) +{ +  thiscpu->cr3 = addr; +  Pte *fakepdir = lookup (addr, oldcr3cache); +  if (!fakepdir) { +    fakedir = ppage_alloc (); +    store (oldcr3cache, addr, fakedir); +    // May wish to scan through supplied page directory to see if +    // we have to fix up anything in particular. +    // Exact settings will depend on how we want to handle +    // problem cases below and our own MM. +  } +  asm ("movl fakepdir,%cr3"); +  // Must make sure our page fault handler is in sync with what we do here. +} +</pre> + +<p>To virtualize the x86, the monitor must intercept any modifications +to the page table and substitute appropriate responses. And update +things like the accessed/dirty bits.  The monitor can arrange for this +to happen by making all page table pages inaccessible so that it can +emulate loads and stores to page table pages.  This setup allow the +monitor to virtualize the memory interface of the x86.</p> + +<p>Unfortunately, not all instructions that must be virtualized result +in traps: +<ul> +<li><code>pushf/popf</code>: <code>FL_IF</code> is handled different, +    for example.  In user-mode setting FL_IF is just ignored.</li> +<li>Anything (<code>push</code>, <code>pop</code>, <code>mov</code>) +    that reads or writes from <code>%cs</code>, which contains the +    privilege level. +<li>Setting the interrupt enable bit in EFLAGS has different +semantics in user space and kernel space.  In user space, it +is ignored; in kernel space, the bit is set. +<li>And some others... (total, 17 instructions). +</ul> +These instructions are unpriviliged instructions (i.e., don't cause a +trap when executed by a guest OS) but expose physical processor state. +These could reveal details of virtualization that should not be +revealed.  For example, if guest OS sets the interrupt enable bit for +its virtual x86, the virtualized EFLAGS should reflect that the bit is +set, even though the guest OS is running in user space. + +<p>How can we virtualize these instructions? An approach is to decode +the instruction stream that is provided by the user and look for bad +instructions. When we find them, replace them with an interrupt +(<code>INT 3</code>) that will allow the VMM to handle it +correctly. This might look something like: +</p> + +<pre> +void initcode () { +  scan_for_nonvirtual (0x7c00); +} + +void scan_for_nonvirtualizable (thiscpu, startaddr) { +  addr  = startaddr; +  instr = disassemble (addr); +  while (instr is not branch or bad) { +    addr += len (instr); +    instr = disassemble (addr); +  } +  // remember that we wanted to execute this instruction. +  replace (addr, "int 3"); +  record (thiscpu->rewrites, addr, instr); +} + +void breakpoint_handler (tf) { +  oldinstr = lookup (thiscpu->rewrites, tf->eip); +  if (oldinstr is branch) { +    newcs:neweip = evaluate branch +    scan_for_nonvirtualizable (thiscpu, newcs:neweip) +    return; +  } else { // something non virtualizable +    // dispatch to appropriate emulation +  } +} +</pre> +<p>All pages must be scanned in this way. Fortunately, most pages +probably are okay and don't really need any special handling so after +scanning them once, we can just remember that the page is okay and let +it run natively. +</p> + +<p>What if a guest OS generates instructions, writes them to memory, +and then wants to execute them? We must detect self-modifying code +(e.g. must simulate buffer overflow attacks correctly.) When a write +to a physical page that happens to be in code segment happens, must +trap the write and then rescan the affected portions of the page.</p> + +<p>What about self-examining code? Need to protect it some +how---possibly by playing tricks with instruction/data TLB caches, or +introducing a private segment for code (%cs) that is different than +the segment used for reads/writes (%ds). +</p> + +<h2>Some Disco paper notes</h2> + +<p> +Disco has some I/O specific optimizations. +</p> +<ul> +<li>Disk reads only need to happen once and can be shared between +    virtual machines via copy-on-write virtual memory tricks.</li> +<li>Network cards do not need to be fully virtualized --- intra +    VM communication doesn't need a real network card backing it.</li> +<li>Special handling for NFS so that all VMs "share" a buffer cache.</li> +</ul> + +<p> +Disco developers clearly had access to IRIX source code. +</p> +<ul> +<li>Need to deal with KSEG0 segment of MIPS memory by relinking kernel +    at different address space.</li> +<li>Ensuring page-alignment of network writes (for the purposes of +    doing memory map tricks.)</li> +</ul> + +<p>Performance?</p> +<ul> +<li>Evaluated in simulation.</li> +<li>Where are the overheads? Where do they come from?</li> +<li>Does it run better than NUMA IRIX?</li> +</ul> + +<p>Premise.  Are virtual machine the preferred approach to extending +operating systems?  Have scalable multiprocessors materialized?</p> + +<h2>Related papers</h2> + +<p>John Scott Robin, Cynthia E. Irvine. <a +href="http://www.cs.nps.navy.mil/people/faculty/irvine/publications/2000/VMM-usenix00-0611.pdf">Analysis of the +Intel Pentium's Ability to Support a Secure Virtual Machine +Monitor</a>.</p> + +<p>Jeremy Sugerman,  Ganesh Venkitachalam, Beng-Hong Lim. <a +href="http://www.vmware.com/resources/techresources/530">Virtualizing +I/O Devices on VMware Workstation's Hosted Virtual Machine +Monitor</a>. In Proceedings of the 2001 Usenix Technical Conference.</p> + +<p>Kevin Lawton, Drew Northup. <a +href="http://savannah.nongnu.org/projects/plex86">Plex86 Virtual +Machine</a>.</p> + +<p><a href="http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf">Xen +and the Art of Virtualization</a>, Paul Barham, Boris +Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf +Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003</p> + +<p><a href="http://www.vmware.com/pdf/asplos235_adams.pdf">A comparison of +software and hardware techniques for x86 virtualizaton</a>Keith Adams +and Ole Agesen, ASPLOS 2006</p> + +</body> + +</html> + diff --git a/web/l-xfi.html b/web/l-xfi.html new file mode 100644 index 0000000..41ba434 --- /dev/null +++ b/web/l-xfi.html @@ -0,0 +1,246 @@ +<html> +<head> +<title>XFI</title> +</head> +<body> + +<h1>XFI</h1> + +<p>Required reading: XFI: software guards for system address spaces. + +<h2>Introduction</h2> + +<p>Problem: how to use untrusted code (an "extension") in a trusted +program? +<ul> +<li>Use untrusted jpeg codec in Web browser +<li>Use an untrusted driver in the kernel +</ul> + +<p>What are the dangers? +<ul> +<li>No fault isolations: extension modifies trusted code unintentionally +<li>No protection: extension causes a security hole +<ul> +<li>Extension has a buffer overrun problem +<li>Extension calls trusted program's functions +<li>Extensions calls a trusted program's functions that is allowed to +  call, but supplies "bad" arguments +<li>Extensions calls privileged hardware instructions (when extending +  kernel) +<li>Extensions reads data out of trusted program it shouldn't. +</ul> +</ul> + +<p>Possible solutions approaches: +<ul> + +<li>Run extension in its own address space with minimal +  privileges. Rely on hardware and operating system protection +  mechanism. + +<li>Restrict the language in which the extension is written: +<ul> + +<li>Packet filter language.  Language is limited in its capabilities, +  and it easy to guarantee "safe" execution. + +<li>Type-safe language. Language runtime and compiler guarantee "safe" +execution. +</ul> + +<li>Software-based sandboxing. + +</ul> + +<h2>Software-based sandboxing</h2> + +<p>Sandboxer. A compiler or binary-rewriter sandboxes all unsafe +  instructions in an extension by inserting additional instructions. +  For example, every indirect store is preceded by a few instructions +  that compute and check the target of the store at runtime. + +<p>Verifier. When the extension is loaded in the trusted program, the +  verifier checks if the extension is appropriately sandboxed (e.g., +  are all indirect stores sandboxed? does it call any privileged +  instructions?).  If not, the extension is rejected. If yes, the +  extension is loaded, and can run.  If the extension runs, the +  instruction that sandbox unsafe instructions check if the unsafe +  instruction is used in a safe way. + +<p>The verifier must be trusted, but the sandboxer doesn't.  We can do +  without the verifier, if the trusted program can establish that the +  extension has been sandboxed by a trusted sandboxer. + +<p>The paper refers to this setup as instance of proof-carrying code. + +<h2>Software fault isolation</h2> + +<p><a href="http://citeseer.ist.psu.edu/wahbe93efficient.html">SFI</a> +by Wahbe et al. explored out to use sandboxing for fault isolation +extensions; that is, use sandboxing to control that stores and jump +stay within a specified memory range (i.e., they don't overwrite and +jump into addresses in the trusted program unchecked).  They +implemented SFI for a RISC processor, which simplify things since +memory can be written only by store instructions (other instructions +modify registers).  In addition, they assumed that there were plenty +of registers, so that they can dedicate a few for sandboxing code. + +<p>The extension is loaded into a specific range (called a segment) +  within the trusted application's address space.  The segment is +  identified by the upper bits of the addresses in the +  segment. Separate code and data segments are necessary to prevent an +  extension overwriting its code. + +<p>An unsafe instruction on the MIPS is an instruction that jumps or +  stores to an address that cannot be statically verified to be within +  the correct segment.  Most control transfer operations, such +  program-counter relative can be statically verified.  Stores to +  static variables often use an immediate addressing mode and can be +  statically verified.  Indirect jumps and indirect stores are unsafe. + +<p>To sandbox those instructions the sandboxer could generate the +  following code for each unsafe instruction: +<pre> +  DR0 <- target address +  R0 <- DR0 >> shift-register;  // load in R0 segment id of target +  CMP R0, segment-register;     // compare to segment id to segment's ID +  BNE fault-isolation-error     // if not equal, branch to trusted error code +  STORE using DR0 +</pre> +In this code, DR0, shift-register, and segment register +are <i>dedicated</i>: they cannot be used by the extension code.  The +verifier must check if the extension doesn't use they registers.  R0 +is a scratch register, but doesn't have to be dedicated.  The +dedicated registers are necessary, because otherwise extension could +load DR0 and jump to the STORE instruction directly, skipping the +check. + +<p>This implementation costs 4 registers, and 4 additional instructions +  for each unsafe instruction. One could do better, however: +<pre> +  DR0 <- target address & and-mask-register // mask segment ID from target +  DR0 <- DR0 | segment register // insert this segment's ID +  STORE using DR0 +</pre> +This code just sets the write segment ID bits.  It doesn't catch +illegal addresses; it just ensures that illegal addresses are within +the segment, harming the extension but no other code.  Even if the +extension jumps to the second instruction of this sandbox sequence, +nothing bad will happen (because DR0 will already contain the correct +segment ID). + +<p>Optimizations include:  +<ul> +<li>use guard zones for <i>store value, offset(reg)</i> +<li>treat SP as dedicated register (sandbox code that initializes it) +<li>etc. +</ul> + +<h2>XFI</h2> + +<p>XFI extends SFI in several ways: +<ul> +<li>Handles fault isolation and protection +<li>Uses control-folow integrity (CFI) to get good performance +<li>Doesn't use dedicated registers +<li>Use two stacks (a scoped stack and an allocation stack) and only +  allocation stack can be corrupted by buffer-overrun attacks. The +  scoped stack cannot via computed memory references. +<li>Uses a binary rewriter. +<li>Works for the x86 +</ul> + +<p>x86 is challenging, because limited registers and variable length +  of instructions. SFI technique won't work with x86 instruction +  set. For example if the binary contains: +<pre> +  25 CD 80 00 00   # AND eax, 0x80CD +</pre> +and an adversary can arrange to jump to the second byte, then the +adversary calls system call on Linux, which has binary the binary +representation CD 80.  Thus, XFI must control execution flow. + +<p>XFI policy goals: +<ul> +<li>Memory-access constraints (like SFI) +<li>Interface restrictions  (extension has fixed entry and exit points) +<li>Scoped-stack integrity (calling stack is well formed) +<li>Simplified instructions semantics (remove dangerous instructions) +<li>System-environment integrity (ensure certain machine model +  invariants, such as x86 flags register cannot be modified) +<li>Control-flow integrity: execution must follow a static, expected +  control-flow graph. (enter at beginning of basic blocks) +<li>Program-data integrity (certain global variables in extension +  cannot be accessed via computed memory addresses) +</ul> + +<p>The binary rewriter inserts guards to ensure these properties.  The +  verifier check if the appropriate guards in place.  The primary +  mechanisms used are: +<ul> +<li>CFI guards on computed control-flow transfers (see figure 2) +<li>Two stacks +<li>Guards on computer memory accesses (see figure 3) +<li>Module header has a section that contain access permissions for +  region +<li>Binary rewriter, which performs intra-procedure analysis, and +  generates guards, code for stack use, and verification hints +<li>Verifier checks specific conditions per basic block. hints specify +  the verification state for the entry to each basic block, and at +  exit of basic block the verifier checks that the final state implies +  the verification state at entry to all possible successor basic +  blocks. (see figure 4) +</ul> + +<p>Can XFI protect against the attack discussed in last lecture? +<pre> +  unsigned int j; +  p=(unsigned char *)s->init_buf->data; +  j= *(p++); +  s->session->session_id_length=j; +  memcpy(s->session->session_id,p,j); +</pre> +Where will <i>j</i> be located? + +<p>How about the following one from the paper <a href="http://research.microsoft.com/users/jpincus/beyond-stack-smashing.pdf"><i>Beyond stack smashing: +  recent advances in exploiting buffer overruns</i></a>? +<pre> +void f2b(void * arg, size_t len) { +  char buf[100]; +  long val = ..; +  long *ptr = ..; +  extern void (*f)(); +   +  memcopy(buff, arg, len); +  *ptr = val; +  f(); +  ... +  return; +} +</pre> +What code can <i>(*f)()</i> call?  Code that the attacker inserted? +Code in libc? + +<p>How about an attack that use <i>ptr</i> in the above code to +  overwrite a method's address in a class's dispatch table with an +  address of support function? + +<p>How about <a href="http://research.microsoft.com/~shuochen/papers/usenix05data_attack.pdf">data-only attacks</a>?  For example, attacker +  overwrites <i>pw_uid</i> in the heap with 0 before the following +  code executes (when downloading /etc/passwd and then uploading it with a +  modified entry). +<pre> +FILE *getdatasock( ... ) { +  seteuid(0); +  setsockeope ( ...); +  ... +  seteuid(pw->pw_uid); +  ... +} +</pre> + +<p>How much does XFI slow down applications? How many more +  instructions are executed?  (see Tables 1-4) + +</body> diff --git a/web/l1.html b/web/l1.html new file mode 100644 index 0000000..9865601 --- /dev/null +++ b/web/l1.html @@ -0,0 +1,288 @@ +<title>L1</title> +<html> +<head> +</head> +<body> + +<h1>OS overview</h1> + +<h2>Overview</h2> + +<ul> +<li>Goal of course: + +<ul> +<li>Understand operating systems in detail by designing and +implementing miminal OS +<li>Hands-on experience with building systems  ("Applying 6.033") +</ul> + +<li>What is an operating system? +<ul> +<li>a piece of software that turns the hardware into something useful +<li>layered picture: hardware, OS, applications +<li>Three main functions: fault isolate applications, abstract hardware,  +manage hardware +</ul> + +<li>Examples: +<ul> +<li>OS-X, Windows, Linux, *BSD, ... (desktop, server) +<li>PalmOS Windows/CE (PDA) +<li>Symbian, JavaOS (Cell phones) +<li>VxWorks, pSOS (real-time) +<li> ... +</ul> + +<li>OS Abstractions +<ul> +<li>processes: fork, wait, exec, exit, kill, getpid, brk, nice, sleep, +trace +<li>files:  open, close, read, write, lseek, stat, sync +<li>directories: mkdir, rmdir, link, unlink, mount, umount +<li>users + security: chown, chmod, getuid, setuid +<li>interprocess communication: signals, pipe +<li>networking: socket, accept, snd, recv, connect +<li>time: gettimeofday +<li>terminal: +</ul> + +<li>Sample Unix System calls (mostly POSIX) +<ul> +	<li> int read(int fd, void*, int) +	<li> int write(int fd, void*, int) +	<li> off_t lseek(int fd, off_t, int [012]) +	<li> int close(int fd) +	<li> int fsync(int fd) +	<li> int open(const char*, int flags [, int mode]) +	  <ul> +		<li> O_RDONLY, O_WRONLY, O_RDWR, O_CREAT +	  </ul> +	<li> mode_t umask(mode_t cmask) +	<li> int mkdir(char *path, mode_t mode); +	<li> DIR *opendir(char *dirname) +	<li> struct dirent *readdir(DIR *dirp) +	<li> int closedir(DIR *dirp) +	<li> int chdir(char *path) +	<li> int link(char *existing, char *new) +	<li> int unlink(char *path) +	<li> int rename(const char*, const char*) +	<li> int rmdir(char *path) +	<li> int stat(char *path, struct stat *buf) +	<li> int mknod(char *path, mode_t mode, dev_t dev) +	<li> int fork() +          <ul> +		<li> returns childPID in parent, 0 in child; only +		difference +           </ul> +	<li>int getpid() +	<li> int waitpid(int pid, int* stat, int opt) +           <ul> +		<li> pid==-1: any; opt==0||WNOHANG +		<li> returns pid or error +	   </ul> +	<li> void _exit(int status) +	<li> int kill(int pid, int signal) +	<li> int sigaction(int sig, struct sigaction *, struct sigaction *) +	<li> int sleep (int sec) +	<li> int execve(char* prog, char** argv, char** envp) +	<li> void *sbrk(int incr) +	<li> int dup2(int oldfd, int newfd) +	<li> int fcntl(int fd, F_SETFD, int val) +	<li> int pipe(int fds[2]) +	  <ul> +		<li> writes on fds[1] will be read on fds[0] +		<li> when last fds[1] closed, read fds[0] retursn EOF +		<li> when last fds[0] closed, write fds[1] kills SIGPIPE/fails +		EPIPE +           </ul> +	<li> int fchown(int fd, uind_t owner, gid_t group) +	<li> int fchmod(int fd, mode_t mode) +	<li> int socket(int domain, int type, int protocol) +	<li> int accept(int socket_fd, struct sockaddr*, int* namelen) +	  <ul> +		<li> returns new fd +          </ul> +	<li> int listen(int fd, int backlog) +	<li> int connect(int fd, const struct sockaddr*, int namelen) +	<li> void* mmap(void* addr, size_t len, int prot, int flags, int fd, +	off_t offset) +	<li> int munmap(void* addr, size_t len) +	<li> int gettimeofday(struct timeval*) +</ul> +</ul> + +<p>See the <a href="../reference.html">reference page</a> for links to +the early Unix papers. + +<h2>Class structure</h2> + +<ul> +<li>Lab: minimal OS for x86 in an exokernel style (50%) +<ul> +<li>kernel interface: hardware + protection +<li>libOS implements fork, exec, pipe, ... +<li>applications: file system, shell, .. +<li>development environment: gcc, bochs +<li>lab 1 is out +</ul> + +<li>Lecture structure (20%) +<ul> +<li>homework +<li>45min lecture +<li>45min case study +</ul> + +<li>Two quizzes (30%) +<ul> +<li>mid-term +<li>final's exam week +</ul> + +</ul> + +<h2>Case study: the shell (simplified)</h2> + +<ul> +<li>interactive command execution and a programming language +<li>Nice example that uses various OS abstractions. See  <a +href="../readings/ritchie74unix.pdf">Unix +paper</a> if you are unfamiliar with the shell. +<li>Final lab is a simple shell. +<li>Basic structure: +<pre> +       +       while (1) { +	    printf ("$"); +	    readcommand (command, args);   // parse user input +	    if ((pid = fork ()) == 0) {  // child? +	       exec (command, args, 0); +	    } else if (pid > 0) {   // parent? +	       wait (0);   // wait for child to terminate +	    } else { +	       perror ("Failed to fork\n"); +            } +        } +</pre> +<p>The split of creating a process with a new program in fork and exec +is mostly a historical accident.  See the  <a +href="../readings/ritchie79evolution.html">assigned paper</a> for today. +<li>Example: +<pre> +        $ ls +</pre> +<li>why call "wait"?  to wait for the child to terminate and collect +its exit status.  (if child finishes, child becomes a zombie until +parent calls wait.) +<li>I/O: file descriptors.  Child inherits open file descriptors +from parent. By convention: +<ul> +<li>file descriptor 0 for input (e.g., keyboard). read_command:  +<pre> +     read (1, buf, bufsize) +</pre> +<li>file descriptor 1 for output (e.g., terminal) +<pre> +     write (1, "hello\n", strlen("hello\n")+1) +</pre> +<li>file descriptor 2 for error (e.g., terminal) +</ul> +<li>How does the shell implement: +<pre> +     $ls > tmp1 +</pre> +just before exec insert: +<pre> +    	   close (1); +	   fd = open ("tmp1", O_CREAT|O_WRONLY);   // fd will be 1! +</pre> +<p>The kernel will return the first free file descriptor, 1 in this case. +<li>How does the shell implement sharing an output file: +<pre> +     $ls 2> tmp1 > tmp1 +</pre> +replace last code with: +<pre> + +	   close (1); +	   close (2); +	   fd1 = open ("tmp1", O_CREAT|O_WRONLY);   // fd will be 1! +	   fd2 = dup (fd1); +</pre> +both file descriptors share offset +<li>how do programs communicate? +<pre> +        $ sort file.txt | uniq | wc +</pre> +or +<pre> +	$ sort file.txt > tmp1 +	$ uniq tmp1 > tmp2 +	$ wc tmp2 +	$ rm tmp1 tmp2 +</pre> +or  +<pre> +        $ kill -9 +</pre> +<li>A pipe is an one-way communication channel.  Here is an example +where the parent is the writer and the child is the reader: +<pre> + +	int fdarray[2]; +	 +	if (pipe(fdarray) < 0) panic ("error"); +	if ((pid = fork()) < 0) panic ("error"); +	else if (pid > 0) { +	  close(fdarray[0]); +	  write(fdarray[1], "hello world\n", 12); +        } else { +	  close(fdarray[1]); +	  n = read (fdarray[0], buf, MAXBUF); +	  write (1, buf, n); +        } +</pre> +<li>How does the shell implement pipelines (i.e., cmd 1 | cmd 2 |..)? +We want to arrange that the output of cmd 1 is the input of cmd 2. +The way to achieve this goal is to manipulate stdout and stdin. +<li>The shell creates processes for each command in +the pipeline, hooks up their stdin and stdout correctly.  To do it +correct, and waits for the last process of the +pipeline to exit.  A sketch of the core modifications to our shell for +setting up a pipe is: +<pre>	     +	    int fdarray[2]; + +  	    if (pipe(fdarray) < 0) panic ("error"); +	    if ((pid = fork ()) == 0) {  child (left end of pipe) +	       close (1); +	       tmp = dup (fdarray[1]);   // fdarray[1] is the write end, tmp will be 1 +	       close (fdarray[0]);       // close read end +	       close (fdarray[1]);       // close fdarray[1] +	       exec (command1, args1, 0); +	    } else if (pid > 0) {        // parent (right end of pipe) +	       close (0); +	       tmp = dup (fdarray[0]);   // fdarray[0] is the read end, tmp will be 0 +	       close (fdarray[0]); +	       close (fdarray[1]);       // close write end +	       exec (command2, args2, 0); +	    } else { +	       printf ("Unable to fork\n"); +            } +</pre> +<li>Why close read-end and write-end? multiple reasons: maintain that +every process starts with 3 file descriptors and reading from an empty +pipe blocks reader, while reading from a closed pipe returns end of +file. +<li>How do you  background jobs? +<pre> +        $ compute & +</pre> +<li>How does the shell implement "&", backgrounding?  (Don't call wait +immediately). +<li>More details in the shell lecture later in the term. + +</body> + + diff --git a/web/l13.html b/web/l13.html new file mode 100644 index 0000000..af0f405 --- /dev/null +++ b/web/l13.html @@ -0,0 +1,245 @@ +<title>High-performance File Systems</title> +<html> +<head> +</head> +<body> + +<h1>High-performance File Systems</h1> + +<p>Required reading: soft updates. + +<h2>Overview</h2> + +<p>A key problem in designing file systems is how to obtain +performance on file system operations while providing consistency. +With consistency, we mean, that file system invariants are maintained +is on disk.  These invariants include that if a file is created, it +appears in its directory, etc.  If the file system data structures are +consistent, then it is possible to rebuild the file system to a +correct state after a failure. + +<p>To ensure consistency of on-disk file system data structures, +  modifications to the file system must respect certain rules: +<ul> + +<li>Never point to a structure before it is initialized. An inode must +be initialized before a directory entry references it.  An block must +be initialized before an inode references it. + +<li>Never reuse a structure before nullifying all pointers to it.  An +inode pointer to a disk block must be reset before the file system can +reallocate the disk block. + +<li>Never reset the last point to a live structure before a new +pointer is set.  When renaming a file, the file system should not +remove the old name for an inode until after the new name has been +written. +</ul> +The paper calls these dependencies update dependencies.   + +<p>xv6 ensures these rules by writing every block synchronously, and +  by ordering the writes appropriately.  With synchronous, we mean +  that a process waits until the current disk write has been +  completed before continuing with execution. + +<ul> + +<li>What happens if power fails after 4776 in mknod1?  Did we lose the +  inode for ever? No, we have a separate program (called fsck), which +  can rebuild the disk structures correctly and can mark the inode on +  the free list. + +<li>Does the order of writes in mknod1 matter?  Say, what if we wrote +  directory entry first and then wrote the allocated inode to disk? +  This violates the update rules and it is not a good plan. If a +  failure happens after the directory write, then on recovery we have +  an directory pointing to an unallocated inode, which now may be +  allocated by another process for another file! + +<li>Can we turn the writes (i.e., the ones invoked by iupdate and +  wdir) into delayed writes without creating problems?  No, because +  the cause might write them back to the disk in an incorrect order. +  It has no information to decide in what order to write them. + +</ul> + +<p>xv6 is a nice example of the tension between consistency and +  performance.  To get consistency, xv6 uses synchronous writes, +  but these writes are slow, because they perform at the rate of a +  seek instead of the rate of the maximum data transfer rate. The +  bandwidth to a disk is reasonable high for large transfer (around +  50Mbyte/s), but latency is low, because of the cost of moving the +  disk arm(s) (the seek latency is about 10msec). + +<p>This tension is an implementation-dependent one.  The Unix API +  doesn't require that writes are synchronous.  Updates don't have to +  appear on disk until a sync, fsync, or open with O_SYNC.  Thus, in +  principle, the UNIX API allows delayed writes, which are good for +  performance: +<ul> +<li>Batch many writes together in a big one, written at the disk data +  rate. +<li>Absorp writes to the same block. +<li>Schedule writes to avoid seeks. +</ul> + +<p>Thus the question: how to delay writes and achieve consistency? +  The paper provides an answer. + +<h2>This paper</h2> + +<p>The paper surveys some of the existing techniques and introduces a +new to achieve the goal of performance and consistency. + +<p> + +<p>Techniques possible: +<ul> + +<li>Equip system with NVRAM, and put buffer cache in NVRAM. + +<li>Logging.  Often used in UNIX file systems for metadata updates. +LFS is an extreme version of this strategy. + +<li>Flusher-enforced ordering.  All writes are delayed. This flusher +is aware of dependencies between blocks, but doesn't work because +circular dependencies need to be broken by writing blocks out. + +</ul> + +<p>Soft updates is the solution explored in this paper.  It doesn't +require NVRAM, and performs as well as the naive strategy of keep all +dirty block in main memory.  Compared to logging, it is unclear if +soft updates is better.  The default BSD file systems uses soft +  updates, but most Linux file systems use logging. + +<p>Soft updates is a sophisticated variant of flusher-enforced +ordering.  Instead of maintaining dependencies on the block-level, it +maintains dependencies on file structure level (per inode, per +directory, etc.), reducing circular dependencies. Furthermore, it +breaks any remaining circular dependencies by undo changes before +writing the block and then redoing them to the block after writing. + +<p>Pseudocode for create: +<pre> +create (f) { +   allocate inode in block i  (assuming inode is available) +   add i to directory data block d  (assuming d has space) +   mark d has dependent on i, and create undo/redo record +   update directory inode in block di +   mark di has dependent on d +} +</pre> + +<p>Pseudocode for the flusher: +<pre> +flushblock (b) +{ +  lock b; +  for all dependencies that b is relying on +    "remove" that dependency by undoing the change to b +    mark the dependency as "unrolled" +  write b  +} + +write_completed (b) { +  remove dependencies that depend on b +  reapply "unrolled" dependencies that b depended on +  unlock b +} +</pre> + +<p>Apply flush algorithm to example: +<ul> +<li>A list of two dependencies: directory->inode, inode->directory. +<li>Lets say syncer picks directory first +<li>Undo directory->inode changes (i.e., unroll <A,#4>) +<li>Write directory block +<li>Remove met dependencies (i.e., remove inode->directory dependency) +<li>Perform redo operation (i.e., redo <A,#4>) +<li>Select inode block and write it +<li>Remove met dependencies (i.e., remove directory->inode dependency) +<li>Select directory block (it is dirty again!) +<li>Write it. +</ul> + +<p>An file operation that is important for file-system consistency  +is rename.  Rename conceptually works as follows: +<pre> +rename (from, to) +   unlink (to); +   link (from, to); +   unlink (from); +</pre> + +<p>Rename it often used by programs to make a new version of a file +the current version.  Committing to a new version must happen +atomically.  Unfortunately, without a transaction-like support +atomicity is impossible to guarantee, so a typical file systems +provides weaker semantics for rename: if to already exists, an +instance of to will always exist, even if the system should crash in +the middle of the operation.  Does the above implementation of rename +guarantee this semantics? (Answer: no). + +<p>If rename is implemented as unlink, link, unlink, then it is +difficult to guarantee even the weak semantics. Modern UNIXes provide +rename as a file system call: +<pre> +   update dir block for to point to from's inode // write block +   update dir block for from to free entry // write block +</pre> +<p>fsck may need to correct refcounts in the inode if the file +system fails during rename.  for example, a crash after the first +write followed by fsck should set refcount to 2, since both from +and to are pointing at the inode. + +<p>This semantics is sufficient, however, for an application to ensure +atomicity. Before the call, there is a from and perhaps a to.  If the +call is successful, following the call there is only a to.  If there +is a crash, there may be both a from and a to, in which case the +caller knows the previous attempt failed, and must retry.  The +subtlety is that if you now follow the two links, the "to" name may +link to either the old file or the new file.  If it links to the new +file, that means that there was a crash and you just detected that the +rename operation was composite.  On the other hand, the retry +procedure can be the same for either case (do the rename again), so it +isn't necessary to discover how it failed.  The function follows the +golden rule of recoverability, and it is idempotent, so it lays all +the needed groundwork for use as part of a true atomic action. + +<p>With soft updates renames becomes: +<pre> +rename (from, to) { +   i = namei(from); +   add "to" directory data block td a reference to inode i +   mark td dependent on block i +   update directory inode "to" tdi +   mark tdi as dependent on td +   remove "from" directory data block fd a reference to inode i +   mark fd as dependent on tdi +   update directory inode in block fdi +   mark fdi as dependent on fd +} +</pre> +<p>No synchronous writes! + +<p>What needs to be done on recovery?  (Inspect every statement in +rename and see what inconsistencies could exist on the disk; e.g., +refcnt inode could be too high.)  None of these inconsitencies require +fixing before the file system can operate; they can be fixed by a +background file system repairer.  + +<h2>Paper discussion</h2> + +<p>Do soft updates perform any useless writes? (A useless write is a +write that will be immediately overwritten.)  (Answer: yes.) Fix +syncer to becareful with what block to start.  Fix cache replacement +to selecting LRU block with no pendending dependencies. + +<p>Can a log-structured file system implement rename better? (Answer: +yes, since it can get the refcnts right). + +<p>Discuss all graphs. + +</body> + diff --git a/web/l14.txt b/web/l14.txt new file mode 100644 index 0000000..d121dff --- /dev/null +++ b/web/l14.txt @@ -0,0 +1,247 @@ +Why am I lecturing about Multics? +  Origin of many ideas in today's OSes +  Motivated UNIX design (often in opposition) +  Motivated x86 VM design +  This lecture is really "how Intel intended x86 segments to be used" + +Multics background +  design started in 1965 +    very few interactive time-shared systems then: CTSS +  design first, then implementation +  system stable by 1969 +    so pre-dates UNIX, which started in 1969 +  ambitious, many years, many programmers, MIT+GE+BTL + +Multics high-level goals +  many users on same machine: "time sharing" +  perhaps commercial services sharing the machine too +  remote terminal access (but no recognizable data networks: wired or phone) +  persistent reliable file system +  encourage interaction between users +    support joint projects that share data &c +  control access to data that should not be shared + +Most interesting aspect of design: memory system +  idea: eliminate memory / file distinction +  file i/o uses LD / ST instructions +  no difference between memory and disk files +  just jump to start of file to run program +  enhances sharing: no more copying files to private memory +  this seems like a really neat simplification! + +GE 645 physical memory system +  24-bit phys addresses +  36-bit words +  so up to 75 megabytes of physical memory!!! +    but no-one could afford more than about a megabyte + +[per-process state] +  DBR +  DS, SDW (== address space) +  KST +  stack segment +  per-segment linkage segments + +[global state] +  segment content pages +  per-segment page tables +  per-segment branch in directory segment +  AST + +645 segments (simplified for now, no paging or rings) +  descriptor base register (DBR) holds phy addr of descriptor segment (DS) +  DS is an array of segment descriptor words (SDW) +  SDW: phys addr, length, r/w/x, present +  CPU has pairs of registers: 18 bit offset, 18 bit segment # +    five pairs (PC, arguments, base, linkage, stack) +  early Multics limited each segment to 2^16 words +    thus there are lots of them, intended to correspond to program modules +  note: cannot directly address phys mem (18 vs 24) +  645 segments are a lot like the x86! + +645 paging +  DBR and SDW actually contain phy addr of 64-entry page table +  each page is 1024 words +  PTE holds phys addr and present flag +  no permission bits, so you really need to use the segments, not like JOS +  no per-process page table, only per-segment +    so all processes using a segment share its page table and phys storage +    makes sense assuming segments tend to be shared +    paging environment doesn't change on process switch + +Multics processes +  each process has its own DS +  Multics switches DBR on context switch +  different processes typically have different number for same segment + +how to use segments to unify memory and file system? +  don't want to have to use 18-bit seg numbers as file names +  we want to write programs using symbolic names +  names should be hierarchical (for users) +    so users can have directories and sub-directories +    and path names + +Multics file system +  tree structure, directories and files +  each file and directory is a segment +  dir seg holds array of "branches" +    name, length, ACL, array of block #s, "active" +  unique ROOT directory +  path names: ROOT > A > B +  note there are no inodes, thus no i-numbers +  so "real name" for a file is the complete path name +    o/s tables have path name where unix would have i-number +    presumably makes renaming and removing active files awkward +    no hard links + +how does a program refer to a different segment? +  inter-segment variables contain symbolic segment name +  A$E refers to segment A, variable/function E +  what happens when segment B calls function A$E(1, 2, 3)? + +when compiling B: +    compiler actually generates *two* segments +    one holds B's instructions +    one holds B's linkage information +    initial linkage entry: +      name of segment e.g. "A" +      name of symbol e.g. "E" +      valid flag +    CALL instruction is indirect through entry i of linkage segment +    compiler marks entry i invalid +    [storage for strings "A" and "E" really in segment B, not linkage seg] + +when a process is executing B: +    two segments in DS: B and a *copy* of B's linkage segment +    CPU linkage register always points to current segment's linkage segment +    call A$E is really call indirect via linkage[i] +    faults because linkage[i] is invalid +    o/s fault handler +      looks up segment name for i ("A") +      search path in file system for segment "A" (cwd, library dirs) +      if not already in use by some process (branch active flag and AST knows): +        allocate page table and pages +        read segment A into memory +      if not already in use by *this* process (KST knows): +        find free SDW j in process DS, make it refer to A's page table +        set up r/w/x based on process's user and file ACL +        also set up copy of A's linkage segment +      search A's symbol table for "E" +      linkage[i] := j / address(E) +      restart B +    now the CALL works via linkage[i] +      and subsequent calls are fast + +how does A get the correct linkage register? +  the right value cannot be embedded in A, since shared among processes +  so CALL actually goes to instructions in A's linkage segment +    load current seg# into linkage register, jump into A +    one set of these per procedure in A + +all memory / file references work this way +  as if pointers were really symbolic names +  segment # is really a transparent optimization +  linking is "dynamic" +    programs contain symbolic references +    resolved only as needed -- if/when executed +  code is shared among processes +  was program data shared? +    probably most variables not shared (on stack, in private segments) +    maybe a DB would share a data segment, w/ synchronization +  file data: +    probably one at a time (locks) for read/write +    read-only is easy to share + +filesystem / segment implications +  programs start slowly due to dynamic linking +  creat(), unlink(), &c are outside of this model +  store beyond end extends a segment (== appends to a file) +  no need for buffer cache! no need to copy into user space! +    but no buffer cache => ad-hoc caches e.g. active segment table +  when are dirty segments written back to disk? +    only in page eviction algorithm, when free pages are low +  database careful ordered writes? e.g. log before data blocks? +    I don't know, probably separate flush system calls + +how does shell work? +  you type a program name +  the shell just CALLs that program, as a segment! +  dynamic linking finds program segment and any library segments it needs +  the program eventually returns, e.g. with RET +  all this happened inside the shell process's address space +  no fork, no exec +  buggy program can crash the shell! e.g. scribble on stack +  process creation was too slow to give each program its own process + +how valuable is the sharing provided by segment machinery? +  is it critical to users sharing information? +  or is it just there to save memory and copying? + +how does the kernel fit into all this? +  kernel is a bunch of code modules in segments (in file system) +  a process dynamically loads in the kernel segments that it uses +  so kernel segments have different numbers in different processes +    a little different from separate kernel "program" in JOS or xv6 +  kernel shares process's segment# address space   +    thus easy to interpret seg #s in system call arguments +  kernel segment ACLs in file system restrict write +    so mapped non-writeable into processes + +how to call the kernel? +  very similar to the Intel x86 +  8 rings. users at 4. core kernel at 0. +  CPU knows current execution level +  SDW has max read/write/execute levels +  call gate: lowers ring level, but only at designated entry +  stack per ring, incoming call switches stacks +  inner ring can always read arguments, write results +  problem: checking validity of arguments to system calls +    don't want user to trick kernel into reading/writing the wrong segment +    you have this problem in JOS too +    later Multics CPUs had hardware to check argument references + +are Multics rings a general-purpose protected subsystem facility? +  example: protected game implementation +    protected so that users cannot cheat +    put game's code and data in ring 3 +    BUT what if I don't trust the author? +      or if i've already put some other subsystem in ring 3? +    a ring has full power over itself and outer rings: you must trust +  today: user/kernel, server processes and IPC +    pro: protection among mutually suspicious subsystems +    con: no convenient sharing of address spaces + +UNIX vs Multics +  UNIX was less ambitious (e.g. no unified mem/FS) +  UNIX hardware was small +  just a few programmers, all in the same room +  evolved rather than pre-planned +  quickly self-hosted, so they got experience earlier + +What did UNIX inherit from MULTICS? +  a shell at user level (not built into kernel) +  a single hierarchical file system, with subdirectories +  controlled sharing of files +  written in high level language, self-hosted development + +What did UNIX reject from MULTICS? +  files look like memory +    instead, unifying idea is file descriptor and read()/write() +    memory is a totally separate resource +  dynamic linking +    instead, static linking at compile time, every binary had copy of libraries +  segments and sharing +    instead, single linear address space per process, like xv6 +  (but shared libraries brought these back, just for efficiency, in 1980s) +  Hierarchical rings of protection +    simpler user/kernel +    for subsystems, setuid, then client/server and IPC + +The most useful sources I found for late-1960s Multics VM: +  1. Bensoussan, Clingen, Daley, "The Multics Virtual Memory: Concepts +     and Design," CACM 1972 (segments, paging, naming segments, dynamic +     linking). +  2. Daley and Dennis, "Virtual Memory, Processes, and Sharing in Multics," +     SOSP 1967 (more details about dynamic linking and CPU).  +  3. Graham, "Protection in an Information Processing Utility," +     CACM 1968 (brief account of rings and gates). diff --git a/web/l19.txt b/web/l19.txt new file mode 100644 index 0000000..af9d0bb --- /dev/null +++ b/web/l19.txt @@ -0,0 +1,1412 @@ +-- front +6.828 Shells Lecture + +Hello. + +-- intro +Bourne shell + +Simplest shell: run cmd arg arg ...  +	fork +		exec in child +		wait in parent + +More functionality: +	file redirection: cmd >file +		open file as fd 1 in child before exec + +Still more functionality: +	pipes: cmd | cmd | cmd ... +		create pipe, +		run first cmd with pipe on fd 1, +		run second cmd with other end of pipe on fd 0 + +More Bourne arcana: +	$* - command args +	"$@" - unexpanded command args +	environment variables +	macro substitution +	if, while, for  +	||  +	&&  +	"foo $x" +	'foo $x' +	`cat foo` + +-- rc +Rc Shell + + +No reparsing of input (except explicit eval). + +Variables as explicit lists. + +Explicit concatenation. + +Multiple input pipes <{cmd} - pass /dev/fd/4 as file name. + +Syntax more like C, less like Algol. + +diff <{echo hi} <{echo bye} + +-- es +Es shell + + +rc++ + +Goal is to override functionality cleanly. + +Rewrite input like cmd | cmd2 as %pipe {cmd} {cmd2}. + +Users can redefine %pipe, etc. + +Need lexical scoping and let to allow new %pipe refer to old %pipe. + +Need garbage collection to collect unreachable code. + +Design principle:  +	minimal functionality + good defaults +	allow users to customize implementations +	 +	emacs, exokernel + +-- apps +Applications + +Shell scripts are only as good as the programs they use. +	(What good are pipes without cat, grep, sort, wc, etc.?) + +The more the scripts can access, the more powerful they become. + +-- acme +Acme, Plan 9 text editor + +Make window system control files available to +everything, including shell. + +Can write shell scripts to script interactions. + +/home/rsc/bin/Slide +/home/rsc/bin/Slide- +/home/rsc/bin/Slide+ + +/usr/local/plan9/bin/adict + +win + +-- javascript +JavaScript + +Very powerful +	- not because it's a great language +	- because it has a great data set +	- Google Maps +	- Gmail +	- Ymail +	- etc. + +-- greasemonkey +GreaseMonkey + +// ==UserScript== +// @name            Google Ring +// @namespace       http://swtch.com/greasemonkey/ +// @description     Changes Google Logo +// @include         http://*.google.*/* +// ==/UserScript== + +(function() { +	for(var i=0; i<document.images.length; i++){ +		if(document.images[i].src == "http://www.google.com/intl/en/images/logo.gif") +			document.images[i].src = "http://swtch.com/googlering.png"; +	} +})(); + +-- webscript0 +Webscript + +Why can't I script my web interactions? + +/home/rsc/plan9/bin/rc/fedex + +webscript /home/rsc/src/webscript/a3 +	/home/rsc/src/webscript/a2 + +-- acid +Acid, a programmable (scriptable) debugger + +defn stopped(pid) +{ +	pfixstop(pid); +	pstop(pid); +} + +defn pfixstop(pid) +{ +	if *fmt(*PC-1, 'b') == 0xCC then { +		// Linux stops us after the breakpoint, not at it +		*PC = *PC-1; +	} +} + +/usr/local/plan9/acid/port:/^defn.bpset +/usr/local/plan9/acid/port:/^defn.step + +defn checkpdb(pdb)  +{ +	loop 1,768 do {  +		if *pdb != 0 then { print(pdb\X, " ", *pdb\X, "\n"); } +		pdb = pdb +4; +	} +} + +-- guis +GUIs + +Can we script guis?  Not as clear. + +Acme examples show one way:  +	turn events into file (pipe) to read. + +Tcl/tk is close too. + +Eventually everyone turns to C. + +-- others +Honorable Mentions + +Scheme + +Lisp + +AutoCAD + +Viaweb RTML + +-- c +"Real" programming languages vs. Scripts + +Why does everyone eventually rewrite scripts in C? +	(aka C++, C#, any compiled language) + +What do you need C for now? + +How could you make it accessible to a script language? + +-- /home/rsc/bin/Slide +#!/usr/local/plan9/bin/rc + +echo name `{pwd}^/$1 | 9p write acme/$winid/ctl +echo clean | 9p write acme/$winid/ctl +echo get | 9p write acme/$winid/ctl + +-- /home/rsc/bin/Slide- +#!/usr/local/plan9/bin/rc + +name=$% +current=`{basename $name} +currentx=`{9 grep -n '^'$current'([ 	]|$)' index | sed 's/:.*//'} + +pagex=`{echo $currentx - 1 | hoc} +if(~ $pagex 0){ +	echo no such page +	exit 0 +} +page=`{sed -n $pagex^p index | awk '{print $1}'} +if(~ $#page 0){ +	echo no such page +	exit 0 +} + +Slide $page +-- /home/rsc/bin/Slide+ +#!/usr/local/plan9/bin/rc + +name=$% +current=`{basename $name} +currentx=`{9 grep -n '^'$current'([ 	]|$)' index | sed 's/:.*//'} + +pagex=`{echo $currentx + 1 | hoc} +page=`{sed -n $pagex^p index | awk '{print $1}'} +if(~ $#page 0){ +	echo no such page +	exit 0 +} + +Slide $page +-- /usr/local/plan9/bin/adict +#!/usr/local/plan9/bin/rc + +. 9.rc +. $PLAN9/lib/acme.rc + +fn event { +	# $1 - c1 origin of event +	# $2 - c2 type of action +	# $3 - q0 beginning of selection +	# $4 - q1 end of selection +	# $5 - eq0 beginning of expanded selection +	# $6 - eq1 end of expanded selection +	# $7 - flag +	# $8 - nr number of runes in $7 +	# $9 - text +	# $10 - chorded argument +	# $11 - origin of chorded argument + +	switch($1$2){ +	case E*	# write to body or tag +	case F*	# generated by ourselves; ignore +	case K*	# type away we do not care +	case Mi	# mouse: text inserted in tag +	case MI	# mouse: text inserted in body +	case Md	# mouse: text deleted from tag +	case MD	# mouse: text deleted from body + +	case Mx MX	# button 2 in tag or body +		winwriteevent $* + +	case Ml ML	# button 3 in tag or body +		{ +			if(~ $dict NONE) +				dictwin /adict/$9/ $9 +			if not +				dictwin /adict/$dict/$9 $dict $9 +		} & +	} +} + +fn dictwin { +	newwindow +	winname $1 +	switch($#*){ +	case 1 +		dict -d '?' >[2=1] | sed 1d | winwrite body +	case 2 +		dict=$2 +	case 3 +		dict=$2 +		dict -d $dict $3 >[2=1] | winwrite body +	} +	winctl clean +	wineventloop +} + +dict=NONE +if(~ $1 -d){ +	shift +	dict=$2 +	shift +} +if(~ $1 -d*){ +	dict=`{echo $1 | sed 's/-d//'} +	shift +} +if(~ $1 -*){ +	echo 'usage: adict [-d dict] [word...]' >[1=2] +	exit usage +} + +switch($#*){ +case 0 +	if(~ $dict NONE) +		dictwin /adict/ +	if not +		dictwin /adict/$dict/ $dict +case * +	if(~ $dict NONE){ +		dict=`{dict -d'?' | 9 sed -n 's/^   ([^\[ 	]+).*/\1/p' | sed 1q} +		if(~ $#dict 0){ +			echo 'no dictionaries present on this system' >[1=2] +			exit nodict +		} +	} +	for(i) +		dictwin /adict/$dict/$i $dict $i +} + +-- /usr/local/plan9/lib/acme.rc +fn newwindow { +	winctl=`{9p read acme/new/ctl} +	winid=$winctl(1) +	winctl noscroll +} + +fn winctl {	 +	echo $* | 9p write acme/acme/$winid/ctl +} + +fn winread { +	9p read acme/acme/$winid/$1 +} + +fn winwrite { +	9p write acme/acme/$winid/$1 +} + +fn windump { +	if(! ~ $1 - '') +		winctl dumpdir $1 +	if(! ~ $2 - '') +		winctl dump $2 +} + +fn winname { +	winctl name $1 +} + +fn winwriteevent { +	echo $1$2$3 $4 | winwrite event +} + +fn windel { +	if(~ $1 sure) +		winctl delete +	if not +		winctl del +} + +fn wineventloop { +	. <{winread event >[2]/dev/null | acmeevent} +} +-- /home/rsc/plan9/rc/bin/fedex +#!/bin/rc + +if(! ~ $#* 1) { +	echo usage: fedex 123456789012 >[1=2] +	exit usage +} + +rfork e + +fn bgrep{ +pattern=`{echo $1 | sed 's;/;\\&;'} +shift + +@{ echo 'X { +$ +a + +. +} +X ,x/(.+\n)+\n/ g/'$pattern'/p' | +sam -d $* >[2]/dev/null +} +} + +fn awk2 { +	awk 'NR%2==1 { a=$0; }  +		NR%2==0 { b=$0; printf("%-30s %s\n", a, b); } +	' $* +} + +fn awk3 { +	awk '{line[NR] = $0} +	END{ +		i = 4; +		while(i < NR){ +			what=line[i++]; +			when=line[i]; +			comment=""; +			if(!(when ~ /..\/..\/.... ..:../)){ +				# out of sync +				printf("%s\n", what); +				continue; +			} +			i++; +			if(!(line[i+1] ~ /..\/..\/.... ..:../) && +				(i+2 > NR || line[i+2] ~ /..\/..\/.... ..:../)){ +				what = what ", " line[i++]; +			} +			printf("%s  %s\n", when, what); +		} +	}' $* +} + +# hget 'http://www.fedex.com/cgi-bin/track_it?airbill_list='$1'&kurrent_airbill='$1'&language=english&cntry_code=us&state=0' | +hget 'http://www.fedex.com/cgi-bin/tracking?action=track&language=english&cntry_code=us&initial=x&mps=y&tracknumbers='$1 | +	htmlfmt >/tmp/fedex.$pid +sed -n '/Tracking number/,/^$/p' /tmp/fedex.$pid | awk2 +echo +sed -n '/Reference number/,/^$/p' /tmp/fedex.$pid | awk2 +echo +sed -n '/Date.time/,/^$/p' /tmp/fedex.$pid | sed 1,4d | fmt -l 4000 | sed 's/ [A-Z][A-Z] /&\n/g' +rm /tmp/fedex.$pid +-- /home/rsc/src/webscript/a3 +#!./o.webscript + +load "http://www.ups.com/WebTracking/track?loc=en_US" +find textbox "InquiryNumber1" +input "1z30557w0340175623" +find next checkbox +input "yes" +find prev form +submit +if(find "Delivery Information"){ +	find outer table +	print +}else if(find "One or more"){ +	print +}else{ +	print "Unexpected results." +	find page +	print +} +-- /home/rsc/src/webscript/a2 +#load "http://apc-reset/outlets.htm" +load "apc.html" +print +print "\n=============\n" +find "yoshimi" +find outer row +find next select +input "Immediate Reboot" +submit +print +-- /usr/local/plan9/acid/port +// portable acid for all architectures + +defn pfl(addr) +{ +	print(pcfile(addr), ":", pcline(addr), "\n"); +} + +defn +notestk(addr) +{ +	local pc, sp; +	complex Ureg addr; + +	pc = addr.pc\X; +	sp = addr.sp\X; + +	print("Note pc:", pc, " sp:", sp, " ", fmt(pc, 'a'), " "); +	pfl(pc); +	_stk({"PC", pc, "SP", sp, linkreg(addr)}, 1); +} + +defn +notelstk(addr) +{ +	local pc, sp; +	complex Ureg addr; + +	pc = addr.pc\X; +	sp = addr.sp\X; + +	print("Note pc:", pc, " sp:", sp, " ", fmt(pc, 'a'), " "); +	pfl(pc); +	_stk({"PC", pc, "SP", sp, linkreg(addr)}, 1); +} + +defn params(param) +{ +	while param do { +		sym = head param; +		print(sym[0], "=", itoa(sym[1], "%#ux")); +		param = tail param; +		if param then +			print (","); +	}	 +} + +stkprefix = ""; +stkignore = {}; +stkend = 0; + +defn locals(l) +{ +	local sym; + +	while l do { +		sym = head l; +		print(stkprefix, "\t", sym[0], "=", itoa(sym[1], "%#ux"), "\n"); +		l = tail l; +	}	 +} + +defn _stkign(frame) +{ +	local file; + +	file = pcfile(frame[0]); +	s = stkignore; +	while s do { +		if regexp(head s, file) then +			return 1; +		s = tail s; +	} +	return 0; +} + +// print a stack trace +// +// in a run of leading frames in files matched by regexps in stkignore, +// only print the last one. +defn _stk(regs, dolocals) +{ +	local stk, frame, pc, fn, done, callerpc, paramlist, locallist; + +	stk = strace(regs); +	if stkignore then { +		while stk && tail stk && _stkign(head tail stk) do +			stk = tail stk; +	} + +	callerpc = 0; +	done = 0; +	while stk && !done do { +		frame = head stk; +		stk = tail stk; +		fn = frame[0]; +		pc = frame[1]; +		callerpc = frame[2]; +		paramlist = frame[3]; +		locallist = frame[4]; + +		print(stkprefix, fmt(fn, 'a'), "("); +		params(paramlist); +		print(")"); +		if pc != fn then +			print("+", itoa(pc-fn, "%#ux")); +		print(" "); +		pfl(pc); +		if dolocals then +			locals(locallist); +		if fn == var("threadmain") || fn == var("p9main") then +			done=1; +		if fn == var("threadstart") || fn == var("scheduler") then +			done=1; +		if callerpc == 0 then +			done=1; +	} +	if callerpc && !done then { +		print(stkprefix, fmt(callerpc, 'a'), " "); +		pfl(callerpc); +	} +} + +defn findsrc(file) +{ +	local lst, src; + +	if file[0] == '/' then { +		src = file(file); +		if src != {} then { +			srcfiles = append srcfiles, file; +			srctext = append srctext, src; +			return src; +		} +		return {}; +	} + +	lst = srcpath; +	while head lst do { +		src = file(head lst+file); +		if src != {} then { +			srcfiles = append srcfiles, file; +			srctext = append srctext, src; +			return src; +		} +		lst = tail lst; +	} +} + +defn line(addr) +{ +	local src, file; + +	file = pcfile(addr); +	src = match(file, srcfiles); + +	if src >= 0 then +		src = srctext[src]; +	else +		src = findsrc(file); + +	if src == {} then { +		print("no source for ", file, "\n"); +		return {}; +	} +	line = pcline(addr)-1; +	print(file, ":", src[line], "\n"); +} + +defn addsrcdir(dir) +{ +	dir = dir+"/"; + +	if match(dir, srcpath) >= 0 then { +		print("already in srcpath\n"); +		return {}; +	} + +	srcpath = {dir}+srcpath; +} + +defn source() +{ +	local l; + +	l = srcpath; +	while l do { +		print(head l, "\n"); +		l = tail l; +	} +	l = srcfiles; + +	while l do { +		print("\t", head l, "\n"); +		l = tail l; +	} +} + +defn Bsrc(addr) +{ +	local lst; + +	lst = srcpath; +	file = pcfile(addr); +	if file[0] == '/' && access(file) then { +		rc("B "+file+":"+itoa(pcline(addr))); +		return {}; +	} +	while head lst do { +		name = head lst+file; +		if access(name) then { +			rc("B "+name+":"+itoa(pcline(addr))); +			return {}; +		} +		lst = tail lst; +	} +	print("no source for ", file, "\n"); +} + +defn srcline(addr) +{ +	local text, cline, line, file, src; +	file = pcfile(addr); +	src = match(file,srcfiles); +	if (src>=0) then +		src = srctext[src]; +	else +		src = findsrc(file); +	if (src=={}) then +	{ +		return "(no source)"; +	} +	return src[pcline(addr)-1]; +} + +defn src(addr) +{ +	local src, file, line, cline, text; + +	file = pcfile(addr); +	src = match(file, srcfiles); + +	if src >= 0 then +		src = srctext[src]; +	else +		src = findsrc(file); + +	if src == {} then { +		print("no source for ", file, "\n"); +		return {}; +	} + +	cline = pcline(addr)-1; +	print(file, ":", cline+1, "\n"); +	line = cline-5; +	loop 0,10 do { +		if line >= 0 then { +			if line == cline then +				print(">"); +			else +				print(" "); +			text = src[line]; +			if text == {} then +				return {}; +			print(line+1, "\t", text, "\n"); +		} +		line = line+1; +	}	 +} + +defn step()					// single step the process +{ +	local lst, lpl, addr, bput; + +	bput = 0; +	if match(*PC, bplist) >= 0 then {	// Sitting on a breakpoint +		bput = fmt(*PC, bpfmt); +		*bput = @bput; +	} + +	lst = follow(*PC); + +	lpl = lst; +	while lpl do {				// place break points +		*(head lpl) = bpinst; +		lpl = tail lpl; +	} + +	startstop(pid);				// do the step + +	while lst do {				// remove the breakpoints +		addr = fmt(head lst, bpfmt); +		*addr = @addr; +		lst = tail lst; +	} +	if bput != 0 then +		*bput = bpinst; +} + +defn bpset(addr)				// set a breakpoint +{ +	if status(pid) != "Stopped" then { +		print("Waiting...\n"); +		stop(pid); +	} +	if match(addr, bplist) >= 0 then +		print("breakpoint already set at ", fmt(addr, 'a'), "\n"); +	else { +		*fmt(addr, bpfmt) = bpinst; +		bplist = append bplist, addr; +	} +} + +defn bptab()					// print a table of breakpoints +{ +	local lst, addr; + +	lst = bplist; +	while lst do { +		addr = head lst; +		print("\t", fmt(addr, 'X'), " ", fmt(addr, 'a'), "  ", fmt(addr, 'i'), "\n"); +		lst = tail lst; +	} +} + +defn bpdel(addr)				// delete a breakpoint +{ +	local n, pc, nbplist; + +	if addr == 0 then { +		while bplist do { +			pc = head bplist; +			pc = fmt(pc, bpfmt); +			*pc = @pc; +			bplist = tail bplist; +		} +		return {}; +	} + +	n = match(addr, bplist); +	if n < 0  then { +		print("no breakpoint at ", fmt(addr, 'a'), "\n"); +		return {}; +	} + +	addr = fmt(addr, bpfmt); +	*addr = @addr; + +	nbplist = {};				// delete from list +	while bplist do { +		pc = head bplist; +		if pc != addr then +			nbplist = append nbplist, pc; +		bplist = tail bplist; +	} +	bplist = nbplist;			// delete from memory +} + +defn cont()					// continue execution +{ +	local addr; + +	addr = fmt(*PC, bpfmt); +	if match(addr, bplist) >= 0 then {	// Sitting on a breakpoint +		*addr = @addr; +		step();				// Step over +		*addr = bpinst; +	} +	startstop(pid);				// Run +} + +defn stopped(pid)		// called from acid when a process changes state +{ +	pfixstop(pid); +	pstop(pid);		// stub so this is easy to replace +} + +defn procs()			// print status of processes +{ +	local c, lst, cpid; + +	cpid = pid; +	lst = proclist; +	while lst do { +		np = head lst; +		setproc(np); +		if np == cpid then +			c = '>'; +		else +			c = ' '; +		print(fmt(c, 'c'), np, ": ", status(np), " at ", fmt(*PC, 'a'), " setproc(", np, ")\n"); +		lst = tail lst; +	} +	pid = cpid; +	if pid != 0 then +		setproc(pid); +} + +_asmlines = 30; + +defn asm(addr) +{ +	local bound; + +	bound = fnbound(addr); + +	addr = fmt(addr, 'i'); +	loop 1,_asmlines do { +		print(fmt(addr, 'a'), " ", fmt(addr, 'X')); +		print("\t", @addr++, "\n"); +		if bound != {} && addr > bound[1] then { +			lasmaddr = addr; +			return {}; +		} +	} +	lasmaddr = addr; +} + +defn casm() +{ +	asm(lasmaddr); +} + +defn xasm(addr) +{ +	local bound; + +	bound = fnbound(addr); + +	addr = fmt(addr, 'i'); +	loop 1,_asmlines do { +		print(fmt(addr, 'a'), " ", fmt(addr, 'X')); +		print("\t", *addr++, "\n"); +		if bound != {} && addr > bound[1] then { +			lasmaddr = addr; +			return {}; +		} +	} +	lasmaddr = addr; +} + +defn xcasm() +{ +	xasm(lasmaddr); +} + +defn win() +{ +	local npid, estr; + +	bplist = {}; +	notes = {}; + +	estr = "/sys/lib/acid/window '0 0 600 400' "+textfile; +	if progargs != "" then +		estr = estr+" "+progargs; + +	npid = rc(estr); +	npid = atoi(npid); +	if npid == 0 then +		error("win failed to create process"); + +	setproc(npid); +	stopped(npid); +} + +defn win2() +{ +	local npid, estr; + +	bplist = {}; +	notes = {}; + +	estr = "/sys/lib/acid/transcript '0 0 600 400' '100 100 700 500' "+textfile; +	if progargs != "" then +		estr = estr+" "+progargs; + +	npid = rc(estr); +	npid = atoi(npid); +	if npid == 0 then +		error("win failed to create process"); + +	setproc(npid); +	stopped(npid); +} + +printstopped = 1; +defn new() +{ +	local a; +	 +	bplist = {}; +	newproc(progargs); +	a = var("p9main"); +	if a == {} then +		a = var("main"); +	if a == {} then +		return {}; +	bpset(a); +	while *PC != a do +		cont(); +	bpdel(a); +} + +defn stmnt()			// step one statement +{ +	local line; + +	line = pcline(*PC); +	while 1 do { +		step(); +		if line != pcline(*PC) then { +			src(*PC); +			return {}; +		} +	} +} + +defn func()			// step until we leave the current function +{ +	local bound, end, start, pc; + +	bound = fnbound(*PC); +	if bound == {} then { +		print("cannot locate text symbol\n"); +		return {}; +	} + +	pc = *PC; +	start = bound[0]; +	end = bound[1]; +	while pc >= start && pc < end do { +		step(); +		pc = *PC; +	} +} + +defn next() +{ +	local sp, bound, pc; + +	sp = *SP; +	bound = fnbound(*PC); +	if bound == {} then { +		print("cannot locate text symbol\n"); +		return {}; +	} +	stmnt(); +	pc = *PC; +	if pc >= bound[0] && pc < bound[1] then +		return {}; + +	while (pc < bound[0] || pc > bound[1]) && sp >= *SP do { +		step(); +		pc = *PC; +	} +	src(*PC); +} + +defn maps() +{ +	local m, mm; + +	m = map(); +	while m != {} do { +		mm = head m; +		m = tail m; +		print(mm[2]\X, " ", mm[3]\X, " ", mm[4]\X, " ", mm[0], " ", mm[1], "\n"); +	} +} + +defn dump(addr, n, fmt) +{ +	loop 0, n do { +		print(fmt(addr, 'X'), ": "); +		addr = mem(addr, fmt); +	} +} + +defn mem(addr, fmt) +{ + +	local i, c, n; + +	i = 0; +	while fmt[i] != 0 do { +		c = fmt[i]; +		n = 0; +		while '0' <= fmt[i] && fmt[i] <= '9' do { +			n = 10*n + fmt[i]-'0'; +			i = i+1; +		} +		if n <= 0 then n = 1; +		addr = fmt(addr, fmt[i]); +		while n > 0 do { +			print(*addr++, " "); +			n = n-1; +		} +		i = i+1; +	} +	print("\n"); +	return addr; +} + +defn symbols(pattern) +{ +	local l, s; + +	l = symbols; +	while l do { +		s = head l; +		if regexp(pattern, s[0]) then +			print(s[0], "\t", s[1], "\t", s[2], "\t", s[3], "\n"); +		l = tail l; +	} +} + +defn havesymbol(name) +{ +	local l, s; + +	l = symbols; +	while l do { +		s = head l; +		l = tail l; +		if s[0] == name then +			return 1; +	} +	return 0; +} + +defn spsrch(len) +{ +	local addr, a, s, e; + +	addr = *SP; +	s = origin & 0x7fffffff; +	e = etext & 0x7fffffff; +	loop 1, len do { +		a = *addr++; +		c = a & 0x7fffffff; +		if c > s && c < e then { +			print("src(", a, ")\n"); +			pfl(a); +		}			 +	} +} + +defn acidtypes() +{ +	local syms; +	local l; + +	l = textfile(); +	if l != {} then { +		syms = "acidtypes"; +		while l != {} do { +			syms = syms + " " + ((head l)[0]); +			l = tail l; +		} +		includepipe(syms); +	} +} + +defn getregs() +{ +	local regs, l; + +	regs = {}; +	l = registers; +	while l != {} do { +		regs = append regs, var(l[0]); +		l = tail l; +	} +	return regs; +} + +defn setregs(regs) +{ +	local l; + +	l = registers; +	while l != {} do { +		var(l[0]) = regs[0]; +		l = tail l; +		regs = tail regs; +	} +	return regs; +} + +defn resetregs() +{ +	local l; + +	l = registers; +	while l != {} do { +		var(l[0]) = register(l[0]); +		l = tail l; +	} +} + +defn clearregs() +{ +	local l; + +	l = registers; +	while l != {} do { +		var(l[0]) = refconst(~0); +		l = tail l; +	} +} + +progargs=""; +print(acidfile); + +-- /usr/local/plan9/acid/386 +// 386 support + +defn acidinit()			// Called after all the init modules are loaded +{ +	bplist = {}; +	bpfmt = 'b'; + +	srcpath = { +		"./", +		"/sys/src/libc/port/", +		"/sys/src/libc/9sys/", +		"/sys/src/libc/386/" +	}; + +	srcfiles = {};			// list of loaded files +	srctext = {};			// the text of the files +} + +defn linkreg(addr) +{ +	return {}; +} + +defn stk()				// trace +{ +	_stk({"PC", *PC, "SP", *SP}, 0); +} + +defn lstk()				// trace with locals +{ +	_stk({"PC", *PC, "SP", *SP}, 1); +} + +defn gpr()		// print general(hah hah!) purpose registers +{ +	print("AX\t", *AX, " BX\t", *BX, " CX\t", *CX, " DX\t", *DX, "\n"); +	print("DI\t", *DI, " SI\t", *SI, " BP\t", *BP, "\n"); +} + +defn spr()				// print special processor registers +{ +	local pc; +	local cause; + +	pc = *PC; +	print("PC\t", pc, " ", fmt(pc, 'a'), "  "); +	pfl(pc); +	print("SP\t", *SP, " ECODE ", *ECODE, " EFLAG ", *EFLAGS, "\n"); +	print("CS\t", *CS, " DS\t ", *DS, " SS\t", *SS, "\n"); +	print("GS\t", *GS, " FS\t ", *FS, " ES\t", *ES, "\n"); +	 +	cause = *TRAP; +	print("TRAP\t", cause, " ", reason(cause), "\n"); +} + +defn regs()				// print all registers +{ +	spr(); +	gpr(); +} + +defn mmregs() +{ +	print("MM0\t", *MM0, " MM1\t", *MM1, "\n"); +	print("MM2\t", *MM2, " MM3\t", *MM3, "\n"); +	print("MM4\t", *MM4, " MM5\t", *MM5, "\n"); +	print("MM6\t", *MM6, " MM7\t", *MM7, "\n"); +} + +defn pfixstop(pid) +{ +	if *fmt(*PC-1, 'b') == 0xCC then { +		// Linux stops us after the breakpoint, not at it +		*PC = *PC-1; +	} +} + + +defn pstop(pid) +{ +	local l; +	local pc; +	local why; + +	pc = *PC; + +	// FIgure out why we stopped. +	if *fmt(pc, 'b') == 0xCC then { +		why = "breakpoint"; +		 +		// fix up instruction for print; will put back later +		*pc = @pc; +	} else if *(pc-2\x) == 0x80CD then { +		pc = pc-2; +		why = "system call"; +	} else +		why = "stopped"; + +	if printstopped then { +		print(pid,": ", why, "\t"); +		print(fmt(pc, 'a'), "\t", *fmt(pc, 'i'), "\n"); +	} +	 +	if why == "breakpoint" then +		*fmt(pc, bpfmt) = bpinst; +	 +	if printstopped && notes then { +		if notes[0] != "sys: breakpoint" then { +			print("Notes pending:\n"); +			l = notes; +			while l do { +				print("\t", head l, "\n"); +				l = tail l; +			} +		} +	} +} + +aggr Ureg +{ +	'U' 0 di; +	'U' 4 si; +	'U' 8 bp; +	'U' 12 nsp; +	'U' 16 bx; +	'U' 20 dx; +	'U' 24 cx; +	'U' 28 ax; +	'U' 32 gs; +	'U' 36 fs; +	'U' 40 es; +	'U' 44 ds; +	'U' 48 trap; +	'U' 52 ecode; +	'U' 56 pc; +	'U' 60 cs; +	'U' 64 flags; +	{ +	'U' 68 usp; +	'U' 68 sp; +	}; +	'U' 72 ss; +}; + +defn +Ureg(addr) { +	complex Ureg addr; +	print("	di	", addr.di, "\n"); +	print("	si	", addr.si, "\n"); +	print("	bp	", addr.bp, "\n"); +	print("	nsp	", addr.nsp, "\n"); +	print("	bx	", addr.bx, "\n"); +	print("	dx	", addr.dx, "\n"); +	print("	cx	", addr.cx, "\n"); +	print("	ax	", addr.ax, "\n"); +	print("	gs	", addr.gs, "\n"); +	print("	fs	", addr.fs, "\n"); +	print("	es	", addr.es, "\n"); +	print("	ds	", addr.ds, "\n"); +	print("	trap	", addr.trap, "\n"); +	print("	ecode	", addr.ecode, "\n"); +	print("	pc	", addr.pc, "\n"); +	print("	cs	", addr.cs, "\n"); +	print("	flags	", addr.flags, "\n"); +	print("	sp	", addr.sp, "\n"); +	print("	ss	", addr.ss, "\n"); +}; +sizeofUreg = 76; + +aggr Linkdebug +{ +	'X' 0 version; +	'X' 4 map; +}; + +aggr Linkmap +{ +	'X' 0 addr; +	'X' 4 name; +	'X' 8 dynsect; +	'X' 12 next; +	'X' 16 prev; +}; + +defn +linkdebug() +{ +	local a; + +	if !havesymbol("_DYNAMIC") then +		return 0; +	 +	a = _DYNAMIC; +	while *a != 0 do { +		if *a == 21 then // 21 == DT_DEBUG +			return *(a+4); +		a = a+8; +	} +	return 0; +} + +defn +dynamicmap() +{ +	if systype == "linux"  || systype == "freebsd" then { +		local r, m, n; +	 +		r = linkdebug(); +		if r then { +			complex Linkdebug r; +			m = r.map; +			n = 0; +			while m != 0 && n < 100 do { +				complex Linkmap m; +				if m.name && *(m.name\b) && access(*(m.name\s)) then +					print("textfile({\"", *(m.name\s), "\", ", m.addr\X, "});\n"); +				m = m.next; +				n = n+1; +			} +		} +	} +} + +defn +acidmap() +{ +//	dynamicmap(); +	acidtypes(); +} + +print(acidfile); diff --git a/web/l2.html b/web/l2.html new file mode 100644 index 0000000..e183d5a --- /dev/null +++ b/web/l2.html @@ -0,0 +1,494 @@ +<html> +<head> +<title>L2</title> +</head> +<body> + +<h1>6.828 Lecture Notes: x86 and PC architecture</h1> + +<h2>Outline</h2> +<ul> +<li>PC architecture +<li>x86 instruction set +<li>gcc calling conventions +<li>PC emulation +</ul> + +<h2>PC architecture</h2> + +<ul> +<li>A full PC has: +  <ul> +  <li>an x86 CPU with registers, execution unit, and memory management +  <li>CPU chip pins include address and data signals +  <li>memory +  <li>disk +  <li>keyboard +  <li>display +  <li>other resources: BIOS ROM, clock, ... +  </ul> + +<li>We will start with the original 16-bit 8086 CPU (1978) +<li>CPU runs instructions: +<pre> +for(;;){ +	run next instruction +} +</pre> + +<li>Needs work space: registers +	<ul> +	<li>four 16-bit data registers: AX, CX, DX, BX +        <li>each in two 8-bit halves, e.g. AH and AL +	<li>very fast, very few +	</ul> +<li>More work space: memory +	<ul> +	<li>CPU sends out address on address lines (wires, one bit per wire) +	<li>Data comes back on data lines +	<li><i>or</i> data is written to data lines +	</ul> + +<li>Add address registers: pointers into memory +	<ul> +	<li>SP - stack pointer +	<li>BP - frame base pointer +	<li>SI - source index +	<li>DI - destination index +	</ul> + +<li>Instructions are in memory too! +	<ul> +	<li>IP - instruction pointer (PC on PDP-11, everything else) +	<li>increment after running each instruction +	<li>can be modified by CALL, RET, JMP, conditional jumps +	</ul> + +<li>Want conditional jumps +	<ul> +	<li>FLAGS - various condition codes +		<ul> +		<li>whether last arithmetic operation overflowed +		<li> ... was positive/negative +		<li> ... was [not] zero +		<li> ... carry/borrow on add/subtract +		<li> ... overflow +		<li> ... etc. +		<li>whether interrupts are enabled +		<li>direction of data copy instructions +		</ul> +	<li>JP, JN, J[N]Z, J[N]C, J[N]O ... +	</ul> + +<li>Still not interesting - need I/O to interact with outside world +	<ul> +	<li>Original PC architecture: use dedicated <i>I/O space</i> +		<ul> +		<li>Works same as memory accesses but set I/O signal +		<li>Only 1024 I/O addresses +		<li>Example: write a byte to line printer: +<pre> +#define DATA_PORT    0x378 +#define STATUS_PORT  0x379 +#define   BUSY 0x80 +#define CONTROL_PORT 0x37A +#define   STROBE 0x01 +void +lpt_putc(int c) +{ +  /* wait for printer to consume previous byte */ +  while((inb(STATUS_PORT) & BUSY) == 0) +    ; + +  /* put the byte on the parallel lines */ +  outb(DATA_PORT, c); + +  /* tell the printer to look at the data */ +  outb(CONTROL_PORT, STROBE); +  outb(CONTROL_PORT, 0); +} +<pre> +		</ul> + +<li>Memory-Mapped I/O +	<ul> +	<li>Use normal physical memory addresses +		<ul> +		<li>Gets around limited size of I/O address space +		<li>No need for special instructions +		<li>System controller routes to appropriate device +		</ul> +	<li>Works like ``magic'' memory: +		<ul> +		<li>	<i>Addressed</i> and <i>accessed</i> like memory, +			but ... +		<li>	... does not <i>behave</i> like memory! +		<li>	Reads and writes can have ``side effects'' +		<li>	Read results can change due to external events +		</ul> +	</ul> +</ul> + + +<li>What if we want to use more than 2^16 bytes of memory? +	<ul> +        <li>8086 has 20-bit physical addresses, can have 1 Meg RAM +        <li>each segment is a 2^16 byte window into physical memory +        <li>virtual to physical translation: pa = va + seg*16 +        <li>the segment is usually implicit, from a segment register +	<li>CS - code segment (for fetches via IP) +	<li>SS - stack segment (for load/store via SP and BP) +	<li>DS - data segment (for load/store via other registers) +	<li>ES - another data segment (destination for string operations) +        <li>tricky: can't use the 16-bit address of a stack variable as a pointer +        <li>but a <i>far pointer</i> includes full segment:offset (16 + 16 bits) +	</ul> + +<li>But 8086's 16-bit addresses and data were still painfully small +  <ul> +  <li>80386 added support for 32-bit data and addresses (1985) +  <li>boots in 16-bit mode, boot.S switches to 32-bit mode +  <li>registers are 32 bits wide, called EAX rather than AX +  <li>operands and addresses are also 32 bits, e.g. ADD does 32-bit arithmetic +  <li>prefix 0x66 gets you 16-bit mode: MOVW is really 0x66 MOVW +  <li>the .code32 in boot.S tells assembler to generate 0x66 for e.g. MOVW +  <li>80386 also changed segments and added paged memory... +  </ul> + +</ul> + +<h2>x86 Physical Memory Map</h2> + +<ul> +<li>The physical address space mostly looks like ordinary RAM +<li>Except some low-memory addresses actually refer to other things +<li>Writes to VGA memory appear on the screen +<li>Reset or power-on jumps to ROM at 0x000ffff0 +</ul> + +<pre> ++------------------+  <- 0xFFFFFFFF (4GB) +|      32-bit      | +|  memory mapped   | +|     devices      | +|                  | +/\/\/\/\/\/\/\/\/\/\ + +/\/\/\/\/\/\/\/\/\/\ +|                  | +|      Unused      | +|                  | ++------------------+  <- depends on amount of RAM +|                  | +|                  | +| Extended Memory  | +|                  | +|                  | ++------------------+  <- 0x00100000 (1MB) +|     BIOS ROM     | ++------------------+  <- 0x000F0000 (960KB) +|  16-bit devices, | +|  expansion ROMs  | ++------------------+  <- 0x000C0000 (768KB) +|   VGA Display    | ++------------------+  <- 0x000A0000 (640KB) +|                  | +|    Low Memory    | +|                  | ++------------------+  <- 0x00000000 +</pre> + +<h2>x86 Instruction Set</h2> + +<ul> +<li>Two-operand instruction set +	<ul> +	<li>Intel syntax: <tt>op dst, src</tt> +	<li>AT&T (gcc/gas) syntax: <tt>op src, dst</tt> +		<ul> +		<li>uses b, w, l suffix on instructions to specify size of operands +		</ul> +	<li>Operands are registers, constant, memory via register, memory via constant +	<li>	Examples: +		<table cellspacing=5> +		<tr><td><u>AT&T syntax</u> <td><u>"C"-ish equivalent</u> +		<tr><td>movl %eax, %edx <td>edx = eax; <td><i>register mode</i> +		<tr><td>movl $0x123, %edx <td>edx = 0x123; <td><i>immediate</i> +		<tr><td>movl 0x123, %edx <td>edx = *(int32_t*)0x123; <td><i>direct</i> +		<tr><td>movl (%ebx), %edx <td>edx = *(int32_t*)ebx; <td><i>indirect</i> +		<tr><td>movl 4(%ebx), %edx <td>edx = *(int32_t*)(ebx+4); <td><i>displaced</i> +		</table> +	</ul> + +<li>Instruction classes +	<ul> +	<li>data movement: MOV, PUSH, POP, ... +	<li>arithmetic: TEST, SHL, ADD, AND, ... +	<li>i/o: IN, OUT, ... +	<li>control: JMP, JZ, JNZ, CALL, RET +	<li>string: REP MOVSB, ... +	<li>system: IRET, INT +	</ul> + +<li>Intel architecture manual Volume 2 is <i>the</i> reference + +</ul> + +<h2>gcc x86 calling conventions</h2> + +<ul> +<li>x86 dictates that stack grows down: +	<table cellspacing=5> +	<tr><td><u>Example instruction</u> <td><u>What it does</u> +	<tr><td>pushl %eax +		<td> +		subl $4, %esp <br> +		movl %eax, (%esp) <br> +	<tr><td>popl %eax +		<td> +		movl (%esp), %eax <br> +		addl $4, %esp <br> +	<tr><td>call $0x12345 +		<td> +		pushl %eip <sup>(*)</sup> <br> +		movl $0x12345, %eip <sup>(*)</sup> <br> +	<tr><td>ret +		<td> +		popl %eip <sup>(*)</sup> +	</table> +	(*) <i>Not real instructions</i> + +<li>GCC dictates how the stack is used. +	Contract between caller and callee on x86: +	<ul> +	<li>after call instruction: +		<ul> +		<li>%eip points at first instruction of function +		<li>%esp+4 points at first argument +		<li>%esp points at return address +		</ul> +	<li>after ret instruction: +		<ul> +		<li>%eip contains return address +		<li>%esp points at arguments pushed by caller +		<li>called function may have trashed arguments +		<li>%eax contains return value +			(or trash if function is <tt>void</tt>) +		<li>%ecx, %edx may be trashed +		<li>%ebp, %ebx, %esi, %edi must contain contents from time of <tt>call</tt> +		</ul> +	<li>Terminology: +		<ul> +		<li>%eax, %ecx, %edx are "caller save" registers +		<li>%ebp, %ebx, %esi, %edi are "callee save" registers +		</ul> +	</ul> + +<li>Functions can do anything that doesn't violate contract. +	By convention, GCC does more: +	<ul> +	<li>each function has a stack frame marked by %ebp, %esp +		<pre> +		       +------------+   | +		       | arg 2      |   \ +		       +------------+    >- previous function's stack frame +		       | arg 1      |   / +		       +------------+   | +		       | ret %eip   |   / +		       +============+    +		       | saved %ebp |   \ +		%ebp-> +------------+   | +		       |            |   | +		       |   local    |   \ +		       | variables, |    >- current function's stack frame +		       |    etc.    |   / +		       |            |   | +		       |            |   | +		%esp-> +------------+   / +		</pre> +	<li>%esp can move to make stack frame bigger, smaller +	<li>%ebp points at saved %ebp from previous function, +		chain to walk stack +	<li>function prologue: +		<pre> +			pushl %ebp +			movl %esp, %ebp +		</pre> +	<li>function epilogue: +		<pre> +			movl %ebp, %esp +			popl %ebp +		</pre> +		or +		<pre> +			leave +		</pre> +	</ul> + +<li>Big example: +	<ul> +	<li>C code +		<pre> +		int main(void) { return f(8)+1; } +		int f(int x) { return g(x); } +		int g(int x) { return x+3; } +		</pre> +	<li>assembler +		<pre> +		_main: +					<i>prologue</i> +			pushl %ebp +			movl %esp, %ebp +					<i>body</i> +			pushl $8 +			call _f +			addl $1, %eax +					<i>epilogue</i> +			movl %ebp, %esp +			popl %ebp +			ret +		_f: +					<i>prologue</i> +			pushl %ebp +			movl %esp, %ebp +					<i>body</i> +			pushl 8(%esp) +			call _g +					<i>epilogue</i> +			movl %ebp, %esp +			popl %ebp +			ret + +		_g: +					<i>prologue</i> +			pushl %ebp +			movl %esp, %ebp +					<i>save %ebx</i> +			pushl %ebx +					<i>body</i> +			movl 8(%ebp), %ebx +			addl $3, %ebx +			movl %ebx, %eax +					<i>restore %ebx</i> +			popl %ebx +					<i>epilogue</i> +			movl %ebp, %esp +			popl %ebp +			ret +		</pre> +	</ul> + +<li>Super-small <tt>_g</tt>: +	<pre> +		_g: +			movl 4(%esp), %eax +			addl $3, %eax +			ret +	</pre> + +<li>Compiling, linking, loading: +	<ul> +	<li>	<i>Compiler</i> takes C source code (ASCII text), +		produces assembly language (also ASCII text) +	<li>	<i>Assembler</i> takes assembly language (ASCII text), +		produces <tt>.o</tt> file (binary, machine-readable!) +	<li>	<i>Linker</i> takse multiple '<tt>.o</tt>'s, +		produces a single <i>program image</i> (binary) +	<li>	<i>Loader</i> loads the program image into memory +		at run-time and starts it executing +	</ul> +</ul> + + +<h2>PC emulation</h2> + +<ul> +<li>	Emulator like Bochs works by +	<ul> +	<li>	doing exactly what a real PC would do, +	<li>	only implemented in software rather than hardware! +	</ul> +<li>	Runs as a normal process in a "host" operating system (e.g., Linux) +<li>	Uses normal process storage to hold emulated hardware state: +	e.g., +	<ul> +	<li>	Hold emulated CPU registers in global variables +		<pre> +		int32_t regs[8]; +		#define REG_EAX 1; +		#define REG_EBX 2; +		#define REG_ECX 3; +		... +		int32_t eip; +		int16_t segregs[4]; +		... +		</pre> +	<li>	<tt>malloc</tt> a big chunk of (virtual) process memory +		to hold emulated PC's (physical) memory +	</ul> +<li>	Execute instructions by simulating them in a loop: +	<pre> +	for (;;) { +		read_instruction(); +		switch (decode_instruction_opcode()) { +		case OPCODE_ADD: +			int src = decode_src_reg(); +			int dst = decode_dst_reg(); +			regs[dst] = regs[dst] + regs[src]; +			break; +		case OPCODE_SUB: +			int src = decode_src_reg(); +			int dst = decode_dst_reg(); +			regs[dst] = regs[dst] - regs[src]; +			break; +		... +		} +		eip += instruction_length; +	} +	</pre> + +<li>	Simulate PC's physical memory map +	by decoding emulated "physical" addresses just like a PC would: +	<pre> +	#define KB		1024 +	#define MB		1024*1024 + +	#define LOW_MEMORY	640*KB +	#define EXT_MEMORY	10*MB + +	uint8_t low_mem[LOW_MEMORY]; +	uint8_t ext_mem[EXT_MEMORY]; +	uint8_t bios_rom[64*KB]; + +	uint8_t read_byte(uint32_t phys_addr) { +		if (phys_addr < LOW_MEMORY) +			return low_mem[phys_addr]; +		else if (phys_addr >= 960*KB && phys_addr < 1*MB) +			return rom_bios[phys_addr - 960*KB]; +		else if (phys_addr >= 1*MB && phys_addr < 1*MB+EXT_MEMORY) { +			return ext_mem[phys_addr-1*MB]; +		else ... +	} + +	void write_byte(uint32_t phys_addr, uint8_t val) { +		if (phys_addr < LOW_MEMORY) +			low_mem[phys_addr] = val; +		else if (phys_addr >= 960*KB && phys_addr < 1*MB) +			; /* ignore attempted write to ROM! */ +		else if (phys_addr >= 1*MB && phys_addr < 1*MB+EXT_MEMORY) { +			ext_mem[phys_addr-1*MB] = val; +		else ... +	} +	</pre> +<li>	Simulate I/O devices, etc., by detecting accesses to +	"special" memory and I/O space and emulating the correct behavior: +	e.g., +	<ul> +	<li>	Reads/writes to emulated hard disk +		transformed into reads/writes of a file on the host system +	<li>	Writes to emulated VGA display hardware +		transformed into drawing into an X window +	<li>	Reads from emulated PC keyboard +		transformed into reads from X input event queue +	</ul> +</ul> diff --git a/web/l3.html b/web/l3.html new file mode 100644 index 0000000..7d6ca0d --- /dev/null +++ b/web/l3.html @@ -0,0 +1,334 @@ +<title>L3</title> +<html> +<head> +</head> +<body> + +<h1>Operating system organizaton</h1> + +<p>Required reading: Exokernel paper. + +<h2>Intro: virtualizing</h2> + +<p>One way to think about an operating system interface is that it +extends the hardware instructions with a set of "instructions" that +are implemented in software. These instructions are invoked using a +system call instruction (int on the x86).  In this view, a task of the +operating system is to provide each application with a <i>virtual</i> +version of the interface; that is, it provides each application with a +virtual computer.   + +<p>One of the challenges in an operating system is multiplexing the +physical resources between the potentially many virtual computers. +What makes the multiplexing typically complicated is an additional +constraint: isolate the virtual computers well from each other. That +is, +<ul> +<li> stores shouldn't be able to overwrite other apps's data +<li> jmp shouldn't be able to enter another application +<li> one virtual computer cannot hog the processor +</ul> + +<p>In this lecture, we will explore at a high-level how to build +virtual computer that meet these goals.  In the rest of the term we +work out the details. + +<h2>Virtual processors</h2> + +<p>To give each application its own set of virtual processor, we need +to virtualize the physical processors.  One way to do is to multiplex +the physical processor over time: the operating system runs one +application for a while, then runs another application for while, etc. +We can implement this solution as follows: when an application has run +for its share of the processor, unload the state of the phyical +processor, save that state to be able to resume the application later, +load in the state for the next application, and resume it. + +<p>What needs to be saved and restored?  That depends on the +processor, but for the x86: +<ul> +<li>IP +<li>SP +<li>The other processor registers (eax, etc.) +</ul> + +<p>To enforce that a virtual processor doesn't keep a processor, the +operating system can arrange for a periodic interrupt, and switch the +processor in the interrupt routine. + +<p>To separate the memories of the applications, we may also need to save +and restore the registers that define the (virtual) memory of the +application (e.g., segment and MMU registers on the x86), which is +explained next. + + + +<h2>Separating memories</h2> +   +<p>Approach to separating memories: +<ul> +<li>Force programs to be written in high-level, type-safe language +<li>Enforce separation using hardware support +</ul> +The approaches can be combined. + +<p>Lets assume unlimited physical memory for a little while. We can +enforce separation then as follows: +<ul> + +<li>Put device (memory management unit) between processor and memory, +  which checks each memory access against a set of domain registers. +  (The domain registers are like segment registers on the x86, except +  there is no computation to compute an address.) +<li>The domain register specifies a range of addresses that the +  processor is allow to access. +<li>When switching applications, switch domain registers. +</ul> +Why does this work? load/stores/jmps cannot touch/enter other +application's domains. + +<p>To allow for controled sharing and separation with an application, +extend domain registers with protectioin bits: read (R), write (W), +execute-only (X). + +<p>How to protect the domain registers? Extend the protection bits +with a kernel-only one.  When in kernel-mode, processor can change +domain registers. As we will see in lecture 4, x86 stores the U/K +information in CPL (current privilege level) in CS segment +register. + +<p>To change from user to kernel, extend the hardware with special +instructions for entering a "supervisor" or "system" call, and +returning from it.  On x86, int and reti. The int instruction takes as +argument the system call number. We can then think of the kernel +interface as the set of "instructions" that augment the instructions +implemented in hardware. + +<h2>Memory management</h2> + +<p>We assumed unlimited physical memory and big addresses.  In +practice, operating system must support creating, shrinking, and +growing of domains, while still allowing the addresses of an +application to be contiguous (for programming convenience).  What if +we want to grow the domain of application 1 but the memory right below +and above it is in use by application 2? + +<p>How? Virtual addresses and spaces.  Virtualize addresses and let +the kernel control the mapping from virtual to physical. + +<p> Address spaces provide each application with the ideas that it has +a complete memory for itself.  All the addresses it issues are its +addresses (e.g., each application has an address 0). + +<li> How do you give each application its own address space? +<ul> +      <li> MMU translates <i>virtual</i> address to <i>physical</i> +        addresses using a translation table +      <li> Implementation approaches for translation table: +<ol> + +<li> for each virtual address store physical address, too costly. + +<li> translate a set of contiguous virtual addresses at a time using +segments (segment #, base address, length) + +<li> translate a fixed-size set of address (page) at a time using a +page map (page # -> block #) (draw hardware page table picture). +Datastructures for page map: array, n-level tree, superpages, etc. + +</ol> +<br>Some processor have both 2+3: x86!  (see lecture 4) +</ul> + +<li> What if two applications want to share real memory? Map the pages +into multiple address spaces and have protection bits per page. + +<li> How do you give an application access to a memory-mapped-IO +device? Map the physical address for the device into the applications +address space. + +<li> How do you get off the ground? +<ul> +     <li> when computer starts, MMU is disabled. +     <li> computer starts in kernel mode, with no +       translation (i.e., virtual address 0 is physical address 0, and +       so on) +     <li> kernel program sets up MMU to translate kernel address to physical +     address. often kernel virtual address translates to physical adress 0. +     <li> enable MMU +<br><p>Lab 2 explores this topic in detail. +</ul> + +<h2>Operating system organizations</h2> + +<p>A central theme in operating system design is how to organize the +operating system. It is helpful to define a couple of terms: +<ul> + +<li>Kernel: the program that runs in kernel mode, in a kernel +address space. + +<li>Library: code against which application link (e.g., libc). + +<li>Application: code that runs in a user-level address space. + +<li>Operating system: kernel plus all user-level system code (e.g., +servers, libraries, etc.) + +</ul> + +<p>Example: trace a call to printf made by an application. + +<p>There are roughly 4 operating system designs: +<ul> + +<li>Monolithic design. The OS interface is the kernel interface (i.e., +the complete operating systems runs in kernel mode).  This has limited +flexibility (other than downloadable kernel modules) and doesn't fault +isolate individual OS modules (e.g., the file system and process +module are both in the kernel address space).  xv6 has this +organization. + +<li>Microkernl design.  The kernel interface provides a minimal set of +abstractions (e.g., virtual memory, IPC, and threads), and the rest of +the operating system is implemented by user applications (often called +servers). + +<li>Virtual machine design.  The kernel implements a virtual machine +monitor.  The monitor multiplexes multiple virtual machines, which +each provide as the kernel programming interface the machine platform +(the instruction set, devices, etc.).  Each virtual machine runs its +own, perhaps simple, operating system. + +<li>Exokernel design.  Only used in this class and discussed below. + +</ul> + +<p>Although monolithic operating systems are the dominant operating +system architecture for desktop and server machines, it is worthwhile +to consider alternative architectures, even it is just to understand +operating systems better.  This lecture looks at exokernels, because +that is what you will building in the lab.  xv6 is organized as a +monolithic system, and we will study in the next lectures.  Later in +the term we will read papers about microkernel and virtual machine +operating systems. + +<h2>Exokernels</h2> + +<p>The exokernel architecture takes an end-to-end approach to +operating system design.  In this design, the kernel just securely +multiplexes physical resources; any programmer can decide what the +operating system interface and its implementation are for his +application.  One would expect a couple of popular APIs (e.g., UNIX) +that most applications will link against, but a programmer is always +free to replace that API, partially or completely.  (Draw picture of +JOS.) + +<p>Compare UNIX interface (<a href="v6.c">v6</a> or <a +href="os10.h">OSX</a>) with the JOS exokernel-like interface: +<pre> +enum +{ +	SYS_cputs = 0, +	SYS_cgetc, +	SYS_getenvid, +	SYS_env_destroy, +	SYS_page_alloc, +	SYS_page_map, +	SYS_page_unmap, +	SYS_exofork, +	SYS_env_set_status, +	SYS_env_set_trapframe, +	SYS_env_set_pgfault_upcall, +	SYS_yield, +	SYS_ipc_try_send, +	SYS_ipc_recv, +}; +</pre> + +<p>To illustrate the differences between these interfaces in more +detail consider implementing the following: +<ul> + +<li>User-level thread package that deals well with blocking system +calls, page faults, etc. + +<li>High-performance web server performing optimizations across module +boundaries (e.g., file system and network stack). + +</ul>  + +<p>How well can each kernel interface implement the above examples? +(Start with UNIX interface and see where you run into problems.)  (The +JOS kernel interface is not flexible enough: for example, +<i>ipc_receive</i> is blocking.) + +<h2>Exokernel paper discussion</h2> + + +<p>The central challenge in an exokernel design it to provide +extensibility, but provide fault isolation.  This challenge breaks +down into three problems: + +<ul> + +<li>tracking owner ship of resources; + +<li>ensuring fault isolation between applications; + +<li>revoking access to resources. + +</ul> + +<ul> + +<li>How is physical memory multiplexed?  Kernel tracks for each +physical page who has it. + +<li>How is the processor multiplexed?  Time slices. + +<li>How is the network multiplexed?  Packet filters. + +<li>What is the plan for revoking resources? +<ul> + +<li>Expose information so that application can do the right thing. + +<li>Ask applications politely to release resources of a given type. + +<li>Ask applications with force to release resources + +</ul> + +<li>What is an environment?  The processor environment: it stores +sufficient information to deliver events to applications: exception +context, interrupt context, protected entry context, and addressing +context.  This structure is processor specific. + +<li>How does on implement a minimal protected control transfer on the +x86?  Lab 4's approach to IPC has some short comings: what are they? +(It is essentially a polling-based solution, and the one you implement +is unfair.) What is a better way?  Set up a specific handler to be +called when an environment wants to call this environment.  How does +this impact scheduling of environments? (i.e., give up time slice or +not?) + +<li>How does one dispatch exceptions (e.g., page fault) to user space +on the x86?  Give each environment a separate exception stack in user +space, and propagate exceptions on that stack. See page-fault handling +in lab 4. + +<li>How does on implement processes in user space? The thread part of +a process is easy.  The difficult part it to perform the copy of the +address space efficiently; one would like to share memory between +parent and child.  This property can be achieved using copy-on-write. +The child should, however, have its own exception stack.  Again, +see lab 4.  <i>sfork</i> is a trivial extension of user-level <i>fork</i>. + +<li>What are the examples of extensibility in this paper? (RPC system +in which server saves and restores registers, different page table, +and stride scheduler.) + +</ul> + +</body> diff --git a/web/l4.html b/web/l4.html new file mode 100644 index 0000000..342af32 --- /dev/null +++ b/web/l4.html @@ -0,0 +1,518 @@ +<title>L4</title> +<html> +<head> +</head> +<body> + +<h1>Address translation and sharing using segments</h1> + +<p>This lecture is about virtual memory, focusing on address +spaces. It is the first lecture out of series of lectures that uses +xv6 as a case study. + +<h2>Address spaces</h2> + +<ul> + +<li>OS: kernel program and user-level programs. For fault isolation +each program runs in a separate address space. The kernel address +spaces is like user address spaces, expect it runs in kernel mode. +The program in kernel mode can execute priviledge instructions (e.g., +writing the kernel's code segment registers). + +<li>One job of kernel is to manage address spaces (creating, growing, +deleting, and switching between them) + +<ul> + +<li>Each address space (including kernel) consists of the binary +    representation for the text of the program, the data part +    part of the program, and the stack area. + +<li>The kernel address space runs the kernel program. In a monolithic +    organization the kernel manages all hardware and provides an API +    to user programs. + +<li>Each user address space contains a program.  A user progam may ask +  to shrink or grow its address space. + +</ul> + +<li>The main operations: +<ul> +<li>Creation.  Allocate physical memory to storage program. Load +program into physical memory. Fill address spaces with references to +physical memory.  +<li>Growing. Allocate physical memory and add it to address space. +<li>Shrinking. Free some of the memory in an address space. +<li>Deletion. Free all memory in an address space. +<li>Switching. Switch the processor to use another address space. +<li>Sharing.  Share a part of an address space with another program. +</ul> +</ul> + +<p>Two main approaches to implementing address spaces: using segments +  and using page tables. Often when one uses segments, one also uses +  page tables.  But not the other way around; i.e., paging without +  segmentation is common. + +<h2>Example support for address spaces: x86</h2> + +<p>For an operating system to provide address spaces and address +translation typically requires support from hardware.  The translation +and checking of permissions typically must happen on each address used +by a program, and it would be too slow to check that in software (if +even possible).  The division of labor is operating system manages +address spaces, and hardware translates addresses and checks +permissions. + +<p>PC block diagram without virtual memory support: +<ul> +<li>physical address +<li>base, IO hole, extended memory +<li>Physical address == what is on CPU's address pins +</ul> + +<p>The x86 starts out in real mode and translation is as follows: +	<ul> +	<li>segment*16+offset ==> physical address +        <li>no protection: program can load anything into seg reg +	</ul> + +<p>The operating system can switch the x86 to protected mode, which +allows the operating system to create address spaces. Translation in +protected mode is as follows: +	<ul> +	<li>selector:offset (logical addr) <br> +	     ==SEGMENTATION==>  +	<li>linear address <br> +	     ==PAGING ==> +	<li>physical address +	</ul> + +<p>Next lecture covers paging; now we focus on segmentation.  + +<p>Protected-mode segmentation works as follows: +<ul> +<li>protected-mode segments add 32-bit addresses and protection +<ul> +<li>wait: what's the point? the point of segments in real mode was +  bigger addresses, but 32-bit mode fixes that! +</ul> +<li>segment register holds segment selector +<li>selector indexes into global descriptor table (GDT) +<li>segment descriptor holds 32-bit base, limit, type, protection +<li>la = va + base ; assert(va < limit); +<li>seg register usually implicit in instruction +	<ul> +	<li>DS:REG +		<ul> +		<li><tt>movl $0x1, _flag</tt> +		</ul> +	<li>SS:ESP, SS:EBP +		<ul> +		<li><tt>pushl %ecx, pushl $_i</tt> +		<li><tt>popl %ecx</tt> +		<li><tt>movl 4(%ebp),%eax</tt> +		</ul> +	<li>CS:EIP +		<ul> +		<li>instruction fetch +		</ul> +	<li>String instructions: read from DS:ESI, write to ES:EDI +		<ul> +		<li><tt>rep movsb</tt> +		</ul> +	<li>Exception: far addresses +		<ul> +		<li><tt>ljmp $selector, $offset</tt> +		</ul> +	</ul> +<li>LGDT instruction loads CPU's GDT register +<li>you turn on protected mode by setting PE bit in CR0 register +<li>what happens with the next instruction? CS now has different +  meaning... + +<li>How to transfer from segment to another, perhaps with different +priveleges. +<ul> +<li>Current privilege level (CPL) is in the low 2 bits of CS +<li>CPL=0 is privileged O/S, CPL=3 is user +<li>Within in the same privelege level: ljmp. +<li>Transfer to a segment with more privilege: call gates. +<ul> +<li>a way for app to jump into a segment and acquire privs +<li>CPL must be <= descriptor's DPL in order to read or write segment +<li>call gates can change privelege <b>and</b> switch CS and SS +  segment +<li>call gates are implemented using a special type segment descriptor +  in the GDT. +<li>interrupts are conceptually the same as call gates, but their +  descriptor is stored in the IDT.  We will use interrupts to transfer +  control between user and kernel mode, both in JOS and xv6.  We will +  return to this in the lecture about interrupts and exceptions. +</ul> +</ul> + +<li>What about protection? +<ul> +  <li>can o/s limit what memory an application can read or write? +  <li>app can load any selector into a seg reg... +  <li>but can only mention indices into GDT +  <li>app can't change GDT register (requires privilege) +  <li>why can't app write the descriptors in the GDT? +  <li>what about system calls? how to they transfer to kernel? +  <li>app cannot <b>just</b> lower the CPL +</ul> +</ul> + +<h2>Case study (xv6)</h2> + +<p>xv6 is a reimplementation of <a href="../v6.html">Unix 6th edition</a>. +<ul> +<li>v6 is a version of the orginal Unix operating system for <a href="http://www.pdp11.org/">DEC PDP11</a> +<ul> +       <li>PDP-11 (1972):  +	 <li>16-bit processor, 18-bit physical (40) +	 <li>UNIBUS +	 <li>memory-mapped I/O +	 <li>performance: less than 1MIPS +	 <li>register-to-register transfer: 0.9 usec +	 <li>56k-228k (40) +	 <li>no paging, but some segmentation support +	 <li>interrupts, traps +	 <li>about $10K +         <li>rk disk with 2MByte of storage +	 <li>with cabinet 11/40 is 400lbs +</ul> +       <li>Unix v6 +<ul> +         <li><a href="../reference.html">Unix papers</a>. +         <li>1976; first widely available Unix outside Bell labs +         <li>Thompson and Ritchie +         <li>Influenced by Multics but simpler. +	 <li>complete (used for real work) +	 <li>Multi-user, time-sharing +	 <li>small (43 system calls) +	 <li>modular (composition through pipes; one had to split programs!!) +	 <li>compactly written (2 programmers, 9,000 lines of code) +	 <li>advanced UI (shell) +	 <li>introduced C (derived from B) +	 <li>distributed with source +	 <li>V7 was sold by Microsoft for a couple years under the name Xenix +</ul> +       <li>Lion's commentary +<ul> +         <li>surpressed because of copyright issue +	 <li>resurfaced in 1996 +</ul> + +<li>xv6 written for 6.828: +<ul> +         <li>v6 reimplementation for x86 +	 <li>does't include all features of v6 (e.g., xv6 has 20 of 43 +	 system calls). +	 <li>runs on symmetric multiprocessing PCs (SMPs). +</ul> +</ul> + +<p>Newer Unixs have inherited many of the conceptual ideas even though +they added paging, networking, graphics, improve performance, etc. + +<p>You will need to read most of the source code multiple times. Your +goal is to explain every line to yourself. + +<h3>Overview of address spaces in xv6</h3> + +<p>In today's lecture we see how xv6 creates the kernel address + spaces, first user address spaces, and switches to it. To understand + how this happens, we need to understand in detail the state on the + stack too---this may be surprising, but a thread of control and + address space are tightly bundled in xv6, in a concept + called <i>process</i>.  The kernel address space is the only address + space with multiple threads of control. We will study context + switching and process management in detail next weeks; creation of + the first user process (init) will get you a first flavor. + +<p>xv6 uses only the segmentation hardware on xv6, but in a limited +  way. (In JOS you will use page-table hardware too, which we cover in +  next lecture.)  The adddress space layouts are as follows: +<ul> +<li>In kernel address space is set up as follows: +  <pre> +  the code segment runs from 0 to 2^32 and is mapped X and R +  the data segment runs from 0 to 2^32 but is mapped W (read and write). +  </pre> +<li>For each process, the layout is as follows:  +<pre> +  text +  original data and bss +  fixed-size stack +  expandable heap +</pre> +The text of a process is stored in its own segment and the rest in a +data segment.   +</ul> + +<p>xv6 makes minimal use of the segmentation hardware available on the +x86. What other plans could you envision? + +<p>In xv6, each each program has a user and a kernel stack; when the +user program switches to the kernel, it switches to its kernel stack. +Its kernel stack is stored in process's proc structure. (This is +arranged through the descriptors in the IDT, which is covered later.) + +<p>xv6 assumes that there is a lot of physical memory. It assumes that +  segments can be stored contiguously in physical memory and has +  therefore no need for page tables. + +<h3>xv6 kernel address space</h3> + +<p>Let's see how xv6 creates the kernel address space by tracing xv6 +  from when it boots, focussing on address space management: +<ul> +<li>Where does xv6 start after the PC is power on: start (which is +  loaded at physical address 0x7c00; see lab 1). +<li>1025-1033: are we in real mode? +<ul> +<li>how big are logical addresses? +<li>how big are physical addresses? +<li>how are addresses physical calculated? +<li>what segment is being used in subsequent code? +<li>what values are in that segment? +</ul> +<li>1068: what values are loaded in the GDT? +<ul> +<li>1097: gdtr points to gdt +<li>1094: entry 0 unused +<li>1095: entry 1 (X + R, base = 0, limit = 0xffffffff, DPL = 0) +<li>1096: entry 2 (W, base = 0, limit = 0xffffffff, DPL = 0) +<li>are we using segments in a sophisticated way? (i.e., controled sharing) +<li>are P and S set? +<li>are addresses translated as in protected mode when lgdt completes? +</ul> +<li>1071: no, and not even here. +<li>1075: far jump, load 8 in CS. from now on we use segment-based translation. +<li>1081-1086: set up other segment registers +<li>1087: where is the stack which is used for procedure calls? +<li>1087: cmain in the bootloader (see lab 1), which calls main0 +<li>1222: main0.   +<ul> +<li>job of main0 is to set everthing up so that all xv6 convtions works +<li>where is the stack?  (sp = 0x7bec) +<li>what is on it? +<pre> +   00007bec [00007bec]  7cda  // return address in cmain +   00007bf0 [00007bf0]  0080  // callee-saved ebx +   00007bf4 [00007bf4]  7369  // callee-saved esi +   00007bf8 [00007bf8]  0000  // callee-saved ebp +   00007bfc [00007bfc]  7c49  // return address for cmain: spin +   00007c00 [00007c00]  c031fcfa  // the instructions from 7c00 (start) +</pre> +</ul> +<li>1239-1240: switch to cpu stack (important for scheduler) +<ul> +<li>why -32? +<li>what values are in ebp and esp? +<pre> +esp: 0x108d30   1084720 +ebp: 0x108d5c   1084764 +</pre> +<li>what is on the stack? +<pre> +   00108d30 [00108d30]  0000 +   00108d34 [00108d34]  0000 +   00108d38 [00108d38]  0000 +   00108d3c [00108d3c]  0000 +   00108d40 [00108d40]  0000 +   00108d44 [00108d44]  0000 +   00108d48 [00108d48]  0000 +   00108d4c [00108d4c]  0000 +   00108d50 [00108d50]  0000 +   00108d54 [00108d54]  0000 +   00108d58 [00108d58]  0000 +   00108d5c [00108d5c]  0000 +   00108d60 [00108d60]  0001 +   00108d64 [00108d64]  0001 +   00108d68 [00108d68]  0000 +   00108d6c [00108d6c]  0000 +</pre> + +<li>what is 1 in 0x108d60?  is it on the stack? + +</ul> + +<li>1242: is it save to reference bcpu?  where is it allocated? + +<li>1260-1270: set up proc[0] + +<ul> +<li>each process has its own stack (see struct proc). + +<li>where is its stack?  (see the section below on physical memory +  management below). + +<li>what is the jmpbuf?  (will discuss in detail later) + +<li>1267: why -4? + +</ul> + +<li>1270: necessar to be able to take interrupts (will discuss in +  detail later) + +<li>1292: what process do you think scheduler() will run?  we will +  study later how that happens, but let's assume it runs process0 on +  process0's stack. +</ul> + +<h3>xv6 user address spaces</h3> + +<ul> +<li>1327: process0   +<ul> +<li>process 0 sets up everything to make process conventions work out + +<li>which stack is process0 running?  see 1260. + +<li>1334: is the convention to release the proc_table_lock after being +  scheduled? (we will discuss locks later; assume there are no other +  processors for now.) + +<li>1336: cwd is current working directory. + +<li>1348: first step in initializing a template tram frame: set +  everything to zero. we are setting up process 0 as if it just +  entered the kernel from user space and wants to go back to user +  space.  (see x86.h to see what field have the value 0.) + +<li>1349: why "|3"?  instead of 0? + +<li>1351: why set interrupt flag in template trapframe? + +<li>1352: where will the user stack be in proc[0]'s address space? + +<li>1353: makes a copy of proc0.  fork() calls copyproc() to implement +  forking a process.  This statement in essense is calling fork inside +  proc0, making a proc[1] a duplicate of proc[0].  proc[0], however, +  has not much in its address space of one page (see 1341). +<ul> +<li>2221: grab a lock on the proc table so that we are the only one +  updating it. +<li>2116: allocate next pid. +<li>2228: we got our entry; release the  lock. from now we are only +  modifying our entry. +<li>2120-2127: copy proc[0]'s memory.  proc[1]'s memory will be identical +  to proc[0]'s. +<li>2130-2136: allocate a kernel stack. this stack is different from +  the stack that proc[1] uses when running in user mode. +<li>2139-2140: copy the template trapframe that xv6 had set up in +  proc[0]. +<li>2147: where will proc[1] start running when the scheduler selects +  it? +<li>2151-2155: Unix semantics: child inherits open file descriptors +  from parent. +<li>2158: same for cwd. +</ul> + +<li>1356: load a program in proc[1]'s address space.  the program +  loaded is the binary version of init.c (sheet 16). + +<li>1374: where will proc[1] start? + +<li>1377-1388: copy the binary into proc[1]'s address space.  (you +  will learn about the ELF format in the labs.)   +<ul> +<li>can the binary for init be any size for proc[1] to work correctly? + +<li>what is the layout of proc[1]'s address space? is it consistent +  with the layout described on line 1950-1954? + +</ul> + +<li>1357: make proc[1] runnable so that the scheduler will select it +  to run.  everything is set up now for proc[1] to run, "return" to +  user space, and execute init. + +<li>1359: proc[0] gives up the processor, which calls sleep, which +  calls sched, which setjmps back to scheduler. let's peak a bit in +  scheduler to see what happens next.  (we will return to the +  scheduler in more detail later.) +</ul> +<li>2219: this test will fail for proc[1] +<li>2226: setupsegs(p) sets up the segments for proc[1].  this call is +  more interesting than the previous, so let's see what happens: +<ul> +<li>2032-37: this is for traps and interrupts, which we will cover later. +<li>2039-49: set up new gdt. +<li>2040: why 0x100000 + 64*1024? +<li>2045: why 3?  why is base p->mem? is p->mem physical or logical? +<li>2045-2046: how much the program for proc[1] be compiled if proc[1] +  will run successfully in user space? +<li>2052: we are still running in the kernel, but we are loading gdt. +  is this ok? +<li>why have so few user-level segments?  why not separate out code, +  data, stack, bss, etc.? +</ul> +<li>2227: record that proc[1] is running on the cpu +<li>2228: record it is running instead of just runnable +<li>2229: setjmp to fork_ret. +<li>2282: which stack is proc[1] running on? +<li>2284: when scheduled, first release the proc_table_lock. +<li>2287: back into assembly. +<li>2782: where is the stack pointer pointing to? +<pre> +   0020dfbc [0020dfbc]  0000 +   0020dfc0 [0020dfc0]  0000 +   0020dfc4 [0020dfc4]  0000 +   0020dfc8 [0020dfc8]  0000 +   0020dfcc [0020dfcc]  0000 +   0020dfd0 [0020dfd0]  0000 +   0020dfd4 [0020dfd4]  0000 +   0020dfd8 [0020dfd8]  0000 +   0020dfdc [0020dfdc]  0023 +   0020dfe0 [0020dfe0]  0023 +   0020dfe4 [0020dfe4]  0000 +   0020dfe8 [0020dfe8]  0000 +   0020dfec [0020dfec]  0000 +   0020dff0 [0020dff0]  001b +   0020dff4 [0020dff4]  0200 +   0020dff8 [0020dff8]  1000 +</pre> +<li>2783: why jmp instead of call? +<li>what will iret put in eip? +<li>what is 0x1b?  what will iret put in cs? +<li>after iret, what will the processor being executing? +</ul> + +<h3>Managing physical memory</h3> + +<p>To create an address space we must allocate physical memory, which +  will be freed when an address space is deleted (e.g., when a user +  program terminates).  xv6 implements a first-fit memory allocater +  (see kalloc.c).   + +<p>It maintains a list of ranges of free memory.  The allocator finds +  the first range that is larger than the amount of requested memory. +  It splits that range in two: one range of the size requested and one +  of the remainder.  It returns the first range.  When memory is +  freed, kfree will merge ranges that are adjacent in memory. + +<p>Under what scenarios is a first-fit memory allocator undesirable? + +<h3>Growing an address space</h3> + +<p>How can a user process grow its address space?  growproc. +<ul> +<li>2064: allocate a new segment of old size plus n +<li>2067: copy the old segment into the new (ouch!) +<li>2068: and zero the rest. +<li>2071: free the old physical memory +</ul> +<p>We could do a lot better if segments didn't have to contiguous in +  physical memory. How could we arrange that? Using page tables, which +  is our next topic.  This is one place where page tables would be +  useful, but there are others too (e.g., in fork). +</body> + + diff --git a/web/l5.html b/web/l5.html new file mode 100644 index 0000000..61b55e4 --- /dev/null +++ b/web/l5.html @@ -0,0 +1,210 @@ +<title>Lecture 5/title> +<html> +<head> +</head> +<body> + +<h2>Address translation and sharing using page tables</h2> + +<p> Reading: <a href="../readings/i386/toc.htm">80386</a> chapters 5 and 6<br> + +<p> Handout: <b> x86 address translation diagram</b> -  +<a href="x86_translation.ps">PS</a> - +<a href="x86_translation.eps">EPS</a> - +<a href="x86_translation.fig">xfig</a> +<br> + +<p>Why do we care about x86 address translation? +<ul> +<li>It can simplify s/w structure by placing data at fixed known addresses. +<li>It can implement tricks like demand paging and copy-on-write. +<li>It can isolate programs to contain bugs. +<li>It can isolate programs to increase security. +<li>JOS uses paging a lot, and segments more than you might think. +</ul> + +<p>Why aren't protected-mode segments enough? +<ul> +<li>Why did the 386 add translation using page tables as well? +<li>Isn't it enough to give each process its own segments? +</ul> + +<p>Translation using page tables on x86: +<ul> +<li>paging hardware maps linear address (la) to physical address (pa) +<li>(we will often interchange "linear" and "virtual") +<li>page size is 4096 bytes, so there are 1,048,576 pages in 2^32 +<li>why not just have a big array with each page #'s translation? +<ul> +<li>table[20-bit linear page #] => 20-bit phys page # +</ul> +<li>386 uses 2-level mapping structure +<li>one page directory page, with 1024 page directory entries (PDEs) +<li>up to 1024 page table pages, each with 1024 page table entries (PTEs) +<li>so la has 10 bits of directory index, 10 bits table index, 12 bits offset +<li>What's in a PDE or PTE? +<ul> +<li>20-bit phys page number, present, read/write, user/supervisor +</ul> +<li>cr3 register holds physical address of current page directory +<li>puzzle: what do PDE read/write and user/supervisor flags mean? +<li>puzzle: can supervisor read/write user pages? + +<li>Here's how the MMU translates an la to a pa: + +   <pre> +   uint +   translate (uint la, bool user, bool write) +   { +     uint pde;  +     pde = read_mem (%CR3 + 4*(la >> 22)); +     access (pde, user, read); +     pte = read_mem ( (pde & 0xfffff000) + 4*((la >> 12) & 0x3ff)); +     access (pte, user, read); +     return (pte & 0xfffff000) + (la & 0xfff); +   } + +   // check protection. pxe is a pte or pde. +   // user is true if CPL==3 +   void +   access (uint pxe, bool user, bool write) +   { +     if (!(pxe & PG_P)   +        => page fault -- page not present +     if (!(pxe & PG_U) && user) +        => page fault -- not access for user +    +     if (write && !(pxe & PG_W)) +       if (user)    +          => page fault -- not writable +       else if (!(pxe & PG_U)) +         => page fault -- not writable +       else if (%CR0 & CR0_WP)  +         => page fault -- not writable +   } +   </pre> + +<li>CPU's TLB caches vpn => ppn mappings +<li>if you change a PDE or PTE, you must flush the TLB! +<ul> +  <li>by re-loading cr3 +</ul> +<li>turn on paging by setting CR0_PE bit of %cr0 +</ul> + +Can we use paging to limit what memory an app can read/write? +<ul> +<li>user can't modify cr3 (requires privilege) +<li>is that enough? +<li>could user modify page tables? after all, they are in memory. +</ul> + +<p>How we will use paging (and segments) in JOS: +<ul> +<li>use segments only to switch privilege level into/out of kernel +<li>use paging to structure process address space +<li>use paging to limit process memory access to its own address space +<li>below is the JOS virtual memory map +<li>why map both kernel and current process? why not 4GB for each? +<li>why is the kernel at the top? +<li>why map all of phys mem at the top? i.e. why multiple mappings? +<li>why map page table a second time at VPT? +<li>why map page table a third time at UVPT? +<li>how do we switch mappings for a different process? +</ul> + +<pre> +    4 Gig -------->  +------------------------------+ +                     |                              | RW/-- +                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +                     :              .               : +                     :              .               : +                     :              .               : +                     |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/-- +                     |                              | RW/-- +                     |   Remapped Physical Memory   | RW/-- +                     |                              | RW/-- +    KERNBASE ----->  +------------------------------+ 0xf0000000 +                     |  Cur. Page Table (Kern. RW)  | RW/--  PTSIZE +    VPT,KSTACKTOP--> +------------------------------+ 0xefc00000      --+ +                     |         Kernel Stack         | RW/--  KSTKSIZE   | +                     | - - - - - - - - - - - - - - -|                 PTSIZE +                     |      Invalid Memory          | --/--             | +    ULIM     ------> +------------------------------+ 0xef800000      --+ +                     |  Cur. Page Table (User R-)   | R-/R-  PTSIZE +    UVPT      ---->  +------------------------------+ 0xef400000 +                     |          RO PAGES            | R-/R-  PTSIZE +    UPAGES    ---->  +------------------------------+ 0xef000000 +                     |           RO ENVS            | R-/R-  PTSIZE + UTOP,UENVS ------>  +------------------------------+ 0xeec00000 + UXSTACKTOP -/       |     User Exception Stack     | RW/RW  PGSIZE +                     +------------------------------+ 0xeebff000 +                     |       Empty Memory           | --/--  PGSIZE +    USTACKTOP  --->  +------------------------------+ 0xeebfe000 +                     |      Normal User Stack       | RW/RW  PGSIZE +                     +------------------------------+ 0xeebfd000 +                     |                              | +                     |                              | +                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +                     .                              . +                     .                              . +                     .                              . +                     |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| +                     |     Program Data & Heap      | +    UTEXT -------->  +------------------------------+ 0x00800000 +    PFTEMP ------->  |       Empty Memory           |        PTSIZE +                     |                              | +    UTEMP -------->  +------------------------------+ 0x00400000 +                     |       Empty Memory           |        PTSIZE +    0 ------------>  +------------------------------+ +</pre> + +<h3>The VPT </h3> + +<p>Remember how the X86 translates virtual addresses into physical ones: + +<p><img src=pagetables.png> + +<p>CR3 points at the page directory.  The PDX part of the address +indexes into the page directory to give you a page table.  The +PTX part indexes into the page table to give you a page, and then +you add the low bits in. + +<p>But the processor has no concept of page directories, page tables, +and pages being anything other than plain memory.  So there's nothing +that says a particular page in memory can't serve as two or three of +these at once.  The processor just follows pointers: + +pd = lcr3(); +pt = *(pd+4*PDX); +page = *(pt+4*PTX); + +<p>Diagramatically, it starts at CR3, follows three arrows, and then stops. + +<p>If we put a pointer into the page directory that points back to itself at +index Z, as in + +<p><img src=vpt.png> + +<p>then when we try to translate a virtual address with PDX and PTX +equal to V, following three arrows leaves us at the page directory. +So that virtual page translates to the page holding the page directory. +In Jos, V is 0x3BD, so the virtual address of the VPD is +(0x3BD<<22)|(0x3BD<<12). + + +<p>Now, if we try to translate a virtual address with PDX = V but an +arbitrary PTX != V, then following three arrows from CR3 ends +one level up from usual (instead of two as in the last case), +which is to say in the page tables.  So the set of virtual pages +with PDX=V form a 4MB region whose page contents, as far +as the processor is concerned, are the page tables themselves. +In Jos, V is 0x3BD so the virtual address of the VPT is (0x3BD<<22). + +<p>So because of the "no-op" arrow we've cleverly inserted into +the page directory, we've mapped the pages being used as +the page directory and page table (which are normally virtually +invisible) into the virtual address space. + +  +</body> diff --git a/web/mkhtml b/web/mkhtml new file mode 100755 index 0000000..74987e6 --- /dev/null +++ b/web/mkhtml @@ -0,0 +1,70 @@ +#!/usr/bin/perl + +my @lines = <>; +my $text = join('', @lines); +my $title; +if($text =~ /^\*\* (.*?)\n/m){ +	$title = $1; +	$text = $` . $'; +}else{ +	$title = "Untitled"; +} + +$text =~ s/[ \t]+$//mg; +$text =~ s/^$/<br><br>/mg; +$text =~ s!\b([a-z0-9]+\.(c|s|pl|h))\b!<a href="src/$1.html">$1</a>!g; +$text =~ s!^(Lecture [0-9]+\. .*?)$!<b><i>$1</i></b>!mg; +$text =~ s!^\* (.*?)$!<h2>$1</h2>!mg; +$text =~ s!((<br>)+\n)+<h2>!\n<h2>!g; +$text =~ s!</h2>\n?((<br>)+\n)+!</h2>\n!g; +$text =~ s!((<br>)+\n)+<b>!\n<br><br><b>!g; +$text =~ s!\b\s*--\s*\b!\–!g; +$text =~ s!\[([^\[\]|]+) \| ([^\[\]]+)\]!<a href="$1">$2</a>!g; +$text =~ s!\[([^ \t]+)\]!<a href="$1">$1</a>!g; + +$text =~ s!``!\“!g; +$text =~ s!''!\”!g; + +print <<EOF; +<!-- AUTOMATICALLY GENERATED: EDIT the .txt version, not the .html version --> +<html> +<head> +<title>$title</title> +<style type="text/css"><!-- +body { +	background-color: white; +	color: black; +	font-size: medium; +	line-height: 1.2em; +	margin-left: 0.5in; +	margin-right: 0.5in; +	margin-top: 0; +	margin-bottom: 0; +} + +h1 { +	text-indent: 0in; +	text-align: left; +	margin-top: 2em; +	font-weight: bold; +	font-size: 1.4em; +} + +h2 { +	text-indent: 0in; +	text-align: left; +	margin-top: 2em; +	font-weight: bold; +	font-size: 1.2em; +} +--></style> +</head> +<body bgcolor=#ffffff> +<h1>$title</h1> +<br><br> +EOF +print $text; +print <<EOF; +</body> +</html> +EOF diff --git a/web/x86-intr.html b/web/x86-intr.html new file mode 100644 index 0000000..0369e25 --- /dev/null +++ b/web/x86-intr.html @@ -0,0 +1,53 @@ +<title>Homework: xv6 and Interrupts and Exceptions</title> +<html> +<head> +</head> +<body> + +<h1>Homework: xv6 and Interrupts and Exceptions</h1> + +<p> +<b>Read</b>: xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim +lapic.c, ioapic.c, and picirq.c + +<p> +<b>Hand-In Procedure</b> +<p> +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of the lecture. +<p> + +<b>Introduction</b> + +<p>Try to understand  +xv6's trapasm.S, trap.c, syscall.c, vectors.S, and usys.S. Skim +  You will need to consult: + +<p>Chapter 5 of <a href="../readings/ia32/IA32-3.pdf">IA-32 Intel +Architecture Software Developer's Manual, Volume 3: System programming +guide</a>; you can skip sections 5.7.1, 5.8.2, and 5.12.2. Be aware +that terms such as exceptions, traps, interrupts, faults and aborts +have no standard meaning. + +<p>Chapter 9 of the 1987 <a href="../readings/i386/toc.htm">i386 +Programmer's Reference Manual</a> also covers exception and interrupt +handling in IA32 processors. + +<p><b>Assignment</b>:  + +In xv6, set a breakpoint at the beginning of <code>syscall()</code> to +catch the very first system call.  What values are on the stack at +this point?  Turn in the output of <code>print-stack 35</code> at that +breakpoint with each value labeled as to what it is (e.g., +saved <code>%ebp</code> for <code>trap</code>, +<code>trapframe.eip</code>, etc.). +<p> +<b>This completes the homework.</b> + +</body> + + + + + diff --git a/web/x86-intro.html b/web/x86-intro.html new file mode 100644 index 0000000..323d92e --- /dev/null +++ b/web/x86-intro.html @@ -0,0 +1,18 @@ +<title>Homework: Intro to x86 and PC</title> +<html> +<head> +</head> +<body> + +<h1>Homework: Intro to x86 and PC</h1> + +<p>Today's lecture is an introduction to the x86 and the PC, the +platform for which you will write an operating system.  The assigned +book is a reference for x86 assembly programming of which you will do +some. + +<p><b>Assignment</b> Make sure to do exercise 1 of lab 1 before +coming to lecture. + +</body> + diff --git a/web/x86-mmu.html b/web/x86-mmu.html new file mode 100644 index 0000000..a83ff26 --- /dev/null +++ b/web/x86-mmu.html @@ -0,0 +1,33 @@ +<title>Homework: x86 MMU</title> +<html> +<head> +</head> +<body> + +<h1>Homework: x86 MMU</h1> + +<p>Read chapters 5 and 6 of +<a href="../readings/i386/toc.htm">Intel 80386 Reference Manual</a>. +These chapters explain +the x86 Memory Management Unit (MMU), +which we will cover in lecture today and which you need +to understand in order to do lab 2. + +<p> +<b>Read</b>: bootasm.S and setupsegs() in proc.c + +<p> +<b>Hand-In Procedure</b> +<p> +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member by the beginning of lecture. +<p> + +<p><b>Assignment</b>: Try to understand setupsegs() in proc.c. +  What values are written into <code>gdt[SEG_UCODE]</code> +  and <code>gdt[SEG_UDATA]</code> for init, the first user-space +  process? +  (You can use Bochs to answer this question.) + +</body> diff --git a/web/x86-mmu1.pdf b/web/x86-mmu1.pdfBinary files differ new file mode 100644 index 0000000..e7103e7 --- /dev/null +++ b/web/x86-mmu1.pdf diff --git a/web/x86-mmu2.pdf b/web/x86-mmu2.pdf new file mode 100644 index 0000000..e548148 --- /dev/null +++ b/web/x86-mmu2.pdf @@ -0,0 +1,55 @@ +%PDF-1.4
1 0 obj
<</Type /XObject /Subtype /Image /ImageMask true /Name /Obj1 /Width 2560 /Height 3256 /BlackIs1 false /BitsPerComponent 1 /Length 25249 /Filter /CCITTFaxDecode /DecodeParms << /K -1 /Columns 2560 >> >> stream
KFf<dp|n'G2tyJ\#菑Fa7Ll1њ"^+aࠎG垼JYw&EBQ 2(09dQfԣjA.zX0r2Aw&9']Y*AɎ[ xrhEd'`_{Z	D#:;IVb.D!U~PL3L<\ๆ	 @֝hqiմӭtOQ;wICړ|wA xzzpc/[*">DQiu"4	{W
T7ӹzAӚI_5lw3$XjZE[wluիwjdW_M5ShAB\ e`H"""""""#k%Ph#B#-"@He0AAaMS}'T§Y'D/(&ALɬ *H8TOr~};VK|W	(Ku	%JxPK%_ I/o($e.J~%jKL~AsfHk?_at04"""#Q|Dr/8-#Hk|h)(r `=	Z&9>.ʕs.!wxvLsK:Ȼ*o;@8z_-Zǵ~!{qr9fcf4'܆戎48M?afPx_wM7q.({Z]/I_z	.aIے{I +	JD~>quZtx?__Y#_z_s:%s7_o}p`iXKm.kظ⽍0+jjWQV"{ojg4vA8@WM5U"""?21)Kz?w}c(E.Έ8""""#4r
c!.ɹCr9F]Uy!͢%,n[o_^jzo__W%0@[;'3`1H*'SQd%JF~4?4vҼfh_Mzg _Kv~C	%IL*-Hu P<7ZN$ +]$xY!ܸ({/Zt]RTگz T}M^kzoҹg}ʙm?߫kJ"8>v0G:8v=4-^*X hWqO3HGn`DE!WA=#w~솔VA7<\@ySYmpX^/W?~{/VlG.>97"8 cAۣzok_~{]&3Gp3A|GM 5Y_?V_fq'3͑#5E#lˌ֎/DDDDDDDDD}m}:'zGe"G~Bio	2ЌSd]QoPv}^ cs1Z^a=pj]={p͢78A_df<|NXg>EGNt ot}XCCh4 t?[?zAvkfǬHz&w=Hhc
BUa<:WzPA7/<d_vqW	A>D)vBw?8)/isH.Ճ&Oȱ`ő)"^!Џ_:D6#!O??Rv}iU_o
7U91VҴl+,y~g
[68H	p]>Q$q;iz+Wg~^AA; +a~[mm;>:_aA	#),0CkV +Q텆H3KDGxzVLTS,4?w
Pi;[4[	""!(Mqkk^P֒^o~}TpL%|%:(W/[lXRl%Gz 㐎Bh%
D%Ŏ]g"gf@a' 4yTEvs˛xdb \3L8JǹpC +! 0'TLz#H98<r8< u +O	p' +u:W_^/"??@姺E_ZTU~O/fuOiuݵml.}{
w
*)xW_}V2+Æpi_4t`" &"52&!q٢4#6#;`G_B9)ͅq˃]T|+>_]KZ}/}_wG5DᠹuQ{^믻OUO^KuOG3:5",08atUnɃU"H0D:;CjB?dxqǔqx!07гp^`gxOՐ#_%]
"#)YyEN<rxO2 <K ~~$fG=__O:$#pP^.
ˆ`Gmpr8`\90E/,먈2&E>3kW.N暈N_B9; ܇_@E{UՄ"",TsRׂ>&/N8K""#Boz+	A}ճϿ"էwk믯!ᄶ!qL}a$De1owk
;A=S	ڪMi^Ԭʪ]%#FK@U0ԝw~#▓^k?8rzALY8epaU	i}?\ kjm+-bW_~W]s_1̏Fh0 +p
"""""""ykܫ?f-2 +Ng+Ymz(AeWRםB.l?,]hd)?"{tE/7(&pC*<G&\.AziOAŮvo!?A0ȯ]`q^ +_8@	]H16%:DDCx!r@/A~]0"?D{xJZ +@b֔ 		 4CdG"9Cж-^D44Buvy__ZAuN_<a[^[R-#N"+QTT6?{\/V[־3O_KM vw~;TL!t,/"""4"#]W_گCZdkˢ蠋_(GLd\9)r_Zn[;"$:`GzC3;߂dO#HꉄQ\L8GG@xl +-Hl,ʃ9X9Ms_
W
	?!![:o_%f]piwE_>TeaN!`܆ +C
Pzk"n3\5%dܖL/zXL N/Sa5Ysf-e%nZ=wD*'
'w}06uf"GD_OOՠ `|@'_Wtj
/|[]uꩪH`}c}JK#1Kmk'nN2qDaik¯Do.`8aai:	(D$4	|CkI`B1P[OT?ִ?]k%B\k]$։|="x8Iux\!Q dZGJAK]5?=p}W	?K./oAi/Cw/|']PKׯ@kյmo^X'^v%+a)1H8إ&DL*Ar7~{H.ӆ@%^panT'VP*p<X!
 +Þ"jb$""hDDDDD,NM~OA{
~_%Kz]lR ܯt^a_h#/b#Ʒ2IaHa +h` zA'wһZrk.ysr{a{~<tw_Z^ZZR:uq@]ZK}UsGeˤ|tg؍nj}u[ͥ+خ%*+m{O
>
[	T!إ~?_H"?1 n;ڦ<DDDDGAfhf@qdGFv!f.heialkHTgx+""h?گ])}a0MqUU__
INfS2MW38 .Ef)qBjn\8)N.n3"4
0DD a8=cC4?&uBа^NI;_0Z	Jo +.zOD"K +MܓD"DQ28%<dpJx<K% x &A)鵿kXUӥWm=?ָw"zPg?pm~%j~{W_ᆿ>ݵDwO7#m	k/c\_t	j?Uҿ~o4_%3+K__-~O[Kײ
vjڶ]w9_+K҆bwG>>llW_SVK~VM?"ui~28^ft ڦ&[_a;00/Xg|@4h!H~$^DDDDG"vu׆(%~5_EQ6-T-
c!x8̈́#r#pp ya3Aqda^Z!`x0G\L[9s0]Ps{89w +@MY hpERX$$95 fC=DCY2dB Y˨4RsQ2S\h)J2rq) ռ"HgdޤE̋8 giY28*#Ԝz&i2^IT^ו#_Ti}))ӑV{ +l^Kx{T}4yq[#TDj3΄;1;8dsA4K|>'0S1 +ۇ>"F4gq=Oz502@̎VP.|]=`AA?_40	 +&8}bL 4N-=:ӈi"c_TnWWqCx_'_Mi릩ߧ:Oz]LCd!iWw~Fܔr7N$g Q~)czFJ?#'Z"#X2;EB@DY"Ò%r,dw%=?O#Oy*@iG +?Z<'E/t	^kǥzzK?;EkQc#$kA?SwKUI/TD!ݿѠ?Ql#}G]^Bj=σ~]D路0넿_
7-_~.l7ᇯ"?ւikY:.uuAҐ=4_:[N5dta-__rӽϙrz pA	#:խGKAm!-ۆh[}W,,?yd^H 0]tDl+-ֿa~Vcce
b-qߍگlcoӵw{WK}NY>궿ZnAD[ﰚ'vDkdX®NPd0Zb1HzdAި4^0@:2:j4!0Mavh՚Bh"hOQ"""""65f?V5հx/tJ@ KF?s7c`fiRUӍGT
bl˃rt\ˊb9gto#qDmPw!9 ,rrxQ6"""""C$,r-0h9k	DDDI9
1gNSHD*,!G(DDDDHcmnE4"""%`!tl	GQ^).ق0M]TG!z(As#DDDX`;;3?pӳY#1vp9,KMM$e9}.]ʂMK\ǒ:_D 6#ă]( w i39F\G3SF /L!
ac|E.
6.	Ȏfc˖h߿ji(E_8;'ԂzH|?P=$(|0Ikv^4^~H:OHh'.װȣOB\%~GK]uuUE̞x'ǒ-<u^	ii#OԝZZ6G@Qj%S}yAQAw0н}A%0yfv9KKǯ
-5|9IjGe`O^	Q,Uq n~GuU
ж~6|-a?Mmz}7]\|oWdQڮzv_a`ij[a""""4
OUPh0E{&2(Iw,"#""#98f!RIyڃDDGVҰXXV궵T|DD[2_+AήD2#x¦ e ʀrɀ锂FӪ2TpJ]S?M4zN,oYbHvywۥ +;%_ma:2s_Z[]ڮJ0\$VX껫'~WK-Wۧ=)aDAoa/KThȋ/]|{ݠ[kH>	X*ڶa0i}/銏S[Q{DZ څL/a|DZu+\!_0DDHDr+٦_ 91ɎCneaʂr +auG\(2 +ADDDDAs$A
3TFW,a	VȔF5&@́)ׯ_y8 z!'|z}n}?__⾻	ׯ]O{4o߿X`/=VKֲ?\hqM +_ׯ]r};#_~sY]h뿵|fq)5HN +hDvxydr̠ڔ?3vPY ntxMM=m=?
0;	a"g#>~T'kiX=[DYwDoylh5i;[>'oOt]^:<A< ˼Ew}=/~Ӧ):N§Oվa>U_L?^?[>}D܄lݨ理Z(n>>yps"7_
]BׯWq_x_蹯_ҷ^iy~S_?;ܞب/_¶^kN\c}ҩֿ.XkkI}pm5{z]E_'V[_ǻdQ5i{c?C:'Aia8d~M?-NᓅdW]ᔝ#V]Sy78_]{t`_q__zy@gA&$NCj2âa]}kZ3d(?o_^_>_y?Q߯S`֪u~_~
oxX/?___bɹݞd' H0c94eACМʖ5DDDDDDZ2:0@[DDDDDDAr +9>4P's1S0`B9i|DDDDDDDDVeـqGC0+<\4G,G
B8d;+?jd"طj֔oZWޚvF+UӮU___^~_[U_ɰ@VE,50Lvԟ__߯}Ǒ3Eԏ36}#S=,?/,w?4!58P@p:|Cb3
0@~C#Óyᚣ_moOOOZN!ń AމwiJ=r''n˶HnEԛI 7P_[~wA<%	$xASrxy;%nNRԼ?N'^uMzNqwto}{V}i'[<߽d''Ԋc_~wq\WTG_xo0<_}B`@Tׯx7?]1W5`ofQ_;^oO?/9<E?pWWkka5b"?oq	qz-{/a+&a{]+];NUǵ8bW[VdWHAaMm[Om7
^i TNӴ:
m4k߈"EI
4M0		x0#J"""""""?+x^+Ta_Qh[TT",Rܰ MgbѶv/뭇;/߾;^&yr(PGC/("Ւw	8T!C`0LঃI?ՠ5ńh]ӴUOOMJy<DE-ܛݢO}rnzxA$A}{c_OWU}Zu炿 +f>?A"&(R9}Ԝl~߿YW]^kK_;K2*j5+]RMoD]{TC⡄	_/L/Uô:_:M/,4DDDGABPGTsq?b""Zn־IC+pVĂ]"7~-:32Ȝ`Y;63v;zAJZzQwMݭ=+VZSOk_i\:K߿ϑ c"z8fI$}
N_=kt0=rDa8)Eͦxf*}&}4~A[]>=m?K^[?녗O(ƉGN$;	z$$]5OxA*D?:I?L/K/"Ηu!ۆ)w
{	aqN=(uJrc}*(DDHd0Oa'B
B_WY<"$Zߠ{u+"莈GFbQ|j=p_-C9C,DDDDz._8oڶZVS̏\W5	0ҭ_>*-6?n+	kjL +NdW0xi"8qaA	T
aW&<lDDDDq_KMzK*G-Łj蔳Ewx jfȗ$9;m +`b +&xtN}5iI\e#v
4 6}O?.I;u_{+rA]wt|Ghӝr86rA#G~{?""""Al7r܅rCsy8W.^Q^m44/?32ZE)r6c8g#7G"_kt]+
^_u0JXa/l%`
~_0lS_mޝ~^0^Oa4" 0A A/؈z_uz^[3K}/_^'"A l0@!M4&}W]v+rv;w4J'.z}mS[zZ[!|u@5,|! xZa9,.lr""tgK0y=9w~l)F3x٘Dr.$jbONa4M?0Aa_	k$%,ȝ5{{\A6xd2lD ~QA	0 q44=4M
w#'y;.tJ_=;	m':NVIb+Z.@{
W㬆G
slR8pr#C\)piU!%r9ssaA(}ָ:%"""""?3EqU#8\6Dp6U]7nDDDyg3l3LY0Gxoa+[Vװ"""""?b%㰜V +/+j-v)5{OJ"}GvDDDDDE4fDG_qU@ѐIZk"Ӯ渚G2;	']>Ȏ8|GPޙ}+ޞ
O.eDy_oKvtdHg#f6!.)ppFr)n3_uUB"""""""""?1 qn11vlb.]3er8rGP'_˂%~-Ƒvy#iIp/=w]_iov1{h8#AgHψj"""rW帾vth2CL3oIDv?|e3Gٸ:|#/
F/q`Eat}4A?{[WyJ'N=K%{WL'nu_HҽYi{ǿo}P4#.F$?7ztypQ)s~
߿p|~1eq|u4˂8R8`mo|on"""""#uuk
zj6)8☯X+5]V44
UAa2p'ik"""?rGs25/n,V4a0;_'7;}ׯ֟P)9{o.˰<_B& ~5_Vȏ%TO2yJWQT_jƷ_.pz~_iz 2A\d\Gn/DDDH\Ɏw"a?qgfGH.#Ad`/##qF#3`mzKb"""""""""""%NtHP#28#6g4ds>p
KwQEm骾m2 ˵M4MZ_'qh_i4͚؆5a;֤#w#	V	 xֿZQh_i +B/""""""#7=>rXr=܄vGGtj",GfG%^/"""""""#c_k_aP唴0vpS0wS"DfF/9R<.uzU-t].b%kIAChק~!g<4	>&j282Gh/0G gs3fh
BxNI}~a wUNc}Orektn7A)+8Gȣ%t	=SUI7%}
_IzN%x+8FG	MZgzҽӤ@:.xk*%W@BCiJQDoR(;ԗ=k\(H{'/xA-#C^6>,k`݂ٲl.x\q~$q}ݯ]GWnցaauwz=;"dcK"4" Ј3& +CSh _@DDFǯ*ڃI/ݤt]`ݤ0Ď{V!;O,F#3ˑ40d")jŕfL qH)dn C'UM5	5TAPL&+^0PoRiތKփ˿/4/?wznr垓Yє?/7WD0XK_a/#W_:v]n/W(4N+KZqmSZҽ
_V_你ſ]-kiZja|*M1	|0_1[Hc^Mi*
nDDDD4#⠶ +h3B	On#hFe<Fi{u_ +uAhaBa0,w
0~0z'~듂TUM>oIz_Kbo[uO_WZ^5E]x?}|{N/zkޞ'lt*B(o0%ĭĸ=oEw_O8AW8/Fb~u6~<-ĺ%p/_J_
'|z]rߜ@nW_>8[OW]|M~io=i[[aovaz._{kK66858>`/O*ikmkLjZoڿQi4_~_\DIAhC/RӴ^-b"""#>H<ZTE}
/Ƈq_cG\IǞ	}CC|ߥ}O_]c_;ZuSqnzxE{/Xj6{X3W
TToo_wu?Mz_㒂 -^[#"F9/ʲPoDDDDDDDDDDDDDDDDH`"#2$69 D<DDDFei0F#0ˁA292/ZDk,#y_io"P:__u__W6)Ͳ5P[OM<V~v$_?<ᙼyBqΣ8や.G#v|n."9^?g5jhY(WAA=zhYD.,C!hI, 383𠁾ߤ?WOOӮ-:Mt\Z
jQ($H~`jVO4VOD>KR%d;"w8AܝH}$]t_z& Kza=0'A?ǼV[~_u=oOL/kA_ӻH'Km&z__o_zך\]5?=#qk_oR,}u]oЈu}yjrsY^o}?bST(ˢ$ϟkikztmy6=} +wb߈qVDZ0#/vV +z+zc㊊"? 5k}ki5ah0M4a*
SMߖ:AE~jE^DDDDDDDDqDJx0hDEe0tQk__ 7-Y*2)B?JSP.ЕsAA7Ǘ"|d7	00tiuH~x'̾:'O>=7_wIx>kth'@}c?ԈfYEB
CT +F1O%ߤ"""#%b?վ3es@#B>x͑2>Gˇ#r>_#AumDDDDDDDDDDDDGCc߾U!^'pja#2wd鄾kL_[32e(.Y)Tm͑'~	yKIW33b4y!'dDkBADX	M4	dqM?8|XCM*wڼ>'n^d%}aR7~HF_	N~6a~}c=?Z_WZN/ׅ'LQOAN娈OEQUn3
B8g#l}KՈcZE?^Zճ;A:ߙ߿¶?mv]Q1
&%^*;[aq_wqݭ{2VpkR(ki'"""""""""!fQ0MT""#WZk~pk:ȣ.Z0lZM_zߐlB8*28`PiDDDDD$l36289ˌO]/1&Ў2]aDDxABD5{K_]rc;4~P?__:C$ ΌAC1-M4{w];ܜQ(wEN'A==>&OM_pc?/>o0˃Is.P28As/SB"""""""@1x*|\<7DDDDDD{3:Zyzzvn5"a4|ɀ +U%r0
a29mGѳ6)ph6p<	?i"""""""""#/ֺ +ئ6?p~v-4elAЈ#ﯵcnⵚ3%U8'THZH]Nn>EȐF1a4>NІh^ kz?PuyǩyD	XM¿ޞإkAN?'M8)"Fi:ez݃{52^6t_);#VC5ޡ=ZRW1.`q.\#\+_zoҰw""""""?2$K[ +vi"*8b5]Bh4״M`0a3fX4DDDG_?nNFfΗwrޫI_a?}+߯dgI}hn/|@ٰAI=uT^9;?DХt%boK^?/0<)q{.fq1Ⱦtα(F5H:*?q!x ]r9aɎR	ʿ_daA,r!lX a=6i͈6vy/3r.2WTD6?a2 
4'5Xpaזa?
ߠ'6P2Af a{zM;kJ/\,$Z=O`6?ɧk#aS.2l5M\C3Gb 9{"aL0I`r'aυHr2&+ +VynP'l""""""""""""""#"i=3Bc!G 9}ߙDDDDLr!̐)aˣ4`S#h
p<4O~|>M}V~ ToU 9 \saܭ9CB+%O$ܪW'aI2<DDDDDDDDDDDD 8U:AL|DDDDHw Jɹaɹ܄rXM.9"""""""#)=d!m#4HdD)0! +`9k#rG8ňnPٗ~4"84p[zsW 4DpO;3X< +C^aˆ\s%x;;\Xw#MM5C3TDA
0(v֫wi~u`!GvR8()?eҽx .Fs>:}r\CC.<DP2 YGqP{$qoh	FJuq h\XA
A pAa29 AЋ ϊt2NM +Y"&r?G0XnAҺ&?	i 'E50O@p 0Lɏ0dL	^`d\$ :UtӐA\~|_
< qT jh4ANVCO醿TUOOm}Uu'ic"Z\Z"ޤW"CA#({j!"?z]CzzȾK*dA/U?#-H;ȑc##Rr7z@$[_Wf]iR +h'<NxDyȶ%Arȩ4LK<2YPa\&X	uLWU??Z݆ᅣ('Z0%uf=QĸzZt*O@z;ʱ%W]|>r=?| +Z"*""""#B""!4A-T
4@ͤ4Pp8dq\C-& +%~""""""""""8#C + ֎Zڌ+8a5T- +=x +]6BUP]5]-_7@ǹ%_}Xz'[~Xp<ԆXK
ֈQl6/GHt6I{kB7|`|}:tIG[U j߸ l`~69+:{pޯQc	CJ*?]k魯i#2OB}V~AQ<A<D$ ܠ?&Fi +"""""4"" Ј4&" diff --git a/web/xv6-disk.html b/web/xv6-disk.html new file mode 100644 index 0000000..65bcf8f --- /dev/null +++ b/web/xv6-disk.html @@ -0,0 +1,63 @@ +<html> +<head> +<title>Homework: Files and Disk I/O</title> +</head> +<body> + +<h1>Homework: Files and Disk I/O</h1> + +<p> +<b>Read</b>: bio.c, fd.c, fs.c, and ide.c + +<p> +This homework should be turned in at the beginning of lecture. + +<p> +<b>File and Disk I/O</b> + +<p>Insert a print statement in bwrite so that you get a +print every time a block is written to disk: + +<pre> +  cprintf("bwrite sector %d\n", sector); +</pre> + +<p>Build and boot a new kernel and run these three commands at the shell: +<pre> +  echo >a +  echo >a +  rm a +  mkdir d +</pre> + +(You can try <tt>rm d</tt> if you are curious, but it should look +almost identical to <tt>rm a</tt>.) + +<p>You should see a sequence of bwrite prints after running each command. +Record the list and annotate it with the calling function and +what block is being written. +For example, this is the <i>second</i> <tt>echo >a</tt>: + +<pre> +$ echo >a +bwrite sector 121  # writei  (data block) +bwrite sector 3    # iupdate (inode block) +$  +</pre> + +<p>Hint: the easiest way to get the name of the +calling function is to add a string argument to bwrite, +edit all the calls to bwrite to pass the name of the +calling function, and just print it. +You should be able to reason about what kind of +block is being written just from the calling function. + +<p>You need not write the following up, but try to +understand why each write is happening.  This will +help your understanding of the file system layout +and the code. + +<p> +<b>This completes the homework.</b> + +</body> diff --git a/web/xv6-intro.html b/web/xv6-intro.html new file mode 100644 index 0000000..3669866 --- /dev/null +++ b/web/xv6-intro.html @@ -0,0 +1,163 @@ +<title>Homework: intro to xv6</title> +<html> +<head> +</head> +<body> + +<h1>Homework: intro to xv6</h1> + +<p>This lecture is the introduction to xv6, our re-implementation of +  Unix v6.  Read the source code in the assigned files. You won't have +  to understand the details yet; we will focus on how the first +  user-level process comes into existence after the computer is turned +  on. +<p> + +<b>Hand-In Procedure</b> +<p> +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of lecture. +<p> + +<p><b>Assignment</b>:  +<br> +Fetch and un-tar the xv6 source: + +<pre> +sh-3.00$ wget http://pdos.csail.mit.edu/6.828/2007/src/xv6-rev1.tar.gz  +sh-3.00$ tar xzvf xv6-rev1.tar.gz +xv6/ +xv6/asm.h +xv6/bio.c +xv6/bootasm.S +xv6/bootmain.c +... +$ +</pre> + +Build xv6: +<pre> +$ cd xv6 +$ make +gcc -O -nostdinc -I. -c bootmain.c +gcc -nostdinc -I. -c bootasm.S +ld -N -e start -Ttext 0x7C00 -o bootblock.o bootasm.o bootmain.o +objdump -S bootblock.o > bootblock.asm +objcopy -S -O binary bootblock.o bootblock +... +$  +</pre> + +Find the address of the <code>main</code> function by +looking in <code>kernel.asm</code>: +<pre> +% grep main kernel.asm +... +00102454 <mpmain>: +mpmain(void) +001024d0 <main>: +  10250d:       79 f1                   jns    102500 <main+0x30> +  1025f3:       76 6f                   jbe    102664 <main+0x194> +  102611:       74 2f                   je     102642 <main+0x172> +</pre> +In this case, the address is <code>001024d0</code>. +<p> + +Run the kernel inside Bochs, setting a breakpoint +at the beginning of <code>main</code> (i.e., the address +you just found). +<pre> +$ make bochs +if [ ! -e .bochsrc ]; then ln -s dot-bochsrc .bochsrc; fi +bochs -q +======================================================================== +                       Bochs x86 Emulator 2.2.6 +                    (6.828 distribution release 1) +======================================================================== +00000000000i[     ] reading configuration from .bochsrc +00000000000i[     ] installing x module as the Bochs GUI +00000000000i[     ] Warning: no rc file specified. +00000000000i[     ] using log file bochsout.txt +Next at t=0 +(0) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0 +(1) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0 +<bochs>  +</pre> + +Look at the registers and the stack contents: + +<pre> +<bochs> info reg +... +<bochs> print-stack +... +<bochs> +</pre> + +Which part of the stack printout is actually the stack? +(Hint: not all of it.)  Identify all the non-zero values +on the stack.<p> + +<b>Turn in:</b> the output of print-stack with  +the valid part of the stack marked.  Write a short (3-5 word) +comment next to each non-zero value explaining what it is. +<p> + +Now look at kernel.asm for the instructions in main that read: +<pre> +  10251e:       8b 15 00 78 10 00       mov    0x107800,%edx +  102524:       8d 04 92                lea    (%edx,%edx,4),%eax +  102527:       8d 04 42                lea    (%edx,%eax,2),%eax +  10252a:       c1 e0 04                shl    $0x4,%eax +  10252d:       01 d0                   add    %edx,%eax +  10252f:       8d 04 85 1c ad 10 00    lea    0x10ad1c(,%eax,4),%eax +  102536:       89 c4                   mov    %eax,%esp +</pre> +(The addresses and constants might be different on your system, +and the compiler might use <code>imul</code> instead of the <code>lea,lea,shl,add,lea</code> sequence. +Look for the move into <code>%esp</code>). +<p> + +Which lines in <code>main.c</code> do these instructions correspond to? +<p> + +Set a breakpoint at the first of those instructions +and let the program run until the breakpoint: +<pre> +<bochs> vb 0x8:0x10251e +<bochs> s +... +<bochs> c +(0) Breakpoint 2, 0x0010251e (0x0008:0x0010251e) +Next at t=1157430 +(0) [0x0010251e] 0008:0x0010251e (unk. ctxt): mov edx, dword ptr ds:0x107800 ; 8b1500781000 +(1) [0xfffffff0] f000:fff0 (unk. ctxt): jmp far f000:e05b         ; ea5be000f0 +<bochs>  +</pre> +(The first <code>s</code> command is necessary +to single-step past the breakpoint at main, otherwise <code>c</code> +will not make any progress.) +<p> + +Inspect the registers and stack again +(<code>info reg</code> and <code>print-stack</code>). +Then step past those seven instructions +(<code>s 7</code>) +and inspect them again. +Convince yourself that the stack has changed correctly. +<p> + +<b>Turn in:</b> answers to the following questions. +Look at the assembly for the call to  +<code>lapic_init</code> that occurs after the +the stack switch.  Where does the  +<code>bcpu</code> argument come from? +What would have happened if <code>main</code> +stored <code>bcpu</code> +on the stack before those four assembly instructions? +Would the code still work?  Why or why not? +<p> + +</body> +</html> diff --git a/web/xv6-lock.html b/web/xv6-lock.html new file mode 100644 index 0000000..887022a --- /dev/null +++ b/web/xv6-lock.html @@ -0,0 +1,100 @@ +<title>Homework: Locking</title> +<html> +<head> +</head> +<body> + +<h1>Homework: Locking</h1> + + +<p> +<b>Read</b>: spinlock.c + +<p> +<b>Hand-In Procedure</b> +<p> +You are to turn in this homework at the beginning of lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of lecture. +<p> + +<b>Assignment</b>: +In this assignment we will explore some of the interaction +between interrupts and locking. +<p> + +Make sure you understand what would happen if the kernel executed +the following code snippet: +<pre> +  struct spinlock lk; +  initlock(&lk, "test lock"); +  acquire(&lk); +  acquire(&lk); +</pre> +(Feel free to use Bochs to find out.  <code>acquire</code> is in <code>spinlock.c</code>.) +<p> + +An <code>acquire</code> ensures interrupts are off +on the local processor using <code>cli</code>, +and interrupts remain off until the <code>release</code> +of the last lock held by that processor +(at which point they are enabled using <code>sti</code>). +<p> + +Let's see what happens if we turn on interrupts while +holding the <code>ide</code> lock. +In <code>ide_rw</code> in <code>ide.c</code>, add a call +to <code>sti()</code> after the <code>acquire()</code>. +Rebuild the kernel and boot it in Bochs. +Chances are the kernel will panic soon after boot; try booting Bochs a few times +if it doesn't. +<p> + +<b>Turn in</b>: explain in a few sentences why the kernel panicked. +You may find it useful to look up the stack trace +(the sequence of <code>%eip</code> values printed by <code>panic</code>) +in the <code>kernel.asm</code> listing. +<p> + +Remove the <code>sti()</code> you added, +rebuild the kernel, and make sure it works again. +<p> + +Now let's see what happens if we turn on interrupts +while holding the <code>kalloc_lock</code>. +In <code>kalloc()</code> in <code>kalloc.c</code>, add +a call to <code>sti()</code> after the call to <code>acquire()</code>. +You will also need to add  +<code>#include "x86.h"</code> at the top of the file after +the other <code>#include</code> lines. +Rebuild the kernel and boot it in Bochs. +It will not panic. +<p> + +<b>Turn in</b>: explain in a few sentences why the kernel didn't panic. +What is different about <code>kalloc_lock</code> +as compared to <code>ide_lock</code>? +<p> +You do not need to understand anything about the details of the IDE hardware +to answer this question, but you may find it helpful to look  +at which functions acquire each lock, and then at when those +functions get called. +<p> + +(There is a very small but non-zero chance that the kernel will panic +with the extra <code>sti()</code> in <code>kalloc</code>. +If the kernel <i>does</i> panic, make doubly sure that +you removed the <code>sti()</code> call from  +<code>ide_rw</code>.  If it continues to panic and the +only extra <code>sti()</code> is in <code>bio.c</code>, +then mail <i>6.828-staff@pdos.csail.mit.edu</i> +and think about buying a lottery ticket.) +<p> + +<b>Turn in</b>: Why does <code>release()</code> clear +<code>lock->pcs[0]</code> and <code>lock->cpu</code> +<i>before</i> clearing <code>lock->locked</code>? +Why not wait until after?  + +</body> +</html> diff --git a/web/xv6-names.html b/web/xv6-names.html new file mode 100644 index 0000000..926be3a --- /dev/null +++ b/web/xv6-names.html @@ -0,0 +1,78 @@ +<html> +<head> +<title>Homework: Naming</title> +</head> +<body> + +<h1>Homework: Naming</h1> + +<p> +<b>Read</b>: namei in fs.c, fd.c, sysfile.c + +<p> +This homework should be turned in at the beginning of lecture. + +<p> +<b>Symbolic Links</b> + +<p> +As you read namei and explore its varied uses throughout xv6, +think about what steps would be required to add symbolic links +to xv6. +A symbolic link is simply a file with a special type (e.g., T_SYMLINK +instead of T_FILE or T_DIR) whose contents contain the path being +linked to. + +<p> +Turn in a short writeup of how you would change xv6 to support  +symlinks.  List the functions that would have to be added or changed, +with short descriptions of the new functionality or changes. + +<p> +<b>This completes the homework.</b> + +<p> +The following is <i>not required</i>.  If you want to try implementing +symbolic links in xv6, here are the files that the course staff +had to change to implement them: + +<pre> +fs.c: 20 lines added, 4 modified +syscall.c: 2 lines added +syscall.h: 1 line added +sysfile.c: 15 lines added +user.h: 1 line added +usys.S: 1 line added +</pre> + +Also, here is an <i>ln</i> program: + +<pre> +#include "types.h" +#include "user.h" + +int +main(int argc, char *argv[]) +{ +  int (*ln)(char*, char*); +   +  ln = link; +  if(argc > 1 && strcmp(argv[1], "-s") == 0){ +    ln = symlink; +    argc--; +    argv++; +  } +   +  if(argc != 3){ +    printf(2, "usage: ln [-s] old new (%d)\n", argc); +    exit(); +  } +  if(ln(argv[1], argv[2]) < 0){ +    printf(2, "%s failed\n", ln == symlink ? "symlink" : "link"); +    exit(); +  } +  exit(); +} +</pre> + +</body> diff --git a/web/xv6-sched.html b/web/xv6-sched.html new file mode 100644 index 0000000..f8b8b31 --- /dev/null +++ b/web/xv6-sched.html @@ -0,0 +1,96 @@ +<title>Homework: Threads and Context Switching</title> +<html> +<head> +</head> +<body> + +<h1>Homework: Threads and Context Switching</h1> + +<p> +<b>Read</b>: swtch.S and proc.c (focus on the code that switches +between processes, specifically <code>scheduler</code> and <code>sched</code>). + +<p> +<b>Hand-In Procedure</b> +<p> +You are to turn in this homework during lecture. Please +write up your answers to the exercises below and hand them in to a +6.828 staff member at the beginning of lecture. +<p> +<b>Introduction</b> + +<p> +In this homework you will investigate how the kernel switches between +two processes.  + +<p> +<b>Assignment</b>: +<p> + +Suppose a process that is running in the kernel +calls <code>sched()</code>, which ends up jumping +into <code>scheduler()</code>. + +<p> +<b>Turn in</b>:  +Where is the stack that <code>sched()</code> executes on? + +<p> +<b>Turn in</b>:  +Where is the stack that <code>scheduler()</code> executes on? + +<p> +<b>Turn in:</b> +When <code>sched()</code> calls <code>swtch()</code>, +does that call to <code>swtch()</code> ever return? If so, when? + +<p> +<b>Turn in:</b> +Why does <code>swtch()</code> copy %eip from the stack into the +context structure, only to copy it from the context +structure to the same place on the stack +when the process is re-activated? +What would go wrong if <code>swtch()</code> just left the +%eip on the stack and didn't store it in the context structure? + +<p> +Surround the call to <code>swtch()</code> in <code>schedule()</code> with calls +to <code>cons_putc()</code> like this: +<pre> +      cons_putc('a'); +      swtch(&cpus[cpu()].context, &p->context); +      cons_putc('b'); +</pre> +<p> +Similarly, +surround the call to <code>swtch()</code> in <code>sched()</code> with calls +to <code>cons_putc()</code> like this: + +<pre> +  cons_putc('c'); +  swtch(&cp->context, &cpus[cpu()].context); +  cons_putc('d'); +</pre> +<p> +Rebuild your kernel and boot it on bochs. +With a few exceptions +you should see a regular four-character pattern repeated over and over. +<p> +<b>Turn in</b>: What is the four-character pattern? +<p> +<b>Turn in</b>: The very first characters are <code>ac</code>. Why does +this happen? +<p> +<b>Turn in</b>: Near the start of the last line you should see +<code>bc</code>. How could this happen? + +<p> +<b>This completes the homework.</b> + +</body> + + + + + + diff --git a/web/xv6-sleep.html b/web/xv6-sleep.html new file mode 100644 index 0000000..e712a40 --- /dev/null +++ b/web/xv6-sleep.html @@ -0,0 +1,100 @@ +<title>Homework: sleep and wakeup</title> +<html> +<head> +</head> +<body> + +<h1>Homework: sleep and wakeup</h1> + +<p> +<b>Read</b>: pipe.c + +<p> +<b>Hand-In Procedure</b> +<p> +You are to turn in this homework at the beginning of lecture. Please +write up your answers to the questions below and hand them in to a +6.828 staff member at the beginning of lecture. +<p> +<b>Introduction</b> +<p> + +Remember in lecture 7 we discussed locking a linked list implementation. +The insert code was: + +<pre> +        struct list *l; +        l = list_alloc(); +        l->next = list_head; +        list_head = l; +</pre> + +and if we run the insert on multiple processors simultaneously with no locking, +this ordering of instructions can cause one of the inserts to be lost: + +<pre> +        CPU1                           CPU2 +        +        struct list *l; +        l = list_alloc(); +        l->next = list_head; +                                       struct list *l; +                                       l = list_alloc(); +                                       l->next = list_head; +                                       list_head = l; +        list_head = l; +</pre> + +(Even though the instructions can happen simultaneously, we +write out orderings where only one CPU is "executing" at a time, +to avoid complicating things more than necessary.) +<p> + +In this case, the list element allocated by CPU2 is lost from +the list by CPU1's update of list_head. +Adding a lock that protects the final two instructions makes +the read and write of list_head atomic, so that this +ordering is impossible. +<p> + +The reading for this lecture is the implementation of sleep and wakeup, +which are used for coordination between different processes executing +in the kernel, perhaps simultaneously. +<p> + +If there were no locking at all in sleep and wakeup, it would be +possible for a sleep and its corresponding wakeup, if executing +simultaneously on different processors, to miss each other, +so that the wakeup didn't find any process to wake up, and yet the +process calling sleep does go to sleep, never to awake.  Obviously this is something +we'd like to avoid. +<p> + +Read the code with this in mind. + +<p> +<br><br> +<b>Questions</b> +<p> +(Answer and hand in.) +<p> + +1.  How does the proc_table_lock help avoid this problem?  Give an +ordering of instructions (like the above example for linked list +insertion) +that could result in a wakeup being missed if the proc_table_lock were not used. +You need only include the relevant lines of code. +<p> + +2.  sleep is also protected by a second lock, its second argument, +which need not be the proc_table_lock.  Look at the example in ide.c, +which uses the ide_lock.  Give an ordering of instructions that could +result in a wakeup being missed if the ide_lock were not being used. +(Hint: this should not be the same as your answer to question 2.  The +two locks serve different purposes.)<p> + +<br><br> +<b>This completes the homework.</b> + +</body> + | 
