diff options
Diffstat (limited to 'web/l-fs.html')
| -rw-r--r-- | web/l-fs.html | 222 | 
1 files changed, 222 insertions, 0 deletions
| diff --git a/web/l-fs.html b/web/l-fs.html new file mode 100644 index 0000000..ed911fc --- /dev/null +++ b/web/l-fs.html @@ -0,0 +1,222 @@ +<title>L10</title> +<html> +<head> +</head> +<body> + +<h1>File systems</h1> + +<p>Required reading: iread, iwrite, and wdir, and code related to +  these calls in fs.c, bio.c, ide.c, file.c, and sysfile.c + +<h2>Overview</h2> + +<p>The next 3 lectures are about file systems: +<ul> +<li>Basic file system implementation +<li>Naming +<li>Performance +</ul> + +<p>Users desire to store their data durable so that data survives when +the user turns of his computer.  The primary media for doing so are: +magnetic disks, flash memory, and tapes.  We focus on magnetic disks +(e.g., through the IDE interface in xv6). + +<p>To allow users to remember where they stored a file, they can +assign a symbolic name to a file, which appears in a directory. + +<p>The data in a file can be organized in a structured way or not. +The structured variant is often called a database.  UNIX uses the +unstructured variant: files are streams of bytes.  Any particular +structure is likely to be useful to only a small class of +applications, and other applications will have to work hard to fit +their data into one of the pre-defined structures. Besides, if you +want structure, you can easily write a user-mode library program that +imposes that format on any file.  The end-to-end argument in action. +(Databases have special requirements and support an important class of +applications, and thus have a specialized plan.) + +<p>The API for a minimal file system consists of: open, read, write, +seek, close, and stat.  Dup duplicates a file descriptor. For example: +<pre> +  fd = open("x", O_RDWR); +  read (fd, buf, 100); +  write (fd, buf, 512); +  close (fd) +</pre> + +<p>Maintaining the file offset behind the read/write interface is an +  interesting design decision . The alternative is that the state of a +  read operation should be maintained by the process doing the reading +  (i.e., that the pointer should be passed as an argument to read). +  This argument is compelling in view of the UNIX fork() semantics, +  which clones a process which shares the file descriptors of its +  parent. A read by the parent of a shared file descriptor (e.g., +  stdin, changes the read pointer seen by the child).  On the other +  hand the alternative would make it difficult to get "(data; ls) > x" +  right. + +<p>Unix API doesn't specify that the effects of write are immediately +  on the disk before a write returns. It is up to the implementation +  of the file system within certain bounds. Choices include (that +  aren't non-exclusive): +<ul> +<li>At some point in the future, if the system stays up (e.g., after +  30 seconds); +<li>Before the write returns; +<li>Before close returns; +<li>User specified (e.g., before fsync returns). +</ul> + +<p>A design issue is the semantics of a file system operation that +  requires multiple disk writes.  In particular, what happens if the +  logical update requires writing multiple disks blocks and the power +  fails during the update?  For example, to create a new file, +  requires allocating an inode (which requires updating the list of +  free inodes on disk), writing a directory entry to record the +  allocated i-node under the name of the new file (which may require +  allocating a new block and updating the directory inode).  If the +  power fails during the operation, the list of free inodes and blocks +  may be inconsistent with the blocks and inodes in use. Again this is +  up to implementation of the file system to keep on disk data +  structures consistent: +<ul> +<li>Don't worry about it much, but use a recovery program to bring +  file system back into a consistent state. +<li>Journaling file system.  Never let the file system get into an +  inconsistent state. +</ul> + +<p>Another design issue is the semantics are of concurrent writes to +the same data item.  What is the order of two updates that happen at +the same time? For example, two processes open the same file and write +to it.  Modern Unix operating systems allow the application to lock a +file to get exclusive access.  If file locking is not used and if the +file descriptor is shared, then the bytes of the two writes will get +into the file in some order (this happens often for log files).  If +the file descriptor is not shared, the end result is not defined. For +example, one write may overwrite the other one (e.g., if they are +writing to the same part of the file.) + +<p>An implementation issue is performance, because writing to magnetic +disk is relatively expensive compared to computing. Three primary ways +to improve performance are: careful file system layout that induces +few seeks, an in-memory cache of frequently-accessed blocks, and +overlap I/O with computation so that file operations don't have to +wait until their completion and so that that the disk driver has more +data to write, which allows disk scheduling. (We will talk about +performance in detail later.) + +<h2>xv6 code examples</h2> + +<p>xv6 implements a minimal Unix file system interface. xv6 doesn't +pay attention to file system layout. It overlaps computation and I/O, +but doesn't do any disk scheduling.  Its cache is write-through, which +simplifies keep on disk datastructures consistent, but is bad for +performance. + +<p>On disk files are represented by an inode (struct dinode in fs.h), +and blocks.  Small files have up to 12 block addresses in their inode; +large files use files the last address in the inode as a disk address +for a block with 128 disk addresses (512/4).  The size of a file is +thus limited to 12 * 512 + 128*512 bytes.  What would you change to +support larger files? (Ans: e.g., double indirect blocks.) + +<p>Directories are files with a bit of structure to them. The file +contains of records of the type struct dirent.  The entry contains the +name for a file (or directory) and its corresponding inode number. +How many files can appear in a directory? + +<p>In memory files are represented by struct inode in fsvar.h. What is +the role of the additional fields in struct inode? + +<p>What is xv6's disk layout?  How does xv6 keep track of free blocks +  and inodes? See balloc()/bfree() and ialloc()/ifree().  Is this +  layout a good one for performance?  What are other options? + +<p>Let's assume that an application created an empty file x with +  contains 512 bytes, and that the application now calls read(fd, buf, +  100), that is, it is requesting to read 100 bytes into buf. +  Furthermore, let's assume that the inode for x is is i. Let's pick +  up what happens by investigating readi(), line 4483. +<ul> +<li>4488-4492: can iread be called on other objects than files?  (Yes. +  For example, read from the keyboard.)  Everything is a file in Unix. +<li>4495: what does bmap do? +<ul> +<li>4384: what block is being read? +</ul> +<li>4483: what does bread do?  does bread always cause a read to disk? +<ul> +<li>4006: what does bget do?  it implements a simple cache of +  recently-read disk blocks. +<ul> +<li>How big is the cache?  (see param.h) +<li>3972: look if the requested block is in the cache by walking down +  a circular list. +<li>3977: we had a match. +<li>3979: some other process has "locked" the block, wait until it +  releases.  the other processes releases the block using brelse(). +Why lock a block? +<ul> +<li>Atomic read and update.  For example, allocating an inode: read +  block containing inode, mark it allocated, and write it back.  This +  operation must be atomic. +</ul> +<li>3982: it is ours now. +<li>3987: it is not in the cache; we need to find a cache entry to +  hold the block. +<li>3987: what is the cache replacement strategy? (see also brelse()) +<li>3988: found an entry that we are going to use. +<li>3989: mark it ours but don't mark it valid (there is no valid data +  in the entry yet). +</ul> +<li>4007: if the block was in the cache and the entry has the block's +  data, return. +<li>4010: if the block wasn't in the cache, read it from disk. are +  read's synchronous or asynchronous? +<ul> +<li>3836: a bounded buffer of outstanding disk requests. +<li>3809: tell the disk to move arm and generate an interrupt. +<li>3851: go to sleep and run some other process to run. time sharing +  in action. +<li>3792: interrupt: arm is in the right position; wakeup requester. +<li>3856: read block from disk. +<li>3860: remove request from bounded buffer.  wakeup processes that +  are waiting for a slot. +<li>3864: start next disk request, if any. xv6 can overlap I/O with +computation. +</ul> +<li>4011: mark the cache entry has holding the data. +</ul> +<li>4498: To where is the block copied?  is dst a valid user address? +</ul> + +<p>Now let's suppose that the process is writing 512 bytes at the end +  of the file a. How many disk writes will happen? +<ul> +<li>4567: allocate a new block +<ul> +<li>4518: allocate a block: scan block map, and write entry +<li>4523: How many disk operations if the process would have been appending +  to a large file? (Answer: read indirect block, scan block map, write +  block map.) +</ul> +<li>4572: read the block that the process will be writing, in case the +  process writes only part of the block. +<li>4574: write it. is it synchronous or asynchronous? (Ans: +  synchronous but with timesharing.) +</ul> + +<p>Lots of code to implement reading and writing of files. How about +  directories? +<ul> +<li>4722: look for the directory, reading directory block and see if a +  directory entry is unused (inum == 0). +<li>4729: use it and update it. +<li>4735: write the modified block. +</ul> +<p>Reading and writing of directories is trivial. + +</body> | 
