diff options
Diffstat (limited to 'web/l13.html')
| -rw-r--r-- | web/l13.html | 245 | 
1 files changed, 245 insertions, 0 deletions
| diff --git a/web/l13.html b/web/l13.html new file mode 100644 index 0000000..af0f405 --- /dev/null +++ b/web/l13.html @@ -0,0 +1,245 @@ +<title>High-performance File Systems</title> +<html> +<head> +</head> +<body> + +<h1>High-performance File Systems</h1> + +<p>Required reading: soft updates. + +<h2>Overview</h2> + +<p>A key problem in designing file systems is how to obtain +performance on file system operations while providing consistency. +With consistency, we mean, that file system invariants are maintained +is on disk.  These invariants include that if a file is created, it +appears in its directory, etc.  If the file system data structures are +consistent, then it is possible to rebuild the file system to a +correct state after a failure. + +<p>To ensure consistency of on-disk file system data structures, +  modifications to the file system must respect certain rules: +<ul> + +<li>Never point to a structure before it is initialized. An inode must +be initialized before a directory entry references it.  An block must +be initialized before an inode references it. + +<li>Never reuse a structure before nullifying all pointers to it.  An +inode pointer to a disk block must be reset before the file system can +reallocate the disk block. + +<li>Never reset the last point to a live structure before a new +pointer is set.  When renaming a file, the file system should not +remove the old name for an inode until after the new name has been +written. +</ul> +The paper calls these dependencies update dependencies.   + +<p>xv6 ensures these rules by writing every block synchronously, and +  by ordering the writes appropriately.  With synchronous, we mean +  that a process waits until the current disk write has been +  completed before continuing with execution. + +<ul> + +<li>What happens if power fails after 4776 in mknod1?  Did we lose the +  inode for ever? No, we have a separate program (called fsck), which +  can rebuild the disk structures correctly and can mark the inode on +  the free list. + +<li>Does the order of writes in mknod1 matter?  Say, what if we wrote +  directory entry first and then wrote the allocated inode to disk? +  This violates the update rules and it is not a good plan. If a +  failure happens after the directory write, then on recovery we have +  an directory pointing to an unallocated inode, which now may be +  allocated by another process for another file! + +<li>Can we turn the writes (i.e., the ones invoked by iupdate and +  wdir) into delayed writes without creating problems?  No, because +  the cause might write them back to the disk in an incorrect order. +  It has no information to decide in what order to write them. + +</ul> + +<p>xv6 is a nice example of the tension between consistency and +  performance.  To get consistency, xv6 uses synchronous writes, +  but these writes are slow, because they perform at the rate of a +  seek instead of the rate of the maximum data transfer rate. The +  bandwidth to a disk is reasonable high for large transfer (around +  50Mbyte/s), but latency is low, because of the cost of moving the +  disk arm(s) (the seek latency is about 10msec). + +<p>This tension is an implementation-dependent one.  The Unix API +  doesn't require that writes are synchronous.  Updates don't have to +  appear on disk until a sync, fsync, or open with O_SYNC.  Thus, in +  principle, the UNIX API allows delayed writes, which are good for +  performance: +<ul> +<li>Batch many writes together in a big one, written at the disk data +  rate. +<li>Absorp writes to the same block. +<li>Schedule writes to avoid seeks. +</ul> + +<p>Thus the question: how to delay writes and achieve consistency? +  The paper provides an answer. + +<h2>This paper</h2> + +<p>The paper surveys some of the existing techniques and introduces a +new to achieve the goal of performance and consistency. + +<p> + +<p>Techniques possible: +<ul> + +<li>Equip system with NVRAM, and put buffer cache in NVRAM. + +<li>Logging.  Often used in UNIX file systems for metadata updates. +LFS is an extreme version of this strategy. + +<li>Flusher-enforced ordering.  All writes are delayed. This flusher +is aware of dependencies between blocks, but doesn't work because +circular dependencies need to be broken by writing blocks out. + +</ul> + +<p>Soft updates is the solution explored in this paper.  It doesn't +require NVRAM, and performs as well as the naive strategy of keep all +dirty block in main memory.  Compared to logging, it is unclear if +soft updates is better.  The default BSD file systems uses soft +  updates, but most Linux file systems use logging. + +<p>Soft updates is a sophisticated variant of flusher-enforced +ordering.  Instead of maintaining dependencies on the block-level, it +maintains dependencies on file structure level (per inode, per +directory, etc.), reducing circular dependencies. Furthermore, it +breaks any remaining circular dependencies by undo changes before +writing the block and then redoing them to the block after writing. + +<p>Pseudocode for create: +<pre> +create (f) { +   allocate inode in block i  (assuming inode is available) +   add i to directory data block d  (assuming d has space) +   mark d has dependent on i, and create undo/redo record +   update directory inode in block di +   mark di has dependent on d +} +</pre> + +<p>Pseudocode for the flusher: +<pre> +flushblock (b) +{ +  lock b; +  for all dependencies that b is relying on +    "remove" that dependency by undoing the change to b +    mark the dependency as "unrolled" +  write b  +} + +write_completed (b) { +  remove dependencies that depend on b +  reapply "unrolled" dependencies that b depended on +  unlock b +} +</pre> + +<p>Apply flush algorithm to example: +<ul> +<li>A list of two dependencies: directory->inode, inode->directory. +<li>Lets say syncer picks directory first +<li>Undo directory->inode changes (i.e., unroll <A,#4>) +<li>Write directory block +<li>Remove met dependencies (i.e., remove inode->directory dependency) +<li>Perform redo operation (i.e., redo <A,#4>) +<li>Select inode block and write it +<li>Remove met dependencies (i.e., remove directory->inode dependency) +<li>Select directory block (it is dirty again!) +<li>Write it. +</ul> + +<p>An file operation that is important for file-system consistency  +is rename.  Rename conceptually works as follows: +<pre> +rename (from, to) +   unlink (to); +   link (from, to); +   unlink (from); +</pre> + +<p>Rename it often used by programs to make a new version of a file +the current version.  Committing to a new version must happen +atomically.  Unfortunately, without a transaction-like support +atomicity is impossible to guarantee, so a typical file systems +provides weaker semantics for rename: if to already exists, an +instance of to will always exist, even if the system should crash in +the middle of the operation.  Does the above implementation of rename +guarantee this semantics? (Answer: no). + +<p>If rename is implemented as unlink, link, unlink, then it is +difficult to guarantee even the weak semantics. Modern UNIXes provide +rename as a file system call: +<pre> +   update dir block for to point to from's inode // write block +   update dir block for from to free entry // write block +</pre> +<p>fsck may need to correct refcounts in the inode if the file +system fails during rename.  for example, a crash after the first +write followed by fsck should set refcount to 2, since both from +and to are pointing at the inode. + +<p>This semantics is sufficient, however, for an application to ensure +atomicity. Before the call, there is a from and perhaps a to.  If the +call is successful, following the call there is only a to.  If there +is a crash, there may be both a from and a to, in which case the +caller knows the previous attempt failed, and must retry.  The +subtlety is that if you now follow the two links, the "to" name may +link to either the old file or the new file.  If it links to the new +file, that means that there was a crash and you just detected that the +rename operation was composite.  On the other hand, the retry +procedure can be the same for either case (do the rename again), so it +isn't necessary to discover how it failed.  The function follows the +golden rule of recoverability, and it is idempotent, so it lays all +the needed groundwork for use as part of a true atomic action. + +<p>With soft updates renames becomes: +<pre> +rename (from, to) { +   i = namei(from); +   add "to" directory data block td a reference to inode i +   mark td dependent on block i +   update directory inode "to" tdi +   mark tdi as dependent on td +   remove "from" directory data block fd a reference to inode i +   mark fd as dependent on tdi +   update directory inode in block fdi +   mark fdi as dependent on fd +} +</pre> +<p>No synchronous writes! + +<p>What needs to be done on recovery?  (Inspect every statement in +rename and see what inconsistencies could exist on the disk; e.g., +refcnt inode could be too high.)  None of these inconsitencies require +fixing before the file system can operate; they can be fixed by a +background file system repairer.  + +<h2>Paper discussion</h2> + +<p>Do soft updates perform any useless writes? (A useless write is a +write that will be immediately overwritten.)  (Answer: yes.) Fix +syncer to becareful with what block to start.  Fix cache replacement +to selecting LRU block with no pendending dependencies. + +<p>Can a log-structured file system implement rename better? (Answer: +yes, since it can get the refcnts right). + +<p>Discuss all graphs. + +</body> + | 
