diff options
Diffstat (limited to 'web/l13.html')
| -rw-r--r-- | web/l13.html | 245 | 
1 files changed, 0 insertions, 245 deletions
| diff --git a/web/l13.html b/web/l13.html deleted file mode 100644 index af0f405..0000000 --- a/web/l13.html +++ /dev/null @@ -1,245 +0,0 @@ -<title>High-performance File Systems</title> -<html> -<head> -</head> -<body> - -<h1>High-performance File Systems</h1> - -<p>Required reading: soft updates. - -<h2>Overview</h2> - -<p>A key problem in designing file systems is how to obtain -performance on file system operations while providing consistency. -With consistency, we mean, that file system invariants are maintained -is on disk.  These invariants include that if a file is created, it -appears in its directory, etc.  If the file system data structures are -consistent, then it is possible to rebuild the file system to a -correct state after a failure. - -<p>To ensure consistency of on-disk file system data structures, -  modifications to the file system must respect certain rules: -<ul> - -<li>Never point to a structure before it is initialized. An inode must -be initialized before a directory entry references it.  An block must -be initialized before an inode references it. - -<li>Never reuse a structure before nullifying all pointers to it.  An -inode pointer to a disk block must be reset before the file system can -reallocate the disk block. - -<li>Never reset the last point to a live structure before a new -pointer is set.  When renaming a file, the file system should not -remove the old name for an inode until after the new name has been -written. -</ul> -The paper calls these dependencies update dependencies.   - -<p>xv6 ensures these rules by writing every block synchronously, and -  by ordering the writes appropriately.  With synchronous, we mean -  that a process waits until the current disk write has been -  completed before continuing with execution. - -<ul> - -<li>What happens if power fails after 4776 in mknod1?  Did we lose the -  inode for ever? No, we have a separate program (called fsck), which -  can rebuild the disk structures correctly and can mark the inode on -  the free list. - -<li>Does the order of writes in mknod1 matter?  Say, what if we wrote -  directory entry first and then wrote the allocated inode to disk? -  This violates the update rules and it is not a good plan. If a -  failure happens after the directory write, then on recovery we have -  an directory pointing to an unallocated inode, which now may be -  allocated by another process for another file! - -<li>Can we turn the writes (i.e., the ones invoked by iupdate and -  wdir) into delayed writes without creating problems?  No, because -  the cause might write them back to the disk in an incorrect order. -  It has no information to decide in what order to write them. - -</ul> - -<p>xv6 is a nice example of the tension between consistency and -  performance.  To get consistency, xv6 uses synchronous writes, -  but these writes are slow, because they perform at the rate of a -  seek instead of the rate of the maximum data transfer rate. The -  bandwidth to a disk is reasonable high for large transfer (around -  50Mbyte/s), but latency is low, because of the cost of moving the -  disk arm(s) (the seek latency is about 10msec). - -<p>This tension is an implementation-dependent one.  The Unix API -  doesn't require that writes are synchronous.  Updates don't have to -  appear on disk until a sync, fsync, or open with O_SYNC.  Thus, in -  principle, the UNIX API allows delayed writes, which are good for -  performance: -<ul> -<li>Batch many writes together in a big one, written at the disk data -  rate. -<li>Absorp writes to the same block. -<li>Schedule writes to avoid seeks. -</ul> - -<p>Thus the question: how to delay writes and achieve consistency? -  The paper provides an answer. - -<h2>This paper</h2> - -<p>The paper surveys some of the existing techniques and introduces a -new to achieve the goal of performance and consistency. - -<p> - -<p>Techniques possible: -<ul> - -<li>Equip system with NVRAM, and put buffer cache in NVRAM. - -<li>Logging.  Often used in UNIX file systems for metadata updates. -LFS is an extreme version of this strategy. - -<li>Flusher-enforced ordering.  All writes are delayed. This flusher -is aware of dependencies between blocks, but doesn't work because -circular dependencies need to be broken by writing blocks out. - -</ul> - -<p>Soft updates is the solution explored in this paper.  It doesn't -require NVRAM, and performs as well as the naive strategy of keep all -dirty block in main memory.  Compared to logging, it is unclear if -soft updates is better.  The default BSD file systems uses soft -  updates, but most Linux file systems use logging. - -<p>Soft updates is a sophisticated variant of flusher-enforced -ordering.  Instead of maintaining dependencies on the block-level, it -maintains dependencies on file structure level (per inode, per -directory, etc.), reducing circular dependencies. Furthermore, it -breaks any remaining circular dependencies by undo changes before -writing the block and then redoing them to the block after writing. - -<p>Pseudocode for create: -<pre> -create (f) { -   allocate inode in block i  (assuming inode is available) -   add i to directory data block d  (assuming d has space) -   mark d has dependent on i, and create undo/redo record -   update directory inode in block di -   mark di has dependent on d -} -</pre> - -<p>Pseudocode for the flusher: -<pre> -flushblock (b) -{ -  lock b; -  for all dependencies that b is relying on -    "remove" that dependency by undoing the change to b -    mark the dependency as "unrolled" -  write b  -} - -write_completed (b) { -  remove dependencies that depend on b -  reapply "unrolled" dependencies that b depended on -  unlock b -} -</pre> - -<p>Apply flush algorithm to example: -<ul> -<li>A list of two dependencies: directory->inode, inode->directory. -<li>Lets say syncer picks directory first -<li>Undo directory->inode changes (i.e., unroll <A,#4>) -<li>Write directory block -<li>Remove met dependencies (i.e., remove inode->directory dependency) -<li>Perform redo operation (i.e., redo <A,#4>) -<li>Select inode block and write it -<li>Remove met dependencies (i.e., remove directory->inode dependency) -<li>Select directory block (it is dirty again!) -<li>Write it. -</ul> - -<p>An file operation that is important for file-system consistency  -is rename.  Rename conceptually works as follows: -<pre> -rename (from, to) -   unlink (to); -   link (from, to); -   unlink (from); -</pre> - -<p>Rename it often used by programs to make a new version of a file -the current version.  Committing to a new version must happen -atomically.  Unfortunately, without a transaction-like support -atomicity is impossible to guarantee, so a typical file systems -provides weaker semantics for rename: if to already exists, an -instance of to will always exist, even if the system should crash in -the middle of the operation.  Does the above implementation of rename -guarantee this semantics? (Answer: no). - -<p>If rename is implemented as unlink, link, unlink, then it is -difficult to guarantee even the weak semantics. Modern UNIXes provide -rename as a file system call: -<pre> -   update dir block for to point to from's inode // write block -   update dir block for from to free entry // write block -</pre> -<p>fsck may need to correct refcounts in the inode if the file -system fails during rename.  for example, a crash after the first -write followed by fsck should set refcount to 2, since both from -and to are pointing at the inode. - -<p>This semantics is sufficient, however, for an application to ensure -atomicity. Before the call, there is a from and perhaps a to.  If the -call is successful, following the call there is only a to.  If there -is a crash, there may be both a from and a to, in which case the -caller knows the previous attempt failed, and must retry.  The -subtlety is that if you now follow the two links, the "to" name may -link to either the old file or the new file.  If it links to the new -file, that means that there was a crash and you just detected that the -rename operation was composite.  On the other hand, the retry -procedure can be the same for either case (do the rename again), so it -isn't necessary to discover how it failed.  The function follows the -golden rule of recoverability, and it is idempotent, so it lays all -the needed groundwork for use as part of a true atomic action. - -<p>With soft updates renames becomes: -<pre> -rename (from, to) { -   i = namei(from); -   add "to" directory data block td a reference to inode i -   mark td dependent on block i -   update directory inode "to" tdi -   mark tdi as dependent on td -   remove "from" directory data block fd a reference to inode i -   mark fd as dependent on tdi -   update directory inode in block fdi -   mark fdi as dependent on fd -} -</pre> -<p>No synchronous writes! - -<p>What needs to be done on recovery?  (Inspect every statement in -rename and see what inconsistencies could exist on the disk; e.g., -refcnt inode could be too high.)  None of these inconsitencies require -fixing before the file system can operate; they can be fixed by a -background file system repairer.  - -<h2>Paper discussion</h2> - -<p>Do soft updates perform any useless writes? (A useless write is a -write that will be immediately overwritten.)  (Answer: yes.) Fix -syncer to becareful with what block to start.  Fix cache replacement -to selecting LRU block with no pendending dependencies. - -<p>Can a log-structured file system implement rename better? (Answer: -yes, since it can get the refcnts right). - -<p>Discuss all graphs. - -</body> - | 
