diff options
Diffstat (limited to 'web/l13.html')
-rw-r--r-- | web/l13.html | 245 |
1 files changed, 245 insertions, 0 deletions
diff --git a/web/l13.html b/web/l13.html new file mode 100644 index 0000000..af0f405 --- /dev/null +++ b/web/l13.html @@ -0,0 +1,245 @@ +<title>High-performance File Systems</title> +<html> +<head> +</head> +<body> + +<h1>High-performance File Systems</h1> + +<p>Required reading: soft updates. + +<h2>Overview</h2> + +<p>A key problem in designing file systems is how to obtain +performance on file system operations while providing consistency. +With consistency, we mean, that file system invariants are maintained +is on disk. These invariants include that if a file is created, it +appears in its directory, etc. If the file system data structures are +consistent, then it is possible to rebuild the file system to a +correct state after a failure. + +<p>To ensure consistency of on-disk file system data structures, + modifications to the file system must respect certain rules: +<ul> + +<li>Never point to a structure before it is initialized. An inode must +be initialized before a directory entry references it. An block must +be initialized before an inode references it. + +<li>Never reuse a structure before nullifying all pointers to it. An +inode pointer to a disk block must be reset before the file system can +reallocate the disk block. + +<li>Never reset the last point to a live structure before a new +pointer is set. When renaming a file, the file system should not +remove the old name for an inode until after the new name has been +written. +</ul> +The paper calls these dependencies update dependencies. + +<p>xv6 ensures these rules by writing every block synchronously, and + by ordering the writes appropriately. With synchronous, we mean + that a process waits until the current disk write has been + completed before continuing with execution. + +<ul> + +<li>What happens if power fails after 4776 in mknod1? Did we lose the + inode for ever? No, we have a separate program (called fsck), which + can rebuild the disk structures correctly and can mark the inode on + the free list. + +<li>Does the order of writes in mknod1 matter? Say, what if we wrote + directory entry first and then wrote the allocated inode to disk? + This violates the update rules and it is not a good plan. If a + failure happens after the directory write, then on recovery we have + an directory pointing to an unallocated inode, which now may be + allocated by another process for another file! + +<li>Can we turn the writes (i.e., the ones invoked by iupdate and + wdir) into delayed writes without creating problems? No, because + the cause might write them back to the disk in an incorrect order. + It has no information to decide in what order to write them. + +</ul> + +<p>xv6 is a nice example of the tension between consistency and + performance. To get consistency, xv6 uses synchronous writes, + but these writes are slow, because they perform at the rate of a + seek instead of the rate of the maximum data transfer rate. The + bandwidth to a disk is reasonable high for large transfer (around + 50Mbyte/s), but latency is low, because of the cost of moving the + disk arm(s) (the seek latency is about 10msec). + +<p>This tension is an implementation-dependent one. The Unix API + doesn't require that writes are synchronous. Updates don't have to + appear on disk until a sync, fsync, or open with O_SYNC. Thus, in + principle, the UNIX API allows delayed writes, which are good for + performance: +<ul> +<li>Batch many writes together in a big one, written at the disk data + rate. +<li>Absorp writes to the same block. +<li>Schedule writes to avoid seeks. +</ul> + +<p>Thus the question: how to delay writes and achieve consistency? + The paper provides an answer. + +<h2>This paper</h2> + +<p>The paper surveys some of the existing techniques and introduces a +new to achieve the goal of performance and consistency. + +<p> + +<p>Techniques possible: +<ul> + +<li>Equip system with NVRAM, and put buffer cache in NVRAM. + +<li>Logging. Often used in UNIX file systems for metadata updates. +LFS is an extreme version of this strategy. + +<li>Flusher-enforced ordering. All writes are delayed. This flusher +is aware of dependencies between blocks, but doesn't work because +circular dependencies need to be broken by writing blocks out. + +</ul> + +<p>Soft updates is the solution explored in this paper. It doesn't +require NVRAM, and performs as well as the naive strategy of keep all +dirty block in main memory. Compared to logging, it is unclear if +soft updates is better. The default BSD file systems uses soft + updates, but most Linux file systems use logging. + +<p>Soft updates is a sophisticated variant of flusher-enforced +ordering. Instead of maintaining dependencies on the block-level, it +maintains dependencies on file structure level (per inode, per +directory, etc.), reducing circular dependencies. Furthermore, it +breaks any remaining circular dependencies by undo changes before +writing the block and then redoing them to the block after writing. + +<p>Pseudocode for create: +<pre> +create (f) { + allocate inode in block i (assuming inode is available) + add i to directory data block d (assuming d has space) + mark d has dependent on i, and create undo/redo record + update directory inode in block di + mark di has dependent on d +} +</pre> + +<p>Pseudocode for the flusher: +<pre> +flushblock (b) +{ + lock b; + for all dependencies that b is relying on + "remove" that dependency by undoing the change to b + mark the dependency as "unrolled" + write b +} + +write_completed (b) { + remove dependencies that depend on b + reapply "unrolled" dependencies that b depended on + unlock b +} +</pre> + +<p>Apply flush algorithm to example: +<ul> +<li>A list of two dependencies: directory->inode, inode->directory. +<li>Lets say syncer picks directory first +<li>Undo directory->inode changes (i.e., unroll <A,#4>) +<li>Write directory block +<li>Remove met dependencies (i.e., remove inode->directory dependency) +<li>Perform redo operation (i.e., redo <A,#4>) +<li>Select inode block and write it +<li>Remove met dependencies (i.e., remove directory->inode dependency) +<li>Select directory block (it is dirty again!) +<li>Write it. +</ul> + +<p>An file operation that is important for file-system consistency +is rename. Rename conceptually works as follows: +<pre> +rename (from, to) + unlink (to); + link (from, to); + unlink (from); +</pre> + +<p>Rename it often used by programs to make a new version of a file +the current version. Committing to a new version must happen +atomically. Unfortunately, without a transaction-like support +atomicity is impossible to guarantee, so a typical file systems +provides weaker semantics for rename: if to already exists, an +instance of to will always exist, even if the system should crash in +the middle of the operation. Does the above implementation of rename +guarantee this semantics? (Answer: no). + +<p>If rename is implemented as unlink, link, unlink, then it is +difficult to guarantee even the weak semantics. Modern UNIXes provide +rename as a file system call: +<pre> + update dir block for to point to from's inode // write block + update dir block for from to free entry // write block +</pre> +<p>fsck may need to correct refcounts in the inode if the file +system fails during rename. for example, a crash after the first +write followed by fsck should set refcount to 2, since both from +and to are pointing at the inode. + +<p>This semantics is sufficient, however, for an application to ensure +atomicity. Before the call, there is a from and perhaps a to. If the +call is successful, following the call there is only a to. If there +is a crash, there may be both a from and a to, in which case the +caller knows the previous attempt failed, and must retry. The +subtlety is that if you now follow the two links, the "to" name may +link to either the old file or the new file. If it links to the new +file, that means that there was a crash and you just detected that the +rename operation was composite. On the other hand, the retry +procedure can be the same for either case (do the rename again), so it +isn't necessary to discover how it failed. The function follows the +golden rule of recoverability, and it is idempotent, so it lays all +the needed groundwork for use as part of a true atomic action. + +<p>With soft updates renames becomes: +<pre> +rename (from, to) { + i = namei(from); + add "to" directory data block td a reference to inode i + mark td dependent on block i + update directory inode "to" tdi + mark tdi as dependent on td + remove "from" directory data block fd a reference to inode i + mark fd as dependent on tdi + update directory inode in block fdi + mark fdi as dependent on fd +} +</pre> +<p>No synchronous writes! + +<p>What needs to be done on recovery? (Inspect every statement in +rename and see what inconsistencies could exist on the disk; e.g., +refcnt inode could be too high.) None of these inconsitencies require +fixing before the file system can operate; they can be fixed by a +background file system repairer. + +<h2>Paper discussion</h2> + +<p>Do soft updates perform any useless writes? (A useless write is a +write that will be immediately overwritten.) (Answer: yes.) Fix +syncer to becareful with what block to start. Fix cache replacement +to selecting LRU block with no pendending dependencies. + +<p>Can a log-structured file system implement rename better? (Answer: +yes, since it can get the refcnts right). + +<p>Discuss all graphs. + +</body> + |