diff options
author | Austin Clements <[email protected]> | 2011-09-07 11:49:14 -0400 |
---|---|---|
committer | Austin Clements <[email protected]> | 2011-09-07 11:49:14 -0400 |
commit | 01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1 (patch) | |
tree | 4320eb3d09f31f4a628b80d45482a72ee7c3956b /web/l-scalablecoord.html | |
parent | 64a03bd7aa5c03a626a2da4730a45fcceea75322 (diff) | |
download | xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.gz xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.bz2 xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.zip |
Remove web directory; all cruft or moved to 6.828 repo
Diffstat (limited to 'web/l-scalablecoord.html')
-rw-r--r-- | web/l-scalablecoord.html | 202 |
1 files changed, 0 insertions, 202 deletions
diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html deleted file mode 100644 index da72c37..0000000 --- a/web/l-scalablecoord.html +++ /dev/null @@ -1,202 +0,0 @@ -<title>Scalable coordination</title> -<html> -<head> -</head> -<body> - -<h1>Scalable coordination</h1> - -<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable - Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991. - -<h2>Overview</h2> - -<p>Shared memory machines are bunch of CPUs, sharing physical memory. -Typically each processor also mantains a cache (for performance), -which introduces the problem of keep caches coherent. If processor 1 -writes a memory location whose value processor 2 has cached, then -processor 2's cache must be updated in some way. How? -<ul> - -<li>Bus-based schemes. Any CPU can access "dance with" any memory -equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache -listens to the memory bus. With write-through architecture, invalidate -copy when see a write. Or can have "ownership" scheme with write-back -cache (E.g., Pentium cache have MESI bits---modified, exclusive, -shared, invalid). If E bit set, CPU caches exclusively and can do -write back. But bus places limits on scalability. - -<li>More scalability w. NUMA schemes (non-uniform memory access). Each -CPU comes with fast "close" memory. Slower to access memory that is -stored with another processor. Use a directory to keep track of who is -caching what. For example, processor 0 is responsible for all memory -starting with address "000", processor 1 is responsible for all memory -starting with "001", etc. - -<li>COMA - cache-only memory architecture. Each CPU has local RAM, -treated as cache. Cache lines migrate around to different nodes based -on access pattern. Data only lives in cache, no permanent memory -location. (These machines aren't too popular any more.) - -</ul> - - -<h2>Scalable locks</h2> - -<p>This paper is about cost and scalability of locking; what if you -have 10 CPUs waiting for the same lock? For example, what would -happen if xv6 runs on an SMP with many processors? - -<p>What's the cost of a simple spinning acquire/release? Algorithm 1 -*without* the delays, which is like xv6's implementation of acquire -and release (xv6 uses XCHG instead of test_and_set): -<pre> - each of the 10 CPUs gets the lock in turn - meanwhile, remaining CPUs in XCHG on lock - lock must be X in cache to run XCHG - otherwise all might read, then all might write - so bus is busy all the time with XCHGs! - can we avoid constant XCHGs while lock is held? -</pre> - -<p>test-and-test-and-set -<pre> - only run expensive TSL if not locked - spin on ordinary load instruction, so cache line is S - acquire(l) - while(1){ - while(l->locked != 0) { } - if(TSL(&l->locked) == 0) - return; - } -</pre> - -<p>suppose 10 CPUs are waiting, let's count cost in total bus - transactions -<pre> - CPU1 gets lock in one cycle - sets lock's cache line to I in other CPUs - 9 CPUs each use bus once in XCHG - then everyone has the line S, so they spin locally - CPU1 release the lock - CPU2 gets the lock in one cycle - 8 CPUs each use bus once... - So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs! - Look at "test-and-test-and-set" in Figure 6 -</pre> -<p> Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time? - -<p>What is the point of the exponential backoff in Algorithm 1? -<pre> - Does it buy us O(n) time for n acquires? - Is there anything wrong with it? - may not be fair - exponential backoff may increase delay after release -</pre> - -<p>What's the point of the ticket locks, Algorithm 2? -<pre> - one interlocked instruction to get my ticket number - then I spin on now_serving with ordinary load - release() just increments now_serving -</pre> - -<p>why is that good? -<pre> - + fair - + no exponential backoff overshoot - + no spinning on -</pre> - -<p>but what's the cost, in bus transactions? -<pre> - while lock is held, now_serving is S in all caches - release makes it I in all caches - then each waiters uses a bus transaction to get new value - so still O(n^2) -</pre> - -<p>What's the point of the array-based queuing locks, Algorithm 3? -<pre> - a lock has an array of "slots" - waiter allocates a slot, spins on that slot - release wakes up just next slot - so O(n) bus transactions to get through n waiters: good! - anderson lines in Figure 4 and 6 are flat-ish - they only go up because lock data structures protected by simpler lock - but O(n) space *per lock*! -</pre> - -<p>Algorithm 5 (MCS), the new algorithm of the paper, uses -compare_and_swap: -<pre> -int compare_and_swap(addr, v1, v2) { - int ret = 0; - // stop all memory activity and ignore interrupts - if (*addr == v1) { - *addr = v2; - ret = 1; - } - // resume other memory activity and take interrupts - return ret; -} -</pre> - -<p>What's the point of the MCS lock, Algorithm 5? -<pre> - constant space per lock, rather than O(n) - one "qnode" per thread, used for whatever lock it's waiting for - lock holder's qnode points to start of list - lock variable points to end of list - acquire adds your qnode to end of list - then you spin on your own qnode - release wakes up next qnode -</pre> - -<h2>Wait-free or non-blocking data structures</h2> - -<p>The previous implementations all block threads when there is - contention for a lock. Other atomic hardware operations allows one - to build implementation wait-free data structures. For example, one - can make an insert of an element in a shared list that don't block a - thread. Such versions are called wait free. - -<p>A linked list with locks is as follows: -<pre> -Lock list_lock; - -insert(int x) { - element *n = new Element; - n->x = x; - - acquire(&list_lock); - n->next = list; - list = n; - release(&list_lock); -} -</pre> - -<p>A wait-free implementation is as follows: -<pre> -insert (int x) { - element *n = new Element; - n->x = x; - do { - n->next = list; - } while (compare_and_swap (&list, n->next, n) == 0); -} -</pre> -<p>How many bus transactions with 10 CPUs inserting one element in the -list? Could you do better? - -<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This - paper by Fraser and Harris</a> compares lock-based implementations - versus corresponding non-blocking implementations of a number of data - structures. - -<p>It is not possible to make every operation wait-free, and there are - times we will need an implementation of acquire and release. - research on non-blocking data structures is active; the last word - isn't said on this topic yet. - -</body> |