summaryrefslogtreecommitdiff
path: root/web/l-scalablecoord.html
diff options
context:
space:
mode:
authorAustin Clements <[email protected]>2011-09-07 11:49:14 -0400
committerAustin Clements <[email protected]>2011-09-07 11:49:14 -0400
commit01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1 (patch)
tree4320eb3d09f31f4a628b80d45482a72ee7c3956b /web/l-scalablecoord.html
parent64a03bd7aa5c03a626a2da4730a45fcceea75322 (diff)
downloadxv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.gz
xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.tar.bz2
xv6-labs-01a6c054d548d9fff8bbdfac4d3f3de4ae8677a1.zip
Remove web directory; all cruft or moved to 6.828 repo
Diffstat (limited to 'web/l-scalablecoord.html')
-rw-r--r--web/l-scalablecoord.html202
1 files changed, 0 insertions, 202 deletions
diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html
deleted file mode 100644
index da72c37..0000000
--- a/web/l-scalablecoord.html
+++ /dev/null
@@ -1,202 +0,0 @@
-<title>Scalable coordination</title>
-<html>
-<head>
-</head>
-<body>
-
-<h1>Scalable coordination</h1>
-
-<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
- Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
-
-<h2>Overview</h2>
-
-<p>Shared memory machines are bunch of CPUs, sharing physical memory.
-Typically each processor also mantains a cache (for performance),
-which introduces the problem of keep caches coherent. If processor 1
-writes a memory location whose value processor 2 has cached, then
-processor 2's cache must be updated in some way. How?
-<ul>
-
-<li>Bus-based schemes. Any CPU can access "dance with" any memory
-equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
-listens to the memory bus. With write-through architecture, invalidate
-copy when see a write. Or can have "ownership" scheme with write-back
-cache (E.g., Pentium cache have MESI bits---modified, exclusive,
-shared, invalid). If E bit set, CPU caches exclusively and can do
-write back. But bus places limits on scalability.
-
-<li>More scalability w. NUMA schemes (non-uniform memory access). Each
-CPU comes with fast "close" memory. Slower to access memory that is
-stored with another processor. Use a directory to keep track of who is
-caching what. For example, processor 0 is responsible for all memory
-starting with address "000", processor 1 is responsible for all memory
-starting with "001", etc.
-
-<li>COMA - cache-only memory architecture. Each CPU has local RAM,
-treated as cache. Cache lines migrate around to different nodes based
-on access pattern. Data only lives in cache, no permanent memory
-location. (These machines aren't too popular any more.)
-
-</ul>
-
-
-<h2>Scalable locks</h2>
-
-<p>This paper is about cost and scalability of locking; what if you
-have 10 CPUs waiting for the same lock? For example, what would
-happen if xv6 runs on an SMP with many processors?
-
-<p>What's the cost of a simple spinning acquire/release? Algorithm 1
-*without* the delays, which is like xv6's implementation of acquire
-and release (xv6 uses XCHG instead of test_and_set):
-<pre>
- each of the 10 CPUs gets the lock in turn
- meanwhile, remaining CPUs in XCHG on lock
- lock must be X in cache to run XCHG
- otherwise all might read, then all might write
- so bus is busy all the time with XCHGs!
- can we avoid constant XCHGs while lock is held?
-</pre>
-
-<p>test-and-test-and-set
-<pre>
- only run expensive TSL if not locked
- spin on ordinary load instruction, so cache line is S
- acquire(l)
- while(1){
- while(l->locked != 0) { }
- if(TSL(&l->locked) == 0)
- return;
- }
-</pre>
-
-<p>suppose 10 CPUs are waiting, let's count cost in total bus
- transactions
-<pre>
- CPU1 gets lock in one cycle
- sets lock's cache line to I in other CPUs
- 9 CPUs each use bus once in XCHG
- then everyone has the line S, so they spin locally
- CPU1 release the lock
- CPU2 gets the lock in one cycle
- 8 CPUs each use bus once...
- So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
- Look at "test-and-test-and-set" in Figure 6
-</pre>
-<p> Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
-
-<p>What is the point of the exponential backoff in Algorithm 1?
-<pre>
- Does it buy us O(n) time for n acquires?
- Is there anything wrong with it?
- may not be fair
- exponential backoff may increase delay after release
-</pre>
-
-<p>What's the point of the ticket locks, Algorithm 2?
-<pre>
- one interlocked instruction to get my ticket number
- then I spin on now_serving with ordinary load
- release() just increments now_serving
-</pre>
-
-<p>why is that good?
-<pre>
- + fair
- + no exponential backoff overshoot
- + no spinning on
-</pre>
-
-<p>but what's the cost, in bus transactions?
-<pre>
- while lock is held, now_serving is S in all caches
- release makes it I in all caches
- then each waiters uses a bus transaction to get new value
- so still O(n^2)
-</pre>
-
-<p>What's the point of the array-based queuing locks, Algorithm 3?
-<pre>
- a lock has an array of "slots"
- waiter allocates a slot, spins on that slot
- release wakes up just next slot
- so O(n) bus transactions to get through n waiters: good!
- anderson lines in Figure 4 and 6 are flat-ish
- they only go up because lock data structures protected by simpler lock
- but O(n) space *per lock*!
-</pre>
-
-<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
-compare_and_swap:
-<pre>
-int compare_and_swap(addr, v1, v2) {
- int ret = 0;
- // stop all memory activity and ignore interrupts
- if (*addr == v1) {
- *addr = v2;
- ret = 1;
- }
- // resume other memory activity and take interrupts
- return ret;
-}
-</pre>
-
-<p>What's the point of the MCS lock, Algorithm 5?
-<pre>
- constant space per lock, rather than O(n)
- one "qnode" per thread, used for whatever lock it's waiting for
- lock holder's qnode points to start of list
- lock variable points to end of list
- acquire adds your qnode to end of list
- then you spin on your own qnode
- release wakes up next qnode
-</pre>
-
-<h2>Wait-free or non-blocking data structures</h2>
-
-<p>The previous implementations all block threads when there is
- contention for a lock. Other atomic hardware operations allows one
- to build implementation wait-free data structures. For example, one
- can make an insert of an element in a shared list that don't block a
- thread. Such versions are called wait free.
-
-<p>A linked list with locks is as follows:
-<pre>
-Lock list_lock;
-
-insert(int x) {
- element *n = new Element;
- n->x = x;
-
- acquire(&list_lock);
- n->next = list;
- list = n;
- release(&list_lock);
-}
-</pre>
-
-<p>A wait-free implementation is as follows:
-<pre>
-insert (int x) {
- element *n = new Element;
- n->x = x;
- do {
- n->next = list;
- } while (compare_and_swap (&list, n->next, n) == 0);
-}
-</pre>
-<p>How many bus transactions with 10 CPUs inserting one element in the
-list? Could you do better?
-
-<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
- paper by Fraser and Harris</a> compares lock-based implementations
- versus corresponding non-blocking implementations of a number of data
- structures.
-
-<p>It is not possible to make every operation wait-free, and there are
- times we will need an implementation of acquire and release.
- research on non-blocking data structures is active; the last word
- isn't said on this topic yet.
-
-</body>