summaryrefslogtreecommitdiff
path: root/web/l-scalablecoord.html
diff options
context:
space:
mode:
Diffstat (limited to 'web/l-scalablecoord.html')
-rw-r--r--web/l-scalablecoord.html202
1 files changed, 202 insertions, 0 deletions
diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html
new file mode 100644
index 0000000..da72c37
--- /dev/null
+++ b/web/l-scalablecoord.html
@@ -0,0 +1,202 @@
+<title>Scalable coordination</title>
+<html>
+<head>
+</head>
+<body>
+
+<h1>Scalable coordination</h1>
+
+<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
+ Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
+
+<h2>Overview</h2>
+
+<p>Shared memory machines are bunch of CPUs, sharing physical memory.
+Typically each processor also mantains a cache (for performance),
+which introduces the problem of keep caches coherent. If processor 1
+writes a memory location whose value processor 2 has cached, then
+processor 2's cache must be updated in some way. How?
+<ul>
+
+<li>Bus-based schemes. Any CPU can access "dance with" any memory
+equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
+listens to the memory bus. With write-through architecture, invalidate
+copy when see a write. Or can have "ownership" scheme with write-back
+cache (E.g., Pentium cache have MESI bits---modified, exclusive,
+shared, invalid). If E bit set, CPU caches exclusively and can do
+write back. But bus places limits on scalability.
+
+<li>More scalability w. NUMA schemes (non-uniform memory access). Each
+CPU comes with fast "close" memory. Slower to access memory that is
+stored with another processor. Use a directory to keep track of who is
+caching what. For example, processor 0 is responsible for all memory
+starting with address "000", processor 1 is responsible for all memory
+starting with "001", etc.
+
+<li>COMA - cache-only memory architecture. Each CPU has local RAM,
+treated as cache. Cache lines migrate around to different nodes based
+on access pattern. Data only lives in cache, no permanent memory
+location. (These machines aren't too popular any more.)
+
+</ul>
+
+
+<h2>Scalable locks</h2>
+
+<p>This paper is about cost and scalability of locking; what if you
+have 10 CPUs waiting for the same lock? For example, what would
+happen if xv6 runs on an SMP with many processors?
+
+<p>What's the cost of a simple spinning acquire/release? Algorithm 1
+*without* the delays, which is like xv6's implementation of acquire
+and release (xv6 uses XCHG instead of test_and_set):
+<pre>
+ each of the 10 CPUs gets the lock in turn
+ meanwhile, remaining CPUs in XCHG on lock
+ lock must be X in cache to run XCHG
+ otherwise all might read, then all might write
+ so bus is busy all the time with XCHGs!
+ can we avoid constant XCHGs while lock is held?
+</pre>
+
+<p>test-and-test-and-set
+<pre>
+ only run expensive TSL if not locked
+ spin on ordinary load instruction, so cache line is S
+ acquire(l)
+ while(1){
+ while(l->locked != 0) { }
+ if(TSL(&l->locked) == 0)
+ return;
+ }
+</pre>
+
+<p>suppose 10 CPUs are waiting, let's count cost in total bus
+ transactions
+<pre>
+ CPU1 gets lock in one cycle
+ sets lock's cache line to I in other CPUs
+ 9 CPUs each use bus once in XCHG
+ then everyone has the line S, so they spin locally
+ CPU1 release the lock
+ CPU2 gets the lock in one cycle
+ 8 CPUs each use bus once...
+ So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
+ Look at "test-and-test-and-set" in Figure 6
+</pre>
+<p> Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
+
+<p>What is the point of the exponential backoff in Algorithm 1?
+<pre>
+ Does it buy us O(n) time for n acquires?
+ Is there anything wrong with it?
+ may not be fair
+ exponential backoff may increase delay after release
+</pre>
+
+<p>What's the point of the ticket locks, Algorithm 2?
+<pre>
+ one interlocked instruction to get my ticket number
+ then I spin on now_serving with ordinary load
+ release() just increments now_serving
+</pre>
+
+<p>why is that good?
+<pre>
+ + fair
+ + no exponential backoff overshoot
+ + no spinning on
+</pre>
+
+<p>but what's the cost, in bus transactions?
+<pre>
+ while lock is held, now_serving is S in all caches
+ release makes it I in all caches
+ then each waiters uses a bus transaction to get new value
+ so still O(n^2)
+</pre>
+
+<p>What's the point of the array-based queuing locks, Algorithm 3?
+<pre>
+ a lock has an array of "slots"
+ waiter allocates a slot, spins on that slot
+ release wakes up just next slot
+ so O(n) bus transactions to get through n waiters: good!
+ anderson lines in Figure 4 and 6 are flat-ish
+ they only go up because lock data structures protected by simpler lock
+ but O(n) space *per lock*!
+</pre>
+
+<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
+compare_and_swap:
+<pre>
+int compare_and_swap(addr, v1, v2) {
+ int ret = 0;
+ // stop all memory activity and ignore interrupts
+ if (*addr == v1) {
+ *addr = v2;
+ ret = 1;
+ }
+ // resume other memory activity and take interrupts
+ return ret;
+}
+</pre>
+
+<p>What's the point of the MCS lock, Algorithm 5?
+<pre>
+ constant space per lock, rather than O(n)
+ one "qnode" per thread, used for whatever lock it's waiting for
+ lock holder's qnode points to start of list
+ lock variable points to end of list
+ acquire adds your qnode to end of list
+ then you spin on your own qnode
+ release wakes up next qnode
+</pre>
+
+<h2>Wait-free or non-blocking data structures</h2>
+
+<p>The previous implementations all block threads when there is
+ contention for a lock. Other atomic hardware operations allows one
+ to build implementation wait-free data structures. For example, one
+ can make an insert of an element in a shared list that don't block a
+ thread. Such versions are called wait free.
+
+<p>A linked list with locks is as follows:
+<pre>
+Lock list_lock;
+
+insert(int x) {
+ element *n = new Element;
+ n->x = x;
+
+ acquire(&list_lock);
+ n->next = list;
+ list = n;
+ release(&list_lock);
+}
+</pre>
+
+<p>A wait-free implementation is as follows:
+<pre>
+insert (int x) {
+ element *n = new Element;
+ n->x = x;
+ do {
+ n->next = list;
+ } while (compare_and_swap (&list, n->next, n) == 0);
+}
+</pre>
+<p>How many bus transactions with 10 CPUs inserting one element in the
+list? Could you do better?
+
+<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
+ paper by Fraser and Harris</a> compares lock-based implementations
+ versus corresponding non-blocking implementations of a number of data
+ structures.
+
+<p>It is not possible to make every operation wait-free, and there are
+ times we will need an implementation of acquire and release.
+ research on non-blocking data structures is active; the last word
+ isn't said on this topic yet.
+
+</body>