1 files changed, 202 insertions, 0 deletions
diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html
new file mode 100644
index 0000000..da72c37
--- /dev/null
+++ b/web/l-scalablecoord.html
@@ -0,0 +1,202 @@
+<title>Scalable coordination</title>
+<html>
+<head>
+</head>
+<body>
+
+<h1>Scalable coordination</h1>
+
+<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable
+  Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
+
+<h2>Overview</h2>
+
+<p>Shared memory machines are bunch of CPUs, sharing physical memory.
+Typically each processor also mantains a cache (for performance),
+which introduces the problem of keep caches coherent.  If processor 1
+writes a memory location whose value processor 2 has cached, then
+processor 2's cache must be updated in some way.  How?
+<ul>
+
+<li>Bus-based schemes.  Any CPU can access "dance with" any memory
+equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache
+listens to the memory bus. With write-through architecture, invalidate
+copy when see a write. Or can have "ownership" scheme with write-back
+cache (E.g., Pentium cache have MESI bits---modified, exclusive,
+shared, invalid). If E bit set, CPU caches exclusively and can do
+write back. But bus places limits on scalability.
+
+<li>More scalability w. NUMA schemes (non-uniform memory access). Each
+CPU comes with fast "close" memory. Slower to access memory that is
+stored with another processor. Use a directory to keep track of who is
+caching what.  For example, processor 0 is responsible for all memory
+starting with address "000", processor 1 is responsible for all memory
+starting with "001", etc.
+
+<li>COMA - cache-only memory architecture.  Each CPU has local RAM,
+treated as cache. Cache lines migrate around to different nodes based
+on access pattern. Data only lives in cache, no permanent memory
+location. (These machines aren't too popular any more.)
+
+</ul>
+
+
+<h2>Scalable locks</h2>
+
+<p>This paper is about cost and scalability of locking; what if you
+have 10 CPUs waiting for the same lock?  For example, what would
+happen if xv6 runs on an SMP with many processors?
+
+<p>What's the cost of a simple spinning acquire/release?  Algorithm 1
+*without* the delays, which is like xv6's implementation of acquire
+and release (xv6 uses XCHG instead of test_and_set):
+<pre>
+  each of the 10 CPUs gets the lock in turn
+  meanwhile, remaining CPUs in XCHG on lock
+  lock must be X in cache to run XCHG
+    otherwise all might read, then all might write
+  so bus is busy all the time with XCHGs!
+  can we avoid constant XCHGs while lock is held?
+</pre>
+
+<p>test-and-test-and-set
+<pre>
+  only run expensive TSL if not locked
+  spin on ordinary load instruction, so cache line is S
+  acquire(l)
+    while(1){
+      while(l->locked != 0) { }
+      if(TSL(&l->locked) == 0)
+        return;
+    }
+</pre>
+
+<p>suppose 10 CPUs are waiting, let's count cost in total bus
+  transactions
+<pre>
+  CPU1 gets lock in one cycle
+    sets lock's cache line to I in other CPUs
+  9 CPUs each use bus once in XCHG
+    then everyone has the line S, so they spin locally
+  CPU1 release the lock
+  CPU2 gets the lock in one cycle
+  8 CPUs each use bus once...
+  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
+  Look at "test-and-test-and-set" in Figure 6
+</pre>
+<p>  Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time?
+
+<p>What is the point of the exponential backoff in Algorithm 1?
+<pre>
+  Does it buy us O(n) time for n acquires?
+  Is there anything wrong with it?
+  may not be fair
+  exponential backoff may increase delay after release
+</pre>
+
+<p>What's the point of the ticket locks, Algorithm 2?
+<pre>
+  one interlocked instruction to get my ticket number
+  then I spin on now_serving with ordinary load
+  release() just increments now_serving
+</pre>
+
+<p>why is that good?
+<pre>
+  + fair
+  + no exponential backoff overshoot
+  + no spinning on 
+</pre>
+
+<p>but what's the cost, in bus transactions?
+<pre>
+  while lock is held, now_serving is S in all caches
+  release makes it I in all caches
+  then each waiters uses a bus transaction to get new value
+  so still O(n^2)
+</pre>
+
+<p>What's the point of the array-based queuing locks, Algorithm 3?
+<pre>
+    a lock has an array of "slots"
+    waiter allocates a slot, spins on that slot
+    release wakes up just next slot
+  so O(n) bus transactions to get through n waiters: good!
+  anderson lines in Figure 4 and 6 are flat-ish
+    they only go up because lock data structures protected by simpler lock
+  but O(n) space *per lock*!
+</pre>
+
+<p>Algorithm 5 (MCS), the new algorithm of the paper, uses
+compare_and_swap:
+<pre>
+int compare_and_swap(addr, v1, v2) {
+  int ret = 0;
+  // stop all memory activity and ignore interrupts
+  if (*addr == v1) {
+    *addr = v2;
+    ret = 1;
+  }
+  // resume other memory activity and take interrupts
+  return ret;
+}
+</pre>
+
+<p>What's the point of the MCS lock, Algorithm 5?
+<pre>
+  constant space per lock, rather than O(n)
+  one "qnode" per thread, used for whatever lock it's waiting for
+  lock holder's qnode points to start of list
+  lock variable points to end of list
+  acquire adds your qnode to end of list
+    then you spin on your own qnode
+  release wakes up next qnode
+</pre>
+
+<h2>Wait-free or non-blocking data structures</h2>
+
+<p>The previous implementations all block threads when there is
+  contention for a lock.  Other atomic hardware operations allows one
+  to build implementation wait-free data structures.  For example, one
+  can make an insert of an element in a shared list that don't block a
+  thread.  Such versions are called wait free. 
+
+<p>A linked list with locks is as follows:
+<pre>
+Lock list_lock;
+
+insert(int x) {
+  element *n = new Element;
+  n->x = x;
+
+  acquire(&list_lock);
+  n->next = list;
+  list = n;
+  release(&list_lock);
+}
+</pre>
+
+<p>A wait-free implementation is as follows:
+<pre>
+insert (int x) {
+  element *n = new Element;
+  n->x = x;
+  do {
+     n->next = list;
+  } while (compare_and_swap (&list, n->next, n) == 0);
+}
+</pre>
+<p>How many bus transactions with 10 CPUs inserting one element in the
+list? Could you do better?
+
+<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This
+ paper by Fraser and Harris</a> compares lock-based implementations
+ versus corresponding non-blocking implementations of a number of data
+ structures.
+
+<p>It is not possible to make every operation wait-free, and there are
+  times we will need an implementation of acquire and release.
+  research on non-blocking data structures is active; the last word
+  isn't said on this topic yet.
+
+</body>