diff options
Diffstat (limited to 'web/l-scalablecoord.html')
| -rw-r--r-- | web/l-scalablecoord.html | 202 | 
1 files changed, 0 insertions, 202 deletions
| diff --git a/web/l-scalablecoord.html b/web/l-scalablecoord.html deleted file mode 100644 index da72c37..0000000 --- a/web/l-scalablecoord.html +++ /dev/null @@ -1,202 +0,0 @@ -<title>Scalable coordination</title> -<html> -<head> -</head> -<body> - -<h1>Scalable coordination</h1> - -<p>Required reading: Mellor-Crummey and Scott, Algorithms for Scalable -  Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991. - -<h2>Overview</h2> - -<p>Shared memory machines are bunch of CPUs, sharing physical memory. -Typically each processor also mantains a cache (for performance), -which introduces the problem of keep caches coherent.  If processor 1 -writes a memory location whose value processor 2 has cached, then -processor 2's cache must be updated in some way.  How? -<ul> - -<li>Bus-based schemes.  Any CPU can access "dance with" any memory -equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache -listens to the memory bus. With write-through architecture, invalidate -copy when see a write. Or can have "ownership" scheme with write-back -cache (E.g., Pentium cache have MESI bits---modified, exclusive, -shared, invalid). If E bit set, CPU caches exclusively and can do -write back. But bus places limits on scalability. - -<li>More scalability w. NUMA schemes (non-uniform memory access). Each -CPU comes with fast "close" memory. Slower to access memory that is -stored with another processor. Use a directory to keep track of who is -caching what.  For example, processor 0 is responsible for all memory -starting with address "000", processor 1 is responsible for all memory -starting with "001", etc. - -<li>COMA - cache-only memory architecture.  Each CPU has local RAM, -treated as cache. Cache lines migrate around to different nodes based -on access pattern. Data only lives in cache, no permanent memory -location. (These machines aren't too popular any more.) - -</ul> - - -<h2>Scalable locks</h2> - -<p>This paper is about cost and scalability of locking; what if you -have 10 CPUs waiting for the same lock?  For example, what would -happen if xv6 runs on an SMP with many processors? - -<p>What's the cost of a simple spinning acquire/release?  Algorithm 1 -*without* the delays, which is like xv6's implementation of acquire -and release (xv6 uses XCHG instead of test_and_set): -<pre> -  each of the 10 CPUs gets the lock in turn -  meanwhile, remaining CPUs in XCHG on lock -  lock must be X in cache to run XCHG -    otherwise all might read, then all might write -  so bus is busy all the time with XCHGs! -  can we avoid constant XCHGs while lock is held? -</pre> - -<p>test-and-test-and-set -<pre> -  only run expensive TSL if not locked -  spin on ordinary load instruction, so cache line is S -  acquire(l) -    while(1){ -      while(l->locked != 0) { } -      if(TSL(&l->locked) == 0) -        return; -    } -</pre> - -<p>suppose 10 CPUs are waiting, let's count cost in total bus -  transactions -<pre> -  CPU1 gets lock in one cycle -    sets lock's cache line to I in other CPUs -  9 CPUs each use bus once in XCHG -    then everyone has the line S, so they spin locally -  CPU1 release the lock -  CPU2 gets the lock in one cycle -  8 CPUs each use bus once... -  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs! -  Look at "test-and-test-and-set" in Figure 6 -</pre> -<p>  Can we have <i>n</i> CPUs acquire a lock in O(<i>n</i>) time? - -<p>What is the point of the exponential backoff in Algorithm 1? -<pre> -  Does it buy us O(n) time for n acquires? -  Is there anything wrong with it? -  may not be fair -  exponential backoff may increase delay after release -</pre> - -<p>What's the point of the ticket locks, Algorithm 2? -<pre> -  one interlocked instruction to get my ticket number -  then I spin on now_serving with ordinary load -  release() just increments now_serving -</pre> - -<p>why is that good? -<pre> -  + fair -  + no exponential backoff overshoot -  + no spinning on  -</pre> - -<p>but what's the cost, in bus transactions? -<pre> -  while lock is held, now_serving is S in all caches -  release makes it I in all caches -  then each waiters uses a bus transaction to get new value -  so still O(n^2) -</pre> - -<p>What's the point of the array-based queuing locks, Algorithm 3? -<pre> -    a lock has an array of "slots" -    waiter allocates a slot, spins on that slot -    release wakes up just next slot -  so O(n) bus transactions to get through n waiters: good! -  anderson lines in Figure 4 and 6 are flat-ish -    they only go up because lock data structures protected by simpler lock -  but O(n) space *per lock*! -</pre> - -<p>Algorithm 5 (MCS), the new algorithm of the paper, uses -compare_and_swap: -<pre> -int compare_and_swap(addr, v1, v2) { -  int ret = 0; -  // stop all memory activity and ignore interrupts -  if (*addr == v1) { -    *addr = v2; -    ret = 1; -  } -  // resume other memory activity and take interrupts -  return ret; -} -</pre> - -<p>What's the point of the MCS lock, Algorithm 5? -<pre> -  constant space per lock, rather than O(n) -  one "qnode" per thread, used for whatever lock it's waiting for -  lock holder's qnode points to start of list -  lock variable points to end of list -  acquire adds your qnode to end of list -    then you spin on your own qnode -  release wakes up next qnode -</pre> - -<h2>Wait-free or non-blocking data structures</h2> - -<p>The previous implementations all block threads when there is -  contention for a lock.  Other atomic hardware operations allows one -  to build implementation wait-free data structures.  For example, one -  can make an insert of an element in a shared list that don't block a -  thread.  Such versions are called wait free.  - -<p>A linked list with locks is as follows: -<pre> -Lock list_lock; - -insert(int x) { -  element *n = new Element; -  n->x = x; - -  acquire(&list_lock); -  n->next = list; -  list = n; -  release(&list_lock); -} -</pre> - -<p>A wait-free implementation is as follows: -<pre> -insert (int x) { -  element *n = new Element; -  n->x = x; -  do { -     n->next = list; -  } while (compare_and_swap (&list, n->next, n) == 0); -} -</pre> -<p>How many bus transactions with 10 CPUs inserting one element in the -list? Could you do better? - -<p><a href="http://www.cl.cam.ac.uk/netos/papers/2007-cpwl.pdf">This - paper by Fraser and Harris</a> compares lock-based implementations - versus corresponding non-blocking implementations of a number of data - structures. - -<p>It is not possible to make every operation wait-free, and there are -  times we will need an implementation of acquire and release. -  research on non-blocking data structures is active; the last word -  isn't said on this topic yet. - -</body> | 
