linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC/PATCH] FUSYN 1/10: documentation files
@ 2003-12-03 18:01 Alex Riesen
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Riesen @ 2003-12-03 18:01 UTC (permalink / raw)
  To: inaky.perez-gonzalez; +Cc: linux-kernel

> +I am calling these things FUSYNs to distinguish them from the original
> +futexes they base on, as they behave kind of different (the naming
> +sucks, you are welcome to suggest new names).

FUSYNC [fu'sink]


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC/PATCH] FUSYN 1/10: documentation files
  2004-01-14 22:49 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2.1 inaky.perez-gonzalez
@ 2004-01-14 22:49 ` inaky.perez-gonzalez
  0 siblings, 0 replies; 3+ messages in thread
From: inaky.perez-gonzalez @ 2004-01-14 22:49 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

 fusyn.txt |  559 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 559 insertions(+)

--- /dev/null	Wed Jan 14 14:39:30 2004
+++ linux/Documentation/fusyn.txt	Mon Jan  5 10:22:44 2004
@@ -0,0 +1,559 @@
+
+FUSYN - Fast User SYNChronization primitives
+
+http://developer.osdl.org/dev/robustmutexes/
+
+I am calling these things FUSYNs to distinguish them from the original
+futexes they base on, as they behave kind of different (the naming
+sucks, you are welcome to suggest new names).
+
+This is my second big time attempt to implement real-time locking in
+the Linux kernel to solve the short comings of a locking system based
+on futexes, being mainly:
+
+   - no robustness support without kernel cooperation
+
+   - no priority inheritance/protection
+
+   - no real-time wake up (priority based, no priority inversion
+     holes, etc)
+
+The main objects to implement are:
+
+  - fuqueues: Priority-sorted wait queues -- other than that, mostly
+    equal to futexes. Usable from kernel and user space.
+
+  - fulock: 
+
+    This is a full blown "mutex" implementation that can be used from
+    kernel and user space (with user-space fast [un]locking on
+    non-contended situations), robustness (if owner dies, ownership is
+    passed on), priority inheritance and (FUTURE) priority protection.
+
+    They are just a fuqueue that supports the ownership concept, to
+    allow for robustness and priority inheritace/protection.
+
+    It also supports serialized and non-serialized unlocks [see FAQ].
+  
+All the non-blocking calls (wake() and unlock() can be used from
+interrupt context; the data structures are protected by IRQ safe spin
+locks). This is heavier weight, but is needed to properly support
+priority inheritance and to avoid cache line bouncing of the spin
+locks that protect the structures.
+
+Released files:
+
+http://developer.osdl.org/dev/robustmutexes/
+
+fusyn-2.6.0XXX-2.X.patch   Patch file against Linus' tree
+fusyn-test-2.X.tar.gz      Sample test library and code
+
+
+Contents:
+---------
+
+- vlocators
+- fuqueues
+- fulocks
+- issues/future work
+- FAQ [some definitions here, try wild search]
+
+
+
+
+VLOCATORS
+---------
+
+This is an structure (struct vlocator) and the associated code to map
+a user space address to a kernel object that is kept in a hash
+table. As well, it provides and uses a reference count for the object,
+as well as a hash table cleanup function.
+
+It uses the 'struct futex_key' as done by Jamie Lokier; the code is in
+include/linux/vlocator.h and kernel/vlocator.c. 
+
+Two very simple operations: find an object in 'current's space by
+address and find-or-allocate (vl_find() and vl_locate()).
+
+The cleanup function (or garbage collector) runs periodically and
+releases items with a reference count of zero. This allows the get/put
+operations to be lockless.
+
+
+FUQUEUES
+--------
+
+Fuqueues are just wait-queues, like futexes; the differences are in
+the wake up process, as it is done not in a FIFO order but by
+priority. As well, if the task changes its priority while waiting, its
+position in the list is updated. The code is in include/linux/fuqueue.h
+and kernel/fuqueue.c.
+
+They consist of a 'struct fuqueue' which has a priority-sorted wait
+list and a lock to protect access to it. They can be used in
+kernel-space as a wait queue.
+
+Entry points:
+
+fuqueue_wait() -- wait on a fuqueue
+fuqueue_wake() -- wake N waiter from a fuqueue with code C.
+
+The code is split in various functions to allow fulocks to use the
+fuqueue stuff for integration. The wlist*_t thing is a reminder of
+something that will go away; it stands for 'wait list' and is setup
+with a #define based redirection to support different types of sorted
+wait list implementations (for example, one that is O(1) using a
+priority array -- that is huge). That is going to be deprecated in
+favor of a O(1) priority sorted list that is not as big (see FUTURE
+WORK).
+
+'struct ufuqueue' is a fuqueue plus the stuff to link it to a possibly
+shared user space address (a vfuqueue) (the vlocator), so that is the
+real futex equivalent. The code is in kernel/ufuqueue.c and just
+consists on the syscall wrappers to associate the proper ufuqueue to
+the vfuqueue and then call the fuqueue layer.
+
+
+FULOCKS
+-------
+
+The mother of the whole thing. Fulocks are a full mutex
+implementation; it is basically the concept of an owner and a list of
+tasks waiting to own the mutex (implemented with a 'struct fuqueue'). 
+
+The 'struct fulock' holds everthing. To support the different modes
+of operation (priority inheritance, priority protection, deadlock
+check and sun-mode robustness [see FUTURE WORK]) there is a 'flags'
+member. 
+
+As well, there is an ownership list node, where all the fulocks that a
+task currently owns are linked to the task (task->fulock_olist). 
+
+As well, a fulock has the concept of 'state': healthy, dead-owner or
+not-recoverable (see the FAQ for the definitions). 
+
+The entry points are [kernel/fulock.c, include/linux/fulock.h]:
+
+fulock_lock()
+fulock_unlock()
+fulock_consistency() [for manipulating the state]
+
+A user level fulock (struct ufulock) is a fulock that can be used from
+the user space--it is represented by a (potentially shared) memory
+address (a vfulock). A vlocator is used to track it. Implemented in
+kernel/ufulock.c.
+
+The vfulock may have different values that server to define the state
+of a lock:
+
+0                       Unlocked [may be fast-locked]
+PID (< VFULOCK_KCO)     Fast-locked by PID, no waiters in the
+                        kernel. [May be fast-unlocked].
+VFULOCK_KCO             Locked by someone, kernel knows who, waiters 
+                        in the kernel.
+VFULOCK_DEAD            Previous owner died (maybe unlocked or
+                        locked), the kernel keeps the status.
+VFULOCK_NR              Not recoverable.
+
+Now, when user space goes to lock a ufulock with a fast operation, it
+issues an atomic compare and swap of its PID against 0; if it
+succeeds, its the owner, done; if not, it goes to the kernel
+(sys_ufulock_lock()), who will put it to wait [see
+test/src/include/kernel-lock.h:vfulock_lock() in the fusyn-test
+package].
+
+Unlock is fairly similar: if the value is VFULOCK_{KCO,DEAD}, go to
+the kernel, sys_ufulock_unlock(); if VFULOCK_NR, return error; if not,
+it is a PID and need to do an atomic compare and exchange of zero
+(unlock) against the PID [again, check vfulock_unlock()].
+
+Now, how it works is fairly simple: the kernel will always maintain
+the vfulock and the corresponding fulock in the 'struct ufulock' in
+sync [vfulock_sync() in kernel/ufulock.c], and will do that everytime
+we enter it through one of the fulock system calls
+(sys_ufulock_{[un]lock,consistency}().
+
+The kernel will use the PID set by the fast-locker to match who is the
+owner when he doesn't know about it [afterwards it will be registered
+in the kernel)--check __fulock_id_owner() for ideas on how to avoid
+collision due to PID reuse].
+
+Once that is done, what is left is a 'fulock' that can be handled by
+the fulock layer.
+
+Now [uv]fulocks support:
+
+    - Real time: the unlock procedure is realtime in the sense that it
+      is O(1) and the next owner is the highest priority one; as well,
+      the fulock (actually, the vfulock) is never unlocked in the
+      meantime, the ownership is transferred instead of unlocking the
+      lock, waking up the first waiter and waiting for it to acquire
+      it. This avoids priority inversions by lower priority threads
+      sneaking in from other processors at the worst time.
+
+    - Deadlock checking: complex dead lock scenarios where a
+      ownership/wait chain [see definition in FAQ] is involved are
+      catched if FULOCK_FL_ERROR_CHK is set.
+
+    - Priority change support: when the priority of the waiting task
+      is changed, it's position in the list is updated. See below for
+      effects on priority inheritance.
+
+    - Robustness: when a task who is a fulock owner dies and the
+      kernel knew about it (ie: it was registered in the
+      task->fulock_list), then the fulock is made dead-owner, unlocked
+      and the next waiter gets ownership, with a -EDEADOWNER return
+      code. 
+
+      This is always enabled; user space can emulate the
+      hangs+timeouts that would happen if this were not detected.
+
+      If the kernel knew nothing about it (ie: it was fast-locked),
+      then __fulock_id_owner() will fail to map the PID in the vfulock
+      to an existing task; then the current claimer would be
+      considered the owner after marking the fulock dead-owner.
+
+      Note the comments in __fulock_id_owner() for ideas on how to
+      avoid collisions due to PID reuse.
+
+    - Priority protection: not implemented yet
+
+    - Priority inheritance: when a waiter queues for a fulock that has
+      the FULOCK_FL_PI bit set and its priority is higher than that of
+      the owner, it will boost the owner's priority to its own; this
+      will propagate in an ownership/wait chain (if the owner was
+      waiting on for a fulock, etc). As well, priority changes will
+      also be propagated.
+
+      This is done with __fulock_pi_boost(), fuqueue_chprio() and
+      __fulock_chprio() [through the fulock's 'struct
+      fuqueue_ops']. The unboosting is done in __fulock_unlock() or
+      __fulock_wait_cancel() [again, through the fulock's 'struct
+      fuqueue_ops']. 
+
+
+FUTURE WORK
+-----------
+  
+  - fucond: conditional variables; although they can be implemented
+    in user space + fuqueues, doing it in the kernel helps a lot in
+    atomicity issues (and the performance should be much better).
+
+    We tried doing that (see releases up to 1.12 in the website) and
+    generally  it sucked because of the code bloat in the kernel, so
+    we decided to extirpate it.
+  
+  - rw lock: only the first locker can do the fast operation; the
+    others go to the kernel to sign up. This way ownership is
+    supported. If a reader dies, nothing happens (after all, it is
+    supposed to be read-only access), but we need to keep track of
+    readers dying so they don't hold writers off. If a writer dies,
+    next locker (reader or writer) gets dead-owner.
+
+    These guys could also get, like this, PI and PP, as they would be
+    very similar to fulocks, but with two waiting lists. One for
+    writers, one for readers, and they allow many ownerships at the
+    same time (when there are readers).
+
+    Maybe different operation modes to primer writers over readers?
+    FIXME, need to explore.
+
+  - Spinlocks: they could be implemented as a trylock() on a fulock
+    for N loops, and after it'd degenerate into a mutex wait. This
+    wait they'd automagically support robustness, PI and PP.
+
+  - Barriers: futexes offer enough functionality for implementing
+    them, however wake up should be real-time (priority based). Not a
+    real issue though, as in barriers everybody is woken up. It can be
+    done also with fuqueues.
+
+  - Getting rid of the vlocator hash table and doing direct mapping
+    [so that we avoid the O(N) lookup] by storing in user space some
+    short of pointer to a in-kernel data struct. The pointer has to be
+    "validated", so that user space cannot have the kernel point to
+    some random or pontentially dangerous space. 
+
+    A way would be to store two values, the pointer itself plus a
+    kernel-crypted copy that the can be used to verify.
+
+    Need more research into this.
+
+  - O(1) priority list: current plist is not O(1) in addition, because
+    it has to locate the proper position in the list where to add. I
+    plan to modify the plist code to be O(N) where N is the number of
+    priority levels, and as it is fixed at compilation time, it is
+    effectively O(1). 
+
+    The idea is to have something similar to a priority array, but
+    instead of having N list heads, we have only the first node of
+    each priority being the list head, and the rest of the guys in
+    that prio hanging from him.
+
+  - Sun-mode robustness. Solaris implements robustness in a slightly
+    more restrictive way. We want to add an small compatibility layer
+    so both models can be used.
+
+  - Support for architectures that don't have atomic compare and
+    swap. Once priority protection is finished (that will involve that
+    a pure KCO method is developed), it can be solved using this;
+    however, they will loose the fast-lock path--maybe not an issue?
+    how many arches do not implement atomic compare and exchange?.
+
+  - That page pinning when waiting for a fulock...
+
+
+FAQ
+---
+
+This set of Q&A is what I use myself to track my ideas and concepts
+(and not to forget why did I decide anything).
+
+
+Q: What is PI?
+
+Priority Inheritance: when task A holds resource R and task B claims
+it, and prio (B) > prio (A), then B can force A to take its priority
+so it finishes sooner and B can take the resource ownership. The
+priority boost ends when A releases R.
+
+
+Q: What is PP?
+
+Priority Protection: resources have an associated priority ceiling;
+any task that acquires a resource will have its prio raised to that
+prioirty ceiling while holding it.
+
+
+Q: What is RM?
+
+Robust Mutex, or robustness, for short: when the owner of a resource
+dies, we want the next owner to know that somebody died while holding
+the resource, so s/he is able to determine if a cleanup is needed.
+
+
+Q: What is a healthy fulock?
+
+This is a fulock in its normal state, that is: initialized and not in
+dead-owner or not-recoverable states.
+
+
+Q: What is a dead-owner fulock?
+
+A fulock is in dead-owner state when it was locked (some task owned
+it) and the task died without unlocking it.
+
+
+Q: What is a not-recoverable fulock?
+
+A fulock is in not-recoverable state when it went into dead-owner
+state and some task that acquired it in dead-owner state decided that
+it had to be made not-recoverable.
+
+The rationale behind this is that normally you have some lock
+protecting access to some data. When the lock goes dead-owner, the
+task that owned it and died could have died in the middle of updating
+the data, and thus it can be inconsistent. Subsequent owners of the
+lock get it with the dead-owner state, so that they are aware of the
+situation. If any of them can fix it, it can move the lock back to
+healthy state and continue operating, but if there is no way to fix
+the data, it is moved to not-recoverable state.
+
+When moved, all the pending waiters are given an error code (EBADR)
+indicating the new state, so that they can bail out and report up to
+their managers for what to do. As well, new contenders that try to
+acquire the lock will get also the EBADR error code.
+
+The only way to make the fulock healthy again is to reinitialized it.
+
+
+Q: What is a dead-owner dead-lock?
+
+When some task that has to unlock a locked fulock dies and others are
+waiting for it to release the fulock.
+
+
+Q: What is a dead-owner recovery?
+
+When a lock owner dies, the next waiter or next guy who locks and gets
+ownership gets it with an special code that indicates that some
+previous owner died and that the state of the lock is "dead-owner",
+that recovery on the data structures protected by the lock must be
+done in order to ensure consistency.
+
+Once a fulock is in dead-owner state, it can be moved back to
+normal/healthy or made inconsistent (so only an initialization returns
+it to normal).
+
+
+Q: Why does the kernel have to set the value of the fulock?
+   Why cannot the value of the fulock after unlock be set by user
+   space?
+
+There is a risk of overwritten values and missed waiters.
+
+For example, task B claims fulock F (locked by task A) so it goes to
+the kernel to wait; now the fulock value is VFULOCK_KCO (kernel
+controlled ownership). Before it reaches the kernel, task C releases
+the fulock for task A; as there are no waiters, it returns UNLOCKED
+and task C has to set it to UNLOCKED, thus overwriting VFULOCK_KCO; as
+KCO is overwritten, task B is going to be dead-locked in the kernel,
+waiting.
+
+Furthermore, it helps guaranteeing robustness. If the just-woken up
+task that has to set the value the kernel passes dies before being
+able to do it, you hit a dead-owner dead-lock because nobody wakes up
+the waiters until somebody tries to lock and realizes the fulock is
+dead.
+
+
+Q: What are the two kinds of unlock?
+
+The two kinds of unlock are serialized and non-serialized.
+
+You need both because the serialized one can be slower, as it might
+force a context switch.
+
+I thought initially that this would show only in synthetic benchmarks
+(very tight loop acquiring and releasing the lock against some other
+threads doing the same thing), but I was wrong. Max Hailperin pointed
+to me that what I was seeing was the "Convoy Phenomenon", documented
+in by Mike Blasgen, Jim Gray, Mike Mitoma and Tom Price, in 1977
+[http://portal.acm.org/citation.cfm?id=850659&jmp=cit&dl=GUIDE&dl=ACM].
+
+After thinking about it, I concluded it would mostly apply in the
+following conditions:
+
+- user space only [in kernel space the lock itself is the spinlock
+  that protects the 'struct fulock'; as well, there is no preemption
+  and we spin]
+
+- non real-time environments/processes (where preemption is likely)
+
+- real-time SMP environments where two taks might compete for a lock
+  at the _same_ time (so to avoid it in this case, it might be
+  interesting to spin a wee bit before blocking).
+
+Now, in order to gain robustness you need serialization, so a
+userspace user is recommended to use serialized wake ups only when:
+
+- need *full* and complete robustness guarantee
+
+- needs real-time priority-inversion protection guarantee (in SMP, not
+  needed for UP).
+
+
+Q: What is a non-serialized unlock?
+
+A non-serialized unlock works by setting the fulock to unlocked and
+waking up as many waiters as desired. The waiters then re-contend for
+the fulock, the winner owns it and the others go back to wait on it.
+
+Main problem: can't find a way to guarantee robustness
+  
+- Say we have a fulock with M guys waiting and we wake up N (1 < N <
+  M), a non-serialized wakeup. Thus, there are M - N guys still
+  waiting in the kernel.
+  
+  In order for the unlock to be non-serialized, the waker first sets
+  the futex to UNLOCKED.
+  
+  Now, how do the woken up processes know that there are still
+  processes in the kernel?
+  
+  A solution is to set the lock not to UNLOCKED, but to
+  UNLOCKED+WP (ie: maintain the WP); this way, whowever tries to
+  acquire will see UNLOCKED+WP and will go down to the kernel to do
+  the lock operation. Also, the lock operation can return an special
+  code that says "contend, but set the lock to pid|WP" to indicate
+  waiters present.
+  
+  However, it still does not solve the fact that when setting to
+  UNLOCKED+WP and waking N, if those N die before locking, the
+  waiters go into dead-owner dead-lock.
+  
+  When somebody tries to lock that, the kernel should be able to
+  notice that there are waiters and it is unlocked and thus give
+  way to the first locker with dead-owner recover --it might be too late.
+  
+  Another option might be to tag the woken-up processes before they
+  exit the kernel, so that if they die, do_exit can trap it (but there
+  are many side issues to this, like how do I make sure that the N who
+  I woke up have gone through it, one has locked, the other N-1 have
+  gone to sleep, how do I clean it up and stuff like that--makes it
+  pretty ugly, not to talk about how many resources it'd need to tag it).
+  
+  Gosh, it is a mess -- I would say that robust mutexes have to
+  require serialized unlocks. Period.
+
+  Not even talk about adding a short timer to verify that the thing
+  was locked...shrivers
+
+  RESEARCH: tentative ownership: when we wake up some guys who are
+  supposed to go and try to lock again, tie the fulock they should
+  lock to their task_struct and on exit(), check they don't have it
+  there ... [many details need to be worked out].
+
+- It could also be solved by waking up _all_ the waiters; this way no
+  dead-owner dead-lock could ever happen; however, it is a sure recipe
+  for an scheduling storm some day.
+
+
+Q: What is a serialized unlock?
+
+A serialized unlock transfers the ownership of the fulock to the first
+waiter in the kernel. 
+
+- Only one waiter can be woken up at the same time with this method.
+
+- It prevents priority inversion (as the fulock stays locked during
+  the whole operation no low priority thread can acquire it in the
+  meantime).
+
+- dead-owner dead-lock is not possible, because the owner is always
+  known during the operation. As well, if the new owner dies on it's
+  way up to user space, its ownership is also known.
+
+Slower (still not proved seriously--postulated and proven in some
+very synthetic benchmarks) because it forces a context switch.
+
+
+Q: What is an vfulock?
+
+It is the address in user space associated to a fulock in kernel
+space. 
+
+
+Q: What is owner identification?
+
+Owner identification is a property that the fulocks have: basically,
+they can identify who is the owner based on the vfulock (in user
+space) or the internal kernel data structures that refer to it (if the
+vfulock is VFULOCK_KCO, that means that the kernel tracks the
+ownership); if vfulock < VFULOCK_KCO, it means that the ownership is
+tracked only in user space, and the vfulock is the PID of the owner.
+
+
+Q: What is a kernel-controlled-ownership fulock? (kco)
+
+A vfulock whose value is VFULOCK_KCO or VFULOCK_DEAD, indicating that
+the ownership for the fulock is tracked by the kernel [FIXME: also KCO
+fulock are those called with the FULOCK_FL_KCO flag--not implemented].
+
+This happens when:
+
+     - The fulock is locked and there are waiters on the kernel
+     - The fulock is dead (and the ownership keeps track for it)
+     - The fulock is a priority protected fulock
+
+Basically it is a way to indicate that the fastpath for
+locking/unlocking cannot be taken.
+
+
+Q: What is an ownership/wait chain?
+
+The structure that is formed when task A owns lock F and is waiting
+for lock G, owned by task B that is waiting for lock H, that is owned
+be task C that is waiting for lock I ... etc. 
+
+When this chain is circular (eg: lock I is owned by A) then there is a
+deadlock. 
\ No newline at end of file

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC/PATCH] FUSYN 1/10: documentation files
  2003-12-03  8:51 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 inaky.perez-gonzalez
@ 2003-12-03  8:51 ` inaky.perez-gonzalez
  0 siblings, 0 replies; 3+ messages in thread
From: inaky.perez-gonzalez @ 2003-12-03  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

 fusyn.txt |  531 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 531 insertions(+)

--- /dev/null	Tue Dec  2 20:07:55 2003
+++ linux/Documentation/fusyn.txt	Tue Dec  2 20:05:33 2003
@@ -0,0 +1,531 @@
+
+FUSYN - Fast User SYNChronization primitives
+
+http://developer.osdl.org/dev/robustmutexes/
+
+I am calling these things FUSYNs to distinguish them from the original
+futexes they base on, as they behave kind of different (the naming
+sucks, you are welcome to suggest new names).
+
+This is my second big time attempt to implement real-time locking in
+the Linux kernel to solve the short comings of a locking system based
+on futexes, being mainly:
+
+   - no robustness support without kernel cooperation
+
+   - no priority inheritance/protection
+
+   - no real-time wake up (priority based, no priority inversion
+     holes, etc)
+
+The main objects to implement are:
+
+  - fuqueues: Priority-sorted wait queues -- other than that, mostly
+    equal to futexes. Usable from kernel and user space.
+
+  - fulock: 
+
+    This is a full blown "mutex" implementation that can be used from
+    kernel and user space (with user-space fast [un]locking on
+    non-contended situations), robustness (if owner dies, ownership is
+    passed on), priority inheritance and (FUTURE) priority protection.
+
+    They are just a fuqueue that supports the ownership concept, to
+    allow for robustness and priority inheritace/protection.
+
+    It also supports serialized and non-serialized unlocks [see FAQ].
+  
+All the non-blocking calls (wake() and unlock() can be used from
+interrupt context; the data structures are protected by IRQ safe spin
+locks). This is heavier weight, but is needed to properly support
+priority inheritance and to avoid cache line bouncing of the spin
+locks that protect the structures.
+
+Released files:
+
+http://developer.osdl.org/dev/robustmutexes/
+
+fusyn-2.6.0XXX-2.X.patch   Patch file against Linus' tree
+fusyn-test-2.X.tar.gz      Sample test library and code
+
+
+Contents:
+---------
+
+- vlocators
+- fuqueues
+- fulocks
+- issues/future work
+- FAQ [some definitions here, try wild search]
+
+
+
+
+VLOCATORS
+---------
+
+This is an structure (struct vlocator) and the associated code to map
+a user space address to a kernel object that is kept in a hash
+table. As well, it provides and uses a reference count for the object,
+as well as a hash table cleanup function.
+
+It uses the 'struct futex_key' as done by Jamie Locker; the code is in
+include/linux/vlocator.h and kernel/vlocator.c. 
+
+Two very simple operations: find an object in 'current's space by
+address and find-or-allocate (vl_find() and vl_locate()).
+
+The cleanup function (or garbage collector) runs periodically and
+releases items with a reference count of zero. This allows the get/put
+operations to be lockless.
+
+
+FUQUEUES
+--------
+
+Fuqueues are just wait-queues, like futexes; the differences are in
+the wake up process, as it is done not in a FIFO order but by
+priority. As well, if the task changes its priority while waiting, its
+position in the list is updated. The code is in include/linux/fuqueue.h
+and kernel/fuqueue.c.
+
+They consist of a 'struct fuqueue' which has a priority-sorted wait
+list and a lock to protect access to it. They can be used in
+kernel-space as a wait queue.
+
+Entry points:
+
+fuqueue_wait() -- wait on a fuqueue
+fuqueue_wake() -- wake N waiter from a fuqueue with code C.
+
+The code is split in various functions to allow fulocks to use the
+fuqueue stuff for integration. The wlist*_t thing is a reminder of
+something that will go away; it stands for 'wait list' and is setup
+with a #define based redirection to support different types of sorted
+wait list implementations (for example, one that is O(1) using a
+priority array -- that is huge). That is going to be deprecated in
+favor of a O(1) priority sorted list that is not as big (see FUTURE
+WORK).
+
+'struct ufuqueue' is a fuqueue plus the stuff to link it to a possibly
+shared user space address (a vfuqueue) (the vlocator), so that is the
+real futex equivalent. The code is in kernel/ufuqueue.c and just
+consists on the syscall wrappers to associate the proper ufuqueue to
+the vfuqueue and then call the fuqueue layer.
+
+
+FULOCKS
+-------
+
+The mother of the whole thing. Fulocks are a full mutex
+implementation; it is basically the concept of an owner and a list of
+tasks waiting to own the mutex (implemented with a 'struct fuqueue'). 
+
+The 'struct fulock' holds everthing. To support the different modes
+of operation (priority inheritance, priority protection, deadlock
+check and sun-mode robustness [see FUTURE WORK]) there is a 'flags'
+member. 
+
+As well, there is an ownership list node, where all the fulocks that a
+task currently owns are linked to the task (task->fulock_olist). 
+
+As well, a fulock has the concept of 'state': healthy, dead-owner or
+not-recoverable (see the FAQ for the definitions). 
+
+The entry points are [kernel/fulock.c, include/linux/fulock.h]:
+
+fulock_lock()
+fulock_unlock()
+fulock_consistency() [for manipulating the state]
+
+A user level fulock (struct ufulock) is a fulock that can be used from
+the user space--it is represented by a (potentially shared) memory
+address (a vfulock). A vlocator is used to track it. Implemented in
+kernel/ufulock.c.
+
+The vfulock may have different values that server to define the state
+of a lock:
+
+0                       Unlocked [may be fast-locked]
+PID (< VFULOCK_KCO)     Fast-locked by PID, no waiters in the
+                        kernel. [May be fast-unlocked].
+VFULOCK_KCO             Locked by someone, kernel knows who, waiters 
+                        in the kernel.
+VFULOCK_DEAD            Previous owner died (maybe unlocked or
+                        locked), the kernel keeps the status.
+VFULOCK_NR              Not recoverable.
+
+Now, when user space goes to lock a ufulock with a fast operation, it
+issues an atomic compare and swap of its PID against 0; if it
+succeeds, its the owner, done; if not, it goes to the kernel
+(sys_ufulock_lock()), who will put it to wait [see
+test/src/include/kernel-lock.h:vfulock_lock() in the fusyn-test
+package].
+
+Unlock is fairly similar: if the value is VFULOCK_{KCO,DEAD}, go to
+the kernel, sys_ufulock_unlock(); if VFULOCK_NR, return error; if not,
+it is a PID and need to do an atomic compare and exchange of zero
+(unlock) against the PID [again, check vfulock_unlock()].
+
+Now, how it works is fairly simple: the kernel will always maintain
+the vfulock and the corresponding fulock in the 'struct ufulock' in
+sync [vfulock_sync() in kernel/ufulock.c], and will do that everytime
+we enter it through one of the fulock system calls
+(sys_ufulock_{[un]lock,consistency}().
+
+The kernel will use the PID set by the fast-locker to match who is the
+owner when he doesn't know about it [afterwards it will be registered
+in the kernel)--check __fulock_id_owner() for ideas on how to avoid
+collision due to PID reuse].
+
+Once that is done, what is left is a 'fulock' that can be handled by
+the fulock layer.
+
+Now [uv]fulocks support:
+
+    - Real time: the unlock procedure is realtime in the sense that it
+      is O(1) and the next owner is the highest priority one; as well,
+      the fulock (actually, the vfulock) is never unlocked in the
+      meantime, the ownership is transferred instead of unlocking the
+      lock, waking up the first waiter and waiting for it to acquire
+      it. This avoids priority inversions by lower priority threads
+      sneaking in from other processors at the worst time.
+
+    - Deadlock checking: complex dead lock scenarios where a
+      ownership/wait chain [see definition in FAQ] is involved are
+      catched if FULOCK_FL_ERROR_CHK is set.
+
+    - Priority change support: when the priority of the waiting task
+      is changed, it's position in the list is updated. See below for
+      effects on priority inheritance.
+
+    - Robustness: when a task who is a fulock owner dies and the
+      kernel knew about it (ie: it was registered in the
+      task->fulock_list), then the fulock is made dead-owner, unlocked
+      and the next waiter gets ownership, with a -EDEADOWNER return
+      code. 
+
+      This is always enabled; user space can emulate the
+      hangs+timeouts that would happen if this were not detected.
+
+      If the kernel knew nothing about it (ie: it was fast-locked),
+      then __fulock_id_owner() will fail to map the PID in the vfulock
+      to an existing task; then the current claimer would be
+      considered the owner after marking the fulock dead-owner.
+
+      Note the comments in __fulock_id_owner() for ideas on how to
+      avoid collisions due to PID reuse.
+
+    - Priority protection: not implemented yet
+
+    - Priority inheritance: when a waiter queues for a fulock that has
+      the FULOCK_FL_PI bit set and its priority is higher than that of
+      the owner, it will boost the owner's priority to its own; this
+      will propagate in an ownership/wait chain (if the owner was
+      waiting on for a fulock, etc). As well, priority changes will
+      also be propagated.
+
+      This is done with __fulock_pi_boost(), fuqueue_chprio() and
+      __fulock_chprio() [through the fulock's 'struct
+      fuqueue_ops']. The unboosting is done in __fulock_unlock() or
+      __fulock_wait_cancel() [again, through the fulock's 'struct
+      fuqueue_ops']. 
+
+
+FUTURE WORK
+-----------
+  
+  - fucond: conditional variables; although they can be implemented
+    in user space + fuqueues, doing it in the kernel helps a lot in
+    atomicity issues (and the performance should be much better).
+
+    We tried doing that (see releases up to 1.12 in the website) and
+    generally  it sucked because of the code bloat in the kernel, so
+    we decided to extirpate it.
+  
+  - rw lock: only the first locker can do the fast operation; the
+    others go to the kernel to sign up. This way ownership is
+    supported. If a reader dies, nothing happens (after all, it is
+    supposed to be read-only access), but we need to keep track of
+    readers dying so they don't hold writers off. If a writer dies,
+    next locker (reader or writer) gets dead-owner.
+
+    These guys could also get, like this, PI and PP, as they would be
+    very similar to fulocks, but with two waiting lists. One for
+    writers, one for readers, and they allow many ownerships at the
+    same time (when there are readers).
+
+    Maybe different operation modes to primer writers over readers?
+    FIXME, need to explore.
+
+  - Spinlocks: they could be implemented as a trylock() on a fulock
+    for N loops, and after it'd degenerate into a mutex wait. This
+    wait they'd automagically support robustness, PI and PP.
+
+  - Barriers: futexes offer enough functionality for implementing
+    them, however wake up should be real-time (priority based). Not a
+    real issue though, as in barriers everybody is woken up. It can be
+    done also with fuqueues.
+
+  - Getting rid of the vlocator hash table and doing direct mapping
+    [so that we avoid the O(N) lookup] by storing in user space some
+    short of pointer to a in-kernel data struct. The pointer has to be
+    "validated", so that user space cannot have the kernel point to
+    some random or pontentially dangerous space. 
+
+    A way would be to store two values, the pointer itself plus a
+    kernel-crypted copy that the can be used to verify.
+
+    Need more research into this.
+
+  - O(1) priority list: current plist is not O(1) in addition, because
+    it has to locate the proper position in the list where to add. I
+    plan to modify the plist code to be O(N) where N is the number of
+    priority levels, and as it is fixed at compilation time, it is
+    effectively O(1). 
+
+    The idea is to have something similar to a priority array, but
+    instead of having N list heads, we have only the first node of
+    each priority being the list head, and the rest of the guys in
+    that prio hanging from him.
+
+  - Sun-mode robustness. Solaris implements robustness in a slightly
+    more restrictive way. We want to add an small compatibility layer
+    so both models can be used.
+
+  - Support for architectures that don't have atomic compare and
+    swap. Once priority protection is finished (that will involve that
+    a pure KCO method is developed), it can be solved using this;
+    however, they will loose the fast-lock path--maybe not an issue?
+    how many arches do not implement atomic compare and exchange?.
+
+  - That page pinning when waiting for a fulock...
+
+
+FAQ
+---
+
+This set of Q&A is what I use myself to track my ideas and concepts
+(and not to forget why did I decide anything).
+
+
+Q: What is PI?
+
+Priority Inheritance: when task A holds resource R and task B claims
+it, and prio (B) > prio (A), then B can force A to take its priority
+so it finishes sooner and B can take the resource ownership. The
+priority boost ends when A releases R.
+
+
+Q: What is PP?
+
+Priority Protection: resources have an associated priority ceiling;
+any task that acquires a resource will have its prio raised to that
+prioirty ceiling while holding it.
+
+
+Q: What is RM?
+
+Robust Mutex, or robustness, for short: when the owner of a resource
+dies, we want the next owner to know that somebody died while holding
+the resource, so s/he is able to determine if a cleanup is needed.
+
+
+Q: What is a healthy fulock?
+
+This is a fulock in its normal state, that is: initialized and not in
+dead-owner or not-recoverable states.
+
+
+Q: What is a dead-owner fulock?
+
+A fulock is in dead-owner state when it was locked (some task owned
+it) and the task died without unlocking it.
+
+
+Q: What is a not-recoverable fulock?
+
+A fulock is in not-recoverable state when it went into dead-owner
+state and some task that acquired it in dead-owner state decided that
+it had to be made not-recoverable.
+
+The rationale behind this is that normally you have some lock
+protecting access to some data. When the lock goes dead-owner, the
+task that owned it and died could have died in the middle of updating
+the data, and thus it can be inconsistent. Subsequent owners of the
+lock get it with the dead-owner state, so that they are aware of the
+situation. If any of them can fix it, it can move the lock back to
+healthy state and continue operating, but if there is no way to fix
+the data, it is moved to not-recoverable state.
+
+When moved, all the pending waiters are given an error code (EBADR)
+indicating the new state, so that they can bail out and report up to
+their managers for what to do. As well, new contenders that try to
+acquire the lock will get also the EBADR error code.
+
+The only way to make the fulock healthy again is to reinitialized it.
+
+
+Q: What is a dead-owner dead-lock?
+
+When some task that has to unlock a locked fulock dies and others are
+waiting for it to release the fulock.
+
+
+Q: What is a dead-owner recovery?
+
+When a lock owner dies, the next waiter or next guy who locks and gets
+ownership gets it with an special code that indicates that some
+previous owner died and that the state of the lock is "dead-owner",
+that recovery on the data structures protected by the lock must be
+done in order to ensure consistency.
+
+Once a fulock is in dead-owner state, it can be moved back to
+normal/healthy or made inconsistent (so only an initialization returns
+it to normal).
+
+
+Q: Why does the kernel have to set the value of the fulock?
+   Why cannot the value of the fulock after unlock be set by user
+   space?
+
+There is a risk of overwritten values and missed waiters.
+
+For example, task B claims fulock F (locked by task A) so it goes to
+the kernel to wait; now the fulock value is VFULOCK_KCO (kernel
+controlled ownership). Before it reaches the kernel, task C releases
+the fulock for task A; as there are no waiters, it returns UNLOCKED
+and task C has to set it to UNLOCKED, thus overwriting VFULOCK_KCO; as
+KCO is overwritten, task B is going to be dead-locked in the kernel,
+waiting.
+
+Furthermore, it helps guaranteeing robustness. If the just-woken up
+task that has to set the value the kernel passes dies before being
+able to do it, you hit a dead-owner dead-lock because nobody wakes up
+the waiters until somebody tries to lock and realizes the fulock is
+dead.
+
+
+Q: What are the two kinds of unlock?
+
+The two kinds of unlock are serialized and non-serialized.
+
+
+Q: What is a non-serialized unlock?
+
+A non-serialized unlock works by setting the fulock to unlocked and
+waking up as many waiters as desired. The waiters then re-contend for
+the fulock, the winner owns it and the others go back to wait on it.
+
+Faster in synthetic benchmarks (very tight loop acquiring and
+releasing the lock against some other threads doing the same thing).
+
+Main problem: can't find a way to guarantee robustness
+  
+- Say we have a fulock with M guys waiting and we wake up N (1 < N <
+  M), a non-serialized wakeup. Thus, there are M - N guys still
+  waiting in the kernel.
+  
+  In order for the unlock to be non-serialized, the waker first sets
+  the futex to UNLOCKED.
+  
+  Now, how do the woken up processes know that there are still
+  processes in the kernel?
+  
+  A solution is to set the lock not to UNLOCKED, but to
+  UNLOCKED+WP (ie: maintain the WP); this way, whowever tries to
+  acquire will see UNLOCKED+WP and will go down to the kernel to do
+  the lock operation. Also, the lock operation can return an special
+  code that says "contend, but set the lock to pid|WP" to indicate
+  waiters present.
+  
+  However, it still does not solve the fact that when setting to
+  UNLOCKED+WP and waking N, if those N die before locking, the
+  waiters go into dead-owner dead-lock.
+  
+  When somebody tries to lock that, the kernel should be able to
+  notice that there are waiters and it is unlocked and thus give
+  way to the first locker with dead-owner recover --it might be too late.
+  
+  Another option might be to tag the woken-up processes before they
+  exit the kernel, so that if they die, do_exit can trap it (but there
+  are many side issues to this, like how do I make sure that the N who
+  I woke up have gone through it, one has locked, the other N-1 have
+  gone to sleep, how do I clean it up and stuff like that--makes it
+  pretty ugly, not to talk about how many resources it'd need to tag it).
+  
+  Gosh, it is a mess -- I would say that robust mutexes have to
+  require serialized unlocks. Period.
+
+  Not even talk about adding a short timer to verify that the thing
+  was locked...shrivers
+
+  RESEARCH: tentative ownership: when we wake up some guys who are
+  supposed to go and try to lock again, tie the fulock they should
+  lock to their task_struct and on exit(), check they don't have it
+  there ... [many details need to be worked out].
+
+- It could also be solved by waking up _all_ the waiters; this way no
+  dead-owner dead-lock could ever happen; however, it is a sure recipe
+  for an scheduling storm some day.
+
+
+Q: What is a serialized unlock?
+
+A serialized unlock transfers the ownership of the fulock to the first
+waiter in the kernel. 
+
+- Only one waiter can be woken up at the same time with this method.
+
+- It prevents priority inversion (as the fulock stays locked during
+  the whole operation no low priority thread can acquire it in the
+  meantime).
+
+- dead-owner dead-lock is not possible, because the owner is always
+  known during the operation. As well, if the new owner dies on it's
+  way up to user space, its ownership is also known.
+
+Slower (still not proved seriously--postulated and proven in some
+very synthetic benchmarks) because it forces a context switch.
+
+
+Q: What is an vfulock?
+
+It is the address in user space associated to a fulock in kernel
+space. 
+
+
+Q: What is owner identification?
+
+Owner identification is a property that the fulocks have: basically,
+they can identify who is the owner based on the vfulock (in user
+space) or the internal kernel data structures that refer to it (if the
+vfulock is VFULOCK_KCO, that means that the kernel tracks the
+ownership); if vfulock < VFULOCK_KCO, it means that the ownership is
+tracked only in user space, and the vfulock is the PID of the owner.
+
+
+Q: What is a kernel-controlled-ownership fulock? (kco)
+
+A vfulock whose value is VFULOCK_KCO or VFULOCK_DEAD, indicating that
+the ownership for the fulock is tracked by the kernel [FIXME: also KCO
+fulock are those called with the FULOCK_FL_KCO flag--not implemented].
+
+This happens when:
+
+     - The fulock is locked and there are waiters on the kernel
+     - The fulock is dead (and the ownership keeps track for it)
+     - The fulock is a priority protected fulock
+
+Basically it is a way to indicate that the fastpath for
+locking/unlocking cannot be taken.
+
+
+Q: What is an ownership/wait chain?
+
+The structure that is formed when task A owns lock F and is waiting
+for lock G, owned by task B that is waiting for lock H, that is owned
+be task C that is waiting for lock I ... etc. 
+
+When this chain is circular (eg: lock I is owned by A) then there is a
+deadlock. 
\ No newline at end of file

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2004-01-14 22:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-03 18:01 [RFC/PATCH] FUSYN 1/10: documentation files Alex Riesen
  -- strict thread matches above, loose matches on Subject: below --
2004-01-14 22:49 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2.1 inaky.perez-gonzalez
2004-01-14 22:49 ` [RFC/PATCH] FUSYN 1/10: documentation files inaky.perez-gonzalez
2003-12-03  8:51 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 inaky.perez-gonzalez
2003-12-03  8:51 ` [RFC/PATCH] FUSYN 1/10: documentation files inaky.perez-gonzalez

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).