linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-03  8:51 inaky.perez-gonzalez
  2003-12-03  8:51 ` [RFC/PATCH] FUSYN 1/10: documentation files inaky.perez-gonzalez
  2003-12-04  4:12 ` [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 Jamie Lokier
  0 siblings, 2 replies; 9+ messages in thread
From: inaky.perez-gonzalez @ 2003-12-03  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

Hi All

This code proposes an implementation of kernel based mutexes,
taking ideas from the actual implementations using futexes
(namely NPTL), intending to solve its limitations (see doc)
and adding some features, such as real-time behavior,
priority inheritance and protection, deadlock detection and
robustness.

Please look at the first patch, containing the documentation
for information on how is it implemented.

The patch is split in the following parts:

1/10: documentation files
2/10: modifications to the core
3/10: Support for i386
4/10: Support for ia64
5/10: kernel fuqueues
6/10: user space/kernel space tracker
7/10: user space fuqueues
8/10: kernel fulocks
9/10: stub for priority protection
10/10: user space fulocks

We have a site at http://developer.osdl.org/dev/robustmutexes
with references to all the releases, test code and NPTL
modifications (rtnptl) to use this code. As well, the patch
is there in a single file, in case you don't want to paste
them manually.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC/PATCH] FUSYN 1/10: documentation files
  2003-12-03  8:51 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 inaky.perez-gonzalez
@ 2003-12-03  8:51 ` inaky.perez-gonzalez
  2003-12-03  8:51   ` [RFC/PATCH] FUSYN 2/10: modifications to the core inaky.perez-gonzalez
  2003-12-04  4:12 ` [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 Jamie Lokier
  1 sibling, 1 reply; 9+ messages in thread
From: inaky.perez-gonzalez @ 2003-12-03  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

 fusyn.txt |  531 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 531 insertions(+)

--- /dev/null	Tue Dec  2 20:07:55 2003
+++ linux/Documentation/fusyn.txt	Tue Dec  2 20:05:33 2003
@@ -0,0 +1,531 @@
+
+FUSYN - Fast User SYNChronization primitives
+
+http://developer.osdl.org/dev/robustmutexes/
+
+I am calling these things FUSYNs to distinguish them from the original
+futexes they base on, as they behave kind of different (the naming
+sucks, you are welcome to suggest new names).
+
+This is my second big time attempt to implement real-time locking in
+the Linux kernel to solve the short comings of a locking system based
+on futexes, being mainly:
+
+   - no robustness support without kernel cooperation
+
+   - no priority inheritance/protection
+
+   - no real-time wake up (priority based, no priority inversion
+     holes, etc)
+
+The main objects to implement are:
+
+  - fuqueues: Priority-sorted wait queues -- other than that, mostly
+    equal to futexes. Usable from kernel and user space.
+
+  - fulock: 
+
+    This is a full blown "mutex" implementation that can be used from
+    kernel and user space (with user-space fast [un]locking on
+    non-contended situations), robustness (if owner dies, ownership is
+    passed on), priority inheritance and (FUTURE) priority protection.
+
+    They are just a fuqueue that supports the ownership concept, to
+    allow for robustness and priority inheritace/protection.
+
+    It also supports serialized and non-serialized unlocks [see FAQ].
+  
+All the non-blocking calls (wake() and unlock() can be used from
+interrupt context; the data structures are protected by IRQ safe spin
+locks). This is heavier weight, but is needed to properly support
+priority inheritance and to avoid cache line bouncing of the spin
+locks that protect the structures.
+
+Released files:
+
+http://developer.osdl.org/dev/robustmutexes/
+
+fusyn-2.6.0XXX-2.X.patch   Patch file against Linus' tree
+fusyn-test-2.X.tar.gz      Sample test library and code
+
+
+Contents:
+---------
+
+- vlocators
+- fuqueues
+- fulocks
+- issues/future work
+- FAQ [some definitions here, try wild search]
+
+
+
+
+VLOCATORS
+---------
+
+This is an structure (struct vlocator) and the associated code to map
+a user space address to a kernel object that is kept in a hash
+table. As well, it provides and uses a reference count for the object,
+as well as a hash table cleanup function.
+
+It uses the 'struct futex_key' as done by Jamie Locker; the code is in
+include/linux/vlocator.h and kernel/vlocator.c. 
+
+Two very simple operations: find an object in 'current's space by
+address and find-or-allocate (vl_find() and vl_locate()).
+
+The cleanup function (or garbage collector) runs periodically and
+releases items with a reference count of zero. This allows the get/put
+operations to be lockless.
+
+
+FUQUEUES
+--------
+
+Fuqueues are just wait-queues, like futexes; the differences are in
+the wake up process, as it is done not in a FIFO order but by
+priority. As well, if the task changes its priority while waiting, its
+position in the list is updated. The code is in include/linux/fuqueue.h
+and kernel/fuqueue.c.
+
+They consist of a 'struct fuqueue' which has a priority-sorted wait
+list and a lock to protect access to it. They can be used in
+kernel-space as a wait queue.
+
+Entry points:
+
+fuqueue_wait() -- wait on a fuqueue
+fuqueue_wake() -- wake N waiter from a fuqueue with code C.
+
+The code is split in various functions to allow fulocks to use the
+fuqueue stuff for integration. The wlist*_t thing is a reminder of
+something that will go away; it stands for 'wait list' and is setup
+with a #define based redirection to support different types of sorted
+wait list implementations (for example, one that is O(1) using a
+priority array -- that is huge). That is going to be deprecated in
+favor of a O(1) priority sorted list that is not as big (see FUTURE
+WORK).
+
+'struct ufuqueue' is a fuqueue plus the stuff to link it to a possibly
+shared user space address (a vfuqueue) (the vlocator), so that is the
+real futex equivalent. The code is in kernel/ufuqueue.c and just
+consists on the syscall wrappers to associate the proper ufuqueue to
+the vfuqueue and then call the fuqueue layer.
+
+
+FULOCKS
+-------
+
+The mother of the whole thing. Fulocks are a full mutex
+implementation; it is basically the concept of an owner and a list of
+tasks waiting to own the mutex (implemented with a 'struct fuqueue'). 
+
+The 'struct fulock' holds everthing. To support the different modes
+of operation (priority inheritance, priority protection, deadlock
+check and sun-mode robustness [see FUTURE WORK]) there is a 'flags'
+member. 
+
+As well, there is an ownership list node, where all the fulocks that a
+task currently owns are linked to the task (task->fulock_olist). 
+
+As well, a fulock has the concept of 'state': healthy, dead-owner or
+not-recoverable (see the FAQ for the definitions). 
+
+The entry points are [kernel/fulock.c, include/linux/fulock.h]:
+
+fulock_lock()
+fulock_unlock()
+fulock_consistency() [for manipulating the state]
+
+A user level fulock (struct ufulock) is a fulock that can be used from
+the user space--it is represented by a (potentially shared) memory
+address (a vfulock). A vlocator is used to track it. Implemented in
+kernel/ufulock.c.
+
+The vfulock may have different values that server to define the state
+of a lock:
+
+0                       Unlocked [may be fast-locked]
+PID (< VFULOCK_KCO)     Fast-locked by PID, no waiters in the
+                        kernel. [May be fast-unlocked].
+VFULOCK_KCO             Locked by someone, kernel knows who, waiters 
+                        in the kernel.
+VFULOCK_DEAD            Previous owner died (maybe unlocked or
+                        locked), the kernel keeps the status.
+VFULOCK_NR              Not recoverable.
+
+Now, when user space goes to lock a ufulock with a fast operation, it
+issues an atomic compare and swap of its PID against 0; if it
+succeeds, its the owner, done; if not, it goes to the kernel
+(sys_ufulock_lock()), who will put it to wait [see
+test/src/include/kernel-lock.h:vfulock_lock() in the fusyn-test
+package].
+
+Unlock is fairly similar: if the value is VFULOCK_{KCO,DEAD}, go to
+the kernel, sys_ufulock_unlock(); if VFULOCK_NR, return error; if not,
+it is a PID and need to do an atomic compare and exchange of zero
+(unlock) against the PID [again, check vfulock_unlock()].
+
+Now, how it works is fairly simple: the kernel will always maintain
+the vfulock and the corresponding fulock in the 'struct ufulock' in
+sync [vfulock_sync() in kernel/ufulock.c], and will do that everytime
+we enter it through one of the fulock system calls
+(sys_ufulock_{[un]lock,consistency}().
+
+The kernel will use the PID set by the fast-locker to match who is the
+owner when he doesn't know about it [afterwards it will be registered
+in the kernel)--check __fulock_id_owner() for ideas on how to avoid
+collision due to PID reuse].
+
+Once that is done, what is left is a 'fulock' that can be handled by
+the fulock layer.
+
+Now [uv]fulocks support:
+
+    - Real time: the unlock procedure is realtime in the sense that it
+      is O(1) and the next owner is the highest priority one; as well,
+      the fulock (actually, the vfulock) is never unlocked in the
+      meantime, the ownership is transferred instead of unlocking the
+      lock, waking up the first waiter and waiting for it to acquire
+      it. This avoids priority inversions by lower priority threads
+      sneaking in from other processors at the worst time.
+
+    - Deadlock checking: complex dead lock scenarios where a
+      ownership/wait chain [see definition in FAQ] is involved are
+      catched if FULOCK_FL_ERROR_CHK is set.
+
+    - Priority change support: when the priority of the waiting task
+      is changed, it's position in the list is updated. See below for
+      effects on priority inheritance.
+
+    - Robustness: when a task who is a fulock owner dies and the
+      kernel knew about it (ie: it was registered in the
+      task->fulock_list), then the fulock is made dead-owner, unlocked
+      and the next waiter gets ownership, with a -EDEADOWNER return
+      code. 
+
+      This is always enabled; user space can emulate the
+      hangs+timeouts that would happen if this were not detected.
+
+      If the kernel knew nothing about it (ie: it was fast-locked),
+      then __fulock_id_owner() will fail to map the PID in the vfulock
+      to an existing task; then the current claimer would be
+      considered the owner after marking the fulock dead-owner.
+
+      Note the comments in __fulock_id_owner() for ideas on how to
+      avoid collisions due to PID reuse.
+
+    - Priority protection: not implemented yet
+
+    - Priority inheritance: when a waiter queues for a fulock that has
+      the FULOCK_FL_PI bit set and its priority is higher than that of
+      the owner, it will boost the owner's priority to its own; this
+      will propagate in an ownership/wait chain (if the owner was
+      waiting on for a fulock, etc). As well, priority changes will
+      also be propagated.
+
+      This is done with __fulock_pi_boost(), fuqueue_chprio() and
+      __fulock_chprio() [through the fulock's 'struct
+      fuqueue_ops']. The unboosting is done in __fulock_unlock() or
+      __fulock_wait_cancel() [again, through the fulock's 'struct
+      fuqueue_ops']. 
+
+
+FUTURE WORK
+-----------
+  
+  - fucond: conditional variables; although they can be implemented
+    in user space + fuqueues, doing it in the kernel helps a lot in
+    atomicity issues (and the performance should be much better).
+
+    We tried doing that (see releases up to 1.12 in the website) and
+    generally  it sucked because of the code bloat in the kernel, so
+    we decided to extirpate it.
+  
+  - rw lock: only the first locker can do the fast operation; the
+    others go to the kernel to sign up. This way ownership is
+    supported. If a reader dies, nothing happens (after all, it is
+    supposed to be read-only access), but we need to keep track of
+    readers dying so they don't hold writers off. If a writer dies,
+    next locker (reader or writer) gets dead-owner.
+
+    These guys could also get, like this, PI and PP, as they would be
+    very similar to fulocks, but with two waiting lists. One for
+    writers, one for readers, and they allow many ownerships at the
+    same time (when there are readers).
+
+    Maybe different operation modes to primer writers over readers?
+    FIXME, need to explore.
+
+  - Spinlocks: they could be implemented as a trylock() on a fulock
+    for N loops, and after it'd degenerate into a mutex wait. This
+    wait they'd automagically support robustness, PI and PP.
+
+  - Barriers: futexes offer enough functionality for implementing
+    them, however wake up should be real-time (priority based). Not a
+    real issue though, as in barriers everybody is woken up. It can be
+    done also with fuqueues.
+
+  - Getting rid of the vlocator hash table and doing direct mapping
+    [so that we avoid the O(N) lookup] by storing in user space some
+    short of pointer to a in-kernel data struct. The pointer has to be
+    "validated", so that user space cannot have the kernel point to
+    some random or pontentially dangerous space. 
+
+    A way would be to store two values, the pointer itself plus a
+    kernel-crypted copy that the can be used to verify.
+
+    Need more research into this.
+
+  - O(1) priority list: current plist is not O(1) in addition, because
+    it has to locate the proper position in the list where to add. I
+    plan to modify the plist code to be O(N) where N is the number of
+    priority levels, and as it is fixed at compilation time, it is
+    effectively O(1). 
+
+    The idea is to have something similar to a priority array, but
+    instead of having N list heads, we have only the first node of
+    each priority being the list head, and the rest of the guys in
+    that prio hanging from him.
+
+  - Sun-mode robustness. Solaris implements robustness in a slightly
+    more restrictive way. We want to add an small compatibility layer
+    so both models can be used.
+
+  - Support for architectures that don't have atomic compare and
+    swap. Once priority protection is finished (that will involve that
+    a pure KCO method is developed), it can be solved using this;
+    however, they will loose the fast-lock path--maybe not an issue?
+    how many arches do not implement atomic compare and exchange?.
+
+  - That page pinning when waiting for a fulock...
+
+
+FAQ
+---
+
+This set of Q&A is what I use myself to track my ideas and concepts
+(and not to forget why did I decide anything).
+
+
+Q: What is PI?
+
+Priority Inheritance: when task A holds resource R and task B claims
+it, and prio (B) > prio (A), then B can force A to take its priority
+so it finishes sooner and B can take the resource ownership. The
+priority boost ends when A releases R.
+
+
+Q: What is PP?
+
+Priority Protection: resources have an associated priority ceiling;
+any task that acquires a resource will have its prio raised to that
+prioirty ceiling while holding it.
+
+
+Q: What is RM?
+
+Robust Mutex, or robustness, for short: when the owner of a resource
+dies, we want the next owner to know that somebody died while holding
+the resource, so s/he is able to determine if a cleanup is needed.
+
+
+Q: What is a healthy fulock?
+
+This is a fulock in its normal state, that is: initialized and not in
+dead-owner or not-recoverable states.
+
+
+Q: What is a dead-owner fulock?
+
+A fulock is in dead-owner state when it was locked (some task owned
+it) and the task died without unlocking it.
+
+
+Q: What is a not-recoverable fulock?
+
+A fulock is in not-recoverable state when it went into dead-owner
+state and some task that acquired it in dead-owner state decided that
+it had to be made not-recoverable.
+
+The rationale behind this is that normally you have some lock
+protecting access to some data. When the lock goes dead-owner, the
+task that owned it and died could have died in the middle of updating
+the data, and thus it can be inconsistent. Subsequent owners of the
+lock get it with the dead-owner state, so that they are aware of the
+situation. If any of them can fix it, it can move the lock back to
+healthy state and continue operating, but if there is no way to fix
+the data, it is moved to not-recoverable state.
+
+When moved, all the pending waiters are given an error code (EBADR)
+indicating the new state, so that they can bail out and report up to
+their managers for what to do. As well, new contenders that try to
+acquire the lock will get also the EBADR error code.
+
+The only way to make the fulock healthy again is to reinitialized it.
+
+
+Q: What is a dead-owner dead-lock?
+
+When some task that has to unlock a locked fulock dies and others are
+waiting for it to release the fulock.
+
+
+Q: What is a dead-owner recovery?
+
+When a lock owner dies, the next waiter or next guy who locks and gets
+ownership gets it with an special code that indicates that some
+previous owner died and that the state of the lock is "dead-owner",
+that recovery on the data structures protected by the lock must be
+done in order to ensure consistency.
+
+Once a fulock is in dead-owner state, it can be moved back to
+normal/healthy or made inconsistent (so only an initialization returns
+it to normal).
+
+
+Q: Why does the kernel have to set the value of the fulock?
+   Why cannot the value of the fulock after unlock be set by user
+   space?
+
+There is a risk of overwritten values and missed waiters.
+
+For example, task B claims fulock F (locked by task A) so it goes to
+the kernel to wait; now the fulock value is VFULOCK_KCO (kernel
+controlled ownership). Before it reaches the kernel, task C releases
+the fulock for task A; as there are no waiters, it returns UNLOCKED
+and task C has to set it to UNLOCKED, thus overwriting VFULOCK_KCO; as
+KCO is overwritten, task B is going to be dead-locked in the kernel,
+waiting.
+
+Furthermore, it helps guaranteeing robustness. If the just-woken up
+task that has to set the value the kernel passes dies before being
+able to do it, you hit a dead-owner dead-lock because nobody wakes up
+the waiters until somebody tries to lock and realizes the fulock is
+dead.
+
+
+Q: What are the two kinds of unlock?
+
+The two kinds of unlock are serialized and non-serialized.
+
+
+Q: What is a non-serialized unlock?
+
+A non-serialized unlock works by setting the fulock to unlocked and
+waking up as many waiters as desired. The waiters then re-contend for
+the fulock, the winner owns it and the others go back to wait on it.
+
+Faster in synthetic benchmarks (very tight loop acquiring and
+releasing the lock against some other threads doing the same thing).
+
+Main problem: can't find a way to guarantee robustness
+  
+- Say we have a fulock with M guys waiting and we wake up N (1 < N <
+  M), a non-serialized wakeup. Thus, there are M - N guys still
+  waiting in the kernel.
+  
+  In order for the unlock to be non-serialized, the waker first sets
+  the futex to UNLOCKED.
+  
+  Now, how do the woken up processes know that there are still
+  processes in the kernel?
+  
+  A solution is to set the lock not to UNLOCKED, but to
+  UNLOCKED+WP (ie: maintain the WP); this way, whowever tries to
+  acquire will see UNLOCKED+WP and will go down to the kernel to do
+  the lock operation. Also, the lock operation can return an special
+  code that says "contend, but set the lock to pid|WP" to indicate
+  waiters present.
+  
+  However, it still does not solve the fact that when setting to
+  UNLOCKED+WP and waking N, if those N die before locking, the
+  waiters go into dead-owner dead-lock.
+  
+  When somebody tries to lock that, the kernel should be able to
+  notice that there are waiters and it is unlocked and thus give
+  way to the first locker with dead-owner recover --it might be too late.
+  
+  Another option might be to tag the woken-up processes before they
+  exit the kernel, so that if they die, do_exit can trap it (but there
+  are many side issues to this, like how do I make sure that the N who
+  I woke up have gone through it, one has locked, the other N-1 have
+  gone to sleep, how do I clean it up and stuff like that--makes it
+  pretty ugly, not to talk about how many resources it'd need to tag it).
+  
+  Gosh, it is a mess -- I would say that robust mutexes have to
+  require serialized unlocks. Period.
+
+  Not even talk about adding a short timer to verify that the thing
+  was locked...shrivers
+
+  RESEARCH: tentative ownership: when we wake up some guys who are
+  supposed to go and try to lock again, tie the fulock they should
+  lock to their task_struct and on exit(), check they don't have it
+  there ... [many details need to be worked out].
+
+- It could also be solved by waking up _all_ the waiters; this way no
+  dead-owner dead-lock could ever happen; however, it is a sure recipe
+  for an scheduling storm some day.
+
+
+Q: What is a serialized unlock?
+
+A serialized unlock transfers the ownership of the fulock to the first
+waiter in the kernel. 
+
+- Only one waiter can be woken up at the same time with this method.
+
+- It prevents priority inversion (as the fulock stays locked during
+  the whole operation no low priority thread can acquire it in the
+  meantime).
+
+- dead-owner dead-lock is not possible, because the owner is always
+  known during the operation. As well, if the new owner dies on it's
+  way up to user space, its ownership is also known.
+
+Slower (still not proved seriously--postulated and proven in some
+very synthetic benchmarks) because it forces a context switch.
+
+
+Q: What is an vfulock?
+
+It is the address in user space associated to a fulock in kernel
+space. 
+
+
+Q: What is owner identification?
+
+Owner identification is a property that the fulocks have: basically,
+they can identify who is the owner based on the vfulock (in user
+space) or the internal kernel data structures that refer to it (if the
+vfulock is VFULOCK_KCO, that means that the kernel tracks the
+ownership); if vfulock < VFULOCK_KCO, it means that the ownership is
+tracked only in user space, and the vfulock is the PID of the owner.
+
+
+Q: What is a kernel-controlled-ownership fulock? (kco)
+
+A vfulock whose value is VFULOCK_KCO or VFULOCK_DEAD, indicating that
+the ownership for the fulock is tracked by the kernel [FIXME: also KCO
+fulock are those called with the FULOCK_FL_KCO flag--not implemented].
+
+This happens when:
+
+     - The fulock is locked and there are waiters on the kernel
+     - The fulock is dead (and the ownership keeps track for it)
+     - The fulock is a priority protected fulock
+
+Basically it is a way to indicate that the fastpath for
+locking/unlocking cannot be taken.
+
+
+Q: What is an ownership/wait chain?
+
+The structure that is formed when task A owns lock F and is waiting
+for lock G, owned by task B that is waiting for lock H, that is owned
+be task C that is waiting for lock I ... etc. 
+
+When this chain is circular (eg: lock I is owned by A) then there is a
+deadlock. 
\ No newline at end of file

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC/PATCH] FUSYN 2/10: modifications to the core
  2003-12-03  8:51 ` [RFC/PATCH] FUSYN 1/10: documentation files inaky.perez-gonzalez
@ 2003-12-03  8:51   ` inaky.perez-gonzalez
  2003-12-03  8:51     ` [RFC/PATCH] FUSYN 3/10: Support for i386 inaky.perez-gonzalez
  0 siblings, 1 reply; 9+ messages in thread
From: inaky.perez-gonzalez @ 2003-12-03  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

These are modifications needed to support fuqueues and
fulocks. They are, in a nutshell:

- Move some stuff from kernel/futex.c to futex.h to make it
usable by more people.

- Add neccesary fields to the task_struct.

- Hookup to initialization and cleanup points for process
creation and destruction.

- Hookup into the signalling and time out code for wait
cancellation, as well as for changing task priorities.

- Add error codes for dead-owner and not-recoverable.


 include/asm-generic/errno-base.h |    4 +++
 include/linux/futex.h            |   52 +++++++++++++++++++++++++++++++++++++++
 include/linux/init_task.h        |    5 +++
 include/linux/sched.h            |    9 ++++++
 kernel/Makefile                  |    3 +-
 kernel/exit.c                    |    3 ++
 kernel/fork.c                    |    2 +
 kernel/futex.c                   |   39 +++--------------------------
 kernel/sched.c                   |   52 ++++++++++++++++++++++++++++++++++++++-
 kernel/signal.c                  |   13 +++++++++
 kernel/timer.c                   |   11 +++++++-
 11 files changed, 155 insertions(+), 38 deletions(-)

--- linux/include/linux/futex.h:1.1.1.1	Thu Jul 10 12:27:31 2003
+++ linux/include/linux/futex.h	Sun Nov 16 07:21:32 2003
@@ -9,7 +9,11 @@
 #define FUTEX_FD (2)
 #define FUTEX_REQUEUE (3)
 
+#ifdef __KERNEL__
 
+#include <linux/jhash.h>
+
+struct timespec;
 asmlinkage long sys_futex(u32 __user *uaddr, int op, int val,
 			  struct timespec __user *utime, u32 __user *uaddr2);
 
@@ -17,4 +21,52 @@
 long do_futex(unsigned long uaddr, int op, int val,
 		unsigned long timeout, unsigned long uaddr2, int val2);
 
+/*
+ * Futexes are matched on equal values of this key.
+ * The key type depends on whether it's a shared or private mapping.
+ * Don't rearrange members without looking at futex_hash_key().
+ *
+ * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
+ * We set bit 0 to indicate if it's an inode-based key.
+ */
+union futex_key {
+	struct {
+		unsigned long pgoff;
+		struct inode *inode;
+		int offset;
+	} shared;
+	struct {
+		unsigned long uaddr;
+		struct mm_struct *mm;
+		int offset;
+	} private;
+	struct {
+		unsigned long word;
+		void *ptr;
+		int offset;
+	} both;
+};
+
+static inline
+u32 futex_hash_key (const union futex_key *key)
+{
+  u32 hash = jhash2((u32*)&key->both.word,
+                    (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
+                    key->both.offset);
+  return hash;
+}
+
+static inline
+int match_futex_key (const union futex_key *key1, const union futex_key *key2)
+{
+	return (key1->both.word == key2->both.word
+		&& key1->both.ptr == key2->both.ptr
+		&& key1->both.offset == key2->both.offset);
+}
+
+extern int get_futex_key (unsigned long uaddr, union futex_key *key);
+extern void get_key_refs(union futex_key *key);
+extern void drop_key_refs(union futex_key *key);
+
+#endif /* #ifdef __KERNEL__ */
 #endif
--- linux/include/linux/init_task.h:1.1.1.2	Thu Aug 28 12:38:00 2003
+++ linux/include/linux/init_task.h	Sun Nov 16 07:21:32 2003
@@ -108,6 +108,11 @@
 	.proc_lock	= SPIN_LOCK_UNLOCKED,				\
 	.switch_lock	= SPIN_LOCK_UNLOCKED,				\
 	.journal_info	= NULL,						\
+	.fuqueue_wait_lock	= SPIN_LOCK_UNLOCKED,			\
+	.fuqueue_wait	= NULL,						\
+	.fuqueue_waiter	= NULL,						\
+	.fulock_olist	= olist_INIT(&tsk.fulock_olist),		\
+	.fulock_olist_lock = SPIN_LOCK_UNLOCKED,			\
 }
 
 
--- linux/include/linux/sched.h:1.1.1.9	Mon Oct 20 13:56:49 2003
+++ linux/include/linux/sched.h	Sun Nov 16 07:21:32 2003
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
+#include <linux/fulock-olist.h>    /* olist_t */
 
 struct exec_domain;
 
@@ -326,7 +327,8 @@
 	struct sigqueue *sigq;		/* signal queue entry. */
 };
 
-
+struct fuqueue;
+struct fuqueue_waiter;
 struct io_context;			/* See blkdev.h */
 void exit_io_context(void);
 
@@ -464,6 +466,11 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+	struct fuqueue *fuqueue_wait; /* waiting for this qeueue */
+	struct fuqueue_waiter *fuqueue_waiter; /* waiting for this qeueue */
+	spinlock_t fuqueue_wait_lock; /* FIXME: locking too heavy -- better sollution? */
+	olist_t fulock_olist;	      /* Fulock ownership list */
+	spinlock_t fulock_olist_lock;
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
--- linux/kernel/Makefile:1.1.1.6	Mon Oct 20 13:55:34 2003
+++ linux/kernel/Makefile	Fri Nov 21 13:26:01 2003
@@ -6,7 +6,8 @@
 	    exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
-	    rcupdate.o intermodule.o extable.o params.o posix-timers.o
+	    rcupdate.o intermodule.o extable.o params.o posix-timers.o \
+	    fuqueue.o fulock.o fulock-pp.o vlocator.o ufuqueue.o ufulock.o
 
 obj-$(CONFIG_FUTEX) += futex.o
 obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
--- linux/kernel/exit.c:1.1.1.7	Mon Oct 20 13:56:49 2003
+++ linux/kernel/exit.c	Sun Nov 16 07:21:32 2003
@@ -704,6 +704,8 @@
 	local_irq_enable();
 }
 
+extern void exit_fulocks(struct task_struct *);
+
 NORET_TYPE void do_exit(long code)
 {
 	struct task_struct *tsk = current;
@@ -732,6 +734,7 @@
 	}
 
 	acct_process(code);
+	exit_fulocks(tsk);
 	__exit_mm(tsk);
 
 	exit_sem(tsk);
--- linux/kernel/fork.c:1.1.1.9	Mon Oct 20 13:56:49 2003
+++ linux/kernel/fork.c	Sun Nov 16 07:21:32 2003
@@ -40,6 +40,7 @@
 
 extern int copy_semundo(unsigned long clone_flags, struct task_struct *tsk);
 extern void exit_sem(struct task_struct *tsk);
+extern void init_fulock (struct task_struct *task);
 
 /* The idle threads do not count..
  * Protected by write_lock_irq(&tasklist_lock)
@@ -922,6 +923,7 @@
 		goto bad_fork_cleanup_signal;
 	if ((retval = copy_namespace(clone_flags, p)))
 		goto bad_fork_cleanup_mm;
+	init_fulock(p);
 	retval = copy_thread(0, clone_flags, stack_start, stack_size, p, regs);
 	if (retval)
 		goto bad_fork_cleanup_namespace;
--- linux/kernel/futex.c:1.1.1.4	Mon Oct 20 13:56:49 2003
+++ linux/kernel/futex.c	Sun Nov 16 07:21:32 2003
@@ -41,31 +41,6 @@
 
 #define FUTEX_HASHBITS 8
 
-/*
- * Futexes are matched on equal values of this key.
- * The key type depends on whether it's a shared or private mapping.
- * Don't rearrange members without looking at hash_futex().
- *
- * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
- */
-union futex_key {
-	struct {
-		unsigned long pgoff;
-		struct inode *inode;
-		int offset;
-	} shared;
-	struct {
-		unsigned long uaddr;
-		struct mm_struct *mm;
-		int offset;
-	} private;
-	struct {
-		unsigned long word;
-		void *ptr;
-		int offset;
-	} both;
-};
 
 /*
  * We use this hashed waitqueue instead of a normal wait_queue_t, so
@@ -109,9 +84,7 @@
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
 {
-	u32 hash = jhash2((u32*)&key->both.word,
-			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
-			  key->both.offset);
+	u32 hash = futex_hash_key (key);
 	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
 }
 
@@ -120,9 +93,7 @@
  */
 static inline int match_futex(union futex_key *key1, union futex_key *key2)
 {
-	return (key1->both.word == key2->both.word
-		&& key1->both.ptr == key2->both.ptr
-		&& key1->both.offset == key2->both.offset);
+	return match_futex_key (key1, key2);
 }
 
 /*
@@ -137,7 +108,7 @@
  *
  * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(unsigned long uaddr, union futex_key *key)
+int get_futex_key(unsigned long uaddr, union futex_key *key)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
@@ -232,7 +203,7 @@
  * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
  * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
-static inline void get_key_refs(union futex_key *key)
+inline void get_key_refs(union futex_key *key)
 {
 	if (key->both.ptr != 0) {
 		if (key->both.offset & 1)
@@ -246,7 +217,7 @@
  * Drop a reference to the resource addressed by a key.
  * The hash bucket spinlock must not be held.
  */
-static inline void drop_key_refs(union futex_key *key)
+inline void drop_key_refs(union futex_key *key)
 {
 	if (key->both.ptr != 0) {
 		if (key->both.offset & 1)
--- linux/kernel/sched.c:1.1.1.11	Sat Nov 15 01:38:17 2003
+++ linux/kernel/sched.c	Tue Nov 18 10:57:31 2003
@@ -591,7 +591,8 @@
 	int success = 0;
 	long old_state;
 	runqueue_t *rq;
-
+	
+		
 repeat_lock_task:
 	rq = task_rq_lock(p, &flags);
 	old_state = p->state;
@@ -1629,6 +1630,8 @@
 
 EXPORT_SYMBOL(default_wake_function);
 
+extern void fuqueue_wait_cancel(struct task_struct *, int);
+
 /*
  * The core wakeup function.  Non-exclusive wakeups (nr_exclusive == 0) just
  * wake everything up.  If it's an exclusive wakeup (nr_exclusive == small +ve
@@ -1637,6 +1640,9 @@
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING.  try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
+ *
+ * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. Check its doc to learn why.
  */
 static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
 {
@@ -1647,6 +1653,8 @@
 		unsigned flags;
 		curr = list_entry(tmp, wait_queue_t, task_list);
 		flags = curr->flags;
+		if (unlikely(curr->task->fuqueue_wait != NULL))
+			fuqueue_wait_cancel(curr->task, -EINTR);
 		if (curr->func(curr, mode, sync) &&
 		    (flags & WQ_FLAG_EXCLUSIVE) &&
 		    !--nr_exclusive)
@@ -1980,6 +1988,46 @@
 	return pid ? find_task_by_pid(pid) : current;
 }
 
+/**
+ * Unconditionally set the effective priority of a task.
+ *
+ * Not too useful on SCHED_OTHER tasks, btw ...
+ * 
+ * @p Pointer to the task in question
+ * @prio New priority to set 
+ */
+#warning FIXME: need to play by POSIX rules on prio change and list repositioning because of prio inheritance
+
+void __set_prio (struct task_struct *p, int prio)
+{
+	runqueue_t *rq;
+	prio_array_t *array;
+	long flags;
+        int oldprio;
+	
+	rq = task_rq_lock(p, &flags);
+        oldprio = p->prio;
+	array = p->array;
+	if (!array) {
+		p->prio = prio;
+		goto out_unlock;
+	}
+	deactivate_task(p, task_rq(p));
+	p->prio = prio;
+	__activate_task(p, task_rq(p));
+        if (rq->curr == p) {
+                if (p->prio > oldprio)
+                        resched_task(rq->curr);
+        }
+	else if (TASK_PREEMPTS_CURR (p, rq))
+                resched_task(rq->curr);
+out_unlock:
+	task_rq_unlock (rq, &flags);
+}
+
+
+extern void fuqueue_chprio (struct task_struct *);
+
 /*
  * setscheduler - change the scheduling policy and/or RT priority of a thread.
  */
@@ -2059,6 +2107,8 @@
 		p->prio = MAX_USER_RT_PRIO-1 - p->rt_priority;
 	else
 		p->prio = p->static_prio;
+	if (p->fuqueue_wait != NULL)
+		fuqueue_chprio(p);
 	if (array) {
 		__activate_task(p, task_rq(p));
 		/*
--- linux/kernel/signal.c:1.1.1.8	Mon Oct 20 13:56:49 2003
+++ linux/kernel/signal.c	Tue Nov 18 10:56:36 2003
@@ -524,6 +524,8 @@
 	return signr;
 }
 
+extern void fuqueue_wait_cancel (struct task_struct *, int);
+
 /*
  * Tell a process that it has a new active signal..
  *
@@ -547,11 +549,19 @@
 	 * executing another processor and just now entering stopped state.
 	 * By calling wake_up_process any time resume is set, we ensure
 	 * the process will wake up and handle its stop or death signal.
+	 * 
+	 * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+	 * priority inheritance/protected tasks. The reason is that
+	 * when we resched a process that has boosted another one, we
+	 * need to kick its butt off the CPU (and lower its priority) ASAP
+	 * so that 't' can run.
 	 */
 	mask = TASK_INTERRUPTIBLE;
 	if (resume)
 		mask |= TASK_STOPPED;
 	if (t->state & mask) {
+		if (unlikely (t->fuqueue_wait != NULL))
+			fuqueue_wait_cancel(t, -EINTR);
 		wake_up_process_kick(t);
 		return;
 	}
@@ -677,6 +687,9 @@
 				set_tsk_thread_flag(t, TIF_SIGPENDING);
 				state |= TASK_INTERRUPTIBLE;
 			}
+			/* FIXME: I am not that sure we need to cancel here */
+			if (unlikely(t->fuqueue_wait != NULL))
+				fuqueue_wait_cancel(t, -EINTR);
 			wake_up_state(t, state);
 
 			t = next_thread(t);
--- linux/kernel/timer.c:1.1.1.8	Sat Nov 15 01:25:05 2003
+++ linux/kernel/timer.c	Sun Nov 16 07:21:32 2003
@@ -969,9 +969,18 @@
 
 #endif
 
+/*
+ * fuqueue_wait_cancel needs to hook up here to properly rescheduler
+ * priority inheritance/protected tasks. Check its doc to learn why.
+ */ 
+extern void fuqueue_wait_cancel(struct task_struct *, int);
+
 static void process_timeout(unsigned long __data)
 {
-	wake_up_process((task_t *)__data);
+	struct task_struct *task = (task_t *) __data;
+	if (unlikely(task->fuqueue_wait != NULL))
+		fuqueue_wait_cancel(task, -ETIMEDOUT);
+	wake_up_process(task);
 }
 
 /**
--- linux/include/asm-generic/errno-base.h:1.1.1.1	Thu Jul 10 12:27:27 2003
+++ linux/include/asm-generic/errno-base.h	Sun Nov 16 07:21:32 2003
@@ -36,4 +36,8 @@
 #define	EDOM		33	/* Math argument out of domain of func */
 #define	ERANGE		34	/* Math result not representable */
 
+  /* FIXME: ugly hack to avoid conflicts -- need to get better numbers */
+#define EOWNERDEAD      525     /* Mutex owner died */
+#define ENOTRECOVERABLE 526     /* Mutex state is not recoverable */
+
 #endif

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC/PATCH] FUSYN 3/10: Support for i386
  2003-12-03  8:51   ` [RFC/PATCH] FUSYN 2/10: modifications to the core inaky.perez-gonzalez
@ 2003-12-03  8:51     ` inaky.perez-gonzalez
  0 siblings, 0 replies; 9+ messages in thread
From: inaky.perez-gonzalez @ 2003-12-03  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

 arch/i386/kernel/entry.S  |    5 +++
 include/asm-i386/fulock.h |   69 ++++++++++++++++++++++++++++++++++++++++++++++
 include/asm-i386/unistd.h |    7 ++++
 3 files changed, 80 insertions(+), 1 deletion(-)

--- linux/arch/i386/kernel/entry.S:1.1.1.6	Sat Nov 15 01:38:16 2003
+++ linux/arch/i386/kernel/entry.S	Fri Nov 21 13:25:56 2003
@@ -906,6 +906,11 @@
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
+	.long sys_ufulock_lock
+	.long sys_ufulock_unlock	/* 275 */
+	.long sys_ufulock_consistency
+	.long sys_ufuqueue_wait        
+	.long sys_ufuqueue_wake	     /* 278 */
 
 nr_syscalls=(.-sys_call_table)/4
 
--- /dev/null	Tue Dec  2 20:07:55 2003
+++ linux/include/asm-i386/fulock.h	Tue Nov 18 11:51:06 2003
@@ -0,0 +1,69 @@
+
+/*
+ * Fast User real-time/pi/pp/robust/deadlock SYNchronization
+ * (C) 2002-2003 Intel Corp
+ * Inaky Perez-Gonzalez <inaky.perez-gonzalez@intel.com>.
+ *
+ * Licensed under the FSF's GNU Public License v2 or later.
+ *
+ * Based on normal futexes (futex.c), (C) Rusty Russell.
+ * Please refer to Documentation/fusyn.txt for more info.
+ */
+
+#ifndef __asm_i386_fulock_h__
+#define __asm_i386_fulock_h__
+
+    /* fulock value / state; anything that is not this is a PID that
+     * currently owns the fulock. */
+
+enum vfulock {
+	VFULOCK_UNLOCKED  = 0x00000000,	/* Unlocked */
+	VFULOCK_KCO	  = 0xfffffffd,	/* KCO: kernel controls ownership */
+	VFULOCK_DEAD	  = 0xfffffffe,	/* KCO: dead, kernel controls ownership */
+	VFULOCK_NR	  = 0xffffffff	/* KCO: fulock is not-recoverable */
+};
+
+
+#ifdef __KERNEL__
+
+/**
+ * [User usable] Atomic compare and swap.
+ *
+ * Used for locking a vfulock.
+ *
+ * @value     Pointer to the value to compare and swap.
+ * @old_value Value that *value has to have for the swap to occur.
+ * @new_value New value to set it *value == old_value.
+ * @return    !0 if the swap succeeded. 0 if failed.
+ */
+static inline
+unsigned vfulock_acas (volatile unsigned *value,
+		       unsigned old_value, unsigned new_value)
+{
+	unsigned result;
+	asm __volatile__ (
+		"lock cmpxchg %3, %1"
+		: "=a" (result), "+m" ((*value))
+		: "a" (old_value), "r" (new_value)
+		: "memory");
+	return result == old_value;
+}
+
+
+/**
+ * Set an ufulock's associated value.
+ *
+ * @vfulock: Pointer to the address of the ufulock to contain for.
+ * @value:    New value to assign.
+ *
+ * Wrapper for arch-specific idiosyncrasies when setting a value that
+ * is shared across different address mappings.
+ */
+static inline
+void vfulock_set (volatile unsigned *vfulock, unsigned value)
+{
+	*vfulock = value;
+}
+
+#endif /* #ifdef __KERNEL__ */
+#endif /* #ifndef __asm_i386_fulock_h__ */
--- linux/include/asm-i386/unistd.h:1.1.1.5	Mon Oct 20 13:55:33 2003
+++ linux/include/asm-i386/unistd.h	Fri Nov 21 13:26:01 2003
@@ -279,8 +279,13 @@
 #define __NR_utimes		271
 #define __NR_fadvise64_64	272
 #define __NR_vserver		273
+#define __NR_ufulock_lock	274
+#define __NR_ufulock_unlock	275
+#define __NR_ufulock_consistency	276
+#define __NR_ufuqueue_wait	277
+#define __NR_ufuqueue_wake	278
 
-#define NR_syscalls 274
+#define NR_syscalls 279
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
  2003-12-03  8:51 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 inaky.perez-gonzalez
  2003-12-03  8:51 ` [RFC/PATCH] FUSYN 1/10: documentation files inaky.perez-gonzalez
@ 2003-12-04  4:12 ` Jamie Lokier
  1 sibling, 0 replies; 9+ messages in thread
From: Jamie Lokier @ 2003-12-04  4:12 UTC (permalink / raw)
  To: inaky.perez-gonzalez; +Cc: linux-kernel, robustmutexes

Here's my first thoughts, on reading Documentation/fusyn.txt.

  - Everything here can be implemented on top of regular futex, but
    that doesn't mean the implementation will be efficient or robust.
    (This is because, in general, any kind of futex/fuqueue/whatever
    implementation can be moved from kernel space to userspace, using
    the real kernel futexes to simulate the waitqueues and spinlocks
    that are called in futex.c).

There are some good ideas in here, for people who need the features:

  - Priority inheritence is ok _when_ you want it.

    Sometimes if task A with high priority wants a resource which is
    locked by task B with lower priority, that should be an error
    condition: it can be dangerous to promote the priority of task B,
    if task B is not safe to run at a high priority.

  - The data structures and priority callbacks which are used to
    implement priority inheritance, protection and highest-priority
    wakeup are fine.

    But highest-priority wakeup (at least) shouldn't be just for
    fuqueues: it should be implemented at a lower level, directly in
    the kernel's waitqueues.  Meaning: wake_up() should wake the
    highest priority task, not the first one queued, if that is
    appropriate for the queue or waker.

  - I like the method of detecting dead mutex holders, and the ability
    to handle recovery of inconsistent data.

  - Is there a method for passing the locked state to another task?
    Compare-and-swap old-pid -> new-pid works when there isn't any
    contention, but a kernel call is needed in any of the
    kernel-controlled states.

  - It's very unpleasant that rwlocks enter the kernel when there is
    more than one reader.  Hashed rwlocks can be implemented in
    userspace to reduce this (readers take one rwlock from a hashed
    set; writers take them all in order), but it isn't wonderful.

  - For architectures which can't do compare-and-swap, a system call
    which does the equivalent (i.e. disables preemption, does
    compare-and-swap, enables preemption again) would be quite useful.
    Not for maximum performance, but to allow architecture-independent
    locking strategies to be written portably.

  - Similarly, it would be good for the VFULOCK_ flags to depend on
    only 31 bits of the word, i.e. ignoring one of them.  Then
    architectures which don't have compare-and-swap but which do have
    test-and-set-bit can use that.

  - Does the owner field have to be the pid or can a fulock use any
    kind of key value, as long as it isn't one of the reserved values,
    that's convenient to the application.

  - It's nice that you use separate syscalls for each operation.
    Futex should do this too; multiplexing syscalls like the one in
    futex.c should be banned :)

  - It's a huge patch.  A nice thing about futex.c is that it's
    relatively small (your patch is 9 times larger).  The original
    futex design was more complicated, and written specifically for
    mutexes.  Then it was made simpler and I think smaller at the same
    time.  Perhaps putting some of the RT priority capabilities
    directly into kernel waitqueues would help with this.

-- Jamie

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-07  1:15 Perez-Gonzalez, Inaky
  0 siblings, 0 replies; 9+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-07  1:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, robustmutexes


> From: Jamie Lokier [mailto:jamie@shareable.org]

I had one question for you I forgot to ask wrt to fulocks. 

In the vlocator code I ripped your stuff for get_futex_key()
to generate the unique ID. Now, in ufulocks I need to access
the user space page where the vfulock is to modify it when 
we fiddle with the ownership of fulock state to keep it in sync;
in every case we need to pull it up from swap if it is there
because we'll read and potentially write.

As I do it now is the lazy way. I just pin the page and keep
it in fulock->page. This is kind of ugly because it reverts
back to the old futex behavior of 'page pinned while waiting'.

Now, I'd love to be able to have a key that can be used to pull
the 'struct page' from, so I don't need to pin the page
while waiting.

This would be easy to do if we only had the lock and unlock 
cases, as both have direct and easy access to the location of
the vfulock in current's address space. 

However, the problem is in __ufulock_exit(); in this case, we
don't necessarily have access to the address space of the 
exiting thread. I am thinking of just adding the user space
address an an ownership property so __ufulock_exit() can use
it, but I am kind of concerned on what would happen if a shared
area were unmapped while owning a lock that is supposed to be
on it.

Do you have any ideas on what would be a good way to do this?

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-04  9:27 Perez-Gonzalez, Inaky
  0 siblings, 0 replies; 9+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-04  9:27 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, robustmutexes


> From: Jamie Lokier [mailto:jamie@shareable.org]

> Iñaky Pérez-González wrote:


> If you are only doing priority inheritance for RT priorities, then
> it's not such a big deal to have task B boosted.

Heh ... doing it for SCHED_OTHER becomes a nightmare, and I still
am not that sure is feasible without major surgery on the way
the priority information is stored in the task_struct now; as well,
POSIX says nothing about it, or at least is kind of gray in a sense
where it seems only SCHED_{FIFO,RR} should be involved, so I am 
really tempted not to do it (well, that's what is nicing it towards
20 :)

> > That was the first thing I thought of; however, it is not an easy
> > task--for example, now you have to allocate a central node that has
> > to live during the life of the waitqueue (unlike in futexes), and
> > that didn't play too well -- with the current code, unlike my previous
> > attempt with rtfutexes, it is not that much of a deal and could be
> > done, but I don't know how much of the interface I could emulate.
> 
> What do you need the central node for?

In futexes we have that each hash chain has all the waiters, in FIFO
arrival order, and we just wake as many as needed. In the fuqueues, 
this wake up has to be by priority. We could walk that chain pedaling
back and forth to wake them up in the correct order, but that would be
anything but O(1), and being "real-time" like, or predictable, is one
of the requirements.

So the wait list has a head, the fuqueue->wlist, and the waiters are
appended there in their correct position (so addition is O(N) now, will
change to O(1) IANF, and removal of the highest priority guy is O(1)--
take the guy in the head).

Now, on ufuqueues (the ones that are associated to user space addresses,
the true futex equivalent) that means you can't do the trick of the 
futex chain lists, so you have on each chain a head per ufuqueue/user
space address. That ufuqueue cannot be declared on the stack of one
of the waiters, as it would disappear when it is woken up and might leave
others dangling.

So we have to allocate it, add the waiters to it and deallocate it when 
the wait list is empty. This is what complicates the whole thing and 
adds the blob of code that is vl_locate() [the allocation and addition 
to the list, checking for collisions when locks are dropped]. As the 
whole thing is kind of expensive, we better cache it for a few seconds, 
as chances are we will have some temporal locality (in fact, it happens, 
it improves the performance a lot), so that leads to more code for the 
"garbage collector" that cleans the hash chains of unused queue heads
every now and then. All this is what the vlocator code does.

> > >   - Is there a method for passing the locked state to another task?
> > >     Compare-and-swap old-pid -> new-pid works when there isn't any
> > >     contention, but a kernel call is needed in any of the
> > >     kernel-controlled states.
> >
> > That can be done, because if you are in non-KCO mode (ie: pid), the
> > kernel by definition knows nil about the mutex, so just do the compare
> > and swap in user space and you are ready. No need to add any special
> > code.
> 
> The question asks what to do in the KCO state.  I.e. when you want to
> transfer a locked state and there are waiters.

Ah, ah, ah ... makes sense. Ok, so this is like an unlock operation 
but "unlock to this guy". Well, same thing, but extended. You need to
try it first in user space, if that fails because it is KCO (locked
and there are waiters), then go to the kernel and ask it to transfer
ownership in there.

Piece of cake, more or less, and can be done O(1) by checking 
dest_task->fuqueue_wait. Why the interest for this? I am curious
to see what could it be used for.

I will give it a shot as soon as I have a minute (unless your question
was purely academic and plan no uses for it).

> If it is as simple as just keeping the mutex in KCO mode all the time
> on archs which don't have compare-and-swap, or those that do if an
> application doesn't have explicit asm code for that arch, that would
> be very convenient.
> 
> I haven't thought through whether keeping a mutex in KCO mode has this
> capability.  Perhaps you have and can state the answer?

First I should make a distinction that is causing way too much confusion,
one thing is KCO for the vfulock (telling the user that it has to go to
the kernel) and the other is using it always in KCO mode by passing a
yet-to-be-implemented-flag FULOCK_FL_KCO to the sys_ufulock_*() system
calls (so it skips all the ugly synchronization code).

This FULOCK_FL_KCO is needed for priority protection anyway, so it will
be there no matter what; thus, in arches without atomic compare-and-swap,
it becomes a matter of ops-I-don't-have-the-fast-path so I just call the
syscall with that bit set.

> > Another thing I discarded; then there is no way for the kernel to
> > identify the owner and most of fusyn's benefits disappear. However,
> > I want to give it a second try, in such a way that it would disable
> > owner identification (as well as priority inheritance).
> 
> You don't lose the owner id, because pid is always positive so there's
> a spare bit.

But still you need to set the full PID--or whatever you use for 
identifying yourself as the owner--atomically. If you are thinking of
using bit 32 as a spinlock and then once you have it, set the rest of
the PID in a non-atomic fashion, I'd forget it. Consider the following
scenario:

Task A: FIFO, prio 5, 

time  A
0     sets bit 32, succeeds
1     sets pid(A)
..    does his thing as lock owner
 
Now the same, but Task B, FIFO prio 6 is trying to do the same

time  A                        B
0     sets bit 32, succeeds
1                              spins trying to set bit
puf!

Now you could say: okay, let's yield B, but it'd not work because
B is prio six and A is five; as you know, you are toast, as B will
run again and A won't.

Another option is as B sees that bit 32 is 1, it goes down to the 
kernel; but A still might not have had a chance to set the PID, so
B (in the kernel) has no means to identify who is A; it can create
the ufulock and leave it in some 'owner unknown state' and queue 
for it--however, how do we find who A is? The synch code in the 
kernel still needs to do an atomic operation, protecting against
user space, to put the v/ufulock in KCO mode, and for robustness to
work correctly, A might have to check that something like that happened
to call the kernel and let it know that it is the owner...

Maybe I am missing a simpler solution, but I just thought, after 
spinning it many times that it wasn't worth the effort, as most
modern arches have an atomic compare and swap, even embedded targets;
and that the easiest thing to do is either to run all the time in
KCO mode or massage the code a wee bit so that owner identification
(and associated goodies) are disabled.

> It would be great if the KCO state could be used to provide complete
> portability, and then adding arch-specific asm code to the
> applications becomes an optimisation, not a requirement.

Yep, as I said, my next requirement is to add that code tidbit, so
that will not be a problem. Being the optimization a simple 
atomic-compare-and-swap, I don't foresee too much uglyness on it.

> More complex locks can be implemented with one or two bits in an
> object and auxiliary user-space data structures hashing the object
> address in the contended case (which is usually a very small subset of
> all the objects).
> 
> In these ways, futex is very versatile.

Agreed - but as I explained above, that leads to many ugly conditions in
real-time scheduling environments (and they can lead to deadline misses
or who knows). I'd love to be proven wrong, or that I missed one or more
ways to do it...I just got drained :]

Now, fuqueues still can do all that, because except for the wake up
order and because you can get -ENOMEM on wait(), they behave exactly
the same.

> Thinking this over, it occurs to me that you can also implement lock
> stealing.  You know who the owner is, so you can send a signal to the
> owner saying "please release the lock", but actually stealing a lock
> requires kernel support doesn't it?  Just a thought, not sure how
> useful that would be.

Yep, it is similar, if not the same, as the "unlock to this guy" we were 
discussing above [and not necessarily the owner is the one that has to
unlock, anybody with access to the lock can unlock it].

However, I am kind of reluctant to add code to do that unless it is really
needed. I am going to have already a hard sell with this as to bloat it
a wee bit more :) [I can already hear Ulrich sharpening his Jagdmesser
over the proposed modifications to NPTL to support this].

> > Asides from the comments, it adds the most complex/bloating part,
> > the priority-sorted thingie and chprio support vs not having the FUTEX_FD
> > or requeue support...it comes to be, more or less equivalent, considering
> > all the crap that has to be changed for the prioritization to work
> > respecting the incredibly stupid POSIX semantincs for mutex lifetimes.
> 
> Are there specified POSIX semantics for prioritisation and mutex interaction?

Yep, they state different things on all that, and how you don't need to 
necessarily destroy non-shared mutexes (vs shared) and a few more things
that made my life a little bit more exciting...I think I still need to tweak
a few bits more on the priority inheritance stuff to get it completely up
to the spec, but it should be pretty good already.

Oh, did I mention anywhere you can also use the fulocks in the kernel to
replace the DECLARE_MUTEXes and gain robustness, pi and pp? Wonder if anybody
will find this interesting.

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
  2003-12-04  4:57 Perez-Gonzalez, Inaky
@ 2003-12-04  5:51 ` Jamie Lokier
  0 siblings, 0 replies; 9+ messages in thread
From: Jamie Lokier @ 2003-12-04  5:51 UTC (permalink / raw)
  To: Perez-Gonzalez, Inaky; +Cc: linux-kernel, robustmutexes

Iñaky, thanks for your response!

Iñaky Pérez-González wrote:
> >     Sometimes if task A with high priority wants a resource which is
> >     locked by task B with lower priority, that should be an error
> >     condition: it can be dangerous to promote the priority of task B,
> >     if task B is not safe to run at a high priority.
> 
> That's a responsibility of the system designer; if he allows tasks to
> share locks, its up to him/her to do it safely.
> 
> I have requests from some vendors to extend this behavior to even
> SCHED_OTHER tasks, so that a FIFO task could promote it to FIFO. I
> personally shiver when thinking about this, but it makes 
> sense in some environments (some real time tasks doing important
> things and a normal task doing low priority cleanups, for example)

Oh, I assumed you _already_ boosted SCHED_OTHER tasks :)

If you are only doing priority inheritance for RT priorities, then
it's not such a big deal to have task B boosted.

> >     But highest-priority wakeup (at least) shouldn't be just for
> >     fuqueues: it should be implemented at a lower level, directly in
> >     the kernel's waitqueues.  Meaning: wake_up() should wake the
> >     highest priority task, not the first one queued, if that is
> >     appropriate for the queue or waker.
> 
> That was the first thing I thought of; however, it is not an easy
> task--for example, now you have to allocate a central node that has
> to live during the life of the waitqueue (unlike in futexes), and
> that didn't play too well -- with the current code, unlike my previous
> attempt with rtfutexes, it is not that much of a deal and could be 
> done, but I don't know how much of the interface I could emulate.

What do you need the central node for?

> >   - Is there a method for passing the locked state to another task?
> >     Compare-and-swap old-pid -> new-pid works when there isn't any
> >     contention, but a kernel call is needed in any of the
> >     kernel-controlled states.
> 
> That can be done, because if you are in non-KCO mode (ie: pid), the
> kernel by definition knows nil about the mutex, so just do the compare 
> and swap in user space and you are ready. No need to add any special
> code.

The question asks what to do in the KCO state.  I.e. when you want to
transfer a locked state and there are waiters.

> >   - For architectures which can't do compare-and-swap, a system call
> >     which does the equivalent (i.e. disables preemption, does
> >     compare-and-swap, enables preemption again) would be quite useful.
> >     Not for maximum performance, but to allow architecture-independent
> >     locking strategies to be written portably.
> 
> But the minute you are doing a system call you are better off calling
> directly the kernel for it to arbitrate the mutex in pure KCO mode.
> I think the overhead saving is worth an #ifdef in the source code for
> the lock operation...

If it is as simple as just keeping the mutex in KCO mode all the time
on archs which don't have compare-and-swap, or those that do if an
application doesn't have explicit asm code for that arch, that would
be very convenient.

I haven't thought through whether keeping a mutex in KCO mode has this
capability.  Perhaps you have and can state the answer?

> >   - Similarly, it would be good for the VFULOCK_ flags to depend on
> >     only 31 bits of the word, i.e. ignoring one of them.  Then
> >     architectures which don't have compare-and-swap but which do have
> >     test-and-set-bit can use that.
> 
> Another thing I discarded; then there is no way for the kernel to 
> identify the owner and most of fusyn's benefits disappear. However,
> I want to give it a second try, in such a way that it would disable
> owner identification (as well as priority inheritance). 

You don't lose the owner id, because pid is always positive so there's
a spare bit.

> Still I don't know how worth is it. Which are the arches that don't 
> support atomic compare-and-swap? [original i386, arm?, sparc32 ... ??]
> Is it worth the bloat added vs. removing for them the fast-lock path?

It means application developers who are using these primitives can
write arch-portable code without writing arch-specific locking logic
for every architecture.  Such applications could be: Perl interpreter,
Python interpreter, Java VM etc. - anything with locks in objects
where the locks need to be as streamlined as possible.

It would be great if the KCO state could be used to provide complete
portability, and then adding arch-specific asm code to the
applications becomes an optimisation, not a requirement.

Speaking of streamlined locks: with futex, it is possible to use fewer
than 32 bits for a lock because futex tests 32 bits against a supplied
value; it doesn't expect anything of the representation.  A spinlock
can be implemented with just one bit in an object, which is useful for
tiny, abundant objects such as Lisp-like cons cells.

More complex locks can be implemented with one or two bits in an
object and auxiliary user-space data structures hashing the object
address in the contended case (which is usually a very small subset of
all the objects).

In these ways, futex is very versatile.

Your next answer explains why fulock cannot be used in a similar way.
It would be very good to keep many of the priority inheritance and RT
scheduling properties of fuqueue while still allowing for the
versatile data uses of futex (or something similar to futex).  I guess
your code does achieve this, because your futex is implemented on top
of fuqueue?

> >   - Does the owner field have to be the pid or can a fulock use any
> >     kind of key value, as long as it isn't one of the reserved values,
> >     that's convenient to the application.
> 
> Anything that we can later use to point it to a 'struct task_struct'
> in a safe way will do. I used the PID because I didn't want to add yet 
> another pid-like field and usage stuff of kernel/pid.c to the task_struct. 
> In fact, it does the trick pretty well (other than the collisions on 
> reusage, but I have some plans for that).

Thinking this over, it occurs to me that you can also implement lock
stealing.  You know who the owner is, so you can send a signal to the
owner saying "please release the lock", but actually stealing a lock
requires kernel support doesn't it?  Just a thought, not sure how
useful that would be.

> Asides from the comments, it adds the most complex/bloating part,
> the priority-sorted thingie and chprio support vs not having the FUTEX_FD
> or requeue support...it comes to be, more or less equivalent, considering
> all the crap that has to be changed for the prioritization to work 
> respecting the incredibly stupid POSIX semantincs for mutex lifetimes.

Are there specified POSIX semantics for prioritisation and mutex interaction?

-- Jamie

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-04  4:57 Perez-Gonzalez, Inaky
  2003-12-04  5:51 ` Jamie Lokier
  0 siblings, 1 reply; 9+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-04  4:57 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, robustmutexes

> From: Jamie Lokier [mailto:jamie@shareable.org]

> Here's my first thoughts, on reading Documentation/fusyn.txt.
> 
>   - Everything here can be implemented on top of regular futex, but
>     that doesn't mean the implementation will be efficient or robust.
>     (This is because, in general, any kind of futex/fuqueue/whatever
>     implementation can be moved from kernel space to userspace, using
>     the real kernel futexes to simulate the waitqueues and spinlocks
>     that are called in futex.c).

I thought that initially, and my first tries (last year?) went into 
that direction, but there are many holes (unless I am wrong). For 
example:

 - [minor] avoidance of priority inversions: you cannot leave the 
   lock unlocked while transferring ownership and race conditions
   trying to implement this.

 - priority inheritance/protection: hell on heels, you will have so
   many system calls into the kernel and race conditions that it is
   probably not worth it. Big surgery would be needed to 
   implement the wait_cancel of a boosting thread. You need
   to be able to find who is owning what and follow a possibly long
   ownership/wait chain (more kernel) to boost (reprioritize) each
   owner until hitting the end. This implies having the privilege and
   the means to look it up.

 - robustness: you need the kernel help, at least to identify the dead
   guy, and for most applications that they want to use this for, it
   has to be snap quick. Maybe it could be made fast, but not as much
   as now (you'd have to query the vfulock, verify that nobody else is
   trying to initiate recovery -- this requires another lock, which 
   requires robustness too, chicken and egg -- and then perform the
   recovery); a PITA, to summarize

> There are some good ideas in here, for people who need the features:
> 
>   - Priority inheritence is ok _when_ you want it.

That's why it is enabled only on request; it bothers me that having
it forces some things, like having to do wait_cancel from interrupt
contexts and stuff like that. Fortunately, chprio also requires that,
so serves as a justification for having it. 

I still need to quantify the overall effects of that, btw.

>     Sometimes if task A with high priority wants a resource which is
>     locked by task B with lower priority, that should be an error
>     condition: it can be dangerous to promote the priority of task B,
>     if task B is not safe to run at a high priority.

That's a responsibility of the system designer; if he allows tasks to
share locks, its up to him/her to do it safely.

I have requests from some vendors to extend this behavior to even
SCHED_OTHER tasks, so that a FIFO task could promote it to FIFO. I
personally shiver when thinking about this, but it makes 
sense in some environments (some real time tasks doing important
things and a normal task doing low priority cleanups, for example)

>     But highest-priority wakeup (at least) shouldn't be just for
>     fuqueues: it should be implemented at a lower level, directly in
>     the kernel's waitqueues.  Meaning: wake_up() should wake the
>     highest priority task, not the first one queued, if that is
>     appropriate for the queue or waker.

That was the first thing I thought of; however, it is not an easy
task--for example, now you have to allocate a central node that has
to live during the life of the waitqueue (unlike in futexes), and
that didn't play too well -- with the current code, unlike my previous
attempt with rtfutexes, it is not that much of a deal and could be 
done, but I don't know how much of the interface I could emulate.

As well, supporting the priority change while waiting requires some
more work...

It is in my todo list to add some more bits to the fuqueue layer so
it can do everything waitqueues do with the priority based interface.

It'd be interesting to experiment in some subsystem by changing the
usage of waitqueues for fuqueues, see what happens. 
 
>   - Is there a method for passing the locked state to another task?
>     Compare-and-swap old-pid -> new-pid works when there isn't any
>     contention, but a kernel call is needed in any of the
>     kernel-controlled states.

That can be done, because if you are in non-KCO mode (ie: pid), the
kernel by definition knows nil about the mutex, so just do the compare 
and swap in user space and you are ready. No need to add any special
code.

>   - It's very unpleasant that rwlocks enter the kernel when there is
>     more than one reader.  Hashed rwlocks can be implemented in
>     ...

Sure it is--will keep this in mind; I still haven't thought too much 
about it.

>   - For architectures which can't do compare-and-swap, a system call
>     which does the equivalent (i.e. disables preemption, does
>     compare-and-swap, enables preemption again) would be quite useful.
>     Not for maximum performance, but to allow architecture-independent
>     locking strategies to be written portably.

But the minute you are doing a system call you are better off calling
directly the kernel for it to arbitrate the mutex in pure KCO mode.
I think the overhead saving is worth an #ifdef in the source code for
the lock operation...

But yes, doing single binary releases complicates this; how does NPTL
solve it now? I don't think Ulrich supports i386, does he? no he does
not. Even a trap for an undefined instruction could solve it for that
case...

>   - Similarly, it would be good for the VFULOCK_ flags to depend on
>     only 31 bits of the word, i.e. ignoring one of them.  Then
>     architectures which don't have compare-and-swap but which do have
>     test-and-set-bit can use that.

Another thing I discarded; then there is no way for the kernel to 
identify the owner and most of fusyn's benefits disappear. However,
I want to give it a second try, in such a way that it would disable
owner identification (as well as priority inheritance). 

Still I don't know how worth is it. Which are the arches that don't 
support atomic compare-and-swap? [original i386, arm?, sparc32 ... ??]
Is it worth the bloat added vs. removing for them the fast-lock path?

>   - Does the owner field have to be the pid or can a fulock use any
>     kind of key value, as long as it isn't one of the reserved values,
>     that's convenient to the application.

Anything that we can later use to point it to a 'struct task_struct'
in a safe way will do. I used the PID because I didn't want to add yet 
another pid-like field and usage stuff of kernel/pid.c to the task_struct. 
In fact, it does the trick pretty well (other than the collisions on 
reusage, but I have some plans for that).
 
>   - It's a huge patch.  A nice thing about futex.c is that it's
>     relatively small (your patch is 9 times larger).  The original
>     futex design was more complicated, and written specifically for
>     mutexes.  Then it was made simpler and I think smaller at the same
>     time.  Perhaps putting some of the RT priority capabilities
>     directly into kernel waitqueues would help with this.

I agree with that, but think about the pieces. The only part that is
strictly equivalent to futexes is the ufuqueues, so that's ufuqueue.c, 
fuqueue.c and vlocator.c. The splitting is necessary to allow parts
and pieces to be shared by the fulocks.

Asides from the comments, it adds the most complex/bloating part,
the priority-sorted thingie and chprio support vs not having the FUTEX_FD
or requeue support...it comes to be, more or less equivalent, considering
all the crap that has to be changed for the prioritization to work 
respecting the incredibly stupid POSIX semantincs for mutex lifetimes.

Well, thanks for the comments--I will keep them in the backburner,
specially the test-and-set-bit thingie. I'll try to tackle it after I do the
KCO mode on call to properly support priority protection, as well as 
improving the owner identification method. 

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-12-07  1:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-03  8:51 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 inaky.perez-gonzalez
2003-12-03  8:51 ` [RFC/PATCH] FUSYN 1/10: documentation files inaky.perez-gonzalez
2003-12-03  8:51   ` [RFC/PATCH] FUSYN 2/10: modifications to the core inaky.perez-gonzalez
2003-12-03  8:51     ` [RFC/PATCH] FUSYN 3/10: Support for i386 inaky.perez-gonzalez
2003-12-04  4:12 ` [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 Jamie Lokier
2003-12-04  4:57 Perez-Gonzalez, Inaky
2003-12-04  5:51 ` Jamie Lokier
2003-12-04  9:27 Perez-Gonzalez, Inaky
2003-12-07  1:15 Perez-Gonzalez, Inaky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).