linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-04  9:27 Perez-Gonzalez, Inaky
  2003-12-06  0:41 ` [robustmutexes] " Scott Wood
  0 siblings, 1 reply; 7+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-04  9:27 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, robustmutexes


> From: Jamie Lokier [mailto:jamie@shareable.org]

> Iñaky Pérez-González wrote:


> If you are only doing priority inheritance for RT priorities, then
> it's not such a big deal to have task B boosted.

Heh ... doing it for SCHED_OTHER becomes a nightmare, and I still
am not that sure is feasible without major surgery on the way
the priority information is stored in the task_struct now; as well,
POSIX says nothing about it, or at least is kind of gray in a sense
where it seems only SCHED_{FIFO,RR} should be involved, so I am 
really tempted not to do it (well, that's what is nicing it towards
20 :)

> > That was the first thing I thought of; however, it is not an easy
> > task--for example, now you have to allocate a central node that has
> > to live during the life of the waitqueue (unlike in futexes), and
> > that didn't play too well -- with the current code, unlike my previous
> > attempt with rtfutexes, it is not that much of a deal and could be
> > done, but I don't know how much of the interface I could emulate.
> 
> What do you need the central node for?

In futexes we have that each hash chain has all the waiters, in FIFO
arrival order, and we just wake as many as needed. In the fuqueues, 
this wake up has to be by priority. We could walk that chain pedaling
back and forth to wake them up in the correct order, but that would be
anything but O(1), and being "real-time" like, or predictable, is one
of the requirements.

So the wait list has a head, the fuqueue->wlist, and the waiters are
appended there in their correct position (so addition is O(N) now, will
change to O(1) IANF, and removal of the highest priority guy is O(1)--
take the guy in the head).

Now, on ufuqueues (the ones that are associated to user space addresses,
the true futex equivalent) that means you can't do the trick of the 
futex chain lists, so you have on each chain a head per ufuqueue/user
space address. That ufuqueue cannot be declared on the stack of one
of the waiters, as it would disappear when it is woken up and might leave
others dangling.

So we have to allocate it, add the waiters to it and deallocate it when 
the wait list is empty. This is what complicates the whole thing and 
adds the blob of code that is vl_locate() [the allocation and addition 
to the list, checking for collisions when locks are dropped]. As the 
whole thing is kind of expensive, we better cache it for a few seconds, 
as chances are we will have some temporal locality (in fact, it happens, 
it improves the performance a lot), so that leads to more code for the 
"garbage collector" that cleans the hash chains of unused queue heads
every now and then. All this is what the vlocator code does.

> > >   - Is there a method for passing the locked state to another task?
> > >     Compare-and-swap old-pid -> new-pid works when there isn't any
> > >     contention, but a kernel call is needed in any of the
> > >     kernel-controlled states.
> >
> > That can be done, because if you are in non-KCO mode (ie: pid), the
> > kernel by definition knows nil about the mutex, so just do the compare
> > and swap in user space and you are ready. No need to add any special
> > code.
> 
> The question asks what to do in the KCO state.  I.e. when you want to
> transfer a locked state and there are waiters.

Ah, ah, ah ... makes sense. Ok, so this is like an unlock operation 
but "unlock to this guy". Well, same thing, but extended. You need to
try it first in user space, if that fails because it is KCO (locked
and there are waiters), then go to the kernel and ask it to transfer
ownership in there.

Piece of cake, more or less, and can be done O(1) by checking 
dest_task->fuqueue_wait. Why the interest for this? I am curious
to see what could it be used for.

I will give it a shot as soon as I have a minute (unless your question
was purely academic and plan no uses for it).

> If it is as simple as just keeping the mutex in KCO mode all the time
> on archs which don't have compare-and-swap, or those that do if an
> application doesn't have explicit asm code for that arch, that would
> be very convenient.
> 
> I haven't thought through whether keeping a mutex in KCO mode has this
> capability.  Perhaps you have and can state the answer?

First I should make a distinction that is causing way too much confusion,
one thing is KCO for the vfulock (telling the user that it has to go to
the kernel) and the other is using it always in KCO mode by passing a
yet-to-be-implemented-flag FULOCK_FL_KCO to the sys_ufulock_*() system
calls (so it skips all the ugly synchronization code).

This FULOCK_FL_KCO is needed for priority protection anyway, so it will
be there no matter what; thus, in arches without atomic compare-and-swap,
it becomes a matter of ops-I-don't-have-the-fast-path so I just call the
syscall with that bit set.

> > Another thing I discarded; then there is no way for the kernel to
> > identify the owner and most of fusyn's benefits disappear. However,
> > I want to give it a second try, in such a way that it would disable
> > owner identification (as well as priority inheritance).
> 
> You don't lose the owner id, because pid is always positive so there's
> a spare bit.

But still you need to set the full PID--or whatever you use for 
identifying yourself as the owner--atomically. If you are thinking of
using bit 32 as a spinlock and then once you have it, set the rest of
the PID in a non-atomic fashion, I'd forget it. Consider the following
scenario:

Task A: FIFO, prio 5, 

time  A
0     sets bit 32, succeeds
1     sets pid(A)
..    does his thing as lock owner
 
Now the same, but Task B, FIFO prio 6 is trying to do the same

time  A                        B
0     sets bit 32, succeeds
1                              spins trying to set bit
puf!

Now you could say: okay, let's yield B, but it'd not work because
B is prio six and A is five; as you know, you are toast, as B will
run again and A won't.

Another option is as B sees that bit 32 is 1, it goes down to the 
kernel; but A still might not have had a chance to set the PID, so
B (in the kernel) has no means to identify who is A; it can create
the ufulock and leave it in some 'owner unknown state' and queue 
for it--however, how do we find who A is? The synch code in the 
kernel still needs to do an atomic operation, protecting against
user space, to put the v/ufulock in KCO mode, and for robustness to
work correctly, A might have to check that something like that happened
to call the kernel and let it know that it is the owner...

Maybe I am missing a simpler solution, but I just thought, after 
spinning it many times that it wasn't worth the effort, as most
modern arches have an atomic compare and swap, even embedded targets;
and that the easiest thing to do is either to run all the time in
KCO mode or massage the code a wee bit so that owner identification
(and associated goodies) are disabled.

> It would be great if the KCO state could be used to provide complete
> portability, and then adding arch-specific asm code to the
> applications becomes an optimisation, not a requirement.

Yep, as I said, my next requirement is to add that code tidbit, so
that will not be a problem. Being the optimization a simple 
atomic-compare-and-swap, I don't foresee too much uglyness on it.

> More complex locks can be implemented with one or two bits in an
> object and auxiliary user-space data structures hashing the object
> address in the contended case (which is usually a very small subset of
> all the objects).
> 
> In these ways, futex is very versatile.

Agreed - but as I explained above, that leads to many ugly conditions in
real-time scheduling environments (and they can lead to deadline misses
or who knows). I'd love to be proven wrong, or that I missed one or more
ways to do it...I just got drained :]

Now, fuqueues still can do all that, because except for the wake up
order and because you can get -ENOMEM on wait(), they behave exactly
the same.

> Thinking this over, it occurs to me that you can also implement lock
> stealing.  You know who the owner is, so you can send a signal to the
> owner saying "please release the lock", but actually stealing a lock
> requires kernel support doesn't it?  Just a thought, not sure how
> useful that would be.

Yep, it is similar, if not the same, as the "unlock to this guy" we were 
discussing above [and not necessarily the owner is the one that has to
unlock, anybody with access to the lock can unlock it].

However, I am kind of reluctant to add code to do that unless it is really
needed. I am going to have already a hard sell with this as to bloat it
a wee bit more :) [I can already hear Ulrich sharpening his Jagdmesser
over the proposed modifications to NPTL to support this].

> > Asides from the comments, it adds the most complex/bloating part,
> > the priority-sorted thingie and chprio support vs not having the FUTEX_FD
> > or requeue support...it comes to be, more or less equivalent, considering
> > all the crap that has to be changed for the prioritization to work
> > respecting the incredibly stupid POSIX semantincs for mutex lifetimes.
> 
> Are there specified POSIX semantics for prioritisation and mutex interaction?

Yep, they state different things on all that, and how you don't need to 
necessarily destroy non-shared mutexes (vs shared) and a few more things
that made my life a little bit more exciting...I think I still need to tweak
a few bits more on the priority inheritance stuff to get it completely up
to the spec, but it should be pretty good already.

Oh, did I mention anywhere you can also use the fulocks in the kernel to
replace the DECLARE_MUTEXes and gain robustness, pi and pp? Wonder if anybody
will find this interesting.

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 7+ messages in thread
* RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-07  1:15 Perez-Gonzalez, Inaky
  0 siblings, 0 replies; 7+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-07  1:15 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, robustmutexes


> From: Jamie Lokier [mailto:jamie@shareable.org]

I had one question for you I forgot to ask wrt to fulocks. 

In the vlocator code I ripped your stuff for get_futex_key()
to generate the unique ID. Now, in ufulocks I need to access
the user space page where the vfulock is to modify it when 
we fiddle with the ownership of fulock state to keep it in sync;
in every case we need to pull it up from swap if it is there
because we'll read and potentially write.

As I do it now is the lazy way. I just pin the page and keep
it in fulock->page. This is kind of ugly because it reverts
back to the old futex behavior of 'page pinned while waiting'.

Now, I'd love to be able to have a key that can be used to pull
the 'struct page' from, so I don't need to pin the page
while waiting.

This would be easy to do if we only had the lock and unlock 
cases, as both have direct and easy access to the location of
the vfulock in current's address space. 

However, the problem is in __ufulock_exit(); in this case, we
don't necessarily have access to the address space of the 
exiting thread. I am thinking of just adding the user space
address an an ownership property so __ufulock_exit() can use
it, but I am kind of concerned on what would happen if a shared
area were unmapped while owning a lock that is supposed to be
on it.

Do you have any ideas on what would be a good way to do this?

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 7+ messages in thread
* RE: [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-04  4:57 Perez-Gonzalez, Inaky
  2003-12-04  5:51 ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: Perez-Gonzalez, Inaky @ 2003-12-04  4:57 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: linux-kernel, robustmutexes

> From: Jamie Lokier [mailto:jamie@shareable.org]

> Here's my first thoughts, on reading Documentation/fusyn.txt.
> 
>   - Everything here can be implemented on top of regular futex, but
>     that doesn't mean the implementation will be efficient or robust.
>     (This is because, in general, any kind of futex/fuqueue/whatever
>     implementation can be moved from kernel space to userspace, using
>     the real kernel futexes to simulate the waitqueues and spinlocks
>     that are called in futex.c).

I thought that initially, and my first tries (last year?) went into 
that direction, but there are many holes (unless I am wrong). For 
example:

 - [minor] avoidance of priority inversions: you cannot leave the 
   lock unlocked while transferring ownership and race conditions
   trying to implement this.

 - priority inheritance/protection: hell on heels, you will have so
   many system calls into the kernel and race conditions that it is
   probably not worth it. Big surgery would be needed to 
   implement the wait_cancel of a boosting thread. You need
   to be able to find who is owning what and follow a possibly long
   ownership/wait chain (more kernel) to boost (reprioritize) each
   owner until hitting the end. This implies having the privilege and
   the means to look it up.

 - robustness: you need the kernel help, at least to identify the dead
   guy, and for most applications that they want to use this for, it
   has to be snap quick. Maybe it could be made fast, but not as much
   as now (you'd have to query the vfulock, verify that nobody else is
   trying to initiate recovery -- this requires another lock, which 
   requires robustness too, chicken and egg -- and then perform the
   recovery); a PITA, to summarize

> There are some good ideas in here, for people who need the features:
> 
>   - Priority inheritence is ok _when_ you want it.

That's why it is enabled only on request; it bothers me that having
it forces some things, like having to do wait_cancel from interrupt
contexts and stuff like that. Fortunately, chprio also requires that,
so serves as a justification for having it. 

I still need to quantify the overall effects of that, btw.

>     Sometimes if task A with high priority wants a resource which is
>     locked by task B with lower priority, that should be an error
>     condition: it can be dangerous to promote the priority of task B,
>     if task B is not safe to run at a high priority.

That's a responsibility of the system designer; if he allows tasks to
share locks, its up to him/her to do it safely.

I have requests from some vendors to extend this behavior to even
SCHED_OTHER tasks, so that a FIFO task could promote it to FIFO. I
personally shiver when thinking about this, but it makes 
sense in some environments (some real time tasks doing important
things and a normal task doing low priority cleanups, for example)

>     But highest-priority wakeup (at least) shouldn't be just for
>     fuqueues: it should be implemented at a lower level, directly in
>     the kernel's waitqueues.  Meaning: wake_up() should wake the
>     highest priority task, not the first one queued, if that is
>     appropriate for the queue or waker.

That was the first thing I thought of; however, it is not an easy
task--for example, now you have to allocate a central node that has
to live during the life of the waitqueue (unlike in futexes), and
that didn't play too well -- with the current code, unlike my previous
attempt with rtfutexes, it is not that much of a deal and could be 
done, but I don't know how much of the interface I could emulate.

As well, supporting the priority change while waiting requires some
more work...

It is in my todo list to add some more bits to the fuqueue layer so
it can do everything waitqueues do with the priority based interface.

It'd be interesting to experiment in some subsystem by changing the
usage of waitqueues for fuqueues, see what happens. 
 
>   - Is there a method for passing the locked state to another task?
>     Compare-and-swap old-pid -> new-pid works when there isn't any
>     contention, but a kernel call is needed in any of the
>     kernel-controlled states.

That can be done, because if you are in non-KCO mode (ie: pid), the
kernel by definition knows nil about the mutex, so just do the compare 
and swap in user space and you are ready. No need to add any special
code.

>   - It's very unpleasant that rwlocks enter the kernel when there is
>     more than one reader.  Hashed rwlocks can be implemented in
>     ...

Sure it is--will keep this in mind; I still haven't thought too much 
about it.

>   - For architectures which can't do compare-and-swap, a system call
>     which does the equivalent (i.e. disables preemption, does
>     compare-and-swap, enables preemption again) would be quite useful.
>     Not for maximum performance, but to allow architecture-independent
>     locking strategies to be written portably.

But the minute you are doing a system call you are better off calling
directly the kernel for it to arbitrate the mutex in pure KCO mode.
I think the overhead saving is worth an #ifdef in the source code for
the lock operation...

But yes, doing single binary releases complicates this; how does NPTL
solve it now? I don't think Ulrich supports i386, does he? no he does
not. Even a trap for an undefined instruction could solve it for that
case...

>   - Similarly, it would be good for the VFULOCK_ flags to depend on
>     only 31 bits of the word, i.e. ignoring one of them.  Then
>     architectures which don't have compare-and-swap but which do have
>     test-and-set-bit can use that.

Another thing I discarded; then there is no way for the kernel to 
identify the owner and most of fusyn's benefits disappear. However,
I want to give it a second try, in such a way that it would disable
owner identification (as well as priority inheritance). 

Still I don't know how worth is it. Which are the arches that don't 
support atomic compare-and-swap? [original i386, arm?, sparc32 ... ??]
Is it worth the bloat added vs. removing for them the fast-lock path?

>   - Does the owner field have to be the pid or can a fulock use any
>     kind of key value, as long as it isn't one of the reserved values,
>     that's convenient to the application.

Anything that we can later use to point it to a 'struct task_struct'
in a safe way will do. I used the PID because I didn't want to add yet 
another pid-like field and usage stuff of kernel/pid.c to the task_struct. 
In fact, it does the trick pretty well (other than the collisions on 
reusage, but I have some plans for that).
 
>   - It's a huge patch.  A nice thing about futex.c is that it's
>     relatively small (your patch is 9 times larger).  The original
>     futex design was more complicated, and written specifically for
>     mutexes.  Then it was made simpler and I think smaller at the same
>     time.  Perhaps putting some of the RT priority capabilities
>     directly into kernel waitqueues would help with this.

I agree with that, but think about the pieces. The only part that is
strictly equivalent to futexes is the ufuqueues, so that's ufuqueue.c, 
fuqueue.c and vlocator.c. The splitting is necessary to allow parts
and pieces to be shared by the fulocks.

Asides from the comments, it adds the most complex/bloating part,
the priority-sorted thingie and chprio support vs not having the FUTEX_FD
or requeue support...it comes to be, more or less equivalent, considering
all the crap that has to be changed for the prioritization to work 
respecting the incredibly stupid POSIX semantincs for mutex lifetimes.

Well, thanks for the comments--I will keep them in the backburner,
specially the test-and-set-bit thingie. I'll try to tackle it after I do the
KCO mode on call to properly support priority protection, as well as 
improving the owner identification method. 

Iñaky Pérez-González -- Not speaking for Intel -- all opinions are my own (and my fault)

^ permalink raw reply	[flat|nested] 7+ messages in thread
* [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2
@ 2003-12-03  8:51 inaky.perez-gonzalez
  2003-12-04  4:12 ` Jamie Lokier
  0 siblings, 1 reply; 7+ messages in thread
From: inaky.perez-gonzalez @ 2003-12-03  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: inaky.perez-gonzalez, robustmutexes

Hi All

This code proposes an implementation of kernel based mutexes,
taking ideas from the actual implementations using futexes
(namely NPTL), intending to solve its limitations (see doc)
and adding some features, such as real-time behavior,
priority inheritance and protection, deadlock detection and
robustness.

Please look at the first patch, containing the documentation
for information on how is it implemented.

The patch is split in the following parts:

1/10: documentation files
2/10: modifications to the core
3/10: Support for i386
4/10: Support for ia64
5/10: kernel fuqueues
6/10: user space/kernel space tracker
7/10: user space fuqueues
8/10: kernel fulocks
9/10: stub for priority protection
10/10: user space fulocks

We have a site at http://developer.osdl.org/dev/robustmutexes
with references to all the releases, test code and NPTL
modifications (rtnptl) to use this code. As well, the patch
is there in a single file, in case you don't want to paste
them manually.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-12-07  1:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-12-04  9:27 [RFC/PATCH] FUSYN Realtime & Robust mutexes for Linux try 2 Perez-Gonzalez, Inaky
2003-12-06  0:41 ` [robustmutexes] " Scott Wood
  -- strict thread matches above, loose matches on Subject: below --
2003-12-07  1:15 Perez-Gonzalez, Inaky
2003-12-04  4:57 Perez-Gonzalez, Inaky
2003-12-04  5:51 ` Jamie Lokier
2003-12-03  8:51 inaky.perez-gonzalez
2003-12-04  4:12 ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).