Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
@ 2002-03-13  9:12 Martin Wirth
  2002-03-13 19:41 ` Bill Davidsen
  2002-03-15  7:31 ` Rusty Russell
  0 siblings, 2 replies; 84+ messages in thread
From: Martin Wirth @ 2002-03-13  9:12 UTC (permalink / raw)
  To: Rusty Russell, linux-kernel

 > > > On Tue, 12 Mar 2002, Rusty Russell wrote:
 > > > > > > > You've convinced me.
 > > > > > Damn.  Because now I've been playing with a different approach.
 > > > I don't think your current patch is very useful.

 > I agree.  But your "Applied" EMail rushed me into posting it.

The normal way to use multithreading under UNIX is the pthread
library. Here the condition variables are the equivalent to kernel
wait queues. So I think to really implement a fast pthread lib based
on futexes one needs some means to implement condition variables
(with synchronous futex release to implement  pthread_cond_wait(..)!).

This could either be done with the exported waitqueue approach or a bit 
easier (but less general) by associating a second hashed waitqueue with 
each futex (maybe marked by the odd offset+1?). Then we would have two 
additional variants of sys_futex (with parameters FUTEX_WAIT, 
FUTEX_SIGNAL).

The principle implementation is:

FUTEX_WAIT:
    add_to_cond_queue
    current->state=INTERRUPTIBLE
    futex_up
    schedule
    remove_from_cond_queue
    futex_down

FUTEX_SIGNAL
    wake_up_all on cond_queue

Later we may also want FUTEX_SIGNAL_ONE and FUTEX_WAIT_TIMEOUT.

The user space code for pthread_cond_wait then of course needs a 
chaining of the protecting pthread_mutex and the futex used as condition 
variable.

Martin

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-13  9:12 [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores) Martin Wirth
@ 2002-03-13 19:41 ` Bill Davidsen
  2002-03-13 19:52   ` Dave McCracken
  2002-03-13 20:06   ` Alan Cox
  2002-03-15  7:31 ` Rusty Russell
  1 sibling, 2 replies; 84+ messages in thread
From: Bill Davidsen @ 2002-03-13 19:41 UTC (permalink / raw)
  To: Martin Wirth; +Cc: linux-kernel

On Wed, 13 Mar 2002, Martin Wirth wrote:

> The normal way to use multithreading under UNIX is the pthread
> library. Here the condition variables are the equivalent to kernel
> wait queues. So I think to really implement a fast pthread lib based
> on futexes one needs some means to implement condition variables
> (with synchronous futex release to implement  pthread_cond_wait(..)!).

Let me mention this again... The IBM release of NGPT states that Linus has
approved the inclusion of the NGPT patches in the mainline kernel. Will
this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
tried 2.4.19 other than to see the patch didn't apply).

(NGPT = Next Generation Pthreads, a cleaner and faster POSIX threads)

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-13 19:41 ` Bill Davidsen
@ 2002-03-13 19:52   ` Dave McCracken
  2002-03-13 22:17     ` Bill Davidsen
  2002-03-13 20:06   ` Alan Cox
  1 sibling, 1 reply; 84+ messages in thread
From: Dave McCracken @ 2002-03-13 19:52 UTC (permalink / raw)
  To: Bill Davidsen, Martin Wirth; +Cc: linux-kernel


--On Wednesday, March 13, 2002 02:41:52 PM -0500 Bill Davidsen
<davidsen@tmr.com> wrote:

> Let me mention this again... The IBM release of NGPT states that Linus has
> approved the inclusion of the NGPT patches in the mainline kernel. Will
> this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
> tried 2.4.19 other than to see the patch didn't apply).

The 2.4 patch needed for NGPT was accepted by Marcelo and is in 2.4.19-pre3.

Dave McCracken

======================================================================
Dave McCracken          IBM Linux Base Kernel Team      1-512-838-3059
dmccr@us.ibm.com                                        T/L   678-3059


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-13 19:41 ` Bill Davidsen
  2002-03-13 19:52   ` Dave McCracken
@ 2002-03-13 20:06   ` Alan Cox
  1 sibling, 0 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-13 20:06 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Martin Wirth, linux-kernel

> Let me mention this again... The IBM release of NGPT states that Linus has
> approved the inclusion of the NGPT patches in the mainline kernel. Will
> this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
> tried 2.4.19 other than to see the patch didn't apply).

2.5 but hopefully it will get backported as it proves solid and the API
is fixed

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-13 19:52   ` Dave McCracken
@ 2002-03-13 22:17     ` Bill Davidsen
  0 siblings, 0 replies; 84+ messages in thread
From: Bill Davidsen @ 2002-03-13 22:17 UTC (permalink / raw)
  To: Dave McCracken; +Cc: linux-kernel

On Wed, 13 Mar 2002, Dave McCracken wrote:

> 
> --On Wednesday, March 13, 2002 02:41:52 PM -0500 Bill Davidsen
> <davidsen@tmr.com> wrote:
> 
> > Let me mention this again... The IBM release of NGPT states that Linus has
> > approved the inclusion of the NGPT patches in the mainline kernel. Will
> > this be in 2.4.19 release? I've been running 2.4.17 for NGPT, haven't
> > tried 2.4.19 other than to see the patch didn't apply).
> 
> The 2.4 patch needed for NGPT was accepted by Marcelo and is in 2.4.19-pre3.

Good info, thanks! I hand edited the 2.4.17 patch to 2.4.18, but 19-pre2
didn't apply and I ran out of time about 1am this morning;-) When pre3
settles down a bit I'll use that as a base.

-- 
bill davidsen <davidsen@tmr.com>
  CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-13  9:12 [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores) Martin Wirth
  2002-03-13 19:41 ` Bill Davidsen
@ 2002-03-15  7:31 ` Rusty Russell
  2002-03-15  8:41   ` Martin Wirth
  1 sibling, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-15  7:31 UTC (permalink / raw)
  To: Martin.Wirth; +Cc: torvalds, Hubertus Franke, linux-kernel

In message <3C8F1801.6070107@dlr.de> you write:
>  > > > On Tue, 12 Mar 2002, Rusty Russell wrote:
>  > > > > > > > You've convinced me.
>  > > > > > Damn.  Because now I've been playing with a different approach.
>  > > > I don't think your current patch is very useful.
> 
>  > I agree.  But your "Applied" EMail rushed me into posting it.
> 
> 
> The normal way to use multithreading under UNIX is the pthread
> library. Here the condition variables are the equivalent to kernel
> wait queues. So I think to really implement a fast pthread lib based
> on futexes one needs some means to implement condition variables
> (with synchronous futex release to implement  pthread_cond_wait(..)!).

Discussions with Ulrich have reaffirmed my opinion that pthreads are
crap.  Hence I'm not all that tempted to warp the (nice, clean,
usable) futex code too far to meet pthreads' wierd needs.

However, it's not too hard to implement condition variables using an
unavailable mutex, if we go for "full" semaphores: ie. not just
mutexes.  It requires a bit more of a stretch for kernel atomic ops...

Yet another untested patch,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.7-pre1/kernel/futex.c working-2.5.7-pre1-sem/kernel/futex.c
--- linux-2.5.7-pre1/kernel/futex.c	Wed Mar 13 13:30:39 2002
+++ working-2.5.7-pre1-sem/kernel/futex.c	Fri Mar 15 18:18:37 2002
@@ -129,17 +129,18 @@
 	return page;
 }
 
-/* Try to decrement the user count to zero. */
-static int decrement_to_zero(struct page *page, unsigned int offset)
+/* Try to decrement the user count to >= zero. */
+static int decrement_futex(struct page *page, unsigned int offset)
 {
 	atomic_t *count;
 	int ret = 0;
 
 	count = kmap(page) + offset;
-	/* If we take the semaphore from 1 to 0, it's ours.  If it's
-           zero, decrement anyway, to indicate we are waiting.  If
-           it's negative, don't decrement so we don't wrap... */
-	if (atomic_read(count) >= 0 && atomic_dec_and_test(count))
+	/* If we take one from the sem, and it's still >= 0, it's
+           ours.  If it's zero, we decrement anyway to indicate we are
+           waiting.  If it's negative, don't decrement so we don't
+           wrap... */
+	if (atomic_read(count) >= 0 && !atomic_add_negative(-1, count))
 		ret = 1;
 	kunmap(page);
 	return ret;
@@ -173,12 +174,36 @@
 	return retval;
 }
 
+/* Atomically: Add 1 if already positive, otherwise set to 1. */
+static void futex_make_positive(atomic_t *count)
+{
+	int old_count, new_count;
+
+#ifndef CONFIG_SMP
+	preempt_disable();
+	old_count = atomic_read(count);
+	new_count = old_count > 0 ? old_count+1 : 1;
+	atomic_set(count, new_count);
+	preempt_enable();
+#elif defined(__HAVE_ARCH_CMPXCHG)
+	do {
+		old_count = atomic_read(count);
+		new_count = old_count > 0 ? old_count+1 : 1;
+	} while (cmpxchg(count, old_count, new_count) != old_count);
+#else
+	/* Do this one at a time.  You will need to implement
+	   atomic_add_positive(). */
+	while (!atomic_add_positive(count, 1));
+#endif
+}
+
 static int futex_up(struct list_head *head, struct page *page, int offset)
 {
 	atomic_t *count;
 
 	count = kmap(page) + offset;
-	atomic_set(count, 1);
+	/* set to MAX(count, 0) + 1 */
+	futex_make_positive(count);
 	smp_wmb();
 	kunmap(page);
 	wake_one_waiter(head, page, offset);

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-15  7:31 ` Rusty Russell
@ 2002-03-15  8:41   ` Martin Wirth
  2002-03-15 15:29     ` Hubertus Franke
  2002-03-15 16:23     ` Peter Wächtler
  0 siblings, 2 replies; 84+ messages in thread
From: Martin Wirth @ 2002-03-15  8:41 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel


Rusty Russell wrote:

>
>Discussions with Ulrich have reaffirmed my opinion that pthreads are
>crap.  Hence I'm not all that tempted to warp the (nice, clean,
>usable) futex code too far to meet pthreads' wierd needs.
>
Crap or not, there are tons of software based on pthreads and at least 
the NGPT team says that Linus
agreed to implement for necessary kernel-infrastructure for a full, fast 
pthread implementation.

Now, if you want to implement mutexes and condition variables with the attribute
PTHREAD_PROCESS_SHARED then you need some functionality like the futexes.
Or NGPT will add his own syscalls to handle these things, which is simply
unnecessary double functionality.

>
>However, it's not too hard to implement condition variables using an
>unavailable mutex, if we go for "full" semaphores: ie. not just
>mutexes.  It requires a bit more of a stretch for kernel atomic ops...
>
A full semaphore is nice, but not a full replacement for a waitqueue (or 
a pthread condition variable brr..).
For the semaphore you always have to assure that the ups and downs are 
balanced, what is not the case
for the condition variable.

Martin



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-15  8:41   ` Martin Wirth
@ 2002-03-15 15:29     ` Hubertus Franke
  2002-03-15 16:23     ` Peter Wächtler
  1 sibling, 0 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-15 15:29 UTC (permalink / raw)
  To: Martin Wirth, Rusty Russell; +Cc: linux-kernel, lse-tech, babt

On Friday 15 March 2002 03:41 am, Martin Wirth wrote:
> Rusty Russell wrote:
> >Discussions with Ulrich have reaffirmed my opinion that pthreads are
> >crap.  Hence I'm not all that tempted to warp the (nice, clean,
> >usable) futex code too far to meet pthreads' wierd needs.
>
> Crap or not, there are tons of software based on pthreads and at least
> the NGPT team says that Linus
> agreed to implement for necessary kernel-infrastructure for a full, fast
> pthread implementation.
>
> Now, if you want to implement mutexes and condition variables with the
> attribute PTHREAD_PROCESS_SHARED then you need some functionality like the
> futexes. Or NGPT will add his own syscalls to handle these things, which is
> simply unnecessary double functionality.
>
> >However, it's not too hard to implement condition variables using an
> >unavailable mutex, if we go for "full" semaphores: ie. not just
> >mutexes.  It requires a bit more of a stretch for kernel atomic ops...
>
> A full semaphore is nice, but not a full replacement for a waitqueue (or
> a pthread condition variable brr..).
> For the semaphore you always have to assure that the ups and downs are
> balanced, what is not the case
> for the condition variable.
>
> Martin
>

Folks, its not that simple as "use futex for PTHREAD_PROCESS_SHARED".
First, you must realize that conceptually, the N kernel threads utilized in a 
M:N thread model like NGPT are virtual processors. Hence you can't simply
wait in the kernel as you would block your v-proc. Hence the current
futex interface of up and down kernel calls in not sufficient.

What is required is an asynchronous mechanism that lets a v-proc 
leave a notification object <nobj> in the kernel that get's enqueued just 
like every other waiting task. <nobj> ::= <v-proc, struct *futex>
When the futex is signaled and <nobj> is woken up, a scheduling event is send 
to the <v-proc> or its task or its process (this has to be thought through).
This can be done through a signal or through an shared event queue or a 
combination of such. 

Under no circumstances can you block the <v-proc> == task on a futex !!!

I talked with Bill Abt of the NGPT team about it and he conceptually agrees 
with this approach, but since the regular interface is still not hammered out 
and stable, no point going after more sophisticated stuff yet.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-15  8:41   ` Martin Wirth
  2002-03-15 15:29     ` Hubertus Franke
@ 2002-03-15 16:23     ` Peter Wächtler
  2002-03-16  0:12       ` Rusty Russell
  1 sibling, 1 reply; 84+ messages in thread
From: Peter Wächtler @ 2002-03-15 16:23 UTC (permalink / raw)
  To: Martin Wirth; +Cc: Rusty Russell, linux-kernel

Martin Wirth wrote:

> 
> Rusty Russell wrote:
> 
>>
>> Discussions with Ulrich have reaffirmed my opinion that pthreads are
>> crap.  Hence I'm not all that tempted to warp the (nice, clean,
>> usable) futex code too far to meet pthreads' wierd needs.
>>
> Crap or not, there are tons of software based on pthreads and at least 
> the NGPT team says that Linus
> agreed to implement for necessary kernel-infrastructure for a full, fast 
> pthread implementation.
> 
> Now, if you want to implement mutexes and condition variables with the 
> attribute
> PTHREAD_PROCESS_SHARED then you need some functionality like the futexes.
> Or NGPT will add his own syscalls to handle these things, which is simply
> unnecessary double functionality.
> 


I think the "crap" refers to current missing meatures of linuxthreads
(most notable: PTHREAD_PROCESS_SHARED on cond and mutex, don't know about sema)

BTW, NGPT introduces two new syscalls: gettid and tkill

>>
>> However, it's not too hard to implement condition variables using an
>> unavailable mutex, if we go for "full" semaphores: ie. not just
>> mutexes.  It requires a bit more of a stretch for kernel atomic ops...
>>
> A full semaphore is nice, but not a full replacement for a waitqueue (or 
> a pthread condition variable brr..).
> For the semaphore you always have to assure that the ups and downs are 
> balanced, what is not the case
> for the condition variable.
> 

also remember pthread_cond_broadcast - waking up _all_ waiting threads.
If the woken up threads check their condition and go to sleep again, is
up to them ( read: the standard mandates that _all_ get woken up)

pthread_cond_signal notifies _one_ thread - which one depends on implementation
( I would like to see a priority based decision )


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-15 16:23     ` Peter Wächtler
@ 2002-03-16  0:12       ` Rusty Russell
  2002-03-16 11:23         ` Martin Wirth
                           ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-16  0:12 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Martin Wirth, linux-kernel, Ulrich Drepper

In message <3C922005.50608@loewe-komp.de> you write:
> Martin Wirth wrote:
> > Rusty Russell wrote:
> > 
> >>
> >> Discussions with Ulrich have reaffirmed my opinion that pthreads are
> >> crap.  Hence I'm not all that tempted to warp the (nice, clean,
> >> usable) futex code too far to meet pthreads' wierd needs.
> >>
> > Crap or not, there are tons of software based on pthreads and at least 
> > the NGPT team says that Linus
> > agreed to implement for necessary kernel-infrastructure for a full, fast 
> > pthread implementation.

Let me clarify my "pthreads are crap" statement.

I firmly believe that there is a place for clone & futex threaded
programs, which is not met by pthreads for cleanliness and performance
reasons, and that such programs will become increasingly important.

Therefore I refuse to penalise such progressive programs so we can
have standards compliance.  Hence my insistance on a clean, minimal,
useful interface.

> >> However, it's not too hard to implement condition variables using an
> >> unavailable mutex, if we go for "full" semaphores: ie. not just
> >> mutexes.  It requires a bit more of a stretch for kernel atomic ops...
> >>
> > A full semaphore is nice, but not a full replacement for a waitqueue (or 
> > a pthread condition variable brr..).
> > For the semaphore you always have to assure that the ups and downs are 
> > balanced, what is not the case
> > for the condition variable.
> > 
> 
> also remember pthread_cond_broadcast - waking up _all_ waiting threads.
> If the woken up threads check their condition and go to sleep again, is
> up to them ( read: the standard mandates that _all_ get woken up)
> 
> pthread_cond_signal notifies _one_ thread - which one depends on implementati
on
> ( I would like to see a priority based decision )

The solution I was referring to before, using full semaphores, would
look like so:

struct pthread_cond_t
{
	int num_waiting;
	struct futex wait, ack;
};

#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }

int pthread_cond_signal(pthread_cond_t *cond)
{
	if (cond->num_waiters)
		return futex_up(&cond->futex, 1);
	return 0;
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
	unsigned int waiters = cond->num_waiting;

	if (waiters) {
		futex_up(&cond->futex, waiters);
		/* Wait for ack before returning. */
		futex_down(&cond->ack);
	}
	return 0;
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
	int ret;

	/* Increment first so broadcaster knows we are waiting. */
	atomic_inc(cond->num_waiting);
	futex_up(&mutex, 1);
	ret = futex_down(&cond);
	if (atomic_dec_and_test(cond->num_waiting))
		futex_up(&cond->ack);
	futex_down(&mutex->futex);
	return ret;
}

Hope that clarifies,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-16  0:12       ` Rusty Russell
@ 2002-03-16 11:23         ` Martin Wirth
  2002-03-18  0:52           ` Ulrich Drepper
  2002-03-16 19:48         ` Peter Wächtler
  2002-03-17  6:50         ` Rusty Russell
  2 siblings, 1 reply; 84+ messages in thread
From: Martin Wirth @ 2002-03-16 11:23 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Peter Wächtler, linux-kernel, Ulrich Drepper



Rusty Russell wrote:

>
>The solution I was referring to before, using full semaphores, would
>look like so:
>
>struct pthread_cond_t
>{
>	int num_waiting;
>	struct futex wait, ack;
>};
>
>#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }
>
>int pthread_cond_signal(pthread_cond_t *cond)
>{
>	if (cond->num_waiters)
>		return futex_up(&cond->futex, 1);
>	return 0;
>}
>
>int pthread_cond_broadcast(pthread_cond_t *cond)
>{
>	unsigned int waiters = cond->num_waiting;
>
>	if (waiters) {
>		futex_up(&cond->futex, waiters);
>		/* Wait for ack before returning. */
>		futex_down(&cond->ack);
>	}
>	return 0;
>}
>
>int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
>{
>	int ret;
>
>	/* Increment first so broadcaster knows we are waiting. */
>	atomic_inc(cond->num_waiting);
>	futex_up(&mutex, 1);
>	ret = futex_down(&cond);
>	if (atomic_dec_and_test(cond->num_waiting))
>		futex_up(&cond->ack);
>	futex_down(&mutex->futex);
>	return ret;
>}
>
In principle that works. But one of  things that's less nice with 
pthread_cond_wait is
that you sometimes have a (most of the time) unnecessary schedule 
ping-pong, and with the
approach above you always have this (due to ack). And secondly if 
futex_up(&f, N) for N > 1
relies on the chained wakeup in the kernels futex_up routine the 
broadcast may take a while to
complete (the lowest priority waiter penalizes all others queued behind 
him). A semaphore simply is no full replacement for a waitqueue with 
wake_all.

Martin

P.S.  With respect to pthreads I was not thinking of a bloated N:M 
library, but of some simple
fast pthread semantics compatible wrapper for _clone etc.   







^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-16  0:12       ` Rusty Russell
  2002-03-16 11:23         ` Martin Wirth
@ 2002-03-16 19:48         ` Peter Wächtler
  2002-03-17  6:50         ` Rusty Russell
  2 siblings, 0 replies; 84+ messages in thread
From: Peter Wächtler @ 2002-03-16 19:48 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Martin Wirth, linux-kernel, Ulrich Drepper

Rusty Russell wrote:
> 
> In message <3C922005.50608@loewe-komp.de> you write:
> > Martin Wirth wrote:
> > > Rusty Russell wrote:
> > >
> > >>
> > >> Discussions with Ulrich have reaffirmed my opinion that pthreads are
> > >> crap.  Hence I'm not all that tempted to warp the (nice, clean,
> > >> usable) futex code too far to meet pthreads' wierd needs.
> > >>
> > > Crap or not, there are tons of software based on pthreads and at least
> > > the NGPT team says that Linus
> > > agreed to implement for necessary kernel-infrastructure for a full, fast
> > > pthread implementation.
> 
> Let me clarify my "pthreads are crap" statement.
> 
> I firmly believe that there is a place for clone & futex threaded
> programs, which is not met by pthreads for cleanliness and performance
> reasons, and that such programs will become increasingly important.
> 
> Therefore I refuse to penalise such progressive programs so we can
> have standards compliance.  Hence my insistance on a clean, minimal,
> useful interface.
> 
> > >> However, it's not too hard to implement condition variables using an
> > >> unavailable mutex, if we go for "full" semaphores: ie. not just
> > >> mutexes.  It requires a bit more of a stretch for kernel atomic ops...
> > >>
> > > A full semaphore is nice, but not a full replacement for a waitqueue (or
> > > a pthread condition variable brr..).
> > > For the semaphore you always have to assure that the ups and downs are
> > > balanced, what is not the case
> > > for the condition variable.
> > >
> >
> > also remember pthread_cond_broadcast - waking up _all_ waiting threads.
> > If the woken up threads check their condition and go to sleep again, is
> > up to them ( read: the standard mandates that _all_ get woken up)
> >
> > pthread_cond_signal notifies _one_ thread - which one depends on implementati
> on
> > ( I would like to see a priority based decision )
> 

I have to correct myself. This is the description of SUSV2 about
pthread_cond_signal/broadcast:

---snip---
These two functions are used to unblock threads blocked on a 
condition variable. 

The pthread_cond_signal() call unblocks at least one of the threads 
that are blocked on the specified condition variable cond (if any 
threads are blocked on cond). 

The pthread_cond_broadcast() call unblocks all threads currently 
blocked on the specified condition variable cond. 

If more than one thread is blocked on a condition variable, the 
scheduling policy determines the order in which threads are unblocked. 
When each thread unblocked as a result of a pthread_cond_signal() or 
pthread_cond_broadcast() returns from its call to pthread_cond_wait() 
or pthread_cond_timedwait(), the thread owns the mutex with which it 
called pthread_cond_wait() or pthread_cond_timedwait(). The thread(s) 
that are unblocked contend for the mutex according to the scheduling 
policy (if applicable), and as if each had called pthread_mutex_lock(). 

The pthread_cond_signal() or pthread_cond_broadcast() functions may 
be called by a thread whether or not it currently owns the mutex that 
threads calling pthread_cond_wait() or pthread_cond_timedwait() have 
associated with the condition variable during their waits; however, if 
predictable scheduling behaviour is required, then that mutex is locked 
by the thread calling pthread_cond_signal() or pthread_cond_broadcast(). 

The pthread_cond_signal() and pthread_cond_broadcast() functions have 
no effect if there are no threads currently blocked on cond.

RETURN VALUE

  If successful, the pthread_cond_signal() and pthread_cond_broadcast() 
  functions return zero. Otherwise, an error number is returned to
  indicate the error. 

ERRORS

  The pthread_cond_signal() and pthread_cond_broadcast() function 
  may fail if: 

 [EINVAL]
   The value cond does not refer to an initialised condition variable. 

   These functions will not return an error code of [EINTR]. 
--snip---

So the semantic of a condvar implies that when returning from a 
"succesful" wait [ i.e. not ETIMEDOUT] , the thread owns the mutex. 
Therefore the scheduler _should_ only wake up the highest priority 
waiting thread - it does not matter if we signal or broadcast!
It's the same operation in effect. Perhaps the broadcast is there 
for implementations that wouldn't queue up (or wake up) the waiters 
in priority order?

For this, I think, kernel support is best, since the waiters get 
woken up in priority order giving wake_one semantics.

For pthread_cond_timedwait() a kernel timer is necessary.
So making the signal/broadcast a syscall that does NOT lead to
a context switch would be benefical. At the next scheduling point
the kernel decides whom to wake up, also checking for timed waiters
to return with ETIMEDOUT.

Then there is the issue with a crashing process holding locks.
I think on IRIX the waiters get a trap causing them to die - what
else one could do?

[after writing and deleting some pseudo code]

I think the condvars are best implemented in shmem + kernel semaphores.
The only issue is a pthread_cond_timedwait - but a semaphore op IS
interruptible. Besides alarm(2)/setitimer(2) - what are other timeout 
mechanisms in Linux? 

In QNX Neutrino you have TimerTimeout() to arm a kernel timeout
for any blocking state (to avoid the window in alarm/timer_settime
and the blocking function call)

Without this you can hardly implement a reliable timeout (with
sub second resolution) for pthread_cond_timedwait or sigtimedwait.

So for PTHREAD_PROCESS_PRIVATE one could use futexes - on
PTHREAD_PROCESS_SHARED it has to reside in the kernel anyway and you
naturally have to live with context switches and a performance hit.

Now the problem with N:M threading model: here it's necessary to
prevent the kernel from blocking the whole process? 
Well, what is pthread_setconcurrency(int new_level) for?

And, what is so bad about condvars? How would you implement a 
typical consumer/producer problem?

To cite Stevens (Vol II, page 165): 
... "mutexes are for locking and cannot be used for waiting."
page 167: "A mutex is for locking and a condition variable is for 
waiting. ... and both are needed"

-- 
-----------------------------------------------------------------------
Peter Waechtler

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-16  0:12       ` Rusty Russell
  2002-03-16 11:23         ` Martin Wirth
  2002-03-16 19:48         ` Peter Wächtler
@ 2002-03-17  6:50         ` Rusty Russell
  2 siblings, 0 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-17  6:50 UTC (permalink / raw)
  To: Martin Wirth; +Cc: pwaechtler, linux-kernel, drepper

On Sat, 16 Mar 2002 12:23:26 +0100
Martin Wirth <martin.wirth@dlr.de> wrote:
> Rusty Russell wrote:
> >The solution I was referring to before, using full semaphores, would
> >look like so:

[snip]

> In principle that works. But one of  things that's less nice with 
> pthread_cond_wait is
> that you sometimes have a (most of the time) unnecessary schedule 
> ping-pong, and with the
> approach above you always have this (due to ack).

Only vs. pthread_cond_broadcast.  And if you're using that you probably
have some other performance issues anyway?

> And secondly if 
> futex_up(&f, N) for N > 1
> relies on the chained wakeup in the kernels futex_up routine the 
> broadcast may take a while to
> complete (the lowest priority waiter penalizes all others queued behind 
> him). A semaphore simply is no full replacement for a waitqueue with 
> wake_all.

Yes, we could have a "wake N" variant, which would be more efficient here.

Hope that clarifies,
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-16 11:23         ` Martin Wirth
@ 2002-03-18  0:52           ` Ulrich Drepper
  2002-03-19  3:28             ` Rusty Russell
  0 siblings, 1 reply; 84+ messages in thread
From: Ulrich Drepper @ 2002-03-18  0:52 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Martin Wirth, pwaechtler, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

On Sat, 2002-03-16 at 22:50, Rusty Russell wrote:

> Only vs. pthread_cond_broadcast.

No.  pthread_barrier_wait has the same problem.  It has to wake up lots
of thread.

> And if you're using that you probably
> have some other performance issues anyway?

Why?  Conditional variables are of use in situations with loosely
coupled threads.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-18  0:52           ` Ulrich Drepper
@ 2002-03-19  3:28             ` Rusty Russell
  2002-03-19  4:05               ` Ulrich Drepper
  2002-03-19  8:34               ` Martin Wirth
  0 siblings, 2 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-19  3:28 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: martin.wirth, pwaechtler, linux-kernel

On 17 Mar 2002 16:52:00 -0800
Ulrich Drepper <drepper@redhat.com> wrote:

> On Sat, 2002-03-16 at 22:50, Rusty Russell wrote:
> 
> > Only vs. pthread_cond_broadcast.
> 
> No.  pthread_barrier_wait has the same problem.  It has to wake up lots
> of thread.

Hmmm....

What do you WANT in a kernel primitive then?  Given that we now have mutexes,
what else do we need to make pthreads relatively painless?

> > And if you're using that you probably
> > have some other performance issues anyway?
> 
> Why?  Conditional variables are of use in situations with loosely
> coupled threads.

I meant vs. pthread_cond_signal.

Look, here is an example implementation.  Please suggest:
1) Where this is flawed,
2) Where this is suboptimal,
3) What kernel primitive would help to resolve these?

Thanks,
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

/* Assume we have the following semaphore operations:

   futex_down(futex);
   futex_down_time(futex, relative timeout);
   futex_up(futex, count);
*/
typedef struct
{
	struct futex futex;
} pthread_mutex_t;

typedef struct 
{
	int num_waiting;
	struct futex wait, ack;
} pthread_cond_t;

typedef struct
{
	unsigned int num_left;
	struct futex wait;
	unsigned int initial_count;
} pthread_barrier_t;

#define PTHREAD_MUTEX_INITIALIZER { { 1 } }
#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }

int pthread_barrier_init(struct pthread_barrier_t *barrier,
			 void *addr,
			 unsigned int count)
{
	barrier->num_left = barrier->initial_count = count;
	barrier->wait.count = 0;
}

int pthread_barrier_wait(struct pthread_barrier_t *barrier)
{
	if (atomic_dec_and_test(&barrier->num_left)) {
		/* Restore barrier. */
		barrier->num_left = barrier->initial_count;
		/* Wake the other threads */
		futex_up(&barrier->wait, barrier->initial_count-1);
		return 0; /* PTHREAD_BARRIER_SERIAL_THREAD */
	}
	while (futex_down(&barrier->wait) == 0 || errno == EINTR);
	return 1;
}

int pthread_cond_signal(pthread_cond_t *cond)
{
	if (cond->num_waiters)
		return futex_up(&cond->futex, 1);
	return 0;
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
	unsigned int waiters = cond->num_waiting;

	if (waiters) {
		/* Re-initialize ACK.  Could have been upped by
                   pthread_cond_signal and pthread_cond_wait. */
		cond->ack.count = 0;
		futex_up(&cond->futex, waiters);
		/* Wait for ack before returning. */
		futex_down(&cond->ack);
	}
	return 0;
}

static int __pthread_cond_wait(pthread_cond_t *cond,
			       pthread_mutex_t *mutex,
			       const struct timespec *reltime)
{
	int ret;

	/* Increment first so broadcaster knows we are waiting. */
	atomic_inc(cond->num_waiting);
	futex_up(&mutex, 1);
	do {
		ret = futex_down_time(&cond, reltime);
	} while (ret < 0 && errno == EINTR);
	if (atomic_dec_and_test(cond->num_waiting))
		futex_up(&cond->ack);
	futex_down(&mutex->futex);
	return ret;
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
	return __pthread_cond_wait(cond, mutex, NULL);
}

int pthread_cond_timedwait(pthread_cond_t *cond,
			   pthread_mutex_t *mutex,
			   const struct timespec *abstime)
{
	struct timeval _now;
	struct timespec now, rel;

	/* Absolute to relative */
	gettimeofday(&_now, NULL);
	TIMEVAL_TO_TIMESPEC(&_now, &now);
	if (now.tv_sec > abstime->tv_sec
	    || (now.tv_sec == abstime->tv_sec
		&& now.tv_nsec > abstime->tv_nsec))
		return ETIMEDOUT;

	rel.tv_sec = now.tv_sec - abstime->tv_sec;
	rel.tv_nsec = now.tv_usec - abstime->tv_usec;
	if (rel.tv_nsec < 0) {
		--rel.tv_sec;
		rel.tv_nsec += 1000000000;
	}
	return __pthread_cond_wait(cond, mutex, &rel);
}

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-19  3:28             ` Rusty Russell
@ 2002-03-19  4:05               ` Ulrich Drepper
  2002-03-20  6:20                 ` Rusty Russell
  2002-03-19  8:34               ` Martin Wirth
  1 sibling, 1 reply; 84+ messages in thread
From: Ulrich Drepper @ 2002-03-19  4:05 UTC (permalink / raw)
  To: Rusty Russell; +Cc: martin.wirth, pwaechtler, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 759 bytes --]

On Mon, 2002-03-18 at 19:28, Rusty Russell wrote:

> What do you WANT in a kernel primitive then?  Given that we now have mutexes,
> what else do we need to make pthreads relatively painless?

I think wrt to the mutexes only wake-all is missing.  I don't think that
semaphore semantic is needed in the kernel.


> Look, here is an example implementation.  Please suggest:
> 1) Where this is flawed,
> 2) Where this is suboptimal,
> 3) What kernel primitive would help to resolve these?

I'll look at this a bit later.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-19  3:28             ` Rusty Russell
  2002-03-19  4:05               ` Ulrich Drepper
@ 2002-03-19  8:34               ` Martin Wirth
  2002-03-20  6:45                 ` Rusty Russell
  1 sibling, 1 reply; 84+ messages in thread
From: Martin Wirth @ 2002-03-19  8:34 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ulrich Drepper, pwaechtler, linux-kernel

Rusty Russell wrote:

> 1) Where this is flawed,

I. There is a race in __pthread_cond_wait between timeout and a 
cond_signal or broadcast. If the signal comes in

} while (ret < 0 && errno == EINTR);
 >>>>> we leave with errno==ETIMEDOUT and get signal or broadcast called 
here
	if (atomic_dec_and_test(cond->num_waiting))

then you up cond->wait one time to often, leaving it in an invalid state.

II. Your implementation relies on the fact that the signal or broadcast
caller owns the mutex used in cond_wait. According to the POSIX spec 
this need not be the case. The only thing that may happen is that you
miss a wakeup. But it is not allowed to screw up the internal state of
of the condition variable, which might well happen in your 
implementation. (Note: Calling cond_signal without holding the mutex is 
not necessarily flawed software. Think of a periodically occurring 
new_data or data_changed flag where it is not really important to sleep 
race free)

III. Minor nit: You should also clear cond->ack.count
in cond_signal otherwise it may wrap around soon (at least for a
24-bit atomic variable) if you mostly use cond_signal.

> 2) Where this is suboptimal,

As said in a previous e-mail, you need an futex_up(..,n) that
really wakes_up n thread at once.

> 3) What kernel primitive would help to resolve these?

Your exported waitqueues or my suggestion for a second waitqueue 
associated with a futex.

Martin

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-19  4:05               ` Ulrich Drepper
@ 2002-03-20  6:20                 ` Rusty Russell
  2002-03-20 10:42                   ` Peter Wächtler
  0 siblings, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-20  6:20 UTC (permalink / raw)
  To: Ulrich Drepper; +Cc: martin.wirth, pwaechtler, linux-kernel

On 18 Mar 2002 20:05:22 -0800
Ulrich Drepper <drepper@redhat.com> wrote:

> On Mon, 2002-03-18 at 19:28, Rusty Russell wrote:
> 
> > What do you WANT in a kernel primitive then?  Given that we now have mutexes,
> > what else do we need to make pthreads relatively painless?
> 
> I think wrt to the mutexes only wake-all is missing.  I don't think that
> semaphore semantic is needed in the kernel.

And have all the waiters exit with errno = EAGAIN?  ("You didn't get it but
someone wanted you all woken").

It can be done, I'd have to see if is sufficient to implement pthreads.

Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-19  8:34               ` Martin Wirth
@ 2002-03-20  6:45                 ` Rusty Russell
  2002-03-21  6:48                   ` Martin Wirth
  0 siblings, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-20  6:45 UTC (permalink / raw)
  To: Martin.Wirth; +Cc: Ulrich Drepper, pwaechtler, linux-kernel

In message <3C96F81F.1020608@dlr.de> you write:
> Rusty Russell wrote:
> 
> 
> > 1) Where this is flawed,
> 
> 
> I. There is a race in __pthread_cond_wait between timeout and a 
> cond_signal or broadcast. If the signal comes in
> 
> } while (ret < 0 && errno == EINTR);
>  >>>>> we leave with errno==ETIMEDOUT and get signal or broadcast called 
> here
> 	if (atomic_dec_and_test(cond->num_waiting))
> 
> then you up cond->wait one time to often, leaving it in an invalid state.

Hmmm... this is true.

> II. Your implementation relies on the fact that the signal or broadcast
> caller owns the mutex used in cond_wait. According to the POSIX spec 
> this need not be the case. The only thing that may happen is that you
> miss a wakeup. But it is not allowed to screw up the internal state of
> of the condition variable, which might well happen in your 
> implementation. (Note: Calling cond_signal without holding the mutex is 
> not necessarily flawed software. Think of a periodically occurring 
> new_data or data_changed flag where it is not really important to sleep 
> race free)

I hadn't appreciated this.  That makes it harder.  I think I have to
abandon the atomics and use a mutex inside the condition variable.

> III. Minor nit: You should also clear cond->ack.count
> in cond_signal otherwise it may wrap around soon (at least for a
> 24-bit atomic variable) if you mostly use cond_signal.

Yep.

> > 2) Where this is suboptimal,
> 
> 
> As said in a previous e-mail, you need an futex_up(..,n) that
> really wakes_up n thread at once.

OK, we could read the value in the kernel's up() and wake that many.

> > 3) What kernel primitive would help to resolve these?
> 
> Your exported waitqueues or my suggestion for a second waitqueue 
> associated with a futex.

Any chance of a rough patch (to the code below, at least)?

Thanks!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

--- non-pthreads.c.19-March-2002	Wed Mar 20 17:37:17 2002
+++ non-pthreads.c	Wed Mar 20 17:43:42 2002
@@ -11,6 +11,7 @@
 
 typedef struct 
 {
+	struct futex lock;
 	int num_waiting;
 	struct futex wait, ack;
 } pthread_cond_t;
@@ -48,23 +49,29 @@
 
 int pthread_cond_signal(pthread_cond_t *cond)
 {
+	futex_down(&cond->lock);
+	/* Reset this so it doesn't overflow */
+	cond->ack.count = 0;
 	if (cond->num_waiters)
 		return futex_up(&cond->futex, 1);
+	futex_up(&cond->lock, 1);
 	return 0;
 }
 
 int pthread_cond_broadcast(pthread_cond_t *cond)
 {
-	unsigned int waiters = cond->num_waiting;
-
-	if (waiters) {
-		/* Re-initialize ACK.  Could have been upped by
-                   pthread_cond_signal and pthread_cond_wait. */
+	futex_down(&cond->lock);
+	if (cond->num_waiting) {
 		cond->ack.count = 0;
+		/* Release the waiters. */
 		futex_up(&cond->futex, waiters);
 		/* Wait for ack before returning. */
 		futex_down(&cond->ack);
+		/* Reset wait, in case someone who was waiting timed
+                   out and didn't decrement. */
+		cond->wait.count = 0;
 	}
+	futex_up(&cond->lock);
 	return 0;
 }
 
@@ -75,8 +82,10 @@
 	int ret;
 
 	/* Increment first so broadcaster knows we are waiting. */
+	futex_down(&cond->lock);
 	atomic_inc(cond->num_waiting);
 	futex_up(&mutex, 1);
+	futex_up(&cond->lock, 1);
 	do {
 		ret = futex_down_time(&cond, reltime);
 	} while (ret < 0 && errno == EINTR);

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-20  6:20                 ` Rusty Russell
@ 2002-03-20 10:42                   ` Peter Wächtler
  2002-03-20 17:20                     ` Ulrich Drepper
  0 siblings, 1 reply; 84+ messages in thread
From: Peter Wächtler @ 2002-03-20 10:42 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ulrich Drepper, martin.wirth, linux-kernel

Rusty Russell wrote:

> On 18 Mar 2002 20:05:22 -0800
> Ulrich Drepper <drepper@redhat.com> wrote:
> 
> 
>>On Mon, 2002-03-18 at 19:28, Rusty Russell wrote:
>>
>>
>>>What do you WANT in a kernel primitive then?  Given that we now have mutexes,
>>>what else do we need to make pthreads relatively painless?
>>>
>>I think wrt to the mutexes only wake-all is missing.  I don't think that
>>semaphore semantic is needed in the kernel.
>>
> 
> And have all the waiters exit with errno = EAGAIN?  ("You didn't get it but
> someone wanted you all woken").
> 

That's a joke, isn't it? ;-)

We can return -1 with errno=ELOCKBREAK if the process holding the lock

dies.

Why don't we need a semaphore in the kernel?
There is open_sem() for POSIX named semaphores. The SysV interface
is there - and it's far more ugly.

in linuxthreads we have:
sem_t *sem_open(const char *name, int oflag, ...)
{
   __set_errno (ENOSYS);
   return SEM_FAILED;
}

Well, mutexes are not semaphores. Why not have fusema or fuma?
They are good for thread pools where you want several workers.

> It can be done, I'd have to see if is sufficient to implement pthreads.
> 

I don't see the necessity for wake_all except for the barrier case.

A condvar implies, that when successfully returning, the thread
owns the mutex. So even in the case of cond_broadcast, only one thread
will get the mutex - the others will get blocked on that.

The only reason for cond_broadcast I can think of, are implementations
that wouldn't wake_up the highest priority waiting thread. So with
a broadcast all will be woken up and try to acquire the lock.

The waiters will NOT return without the mutex locked except on ETIMEDOUT
(when the thread gets cancelled, the cancelation handler will be called
and the thread will die).

It's also written in the spec, that when you want "predictable scheduling"
the caller of signal/broadcast _has_to_ hold the mutex. On release of
the mutex all other threads will race for it.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-20 10:42                   ` Peter Wächtler
@ 2002-03-20 17:20                     ` Ulrich Drepper
  0 siblings, 0 replies; 84+ messages in thread
From: Ulrich Drepper @ 2002-03-20 17:20 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Rusty Russell, martin.wirth, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 682 bytes --]

On Wed, 2002-03-20 at 02:42, Peter Wächtler wrote:

> Well, mutexes are not semaphores. Why not have fusema or fuma?
> They are good for thread pools where you want several workers.

I have absolutely no objections whatsoever to get semaphore support in
the kernel.  It would be much more efficient in some situations and the
basic infrastructures is already there.  If you can convince the powers
there are to accept such a patch I'm all for it.

-- 
---------------.                          ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 232 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-20  6:45                 ` Rusty Russell
@ 2002-03-21  6:48                   ` Martin Wirth
  2002-03-24 18:25                     ` Peter Wächtler
  0 siblings, 1 reply; 84+ messages in thread
From: Martin Wirth @ 2002-03-21  6:48 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Ulrich Drepper, pwaechtler, linux-kernel

Rusty Russell wrote:

>2) Where this is suboptimal,

Up to know I was too focused on the wait functions, but there is
also a problem with cond_broadcast (and the mutex-protected version of 
cond_signal): since they may block (on ack or lock) this opens up 
chances for priority inversion like problems. I think to be really 
usefull cond_broacast and cond_signal should have a non-blocking 
behaviour with predictible runtime.

Just to convince you that this is a real world problem here is a 
description of one of my data-aquisition programs:

A 'producer' thread waits for the trigger of a transient recorder at 500 
Hz IRQ-rate, reads out 64k on each event into a large circular buffer,
calls cond_broadcast (every 5th IRQ) without holding a mutex and goes to 
sleep to wait for the next IRQ. (This thread is SCHED_FIFO)

Then there are three (SCHED_OTHER) 'consumer' threads which work on the 
same data doing different things of different importance (group them 
according to some hardware parameter and store them into different 
files, calculate averaged powerspectra, select pieces for an online 
scope-like display etc.)

If in this scenario the producer would have to wait in cond_broadcast 
until the least prio consumer has acknowledged (which may take a timer 
tick or longer) he would lose several IRQs each time.

So for my applications a cond_broadcast blocking for the waiters is 
simply not acceptable.

Martin

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-21  6:48                   ` Martin Wirth
@ 2002-03-24 18:25                     ` Peter Wächtler
  2002-03-25  2:28                       ` Rusty Russell
  0 siblings, 1 reply; 84+ messages in thread
From: Peter Wächtler @ 2002-03-24 18:25 UTC (permalink / raw)
  To: Martin.Wirth; +Cc: Rusty Russell, Ulrich Drepper, linux-kernel

Martin Wirth wrote:
> 
> Rusty Russell wrote:
> 
> >2) Where this is suboptimal,
> 
> Up to know I was too focused on the wait functions, but there is
> also a problem with cond_broadcast (and the mutex-protected version of
> cond_signal): since they may block (on ack or lock) this opens up
> chances for priority inversion like problems. I think to be really
> usefull cond_broacast and cond_signal should have a non-blocking
> behaviour with predictible runtime.
> 
[real world example deleted]
> So for my applications a cond_broadcast blocking for the waiters is
> simply not acceptable.
> 
Exactly.
I can't see a reason why the ack-futex is needed. I think we can simply
delete it.
When deleted, the broadcast wouldn't block on ack (also preventing 
schedule ping-pong). With the cond->lock it's save to have several
broadcasters. That's fine.
But:
static int __pthread_cond_wait(pthread_cond_t *cond,
                               pthread_mutex_t *mutex,
                               const struct timespec *reltime)
{
        int ret;

        /* Increment first so broadcaster knows we are waiting. */
	futex_down(&cond->lock);
        atomic_inc(cond->num_waiting);
(*)     futex_up(&mutex, 1);
a)	futex_up(&cond->lock, 1);  [move into syscall]
        do {
b)                ret = futex_down_time(&cond, ABSTIME); [cond_timed_wait]
        } while (ret < 0 && errno == EINTR);
	[futex_up(&cond->lock, 1); /* release condvar */]

        futex_down(&mutex->futex);
        return ret;
}

With the original code, we have a "signal/broadcast lost window (a->b)" 
that shouldn't be there:

SUSV2 on pthread_cond_[timed]wait:
These functions atomically release mutex(*) and cause the calling 
thread to block on the condition variable cond; atomically here means
"atomically with respect to access by another thread to the mutex and 
then the condition variable". That is, if another thread is able to
acquire the mutex after the about-to-block thread has released it, then 
a subsequent call to pthread_cond_signal() or pthread_cond_broadcast() 
in that thread behaves as if it were issued after the about-to-block 
thread has blocked. 

So we would need to enhance the futex_down_timed() call, to
atomically release the cond->lock on entry, re-aquiring on exit (because
of the loop).
This boils down to a cond_var syscall to me (wouldn't sys_ulock(,,OP)
a better name ? with OPs like MUTEX_UP,MUTEX_DOWN, SEMA_UPn, SEMA_DOWNn, 
COND_WAIT, COND_TIMED_WAIT, COND_SIGNAL, COND_BROADCAST, RWLOCK_WRLOCK,
RWLOCK_RDLOCK,RWLOCK_UNLOCK)

Also note that we have to recalculate the relative time to sleep when
signalled - or just using an absolute time stamp.
If the syscall is interruptible, we open the "signal/broadcast lost 
window (a->b)" again... hmh (Here queuing up RT signals are much better
for handling the wakeup, because you can block them, and they don't get 
lost)

Alternatively when using the uwaitq: they could use a lock to serialize
an add/wait and a possibly parallel wake operation (but with the above
locks you can achieve exactly this)

-- 
-----------------------------------------------------------------------
Peter Waechtler

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-24 18:25                     ` Peter Wächtler
@ 2002-03-25  2:28                       ` Rusty Russell
  2002-03-25  4:46                         ` Rusty Russell
  2002-03-25  9:47                         ` Peter Wächtler
  0 siblings, 2 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-25  2:28 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Rusty Russell, Ulrich Drepper, linux-kernel

In message <3C9E1A10.F7AA6D6E@loewe-komp.de> you write:
> I can't see a reason why the ack-futex is needed. I think we can simply
> delete it.
> When deleted, the broadcast wouldn't block on ack (also preventing 
> schedule ping-pong). With the cond->lock it's save to have several
> broadcasters. That's fine.

No, you might end up waking someone who did the pthread_cond_wait()
after you did the pthread_cond_broadcast in place of one of the
existing pthread_cond_wait() threads.

I don't believe this is allowed.  

> But:
> static int __pthread_cond_wait(pthread_cond_t *cond,
>                                pthread_mutex_t *mutex,
>                                const struct timespec *reltime)
> {
>         int ret;
> 
>         /* Increment first so broadcaster knows we are waiting. */
> 	futex_down(&cond->lock);
>         atomic_inc(cond->num_waiting);
> (*)     futex_up(&mutex, 1);
> a)	futex_up(&cond->lock, 1);  [move into syscall]
>         do {
> b)                ret = futex_down_time(&cond, ABSTIME); [cond_timed_wait]
>         } while (ret < 0 && errno == EINTR);
> 	[futex_up(&cond->lock, 1); /* release condvar */]
> 
>         futex_down(&mutex->futex);
>         return ret;
> }
> 
> With the original code, we have a "signal/broadcast lost window (a->b)" 
> that shouldn't be there:

Where?  Having done the inc, the futex_up at (a) will fall through,
giving the "thread behaves as if it [signal or broadcast] were issued
after the about-to-block thread has blocked."

> So we would need to enhance the futex_down_timed() call, to
> atomically release the cond->lock on entry, re-aquiring on exit (because
> of the loop).
> This boils down to a cond_var syscall to me (wouldn't sys_ulock(,,OP)
> a better name ? with OPs like MUTEX_UP,MUTEX_DOWN, SEMA_UPn, SEMA_DOWNn, 
> COND_WAIT, COND_TIMED_WAIT, COND_SIGNAL, COND_BROADCAST, RWLOCK_WRLOCK,
> RWLOCK_RDLOCK,RWLOCK_UNLOCK)

You're talking about a completely different beast.

So the summary is: futexes not sufficient to implement pthreads
locking.  That's fine; I can drop all the "extension for pthreads"
futex patches and leave the code as-is ('cept the UP_FAIR patch, which
is independent of this debate).

Thanks!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-25  2:28                       ` Rusty Russell
@ 2002-03-25  4:46                         ` Rusty Russell
  2002-03-25 11:56                           ` Peter Wächtler
  2002-03-25  9:47                         ` Peter Wächtler
  1 sibling, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-25  4:46 UTC (permalink / raw)
  To: Rusty Russell; +Cc: pwaechtler, drepper, linux-kernel

On Mon, 25 Mar 2002 13:28:44 +1100
Rusty Russell <rusty@rustcorp.com.au> wrote:
> So the summary is: futexes not sufficient to implement pthreads
> locking.

So let's go back to the more generic "exporting waitqueues to userspace" idea,
with a twist: we use a userspace address as the identifier on the waitq, which
gives us a unique identifier, but the kernel never actually derefs the
pointer.  (And in my prior kernel code I optimized it so that waking did an
implicit remove; not sure it's a win, so assumed that was removed here).

This gives code as below (Peter, Martin, please check):

/* Assume we have the following operations:

   uwaitq_add(void *uaddr);
   uwaitq_remove(void *uaddr);
   uwaitq_wake(void *uaddr, int wake_all_flag);
   uwaitq_wait(relative timeout);
*/
typedef struct
{
	int counter;
} pthread_mutex_t;

typedef struct 
{
	int condition;
} pthread_cond_t;

typedef struct
{
	unsigned int num_left;
	unsigned int initial_count;
} pthread_barrier_t;

#define PTHREAD_MUTEX_INITIALIZER { { 1 } }
#define PTHREAD_COND_INITIALIZER { 0, { 0 }, { 0 } }

int pthread_barrier_init(struct pthread_barrier_t *barrier,
			 void *addr,
			 unsigned int count)
{
	barrier->num_left = barrier->initial_count = count;
}

int pthread_barrier_wait(struct pthread_barrier_t *barrier)
{
	/* Use barrier address as uwaitq id. */
	uwaitq_add(barrier);
	if (atomic_dec_and_test(&barrier->num_left)) {
		/* Restore barrier. */
		barrier->num_left = barrier->initial_count;
		/* Wake the other threads */
		uwaitq_wake(barrier, 1 /* WAKE_ALL */);
		uwaitq_remove(barrier);
		return 0; /* PTHREAD_BARRIER_SERIAL_THREAD */
	}
	while (uwaitq_wait(NULL) == 0 || errno == EINTR);
	uwaitq_remove(barrier);
	return 1;
}

int pthread_cond_signal(pthread_cond_t *cond)
{
	return uwaitq_wake(cond, 0 /* WAKE_ONE */);
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
	return uwaitq_wake(cond, 1 /* WAKE_ALL */);
}

static int __pthread_cond_wait(pthread_cond_t *cond,
			       pthread_mutex_t *mutex,
			       const struct timespec *reltime)
{
	int ret;

	uwaitq_add(cond);
	futex_up(&mutex, 1);
	while ((ret = uwaitq_wait(reltime)) == 0 || errno == EINTR);
	uwaitq_remove(cond);
	futex_down(&mutex, NULL);
	return ret;
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
	return __pthread_cond_wait(cond, mutex, NULL);
}

int pthread_cond_timedwait(pthread_cond_t *cond,
			   pthread_mutex_t *mutex,
			   const struct timespec *abstime)
{
	struct timeval _now;
	struct timespec now, rel;

	/* Absolute to relative */
	gettimeofday(&_now, NULL);
	TIMEVAL_TO_TIMESPEC(&_now, &now);
	if (now.tv_sec > abstime->tv_sec
	    || (now.tv_sec == abstime->tv_sec
		&& now.tv_nsec > abstime->tv_nsec))
		return ETIMEDOUT;

	rel.tv_sec = now.tv_sec - abstime->tv_sec;
	rel.tv_nsec = now.tv_usec - abstime->tv_usec;
	if (rel.tv_nsec < 0) {
		--rel.tv_sec;
		rel.tv_nsec += 1000000000;
	}
	return __pthread_cond_wait(cond, mutex, &rel);
}

-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-25  2:28                       ` Rusty Russell
  2002-03-25  4:46                         ` Rusty Russell
@ 2002-03-25  9:47                         ` Peter Wächtler
  1 sibling, 0 replies; 84+ messages in thread
From: Peter Wächtler @ 2002-03-25  9:47 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel

Rusty Russell wrote:

> In message <3C9E1A10.F7AA6D6E@loewe-komp.de> you write:
> 
>>I can't see a reason why the ack-futex is needed. I think we can simply
>>delete it.
>>When deleted, the broadcast wouldn't block on ack (also preventing 
>>schedule ping-pong). With the cond->lock it's save to have several
>>broadcasters. That's fine.
>>
> 
> No, you might end up waking someone who did the pthread_cond_wait()
> after you did the pthread_cond_broadcast in place of one of the
> existing pthread_cond_wait() threads.
> 
> I don't believe this is allowed.  
> 


Indeed, I suspect that this isn't wanted.
With the cond->lock you almost prevent this: an ongoing broadcast
can't be intermixed with newly incoming waiters (they will block
on futex_down(&cond->lock))
But there is the window between a->b...


> 
>>But:
>>static int __pthread_cond_wait(pthread_cond_t *cond,
>>                               pthread_mutex_t *mutex,
>>                               const struct timespec *reltime)
>>{
>>        int ret;
>>
>>        /* Increment first so broadcaster knows we are waiting. */
>>	futex_down(&cond->lock);
>>        atomic_inc(cond->num_waiting);
>>(*)     futex_up(&mutex, 1);
>>a)	futex_up(&cond->lock, 1);  [move into syscall]
>>        do {
>>b)                ret = futex_down_time(&cond, ABSTIME); [cond_timed_wait]
>>        } while (ret < 0 && errno == EINTR);
>>	[futex_up(&cond->lock, 1); /* release condvar */]
>>
>>        futex_down(&mutex->futex);
>>        return ret;
>>}
>>
>>With the original code, we have a "signal/broadcast lost window (a->b)" 
>>that shouldn't be there:
>>
> 
> Where?  Having done the inc, the futex_up at (a) will fall through,
> giving the "thread behaves as if it [signal or broadcast] were issued
> after the about-to-block thread has blocked."
> 
Right after (a) another thread gets scheduled, issueing a signal/broadcast.

Ah, and then the futex_down_timed() wouldn't block, OK ;-)
But this way you have to use the ack->lock

 

I strongly believe, that the implementation of a condvar needs a lock
to prevent intermixed calls. You will see my comment on your implementation
with uwaitq. ;-)





^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-25  4:46                         ` Rusty Russell
@ 2002-03-25 11:56                           ` Peter Wächtler
  2002-03-26  1:02                             ` Rusty Russell
  0 siblings, 1 reply; 84+ messages in thread
From: Peter Wächtler @ 2002-03-25 11:56 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Martin.Wirth, linux-kernel

Rusty Russell wrote:

> On Mon, 25 Mar 2002 13:28:44 +1100
> Rusty Russell <rusty@rustcorp.com.au> wrote:
> 
>>So the summary is: futexes not sufficient to implement pthreads
>>locking.
>>
> 
> So let's go back to the more generic "exporting waitqueues to userspace" idea,
> with a twist: we use a userspace address as the identifier on the waitq, which
> gives us a unique identifier, but the kernel never actually derefs the
> pointer.  (And in my prior kernel code I optimized it so that waking did an
> implicit remove; not sure it's a win, so assumed that was removed here).
> 
> This gives code as below (Peter, Martin, please check):
> 
> /* Assume we have the following operations:
> 
>    uwaitq_add(void *uaddr);
>    uwaitq_remove(void *uaddr);
>    uwaitq_wake(void *uaddr, int wake_all_flag);
>    uwaitq_wait(relative timeout);
> */
> typedef struct
> {
> 	int counter;
> } pthread_mutex_t;
> 
> typedef struct 
> {
> 	int condition;
> } pthread_cond_t;
> 
 
> int pthread_cond_signal(pthread_cond_t *cond)
> {
> 	return uwaitq_wake(cond, 0 /* WAKE_ONE */);
> }
> 
> int pthread_cond_broadcast(pthread_cond_t *cond)
> {
> 	return uwaitq_wake(cond, 1 /* WAKE_ALL */);
> }
> 
> static int __pthread_cond_wait(pthread_cond_t *cond,
> 			       pthread_mutex_t *mutex,
> 			       const struct timespec *reltime)
> {
> 	int ret;
> 
> 	uwaitq_add(cond);
> 	futex_up(&mutex, 1);


Here another thread gets scheduled, calling signal/broadcast.
Since the former thread is already on the queue -> well done ;-)


> 	while ((ret = uwaitq_wait(reltime)) == 0 || errno == EINTR);
> 	uwaitq_remove(cond);
> 	futex_down(&mutex, NULL);
> 	return ret;
> }


I assume that uwaitq_wait() will modify the reltime (which is legal)
if signalled. Otherwise we would wait 2*retime, and so on

Then we have to be careful about errno and the return values:

static int __pthread_cond_wait(pthread_cond_t *cond,
			       pthread_mutex_t *mutex,
			       const struct timespec *reltime)
{
	int ret;

! 
if (uwaitq_add(cond) == -1)
+ 
	return -1;
! 
if (futex_up(&mutex, 1) == -1){
+ 
	uwaitq_remove(cond);
+ 
	return -1;
+ 
}
! 
while ((ret = uwaitq_wait(cond,reltime)) == -1 && errno == EINTR);
+ 
saverrno=errno;
	uwaitq_remove(cond);
	futex_down(&mutex, NULL);
+ 
if (ret == -1){
+ 
  	if (saverrno == ENOENT)
+ 
		return 0; /* there was a sig/broadc before we went to sleep*/
+ 
	errno=saverrno;
+ 
	return -1;
+ 
}
	return 0;
}

I assume that uwaitq_wait() will return -1 and errno==ENOENT or similar
if we are not on the queue (anymore), -1 and ETIMEDOUT on timeout,
-1 and EINVAL on illegal cond or reltime ,zero on wakeup?



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-25 11:56                           ` Peter Wächtler
@ 2002-03-26  1:02                             ` Rusty Russell
  2002-03-26  8:17                               ` Martin Wirth
  0 siblings, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-26  1:02 UTC (permalink / raw)
  To: Peter Wächtler; +Cc: Martin.Wirth, linux-kernel

In message <3C9F1074.6030902@loewe-komp.de> you write:
> > 	while ((ret = uwaitq_wait(reltime)) == 0 || errno == EINTR);
> > 	uwaitq_remove(cond);
> > 	futex_down(&mutex, NULL);
> > 	return ret;
> > }
> 
> 
> I assume that uwaitq_wait() will modify the reltime (which is legal)
> if signalled. Otherwise we would wait 2*retime, and so on

Oh yeah, I'll have to split that out and handle it.  Not worth having
the kernel do it as it's hardly going to be a fast path...

> Then we have to be careful about errno and the return values:

Sigh.  Yep.  I was pretty lax...

> I assume that uwaitq_wait() will return -1 and errno==ENOENT or similar
> if we are not on the queue (anymore), -1 and ETIMEDOUT on timeout,
> -1 and EINVAL on illegal cond or reltime ,zero on wakeup?

uwaitq_wait() which does not follow a uwaitq_add() would be some kind
of error, yes.  ETIMEDOUT on timeout, yes.  EFAULT on illegal cond
(that's all it can tell, that the address isn't in the user's range),
and 0 on success.

OK, new tidier version...
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

/* Assume we have the following operations:

   uwaitq_add(void *uaddr);
   uwaitq_remove(void *uaddr);
   uwaitq_wake(void *uaddr, int wake_all_flag);
   uwaitq_wait(relative timeout);

   And on top of them:
   futex_down(struct futex *);
   futex_up(struct futex *);
*/
typedef struct futex pthread_mutex_t;

typedef struct 
{
	int dummy; /* This could be an empty struct for gcc */
} pthread_cond_t;

typedef struct
{
	unsigned int num_left;
	unsigned int initial_count;
} pthread_barrier_t;

#define PTHREAD_MUTEX_INITIALIZER FUTEX_INITIALIZER
#define PTHREAD_COND_INITIALIZER { 0 }

int pthread_barrier_init(struct pthread_barrier_t *barrier,
			 void *addr,
			 unsigned int count)
{
	barrier->num_left = barrier->initial_count = count;
}

int pthread_barrier_wait(struct pthread_barrier_t *barrier)
{
	int ret;

	/* Use barrier address as uwaitq id. */
	ret = uwaitq_add(barrier);
	if (ret < 0)
		return ret;

	if (atomic_dec_and_test(&barrier->num_left)) {
		/* Restore barrier. */
		barrier->num_left = barrier->initial_count;
		/* Wake the other threads */
		uwaitq_wake(barrier, 1 /* WAKE_ALL */);
		uwaitq_remove(barrier);
		return 0; /* PTHREAD_BARRIER_SERIAL_THREAD */
	}
	while (uwaitq_wait(NULL) == 0 || errno == EINTR);
	uwaitq_remove(barrier);
	return 1;
}

int pthread_cond_signal(pthread_cond_t *cond)
{
	return uwaitq_wake(cond, 0 /* WAKE_ONE */);
}

int pthread_cond_broadcast(pthread_cond_t *cond)
{
	return uwaitq_wake(cond, 1 /* WAKE_ALL */);
}

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
{
	int ret, saved_errno;

	uwaitq_add(cond);
	futex_up(&mutex);
	while ((ret = uwaitq_wait(NULL)) == 0 || errno == EINTR);
	saved_errno = errno;
	uwaitq_remove(cond);
	futex_down(&mutex);
	errno = saved_errno;
	return ret;
}

int pthread_cond_timedwait(pthread_cond_t *cond,
			   pthread_mutex_t *mutex,
			   const struct timespec *abstime)
{
	struct timeval _now;
	struct timespec now, rel;
	int saved_errno, ret;

	/* Absolute to relative */
 again:
	gettimeofday(&_now, NULL);
	TIMEVAL_TO_TIMESPEC(&_now, &now);
	if (now.tv_sec > abstime->tv_sec
	    || (now.tv_sec == abstime->tv_sec
		&& now.tv_nsec > abstime->tv_nsec))
		return ETIMEDOUT;

	rel.tv_sec = now.tv_sec - abstime->tv_sec;
	rel.tv_nsec = now.tv_usec - abstime->tv_usec;
	if (rel.tv_nsec < 0) {
		--rel.tv_sec;
		rel.tv_nsec += 1000000000;
	}

	uwaitq_add(cond);
	futex_up(&mutex);
	ret = uwaitq_wait(&rel);
	if (ret < 0 && errno == EINTR)
		goto again;

	saved_errno = errno;
	uwaitq_remove(cond);
	futex_down(&mutex);
	errno = saved_errno;

	return ret;
}

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-26  1:02                             ` Rusty Russell
@ 2002-03-26  8:17                               ` Martin Wirth
  2002-03-26 23:10                                 ` Rusty Russell
  0 siblings, 1 reply; 84+ messages in thread
From: Martin Wirth @ 2002-03-26  8:17 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Peter Wächtler, linux-kernel


>
>   And on top of them:
>   futex_down(struct futex *);
>   futex_up(struct futex *);
>
Why not keep the simple one-sys-call interface for the fuxtexes. The 
code is so small that it is
 not worth to delete it.

>
>
>int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
>{
>	int ret, saved_errno;
>
>	uwaitq_add(cond);
>	futex_up(&mutex);
>	while ((ret = uwaitq_wait(NULL)) == 0 || errno == EINTR);
>	saved_errno = errno;
>	uwaitq_remove(cond);
>	futex_down(&mutex);
>
You should loop here in order catch signals:

	while ( futex_down(&mutex) < 0 && errno == EINTR)

>
>	if (ret < 0 && errno == EINTR)
>		goto again;
>
This assumes that you are allowed to do a double uwaitq_add.

>
>	saved_errno = errno;
>	uwaitq_remove(cond);
>	futex_down(&mutex);
>
Also loop here

>
>	errno = saved_errno;
>
>	return ret;
>}
>
Now whats interesting is the kernel part. I must admit that I haven't 
fully understood all
effects of the double use of the cookie in your first implementation. 
But if you use a memory
location as identifier you have to keep a separate flag within 
uwaitq_head that is zeroed
before you add to the waitqueue and set by the signal functions. Then 
uwaitq_wait has to check for it.
This is necessary in order not to loose a wakeup while you are on the 
queue but not sleeping.

uwaitq_remove also takes an argument, are you heading for waiting on 
multiple events?

Since you need to pin down the page between uwaitq_add and uwaitq_remove 
you will have to limit
the number of simultaneous add calls. Should this be configurable?

Cheers,
Martin
 


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-26  8:17                               ` Martin Wirth
@ 2002-03-26 23:10                                 ` Rusty Russell
  2002-03-27 21:05                                   ` Hubertus Franke
  0 siblings, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-26 23:10 UTC (permalink / raw)
  To: Martin Wirth; +Cc: Peter Wächtler, linux-kernel

In message <3CA02E80.1000600@dlr.de> you write:
> 
> >
> >   And on top of them:
> >   futex_down(struct futex *);
> >   futex_up(struct futex *);
> >
> Why not keep the simple one-sys-call interface for the fuxtexes. The 
> code is so small that it is
>  not worth to delete it.

Because it's a premature optimization.  We can add it later if someone
proves it's a major hotspot.  But it may turn out some other primitive
is more important: I'm not smart enough to know.

> >int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex)
> >{
...
> >	futex_down(&mutex);
> >
> You should loop here in order catch signals:
> 
> 	while ( futex_down(&mutex) < 0 && errno == EINTR)

Yep.

> >	if (ret < 0 && errno == EINTR)
> >		goto again;
> >
> This assumes that you are allowed to do a double uwaitq_add.

Oops.... Fixed by moving the uwaitq_add above the again: label.

> >
> >	saved_errno = errno;
> >	uwaitq_remove(cond);
> >	futex_down(&mutex);
> >
> Also loop here

Yep.

> Now whats interesting is the kernel part. I must admit that I haven't 
> fully understood all
> effects of the double use of the cookie in your first implementation. 
> But if you use a memory
> location as identifier you have to keep a separate flag within 
> uwaitq_head that is zeroed
> before you add to the waitqueue and set by the signal functions. Then 
> uwaitq_wait has to check for it.

Yes.  My prior implementation actually deleted the sleeper from the
queue on wakeup, making it easy to detect and also saving a syscall.
I'm leaning towards the "flag" idea now though.

> uwaitq_remove also takes an argument, are you heading for waiting on 
> multiple events?

Yes.  The problem is: how many?  Each one takes memory and (as you
point out) pinned pages (although this could be changed with some loss
of simplicity).  An arbitrary limit smacks of SysV IPC: a sure sign of
bad taste.  So for the moment, the limit is 1.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-26 23:10                                 ` Rusty Russell
@ 2002-03-27 21:05                                   ` Hubertus Franke
  2002-03-27 23:53                                     ` Rusty Russell
  0 siblings, 1 reply; 84+ messages in thread
From: Hubertus Franke @ 2002-03-27 21:05 UTC (permalink / raw)
  To: Rusty Russell, Martin Wirth; +Cc: Peter Wächtler, linux-kernel

On Tuesday 26 March 2002 06:10 pm, Rusty Russell wrote:
> In message <3CA02E80.1000600@dlr.de> you write:
> > >   And on top of them:
> > >   futex_down(struct futex *);
> > >   futex_up(struct futex *);
> >
> > Why not keep the simple one-sys-call interface for the fuxtexes. The
> > code is so small that it is
> >  not worth to delete it.
>
>

Rusty, you lost me in all these discussions now.
Is the current position to export wait queues and drop the futex interface ?
I would recommend against that. If we need 2 syscalls to implement
the futex behavior that certainly will create quite some overhead.

>From my own implementation, I exported the wait queues and I didn't need the
add/wait sequence. This as you know is/was due to the fact that I used 
semaphores in the kernel. While that created some allocation problems and 
won't allow for usage of the wait queues, it seems more compact.
Any chance to move the semaphore behavior into the futexes.



-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-27 21:05                                   ` Hubertus Franke
@ 2002-03-27 23:53                                     ` Rusty Russell
  0 siblings, 0 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-27 23:53 UTC (permalink / raw)
  To: frankeh; +Cc: Peter Wächtler, linux-kernel

In message <20020327210454.BBB763FE06@smtp.linux.ibm.com> you write:
> On Tuesday 26 March 2002 06:10 pm, Rusty Russell wrote:
> > In message <3CA02E80.1000600@dlr.de> you write:
> > > >   And on top of them:
> > > >   futex_down(struct futex *);
> > > >   futex_up(struct futex *);
> > >
> > > Why not keep the simple one-sys-call interface for the fuxtexes. The
> > > code is so small that it is
> > >  not worth to delete it.
> >
> >
> 
> Rusty, you lost me in all these discussions now.

I know the feeling 8)

> Is the current position to export wait queues and drop the futex interface ?
> I would recommend against that. If we need 2 syscalls to implement
> the futex behavior that certainly will create quite some overhead.

I'm still playing with the options.  Two system calls in the slow path
is definitely slower, but it's not insanely slow, and it's becoming
fairly clear that the wider range of primitives is worthwhile.

Both approaches can coexist, but I would consider the sys_futex call a
premature optimization if uwaitqs go in: this comes down to the
numbers.  I can supply a uwaitq implementation for benchmarking if you
want?

> >From my own implementation, I exported the wait queues and I didn't need the
> add/wait sequence. This as you know is/was due to the fact that I used 
> semaphores in the kernel. While that created some allocation problems and 
> won't allow for usage of the wait queues, it seems more compact.
> Any chance to move the semaphore behavior into the futexes.

The allocation element is the one I don't like: there's no really good
way of limiting it.

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-13  7:40                   ` Rusty Russell
@ 2002-03-13 16:37                     ` Alan Cox
  0 siblings, 0 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-13 16:37 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linus Torvalds, linux-kernel

> > And no, it's not worth discontinuing i386 support.  It just isn't
> > painful enough to maintain. 
> 
> How about just dropping 386 + SMP support?

We dont support SMP 386 boxes anyway. Nothing to drop. We've never supported
anything earlier than 486 SMP like the early MP1.1 compliant IBM boards, 
(and briefly the compaq non MP 1.1 compliant stuff  (Thomas Radke - the
forgotten man in the creation of SMP Linux - did 2/4 way compaq stuff)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  1:15                 ` Alan Cox
  2002-03-10 19:41                   ` Linus Torvalds
  2002-03-10 19:58                   ` Martin J. Bligh
@ 2002-03-13  7:40                   ` Rusty Russell
  2002-03-13 16:37                     ` Alan Cox
  2 siblings, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-13  7:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Sun, 10 Mar 2002 19:41:36 +0000 (UTC)
torvalds@transmeta.com (Linus Torvalds) wrote:

> And no, it's not worth discontinuing i386 support.  It just isn't
> painful enough to maintain. 

How about just dropping 386 + SMP support?

Then it would be a nobrainer to have cmpxchg a generic operation, AND
export it to userspace in the proposed "kernel routine page".

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/i386/config.in working-2.5.6-futex-6/arch/i386/config.in
--- linux-2.5.6/arch/i386/config.in	Wed Feb 20 17:56:59 2002
+++ working-2.5.6-futex-6/arch/i386/config.in	Sat Mar  9 15:17:18 2002
@@ -177,7 +177,9 @@
 
 bool 'Math emulation' CONFIG_MATH_EMULATION
 bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
-bool 'Symmetric multi-processing support' CONFIG_SMP
+if [ "$CONFIG_M386" != "y" ]; then
+	bool 'Symmetric multi-processing support' CONFIG_SMP
+fi
 bool 'Preemptible Kernel' CONFIG_PREEMPT
 if [ "$CONFIG_SMP" != "y" ]; then
    bool 'Local APIC support on uniprocessors' CONFIG_X86_UP_APIC

Thanks!
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-12 14:56           ` Hubertus Franke
@ 2002-03-13  4:02             ` Rusty Russell
  0 siblings, 0 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-13  4:02 UTC (permalink / raw)
  To: frankeh, Linus Torvalds; +Cc: linux-kernel

In message <20020312145542.2C8613FE06@smtp.linux.ibm.com> you write:
> > If we basically export "add_to_waitqueue", "del_from_waitqueue",
> > "wait_for_waitqueue" and "wakeup_waitqueue" syscalls, we have a more
> > powerful interface: the kernel need not touch userspace addresses at
> > all (no kmap/kunmap, no worried about spinlocks vs. rwlocks).
> 
> Rusty, aren't you now going back to the design that I implemented after Ben's
> comments. 

> From the get-go, I never touched the user address in the kernel, as
> I thought it would require detailed knowledge of the user level
> locking strategy.

Yes, as with my initial patch (like you, I had a semaphore in the
kernel).  However, with two separate syscalls, you don't need to
allocate anything in the kernel, and still know nothing about the
userspace locking.

> Could you explain, why you need add_to_waitqueue and wait_for_waitqueue as 
> separate calls ? Is it for resolving a race conditions ?

Yep.  Exactly analogous to the kernel idiom (add to queue, check, sleep).
(Except the "waker dequeues us" microoptimization).

> One comment with respect to multiple wait queues and rwsems:
> Again it will allow you to do reader-pref and/or writer-pref, but not 
> something like FIFO, i.e. wake up a writer if first waiter or wake up all 
> readers if first ..... and so on.
> I don't know whether the latter is terrible important ...

(Aside: I'm still reluctant to implement strict FIFO locks until
someone shows a starvation case in the current locks).

I thought about this a little.  If we add a flags arg to the
sys_uwaitq_wake() and make sys_uwaitq_wait() return the flags used by
the waker, and sys_uwaitq_wake return 1 if it woke someone...

	#define UWAITQ_PASSING 1
    up:
	ret = sys_uwaitq_wake(cookie, UWAITQ_PASSING);
	if (ret < 0) return -1;
	/* No waiters actually waiting? */
	if (ret == 0) {
		lock->counter = 1;
		sys_uwaitq_wake(cookie, 0);
	}

    down:
	sys_uwaitq_queue(cookie);
	if (atomic dec counter to 0) {
		sys_uwaitq_queue(0); /* unqueue */
		return 0;
	}
	ret = sys_uwaitq_wait(cookie);
	if (ret < 0)
		return -1;
	if (ret == UWAITQ_PASSING)
		return 0;
	goto down; /* spin again */

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-12 17:17           ` Linus Torvalds
@ 2002-03-13  2:57             ` Rusty Russell
  0 siblings, 0 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-13  2:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: frankeh, linux-kernel

In message <Pine.LNX.4.33.0203120905280.19167-100000@penguin.transmeta.com> you
 write:
> 
> On Tue, 12 Mar 2002, Rusty Russell wrote:
> > > 
> > > You've convinced me.
> > 
> > Damn.  Because now I've been playing with a different approach.
> 
> I don't think your current patch is very useful.

I agree.  But your "Applied" EMail rushed me into posting it.

> It's obviously slower, and while it is an interesting approach for not 
> just lock generation but also for synchronization points, it doesn't seem 
> to actually _buy_ you anything. And since the cookie isn't guaranteed to 
> be unique, you can't actually use it as a synchronization point on its 
> own, but must always still have some shared memory location as a 
> confirmation for whatever the synchronization was.

My original cookie was 128 bits.  ie. unique.

> So: interesting approach, but in its current form pointless as far as I 
> can see.

Yeah, I'm not sure what it's useful for either.  But the code is out
there if someone gets inspired...

Thanks,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-12  7:20         ` Rusty Russell
  2002-03-12 14:56           ` Hubertus Franke
@ 2002-03-12 17:17           ` Linus Torvalds
  2002-03-13  2:57             ` Rusty Russell
  1 sibling, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2002-03-12 17:17 UTC (permalink / raw)
  To: Rusty Russell; +Cc: frankeh, linux-kernel

On Tue, 12 Mar 2002, Rusty Russell wrote:
> > 
> > You've convinced me.
> 
> Damn.  Because now I've been playing with a different approach.

I don't think your current patch is very useful.

It's obviously slower, and while it is an interesting approach for not 
just lock generation but also for synchronization points, it doesn't seem 
to actually _buy_ you anything. And since the cookie isn't guaranteed to 
be unique, you can't actually use it as a synchronization point on its 
own, but must always still have some shared memory location as a 
confirmation for whatever the synchronization was.

Finally, waitqueue's (to me) always were about two big points:

 - natural race condition avoidance through ordering:

	current->state = sleeping;
	add_wait_queue();
	if (test)
		schedule();

   which you basically emulate with the "zero cookie" thing.

 - ability to wait on multiple events concurrently

   which you don't take any advantage of at all.

So you kind of missed the second big point of waitqueues, so the end
result really isn't any more fundamentally powerful than the (faster)  
specialized semaphore system call as far as I can tell.

In short, I would argue that this approach, while interesting, doesn't 
actually _buy_ you anything. 

Now, if you want to wake on any of N events, then a "add_wait_queue*N +
wait" approach actually makes sense. But quite frankly, once you are there 
you should really instead do full events, and go away and work together 
with Ben on the aio stuff instead of this.

So: interesting approach, but in its current form pointless as far as I 
can see.

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-12  7:20         ` Rusty Russell
@ 2002-03-12 14:56           ` Hubertus Franke
  2002-03-13  4:02             ` Rusty Russell
  2002-03-12 17:17           ` Linus Torvalds
  1 sibling, 1 reply; 84+ messages in thread
From: Hubertus Franke @ 2002-03-12 14:56 UTC (permalink / raw)
  To: Rusty Russell, Linus Torvalds; +Cc: linux-kernel

On Tuesday 12 March 2002 02:20 am, Rusty Russell wrote:
> In message <Pine.LNX.4.33.0203111441120.17864-100000@penguin.transmeta.com>
> you
>
>  write:
> > On Sat, 9 Mar 2002, Rusty Russell wrote:
> > > > So I would suggest making the size (and thus alignment check) of
> > > > locks at least 8 bytes (and preferably 16). That makes it slightly
> > > > harder to put locks on the stack, but gcc does support stack
> > > > alignment, even if the cod
>
> e
>
> > > > sucks right now.
> > >
> > > Actually, I disagree.
> > >
> > > 1) We've left wiggle room in the second arg to sys_futex() to add
> > > rwsems later if required.
> > > 2) Someone needs to implement them and prove they are superior to the
> > >    pure userspace solution.
> >
> > You've convinced me.
>
> Damn.  Because now I've been playing with a different approach.
>
> If we basically export "add_to_waitqueue", "del_from_waitqueue",
> "wait_for_waitqueue" and "wakeup_waitqueue" syscalls, we have a more
> powerful interface: the kernel need not touch userspace addresses at
> all (no kmap/kunmap, no worried about spinlocks vs. rwlocks).
>
> The problem is that this fundamentally requires at least two syscalls
> in the slow path (add_to_waitqueue, try for lock, wait_for_waitqueue).
> My tests here show it's about 6% slower than the solution you accepted
> for tdbtorture (which means the slow path is significantly slower).  I
> can't imagine shaving that much more off it.
>
> There are variations on this: cookie could be replaces the page struct
> and the offset, ala futexes.
>
> Thoughts?
> Rusty.
>
> PS.  Kudos: it was Ben LaHaise's idea to export waitqueues, but I
>      didn't see how to do it until Paul M made a bad joke about two
>      syscalls....

Rusty, aren't you now going back to the design that I implemented after Ben's 
comments. 
>From the get-go, I never touched the user address in the kernel, as I thought 
it would require detailed knowledge of the user level locking strategy.

Could you explain, why you need add_to_waitqueue and wait_for_waitqueue as 
separate calls ? Is it for resolving a race conditions ?

One comment with respect to multiple wait queues and rwsems:
Again it will allow you to do reader-pref and/or writer-pref, but not 
something like FIFO, i.e. wake up a writer if first waiter or wake up all 
readers if first ..... and so on.
I don't know whether the latter is terrible important ...

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  0:03               ` H. Peter Anvin
  2002-03-09  1:15                 ` Alan Cox
@ 2002-03-12  9:35                 ` Helge Hafting
  1 sibling, 0 replies; 84+ messages in thread
From: Helge Hafting @ 2002-03-12  9:35 UTC (permalink / raw)
  To: H. Peter Anvin, linux-kernel

"H. Peter Anvin" wrote:

> 
> Okay, I'll say it and be impopular...
> 
> Perhaps it's time to drop i386 support?
> 
Wouldn't it be better to just separate it out?  I.e. make i386
an arch of its own, while most pc people use a "486 and up" arch?

The few who actually want 386 code won't loose it, and other developers
won't have to bother with 386 issues.  Then drop the 386 arch when it
dies from lack of maintenance...

Helge Hafting

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-11 22:45       ` Linus Torvalds
  2002-03-11 23:12         ` Hubertus Franke
@ 2002-03-12  7:20         ` Rusty Russell
  2002-03-12 14:56           ` Hubertus Franke
  2002-03-12 17:17           ` Linus Torvalds
  1 sibling, 2 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-12  7:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: frankeh, linux-kernel

In message <Pine.LNX.4.33.0203111441120.17864-100000@penguin.transmeta.com> you
 write:
> 
> On Sat, 9 Mar 2002, Rusty Russell wrote:
> > > 
> > > So I would suggest making the size (and thus alignment check) of locks at
> > > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > > locks on the stack, but gcc does support stack alignment, even if the cod
e
> > > sucks right now.
> > 
> > Actually, I disagree.
> > 
> > 1) We've left wiggle room in the second arg to sys_futex() to add rwsems
> >    later if required.
> > 2) Someone needs to implement them and prove they are superior to the
> >    pure userspace solution.
> 
> You've convinced me.

Damn.  Because now I've been playing with a different approach.

If we basically export "add_to_waitqueue", "del_from_waitqueue",
"wait_for_waitqueue" and "wakeup_waitqueue" syscalls, we have a more
powerful interface: the kernel need not touch userspace addresses at
all (no kmap/kunmap, no worried about spinlocks vs. rwlocks).

The problem is that this fundamentally requires at least two syscalls
in the slow path (add_to_waitqueue, try for lock, wait_for_waitqueue).
My tests here show it's about 6% slower than the solution you accepted
for tdbtorture (which means the slow path is significantly slower).  I
can't imagine shaving that much more off it.

There are variations on this: cookie could be replaces the page struct
and the offset, ala futexes.

Thoughts?
Rusty.

PS.  Kudos: it was Ben LaHaise's idea to export waitqueues, but I
     didn't see how to do it until Paul M made a bad joke about two
     syscalls....
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/i386/kernel/entry.S working-2.5.6-hwq/arch/i386/kernel/entry.S
--- linux-2.5.6/arch/i386/kernel/entry.S	Fri Mar  8 14:49:11 2002
+++ working-2.5.6-hwq/arch/i386/kernel/entry.S	Mon Mar 11 15:50:20 2002
@@ -717,6 +717,9 @@
 	.long SYMBOL_NAME(sys_fremovexattr)
 	.long SYMBOL_NAME(sys_tkill)
 	.long SYMBOL_NAME(sys_sendfile64)
+	.long SYMBOL_NAME(sys_uwaitq_add)	/* 240 */
+	.long SYMBOL_NAME(sys_uwaitq_wait)
+	.long SYMBOL_NAME(sys_uwaitq_wake)
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/mman.h working-2.5.6-hwq/include/asm-i386/mman.h
--- linux-2.5.6/include/asm-i386/mman.h	Wed Mar 15 12:45:20 2000
+++ working-2.5.6-hwq/include/asm-i386/mman.h	Mon Mar 11 15:58:59 2002
@@ -5,6 +5,7 @@
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
 #define PROT_NONE	0x0		/* page can not be accessed */
+#define PROT_SEM	0x8		/* page can contain semaphores */
 
 #define MAP_SHARED	0x01		/* Share changes */
 #define MAP_PRIVATE	0x02		/* Changes are private */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/unistd.h working-2.5.6-hwq/include/asm-i386/unistd.h
--- linux-2.5.6/include/asm-i386/unistd.h	Fri Mar  8 14:49:28 2002
+++ working-2.5.6-hwq/include/asm-i386/unistd.h	Mon Mar 11 14:52:07 2002
@@ -244,6 +244,9 @@
 #define __NR_fremovexattr	237
 #define __NR_tkill		238
 #define __NR_sendfile64		239
+#define __NR_uwaitq_add		240
+#define __NR_uwaitq_wait	241
+#define __NR_uwaitq_wake	242
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/sched.h working-2.5.6-hwq/include/linux/sched.h
--- linux-2.5.6/include/linux/sched.h	Sat Mar  9 14:52:15 2002
+++ working-2.5.6-hwq/include/linux/sched.h	Tue Mar 12 13:54:15 2002
@@ -230,6 +230,11 @@
 
 typedef struct prio_array prio_array_t;
 
+struct uwaitq {
+	struct list_head list;
+	unsigned long /*long*/ cookie;
+};
+
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
 	struct thread_info *thread_info;
@@ -344,6 +349,8 @@
 
 /* journalling filesystem info */
 	void *journal_info;
+/* user wait queue info */
+	struct uwaitq uwaitq;
 };
 
 extern void __put_task_struct(struct task_struct *tsk);
@@ -508,6 +515,13 @@
 extern int kill_proc(pid_t, int, int);
 extern int do_sigaction(int, const struct k_sigaction *, struct k_sigaction *);
 extern int do_sigaltstack(const stack_t *, stack_t *, unsigned long);
+
+extern void __uwaitq_unqueue(struct uwaitq *q);
+static inline void uwaitq_unqueue(struct uwaitq *q)
+{
+	if (q->cookie)
+		__uwaitq_unqueue(q);
+}
 
 /*
  * Re-calculate pending state from the set of locally pending
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/Makefile working-2.5.6-hwq/kernel/Makefile
--- linux-2.5.6/kernel/Makefile	Wed Feb 20 17:56:17 2002
+++ working-2.5.6-hwq/kernel/Makefile	Mon Mar 11 14:50:49 2002
@@ -15,7 +15,7 @@
 obj-y     = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
 	    sysctl.o acct.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o 
+	    signal.o sys.o kmod.o context.o uwaitq.o
 
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/exit.c working-2.5.6-hwq/kernel/exit.c
--- linux-2.5.6/kernel/exit.c	Wed Feb 20 17:57:21 2002
+++ working-2.5.6-hwq/kernel/exit.c	Tue Mar 12 12:57:09 2002
@@ -502,7 +502,7 @@
 	acct_process(code);
 #endif
 	__exit_mm(tsk);
-
+	uwaitq_unqueue(&tsk->uwaitq);
 	lock_kernel();
 	sem_exit();
 	__exit_files(tsk);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/fork.c working-2.5.6-hwq/kernel/fork.c
--- linux-2.5.6/kernel/fork.c	Fri Mar  8 14:49:30 2002
+++ working-2.5.6-hwq/kernel/fork.c	Tue Mar 12 12:52:27 2002
@@ -720,6 +720,8 @@
 	if (retval)
 		goto bad_fork_cleanup_namespace;
 	p->semundo = NULL;
+	INIT_LIST_HEAD(&p->uwaitq.list);
+	p->uwaitq.cookie = 0;
 	
 	/* Our parent execution domain becomes current domain
 	   These must match for thread signalling to apply */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/uwaitq.c working-2.5.6-hwq/kernel/uwaitq.c
--- linux-2.5.6/kernel/uwaitq.c	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-hwq/kernel/uwaitq.c	Tue Mar 12 14:06:58 2002
@@ -0,0 +1,153 @@
+/*
+ *  User-exported wait queues.
+ *  (C) Rusty Russell, IBM 2002
+ *
+ *  Thanks to Ben LaHaise for yelling "hashed waitqueues", and Paul
+ *  Mackerras for suggesting breaking it into multiple syscalls.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/init.h>
+#include <asm/uaccess.h>
+
+/* Theory is simple:
+   1) If we are on a hash list, we are waiting to be woken up.
+   2) Otherwise, if cookie is not zero, we have just been woken up.
+   3) Otherwise, we are not involved in any userspace wait queues.
+*/
+/* FIXME: This may be way too small. --RR */
+#define UWAITQ_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+   the relevent ones (hashed queues may be shared) */
+struct uwaitq_head
+{
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned;
+static struct uwaitq_head uwait_queues[1<<UWAITQ_HASHBITS] __cacheline_aligned;
+
+static inline struct uwaitq_head *hash_uwaitq(unsigned long /*long*/ cookie)
+{
+	return &uwait_queues[cookie & ((1<<UWAITQ_HASHBITS)-1)];
+}
+
+/* struct uwaitq is always embedded in a task struct. */
+static inline struct task_struct *uwaitq_to_task(struct uwaitq *q)
+{
+	return (void *)q - offsetof(struct task_struct, uwaitq);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct uwaitq_head *head)
+{
+	spin_lock(&head->lock);
+	list_add_tail(&current->uwaitq.list, &head->list);
+	spin_unlock(&head->lock);
+}
+
+/* Must be holding spinlock */
+static void wake_by_cookie(struct uwaitq_head *head, unsigned long /*long*/ cookie)
+{
+	struct list_head *i;
+
+	list_for_each(i, &head->list) {
+		struct uwaitq *this = list_entry(i, struct uwaitq, list);
+
+		if (cookie == this->cookie) {
+			list_del_init(&this->list);
+			wmb();
+			wake_up_process(uwaitq_to_task(this));
+			return;
+		}
+	}
+}
+
+void __uwaitq_unqueue(struct uwaitq *q)
+{
+	struct uwaitq_head *head;
+
+	head = hash_uwaitq(q->cookie);
+	spin_lock(&head->lock);
+	/* If we have been woken, we must wake someone else since we
+	   are no longer interested. */
+	if (list_empty(&q->list))
+		wake_by_cookie(head, q->cookie);
+	else
+		list_del_init(&q->list);
+	spin_unlock(&head->lock);
+}
+
+asmlinkage int sys_uwaitq_wake(unsigned long /*long*/ cookie)
+{
+	struct uwaitq_head *head;
+	
+	head = hash_uwaitq(cookie);
+	spin_lock(&head->lock);
+	wake_by_cookie(head, cookie);
+	spin_unlock(&head->lock);
+	return 0;
+}
+
+/* Add to the wait queue: 0 cookie means delete. */
+asmlinkage int sys_uwaitq_add(unsigned long /*long*/ cookie)
+{
+	/* Unqueue if is/was queued. */
+	uwaitq_unqueue(&current->uwaitq);
+
+	/* Set cookie. */
+	current->uwaitq.cookie = cookie;
+
+	/* If non-zero, requeue */
+	if (cookie)
+		queue_me(hash_uwaitq(cookie));
+	return 0;
+}
+
+/* Wait to be chucked off the queue. */
+asmlinkage int sys_uwaitq_wait(void)
+{
+	int retval = 0;
+
+	set_current_state(TASK_INTERRUPTIBLE);
+	while (!list_empty(&current->uwaitq.list)) {
+		if (signal_pending(current)) {
+			retval = -EINTR;
+			__uwaitq_unqueue(&current->uwaitq);
+			goto out;
+		}
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	/* Mark the fact that we were woken up. */
+	current->uwaitq.cookie = 0;
+ out:
+	set_current_state(TASK_RUNNING);
+	return retval;
+}
+
+static int __init init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(uwait_queues); i++) {
+		INIT_LIST_HEAD(&uwait_queues[i].list);
+		spin_lock_init(&uwait_queues[i].lock);
+	}
+	return 0;
+}
+__initcall(init);

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-11 22:45       ` Linus Torvalds
@ 2002-03-11 23:12         ` Hubertus Franke
  2002-03-12  7:20         ` Rusty Russell
  1 sibling, 0 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-11 23:12 UTC (permalink / raw)
  To: Linus Torvalds, Rusty Russell; +Cc: linux-kernel

On Monday 11 March 2002 05:45 pm, Linus Torvalds wrote:
> On Sat, 9 Mar 2002, Rusty Russell wrote:
> > > So I would suggest making the size (and thus alignment check) of locks
> > > at least 8 bytes (and preferably 16). That makes it slightly harder to
> > > put locks on the stack, but gcc does support stack alignment, even if
> > > the code sucks right now.
> >
> > Actually, I disagree.
> >
> > 1) We've left wiggle room in the second arg to sys_futex() to add rwsems
> >    later if required.
> > 2) Someone needs to implement them and prove they are superior to the
> >    pure userspace solution.
>
> You've convinced me.
>
> Considering how long people argued about dubious cycle measurements on the
> rwsem implementation, and where the crrent one actually uses a spinlock
> for exclusion on the fast path, the kernel lock really probably doesn't
> need to be expanded, and as there is provable overhead to the expansion,
> I'll just agree with you.
>
> Applied.
>
> 		Linus

Great, now, that this is settled, let's go back to the compare and swap 
issues.

As far as I can tell, we need compare and swap on a single queue 
implementation for rwsems. Again most architectures provide something like
that today. As for those which don't, why not provide either of the following 
approaches

(a) spinlock around the update code in the kernel. We could provide multiple
spinlocks to avoid potential collision.
(b) only provide futex in the kernel and user library approach using 2 queues 
as shown by Rusty. The lack of cmpxchg support would be exported by the  
futex_region call.

- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  4:49     ` Rusty Russell
@ 2002-03-11 22:45       ` Linus Torvalds
  2002-03-11 23:12         ` Hubertus Franke
  2002-03-12  7:20         ` Rusty Russell
  0 siblings, 2 replies; 84+ messages in thread
From: Linus Torvalds @ 2002-03-11 22:45 UTC (permalink / raw)
  To: Rusty Russell; +Cc: frankeh, linux-kernel


On Sat, 9 Mar 2002, Rusty Russell wrote:
> > 
> > So I would suggest making the size (and thus alignment check) of locks at
> > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > locks on the stack, but gcc does support stack alignment, even if the code
> > sucks right now.
> 
> Actually, I disagree.
> 
> 1) We've left wiggle room in the second arg to sys_futex() to add rwsems
>    later if required.
> 2) Someone needs to implement them and prove they are superior to the
>    pure userspace solution.

You've convinced me.

Considering how long people argued about dubious cycle measurements on the 
rwsem implementation, and where the crrent one actually uses a spinlock 
for exclusion on the fast path, the kernel lock really probably doesn't 
need to be expanded, and as there is provable overhead to the expansion, 
I'll just agree with you.

Applied.

		Linus


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-10 19:41                   ` Linus Torvalds
@ 2002-03-11 20:49                     ` Pavel Machek
  0 siblings, 0 replies; 84+ messages in thread
From: Pavel Machek @ 2002-03-11 20:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Hi!

> >So if anything its just not worth the effort of breaking the 386 setup
> >either 8). 386 SMP is a different issue but I don't see any lunatics doing
> >a 386 based sequent port thankfully.
> 
> Since the only person that comes to mind that would be crazy enough to
> even _try_ to use Linux on 386-SMP is you, Alan, I'm really relieved to
> hear you say that ;)
> 
> And no, it's not worth discontinuing i386 support.  It just isn't
> painful enough to maintain. 
> 
> Note that the i386 has _long_ been a "stepchild", though: because of the
> lack of WP, the kernel simply doesn't do threaded MM correctly on a 386. 
> Never has, and never will. 
> 
> However, the known "incorrect" case is so obscure that it's not even an
> issue - although I suspect that it means that you should not let
> untrusted users run on a i386 server machine that contains any sensitive
> data.  I could cerrtainly come up with exploits that would work at least
> in theory (whether they are workable in practice I don't know). 

That should mean at least warning during bootup. Checking for WP
bit... message does not suggest that WP not present means security
hole.

--- clean.2.5//arch/i386/mm/init.c	Sun Mar 10 20:06:31 2002
+++ linux/arch/i386/mm/init.c	Mon Mar 11 21:49:14 2002
@@ -383,7 +383,7 @@
 	local_flush_tlb();
 
 	if (!boot_cpu_data.wp_works_ok) {
-		printk("No.\n");
+		printk("No (that's security hole).\n");
 #ifdef CONFIG_X86_WP_WORKS_OK
 		panic("This kernel doesn't support CPU's with broken WP. Recompile it for a 386!");
 #endif

									Pavel
-- 
(about SSSCA) "I don't say this lightly.  However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  2:12                 ` Linus Torvalds
@ 2002-03-11 14:14                   ` Hubertus Franke
  0 siblings, 0 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-11 14:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Friday 08 March 2002 09:12 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > The point being that the difference between a "decl" and a "lock ; 
> > > decl" is about 1:12 or so in performance.
> >
> > I am no expert in architecture, but if its done through the cache
> > coherency mechanism, the overhead shouldn't be 12:1. You simply mark the
> > cache line as part of you instruction to avoid a cache line transfer. How
> > can that be 12 times slower.  .. Ready to be educated....
>
> A lock in a SMP system also needs to synchronize the instruction stream,
> and not let stores move "out" from the locked region.
>
> On a UP system, this all happens automatically (well, getting it to happen
> right is obviously one of the big issues in an out-of-order CPU core, but
> it's a very fundamental part of the core, so it's "free" in the sense that
> if it isn't done, the CPU simply doesn't work).
>
> On SMP, it's a memory barrier. This is why a "lock ; decl" is more
> expensive than a "decl" - it's the implied memory ordering constraints (on
> other architectures they are explicit). On an intel CPU, this basically
> means that the pipeline is drained, so a locked instruction takes roughly
> 12 cycles on a PPro core (AMD's K7 core seems to be rather more graceful
> about this one). I haven't timed a P4 lately, I think it's worse.
>
> Other architectures do the memory ordering explicitly, and some are
> better, some are worse. But it always costs you _something_.
>
> 		Linus


Sure, not contending that. Right now I think our focus should be to get the
right functionality out and address people's concerns.
Improvements, as you suggested, are orthogonal and can always be put
in later.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-10 20:28                       ` Martin J. Bligh
@ 2002-03-10 21:05                         ` Alan Cox
  0 siblings, 0 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-10 21:05 UTC (permalink / raw)
  To: Martin.Bligh; +Cc: Alan Cox, H. Peter Anvin, linux-kernel

> IIRC the half height cabinets run of single phase, 240V. I'm
> sure I can arrange to have a 386 half height system shipped to
> you ;-) ;-) 
> 
> PS. Have fun with that VME bus ...

Actually we have VME code for Linux, including a PCI/VME bridge. Its not
in the base distro as the author never needed to tidy it up to make it
work for arbitary PCI/VME bridges.

I think you should ship the machine to hpa though, then he can run
kernel.org on it 8)


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-10 19:58                   ` Martin J. Bligh
@ 2002-03-10 20:40                     ` Alan Cox
  2002-03-10 20:28                       ` Martin J. Bligh
  0 siblings, 1 reply; 84+ messages in thread
From: Alan Cox @ 2002-03-10 20:40 UTC (permalink / raw)
  To: Martin.Bligh; +Cc: Alan Cox, H. Peter Anvin, linux-kernel

> two ago, asking what the internal structure of a Sequent Symmetry 
> was, so that they could get Linux running on it. OK, so they gave 
> up when I gave them an outline of what was in there, but ... ;-)

I feel safe. A long time ago I looked at a symettry but 3 phase power was
harder to arrange 8)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-10 20:40                     ` Alan Cox
@ 2002-03-10 20:28                       ` Martin J. Bligh
  2002-03-10 21:05                         ` Alan Cox
  0 siblings, 1 reply; 84+ messages in thread
From: Martin J. Bligh @ 2002-03-10 20:28 UTC (permalink / raw)
  To: Alan Cox; +Cc: H. Peter Anvin, linux-kernel

>> two ago, asking what the internal structure of a Sequent Symmetry 
>> was, so that they could get Linux running on it. OK, so they gave 
>> up when I gave them an outline of what was in there, but ... ;-)
> 
> I feel safe. A long time ago I looked at a symettry but 3 phase 
> power was harder to arrange 8)

IIRC the half height cabinets run of single phase, 240V. I'm
sure I can arrange to have a 386 half height system shipped to
you ;-) ;-) 

M.

PS. Have fun with that VME bus ...


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  1:15                 ` Alan Cox
  2002-03-10 19:41                   ` Linus Torvalds
@ 2002-03-10 19:58                   ` Martin J. Bligh
  2002-03-10 20:40                     ` Alan Cox
  2002-03-13  7:40                   ` Rusty Russell
  2 siblings, 1 reply; 84+ messages in thread
From: Martin J. Bligh @ 2002-03-10 19:58 UTC (permalink / raw)
  To: Alan Cox, H. Peter Anvin; +Cc: linux-kernel

> So if anything its just not worth the effort of breaking the 
> 386 setup either 8). 386 SMP is a different issue but I don't 
> see any lunatics doing a 386 based sequent port thankfully.

Hey, don't count it out ... someone was emailing me a week or
two ago, asking what the internal structure of a Sequent Symmetry 
was, so that they could get Linux running on it. OK, so they gave 
up when I gave them an outline of what was in there, but ... ;-)

M.

PS. No I'm not suggesting we should support 386 SMP ;-)


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  1:15                 ` Alan Cox
@ 2002-03-10 19:41                   ` Linus Torvalds
  2002-03-11 20:49                     ` Pavel Machek
  2002-03-10 19:58                   ` Martin J. Bligh
  2002-03-13  7:40                   ` Rusty Russell
  2 siblings, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2002-03-10 19:41 UTC (permalink / raw)
  To: linux-kernel

In article <E16jVSZ-0008FH-00@the-village.bc.nu>,
Alan Cox  <alan@lxorguk.ukuu.org.uk> wrote:
>
>So if anything its just not worth the effort of breaking the 386 setup
>either 8). 386 SMP is a different issue but I don't see any lunatics doing
>a 386 based sequent port thankfully.

Since the only person that comes to mind that would be crazy enough to
even _try_ to use Linux on 386-SMP is you, Alan, I'm really relieved to
hear you say that ;)

And no, it's not worth discontinuing i386 support.  It just isn't
painful enough to maintain. 

Note that the i386 has _long_ been a "stepchild", though: because of the
lack of WP, the kernel simply doesn't do threaded MM correctly on a 386. 
Never has, and never will. 

However, the known "incorrect" case is so obscure that it's not even an
issue - although I suspect that it means that you should not let
untrusted users run on a i386 server machine that contains any sensitive
data.  I could cerrtainly come up with exploits that would work at least
in theory (whether they are workable in practice I don't know). 

Using i386's for network servers is fine, of course.  Just don't use
them for cpu farms (not that I think anybody is - it takes quite a big
farm of i386 machines to equal even _one_ PII ;)

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05  7:01 Rusty Russell
  2002-03-05 22:39 ` Davide Libenzi
  2002-03-08 18:07 ` Linus Torvalds
@ 2002-03-09  4:51 ` Rusty Russell
  2 siblings, 0 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-09  4:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, 8 Mar 2002 10:07:51 -0800 (PST)
Linus Torvalds <torvalds@transmeta.com> wrote:
> This doesn't work on highmem machines - doing the conversion from "<struct
> page, offset>" to "page_address(page)+offset" is simply not legal (not
> even for pure hashing purposes - page_address() changes as you kmap it).

Thanks for the catch.  Am not blessed (cursed?) with highmem here.

> You need to keep the <struct page,offset> tuple in that format, and no
> other. And when you actually touch the page, you need to do the
> kmap()/kunmap() (and you must not keep it mapped while you sleep, because
> that might trivially make the kernel run out of virtual mappings).

Ick: not allowing for one virtual mapping per process is pretty horrible.
Still, pretty easy to fix.

Also updated syscall numbers for 2.5.6, and changed FUTEX_UP/DOWN definitions
to be more logical for future expansions (eg. r/w).

Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/i386/kernel/entry.S working-2.5.6-futex-6/arch/i386/kernel/entry.S
--- linux-2.5.6/arch/i386/kernel/entry.S	Fri Mar  8 14:49:11 2002
+++ working-2.5.6-futex-6/arch/i386/kernel/entry.S	Sat Mar  9 14:09:10 2002
@@ -717,6 +717,7 @@
 	.long SYMBOL_NAME(sys_fremovexattr)
 	.long SYMBOL_NAME(sys_tkill)
 	.long SYMBOL_NAME(sys_sendfile64)
+	.long SYMBOL_NAME(sys_futex)		/* 240 */
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/arch/ppc/kernel/misc.S working-2.5.6-futex-6/arch/ppc/kernel/misc.S
--- linux-2.5.6/arch/ppc/kernel/misc.S	Fri Mar  8 14:49:11 2002
+++ working-2.5.6-futex-6/arch/ppc/kernel/misc.S	Sat Mar  9 14:08:24 2002
@@ -1289,6 +1289,7 @@
 	.long sys_removexattr
 	.long sys_lremovexattr
 	.long sys_fremovexattr	/* 220 */
+	.long sys_futex
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long sys_ni_syscall
 	.endr
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/mman.h working-2.5.6-futex-6/include/asm-i386/mman.h
--- linux-2.5.6/include/asm-i386/mman.h	Wed Mar 15 12:45:20 2000
+++ working-2.5.6-futex-6/include/asm-i386/mman.h	Sat Mar  9 14:32:43 2002
@@ -4,6 +4,7 @@
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
+#define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
 
 #define MAP_SHARED	0x01		/* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-i386/unistd.h working-2.5.6-futex-6/include/asm-i386/unistd.h
--- linux-2.5.6/include/asm-i386/unistd.h	Fri Mar  8 14:49:28 2002
+++ working-2.5.6-futex-6/include/asm-i386/unistd.h	Sat Mar  9 14:09:31 2002
@@ -244,6 +244,7 @@
 #define __NR_fremovexattr	237
 #define __NR_tkill		238
 #define __NR_sendfile64		239
+#define __NR_futex		240
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-ppc/mman.h working-2.5.6-futex-6/include/asm-ppc/mman.h
--- linux-2.5.6/include/asm-ppc/mman.h	Tue May 22 08:02:06 2001
+++ working-2.5.6-futex-6/include/asm-ppc/mman.h	Sat Mar  9 14:32:39 2002
@@ -7,6 +7,7 @@
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
+#define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
 
 #define MAP_SHARED	0x01		/* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/asm-ppc/unistd.h working-2.5.6-futex-6/include/asm-ppc/unistd.h
--- linux-2.5.6/include/asm-ppc/unistd.h	Wed Feb 20 17:57:18 2002
+++ working-2.5.6-futex-6/include/asm-ppc/unistd.h	Sat Mar  9 14:08:27 2002
@@ -228,6 +228,7 @@
 #define __NR_removexattr	218
 #define __NR_lremovexattr	219
 #define __NR_fremovexattr	220
+#define __NR_futex		221
 
 #define __NR(n)	#n
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/futex.h working-2.5.6-futex-6/include/linux/futex.h
--- linux-2.5.6/include/linux/futex.h	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-futex-6/include/linux/futex.h	Sat Mar  9 14:08:24 2002
@@ -0,0 +1,8 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_UP (0)
+#define FUTEX_DOWN (1)
+
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/hash.h working-2.5.6-futex-6/include/linux/hash.h
--- linux-2.5.6/include/linux/hash.h	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-futex-6/include/linux/hash.h	Sat Mar  9 14:08:27 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+   (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+	unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
+	unsigned long n = hash;
+	n <<= 18;
+	hash -= n;
+	n <<= 33;
+	hash -= n;
+	n <<= 3;
+	hash += n;
+	n <<= 3;
+	hash -= n;
+	n <<= 4;
+	hash += n;
+	n <<= 2;
+	hash += n;
+#else
+	/* On some cpus multiply is faster, on others gcc will do shifts */
+	hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+	/* High bits are more random, so use them. */
+	return hash >> (BITS_PER_LONG - bits);
+}
+	
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+	return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/include/linux/mmzone.h working-2.5.6-futex-6/include/linux/mmzone.h
--- linux-2.5.6/include/linux/mmzone.h	Fri Mar  1 22:58:34 2002
+++ working-2.5.6-futex-6/include/linux/mmzone.h	Sat Mar  9 14:52:15 2002
@@ -51,8 +51,7 @@
 	/*
 	 * wait_table		-- the array holding the hash table
 	 * wait_table_size	-- the size of the hash table array
-	 * wait_table_shift	-- wait_table_size
-	 * 				== BITS_PER_LONG (1 << wait_table_bits)
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
 	 *
 	 * The purpose of all these is to keep track of the people
 	 * waiting for a page to become available and make them
@@ -75,7 +74,7 @@
 	 */
 	wait_queue_head_t	* wait_table;
 	unsigned long		wait_table_size;
-	unsigned long		wait_table_shift;
+	unsigned long		wait_table_bits;
 
 	/*
 	 * Discontig memory support fields.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/Makefile working-2.5.6-futex-6/kernel/Makefile
--- linux-2.5.6/kernel/Makefile	Wed Feb 20 17:56:17 2002
+++ working-2.5.6-futex-6/kernel/Makefile	Sat Mar  9 14:08:27 2002
@@ -15,7 +15,7 @@
 obj-y     = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
 	    sysctl.o acct.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o 
+	    signal.o sys.o kmod.o context.o futex.o
 
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/kernel/futex.c working-2.5.6-futex-6/kernel/futex.c
--- linux-2.5.6/kernel/futex.c	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-futex-6/kernel/futex.c	Sat Mar  9 15:03:13 2002
@@ -0,0 +1,232 @@
+/*
+ *  Fast Userspace Mutexes (which I call "Futexes!").
+ *  (C) Rusty Russell, IBM 2002
+ *
+ *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ *  enough at me, Linus for the original (flawed) idea, Matthew
+ *  Kirkwood for proof-of-concept implementation.
+ *
+ *  "The futexes are also cursed."
+ *  "But they come in a choice of three flavours!"
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <linux/highmem.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+   decrements from 1 to 0.  The counter starts at 1 when the lock is
+   free.  A value other than 0 or 1 means someone may be sleeping.
+   This is simple enough to work on all architectures, but has the
+   problem that if we never "up" the semaphore it could eventually
+   wrap around. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+   the relevent ones (hashed queues may be shared) */
+struct futex_q {
+	struct list_head list;
+	struct task_struct *task;
+	/* Page struct and offset within it. */
+	struct page *page;
+	unsigned int offset;
+};
+
+/* The key for the hash is the address + index + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+static inline struct list_head *hash_futex(struct page *page,
+					   unsigned long offset)
+{
+	unsigned long h;
+
+	/* struct page is shared, so we can hash on its address */
+	h = (unsigned long)page + offset;
+	return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+static inline void wake_one_waiter(struct list_head *head,
+				   struct page *page,
+				   unsigned int offset)
+{
+	struct list_head *i;
+
+	spin_lock(&futex_lock);
+	list_for_each(i, head) {
+		struct futex_q *this = list_entry(i, struct futex_q, list);
+
+		if (this->page == page && this->offset == offset) {
+			wake_up_process(this->task);
+			break;
+		}
+	}
+	spin_unlock(&futex_lock);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct list_head *head,
+			    struct futex_q *q,
+			    struct page *page,
+			    unsigned int offset)
+{
+	q->task = current;
+	q->page = page;
+	q->offset = offset;
+
+	spin_lock(&futex_lock);
+	list_add_tail(&q->list, head);
+	spin_unlock(&futex_lock);
+}
+
+static inline void unqueue_me(struct futex_q *q)
+{
+	spin_lock(&futex_lock);
+	list_del(&q->list);
+	spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+	struct mm_struct *mm = current->mm;
+	struct page *page;
+	int err;
+
+	down_read(&mm->mmap_sem);
+	err = get_user_pages(current, current->mm, page_start,
+			     1 /* one page */,
+			     1 /* writable */,
+			     0 /* don't force */,
+			     &page,
+			     NULL /* don't return vmas */);
+	up_read(&mm->mmap_sem);
+
+	if (err < 0)
+		return ERR_PTR(err);
+	return page;
+}
+
+/* Try to decrement the user count to zero. */
+static int decrement_to_zero(struct page *page, unsigned int offset)
+{
+	atomic_t *count;
+	int ret = 0;
+
+	count = kmap(page) + offset;
+	/* If we take the semaphore from 1 to 0, it's ours.  If it's
+           zero, decrement anyway, to indicate we are waiting.  If
+           it's negative, don't decrement so we don't wrap... */
+	if (atomic_read(count) >= 0 && atomic_dec_and_test(count))
+		ret = 1;
+	kunmap(page);
+	return ret;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, struct page *page, int offset)
+{
+	int retval = 0;
+	struct futex_q q;
+
+	current->state = TASK_INTERRUPTIBLE;
+	queue_me(head, &q, page, offset);
+
+	while (!decrement_to_zero(page, offset)) {
+		if (signal_pending(current)) {
+			retval = -EINTR;
+			break;
+		}
+		schedule();
+		current->state = TASK_INTERRUPTIBLE;
+	}
+	current->state = TASK_RUNNING;
+	unqueue_me(&q);
+	/* If we were signalled, we might have just been woken: we
+	   must wake another one.  Otherwise we need to wake someone
+	   else (if they are waiting) so they drop the count below 0,
+	   and when we "up" in userspace, we know there is a
+	   waiter. */
+	wake_one_waiter(head, page, offset);
+	return retval;
+}
+
+static int futex_up(struct list_head *head, struct page *page, int offset)
+{
+	atomic_t *count;
+
+	count = kmap(page) + offset;
+	atomic_set(count, 1);
+	smp_wmb();
+	kunmap(page);
+	wake_one_waiter(head, page, offset);
+	return 0;
+}
+
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+	int ret;
+	unsigned long pos_in_page;
+	struct list_head *head;
+	struct page *page;
+
+	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+	/* Must be "naturally" aligned, and not on page boundary. */
+	if ((pos_in_page % __alignof__(atomic_t)) != 0
+	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+		return -EINVAL;
+
+	/* Simpler if it doesn't vanish underneath us. */
+	page = pin_page((unsigned long)uaddr - pos_in_page);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+
+	head = hash_futex(page, pos_in_page);
+	switch (op) {
+	case FUTEX_UP:
+		ret = futex_up(head, page, pos_in_page);
+		break;
+	case FUTEX_DOWN:
+		ret = futex_down(head, page, pos_in_page);
+		break;
+	/* Add other lock types here... */
+	default:
+		ret = -EINVAL;
+	}
+	put_page(page);
+
+	return ret;
+}
+
+static int __init init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+		INIT_LIST_HEAD(&futex_queues[i]);
+	return 0;
+}
+__initcall(init);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/mm/filemap.c working-2.5.6-futex-6/mm/filemap.c
--- linux-2.5.6/mm/filemap.c	Fri Mar  8 14:49:30 2002
+++ working-2.5.6-futex-6/mm/filemap.c	Sat Mar  9 14:08:27 2002
@@ -25,6 +25,7 @@
 #include <linux/iobuf.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/hash.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -773,32 +774,8 @@
 static inline wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const zone_t *zone = page_zone(page);
-	wait_queue_head_t *wait = zone->wait_table;
-	unsigned long hash = (unsigned long)page;
 
-#if BITS_PER_LONG == 64
-	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
-	unsigned long n = hash;
-	n <<= 18;
-	hash -= n;
-	n <<= 33;
-	hash -= n;
-	n <<= 3;
-	hash += n;
-	n <<= 3;
-	hash -= n;
-	n <<= 4;
-	hash += n;
-	n <<= 2;
-	hash += n;
-#else
-	/* On some cpus multiply is faster, on others gcc will do shifts */
-	hash *= GOLDEN_RATIO_PRIME;
-#endif
-
-	hash >>= zone->wait_table_shift;
-
-	return &wait[hash];
+	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
 /* 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/mm/mprotect.c working-2.5.6-futex-6/mm/mprotect.c
--- linux-2.5.6/mm/mprotect.c	Wed Feb 20 17:57:21 2002
+++ working-2.5.6-futex-6/mm/mprotect.c	Sat Mar  9 14:32:32 2002
@@ -280,7 +280,7 @@
 	end = start + len;
 	if (end < start)
 		return -EINVAL;
-	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
 		return -EINVAL;
 	if (end == start)
 		return 0;
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6/mm/page_alloc.c working-2.5.6-futex-6/mm/page_alloc.c
--- linux-2.5.6/mm/page_alloc.c	Fri Mar  8 14:49:30 2002
+++ working-2.5.6-futex-6/mm/page_alloc.c	Sat Mar  9 14:08:27 2002
@@ -776,8 +776,8 @@
 		 * per zone.
 		 */
 		zone->wait_table_size = wait_table_size(size);
-		zone->wait_table_shift =
-			BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+		zone->wait_table_bits =
+			wait_table_bits(zone->wait_table_size);
 		zone->wait_table = (wait_queue_head_t *)
 			alloc_bootmem_node(pgdat, zone->wait_table_size
 						* sizeof(wait_queue_head_t));

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 19:03   ` Hubertus Franke
  2002-03-08 19:22     ` Linus Torvalds
@ 2002-03-09  4:49     ` Rusty Russell
  2002-03-11 22:45       ` Linus Torvalds
  1 sibling, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-09  4:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: frankeh, linux-kernel

On Fri, 8 Mar 2002 11:22:20 -0800 (PST)
Linus Torvalds <torvalds@transmeta.com> wrote:
> First off, I have to say that I really like the current patch by Rusty. 
> The hashing approach is very clean, and it all seems quite good. As to 
> specific points:

Thanks.  [Aside: VIRO PLEASE TAKE NOTE.]

> > (I) the fairness issues that have been raised.
> >     do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR 
> >     or you don't care about fairness and starvation
> 
> I don't think fairness and starvation is that big of a deal for
> semaphores, usually being unfair in these things tends to just improve
> performance through better cache locality with no real downside. That
> said, I think the option should be open (which it does seem to be).

1) Unfairness definitely helps performance (~30% faster on tdbtorture and
   IIRC on Hubertus' benchmark too).
2) Absolute fairness depends on hardware anyway.
3) I was not able to produce any evidence of STARVATION (which I think
   we all agree *is* an issue).

So I'd say stick with the minimalistic, fast, unfair solution.

> For rwlocks, my personal preference is the fifo-fair-preference (unlike
> semaphore fairness, I have actually seen loads where read- vs
> write-preference really is unacceptable). This might be a point where we 
> give users the choice.

Yes.  See post on "furwocks": fair-preference rw locks implemented in
userspace on top of the futexes.

> I do think we should make the lock bigger - I worry that atomic_t simply
> won't be enough for things like fair rwlocks, which might want a
> "cmpxchg8b" on x86. 
> 
> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.

Actually, I disagree.

1) We've left wiggle room in the second arg to sys_futex() to add rwsems
   later if required.
2) Someone needs to implement them and prove they are superior to the
   pure userspace solution.

The most gain will be from a very briefly held lock that is 99.99% read.
But if it's 10%, it's not worth it: we need numbers.

Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:56               ` Hubertus Franke
@ 2002-03-09  2:12                 ` Linus Torvalds
  2002-03-11 14:14                   ` Hubertus Franke
  0 siblings, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2002-03-09  2:12 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > The point being that the difference between a "decl" and a "lock ;  decl"
> > is about 1:12 or so in performance.
>
> I am no expert in architecture, but if its done through the cache coherency 
> mechanism, the overhead shouldn't be 12:1. You simply mark the cache line as 
> part of you instruction to avoid a cache line transfer. How can that be 12 
> times slower.  .. Ready to be educated....

A lock in a SMP system also needs to synchronize the instruction stream,
and not let stores move "out" from the locked region.

On a UP system, this all happens automatically (well, getting it to happen
right is obviously one of the big issues in an out-of-order CPU core, but
it's a very fundamental part of the core, so it's "free" in the sense that
if it isn't done, the CPU simply doesn't work).

On SMP, it's a memory barrier. This is why a "lock ; decl" is more
expensive than a "decl" - it's the implied memory ordering constraints (on
other architectures they are explicit). On an intel CPU, this basically
means that the pipeline is drained, so a locked instruction takes roughly
12 cycles on a PPro core (AMD's K7 core seems to be rather more graceful
about this one). I haven't timed a P4 lately, I think it's worse.

Other architectures do the memory ordering explicitly, and some are
better, some are worse. But it always costs you _something_.

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:47           ` george anzinger
  2002-03-09  1:11             ` Alan Cox
@ 2002-03-09  1:20             ` Linus Torvalds
  1 sibling, 0 replies; 84+ messages in thread
From: Linus Torvalds @ 2002-03-09  1:20 UTC (permalink / raw)
  To: george anzinger; +Cc: frankeh, Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, george anzinger wrote:
> 
> Uh, just the pid would do.  Maybe reserve a bit to indicate
> contention, but surly one word would be enough.

Not really.

The pid would mean that anybody who gets a lock would have to have its pid
available (remember the fast-path is what we really care about), but
there's also a fundamental race between getting the lock and writing the
pid to the second word of the lock that you just won't avoid.

And that's assuming you only use the semaphores for pure mutual exclusion. 
That is the normal behaviour, but some people use semaphores for other 
things (ie "N people can be active inside this region" where N != 1).

And then you have to realize that doing the same for readers in a rwlock 
is even worse.

In short, it just cannot be done quickly and simply, and for many cases it 
cannot be done at all.

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-09  0:03               ` H. Peter Anvin
@ 2002-03-09  1:15                 ` Alan Cox
  2002-03-10 19:41                   ` Linus Torvalds
                                     ` (2 more replies)
  2002-03-12  9:35                 ` Helge Hafting
  1 sibling, 3 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-09  1:15 UTC (permalink / raw)
  To: H. Peter Anvin; +Cc: linux-kernel

> It seems to me that the i386 support has been around mostly on a
> "until we have a reason to do otherwise" basis, but perhaps this is
> the reason?
> 
> There certainly are enough little, nagging reasons... CMPXCHG, BSWAP,
> and especially WP...

They don't really arise in most normal situations. Bswap is trivial to
the extreme. Cmpxchg only comes up on SMP boxes.  Right now the one big
hit is cmpxchg8 if you use direct rendering. And quite frankly if you
use the direct rendering infrastructure on a 386 its going to be a teeny
bit slow anyway 8)

So if anything its just not worth the effort of breaking the 386 setup
either 8). 386 SMP is a different issue but I don't see any lunatics doing
a 386 based sequent port thankfully.

Alan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:47           ` george anzinger
@ 2002-03-09  1:11             ` Alan Cox
  2002-03-09  1:20             ` Linus Torvalds
  1 sibling, 0 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-09  1:11 UTC (permalink / raw)
  To: george anzinger; +Cc: frankeh, Linus Torvalds, Rusty Russell, linux-kernel

> > cmpxchg-doubleword, still its not fool proof.
> 
> Uh, just the pid would do.  Maybe reserve a bit to indicate contention,
> but surly one word would be enough.

Make it a dword, the 16bit pid is beginning to look strained on big
boxes

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:41             ` Linus Torvalds
  2002-03-08 23:56               ` Hubertus Franke
@ 2002-03-09  0:03               ` H. Peter Anvin
  2002-03-09  1:15                 ` Alan Cox
  2002-03-12  9:35                 ` Helge Hafting
  1 sibling, 2 replies; 84+ messages in thread
From: H. Peter Anvin @ 2002-03-09  0:03 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.33.0203081532550.4421-100000@penguin.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> You don't understand. This has nothing to do with lock holders, or 
> anything else.
> 
> I'm saying that we map in a page at a magic offset (just above the stack), 
> and that page contains the locking code.
> 
> For 386 CPU's (where only UP matters), we can trivially come up with a
> lock that doesn't use cmpxchg8b and that isn't SMP-safe. It might even go
> into the kernel every time if it has to - ie it _works_, it just isn't 
> optimal.
> 

Okay, I'll say it and be impopular...

Perhaps it's time to drop i386 support?

It seems to me that the i386 support has been around mostly on a
"until we have a reason to do otherwise" basis, but perhaps this is
the reason?

There certainly are enough little, nagging reasons... CMPXCHG, BSWAP,
and especially WP...

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:41             ` Linus Torvalds
@ 2002-03-08 23:56               ` Hubertus Franke
  2002-03-09  2:12                 ` Linus Torvalds
  2002-03-09  0:03               ` H. Peter Anvin
  1 sibling, 1 reply; 84+ messages in thread
From: Hubertus Franke @ 2002-03-08 23:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Friday 08 March 2002 06:41 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > I think the next step should be to map in one page of kernel code in a
> > > user-readable location, and just do it there.
> >
> > Your kidding .....
> > Seriously, how can we guarantee that we correctly determine the
> > lock holder, due to memory corruption problems. If we can't do
> > it correctly all the times, why do it at all ?
>
> You don't understand. This has nothing to do with lock holders, or
> anything else.
>
Sorry misunderstood your answer with respect to another question.

> I'm saying that we map in a page at a magic offset (just above the stack),
> and that page contains the locking code.
>
> For 386 CPU's (where only UP matters), we can trivially come up with a
> lock that doesn't use cmpxchg8b and that isn't SMP-safe. It might even go
> into the kernel every time if it has to - ie it _works_, it just isn't
> optimal.
>

Ahhhh, in this context, now "I see the light" (actually its dark at the east 
coast).
So you envision this to go through some named "section" or do you
want to go through the futex_region() library call which identifies whether
the code page has been mapped. If not, the kernel then will provide the 
locking code in that page dependent on the architecture (UP or SMP).
Fair enough.

> > Fail to see why that matters. User level locking is mostly beneficial on
> > SMPs.
>
> That's not the issue AT ALL.
>
> Semaphores are absolutely required on UP too, with threads. There is
> _zero_ difference between UP and SMP from a locking perspective in user
> space due to the fact that we can be preempted at any time - except from
> the cache coherency issue.
>

Agreed, my point was wrt providing the functionality only. Only difference 
between UP and SMP would be that a spinning version would default to the 
standard version (no spinning) under UP. 

> > So, you lock the bus for the atomic update. This is UP, nothing's going
> > on on the bus anyway.
>
> That's not the point. Nobody has locked the bus in the last ten years: the
> cache coherency is done on a cacheline basis, not on the bus.
>
> The point being that the difference between a "decl" and a "lock ;  decl"
> is about 1:12 or so in performance.
>

I am no expert in architecture, but if its done through the cache coherency 
mechanism, the overhead shouldn't be 12:1. You simply mark the cache line as 
part of you instruction to avoid a cache line transfer. How can that be 12 
times slower.  .. Ready to be educated....

> > Even if its a few more cycles, still beats the heck out of using other
> > heavyweight kernel APIs
>
> Sure it does. But if the speed of locking matters enough for user-level
> locks to matter, don't you think the 1:12 difference matters as well?
>
> 		Linus

Yipp, I buy that argument.

Overall, it just seems to me the user locking subsystem is becoming quickly 
again a complicated beast.


Anyway, time to go home and play with the kids :-)

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:02         ` Hubertus Franke
@ 2002-03-08 23:47           ` george anzinger
  2002-03-09  1:11             ` Alan Cox
  2002-03-09  1:20             ` Linus Torvalds
  0 siblings, 2 replies; 84+ messages in thread
From: george anzinger @ 2002-03-08 23:47 UTC (permalink / raw)
  To: frankeh; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

Hubertus Franke wrote:
> 
> On Friday 08 March 2002 03:47 pm, george anzinger wrote:
> > Linus Torvalds wrote:
> > > On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > > Could you also comment on the functionality that has been discussed.
> > >
> > > First off, I have to say that I really like the current patch by Rusty.
> > > The hashing approach is very clean, and it all seems quite good. As to
> > >
> > > specific points:
> > > > (I) the fairness issues that have been raised.
> > > >     do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > > >     or you don't care about fairness and starvation
> > >
> > > I don't think fairness and starvation is that big of a deal for
> > > semaphores, usually being unfair in these things tends to just improve
> > > performance through better cache locality with no real downside. That
> > > said, I think the option should be open (which it does seem to be).
> > >
> > > For rwlocks, my personal preference is the fifo-fair-preference (unlike
> > > semaphore fairness, I have actually seen loads where read- vs
> > > write-preference really is unacceptable). This might be a point where we
> > > give users the choice.
> > >
> > > I do think we should make the lock bigger - I worry that atomic_t simply
> > > won't be enough for things like fair rwlocks, which might want a
> > > "cmpxchg8b" on x86.
> > >
> > > So I would suggest making the size (and thus alignment check) of locks at
> > > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > > locks on the stack, but gcc does support stack alignment, even if the
> > > code sucks right now.
> >
> > I think this is needed if we want to address the "task dies while
> > holding a lock" issue.  In this case we need to know who holds the lock.
> >
> > -g
> >
> 
> George, while desirable its very tricky if possible at all.
> 
> You need to stick your pid or so into the lock and do it
> atomically. So let's assume we only stick with architectures that can do
> cmpxchg-doubleword, still its not fool proof.

Uh, just the pid would do.  Maybe reserve a bit to indicate contention,
but surly one word would be enough.

> First, the app could still corrupt that count or pid field of the lock
> In that case the whole logic get'ss crewed up.
> There is no guarantee that you ever know who holds the locks !!!

At that point you are most likely down for the count.  The most use for
this would be development where programs are dying like flies.  If the
sem area is "registered" with the kernel, it could do the right thing®
in the exit code.  Except for using the cmpxchg I don't think it adds to
the overhead.
> 
> Secondly, what guarantee do you have that your data is kosher ?

Well, none, but you don't in either case.  This is a non-argument.

> I tend to agree with the masses, that hand waving might be the best
> first approximation

That is your privilege :-)
> 
> --
> -- Hubertus Franke  (frankeh@watson.ibm.com)

-- 
George           george@mvista.com
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 22:55         ` Hubertus Franke
  2002-03-08 23:38           ` Alan Cox
@ 2002-03-08 23:44           ` H. Peter Anvin
  1 sibling, 0 replies; 84+ messages in thread
From: H. Peter Anvin @ 2002-03-08 23:44 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <20020308225425.772D13FE06@smtp.linux.ibm.com>
By author:    Hubertus Franke <frankeh@watson.ibm.com>
In newsgroup: linux.dev.kernel
> >
> > Can we go to cache line alignment - for an array of locks thats clearly
> > advantageous
> 
> NO and let me explain.
> 
> I would to be able to integrate the lock with the data.
> This is much more cache friendly then putting the lock on a different 
> cacheline.
> 

Not just cache, but programmer-friendly as well.  Data structures
containing locks (sometimes multiple and related) are really the
common case.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 20:57         ` Linus Torvalds
@ 2002-03-08 23:43           ` H. Peter Anvin
  0 siblings, 0 replies; 84+ messages in thread
From: H. Peter Anvin @ 2002-03-08 23:43 UTC (permalink / raw)
  To: linux-kernel

Followup to:  <Pine.LNX.4.33.0203081252450.1412-100000@penguin.transmeta.com>
By author:    Linus Torvalds <torvalds@transmeta.com>
In newsgroup: linux.dev.kernel
> 
> On Fri, 8 Mar 2002, Alan Cox wrote:
> > 
> > Can we go to cache line alignment - for an array of locks thats clearly
> > advantageous
> 
> I disagree about the "clearly". Firstly, the cacheline alignment is CPU 
> dependent, so on some CPU's it's 32 bytes (or even 16), on others it is 
> 128 bytes. 
> 
> Secondly, a lot of locking is actually done inside a single thread, and
> false sharing doesn't happen much - so keeping the locks dense can be
> quite advantageous.
> 
> The cases where false sharing _does_ happen and are a problem should be 
> for the application writer to worry about, not for the kernel to force.
> 
> So I think 8 bytes is plenty fine enough - with 16 bytes a remote 
> possibility (I don't think it is needed, but it gives you som epadding for 
> future expansion). And people who have arrays and find false sharing to be 
> a problem can fix it themselves.
> 
> I personally don't find arrays of locks very common. It's much more common
> to have arrays of data structures that _contain_ locks (eg things like
> having hash tables etc with a per-hashchain lock) and then those container 
> structures may want to be cacheline aligned, but the locks themselves 
> should not need to be.
> 

Also, on UP this is all a waste.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt	<amsp@zytor.com>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:15           ` Hubertus Franke
  2002-03-08 23:36             ` Alan Cox
@ 2002-03-08 23:41             ` Linus Torvalds
  2002-03-08 23:56               ` Hubertus Franke
  2002-03-09  0:03               ` H. Peter Anvin
  1 sibling, 2 replies; 84+ messages in thread
From: Linus Torvalds @ 2002-03-08 23:41 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > I think the next step should be to map in one page of kernel code in a
> > user-readable location, and just do it there.
> 
> Your kidding .....
> Seriously, how can we guarantee that we correctly determine the 
> lock holder, due to memory corruption problems. If we can't do 
> it correctly all the times, why do it at all ?

You don't understand. This has nothing to do with lock holders, or 
anything else.

I'm saying that we map in a page at a magic offset (just above the stack), 
and that page contains the locking code.

For 386 CPU's (where only UP matters), we can trivially come up with a
lock that doesn't use cmpxchg8b and that isn't SMP-safe. It might even go
into the kernel every time if it has to - ie it _works_, it just isn't 
optimal.

> Fail to see why that matters. User level locking is mostly beneficial on SMPs.

That's not the issue AT ALL.

Semaphores are absolutely required on UP too, with threads. There is
_zero_ difference between UP and SMP from a locking perspective in user
space due to the fact that we can be preempted at any time - except from
the cache coherency issue.

> So, you lock the bus for the atomic update. This is UP, nothing's going on 
> on the bus anyway.

That's not the point. Nobody has locked the bus in the last ten years: the 
cache coherency is done on a cacheline basis, not on the bus.

The point being that the difference between a "decl" and a "lock ;  decl"
is about 1:12 or so in performance.

> Even if its a few more cycles, still beats the heck out of using other 
> heavyweight kernel APIs

Sure it does. But if the speed of locking matters enough for user-level 
locks to matter, don't you think the 1:12 difference matters as well?

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 22:55         ` Hubertus Franke
@ 2002-03-08 23:38           ` Alan Cox
  2002-03-08 23:44           ` H. Peter Anvin
  1 sibling, 0 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-08 23:38 UTC (permalink / raw)
  To: frankeh; +Cc: Alan Cox, Linus Torvalds, Rusty Russell, linux-kernel

> NO and let me explain.
> 
> I would to be able to integrate the lock with the data.
> This is much more cache friendly then putting the lock on a different 
> cacheline.

Yep - I agree

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 23:15           ` Hubertus Franke
@ 2002-03-08 23:36             ` Alan Cox
  2002-03-08 23:41             ` Linus Torvalds
  1 sibling, 0 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-08 23:36 UTC (permalink / raw)
  To: frankeh; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

> > It's not just 386 vs later due to cmpxchg. It's also the simple issue of
> > UP vs SMP - a UP system still wants to do locking, but it doesn't need the
> > lock prefix. And that lock prefix makes a _huge_ difference
> > performance-wise.
> 
> Fail to see why that matters. User level locking is mostly beneficial on SMPs.
> So, you lock the bus for the atomic update. This is UP, nothing's going on 
> on the bus anyway.

Lots of older x86 is too stupid to optimise exclusive cache line locked
operations. After all the bus is still shared - PCI bus masters for one

Alan

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 21:02         ` Linus Torvalds
@ 2002-03-08 23:15           ` Hubertus Franke
  2002-03-08 23:36             ` Alan Cox
  2002-03-08 23:41             ` Linus Torvalds
  0 siblings, 2 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-08 23:15 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Friday 08 March 2002 04:02 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > But what about compatibility with i368, no cmpxchg or cmpxchg8b
> > Can't we have to types and infer from the op in the kernel what
> > the correct size in user space is.
>
> I think the next step should be to map in one page of kernel code in a
> user-readable location, and just do it there.
>

Your kidding .....
Seriously, how can we guarantee that we correctly determine the 
lock holder, due to memory corruption problems. If we can't do 
it correctly all the times, why do it at all ?

> It's not just 386 vs later due to cmpxchg. It's also the simple issue of
> UP vs SMP - a UP system still wants to do locking, but it doesn't need the
> lock prefix. And that lock prefix makes a _huge_ difference
> performance-wise.

Fail to see why that matters. User level locking is mostly beneficial on SMPs.
So, you lock the bus for the atomic update. This is UP, nothing's going on 
on the bus anyway.
In your experience, what is the overhead in cycles for "incl" vs "lock;incl".
Even if its a few more cycles, still beats the heck out of using other 
heavyweight kernel APIs

> So my suggestion is: ignore i386 for now (no _relevant_ SMP boxes exist
> anyway), and plan on solving the problem with a separate library page
> before 2.6.x gets released.
>

Any rough design.. 
> 			Linus

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 20:47       ` george anzinger
@ 2002-03-08 23:02         ` Hubertus Franke
  2002-03-08 23:47           ` george anzinger
  0 siblings, 1 reply; 84+ messages in thread
From: Hubertus Franke @ 2002-03-08 23:02 UTC (permalink / raw)
  To: george anzinger, Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Friday 08 March 2002 03:47 pm, george anzinger wrote:
> Linus Torvalds wrote:
> > On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > > Could you also comment on the functionality that has been discussed.
> >
> > First off, I have to say that I really like the current patch by Rusty.
> > The hashing approach is very clean, and it all seems quite good. As to
> >
> > specific points:
> > > (I) the fairness issues that have been raised.
> > >     do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> > >     or you don't care about fairness and starvation
> >
> > I don't think fairness and starvation is that big of a deal for
> > semaphores, usually being unfair in these things tends to just improve
> > performance through better cache locality with no real downside. That
> > said, I think the option should be open (which it does seem to be).
> >
> > For rwlocks, my personal preference is the fifo-fair-preference (unlike
> > semaphore fairness, I have actually seen loads where read- vs
> > write-preference really is unacceptable). This might be a point where we
> > give users the choice.
> >
> > I do think we should make the lock bigger - I worry that atomic_t simply
> > won't be enough for things like fair rwlocks, which might want a
> > "cmpxchg8b" on x86.
> >
> > So I would suggest making the size (and thus alignment check) of locks at
> > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > locks on the stack, but gcc does support stack alignment, even if the
> > code sucks right now.
>
> I think this is needed if we want to address the "task dies while
> holding a lock" issue.  In this case we need to know who holds the lock.
>
> -g
>

Georg, while desirable its very tricky if possible at all.

You need to stick your pid or so into the lock and do it 
atomically. So let's assume we only stick with architectures that can do 
cmpxchg-doubleword, still its not fool proof. 
First, the app could still corrupt that count or pid field of the lock 
In that case the whole logic get'ss crewed up.
There is no guarantee that you ever know who holds the locks !!!

Secondly, what guarantee do you have that your data is kosher ?
I tend to agree with the masses, that hand waving might be the best
first approximation 

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 20:40       ` Alan Cox
  2002-03-08 20:57         ` Linus Torvalds
@ 2002-03-08 22:55         ` Hubertus Franke
  2002-03-08 23:38           ` Alan Cox
  2002-03-08 23:44           ` H. Peter Anvin
  1 sibling, 2 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-08 22:55 UTC (permalink / raw)
  To: Alan Cox, Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Friday 08 March 2002 03:40 pm, Alan Cox wrote:
> > So I would suggest making the size (and thus alignment check) of locks at
> > least 8 bytes (and preferably 16). That makes it slightly harder to put
> > locks on the stack, but gcc does support stack alignment, even if the
> > code sucks right now.
>
> Can we go to cache line alignment - for an array of locks thats clearly
> advantageous

NO and let me explain.

I would to be able to integrate the lock with the data.
This is much more cache friendly then putting the lock on a different 
cacheline.

If you want an array you need to pad each element. 
That's easy enough to do....
Can't shrink a datastructure on the other hand :-)

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 20:29       ` Hubertus Franke
  2002-03-08 20:48         ` Matthew Kirkwood
@ 2002-03-08 21:02         ` Linus Torvalds
  2002-03-08 23:15           ` Hubertus Franke
  1 sibling, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2002-03-08 21:02 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, Hubertus Franke wrote:
> 
> But what about compatibility with i368, no cmpxchg or cmpxchg8b
> Can't we have to types and infer from the op in the kernel what 
> the correct size in user space is.

I think the next step should be to map in one page of kernel code in a 
user-readable location, and just do it there.

It's not just 386 vs later due to cmpxchg. It's also the simple issue of
UP vs SMP - a UP system still wants to do locking, but it doesn't need the
lock prefix. And that lock prefix makes a _huge_ difference
performance-wise.

So my suggestion is: ignore i386 for now (no _relevant_ SMP boxes exist
anyway), and plan on solving the problem with a separate library page 
before 2.6.x gets released.

			Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 20:40       ` Alan Cox
@ 2002-03-08 20:57         ` Linus Torvalds
  2002-03-08 23:43           ` H. Peter Anvin
  2002-03-08 22:55         ` Hubertus Franke
  1 sibling, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2002-03-08 20:57 UTC (permalink / raw)
  To: Alan Cox; +Cc: Hubertus Franke, Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, Alan Cox wrote:
> 
> Can we go to cache line alignment - for an array of locks thats clearly
> advantageous

I disagree about the "clearly". Firstly, the cacheline alignment is CPU 
dependent, so on some CPU's it's 32 bytes (or even 16), on others it is 
128 bytes. 

Secondly, a lot of locking is actually done inside a single thread, and
false sharing doesn't happen much - so keeping the locks dense can be
quite advantageous.

The cases where false sharing _does_ happen and are a problem should be 
for the application writer to worry about, not for the kernel to force.

So I think 8 bytes is plenty fine enough - with 16 bytes a remote 
possibility (I don't think it is needed, but it gives you som epadding for 
future expansion). And people who have arrays and find false sharing to be 
a problem can fix it themselves.

I personally don't find arrays of locks very common. It's much more common
to have arrays of data structures that _contain_ locks (eg things like
having hash tables etc with a per-hashchain lock) and then those container 
structures may want to be cacheline aligned, but the locks themselves 
should not need to be.

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 20:29       ` Hubertus Franke
@ 2002-03-08 20:48         ` Matthew Kirkwood
  2002-03-08 21:02         ` Linus Torvalds
  1 sibling, 0 replies; 84+ messages in thread
From: Matthew Kirkwood @ 2002-03-08 20:48 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Linus Torvalds, Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, Hubertus Franke wrote:

> > So I would suggest making the size (and thus alignment check) of locks
> > at least 8 bytes (and preferably 16). That makes it slightly harder to
> > put locks on the stack, but gcc does support stack alignment, even if
> > the code sucks right now.

> Keeping it small, allows for the common case of mutex integration into
> data objects, though its not a big deal.

I guess we need to know what the requirements are of the fabled
"architectures which need special handling of PROT_SEM" before
we can tell whether this is a good idea or not.

Matthew.


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 19:22     ` Linus Torvalds
  2002-03-08 20:29       ` Hubertus Franke
  2002-03-08 20:40       ` Alan Cox
@ 2002-03-08 20:47       ` george anzinger
  2002-03-08 23:02         ` Hubertus Franke
  2 siblings, 1 reply; 84+ messages in thread
From: george anzinger @ 2002-03-08 20:47 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hubertus Franke, Rusty Russell, linux-kernel

Linus Torvalds wrote:
> 
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> >
> > Could you also comment on the functionality that has been discussed.
> 
> First off, I have to say that I really like the current patch by Rusty.
> The hashing approach is very clean, and it all seems quite good. As to
> specific points:
> 
> > (I) the fairness issues that have been raised.
> >     do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> >     or you don't care about fairness and starvation
> 
> I don't think fairness and starvation is that big of a deal for
> semaphores, usually being unfair in these things tends to just improve
> performance through better cache locality with no real downside. That
> said, I think the option should be open (which it does seem to be).
> 
> For rwlocks, my personal preference is the fifo-fair-preference (unlike
> semaphore fairness, I have actually seen loads where read- vs
> write-preference really is unacceptable). This might be a point where we
> give users the choice.
> 
> I do think we should make the lock bigger - I worry that atomic_t simply
> won't be enough for things like fair rwlocks, which might want a
> "cmpxchg8b" on x86.
> 
> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.
> 
I think this is needed if we want to address the "task dies while
holding a lock" issue.  In this case we need to know who holds the lock.

-g

>                 Linus
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
George           george@mvista.com
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 19:22     ` Linus Torvalds
  2002-03-08 20:29       ` Hubertus Franke
@ 2002-03-08 20:40       ` Alan Cox
  2002-03-08 20:57         ` Linus Torvalds
  2002-03-08 22:55         ` Hubertus Franke
  2002-03-08 20:47       ` george anzinger
  2 siblings, 2 replies; 84+ messages in thread
From: Alan Cox @ 2002-03-08 20:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Hubertus Franke, Rusty Russell, linux-kernel

> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.

Can we go to cache line alignment - for an array of locks thats clearly
advantageous


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 19:22     ` Linus Torvalds
@ 2002-03-08 20:29       ` Hubertus Franke
  2002-03-08 20:48         ` Matthew Kirkwood
  2002-03-08 21:02         ` Linus Torvalds
  2002-03-08 20:40       ` Alan Cox
  2002-03-08 20:47       ` george anzinger
  2 siblings, 2 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-08 20:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Rusty Russell, linux-kernel

On Friday 08 March 2002 02:22 pm, Linus Torvalds wrote:
> On Fri, 8 Mar 2002, Hubertus Franke wrote:
> > Could you also comment on the functionality that has been discussed.
>
> First off, I have to say that I really like the current patch by Rusty.
> The hashing approach is very clean, and it all seems quite good. As to
>
Agreed, Rusty did a marvelous job pulling all the right thinks together
and adding the hashing.

> specific points:
> > (I) the fairness issues that have been raised.
> >     do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR
> >     or you don't care about fairness and starvation
>
> I don't think fairness and starvation is that big of a deal for
> semaphores, usually being unfair in these things tends to just improve
> performance through better cache locality with no real downside. That
> said, I think the option should be open (which it does seem to be).
>
Yip, I'd prefer to leave it up to the user on what one exactly wants
fair or non-fair wakeup.

> For rwlocks, my personal preference is the fifo-fair-preference (unlike
> semaphore fairness, I have actually seen loads where read- vs
> write-preference really is unacceptable). This might be a point where we
> give users the choice.
>
> I do think we should make the lock bigger - I worry that atomic_t simply
> won't be enough for things like fair rwlocks, which might want a
> "cmpxchg8b" on x86.
>

But what about compatibility with i368, no cmpxchg or cmpxchg8b
Can't we have to types and infer from the op in the kernel what 
the correct size in user space is.

> So I would suggest making the size (and thus alignment check) of locks at
> least 8 bytes (and preferably 16). That makes it slightly harder to put
> locks on the stack, but gcc does support stack alignment, even if the code
> sucks right now.
>
> 		Linus

Keeping it small, allows for the common case of mutex integration into data
objects, though its not a big deal.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 19:03   ` Hubertus Franke
@ 2002-03-08 19:22     ` Linus Torvalds
  2002-03-08 20:29       ` Hubertus Franke
                         ` (2 more replies)
  2002-03-09  4:49     ` Rusty Russell
  1 sibling, 3 replies; 84+ messages in thread
From: Linus Torvalds @ 2002-03-08 19:22 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, linux-kernel

On Fri, 8 Mar 2002, Hubertus Franke wrote:
> 
> Could you also comment on the functionality that has been discussed. 

First off, I have to say that I really like the current patch by Rusty. 
The hashing approach is very clean, and it all seems quite good. As to 
specific points:

> (I) the fairness issues that have been raised.
>     do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR 
>     or you don't care about fairness and starvation

I don't think fairness and starvation is that big of a deal for
semaphores, usually being unfair in these things tends to just improve
performance through better cache locality with no real downside. That
said, I think the option should be open (which it does seem to be).

For rwlocks, my personal preference is the fifo-fair-preference (unlike
semaphore fairness, I have actually seen loads where read- vs
write-preference really is unacceptable). This might be a point where we 
give users the choice.

I do think we should make the lock bigger - I worry that atomic_t simply
won't be enough for things like fair rwlocks, which might want a
"cmpxchg8b" on x86. 

So I would suggest making the size (and thus alignment check) of locks at
least 8 bytes (and preferably 16). That makes it slightly harder to put
locks on the stack, but gcc does support stack alignment, even if the code
sucks right now.

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-08 18:07 ` Linus Torvalds
@ 2002-03-08 19:03   ` Hubertus Franke
  2002-03-08 19:22     ` Linus Torvalds
  2002-03-09  4:49     ` Rusty Russell
  0 siblings, 2 replies; 84+ messages in thread
From: Hubertus Franke @ 2002-03-08 19:03 UTC (permalink / raw)
  To: Linus Torvalds, Rusty Russell; +Cc: linux-kernel

On Friday 08 March 2002 01:07 pm, Linus Torvalds wrote:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
> > 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> > 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> > 3) x86 fixes: tested on dual x86 box.
>
> This doesn't work on highmem machines - doing the conversion from "<struct
> page, offset>" to "page_address(page)+offset" is simply not legal (not
> even for pure hashing purposes - page_address() changes as you kmap it).
>
> You need to keep the <struct page,offset> tuple in that format, and no
> other. And when you actually touch the page, you need to do the
> kmap()/kunmap() (and you must not keep it mapped while you sleep, because
> that might trivially make the kernel run out of virtual mappings).
>
> 		Linus
>

Good point, my package doesn't have that problem, but your suggestion
should nicely fit into Rusty's patch, but it seems like increasing in 
overhead now. 

Could you also comment on the functionality that has been discussed. 


(I) the fairness issues that have been raised.
    do you support two wakeup mechanism: FUTEX_UP and FUTEX_UP_FAIR 
    or you don't care about fairness and starvation
(II) the rwlocks issues,
     do you support rich set of functions/ops such as below

(a) writer preference
        if any writer is waiting then wake that one up.
(b) reader preference
        if any reader is waiting wait up all the readers in the queue
(c) fifo preference
        if the first waiter is a writer wait it up, otherwise wake up all 
readers
(d) fifo-fair  preference
        like (c), but only wake up readers until the next writer is 
encountered

(a) - (c) can be implemented with Rusty's 2 user-ueue approach as long
as the wakeup type is always the same. The last one can't (?).

In the kernel this is easy to implement, but the trouble is the status
word in user space, still thinking about it.
It also requires compare and swap.

Thanks for your time and pointing this out. 

> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05  7:01 Rusty Russell
  2002-03-05 22:39 ` Davide Libenzi
@ 2002-03-08 18:07 ` Linus Torvalds
  2002-03-08 19:03   ` Hubertus Franke
  2002-03-09  4:51 ` Rusty Russell
  2 siblings, 1 reply; 84+ messages in thread
From: Linus Torvalds @ 2002-03-08 18:07 UTC (permalink / raw)
  To: Rusty Russell; +Cc: linux-kernel

On Tue, 5 Mar 2002, Rusty Russell wrote:
>
> 1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
> 2) Fix for the "decrement wraparound" problem (Paul Mackerras)
> 3) x86 fixes: tested on dual x86 box.

This doesn't work on highmem machines - doing the conversion from "<struct
page, offset>" to "page_address(page)+offset" is simply not legal (not
even for pure hashing purposes - page_address() changes as you kmap it).

You need to keep the <struct page,offset> tuple in that format, and no
other. And when you actually touch the page, you need to do the
kmap()/kunmap() (and you must not keep it mapped while you sleep, because
that might trivially make the kernel run out of virtual mappings).

		Linus

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05 23:26     ` Davide Libenzi
  2002-03-05 23:37       ` Peter Svensson
@ 2002-03-08  0:07       ` Richard Henderson
  1 sibling, 0 replies; 84+ messages in thread
From: Richard Henderson @ 2002-03-08  0:07 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Hubertus Franke, Rusty Russell, Linux Kernel Mailing List

On Tue, Mar 05, 2002 at 03:26:31PM -0800, Davide Libenzi wrote:
> Yes but this is always true   alignof >= sizeof

No.  m68k sets alignof to 2 for all types with sizeof >= 2.


r~

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-06  1:46   ` Rusty Russell
@ 2002-03-06  2:03     ` Davide Libenzi
  0 siblings, 0 replies; 84+ messages in thread
From: Davide Libenzi @ 2002-03-06  2:03 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Wed, 6 Mar 2002, Rusty Russell wrote:

> In message <Pine.LNX.4.44.0203051433400.1475-100000@blue1.dev.mcafeelabs.com> y
> ou write:
> > On Tue, 5 Mar 2002, Rusty Russell wrote:
> >
> > > +	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > > +
> > > +	/* Must be "naturally" aligned, and not on page boundary. */
> > > +	if ((pos_in_page % __alignof__(atomic_t)) != 0
> > > +	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > > +		return -EINVAL;
> >
> > How can this :
> >
> > 	(pos_in_page % __alignof__(atomic_t)) != 0
> >
> > to be false, and together this :
> >
> > 	pos_in_page + sizeof(atomic_t) > PAGE_SIZE
> >
> > to be true ?
>
> You're assuming that __alignof__(atomic_t) = N * sizeof(atomic_t),
> where N is an integer.
>
> If alignof == 1, and sizeof == 4, you lose.  I prefer to be
> future-proof.
>
> This means I should clarify the comment...

No i should do less things at a time :-(



- Davide



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05 22:39 ` Davide Libenzi
  2002-03-05 23:16   ` Hubertus Franke
@ 2002-03-06  1:46   ` Rusty Russell
  2002-03-06  2:03     ` Davide Libenzi
  1 sibling, 1 reply; 84+ messages in thread
From: Rusty Russell @ 2002-03-06  1:46 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: torvalds, linux-kernel

In message <Pine.LNX.4.44.0203051433400.1475-100000@blue1.dev.mcafeelabs.com> y
ou write:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
> 
> > +	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > +
> > +	/* Must be "naturally" aligned, and not on page boundary. */
> > +	if ((pos_in_page % __alignof__(atomic_t)) != 0
> > +	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > +		return -EINVAL;
> 
> How can this :
> 
> 	(pos_in_page % __alignof__(atomic_t)) != 0
> 
> to be false, and together this :
> 
> 	pos_in_page + sizeof(atomic_t) > PAGE_SIZE
> 
> to be true ?

You're assuming that __alignof__(atomic_t) = N * sizeof(atomic_t),
where N is an integer.

If alignof == 1, and sizeof == 4, you lose.  I prefer to be
future-proof.

This means I should clarify the comment...
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05 23:37       ` Peter Svensson
@ 2002-03-05 23:50         ` Davide Libenzi
  0 siblings, 0 replies; 84+ messages in thread
From: Davide Libenzi @ 2002-03-05 23:50 UTC (permalink / raw)
  To: Peter Svensson; +Cc: Hubertus Franke, Rusty Russell, Linux Kernel Mailing List

On Wed, 6 Mar 2002, Peter Svensson wrote:

> On Tue, 5 Mar 2002, Davide Libenzi wrote:
>
> > > I believe not all machine have  alignof  == sizeof
> >
> > Yes but this is always true   alignof >= sizeof
>
> No, this is not true. As the gcc info pages says:
>    For example, if the target machine requires a `double' value to be
>    aligned on an 8-byte boundary, then `__alignof__ (double)' is 8.  This
>    is true on many RISC machines.  On more traditional machine designs,
>    `__alignof__ (double)' is 4 or even 2.
> A later example shows situations where alignof>sizeof.

Yes, it's true.



- Davide




^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05 23:26     ` Davide Libenzi
@ 2002-03-05 23:37       ` Peter Svensson
  2002-03-05 23:50         ` Davide Libenzi
  2002-03-08  0:07       ` Richard Henderson
  1 sibling, 1 reply; 84+ messages in thread
From: Peter Svensson @ 2002-03-05 23:37 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Hubertus Franke, Rusty Russell, Linux Kernel Mailing List

On Tue, 5 Mar 2002, Davide Libenzi wrote:

> > I believe not all machine have  alignof  == sizeof
> 
> Yes but this is always true   alignof >= sizeof

No, this is not true. As the gcc info pages says: 
   For example, if the target machine requires a `double' value to be
   aligned on an 8-byte boundary, then `__alignof__ (double)' is 8.  This
   is true on many RISC machines.  On more traditional machine designs,
   `__alignof__ (double)' is 4 or even 2.
A later example shows situations where alignof>sizeof.

Peter
--
Peter Svensson      ! Pgp key available by finger, fingerprint:
<petersv@psv.nu>    ! 8A E9 20 98 C1 FF 43 E3  07 FD B9 0A 80 72 70 AF
------------------------------------------------------------------------
Remember, Luke, your source will be with you... always...

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05 23:16   ` Hubertus Franke
@ 2002-03-05 23:26     ` Davide Libenzi
  2002-03-05 23:37       ` Peter Svensson
  2002-03-08  0:07       ` Richard Henderson
  0 siblings, 2 replies; 84+ messages in thread
From: Davide Libenzi @ 2002-03-05 23:26 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Rusty Russell, Linux Kernel Mailing List

On Tue, 5 Mar 2002, Hubertus Franke wrote:

> On Tuesday 05 March 2002 05:39 pm, Davide Libenzi wrote:
> > On Tue, 5 Mar 2002, Rusty Russell wrote:
> > > +	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > > +
> > > +	/* Must be "naturally" aligned, and not on page boundary. */
> > > +	if ((pos_in_page % __alignof__(atomic_t)) != 0
> > > +	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > > +		return -EINVAL;
> >
> > How can this :
> >
> > 	(pos_in_page % __alignof__(atomic_t)) != 0
> >
> > to be false, and together this :
> >
> > 	pos_in_page + sizeof(atomic_t) > PAGE_SIZE
> >
> > to be true ?
> > This is enough :
> >
> > 	if ((pos_in_page % __alignof__(atomic_t)) != 0)
> >
> >
>
> I believe not all machine have  alignof  == sizeof

Yes but this is always true   alignof >= sizeof




- Davide



^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05 22:39 ` Davide Libenzi
@ 2002-03-05 23:16   ` Hubertus Franke
  2002-03-05 23:26     ` Davide Libenzi
  2002-03-06  1:46   ` Rusty Russell
  1 sibling, 1 reply; 84+ messages in thread
From: Hubertus Franke @ 2002-03-05 23:16 UTC (permalink / raw)
  To: Davide Libenzi, Rusty Russell; +Cc: linux-kernel

On Tuesday 05 March 2002 05:39 pm, Davide Libenzi wrote:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
> > +	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> > +
> > +	/* Must be "naturally" aligned, and not on page boundary. */
> > +	if ((pos_in_page % __alignof__(atomic_t)) != 0
> > +	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> > +		return -EINVAL;
>
> How can this :
>
> 	(pos_in_page % __alignof__(atomic_t)) != 0
>
> to be false, and together this :
>
> 	pos_in_page + sizeof(atomic_t) > PAGE_SIZE
>
> to be true ?
> This is enough :
>
> 	if ((pos_in_page % __alignof__(atomic_t)) != 0)
>
>

I believe not all machine have  alignof  == sizeof
>
>
> - Davide
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
  2002-03-05  7:01 Rusty Russell
@ 2002-03-05 22:39 ` Davide Libenzi
  2002-03-05 23:16   ` Hubertus Franke
  2002-03-06  1:46   ` Rusty Russell
  2002-03-08 18:07 ` Linus Torvalds
  2002-03-09  4:51 ` Rusty Russell
  2 siblings, 2 replies; 84+ messages in thread
From: Davide Libenzi @ 2002-03-05 22:39 UTC (permalink / raw)
  To: Rusty Russell; +Cc: torvalds, linux-kernel

On Tue, 5 Mar 2002, Rusty Russell wrote:

> +	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
> +
> +	/* Must be "naturally" aligned, and not on page boundary. */
> +	if ((pos_in_page % __alignof__(atomic_t)) != 0
> +	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
> +		return -EINVAL;

How can this :

	(pos_in_page % __alignof__(atomic_t)) != 0

to be false, and together this :

	pos_in_page + sizeof(atomic_t) > PAGE_SIZE

to be true ?
This is enough :

	if ((pos_in_page % __alignof__(atomic_t)) != 0)




- Davide



^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores)
@ 2002-03-05  7:01 Rusty Russell
  2002-03-05 22:39 ` Davide Libenzi
                   ` (2 more replies)
  0 siblings, 3 replies; 84+ messages in thread
From: Rusty Russell @ 2002-03-05  7:01 UTC (permalink / raw)
  To: torvalds; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 16256 bytes --]

1) FUTEX_UP and FUTEX_DOWN defines. (Robert Love)
2) Fix for the "decrement wraparound" problem (Paul Mackerras)
3) x86 fixes: tested on dual x86 box.

Example userspace lib attached,
Rusty.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/futex.h working-2.5.6-pre2-futex/include/linux/futex.h
--- linux-2.5.6-pre2/include/linux/futex.h	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-pre2-futex/include/linux/futex.h	Tue Mar  5 13:53:33 2002
@@ -0,0 +1,8 @@
+#ifndef _LINUX_FUTEX_H
+#define _LINUX_FUTEX_H
+
+/* Second argument to futex syscall */
+#define FUTEX_UP (1)
+#define FUTEX_DOWN (-1)
+
+#endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/kernel/futex.c working-2.5.6-pre2-futex/kernel/futex.c
--- linux-2.5.6-pre2/kernel/futex.c	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-pre2-futex/kernel/futex.c	Tue Mar  5 13:53:33 2002
@@ -0,0 +1,208 @@
+/*
+ *  Fast Userspace Mutexes (which I call "Futexes!").
+ *  (C) Rusty Russell, IBM 2002
+ *
+ *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ *  enough at me, Linus for the original (flawed) idea, Matthew
+ *  Kirkwood for proof-of-concept implementation.
+ *
+ *  "The futexes are also cursed."
+ *  "But they come in a choice of three flavours!"
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/futex.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+   decrements from 1 to 0.  The counter starts at 1 when the lock is
+   free.  A value other than 0 or 1 means someone may be sleeping.
+   This is simple enough to work on all architectures, but has the
+   problem that if we never "up" the semaphore it could eventually
+   wrap around. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+   the relevent ones (hashed queues may be shared) */
+struct futex_q {
+	struct list_head list;
+	struct task_struct *task;
+	atomic_t *count;
+};
+
+/* The key for the hash is the address + index + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+static inline struct list_head *hash_futex(struct page *page,
+					   unsigned long offset)
+{
+	unsigned long h;
+
+	/* If someone is sleeping, page is pinned.  ie. page_address
+           is a constant when we care about it. */
+	h = (unsigned long)page_address(page) + offset;
+	return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+static inline void wake_one_waiter(struct list_head *head, atomic_t *count)
+{
+	struct list_head *i;
+
+	spin_lock(&futex_lock);
+	list_for_each(i, head) {
+		struct futex_q *this = list_entry(i, struct futex_q, list);
+
+		if (this->count == count) {
+			wake_up_process(this->task);
+			break;
+		}
+	}
+	spin_unlock(&futex_lock);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct list_head *head,
+			    struct futex_q *q,
+			    atomic_t *count)
+{
+	q->task = current;
+	q->count = count;
+
+	spin_lock(&futex_lock);
+	list_add_tail(&q->list, head);
+	spin_unlock(&futex_lock);
+}
+
+static inline void unqueue_me(struct futex_q *q)
+{
+	spin_lock(&futex_lock);
+	list_del(&q->list);
+	spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+	struct mm_struct *mm = current->mm;
+	struct page *page;
+	int err;
+
+	down_read(&mm->mmap_sem);
+	err = get_user_pages(current, current->mm, page_start,
+			     1 /* one page */,
+			     1 /* writable */,
+			     0 /* don't force */,
+			     &page,
+			     NULL /* don't return vmas */);
+	up_read(&mm->mmap_sem);
+
+	if (err < 0)
+		return ERR_PTR(err);
+	return page;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, atomic_t *count)
+{
+	int retval = 0;
+	struct futex_q q;
+
+	current->state = TASK_INTERRUPTIBLE;
+	queue_me(head, &q, count);
+
+	/* If we take the semaphore from 1 to 0, it's ours.  But don't
+           bother decrementing if it's already negative. */
+	while (atomic_read(count) < 0 || !atomic_dec_and_test(count)) {
+		if (signal_pending(current)) {
+			retval = -EINTR;
+			break;
+		}
+		schedule();
+		current->state = TASK_INTERRUPTIBLE;
+	}
+	current->state = TASK_RUNNING;
+	unqueue_me(&q);
+	/* If we were signalled, we might have just been woken: we
+	   must wake another one.  Otherwise we need to wake someone
+	   else (if they are waiting) so they drop the count below 0,
+	   and when we "up" in userspace, we know there is a
+	   waiter. */
+	wake_one_waiter(head, count);
+	return retval;
+}
+
+static int futex_up(struct list_head *head, atomic_t *count)
+{
+	atomic_set(count, 1);
+	smp_wmb();
+	wake_one_waiter(head, count);
+	return 0;
+}
+
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+	int ret;
+	unsigned long pos_in_page;
+	struct list_head *head;
+	struct page *page;
+
+	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+	/* Must be "naturally" aligned, and not on page boundary. */
+	if ((pos_in_page % __alignof__(atomic_t)) != 0
+	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+		return -EINVAL;
+
+	/* Simpler if it doesn't vanish underneath us. */
+	page = pin_page((unsigned long)uaddr - pos_in_page);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+
+	head = hash_futex(page, pos_in_page);
+	switch (op) {
+	case FUTEX_UP:
+		ret = futex_up(head, page_address(page) + pos_in_page);
+		break;
+	case FUTEX_DOWN:
+		ret = futex_down(head, page_address(page) + pos_in_page);
+		break;
+	/* Add other lock types here... */
+	default:
+		ret = -EINVAL;
+	}
+	put_page(page);
+
+	return ret;
+}
+
+static int __init init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+		INIT_LIST_HEAD(&futex_queues[i]);
+	return 0;
+}
+__initcall(init);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/arch/i386/kernel/entry.S working-2.5.6-pre2-futex/arch/i386/kernel/entry.S
--- linux-2.5.6-pre2/arch/i386/kernel/entry.S	Wed Feb 20 17:56:59 2002
+++ working-2.5.6-pre2-futex/arch/i386/kernel/entry.S	Tue Mar  5 13:53:33 2002
@@ -716,6 +716,7 @@
 	.long SYMBOL_NAME(sys_lremovexattr)
 	.long SYMBOL_NAME(sys_fremovexattr)
 	.long SYMBOL_NAME(sys_tkill)
+	.long SYMBOL_NAME(sys_futex)
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/arch/ppc/kernel/misc.S working-2.5.6-pre2-futex/arch/ppc/kernel/misc.S
--- linux-2.5.6-pre2/arch/ppc/kernel/misc.S	Wed Feb 20 17:57:04 2002
+++ working-2.5.6-pre2-futex/arch/ppc/kernel/misc.S	Tue Mar  5 13:53:33 2002
@@ -1246,6 +1246,7 @@
 	.long sys_removexattr
 	.long sys_lremovexattr
 	.long sys_fremovexattr	/* 220 */
+	.long sys_futex
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long sys_ni_syscall
 	.endr
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/mman.h working-2.5.6-pre2-futex/include/asm-i386/mman.h
--- linux-2.5.6-pre2/include/asm-i386/mman.h	Wed Mar 15 12:45:20 2000
+++ working-2.5.6-pre2-futex/include/asm-i386/mman.h	Tue Mar  5 13:53:33 2002
@@ -4,6 +4,7 @@
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
+#define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
 
 #define MAP_SHARED	0x01		/* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-i386/unistd.h working-2.5.6-pre2-futex/include/asm-i386/unistd.h
--- linux-2.5.6-pre2/include/asm-i386/unistd.h	Wed Feb 20 17:56:40 2002
+++ working-2.5.6-pre2-futex/include/asm-i386/unistd.h	Tue Mar  5 13:53:33 2002
@@ -243,6 +243,7 @@
 #define __NR_lremovexattr	236
 #define __NR_fremovexattr	237
 #define __NR_tkill		238
+#define __NR_futex		239
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-ppc/mman.h working-2.5.6-pre2-futex/include/asm-ppc/mman.h
--- linux-2.5.6-pre2/include/asm-ppc/mman.h	Tue May 22 08:02:06 2001
+++ working-2.5.6-pre2-futex/include/asm-ppc/mman.h	Tue Mar  5 13:53:33 2002
@@ -7,6 +7,7 @@
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
+#define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
 
 #define MAP_SHARED	0x01		/* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/asm-ppc/unistd.h working-2.5.6-pre2-futex/include/asm-ppc/unistd.h
--- linux-2.5.6-pre2/include/asm-ppc/unistd.h	Wed Feb 20 17:57:18 2002
+++ working-2.5.6-pre2-futex/include/asm-ppc/unistd.h	Tue Mar  5 13:53:33 2002
@@ -228,6 +228,7 @@
 #define __NR_removexattr	218
 #define __NR_lremovexattr	219
 #define __NR_fremovexattr	220
+#define __NR_futex		221
 
 #define __NR(n)	#n
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/kernel/Makefile working-2.5.6-pre2-futex/kernel/Makefile
--- linux-2.5.6-pre2/kernel/Makefile	Wed Feb 20 17:56:17 2002
+++ working-2.5.6-pre2-futex/kernel/Makefile	Tue Mar  5 13:53:33 2002
@@ -15,7 +15,7 @@
 obj-y     = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
 	    sysctl.o acct.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o 
+	    signal.o sys.o kmod.o context.o futex.o
 
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/hash.h working-2.5.6-pre2-futex/include/linux/hash.h
--- linux-2.5.6-pre2/include/linux/hash.h	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-pre2-futex/include/linux/hash.h	Tue Mar  5 13:53:33 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+   (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+	unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
+	unsigned long n = hash;
+	n <<= 18;
+	hash -= n;
+	n <<= 33;
+	hash -= n;
+	n <<= 3;
+	hash += n;
+	n <<= 3;
+	hash -= n;
+	n <<= 4;
+	hash += n;
+	n <<= 2;
+	hash += n;
+#else
+	/* On some cpus multiply is faster, on others gcc will do shifts */
+	hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+	/* High bits are more random, so use them. */
+	return hash >> (BITS_PER_LONG - bits);
+}
+	
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+	return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/include/linux/mmzone.h working-2.5.6-pre2-futex/include/linux/mmzone.h
--- linux-2.5.6-pre2/include/linux/mmzone.h	Fri Mar  1 22:58:34 2002
+++ working-2.5.6-pre2-futex/include/linux/mmzone.h	Tue Mar  5 14:03:15 2002
@@ -51,8 +51,7 @@
 	/*
 	 * wait_table		-- the array holding the hash table
 	 * wait_table_size	-- the size of the hash table array
-	 * wait_table_shift	-- wait_table_size
-	 * 				== BITS_PER_LONG (1 << wait_table_bits)
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
 	 *
 	 * The purpose of all these is to keep track of the people
 	 * waiting for a page to become available and make them
@@ -75,7 +74,7 @@
 	 */
 	wait_queue_head_t	* wait_table;
 	unsigned long		wait_table_size;
-	unsigned long		wait_table_shift;
+	unsigned long		wait_table_bits;
 
 	/*
 	 * Discontig memory support fields.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/filemap.c working-2.5.6-pre2-futex/mm/filemap.c
--- linux-2.5.6-pre2/mm/filemap.c	Fri Mar  1 23:27:15 2002
+++ working-2.5.6-pre2-futex/mm/filemap.c	Tue Mar  5 13:53:33 2002
@@ -25,6 +25,7 @@
 #include <linux/iobuf.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/hash.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -773,32 +774,8 @@
 static inline wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const zone_t *zone = page_zone(page);
-	wait_queue_head_t *wait = zone->wait_table;
-	unsigned long hash = (unsigned long)page;
 
-#if BITS_PER_LONG == 64
-	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
-	unsigned long n = hash;
-	n <<= 18;
-	hash -= n;
-	n <<= 33;
-	hash -= n;
-	n <<= 3;
-	hash += n;
-	n <<= 3;
-	hash -= n;
-	n <<= 4;
-	hash += n;
-	n <<= 2;
-	hash += n;
-#else
-	/* On some cpus multiply is faster, on others gcc will do shifts */
-	hash *= GOLDEN_RATIO_PRIME;
-#endif
-
-	hash >>= zone->wait_table_shift;
-
-	return &wait[hash];
+	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
 /* 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/mprotect.c working-2.5.6-pre2-futex/mm/mprotect.c
--- linux-2.5.6-pre2/mm/mprotect.c	Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre2-futex/mm/mprotect.c	Tue Mar  5 13:53:33 2002
@@ -280,7 +280,7 @@
 	end = start + len;
 	if (end < start)
 		return -EINVAL;
-	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
 		return -EINVAL;
 	if (end == start)
 		return 0;
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/current-dontdiff --minimal linux-2.5.6-pre2/mm/page_alloc.c working-2.5.6-pre2-futex/mm/page_alloc.c
--- linux-2.5.6-pre2/mm/page_alloc.c	Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre2-futex/mm/page_alloc.c	Tue Mar  5 13:53:33 2002
@@ -776,8 +776,8 @@
 		 * per zone.
 		 */
 		zone->wait_table_size = wait_table_size(size);
-		zone->wait_table_shift =
-			BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+		zone->wait_table_bits =
+			wait_table_bits(zone->wait_table_size);
 		zone->wait_table = (wait_queue_head_t *)
 			alloc_bootmem_node(pgdat, zone->wait_table_size
 						* sizeof(wait_queue_head_t));

--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.


[-- Attachment #2: Futex example library --]
[-- Type: application/octet-stream, Size: 2855 bytes --]

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2002-03-27 23:51 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-13  9:12 [PATCH] Futexes IV (Fast Lightweight Userspace Semaphores) Martin Wirth
2002-03-13 19:41 ` Bill Davidsen
2002-03-13 19:52   ` Dave McCracken
2002-03-13 22:17     ` Bill Davidsen
2002-03-13 20:06   ` Alan Cox
2002-03-15  7:31 ` Rusty Russell
2002-03-15  8:41   ` Martin Wirth
2002-03-15 15:29     ` Hubertus Franke
2002-03-15 16:23     ` Peter Wächtler
2002-03-16  0:12       ` Rusty Russell
2002-03-16 11:23         ` Martin Wirth
2002-03-18  0:52           ` Ulrich Drepper
2002-03-19  3:28             ` Rusty Russell
2002-03-19  4:05               ` Ulrich Drepper
2002-03-20  6:20                 ` Rusty Russell
2002-03-20 10:42                   ` Peter Wächtler
2002-03-20 17:20                     ` Ulrich Drepper
2002-03-19  8:34               ` Martin Wirth
2002-03-20  6:45                 ` Rusty Russell
2002-03-21  6:48                   ` Martin Wirth
2002-03-24 18:25                     ` Peter Wächtler
2002-03-25  2:28                       ` Rusty Russell
2002-03-25  4:46                         ` Rusty Russell
2002-03-25 11:56                           ` Peter Wächtler
2002-03-26  1:02                             ` Rusty Russell
2002-03-26  8:17                               ` Martin Wirth
2002-03-26 23:10                                 ` Rusty Russell
2002-03-27 21:05                                   ` Hubertus Franke
2002-03-27 23:53                                     ` Rusty Russell
2002-03-25  9:47                         ` Peter Wächtler
2002-03-16 19:48         ` Peter Wächtler
2002-03-17  6:50         ` Rusty Russell
  -- strict thread matches above, loose matches on Subject: below --
2002-03-05  7:01 Rusty Russell
2002-03-05 22:39 ` Davide Libenzi
2002-03-05 23:16   ` Hubertus Franke
2002-03-05 23:26     ` Davide Libenzi
2002-03-05 23:37       ` Peter Svensson
2002-03-05 23:50         ` Davide Libenzi
2002-03-08  0:07       ` Richard Henderson
2002-03-06  1:46   ` Rusty Russell
2002-03-06  2:03     ` Davide Libenzi
2002-03-08 18:07 ` Linus Torvalds
2002-03-08 19:03   ` Hubertus Franke
2002-03-08 19:22     ` Linus Torvalds
2002-03-08 20:29       ` Hubertus Franke
2002-03-08 20:48         ` Matthew Kirkwood
2002-03-08 21:02         ` Linus Torvalds
2002-03-08 23:15           ` Hubertus Franke
2002-03-08 23:36             ` Alan Cox
2002-03-08 23:41             ` Linus Torvalds
2002-03-08 23:56               ` Hubertus Franke
2002-03-09  2:12                 ` Linus Torvalds
2002-03-11 14:14                   ` Hubertus Franke
2002-03-09  0:03               ` H. Peter Anvin
2002-03-09  1:15                 ` Alan Cox
2002-03-10 19:41                   ` Linus Torvalds
2002-03-11 20:49                     ` Pavel Machek
2002-03-10 19:58                   ` Martin J. Bligh
2002-03-10 20:40                     ` Alan Cox
2002-03-10 20:28                       ` Martin J. Bligh
2002-03-10 21:05                         ` Alan Cox
2002-03-13  7:40                   ` Rusty Russell
2002-03-13 16:37                     ` Alan Cox
2002-03-12  9:35                 ` Helge Hafting
2002-03-08 20:40       ` Alan Cox
2002-03-08 20:57         ` Linus Torvalds
2002-03-08 23:43           ` H. Peter Anvin
2002-03-08 22:55         ` Hubertus Franke
2002-03-08 23:38           ` Alan Cox
2002-03-08 23:44           ` H. Peter Anvin
2002-03-08 20:47       ` george anzinger
2002-03-08 23:02         ` Hubertus Franke
2002-03-08 23:47           ` george anzinger
2002-03-09  1:11             ` Alan Cox
2002-03-09  1:20             ` Linus Torvalds
2002-03-09  4:49     ` Rusty Russell
2002-03-11 22:45       ` Linus Torvalds
2002-03-11 23:12         ` Hubertus Franke
2002-03-12  7:20         ` Rusty Russell
2002-03-12 14:56           ` Hubertus Franke
2002-03-13  4:02             ` Rusty Russell
2002-03-12 17:17           ` Linus Torvalds
2002-03-13  2:57             ` Rusty Russell
2002-03-09  4:51 ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).