linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/21] mutex subsystem, -V14
@ 2006-01-04 14:41 Ingo Molnar
  2006-01-04 23:45 ` Joel Schopp
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-04 14:41 UTC (permalink / raw)
  To: lkml
  Cc: Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre,
	Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox,
	Christoph Hellwig, Andi Kleen, Russell King


this is version 14 of the generic mutex subsystem, against v2.6.15.

The patch-queue consists of 21 patches, which can also be downloaded 
from:

  http://redhat.com/~mingo/generic-mutex-subsystem/

Changes since -V13:

 13 files changed, 113 insertions(+), 85 deletions(-)

 - VFS: converted sb->s_lock to a mutex too. This improves the
   create+unlink benchmark on ext3.

 - further simplification of __mutex_lock_common(): no more gotos, and
   only one atomic_xchg() is done. Code size is now extremely small on 
   both UP and SMP:

   text    data     bss     dec     hex filename
    398       0       0     398     18e mutex.o.UP

   text    data     bss     dec     hex filename
    463       0       0     463     1cf mutex.o.SMP

 - synchro-test updates: max # of threads of 64, fairness stats,
   better defaults if =y, cleanups. (David Howells, me)

 - FASTCALL -> fastcall in mutex.h (suggested by David Howells)

 - documentation updates

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-04 14:41 [patch 00/21] mutex subsystem, -V14 Ingo Molnar
@ 2006-01-04 23:45 ` Joel Schopp
  2006-01-05  2:38   ` Nicolas Pitre
  2006-01-05 14:35   ` Ingo Molnar
  0 siblings, 2 replies; 34+ messages in thread
From: Joel Schopp @ 2006-01-04 23:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard

> this is version 14 of the generic mutex subsystem, against v2.6.15.
> 
> The patch-queue consists of 21 patches, which can also be downloaded 
> from:
> 
>   http://redhat.com/~mingo/generic-mutex-subsystem/
> 

Took a glance at this on ppc64.  Would it be useful if I contributed an arch 
specific version like arm has?  We'll either need an arch specific version or 
have the generic changed.

Anyway, here is some disassembly of some of the code generated with my comments:

c00000000049bf9c <.mutex_lock>:
c00000000049bf9c:       7c 00 06 ac     eieio
c00000000049bfa0:       7d 20 18 28     lwarx   r9,r0,r3
c00000000049bfa4:       31 29 ff ff     addic   r9,r9,-1
c00000000049bfa8:       7d 20 19 2d     stwcx.  r9,r0,r3
c00000000049bfac:       40 c2 ff f4     bne+    c00000000049bfa0 <.mutex_lock+0x4>
c00000000049bfb0:       4c 00 01 2c     isync
c00000000049bfb4:       7d 20 4b 78     mr      r0,r9
c00000000049bfb8:       78 09 0f e3     rldicl. r9,r0,33,63
c00000000049bfbc:       4d 82 00 20     beqlr
c00000000049bfc0:       4b ff ff 58     b       c00000000049bf18 
<.__mutex_lock_noinline>

The eieio is completly unnecessary, it got picked up from atomic_dec_return 
(Anton, why is there an eieio at the start of atomic_dec_return in the first 
place?).

Ignore the + on the bne, the disassembler is wrong, it is really a -

c00000000049bfc4 <.mutex_unlock>:
c00000000049bfc4:       7c 00 06 ac     eieio
c00000000049bfc8:       7d 20 18 28     lwarx   r9,r0,r3
c00000000049bfcc:       31 29 00 01     addic   r9,r9,1
c00000000049bfd0:       7d 20 19 2d     stwcx.  r9,r0,r3
c00000000049bfd4:       40 c2 ff f4     bne+    c00000000049bfc8 <.mutex_unlock+0x4>
c00000000049bfd8:       4c 00 01 2c     isync
c00000000049bfdc:       7d 20 07 b4     extsw   r0,r9
c00000000049bfe0:       2c 00 00 00     cmpwi   r0,0
c00000000049bfe4:       4d 81 00 20     bgtlr
c00000000049bfe8:       4b ff ff 40     b       c00000000049bf28 
<.__mutex_unlock_noinline>

That eieio should be an lwsync to avoid data corruption.  And I think the isync 
is superfluous.

Ditto the disassembler being wrong about the + vs -.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-04 23:45 ` Joel Schopp
@ 2006-01-05  2:38   ` Nicolas Pitre
  2006-01-05  2:51     ` Linus Torvalds
  2006-01-05 14:35   ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Nicolas Pitre @ 2006-01-05  2:38 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Ingo Molnar, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard

On Wed, 4 Jan 2006, Joel Schopp wrote:

> > this is version 14 of the generic mutex subsystem, against v2.6.15.
> > 
> > The patch-queue consists of 21 patches, which can also be downloaded from:
> > 
> >   http://redhat.com/~mingo/generic-mutex-subsystem/
> > 
> 
> Took a glance at this on ppc64.  Would it be useful if I contributed an arch
> specific version like arm has?  We'll either need an arch specific version or
> have the generic changed.

Don't change the generic version.  You should provide a ppc specific 
version if the generic ones don't look so good.


Nicolas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05  2:38   ` Nicolas Pitre
@ 2006-01-05  2:51     ` Linus Torvalds
  2006-01-05  3:21       ` Nick Piggin
  2006-01-05 14:40       ` Ingo Molnar
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2006-01-05  2:51 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Joel Schopp, Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven,
	Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox,
	Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard



On Wed, 4 Jan 2006, Nicolas Pitre wrote:

> On Wed, 4 Jan 2006, Joel Schopp wrote:
> 
> > > this is version 14 of the generic mutex subsystem, against v2.6.15.
> > > 
> > > The patch-queue consists of 21 patches, which can also be downloaded from:
> > > 
> > >   http://redhat.com/~mingo/generic-mutex-subsystem/
> > > 
> > 
> > Took a glance at this on ppc64.  Would it be useful if I contributed an arch
> > specific version like arm has?  We'll either need an arch specific version or
> > have the generic changed.
> 
> Don't change the generic version.  You should provide a ppc specific 
> version if the generic ones don't look so good.

Well, if the generic one generates _buggy_ code on ppc64, that means that 
either the generic version is buggy, or one of the atomics that it uses is 
buggily implemented on ppc64.

And I think it's the generic mutex stuff that is buggy. It seems to assume 
memory barriers that aren't valid to assume.

A mutex is more than just updating the mutex count properly. You also have 
to have the proper memory barriers there to make sure that the things that 
the mutex _protects_ actually stay inside the mutex.

So while a ppc64-optimized mutex is probably a good idea per se, I think 
the generic mutex code had better be fixed first and regardless of any 
optimized version.

On x86/x86-64, the locked instructions automatically imply the proper 
memory barriers, but that was just lucky, I think.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05  2:51     ` Linus Torvalds
@ 2006-01-05  3:21       ` Nick Piggin
  2006-01-05  3:39         ` Anton Blanchard
  2006-01-05 14:40       ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Nick Piggin @ 2006-01-05  3:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Joel Schopp, Ingo Molnar, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard

Linus Torvalds wrote:
> 
> On Wed, 4 Jan 2006, Nicolas Pitre wrote:
> 
> 
>>On Wed, 4 Jan 2006, Joel Schopp wrote:
>>
>>
>>>>this is version 14 of the generic mutex subsystem, against v2.6.15.
>>>>
>>>>The patch-queue consists of 21 patches, which can also be downloaded from:
>>>>
>>>>  http://redhat.com/~mingo/generic-mutex-subsystem/
>>>>
>>>
>>>Took a glance at this on ppc64.  Would it be useful if I contributed an arch
>>>specific version like arm has?  We'll either need an arch specific version or
>>>have the generic changed.
>>
>>Don't change the generic version.  You should provide a ppc specific 
>>version if the generic ones don't look so good.
> 
> 
> Well, if the generic one generates _buggy_ code on ppc64, that means that 
> either the generic version is buggy, or one of the atomics that it uses is 
> buggily implemented on ppc64.
> 
> And I think it's the generic mutex stuff that is buggy. It seems to assume 
> memory barriers that aren't valid to assume.
> 
> A mutex is more than just updating the mutex count properly. You also have 
> to have the proper memory barriers there to make sure that the things that 
> the mutex _protects_ actually stay inside the mutex.
> 
> So while a ppc64-optimized mutex is probably a good idea per se, I think 
> the generic mutex code had better be fixed first and regardless of any 
> optimized version.
> 
> On x86/x86-64, the locked instructions automatically imply the proper 
> memory barriers, but that was just lucky, I think.
> 

I think the generic code is correct according to Documentation/atomic_ops.txt
which basically defines any atomic_xxx operation which both modifies its
operand and returns something to have a full memory barrier before and after
its load/store operations.

Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems to
be that it allows loads to be reordered before stores, but that's OK with
smp_wmb, right?

And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does
not? And rmb() does lwsync, which apparently does not order IO at all...

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05  3:21       ` Nick Piggin
@ 2006-01-05  3:39         ` Anton Blanchard
  2006-01-05 18:04           ` Jesse Barnes
  0 siblings, 1 reply; 34+ messages in thread
From: Anton Blanchard @ 2006-01-05  3:39 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Nicolas Pitre, Joel Schopp, Ingo Molnar, lkml,
	Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King


Hi,

> Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems 
> to be that it allows loads to be reordered before stores, but that's
> OK with smp_wmb, right?

lwsync implies more ordering than eieio and so may take longer. lwsync
orders everything except store - load, and eieio just orders store - store.

On power3 an lwsync is a full sync which takes forever, although in
newer chips both lwsync and eieio tend to take the same number of
cycles.

> And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does
> not? And rmb() does lwsync, which apparently does not order IO at all...

Because people love to abuse the barrier macros :) grep for wmb in
drivers/net and look for the number of places wmb() is being used to
order memory and IO. Architecturally eieio is a store - store ordering
for IO and memory but not between the two. sync is slow but does
guarantee this. 

SGIs mmiowb() might be useful for some of these cases but every time its
brought up everyone ends up confused as to its real use.

Really we should have io_*mb() and smp_*mb(). At that stage we may even
be able to kill the base *mb() macros.

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-04 23:45 ` Joel Schopp
  2006-01-05  2:38   ` Nicolas Pitre
@ 2006-01-05 14:35   ` Ingo Molnar
  2006-01-05 16:42     ` Joel Schopp
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-05 14:35 UTC (permalink / raw)
  To: Joel Schopp
  Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard


* Joel Schopp <jschopp@austin.ibm.com> wrote:

> Took a glance at this on ppc64.  Would it be useful if I contributed 
> an arch specific version like arm has?  We'll either need an arch 
> specific version or have the generic changed.

feel free to implement an assembly mutex fastpath, and it would 
certainly be welcome and useful - but i think you are wrong about SMP 
synchronization:

> Anyway, here is some disassembly of some of the code generated with my 
> comments:
> 
> c00000000049bf9c <.mutex_lock>:
> c00000000049bf9c:       7c 00 06 ac     eieio
> c00000000049bfa0:       7d 20 18 28     lwarx   r9,r0,r3
> c00000000049bfa4:       31 29 ff ff     addic   r9,r9,-1

> The eieio is completly unnecessary, it got picked up from 
> atomic_dec_return (Anton, why is there an eieio at the start of 
> atomic_dec_return in the first place?).

a mutex is like a spinlock, it must prevent loads and stores within the 
critical section from 'leaking outside the critical section' [they must 
not be reordered to before the mutex_lock(), nor to after the 
mutex_unlock()] - hence the barriers added by atomic_dec_return() are 
very much needed.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05  2:51     ` Linus Torvalds
  2006-01-05  3:21       ` Nick Piggin
@ 2006-01-05 14:40       ` Ingo Molnar
  2006-01-05 16:21         ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-05 14:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard


* Linus Torvalds <torvalds@osdl.org> wrote:

> Well, if the generic one generates _buggy_ code on ppc64, that means 
> that either the generic version is buggy, or one of the atomics that 
> it uses is buggily implemented on ppc64.
> 
> And I think it's the generic mutex stuff that is buggy. It seems to 
> assume memory barriers that aren't valid to assume.

in those headers i'm only using atomic_dec_return(), atomic_cmpxchg() 
and atomic_xchg() - all of which imply a barrier. It is 
atomic_inc/add/sub/dec() that doesnt default to an implied SMP barrier.

but it's certainly not documented too well. If atomic_dec_return() is 
not supposed to imply a barrier, then all the affected architectures 
(sparc64, ppc64, mips, alpha, etc.) are overdoing synchronization 
currently: they all have barriers for these primitives. [They also have 
an implicit barrier for atomic_dec_test() - which is being relied on for 
correctness - no kernel code adds an smp_mb__before_atomic_dec() barrier 
to around atomic_dec_test().]

the patch below adds the barriers to the asm-generic mutex routines, so 
it's not like i'm lazy ;), but i really think this is unnecessary.  
Adding this patch would add a second, unnecessary barrier for all the 
arches that have barrier-less atomic ops.

it also makes sense: the moment you are interested in the 'previous 
value' of the atomic counter in an atomic fashion, you very likely want 
to use it for a critical section. (e.g. all the put-the-resource ops 
that use atomic_dec_test() rely on this implicit barrier.)

	Ingo

Index: linux/include/asm-generic/mutex-dec.h
===================================================================
--- linux.orig/include/asm-generic/mutex-dec.h
+++ linux/include/asm-generic/mutex-dec.h
@@ -19,6 +19,7 @@
  */
 #define __mutex_fastpath_lock(count, fail_fn)				\
 do {									\
+	smp_mb__before_atomic_dec();					\
 	if (unlikely(atomic_dec_return(count) < 0))			\
 		fail_fn(count);						\
 } while (0)
@@ -36,6 +37,7 @@ do {									\
 static inline int
 __mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
 {
+	smp_mb__before_atomic_dec();
 	if (unlikely(atomic_dec_return(count) < 0))
 		return fail_fn(count);
 	else
@@ -59,6 +61,8 @@ __mutex_fastpath_lock_retval(atomic_t *c
 do {									\
 	if (unlikely(atomic_inc_return(count) <= 0))			\
 		fail_fn(count);						\
+	else								\
+		smp_mb__after_atomic_inc();				\
 } while (0)
 
 #define __mutex_slowpath_needs_to_unlock()		1
@@ -92,6 +96,7 @@ __mutex_fastpath_trylock(atomic_t *count
 	 * the mutex state would be.
 	 */
 #ifdef __HAVE_ARCH_CMPXCHG
+	smp_mb__before_atomic_dec();
 	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
 		return 1;
 	return 0;
Index: linux/include/asm-generic/mutex-xchg.h
===================================================================
--- linux.orig/include/asm-generic/mutex-xchg.h
+++ linux/include/asm-generic/mutex-xchg.h
@@ -24,6 +24,7 @@
  */
 #define __mutex_fastpath_lock(count, fail_fn)				\
 do {									\
+	smp_mb__before_atomic_dec();					\
 	if (unlikely(atomic_xchg(count, 0) != 1))			\
 		fail_fn(count);						\
 } while (0)
@@ -42,6 +43,7 @@ do {									\
 static inline int
 __mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *))
 {
+	smp_mb__before_atomic_dec();
 	if (unlikely(atomic_xchg(count, 0) != 1))
 		return fail_fn(count);
 	else
@@ -64,6 +66,8 @@ __mutex_fastpath_lock_retval(atomic_t *c
 do {									\
 	if (unlikely(atomic_xchg(count, 1) != 0))			\
 		fail_fn(count);						\
+	else								\
+		smp_mb__after_atomic_inc();				\
 } while (0)
 
 #define __mutex_slowpath_needs_to_unlock()		0
@@ -86,7 +90,10 @@ do {									\
 static inline int
 __mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *))
 {
-	int prev = atomic_xchg(count, 0);
+	int prev;
+
+	smp_mb__before_atomic_dec();
+	prev = atomic_xchg(count, 0);
 
 	if (unlikely(prev < 0)) {
 		/*

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 14:40       ` Ingo Molnar
@ 2006-01-05 16:21         ` Linus Torvalds
  2006-01-05 22:03           ` Ingo Molnar
  2006-01-06  7:34           ` Denis Vlasenko
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2006-01-05 16:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard



On Thu, 5 Jan 2006, Ingo Molnar wrote:
> 
> the patch below adds the barriers to the asm-generic mutex routines, so 
> it's not like i'm lazy ;), but i really think this is unnecessary.  
> Adding this patch would add a second, unnecessary barrier for all the 
> arches that have barrier-less atomic ops.
> 
> it also makes sense: the moment you are interested in the 'previous 
> value' of the atomic counter in an atomic fashion, you very likely want 
> to use it for a critical section. (e.g. all the put-the-resource ops 
> that use atomic_dec_test() rely on this implicit barrier.)

Ok, fair enough. However, that still leaves the question of which way the 
barrier works. Traditionally, we have only cared about one thing: that all 
preceding writes have finished, because the "atomic_dec_return" thing is 
used as a _reference_counter_, and we're going to release the thing.

However, that's not the case in a mutex. A mutex locking operation works 
exactly the other way around: it doesn't really care about the previous 
writes at all, since those operations were unlocked. It cares about the 
_subsequent_ writes, since those have to be seen by others as being in the 
critical region, and never be seen as happening before the lock.

So I _think_ your argument is bogus, and your patch is bogus. The use of 
"atomic_dec_return()" in a mutex is _not_ the same barrier as using it for 
reference counting. Not at all. Memory barriers aren't just one thing: 
they are semi-permeable things in two different directions and with two 
different operations: there are several different kinds of them.

>  #define __mutex_fastpath_lock(count, fail_fn)				\
>  do {									\
> +	smp_mb__before_atomic_dec();					\
>  	if (unlikely(atomic_dec_return(count) < 0))			\
>  		fail_fn(count);						\
>  } while (0)

So I think the barrier has to come _after_ the atomic decrement (or 
exchange). 

Because as it is written now, any writes in the locked region could 
percolate up to just before the atomic dec - ie _outside_ the region. 
Which is against the whole point of a lock - it would allow another CPU to 
see the write even before it sees that the lock was successful, as far as
I can tell.

But memory ordering is subtle, so maybe I'm missing something..

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 14:35   ` Ingo Molnar
@ 2006-01-05 16:42     ` Joel Schopp
  2006-01-05 22:21       ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Schopp @ 2006-01-05 16:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard


>>Anyway, here is some disassembly of some of the code generated with my 
>>comments:
>>
>>c00000000049bf9c <.mutex_lock>:
>>c00000000049bf9c:       7c 00 06 ac     eieio
>>c00000000049bfa0:       7d 20 18 28     lwarx   r9,r0,r3
>>c00000000049bfa4:       31 29 ff ff     addic   r9,r9,-1
> 
> 
>>The eieio is completly unnecessary, it got picked up from 
>>atomic_dec_return (Anton, why is there an eieio at the start of 
>>atomic_dec_return in the first place?).
> 
> 
> a mutex is like a spinlock, it must prevent loads and stores within the 
> critical section from 'leaking outside the critical section' [they must 
> not be reordered to before the mutex_lock(), nor to after the 
> mutex_unlock()] - hence the barriers added by atomic_dec_return() are 
> very much needed.

The bne- and isync together form a sufficient import barrier.  See PowerPC Book2 
Appendix B.2.1.1

And if the eieio was necessary it should come after not before twidling the lock 
bits.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05  3:39         ` Anton Blanchard
@ 2006-01-05 18:04           ` Jesse Barnes
  0 siblings, 0 replies; 34+ messages in thread
From: Jesse Barnes @ 2006-01-05 18:04 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Nick Piggin, Linus Torvalds, Nicolas Pitre, Joel Schopp,
	Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen,
	Al Viro, Oleg Nesterov, David Howells, Alan Cox,
	Christoph Hellwig, Andi Kleen, Russell King

On Wednesday, January 4, 2006 7:39 pm, Anton Blanchard wrote:
> SGIs mmiowb() might be useful for some of these cases but every time
> its brought up everyone ends up confused as to its real use.

It's documented in Documentation/DocBook/deviceiobook.tmpl.  If the 
documentation isn't clear, we should fix it, rather than avoid using the 
primitive altogether.  If drivers/net really means mmiowb() in some 
places, we should change it, and like you said maybe get rid of some of 
these primitives so that their usage is clearer.

Jesse

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 16:21         ` Linus Torvalds
@ 2006-01-05 22:03           ` Ingo Molnar
  2006-01-05 22:17             ` Linus Torvalds
  2006-01-06  7:34           ` Denis Vlasenko
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-05 22:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard


* Linus Torvalds <torvalds@osdl.org> wrote:

> On Thu, 5 Jan 2006, Ingo Molnar wrote:
> > 
> > the patch below adds the barriers to the asm-generic mutex routines, so 
> > it's not like i'm lazy ;), but i really think this is unnecessary.  
> > Adding this patch would add a second, unnecessary barrier for all the 
> > arches that have barrier-less atomic ops.
> > 
> > it also makes sense: the moment you are interested in the 'previous 
> > value' of the atomic counter in an atomic fashion, you very likely want 
> > to use it for a critical section. (e.g. all the put-the-resource ops 
> > that use atomic_dec_test() rely on this implicit barrier.)
> 
> Ok, fair enough. However, that still leaves the question of which way 
> the barrier works. Traditionally, we have only cared about one thing: 
> that all preceding writes have finished, because the 
> "atomic_dec_return" thing is used as a _reference_counter_, and we're 
> going to release the thing.

yeah, i think you are right. Here's a detailed analysis of why you are 
right about atomic_dec_return():

there are 8 types of instruction reordering that can happen at the 
beginning and at the end of a critical section. Firstly, here's the 
programmed order of instructions:

     pre-read
     pre-write
     [critical-START]
     critial-read
     critical-write
     [critical-END]
     post-read
     post-write

the following reorderings are possible:

 a pre-read   crossing forwards into (and over) the critical section: OK
 a pre-write  crossing forwards into (and over) the critical section: OK
 a critical-read  crossing backwards across critical-START: BAD
 a critical-write crossing backwards across critical-START: BAD
 a critical-read  crossing forwards  across critical-END:   BAD
 a critical-write crossing forwards  across critical-END:   BAD
 a post-read  crossing backwards into (and over) the critical section: OK
 a post-write crossing backwards into (and over) the critical section: OK

so critical-START needs to be a read and a write barrier for reads and 
writes happening after it. I.e. it's a memory barrier that only lets 
instruction reordering in a forward direction, not in a backwards 
direction.

critical-END needs to be a read and write barrier that only lets 
instructions into the critical section in a backwards direction - and 
it's a full memory barrier otherwise.

AFAICS, currently we dont have such a 'half-conductor / diode' memory 
barrier primitive: smp_mb() is a full barrier for both directions. But 
lets assume they existed, and lets call them smp_mb_forwards() and 
smp_mb_backwards().

furthermore, the locking and unlocking instruction must not cross into 
the critical section, so the lock sequence must be at least:

	lock
	smp_rmb_for_lock_forwards()
	smp_mb_backwards()
	... critical section ...
	smp_mb_forwards()
	smp_wmb_for_lock_backwards()
	unlock

and yet another requirement is that two subsequent critical sections for 
the same lock must not reorder the 'lock' and 'unlock' instructions:

	smp_mb_forwards()
	unlock
	... unlocked code ...
	lock
	smp_mb_backwards()

i.e. 'lock' and 'unlock' must not switch places. So the most relaxed 
list of requirements would be:

	smp_wmb_for_lock_backwards()
	lock
	smp_wmb_for_lock_forwards()
	smp_mb_backwards()
	... critical section ...
	smp_mb_forwards()
	smp_wmb_for_lock_backwards()
	unlock

i also think this is the 'absolute minimum memory ordering requirement' 
for critical sections: relaxing this any further is not possible without 
breaking critical sections.

i doubt many (in fact, any) CPUs are capable of expressing it in such a 
finegrained way. With our existing primitives, probably the closest one 
would be:

	lock
	smp_mb();
	...
	smp_mb();
	unlock

as long as the CPU always executes the lock and unlock stores (which go 
to the same address) in program order. (is there any CPU doesnt do 
that?)

in that sense, both atomic_dec_return() and atomic_inc_return() are in 
indeed incorrect (for the use of mutexes) e.g. on ppc64. They are both 
done via:

	eioio
	... atomic-dec ...
	isync

eioio is a stronger than smp_wmb() - it is a barrier for system memory 
and IO space memory writes. isync is a read barrier - it throws away all 
speculative register contents. So it is roughly equivalent to:

	smp_wmb();
	... atomic-dec ...
	smp_rmb();

this fulfills the requirement of the critical section not leaking out of 
the lock sequence itself, but (if i got the ppc64 semantics right) it 
doesnt protect a write within the critical section to cross out over the 
smp_rmb(), and to get reordered with the atomic-dec - violating the 
critical section rules.

some other architectures are safe by accident, e.g. Alpha's 
atomic_dec_return() does:

	smp_mb();
	... atomic-dec ...
	smp_mb();

which is overkill, full read and write barrier on both sides.

Sparc64's atomic_dec_return() does yet another thing:

	membar StoreLoad | LoadLoad
	... atomic-load ...
	... atomic-conditional-store ...
	membar StoreLoad | StoreStore

AFAICS this violates the requirements: a load from within the critical 
section may go before the atomic-conditional-store, and may thus be 
executed before the critical section acquires the lock.

on MIPS, atomic_dec_return() does what is equivalent to:

	... atomic-dec ...
	smp_mb();

which is fine for a lock sequence, but atomic_inc_return() is not 
adequate for an unlock sequence:

	... atomic-inc ...
	smp_mb();

because this allows reads and writes within the critical section to 
reorder with the atomic-inc instructions.

to sum it up: atomic_dec/inc_return() alone is not enough to implement 
critical sections, on a number of architectures. atomic_xchg() seems to 
have similar problems too.

the patch below adds the smp_mb() barriers to the generic headers, which 
should now fulfill all the ordering requirements, on every architecture.  
It only relies on one property of the atomic primitives: that they wont 
get reordered with respect to themselves, so an atomic_inc_ret() and an 
atomic_dec_ret() cannot switch place.

Can you see any hole in this reasoning?

	Ingo

Index: linux/include/asm-generic/mutex-dec.h
===================================================================
--- linux.orig/include/asm-generic/mutex-dec.h
+++ linux/include/asm-generic/mutex-dec.h
@@ -21,6 +21,8 @@
 do {									\
 	if (unlikely(atomic_dec_return(count) < 0))			\
 		fail_fn(count);						\
+	else								\
+		smp_mb();						\
 } while (0)
 
 /**
@@ -38,8 +40,10 @@ __mutex_fastpath_lock_retval(atomic_t *c
 {
 	if (unlikely(atomic_dec_return(count) < 0))
 		return fail_fn(count);
-	else
+	else {
+		smp_mb();
 		return 0;
+	}
 }
 
 /**
@@ -57,6 +61,7 @@ __mutex_fastpath_lock_retval(atomic_t *c
  */
 #define __mutex_fastpath_unlock(count, fail_fn)				\
 do {									\
+	smp_mb();							\
 	if (unlikely(atomic_inc_return(count) <= 0))			\
 		fail_fn(count);						\
 } while (0)
@@ -92,8 +97,10 @@ __mutex_fastpath_trylock(atomic_t *count
 	 * the mutex state would be.
 	 */
 #ifdef __HAVE_ARCH_CMPXCHG
-	if (likely(atomic_cmpxchg(count, 1, 0)) == 1)
+	if (likely(atomic_cmpxchg(count, 1, 0)) == 1) {
+		smp_mb();
 		return 1;
+	}
 	return 0;
 #else
 	return fail_fn(count);
Index: linux/include/asm-generic/mutex-xchg.h
===================================================================
--- linux.orig/include/asm-generic/mutex-xchg.h
+++ linux/include/asm-generic/mutex-xchg.h
@@ -26,6 +26,8 @@
 do {									\
 	if (unlikely(atomic_xchg(count, 0) != 1))			\
 		fail_fn(count);						\
+	else								\
+		smp_mb();						\
 } while (0)
 
 
@@ -44,8 +46,10 @@ __mutex_fastpath_lock_retval(atomic_t *c
 {
 	if (unlikely(atomic_xchg(count, 0) != 1))
 		return fail_fn(count);
-	else
+	else {
+		smp_mb();
 		return 0;
+	}
 }
 
 /**
@@ -62,6 +66,7 @@ __mutex_fastpath_lock_retval(atomic_t *c
  */
 #define __mutex_fastpath_unlock(count, fail_fn)				\
 do {									\
+	smp_mb();							\
 	if (unlikely(atomic_xchg(count, 1) != 0))			\
 		fail_fn(count);						\
 } while (0)
@@ -104,6 +109,7 @@ __mutex_fastpath_trylock(atomic_t *count
 		if (prev < 0)
 			prev = 0;
 	}
+	smp_mb();
 
 	return prev;
 }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 22:03           ` Ingo Molnar
@ 2006-01-05 22:17             ` Linus Torvalds
  2006-01-05 22:43               ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2006-01-05 22:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard



On Thu, 5 Jan 2006, Ingo Molnar wrote:
>
> [ long details removed ]
> 
> to sum it up: atomic_dec/inc_return() alone is not enough to implement 
> critical sections, on a number of architectures. atomic_xchg() seems to 
> have similar problems too.

Yes.

> the patch below adds the smp_mb() barriers to the generic headers, which 
> should now fulfill all the ordering requirements, on every architecture.  
> It only relies on one property of the atomic primitives: that they wont 
> get reordered with respect to themselves, so an atomic_inc_ret() and an 
> atomic_dec_ret() cannot switch place.
> 
> Can you see any hole in this reasoning?

No. The alternative is to just make the ordering requirements 
for "atomic_dec_return()" and "atomic_xchg()" be absolute. Say that they 
have to be full memory barriers, and push the problem into the low-level
architecture.

I _think_ your patch is the right approach, because most architectures are 
likely to do their own fast-paths for mutexes, and as such the generic 
ones are more of a template for how to do it, and hopefilly aren't that 
performance critical.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 16:42     ` Joel Schopp
@ 2006-01-05 22:21       ` Ingo Molnar
  2006-01-05 23:06         ` Joel Schopp
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-05 22:21 UTC (permalink / raw)
  To: Joel Schopp
  Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard


* Joel Schopp <jschopp@austin.ibm.com> wrote:

> The bne- and isync together form a sufficient import barrier.  See 
> PowerPC Book2 Appendix B.2.1.1

ok. Please correct me if i'm wrong: the question is, could we on ppc64 
use atomic_dec_return() for mutex_lock(), and atomic_inc_return() for 
mutex_unlock().

atomic_dec_return() does:

        EIEIO_ON_SMP
"1:     lwarx   %0,0,%1         # atomic_dec_return\n\
        addic   %0,%0,-1\n"
        PPC405_ERR77(0,%1)
"       stwcx.  %0,0,%1\n\
        bne-    1b"
        ISYNC_ON_SMP

the EIEIO_ON_SMP is in essence smp_wmb(), correct? (it's a bit stronger 
because it also orders IO-space writes, but it doesnt impose any 
restrictions on reads)

ISYNC_ON_SMP flushes all speculative reads currently in the queue - and 
is hence a smp_rmb_backwards() primitive [per my previous mail] - but 
does not affect writes - correct?

if that's the case, what prevents a store from within the critical 
section going up to right after the EIEIO_ON_SMP, but before the 
atomic-dec instructions? Does any of those instructions imply some 
barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs), 
and hence the store before the bne is an effective write-barrier?

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 22:17             ` Linus Torvalds
@ 2006-01-05 22:43               ` Ingo Molnar
  2006-01-06  3:49                 ` Keith Owens
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-05 22:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard


* Linus Torvalds <torvalds@osdl.org> wrote:

> I _think_ your patch is the right approach, because most architectures 
> are likely to do their own fast-paths for mutexes, and as such the 
> generic ones are more of a template for how to do it, and hopefilly 
> aren't that performance critical.

yeah, i think so too. We've got 3 architectures done in assembly so far, 
and it seems people like optimizing such code. Also, since the generic 
code does all the boring slowpath stuff, the architecture can 
concentrate on the fun part alone: to make the fastpath really fast.  
The generic code is still in full control of all the mutex semantics, 
and can ignore asm/mutex.h when it wants/needs to. So i'm quite happy 
with the current design and i'm not against more per-arch assembly fun, 
at all.

there's one exception i think: atomic-xchg.h was pretty optimal on ARM, 
and i'd expect it to be pretty optimal on the other atomic-swap 
platforms too. So maybe we should specify atomic_xchg() to _not_ imply a 
full barrier - it's a new API anyway. We cannot embedd the barrier 
within atomic_xchg(), because the barrier is needed at different ends 
for lock and for unlock, and adding two barriers would be unnecessary.

asm-generic/mutex-dec.h is less optimal (and thus less critical), and i 
can see no easy way to modify it, because i think it would be quite 
confusing to enforce 'lock' ordering for atomic_dec_return(), and 
'unlock' ordering for atomic_inc_return(). We cannot remove the existing 
barriers either (and add them explicitly), because there are existing 
users of these primitives. (although we could add explicit barriers to 
those too - but probably not worth the trouble)

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 22:21       ` Ingo Molnar
@ 2006-01-05 23:06         ` Joel Schopp
  2006-01-05 23:26           ` Linus Torvalds
  2006-01-06  0:29           ` Olof Johansson
  0 siblings, 2 replies; 34+ messages in thread
From: Joel Schopp @ 2006-01-05 23:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard, PPC64-dev

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]

> ISYNC_ON_SMP flushes all speculative reads currently in the queue - and 
> is hence a smp_rmb_backwards() primitive [per my previous mail] - but 
> does not affect writes - correct?
> 
> if that's the case, what prevents a store from within the critical 
> section going up to right after the EIEIO_ON_SMP, but before the 
> atomic-dec instructions? Does any of those instructions imply some 
> barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs), 
> and hence the store before the bne is an effective write-barrier?

It really makes more sense after reading PowerPC Book II, which you can find at 
this link, it was written by people who explain this for a living: 
http://www-128.ibm.com/developerworks/eserver/articles/archguide.html

While isync technically doesn't order stores it does order instructions.  The 
previous bne- must complete, that bne- is dependent on the previous stwcx being 
complete.  So no stores are slipping up.  To get a better explanation you will 
have to read the document yourself.

Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC. 
It is completely untested, but compiles.

Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>




[-- Attachment #2: powerpcmutex.patch --]
[-- Type: text/plain, Size: 2812 bytes --]

Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h
===================================================================
--- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h	2006-01-04 14:46:31.%N -0600
+++ 2.6.15-mutex14/include/asm-powerpc/mutex.h	2006-01-05 16:25:41.%N -0600
@@ -1,9 +1,83 @@
 /*
- * Pull in the generic implementation for the mutex fastpath.
+ * include/asm-powerpc/mutex.h
  *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
+ * PowerPC optimized mutex locking primitives
+ *
+ * Please look into asm-generic/mutex-xchg.h for a formal definition.
+ * Copyright (C) 2006 Joel Schopp <jschopp@austin.ibm.com>, IBM
  */
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+#define __mutex_fastpath_lock(count, fail_fn)\
+do{                                     \
+	long tmp;                       \
+	__asm__ __volatile__(		\
+"1:	lwarx		%0,0,%1\n"      \
+"	addic	        %0,%0,-1\n"     \
+"	stwcx.          %0,0,%1\n"      \
+"	bne-            1b\n"           \
+"	isync           \n"             \
+	: "=&r" (tmp)                   \
+	: "r" (&(count)->counter)       \
+	: "cr0", "memory");             \
+	if (unlikely(tmp < 0))          \
+		fail_fn(count);         \
+} while (0)                              
+
+#define __mutex_fastpath_unlock(count, fail_fn)\
+do{                                         \
+	long tmp;                           \
+	__asm__ __volatile__(SYNC_ON_SMP    \
+"1:	lwarx		%0,0,%1\n"          \
+"	addic	        %0,%0,1\n"          \
+"	stwcx.          %0,0,%1\n"          \
+"       bne-            1b\n"               \
+	: "=&r" (tmp)                       \
+	: "r" (&(count)->counter)           \
+	: "cr0", "memory");                 \
+	if (unlikely(tmp <= 0))             \
+		fail_fn(count);             \
+} while (0)
+
+
+static inline int 
+__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*))
+{
+	long tmp;
+	__asm__ __volatile__(
+"1:     lwarx		%0,0,%1\n"
+"       cmpwi           0,%0,1\n"
+"       bne-            2f\n"
+"       stwcx.          %0,0,%1\n"
+"	bne-		1b\n"
+"	isync\n"
+"2:"
+	: "=&r" (tmp)
+	: "r" (&(count)->counter)
+	: "cr0", "memory");
+
+	return (int)tmp;
+
+}
+
+#define __mutex_slowpath_needs_to_unlock()		1
 
-#include <asm-generic/mutex-dec.h>
+static inline int 
+__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *))
+{
+	long tmp;
+	__asm__ __volatile__(
+"1:	lwarx		%0,0,%1\n"
+"	addic	        %0,%0,-1\n"
+"	stwcx.          %0,0,%1\n"
+"	bne-            1b\n"
+"	isync           \n"
+	: "=&r" (tmp)
+	: "r" (&(count)->counter)
+	: "cr0", "memory");
+	if (unlikely(tmp < 0))
+		return fail_fn(count);
+	else
+		return 0;
+}
+#endif

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 23:06         ` Joel Schopp
@ 2006-01-05 23:26           ` Linus Torvalds
  2006-01-05 23:36             ` Joel Schopp
  2006-01-06  0:29           ` Olof Johansson
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2006-01-05 23:26 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard, PPC64-dev



On Thu, 5 Jan 2006, Joel Schopp wrote:
> 
> Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC.
> It is completely untested, but compiles.

Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it, 
since DMA will never matter, and interrupts are precise.

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 23:26           ` Linus Torvalds
@ 2006-01-05 23:36             ` Joel Schopp
  2006-01-05 23:42               ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Schopp @ 2006-01-05 23:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard, PPC64-dev

>>Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC.
>>It is completely untested, but compiles.
> 
> 
> Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it, 
> since DMA will never matter, and interrupts are precise.
> 
> 		Linus
> 

I think the isync is necessary to keep heavily out of order processors from 
getting ahead of themselves even on UP.  Scanning back through the powerpc 
spinlock code they seem to take the same view there as well.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 23:36             ` Joel Schopp
@ 2006-01-05 23:42               ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2006-01-05 23:42 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Linus Torvalds, lkml, Andrew Morton, Arjan van de Ven,
	Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard, PPC64-dev


* Joel Schopp <jschopp@austin.ibm.com> wrote:

> > Shouldn't you make that "isync" dependent on SMP too? UP doesn't 
> > need it, since DMA will never matter, and interrupts are precise.
> 
> I think the isync is necessary to keep heavily out of order processors 
> from getting ahead of themselves even on UP.  Scanning back through 
> the powerpc spinlock code they seem to take the same view there as 
> well.

the asm/spinlock.h ops are only built on SMP kernels. mutex.h is for 
both UP and SMP. On UP you should need no synchronization, because the 
only way another context could interfere with your critical section is 
by getting interrupted, and interrupts are fully synchronizing, right?  
On UP the only synchronization needed is when a device reads/writes 
memory in parallel to the CPU.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 23:06         ` Joel Schopp
  2006-01-05 23:26           ` Linus Torvalds
@ 2006-01-06  0:29           ` Olof Johansson
  2006-01-07 17:49             ` PowerPC fastpaths for mutex subsystem Joel Schopp
  1 sibling, 1 reply; 34+ messages in thread
From: Olof Johansson @ 2006-01-06  0:29 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Ingo Molnar, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev

On Thu, Jan 05, 2006 at 05:06:26PM -0600, Joel Schopp wrote:

> Here is a first pass at a powerpc file for the fast paths just as an 
> FYI/RFC. It is completely untested, but compiles.

You really should test it, it saves reviewers time. It's not that hard
to at least try booting it.

Besides the isync comments earlier, there's a bunch of whitespace issues
going on. Did you copy and paste the code from somewhere? If so, you
should move the original copyright over too.

All your macros use spaces instead of tabs up to the \, should be
changed.

All tmp variables should be ints, since the atomic_t counter is a 32-bit
variable. If you use longs, and lwarx (loads 32-bit without sign
extend), the comparison with < 0 will never be true.

> Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h
> ===================================================================
> --- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h	2006-01-04 14:46:31.%N -0600
> +++ 2.6.15-mutex14/include/asm-powerpc/mutex.h	2006-01-05 16:25:41.%N -0600
> @@ -1,9 +1,83 @@
>  /*
> - * Pull in the generic implementation for the mutex fastpath.
> + * include/asm-powerpc/mutex.h

No need to keep filenames in files.

>   *
> - * TODO: implement optimized primitives instead, or leave the generic
> - * implementation in place, or pick the atomic_xchg() based generic
> - * implementation. (see asm-generic/mutex-xchg.h for details)
> + * PowerPC optimized mutex locking primitives
> + *
> + * Please look into asm-generic/mutex-xchg.h for a formal definition.
> + * Copyright (C) 2006 Joel Schopp <jschopp@austin.ibm.com>, IBM
>   */
> +#ifndef _ASM_MUTEX_H
> +#define _ASM_MUTEX_H
> +#define __mutex_fastpath_lock(count, fail_fn)\
> +do{                                     \
> +	long tmp;                       \
> +	__asm__ __volatile__(		\
> +"1:	lwarx		%0,0,%1\n"      \
> +"	addic	        %0,%0,-1\n"     \
> +"	stwcx.          %0,0,%1\n"      \
> +"	bne-            1b\n"           \
> +"	isync           \n"             \
> +	: "=&r" (tmp)                   \
> +	: "r" (&(count)->counter)       \
> +	: "cr0", "memory");             \
> +	if (unlikely(tmp < 0))          \
> +		fail_fn(count);         \
> +} while (0)                              

trailing whitespace

> +
> +#define __mutex_fastpath_unlock(count, fail_fn)\
> +do{                                         \
> +	long tmp;                           \
> +	__asm__ __volatile__(SYNC_ON_SMP    \
> +"1:	lwarx		%0,0,%1\n"          \
> +"	addic	        %0,%0,1\n"          \
> +"	stwcx.          %0,0,%1\n"          \
> +"       bne-            1b\n"               \

space vs tab

> +	: "=&r" (tmp)                       \
> +	: "r" (&(count)->counter)           \
> +	: "cr0", "memory");                 \
> +	if (unlikely(tmp <= 0))             \
> +		fail_fn(count);             \
> +} while (0)
> +
> +
> +static inline int 

trailing whitespace

> +__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*))

atomic_t *count

> +{
> +	long tmp;
> +	__asm__ __volatile__(
> +"1:     lwarx		%0,0,%1\n"
> +"       cmpwi           0,%0,1\n"
> +"       bne-            2f\n"
> +"       stwcx.          %0,0,%1\n"

space vs tab on the above 4 lines

Shouldn't you decrement the counter before the store?

> +"	bne-		1b\n"
> +"	isync\n"
> +"2:"
> +	: "=&r" (tmp)
> +	: "r" (&(count)->counter)
> +	: "cr0", "memory");
> +
> +	return (int)tmp;
> +
> +}
> +
> +#define __mutex_slowpath_needs_to_unlock()		1
>  
> -#include <asm-generic/mutex-dec.h>
> +static inline int 

trailing whitespace

> +__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *))

atomic_t *count

> +{
> +	long tmp;

counter is a 32-bit variable, so should tmp be otherwise the < 0
comparison can never be true.

> +	__asm__ __volatile__(
> +"1:	lwarx		%0,0,%1\n"
> +"	addic	        %0,%0,-1\n"
> +"	stwcx.          %0,0,%1\n"
> +"	bne-            1b\n"
> +"	isync           \n"
> +	: "=&r" (tmp)
> +	: "r" (&(count)->counter)
> +	: "cr0", "memory");
> +	if (unlikely(tmp < 0))
> +		return fail_fn(count);
> +	else
> +		return 0;
> +}
> +#endif


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 22:43               ` Ingo Molnar
@ 2006-01-06  3:49                 ` Keith Owens
  0 siblings, 0 replies; 34+ messages in thread
From: Keith Owens @ 2006-01-06  3:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard

Ingo Molnar (on Thu, 5 Jan 2006 23:43:57 +0100) wrote:
>there's one exception i think: atomic-xchg.h was pretty optimal on ARM, 
>and i'd expect it to be pretty optimal on the other atomic-swap 
>platforms too. So maybe we should specify atomic_xchg() to _not_ imply a 
>full barrier - it's a new API anyway. We cannot embedd the barrier 
>within atomic_xchg(), because the barrier is needed at different ends 
>for lock and for unlock, and adding two barriers would be unnecessary.

IA64 defines two qualifiers for cmpxchg, specifically for
distinguishing between acquiring and releasing the lock.

  cmpxchg<sz>.<sem>

<sz> is the data size, 1, 2, 4 or 8.  <sem> is one of 'acq' or 'rel'.

  sem       Ordering    Semaphore Operation
Completer   Semantics
  acq        Acquire    The memory read/write is made visible prior to
                        all subsequent data memory accesses.
  rel        Release    The memory read/write is made visible after all
                        previous data memory accesses.

cmpxchg4.acq prevents following data accesses from migrating before
taking the lock (critical R/W cannot precede critical-START).
cmpxchg4.rel prevents preceding data accesses from migrating after
releasing the lock (critical R/W cannot follow critical-END).  I
suggest adding acq and rel hints to atomic_xchg, and let architectures
that implement suitable operations use those hints.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 00/21] mutex subsystem, -V14
  2006-01-05 16:21         ` Linus Torvalds
  2006-01-05 22:03           ` Ingo Molnar
@ 2006-01-06  7:34           ` Denis Vlasenko
  1 sibling, 0 replies; 34+ messages in thread
From: Denis Vlasenko @ 2006-01-06  7:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Nicolas Pitre, Joel Schopp, lkml, Andrew Morton,
	Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov,
	David Howells, Alan Cox, Christoph Hellwig, Andi Kleen,
	Russell King, Anton Blanchard

On Thursday 05 January 2006 18:21, Linus Torvalds wrote:
> On Thu, 5 Jan 2006, Ingo Molnar wrote:
> > 
> > the patch below adds the barriers to the asm-generic mutex routines, so 
> > it's not like i'm lazy ;), but i really think this is unnecessary.  
> > Adding this patch would add a second, unnecessary barrier for all the 
> > arches that have barrier-less atomic ops.
> > 
> > it also makes sense: the moment you are interested in the 'previous 
> > value' of the atomic counter in an atomic fashion, you very likely want 
> > to use it for a critical section. (e.g. all the put-the-resource ops 
> > that use atomic_dec_test() rely on this implicit barrier.)
>
> So I _think_ your argument is bogus, and your patch is bogus. The use of 
> "atomic_dec_return()" in a mutex is _not_ the same barrier as using it for 
> reference counting. Not at all. Memory barriers aren't just one thing: 
> they are semi-permeable things in two different directions and with two 
> different operations: there are several different kinds of them.
> 
> >  #define __mutex_fastpath_lock(count, fail_fn)				\
> >  do {									\
> > +	smp_mb__before_atomic_dec();					\
> >  	if (unlikely(atomic_dec_return(count) < 0))			\
> >  		fail_fn(count);						\
> >  } while (0)
> 
> So I think the barrier has to come _after_ the atomic decrement (or 
> exchange). 
> 
> Because as it is written now, any writes in the locked region could 
> percolate up to just before the atomic dec - ie _outside_ the region. 
> Which is against the whole point of a lock - it would allow another CPU to 
> see the write even before it sees that the lock was successful, as far as
> I can tell.
> 
> But memory ordering is subtle, so maybe I'm missing something..

We mere humans^W device driver people get more confused with barriers
every day, as CPUs get more subtle in their out-of-order-ness.

I think adding longer-named-but-self-explanatory aliases for memory and io
barrier functions can help.

mmiowb => barrier_memw_iow
.... => barrier_memw_memw (a store-store barrier to mem)
....

General template for the name may be something like 

	[smp]barrier_{mem,io,memio}{r,w,rw}_{mem,io,memio}{r,w,rw}

Are there even more subtle cases?
--
vda

^ permalink raw reply	[flat|nested] 34+ messages in thread

* PowerPC fastpaths for mutex subsystem
  2006-01-06  0:29           ` Olof Johansson
@ 2006-01-07 17:49             ` Joel Schopp
  2006-01-07 22:37               ` Andrew Morton
                                 ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Joel Schopp @ 2006-01-07 17:49 UTC (permalink / raw)
  To: Olof Johansson
  Cc: Ingo Molnar, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev

[-- Attachment #1: Type: text/plain, Size: 683 bytes --]

This is the second pass at optimizing the fastpath for the new mutex subsystem 
on PowerPC.  I think it is ready to be included in the series with the other 
mutex patches now.  Tested on a 4 core (2 SMT threads/core) Power5 machine with 
gcc 3.3.2.

Test results from synchro-test.ko:

All tests run for default 5 seconds
Threads semaphores  mutexes     mutexes+attached
1       63,465,364  58,404,630  62,109,571
4       58,424,282  35,541,297  37,820,794
8       40,731,668  35,541,297  40,281,768
16      38,372,769  37,256,298  41,751,764
32      38,406,895  36,933,675  38,731,571
64      37,232,017  36,222,480  40,766,379

Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>

[-- Attachment #2: powerpcmutex.patch --]
[-- Type: text/plain, Size: 2347 bytes --]

Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h
===================================================================
--- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h	2006-01-04 14:46:31.%N -0600
+++ 2.6.15-mutex14/include/asm-powerpc/mutex.h	2006-01-06 17:36:09.%N -0600
@@ -1,9 +1,84 @@
 /*
- * Pull in the generic implementation for the mutex fastpath.
+ * include/asm-powerpc/mutex.h
  *
- * TODO: implement optimized primitives instead, or leave the generic
- * implementation in place, or pick the atomic_xchg() based generic
- * implementation. (see asm-generic/mutex-xchg.h for details)
+ * PowerPC optimized mutex locking primitives
+ *
+ * Please look into asm-generic/mutex-xchg.h for a formal definition.
+ * Copyright (C) 2006 Joel Schopp <jschopp@austin.ibm.com>, IBM
  */
+#ifndef _ASM_MUTEX_H
+#define _ASM_MUTEX_H
+#define __mutex_fastpath_lock(count, fail_fn)\
+do{					\
+	int tmp;			\
+	__asm__ __volatile__(		\
+"1:	lwarx		%0,0,%1\n"	\
+"	addic		%0,%0,-1\n"	\
+"	stwcx.		%0,0,%1\n"	\
+"	bne-		1b\n"		\
+	ISYNC_ON_SMP			\
+	: "=&r" (tmp)			\
+	: "r" (&(count)->counter)	\
+	: "cr0", "memory");		\
+	if (unlikely(tmp < 0))		\
+		fail_fn(count);		\
+} while (0)
+
+#define __mutex_fastpath_unlock(count, fail_fn)\
+do{					\
+	int tmp;			\
+	__asm__ __volatile__(SYNC_ON_SMP\
+"1:	lwarx		%0,0,%1\n"	\
+"	addic		%0,%0,1\n"	\
+"	stwcx.		%0,0,%1\n"	\
+"	bne-		1b\n"		\
+	: "=&r" (tmp)			\
+	: "r" (&(count)->counter)	\
+	: "cr0", "memory");		\
+	if (unlikely(tmp <= 0))		\
+		fail_fn(count);		\
+} while (0)
+
+
+static inline int
+__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*))
+{
+	int tmp;
+	__asm__ __volatile__(
+"1:	lwarx		%0,0,%1\n"
+"	cmpwi		0,%0,1\n"
+"	bne-		2f\n"
+"	addic		%0,%0,-1\n"
+"	stwcx.		%0,0,%1\n"
+"	bne-		1b\n"
+"	isync\n"
+"2:"
+	: "=&r" (tmp)
+	: "r" (&(count)->counter)
+	: "cr0", "memory");
+
+	return (int)tmp;
+
+}
+
+#define __mutex_slowpath_needs_to_unlock()		1
 
-#include <asm-generic/mutex-dec.h>
+static inline int
+__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *))
+{
+	int tmp;
+	__asm__ __volatile__(
+"1:	lwarx		%0,0,%1\n"
+"	addic		%0,%0,-1\n"
+"	stwcx.		%0,0,%1\n"
+"	bne-		1b\n"
+"	isync		\n"
+	: "=&r" (tmp)
+	: "r" (&(count)->counter)
+	: "cr0", "memory");
+	if (unlikely(tmp < 0))
+		return fail_fn(count);
+	else
+		return 0;
+}
+#endif

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-07 17:49             ` PowerPC fastpaths for mutex subsystem Joel Schopp
@ 2006-01-07 22:37               ` Andrew Morton
  2006-01-08  7:43                 ` Anton Blanchard
  2006-01-08  9:48               ` Ingo Molnar
  2006-01-08 10:43               ` Ingo Molnar
  2 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2006-01-07 22:37 UTC (permalink / raw)
  To: Joel Schopp
  Cc: olof, mingo, linux-kernel, torvalds, arjan, nico, jes, viro,
	oleg, dhowells, alan, hch, ak, rmk+lkml, anton, linuxppc64-dev

Joel Schopp <jschopp@austin.ibm.com> wrote:
>
> This is the second pass at optimizing the fastpath for the new mutex subsystem 
>  on PowerPC.  I think it is ready to be included in the series with the other 
>  mutex patches now.  Tested on a 4 core (2 SMT threads/core) Power5 machine with 
>  gcc 3.3.2.
> 
>  Test results from synchro-test.ko:
> 
>  All tests run for default 5 seconds
>  Threads semaphores  mutexes     mutexes+attached
>  1       63,465,364  58,404,630  62,109,571
>  4       58,424,282  35,541,297  37,820,794
>  8       40,731,668  35,541,297  40,281,768
>  16      38,372,769  37,256,298  41,751,764
>  32      38,406,895  36,933,675  38,731,571
>  64      37,232,017  36,222,480  40,766,379
> 

Doens't this mean that the sped-up mutexes are still slower than semaphores?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-07 22:37               ` Andrew Morton
@ 2006-01-08  7:43                 ` Anton Blanchard
  2006-01-08  8:00                   ` Andrew Morton
  2006-01-09 11:13                   ` David Howells
  0 siblings, 2 replies; 34+ messages in thread
From: Anton Blanchard @ 2006-01-08  7:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joel Schopp, jes, rmk+lkml, hch, linux-kernel, ak, torvalds,
	viro, linuxppc64-dev, mingo, nico, oleg, alan, arjan

 
> Doens't this mean that the sped-up mutexes are still slower than semaphores?

Wasnt most of the x86 mutex gain a result of going from fair to unfair
operation? The current ppc64 semaphores are unfair.

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-08  7:43                 ` Anton Blanchard
@ 2006-01-08  8:00                   ` Andrew Morton
  2006-01-08  8:23                     ` Anton Blanchard
  2006-01-09 11:13                   ` David Howells
  1 sibling, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2006-01-08  8:00 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: jschopp, jes, rmk+lkml, hch, linux-kernel, ak, torvalds, viro,
	linuxppc64-dev, mingo, nico, oleg, alan, arjan

Anton Blanchard <anton@samba.org> wrote:
>
>  
> > Doens't this mean that the sped-up mutexes are still slower than semaphores?
> 
> Wasnt most of the x86 mutex gain a result of going from fair to unfair
> operation? The current ppc64 semaphores are unfair.
> 

What's "unfair"?  Mutexes are FIFO, as are x86 semaphores.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-08  8:00                   ` Andrew Morton
@ 2006-01-08  8:23                     ` Anton Blanchard
  0 siblings, 0 replies; 34+ messages in thread
From: Anton Blanchard @ 2006-01-08  8:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: jes, rmk+lkml, ak, linux-kernel, hch, torvalds, viro,
	linuxppc64-dev, mingo, nico, oleg, alan, arjan

 
> What's "unfair"?  Mutexes are FIFO, as are x86 semaphores.

The ppc64 semaphores dont force everyone into the slow path under
contention. So you could drop and pick up the semaphore even with
someone waiting. I thought thats how the new mutex code worked.

Anton

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-07 17:49             ` PowerPC fastpaths for mutex subsystem Joel Schopp
  2006-01-07 22:37               ` Andrew Morton
@ 2006-01-08  9:48               ` Ingo Molnar
  2006-01-10 22:31                 ` Joel Schopp
  2006-01-08 10:43               ` Ingo Molnar
  2 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2006-01-08  9:48 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev


* Joel Schopp <jschopp@austin.ibm.com> wrote:

> Tested on a 4 core (2 SMT threads/core) Power5 machine with gcc 3.3.2.

> Test results from synchro-test.ko:
> 
> All tests run for default 5 seconds
> Threads semaphores  mutexes     mutexes+attached
> 1       63,465,364  58,404,630  62,109,571
> 4       58,424,282  35,541,297  37,820,794
> 8       40,731,668  35,541,297  40,281,768
> 16      38,372,769  37,256,298  41,751,764
> 32      38,406,895  36,933,675  38,731,571
> 64      37,232,017  36,222,480  40,766,379

interesting. Could you try two things? Firstly, could you add some 
minimal delays to the lock/unlock path, of at least 1 usec? E.g.  
"synchro-test.ko load=1 interval=1". [but you could try longer delays 
too, 10 usecs is still realistic.]

secondly, could you try the VFS creat+unlink test via the test-mutex.c 
code below, with something like:

	./test-mutex V 16 10

(this tests with 16 tasks, for 10 seconds.) You'll get a useful ops/sec 
number out of this test, but the other stats will only be calculated if 
you implement the rdtsc() macro to read cycles - right now it defaults 
to 'always 0' on ppc, i386 and ia64 has it implemented. Also, beware 
that the default atomic_inc()/dec() is unsafe (only i386 and ia64 has 
the real thing implemented), you might want to add a safe PPC 
implementation.

thirdly, could you run 'vmstat 1' during the tests, and post those lines 
too? Here i'm curious about two things: the average runqueue length 
(whether we have overscheduling), and CPU utilization and idle time left 
(how efficiently cycles are preserved in contention). [btw., does ppc 
have an idle=poll equivalent mode of idling?]

also, there seems to be some fluctuation in the numbers - could you try 
to run a few more to see how stable the numbers are?

	Ingo

------------
/*
 * Copyright (C) 2005, Ingo Molnar <mingo@redhat.com>
 */

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <linux/unistd.h>
#include <unistd.h>
#include <string.h>
#include <pwd.h>
#include <grp.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <regex.h>
#include <fcntl.h>
#include <time.h>
#include <sys/mman.h>
#include <dlfcn.h>
#include <popt.h>
#include <sys/socket.h>
#include <ctype.h>
#include <assert.h>
#include <sched.h>

#ifdef __ia64__
#include <sys/ioctl.h>
#include "mmtimer.h"
int mmtimer_fd;

unsigned long __mm_timer_clock_res;
unsigned long *__mm_clock_dev;
unsigned long __mm_clock_offset;
#endif

unsigned long *shared;

#define mutex_lock()    gettimeofday((void *)0, (void *)10)
#define mutex_unlock()  gettimeofday((void *)0, (void *)20)
#define down()          gettimeofday((void *)0, (void *)100)
#define up()            gettimeofday((void *)0, (void *)200)
#define down_write()    gettimeofday((void *)0, (void *)1000)
#define up_write()      gettimeofday((void *)0, (void *)2000)
#define down_read()     gettimeofday((void *)0, (void *)10000)
#define up_read()       gettimeofday((void *)0, (void *)20000)

/*
 * Shared locks and variables between the test tasks:
 */
#define CACHELINE_SIZE (128/sizeof(long))

enum {
	SHARED_DELTA_SUM		= 0*CACHELINE_SIZE,
	SHARED_DELTA_MAX		= 1*CACHELINE_SIZE,
	SHARED_DELTA2_SUM		= 2*CACHELINE_SIZE,
	SHARED_DELTA2_MAX		= 3*CACHELINE_SIZE,
	SHARED_DELTA3_SUM		= 4*CACHELINE_SIZE,
	SHARED_DELTA3_MAX		= 5*CACHELINE_SIZE,
	SHARED_DELTA_DELTA_SUM		= 6*CACHELINE_SIZE,
	SHARED_COUNT			= 7*CACHELINE_SIZE,
	SHARED_SUM			= 8*CACHELINE_SIZE,
	SHARED_LOCK			= 9*CACHELINE_SIZE,
	SHARED_END			= 10*CACHELINE_SIZE,
};

#define SHARED(x)	(*(shared + SHARED_##x))
#define SHARED_LL(x)	(*(unsigned long long *)(shared + SHARED_##x))

#define BUG_ON(c)	assert(!(c))

static unsigned long *setup_shared_var(void)
{
	char zerobuff [4096] = { 0, };
	int ret, fd;
	unsigned long *buf;
	char tmpfile[100];

	sprintf(tmpfile, ".tmp_mmap-%d", getpid());

	fd = creat(tmpfile, 0700);

	BUG_ON(fd == -1);
	close(fd);

	fd = open(tmpfile, O_RDWR|O_CREAT|O_TRUNC);
	unlink(tmpfile);
	BUG_ON(fd == -1);
	ret = write(fd, zerobuff, 4096);
	BUG_ON(ret != 4096);

	buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	BUG_ON(buf == (void *)-1);

	close(fd);

	return buf;
}

#define LOOPS 10000

#ifdef __ia64__
static int setup_mmtimer(void)
{
	unsigned long regoff;
	int fd, _t;
	size_t pagesize;

	if ((fd = open ("/dev/mmtimer", O_RDONLY)) == -1)
		perror("missing /dev/mmtimer");
	else {
		pagesize = getpagesize();
		__mm_clock_dev = mmap(0, pagesize, PROT_READ,
				      MAP_SHARED, fd, 0);
		if (__mm_clock_dev != MAP_FAILED) {
			regoff = ioctl(fd, MMTIMER_GETOFFSET, 0);
			if (regoff >= 0) {
				__mm_clock_dev += regoff;
				__mm_clock_offset = *__mm_clock_dev;
			} else
				perror("reg offset ioctl failed");
			_t = ioctl(fd, MMTIMER_GETFREQ, &__mm_timer_clock_res);
			if (_t)
				perror("get freq ioctl fail");
		}
	}
}


#define ia64_fetchadd8_rel(p, inc)					\
({									\
	__u64 ia64_intri_res;						\
	asm volatile ("fetchadd8.rel %0=[%1],%2"			\
				: "=r"(ia64_intri_res) : "r"(p), "i" (inc) \
				: "memory");				\
									\
	ia64_intri_res;							\
})

static inline void atomic_inc(unsigned long *flag)
{
	ia64_fetchadd8_rel(flag, 1);
}

static inline void atomic_dec(unsigned long *flag)
{
	ia64_fetchadd8_rel(flag, -1);
}
#elif defined(__i386__)
static inline void atomic_inc(unsigned long *flag)
{
	__asm__ __volatile__(
		"lock; incl %0\n"
			     : "=g"(*flag) : : "memory");
}

static inline void atomic_dec(unsigned long *flag)
{
	__asm__ __volatile__(
		"lock; decl %0\n"
			     : "=g"(*flag) : : "memory");
}
#else
static inline void atomic_inc(unsigned long *flag)
{
	++*flag;
}

static inline void atomic_dec(unsigned long *flag)
{
	--*flag;
}
#endif

static void LOCK(unsigned long *shared)
{
	for (;;) {
		atomic_inc(&SHARED(LOCK));
		if (SHARED(LOCK) == 1)
			break;
		atomic_dec(&SHARED(LOCK));
		usleep(1);
	}
}

static void UNLOCK(unsigned long *shared)
{
	atomic_dec(&SHARED(LOCK));
}


static void sigint(int sig)
{
	atomic_inc(&SHARED(END));
}

static void print_status(unsigned long *shared)
{
	unsigned long count;

	count = SHARED(COUNT);
	SHARED(COUNT) = 0;
	SHARED_LL(SUM) += count;

	printf("\r| loops/sec: %ld    \r", count);
	fflush(stdout);
}

enum {
	TYPE_MUTEX,
	TYPE_SEM,
	TYPE_RSEM,
	TYPE_WSEM,
	TYPE_VFS,
	NR_TYPES
};

const char * type_names[NR_TYPES] =
	{	"Mutex",
		"Semaphore",
		"RW-semaphore Read",
		"RW-semaphore Write",
		"VFS"
	};


typedef unsigned long long cycles_t;
typedef unsigned long long usecs_t;

#ifdef __ia64__
# define rdtscll(val)					\
do {							\
	val = *__mm_clock_dev;				\
} while (0)
#elif defined(__i386__)
# define rdtscll(val)					\
do {							\
	__asm__ __volatile__("rdtsc" : "=A" (val));	\
} while (0)
#else
# define rdtscll(val) \
	do { (val) = 0LL; } while (0)
#endif

#define rdtod(val)					\
do {							\
	struct timeval tv;				\
							\
	gettimeofday(&tv, NULL);			\
	(val) = tv.tv_sec * 1000000ULL + tv.tv_usec;	\
} while (0)

#define max(x,y) ({ \
	typeof(x) _x = (x);	\
	typeof(y) _y = (y);	\
	(void) (&_x == &_y);		\
	_x > _y ? _x : _y; })

#define unlikely(x) __builtin_expect(!!(x), 0)

int main(int argc, char **argv)
{
	int i, parent, me, first = 1;
	unsigned long cpus, tasks, seconds = 0;
	cycles_t t0, t01, t1, delta, delta2, delta3, delta_sum = 0,
		delta2_sum = 0, delta3_sum = 0, delta_delta,
		delta_delta_sum = 0, prev_delta,
		delta_max = 0, delta2_max = 0, delta3_max = 0;
	char str[100];
	double freq;
	int type;

	if (argc <= 1 || argc > 4) {
usage:
		fprintf(stderr,
			"usage: test-mutex [Mutex|Sem|Rsem|Wsem|Vfs creat+unlink] <threads> <seconds>\n");
		exit(-1);
usage2:
		fprintf(stderr, "the Mutex/Sem/Rsem/Wsem tests are not available.\n");
		goto usage;
	}

	switch (argv[1][0]) {
		case 'M': type = TYPE_MUTEX; goto usage2; break;
		case 'S': type = TYPE_SEM; goto usage2;  break;
		case 'R': type = TYPE_RSEM; goto usage2; break;
		case 'W': type = TYPE_WSEM; goto usage2; break;
		case 'V': type = TYPE_VFS; break;
		 default: goto usage;
	}

	system("rm -f /tmp/* 2>/dev/null >/dev/null");
	cpus = system("exit `grep processor /proc/cpuinfo  | wc -l`");
	cpus = WEXITSTATUS(cpus);

	tasks = cpus;
	if (argc >= 3) {
		tasks = atol(argv[2]);
		if (!tasks)
			goto usage;
	}

	if (argc >= 4)
		seconds = atol(argv[3]);
	else
		seconds = -1;

#ifdef __ia64__
	setup_mmtimer();
#endif

	printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus, tasks);
	printf("checking %s performance.\n", type_names[type]);

	shared = setup_shared_var();

	signal(SIGINT, sigint);
	signal(SIGHUP, sigint);

	parent = getpid();

	for (i = 0; i < tasks; i++)
		if (!fork())
			break;
	sleep(1);
	me = getpid();
	sprintf(str, "/tmp/tmp-%d", me);

	if (me == parent) {
		unsigned long long total_count;
		int i = 0, j;

		for (;;) {
			sleep(1);
			if (i == seconds || SHARED(END))
				break;
			i++;
			print_status(shared);
		}
		atomic_inc(&SHARED(END));
		total_count = SHARED(SUM);
		for (j = 0; j < tasks; j++)
			wait(NULL);

		if (i)
			printf("\navg ops/sec:               %Ld\n", total_count / i);
		LOCK(shared);
//		printf("delta_sum: %Ld\n", SHARED_LL(DELTA_SUM));
//		printf("delta_delta_sum: %Ld\n", SHARED_LL(DELTA_DELTA_SUM));
#ifdef __ia64__
		freq = 25.0;
#else
		freq = 700.0;
#endif

		printf("average cost per op:       %.2f usecs\n",
			(double)SHARED_LL(DELTA_SUM)/total_count/freq);
		printf("average cost per lock:     %.2f usecs\n",
			(double)SHARED_LL(DELTA2_SUM)/total_count/freq);
		printf("average cost per unlock:   %.2f usecs\n",
			(double)SHARED_LL(DELTA3_SUM)/total_count/freq);
		printf("max cost per op:           %.2f usecs\n",
			(double)SHARED_LL(DELTA_MAX)/freq);
		printf("max cost per lock:         %.2f usecs\n",
			(double)SHARED_LL(DELTA2_MAX)/freq);
		printf("max cost per unlock:       %.2f usecs\n",
			(double)SHARED_LL(DELTA3_MAX)/freq);
		printf("average deviance per op:   %.2f usecs\n",
			(double)SHARED_LL(DELTA_DELTA_SUM)/total_count/freq/2.0);
		UNLOCK(shared);
		exit(0);
	}

	for (;;) {
		rdtscll(t0);

		switch (type) {
			case TYPE_MUTEX:
				mutex_lock();
				rdtscll(t01);
				mutex_unlock();
				break;
			case TYPE_SEM:
				down();
				rdtscll(t01);
				up();
				break;
			case TYPE_RSEM:
				down_read();
				rdtscll(t01);
				up_read();
				break;
			case TYPE_WSEM:
				down_write();
				rdtscll(t01);
				up_write();
				break;
			case TYPE_VFS:
			{
				int fd;

				fd = creat(str, S_IRWXU);
				rdtscll(t01);
				close(fd);

				break;
			}
		}
		rdtscll(t1);

		delta = t1-t0;
		if (unlikely(delta > delta_max))
			delta_max = delta;
		delta_sum += delta;

		delta2 = t01-t0;
		if (unlikely(delta2 > delta2_max))
			delta2_max = delta2;
		delta2_sum += delta2;

		delta3 = t1-t01;
		if (unlikely(delta3 > delta3_max))
			delta3_max = delta3;
		delta3_sum += delta3;

		if (!first) {
			if (prev_delta < delta)
				delta_delta = delta - prev_delta;
			else
				delta_delta = prev_delta - delta;
			delta_delta_sum += delta_delta;
#if 0
			printf("%Ld-%Ld {%Ld} prev: {%Ld} / [%Ld]\n",
				t0, t1, delta, prev_delta, delta_delta);
			printf("  {%Ld} - {%Ld}\n",
				delta_sum, delta_delta_sum);
#endif
		} else
			first = 0;

		prev_delta = delta;

		atomic_inc(&SHARED(COUNT));

		if (unlikely(SHARED(END))) {
			LOCK(shared);
			SHARED_LL(DELTA_SUM) += delta_sum;
			SHARED_LL(DELTA_MAX) = max(SHARED_LL(DELTA_MAX),
							delta_max);
			SHARED_LL(DELTA2_SUM) += delta2_sum;
			SHARED_LL(DELTA2_MAX) = max(SHARED_LL(DELTA2_MAX),
							delta2_max);
			SHARED_LL(DELTA3_SUM) += delta3_sum;
			SHARED_LL(DELTA3_MAX) = max(SHARED_LL(DELTA3_MAX),
							delta3_max);
			SHARED_LL(DELTA_DELTA_SUM) += delta_delta_sum;
#if 0
			printf("delta_sum: %Ld\n", delta_sum);
			printf("delta_delta_sum: %Ld\n", delta_delta_sum);
			printf("DELTA_SUM: %Ld\n", SHARED_LL(DELTA_SUM));
			printf("DELTA_DELTA_SUM: %Ld\n", SHARED_LL(DELTA_DELTA_SUM));
#endif
			UNLOCK(shared);

			exit(0);
		}
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-07 17:49             ` PowerPC fastpaths for mutex subsystem Joel Schopp
  2006-01-07 22:37               ` Andrew Morton
  2006-01-08  9:48               ` Ingo Molnar
@ 2006-01-08 10:43               ` Ingo Molnar
  2 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2006-01-08 10:43 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev


looks good to me. Minor nit:

> +"	isync\n"

> +"	isync		\n"

shouldnt these two be ISYNC_ON_SMP?

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-08  7:43                 ` Anton Blanchard
  2006-01-08  8:00                   ` Andrew Morton
@ 2006-01-09 11:13                   ` David Howells
  1 sibling, 0 replies; 34+ messages in thread
From: David Howells @ 2006-01-09 11:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, jes, rmk+lkml, ak, linux-kernel, hch, torvalds,
	viro, linuxppc64-dev, mingo, nico, oleg, alan, arjan

Andrew Morton <akpm@osdl.org> wrote:

> > Wasnt most of the x86 mutex gain a result of going from fair to unfair
> > operation? The current ppc64 semaphores are unfair.
> > 
> 
> What's "unfair"?  Mutexes are FIFO, as are x86 semaphores.

No, strictly Ingo's mutexes are neither completely fair nor completely FIFO.
It's possible for a process to jump the queue because unlock() always sets the
counter back to 1 before waking up the process at the front of the queue. This
means that the lock() fastpath in another process may steal the mutex out of
sequence before the wakee has a chance to grab it.

I'm not 100% convinced that x86 counting semaphores are completely fair or
completely FIFO. It's possible that they are because up() never arbitrarily
sets the count back to >0.

R/W semaphores are completely fair, but only as completely FIFO as the unfair
spinlocks permit. This is because it's much easier to guarantee their behaviour
(writer-starvation is a real problem with unfair rwsems). I have a simple
implementation of totally fair spinlocks for x86 which would also work on
anything that can emulate XADD, but I don't think it's worth the trouble.

However, for Ingo's mutexes, I suspect this queue-jumping feature is
sufficiently low probability that we can ignore it. It is theoretically
possible to induce livelock, but in reality I think it extremely unlikely to
happen for any significant length of time.

David

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-08  9:48               ` Ingo Molnar
@ 2006-01-10 22:31                 ` Joel Schopp
  2006-01-10 23:09                   ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Joel Schopp @ 2006-01-10 22:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev

[-- Attachment #1: Type: text/plain, Size: 1379 bytes --]

> interesting. Could you try two things? Firstly, could you add some 
> minimal delays to the lock/unlock path, of at least 1 usec? E.g.  
> "synchro-test.ko load=1 interval=1". [but you could try longer delays 
> too, 10 usecs is still realistic.]

Graphs attached.  The summary for those who don't like to look at attachments is 
that the mutex fastpath (threads 1) that I sent the optimized patch for is 
comparable within the margin of error to semaphores.  The mutex common path 
(threads > 1) gets embarrassed by semaphores. So mutexes common paths are not 
yet ready as far as ppc64 is concerned.

> 
> secondly, could you try the VFS creat+unlink test via the test-mutex.c 
> code below, with something like:
> 
> 	./test-mutex V 16 10

Queued into my todo list.

> 
> thirdly, could you run 'vmstat 1' during the tests, and post those lines 
> too? Here i'm curious about two things: the average runqueue length 
> (whether we have overscheduling), and CPU utilization and idle time left 
> (how efficiently cycles are preserved in contention). [btw., does ppc 
> have an idle=poll equivalent mode of idling?]

Also queued in my todo list.

> 
> also, there seems to be some fluctuation in the numbers - could you try 
> to run a few more to see how stable the numbers are?

For the graphs the line is the average of 5 runs, and the 5 runs are scatter 
plotted as well.

[-- Attachment #2: semvsmux2.png --]
[-- Type: image/png, Size: 4536 bytes --]

[-- Attachment #3: semvsmux3.png --]
[-- Type: image/png, Size: 4471 bytes --]

[-- Attachment #4: semvsmux.png --]
[-- Type: image/png, Size: 4805 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-10 22:31                 ` Joel Schopp
@ 2006-01-10 23:09                   ` Ingo Molnar
  2006-01-11 10:52                     ` Ingo Molnar
  2006-01-11 17:44                     ` Joel Schopp
  0 siblings, 2 replies; 34+ messages in thread
From: Ingo Molnar @ 2006-01-10 23:09 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev


* Joel Schopp <jschopp@austin.ibm.com> wrote:

> >interesting. Could you try two things? Firstly, could you add some 
> >minimal delays to the lock/unlock path, of at least 1 usec? E.g.  
> >"synchro-test.ko load=1 interval=1". [but you could try longer delays 
> >too, 10 usecs is still realistic.]
> 
> Graphs attached.  The summary for those who don't like to look at 
> attachments is that the mutex fastpath (threads 1) that I sent the 
> optimized patch for is comparable within the margin of error to 
> semaphores.  The mutex common path (threads > 1) gets embarrassed by 
> semaphores. So mutexes common paths are not yet ready as far as ppc64 
> is concerned.

ok. I'll really need to look at "vmstat" output from these. We could 
easily make the mutex slowpath behave like ppc64 semaphores, via the 
attached (untested) patch, but i really think it's the wrong thing to 
do, because it overloads the system with runnable tasks in an 
essentially unlimited fashion [== overscheduling] - they'll all contend 
for the same single mutex.

in synthetic workloads on idle systems it such overscheduling can help, 
because the 'luck factor' of the 'thundering herd' of tasks can generate 
a higher total throughput - at the expense of system efficiency. At 8 
CPUs i already measured a net performance loss at 3 tasks! So i think 
the current 'at most 2 tasks runnable' approach of mutexes is the right 
one on a broad range of hardware.

still, i'll try a different patch tomorrow, to keep the number of 'in 
flight' tasks within a certain limit (say at 2) - i suspect that would 
close the performance gap too, on this test.

but i really think the current 'at most one task in flight' logic is the 
correct approach. I'm also curious about the VFS-test numbers (already 
on your todo).

> >thirdly, could you run 'vmstat 1' during the tests, and post those lines 
> >too? Here i'm curious about two things: the average runqueue length 
> >(whether we have overscheduling), and CPU utilization and idle time left 
> >(how efficiently cycles are preserved in contention). [btw., does ppc 
> >have an idle=poll equivalent mode of idling?]
> 
> Also queued in my todo list.

thanks!

> >also, there seems to be some fluctuation in the numbers - could you try 
> >to run a few more to see how stable the numbers are?
> 
> For the graphs the line is the average of 5 runs, and the 5 runs are 
> scatter plotted as well.

ok, that should be more than enough.

	Ingo

--- kernel/mutex.c.orig
+++ kernel/mutex.c
@@ -226,6 +226,9 @@ __mutex_unlock_slowpath(atomic_t *lock_c
 
 		debug_mutex_wake_waiter(lock, waiter);
 
+		/* be (much) more agressive about wakeups: */
+		list_move_tail(&waiter.list, &lock->wait_list);
+
 		wake_up_process(waiter->task);
 	}
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-10 23:09                   ` Ingo Molnar
@ 2006-01-11 10:52                     ` Ingo Molnar
  2006-01-11 17:44                     ` Joel Schopp
  1 sibling, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2006-01-11 10:52 UTC (permalink / raw)
  To: Joel Schopp
  Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev


* Ingo Molnar <mingo@elte.hu> wrote:

> ok. I'll really need to look at "vmstat" output from these. We could 
> easily make the mutex slowpath behave like ppc64 semaphores, via the 
> attached (untested) patch, but i really think it's the wrong thing to 
> do, because it overloads the system with runnable tasks in an 
> essentially unlimited fashion [== overscheduling] - they'll all 
> contend for the same single mutex.

find the working patch below. (the previous one had a syntax error)

	Ingo

Index: linux/kernel/mutex.c
===================================================================
--- linux.orig/kernel/mutex.c
+++ linux/kernel/mutex.c
@@ -227,6 +227,9 @@ __mutex_unlock_slowpath(atomic_t *lock_c
 		debug_mutex_wake_waiter(lock, waiter);
 
 		wake_up_process(waiter->task);
+
+		/* be (much) more agressive about wakeups: */
+		list_move_tail(&waiter->list, &lock->wait_list);
 	}
 
 	debug_mutex_clear_owner(lock);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: PowerPC fastpaths for mutex subsystem
  2006-01-10 23:09                   ` Ingo Molnar
  2006-01-11 10:52                     ` Ingo Molnar
@ 2006-01-11 17:44                     ` Joel Schopp
  1 sibling, 0 replies; 34+ messages in thread
From: Joel Schopp @ 2006-01-11 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro,
	Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig,
	Andi Kleen, Russell King, Anton Blanchard, PPC64-dev

> ok. I'll really need to look at "vmstat" output from these. We could 
> easily make the mutex slowpath behave like ppc64 semaphores, via the 
> attached (untested) patch, but i really think it's the wrong thing to 
> do, because it overloads the system with runnable tasks in an 
> essentially unlimited fashion [== overscheduling] - they'll all contend 
> for the same single mutex.
> 
> in synthetic workloads on idle systems it such overscheduling can help, 
> because the 'luck factor' of the 'thundering herd' of tasks can generate 
> a higher total throughput - at the expense of system efficiency. At 8 
> CPUs i already measured a net performance loss at 3 tasks! So i think 
> the current 'at most 2 tasks runnable' approach of mutexes is the right 
> one on a broad range of hardware.
> 
> still, i'll try a different patch tomorrow, to keep the number of 'in 
> flight' tasks within a certain limit (say at 2) - i suspect that would 
> close the performance gap too, on this test.

The fundamental problem is that there is a relatively major latency to wake 
somebody up, and for them to actually run so they can acquire a lock.  In an 
ideal world there would always be a single waiter running trying to acquire the 
lock at the moment it was unlocked and not running until then.

There are better solutions than just madly throwing more waiters in flight on an 
unlock.  Here's three possibilities off the top of my head:

1) It is possible to have a hybrid lock that spins a single waiting thread and 
sleeps waiters 2..n, so there is always a waiter running trying to acquire the 
lock.  It solves the latency problem if the lock is held a length of time at 
least as long as it takes to wake up the next waiter.  But the spinning waiter 
burns some cpu to buy the decreased latency.

2) You could also do the classic spin for awhile and then sleep method.  This 
essentially turns low latency locks into spinlocks but still sleeps locks which 
are held longer and/or are much more contested.

3) There is the option to look at cpu idleness of the current cpu and spin or 
sleep based on that.

4) Accept that we have a cpu efficient high latency lock and use it appropriately.

I'm not saying any of these 4 is what we should do.  I'm just trying to say 
there are options out there that don't involve thundering hurds and luck to 
address the problem.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2006-01-11 17:45 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-04 14:41 [patch 00/21] mutex subsystem, -V14 Ingo Molnar
2006-01-04 23:45 ` Joel Schopp
2006-01-05  2:38   ` Nicolas Pitre
2006-01-05  2:51     ` Linus Torvalds
2006-01-05  3:21       ` Nick Piggin
2006-01-05  3:39         ` Anton Blanchard
2006-01-05 18:04           ` Jesse Barnes
2006-01-05 14:40       ` Ingo Molnar
2006-01-05 16:21         ` Linus Torvalds
2006-01-05 22:03           ` Ingo Molnar
2006-01-05 22:17             ` Linus Torvalds
2006-01-05 22:43               ` Ingo Molnar
2006-01-06  3:49                 ` Keith Owens
2006-01-06  7:34           ` Denis Vlasenko
2006-01-05 14:35   ` Ingo Molnar
2006-01-05 16:42     ` Joel Schopp
2006-01-05 22:21       ` Ingo Molnar
2006-01-05 23:06         ` Joel Schopp
2006-01-05 23:26           ` Linus Torvalds
2006-01-05 23:36             ` Joel Schopp
2006-01-05 23:42               ` Ingo Molnar
2006-01-06  0:29           ` Olof Johansson
2006-01-07 17:49             ` PowerPC fastpaths for mutex subsystem Joel Schopp
2006-01-07 22:37               ` Andrew Morton
2006-01-08  7:43                 ` Anton Blanchard
2006-01-08  8:00                   ` Andrew Morton
2006-01-08  8:23                     ` Anton Blanchard
2006-01-09 11:13                   ` David Howells
2006-01-08  9:48               ` Ingo Molnar
2006-01-10 22:31                 ` Joel Schopp
2006-01-10 23:09                   ` Ingo Molnar
2006-01-11 10:52                     ` Ingo Molnar
2006-01-11 17:44                     ` Joel Schopp
2006-01-08 10:43               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).