* [patch 00/21] mutex subsystem, -V14 @ 2006-01-04 14:41 Ingo Molnar 2006-01-04 23:45 ` Joel Schopp 0 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-04 14:41 UTC (permalink / raw) To: lkml Cc: Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King this is version 14 of the generic mutex subsystem, against v2.6.15. The patch-queue consists of 21 patches, which can also be downloaded from: http://redhat.com/~mingo/generic-mutex-subsystem/ Changes since -V13: 13 files changed, 113 insertions(+), 85 deletions(-) - VFS: converted sb->s_lock to a mutex too. This improves the create+unlink benchmark on ext3. - further simplification of __mutex_lock_common(): no more gotos, and only one atomic_xchg() is done. Code size is now extremely small on both UP and SMP: text data bss dec hex filename 398 0 0 398 18e mutex.o.UP text data bss dec hex filename 463 0 0 463 1cf mutex.o.SMP - synchro-test updates: max # of threads of 64, fairness stats, better defaults if =y, cleanups. (David Howells, me) - FASTCALL -> fastcall in mutex.h (suggested by David Howells) - documentation updates Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-04 14:41 [patch 00/21] mutex subsystem, -V14 Ingo Molnar @ 2006-01-04 23:45 ` Joel Schopp 2006-01-05 2:38 ` Nicolas Pitre 2006-01-05 14:35 ` Ingo Molnar 0 siblings, 2 replies; 34+ messages in thread From: Joel Schopp @ 2006-01-04 23:45 UTC (permalink / raw) To: Ingo Molnar Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard > this is version 14 of the generic mutex subsystem, against v2.6.15. > > The patch-queue consists of 21 patches, which can also be downloaded > from: > > http://redhat.com/~mingo/generic-mutex-subsystem/ > Took a glance at this on ppc64. Would it be useful if I contributed an arch specific version like arm has? We'll either need an arch specific version or have the generic changed. Anyway, here is some disassembly of some of the code generated with my comments: c00000000049bf9c <.mutex_lock>: c00000000049bf9c: 7c 00 06 ac eieio c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3 c00000000049bfa4: 31 29 ff ff addic r9,r9,-1 c00000000049bfa8: 7d 20 19 2d stwcx. r9,r0,r3 c00000000049bfac: 40 c2 ff f4 bne+ c00000000049bfa0 <.mutex_lock+0x4> c00000000049bfb0: 4c 00 01 2c isync c00000000049bfb4: 7d 20 4b 78 mr r0,r9 c00000000049bfb8: 78 09 0f e3 rldicl. r9,r0,33,63 c00000000049bfbc: 4d 82 00 20 beqlr c00000000049bfc0: 4b ff ff 58 b c00000000049bf18 <.__mutex_lock_noinline> The eieio is completly unnecessary, it got picked up from atomic_dec_return (Anton, why is there an eieio at the start of atomic_dec_return in the first place?). Ignore the + on the bne, the disassembler is wrong, it is really a - c00000000049bfc4 <.mutex_unlock>: c00000000049bfc4: 7c 00 06 ac eieio c00000000049bfc8: 7d 20 18 28 lwarx r9,r0,r3 c00000000049bfcc: 31 29 00 01 addic r9,r9,1 c00000000049bfd0: 7d 20 19 2d stwcx. r9,r0,r3 c00000000049bfd4: 40 c2 ff f4 bne+ c00000000049bfc8 <.mutex_unlock+0x4> c00000000049bfd8: 4c 00 01 2c isync c00000000049bfdc: 7d 20 07 b4 extsw r0,r9 c00000000049bfe0: 2c 00 00 00 cmpwi r0,0 c00000000049bfe4: 4d 81 00 20 bgtlr c00000000049bfe8: 4b ff ff 40 b c00000000049bf28 <.__mutex_unlock_noinline> That eieio should be an lwsync to avoid data corruption. And I think the isync is superfluous. Ditto the disassembler being wrong about the + vs -. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-04 23:45 ` Joel Schopp @ 2006-01-05 2:38 ` Nicolas Pitre 2006-01-05 2:51 ` Linus Torvalds 2006-01-05 14:35 ` Ingo Molnar 1 sibling, 1 reply; 34+ messages in thread From: Nicolas Pitre @ 2006-01-05 2:38 UTC (permalink / raw) To: Joel Schopp Cc: Ingo Molnar, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard On Wed, 4 Jan 2006, Joel Schopp wrote: > > this is version 14 of the generic mutex subsystem, against v2.6.15. > > > > The patch-queue consists of 21 patches, which can also be downloaded from: > > > > http://redhat.com/~mingo/generic-mutex-subsystem/ > > > > Took a glance at this on ppc64. Would it be useful if I contributed an arch > specific version like arm has? We'll either need an arch specific version or > have the generic changed. Don't change the generic version. You should provide a ppc specific version if the generic ones don't look so good. Nicolas ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 2:38 ` Nicolas Pitre @ 2006-01-05 2:51 ` Linus Torvalds 2006-01-05 3:21 ` Nick Piggin 2006-01-05 14:40 ` Ingo Molnar 0 siblings, 2 replies; 34+ messages in thread From: Linus Torvalds @ 2006-01-05 2:51 UTC (permalink / raw) To: Nicolas Pitre Cc: Joel Schopp, Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard On Wed, 4 Jan 2006, Nicolas Pitre wrote: > On Wed, 4 Jan 2006, Joel Schopp wrote: > > > > this is version 14 of the generic mutex subsystem, against v2.6.15. > > > > > > The patch-queue consists of 21 patches, which can also be downloaded from: > > > > > > http://redhat.com/~mingo/generic-mutex-subsystem/ > > > > > > > Took a glance at this on ppc64. Would it be useful if I contributed an arch > > specific version like arm has? We'll either need an arch specific version or > > have the generic changed. > > Don't change the generic version. You should provide a ppc specific > version if the generic ones don't look so good. Well, if the generic one generates _buggy_ code on ppc64, that means that either the generic version is buggy, or one of the atomics that it uses is buggily implemented on ppc64. And I think it's the generic mutex stuff that is buggy. It seems to assume memory barriers that aren't valid to assume. A mutex is more than just updating the mutex count properly. You also have to have the proper memory barriers there to make sure that the things that the mutex _protects_ actually stay inside the mutex. So while a ppc64-optimized mutex is probably a good idea per se, I think the generic mutex code had better be fixed first and regardless of any optimized version. On x86/x86-64, the locked instructions automatically imply the proper memory barriers, but that was just lucky, I think. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 2:51 ` Linus Torvalds @ 2006-01-05 3:21 ` Nick Piggin 2006-01-05 3:39 ` Anton Blanchard 2006-01-05 14:40 ` Ingo Molnar 1 sibling, 1 reply; 34+ messages in thread From: Nick Piggin @ 2006-01-05 3:21 UTC (permalink / raw) To: Linus Torvalds Cc: Nicolas Pitre, Joel Schopp, Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard Linus Torvalds wrote: > > On Wed, 4 Jan 2006, Nicolas Pitre wrote: > > >>On Wed, 4 Jan 2006, Joel Schopp wrote: >> >> >>>>this is version 14 of the generic mutex subsystem, against v2.6.15. >>>> >>>>The patch-queue consists of 21 patches, which can also be downloaded from: >>>> >>>> http://redhat.com/~mingo/generic-mutex-subsystem/ >>>> >>> >>>Took a glance at this on ppc64. Would it be useful if I contributed an arch >>>specific version like arm has? We'll either need an arch specific version or >>>have the generic changed. >> >>Don't change the generic version. You should provide a ppc specific >>version if the generic ones don't look so good. > > > Well, if the generic one generates _buggy_ code on ppc64, that means that > either the generic version is buggy, or one of the atomics that it uses is > buggily implemented on ppc64. > > And I think it's the generic mutex stuff that is buggy. It seems to assume > memory barriers that aren't valid to assume. > > A mutex is more than just updating the mutex count properly. You also have > to have the proper memory barriers there to make sure that the things that > the mutex _protects_ actually stay inside the mutex. > > So while a ppc64-optimized mutex is probably a good idea per se, I think > the generic mutex code had better be fixed first and regardless of any > optimized version. > > On x86/x86-64, the locked instructions automatically imply the proper > memory barriers, but that was just lucky, I think. > I think the generic code is correct according to Documentation/atomic_ops.txt which basically defines any atomic_xxx operation which both modifies its operand and returns something to have a full memory barrier before and after its load/store operations. Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems to be that it allows loads to be reordered before stores, but that's OK with smp_wmb, right? And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does not? And rmb() does lwsync, which apparently does not order IO at all... -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 3:21 ` Nick Piggin @ 2006-01-05 3:39 ` Anton Blanchard 2006-01-05 18:04 ` Jesse Barnes 0 siblings, 1 reply; 34+ messages in thread From: Anton Blanchard @ 2006-01-05 3:39 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Nicolas Pitre, Joel Schopp, Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King Hi, > Side note, why can't powerpc use lwsync for smp_wmb? The only problem seems > to be that it allows loads to be reordered before stores, but that's > OK with smp_wmb, right? lwsync implies more ordering than eieio and so may take longer. lwsync orders everything except store - load, and eieio just orders store - store. On power3 an lwsync is a full sync which takes forever, although in newer chips both lwsync and eieio tend to take the same number of cycles. > And why is smp_wmb() (ie. the non I/O barrier) doing eieio, while wmb() does > not? And rmb() does lwsync, which apparently does not order IO at all... Because people love to abuse the barrier macros :) grep for wmb in drivers/net and look for the number of places wmb() is being used to order memory and IO. Architecturally eieio is a store - store ordering for IO and memory but not between the two. sync is slow but does guarantee this. SGIs mmiowb() might be useful for some of these cases but every time its brought up everyone ends up confused as to its real use. Really we should have io_*mb() and smp_*mb(). At that stage we may even be able to kill the base *mb() macros. Anton ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 3:39 ` Anton Blanchard @ 2006-01-05 18:04 ` Jesse Barnes 0 siblings, 0 replies; 34+ messages in thread From: Jesse Barnes @ 2006-01-05 18:04 UTC (permalink / raw) To: Anton Blanchard Cc: Nick Piggin, Linus Torvalds, Nicolas Pitre, Joel Schopp, Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King On Wednesday, January 4, 2006 7:39 pm, Anton Blanchard wrote: > SGIs mmiowb() might be useful for some of these cases but every time > its brought up everyone ends up confused as to its real use. It's documented in Documentation/DocBook/deviceiobook.tmpl. If the documentation isn't clear, we should fix it, rather than avoid using the primitive altogether. If drivers/net really means mmiowb() in some places, we should change it, and like you said maybe get rid of some of these primitives so that their usage is clearer. Jesse ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 2:51 ` Linus Torvalds 2006-01-05 3:21 ` Nick Piggin @ 2006-01-05 14:40 ` Ingo Molnar 2006-01-05 16:21 ` Linus Torvalds 1 sibling, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-05 14:40 UTC (permalink / raw) To: Linus Torvalds Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard * Linus Torvalds <torvalds@osdl.org> wrote: > Well, if the generic one generates _buggy_ code on ppc64, that means > that either the generic version is buggy, or one of the atomics that > it uses is buggily implemented on ppc64. > > And I think it's the generic mutex stuff that is buggy. It seems to > assume memory barriers that aren't valid to assume. in those headers i'm only using atomic_dec_return(), atomic_cmpxchg() and atomic_xchg() - all of which imply a barrier. It is atomic_inc/add/sub/dec() that doesnt default to an implied SMP barrier. but it's certainly not documented too well. If atomic_dec_return() is not supposed to imply a barrier, then all the affected architectures (sparc64, ppc64, mips, alpha, etc.) are overdoing synchronization currently: they all have barriers for these primitives. [They also have an implicit barrier for atomic_dec_test() - which is being relied on for correctness - no kernel code adds an smp_mb__before_atomic_dec() barrier to around atomic_dec_test().] the patch below adds the barriers to the asm-generic mutex routines, so it's not like i'm lazy ;), but i really think this is unnecessary. Adding this patch would add a second, unnecessary barrier for all the arches that have barrier-less atomic ops. it also makes sense: the moment you are interested in the 'previous value' of the atomic counter in an atomic fashion, you very likely want to use it for a critical section. (e.g. all the put-the-resource ops that use atomic_dec_test() rely on this implicit barrier.) Ingo Index: linux/include/asm-generic/mutex-dec.h =================================================================== --- linux.orig/include/asm-generic/mutex-dec.h +++ linux/include/asm-generic/mutex-dec.h @@ -19,6 +19,7 @@ */ #define __mutex_fastpath_lock(count, fail_fn) \ do { \ + smp_mb__before_atomic_dec(); \ if (unlikely(atomic_dec_return(count) < 0)) \ fail_fn(count); \ } while (0) @@ -36,6 +37,7 @@ do { \ static inline int __mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *)) { + smp_mb__before_atomic_dec(); if (unlikely(atomic_dec_return(count) < 0)) return fail_fn(count); else @@ -59,6 +61,8 @@ __mutex_fastpath_lock_retval(atomic_t *c do { \ if (unlikely(atomic_inc_return(count) <= 0)) \ fail_fn(count); \ + else \ + smp_mb__after_atomic_inc(); \ } while (0) #define __mutex_slowpath_needs_to_unlock() 1 @@ -92,6 +96,7 @@ __mutex_fastpath_trylock(atomic_t *count * the mutex state would be. */ #ifdef __HAVE_ARCH_CMPXCHG + smp_mb__before_atomic_dec(); if (likely(atomic_cmpxchg(count, 1, 0)) == 1) return 1; return 0; Index: linux/include/asm-generic/mutex-xchg.h =================================================================== --- linux.orig/include/asm-generic/mutex-xchg.h +++ linux/include/asm-generic/mutex-xchg.h @@ -24,6 +24,7 @@ */ #define __mutex_fastpath_lock(count, fail_fn) \ do { \ + smp_mb__before_atomic_dec(); \ if (unlikely(atomic_xchg(count, 0) != 1)) \ fail_fn(count); \ } while (0) @@ -42,6 +43,7 @@ do { \ static inline int __mutex_fastpath_lock_retval(atomic_t *count, int (*fail_fn)(atomic_t *)) { + smp_mb__before_atomic_dec(); if (unlikely(atomic_xchg(count, 0) != 1)) return fail_fn(count); else @@ -64,6 +66,8 @@ __mutex_fastpath_lock_retval(atomic_t *c do { \ if (unlikely(atomic_xchg(count, 1) != 0)) \ fail_fn(count); \ + else \ + smp_mb__after_atomic_inc(); \ } while (0) #define __mutex_slowpath_needs_to_unlock() 0 @@ -86,7 +90,10 @@ do { \ static inline int __mutex_fastpath_trylock(atomic_t *count, int (*fail_fn)(atomic_t *)) { - int prev = atomic_xchg(count, 0); + int prev; + + smp_mb__before_atomic_dec(); + prev = atomic_xchg(count, 0); if (unlikely(prev < 0)) { /* ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 14:40 ` Ingo Molnar @ 2006-01-05 16:21 ` Linus Torvalds 2006-01-05 22:03 ` Ingo Molnar 2006-01-06 7:34 ` Denis Vlasenko 0 siblings, 2 replies; 34+ messages in thread From: Linus Torvalds @ 2006-01-05 16:21 UTC (permalink / raw) To: Ingo Molnar Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard On Thu, 5 Jan 2006, Ingo Molnar wrote: > > the patch below adds the barriers to the asm-generic mutex routines, so > it's not like i'm lazy ;), but i really think this is unnecessary. > Adding this patch would add a second, unnecessary barrier for all the > arches that have barrier-less atomic ops. > > it also makes sense: the moment you are interested in the 'previous > value' of the atomic counter in an atomic fashion, you very likely want > to use it for a critical section. (e.g. all the put-the-resource ops > that use atomic_dec_test() rely on this implicit barrier.) Ok, fair enough. However, that still leaves the question of which way the barrier works. Traditionally, we have only cared about one thing: that all preceding writes have finished, because the "atomic_dec_return" thing is used as a _reference_counter_, and we're going to release the thing. However, that's not the case in a mutex. A mutex locking operation works exactly the other way around: it doesn't really care about the previous writes at all, since those operations were unlocked. It cares about the _subsequent_ writes, since those have to be seen by others as being in the critical region, and never be seen as happening before the lock. So I _think_ your argument is bogus, and your patch is bogus. The use of "atomic_dec_return()" in a mutex is _not_ the same barrier as using it for reference counting. Not at all. Memory barriers aren't just one thing: they are semi-permeable things in two different directions and with two different operations: there are several different kinds of them. > #define __mutex_fastpath_lock(count, fail_fn) \ > do { \ > + smp_mb__before_atomic_dec(); \ > if (unlikely(atomic_dec_return(count) < 0)) \ > fail_fn(count); \ > } while (0) So I think the barrier has to come _after_ the atomic decrement (or exchange). Because as it is written now, any writes in the locked region could percolate up to just before the atomic dec - ie _outside_ the region. Which is against the whole point of a lock - it would allow another CPU to see the write even before it sees that the lock was successful, as far as I can tell. But memory ordering is subtle, so maybe I'm missing something.. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 16:21 ` Linus Torvalds @ 2006-01-05 22:03 ` Ingo Molnar 2006-01-05 22:17 ` Linus Torvalds 2006-01-06 7:34 ` Denis Vlasenko 1 sibling, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-05 22:03 UTC (permalink / raw) To: Linus Torvalds Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard * Linus Torvalds <torvalds@osdl.org> wrote: > On Thu, 5 Jan 2006, Ingo Molnar wrote: > > > > the patch below adds the barriers to the asm-generic mutex routines, so > > it's not like i'm lazy ;), but i really think this is unnecessary. > > Adding this patch would add a second, unnecessary barrier for all the > > arches that have barrier-less atomic ops. > > > > it also makes sense: the moment you are interested in the 'previous > > value' of the atomic counter in an atomic fashion, you very likely want > > to use it for a critical section. (e.g. all the put-the-resource ops > > that use atomic_dec_test() rely on this implicit barrier.) > > Ok, fair enough. However, that still leaves the question of which way > the barrier works. Traditionally, we have only cared about one thing: > that all preceding writes have finished, because the > "atomic_dec_return" thing is used as a _reference_counter_, and we're > going to release the thing. yeah, i think you are right. Here's a detailed analysis of why you are right about atomic_dec_return(): there are 8 types of instruction reordering that can happen at the beginning and at the end of a critical section. Firstly, here's the programmed order of instructions: pre-read pre-write [critical-START] critial-read critical-write [critical-END] post-read post-write the following reorderings are possible: a pre-read crossing forwards into (and over) the critical section: OK a pre-write crossing forwards into (and over) the critical section: OK a critical-read crossing backwards across critical-START: BAD a critical-write crossing backwards across critical-START: BAD a critical-read crossing forwards across critical-END: BAD a critical-write crossing forwards across critical-END: BAD a post-read crossing backwards into (and over) the critical section: OK a post-write crossing backwards into (and over) the critical section: OK so critical-START needs to be a read and a write barrier for reads and writes happening after it. I.e. it's a memory barrier that only lets instruction reordering in a forward direction, not in a backwards direction. critical-END needs to be a read and write barrier that only lets instructions into the critical section in a backwards direction - and it's a full memory barrier otherwise. AFAICS, currently we dont have such a 'half-conductor / diode' memory barrier primitive: smp_mb() is a full barrier for both directions. But lets assume they existed, and lets call them smp_mb_forwards() and smp_mb_backwards(). furthermore, the locking and unlocking instruction must not cross into the critical section, so the lock sequence must be at least: lock smp_rmb_for_lock_forwards() smp_mb_backwards() ... critical section ... smp_mb_forwards() smp_wmb_for_lock_backwards() unlock and yet another requirement is that two subsequent critical sections for the same lock must not reorder the 'lock' and 'unlock' instructions: smp_mb_forwards() unlock ... unlocked code ... lock smp_mb_backwards() i.e. 'lock' and 'unlock' must not switch places. So the most relaxed list of requirements would be: smp_wmb_for_lock_backwards() lock smp_wmb_for_lock_forwards() smp_mb_backwards() ... critical section ... smp_mb_forwards() smp_wmb_for_lock_backwards() unlock i also think this is the 'absolute minimum memory ordering requirement' for critical sections: relaxing this any further is not possible without breaking critical sections. i doubt many (in fact, any) CPUs are capable of expressing it in such a finegrained way. With our existing primitives, probably the closest one would be: lock smp_mb(); ... smp_mb(); unlock as long as the CPU always executes the lock and unlock stores (which go to the same address) in program order. (is there any CPU doesnt do that?) in that sense, both atomic_dec_return() and atomic_inc_return() are in indeed incorrect (for the use of mutexes) e.g. on ppc64. They are both done via: eioio ... atomic-dec ... isync eioio is a stronger than smp_wmb() - it is a barrier for system memory and IO space memory writes. isync is a read barrier - it throws away all speculative register contents. So it is roughly equivalent to: smp_wmb(); ... atomic-dec ... smp_rmb(); this fulfills the requirement of the critical section not leaking out of the lock sequence itself, but (if i got the ppc64 semantics right) it doesnt protect a write within the critical section to cross out over the smp_rmb(), and to get reordered with the atomic-dec - violating the critical section rules. some other architectures are safe by accident, e.g. Alpha's atomic_dec_return() does: smp_mb(); ... atomic-dec ... smp_mb(); which is overkill, full read and write barrier on both sides. Sparc64's atomic_dec_return() does yet another thing: membar StoreLoad | LoadLoad ... atomic-load ... ... atomic-conditional-store ... membar StoreLoad | StoreStore AFAICS this violates the requirements: a load from within the critical section may go before the atomic-conditional-store, and may thus be executed before the critical section acquires the lock. on MIPS, atomic_dec_return() does what is equivalent to: ... atomic-dec ... smp_mb(); which is fine for a lock sequence, but atomic_inc_return() is not adequate for an unlock sequence: ... atomic-inc ... smp_mb(); because this allows reads and writes within the critical section to reorder with the atomic-inc instructions. to sum it up: atomic_dec/inc_return() alone is not enough to implement critical sections, on a number of architectures. atomic_xchg() seems to have similar problems too. the patch below adds the smp_mb() barriers to the generic headers, which should now fulfill all the ordering requirements, on every architecture. It only relies on one property of the atomic primitives: that they wont get reordered with respect to themselves, so an atomic_inc_ret() and an atomic_dec_ret() cannot switch place. Can you see any hole in this reasoning? Ingo Index: linux/include/asm-generic/mutex-dec.h =================================================================== --- linux.orig/include/asm-generic/mutex-dec.h +++ linux/include/asm-generic/mutex-dec.h @@ -21,6 +21,8 @@ do { \ if (unlikely(atomic_dec_return(count) < 0)) \ fail_fn(count); \ + else \ + smp_mb(); \ } while (0) /** @@ -38,8 +40,10 @@ __mutex_fastpath_lock_retval(atomic_t *c { if (unlikely(atomic_dec_return(count) < 0)) return fail_fn(count); - else + else { + smp_mb(); return 0; + } } /** @@ -57,6 +61,7 @@ __mutex_fastpath_lock_retval(atomic_t *c */ #define __mutex_fastpath_unlock(count, fail_fn) \ do { \ + smp_mb(); \ if (unlikely(atomic_inc_return(count) <= 0)) \ fail_fn(count); \ } while (0) @@ -92,8 +97,10 @@ __mutex_fastpath_trylock(atomic_t *count * the mutex state would be. */ #ifdef __HAVE_ARCH_CMPXCHG - if (likely(atomic_cmpxchg(count, 1, 0)) == 1) + if (likely(atomic_cmpxchg(count, 1, 0)) == 1) { + smp_mb(); return 1; + } return 0; #else return fail_fn(count); Index: linux/include/asm-generic/mutex-xchg.h =================================================================== --- linux.orig/include/asm-generic/mutex-xchg.h +++ linux/include/asm-generic/mutex-xchg.h @@ -26,6 +26,8 @@ do { \ if (unlikely(atomic_xchg(count, 0) != 1)) \ fail_fn(count); \ + else \ + smp_mb(); \ } while (0) @@ -44,8 +46,10 @@ __mutex_fastpath_lock_retval(atomic_t *c { if (unlikely(atomic_xchg(count, 0) != 1)) return fail_fn(count); - else + else { + smp_mb(); return 0; + } } /** @@ -62,6 +66,7 @@ __mutex_fastpath_lock_retval(atomic_t *c */ #define __mutex_fastpath_unlock(count, fail_fn) \ do { \ + smp_mb(); \ if (unlikely(atomic_xchg(count, 1) != 0)) \ fail_fn(count); \ } while (0) @@ -104,6 +109,7 @@ __mutex_fastpath_trylock(atomic_t *count if (prev < 0) prev = 0; } + smp_mb(); return prev; } ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 22:03 ` Ingo Molnar @ 2006-01-05 22:17 ` Linus Torvalds 2006-01-05 22:43 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2006-01-05 22:17 UTC (permalink / raw) To: Ingo Molnar Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard On Thu, 5 Jan 2006, Ingo Molnar wrote: > > [ long details removed ] > > to sum it up: atomic_dec/inc_return() alone is not enough to implement > critical sections, on a number of architectures. atomic_xchg() seems to > have similar problems too. Yes. > the patch below adds the smp_mb() barriers to the generic headers, which > should now fulfill all the ordering requirements, on every architecture. > It only relies on one property of the atomic primitives: that they wont > get reordered with respect to themselves, so an atomic_inc_ret() and an > atomic_dec_ret() cannot switch place. > > Can you see any hole in this reasoning? No. The alternative is to just make the ordering requirements for "atomic_dec_return()" and "atomic_xchg()" be absolute. Say that they have to be full memory barriers, and push the problem into the low-level architecture. I _think_ your patch is the right approach, because most architectures are likely to do their own fast-paths for mutexes, and as such the generic ones are more of a template for how to do it, and hopefilly aren't that performance critical. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 22:17 ` Linus Torvalds @ 2006-01-05 22:43 ` Ingo Molnar 2006-01-06 3:49 ` Keith Owens 0 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-05 22:43 UTC (permalink / raw) To: Linus Torvalds Cc: Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard * Linus Torvalds <torvalds@osdl.org> wrote: > I _think_ your patch is the right approach, because most architectures > are likely to do their own fast-paths for mutexes, and as such the > generic ones are more of a template for how to do it, and hopefilly > aren't that performance critical. yeah, i think so too. We've got 3 architectures done in assembly so far, and it seems people like optimizing such code. Also, since the generic code does all the boring slowpath stuff, the architecture can concentrate on the fun part alone: to make the fastpath really fast. The generic code is still in full control of all the mutex semantics, and can ignore asm/mutex.h when it wants/needs to. So i'm quite happy with the current design and i'm not against more per-arch assembly fun, at all. there's one exception i think: atomic-xchg.h was pretty optimal on ARM, and i'd expect it to be pretty optimal on the other atomic-swap platforms too. So maybe we should specify atomic_xchg() to _not_ imply a full barrier - it's a new API anyway. We cannot embedd the barrier within atomic_xchg(), because the barrier is needed at different ends for lock and for unlock, and adding two barriers would be unnecessary. asm-generic/mutex-dec.h is less optimal (and thus less critical), and i can see no easy way to modify it, because i think it would be quite confusing to enforce 'lock' ordering for atomic_dec_return(), and 'unlock' ordering for atomic_inc_return(). We cannot remove the existing barriers either (and add them explicitly), because there are existing users of these primitives. (although we could add explicit barriers to those too - but probably not worth the trouble) Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 22:43 ` Ingo Molnar @ 2006-01-06 3:49 ` Keith Owens 0 siblings, 0 replies; 34+ messages in thread From: Keith Owens @ 2006-01-06 3:49 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard Ingo Molnar (on Thu, 5 Jan 2006 23:43:57 +0100) wrote: >there's one exception i think: atomic-xchg.h was pretty optimal on ARM, >and i'd expect it to be pretty optimal on the other atomic-swap >platforms too. So maybe we should specify atomic_xchg() to _not_ imply a >full barrier - it's a new API anyway. We cannot embedd the barrier >within atomic_xchg(), because the barrier is needed at different ends >for lock and for unlock, and adding two barriers would be unnecessary. IA64 defines two qualifiers for cmpxchg, specifically for distinguishing between acquiring and releasing the lock. cmpxchg<sz>.<sem> <sz> is the data size, 1, 2, 4 or 8. <sem> is one of 'acq' or 'rel'. sem Ordering Semaphore Operation Completer Semantics acq Acquire The memory read/write is made visible prior to all subsequent data memory accesses. rel Release The memory read/write is made visible after all previous data memory accesses. cmpxchg4.acq prevents following data accesses from migrating before taking the lock (critical R/W cannot precede critical-START). cmpxchg4.rel prevents preceding data accesses from migrating after releasing the lock (critical R/W cannot follow critical-END). I suggest adding acq and rel hints to atomic_xchg, and let architectures that implement suitable operations use those hints. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 16:21 ` Linus Torvalds 2006-01-05 22:03 ` Ingo Molnar @ 2006-01-06 7:34 ` Denis Vlasenko 1 sibling, 0 replies; 34+ messages in thread From: Denis Vlasenko @ 2006-01-06 7:34 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Nicolas Pitre, Joel Schopp, lkml, Andrew Morton, Arjan van de Ven, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard On Thursday 05 January 2006 18:21, Linus Torvalds wrote: > On Thu, 5 Jan 2006, Ingo Molnar wrote: > > > > the patch below adds the barriers to the asm-generic mutex routines, so > > it's not like i'm lazy ;), but i really think this is unnecessary. > > Adding this patch would add a second, unnecessary barrier for all the > > arches that have barrier-less atomic ops. > > > > it also makes sense: the moment you are interested in the 'previous > > value' of the atomic counter in an atomic fashion, you very likely want > > to use it for a critical section. (e.g. all the put-the-resource ops > > that use atomic_dec_test() rely on this implicit barrier.) > > So I _think_ your argument is bogus, and your patch is bogus. The use of > "atomic_dec_return()" in a mutex is _not_ the same barrier as using it for > reference counting. Not at all. Memory barriers aren't just one thing: > they are semi-permeable things in two different directions and with two > different operations: there are several different kinds of them. > > > #define __mutex_fastpath_lock(count, fail_fn) \ > > do { \ > > + smp_mb__before_atomic_dec(); \ > > if (unlikely(atomic_dec_return(count) < 0)) \ > > fail_fn(count); \ > > } while (0) > > So I think the barrier has to come _after_ the atomic decrement (or > exchange). > > Because as it is written now, any writes in the locked region could > percolate up to just before the atomic dec - ie _outside_ the region. > Which is against the whole point of a lock - it would allow another CPU to > see the write even before it sees that the lock was successful, as far as > I can tell. > > But memory ordering is subtle, so maybe I'm missing something.. We mere humans^W device driver people get more confused with barriers every day, as CPUs get more subtle in their out-of-order-ness. I think adding longer-named-but-self-explanatory aliases for memory and io barrier functions can help. mmiowb => barrier_memw_iow .... => barrier_memw_memw (a store-store barrier to mem) .... General template for the name may be something like [smp]barrier_{mem,io,memio}{r,w,rw}_{mem,io,memio}{r,w,rw} Are there even more subtle cases? -- vda ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-04 23:45 ` Joel Schopp 2006-01-05 2:38 ` Nicolas Pitre @ 2006-01-05 14:35 ` Ingo Molnar 2006-01-05 16:42 ` Joel Schopp 1 sibling, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-05 14:35 UTC (permalink / raw) To: Joel Schopp Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard * Joel Schopp <jschopp@austin.ibm.com> wrote: > Took a glance at this on ppc64. Would it be useful if I contributed > an arch specific version like arm has? We'll either need an arch > specific version or have the generic changed. feel free to implement an assembly mutex fastpath, and it would certainly be welcome and useful - but i think you are wrong about SMP synchronization: > Anyway, here is some disassembly of some of the code generated with my > comments: > > c00000000049bf9c <.mutex_lock>: > c00000000049bf9c: 7c 00 06 ac eieio > c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3 > c00000000049bfa4: 31 29 ff ff addic r9,r9,-1 > The eieio is completly unnecessary, it got picked up from > atomic_dec_return (Anton, why is there an eieio at the start of > atomic_dec_return in the first place?). a mutex is like a spinlock, it must prevent loads and stores within the critical section from 'leaking outside the critical section' [they must not be reordered to before the mutex_lock(), nor to after the mutex_unlock()] - hence the barriers added by atomic_dec_return() are very much needed. Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 14:35 ` Ingo Molnar @ 2006-01-05 16:42 ` Joel Schopp 2006-01-05 22:21 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Joel Schopp @ 2006-01-05 16:42 UTC (permalink / raw) To: Ingo Molnar Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard >>Anyway, here is some disassembly of some of the code generated with my >>comments: >> >>c00000000049bf9c <.mutex_lock>: >>c00000000049bf9c: 7c 00 06 ac eieio >>c00000000049bfa0: 7d 20 18 28 lwarx r9,r0,r3 >>c00000000049bfa4: 31 29 ff ff addic r9,r9,-1 > > >>The eieio is completly unnecessary, it got picked up from >>atomic_dec_return (Anton, why is there an eieio at the start of >>atomic_dec_return in the first place?). > > > a mutex is like a spinlock, it must prevent loads and stores within the > critical section from 'leaking outside the critical section' [they must > not be reordered to before the mutex_lock(), nor to after the > mutex_unlock()] - hence the barriers added by atomic_dec_return() are > very much needed. The bne- and isync together form a sufficient import barrier. See PowerPC Book2 Appendix B.2.1.1 And if the eieio was necessary it should come after not before twidling the lock bits. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 16:42 ` Joel Schopp @ 2006-01-05 22:21 ` Ingo Molnar 2006-01-05 23:06 ` Joel Schopp 0 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-05 22:21 UTC (permalink / raw) To: Joel Schopp Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard * Joel Schopp <jschopp@austin.ibm.com> wrote: > The bne- and isync together form a sufficient import barrier. See > PowerPC Book2 Appendix B.2.1.1 ok. Please correct me if i'm wrong: the question is, could we on ppc64 use atomic_dec_return() for mutex_lock(), and atomic_inc_return() for mutex_unlock(). atomic_dec_return() does: EIEIO_ON_SMP "1: lwarx %0,0,%1 # atomic_dec_return\n\ addic %0,%0,-1\n" PPC405_ERR77(0,%1) " stwcx. %0,0,%1\n\ bne- 1b" ISYNC_ON_SMP the EIEIO_ON_SMP is in essence smp_wmb(), correct? (it's a bit stronger because it also orders IO-space writes, but it doesnt impose any restrictions on reads) ISYNC_ON_SMP flushes all speculative reads currently in the queue - and is hence a smp_rmb_backwards() primitive [per my previous mail] - but does not affect writes - correct? if that's the case, what prevents a store from within the critical section going up to right after the EIEIO_ON_SMP, but before the atomic-dec instructions? Does any of those instructions imply some barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs), and hence the store before the bne is an effective write-barrier? Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 22:21 ` Ingo Molnar @ 2006-01-05 23:06 ` Joel Schopp 2006-01-05 23:26 ` Linus Torvalds 2006-01-06 0:29 ` Olof Johansson 0 siblings, 2 replies; 34+ messages in thread From: Joel Schopp @ 2006-01-05 23:06 UTC (permalink / raw) To: Ingo Molnar Cc: lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev [-- Attachment #1: Type: text/plain, Size: 1211 bytes --] > ISYNC_ON_SMP flushes all speculative reads currently in the queue - and > is hence a smp_rmb_backwards() primitive [per my previous mail] - but > does not affect writes - correct? > > if that's the case, what prevents a store from within the critical > section going up to right after the EIEIO_ON_SMP, but before the > atomic-dec instructions? Does any of those instructions imply some > barrier perhaps? Are writes always ordered perhaps (like on x86 CPUs), > and hence the store before the bne is an effective write-barrier? It really makes more sense after reading PowerPC Book II, which you can find at this link, it was written by people who explain this for a living: http://www-128.ibm.com/developerworks/eserver/articles/archguide.html While isync technically doesn't order stores it does order instructions. The previous bne- must complete, that bne- is dependent on the previous stwcx being complete. So no stores are slipping up. To get a better explanation you will have to read the document yourself. Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC. It is completely untested, but compiles. Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> [-- Attachment #2: powerpcmutex.patch --] [-- Type: text/plain, Size: 2812 bytes --] Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h =================================================================== --- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600 +++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-05 16:25:41.%N -0600 @@ -1,9 +1,83 @@ /* - * Pull in the generic implementation for the mutex fastpath. + * include/asm-powerpc/mutex.h * - * TODO: implement optimized primitives instead, or leave the generic - * implementation in place, or pick the atomic_xchg() based generic - * implementation. (see asm-generic/mutex-xchg.h for details) + * PowerPC optimized mutex locking primitives + * + * Please look into asm-generic/mutex-xchg.h for a formal definition. + * Copyright (C) 2006 Joel Schopp <jschopp@austin.ibm.com>, IBM */ +#ifndef _ASM_MUTEX_H +#define _ASM_MUTEX_H +#define __mutex_fastpath_lock(count, fail_fn)\ +do{ \ + long tmp; \ + __asm__ __volatile__( \ +"1: lwarx %0,0,%1\n" \ +" addic %0,%0,-1\n" \ +" stwcx. %0,0,%1\n" \ +" bne- 1b\n" \ +" isync \n" \ + : "=&r" (tmp) \ + : "r" (&(count)->counter) \ + : "cr0", "memory"); \ + if (unlikely(tmp < 0)) \ + fail_fn(count); \ +} while (0) + +#define __mutex_fastpath_unlock(count, fail_fn)\ +do{ \ + long tmp; \ + __asm__ __volatile__(SYNC_ON_SMP \ +"1: lwarx %0,0,%1\n" \ +" addic %0,%0,1\n" \ +" stwcx. %0,0,%1\n" \ +" bne- 1b\n" \ + : "=&r" (tmp) \ + : "r" (&(count)->counter) \ + : "cr0", "memory"); \ + if (unlikely(tmp <= 0)) \ + fail_fn(count); \ +} while (0) + + +static inline int +__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*)) +{ + long tmp; + __asm__ __volatile__( +"1: lwarx %0,0,%1\n" +" cmpwi 0,%0,1\n" +" bne- 2f\n" +" stwcx. %0,0,%1\n" +" bne- 1b\n" +" isync\n" +"2:" + : "=&r" (tmp) + : "r" (&(count)->counter) + : "cr0", "memory"); + + return (int)tmp; + +} + +#define __mutex_slowpath_needs_to_unlock() 1 -#include <asm-generic/mutex-dec.h> +static inline int +__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *)) +{ + long tmp; + __asm__ __volatile__( +"1: lwarx %0,0,%1\n" +" addic %0,%0,-1\n" +" stwcx. %0,0,%1\n" +" bne- 1b\n" +" isync \n" + : "=&r" (tmp) + : "r" (&(count)->counter) + : "cr0", "memory"); + if (unlikely(tmp < 0)) + return fail_fn(count); + else + return 0; +} +#endif ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 23:06 ` Joel Schopp @ 2006-01-05 23:26 ` Linus Torvalds 2006-01-05 23:36 ` Joel Schopp 2006-01-06 0:29 ` Olof Johansson 1 sibling, 1 reply; 34+ messages in thread From: Linus Torvalds @ 2006-01-05 23:26 UTC (permalink / raw) To: Joel Schopp Cc: Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev On Thu, 5 Jan 2006, Joel Schopp wrote: > > Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC. > It is completely untested, but compiles. Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it, since DMA will never matter, and interrupts are precise. Linus ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 23:26 ` Linus Torvalds @ 2006-01-05 23:36 ` Joel Schopp 2006-01-05 23:42 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Joel Schopp @ 2006-01-05 23:36 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, lkml, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev >>Here is a first pass at a powerpc file for the fast paths just as an FYI/RFC. >>It is completely untested, but compiles. > > > Shouldn't you make that "isync" dependent on SMP too? UP doesn't need it, > since DMA will never matter, and interrupts are precise. > > Linus > I think the isync is necessary to keep heavily out of order processors from getting ahead of themselves even on UP. Scanning back through the powerpc spinlock code they seem to take the same view there as well. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 23:36 ` Joel Schopp @ 2006-01-05 23:42 ` Ingo Molnar 0 siblings, 0 replies; 34+ messages in thread From: Ingo Molnar @ 2006-01-05 23:42 UTC (permalink / raw) To: Joel Schopp Cc: Linus Torvalds, lkml, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev * Joel Schopp <jschopp@austin.ibm.com> wrote: > > Shouldn't you make that "isync" dependent on SMP too? UP doesn't > > need it, since DMA will never matter, and interrupts are precise. > > I think the isync is necessary to keep heavily out of order processors > from getting ahead of themselves even on UP. Scanning back through > the powerpc spinlock code they seem to take the same view there as > well. the asm/spinlock.h ops are only built on SMP kernels. mutex.h is for both UP and SMP. On UP you should need no synchronization, because the only way another context could interfere with your critical section is by getting interrupted, and interrupts are fully synchronizing, right? On UP the only synchronization needed is when a device reads/writes memory in parallel to the CPU. Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [patch 00/21] mutex subsystem, -V14 2006-01-05 23:06 ` Joel Schopp 2006-01-05 23:26 ` Linus Torvalds @ 2006-01-06 0:29 ` Olof Johansson 2006-01-07 17:49 ` PowerPC fastpaths for mutex subsystem Joel Schopp 1 sibling, 1 reply; 34+ messages in thread From: Olof Johansson @ 2006-01-06 0:29 UTC (permalink / raw) To: Joel Schopp Cc: Ingo Molnar, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev On Thu, Jan 05, 2006 at 05:06:26PM -0600, Joel Schopp wrote: > Here is a first pass at a powerpc file for the fast paths just as an > FYI/RFC. It is completely untested, but compiles. You really should test it, it saves reviewers time. It's not that hard to at least try booting it. Besides the isync comments earlier, there's a bunch of whitespace issues going on. Did you copy and paste the code from somewhere? If so, you should move the original copyright over too. All your macros use spaces instead of tabs up to the \, should be changed. All tmp variables should be ints, since the atomic_t counter is a 32-bit variable. If you use longs, and lwarx (loads 32-bit without sign extend), the comparison with < 0 will never be true. > Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h > =================================================================== > --- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600 > +++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-05 16:25:41.%N -0600 > @@ -1,9 +1,83 @@ > /* > - * Pull in the generic implementation for the mutex fastpath. > + * include/asm-powerpc/mutex.h No need to keep filenames in files. > * > - * TODO: implement optimized primitives instead, or leave the generic > - * implementation in place, or pick the atomic_xchg() based generic > - * implementation. (see asm-generic/mutex-xchg.h for details) > + * PowerPC optimized mutex locking primitives > + * > + * Please look into asm-generic/mutex-xchg.h for a formal definition. > + * Copyright (C) 2006 Joel Schopp <jschopp@austin.ibm.com>, IBM > */ > +#ifndef _ASM_MUTEX_H > +#define _ASM_MUTEX_H > +#define __mutex_fastpath_lock(count, fail_fn)\ > +do{ \ > + long tmp; \ > + __asm__ __volatile__( \ > +"1: lwarx %0,0,%1\n" \ > +" addic %0,%0,-1\n" \ > +" stwcx. %0,0,%1\n" \ > +" bne- 1b\n" \ > +" isync \n" \ > + : "=&r" (tmp) \ > + : "r" (&(count)->counter) \ > + : "cr0", "memory"); \ > + if (unlikely(tmp < 0)) \ > + fail_fn(count); \ > +} while (0) trailing whitespace > + > +#define __mutex_fastpath_unlock(count, fail_fn)\ > +do{ \ > + long tmp; \ > + __asm__ __volatile__(SYNC_ON_SMP \ > +"1: lwarx %0,0,%1\n" \ > +" addic %0,%0,1\n" \ > +" stwcx. %0,0,%1\n" \ > +" bne- 1b\n" \ space vs tab > + : "=&r" (tmp) \ > + : "r" (&(count)->counter) \ > + : "cr0", "memory"); \ > + if (unlikely(tmp <= 0)) \ > + fail_fn(count); \ > +} while (0) > + > + > +static inline int trailing whitespace > +__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*)) atomic_t *count > +{ > + long tmp; > + __asm__ __volatile__( > +"1: lwarx %0,0,%1\n" > +" cmpwi 0,%0,1\n" > +" bne- 2f\n" > +" stwcx. %0,0,%1\n" space vs tab on the above 4 lines Shouldn't you decrement the counter before the store? > +" bne- 1b\n" > +" isync\n" > +"2:" > + : "=&r" (tmp) > + : "r" (&(count)->counter) > + : "cr0", "memory"); > + > + return (int)tmp; > + > +} > + > +#define __mutex_slowpath_needs_to_unlock() 1 > > -#include <asm-generic/mutex-dec.h> > +static inline int trailing whitespace > +__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *)) atomic_t *count > +{ > + long tmp; counter is a 32-bit variable, so should tmp be otherwise the < 0 comparison can never be true. > + __asm__ __volatile__( > +"1: lwarx %0,0,%1\n" > +" addic %0,%0,-1\n" > +" stwcx. %0,0,%1\n" > +" bne- 1b\n" > +" isync \n" > + : "=&r" (tmp) > + : "r" (&(count)->counter) > + : "cr0", "memory"); > + if (unlikely(tmp < 0)) > + return fail_fn(count); > + else > + return 0; > +} > +#endif ^ permalink raw reply [flat|nested] 34+ messages in thread
* PowerPC fastpaths for mutex subsystem 2006-01-06 0:29 ` Olof Johansson @ 2006-01-07 17:49 ` Joel Schopp 2006-01-07 22:37 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 34+ messages in thread From: Joel Schopp @ 2006-01-07 17:49 UTC (permalink / raw) To: Olof Johansson Cc: Ingo Molnar, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev [-- Attachment #1: Type: text/plain, Size: 683 bytes --] This is the second pass at optimizing the fastpath for the new mutex subsystem on PowerPC. I think it is ready to be included in the series with the other mutex patches now. Tested on a 4 core (2 SMT threads/core) Power5 machine with gcc 3.3.2. Test results from synchro-test.ko: All tests run for default 5 seconds Threads semaphores mutexes mutexes+attached 1 63,465,364 58,404,630 62,109,571 4 58,424,282 35,541,297 37,820,794 8 40,731,668 35,541,297 40,281,768 16 38,372,769 37,256,298 41,751,764 32 38,406,895 36,933,675 38,731,571 64 37,232,017 36,222,480 40,766,379 Signed-off-by: Joel Schopp <jschopp@austin.ibm.com> [-- Attachment #2: powerpcmutex.patch --] [-- Type: text/plain, Size: 2347 bytes --] Index: 2.6.15-mutex14/include/asm-powerpc/mutex.h =================================================================== --- 2.6.15-mutex14.orig/include/asm-powerpc/mutex.h 2006-01-04 14:46:31.%N -0600 +++ 2.6.15-mutex14/include/asm-powerpc/mutex.h 2006-01-06 17:36:09.%N -0600 @@ -1,9 +1,84 @@ /* - * Pull in the generic implementation for the mutex fastpath. + * include/asm-powerpc/mutex.h * - * TODO: implement optimized primitives instead, or leave the generic - * implementation in place, or pick the atomic_xchg() based generic - * implementation. (see asm-generic/mutex-xchg.h for details) + * PowerPC optimized mutex locking primitives + * + * Please look into asm-generic/mutex-xchg.h for a formal definition. + * Copyright (C) 2006 Joel Schopp <jschopp@austin.ibm.com>, IBM */ +#ifndef _ASM_MUTEX_H +#define _ASM_MUTEX_H +#define __mutex_fastpath_lock(count, fail_fn)\ +do{ \ + int tmp; \ + __asm__ __volatile__( \ +"1: lwarx %0,0,%1\n" \ +" addic %0,%0,-1\n" \ +" stwcx. %0,0,%1\n" \ +" bne- 1b\n" \ + ISYNC_ON_SMP \ + : "=&r" (tmp) \ + : "r" (&(count)->counter) \ + : "cr0", "memory"); \ + if (unlikely(tmp < 0)) \ + fail_fn(count); \ +} while (0) + +#define __mutex_fastpath_unlock(count, fail_fn)\ +do{ \ + int tmp; \ + __asm__ __volatile__(SYNC_ON_SMP\ +"1: lwarx %0,0,%1\n" \ +" addic %0,%0,1\n" \ +" stwcx. %0,0,%1\n" \ +" bne- 1b\n" \ + : "=&r" (tmp) \ + : "r" (&(count)->counter) \ + : "cr0", "memory"); \ + if (unlikely(tmp <= 0)) \ + fail_fn(count); \ +} while (0) + + +static inline int +__mutex_fastpath_trylock(atomic_t* count, int (*fail_fn)(atomic_t*)) +{ + int tmp; + __asm__ __volatile__( +"1: lwarx %0,0,%1\n" +" cmpwi 0,%0,1\n" +" bne- 2f\n" +" addic %0,%0,-1\n" +" stwcx. %0,0,%1\n" +" bne- 1b\n" +" isync\n" +"2:" + : "=&r" (tmp) + : "r" (&(count)->counter) + : "cr0", "memory"); + + return (int)tmp; + +} + +#define __mutex_slowpath_needs_to_unlock() 1 -#include <asm-generic/mutex-dec.h> +static inline int +__mutex_fastpath_lock_retval(atomic_t* count, int (*fail_fn)(atomic_t *)) +{ + int tmp; + __asm__ __volatile__( +"1: lwarx %0,0,%1\n" +" addic %0,%0,-1\n" +" stwcx. %0,0,%1\n" +" bne- 1b\n" +" isync \n" + : "=&r" (tmp) + : "r" (&(count)->counter) + : "cr0", "memory"); + if (unlikely(tmp < 0)) + return fail_fn(count); + else + return 0; +} +#endif ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-07 17:49 ` PowerPC fastpaths for mutex subsystem Joel Schopp @ 2006-01-07 22:37 ` Andrew Morton 2006-01-08 7:43 ` Anton Blanchard 2006-01-08 9:48 ` Ingo Molnar 2006-01-08 10:43 ` Ingo Molnar 2 siblings, 1 reply; 34+ messages in thread From: Andrew Morton @ 2006-01-07 22:37 UTC (permalink / raw) To: Joel Schopp Cc: olof, mingo, linux-kernel, torvalds, arjan, nico, jes, viro, oleg, dhowells, alan, hch, ak, rmk+lkml, anton, linuxppc64-dev Joel Schopp <jschopp@austin.ibm.com> wrote: > > This is the second pass at optimizing the fastpath for the new mutex subsystem > on PowerPC. I think it is ready to be included in the series with the other > mutex patches now. Tested on a 4 core (2 SMT threads/core) Power5 machine with > gcc 3.3.2. > > Test results from synchro-test.ko: > > All tests run for default 5 seconds > Threads semaphores mutexes mutexes+attached > 1 63,465,364 58,404,630 62,109,571 > 4 58,424,282 35,541,297 37,820,794 > 8 40,731,668 35,541,297 40,281,768 > 16 38,372,769 37,256,298 41,751,764 > 32 38,406,895 36,933,675 38,731,571 > 64 37,232,017 36,222,480 40,766,379 > Doens't this mean that the sped-up mutexes are still slower than semaphores? ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-07 22:37 ` Andrew Morton @ 2006-01-08 7:43 ` Anton Blanchard 2006-01-08 8:00 ` Andrew Morton 2006-01-09 11:13 ` David Howells 0 siblings, 2 replies; 34+ messages in thread From: Anton Blanchard @ 2006-01-08 7:43 UTC (permalink / raw) To: Andrew Morton Cc: Joel Schopp, jes, rmk+lkml, hch, linux-kernel, ak, torvalds, viro, linuxppc64-dev, mingo, nico, oleg, alan, arjan > Doens't this mean that the sped-up mutexes are still slower than semaphores? Wasnt most of the x86 mutex gain a result of going from fair to unfair operation? The current ppc64 semaphores are unfair. Anton ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-08 7:43 ` Anton Blanchard @ 2006-01-08 8:00 ` Andrew Morton 2006-01-08 8:23 ` Anton Blanchard 2006-01-09 11:13 ` David Howells 1 sibling, 1 reply; 34+ messages in thread From: Andrew Morton @ 2006-01-08 8:00 UTC (permalink / raw) To: Anton Blanchard Cc: jschopp, jes, rmk+lkml, hch, linux-kernel, ak, torvalds, viro, linuxppc64-dev, mingo, nico, oleg, alan, arjan Anton Blanchard <anton@samba.org> wrote: > > > > Doens't this mean that the sped-up mutexes are still slower than semaphores? > > Wasnt most of the x86 mutex gain a result of going from fair to unfair > operation? The current ppc64 semaphores are unfair. > What's "unfair"? Mutexes are FIFO, as are x86 semaphores. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-08 8:00 ` Andrew Morton @ 2006-01-08 8:23 ` Anton Blanchard 0 siblings, 0 replies; 34+ messages in thread From: Anton Blanchard @ 2006-01-08 8:23 UTC (permalink / raw) To: Andrew Morton Cc: jes, rmk+lkml, ak, linux-kernel, hch, torvalds, viro, linuxppc64-dev, mingo, nico, oleg, alan, arjan > What's "unfair"? Mutexes are FIFO, as are x86 semaphores. The ppc64 semaphores dont force everyone into the slow path under contention. So you could drop and pick up the semaphore even with someone waiting. I thought thats how the new mutex code worked. Anton ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-08 7:43 ` Anton Blanchard 2006-01-08 8:00 ` Andrew Morton @ 2006-01-09 11:13 ` David Howells 1 sibling, 0 replies; 34+ messages in thread From: David Howells @ 2006-01-09 11:13 UTC (permalink / raw) To: Andrew Morton Cc: Anton Blanchard, jes, rmk+lkml, ak, linux-kernel, hch, torvalds, viro, linuxppc64-dev, mingo, nico, oleg, alan, arjan Andrew Morton <akpm@osdl.org> wrote: > > Wasnt most of the x86 mutex gain a result of going from fair to unfair > > operation? The current ppc64 semaphores are unfair. > > > > What's "unfair"? Mutexes are FIFO, as are x86 semaphores. No, strictly Ingo's mutexes are neither completely fair nor completely FIFO. It's possible for a process to jump the queue because unlock() always sets the counter back to 1 before waking up the process at the front of the queue. This means that the lock() fastpath in another process may steal the mutex out of sequence before the wakee has a chance to grab it. I'm not 100% convinced that x86 counting semaphores are completely fair or completely FIFO. It's possible that they are because up() never arbitrarily sets the count back to >0. R/W semaphores are completely fair, but only as completely FIFO as the unfair spinlocks permit. This is because it's much easier to guarantee their behaviour (writer-starvation is a real problem with unfair rwsems). I have a simple implementation of totally fair spinlocks for x86 which would also work on anything that can emulate XADD, but I don't think it's worth the trouble. However, for Ingo's mutexes, I suspect this queue-jumping feature is sufficiently low probability that we can ignore it. It is theoretically possible to induce livelock, but in reality I think it extremely unlikely to happen for any significant length of time. David ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-07 17:49 ` PowerPC fastpaths for mutex subsystem Joel Schopp 2006-01-07 22:37 ` Andrew Morton @ 2006-01-08 9:48 ` Ingo Molnar 2006-01-10 22:31 ` Joel Schopp 2006-01-08 10:43 ` Ingo Molnar 2 siblings, 1 reply; 34+ messages in thread From: Ingo Molnar @ 2006-01-08 9:48 UTC (permalink / raw) To: Joel Schopp Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev * Joel Schopp <jschopp@austin.ibm.com> wrote: > Tested on a 4 core (2 SMT threads/core) Power5 machine with gcc 3.3.2. > Test results from synchro-test.ko: > > All tests run for default 5 seconds > Threads semaphores mutexes mutexes+attached > 1 63,465,364 58,404,630 62,109,571 > 4 58,424,282 35,541,297 37,820,794 > 8 40,731,668 35,541,297 40,281,768 > 16 38,372,769 37,256,298 41,751,764 > 32 38,406,895 36,933,675 38,731,571 > 64 37,232,017 36,222,480 40,766,379 interesting. Could you try two things? Firstly, could you add some minimal delays to the lock/unlock path, of at least 1 usec? E.g. "synchro-test.ko load=1 interval=1". [but you could try longer delays too, 10 usecs is still realistic.] secondly, could you try the VFS creat+unlink test via the test-mutex.c code below, with something like: ./test-mutex V 16 10 (this tests with 16 tasks, for 10 seconds.) You'll get a useful ops/sec number out of this test, but the other stats will only be calculated if you implement the rdtsc() macro to read cycles - right now it defaults to 'always 0' on ppc, i386 and ia64 has it implemented. Also, beware that the default atomic_inc()/dec() is unsafe (only i386 and ia64 has the real thing implemented), you might want to add a safe PPC implementation. thirdly, could you run 'vmstat 1' during the tests, and post those lines too? Here i'm curious about two things: the average runqueue length (whether we have overscheduling), and CPU utilization and idle time left (how efficiently cycles are preserved in contention). [btw., does ppc have an idle=poll equivalent mode of idling?] also, there seems to be some fluctuation in the numbers - could you try to run a few more to see how stable the numbers are? Ingo ------------ /* * Copyright (C) 2005, Ingo Molnar <mingo@redhat.com> */ #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <signal.h> #include <sys/wait.h> #include <linux/unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <stdarg.h> #include <stdlib.h> #include <signal.h> #include <sys/wait.h> #include <linux/unistd.h> #include <unistd.h> #include <string.h> #include <pwd.h> #include <grp.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/time.h> #include <regex.h> #include <fcntl.h> #include <time.h> #include <sys/mman.h> #include <dlfcn.h> #include <popt.h> #include <sys/socket.h> #include <ctype.h> #include <assert.h> #include <sched.h> #ifdef __ia64__ #include <sys/ioctl.h> #include "mmtimer.h" int mmtimer_fd; unsigned long __mm_timer_clock_res; unsigned long *__mm_clock_dev; unsigned long __mm_clock_offset; #endif unsigned long *shared; #define mutex_lock() gettimeofday((void *)0, (void *)10) #define mutex_unlock() gettimeofday((void *)0, (void *)20) #define down() gettimeofday((void *)0, (void *)100) #define up() gettimeofday((void *)0, (void *)200) #define down_write() gettimeofday((void *)0, (void *)1000) #define up_write() gettimeofday((void *)0, (void *)2000) #define down_read() gettimeofday((void *)0, (void *)10000) #define up_read() gettimeofday((void *)0, (void *)20000) /* * Shared locks and variables between the test tasks: */ #define CACHELINE_SIZE (128/sizeof(long)) enum { SHARED_DELTA_SUM = 0*CACHELINE_SIZE, SHARED_DELTA_MAX = 1*CACHELINE_SIZE, SHARED_DELTA2_SUM = 2*CACHELINE_SIZE, SHARED_DELTA2_MAX = 3*CACHELINE_SIZE, SHARED_DELTA3_SUM = 4*CACHELINE_SIZE, SHARED_DELTA3_MAX = 5*CACHELINE_SIZE, SHARED_DELTA_DELTA_SUM = 6*CACHELINE_SIZE, SHARED_COUNT = 7*CACHELINE_SIZE, SHARED_SUM = 8*CACHELINE_SIZE, SHARED_LOCK = 9*CACHELINE_SIZE, SHARED_END = 10*CACHELINE_SIZE, }; #define SHARED(x) (*(shared + SHARED_##x)) #define SHARED_LL(x) (*(unsigned long long *)(shared + SHARED_##x)) #define BUG_ON(c) assert(!(c)) static unsigned long *setup_shared_var(void) { char zerobuff [4096] = { 0, }; int ret, fd; unsigned long *buf; char tmpfile[100]; sprintf(tmpfile, ".tmp_mmap-%d", getpid()); fd = creat(tmpfile, 0700); BUG_ON(fd == -1); close(fd); fd = open(tmpfile, O_RDWR|O_CREAT|O_TRUNC); unlink(tmpfile); BUG_ON(fd == -1); ret = write(fd, zerobuff, 4096); BUG_ON(ret != 4096); buf = (void *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); BUG_ON(buf == (void *)-1); close(fd); return buf; } #define LOOPS 10000 #ifdef __ia64__ static int setup_mmtimer(void) { unsigned long regoff; int fd, _t; size_t pagesize; if ((fd = open ("/dev/mmtimer", O_RDONLY)) == -1) perror("missing /dev/mmtimer"); else { pagesize = getpagesize(); __mm_clock_dev = mmap(0, pagesize, PROT_READ, MAP_SHARED, fd, 0); if (__mm_clock_dev != MAP_FAILED) { regoff = ioctl(fd, MMTIMER_GETOFFSET, 0); if (regoff >= 0) { __mm_clock_dev += regoff; __mm_clock_offset = *__mm_clock_dev; } else perror("reg offset ioctl failed"); _t = ioctl(fd, MMTIMER_GETFREQ, &__mm_timer_clock_res); if (_t) perror("get freq ioctl fail"); } } } #define ia64_fetchadd8_rel(p, inc) \ ({ \ __u64 ia64_intri_res; \ asm volatile ("fetchadd8.rel %0=[%1],%2" \ : "=r"(ia64_intri_res) : "r"(p), "i" (inc) \ : "memory"); \ \ ia64_intri_res; \ }) static inline void atomic_inc(unsigned long *flag) { ia64_fetchadd8_rel(flag, 1); } static inline void atomic_dec(unsigned long *flag) { ia64_fetchadd8_rel(flag, -1); } #elif defined(__i386__) static inline void atomic_inc(unsigned long *flag) { __asm__ __volatile__( "lock; incl %0\n" : "=g"(*flag) : : "memory"); } static inline void atomic_dec(unsigned long *flag) { __asm__ __volatile__( "lock; decl %0\n" : "=g"(*flag) : : "memory"); } #else static inline void atomic_inc(unsigned long *flag) { ++*flag; } static inline void atomic_dec(unsigned long *flag) { --*flag; } #endif static void LOCK(unsigned long *shared) { for (;;) { atomic_inc(&SHARED(LOCK)); if (SHARED(LOCK) == 1) break; atomic_dec(&SHARED(LOCK)); usleep(1); } } static void UNLOCK(unsigned long *shared) { atomic_dec(&SHARED(LOCK)); } static void sigint(int sig) { atomic_inc(&SHARED(END)); } static void print_status(unsigned long *shared) { unsigned long count; count = SHARED(COUNT); SHARED(COUNT) = 0; SHARED_LL(SUM) += count; printf("\r| loops/sec: %ld \r", count); fflush(stdout); } enum { TYPE_MUTEX, TYPE_SEM, TYPE_RSEM, TYPE_WSEM, TYPE_VFS, NR_TYPES }; const char * type_names[NR_TYPES] = { "Mutex", "Semaphore", "RW-semaphore Read", "RW-semaphore Write", "VFS" }; typedef unsigned long long cycles_t; typedef unsigned long long usecs_t; #ifdef __ia64__ # define rdtscll(val) \ do { \ val = *__mm_clock_dev; \ } while (0) #elif defined(__i386__) # define rdtscll(val) \ do { \ __asm__ __volatile__("rdtsc" : "=A" (val)); \ } while (0) #else # define rdtscll(val) \ do { (val) = 0LL; } while (0) #endif #define rdtod(val) \ do { \ struct timeval tv; \ \ gettimeofday(&tv, NULL); \ (val) = tv.tv_sec * 1000000ULL + tv.tv_usec; \ } while (0) #define max(x,y) ({ \ typeof(x) _x = (x); \ typeof(y) _y = (y); \ (void) (&_x == &_y); \ _x > _y ? _x : _y; }) #define unlikely(x) __builtin_expect(!!(x), 0) int main(int argc, char **argv) { int i, parent, me, first = 1; unsigned long cpus, tasks, seconds = 0; cycles_t t0, t01, t1, delta, delta2, delta3, delta_sum = 0, delta2_sum = 0, delta3_sum = 0, delta_delta, delta_delta_sum = 0, prev_delta, delta_max = 0, delta2_max = 0, delta3_max = 0; char str[100]; double freq; int type; if (argc <= 1 || argc > 4) { usage: fprintf(stderr, "usage: test-mutex [Mutex|Sem|Rsem|Wsem|Vfs creat+unlink] <threads> <seconds>\n"); exit(-1); usage2: fprintf(stderr, "the Mutex/Sem/Rsem/Wsem tests are not available.\n"); goto usage; } switch (argv[1][0]) { case 'M': type = TYPE_MUTEX; goto usage2; break; case 'S': type = TYPE_SEM; goto usage2; break; case 'R': type = TYPE_RSEM; goto usage2; break; case 'W': type = TYPE_WSEM; goto usage2; break; case 'V': type = TYPE_VFS; break; default: goto usage; } system("rm -f /tmp/* 2>/dev/null >/dev/null"); cpus = system("exit `grep processor /proc/cpuinfo | wc -l`"); cpus = WEXITSTATUS(cpus); tasks = cpus; if (argc >= 3) { tasks = atol(argv[2]); if (!tasks) goto usage; } if (argc >= 4) seconds = atol(argv[3]); else seconds = -1; #ifdef __ia64__ setup_mmtimer(); #endif printf("%ld CPUs, running %ld parallel test-tasks.\n", cpus, tasks); printf("checking %s performance.\n", type_names[type]); shared = setup_shared_var(); signal(SIGINT, sigint); signal(SIGHUP, sigint); parent = getpid(); for (i = 0; i < tasks; i++) if (!fork()) break; sleep(1); me = getpid(); sprintf(str, "/tmp/tmp-%d", me); if (me == parent) { unsigned long long total_count; int i = 0, j; for (;;) { sleep(1); if (i == seconds || SHARED(END)) break; i++; print_status(shared); } atomic_inc(&SHARED(END)); total_count = SHARED(SUM); for (j = 0; j < tasks; j++) wait(NULL); if (i) printf("\navg ops/sec: %Ld\n", total_count / i); LOCK(shared); // printf("delta_sum: %Ld\n", SHARED_LL(DELTA_SUM)); // printf("delta_delta_sum: %Ld\n", SHARED_LL(DELTA_DELTA_SUM)); #ifdef __ia64__ freq = 25.0; #else freq = 700.0; #endif printf("average cost per op: %.2f usecs\n", (double)SHARED_LL(DELTA_SUM)/total_count/freq); printf("average cost per lock: %.2f usecs\n", (double)SHARED_LL(DELTA2_SUM)/total_count/freq); printf("average cost per unlock: %.2f usecs\n", (double)SHARED_LL(DELTA3_SUM)/total_count/freq); printf("max cost per op: %.2f usecs\n", (double)SHARED_LL(DELTA_MAX)/freq); printf("max cost per lock: %.2f usecs\n", (double)SHARED_LL(DELTA2_MAX)/freq); printf("max cost per unlock: %.2f usecs\n", (double)SHARED_LL(DELTA3_MAX)/freq); printf("average deviance per op: %.2f usecs\n", (double)SHARED_LL(DELTA_DELTA_SUM)/total_count/freq/2.0); UNLOCK(shared); exit(0); } for (;;) { rdtscll(t0); switch (type) { case TYPE_MUTEX: mutex_lock(); rdtscll(t01); mutex_unlock(); break; case TYPE_SEM: down(); rdtscll(t01); up(); break; case TYPE_RSEM: down_read(); rdtscll(t01); up_read(); break; case TYPE_WSEM: down_write(); rdtscll(t01); up_write(); break; case TYPE_VFS: { int fd; fd = creat(str, S_IRWXU); rdtscll(t01); close(fd); break; } } rdtscll(t1); delta = t1-t0; if (unlikely(delta > delta_max)) delta_max = delta; delta_sum += delta; delta2 = t01-t0; if (unlikely(delta2 > delta2_max)) delta2_max = delta2; delta2_sum += delta2; delta3 = t1-t01; if (unlikely(delta3 > delta3_max)) delta3_max = delta3; delta3_sum += delta3; if (!first) { if (prev_delta < delta) delta_delta = delta - prev_delta; else delta_delta = prev_delta - delta; delta_delta_sum += delta_delta; #if 0 printf("%Ld-%Ld {%Ld} prev: {%Ld} / [%Ld]\n", t0, t1, delta, prev_delta, delta_delta); printf(" {%Ld} - {%Ld}\n", delta_sum, delta_delta_sum); #endif } else first = 0; prev_delta = delta; atomic_inc(&SHARED(COUNT)); if (unlikely(SHARED(END))) { LOCK(shared); SHARED_LL(DELTA_SUM) += delta_sum; SHARED_LL(DELTA_MAX) = max(SHARED_LL(DELTA_MAX), delta_max); SHARED_LL(DELTA2_SUM) += delta2_sum; SHARED_LL(DELTA2_MAX) = max(SHARED_LL(DELTA2_MAX), delta2_max); SHARED_LL(DELTA3_SUM) += delta3_sum; SHARED_LL(DELTA3_MAX) = max(SHARED_LL(DELTA3_MAX), delta3_max); SHARED_LL(DELTA_DELTA_SUM) += delta_delta_sum; #if 0 printf("delta_sum: %Ld\n", delta_sum); printf("delta_delta_sum: %Ld\n", delta_delta_sum); printf("DELTA_SUM: %Ld\n", SHARED_LL(DELTA_SUM)); printf("DELTA_DELTA_SUM: %Ld\n", SHARED_LL(DELTA_DELTA_SUM)); #endif UNLOCK(shared); exit(0); } } return 0; } ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-08 9:48 ` Ingo Molnar @ 2006-01-10 22:31 ` Joel Schopp 2006-01-10 23:09 ` Ingo Molnar 0 siblings, 1 reply; 34+ messages in thread From: Joel Schopp @ 2006-01-10 22:31 UTC (permalink / raw) To: Ingo Molnar Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev [-- Attachment #1: Type: text/plain, Size: 1379 bytes --] > interesting. Could you try two things? Firstly, could you add some > minimal delays to the lock/unlock path, of at least 1 usec? E.g. > "synchro-test.ko load=1 interval=1". [but you could try longer delays > too, 10 usecs is still realistic.] Graphs attached. The summary for those who don't like to look at attachments is that the mutex fastpath (threads 1) that I sent the optimized patch for is comparable within the margin of error to semaphores. The mutex common path (threads > 1) gets embarrassed by semaphores. So mutexes common paths are not yet ready as far as ppc64 is concerned. > > secondly, could you try the VFS creat+unlink test via the test-mutex.c > code below, with something like: > > ./test-mutex V 16 10 Queued into my todo list. > > thirdly, could you run 'vmstat 1' during the tests, and post those lines > too? Here i'm curious about two things: the average runqueue length > (whether we have overscheduling), and CPU utilization and idle time left > (how efficiently cycles are preserved in contention). [btw., does ppc > have an idle=poll equivalent mode of idling?] Also queued in my todo list. > > also, there seems to be some fluctuation in the numbers - could you try > to run a few more to see how stable the numbers are? For the graphs the line is the average of 5 runs, and the 5 runs are scatter plotted as well. [-- Attachment #2: semvsmux2.png --] [-- Type: image/png, Size: 4536 bytes --] [-- Attachment #3: semvsmux3.png --] [-- Type: image/png, Size: 4471 bytes --] [-- Attachment #4: semvsmux.png --] [-- Type: image/png, Size: 4805 bytes --] ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-10 22:31 ` Joel Schopp @ 2006-01-10 23:09 ` Ingo Molnar 2006-01-11 10:52 ` Ingo Molnar 2006-01-11 17:44 ` Joel Schopp 0 siblings, 2 replies; 34+ messages in thread From: Ingo Molnar @ 2006-01-10 23:09 UTC (permalink / raw) To: Joel Schopp Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev * Joel Schopp <jschopp@austin.ibm.com> wrote: > >interesting. Could you try two things? Firstly, could you add some > >minimal delays to the lock/unlock path, of at least 1 usec? E.g. > >"synchro-test.ko load=1 interval=1". [but you could try longer delays > >too, 10 usecs is still realistic.] > > Graphs attached. The summary for those who don't like to look at > attachments is that the mutex fastpath (threads 1) that I sent the > optimized patch for is comparable within the margin of error to > semaphores. The mutex common path (threads > 1) gets embarrassed by > semaphores. So mutexes common paths are not yet ready as far as ppc64 > is concerned. ok. I'll really need to look at "vmstat" output from these. We could easily make the mutex slowpath behave like ppc64 semaphores, via the attached (untested) patch, but i really think it's the wrong thing to do, because it overloads the system with runnable tasks in an essentially unlimited fashion [== overscheduling] - they'll all contend for the same single mutex. in synthetic workloads on idle systems it such overscheduling can help, because the 'luck factor' of the 'thundering herd' of tasks can generate a higher total throughput - at the expense of system efficiency. At 8 CPUs i already measured a net performance loss at 3 tasks! So i think the current 'at most 2 tasks runnable' approach of mutexes is the right one on a broad range of hardware. still, i'll try a different patch tomorrow, to keep the number of 'in flight' tasks within a certain limit (say at 2) - i suspect that would close the performance gap too, on this test. but i really think the current 'at most one task in flight' logic is the correct approach. I'm also curious about the VFS-test numbers (already on your todo). > >thirdly, could you run 'vmstat 1' during the tests, and post those lines > >too? Here i'm curious about two things: the average runqueue length > >(whether we have overscheduling), and CPU utilization and idle time left > >(how efficiently cycles are preserved in contention). [btw., does ppc > >have an idle=poll equivalent mode of idling?] > > Also queued in my todo list. thanks! > >also, there seems to be some fluctuation in the numbers - could you try > >to run a few more to see how stable the numbers are? > > For the graphs the line is the average of 5 runs, and the 5 runs are > scatter plotted as well. ok, that should be more than enough. Ingo --- kernel/mutex.c.orig +++ kernel/mutex.c @@ -226,6 +226,9 @@ __mutex_unlock_slowpath(atomic_t *lock_c debug_mutex_wake_waiter(lock, waiter); + /* be (much) more agressive about wakeups: */ + list_move_tail(&waiter.list, &lock->wait_list); + wake_up_process(waiter->task); } ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-10 23:09 ` Ingo Molnar @ 2006-01-11 10:52 ` Ingo Molnar 2006-01-11 17:44 ` Joel Schopp 1 sibling, 0 replies; 34+ messages in thread From: Ingo Molnar @ 2006-01-11 10:52 UTC (permalink / raw) To: Joel Schopp Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev * Ingo Molnar <mingo@elte.hu> wrote: > ok. I'll really need to look at "vmstat" output from these. We could > easily make the mutex slowpath behave like ppc64 semaphores, via the > attached (untested) patch, but i really think it's the wrong thing to > do, because it overloads the system with runnable tasks in an > essentially unlimited fashion [== overscheduling] - they'll all > contend for the same single mutex. find the working patch below. (the previous one had a syntax error) Ingo Index: linux/kernel/mutex.c =================================================================== --- linux.orig/kernel/mutex.c +++ linux/kernel/mutex.c @@ -227,6 +227,9 @@ __mutex_unlock_slowpath(atomic_t *lock_c debug_mutex_wake_waiter(lock, waiter); wake_up_process(waiter->task); + + /* be (much) more agressive about wakeups: */ + list_move_tail(&waiter->list, &lock->wait_list); } debug_mutex_clear_owner(lock); ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-10 23:09 ` Ingo Molnar 2006-01-11 10:52 ` Ingo Molnar @ 2006-01-11 17:44 ` Joel Schopp 1 sibling, 0 replies; 34+ messages in thread From: Joel Schopp @ 2006-01-11 17:44 UTC (permalink / raw) To: Ingo Molnar Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev > ok. I'll really need to look at "vmstat" output from these. We could > easily make the mutex slowpath behave like ppc64 semaphores, via the > attached (untested) patch, but i really think it's the wrong thing to > do, because it overloads the system with runnable tasks in an > essentially unlimited fashion [== overscheduling] - they'll all contend > for the same single mutex. > > in synthetic workloads on idle systems it such overscheduling can help, > because the 'luck factor' of the 'thundering herd' of tasks can generate > a higher total throughput - at the expense of system efficiency. At 8 > CPUs i already measured a net performance loss at 3 tasks! So i think > the current 'at most 2 tasks runnable' approach of mutexes is the right > one on a broad range of hardware. > > still, i'll try a different patch tomorrow, to keep the number of 'in > flight' tasks within a certain limit (say at 2) - i suspect that would > close the performance gap too, on this test. The fundamental problem is that there is a relatively major latency to wake somebody up, and for them to actually run so they can acquire a lock. In an ideal world there would always be a single waiter running trying to acquire the lock at the moment it was unlocked and not running until then. There are better solutions than just madly throwing more waiters in flight on an unlock. Here's three possibilities off the top of my head: 1) It is possible to have a hybrid lock that spins a single waiting thread and sleeps waiters 2..n, so there is always a waiter running trying to acquire the lock. It solves the latency problem if the lock is held a length of time at least as long as it takes to wake up the next waiter. But the spinning waiter burns some cpu to buy the decreased latency. 2) You could also do the classic spin for awhile and then sleep method. This essentially turns low latency locks into spinlocks but still sleeps locks which are held longer and/or are much more contested. 3) There is the option to look at cpu idleness of the current cpu and spin or sleep based on that. 4) Accept that we have a cpu efficient high latency lock and use it appropriately. I'm not saying any of these 4 is what we should do. I'm just trying to say there are options out there that don't involve thundering hurds and luck to address the problem. ^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: PowerPC fastpaths for mutex subsystem 2006-01-07 17:49 ` PowerPC fastpaths for mutex subsystem Joel Schopp 2006-01-07 22:37 ` Andrew Morton 2006-01-08 9:48 ` Ingo Molnar @ 2006-01-08 10:43 ` Ingo Molnar 2 siblings, 0 replies; 34+ messages in thread From: Ingo Molnar @ 2006-01-08 10:43 UTC (permalink / raw) To: Joel Schopp Cc: Olof Johansson, lkml, Linus Torvalds, Andrew Morton, Arjan van de Ven, Nicolas Pitre, Jes Sorensen, Al Viro, Oleg Nesterov, David Howells, Alan Cox, Christoph Hellwig, Andi Kleen, Russell King, Anton Blanchard, PPC64-dev looks good to me. Minor nit: > +" isync\n" > +" isync \n" shouldnt these two be ISYNC_ON_SMP? Ingo ^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2006-01-11 17:45 UTC | newest] Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-01-04 14:41 [patch 00/21] mutex subsystem, -V14 Ingo Molnar 2006-01-04 23:45 ` Joel Schopp 2006-01-05 2:38 ` Nicolas Pitre 2006-01-05 2:51 ` Linus Torvalds 2006-01-05 3:21 ` Nick Piggin 2006-01-05 3:39 ` Anton Blanchard 2006-01-05 18:04 ` Jesse Barnes 2006-01-05 14:40 ` Ingo Molnar 2006-01-05 16:21 ` Linus Torvalds 2006-01-05 22:03 ` Ingo Molnar 2006-01-05 22:17 ` Linus Torvalds 2006-01-05 22:43 ` Ingo Molnar 2006-01-06 3:49 ` Keith Owens 2006-01-06 7:34 ` Denis Vlasenko 2006-01-05 14:35 ` Ingo Molnar 2006-01-05 16:42 ` Joel Schopp 2006-01-05 22:21 ` Ingo Molnar 2006-01-05 23:06 ` Joel Schopp 2006-01-05 23:26 ` Linus Torvalds 2006-01-05 23:36 ` Joel Schopp 2006-01-05 23:42 ` Ingo Molnar 2006-01-06 0:29 ` Olof Johansson 2006-01-07 17:49 ` PowerPC fastpaths for mutex subsystem Joel Schopp 2006-01-07 22:37 ` Andrew Morton 2006-01-08 7:43 ` Anton Blanchard 2006-01-08 8:00 ` Andrew Morton 2006-01-08 8:23 ` Anton Blanchard 2006-01-09 11:13 ` David Howells 2006-01-08 9:48 ` Ingo Molnar 2006-01-10 22:31 ` Joel Schopp 2006-01-10 23:09 ` Ingo Molnar 2006-01-11 10:52 ` Ingo Molnar 2006-01-11 17:44 ` Joel Schopp 2006-01-08 10:43 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).