linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/15] Generic Mutex Subsystem
@ 2005-12-19  1:34 Ingo Molnar
  2005-12-19  4:22 ` Andi Kleen
  0 siblings, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19  1:34 UTC (permalink / raw)
  To: linux-kernel, Linus Torvalds, Andrew Morton
  Cc: Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	Andi Kleen, David Howells, Alexander Viro, Oleg Nesterov


The following patchset is a split-up and streamlined version of the 
generic mutex subsystem that we have in the -rt kernel, ported to the 
upstream kernel. (To reduce the confusion: this code is unrelated to the 
'simple mutex' code recently posted by David Howells.)

the patchset can also be found at:

  http://redhat.com/~mingo/generic-mutex-subsystem/

This patchset is intended to address most (and hopefully all) of the 
objections (and suggestions) raised here in last week's mutex 
discussion. Since there were many issues raised in that thread (and i 
really have read all of them), the answers are unfortunately quite 
extensive too. I think i have the right answer to each of them, embedded 
somewhere below, just be patient and bear with this long email before 
forming an opinion about the feasibility of this approach ;-)

QA status: this is prototype code that supports i386 and x86_64 (SMP and 
UP) at the moment - but it is what i believe could be ported to every 
architecture, and could be merged into the upstream kernel in due 
course. All suggestions to further improve it are welcome.

But firstly, i'd like to answer the most important question:

  "Why the hell do we need a new mutex subsystem, and what's wrong 
   with semaphores??"

This question is clearly nagging most of the doubters, so i'm listing my 
answers first, before fully explaining the patchset. (For more 
implementational details, see the subseqent sections.)

here are the top 10 reasons of why i think the generic mutex code should 
be considered for upstream integration:

 - 'struct mutex' is smaller: on x86, 'struct semaphore' is 20 bytes, 
   'struct mutex' is 16 bytes. A smaller structure size means less RAM 
   footprint, and better CPU-cache utilization.

 - tighter code. On x86 i get the following .text sizes when 
   switching all mutex-alike semaphores in the kernel to the mutex 
   subsystem:

        text    data     bss     dec     hex filename
     3280380  868188  396860 4545428  455b94 vmlinux-semaphore
     3255329  865296  396732 4517357  44eded vmlinux-mutex

   that's 25051 bytes of code saved, or a 0.76% win - off the hottest 
   codepaths of the kernel. (The .data savings are 2892 bytes, or 0.33%) 
   Smaller code means better icache footprint, which is one of the 
   major optimization goals in the Linux kernel currently.

 - the mutex subsystem is faster and has superior scalability for 
   contented workloads. On an 8-way x86 system, running a mutex-based 
   kernel and testing creat+unlink+close (of separate, per-task files)
   in /tmp with 16 parallel tasks, the average number of ops/sec is:

    Semaphores:                        Mutexes:

    $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
    8 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
    checking VFS performance.          checking VFS performance.
    avg loops/sec:      34713          avg loops/sec:      84153
    CPU utilization:    63%            CPU utilization:    22%

   i.e. in this workload, the mutex based kernel was 2.4 times faster 
   than the semaphore based kernel, _and_ it also had 2.8 times less CPU 
   utilization. (In terms of 'ops per CPU cycle', the semaphore kernel 
   performed 551 ops/sec per 1% of CPU time used, while the mutex kernel 
   performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times 
   more efficient.)

   the scalability difference is visible even on a 2-way P4 HT box:

    Semaphores:                        Mutexes:

    $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
    4 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
    checking VFS performance.          checking VFS performance.
    avg loops/sec:      127659         avg loops/sec:      181082
    CPU utilization:    100%           CPU utilization:    34%

   (the straight performance advantage of mutexes is 41%, the per-cycle 
    efficiency of mutexes is 4.1 times better.)

 - there are no fastpath tradeoffs, the mutex fastpath is just as tight 
   as the semaphore fastpath. On x86, the locking fastpath is 2 
   instructions:

    c0377ccb <mutex_lock>:
    c0377ccb:       f0 ff 08                lock decl (%eax)
    c0377cce:       78 0e                   js     c0377cde <.text.lock.mutex>
    c0377cd0:       c3                      ret

   the unlocking fastpath is equally tight:

    c0377cd1 <mutex_unlock>:
    c0377cd1:       f0 ff 00                lock incl (%eax)
    c0377cd4:       7e 0f                   jle    c0377ce5 <.text.lock.mutex+0x7>
    c0377cd6:       c3                      ret

 - the per-call-site inlining cost of the slowpath is cheaper and 
   smaller than that of semaphores, by one instruction, because the 
   mutex trampoline code does not do a "lea %0,%%eax" that the semaphore 
   code does before calling __down_failed. The mutex subsystem uses out 
   of line code currently so this makes only a small difference in .text
   size, but in case we want to inline mutexes, they will be cheaper 
   than semaphores.

 - No wholesale or dangerous migration path. The migration to mutexes is 
   fundamentally opt-in, safe and easy: multiple type-based and .config 
   based migration helpers are provided to make the migration to mutexes 
   easy. Migration is as finegrained as it gets, so robustness of the 
   kernel or out-of-tree code should not be jeopardized at any stage.  
   The migration helpers can be eliminated once migration is completed, 
   once the kernel has been separated into 'mutex users' and 'semaphore 
   users'. Out-of-tree code automatically defaults to semaphore 
   semantics, mutexes are not forced upon anyone, at any stage of the 
   migration.

 - 'struct mutex' semantics are well-defined and are enforced if
   CONFIG_DEBUG_MUTEXES is turned on. Semaphores on the other hand have 
   virtually no debugging code or instrumentation. The mutex subsystem 
   checks and enforces the following rules:

   * - only one task can hold the mutex at a time
   * - only the owner can unlock the mutex
   * - multiple unlocks are not permitted
   * - recursive locking is not permitted
   * - a mutex object must be initialized via the API
   * - a mutex object must not be initialized via memset or copying
   * - task may not exit with mutex held
   * - memory areas where held locks reside must not be freed

   furthermore, there are also convenience features in the debugging 
   code:

   * - uses symbolic names of mutexes, whenever they are printed in debug output
   * - point-of-acquire tracking, symbolic lookup of function names
   * - list of all locks held in the system, printout of them
   * - owner tracking
   * - detects self-recursing locks and prints out all relevant info
   * - detects multi-task circular deadlocks and prints out all affected
   *   locks and tasks (and only those tasks)

   we have extensive experience with the mutex debugging code in the -rt
   kernel, and it eases the debugging of mutex related bugs 
   considerably. A handful of upstream bugs were found as well this
   way, and were contributed back to the vanilla kernel. We do believe 
   that improved debugging code is an important tool in improving the 
   fast-paced upstream kernel's quality.

   a side-effect of the strict semantics is that mutexes are much easier 
   to analyze on a static level. E.g. Sparse could check the correctness 
   of mutex users, further improving the kernel's quality. Also, the
   highest-level security and reliability validation techniques (and 
   certification processes) involve static code analysis.

 - kernel/mutex.c is generic, and has minimal per-arch needs. No new 
   primitives have to be implemented to support spinlock-based generic 
   mutexes. Only 2 new atomic primitives have to be implemented for an 
   architecture to support optimized, lockless generic mutexes. In 
   contrast, to implement semaphores on a new architecture, hundreds of 
   lines of nontrivial (often assembly) code has to be written and 
   debugged.

 - kernel/mutex.c is highly hackable. New locking features can be 
   implemented in C, and they carry over to every architecture.  
   Extensive self-consistency debugging checks of the mutex
   implementation are done if CONFIG_DEBUG_MUTEXES is turned on. I do 
   think that hackability is the most important property of 
   kernel code.

 - the generic mutex subsystem is also one more step towards enabling 
   the fully preemptable -rt kernel. Ok, this shouldnt bother the 
   upstream kernel too much at the moment, but it's a personal driving
   force for me nevertheless ;-)

   (NOTE: i consciously did not list 'Priority Inheritance' amongst the 
    reasons, because priority inheritance for blocking kernel locks 
    would be a misguided reason at best, and flat out wrong at worst.)


Implementation of mutexes
-------------------------

there are two central data types:

  - 'struct mutex'
  - 'struct arch_semaphore'

'struct mutex' is the new mutex type, defined in include/linux/mutex.h 
and implemented in kernel/mutex.c. It is a counter-based mutex with a 
spinlock and a wait-list.

'struct arch_semaphore' is the 'struct semaphore' type and 
implementation, defined and implemented in include/asm-*/semaphore.h and 
in various other arch-level files.

NOTE: the patchset does _NO_ wholesale renaming of 'struct semaphore' to 
'struct arch_semaphore'. The limited renaming of 'struct semaphore' to 
'struct arch_semaphore' is only technical, and affects only a limited 
amount of architecture-level code. It does _not_ spread out into the 
generic kernel itself. The reason for this renaming is to make migration 
to mutexes safer and easier.

the APIs of 'struct mutex' have been streamlined:

 DEFINE_MUTEX(name);

 mutex_init(mutex);
 
 void mutex_lock(struct mutex *lock);
 int  mutex_lock_interruptible(struct mutex *lock);
 int  mutex_trylock(struct mutex *lock);
 void mutex_unlock(struct mutex *lock);
 int  mutex_is_locked(struct mutex *lock);

the APIs of 'struct arch_semaphore' are the well-known semaphore APIs:

 DECLARE_ARCH_MUTEX(name)
 DECLARE_ARCH_MUTEX_LOCKED(name)

 void arch_sema_init(struct arch_semaphore *sem, int val);
 void arch_init_MUTEX(struct arch_semaphore *sem);
 void arch_init_MUTEX_LOCKED(struct arch_semaphore *sem);
 void arch_down(struct arch_semaphore * sem);
 int arch_down_interruptible(struct arch_semaphore * sem);
 int arch_down_trylock(struct arch_semaphore * sem);
 void arch_up(struct arch_semaphore * sem);
 int arch_sem_is_locked(struct arch_semaphore *sem);
 arch_sema_count(sem)

NOTE: no non-mutex subsystem will ever make use of these arch_-prefixed 
API calls - they are all hidden within type-switching macros. So no 
wholesale migration of the semaphore APIs is done.


The migratio path to mutexes
----------------------------

there are two other sub-types implemented as well, to ease migration and 
to enable debugging. They are:

  - 'struct semaphore'
  - 'struct mutex_debug'

'Break The World' approaches are unacceptable. By default, 'struct 
semaphore' is mapped to the plain semaphore implementation of the 
architecture - bit for bit equivalent and compatible. The APIs are the 
well-know down()/up()/etc. APIs, and they are bit for bit equivalent to 
the vanilla kernel's semaphore implementation. This property is crutial 
to be able to introduce the mutex subsystem in the stable kernel series.  
Doing anything else would be a non-starter.

'struct mutex_debug' is a debugging variant of mutexes: it can be used 
with both sets of APIs, both with the semaphore APIs and with the mutex 
APIs. E.g.:

	struct mutex_debug sem;

	down(&sem);
	...
	up(&sem);
	...
	down(&sem);
	...
	up(&sem);

will be mapped by the kernel to mutex_lock() and mutex_unlock(). The 
code can be changed back to a semaphore by changing the 'struct 
mutex_debug sem' line to 'struct semaphore sem'. The actual down()/up() 
calls in the code do not have to be modified for this type of migration.

thus 'struct mutex_debug' is the natural starting point for migrating a 
kernel subsystem or driver over to mutexes: just change the 'struct 
semaphore' data definition to 'struct mutex_debug', and rebuild the 
kernel - the whole subsystem will use mutexes from that point on. If 
there are any problems with the migration, switch the data type back to 
'struct semaphore' and forget about mutexes. If the migration proves to 
be successful, change the data type to 'struct mutex' and fix up all the 
API uses to use the mutex variants.

there are .config debugging options that change the meaning of 'struct 
semaphore': e.g. CONFIG_DEBUG_MUTEX_FULL switches all semaphores over to 
mutexes. All semaphores, except the ones that have been marked 
arch_semaphore. (see the arch-semaphores.patch sub-patch) The 
DEBUG_MUTEX_FULL debugging mode is fully functional on all of my 
testsystems, but since it's equivalent to a wholesale conversion to 
mutexes, it cannot be guaranteed to be safe. But it is a nice debugging 
option nevertheless, and can be used to verify the mutex subsystem.

so to summarize, these types enable the finegrained and robust 
separation of the kernel's semaphores into 3 categories:

- mutexes (first 'struct mutex_debug', then 'struct mutex')

- counting semaphores (first 'struct semaphore', then
                       'struct arch_semaphore'

- unknown, unreviewed ('struct semaphores')

the 'mutex conversion' process would act on the 'unknown' pool, and 
would move semaphores into one of the two groups, one by one.

out-of-tree semaphore users are safe by default in this scheme: they get 
into the 'unknown' pool and are hence conservatively assumed to be full 
semaphores.

once all semaphores have been safely categorized and converted to one 
group or another, and all out-of-tree codebases are fixed and a 
deprecation period has passed, we can rename arch_semaphores back to 
'struct semaphore'.


Status of the patch-queue
-------------------------

the patch-series currently builds and boots on i386 and x86_64. It 
consists of 15 finegrained patches:

  add-atomic-xchg-i386.patch
  add-atomic-xchg-x86_64.patch
  add-atomic-call-func-i386.patch
  add-atomic-call-func-x86_64.patch

  mutex-core.patch

  mutex-debug.patch
  mutex-debug-more.patch

  mutex-migration-helper-i386.patch
  mutex-migration-helper-x86_64.patch
  mutex-migration-helper-core.patch

  sx8-sem2completions.patch
  cpu5wdt-sem2completions.patch
  ide-gendev-sem-to-completion.patch
  loop-to-completions.patch

  arch-semaphores.patch

these patches, ontop of Linus' current -git tree, build and boot at all 
of these stages, and the mutex-kernel is fully functional with all 
patches applied, in both debug and non-debug mode.

reports, suggestions, fixes are welcome.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  1:34 [patch 00/15] Generic Mutex Subsystem Ingo Molnar
@ 2005-12-19  4:22 ` Andi Kleen
  2005-12-19  4:28   ` Steven Rostedt
                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Andi Kleen @ 2005-12-19  4:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, Andi Kleen,
	David Howells, Alexander Viro, Oleg Nesterov

>     $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
>     8 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
>     checking VFS performance.          checking VFS performance.
>     avg loops/sec:      34713          avg loops/sec:      84153
>     CPU utilization:    63%            CPU utilization:    22%
> 
>    i.e. in this workload, the mutex based kernel was 2.4 times faster 
>    than the semaphore based kernel, _and_ it also had 2.8 times less CPU 
>    utilization. (In terms of 'ops per CPU cycle', the semaphore kernel 
>    performed 551 ops/sec per 1% of CPU time used, while the mutex kernel 
>    performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times 
>    more efficient.)

Do you have an idea where this big difference comes from? It doesn't look
it's from the fast path which is essentially the same.  Do the mutexes have
that much better scheduling behaviour than semaphores? It is a bit hard to 
believe.

I would perhaps understand your numbers if you used adaptive mutexes 
or similar that spin for a bit, but just for pure sleeping locks it seems 
weird. After all the scheduler should work in the same way for both.

-Andi


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  4:22 ` Andi Kleen
@ 2005-12-19  4:28   ` Steven Rostedt
  2005-12-19  4:31     ` Andi Kleen
  2005-12-19  6:24   ` Linus Torvalds
  2005-12-19 16:22   ` Ingo Molnar
  2 siblings, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2005-12-19  4:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, linux-kernel, Linus Torvalds, Andrew Morton,
	Arjan van de Ven, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov


On Mon, 19 Dec 2005, Andi Kleen wrote:

> >     $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
> >     8 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
> >     checking VFS performance.          checking VFS performance.
> >     avg loops/sec:      34713          avg loops/sec:      84153
> >     CPU utilization:    63%            CPU utilization:    22%
> >
> >    i.e. in this workload, the mutex based kernel was 2.4 times faster
> >    than the semaphore based kernel, _and_ it also had 2.8 times less CPU
> >    utilization. (In terms of 'ops per CPU cycle', the semaphore kernel
> >    performed 551 ops/sec per 1% of CPU time used, while the mutex kernel
> >    performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times
> >    more efficient.)
>
> Do you have an idea where this big difference comes from? It doesn't look
> it's from the fast path which is essentially the same.  Do the mutexes have
> that much better scheduling behaviour than semaphores? It is a bit hard to
> believe.
>
> I would perhaps understand your numbers if you used adaptive mutexes
> or similar that spin for a bit, but just for pure sleeping locks it seems
> weird. After all the scheduler should work in the same way for both.
>

Perhaps it's the smaller structures, as Ingo said, which would allow for
better cache handling.

-- Steve


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  4:28   ` Steven Rostedt
@ 2005-12-19  4:31     ` Andi Kleen
  0 siblings, 0 replies; 29+ messages in thread
From: Andi Kleen @ 2005-12-19  4:31 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Linus Torvalds,
	Andrew Morton, Arjan van de Ven, Alan Cox, Christoph Hellwig,
	David Howells, Alexander Viro, Oleg Nesterov

> Perhaps it's the smaller structures, as Ingo said, which would allow for
> better cache handling.

That still doesn't seem credible on a large cached x86 CPU.
However if it's that then oprofile could tell I guess.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  4:22 ` Andi Kleen
  2005-12-19  4:28   ` Steven Rostedt
@ 2005-12-19  6:24   ` Linus Torvalds
  2005-12-19 12:56     ` Steven Rostedt
  2005-12-19 15:50     ` Ingo Molnar
  2005-12-19 16:22   ` Ingo Molnar
  2 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2005-12-19  6:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, linux-kernel, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov



On Mon, 19 Dec 2005, Andi Kleen wrote:
> 
> Do you have an idea where this big difference comes from? It doesn't look
> it's from the fast path which is essentially the same.  Do the mutexes have
> that much better scheduling behaviour than semaphores? It is a bit hard to 
> believe.

Are ingo's mutex'es perhaps not trying to be fair?

The normal mutexes try to make sure that if a process comes in and gets 
stuck on a mutex, and then another CPU releases the mutex and immediately 
tries to grab it again, the other CPU will _not_ get it.

That's a huge performance disadvantage, but it's done on purpose, because 
otherwise you end up in a situation where the semaphore release code did 
wake up the waiter, but before the waiter actually had time to grab it (it 
has to go through the IPI and scheduling logic), the same CPU just grabbed 
it again.

The original semaphores were unfair, and it works really well most of the 
time. But then it really sucks in some nasty cases.

The numbers make me suspect that Ingo's mutexes are unfair too, but I've 
not looked at the code yet.

NOTE! I'm not a huge fan of fairness per se. I think unfair is often ok, 
and the performance advantages are certainly real. It may well be that the 
cases that caused problems before are now done differently (eg we switched 
the VM semaphore over to an rwsem), and that we can have an unfair and 
fast mutex for those cases where we don't care.

I just suspect the comparison isn't fair.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  6:24   ` Linus Torvalds
@ 2005-12-19 12:56     ` Steven Rostedt
  2005-12-19 16:55       ` Ingo Molnar
  2005-12-19 15:50     ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Steven Rostedt @ 2005-12-19 12:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Ingo Molnar, linux-kernel, Andrew Morton,
	Arjan van de Ven, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov


On Sun, 18 Dec 2005, Linus Torvalds wrote:

>
>
> On Mon, 19 Dec 2005, Andi Kleen wrote:
> >
> > Do you have an idea where this big difference comes from? It doesn't look
> > it's from the fast path which is essentially the same.  Do the mutexes have
> > that much better scheduling behaviour than semaphores? It is a bit hard to
> > believe.
>
> Are ingo's mutex'es perhaps not trying to be fair?
>
> The normal mutexes try to make sure that if a process comes in and gets
> stuck on a mutex, and then another CPU releases the mutex and immediately
> tries to grab it again, the other CPU will _not_ get it.
>
> That's a huge performance disadvantage, but it's done on purpose, because
> otherwise you end up in a situation where the semaphore release code did
> wake up the waiter, but before the waiter actually had time to grab it (it
> has to go through the IPI and scheduling logic), the same CPU just grabbed
> it again.
>
> The original semaphores were unfair, and it works really well most of the
> time. But then it really sucks in some nasty cases.
>
> The numbers make me suspect that Ingo's mutexes are unfair too, but I've
> not looked at the code yet.

Yes, Ingo's code does act like this unfairness.  Interesting also is that
Ingo's original code for his rt_mutexes was fair, and it killed
performance for high priority processes.  I introduced a "lock stealing"
algorithm that would check if the process trying to grab the lock again
was a higher priority then the one about to get it, and if it was, it
would "steal" the lock from it unfairly as you said.

Now, you are forgetting about PREEMPT.  Yes, on multiple CPUs, and that is
what Ingo is testing, to wait for the other CPU to schedule in and run is
probably not as bad as with PREEMPTION. (Ingo, did you have preemption on
in these tests?).  The reason is that if you have a high priority process
release a lock (giving it to a lower priority process that hasn't woken up
yet), then try to grab it again, but a lower priority process was waiting
on it,  the high priorty process would need to schedule out and wait on
the lower priority process. Here's a case of priority inversion that can
be solved without priority inheritance.

The way this situation happens is if you have three processes, A B and C
where A is the highest and C is the lowest.  C grabs the lock and is
preempted by A, A tries to grab the lock and goes to sleep, then B
runs and preempts C (remember, we don't have PI here), and then tries to
grab the lock.  C releases the lock and gives it to A, then A releases the
lock (gives it to B) and then tries to grab it again.

Now you must wait for two schedules and B to release the lock, before high
priority process A gets to run again.

So when we have PREEMPT, your fairness is not being very fair.

-- Steve

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  6:24   ` Linus Torvalds
  2005-12-19 12:56     ` Steven Rostedt
@ 2005-12-19 15:50     ` Ingo Molnar
  2005-12-19 19:11       ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19 15:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, linux-kernel, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov


* Linus Torvalds <torvalds@osdl.org> wrote:

> On Mon, 19 Dec 2005, Andi Kleen wrote:
> > 
> > Do you have an idea where this big difference comes from? It doesn't look
> > it's from the fast path which is essentially the same.  Do the mutexes have
> > that much better scheduling behaviour than semaphores? It is a bit hard to 
> > believe.
> 
> Are Ingo's mutex'es perhaps not trying to be fair?
> 
> The normal mutexes try to make sure that if a process comes in and 
> gets stuck on a mutex, and then another CPU releases the mutex and 
> immediately tries to grab it again, the other CPU will _not_ get it.

the VFS creat+unlink performance advantage i measured on SMP systems is 
a pure 'struct mutex' vs. stock 'struct semaphore' measurement. I 
measured the same kernel with a single .config option difference 
(DEBUG_MUTEX_FULL to get the 'mutex' variant, vs. DEBUG_MUTEX_PARTIAL to 
get the 'semaphore' variant), the systems were completely idle, and the 
results are statistically stable.

fact is, we dont implement fairness in the upstream struct semaphore 
implementation(s) either - it would be extremely bad to performance (as 
you are pointing it out too).

both stock semaphores and generic mutexes are unfair in the following 
sense: if there is a waiter 'in flight' (but has not grabbed the lock 
yet), a 'lucky bastard may jump the queue' and may grab the lock, 
without having waited anything. So comparing semaphores to generic 
mutexes is an apples to apples comparison.

in fact, generic mutexes are _more_ fair than struct semaphore in their 
wait logic. In the stock semaphore implementation, when a waiter is 
woken up, it will retry the lock, and if it fails, it goes back to the 
_tail_ of the queue again - waiting one full cycle again.

in the generic mutex implementation, the first waiter is woken up, and 
it will retry the lock - but no other waiters are woken up until this 
waiter has done its round. (a mutex implementation can do this, because 
it knows that there can only be one owner, while a counting semaphore 
has to be prepared for the possibility of multiple concurrent up()s, and 
for the count going above 1.)

and this is also the crutial difference that gives mutexes the 
performance edge in contended workloads (so this should also answers 
Andi's question): stock semaphores easily end up creating a 'thundering 
herd' scenario, if a 'fast lucky bastard' releases the lock - and wakes 
up _yet another_ waiter. The end result is, that given a high enough 
'semaphore use frequency', our wake-one logic in semaphores totally 
falls apart and we end up having all waiters racing for the runqueues, 
and racing for the lock itself. This causes much higher CPU utilization, 
wasted resources, and slower total performance.

there is one more wakeup related optimization in generic mutexes: the 
waiter->woken flag. It is set when the waiter is 'set into flight', and 
subsequent wakeups first check this flag. Since new waiters are added to 
the end of the wait-list, this waiter's data cachelines stay clean, and 
the ->woken flag is nicely shared amongst CPUs. Doing a blind 
wake_up_process() would at a minimum dirty the remote runqueue's lock 
cacheline.

> That's a huge performance disadvantage, but it's done on purpose, 
> because otherwise you end up in a situation where the semaphore 
> release code did wake up the waiter, but before the waiter actually 
> had time to grab it (it has to go through the IPI and scheduling 
> logic), the same CPU just grabbed it again.
> 
> The original semaphores were unfair, and it works really well most of 
> the time. But then it really sucks in some nasty cases.
> 
> The numbers make me suspect that Ingo's mutexes are unfair too, but 
> I've not looked at the code yet.

yes, they are unfair - just like stock semaphores.

[ Note that the generic rt-mutexes in the -rt tree are of course fair to 
  higher-prio tasks, and this fairness is implemented via 
  priority-queueing and priority inheritance: the highest prio (RT) task
  gets the lock, and if a lower-prio task is holding the lock, it is
  boosted up until the end of the critical section, at which point it
  hands over the lock to the highprio task.

  implementing any other method to achieve fairness would result in 
  really bad real-world performance. The -rt tree's performance is quite 
  close to the vanilla kernel (considering that it's a fully preemptible
  kernel), and we couldnt do that without the mutex implementation. ]

generic mutexes, as present in this patchqueue, do not include priority 
queueing and priority inheritance, so they are 'plain unfair', just like 
stock semaphores are.

> NOTE! I'm not a huge fan of fairness per se. I think unfair is often 
> ok, and the performance advantages are certainly real. It may well be 
> that the cases that caused problems before are now done differently 
> (eg we switched the VM semaphore over to an rwsem), and that we can 
> have an unfair and fast mutex for those cases where we don't care.
> 
> I just suspect the comparison isn't fair.

the comparison is 100% totally fair, because both generic mutexes, and 
struct semaphores are 100% totally unfair :-) There's no difference in 
the level of unfairness either: 'lucky bastards' are allowed to steal 
the lock in both implementations.

we have only one (limited) notion of fairness in Linux synchronization 
primitives: rwsems prefer waiting writers, and then block subsequent 
readers. Note that rwsems are still reader-reader and writer-writer 
unfair, hence the -rt tree replaces the rwsem implementation too, with 
generic mutexes and PI.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19  4:22 ` Andi Kleen
  2005-12-19  4:28   ` Steven Rostedt
  2005-12-19  6:24   ` Linus Torvalds
@ 2005-12-19 16:22   ` Ingo Molnar
  2 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19 16:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Linus Torvalds, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov


* Andi Kleen <ak@suse.de> wrote:

> >     $ ./test-mutex V 16 10             $ ./test-mutex V 16 10
> >     8 CPUs, running 16 tasks.          8 CPUs, running 16 tasks.
> >     checking VFS performance.          checking VFS performance.
> >     avg loops/sec:      34713          avg loops/sec:      84153
> >     CPU utilization:    63%            CPU utilization:    22%
> > 
> >    i.e. in this workload, the mutex based kernel was 2.4 times faster 
> >    than the semaphore based kernel, _and_ it also had 2.8 times less CPU 
> >    utilization. (In terms of 'ops per CPU cycle', the semaphore kernel 
> >    performed 551 ops/sec per 1% of CPU time used, while the mutex kernel 
> >    performed 3825 ops/sec per 1% of CPU time used - it was 6.9 times 
> >    more efficient.)
> 
> Do you have an idea where this big difference comes from? It doesn't 
> look it's from the fast path which is essentially the same.  Do the 
> mutexes have that much better scheduling behaviour than semaphores? It 
> is a bit hard to believe.

yes, generic mutexes have that much better scheduling in this workload.  
[ And no, there's no secret speedup magic hidden in the scheduler ;) ] 
See my other reply to Linus about why there's such a difference.

> I would perhaps understand your numbers if you used adaptive mutexes 
> or similar that spin for a bit, but just for pure sleeping locks it 
> seems weird. After all the scheduler should work in the same way for 
> both.

hm, i'm not so sure about adaptive mutexes - i'm a bit uneasy about 
wasting cycles on spinning, it will inevitably cause wasted resources in 
some workloads. I think the right approach is to make scheduling fast 
and cheap, and to improve the queueing/wakeup logic of kernel code.

but by all means feel free to experiment with adaptive mutexes, all the 
mutex logic is located in kernel/mutex.c, and it is well-suited for 
rapid prototyping of new locking logic.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 12:56     ` Steven Rostedt
@ 2005-12-19 16:55       ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19 16:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andi Kleen, linux-kernel, Andrew Morton,
	Arjan van de Ven, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov


* Steven Rostedt <rostedt@goodmis.org> wrote:

> > The numbers make me suspect that Ingo's mutexes are unfair too, but I've
> > not looked at the code yet.
> 
> Yes, Ingo's code does act like this unfairness.  Interesting also is 
> that Ingo's original code for his rt_mutexes was fair, and it killed 
> performance for high priority processes.  I introduced a "lock 
> stealing" algorithm that would check if the process trying to grab the 
> lock again was a higher priority then the one about to get it, and if 
> it was, it would "steal" the lock from it unfairly as you said.

yes, it's unfair - but stock semaphores are unfair too, so what i've 
measured is still a fair comparison of the two implementations.

lock stealing i've eliminated from this patch-queue, and i've moved the 
point of acquire to after the schedule(). (lock-stealing is only 
relevant for PI, where we always need to associate an owner with the 
lock, hence we pass ownership at the point of release.)

> Now, you are forgetting about PREEMPT.  Yes, on multiple CPUs, and 
> that is what Ingo is testing, to wait for the other CPU to schedule in 
> and run is probably not as bad as with PREEMPTION. (Ingo, did you have 
> preemption on in these tests?). [...]

no, CONFIG_PREEMPT was disabled in every test result i posted. (but i 
get similar results even with it enabled.)

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 15:50     ` Ingo Molnar
@ 2005-12-19 19:11       ` Linus Torvalds
  2005-12-19 19:25         ` Benjamin LaHaise
  2005-12-19 19:55         ` Ingo Molnar
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2005-12-19 19:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	David Howells, Alexander Viro, Oleg Nesterov, Benjamin LaHaise



On Mon, 19 Dec 2005, Ingo Molnar wrote:
> 
> in fact, generic mutexes are _more_ fair than struct semaphore in their 
> wait logic. In the stock semaphore implementation, when a waiter is 
> woken up, it will retry the lock, and if it fails, it goes back to the 
> _tail_ of the queue again - waiting one full cycle again.

Ingo, I don't think that is true.

It shouldn't be true, at least. The whole point with the "sleeper" count 
was to not have that happen. Of course, bugs happen, so I won't guarantee 
that's actually true, but ..

If you are woken up as a waiter on a semaphore, you shouldn't fail to get 
it. You will be woken first, and nobody else will get at it, because the 
count has been kept negative or zero even by the waiters, so that a 
fast-path user shouldn't be able to get the lock without going through the 
slow path and adding itself (last) to the list.

But hey, somebody should test it with <n> kernel threads that just do 
down()/up() and some make-believe work in between to make sure there 
really is contention, and count how many times they actually get the 
semaphore. That code has been changed so many times that it may not work 
the way it is advertized ;)

[ Oh.  I'm looking at the semaphore code, and I realize that we have a 
  "wake_up(&sem->wait)" in the __down() path because we had some race long 
  ago that we fixed by band-aiding over it. Which means that we wake up 
  sleepers that shouldn't be woken up. THAT may well be part of the 
  performance problem.. The semaphores are really meant to wake up just 
  one at a time, but because of that race hack they'll wake up _two_ at a 
  time - once by up(), once by down().

  That also destroys the fairness. Does anybody remember why it's that 
  way? ]

Ho humm.. That's interesting. 

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 19:11       ` Linus Torvalds
@ 2005-12-19 19:25         ` Benjamin LaHaise
  2005-12-19 19:55           ` Linus Torvalds
  2005-12-19 20:11           ` Ingo Molnar
  2005-12-19 19:55         ` Ingo Molnar
  1 sibling, 2 replies; 29+ messages in thread
From: Benjamin LaHaise @ 2005-12-19 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, David Howells, Alexander Viro, Oleg Nesterov

On Mon, Dec 19, 2005 at 11:11:03AM -0800, Linus Torvalds wrote:
> On Mon, 19 Dec 2005, Ingo Molnar wrote:
> > 
> > in fact, generic mutexes are _more_ fair than struct semaphore in their 
> > wait logic. In the stock semaphore implementation, when a waiter is 
> > woken up, it will retry the lock, and if it fails, it goes back to the 
> > _tail_ of the queue again - waiting one full cycle again.
> 
> Ingo, I don't think that is true.
> 
> It shouldn't be true, at least. The whole point with the "sleeper" count 
> was to not have that happen. Of course, bugs happen, so I won't guarantee 
> that's actually true, but ..

The only thing I can see as an improvement that a mutex can offer over 
the current semaphore implementation is if we can perform the same 
optimization that spinlocks perform in the unlock operation: don't use 
a locked, serialising instruction in the up() codepath.  That might be 
a bit tricky to implement, but it's definately a win on the P4 where the 
cost of serialisation can be quite high.

> [ Oh.  I'm looking at the semaphore code, and I realize that we have a 
>   "wake_up(&sem->wait)" in the __down() path because we had some race long 
>   ago that we fixed by band-aiding over it. Which means that we wake up 
>   sleepers that shouldn't be woken up. THAT may well be part of the 
>   performance problem.. The semaphores are really meant to wake up just 
>   one at a time, but because of that race hack they'll wake up _two_ at a 
>   time - once by up(), once by down().
> 
>   That also destroys the fairness. Does anybody remember why it's that 
>   way? ]

History?  I think that code is very close to what was done in the pre-SMP 
version of semaphores.  It is certainly possible to get rid of the separate 
sleepers -- parisc seems to have such an implementation.  It updates 
sem->count in the wakeup path of __down().

		-ben
-- 
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 19:11       ` Linus Torvalds
  2005-12-19 19:25         ` Benjamin LaHaise
@ 2005-12-19 19:55         ` Ingo Molnar
  2005-12-19 20:12           ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	David Howells, Alexander Viro, Oleg Nesterov, Benjamin LaHaise


* Linus Torvalds <torvalds@osdl.org> wrote:

> On Mon, 19 Dec 2005, Ingo Molnar wrote:
> > 
> > in fact, generic mutexes are _more_ fair than struct semaphore in their 
> > wait logic. In the stock semaphore implementation, when a waiter is 
> > woken up, it will retry the lock, and if it fails, it goes back to the 
> > _tail_ of the queue again - waiting one full cycle again.
> 
> Ingo, I don't think that is true.
> 
> It shouldn't be true, at least. The whole point with the "sleeper" 
> count was to not have that happen. Of course, bugs happen, so I won't 
> guarantee that's actually true, but ..

you are right, i based my comments on observed behavior, not on the 
code's intentions.

I havent actually traced the behavior of semaphores (being mostly 
interested in mutexes ;-), but fairness algorithms always show up as 
heavy context-switchers on SMP (because other CPUs queue themselves as 
waiters, and wakeups go across CPUs all the time), and i'm quite sure 
that contended scenarios with the current semaphore implementation never 
overscheduled. Hence current semaphores are very likely not fair, and 
sleepers roundrobin back to the queue quite often.

but i've got some measurements of how fairness plays out in practice.  
The mutex based ops go:

 mars:~> ./test-mutex V 16 10
 8 CPUs, running 16 parallel test-tasks.
 checking VFS performance.
 avg ops/sec:               85130
 average cost per op:       206.59 usecs
 average deviance per op:   319.08 usecs

note the 'average latency of an op' (in absolute time), and the standard 
deviation (which has been measured by summing up the deltas between 
subsequent latencies, and divided by the total number of ops).

With semaphores the results go like this:

 mars:~> ./test-mutex V 16 10
 8 CPUs, running 16 parallel test-tasks.
 checking VFS performance.
 avg ops/sec:               34381
 average cost per op:       512.13 usecs
 average deviance per op:   573.10 usecs

look that even though the ratio between the per op cost and the standard 
deviance looks to be a bit better for semaphores, the pure fact that it 
was much slower lengthened its standard deviance to well above that of 
the mutex's.

So even if they are fairer, if the system ends up being slower, fairness 
(==observed fluctuations, and resulting injustices) suffers more as a 
result than due to the queueing logic. I'd chose this 200 +/- 150 usecs 
behavior over 500 +/- 280 usecs behavior - even though the latter has 
smaller relative fluctuations.

(although i'm still unsure which one is fairer, because i couldnt create 
a scenario that is comparable in terms of fairness comparisons: the 
mutex based workloads are always more efficient, and as a result they 
schedule into the idle thread more often, which creates additional noise 
and may be a reason why its standard deviation is higher. The semaphore 
workloads are more saturated, which may flatten its standard deviation.)

> If you are woken up as a waiter on a semaphore, you shouldn't fail to 
> get it. You will be woken first, and nobody else will get at it, 
> because the count has been kept negative or zero even by the waiters, 
> so that a fast-path user shouldn't be able to get the lock without 
> going through the slow path and adding itself (last) to the list.
> 
> But hey, somebody should test it with <n> kernel threads that just do 
> down()/up() and some make-believe work in between to make sure there 
> really is contention, and count how many times they actually get the 
> semaphore. That code has been changed so many times that it may not 
> work the way it is advertized ;)
> 
> [ Oh.  I'm looking at the semaphore code, and I realize that we have a 
>   "wake_up(&sem->wait)" in the __down() path because we had some race long 
>   ago that we fixed by band-aiding over it. Which means that we wake up 
>   sleepers that shouldn't be woken up. THAT may well be part of the 
>   performance problem.. The semaphores are really meant to wake up just 
>   one at a time, but because of that race hack they'll wake up _two_ at a 
>   time - once by up(), once by down().
> 
>   That also destroys the fairness. Does anybody remember why it's that 
>   way? ]
> 
> Ho humm.. That's interesting.

hm, removing that wakeup quickly causes hung test-tasks. (i booted an 
all-mutexes kernel, and did the testing on arch_semaphores, using the 
modified semaphore-sleepers.c code. The test ran for a few seconds, so 
semaphores werent _totally_ broken, but there's some clear race in terms 
of not waking up a sleeper.)

and even considering that the current semaphore implementation may have 
a fairness bug, i cannot imagine that making it more fair would also 
speed it up. So it could end up having an even larger performance gap to 
the mutex implementation. But in any case, it should be an interesting 
comparison! I personally find the semaphore implementation clever but 
too complex, maybe that's a reason why such bugs might be hiding there.  
(possibly for many years already ...)

In any case, i concur with you that the fairness design of the two is 
not comparable, and that semaphores _should_ be fairer.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 19:25         ` Benjamin LaHaise
@ 2005-12-19 19:55           ` Linus Torvalds
  2005-12-21 16:42             ` Oleg Nesterov
  2005-12-19 20:11           ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2005-12-19 19:55 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Ingo Molnar, Andi Kleen, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, David Howells, Alexander Viro, Oleg Nesterov



On Mon, 19 Dec 2005, Benjamin LaHaise wrote:
> 
> The only thing I can see as an improvement that a mutex can offer over 
> the current semaphore implementation is if we can perform the same 
> optimization that spinlocks perform in the unlock operation: don't use 
> a locked, serialising instruction in the up() codepath.  That might be 
> a bit tricky to implement, but it's definately a win on the P4 where the 
> cost of serialisation can be quite high.

Good point. However, it really _is_ hard, because we also need to know if 
the mutex was under contention. A spinlock doesn't care, so we can just 
overwrite the lock value. A mutex would always care, in order to know 
whether it needs to do the slow wakeup path. 

So I suspect you can't avoid serializing the unlock path for a mutex. The 
issue of "was there contention while I held it" fundamentally _is_ a 
serializing question.

> > [ Oh.  I'm looking at the semaphore code, and I realize that we have a 
> >   "wake_up(&sem->wait)" in the __down() path because we had some race long 
> >   ago that we fixed by band-aiding over it. Which means that we wake up 
> >   sleepers that shouldn't be woken up. THAT may well be part of the 
> >   performance problem.. The semaphores are really meant to wake up just 
> >   one at a time, but because of that race hack they'll wake up _two_ at a 
> >   time - once by up(), once by down().
> > 
> >   That also destroys the fairness. Does anybody remember why it's that 
> >   way? ]
> 
> History?

Oh, absolutely, I already checked the old BK history too, and that extra 
wake_up() has been there at least since before we even started using BK. 
So it's very much historical, I'm just wondering if somebody remembers far 
enough back that we'd know.

I don't see why it's needed (since we re-try the "atomic_add_negative()" 
inside the semaphore wait lock, and any up() that saw contention should 
have always been guaranteed to do a wakeup that should fill the race in 
between that atomic_add_negative() and the thing going to sleep). 

It may be that it is _purely_ historical, and simply isn't needed. That 
would be funny/sad, in the sense that we've had it there for years and 
years ;)

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 19:25         ` Benjamin LaHaise
  2005-12-19 19:55           ` Linus Torvalds
@ 2005-12-19 20:11           ` Ingo Molnar
  2005-12-19 20:19             ` Benjamin LaHaise
  2005-12-19 20:32             ` Russell King
  1 sibling, 2 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19 20:11 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Linus Torvalds, Andi Kleen, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, David Howells, Alexander Viro, Oleg Nesterov


* Benjamin LaHaise <bcrl@kvack.org> wrote:

> > [ Oh.  I'm looking at the semaphore code, and I realize that we have a 
> >   "wake_up(&sem->wait)" in the __down() path because we had some race long 
> >   ago that we fixed by band-aiding over it. Which means that we wake up 
> >   sleepers that shouldn't be woken up. THAT may well be part of the 
> >   performance problem.. The semaphores are really meant to wake up just 
> >   one at a time, but because of that race hack they'll wake up _two_ at a 
> >   time - once by up(), once by down().
> > 
> >   That also destroys the fairness. Does anybody remember why it's that 
> >   way? ]
> 
> History?  I think that code is very close to what was done in the 
> pre-SMP version of semaphores.  It is certainly possible to get rid of 
> the separate sleepers -- parisc seems to have such an implementation.  
> It updates sem->count in the wakeup path of __down().

i think we also need to look at the larger picture. If this really is a 
bug that hid for years, it shows that the semaphore code is too complex 
to be properly reviewed and improved. Hence even assuming that the mutex 
code does not bring direct code advantages (which i'm disputing :-), the 
mutex code is far simpler and thus easier to improve. We humans have a 
given number of neurons, which form a hard limit :) In fact it's the 
mutex code that made it apparent that there's something wrong with 
semaphores.

we saw that with the genirq code, with the spinlock code, with the 
preempt code. Consolidation did not add anything drastiically new, but 
code consolidation _did_ make things more hackable, and improved the end 
result far more than a splintered set of implementations would have 
looked like.

Just look at the semaphore implementations of various architectures, 
it's a quite colorful and inconsistent mix. Can you imagine adding 
deadlock debugging to each of them?

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 19:55         ` Ingo Molnar
@ 2005-12-19 20:12           ` Linus Torvalds
  2005-12-19 23:37             ` Christoph Hellwig
  2005-12-20  8:03             ` Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2005-12-19 20:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Linux Kernel Mailing List, Andrew Morton,
	Arjan van de Ven, Steven Rostedt, Alan Cox, Christoph Hellwig,
	David Howells, Alexander Viro, Oleg Nesterov, Benjamin LaHaise



On Mon, 19 Dec 2005, Ingo Molnar wrote:
> 
>  average cost per op:       206.59 usecs
>  average cost per op:       512.13 usecs

(mutex vs semaphore).

That looks suspiciously like exactly double the cost, so I do believe that 
the double wake_up() might be exactly what is going on.

However:

> hm, removing that wakeup quickly causes hung test-tasks.

So clearly it really is still hiding some bug.

> and even considering that the current semaphore implementation may have 
> a fairness bug, i cannot imagine that making it more fair would also 
> speed it up.

That's not the point. The extra wakeup() in th esemaphore code wakes up 
two processes for every single up(), so the semaphores end up not just 
being unfair, they also end up doing twice the work (because it will 
result in the other processes effectively just doing the down() twice).

> I personally find the semaphore implementation clever but too complex, 
> maybe that's a reason why such bugs might be hiding there.  (possibly 
> for many years already ...)

Oh, absolutely. It is too complex. 

And don't get me wrong: if it's easier to just ignore the performance bug, 
and introduce a new "struct mutex" that just doesn't have it, I'm all for 
it. However, if so, I do NOT want to do the unnecessary renaming. "struct 
semaphore" should stay as "struct semaphore", and we should not affect old 
code in the _least_.

Then code can switch to "struct mutex" if people want to. And if one 
reason for it ends up being that the code avoids a performance bug in the 
process, all the better ;)

IOW, I really think this should be a series of small patches that don't 
touch old users of "struct semaphore" at all. None of this "semaphore" to 
"arch_semaphore" stuff, and the new "struct mutex" would not re-use _any_ 
of the names that the old "struct semaphore" uses.

		Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 20:11           ` Ingo Molnar
@ 2005-12-19 20:19             ` Benjamin LaHaise
  2005-12-19 20:32             ` Russell King
  1 sibling, 0 replies; 29+ messages in thread
From: Benjamin LaHaise @ 2005-12-19 20:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andi Kleen, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, David Howells, Alexander Viro, Oleg Nesterov

On Mon, Dec 19, 2005 at 09:11:18PM +0100, Ingo Molnar wrote:
> i think we also need to look at the larger picture. If this really is a 
> bug that hid for years, it shows that the semaphore code is too complex 
> to be properly reviewed and improved. Hence even assuming that the mutex 
> code does not bring direct code advantages (which i'm disputing :-), the 
> mutex code is far simpler and thus easier to improve. We humans have a 
> given number of neurons, which form a hard limit :) In fact it's the 
> mutex code that made it apparent that there's something wrong with 
> semaphores.

True, but then the original semaphores weren't designed with fairness in 
mind so much as being able to operate quickly.  The commodity SMP hardware 
that we have today has significantly different characteristics than the 
first dual Pentium I used.  I think there is significant room for improving 
the implementation while still making it as tight and lean as possible.  To 
that end, adding state diagrams that make it easier to visualise what is 
going on would be a big help.  With that in place, it will be easier to 
provide optimized fast paths with understandable logic. 

> Just look at the semaphore implementations of various architectures, 
> it's a quite colorful and inconsistent mix. Can you imagine adding 
> deadlock debugging to each of them?

Agreed.

		-ben
-- 
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 20:11           ` Ingo Molnar
  2005-12-19 20:19             ` Benjamin LaHaise
@ 2005-12-19 20:32             ` Russell King
  2005-12-19 20:57               ` Ingo Molnar
  1 sibling, 1 reply; 29+ messages in thread
From: Russell King @ 2005-12-19 20:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Benjamin LaHaise, Linus Torvalds, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov

On Mon, Dec 19, 2005 at 09:11:18PM +0100, Ingo Molnar wrote:
> We humans have a given number of neurons, which form a hard limit :)

That's also a very valid argument for keeping the number of different
locking mechanisms down to a small number.

> we saw that with the genirq code, with the spinlock code, with the 
> preempt code. Consolidation did not add anything drastiically new, but 
> code consolidation _did_ make things more hackable, and improved the end 
> result far more than a splintered set of implementations would have 
> looked like.
> 
> Just look at the semaphore implementations of various architectures, 
> it's a quite colorful and inconsistent mix. Can you imagine adding 
> deadlock debugging to each of them?

However, the argument _against_ making things generic is that they
become less optimised for specific architectures.  I'm still not
convinced that the genirq stuff is as optimal for ARM as the existing
code is, so I've little motivation to move to the genirq stuff.
(Though I will try to make things easier for those who would like to.)

The same argument applies to attempts to make atomic ops generic using
cmpxchg as a basis for that (as demonstrated on "that other list"), and
the fast path of semaphores.

On "that other list", we're debating saving between 9 and 13 CPU
cycles as an argument in favour of mutexes.  Such a gain which could
very probably be wiped out by any attempt to make them "generic".

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 Serial core

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 20:32             ` Russell King
@ 2005-12-19 20:57               ` Ingo Molnar
  0 siblings, 0 replies; 29+ messages in thread
From: Ingo Molnar @ 2005-12-19 20:57 UTC (permalink / raw)
  To: Benjamin LaHaise, Linus Torvalds, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov


* Russell King <rmk+lkml@arm.linux.org.uk> wrote:

> However, the argument _against_ making things generic is that they 
> become less optimised for specific architectures.  I'm still not 
> convinced that the genirq stuff is as optimal for ARM as the existing 
> code is, so I've little motivation to move to the genirq stuff. 
> (Though I will try to make things easier for those who would like to.)

i'm quite convinced that the final phase of the genirq conversion will 
work out fine: because it mostly meant the conceptual adoption of your 
ARM IRQ layer (the irqchips approach), with compatibility mechanisms for 
all the other arches, with some minor SMP improvements ontop of it. So 
i'd be surprised if you found _that_ one inadequate :-) If there's any 
detail that ARM doesnt need, i'm sure we can find a non-runtime solution 
for it. But i think i digress.

	Ingo

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 20:12           ` Linus Torvalds
@ 2005-12-19 23:37             ` Christoph Hellwig
  2005-12-20  8:03             ` Nick Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2005-12-19 23:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, David Howells, Alexander Viro, Oleg Nesterov,
	Benjamin LaHaise

On Mon, Dec 19, 2005 at 12:12:26PM -0800, Linus Torvalds wrote:
> And don't get me wrong: if it's easier to just ignore the performance bug, 
> and introduce a new "struct mutex" that just doesn't have it, I'm all for 
> it. However, if so, I do NOT want to do the unnecessary renaming. "struct 
> semaphore" should stay as "struct semaphore", and we should not affect old 
> code in the _least_.
> 
> Then code can switch to "struct mutex" if people want to. And if one 
> reason for it ends up being that the code avoids a performance bug in the 
> process, all the better ;)
> 
> IOW, I really think this should be a series of small patches that don't 
> touch old users of "struct semaphore" at all. None of this "semaphore" to 
> "arch_semaphore" stuff, and the new "struct mutex" would not re-use _any_ 
> of the names that the old "struct semaphore" uses.

That's exactly what Ingo's series does if you ignore the two odd patches ;-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 20:12           ` Linus Torvalds
  2005-12-19 23:37             ` Christoph Hellwig
@ 2005-12-20  8:03             ` Nick Piggin
  2005-12-20  8:06               ` Arjan van de Ven
  1 sibling, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2005-12-20  8:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Linux Kernel Mailing List,
	Andrew Morton, Arjan van de Ven, Steven Rostedt, Alan Cox,
	Christoph Hellwig, David Howells, Alexander Viro, Oleg Nesterov,
	Benjamin LaHaise

Linus Torvalds wrote:

> And don't get me wrong: if it's easier to just ignore the performance bug, 
> and introduce a new "struct mutex" that just doesn't have it, I'm all for 
> it. However, if so, I do NOT want to do the unnecessary renaming. "struct 
> semaphore" should stay as "struct semaphore", and we should not affect old 
> code in the _least_.
> 

It would still be good to look at a fair mutex implementation first
IMO before making a choice to use unfair mutexes.

They'll often be held for longer than spinlocks so fairness may be
more important.

> Then code can switch to "struct mutex" if people want to. And if one 
> reason for it ends up being that the code avoids a performance bug in the 
> process, all the better ;)
> 

Is this a good idea? Then we will have for a long time different
bits of code with exactly the same synchronisation requirements
using two different constructs that are slightly different. Not to
mention code specifically requiring semaphores would get confusing.

If we agree mutex is a good idea at all (and I think it is), then
wouldn't it be better to aim for a wholesale conversion rather than
"if people want to"?

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-20  8:03             ` Nick Piggin
@ 2005-12-20  8:06               ` Arjan van de Ven
  2005-12-20  8:21                 ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2005-12-20  8:06 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov, Benjamin LaHaise


> > Then code can switch to "struct mutex" if people want to. And if one 
> > reason for it ends up being that the code avoids a performance bug in the 
> > process, all the better ;)
> > 
> 
> Is this a good idea? Then we will have for a long time different
> bits of code with exactly the same synchronisation requirements
> using two different constructs that are slightly different. Not to
> mention code specifically requiring semaphores would get confusing.
> 
> If we agree mutex is a good idea at all (and I think it is), then
> wouldn't it be better to aim for a wholesale conversion rather than
> "if people want to"?

well most of this will "only" take a few kernel releases ;-)



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-20  8:06               ` Arjan van de Ven
@ 2005-12-20  8:21                 ` Nick Piggin
  2005-12-20  8:36                   ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2005-12-20  8:21 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro, Oleg Nesterov, Benjamin LaHaise

Arjan van de Ven wrote:

>>If we agree mutex is a good idea at all (and I think it is), then
>>wouldn't it be better to aim for a wholesale conversion rather than
>>"if people want to"?
> 
> 
> well most of this will "only" take a few kernel releases ;-)
> 

OK, so in my mind that is aiming for a wholesale conversion. And
yes I would expect the semaphore to stay around for a while...

But I thought Linus phrased it as though mutex should be made
available and no large scale conversions.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-20  8:21                 ` Nick Piggin
@ 2005-12-20  8:36                   ` Arjan van de Ven
  2005-12-20  8:48                     ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2005-12-20  8:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Steven Rostedt,
	Alan Cox, Christoph Hellwig, David Howells, Alexander Viro,
	Oleg Nesterov, Benjamin LaHaise

On Tue, 2005-12-20 at 19:21 +1100, Nick Piggin wrote:
> Arjan van de Ven wrote:
> 
> >>If we agree mutex is a good idea at all (and I think it is), then
> >>wouldn't it be better to aim for a wholesale conversion rather than
> >>"if people want to"?
> > 
> > 
> > well most of this will "only" take a few kernel releases ;-)
> > 
> 
> OK, so in my mind that is aiming for a wholesale conversion. And
> yes I would expect the semaphore to stay around for a while...
> 
> But I thought Linus phrased it as though mutex should be made
> available and no large scale conversions.

it's one at a time. Manual inspection etc etc
and I don't expect 99% coverage, but the important 50% after 6 months or
so...



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-20  8:36                   ` Arjan van de Ven
@ 2005-12-20  8:48                     ` Nick Piggin
  0 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2005-12-20  8:48 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Linus Torvalds, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Steven Rostedt,
	Alan Cox, Christoph Hellwig, David Howells, Alexander Viro,
	Oleg Nesterov, Benjamin LaHaise

Arjan van de Ven wrote:
> On Tue, 2005-12-20 at 19:21 +1100, Nick Piggin wrote:

>>But I thought Linus phrased it as though mutex should be made
>>available and no large scale conversions.
> 
> 
> it's one at a time. Manual inspection etc etc
> and I don't expect 99% coverage, but the important 50% after 6 months or
> so...
> 

So long as there is a plan and movement, that is fine in my opinion.

-- 
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-19 19:55           ` Linus Torvalds
@ 2005-12-21 16:42             ` Oleg Nesterov
  2006-01-10 10:28               ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2005-12-21 16:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin LaHaise, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro

Linus Torvalds wrote:
> 
> > > [ Oh.  I'm looking at the semaphore code, and I realize that we have a
> > >   "wake_up(&sem->wait)" in the __down() path because we had some race long
> > >   ago that we fixed by band-aiding over it. Which means that we wake up
> > >   sleepers that shouldn't be woken up. THAT may well be part of the
> > >   performance problem.. The semaphores are really meant to wake up just
> > >   one at a time, but because of that race hack they'll wake up _two_ at a
> > >   time - once by up(), once by down().
> > >
> > >   That also destroys the fairness. Does anybody remember why it's that
> > >   way? ]
> >
> > History?
> 
> Oh, absolutely, I already checked the old BK history too, and that extra
> wake_up() has been there at least since before we even started using BK.
> So it's very much historical, I'm just wondering if somebody remembers far
> enough back that we'd know.
> 
> I don't see why it's needed (since we re-try the "atomic_add_negative()"
> inside the semaphore wait lock, and any up() that saw contention should
> have always been guaranteed to do a wakeup that should fill the race in
> between that atomic_add_negative() and the thing going to sleep).
> 
> It may be that it is _purely_ historical, and simply isn't needed. That
> would be funny/sad, in the sense that we've had it there for years and
> years ;)

This does not look like it was added to fix a race or historical
to me. I think without that "wake_up" a waiter can miss wakeup
if it is not the only one sleeper.

sem->count == 0, ->sleepers == 0. down() decrements ->count,

__down:
	// ->count == -1

	++sleepers; // == 1

	for (;;) {
		->count += (->sleepers - 1); // does nothing
		if (->count >= 0) // NO
			break;

		->sleepers = 1;
		schedule();
	...

Another process calls down(), ->count == -2

__down:
	++sleepers; // == 2;

	for (;;) {
		->count += (->sleepers - 1) // ->count == -1;

		->sleepers = 1;
		schedule();
	...

up() makes ++->count == 0, and wakes one of them. It will see
->sleepers == 1, so atomic_add_negative(sleepers - 1) again
has no effect, sets ->sleepers == 0 and takes the semaphore.

Note that subsequent up() will not call wakeup(): ->count == 0,
it just increment it. That is why we are waking the next waiter
in advance. When it gets cpu, it will decrement ->count by 1,
because ->sleepers == 0. If up() (++->count) was already called,
it takes semaphore. If not - goes to sleep again.

Or my understanding is completely broken?

Oleg.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2005-12-21 16:42             ` Oleg Nesterov
@ 2006-01-10 10:28               ` Balbir Singh
  2006-01-10 18:03                 ` Oleg Nesterov
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2006-01-10 10:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Benjamin LaHaise, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro

On Wed, Dec 21, 2005 at 07:42:14PM +0300, Oleg Nesterov wrote:
> Linus Torvalds wrote:
> > 
> > > > [ Oh.  I'm looking at the semaphore code, and I realize that we have a
> > > >   "wake_up(&sem->wait)" in the __down() path because we had some race long
> > > >   ago that we fixed by band-aiding over it. Which means that we wake up
> > > >   sleepers that shouldn't be woken up. THAT may well be part of the
> > > >   performance problem.. The semaphores are really meant to wake up just
> > > >   one at a time, but because of that race hack they'll wake up _two_ at a
> > > >   time - once by up(), once by down().
> > > >
> > > >   That also destroys the fairness. Does anybody remember why it's that
> > > >   way? ]
> > >
> > > History?
> > 
> > Oh, absolutely, I already checked the old BK history too, and that extra
> > wake_up() has been there at least since before we even started using BK.
> > So it's very much historical, I'm just wondering if somebody remembers far
> > enough back that we'd know.
> > 
> > I don't see why it's needed (since we re-try the "atomic_add_negative()"
> > inside the semaphore wait lock, and any up() that saw contention should
> > have always been guaranteed to do a wakeup that should fill the race in
> > between that atomic_add_negative() and the thing going to sleep).
> > 
> > It may be that it is _purely_ historical, and simply isn't needed. That
> > would be funny/sad, in the sense that we've had it there for years and
> > years ;)
> 
> This does not look like it was added to fix a race or historical
> to me. I think without that "wake_up" a waiter can miss wakeup
> if it is not the only one sleeper.
> 
> sem->count == 0, ->sleepers == 0. down() decrements ->count,
> 
> __down:
> 	// ->count == -1
> 
> 	++sleepers; // == 1
> 
> 	for (;;) {
> 		->count += (->sleepers - 1); // does nothing
> 		if (->count >= 0) // NO
> 			break;
> 
> 		->sleepers = 1;
> 		schedule();
> 	...
> 
> Another process calls down(), ->count == -2
> 
> __down:
> 	++sleepers; // == 2;
> 
> 	for (;;) {
> 		->count += (->sleepers - 1) // ->count == -1;
> 
> 		->sleepers = 1;
> 		schedule();
> 	...
> 
> up() makes ++->count == 0, and wakes one of them. It will see
> ->sleepers == 1, so atomic_add_negative(sleepers - 1) again
> has no effect, sets ->sleepers == 0 and takes the semaphore.
> 
> Note that subsequent up() will not call wakeup(): ->count == 0,
> it just increment it. That is why we are waking the next waiter
> in advance. When it gets cpu, it will decrement ->count by 1,
> because ->sleepers == 0. If up() (++->count) was already called,
> it takes semaphore. If not - goes to sleep again.
> 
> Or my understanding is completely broken?

Sorry for the delayed response, but I just started looking at the double
wakeup issue.

The analysis looks a bit confusing. We start with an initial count=0 and
after 2 down()'s and 2 up()'s, the final value of expected count==0, but
it seems like it is +1.

My analysis is based on your analysis and I have used your cool convention!

Lets assume that there are two tasks P1 and P2.

For a semaphore with ->count = 0 and ->sleepers = 0

If P1 does a down(), then

->count = -1 // decl

->sleepers++ //sleepers = 1

for (;;)
	sleepers = 1;
	->count += (sleepers - 1); // sleepers - 1 = 0, count = -1
	if (->count >= 0)  // count is -1
		break;
	->sleepers = 1;
	schedule();


Now, if there is another down() by P2

->count = -2 // decl
->sleepers++ // sleepers = 2

for (;;)
	sleepers = 2;
	->count += (sleepers - 1);	// (sleepers - 1) = 1, count = -1
	if (->count >= 0) // count is -1
		break;
	->sleepers = 1;
	schedule()


Consider some task doing an up()

->count++ //incl, count = 0, count was previously -1

This wakes up one of P1. P1 will do the following

	....
	spin_lock_irqsave(...)
	tsk->state = TASK_UNINTERRUPTIBLE

	and go back to the for loop

	sleepers = 1;
	->count += (sleepers - 1) // count += 0, count = 0
	if (count >= 0) {	// YES
		->sleepers = 0;
		break;
	}

	// outside the for loop
	remove_wait_queue_locked(->wait, &wait);

	wake_up_locked(->wait); // double wakeup
	tsk->state = TASK_UNINTERRUPTIBLE

The *double wakeup* will cause P2 to be woken up, it will do the following

	...
	....
	spin_lock_irqsave(...)
	tsk->state = TASK_UNINTERRUPTIBLE

	and go back to the for loop
	sleepers = 0; // the last task out of __down, set sleepers to 0
	->count += -1; // (sleepers - 1) = -1, count = -1

	if (count >= 0) // NO, count = -1
		break;

	sleepers = 1;
	spin_unlock_irqrestore(...)
	schedule()

without this *double wakeup*, the count would become 0, which is incorrect.

If some other task does an up() 
->count++ // incl, count = 0, was previously -1


This would wakeup P2, which would do the following

	....
	spin_lock_irqsave(...)
	tsk->state = TASK_UNINTERRUPTIBLE

	and go back to the for loop

	sleepers = 1;
	->count += 0;	// count = 0
	if (count >= 0) { // YES
		->sleepers = 0;
		break;
	}


The question now remains as to why we have the atomic_add_negative()? Why do
we change the count in __down(), when down() has already decremented it
for us?

Comments?

Balbir

PS: This analysis involved no caffeine, so it might be incorrect or bogus.
I did try and do my best to check its validity.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2006-01-10 10:28               ` Balbir Singh
@ 2006-01-10 18:03                 ` Oleg Nesterov
  2006-01-11  6:33                   ` Balbir Singh
  0 siblings, 1 reply; 29+ messages in thread
From: Oleg Nesterov @ 2006-01-10 18:03 UTC (permalink / raw)
  To: balbir
  Cc: Linus Torvalds, Benjamin LaHaise, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro

Balbir Singh wrote:
>
> On Wed, Dec 21, 2005 at 07:42:14PM +0300, Oleg Nesterov wrote:
> >
> > Note that subsequent up() will not call wakeup(): ->count == 0,
> > it just increment it. That is why we are waking the next waiter
> > in advance. When it gets cpu, it will decrement ->count by 1,
> > because ->sleepers == 0. If up() (++->count) was already called,
> > it takes semaphore. If not - goes to sleep again.
> >
> > Or my understanding is completely broken?
>
> [ ... long snip ... ]
>
> The question now remains as to why we have the atomic_add_negative()? Why do
> we change the count in __down(), when down() has already decremented it
> for us?

... and why __down() always sets ->sleepers = 0 on return.

I don't have an answer, only a wild guess.

Note that if P1 releases this semaphore before pre-woken P2 actually
gets cpu what happens is:

	P1->up() just increments ->count, no wake_up() (fastpath)

	P2 takes the semaphore without schedule.

So *may be* it was designed this way as some form of optimization,
in this scenario P2 has some chances to run with sem held earlier.

Oleg.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2006-01-10 18:03                 ` Oleg Nesterov
@ 2006-01-11  6:33                   ` Balbir Singh
  2006-01-11  9:22                     ` Oleg Nesterov
  0 siblings, 1 reply; 29+ messages in thread
From: Balbir Singh @ 2006-01-11  6:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Benjamin LaHaise, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro

On Tue, Jan 10, 2006 at 09:03:07PM +0300, Oleg Nesterov wrote:
> Balbir Singh wrote:
> >
> > On Wed, Dec 21, 2005 at 07:42:14PM +0300, Oleg Nesterov wrote:
> > >
> > > Note that subsequent up() will not call wakeup(): ->count == 0,
> > > it just increment it. That is why we are waking the next waiter
> > > in advance. When it gets cpu, it will decrement ->count by 1,
> > > because ->sleepers == 0. If up() (++->count) was already called,
> > > it takes semaphore. If not - goes to sleep again.
> > >
> > > Or my understanding is completely broken?
> >
> > [ ... long snip ... ]
> >
> > The question now remains as to why we have the atomic_add_negative()? Why do
> > we change the count in __down(), when down() has already decremented it
> > for us?
> 
> ... and why __down() always sets ->sleepers = 0 on return.
>

I think sem->sleepers initially started as a counter, but was later converted
to a boolean value (0 or 1). The only possible values of sem->sleepers is 0, 1
or 2 and we always use sem->sleepers - 1 and set the value to either 0 or 1.

sem->sleepers is set to 0, so that when the double wakeup is called on the
wait queue, the task that wakes up (P2) corrects the count to 
(sem->sleepers - 1) = -1. This ensures that other tasks do not acquire 
the semaphore with down() (count is -1) and P2 sets sem->sleepers back to 1 
and sleeps.

 
> I don't have an answer, only a wild guess.
> 
> Note that if P1 releases this semaphore before pre-woken P2 actually
> gets cpu what happens is:
> 
> 	P1->up() just increments ->count, no wake_up() (fastpath)
> 
> 	P2 takes the semaphore without schedule.
> 
> So *may be* it was designed this way as some form of optimization,
> in this scenario P2 has some chances to run with sem held earlier.
>

P1->up() will do a wake_up() only if count < 0. For no wake_up()
the count >=0 before the increment. This means that there is no one
sleeping on the semaphore.
 
Feel free to correct me,
Balbir

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [patch 00/15] Generic Mutex Subsystem
  2006-01-11  6:33                   ` Balbir Singh
@ 2006-01-11  9:22                     ` Oleg Nesterov
  0 siblings, 0 replies; 29+ messages in thread
From: Oleg Nesterov @ 2006-01-11  9:22 UTC (permalink / raw)
  To: balbir
  Cc: Linus Torvalds, Benjamin LaHaise, Ingo Molnar, Andi Kleen,
	Linux Kernel Mailing List, Andrew Morton, Arjan van de Ven,
	Steven Rostedt, Alan Cox, Christoph Hellwig, David Howells,
	Alexander Viro

Balbir Singh wrote:
> 
> On Tue, Jan 10, 2006 at 09:03:07PM +0300, Oleg Nesterov wrote:
> 
> > I don't have an answer, only a wild guess.
> >
> > Note that if P1 releases this semaphore before pre-woken P2 actually
> > gets cpu what happens is:
> >
> >       P1->up() just increments ->count, no wake_up() (fastpath)
> >
> >       P2 takes the semaphore without schedule.
> >
> > So *may be* it was designed this way as some form of optimization,
> > in this scenario P2 has some chances to run with sem held earlier.
> >
> 
> P1->up() will do a wake_up() only if count < 0. For no wake_up()
> the count >=0 before the increment. This means that there is no one
> sleeping on the semaphore.

And this exactly happens. P1 returns from __down() with ->count == 0
and P2 pre-woken.

Oleg.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2006-01-11  8:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-19  1:34 [patch 00/15] Generic Mutex Subsystem Ingo Molnar
2005-12-19  4:22 ` Andi Kleen
2005-12-19  4:28   ` Steven Rostedt
2005-12-19  4:31     ` Andi Kleen
2005-12-19  6:24   ` Linus Torvalds
2005-12-19 12:56     ` Steven Rostedt
2005-12-19 16:55       ` Ingo Molnar
2005-12-19 15:50     ` Ingo Molnar
2005-12-19 19:11       ` Linus Torvalds
2005-12-19 19:25         ` Benjamin LaHaise
2005-12-19 19:55           ` Linus Torvalds
2005-12-21 16:42             ` Oleg Nesterov
2006-01-10 10:28               ` Balbir Singh
2006-01-10 18:03                 ` Oleg Nesterov
2006-01-11  6:33                   ` Balbir Singh
2006-01-11  9:22                     ` Oleg Nesterov
2005-12-19 20:11           ` Ingo Molnar
2005-12-19 20:19             ` Benjamin LaHaise
2005-12-19 20:32             ` Russell King
2005-12-19 20:57               ` Ingo Molnar
2005-12-19 19:55         ` Ingo Molnar
2005-12-19 20:12           ` Linus Torvalds
2005-12-19 23:37             ` Christoph Hellwig
2005-12-20  8:03             ` Nick Piggin
2005-12-20  8:06               ` Arjan van de Ven
2005-12-20  8:21                 ` Nick Piggin
2005-12-20  8:36                   ` Arjan van de Ven
2005-12-20  8:48                     ` Nick Piggin
2005-12-19 16:22   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).