All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
@ 2010-04-28  4:38 Benjamin Herrenschmidt
  2010-04-28  4:39 ` Benjamin Herrenschmidt
  2010-04-28 12:06 ` Arnd Bergmann
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-28  4:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Tony Breeds, Linus Torvalds

Currently, we can hit a nasty case with optimistic spinning on mutexes:

    CPU A tries to take a mutex, while holding the BKL

    CPU B tried to take the BLK while holding the mutex

This looks like a AB-BA scenario but in practice, is allowed and happens
due to the auto-release-on-schedule nature of the BKL.

In that case, the optimistic spinning code can get us into a situation
where instead of going to sleep, A will spin waiting for B who is
spinning waiting for A, and the only way out of that loop is the
need_resched() test in mutex_spin_on_owner().

Now, that's bad enough since we may end up having those two processors
deadlocked for a while, thus introducing latencies, but I've had cases
where it completely stopped making forward progress. I suspect CPU A
had nothing else waiting to run, and see need_resched() was never set.

This patch fixes both in a rather crude way. I completely disable
spinning if we own the BKL, and I add a safety timeout using jiffies
to fallback to sleeping if we end up spinning for more than 1 or 2 jiffies.

Now, we -could- make it a bit smarter about the BKL by introducing a
contention counter and only go out if we own the BKL and it is contended,
but I didn't feel like this was worth the effort, time is better spent
removing the BKL from sensitive code path instead.

Regarding the choice of 1 or 2 jiffies, it's completely arbitrary. I
prefer that to an arbitrary number of milliseconds mostly because it's
expected that a 1000HZ kernel is run on a workload that expects smaller
latencies, and as such reflects better the idea that if we're going
to spin for more than a scheduler tick, we may as well schedule (and
save power by doing so if we hit the idle thread).

This timeout is also a safeguard in case we find another weird deadlock
scenario with optimistic spinning (that's the second one I found so far,
the other one was with CPU hotplug). At least we have some kind of
forward progress guarantee now.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..bc6bd9a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -361,7 +361,8 @@ extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
+			       unsigned long timeout);
 
 struct nsproxy;
 struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 632f04c..59adbae 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -145,6 +145,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
+	unsigned long timeout;
 
 	preempt_disable();
 	mutex_acquire(&lock->dep_map, subclass, 0, ip);
@@ -168,15 +169,22 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	 * to serialize everything.
 	 */
 
-	for (;;) {
+	for (timeout = jiffies + 2; jiffies < timeout;) {
 		struct thread_info *owner;
 
 		/*
+		 * If we own the BKL, then don't spin. The owner of the mutex
+		 * might be waiting on us to release the BKL.
+		 */
+		if (current->lock_depth >= 0)
+			break;
+
+		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner))
+		if (owner && !mutex_spin_on_owner(lock, owner, timeout))
 			break;
 
 		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
diff --git a/kernel/sched.c b/kernel/sched.c
index a3dff1f..b582f2e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3765,7 +3765,8 @@ EXPORT_SYMBOL(schedule);
  * Look out! "owner" is an entirely speculative pointer
  * access and not reliable.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
+			unsigned long timeout)
 {
 	unsigned int cpu;
 	struct rq *rq;
@@ -3801,7 +3802,7 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 
 	rq = cpu_rq(cpu);
 
-	for (;;) {
+	while (jiffies < timeout) {
 		/*
 		 * Owner changed, break to re-assess state.
 		 */



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-04-28  4:38 [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL Benjamin Herrenschmidt
@ 2010-04-28  4:39 ` Benjamin Herrenschmidt
  2010-04-28 12:06 ` Arnd Bergmann
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-28  4:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Tony Breeds, Linus Torvalds

On Wed, 2010-04-28 at 14:38 +1000, Benjamin Herrenschmidt wrote:
> Currently, we can hit a nasty case with optimistic spinning on mutexes:
> 
>     CPU A tries to take a mutex, while holding the BKL
> 
>     CPU B tried to take the BLK while holding the mutex
> 
> This looks like a AB-BA scenario but in practice, is allowed and happens
> due to the auto-release-on-schedule nature of the BKL.

 .../...

BTW. The patch is only compile-tested so far :-) It's going to be
hammered with the test case that triggered the original bug hopefully
tonight.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-04-28  4:38 [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL Benjamin Herrenschmidt
  2010-04-28  4:39 ` Benjamin Herrenschmidt
@ 2010-04-28 12:06 ` Arnd Bergmann
  2010-04-28 22:35   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2010-04-28 12:06 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Tony Breeds, Linus Torvalds

On Wednesday 28 April 2010, Benjamin Herrenschmidt wrote:
> Now, we -could- make it a bit smarter about the BKL by introducing a
> contention counter and only go out if we own the BKL and it is contended,
> but I didn't feel like this was worth the effort, time is better spent
> removing the BKL from sensitive code path instead.

Agreed.

> -       for (;;) {
> +       for (timeout = jiffies + 2; jiffies < timeout;) {
> [...]
> -       for (;;) {
> +       while (jiffies < timeout) {

This needs to use time_before() to avoid problems on jiffies wraparound.

	Arnd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-04-28 12:06 ` Arnd Bergmann
@ 2010-04-28 22:35   ` Benjamin Herrenschmidt
  2010-05-07  4:20     ` Tony Breeds
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2010-04-28 22:35 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Tony Breeds, Linus Torvalds

On Wed, 2010-04-28 at 14:06 +0200, Arnd Bergmann wrote:
> 
> This needs to use time_before() to avoid problems on jiffies
> wraparound.

Ah right, forgot about that, been a while I didn't use jiffies for
anything :-)

I'll respin later today.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-04-28 22:35   ` Benjamin Herrenschmidt
@ 2010-05-07  4:20     ` Tony Breeds
  2010-05-07  5:30       ` Frederic Weisbecker
  2010-05-11 15:43       ` [tip:core/locking] " tip-bot for Tony Breeds
  0 siblings, 2 replies; 20+ messages in thread
From: Tony Breeds @ 2010-05-07  4:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Tony Breeds, Linus Torvalds

On Thu, Apr 29, 2010 at 08:35:21AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2010-04-28 at 14:06 +0200, Arnd Bergmann wrote:
> > 
> > This needs to use time_before() to avoid problems on jiffies
> > wraparound.
> 
> Ah right, forgot about that, been a while I didn't use jiffies for
> anything :-)
> 
> I'll respin later today.

Like thie perhaps?  If this looks good it would be great to get this in .34

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 70abfd3..991d86f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -351,7 +351,8 @@ extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void __schedule(void);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
+                               unsigned long timeout);
 
 struct nsproxy;
 struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 947b3ad..fcf2573 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -145,6 +145,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
+	unsigned long timeout;
 
 	preempt_disable();
 	mutex_acquire(&lock->dep_map, subclass, 0, ip);
@@ -168,15 +169,22 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	 * to serialize everything.
 	 */
 
-	for (;;) {
+	for (timeout = jiffies + 2; time_before(jiffies, timeout);) {
 		struct thread_info *owner;
 
 		/*
+		 * If we own the BKL, then don't spin. The owner of the mutex
+		 * might be waiting on us to release the BKL.
+		 */
+		if (current->lock_depth >= 0)
+			break;
+
+		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner))
+		if (owner && !mutex_spin_on_owner(lock, owner, timeout))
 			break;
 
 		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
diff --git a/kernel/sched.c b/kernel/sched.c
index 663a1d0..ef58dc0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5523,7 +5523,8 @@ EXPORT_SYMBOL(schedule);
  * Look out! "owner" is an entirely speculative pointer
  * access and not reliable.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
+			unsigned long timeout)
 {
 	unsigned int cpu;
 	struct rq *rq;
@@ -5559,7 +5560,7 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 
 	rq = cpu_rq(cpu);
 
-	for (;;) {
+	while (time_before(jiffies, timeout)) {
 		/*
 		 * Owner changed, break to re-assess state.
 		 */

Yours Tony

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-07  4:20     ` Tony Breeds
@ 2010-05-07  5:30       ` Frederic Weisbecker
  2010-05-07  6:01         ` Benjamin Herrenschmidt
  2010-05-07  6:16         ` Mike Galbraith
  2010-05-11 15:43       ` [tip:core/locking] " tip-bot for Tony Breeds
  1 sibling, 2 replies; 20+ messages in thread
From: Frederic Weisbecker @ 2010-05-07  5:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Arnd Bergmann, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Tony Breeds, Linus Torvalds

On Fri, May 07, 2010 at 02:20:10PM +1000, Tony Breeds wrote:
> On Thu, Apr 29, 2010 at 08:35:21AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2010-04-28 at 14:06 +0200, Arnd Bergmann wrote:
> > > 
> > > This needs to use time_before() to avoid problems on jiffies
> > > wraparound.
> > 
> > Ah right, forgot about that, been a while I didn't use jiffies for
> > anything :-)
> > 
> > I'll respin later today.
> 
> Like thie perhaps?  If this looks good it would be great to get this in .34
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 70abfd3..991d86f 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -351,7 +351,8 @@ extern signed long schedule_timeout_killable(signed long timeout);
>  extern signed long schedule_timeout_uninterruptible(signed long timeout);
>  asmlinkage void __schedule(void);
>  asmlinkage void schedule(void);
> -extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
> +extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
> +                               unsigned long timeout);
>  
>  struct nsproxy;
>  struct user_namespace;
> diff --git a/kernel/mutex.c b/kernel/mutex.c
> index 947b3ad..fcf2573 100644
> --- a/kernel/mutex.c
> +++ b/kernel/mutex.c
> @@ -145,6 +145,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  	struct task_struct *task = current;
>  	struct mutex_waiter waiter;
>  	unsigned long flags;
> +	unsigned long timeout;
>  
>  	preempt_disable();
>  	mutex_acquire(&lock->dep_map, subclass, 0, ip);
> @@ -168,15 +169,22 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
>  	 * to serialize everything.
>  	 */
>  
> -	for (;;) {
> +	for (timeout = jiffies + 2; time_before(jiffies, timeout);) {
>  		struct thread_info *owner;
>  
>  		/*
> +		 * If we own the BKL, then don't spin. The owner of the mutex
> +		 * might be waiting on us to release the BKL.
> +		 */
> +		if (current->lock_depth >= 0)
> +			break;
> +
> +		/*
>  		 * If there's an owner, wait for it to either
>  		 * release the lock or go to sleep.
>  		 */
>  		owner = ACCESS_ONCE(lock->owner);
> -		if (owner && !mutex_spin_on_owner(lock, owner))
> +		if (owner && !mutex_spin_on_owner(lock, owner, timeout))
>  			break;



I like the safeguard against the bkl, it looks indeed like something
we should have in .34

But I really don't like the timeout.

This is going to make the things even worse if we have another cause of
deadlock by hiding the worst part of the consequences without actually
solving the problem.
And since the induced latency or deadlock won't be easily visible anymore,
we'll miss there is a problem. So we are going to spin for two jiffies
and only someone doing specific latency measurements will notice, if he's
lucky enough to meet the bug.

Moreover that adds some unnessary (small) overhead in this path.

May be can we have it as a debugging option, something that would
be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
support mutex adaptive spinning.

A debugging option that could just dump the held locks and the
current one if we spin for an excessive timeslice.

Thanks.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-07  5:30       ` Frederic Weisbecker
@ 2010-05-07  6:01         ` Benjamin Herrenschmidt
  2010-05-07 21:29           ` Frederic Weisbecker
  2010-05-07  6:16         ` Mike Galbraith
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-07  6:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Arnd Bergmann, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Tony Breeds, Linus Torvalds

On Fri, 2010-05-07 at 07:30 +0200, Frederic Weisbecker wrote:
> 
> 
> I like the safeguard against the bkl, it looks indeed like something
> we should have in .34
> 
> But I really don't like the timeout.

And I hate not having it :-)

> This is going to make the things even worse if we have another cause
> of deadlock by hiding the worst part of the consequences without
> actually solving the problem.

Yes and no. There's two reasons why the timeout is good. One is the
safeguard part and it's arguable whether it helps hiding bugs or not,
but there are very real cases where I believe we should get out and go
to sleep as well that aren't bugs.

IE. If the mutex owner is running for a long time and nobody is
contending on your CPU, you're simply not going to hit the
need_resched() test. That means you will spin, which means you will suck
a lot more power as well, not counting the potentially bad effect on
rebalance, load average etc...

There's also the fact that 2 CPUs or more trying to obtain it at once
may all go into spinning, which can lead to interesting results in term
of power consumption (and cpu_relax doesn't help that much).

I really don't think it's a good idea to turns mutex into potential
multi-jiffies spinning things like that.

> And since the induced latency or deadlock won't be easily visible
> anymore, we'll miss there is a problem. So we are going to spin for
> two jiffies and only someone doing specific latency measurements will
> notice, if he's lucky enough to meet the bug.

Well, the thing is that it may not be a bug.

The thing is that the actual livelock with the BKL should really only
happen with the BKL since that's the only thing we have that allows for
AB->BA semantics. Anything else should hopefully be caught by lockdep.

So I don't think there's that much to fear about hidden bugs.

But I -also- don't see the point of spinning potentially for a very long
time instead of going to sleep and saving power. The adaptive spinning
goal is to have an opportunistic optimization based on the idea that the
mutex is likely to be held for a very short period of time by its owner
and nobody's waiting for it yet. Ending up doing multi-jiffies spins
just doesn't fit in that picture. In fact, I was tempted to make the
timeout a lot shorter but decided against calling into clock sources
etc... and instead kept it simple with jiffies.

> Moreover that adds some unnessary (small) overhead in this path.

Uh ? So ? This is the contended path where we are .. spinning :-) The
overhead of reading jiffies and comparing here is simply never going to
show on any measurement I bet you :-)

> May be can we have it as a debugging option, something that would
> be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> support mutex adaptive spinning.

No, what I would potentially add as part of lockdep however is a way to
instrument how often we get out of the spin via the timeout. That might
be a useful information to figure out some of those runaway code path,
but they may well happen for very legit reasons and aren't a bug per-se.

> A debugging option that could just dump the held locks and the
> current one if we spin for an excessive timeslice.

Ben.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-07  5:30       ` Frederic Weisbecker
  2010-05-07  6:01         ` Benjamin Herrenschmidt
@ 2010-05-07  6:16         ` Mike Galbraith
  1 sibling, 0 replies; 20+ messages in thread
From: Mike Galbraith @ 2010-05-07  6:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, Peter Zijlstra,
	Ingo Molnar, linux-kernel, Tony Breeds, Linus Torvalds

On Fri, 2010-05-07 at 07:30 +0200, Frederic Weisbecker wrote:

> I like the safeguard against the bkl, it looks indeed like something
> we should have in .34
> 
> But I really don't like the timeout.
> 
> This is going to make the things even worse if we have another cause of
> deadlock by hiding the worst part of the consequences without actually
> solving the problem.
> And since the induced latency or deadlock won't be easily visible anymore,
> we'll miss there is a problem. So we are going to spin for two jiffies
> and only someone doing specific latency measurements will notice, if he's
> lucky enough to meet the bug.
> 
> Moreover that adds some unnessary (small) overhead in this path.
> 
> May be can we have it as a debugging option, something that would
> be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> support mutex adaptive spinning.
> 
> A debugging option that could just dump the held locks and the
> current one if we spin for an excessive timeslice.

FYI, there's a case where OWNER_SPIN eats astounding amounts of CPU.

Run virgin AIM7 V1.1 with many many tasks and a nil workfile.  When the
thing kicks off preparing for test, it forks of these many tasks, who
then all try to phone home via pipe.  The result it horrible to behold.

There's a fix for AIM7's scalability problem, but the virgin code looks
like a decent OWNER_SPIN corner case test.  It turns a 32 core smt box
into an expensive space heater at sufficiently high task count:)

	-Mike


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-07  6:01         ` Benjamin Herrenschmidt
@ 2010-05-07 21:29           ` Frederic Weisbecker
  2010-05-07 22:27             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Frederic Weisbecker @ 2010-05-07 21:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Tony Breeds, Linus Torvalds

On Fri, May 07, 2010 at 04:01:56PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-05-07 at 07:30 +0200, Frederic Weisbecker wrote:
> > 
> > 
> > I like the safeguard against the bkl, it looks indeed like something
> > we should have in .34
> > 
> > But I really don't like the timeout.
> 
> And I hate not having it :-)
> 
> > This is going to make the things even worse if we have another cause
> > of deadlock by hiding the worst part of the consequences without
> > actually solving the problem.
> 
> Yes and no. There's two reasons why the timeout is good. One is the
> safeguard part and it's arguable whether it helps hiding bugs or not,
> but there are very real cases where I believe we should get out and go
> to sleep as well that aren't bugs.
> 
> IE. If the mutex owner is running for a long time and nobody is
> contending on your CPU, you're simply not going to hit the
> need_resched() test. That means you will spin, which means you will suck
> a lot more power as well, not counting the potentially bad effect on
> rebalance, load average etc...
> 
> There's also the fact that 2 CPUs or more trying to obtain it at once
> may all go into spinning, which can lead to interesting results in term
> of power consumption (and cpu_relax doesn't help that much).
> 
> I really don't think it's a good idea to turns mutex into potential
> multi-jiffies spinning things like that.



Yeah, I was considering only the deadlock issue.

Ok it's an problem on power consumption. That said I wonder if we have
so much places where a task can hold a mutex for two jiffies without
sleeping in-between, since when the owner goes to sleep, spinning is
stopped.

If that's an issue for power saving, it means we have places to fix.
Also I wonder if we should disable adaptive spinning for people who want
agressive power saving. At least it would be worth the test to check the
result.

Also, to solve this problem when several cpus may spin on the owner,
I wonder adaptive spinning is doing the right thing here. We should
have only one spinner I think, and the rest should go to sleep as the
time to spin on several subsequent owners would be much better gained
to do something else (schedule something else or power saving).
In fact, that too could deserve some tests.

 
> > And since the induced latency or deadlock won't be easily visible
> > anymore, we'll miss there is a problem. So we are going to spin for
> > two jiffies and only someone doing specific latency measurements will
> > notice, if he's lucky enough to meet the bug.
> 
> Well, the thing is that it may not be a bug.
> 
> The thing is that the actual livelock with the BKL should really only
> happen with the BKL since that's the only thing we have that allows for
> AB->BA semantics. Anything else should hopefully be caught by lockdep.
> 
> So I don't think there's that much to fear about hidden bugs.
> 
> But I -also- don't see the point of spinning potentially for a very long
> time instead of going to sleep and saving power. The adaptive spinning
> goal is to have an opportunistic optimization based on the idea that the
> mutex is likely to be held for a very short period of time by its owner
> and nobody's waiting for it yet. Ending up doing multi-jiffies spins
> just doesn't fit in that picture. In fact, I was tempted to make the
> timeout a lot shorter but decided against calling into clock sources
> etc... and instead kept it simple with jiffies.


Ok. But even under this power saving angle, I think this heuristic based on
timeout is still only hidding the problem. Or it lowers it more than it
solves it, which is quite better than keeping the things as is of course.
But if the real source is in places that hold mutexes without sleeping for
too much time, I think we should rather identify these places and solve
the ugly locking there.

Plus may be allowing only one spinner at the time.

But yeah until we reach a good solution for that, may be this timeout
would be worth.



> > Moreover that adds some unnessary (small) overhead in this path.
> 
> Uh ? So ? This is the contended path where we are .. spinning :-) The
> overhead of reading jiffies and comparing here is simply never going to
> show on any measurement I bet you :-)



Yeah, probably in fact :)



> > May be can we have it as a debugging option, something that would
> > be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> > support mutex adaptive spinning.
> 
> No, what I would potentially add as part of lockdep however is a way to
> instrument how often we get out of the spin via the timeout. That might
> be a useful information to figure out some of those runaway code path,
> but they may well happen for very legit reasons and aren't a bug per-se.



We could make a trace event for that.

Thanks.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-07 21:29           ` Frederic Weisbecker
@ 2010-05-07 22:27             ` Benjamin Herrenschmidt
  2010-05-10  7:55               ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-07 22:27 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Arnd Bergmann, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Tony Breeds, Linus Torvalds


> Yeah, I was considering only the deadlock issue.
> 
> Ok it's an problem on power consumption. That said I wonder if we have
> so much places where a task can hold a mutex for two jiffies without
> sleeping in-between, since when the owner goes to sleep, spinning is
> stopped.
> 
> If that's an issue for power saving, it means we have places to fix.
> Also I wonder if we should disable adaptive spinning for people who want
> agressive power saving. At least it would be worth the test to check the
> result.

I don't think it's only about agressive power saving. I'm not sure this
is directly related but somebody recently reported a case where pretty
much every CPU in the machine gets to spin on one holding the mutex, the
power consumption skyrockets and everything heats up at once. Not nice
even on servers on workloads where otherwise everybody would have gone
to sleep.

My feeling is that spinning should last only a very small amount of
time, ideally using a better clock than jiffies, maybe a few dozen ms at
max. But I decided to compromise for code simplicity and less overhead
by using jiffies instead. Maybe 1 jiffy delta would have been enough,
not two tho (which means a random timeout between 0 and 1 jiffie).

> Also, to solve this problem when several cpus may spin on the owner,
> I wonder adaptive spinning is doing the right thing here. We should
> have only one spinner I think, and the rest should go to sleep as the
> time to spin on several subsequent owners would be much better gained
> to do something else (schedule something else or power saving).
> In fact, that too could deserve some tests.

Right, the problem is due to the fact that we skip spinning if there's
already a waiter but we don't know that there is already a spinner so we
can end up with multiple spinners.

I don't see a non invasive way to fix that.. we could add a spinner
counter to the mutex but that sucks a bit. Might still be worthwhile,
not sure. Peter, what do you reckon ?

> Ok. But even under this power saving angle, I think this heuristic based on
> timeout is still only hidding the problem. Or it lowers it more than it
> solves it, which is quite better than keeping the things as is of course.
> But if the real source is in places that hold mutexes without sleeping for
> too much time, I think we should rather identify these places and solve
> the ugly locking there.

That's a job for lockdep or something similar :-) Doing so is
orthogonal, I don't see this contradicting with limiting the spinning
time. Again, I tend to have a different "view" of it as: spinning is an
oportunistic optimization based on the idea that the owner will release
the mutex soon, I don't see the point of spinning for a long time :-)

Note that power management is one side of the story. Another one is
relinguishing the CPU to the hypervisor so some other partition gets
some time. Etc... There's more than one consequence to ending up with
CPUs spinning instead of sleeping and they are not all trivial, so by
letting mutexes spin for potentially a "long time", we quite
fundamentally change the semantics of those mutexes with impacts in othe
areas that we don't necessarily see. This goes beyond the scope of a
simple optimization and becomes problematic.

Yes, we can to try to find all of these areas and address them one by
one, or we can use my simpler approach which is to keep the overall
semantic of mutexes which is to sleep ... and bound the optimisation to
what it should be: a short term spin to speed up short lived contention
cases.

But yeah, the argument is very valid that maybe we should spin even less
long and that maybe we should use a finer grained clock to limit the
spin to a few microseconds at most. The exact number remains to be
determined :-)

> Plus may be allowing only one spinner at the time.
> 
> But yeah until we reach a good solution for that, may be this timeout
> would be worth.

Cheers,
Ben.

> 
> 
> > > Moreover that adds some unnessary (small) overhead in this path.
> > 
> > Uh ? So ? This is the contended path where we are .. spinning :-) The
> > overhead of reading jiffies and comparing here is simply never going to
> > show on any measurement I bet you :-)
> 
> 
> 
> Yeah, probably in fact :)
> 
> 
> 
> > > May be can we have it as a debugging option, something that would
> > > be part of lockdep, which would require CONFIG_DEBUG_MUTEX to
> > > support mutex adaptive spinning.
> > 
> > No, what I would potentially add as part of lockdep however is a way to
> > instrument how often we get out of the spin via the timeout. That might
> > be a useful information to figure out some of those runaway code path,
> > but they may well happen for very legit reasons and aren't a bug per-se.
> 
> 
> 
> We could make a trace event for that.
> 
> Thanks.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-07 22:27             ` Benjamin Herrenschmidt
@ 2010-05-10  7:55               ` Peter Zijlstra
  2010-05-11 18:06                 ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2010-05-10  7:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Frederic Weisbecker, Arnd Bergmann, Ingo Molnar, linux-kernel,
	Tony Breeds, Linus Torvalds

On Sat, 2010-05-08 at 08:27 +1000, Benjamin Herrenschmidt wrote:
> > Also, to solve this problem when several cpus may spin on the owner,
> > I wonder adaptive spinning is doing the right thing here. We should
> > have only one spinner I think, and the rest should go to sleep as the
> > time to spin on several subsequent owners would be much better gained
> > to do something else (schedule something else or power saving).
> > In fact, that too could deserve some tests.
> 
> Right, the problem is due to the fact that we skip spinning if there's
> already a waiter but we don't know that there is already a spinner so we
> can end up with multiple spinners.
> 
> I don't see a non invasive way to fix that.. we could add a spinner
> counter to the mutex but that sucks a bit. Might still be worthwhile,
> not sure. Peter, what do you reckon ? 

If its a large problem the lock is overly contended and _that_ needs
fixing. I don't at all feel like adding atomic ops to the spin loop to
try and detect this.

As to the 2 jiffy spin timeout, I guess we should add a lockdep warning
for that, because anybody holding a mutex for longer than 2 jiffies and
not sleeping does need fixing anyway.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip:core/locking] mutex: Fix optimistic spinning vs. BKL
  2010-05-07  4:20     ` Tony Breeds
  2010-05-07  5:30       ` Frederic Weisbecker
@ 2010-05-11 15:43       ` tip-bot for Tony Breeds
  2010-05-11 23:05         ` Tony Breeds
  2010-05-18 16:08         ` Ingo Molnar
  1 sibling, 2 replies; 20+ messages in thread
From: tip-bot for Tony Breeds @ 2010-05-11 15:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, benh, stable, tglx, mingo, tony

Commit-ID:  227945799cc10d77c6ef812f3eb8a61a78689454
Gitweb:     http://git.kernel.org/tip/227945799cc10d77c6ef812f3eb8a61a78689454
Author:     Tony Breeds <tony@bakeyournoodle.com>
AuthorDate: Fri, 7 May 2010 14:20:10 +1000
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 11 May 2010 17:07:24 +0200

mutex: Fix optimistic spinning vs. BKL

Currently, we can hit a nasty case with optimistic spinning on
mutexes:

    CPU A tries to take a mutex, while holding the BKL

    CPU B tried to take the BLK while holding the mutex

This looks like a AB-BA scenario but in practice, is allowed and
happens due to the auto-release-on-schedule nature of the BKL.

In that case, the optimistic spinning code can get us into a situation
where instead of going to sleep, A will spin waiting for B who is
spinning waiting for A, and the only way out of that loop is the
need_resched() test in mutex_spin_on_owner().

Now, that's bad enough since we may end up having those two processors
deadlocked for a while, thus introducing latencies, but I've had cases
where it completely stopped making forward progress. I suspect CPU A
had nothing else waiting to run, and see need_resched() was never set.

This patch fixes both in a rather crude way. I completely disable
spinning if we own the BKL, and I add a safety timeout using jiffies
to fallback to sleeping if we end up spinning for more than 1 or 2
jiffies.

Now, we -could- make it a bit smarter about the BKL by introducing a
contention counter and only go out if we own the BKL and it is
contended, but I didn't feel like this was worth the effort, time is
better spent removing the BKL from sensitive code path instead.

Regarding the choice of 1 or 2 jiffies, it's completely arbitrary. I
prefer that to an arbitrary number of milliseconds mostly because it's
expected that a 1000HZ kernel is run on a workload that expects
smaller latencies, and as such reflects better the idea that if we're
going to spin for more than a scheduler tick, we may as well schedule
(and save power by doing so if we hit the idle thread).

This timeout is also a safeguard in case we find another weird
deadlock scenario with optimistic spinning (that's the second one I
found so far, the other one was with CPU hotplug). At least we have
some kind of forward progress guarantee now.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
LKML-Reference: <20100507042010.GR12389@ozlabs.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    3 ++-
 kernel/mutex.c        |   12 ++++++++++--
 kernel/sched.c        |    5 +++--
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..bc6bd9a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -361,7 +361,8 @@ extern signed long schedule_timeout_interruptible(signed long timeout);
 extern signed long schedule_timeout_killable(signed long timeout);
 extern signed long schedule_timeout_uninterruptible(signed long timeout);
 asmlinkage void schedule(void);
-extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner);
+extern int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
+			       unsigned long timeout);
 
 struct nsproxy;
 struct user_namespace;
diff --git a/kernel/mutex.c b/kernel/mutex.c
index 632f04c..7d4626b 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -145,6 +145,7 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	struct task_struct *task = current;
 	struct mutex_waiter waiter;
 	unsigned long flags;
+	unsigned long timeout;
 
 	preempt_disable();
 	mutex_acquire(&lock->dep_map, subclass, 0, ip);
@@ -168,15 +169,22 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 	 * to serialize everything.
 	 */
 
-	for (;;) {
+	for (timeout = jiffies + 2; time_before(jiffies, timeout);) {
 		struct thread_info *owner;
 
 		/*
+		 * If we own the BKL, then don't spin. The owner of the mutex
+		 * might be waiting on us to release the BKL.
+		 */
+		if (current->lock_depth >= 0)
+			break;
+
+		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
 		owner = ACCESS_ONCE(lock->owner);
-		if (owner && !mutex_spin_on_owner(lock, owner))
+		if (owner && !mutex_spin_on_owner(lock, owner, timeout))
 			break;
 
 		if (atomic_cmpxchg(&lock->count, 1, 0) == 1) {
diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..e613160 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3775,7 +3775,8 @@ EXPORT_SYMBOL(schedule);
  * Look out! "owner" is an entirely speculative pointer
  * access and not reliable.
  */
-int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
+int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner,
+			unsigned long timeout)
 {
 	unsigned int cpu;
 	struct rq *rq;
@@ -3811,7 +3812,7 @@ int mutex_spin_on_owner(struct mutex *lock, struct thread_info *owner)
 
 	rq = cpu_rq(cpu);
 
-	for (;;) {
+	while (time_before(jiffies, timeout)) {
 		/*
 		 * Owner changed, break to re-assess state.
 		 */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-10  7:55               ` Peter Zijlstra
@ 2010-05-11 18:06                 ` Linus Torvalds
  2010-05-11 18:19                   ` Peter Zijlstra
  2010-05-11 21:13                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2010-05-11 18:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Frederic Weisbecker, Arnd Bergmann,
	Ingo Molnar, linux-kernel, Tony Breeds



On Mon, 10 May 2010, Peter Zijlstra wrote:
> 
> As to the 2 jiffy spin timeout, I guess we should add a lockdep warning
> for that, because anybody holding a mutex for longer than 2 jiffies and
> not sleeping does need fixing anyway.

I really hate the jiffies thing, but looking at the optimistic spinning, I 
do wonder about two things..

First - we check "need_resched()" only if owner is NULL. That sounds 
wrong. If we need to reschedule, we need to stop spinning _regardless_ of 
whether the owner may have been preempted before setting the owner field.

Second: we allow "owner" to change, and we'll continue spinning. This is 
how you can end up spinning for a long time - not because anybody holds 
the mutex for longer than 2 jiffies, but because a lot of other threads 
_together_ hold the mutex for longer than 2 jiffies.

Now, I think we do want some limited "continue spinning even if somebody 
else ended up getting it instead", but I think we should at least limit 
it. Otherwise we end up being potentially rather unfair, since we don't 
have any fair queueing logic for the optimistic spinning phase.

Now, we could just count the number of times "owner" has changed, and I 
suspect that would be sufficient. Now, that trivial counting sceme would 
fail if "owner" stays the same (ie the same process re-takes the lock over 
and over again, possibly due to hot cacheline things being very unfair 
to the person who already owns it), but quite frankly, I don't think we 
can get into that kind of situation. 

Why? Mutexes may end up being very heavily contended, but they can't be 
contended by just _one_ thread. So if we're really in a starvation issue, 
the thread that is waiting _will_ see multiple different owners.

So once you have seen X number of other owners, you just say "screw it, 
this spinning thing isn't working for me, I'll go to the sleeping case".

Of course, that is _really_ unfair. It will make it all that easier for 
the other spinners to just get the lock by spinning. We used to have some 
logic to say "we won't spin if there are proper waiters", but that was 
horrible for performance. See commit ac6e60ee405.

Of course, it's quite possible that as long as "need_resched()" isn't set, 
spinning really _is_ the right thing to do. Maybe it causes horrible CPU 
load on some odd "everybody synchronize" loads, but maybe that really is 
the best we can do.

			Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-11 18:06                 ` Linus Torvalds
@ 2010-05-11 18:19                   ` Peter Zijlstra
  2010-05-11 21:13                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2010-05-11 18:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Frederic Weisbecker, Arnd Bergmann,
	Ingo Molnar, linux-kernel, Tony Breeds

On Tue, 2010-05-11 at 11:06 -0700, Linus Torvalds wrote:
> 
> On Mon, 10 May 2010, Peter Zijlstra wrote:
> > 
> > As to the 2 jiffy spin timeout, I guess we should add a lockdep warning
> > for that, because anybody holding a mutex for longer than 2 jiffies and
> > not sleeping does need fixing anyway.
> 
> I really hate the jiffies thing, but looking at the optimistic spinning, I 
> do wonder about two things..
> 
> First - we check "need_resched()" only if owner is NULL. That sounds 
> wrong. If we need to reschedule, we need to stop spinning _regardless_ of 
> whether the owner may have been preempted before setting the owner field.

There is a second need_resched() in the inner spin loop in
kernel/sched.c:mutex_spin_on_owner().

> Second: we allow "owner" to change, and we'll continue spinning. This is 
> how you can end up spinning for a long time - not because anybody holds 
> the mutex for longer than 2 jiffies, but because a lot of other threads 
> _together_ hold the mutex for longer than 2 jiffies.

Granted.

> Now, I think we do want some limited "continue spinning even if somebody 
> else ended up getting it instead", but I think we should at least limit 
> it. Otherwise we end up being potentially rather unfair, since we don't 
> have any fair queueing logic for the optimistic spinning phase.
> 
> Now, we could just count the number of times "owner" has changed, and I 
> suspect that would be sufficient. Now, that trivial counting sceme would 
> fail if "owner" stays the same (ie the same process re-takes the lock over 
> and over again, possibly due to hot cacheline things being very unfair 
> to the person who already owns it), but quite frankly, I don't think we 
> can get into that kind of situation. 
> 
> Why? Mutexes may end up being very heavily contended, but they can't be 
> contended by just _one_ thread. So if we're really in a starvation issue, 
> the thread that is waiting _will_ see multiple different owners.
> 
> So once you have seen X number of other owners, you just say "screw it, 
> this spinning thing isn't working for me, I'll go to the sleeping case".

Right, so basically count the number of mutex_spin_on_owner() calls and
bail when >N.

> Of course, it's quite possible that as long as "need_resched()" isn't set, 
> spinning really _is_ the right thing to do. Maybe it causes horrible CPU 
> load on some odd "everybody synchronize" loads, but maybe that really is 
> the best we can do.

Ben's argument was that spinning for a long time wrecks power usage.

That said, I'd still like a counter/event/warning to see if someone
actually manages to hold onto a mutex for long (2 jiffies) without
scheduling at all. If we ever run into something like that, that needs
to get fixed regardless.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL
  2010-05-11 18:06                 ` Linus Torvalds
  2010-05-11 18:19                   ` Peter Zijlstra
@ 2010-05-11 21:13                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2010-05-11 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Frederic Weisbecker, Arnd Bergmann, Ingo Molnar,
	linux-kernel, Tony Breeds

On Tue, 2010-05-11 at 11:06 -0700, Linus Torvalds wrote:
> 
> On Mon, 10 May 2010, Peter Zijlstra wrote:
> > 
> > As to the 2 jiffy spin timeout, I guess we should add a lockdep warning
> > for that, because anybody holding a mutex for longer than 2 jiffies and
> > not sleeping does need fixing anyway.
> 
> I really hate the jiffies thing, but looking at the optimistic spinning, I 
> do wonder about two things..
> 
> First - we check "need_resched()" only if owner is NULL. That sounds 
> wrong. If we need to reschedule, we need to stop spinning _regardless_ of 
> whether the owner may have been preempted before setting the owner field.

The inner call mutex_spin_on_owner() checks it regardless (yeah, I
tripped on that too initially).

 .../...

> Now, we could just count the number of times "owner" has changed, and I 
> suspect that would be sufficient. Now, that trivial counting sceme would 
> fail if "owner" stays the same (ie the same process re-takes the lock over 
> and over again, possibly due to hot cacheline things being very unfair 
> to the person who already owns it), but quite frankly, I don't think we 
> can get into that kind of situation. 

I've observed cases where owner didn't appear to change but we also
didn't trip on need_resched(), which leads me to wonder, if nobody is
contending to run on that processor, we may never set need_resched() and
end up spinning as long as the target is running.

This cause the deadlock with the BKL (which I fixed) but it also means
we may spend a long time without going to sleep, which means CPU not
dozed or returned to the hypervisor etc... for longer than needed.

That's the reasoning behind the jiffy timeout. It's not great but it
will get us out in some amount of time, which I wanted to guarantee.
  
 .../...

> Of course, it's quite possible that as long as "need_resched()" isn't set, 
> spinning really _is_ the right thing to do. Maybe it causes horrible CPU 
> load on some odd "everybody synchronize" loads, but maybe that really is 
> the best we can do.

Well, that happens at least occasionally but if you think that can be
safely ignored, then drop the jiffy timeout.

Cheers,
Ben.

> 			Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [tip:core/locking] mutex: Fix optimistic spinning vs. BKL
  2010-05-11 15:43       ` [tip:core/locking] " tip-bot for Tony Breeds
@ 2010-05-11 23:05         ` Tony Breeds
  2010-05-18 16:08         ` Ingo Molnar
  1 sibling, 0 replies; 20+ messages in thread
From: Tony Breeds @ 2010-05-11 23:05 UTC (permalink / raw)
  To: Linux Kernel ML; +Cc: hpa, mingo, a.p.zijlstra, benh, stable, tglx, mingo

On Tue, May 11, 2010 at 03:43:01PM +0000, tip-bot for Tony Breeds wrote:
> Commit-ID:  227945799cc10d77c6ef812f3eb8a61a78689454
> Gitweb:     http://git.kernel.org/tip/227945799cc10d77c6ef812f3eb8a61a78689454
> Author:     Tony Breeds <tony@bakeyournoodle.com>

For the record this is Ben's patch not mine, I just used time_before() to
address feedback.

Thanks for taking it though.

Yours Tony

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [tip:core/locking] mutex: Fix optimistic spinning vs. BKL
  2010-05-11 15:43       ` [tip:core/locking] " tip-bot for Tony Breeds
  2010-05-11 23:05         ` Tony Breeds
@ 2010-05-18 16:08         ` Ingo Molnar
  2010-05-18 16:26           ` Linus Torvalds
  2010-05-19  5:46           ` Tony Breeds
  1 sibling, 2 replies; 20+ messages in thread
From: Ingo Molnar @ 2010-05-18 16:08 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, a.p.zijlstra, benh, stable, tglx, tony,
	Linus Torvalds
  Cc: linux-tip-commits


* tip-bot for Tony Breeds <tony@bakeyournoodle.com> wrote:

> Commit-ID:  227945799cc10d77c6ef812f3eb8a61a78689454
> Gitweb:     http://git.kernel.org/tip/227945799cc10d77c6ef812f3eb8a61a78689454
> Author:     Tony Breeds <tony@bakeyournoodle.com>
> AuthorDate: Fri, 7 May 2010 14:20:10 +1000
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Tue, 11 May 2010 17:07:24 +0200
> 
> mutex: Fix optimistic spinning vs. BKL

Tony, mind sending a version of this patch that does not 
include a jiffies based spinning loop?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [tip:core/locking] mutex: Fix optimistic spinning vs. BKL
  2010-05-18 16:08         ` Ingo Molnar
@ 2010-05-18 16:26           ` Linus Torvalds
  2010-05-19  5:46           ` Tony Breeds
  1 sibling, 0 replies; 20+ messages in thread
From: Linus Torvalds @ 2010-05-18 16:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, benh, stable, tglx, tony,
	linux-tip-commits



On Tue, 18 May 2010, Ingo Molnar wrote:
>
> Tony, mind sending a version of this patch that does not 
> include a jiffies based spinning loop?

I think it should just be the two-liner

	if (current->lock_depth >= 0)
		break;

in that loop. Plus comment.

		Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [tip:core/locking] mutex: Fix optimistic spinning vs. BKL
  2010-05-18 16:08         ` Ingo Molnar
  2010-05-18 16:26           ` Linus Torvalds
@ 2010-05-19  5:46           ` Tony Breeds
  2010-05-19  7:56             ` [tip:core/urgent] " tip-bot for Tony Breeds
  1 sibling, 1 reply; 20+ messages in thread
From: Tony Breeds @ 2010-05-19  5:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: mingo, hpa, linux-kernel, a.p.zijlstra, benh, stable, tglx,
	Linus Torvalds, linux-tip-commits

On Tue, May 18, 2010 at 06:08:38PM +0200, Ingo Molnar wrote:
> 
> * tip-bot for Tony Breeds <tony@bakeyournoodle.com> wrote:
> 
> > Commit-ID:  227945799cc10d77c6ef812f3eb8a61a78689454
> > Gitweb:     http://git.kernel.org/tip/227945799cc10d77c6ef812f3eb8a61a78689454
> > Author:     Tony Breeds <tony@bakeyournoodle.com>
> > AuthorDate: Fri, 7 May 2010 14:20:10 +1000
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Tue, 11 May 2010 17:07:24 +0200
> > 
> > mutex: Fix optimistic spinning vs. BKL
> 
> Tony, mind sending a version of this patch that does not 
> include a jiffies based spinning loop?

Subject: [PATCH] mutex: Fix optimistic spinning vs. BKL

Currently, we can hit a nasty case with optimistic spinning on
mutexes:

    CPU A tries to take a mutex, while holding the BKL

    CPU B tried to take the BLK while holding the mutex

This looks like a AB-BA scenario but in practice, is allowed and
happens due to the auto-release-on-schedule nature of the BKL.

In that case, the optimistic spinning code can get us into a situation
where instead of going to sleep, A will spin waiting for B who is
spinning waiting for A, and the only way out of that loop is the
need_resched() test in mutex_spin_on_owner().

This patch fixes both in a rather crude way. I completely disable
spinning if we own the BKL, and I add a safety timeout using jiffies
to fallback to sleeping if we end up spinning for more than 1 or 2
jiffies.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
---
 kernel/mutex.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index 632f04c..c38d302 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -172,6 +172,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		struct thread_info *owner;
 
 		/*
+		 * If we own the BKL, then don't spin. The owner of the mutex
+		 * might be waiting on us to release the BKL.
+		 */
+		if (current->lock_depth >= 0)
+			break;
+
+		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */
-- 
1.6.6.1


Yours Tony

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [tip:core/urgent] mutex: Fix optimistic spinning vs. BKL
  2010-05-19  5:46           ` Tony Breeds
@ 2010-05-19  7:56             ` tip-bot for Tony Breeds
  0 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Tony Breeds @ 2010-05-19  7:56 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, torvalds, benh, stable,
	tglx, mingo, tony

Commit-ID:  fd6be105b883244127a734ac9f14ae94a022dcc0
Gitweb:     http://git.kernel.org/tip/fd6be105b883244127a734ac9f14ae94a022dcc0
Author:     Tony Breeds <tony@bakeyournoodle.com>
AuthorDate: Wed, 19 May 2010 15:46:36 +1000
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 19 May 2010 08:18:44 +0200

mutex: Fix optimistic spinning vs. BKL

Currently, we can hit a nasty case with optimistic
spinning on mutexes:

    CPU A tries to take a mutex, while holding the BKL

    CPU B tried to take the BLK while holding the mutex

This looks like a AB-BA scenario but in practice, is
allowed and happens due to the auto-release on
schedule() nature of the BKL.

In that case, the optimistic spinning code can get us
into a situation where instead of going to sleep, A
will spin waiting for B who is spinning waiting for
A, and the only way out of that loop is the
need_resched() test in mutex_spin_on_owner().

This patch fixes it by completely disabling spinning
if we own the BKL. This adds one more detail to the
extensive list of reasons why it's a bad idea for
kernel code to be holding the BKL.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
LKML-Reference: <20100519054636.GC12389@ozlabs.org>
[ added an unlikely() attribute to the branch ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/mutex.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/kernel/mutex.c b/kernel/mutex.c
index 632f04c..4c0b7b3 100644
--- a/kernel/mutex.c
+++ b/kernel/mutex.c
@@ -172,6 +172,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
 		struct thread_info *owner;
 
 		/*
+		 * If we own the BKL, then don't spin. The owner of
+		 * the mutex might be waiting on us to release the BKL.
+		 */
+		if (unlikely(current->lock_depth >= 0))
+			break;
+
+		/*
 		 * If there's an owner, wait for it to either
 		 * release the lock or go to sleep.
 		 */

^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2010-05-19  7:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-28  4:38 [PATCH/RFC] mutex: Fix optimistic spinning vs. BKL Benjamin Herrenschmidt
2010-04-28  4:39 ` Benjamin Herrenschmidt
2010-04-28 12:06 ` Arnd Bergmann
2010-04-28 22:35   ` Benjamin Herrenschmidt
2010-05-07  4:20     ` Tony Breeds
2010-05-07  5:30       ` Frederic Weisbecker
2010-05-07  6:01         ` Benjamin Herrenschmidt
2010-05-07 21:29           ` Frederic Weisbecker
2010-05-07 22:27             ` Benjamin Herrenschmidt
2010-05-10  7:55               ` Peter Zijlstra
2010-05-11 18:06                 ` Linus Torvalds
2010-05-11 18:19                   ` Peter Zijlstra
2010-05-11 21:13                   ` Benjamin Herrenschmidt
2010-05-07  6:16         ` Mike Galbraith
2010-05-11 15:43       ` [tip:core/locking] " tip-bot for Tony Breeds
2010-05-11 23:05         ` Tony Breeds
2010-05-18 16:08         ` Ingo Molnar
2010-05-18 16:26           ` Linus Torvalds
2010-05-19  5:46           ` Tony Breeds
2010-05-19  7:56             ` [tip:core/urgent] " tip-bot for Tony Breeds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.