linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mutex_lock issues during poweroff
@ 2017-09-07 12:16 Maxime Ripard
  2017-09-07 12:40 ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Maxime Ripard @ 2017-09-07 12:16 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Thomas Petazzoni, Quentin Schulz, Mylene Josserand

[-- Attachment #1: Type: text/plain, Size: 2590 bytes --]

Hi,

We've been investigating a bug on our kernel for the last couple
monthes.

The scenario is this: we have an ARM board that embed an Allwinner A33
SoC. That board is using a PMIC connected to the SoC through a
proprietary bus, whose driver is in drivers/bus/sunxi-rsb.c. The
poweroff is implemented by sending a shutdown command to that PMIC.

http://elixir.free-electrons.com/linux/v4.9.47/source/drivers/mfd/axp20x.c#L743

That PMIC also serves other purposes, such as controlling the
regulators, but we also use it to get the various power supplies
state, and report them through our power supplies driver.

http://elixir.free-electrons.com/linux/v4.9.47/source/drivers/power/supply/axp20x_usb_power.c
http://elixir.free-electrons.com/linux/v4.12.11/source/drivers/power/supply/axp20x_ac_power.c
http://elixir.free-electrons.com/linux/v4.12.11/source/drivers/power/supply/axp20x_battery.c

The bug arises when we have those drivers enabled on a kernel 4.9.47
(or any 4.9 kernel. 4.8 also happens to show this). In some cases (1
out of 200-300 poweroff), the board will not poweroff. After digging
through this, it turns out that in such scenario, the mutex_lock we
have in the bus driver never returns.

Here: http://elixir.free-electrons.com/linux/v4.9.47/source/drivers/bus/sunxi-rsb.c#L379

Which means that we will never actually send the command, which also
explains why it powered on.

This gets weirder, since if we dump the return code of mutex_is_locked
right before a failing case, the mutex isn't already locked, so we
should not block or sleep at all.

If we disable the power supplies driver that poll the PMIC status on a
regular basis, it works, however we've never actually seen a
concurrent usage of that bus. In our practical cases, the mutex is
always unlocked.

If we remove the mutex_lock / _unlock entirely, we don't stall anymore
either, which seems to confirm something weird going on here.

One thing worth noting is that we couldn't reproduce the issue with a
4.13. We can't bisect really easily due to the amount of patches that
we still have on 4.9 and have all been merged since, but it seems like
the bug was fixed (either on purpose or as a side effect), and was
never sent to stable. Looking at the history of kernel/locking/mutex.c
during that window didn't really show anything obvious though.

If you have any ideas or spot something very wrong, I'd be happy to
hear about. Thanks!

Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mutex_lock issues during poweroff
  2017-09-07 12:16 mutex_lock issues during poweroff Maxime Ripard
@ 2017-09-07 12:40 ` Peter Zijlstra
  2017-09-08 14:43   ` Maxime Ripard
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Zijlstra @ 2017-09-07 12:40 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Ingo Molnar, linux-kernel, Thomas Petazzoni, Quentin Schulz,
	Mylene Josserand

On Thu, Sep 07, 2017 at 02:16:19PM +0200, Maxime Ripard wrote:
> One thing worth noting is that we couldn't reproduce the issue with a
> 4.13. We can't bisect really easily due to the amount of patches that
> we still have on 4.9 and have all been merged since, but it seems like
> the bug was fixed (either on purpose or as a side effect), and was
> never sent to stable. Looking at the history of kernel/locking/mutex.c
> during that window didn't really show anything obvious though.
> 
> If you have any ideas or spot something very wrong, I'd be happy to
> hear about. Thanks!

Well, we did a _complete_ rewrite of the mutex primitive in v4.10-rc1.

Part of the reason for that rewrite was fixing a starvation case, but
for that you'd need to actually have contending usage, which you claim
not to have.

Aside from that I really can't remember any specific issues with the old
code (4.9 is such a long time ago). You could try to disable the
optimistic spinning code, see if that helps.

You did also say you were running on an ARM64, there were a few memory
ordering fixes like for example commit:

  50972fe78f24 ("locking/osq_lock: Fix osq_lock queue corruption")

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mutex_lock issues during poweroff
  2017-09-07 12:40 ` Peter Zijlstra
@ 2017-09-08 14:43   ` Maxime Ripard
  2017-09-08 16:11     ` Peter Zijlstra
  0 siblings, 1 reply; 4+ messages in thread
From: Maxime Ripard @ 2017-09-08 14:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Thomas Petazzoni, Quentin Schulz,
	Mylene Josserand

[-- Attachment #1: Type: text/plain, Size: 1906 bytes --]

Hi Peter,

Thanks for your answer.

On Thu, Sep 07, 2017 at 02:40:33PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 07, 2017 at 02:16:19PM +0200, Maxime Ripard wrote:
> > One thing worth noting is that we couldn't reproduce the issue with a
> > 4.13. We can't bisect really easily due to the amount of patches that
> > we still have on 4.9 and have all been merged since, but it seems like
> > the bug was fixed (either on purpose or as a side effect), and was
> > never sent to stable. Looking at the history of kernel/locking/mutex.c
> > during that window didn't really show anything obvious though.
> > 
> > If you have any ideas or spot something very wrong, I'd be happy to
> > hear about. Thanks!
> 
> Well, we did a _complete_ rewrite of the mutex primitive in v4.10-rc1.

Ok.

What commit happened to be the rewrite? 9d659ae14b54 ("locking/mutex:
Add lock handoff to avoid starvation") ? We backported this one and
3ca0ff571b09 ("locking/mutex: Rework mutex::owner"), and still can
reproduce the issue. Is there any other?

> Part of the reason for that rewrite was fixing a starvation case, but
> for that you'd need to actually have contending usage, which you claim
> not to have.

Yeah, we're close to the opposite case :)

> Aside from that I really can't remember any specific issues with the old
> code (4.9 is such a long time ago). You could try to disable the
> optimistic spinning code, see if that helps.
> 
> You did also say you were running on an ARM64, there were a few memory
> ordering fixes like for example commit:
> 
>   50972fe78f24 ("locking/osq_lock: Fix osq_lock queue corruption")

We're running on ARM, not ARM64 (they still are separate architectures
under arch/, unlike x86), but I'll look into them too.

Thanks!
Maxime

-- 
Maxime Ripard, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mutex_lock issues during poweroff
  2017-09-08 14:43   ` Maxime Ripard
@ 2017-09-08 16:11     ` Peter Zijlstra
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Zijlstra @ 2017-09-08 16:11 UTC (permalink / raw)
  To: Maxime Ripard
  Cc: Ingo Molnar, linux-kernel, Thomas Petazzoni, Quentin Schulz,
	Mylene Josserand

On Fri, Sep 08, 2017 at 04:43:07PM +0200, Maxime Ripard wrote:

> What commit happened to be the rewrite? 9d659ae14b54 ("locking/mutex:
> Add lock handoff to avoid starvation") ? We backported this one and
> 3ca0ff571b09 ("locking/mutex: Rework mutex::owner"), and still can
> reproduce the issue. Is there any other?

3ca0ff571b09 was the rewrite, 9d659ae14b54 fixed that starvation case.

With those you should be close to what mainline runs I think.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-09-08 16:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-07 12:16 mutex_lock issues during poweroff Maxime Ripard
2017-09-07 12:40 ` Peter Zijlstra
2017-09-08 14:43   ` Maxime Ripard
2017-09-08 16:11     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).