linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Workqueue regression
@ 2024-02-01 20:57 Konrad Dybcio
  2024-02-02  1:52 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Konrad Dybcio @ 2024-02-01 20:57 UTC (permalink / raw)
  To: Tejun Heo, linux-kernel, Naohiro.Aota, kernel-team; +Cc: Bjorn Andersson

Hi,

So, commit "Implement system-wide nr_active enforcement for unbound workqueues"
broke *something* and now performing a suspend-wakeup cycle on a Qualcomm
SC8280XP-based (arm64) platform hangs when performing the resume tasks,
presumably somewhere near PCIe reinitialization (but that may be a red herring).

Reverting the commit (and the ones on top of it due to conflicts) fixes
the issue on next-20240130 and later (plus some out-of-tree patches that
are largely unrelated).

Not sure where to start looking.

Konrad

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Workqueue regression
  2024-02-01 20:57 Workqueue regression Konrad Dybcio
@ 2024-02-02  1:52 ` Tejun Heo
  2024-02-02 12:31   ` Konrad Dybcio
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2024-02-02  1:52 UTC (permalink / raw)
  To: Konrad Dybcio; +Cc: linux-kernel, Naohiro.Aota, kernel-team, Bjorn Andersson

Hello,

On Thu, Feb 01, 2024 at 09:57:59PM +0100, Konrad Dybcio wrote:
> So, commit "Implement system-wide nr_active enforcement for unbound workqueues"
> broke *something* and now performing a suspend-wakeup cycle on a Qualcomm
> SC8280XP-based (arm64) platform hangs when performing the resume tasks,
> presumably somewhere near PCIe reinitialization (but that may be a red herring).
> 
> Reverting the commit (and the ones on top of it due to conflicts) fixes
> the issue on next-20240130 and later (plus some out-of-tree patches that
> are largely unrelated).
> 
> Not sure where to start looking.

Hmm... sorry about that. Can you please boot with `console_no_suspend` and
retry? Once the system gets stuck, you can wait for several minutes till the
workqueue watchdog triggers and dumps the state or, if you can, trigger
`sysrq-t` which has workqueue state dump at the end.

If the system doesn't become live enough after suspend/resume cycle to get
more info, the following might help:

$ echo test_resume > /sys/power/disk
$ echo disk > /sys/power/state

That should walk most of the hibernation/wakeup path which is pretty simliar
to suspend/resume path without touching system power state.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Workqueue regression
  2024-02-02  1:52 ` Tejun Heo
@ 2024-02-02 12:31   ` Konrad Dybcio
  2024-02-02 18:40     ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Konrad Dybcio @ 2024-02-02 12:31 UTC (permalink / raw)
  To: Tejun Heo, Konrad Dybcio
  Cc: linux-kernel, Naohiro.Aota, kernel-team, Bjorn Andersson

On 2.02.2024 02:52, Tejun Heo wrote:
> Hello,
> 
> On Thu, Feb 01, 2024 at 09:57:59PM +0100, Konrad Dybcio wrote:
>> So, commit "Implement system-wide nr_active enforcement for unbound workqueues"
>> broke *something* and now performing a suspend-wakeup cycle on a Qualcomm
>> SC8280XP-based (arm64) platform hangs when performing the resume tasks,
>> presumably somewhere near PCIe reinitialization (but that may be a red herring).
>>
>> Reverting the commit (and the ones on top of it due to conflicts) fixes
>> the issue on next-20240130 and later (plus some out-of-tree patches that
>> are largely unrelated).
>>
>> Not sure where to start looking.
> 
> Hmm... sorry about that. Can you please boot with `console_no_suspend` and
> retry? Once the system gets stuck, you can wait for several minutes till the
> workqueue watchdog triggers and dumps the state or, if you can, trigger
> `sysrq-t` which has workqueue state dump at the end.
> 
> If the system doesn't become live enough after suspend/resume cycle to get
> more info, the following might help:

Looks like it's too far gone indeed..

> 
> $ echo test_resume > /sys/power/disk
> $ echo disk > /sys/power/state

Sadly, hibernation is not a thing on this platform.. Without going into much
detail of how messy the power management stuff is, you can either have
"on", "off" or "power collapsed" (bound to s2idle).. Trying to trigger this
sequence makes the thing lock up and die due to unclocked accesses with or
without the WQ regression.

Konrad

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Workqueue regression
  2024-02-02 12:31   ` Konrad Dybcio
@ 2024-02-02 18:40     ` Tejun Heo
  2024-02-04 21:19       ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2024-02-02 18:40 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Konrad Dybcio, linux-kernel, Naohiro.Aota, kernel-team, Bjorn Andersson

Hello,

On Fri, Feb 02, 2024 at 01:31:01PM +0100, Konrad Dybcio wrote:
> > If the system doesn't become live enough after suspend/resume cycle to get
> > more info, the following might help:
> 
> Looks like it's too far gone indeed..
> 
> > 
> > $ echo test_resume > /sys/power/disk
> > $ echo disk > /sys/power/state
> 
> Sadly, hibernation is not a thing on this platform.. Without going into much
> detail of how messy the power management stuff is, you can either have
> "on", "off" or "power collapsed" (bound to s2idle).. Trying to trigger this
> sequence makes the thing lock up and die due to unclocked accesses with or
> without the WQ regression.

I see, so, if you enable CONFIG_PM_DEBUG, CONFIG_PM_ADVANCED_DEBUG and
CONFIG_PM_SLEEP_DEBUG, there will be /sys/power/pm_test file which allows to
select the stage at which suspend is going to abort. Can you please play
with it and see whether you can reproduce the issue while maintaining the
console output?

Can you also make sure that the system is actually dead, not just the
console? e.g. by pinging from network?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Workqueue regression
  2024-02-02 18:40     ` Tejun Heo
@ 2024-02-04 21:19       ` Tejun Heo
  2024-02-05 11:43         ` Konrad Dybcio
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2024-02-04 21:19 UTC (permalink / raw)
  To: Konrad Dybcio
  Cc: Konrad Dybcio, linux-kernel, Naohiro.Aota, kernel-team, Bjorn Andersson

Hello,

There was a bug which could easily stall flush_workqueue() which just got
fixed (http://lkml.kernel.org/r/Zb_-LQLY7eRuakfe@slm.duckdns.org). Can you
please see whether the patch fixes the suspend problem? 

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Workqueue regression
  2024-02-04 21:19       ` Tejun Heo
@ 2024-02-05 11:43         ` Konrad Dybcio
  0 siblings, 0 replies; 6+ messages in thread
From: Konrad Dybcio @ 2024-02-05 11:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Konrad Dybcio, linux-kernel, Naohiro.Aota, kernel-team, Bjorn Andersson

On 4.02.2024 22:19, Tejun Heo wrote:
> Hello,
> 
> There was a bug which could easily stall flush_workqueue() which just got
> fixed (http://lkml.kernel.org/r/Zb_-LQLY7eRuakfe@slm.duckdns.org). Can you
> please see whether the patch fixes the suspend problem? 

Thanks for the pointer!

Unfortunately, it doesn't seem to fix my issue :/
I'll try to look into it more in the coming days, though my calendar is
somewhat wavy..

Konrad

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-02-05 11:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-01 20:57 Workqueue regression Konrad Dybcio
2024-02-02  1:52 ` Tejun Heo
2024-02-02 12:31   ` Konrad Dybcio
2024-02-02 18:40     ` Tejun Heo
2024-02-04 21:19       ` Tejun Heo
2024-02-05 11:43         ` Konrad Dybcio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).