All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Ellerman <mpe@ellerman.id.au>
To: paulmck@kernel.org, Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: rcu <rcu@vger.kernel.org>,
	Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Nicholas Piggin <npiggin@gmail.com>
Subject: Re: rcu_sched self-detected stall on CPU
Date: Sat, 09 Apr 2022 00:42:39 +1000	[thread overview]
Message-ID: <87k0bz7i1s.fsf@mpe.ellerman.id.au> (raw)
In-Reply-To: <87pmls6nt7.fsf@mpe.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:
> "Paul E. McKenney" <paulmck@kernel.org> writes:
>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
>>> Hi
>>> 
>>> I can reproduce it in a ppc virtual cloud server provided by Oregon
>>> State University.  Following is what I do:
>>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
>>> -o linux-5.18-rc1.tar.gz
>>> 2) tar zxf linux-5.18-rc1.tar.gz
>>> 3) cp config linux-5.18-rc1/.config
>>> 4) cd linux-5.18-rc1
>>> 5) make vmlinux -j 8
>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
>>> -smp 2 (QEMU 4.2.1)
>>> 7) after 12 rounds, the bug got reproduced:
>>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
>>
>> Just to make sure, are you both seeing the same thing?  Last I knew,
>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
>> I miss something?
>>
>> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
>> kthread slept for three milliseconds, but did not wake up for more than
>> 20 seconds.  This kthread would normally have awakened on CPU 1, but
>> CPU 1 looks to me to be very unhealthy, as can be seen in your console
>> output below (but maybe my idea of what is healthy for powerpc systems
>> is outdated).  Please see also the inline annotations.
>>
>> Thoughts from the PPC guys?
>
> I haven't seen it in my testing. But using Miguel's config I can
> reproduce it seemingly on every boot.
>
> For me it bisects to:
>
>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>
> Which seems plausible.
>
> Reverting that on mainline makes the bug go away.
>
> I don't see an obvious bug in the diff, but I could be wrong, or the old
> code was papering over an existing bug?
>
> I'll try and work out what it is about Miguel's config that exposes
> this vs our defconfig, that might give us a clue.

It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.

I can reproduce just with:

  $ make ppc64le_guest_defconfig
  $ ./scripts/config -d HIGH_RES_TIMERS

We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
realise you could disable it TBH :)

The Rust CI has it disabled because I copied that from the x86 defconfig
they were using back when I added the Rust support. I think that was
meant to be a stripped down fast config for CI, but the result is it's
just using a badly tested combination which is not helpful.

So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
can debug this further without blocking them.

cheers

  parent reply	other threads:[~2022-04-08 14:42 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-05 21:41 rcu_sched self-detected stall on CPU Miguel Ojeda
2022-04-06  9:31 ` Zhouyi Zhou
2022-04-06  9:31   ` Zhouyi Zhou
2022-04-06 17:00   ` Paul E. McKenney
2022-04-06 17:00     ` Paul E. McKenney
2022-04-06 18:25     ` Zhouyi Zhou
2022-04-06 18:25       ` Zhouyi Zhou
2022-04-06 19:50       ` Paul E. McKenney
2022-04-06 19:50         ` Paul E. McKenney
2022-04-07  2:26         ` Zhouyi Zhou
2022-04-07  2:26           ` Zhouyi Zhou
2022-04-07 10:07           ` Miguel Ojeda
2022-04-07 10:07             ` Miguel Ojeda
2022-04-07 15:15             ` Paul E. McKenney
2022-04-07 15:15               ` Paul E. McKenney
2022-04-07 17:05               ` Miguel Ojeda
2022-04-07 17:05                 ` Miguel Ojeda
2022-04-07 17:55                 ` Paul E. McKenney
2022-04-07 17:55                   ` Paul E. McKenney
2022-04-07 23:14                   ` Zhouyi Zhou
2022-04-07 23:14                     ` Zhouyi Zhou
2022-04-08  1:43                     ` Paul E. McKenney
2022-04-08  1:43                       ` Paul E. McKenney
2022-04-08  7:23     ` Michael Ellerman
2022-04-08 10:02       ` Zhouyi Zhou
2022-04-08 10:02         ` Zhouyi Zhou
2022-04-08 14:07         ` Paul E. McKenney
2022-04-08 14:07           ` Paul E. McKenney
2022-04-08 14:25           ` Zhouyi Zhou
2022-04-08 14:25             ` Zhouyi Zhou
2022-04-10 11:33             ` Michael Ellerman
2022-04-11  3:05               ` Paul E. McKenney
2022-04-11  3:05                 ` Paul E. McKenney
2022-04-12  6:53                 ` Michael Ellerman
2022-04-12  6:53                   ` Michael Ellerman
2022-04-12 13:36                   ` Paul E. McKenney
2022-04-12 13:36                     ` Paul E. McKenney
2022-04-08 13:52       ` Miguel Ojeda
2022-04-08 13:52         ` Miguel Ojeda
2022-04-08 14:06       ` Paul E. McKenney
2022-04-08 14:06         ` Paul E. McKenney
2022-04-08 14:42       ` Michael Ellerman [this message]
2022-04-08 15:52         ` Paul E. McKenney
2022-04-08 15:52           ` Paul E. McKenney
2022-04-08 17:02         ` Miguel Ojeda
2022-04-08 17:02           ` Miguel Ojeda
2022-04-13  5:11         ` Nicholas Piggin
2022-04-13  5:11           ` Nicholas Piggin
2022-04-13  6:10           ` Low-res tick handler device not going to ONESHOT_STOPPED when tick is stopped (was: rcu_sched self-detected stall on CPU) Nicholas Piggin
2022-04-13  6:10             ` Nicholas Piggin
2022-04-14 17:15             ` Paul E. McKenney
2022-04-14 17:15               ` Paul E. McKenney
2022-04-22 15:53           ` Thomas Gleixner
2022-04-22 15:53             ` Re: Thomas Gleixner
2022-04-23  2:29             ` Re: Nicholas Piggin
2022-04-23  2:29               ` Re: Nicholas Piggin
  -- strict thread matches above, loose matches on Subject: below --
2016-09-15  4:02 rcu_sched self-detected stall on CPU NTU
2016-09-15  6:22 ` Mike Galbraith
2016-09-15 17:15   ` NTU
2016-09-20 14:34     ` NTU

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k0bz7i1s.fsf@mpe.ellerman.id.au \
    --to=mpe@ellerman.id.au \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=miguel.ojeda.sandonis@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=zhouzhouyi@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.