linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tom Li <tomli@tomli.me>
To: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: paulmck@linux.ibm.com, ak@linux.intel.com, bp@alien8.de,
	hpa@zytor.com, linux-kernel@vger.kernel.org, mingo@redhat.com,
	rjw@rjwysocki.net, ville.syrjala@linux.intel.com,
	viresh.kumar@linaro.org, tomli@tomli.me,
	linux-mips@vger.kernel.org
Subject: Re: [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds")
Date: Tue, 5 Feb 2019 13:07:00 +0800	[thread overview]
Message-ID: <20190205050700.GA31571@localhost.localdomain> (raw)
In-Reply-To: <20181126220122.GA6345@linux.ibm.com>

On Tue, Nov 13, 2018 at 03:54:53PM +0200, Ville Syrjälä wrote:
> Hi Paul,
> After 4.20-rc1 some of my 32bit UP machines no longer reboot/shutdown.
> I bisected this down to commit 45975c7d21a1 ("rcu: Define RCU-sched
> API in terms of RCU for Tree RCU PREEMPT builds").
> 
> I traced the hang into
> -> cpufreq_suspend()
>  -> cpufreq_stop_governor()
>   -> cpufreq_dbs_governor_stop()
>    -> gov_clear_update_util()
>     -> synchronize_sched()
>      -> synchronize_rcu()
>
> Only PREEMPT=y is affected for obvious reasons, but that couldn't
> explain why the same UP kernel booted on an SMP machine worked fine.
> Eventually I realized that the difference between working and
> non-working machine was IOAPIC vs. PIC. With initcall_debug I saw
> that we mask everything in the PIC before cpufreq is shut down,

Hello Paul & Ville.

I'm not a kernel hacker, just a n00b playing with various non-x86
systems, but I've been forced getting into kernel hacking due to
the same issue.

Since February, I'm working on porting some trivial out-of-tree
drivers to the upstream, and noticed my Yeeloong 8089D, a machine running
the Loongson 2F 64-bit MIPS-III processor, will completely hang during
reboot or shutdown. I tried bisecting it for 24 hours without sleep, but
the attempt had failed.

I managed to narrow it in-between 4.19 and 4.20, most unusual thing I've
observed was its probabilistic nature. The chance of triggering it seems
getting progressively higher since 4.20, making pinpointing the specific
commit difficult, finally with a 100% chance since Linux 4.19.

I initially suspected a bug in the platform driver, later I also tried
to disable various kernel hardening options, all without success. At this
point I've realized, because it has shown to be probabilistic, it must be
a race condition, and since it's an uniprocessor system, it must be the
CPU scheduler getting preempted somehow. Disabling CONFIG_PREEMPT makes
the issue go away. I tried to add various preempt_disable() in the platform
driver but didn't work.

Finally I've hooked up a netconsole and started adding printk()s.

[   23.652000] loongson2_cpufreq loongson2_cpufreq: shutdown
[   23.656000] loongson-gpio loongson-gpio: shutdown
[   23.660000] migrate_to_reboot_cpu()
[   23.664000] syscore_shutdown()
[   23.668000] PM: Calling i8259A_shutdown+0x0/0xa8
[   23.672000] PM: Calling irq_gc_shutdown+0x0/0x88
[   23.672000] PM: Calling cpufreq_suspend+0x0/0x1a0
[   23.672000] cpufreq_suspend()
[   23.672000] cpufreq_suspend: Suspending Governors
[   23.672000] cpufreq_stop_governor()
[   23.672000] cpufreq_stop_governor: for CPU 0

Looks like something in the core cpufreq code? So I tried searching
"cpufreq_stop_governor()" at LKML... Oops!

So it must be the same issue. To summary, the issue exists on all Linux
kernels since 4.20-rc1, and the chance of triggering it, is now 100%.

What is the current progress of purposed solutions? If a complete solution
is still work-in-progress, could we simply submit a hotfix into the linux-stable
trees, so at least the issue can be temporarily solved? What can I do to help
testing?

Thanks!
Tom Li

       reply	other threads:[~2019-02-05  5:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20181113135453.GW9144@intel.com>
     [not found] ` <20181113151037.GG4170@linux.ibm.com>
     [not found]   ` <20181114202013.GA27603@linux.ibm.com>
     [not found]     ` <20181126220122.GA6345@linux.ibm.com>
2019-02-05  5:07       ` Tom Li [this message]
2019-02-05  9:58         ` [REGRESSION 4.20-rc1] 45975c7d21a1 ("rcu: Define RCU-sched API in terms of RCU for Tree RCU PREEMPT builds") Aaro Koskinen
2019-02-05 13:05           ` Tom Li
2019-02-05 18:59             ` Aaro Koskinen
2019-02-06  1:22               ` tomli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190205050700.GA31571@localhost.localdomain \
    --to=tomli@tomli.me \
    --cc=ak@linux.intel.com \
    --cc=bp@alien8.de \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=paulmck@linux.ibm.com \
    --cc=rjw@rjwysocki.net \
    --cc=ville.syrjala@linux.intel.com \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).