Re: Unstable tsc caused soft lockup in kdump kernel

From: Baoquan He <bhe@redhat.com>
To: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
Cc: jstultz@google.com, tglx@linutronix.de, sboyd@kernel.org,
	linux-kernel@vger.kernel.org, x86@kernel.org,
	kexec@lists.infradead.org
Subject: Re: Unstable tsc caused soft lockup in kdump kernel
Date: Fri, 15 Jul 2022 09:33:13 +0800	[thread overview]
Message-ID: <YtDD2WHLiFbceXuE@MiWiFi-R3L-srv> (raw)
In-Reply-To: <bf57256f-127d-6f26-404a-b9cff6df70b3@igalia.com>

On 07/14/22 at 04:34pm, Guilherme G. Piccoli wrote:
> On 29/06/2022 07:25, Baoquan He wrote:
> > Hi,
> > 
> > On a HP machine, after crash triggered via sysrq-c, kdump kernel will
> > boot and get soft lockup as below. And this can be always reproduced.
> > 
> > From log, it seems that unreliable tsc was marked as unstable in
> > clocksource_watchdog, then worker sched_clock_work was scheduled. And
> > this tsc unstable marking always happened after sysrq-c is triggered.
> > And the cpu where worker smp_call_function_many_cond is running won't
> > be stopped and hang there and keep locks, even though the cpu should be
> > stopped. While kdump kernel is running in a different cpu and boot, then
> > soft lockup happened because other workers or the relevant threads are
> > waiting for locks taken by the hang sched_clock_work worker.
> > 
> > Any idea or suggestion?
> > 
> > The normal kernel boot log and kdump kernel log, kernel config, are all
> > attached, please check.
> > 
> 
> Hi Baoquan, interesting issue! Do you happen to have a full kdump boot
> log with the issue? Maybe collected through serial console, etc.
> It seems the one attached is from a succeeding kdump by passing
> "tsc=unstable" to the kdump kernel right?

The attached kdump boot log is the one in which kdump kernel is hang.
The 'tsc=unstable' need be added to 1st kernel to work around it. Only
adding 'tsc=unstable' into kdump kernel doesn't change anything since
the clocksouce_watchdog work has been in a loop because of the unstable
tsc in 1st kernel.

> 
> Also, did you try to "forbid" tsc to get marked as unstable in the first
> kernel, before kdump? I mean like a hack code change, just prevent
> kernel doing that and seeing if it works. If that still fails, then it
> seems the cause of the issue is the same as the cause of TSC getting
> unstable - in other words, something would be causing both the kdump
> kernel lockup AND the TSC unstable marking in the first kernel...

As I added later that adding 'tsc=unstable' into 1st kernel's cmdline can work
around the issue. kdump works well with that.