linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Fwd: Kernel 6.5 hangs on shutdown
@ 2023-10-12  9:37 Bagas Sanjaya
  2023-10-13 12:05 ` [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown) Linux regression tracking (Thorsten Leemhuis)
  2023-10-16  8:46 ` Fwd: Kernel 6.5 hangs on shutdown Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 2 replies; 6+ messages in thread
From: Bagas Sanjaya @ 2023-10-12  9:37 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Linux Regressions
  Cc: Linus Torvalds, Thomas Gleixner, Yanjun Yang

Hi,

I notice a regression report on Bugzilla [1]. Quoting from it:

> I use Dell OptiPlex 7050, and kernel hangs when shutting down the computer. 
> Similar symptom has been reported on some forums, and all of them are using
> Dell computers:
> https://bbs.archlinux.org/viewtopic.php?pid=2124429
> https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
> https://forum.artixlinux.org/index.php/topic,5997.0.html
> 
> Tested with various kernel and this bug seems to be caused by commit: 88afbb21d4b36fee6acaa167641f9f0fc122f01b.

See Bugzilla for the full thread.

Anyway, I'm adding this regression to be tracked by regzbot:

#regzbot introduced: 88afbb21d4b36f https://bugzilla.kernel.org/show_bug.cgi?id=217995
#regzbot title: x86 core fix pull causes shutdown hang on Dell OptiPlex 7050
#regzbot link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
#regzbot link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
#regzbot link: https://forum.artixlinux.org/index.php/topic,5997.0.html

Thanks.

[1]: https://bugzilla.kernel.org/show_bug.cgi?id=217995

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)
  2023-10-12  9:37 Fwd: Kernel 6.5 hangs on shutdown Bagas Sanjaya
@ 2023-10-13 12:05 ` Linux regression tracking (Thorsten Leemhuis)
  2023-10-13 17:48   ` Linus Torvalds
  2023-10-16  8:46 ` Fwd: Kernel 6.5 hangs on shutdown Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 1 reply; 6+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-10-13 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Yanjun Yang, Linux Kernel Mailing List,
	Linux Regressions, Bagas Sanjaya, Borislav Petkov (AMD),
	Ashok Raj, Ingo Molnar, Dave Hansen, the arch/x86 maintainers

[CCing x86 maintainers]

Hi Thomas!

On 12.10.23 11:37, Bagas Sanjaya wrote:
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:
>>> I use Dell OptiPlex 7050, and kernel hangs when shutting down the
computer.
>> Similar symptom has been reported on some forums, and all of them are using
>> Dell computers:
>> https://bbs.archlinux.org/viewtopic.php?pid=2124429
>> https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
>> https://forum.artixlinux.org/index.php/topic,5997.0.html

Another report: https://bugzilla.redhat.com/show_bug.cgi?id=2241279

From all those links it seems quite a lot of users with Dell machines
are affected by this problem.

>> Tested with various kernel and this bug seems to be caused by commit: 88afbb21d4b36fee6acaa167641f9f0fc122f01b.

Thomas, turns out that bisection result was slightly wrong: a recheck
confirmed that the regression is actually caused by 45e34c8af58f23
("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.

Ciao, Thorsten

> Anyway, I'm adding this regression to be tracked by regzbot:
> [...]

#regzbot introduced: 45e34c8af58f
#regzbot link: https://bugzilla.redhat.com/show_bug.cgi?id=2241279

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)
  2023-10-13 12:05 ` [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown) Linux regression tracking (Thorsten Leemhuis)
@ 2023-10-13 17:48   ` Linus Torvalds
  2023-10-13 18:28     ` Ashok Raj
  2023-10-13 19:40     ` Thomas Gleixner
  0 siblings, 2 replies; 6+ messages in thread
From: Linus Torvalds @ 2023-10-13 17:48 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Thomas Gleixner, Yanjun Yang, Linux Kernel Mailing List,
	Bagas Sanjaya, Borislav Petkov (AMD),
	Ashok Raj, Ingo Molnar, Dave Hansen, the arch/x86 maintainers

On Fri, 13 Oct 2023 at 05:05, Linux regression tracking (Thorsten
Leemhuis) <regressions@leemhuis.info> wrote:
>
> Thomas, turns out that bisection result was slightly wrong: a recheck
> confirmed that the regression is actually caused by 45e34c8af58f23
> ("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
> yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.

That commit does look pretty dangerous.

If *anything* is done through SMI after the code does that
smp_park_other_cpus_in_init() sequence, I wouldn't be surprised in the
least if the machine is hung.

That's made worse since it looks like the shutdown sequence isn't
necessarily run on the boot CPU, so the boot CPU itself may be in
INIT, and any SMI quite possibly ends up treating that CPU specially.

Who knows what SMI does, but the fact that the affected machines seem
to be mainly from one particular manufacturer does tend to imply it's
something like that.

And the code does do a fair amount *after* shutting down cpu's. Not
just things like calling x86_platform.iommu_shutdown(), but also
things like possibly the tboot shutdown sequence (which almost
*certainly* is some SMI thing).

I dunno. Thomas - I htink the argument for that commit was fairly
theoretical, and reverting it seems the obvious thing, unless you have
some idea of what might be wrong.

               Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)
  2023-10-13 17:48   ` Linus Torvalds
@ 2023-10-13 18:28     ` Ashok Raj
  2023-10-13 19:40     ` Thomas Gleixner
  1 sibling, 0 replies; 6+ messages in thread
From: Ashok Raj @ 2023-10-13 18:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux regressions mailing list, Thomas Gleixner, Yanjun Yang,
	Linux Kernel Mailing List, Bagas Sanjaya, Borislav Petkov (AMD),
	Ingo Molnar, Dave Hansen, the arch/x86 maintainers, Ashok Raj

Hi

On Fri, Oct 13, 2023 at 10:48:19AM -0700, Linus Torvalds wrote:
> On Fri, 13 Oct 2023 at 05:05, Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
> >
> > Thomas, turns out that bisection result was slightly wrong: a recheck
> > confirmed that the regression is actually caused by 45e34c8af58f23
> > ("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
> > yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.
> 
> That commit does look pretty dangerous.
> 
> If *anything* is done through SMI after the code does that
> smp_park_other_cpus_in_init() sequence, I wouldn't be surprised in the
> least if the machine is hung.
> 
> That's made worse since it looks like the shutdown sequence isn't
> necessarily run on the boot CPU, so the boot CPU itself may be in
> INIT, and any SMI quite possibly ends up treating that CPU specially.

Sending INIT to processor marked as BSP will tank the system.

> 
> Who knows what SMI does, but the fact that the affected machines seem
> to be mainly from one particular manufacturer does tend to imply it's
> something like that.

There was a report (probably this same one), and it turns out it was a
bug in the BIOS SMI handler.

The client BIOS's were waiting for the lowest APICID to be the SMI
rendevous master. If this is MeteorLake, the BSP wasn't the one
with the lowest APIC and it triped here.

The BIOS change is also being pushed to others for assimilation :)

Server BIOS's had this correctly for a while now.
> 
> And the code does do a fair amount *after* shutting down cpu's. Not
> just things like calling x86_platform.iommu_shutdown(), but also
> things like possibly the tboot shutdown sequence (which almost
> *certainly* is some SMI thing).
> 
> I dunno. Thomas - I htink the argument for that commit was fairly
> theoretical, and reverting it seems the obvious thing, unless you have
> some idea of what might be wrong.
> 
>                Linus

-- 
Cheers,
Ashok

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown)
  2023-10-13 17:48   ` Linus Torvalds
  2023-10-13 18:28     ` Ashok Raj
@ 2023-10-13 19:40     ` Thomas Gleixner
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2023-10-13 19:40 UTC (permalink / raw)
  To: Linus Torvalds, Linux regressions mailing list
  Cc: Yanjun Yang, Linux Kernel Mailing List, Bagas Sanjaya,
	Borislav Petkov (AMD),
	Ashok Raj, Ingo Molnar, Dave Hansen, the arch/x86 maintainers

On Fri, Oct 13 2023 at 10:48, Linus Torvalds wrote:
> On Fri, 13 Oct 2023 at 05:05, Linux regression tracking (Thorsten
> Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> Thomas, turns out that bisection result was slightly wrong: a recheck
>> confirmed that the regression is actually caused by 45e34c8af58f23
>> ("x86/smp: Put CPUs into INIT on shutdown if possible") [v6.5-rc1] of
>> yours. See https://bugzilla.kernel.org/show_bug.cgi?id=217995 for details.
>
> That commit does look pretty dangerous.
>
> If *anything* is done through SMI after the code does that
> smp_park_other_cpus_in_init() sequence, I wouldn't be surprised in the
> least if the machine is hung.
>
> That's made worse since it looks like the shutdown sequence isn't
> necessarily run on the boot CPU, so the boot CPU itself may be in
> INIT, and any SMI quite possibly ends up treating that CPU specially.

smp_park_other_cpus_in_init() bails out early when it's not invoked on
the boot CPU because sending INIT to the BSP results in a full machine
reset. So that's definitely not the problem.

> Who knows what SMI does, but the fact that the affected machines seem
> to be mainly from one particular manufacturer does tend to imply it's
> something like that.

It's mostly DELL machines. The rest seems to be Lenovo and Sony with
Alderlake/Raptorlake CPUs - at least that's what I could figure out from
the various bug reports. I don't know which CPUs the DELL machines have,
so I can't say it's a pattern.

Bagas, can you please provide the output of /proc/cpuinfo ?

> And the code does do a fair amount *after* shutting down cpu's. Not
> just things like calling x86_platform.iommu_shutdown(), but also
> things like possibly the tboot shutdown sequence (which almost
> *certainly* is some SMI thing).

That should not matter, but who the heck knows.

> I dunno. Thomas - I htink the argument for that commit was fairly
> theoretical, and reverting it seems the obvious thing, unless you have
> some idea of what might be wrong.

I agree with the revert for now.

The problem is not entirely theoretical in the kexec() case, but yes for
shutdown/reboot it's irrelevant.

The reason why I ended up with this is the initial problem of soft
offlined CPUs sitting in MWAIT. The kexec() kernel can end up writing to
the monitor cache line reliably after it overwrote the original kernel
mappings, which results in completely undebugable chaos or triple
faults.

The MWAIT issue is mitigated by writing to the monitor cache lines and
forcing the CPUs into HLT.

Extensive testing revealed that HLT is not entirely safe either, so we
ended up with the INIT trick, which turned out to be very reliable in
testing. Though it's obviously making some BIOSes very unhappy. Sigh...

Did I mention before that I hate computers with a passion?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: Kernel 6.5 hangs on shutdown
  2023-10-12  9:37 Fwd: Kernel 6.5 hangs on shutdown Bagas Sanjaya
  2023-10-13 12:05 ` [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown) Linux regression tracking (Thorsten Leemhuis)
@ 2023-10-16  8:46 ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 0 replies; 6+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-10-16  8:46 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Linux Regressions

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 12.10.23 11:37, Bagas Sanjaya wrote:
> 
> I notice a regression report on Bugzilla [1]. Quoting from it:
> 
>> I use Dell OptiPlex 7050, and kernel hangs when shutting down the computer. 
>> Similar symptom has been reported on some forums, and all of them are using
>> Dell computers:
>> https://bbs.archlinux.org/viewtopic.php?pid=2124429
>> https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
>> https://forum.artixlinux.org/index.php/topic,5997.0.html
>>
>> Tested with various kernel and this bug seems to be caused by commit: 88afbb21d4b36fee6acaa167641f9f0fc122f01b.
> 
> See Bugzilla for the full thread.
> 
> Anyway, I'm adding this regression to be tracked by regzbot:
> 
> #regzbot introduced: 88afbb21d4b36f https://bugzilla.kernel.org/show_bug.cgi?id=217995
> #regzbot title: x86 core fix pull causes shutdown hang on Dell OptiPlex 7050
> #regzbot link: https://bbs.archlinux.org/viewtopic.php?pid=2124429
> #regzbot link: https://www.reddit.com/r/openSUSE/comments/16qq99b/tumbleweed_shutdown_did_not_finish_completely/
> #regzbot link: https://forum.artixlinux.org/index.php/topic,5997.0.html

#regzbot fix: fbe1bf1e5ff1e3b298420d7a8434983ef8d72bd1
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-10-16  8:46 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-12  9:37 Fwd: Kernel 6.5 hangs on shutdown Bagas Sanjaya
2023-10-13 12:05 ` [regression] some Dell systems hang at shutdown due to "x86/smp: Put CPUs into INIT on shutdown if possible" (was Fwd: Kernel 6.5 hangs on shutdown) Linux regression tracking (Thorsten Leemhuis)
2023-10-13 17:48   ` Linus Torvalds
2023-10-13 18:28     ` Ashok Raj
2023-10-13 19:40     ` Thomas Gleixner
2023-10-16  8:46 ` Fwd: Kernel 6.5 hangs on shutdown Linux regression tracking #update (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).