All of lore.kernel.org
 help / color / mirror / Atom feed
* linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single
@ 2021-05-17  8:13 James Feeney
  2021-05-17  8:32 ` Borislav Petkov
  0 siblings, 1 reply; 33+ messages in thread
From: James Feeney @ 2021-05-17  8:13 UTC (permalink / raw)
  To: linux-smp; +Cc: Borislav Petkov, Jens Axboe

I re-ran my git bisect, this time with a full power-down and cold boot, and more thorough testing, running a web browser.  My second bisect went from good to bad.

So now, instead, git bisect ended here:

4f432e8bb15b352da72525144da025a46695968f is the first bad commit
commit 4f432e8bb15b352da72525144da025a46695968f
Author: Borislav Petkov <bp@suse.de>
Date:   Thu Jan 7 13:23:34 2021 +0100

    x86/mce: Get rid of mcheck_intel_therm_init()

    Move the APIC_LVTTHMR read which needs to happen on the BSP, to
    intel_init_thermal(). One less boot dependency.

    No functional changes.

    Signed-off-by: Borislav Petkov <bp@suse.de>
    Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
    Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de

 arch/x86/include/asm/mce.h            |  6 ------
 arch/x86/kernel/cpu/mce/core.c        |  1 -
 arch/x86/kernel/cpu/mce/therm_throt.c | 15 ++++-----------
 3 files changed, 4 insertions(+), 18 deletions(-)


Please let me know if that makes more sense.

Again:

Arch Linux
linux 5.12.arch1-1

Intel Core2 T7200
Mobile Intel 945PM Express Chipset
ICH7-M
Mobility Radeon X1600

Generally, on failure, the system will not boot past "Loading initial ramdisk...", or, when it does, the boot process will hang, and the console will eventually show:

watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
...
RIP: 0010:smp_call_function_single+0xf7/0x140

The top of the call trace variously shows either "__flush_tlb_all" or "tlbflush_read_file", with the "soft lockup" repeating indefinitely.

If this is some race/timing issue on boot, I have to go back and re-test every "good" bisect, re-booting many times to see if there is *ever* a failure - and that is supposing that there is no interaction between whatever is causing the problem and all the other patches being added.  Any insight would be appreciated.


James

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single
@ 2021-05-14 18:39 James Feeney
  2021-05-17 12:27 ` Christoph Hellwig
  0 siblings, 1 reply; 33+ messages in thread
From: James Feeney @ 2021-05-14 18:39 UTC (permalink / raw)
  To: linux-smp; +Cc: Christoph Hellwig

With the patch to kernel/smp.c in linux 5.12.4, "smp: Fix smp_call_function_single_async prototype", by Arnd Bergmann, I thought maybe there was a fix.  But no.  The error is the same, except the top of the Call Trace is different:

...
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! ...
...
RIP: 0010:smp_call_function_single+0xeb/0x130
...
Call Trace:
? text_poke_loc_init+0x160/0x160
? text_poke_loc_init+0x160/0x160
on_each_cpu+0x39/0x90
...

and repeats indefinitely.

Again, smp_call_function_single is defined in kernel/smp.c

It seems that my git bisect is probably off, since apparently the system may sometimes boot to a temporarily working state, and some "exercise" is needed to identify the failure.  However, see another git bisect for possibly the same issue at

 https://bugs.archlinux.org/task/70663#comment199765

with "bisect-result.txt"

 https://bugs.archlinux.org/task/70663?getfile=20255

Markus says, in part:

====
Trying to bisect, I arrived at a different set of commits though.
7a800a20ae6329e803c5c646b20811a6ae9ca136 showed the issue described, where a seemingly working kernel will lock up rather quickly.
f007a3d66c5480c8dae3fa20a89a06861ef1f5db worked flawlessly, without any hiccups doing random internet browsing while I was compiling the next bisect step.
However, there are six commits between those, that did not boot and left me stuck with a black screen right after the bootloader (so no systemd startup message or similar). The system did not react to any inputs (Alt+SysRq) or to a short press of the PC's power button, and thus a hard shutdown was necessary.
====

These 8 commits - total - are from Christopher Hellwig, 2021 Feb 02.  Perhaps something closer to the real issue is in there.  As with Markus, I've also noticed that a "warm" reboot can result in a frozen system immediately after the boot loader has run.  A full power-off reboot is needed to get past the early screen initialization.

I'll have to re-do my git bisect, with more extensive system "exercise", to see if something more useful results.

James

^ permalink raw reply	[flat|nested] 33+ messages in thread
* Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single
@ 2021-05-03  9:44 James Feeney
  0 siblings, 0 replies; 33+ messages in thread
From: James Feeney @ 2021-05-03  9:44 UTC (permalink / raw)
  To: linux-smp

$ git bisect bad
7c70f3a7488d2fa62d32849d138bf2b8420fe788 is the first bad commit
commit 7c70f3a7488d2fa62d32849d138bf2b8420fe788
Merge: 20bf195e9391 4d12b7275386
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Feb 22 13:29:55 2021 -0800

    Merge tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

    Pull more nfsd updates from Chuck Lever:
     "Here are a few additional NFSD commits for the merge window:

     Optimization:
       - Cork the socket while there are queued replies

      Fixes:
       - DRC shutdown ordering
       - svc_rdma_accept() lockdep splat"

    * tag 'nfsd-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
      SUNRPC: Further clean up svc_tcp_sendmsg()
      SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg()
      SUNRPC: Use TCP_CORK to optimise send performance on the server
      svcrdma: Hold private mutex while invoking rdma_accept()
      nfsd: register pernet ops last, unregister first

 fs/nfsd/nfsctl.c                         | 14 ++++++-------
 include/linux/sunrpc/svcsock.h           |  2 ++
 net/sunrpc/svcsock.c                     | 35 ++++++++++++++++----------------
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  6 +++---
 4 files changed, 29 insertions(+), 28 deletions(-)

--------------

There is a small chance that this bisect is not precise, because sometimes the system can boot to a temporarily working state, then lock-up after a short time.  I did not test every successful initial boot extensively.

This particular commit does not produce the same "watchdog: BUG: soft lockup" log message.  Instead, after sometimes - mostly not - booting to an Xorg display, the system just completely freezes, with not so much as the system log still working.

^ permalink raw reply	[flat|nested] 33+ messages in thread
* linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single
@ 2021-04-30 17:04 James Feeney
  0 siblings, 0 replies; 33+ messages in thread
From: James Feeney @ 2021-04-30 17:04 UTC (permalink / raw)
  To: linux-smp

Arch Linux
linux 5.12.arch1-1

Intel Core2 T7200
Mobile Intel 945PM Express Chipset
ICH7-M
Mobility Radeon X1600


System log throws:

...
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
...
RIP: 0010:smp_call_function_single+0xf7/0x140
...
Call Trace:
? __flush_tlb_all+0x30/0x30
? __flush_tlb_all+0x30/0x30
on_each_cpu+0x39/0x90
...

and repeats indefinitely.

smp_call_function_single is defined in kernel/smp.c

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2021-05-31 21:46 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-17  8:13 linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single James Feeney
2021-05-17  8:32 ` Borislav Petkov
2021-05-19  3:58   ` James Feeney
2021-05-19 11:12     ` Borislav Petkov
2021-05-19 20:03       ` James Feeney
2021-05-19 21:18         ` Borislav Petkov
2021-05-20  3:12           ` James Feeney
2021-05-20  9:21             ` Borislav Petkov
2021-05-21 22:11               ` James Feeney
2021-05-22  9:06                 ` Borislav Petkov
2021-05-22 23:28                   ` James Feeney
2021-05-22 23:28                     ` James Feeney
2021-05-23 17:05                     ` Borislav Petkov
2021-05-23 23:02                       ` James Feeney
2021-05-24  7:51                         ` Borislav Petkov
2021-05-25  4:02                           ` James Feeney
2021-05-27 10:31                             ` [PATCH] x86/thermal: Fix LVT thermal setup for SMI delivery mode Borislav Petkov
2021-05-27 11:49                               ` Thomas Gleixner
2021-05-27 11:56                                 ` Borislav Petkov
2021-05-27 18:54                                 ` Borislav Petkov
2021-05-28  8:23                                   ` Thomas Gleixner
2021-05-28 11:19                                     ` Borislav Petkov
2021-05-31 18:26                                       ` James Feeney
2021-05-27 18:09                               ` Srinivas Pandruvada
2021-05-27 19:01                                 ` Borislav Petkov
2021-05-27 20:28                                   ` Srinivas Pandruvada
2021-05-28  7:05                               ` James Feeney
2021-05-31 21:46   ` [tip: x86/urgent] " tip-bot2 for Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2021-05-14 18:39 linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single James Feeney
2021-05-17 12:27 ` Christoph Hellwig
2021-05-19 15:50   ` James Feeney
2021-05-03  9:44 James Feeney
2021-04-30 17:04 James Feeney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.