On 5/17/21 2:32 AM, Borislav Petkov wrote: > + lkml. > > On Mon, May 17, 2021 at 02:13:45AM -0600, James Feeney wrote: >> I re-ran my git bisect, this time with a full power-down and cold boot, and more thorough testing, running a web browser. My second bisect went from good to bad. >> >> So now, instead, git bisect ended here: >> >> 4f432e8bb15b352da72525144da025a46695968f is the first bad commit >> commit 4f432e8bb15b352da72525144da025a46695968f >> Author: Borislav Petkov >> Date: Thu Jan 7 13:23:34 2021 +0100 >> >> x86/mce: Get rid of mcheck_intel_therm_init() >> >> Move the APIC_LVTTHMR read which needs to happen on the BSP, to >> intel_init_thermal(). One less boot dependency. >> >> No functional changes. >> >> Signed-off-by: Borislav Petkov >> Tested-by: Srinivas Pandruvada >> Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de >> >> arch/x86/include/asm/mce.h | 6 ------ >> arch/x86/kernel/cpu/mce/core.c | 1 - >> arch/x86/kernel/cpu/mce/therm_throt.c | 15 ++++----------- >> 3 files changed, 4 insertions(+), 18 deletions(-) >> >> >> Please let me know if that makes more sense. > > Not really - this is the first time I'm seeing this and I highly doubt > your bisection is correct. But we'll see.> I did go back and repeat the git bisect for a third time. This time, I re-booted all of the "good" kernels 10 times, in case there was some random probability that a "good" kernel "just got lucky", and failed to produce an error on that boot. There were *no* boot failures on the "good" kernels, and there was *no change* in the resulting final "bad" commit. >> >> Again: >> >> Arch Linux >> linux 5.12.arch1-1 > > Can you reproduce with the upstream 5.12 kernel to rule out influence by > any distro-specific patches? > Hmm - I am naively supposing that "the bisect is the bisect". No matter what commit initiates a problem, it's still a problem. It would be useful to investigate, and introspect the calling functions in the Call Trace. No? >> Intel Core2 T7200 >> Mobile Intel 945PM Express Chipset >> ICH7-M >> Mobility Radeon X1600 > > Can you send full dmesg from a working kernel and the .config you're > using with 5.12? > Attached: dmesglog.7bb39313cd62 bisectconfig 7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs 4f432e8bb15b x86/mce: Get rid of mcheck_intel_therm_init() 7bb39313cd62 is the immediately previous "good" bisect kernel. The config files for the two kernels is exactly the same. >> Generally, on failure, the system will not boot past "Loading initial ramdisk...", or, when it does, the boot process will hang, and the console will eventually show: >> >> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241] >> ... >> RIP: 0010:smp_call_function_single+0xf7/0x140 >> >> The top of the call trace variously shows either "__flush_tlb_all" or "tlbflush_read_file", with the "soft lockup" repeating indefinitely. >> > > I'm presuming there's no way to connect your box over serial cable to > another one so that you can catch the full bad dmesg when it hangs? It > would be good if you could... > Attached: bootlog.7bb39313cd62 bootlog.4f432e8bb15b The later with the "soft lockup" repeating four times. The kernel command line has loglevel=5 and console=ttyS0,115200. > Thx. > Thanks for looking into this. Would some additional printk's be useful? James