On 5/17/21 2:32 AM, Borislav Petkov wrote:
> + lkml.
> 
> On Mon, May 17, 2021 at 02:13:45AM -0600, James Feeney wrote:
>> I re-ran my git bisect, this time with a full power-down and cold boot, and more thorough testing, running a web browser.  My second bisect went from good to bad.
>>
>> So now, instead, git bisect ended here:
>>
>> 4f432e8bb15b352da72525144da025a46695968f is the first bad commit
>> commit 4f432e8bb15b352da72525144da025a46695968f
>> Author: Borislav Petkov <bp@suse.de>
>> Date:   Thu Jan 7 13:23:34 2021 +0100
>>
>>     x86/mce: Get rid of mcheck_intel_therm_init()
>>
>>     Move the APIC_LVTTHMR read which needs to happen on the BSP, to
>>     intel_init_thermal(). One less boot dependency.
>>
>>     No functional changes.
>>
>>     Signed-off-by: Borislav Petkov <bp@suse.de>
>>     Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
>>     Link: https://lkml.kernel.org/r/20210201142704.12495-2-bp@alien8.de
>>
>>  arch/x86/include/asm/mce.h            |  6 ------
>>  arch/x86/kernel/cpu/mce/core.c        |  1 -
>>  arch/x86/kernel/cpu/mce/therm_throt.c | 15 ++++-----------
>>  3 files changed, 4 insertions(+), 18 deletions(-)
>>
>>
>> Please let me know if that makes more sense.
> 
> Not really - this is the first time I'm seeing this and I highly doubt
> your bisection is correct. But we'll see.> 

I did go back and repeat the git bisect for a third time.  This time, I re-booted all of the "good" kernels 10 times, in case there was some random probability that a "good" kernel "just got lucky", and failed to produce an error on that boot.  There were *no* boot failures on the "good" kernels, and there was *no change* in the resulting final "bad" commit.

>>
>> Again:
>>
>> Arch Linux
>> linux 5.12.arch1-1
> 
> Can you reproduce with the upstream 5.12 kernel to rule out influence by
> any distro-specific patches?
> 

Hmm - I am naively supposing that "the bisect is the bisect".  No matter what commit initiates a problem, it's still a problem.  It would be useful to investigate, and introspect the calling functions in the Call Trace.  No?

>> Intel Core2 T7200
>> Mobile Intel 945PM Express Chipset
>> ICH7-M
>> Mobility Radeon X1600
> 
> Can you send full dmesg from a working kernel and the .config you're
> using with 5.12?
> 

Attached:
dmesglog.7bb39313cd62
bisectconfig

7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs
4f432e8bb15b x86/mce: Get rid of mcheck_intel_therm_init()

7bb39313cd62 is the immediately previous "good" bisect kernel.  The config files for the two kernels is exactly the same.

>> Generally, on failure, the system will not boot past "Loading initial ramdisk...", or, when it does, the boot process will hang, and the console will eventually show:
>>
>> watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [systemd-udevd: 241]
>> ...
>> RIP: 0010:smp_call_function_single+0xf7/0x140
>>
>> The top of the call trace variously shows either "__flush_tlb_all" or "tlbflush_read_file", with the "soft lockup" repeating indefinitely.
>>
> 
> I'm presuming there's no way to connect your box over serial cable to
> another one so that you can catch the full bad dmesg when it hangs? It
> would be good if you could...
> 

Attached:
bootlog.7bb39313cd62
bootlog.4f432e8bb15b

The later with the "soft lockup" repeating four times.  The kernel command line has loglevel=5 and console=ttyS0,115200.

> Thx.
> 

Thanks for looking into this.  Would some additional printk's be useful?

James