On 5/19/21 5:12 AM, Borislav Petkov wrote: > On Tue, May 18, 2021 at 09:58:46PM -0600, James Feeney wrote: >> Hmm - I am naively supposing that "the bisect is the bisect". No >> matter what commit initiates a problem, it's still a problem. It would >> be useful to investigate, and introspect the calling functions in the >> Call Trace. No? > > I'd like to know that the source you're looking at is the same source > I'm looking at. > > And yes, AFAIK, Arch kernels are simply the upstream kernels but > still... > I had to ask, and got this answer: ==== The sources contain commits on top of upstream releases. This is why the tags contain -arch1 etc. For example, see https://git.archlinux.org/linux.git/log/?h=v5.11.16-arch1 , which adds 6 commits on top of the upstream "Linux 5.11.16" release, while https://git.archlinux.org/linux.git/log/?h=v5.12-arch1 only contains the long-standing "unprivileged_userns_clone" patch and the version number change, making it essentially vanilla. ==== There are no additional kernel patches in the build. >> Attached: >> bootlog.7bb39313cd62 >> bootlog.4f432e8bb15b >> >> The later with the "soft lockup" repeating four times. The kernel >> command line has loglevel=5 and console=ttyS0,115200. > > Those are not the full boot messages - they should look like > dmesglog.7bb39313cd62 but probably you cannot log into the box after the > softlockup happens to dump them. That's why I meant to try the serial > connection... > > Anyway, let's start somewhere. > > 1. Take a pristine 5.12 upstream kernel from git, build it using your > bisectconfig and try booting it with > > debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0 > > on the kernel command line. Then save a full dmesg, if you can. If you > ocan catch ot ver serial, then that would be awesomer. > > 2. Use the exact same kernel but this time disable > > CONFIG_X86_THERMAL_VECTOR > > in its .config and do the same thing. > > Send me both dmesg files then. > > Thx. > $ git bisect reset v5.12-arch1 Updating files: 100% (12812/12812), done. Previous HEAD position was 7bb39313cd62 x86/mce: Make mce_timed_out() identify holdout CPUs HEAD is now at bee4e691ceea Arch Linux kernel v5.12-arch1 $ grep CONFIG_X86_THERMAL_VECTOR .config CONFIG_X86_THERMAL_VECTOR=y Attached: dmesglog.5.12.therm.1.nostart hangs after unpack rootfs dmesglog.5.12.therm.2.softlockup soft lockup, but stops and does not repeat dmesglog.5.12.therm.3.fullboot boots all the way to Xorg and does run a browser and play video The fourth boot attempt hung again at unpack rootfs. If the machine is let sit in this state, the fan will begin to run full, off and on, suggesting that maybe the processor is still running and running full power. These boots are consecutive and are all from the same stock 5.12.0 kernel. > Use the exact same kernel but this time disable CONFIG_X86_THERMAL_VECTOR $ make menuconfig ... This config option is not listed and is not changeable: ==== drivers/thermal/intel/Kconfig config X86_THERMAL_VECTOR def_bool y depends on X86 && CPU_SUP_INTEL && X86_LOCAL_APIC ==== The Makefile there has: obj-$(CONFIG_X86_THERMAL_VECTOR) += therm_throt.o The files, thermal_interrupt.h and therm_throt.c, by Dmitriy Zavin, are new since 5.11. But, it seems that this therm_throt.c file is one of yours, anyway: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/thermal/intel/therm_throt.c?h=linux-5.12.y&id=9223d0dccb8f8523754122f68316dd1a4f39f7f8 I'm not sure that I can just delete these files, being thermal management and all. I see some talk in the associated thread about IRQ handler registration. Could there be some connection between this and the soft lockup? https://lore.kernel.org/linux-pm/20210201142704.12495-1-bp@alien8.de/ What should we do next? James