On Wed, 2021-12-29 at 14:18 +0100, Paul Menzel wrote: > > Or the one in > > https://lore.kernel.org/lkml/d4cde50b4aab24612823714dfcbe69bc4bb63b60.camel@infradead.org > > > > which makes it do nothing except prepare all the CPUs before bringing > > them up one at a time? > > I applied it on top the other one, and it made no difference either. It's possible I missed something else in the prepare stage that doesn't cope with all CPUs being prepared first. My next attempt might be to change the loop in bringup_nonboot_cpus() to bring all the CPUs not to the CPUHP_BP_PARALLEL_DYN state(s) but instead just bring them to somewhere like CPUHP_RCUTREE_PREP, which is somewhere in the middle between CPUHP_OFFLINE and CPUHP_BRINGUP_CPU. Then a binary chop search — if that one boots, try maybe CPUHP_TOPOLOGY_PREPARE. And if not, try CPUHP_PROFILE_PREPARE. Etc. > > My current theory (not that I've spent that much time thinking about it > > in the last week) is that there's something about the existing CPU > > bringup, possibly a CPU bug or something special about the AMD CPUs, > > which is triggered by just making it a little bit *faster*, which is > > why bringing them up from kexec (especially in qemu) can cause it too? > > Would having the serial console enabled make a difference? > Yes. I couldn't make this fail in my EC2 m6a instance (for clean boots; I have never managed to kexec it) until I turned off the serial console to make things go faster. > > Tom seemed to find that it was in load_TR_desc(), so if you could try > > this hack on a machine that doesn't magically wink out of existence on > > a triplefault before even flushing its serial output, that would be > > much appreciated... > Unfortunately, no more messages were printed on the serial console. I suppose we need to litter those outputs somewhere earlier in the trampoline then, perhaps it *isn't* getting to load_TR_desc() in your case? Will be back online properly next week and can actually provide some of the above suggestions in patch form if you're willing to keep testing. Thanks!