>> The lost cpu is *really* lost. Warm reset doesn't fix the machine, I usually >> have to do a full power cycle. > How is it even possible that I did that with a few lines of asm? Probably not your directly your fault - some cascade of errors may have occurred. > Could this be a hardware bug? Is there some condition that causes #MC > delivery to wedge hard enough that even INIT/RESET stops working? Or > possibly some CPU got stuck in SMM -- I have no idea what warm reset > does these days. I'm not even sure what kind of reset the remote management i/f I used actually applied. > Here's the patch to improve the timeout messages, but given the degree > of wedgedness, I can guess what it'll say: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=e5cbd9d141bde651ecb20f0b65ad13bcef2468d0 Heh - I'd already put in some hacky printk()s to do similar. Mine aren't upstream quality, but do print the value of mce_callin/mce_executing as appropriate. But I got some confusing results - reporter complained that only 142 of 144 had shown up. So two threads missing, maybe means one core went into h/w shutdown. Need to dig further to see if the missing duo really are from the same core. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I