>> The lost cpu is *really* lost.  Warm reset doesn't fix the machine, I usually
>> have to do a full power cycle.

> How is it even possible that I did that with a few lines of asm?

Probably not your directly your fault - some cascade of errors may have occurred.

> Could this be a hardware bug?  Is there some condition that causes #MC
> delivery to wedge hard enough that even INIT/RESET stops working?  Or
> possibly some CPU got stuck in SMM -- I have no idea what warm reset
> does these days.

I'm not even sure what kind of reset the remote management i/f I used
actually applied.

> Here's the patch to improve the timeout messages, but given the degree
> of wedgedness, I can guess what it'll say:
>
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/paranoid&id=e5cbd9d141bde651ecb20f0b65ad13bcef2468d0

Heh - I'd already put in some hacky printk()s to do similar. Mine aren't upstream quality, but do print the value of mce_callin/mce_executing
as appropriate.  But I got some confusing results - reporter complained that only 142 of 144 had shown up. So two threads missing,
maybe means one core went into h/w shutdown.  Need to dig further to see if the missing duo really are from the same core.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éÝ¶¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayºÊ‡Ú™ë,j­¢f£¢·hšïêÿ‘êçz_è®(­éšŽŠÝ¢j"ú¶m§ÿÿ¾«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^¶m§ÿÿÃÿ¶ìÿ¢¸?–I¥