>> No information besides that it is a machine check. This happens in two cases: >> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux >> ignores EN=0 entries (as it should). > Well, I guess we shouldn't anymore. Apparently hw forgets to set the > bit when raising an MCE so then we should ignore it too in mce-severity > and delete that piece or grade it as higher severity based on, I dunno, > b0rked hardware family/model/stepping or whatever bit we set... > > MCESEV( > NO, "Not enabled", > BITCLR(MCI_STATUS_EN) > ), The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B): When the EN flag is zero but the VAL and UC flags are one in the IA32_MCi_STATUS register, the reported uncorrected error in this bank is not enabled. As uncorrected errors with the EN flag = 0 are not the source of machine check exceptions, the MCE handler should log and clear non-enabled errors when the S bit is set and should continue searching for enabled errors from the other IA32_MCi_STATUS registers. Note that when IA32_MCG_CAP [24] is 0, any uncorrected error condition (VAL =1 and UC=1) including the one with the EN flag cleared are fatal and the handler must signal the operating system to reset the system. For the errors that do not generate machine check exceptions, the EN flag has no meaning. Note the "should log and clear". We just clear ... just need to shuffle some code in mce.c to add the logging. But we still need something like Rui's patch - calling mcelog() doesn't ensure that we see something on the console about possible cause of the problem. -Tony {.n++%ݶw{.n+{G{ayʇڙ,jfhz_(階ݢj"mG?&~iOzv^m ?I