Dell XPS13: MCE (Hardware Error) reported

* Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-04 15:42 Paul Menzel
  2017-01-04 22:55 ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Menzel @ 2017-01-04 15:42 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Thorsten Leemhuis, Len Brown

Dear Linux folks,

The logs contain the following messages.

 From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):

> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0

I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.

Installing *mcelog* 144+dfsg-1, the file below is created.

```
$ more /var/log/mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
```

It looks like it’s a common problem on this machine [1].

> First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
>
> <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
>
> To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
>
> Regarding your questions:
>
>     What do these errors mean and should I worry about them?
>
> I don't know. Dell Support thinks those are false positives.
>
>     Could these hardware errors be the cause of the freezes of the entire system?
>
> Besides the messages my system works fine. I'd guess the freeze is a different issue.
>
>     Should I have the laptop (or parts) replaced by the manufacturer?
>
> Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
>
>     Are there any other actions I should take?
>
> If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.

Could you please tell me, if and where I should open an issue in the 
Linux bug tracker [2]?

Any ideas are welcome.

Kind regards,

Paul

[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
[2] https://bugzilla.kernel.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread