All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-05  5:00 Daniel J Blueman
  2017-01-05 14:05 ` Daniel J Blueman
  0 siblings, 1 reply; 49+ messages in thread
From: Daniel J Blueman @ 2017-01-05  5:00 UTC (permalink / raw)
  To: ashok.raj, Borislav Petkov, pmenzel, tony.luck, linux, len.brown,
	Linux Kernel

On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
> Hi Boris
>
> thanks for forwarding.
>
> > > CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 0 BANK 7
> > > MISC 7880018086 ADDR fef1ce40
> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > > MCG status:
> > > MCi status:
> > > Error overflow
> > > Uncorrected error
> > > MCi_MISC register valid
> > > MCi_ADDR register valid
> > > Processor context corrupt
> > > MCA: corrected filtering (some unreported errors in same region)
> > > Generic CACHE Level-2 Generic Error
> > > STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
>
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
> > > MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
>
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
>
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
physical address is in the non-coherent low MMIO window:
MISC 7880018086 ADDR fef1ce40

Which is declared as device memory:
[    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]

For core-generated cycles, it is between the local APIC space at
FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
subtractively decoded to the PCH, maybe being aborted due to a device
not being enabled (hello TPM3 or new image processor).

As it is logged as soon as the MCE driver initialises, it was probably
logged during BIOS init, so there's not much we can do about it
anyways.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 49+ messages in thread
* [PATCH 0/9] x86/RAS: Queue for 4.11
@ 2017-01-23 18:35 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
                   ` (8 more replies)
  0 siblings, 9 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Hi,

here's the stuff which got ready in time. The more exciting things are
going to wait for the next merge window. :-)

Please apply,
thanks.

Borislav Petkov (8):
  x86/mce-inject: Make it depend on X86_LOCAL_APIC
  x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
  x86/MCE: Flip the TSC-adding logic
  x86/ras/mce_amd_inj: Change dependency
  EDAC/mce_amd: Unexport amd_decode_mce()
  EDAC/mce_amd: Dump TSC value
  x86/MCE: Get rid of mce_process_work()
  x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority

Yazen Ghannam (1):
  x86/MCE/AMD: Make sysfs names of banks more user-friendly

 arch/x86/Kconfig                          |  2 +-
 arch/x86/include/asm/mce.h                | 20 ++++++-----
 arch/x86/kernel/cpu/mcheck/mce-apei.c     |  5 ++-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c   |  5 +--
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 57 ++++---------------------------
 arch/x86/kernel/cpu/mcheck/mce_amd.c      |  9 +++--
 arch/x86/kernel/cpu/mcheck/therm_throt.c  | 30 ++++++----------
 arch/x86/ras/Kconfig                      |  2 +-
 drivers/acpi/acpi_extlog.c                |  1 +
 drivers/acpi/nfit/mce.c                   |  1 +
 drivers/edac/i7core_edac.c                |  1 +
 drivers/edac/mce_amd.c                    |  8 +++--
 drivers/edac/mce_amd.h                    |  1 -
 drivers/edac/sb_edac.c                    |  3 +-
 drivers/edac/skx_edac.c                   |  3 +-
 17 files changed, 59 insertions(+), 93 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread
* Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-04 15:42 Paul Menzel
  2017-01-04 22:55 ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Paul Menzel @ 2017-01-04 15:42 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Thorsten Leemhuis, Len Brown

Dear Linux folks,


The logs contain the following messages.

 From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):

> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0

I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.

Installing *mcelog* 144+dfsg-1, the file below is created.

```
$ more /var/log/mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
```

It looks like it’s a common problem on this machine [1].

> First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
>
> <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
>
> To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
>
> Regarding your questions:
>
>     What do these errors mean and should I worry about them?
>
> I don't know. Dell Support thinks those are false positives.
>
>     Could these hardware errors be the cause of the freezes of the entire system?
>
> Besides the messages my system works fine. I'd guess the freeze is a different issue.
>
>     Should I have the laptop (or parts) replaced by the manufacturer?
>
> Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
>
>     Are there any other actions I should take?
>
> If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.

Could you please tell me, if and where I should open an issue in the 
Linux bug tracker [2]?

Any ideas are welcome.


Kind regards,

Paul


[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
[2] https://bugzilla.kernel.org/

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-02-01 20:52 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
2017-01-05 14:05 ` Daniel J Blueman
2017-01-05 20:10   ` Alexander Alemayhu
2017-01-05 20:31     ` Borislav Petkov
2017-01-05 20:43       ` Raj, Ashok
2017-01-05 21:03         ` Pandruvada, Srinivas
2017-01-05 23:23           ` Alexander Alemayhu
2017-01-05 21:38       ` Alexander Alemayhu
2017-01-05 23:28       ` Raj, Ashok
2017-01-05 23:56         ` Borislav Petkov
2017-01-06  1:26           ` Raj, Ashok
2017-01-06 11:16             ` Borislav Petkov
2017-01-06 15:58               ` Raj, Ashok
2017-01-06 16:54                 ` Borislav Petkov
2017-01-06 17:04                   ` Raj, Ashok
2017-01-09 10:55                   ` Paul Menzel
2017-01-09 11:05                     ` Borislav Petkov
2017-01-09 11:11                       ` Paul Menzel
  -- strict thread matches above, loose matches on Subject: below --
2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
2017-01-04 22:55 ` Borislav Petkov
2017-01-05  1:12   ` Raj, Ashok
2017-01-09 11:53     ` Paul Menzel
2017-01-09 19:23       ` Raj, Ashok
2017-01-27 13:35         ` Paul Menzel
2017-01-27 17:10           ` Borislav Petkov
2017-01-27 17:16             ` Mario.Limonciello
2017-01-31 15:29               ` Paul Menzel
2017-01-31 17:20                 ` Borislav Petkov
2017-01-31 18:50                 ` Austin S. Hemmelgarn
2017-02-01 20:52                 ` Mario.Limonciello

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.