From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934588AbdAEFAp (ORCPT ); Thu, 5 Jan 2017 00:00:45 -0500 Received: from mail-vk0-f52.google.com ([209.85.213.52]:34857 "EHLO mail-vk0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932111AbdAEFAn (ORCPT ); Thu, 5 Jan 2017 00:00:43 -0500 MIME-Version: 1.0 X-Originating-IP: [118.189.165.166] From: Daniel J Blueman Date: Thu, 5 Jan 2017 13:00:42 +0800 Message-ID: Subject: Re: Dell XPS13: MCE (Hardware Error) reported To: ashok.raj@intel.com, Borislav Petkov , pmenzel@molgen.mpg.de, tony.luck@intel.com, linux@leemhuis.info, len.brown@intel.com, Linux Kernel Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote: > Hi Boris > > thanks for forwarding. > > > > CPUID Vendor Intel Family 6 Model 142 > This is Kabylake Mobile > > > > Hardware event. This is not a software error. > > > MCE 1 > > > CPU 0 BANK 7 > > > MISC 7880018086 ADDR fef1ce40 > > > TIME 1483543069 Wed Jan 4 16:17:49 2017 > > > MCG status: > > > MCi status: > > > Error overflow > > > Uncorrected error > > > MCi_MISC register valid > > > MCi_ADDR register valid > > > Processor context corrupt > > > MCA: corrected filtering (some unreported errors in same region) > > > Generic CACHE Level-2 Generic Error > > > STATUS ee0000000040110a MCGSTATUS 0 > > Decoding the bits further from MCi_STATUS above: > Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have > been signaled by a CMCI. > > PCC=1, but should be ignored when EN=0. > MCACOD: 110a MSCOD: 0040 > > If the system is stable enough after the report, can you send the output of > /proc/interrupts to confirm that. > > Although its reported as a L2 error, some memory errors can also manifest > itself as a cache error in certain cases. In this case it looks like > some speculative fetch from bad memory might be the cause. > > > > MCGCAP c08 APICID 0 SOCKETID 0 > > MCG_CAP: c08 > Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and > Threshold based error reporting (bit 11) (TES_P). > > > Do you have another machine which doesn't report these errors? if so try > swapping memory between them to see if the error disappears. > > I don't have the model specific error handy.. will check that in the meantime > to get some decoding as well. > > If you haven't already running some memory tests would also help. > > If you replaced the motherboard, did that involve both cpu and memory? > or just the motheboard swap? I see the MCE on my XPS 9360 also. It's not related to DRAM, as the physical address is in the non-coherent low MMIO window: MISC 7880018086 ADDR fef1ce40 Which is declared as device memory: [ 0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff] For core-generated cycles, it is between the local APIC space at FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be subtractively decoded to the PCH, maybe being aborted due to a device not being enabled (hello TPM3 or new image processor). As it is logged as soon as the MCE driver initialises, it was probably logged during BIOS init, so there's not much we can do about it anyways. Dan -- Daniel J Blueman