From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753138AbdAEBMm (ORCPT <rfc822;w@1wt.eu>);
        Wed, 4 Jan 2017 20:12:42 -0500
Received: from mga14.intel.com ([192.55.52.115]:16422 "EHLO mga14.intel.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752864AbdAEBMi (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 4 Jan 2017 20:12:38 -0500
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.33,318,1477983600"; 
   d="scan'208";a="1079303459"
Date: Wed, 4 Jan 2017 17:12:36 -0800
From: "Raj, Ashok" <ashok.raj@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Thorsten Leemhuis <linux@leemhuis.info>,
        Len Brown <len.brown@intel.com>, Tony Luck <tony.luck@intel.com>,
        ashok.raj@intel.com
Subject: Re: Dell XPS13: MCE (Hardware Error) reported
Message-ID: <20170105011236.GA80100@otc-brkl-03>
References: <f6d1d38d-ed57-8953-501b-c76a80a2f452@molgen.mpg.de>
 <20170104225546.wy36fu5t2jbow2dq@pd.tnic>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170104225546.wy36fu5t2jbow2dq@pd.tnic>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Boris

thanks for forwarding.

> > CPUID Vendor Intel Family 6 Model 142
This is Kabylake Mobile

> > Hardware event. This is not a software error.
> > MCE 1
> > CPU 0 BANK 7
> > MISC 7880018086 ADDR fef1ce40
> > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > MCG status:
> > MCi status:
> > Error overflow
> > Uncorrected error
> > MCi_MISC register valid
> > MCi_ADDR register valid
> > Processor context corrupt
> > MCA: corrected filtering (some unreported errors in same region)
> > Generic CACHE Level-2 Generic Error
> > STATUS ee0000000040110a MCGSTATUS 0

Decoding the bits further from MCi_STATUS above:
Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
been signaled by a CMCI.

PCC=1, but should be ignored when EN=0. 
MCACOD: 110a MSCOD: 0040

If the system is stable enough after the report, can you send the output of 
/proc/interrupts to confirm that. 

Although its reported as a L2 error, some memory errors can also manifest
itself as a cache error in certain cases.  In this case it looks like 
some speculative fetch from bad memory might be the cause.

> > MCGCAP c08 APICID 0 SOCKETID 0

MCG_CAP: c08
Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
Threshold based error reporting (bit 11) (TES_P). 


Do you have another machine which doesn't report these errors? if so try 
swapping memory between them to see if the error disappears.

I don't have the model specific error handy.. will check that in the meantime
to get some decoding as well.

If you haven't already running some memory tests would also help.

If you replaced the motherboard, did that involve both cpu and memory?
or just the motheboard swap?

Cheers,
Ashok