All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel J Blueman <daniel@quora.org>
To: "Raj, Ashok" <ashok.raj@intel.com>, Borislav Petkov <bp@suse.de>,
	Paul Menzel <pmenzel@molgen.mpg.de>,
	tony.luck@intel.com, linux@leemhuis.info, len.brown@intel.com,
	Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: Dell XPS13: MCE (Hardware Error) reported
Date: Thu, 5 Jan 2017 22:05:39 +0800	[thread overview]
Message-ID: <CAMVG2suPLS315HecNrAu9GB9wSYEr276V0utBbOO0vi9yapBQQ@mail.gmail.com> (raw)
In-Reply-To: <CAMVG2stPPdkFUt=6z1tXiQwsY8JysdqPU+XYPNYF72Gs46neqw@mail.gmail.com>

On 5 January 2017 at 13:00, Daniel J Blueman <daniel@quora.org> wrote:
> On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
>> Hi Boris
>>
>> thanks for forwarding.
>>
>> > > CPUID Vendor Intel Family 6 Model 142
>> This is Kabylake Mobile
>>
>> > > Hardware event. This is not a software error.
>> > > MCE 1
>> > > CPU 0 BANK 7
>> > > MISC 7880018086 ADDR fef1ce40
>> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
>> > > MCG status:
>> > > MCi status:
>> > > Error overflow
>> > > Uncorrected error
>> > > MCi_MISC register valid
>> > > MCi_ADDR register valid
>> > > Processor context corrupt
>> > > MCA: corrected filtering (some unreported errors in same region)
>> > > Generic CACHE Level-2 Generic Error
>> > > STATUS ee0000000040110a MCGSTATUS 0
>>
>> Decoding the bits further from MCi_STATUS above:
>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>> been signaled by a CMCI.
>>
>> PCC=1, but should be ignored when EN=0.
>> MCACOD: 110a MSCOD: 0040
>>
>> If the system is stable enough after the report, can you send the output of
>> /proc/interrupts to confirm that.
>>
>> Although its reported as a L2 error, some memory errors can also manifest
>> itself as a cache error in certain cases.  In this case it looks like
>> some speculative fetch from bad memory might be the cause.
>>
>> > > MCGCAP c08 APICID 0 SOCKETID 0
>>
>> MCG_CAP: c08
>> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
>> Threshold based error reporting (bit 11) (TES_P).
>>
>>
>> Do you have another machine which doesn't report these errors? if so try
>> swapping memory between them to see if the error disappears.
>>
>> I don't have the model specific error handy.. will check that in the meantime
>> to get some decoding as well.
>>
>> If you haven't already running some memory tests would also help.
>>
>> If you replaced the motherboard, did that involve both cpu and memory?
>> or just the motheboard swap?
>
> I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
> physical address is in the non-coherent low MMIO window:
> MISC 7880018086 ADDR fef1ce40
>
> Which is declared as device memory:
> [    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]
>
> For core-generated cycles, it is between the local APIC space at
> FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
> subtractively decoded to the PCH, maybe being aborted due to a device
> not being enabled (hello TPM3 or new image processor).
>
> As it is logged as soon as the MCE driver initialises, it was probably
> logged during BIOS init, so there's not much we can do about it
> anyways.

That said, I have seen this reoccur after boot; there were no other
kernel messages around 300s uptime, and it hasn't occurred in the last
hours since:

$ dmesg | grep Machine
[    0.039072] mce: [Hardware Error]: Machine check events logged
[  300.069176] mce: [Hardware Error]: Machine check events logged

As I don't see a driver controlling this area of address space, the
access is likely initiated from the UEFI BIOS System Management Mode
handler, and we see the same pair of registers FEF1FF40, FEF1CE40
accessed each time.

Dan
-- 
Daniel J Blueman

  reply	other threads:[~2017-01-05 14:05 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
2017-01-05 14:05 ` Daniel J Blueman [this message]
2017-01-05 20:10   ` Alexander Alemayhu
2017-01-05 20:31     ` Borislav Petkov
2017-01-05 20:43       ` Raj, Ashok
2017-01-05 21:03         ` Pandruvada, Srinivas
2017-01-05 23:23           ` Alexander Alemayhu
2017-01-05 21:38       ` Alexander Alemayhu
2017-01-05 23:28       ` Raj, Ashok
2017-01-05 23:56         ` Borislav Petkov
2017-01-06  1:26           ` Raj, Ashok
2017-01-06 11:16             ` Borislav Petkov
2017-01-06 15:58               ` Raj, Ashok
2017-01-06 16:54                 ` Borislav Petkov
2017-01-06 17:04                   ` Raj, Ashok
2017-01-09 10:55                   ` Paul Menzel
2017-01-09 11:05                     ` Borislav Petkov
2017-01-09 11:11                       ` Paul Menzel
  -- strict thread matches above, loose matches on Subject: below --
2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
2017-01-04 22:55 ` Borislav Petkov
2017-01-05  1:12   ` Raj, Ashok
2017-01-09 11:53     ` Paul Menzel
2017-01-09 19:23       ` Raj, Ashok
2017-01-27 13:35         ` Paul Menzel
2017-01-27 17:10           ` Borislav Petkov
2017-01-27 17:16             ` Mario.Limonciello
2017-01-31 15:29               ` Paul Menzel
2017-01-31 17:20                 ` Borislav Petkov
2017-01-31 18:50                 ` Austin S. Hemmelgarn
2017-02-01 20:52                 ` Mario.Limonciello

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMVG2suPLS315HecNrAu9GB9wSYEr276V0utBbOO0vi9yapBQQ@mail.gmail.com \
    --to=daniel@quora.org \
    --cc=ashok.raj@intel.com \
    --cc=bp@suse.de \
    --cc=len.brown@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@leemhuis.info \
    --cc=pmenzel@molgen.mpg.de \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.