All of lore.kernel.org
 help / color / mirror / Atom feed
* Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-04 15:42 Paul Menzel
  2017-01-04 22:55 ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Menzel @ 2017-01-04 15:42 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Thorsten Leemhuis, Len Brown

Dear Linux folks,


The logs contain the following messages.

 From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):

> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0

I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.

Installing *mcelog* 144+dfsg-1, the file below is created.

```
$ more /var/log/mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
```

It looks like it’s a common problem on this machine [1].

> First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
>
> <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
>
> To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
>
> Regarding your questions:
>
>     What do these errors mean and should I worry about them?
>
> I don't know. Dell Support thinks those are false positives.
>
>     Could these hardware errors be the cause of the freezes of the entire system?
>
> Besides the messages my system works fine. I'd guess the freeze is a different issue.
>
>     Should I have the laptop (or parts) replaced by the manufacturer?
>
> Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
>
>     Are there any other actions I should take?
>
> If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.

Could you please tell me, if and where I should open an issue in the 
Linux bug tracker [2]?

Any ideas are welcome.


Kind regards,

Paul


[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
[2] https://bugzilla.kernel.org/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
@ 2017-01-04 22:55 ` Borislav Petkov
  2017-01-05  1:12   ` Raj, Ashok
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2017-01-04 22:55 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Linux Kernel Mailing List, Thorsten Leemhuis, Len Brown,
	Tony Luck, Raj, Ashok

Lemme add some more folks to CC.

On Wed, Jan 04, 2017 at 04:42:18PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> The logs contain the following messages.
> 
> From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):
> 
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> 
> I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.
> 
> Installing *mcelog* 144+dfsg-1, the file below is created.
> 
> ```
> $ more /var/log/mcelog
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543069 Wed Jan  4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543069 Wed Jan  4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543581 Wed Jan  4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543581 Wed Jan  4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> ```
> 
> It looks like it’s a common problem on this machine [1].
> 
> > First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
> > 
> > <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
> > 
> > To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
> > 
> > Regarding your questions:
> > 
> >     What do these errors mean and should I worry about them?
> > 
> > I don't know. Dell Support thinks those are false positives.
> > 
> >     Could these hardware errors be the cause of the freezes of the entire system?
> > 
> > Besides the messages my system works fine. I'd guess the freeze is a different issue.
> > 
> >     Should I have the laptop (or parts) replaced by the manufacturer?
> > 
> > Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
> > 
> >     Are there any other actions I should take?
> > 
> > If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.
> 
> Could you please tell me, if and where I should open an issue in the Linux
> bug tracker [2]?
> 
> Any ideas are welcome.
> 
> 
> Kind regards,
> 
> Paul
> 
> 
> [1] https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
> [2] https://bugzilla.kernel.org/
> 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-04 22:55 ` Borislav Petkov
@ 2017-01-05  1:12   ` Raj, Ashok
  2017-01-09 11:53     ` Paul Menzel
  0 siblings, 1 reply; 30+ messages in thread
From: Raj, Ashok @ 2017-01-05  1:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Paul Menzel, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, ashok.raj

Hi Boris

thanks for forwarding.

> > CPUID Vendor Intel Family 6 Model 142
This is Kabylake Mobile

> > Hardware event. This is not a software error.
> > MCE 1
> > CPU 0 BANK 7
> > MISC 7880018086 ADDR fef1ce40
> > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > MCG status:
> > MCi status:
> > Error overflow
> > Uncorrected error
> > MCi_MISC register valid
> > MCi_ADDR register valid
> > Processor context corrupt
> > MCA: corrected filtering (some unreported errors in same region)
> > Generic CACHE Level-2 Generic Error
> > STATUS ee0000000040110a MCGSTATUS 0

Decoding the bits further from MCi_STATUS above:
Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
been signaled by a CMCI.

PCC=1, but should be ignored when EN=0. 
MCACOD: 110a MSCOD: 0040

If the system is stable enough after the report, can you send the output of 
/proc/interrupts to confirm that. 

Although its reported as a L2 error, some memory errors can also manifest
itself as a cache error in certain cases.  In this case it looks like 
some speculative fetch from bad memory might be the cause.

> > MCGCAP c08 APICID 0 SOCKETID 0

MCG_CAP: c08
Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
Threshold based error reporting (bit 11) (TES_P). 


Do you have another machine which doesn't report these errors? if so try 
swapping memory between them to see if the error disappears.

I don't have the model specific error handy.. will check that in the meantime
to get some decoding as well.

If you haven't already running some memory tests would also help.

If you replaced the motherboard, did that involve both cpu and memory?
or just the motheboard swap?

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05  1:12   ` Raj, Ashok
@ 2017-01-09 11:53     ` Paul Menzel
  2017-01-09 19:23       ` Raj, Ashok
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Menzel @ 2017-01-09 11:53 UTC (permalink / raw)
  To: Ashok Raj, Borislav Petkov
  Cc: Linux Kernel Mailing List, Thorsten Leemhuis, Len Brown, Tony Luck

Dear Ashosk, dear Borislav,


On 01/05/17 02:12, Raj, Ashok wrote:

>>> CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
>>> Hardware event. This is not a software error.
>>> MCE 1
>>> CPU 0 BANK 7
>>> MISC 7880018086 ADDR fef1ce40
>>> TIME 1483543069 Wed Jan  4 16:17:49 2017
>>> MCG status:
>>> MCi status:
>>> Error overflow
>>> Uncorrected error
>>> MCi_MISC register valid
>>> MCi_ADDR register valid
>>> Processor context corrupt
>>> MCA: corrected filtering (some unreported errors in same region)
>>> Generic CACHE Level-2 Generic Error
>>> STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.

To be clear, other than the message, the system is stable for me.

Here is `/proc/interrupts`.

```
$ more /proc/interrupts
             CPU0       CPU1       CPU2       CPU3
    0:         27          0          0          0  IR-IO-APIC    2-edge 
      timer
    1:          3          2        125          5  IR-IO-APIC    1-edge 
      i8042
    8:          0          1          0          0  IR-IO-APIC    8-edge 
      rtc0
    9:        108         31        397          5  IR-IO-APIC 
9-fasteoi   acpi
   12:         66         18         92         35  IR-IO-APIC   12-edge 
      i8042
   14:          0          0          0          0  IR-IO-APIC 
14-fasteoi   INT344B:00
   16:          0          0          0          0  IR-IO-APIC 
16-fasteoi   idma64.0, i801_smbus, i2c_designware.0
   17:        419         42        280        415  IR-IO-APIC 
17-fasteoi   idma64.1, i2c_designware.1
   51:          2          0          0          1  IR-IO-APIC 
51-fasteoi   DLL075B:01
  120:          0          0          0          0  DMAR-MSI    0-edge 
    dmar0
  121:          0          0          0          0  DMAR-MSI    1-edge 
    dmar1
  274:         17          2          0          4  IR-PCI-MSI 
30932992-edge      rtsx_pci
  275:         89         26         57         45  IR-PCI-MSI 
327680-edge      xhci_hcd
  276:       1886          0       2361          0  IR-PCI-MSI 
31457280-edge      nvme0q0, nvme0q1
  277:          0       3010       2570          0  IR-PCI-MSI 
31457281-edge      nvme0q2
  278:          0          0       2023       3480  IR-PCI-MSI 
31457282-edge      nvme0q3
  279:          0       3319          0       5863  IR-PCI-MSI 
31457283-edge      nvme0q4
  280:         45          0          0          0  IR-PCI-MSI 
360448-edge      mei_me
  281:        201         52       3008         85  IR-PCI-MSI 
32768-edge      i915
  282:        151         29        997      24821  IR-PCI-MSI 
30408704-edge      ath10k_pci
  283:        331        938        677        188  IR-PCI-MSI 
514048-edge      snd_hda_intel:card0
  NMI:          1          0          0          0   Non-maskable interrupts
  LOC:      15198      21708      16850      31954   Local timer interrupts
  SPU:          0          0          0          0   Spurious interrupts
  PMI:          1          0          0          0   Performance 
monitoring interrupts
  IWI:          3          0          0          0   IRQ work interrupts
  RTR:          0          0          0          0   APIC ICR read retries
  RES:       1329       1974       1532       1959   Rescheduling interrupts
  CAL:       2254       3827       1969       3963   Function call 
interrupts
  TLB:        396       2349        342       2193   TLB shootdowns
  TRM:          0          0          0          0   Thermal event 
interrupts
  THR:          0          0          0          0   Threshold APIC 
interrupts
  DFR:          0          0          0          0   Deferred Error APIC 
interrupts
  MCE:          0          0          0          0   Machine check 
exceptions
  MCP:          9          9          9          9   Machine check polls
  ERR:         17
  MIS:          0
  PIN:          0          0          0          0   Posted-interrupt 
notification event
  PIW:          0          0          0          0   Posted-interrupt 
wakeup event
```

> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
>>> MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.

No, I don’t. And everybody I talked to with a Dell XPS13 (9360) seems to 
have these errors.

> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.

I need some time for that.

> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

Sorry, I don’t know, as I am not the person from StackExchange [1].


Kind regards,

Paul


[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 11:53     ` Paul Menzel
@ 2017-01-09 19:23       ` Raj, Ashok
  2017-01-27 13:35         ` Paul Menzel
  0 siblings, 1 reply; 30+ messages in thread
From: Raj, Ashok @ 2017-01-09 19:23 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Borislav Petkov, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, ashok.raj

Hi Paul

On Mon, Jan 09, 2017 at 12:53:33PM +0100, Paul Menzel wrote:
> 
> 
> On 01/05/17 02:12, Raj, Ashok wrote:
> 
> >>>CPUID Vendor Intel Family 6 Model 142
> >This is Kabylake Mobile
> >
> >>>Hardware event. This is not a software error.
> >>>MCE 1
> >>>CPU 0 BANK 7
> >>>MISC 7880018086 ADDR fef1ce40
> >>>TIME 1483543069 Wed Jan  4 16:17:49 2017

> >>>STATUS ee0000000040110a MCGSTATUS 0
> >
> >Decoding the bits further from MCi_STATUS above:
> >Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> >been signaled by a CMCI.
> >
> >PCC=1, but should be ignored when EN=0.
> >MCACOD: 110a MSCOD: 0040

This MSCOD indicates that its a write back access to mmio space. Its possible
that BIOS is scanning certain memory region during boot. During which time
BIOS does disable generation of MCE's. Which is why EN=0 in the above log.

Its a BIOS bug, one would expect that BIOS clears up these before handoff to
OS. During OS boot we also scan all MC banks and log/clear them.

If you aren't observing them during normal operation you can safely ignore
these preboot logs, or pass them along to your OEM.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 19:23       ` Raj, Ashok
@ 2017-01-27 13:35         ` Paul Menzel
  2017-01-27 17:10           ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Menzel @ 2017-01-27 13:35 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Borislav Petkov, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, Mario Limonciello, Thorsten Leemhuis

Dear Ashok,


On 01/09/17 20:23, Raj, Ashok wrote:

> On Mon, Jan 09, 2017 at 12:53:33PM +0100, Paul Menzel wrote:
>
>> On 01/05/17 02:12, Raj, Ashok wrote:
>>
>>>>> CPUID Vendor Intel Family 6 Model 142
>>> This is Kabylake Mobile
>>>
>>>>> Hardware event. This is not a software error.
>>>>> MCE 1
>>>>> CPU 0 BANK 7
>>>>> MISC 7880018086 ADDR fef1ce40
>>>>> TIME 1483543069 Wed Jan  4 16:17:49 2017
>
>>>>> STATUS ee0000000040110a MCGSTATUS 0
>>>
>>> Decoding the bits further from MCi_STATUS above:
>>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>>> been signaled by a CMCI.
>>>
>>> PCC=1, but should be ignored when EN=0.
>>> MCACOD: 110a MSCOD: 0040
>
> This MSCOD indicates that its a write back access to mmio space. Its possible
> that BIOS is scanning certain memory region during boot. During which time
> BIOS does disable generation of MCE's. Which is why EN=0 in the above log.
>
> Its a BIOS bug, one would expect that BIOS clears up these before handoff to
> OS. During OS boot we also scan all MC banks and log/clear them.
>
> If you aren't observing them during normal operation you can safely ignore
> these preboot logs, or pass them along to your OEM.

Thank you very much for your help. After wasting my time with the Dell 
support over Twitter [1], where they basically also make you jump 
through hoops, and then claim it’s an mcelog issue – as they apparently 
only execute `sudo mcelog` –, I updated to the latest firmware 1.3.2 
released yesterday [2].

With that new firmware version, it looks like that the firmware has been 
fixed and Linux does not report any MCEs.

It’d be great if other Dell XPS13 9360 users could verify that.


Kind regards,

Paul


[1] https://twitter.com/pmenzel_molgen/status/818808708692115456
[2] XPS_9360_1.3.2.exe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-27 13:35         ` Paul Menzel
@ 2017-01-27 17:10           ` Borislav Petkov
  2017-01-27 17:16             ` Mario.Limonciello
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2017-01-27 17:10 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ashok Raj, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, Mario Limonciello

On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
> Thank you very much for your help. After wasting my time with the Dell
> support over Twitter [1], where they basically also make you jump through
> hoops,

Fun read that twitter page. Especially the bit about not supporting
Linux but but you did ship this laptop with ubuntu. LOOL.

FWIW, this is not the first time I've heard vendors trying to get away
from dealing with bugs by playing the "we-dont-support-Linux" card.

Good to know when I go looking for a new laptop in the future.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Dell XPS13: MCE (Hardware Error) reported
  2017-01-27 17:10           ` Borislav Petkov
@ 2017-01-27 17:16             ` Mario.Limonciello
  2017-01-31 15:29               ` Paul Menzel
  0 siblings, 1 reply; 30+ messages in thread
From: Mario.Limonciello @ 2017-01-27 17:16 UTC (permalink / raw)
  To: bp, pmenzel; +Cc: ashok.raj, linux-kernel, linux, len.brown, tony.luck

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Friday, January 27, 2017 11:11 AM
> To: Paul Menzel <pmenzel@molgen.mpg.de>
> Cc: Ashok Raj <ashok.raj@intel.com>; Linux Kernel Mailing List <linux-
> kernel@vger.kernel.org>; Thorsten Leemhuis <linux@leemhuis.info>; Len
> Brown <len.brown@intel.com>; Tony Luck <tony.luck@intel.com>;
> Limonciello, Mario <Mario_Limonciello@Dell.com>
> Subject: Re: Dell XPS13: MCE (Hardware Error) reported
> 
> On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
> > Thank you very much for your help. After wasting my time with the Dell
> > support over Twitter [1], where they basically also make you jump
> > through hoops,
> 
> Fun read that twitter page. Especially the bit about not supporting Linux but
> but you did ship this laptop with ubuntu. LOOL.
> 
> FWIW, this is not the first time I've heard vendors trying to get away from
> dealing with bugs by playing the "we-dont-support-Linux" card.
> 
> Good to know when I go looking for a new laptop in the future.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Sorry you had that experience.  I'll pass that on to get that messaging
corrected.  The team that does Linux support doesn't use twitter, they
do phone and email support only.

I'm glad you got things sorted moving to the next FW version. 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-27 17:16             ` Mario.Limonciello
@ 2017-01-31 15:29               ` Paul Menzel
  2017-01-31 17:20                 ` Borislav Petkov
                                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Paul Menzel @ 2017-01-31 15:29 UTC (permalink / raw)
  To: Mario Limonciello, Borislav Petkov
  Cc: Ashok Raj, linux-kernel, Thorsten Leemhuis, Len Brown, Tony Luck

Dear Borislav, dear Mario,


On 01/27/17 18:16, Mario.Limonciello@dell.com wrote:
>> -----Original Message-----
>> From: Borislav Petkov [mailto:bp@alien8.de]
>> Sent: Friday, January 27, 2017 11:11 AM
>> To: Paul Menzel <pmenzel@molgen.mpg.de>
>> Cc: Ashok Raj <ashok.raj@intel.com>; Linux Kernel Mailing List <linux-
>> kernel@vger.kernel.org>; Thorsten Leemhuis <linux@leemhuis.info>; Len
>> Brown <len.brown@intel.com>; Tony Luck <tony.luck@intel.com>;
>> Limonciello, Mario <Mario_Limonciello@Dell.com>
>> Subject: Re: Dell XPS13: MCE (Hardware Error) reported
>>
>> On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
>>> Thank you very much for your help. After wasting my time with the Dell
>>> support over Twitter [1], where they basically also make you jump
>>> through hoops,
>>
>> Fun read that twitter page. Especially the bit about not supporting Linux but
>> but you did ship this laptop with ubuntu. LOOL.
>>
>> FWIW, this is not the first time I've heard vendors trying to get away from
>> dealing with bugs by playing the "we-dont-support-Linux" card.

I think, in this case it was an honest mistake from the support person, 
which they corrected in a follow-up post.

But the next conversion was a little disappointing as the problem was 
already diagnosed out on the LKML.

As Mario replied, the Twitter people do not seem to know the free 
software world that well, that “normal” people actually are in contact 
with the developers.

What surprises me though is, that the change-log for the firmware seems 
to be incomplete even for the Dell staff.

>> Good to know when I go looking for a new laptop in the future.

Unfortunately, there are not a lot of options. If you know a good 
vendor, please share.

Dell is one of the view “big” vendors selling laptops with a GNU/Linux 
operating system installed.

There are Google Chromebooks, which even have coreboot. Google’s 
Chromium team does a great job and have a great quality assurance. On 
par, or even better than Apple, I’d say.

Some people need more powerful devices though.

In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux 
installed [2].

TUXEDO does not build the devices themselves, and gets template 
models(?) from Chinese manufactures. They only seem to start hiring the 
expertise to do Linux kernel and OS work themselves [3]. Currently, it’s 
mostly the stock Ubuntu for desktop with their artwork and some small 
optimizations.

> Sorry you had that experience.  I'll pass that on to get that messaging
> corrected.  The team that does Linux support doesn't use twitter, they
> do phone and email support only.

Thank you that’s good to know.

> I'm glad you got things sorted moving to the next FW version.

Me too.

Mario, there are at least two more firmware bugs [4][5][6]. Having fast 
suspend and resume says something about the quality of a device. If the 
Dell XPS13 9360 is supposed to compete with Apple devices, and Google 
Chromebooks, then this should be improved in my opinion. Do you have a 
suggestion on how to best get this solved? Do you want me to contact the 
support, or are there Dell employees with access to the Linux kernel 
Bugzilla bug tracker?


Kind regards,

Paul


[1] https://www.tuxedocomputers.com/
[2] https://puri.sm/
[3] 
https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5
[4] https://github.molgen.mpg.de/mariux64/dell_xps13/
[5] https://bugzilla.kernel.org/show_bug.cgi?id=185611
[6] https://bugzilla.kernel.org/show_bug.cgi?id=185621

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-31 15:29               ` Paul Menzel
@ 2017-01-31 17:20                 ` Borislav Petkov
  2017-01-31 18:50                 ` Austin S. Hemmelgarn
  2017-02-01 20:52                 ` Mario.Limonciello
  2 siblings, 0 replies; 30+ messages in thread
From: Borislav Petkov @ 2017-01-31 17:20 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Mario Limonciello, Ashok Raj, linux-kernel, Thorsten Leemhuis,
	Len Brown, Tony Luck

On Tue, Jan 31, 2017 at 04:29:51PM +0100, Paul Menzel wrote:
> Unfortunately, there are not a lot of options. If you know a good
> vendor, please share.

What I'd do is go to the store with a live CD and boot it on the
machine. Using a fairly recent allmodconfig kernel should be able to
tell you what works and what doesn't.

If there's no store, you can still return it in the first 14 days
according to EU law.

> Dell is one of the view “big” vendors selling laptops with a GNU/Linux
> operating system installed.
> 
> There are Google Chromebooks, which even have coreboot.

Yes, I dream of the day when we get rid of that vendor firmware and we
all use coreboot. That'll be a good day.

> team does a great job and have a great quality assurance. On par, or even
> better than Apple, I’d say.
> 
> Some people need more powerful devices though.
> 
> In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux
> installed [2].
> 
> TUXEDO does not build the devices themselves, and gets template models(?)
> from Chinese manufactures.

Looks cool. Especially if I don't want to pay for that windoze license
being shoved down my throat.

I'd probably give it a try next time I need new hw. Thanks for the tip!

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-31 15:29               ` Paul Menzel
  2017-01-31 17:20                 ` Borislav Petkov
@ 2017-01-31 18:50                 ` Austin S. Hemmelgarn
  2017-02-01 20:52                 ` Mario.Limonciello
  2 siblings, 0 replies; 30+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-31 18:50 UTC (permalink / raw)
  To: Paul Menzel, Mario Limonciello, Borislav Petkov
  Cc: Ashok Raj, linux-kernel, Thorsten Leemhuis, Len Brown, Tony Luck

On 2017-01-31 10:29, Paul Menzel wrote:
> Dear Borislav, dear Mario,
>
>
> On 01/27/17 18:16, Mario.Limonciello@dell.com wrote:
>>> -----Original Message-----
>>> From: Borislav Petkov [mailto:bp@alien8.de]
>>> Sent: Friday, January 27, 2017 11:11 AM
>>> To: Paul Menzel <pmenzel@molgen.mpg.de>
>>> Cc: Ashok Raj <ashok.raj@intel.com>; Linux Kernel Mailing List <linux-
>>> kernel@vger.kernel.org>; Thorsten Leemhuis <linux@leemhuis.info>; Len
>>> Brown <len.brown@intel.com>; Tony Luck <tony.luck@intel.com>;
>>> Limonciello, Mario <Mario_Limonciello@Dell.com>
>>> Subject: Re: Dell XPS13: MCE (Hardware Error) reported
>>>
>>> On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
>>>> Thank you very much for your help. After wasting my time with the Dell
>>>> support over Twitter [1], where they basically also make you jump
>>>> through hoops,
>>>
>>> Fun read that twitter page. Especially the bit about not supporting
>>> Linux but
>>> but you did ship this laptop with ubuntu. LOOL.
>>>
>>> FWIW, this is not the first time I've heard vendors trying to get
>>> away from
>>> dealing with bugs by playing the "we-dont-support-Linux" card.
>
> I think, in this case it was an honest mistake from the support person,
> which they corrected in a follow-up post.
>
> But the next conversion was a little disappointing as the problem was
> already diagnosed out on the LKML.
>
> As Mario replied, the Twitter people do not seem to know the free
> software world that well, that “normal” people actually are in contact
> with the developers.
>
> What surprises me though is, that the change-log for the firmware seems
> to be incomplete even for the Dell staff.
>
>>> Good to know when I go looking for a new laptop in the future.
>
> Unfortunately, there are not a lot of options. If you know a good
> vendor, please share.
>
> Dell is one of the view “big” vendors selling laptops with a GNU/Linux
> operating system installed.
>
> There are Google Chromebooks, which even have coreboot. Google’s
> Chromium team does a great job and have a great quality assurance. On
> par, or even better than Apple, I’d say.
>
> Some people need more powerful devices though.
>
> In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux
> installed [2].
>
> TUXEDO does not build the devices themselves, and gets template
> models(?) from Chinese manufactures. They only seem to start hiring the
> expertise to do Linux kernel and OS work themselves [3]. Currently, it’s
> mostly the stock Ubuntu for desktop with their artwork and some small
> optimizations.
When it comes to laptops, I've recently become a fan of System76 [1]. 
Linux runs pretty much flawlessly on all the systems I've ever 
encountered from them without needing any special configuration, and 
they offer some pretty high-end systems.  Their Oryx Pro laptop is 
particularly impressive, it's got hardware on par with a desktop bearing 
a similar price-tag, which is rare even for systems that ship with 
Windows per-installed.  They're based in the US, but I'm pretty certain 
they ship worldwide, and like most of the better manufacturers these 
days, the systems are made-to-order with quite a lot in the way of 
options (you can even get an NVMe boot drive in their higher-end systems 
if you're willing to shell out more than a third of the base cost for 
the system on it).

They also do desktops and servers, but it's generally cheaper in my 
experience for anything but a laptop to just build it yourself.
>
>> Sorry you had that experience.  I'll pass that on to get that messaging
>> corrected.  The team that does Linux support doesn't use twitter, they
>> do phone and email support only.
>
> Thank you that’s good to know.
>
>> I'm glad you got things sorted moving to the next FW version.
>
> Me too.
>
> Mario, there are at least two more firmware bugs [4][5][6]. Having fast
> suspend and resume says something about the quality of a device. If the
> Dell XPS13 9360 is supposed to compete with Apple devices, and Google
> Chromebooks, then this should be improved in my opinion. Do you have a
> suggestion on how to best get this solved? Do you want me to contact the
> support, or are there Dell employees with access to the Linux kernel
> Bugzilla bug tracker?
>
>
> Kind regards,
>
> Paul
>
>
> [1] https://www.tuxedocomputers.com/
> [2] https://puri.sm/
> [3]
> https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5
>
> [4] https://github.molgen.mpg.de/mariux64/dell_xps13/
> [5] https://bugzilla.kernel.org/show_bug.cgi?id=185611
> [6] https://bugzilla.kernel.org/show_bug.cgi?id=185621
[1] https://system76.com/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: Dell XPS13: MCE (Hardware Error) reported
  2017-01-31 15:29               ` Paul Menzel
  2017-01-31 17:20                 ` Borislav Petkov
  2017-01-31 18:50                 ` Austin S. Hemmelgarn
@ 2017-02-01 20:52                 ` Mario.Limonciello
  2 siblings, 0 replies; 30+ messages in thread
From: Mario.Limonciello @ 2017-02-01 20:52 UTC (permalink / raw)
  To: pmenzel, bp; +Cc: ashok.raj, linux-kernel, linux, len.brown, tony.luck

Hi Paul,

> Mario, there are at least two more firmware bugs [4][5][6]. Having fast
> suspend and resume says something about the quality of a device. If the Dell
> XPS13 9360 is supposed to compete with Apple devices, and Google
> Chromebooks, then this should be improved in my opinion. Do you have a
> suggestion on how to best get this solved? Do you want me to contact the
> support, or are there Dell employees with access to the Linux kernel Bugzilla
> bug tracker?
> 
> 

Thanks for sharing these to my attention.  I wasn't aware, and neither was the rest
of the team handling these machines.  I've started some internal discussion around
the slow ASL resume one.

The WoWLAN one, I'm not sure there will be much for us to do on the BIOS FW end.
As Kalle was mentioning, this comes down to the device FW needing to spin up.  The
easiest way to keep it quick is to leave the radio active, which is what WoWLAN will
accomplish.

The other option is going to be to move over to suspend to idle.  This will leave most
devices operational (in RTD3), freeze kernel threads and let the CPU go into a low
enough state.  Some of the plumbing that will allow supporting this has recently gone
in, but there are some problems to work through that will prevent letting machines
that support it make it the default until at least the next kernel version.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 11:05                     ` Borislav Petkov
@ 2017-01-09 11:11                       ` Paul Menzel
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Menzel @ 2017-01-09 11:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ashok Raj, Alexander Alemayhu, Daniel J Blueman, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas,
	Daniel Blueman

Dear Boris,


On 01/09/17 12:05, Borislav Petkov wrote:
> On Mon, Jan 09, 2017 at 11:55:41AM +0100, Paul Menzel wrote:

[…]

>> Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks
>> different from what I see, doesn’t it?
>
> Yes, yours is different. I'm still waiting for you to reply to Ashok's
> questions here:
>
> https://lkml.kernel.org/r/20170105011236.GA80100@otc-brkl-03

I see. I thought Daniel’s answered them. I’ll reply now.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 10:55                   ` Paul Menzel
@ 2017-01-09 11:05                     ` Borislav Petkov
  2017-01-09 11:11                       ` Paul Menzel
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2017-01-09 11:05 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ashok Raj, Alexander Alemayhu, Daniel J Blueman, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Mon, Jan 09, 2017 at 11:55:41AM +0100, Paul Menzel wrote:
> Do you mean *shouldn’t have been done*?

Yes.

> Should the discussion be referenced?

Yap, it will be.

> Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks
> different from what I see, doesn’t it?

Yes, yours is different. I'm still waiting for you to reply to Ashok's
questions here:

https://lkml.kernel.org/r/20170105011236.GA80100@otc-brkl-03

Thanks.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 16:54                 ` Borislav Petkov
  2017-01-06 17:04                   ` Raj, Ashok
@ 2017-01-09 10:55                   ` Paul Menzel
  2017-01-09 11:05                     ` Borislav Petkov
  1 sibling, 1 reply; 30+ messages in thread
From: Paul Menzel @ 2017-01-09 10:55 UTC (permalink / raw)
  To: Borislav Petkov, Ashok Raj
  Cc: Alexander Alemayhu, Daniel J Blueman, tony.luck, linux,
	len.brown, Linux Kernel, Pandruvada, Srinivas

On 01/06/17 17:54, Borislav Petkov wrote:
> On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
>> Looks like we don't need a return value from therm_throt_process(),
>> we can fix that as void as well.
>
> Right you are, here's v2:
>
> ---
> From a8151fa6f18c2605eb7972061234f05e79b372c4 Mon Sep 17 00:00:00 2001
> From: Borislav Petkov <bp@suse.de>
> Date: Fri, 6 Jan 2017 12:07:08 +0100
> Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
>
> We log a fake bank 128 MCE to note that we're handling a CPU thermal
> event. However, this confuses people into thinking that their hardware
> generates MCEs. Hijacking MCA for logging thermal events is a gross
> misuse anyway and it should've been done in the first place. And besides

Do you mean *shouldn’t have been done*?

> we have other means for dealing with thermal events which are much more
> suitable.
>
> So let's kill the MCE logging part.
>
> Signed-off-by: Borislav Petkov <bp@suse.de>

Should the discussion be referenced?

Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks 
different from what I see, doesn’t it?

[…]


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 16:54                 ` Borislav Petkov
@ 2017-01-06 17:04                   ` Raj, Ashok
  2017-01-09 10:55                   ` Paul Menzel
  1 sibling, 0 replies; 30+ messages in thread
From: Raj, Ashok @ 2017-01-06 17:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas, ashok.raj

Hi Boris

This looks good to me!

On Fri, Jan 06, 2017 at 05:54:23PM +0100, Borislav Petkov wrote:
> On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
> > Looks like we don't need a return value from therm_throt_process(),
> > we can fix that as void as well.
> 
> Right you are, here's v2:
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
> 
> v2: Ashok: make therm_throt_process() void.

Acked-by: Ashok Raj <ashok.raj@intel.com>
> 
>  arch/x86/include/asm/mce.h               |  6 ------
>  arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
>  arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
>  3 files changed, 11 insertions(+), 50 deletions(-)

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 15:58               ` Raj, Ashok
@ 2017-01-06 16:54                 ` Borislav Petkov
  2017-01-06 17:04                   ` Raj, Ashok
  2017-01-09 10:55                   ` Paul Menzel
  0 siblings, 2 replies; 30+ messages in thread
From: Borislav Petkov @ 2017-01-06 16:54 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
> Looks like we don't need a return value from therm_throt_process(),
> we can fix that as void as well.

Right you are, here's v2:

---
>From a8151fa6f18c2605eb7972061234f05e79b372c4 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Fri, 6 Jan 2017 12:07:08 +0100
Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it should've been done in the first place. And besides
we have other means for dealing with thermal events which are much more
suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
---

v2: Ashok: make therm_throt_process() void.

 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..85469f84c921 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 11:16             ` Borislav Petkov
@ 2017-01-06 15:58               ` Raj, Ashok
  2017-01-06 16:54                 ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Raj, Ashok @ 2017-01-06 15:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas, ashok.raj

Hi Boris

On Fri, Jan 06, 2017 at 12:16:17PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 05:26:17PM -0800, Raj, Ashok wrote:
> > Agree, since we have both a log and another agent to deal with it, it makes
> > no good reason to continue... Will pass this along, and have someone look at
> > cleaning this up.
> 
> Like this?

That was quick :-).

> -	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> -				THERMAL_THROTTLING_EVENT,
> -				CORE_LEVEL) != 0)

Looks like we don't need a return value from therm_throt_process(),
we can fix that as void as well.

Otherwise it looks good. 

> +	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> +			    THERMAL_THROTTLING_EVENT,
> +			    CORE_LEVEL);
>  

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06  1:26           ` Raj, Ashok
@ 2017-01-06 11:16             ` Borislav Petkov
  2017-01-06 15:58               ` Raj, Ashok
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2017-01-06 11:16 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Thu, Jan 05, 2017 at 05:26:17PM -0800, Raj, Ashok wrote:
> Agree, since we have both a log and another agent to deal with it, it makes
> no good reason to continue... Will pass this along, and have someone look at
> cleaning this up.

Like this?

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 6 Jan 2017 12:07:08 +0100
Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it should've been done in the first place. And besides
we have other means for dealing with thermal events which are much more
suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  9 ++++-----
 3 files changed, 4 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..109fbb25c851 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -365,10 +365,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 23:56         ` Borislav Petkov
@ 2017-01-06  1:26           ` Raj, Ashok
  2017-01-06 11:16             ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Raj, Ashok @ 2017-01-06  1:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj

On Fri, Jan 06, 2017 at 12:56:11AM +0100, Borislav Petkov wrote:
> Oh, and it's not like the user can do anything - there's a thermald
> which is supposed to deal with all that. Which is not really
> trouble-free too, TBH. What happens if that thing dies? Fried CPU?
> 
> So I say we should rip out that mce_log_therm_throt_event() and never
> ever handle thermal events with MCEs. It is a bad idea.

Agree, since we have both a log and another agent to deal with it, it makes
no good reason to continue... Will pass this along, and have someone look at
cleaning this up.


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 23:28       ` Raj, Ashok
@ 2017-01-05 23:56         ` Borislav Petkov
  2017-01-06  1:26           ` Raj, Ashok
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2017-01-05 23:56 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 03:28:00PM -0800, Raj, Ashok wrote:
> After looking at the code, seems like these events are logged as MCE's
> but are really picked from real lvt thermal event interrupts.  via a fake
> bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
> and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

Right, we've done that since forever but I do think that it confuses
people. This thread case-in-point. I mean, we already scream:

	pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",

to dmesg, why do we have to log a fake MCE too?!

Hell, we even log an MCE when things go back to normal:

        if (old_event) {
                if (event == THERMAL_THROTTLING_EVENT)
                        pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
                                level == CORE_LEVEL ? "Core" : "Package");
                return 1;

And Alexander's log shows exactly that:

[ 6338.170924] CPU1: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170925] CPU5: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170928] CPU7: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170931] CPU4: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170932] CPU0: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170933] CPU6: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170935] CPU2: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170936] CPU3: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170937] CPU5: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170945] CPU1: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170947] mce_notify_irq: 1 callbacks suppressed
[ 6338.170948] mce: [Hardware Error]: Machine check events logged				<--- new event
[ 6338.171917] CPU1: Core temperature/speed normal
[ 6338.171918] CPU5: Core temperature/speed normal
[ 6338.171920] CPU4: Package temperature/speed normal
[ 6338.171920] CPU0: Package temperature/speed normal
[ 6338.171922] CPU2: Package temperature/speed normal
[ 6338.171923] CPU6: Package temperature/speed normal
[ 6338.171924] CPU3: Package temperature/speed normal
[ 6338.171925] CPU7: Package temperature/speed normal
[ 6338.171927] CPU5: Package temperature/speed normal
[ 6338.171929] CPU1: Package temperature/speed normal
[ 6338.171930] mce: [Hardware Error]: Machine check events logged				<--- old event

Oh, and it's not like the user can do anything - there's a thermald
which is supposed to deal with all that. Which is not really
trouble-free too, TBH. What happens if that thing dies? Fried CPU?

So I say we should rip out that mce_log_therm_throt_event() and never
ever handle thermal events with MCEs. It is a bad idea.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
  2017-01-05 21:38       ` Alexander Alemayhu
@ 2017-01-05 23:28       ` Raj, Ashok
  2017-01-05 23:56         ` Borislav Petkov
  2 siblings, 1 reply; 30+ messages in thread
From: Raj, Ashok @ 2017-01-05 23:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj

Hi Boris

On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > Not sure if it is related, but I am also seeing those messages on my
> > MacBookPro11,3:
> 
> Yours look to me like thermal throttling MCEs. And TBH we whould
> not issue those as actual MCEs because they are not - they *signal*
> overheating condition only and should be handled differently.

After looking at the code, seems like these events are logged as MCE's
but are really picked from real lvt thermal event interrupts.  via a fake
bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 21:03         ` Pandruvada, Srinivas
@ 2017-01-05 23:23           ` Alexander Alemayhu
  0 siblings, 0 replies; 30+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 23:23 UTC (permalink / raw)
  To: Pandruvada, Srinivas
  Cc: Raj, Ashok, bp, Brown, Len, linux-kernel, Luck, Tony, linux,
	pmenzel, daniel

On Thu, Jan 05, 2017 at 09:03:10PM +0000, Pandruvada, Srinivas wrote:
> I suggest trying with the following kernel command line, if your
> getting notification to throttle from SMM before:
> 
> 	intel_pstate=support_acpi_ppc
> 
> opensuse doesn't start thermald by default.
> 

Used the suggested kernel command line change and made sure thermald[0] is
running.  I see no more mce* related errors and my fans are making less noise
during workloads.  Thank you :)

[0]: https://github.com/01org/thermal_daemon

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
@ 2017-01-05 21:38       ` Alexander Alemayhu
  2017-01-05 23:28       ` Raj, Ashok
  2 siblings, 0 replies; 30+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 21:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Daniel J Blueman, Raj, Ashok, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> 
> What does your /var/log/mcelog* log file contain?
>

There are no files there, but in the attached backtrace mcelog is mentioned
several times. Sorry I am not familiar with mcelog so I don't know where it
would be logging on Fedora.

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

[-- Attachment #2: backtrace --]
[-- Type: text/plain, Size: 1500 bytes --]

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Jan  5 10:32:09 hafza mcelog: mcelog: warning: 16 bytes ignored in each record
Jan  5 10:32:09 hafza mcelog: mcelog: consider an update
Jan  5 10:32:09 hafza mcelog: Hardware event. This is not a software error.
Jan  5 10:32:09 hafza mcelog: MCE 0
Jan  5 10:32:09 hafza mcelog: CPU 1 THERMAL EVENT TSC 7722fe2d8c7
Jan  5 10:32:09 hafza mcelog: TIME 1483608727 Thu Jan  5 10:32:07 2017
Jan  5 10:32:09 hafza mcelog: Processor 1 below trip temperature. Throttling disabled
Jan  5 10:32:09 hafza mcelog: STATUS 88020a8a MCGSTATUS 0
Jan  5 10:32:09 hafza mcelog: MCGCAP c0a APICID 2 SOCKETID 0
Jan  5 10:32:09 hafza mcelog: CPUID Vendor Intel Family 6 Model 70
Jan  5 10:32:09 hafza mcelog: Hardware event. This is not a software error.
Jan  5 10:32:09 hafza mcelog: MCE 1
Jan  5 10:32:09 hafza mcelog: CPU 5 THERMAL EVENT TSC 7722fe316ed
Jan  5 10:32:09 hafza mcelog: TIME 1483608727 Thu Jan  5 10:32:07 2017
Jan  5 10:32:09 hafza mcelog: Processor 5 below trip temperature. Throttling disabled
Jan  5 10:32:09 hafza mcelog: STATUS 88020a8a MCGSTATUS 0
Jan  5 10:32:09 hafza mcelog: MCGCAP c0a APICID 3 SOCKETID 0
Jan  5 10:32:09 hafza mcelog: CPUID Vendor Intel Family 6 Model 70
Jan  5 10:32:09 hafza mcelog: mcelog: warning: 16 bytes ignored in each record
Jan  5 10:32:09 hafza mcelog: mcelog: consider an update

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:43       ` Raj, Ashok
@ 2017-01-05 21:03         ` Pandruvada, Srinivas
  2017-01-05 23:23           ` Alexander Alemayhu
  0 siblings, 1 reply; 30+ messages in thread
From: Pandruvada, Srinivas @ 2017-01-05 21:03 UTC (permalink / raw)
  To: Raj, Ashok, bp
  Cc: Brown, Len, linux-kernel, Luck, Tony, linux, pmenzel, alexander, daniel

On Thu, 2017-01-05 at 12:43 -0800, Raj, Ashok wrote:
> Hi Boris
> 
> 
> On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> > 
> > On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > > 
> > > Not sure if it is related, but I am also seeing those messages on
> > > my
> > > MacBookPro11,3:
> > 
> > Yours look to me like thermal throttling MCEs. And TBH we whould
> > not issue those as actual MCEs because they are not - they *signal*
> > overheating condition only and should be handled differently.
> 
> That's right.. the thermal interrupts are being reported, that should
> have
> started running cpu's at lower frequencies via some thermald. 
> If that's not handled and the PCU starts enforcing trying to keep the
> temps 
> under control system starts logging MCE's. 
> 
> Ccing Srinivas who might be able to give a better pointer to check
> why 
> that's not happening.
I suggest trying with the following kernel command line, if your
getting notification to throttle from SMM before:

	intel_pstate=support_acpi_ppc

opensuse doesn't start thermald by default.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
@ 2017-01-05 20:43       ` Raj, Ashok
  2017-01-05 21:03         ` Pandruvada, Srinivas
  2017-01-05 21:38       ` Alexander Alemayhu
  2017-01-05 23:28       ` Raj, Ashok
  2 siblings, 1 reply; 30+ messages in thread
From: Raj, Ashok @ 2017-01-05 20:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj, srinivas.pandruvada

Hi Boris


On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > Not sure if it is related, but I am also seeing those messages on my
> > MacBookPro11,3:
> 
> Yours look to me like thermal throttling MCEs. And TBH we whould
> not issue those as actual MCEs because they are not - they *signal*
> overheating condition only and should be handled differently.

That's right.. the thermal interrupts are being reported, that should have
started running cpu's at lower frequencies via some thermald. 
If that's not handled and the PCU starts enforcing trying to keep the temps 
under control system starts logging MCE's. 

Ccing Srinivas who might be able to give a better pointer to check why 
that's not happening.

The log didn't have the exact MCE's reported.. if you have mcelog, please
attach that in the report.

> 
> What does your /var/log/mcelog* log file contain?
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> -- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:10   ` Alexander Alemayhu
@ 2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
                         ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Borislav Petkov @ 2017-01-05 20:31 UTC (permalink / raw)
  To: Alexander Alemayhu
  Cc: Daniel J Blueman, Raj, Ashok, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> Not sure if it is related, but I am also seeing those messages on my
> MacBookPro11,3:

Yours look to me like thermal throttling MCEs. And TBH we whould
not issue those as actual MCEs because they are not - they *signal*
overheating condition only and should be handled differently.

What does your /var/log/mcelog* log file contain?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 14:05 ` Daniel J Blueman
@ 2017-01-05 20:10   ` Alexander Alemayhu
  2017-01-05 20:31     ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 20:10 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Raj, Ashok, Borislav Petkov, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 10:05:39PM +0800, Daniel J Blueman wrote:
> 
> That said, I have seen this reoccur after boot; there were no other
> kernel messages around 300s uptime, and it hasn't occurred in the last
> hours since:
>

Not sure if it is related, but I am also seeing those messages on my
MacBookPro11,3:

grep -e "Linux version" -e "Machine" oops-2017-01-05-10:32:10-1076-0/dmesg

[    0.000000] Linux version 4.9.0 (scanf@hafza) (gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC) ) #1 SMP Sun Dec 25 22:25:17 CET 2016
[ 4231.274376] mce: [Hardware Error]: Machine check events logged
[ 4231.274893] mce: [Hardware Error]: Machine check events logged
[ 4531.292608] mce: [Hardware Error]: Machine check events logged
[ 4531.292610] mce: [Hardware Error]: Machine check events logged
[ 4833.369927] mce: [Hardware Error]: Machine check events logged
[ 4833.370906] mce: [Hardware Error]: Machine check events logged
[ 5135.449222] mce: [Hardware Error]: Machine check events logged
[ 5135.449228] mce: [Hardware Error]: Machine check events logged
[ 5435.564152] mce: [Hardware Error]: Machine check events logged
[ 5435.564153] mce: [Hardware Error]: Machine check events logged
[ 5735.592854] mce: [Hardware Error]: Machine check events logged
[ 5735.592862] mce: [Hardware Error]: Machine check events logged
[ 6038.070068] mce: [Hardware Error]: Machine check events logged
[ 6038.070073] mce: [Hardware Error]: Machine check events logged
[ 6338.170948] mce: [Hardware Error]: Machine check events logged
[ 6338.171930] mce: [Hardware Error]: Machine check events logged
[ 6920.788280] mce: [Hardware Error]: Machine check events logged
[ 6920.788284] mce: [Hardware Error]: Machine check events logged

full output: https://gist.githubusercontent.com/scanf/5b9dc1940c4913f393fbfbbe40ef6788/raw/2fe98a912979b8b4346c4c55582e092fc42bce8a/7ba692efe.txt
-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05  5:00 Daniel J Blueman
@ 2017-01-05 14:05 ` Daniel J Blueman
  2017-01-05 20:10   ` Alexander Alemayhu
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel J Blueman @ 2017-01-05 14:05 UTC (permalink / raw)
  To: Raj, Ashok, Borislav Petkov, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On 5 January 2017 at 13:00, Daniel J Blueman <daniel@quora.org> wrote:
> On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
>> Hi Boris
>>
>> thanks for forwarding.
>>
>> > > CPUID Vendor Intel Family 6 Model 142
>> This is Kabylake Mobile
>>
>> > > Hardware event. This is not a software error.
>> > > MCE 1
>> > > CPU 0 BANK 7
>> > > MISC 7880018086 ADDR fef1ce40
>> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
>> > > MCG status:
>> > > MCi status:
>> > > Error overflow
>> > > Uncorrected error
>> > > MCi_MISC register valid
>> > > MCi_ADDR register valid
>> > > Processor context corrupt
>> > > MCA: corrected filtering (some unreported errors in same region)
>> > > Generic CACHE Level-2 Generic Error
>> > > STATUS ee0000000040110a MCGSTATUS 0
>>
>> Decoding the bits further from MCi_STATUS above:
>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>> been signaled by a CMCI.
>>
>> PCC=1, but should be ignored when EN=0.
>> MCACOD: 110a MSCOD: 0040
>>
>> If the system is stable enough after the report, can you send the output of
>> /proc/interrupts to confirm that.
>>
>> Although its reported as a L2 error, some memory errors can also manifest
>> itself as a cache error in certain cases.  In this case it looks like
>> some speculative fetch from bad memory might be the cause.
>>
>> > > MCGCAP c08 APICID 0 SOCKETID 0
>>
>> MCG_CAP: c08
>> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
>> Threshold based error reporting (bit 11) (TES_P).
>>
>>
>> Do you have another machine which doesn't report these errors? if so try
>> swapping memory between them to see if the error disappears.
>>
>> I don't have the model specific error handy.. will check that in the meantime
>> to get some decoding as well.
>>
>> If you haven't already running some memory tests would also help.
>>
>> If you replaced the motherboard, did that involve both cpu and memory?
>> or just the motheboard swap?
>
> I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
> physical address is in the non-coherent low MMIO window:
> MISC 7880018086 ADDR fef1ce40
>
> Which is declared as device memory:
> [    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]
>
> For core-generated cycles, it is between the local APIC space at
> FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
> subtractively decoded to the PCH, maybe being aborted due to a device
> not being enabled (hello TPM3 or new image processor).
>
> As it is logged as soon as the MCE driver initialises, it was probably
> logged during BIOS init, so there's not much we can do about it
> anyways.

That said, I have seen this reoccur after boot; there were no other
kernel messages around 300s uptime, and it hasn't occurred in the last
hours since:

$ dmesg | grep Machine
[    0.039072] mce: [Hardware Error]: Machine check events logged
[  300.069176] mce: [Hardware Error]: Machine check events logged

As I don't see a driver controlling this area of address space, the
access is likely initiated from the UEFI BIOS System Management Mode
handler, and we see the same pair of registers FEF1FF40, FEF1CE40
accessed each time.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-05  5:00 Daniel J Blueman
  2017-01-05 14:05 ` Daniel J Blueman
  0 siblings, 1 reply; 30+ messages in thread
From: Daniel J Blueman @ 2017-01-05  5:00 UTC (permalink / raw)
  To: ashok.raj, Borislav Petkov, pmenzel, tony.luck, linux, len.brown,
	Linux Kernel

On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
> Hi Boris
>
> thanks for forwarding.
>
> > > CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 0 BANK 7
> > > MISC 7880018086 ADDR fef1ce40
> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > > MCG status:
> > > MCi status:
> > > Error overflow
> > > Uncorrected error
> > > MCi_MISC register valid
> > > MCi_ADDR register valid
> > > Processor context corrupt
> > > MCA: corrected filtering (some unreported errors in same region)
> > > Generic CACHE Level-2 Generic Error
> > > STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
>
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
> > > MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
>
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
>
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
physical address is in the non-coherent low MMIO window:
MISC 7880018086 ADDR fef1ce40

Which is declared as device memory:
[    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]

For core-generated cycles, it is between the local APIC space at
FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
subtractively decoded to the PCH, maybe being aborted due to a device
not being enabled (hello TPM3 or new image processor).

As it is logged as soon as the MCE driver initialises, it was probably
logged during BIOS init, so there's not much we can do about it
anyways.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-02-01 20:52 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
2017-01-04 22:55 ` Borislav Petkov
2017-01-05  1:12   ` Raj, Ashok
2017-01-09 11:53     ` Paul Menzel
2017-01-09 19:23       ` Raj, Ashok
2017-01-27 13:35         ` Paul Menzel
2017-01-27 17:10           ` Borislav Petkov
2017-01-27 17:16             ` Mario.Limonciello
2017-01-31 15:29               ` Paul Menzel
2017-01-31 17:20                 ` Borislav Petkov
2017-01-31 18:50                 ` Austin S. Hemmelgarn
2017-02-01 20:52                 ` Mario.Limonciello
2017-01-05  5:00 Daniel J Blueman
2017-01-05 14:05 ` Daniel J Blueman
2017-01-05 20:10   ` Alexander Alemayhu
2017-01-05 20:31     ` Borislav Petkov
2017-01-05 20:43       ` Raj, Ashok
2017-01-05 21:03         ` Pandruvada, Srinivas
2017-01-05 23:23           ` Alexander Alemayhu
2017-01-05 21:38       ` Alexander Alemayhu
2017-01-05 23:28       ` Raj, Ashok
2017-01-05 23:56         ` Borislav Petkov
2017-01-06  1:26           ` Raj, Ashok
2017-01-06 11:16             ` Borislav Petkov
2017-01-06 15:58               ` Raj, Ashok
2017-01-06 16:54                 ` Borislav Petkov
2017-01-06 17:04                   ` Raj, Ashok
2017-01-09 10:55                   ` Paul Menzel
2017-01-09 11:05                     ` Borislav Petkov
2017-01-09 11:11                       ` Paul Menzel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.