All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-05  5:00 Daniel J Blueman
  2017-01-05 14:05 ` Daniel J Blueman
  0 siblings, 1 reply; 49+ messages in thread
From: Daniel J Blueman @ 2017-01-05  5:00 UTC (permalink / raw)
  To: ashok.raj, Borislav Petkov, pmenzel, tony.luck, linux, len.brown,
	Linux Kernel

On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
> Hi Boris
>
> thanks for forwarding.
>
> > > CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 0 BANK 7
> > > MISC 7880018086 ADDR fef1ce40
> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > > MCG status:
> > > MCi status:
> > > Error overflow
> > > Uncorrected error
> > > MCi_MISC register valid
> > > MCi_ADDR register valid
> > > Processor context corrupt
> > > MCA: corrected filtering (some unreported errors in same region)
> > > Generic CACHE Level-2 Generic Error
> > > STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
>
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
> > > MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
>
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
>
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
physical address is in the non-coherent low MMIO window:
MISC 7880018086 ADDR fef1ce40

Which is declared as device memory:
[    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]

For core-generated cycles, it is between the local APIC space at
FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
subtractively decoded to the PCH, maybe being aborted due to a device
not being enabled (hello TPM3 or new image processor).

As it is logged as soon as the MCE driver initialises, it was probably
logged during BIOS init, so there's not much we can do about it
anyways.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
@ 2017-01-05 14:05 ` Daniel J Blueman
  2017-01-05 20:10   ` Alexander Alemayhu
  0 siblings, 1 reply; 49+ messages in thread
From: Daniel J Blueman @ 2017-01-05 14:05 UTC (permalink / raw)
  To: Raj, Ashok, Borislav Petkov, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On 5 January 2017 at 13:00, Daniel J Blueman <daniel@quora.org> wrote:
> On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
>> Hi Boris
>>
>> thanks for forwarding.
>>
>> > > CPUID Vendor Intel Family 6 Model 142
>> This is Kabylake Mobile
>>
>> > > Hardware event. This is not a software error.
>> > > MCE 1
>> > > CPU 0 BANK 7
>> > > MISC 7880018086 ADDR fef1ce40
>> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
>> > > MCG status:
>> > > MCi status:
>> > > Error overflow
>> > > Uncorrected error
>> > > MCi_MISC register valid
>> > > MCi_ADDR register valid
>> > > Processor context corrupt
>> > > MCA: corrected filtering (some unreported errors in same region)
>> > > Generic CACHE Level-2 Generic Error
>> > > STATUS ee0000000040110a MCGSTATUS 0
>>
>> Decoding the bits further from MCi_STATUS above:
>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>> been signaled by a CMCI.
>>
>> PCC=1, but should be ignored when EN=0.
>> MCACOD: 110a MSCOD: 0040
>>
>> If the system is stable enough after the report, can you send the output of
>> /proc/interrupts to confirm that.
>>
>> Although its reported as a L2 error, some memory errors can also manifest
>> itself as a cache error in certain cases.  In this case it looks like
>> some speculative fetch from bad memory might be the cause.
>>
>> > > MCGCAP c08 APICID 0 SOCKETID 0
>>
>> MCG_CAP: c08
>> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
>> Threshold based error reporting (bit 11) (TES_P).
>>
>>
>> Do you have another machine which doesn't report these errors? if so try
>> swapping memory between them to see if the error disappears.
>>
>> I don't have the model specific error handy.. will check that in the meantime
>> to get some decoding as well.
>>
>> If you haven't already running some memory tests would also help.
>>
>> If you replaced the motherboard, did that involve both cpu and memory?
>> or just the motheboard swap?
>
> I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
> physical address is in the non-coherent low MMIO window:
> MISC 7880018086 ADDR fef1ce40
>
> Which is declared as device memory:
> [    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]
>
> For core-generated cycles, it is between the local APIC space at
> FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
> subtractively decoded to the PCH, maybe being aborted due to a device
> not being enabled (hello TPM3 or new image processor).
>
> As it is logged as soon as the MCE driver initialises, it was probably
> logged during BIOS init, so there's not much we can do about it
> anyways.

That said, I have seen this reoccur after boot; there were no other
kernel messages around 300s uptime, and it hasn't occurred in the last
hours since:

$ dmesg | grep Machine
[    0.039072] mce: [Hardware Error]: Machine check events logged
[  300.069176] mce: [Hardware Error]: Machine check events logged

As I don't see a driver controlling this area of address space, the
access is likely initiated from the UEFI BIOS System Management Mode
handler, and we see the same pair of registers FEF1FF40, FEF1CE40
accessed each time.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 14:05 ` Daniel J Blueman
@ 2017-01-05 20:10   ` Alexander Alemayhu
  2017-01-05 20:31     ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 20:10 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Raj, Ashok, Borislav Petkov, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 10:05:39PM +0800, Daniel J Blueman wrote:
> 
> That said, I have seen this reoccur after boot; there were no other
> kernel messages around 300s uptime, and it hasn't occurred in the last
> hours since:
>

Not sure if it is related, but I am also seeing those messages on my
MacBookPro11,3:

grep -e "Linux version" -e "Machine" oops-2017-01-05-10:32:10-1076-0/dmesg

[    0.000000] Linux version 4.9.0 (scanf@hafza) (gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC) ) #1 SMP Sun Dec 25 22:25:17 CET 2016
[ 4231.274376] mce: [Hardware Error]: Machine check events logged
[ 4231.274893] mce: [Hardware Error]: Machine check events logged
[ 4531.292608] mce: [Hardware Error]: Machine check events logged
[ 4531.292610] mce: [Hardware Error]: Machine check events logged
[ 4833.369927] mce: [Hardware Error]: Machine check events logged
[ 4833.370906] mce: [Hardware Error]: Machine check events logged
[ 5135.449222] mce: [Hardware Error]: Machine check events logged
[ 5135.449228] mce: [Hardware Error]: Machine check events logged
[ 5435.564152] mce: [Hardware Error]: Machine check events logged
[ 5435.564153] mce: [Hardware Error]: Machine check events logged
[ 5735.592854] mce: [Hardware Error]: Machine check events logged
[ 5735.592862] mce: [Hardware Error]: Machine check events logged
[ 6038.070068] mce: [Hardware Error]: Machine check events logged
[ 6038.070073] mce: [Hardware Error]: Machine check events logged
[ 6338.170948] mce: [Hardware Error]: Machine check events logged
[ 6338.171930] mce: [Hardware Error]: Machine check events logged
[ 6920.788280] mce: [Hardware Error]: Machine check events logged
[ 6920.788284] mce: [Hardware Error]: Machine check events logged

full output: https://gist.githubusercontent.com/scanf/5b9dc1940c4913f393fbfbbe40ef6788/raw/2fe98a912979b8b4346c4c55582e092fc42bce8a/7ba692efe.txt
-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:10   ` Alexander Alemayhu
@ 2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
                         ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-01-05 20:31 UTC (permalink / raw)
  To: Alexander Alemayhu
  Cc: Daniel J Blueman, Raj, Ashok, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> Not sure if it is related, but I am also seeing those messages on my
> MacBookPro11,3:

Yours look to me like thermal throttling MCEs. And TBH we whould
not issue those as actual MCEs because they are not - they *signal*
overheating condition only and should be handled differently.

What does your /var/log/mcelog* log file contain?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
@ 2017-01-05 20:43       ` Raj, Ashok
  2017-01-05 21:03         ` Pandruvada, Srinivas
  2017-01-05 21:38       ` Alexander Alemayhu
  2017-01-05 23:28       ` Raj, Ashok
  2 siblings, 1 reply; 49+ messages in thread
From: Raj, Ashok @ 2017-01-05 20:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj, srinivas.pandruvada

Hi Boris


On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > Not sure if it is related, but I am also seeing those messages on my
> > MacBookPro11,3:
> 
> Yours look to me like thermal throttling MCEs. And TBH we whould
> not issue those as actual MCEs because they are not - they *signal*
> overheating condition only and should be handled differently.

That's right.. the thermal interrupts are being reported, that should have
started running cpu's at lower frequencies via some thermald. 
If that's not handled and the PCU starts enforcing trying to keep the temps 
under control system starts logging MCE's. 

Ccing Srinivas who might be able to give a better pointer to check why 
that's not happening.

The log didn't have the exact MCE's reported.. if you have mcelog, please
attach that in the report.

> 
> What does your /var/log/mcelog* log file contain?
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> -- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:43       ` Raj, Ashok
@ 2017-01-05 21:03         ` Pandruvada, Srinivas
  2017-01-05 23:23           ` Alexander Alemayhu
  0 siblings, 1 reply; 49+ messages in thread
From: Pandruvada, Srinivas @ 2017-01-05 21:03 UTC (permalink / raw)
  To: Raj, Ashok, bp
  Cc: Brown, Len, linux-kernel, Luck, Tony, linux, pmenzel, alexander, daniel

On Thu, 2017-01-05 at 12:43 -0800, Raj, Ashok wrote:
> Hi Boris
> 
> 
> On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> > 
> > On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > > 
> > > Not sure if it is related, but I am also seeing those messages on
> > > my
> > > MacBookPro11,3:
> > 
> > Yours look to me like thermal throttling MCEs. And TBH we whould
> > not issue those as actual MCEs because they are not - they *signal*
> > overheating condition only and should be handled differently.
> 
> That's right.. the thermal interrupts are being reported, that should
> have
> started running cpu's at lower frequencies via some thermald. 
> If that's not handled and the PCU starts enforcing trying to keep the
> temps 
> under control system starts logging MCE's. 
> 
> Ccing Srinivas who might be able to give a better pointer to check
> why 
> that's not happening.
I suggest trying with the following kernel command line, if your
getting notification to throttle from SMM before:

	intel_pstate=support_acpi_ppc

opensuse doesn't start thermald by default.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
@ 2017-01-05 21:38       ` Alexander Alemayhu
  2017-01-05 23:28       ` Raj, Ashok
  2 siblings, 0 replies; 49+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 21:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Daniel J Blueman, Raj, Ashok, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> 
> What does your /var/log/mcelog* log file contain?
>

There are no files there, but in the attached backtrace mcelog is mentioned
several times. Sorry I am not familiar with mcelog so I don't know where it
would be logging on Fedora.

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

[-- Attachment #2: backtrace --]
[-- Type: text/plain, Size: 1500 bytes --]

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Jan  5 10:32:09 hafza mcelog: mcelog: warning: 16 bytes ignored in each record
Jan  5 10:32:09 hafza mcelog: mcelog: consider an update
Jan  5 10:32:09 hafza mcelog: Hardware event. This is not a software error.
Jan  5 10:32:09 hafza mcelog: MCE 0
Jan  5 10:32:09 hafza mcelog: CPU 1 THERMAL EVENT TSC 7722fe2d8c7
Jan  5 10:32:09 hafza mcelog: TIME 1483608727 Thu Jan  5 10:32:07 2017
Jan  5 10:32:09 hafza mcelog: Processor 1 below trip temperature. Throttling disabled
Jan  5 10:32:09 hafza mcelog: STATUS 88020a8a MCGSTATUS 0
Jan  5 10:32:09 hafza mcelog: MCGCAP c0a APICID 2 SOCKETID 0
Jan  5 10:32:09 hafza mcelog: CPUID Vendor Intel Family 6 Model 70
Jan  5 10:32:09 hafza mcelog: Hardware event. This is not a software error.
Jan  5 10:32:09 hafza mcelog: MCE 1
Jan  5 10:32:09 hafza mcelog: CPU 5 THERMAL EVENT TSC 7722fe316ed
Jan  5 10:32:09 hafza mcelog: TIME 1483608727 Thu Jan  5 10:32:07 2017
Jan  5 10:32:09 hafza mcelog: Processor 5 below trip temperature. Throttling disabled
Jan  5 10:32:09 hafza mcelog: STATUS 88020a8a MCGSTATUS 0
Jan  5 10:32:09 hafza mcelog: MCGCAP c0a APICID 3 SOCKETID 0
Jan  5 10:32:09 hafza mcelog: CPUID Vendor Intel Family 6 Model 70
Jan  5 10:32:09 hafza mcelog: mcelog: warning: 16 bytes ignored in each record
Jan  5 10:32:09 hafza mcelog: mcelog: consider an update

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 21:03         ` Pandruvada, Srinivas
@ 2017-01-05 23:23           ` Alexander Alemayhu
  0 siblings, 0 replies; 49+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 23:23 UTC (permalink / raw)
  To: Pandruvada, Srinivas
  Cc: Raj, Ashok, bp, Brown, Len, linux-kernel, Luck, Tony, linux,
	pmenzel, daniel

On Thu, Jan 05, 2017 at 09:03:10PM +0000, Pandruvada, Srinivas wrote:
> I suggest trying with the following kernel command line, if your
> getting notification to throttle from SMM before:
> 
> 	intel_pstate=support_acpi_ppc
> 
> opensuse doesn't start thermald by default.
> 

Used the suggested kernel command line change and made sure thermald[0] is
running.  I see no more mce* related errors and my fans are making less noise
during workloads.  Thank you :)

[0]: https://github.com/01org/thermal_daemon

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
  2017-01-05 21:38       ` Alexander Alemayhu
@ 2017-01-05 23:28       ` Raj, Ashok
  2017-01-05 23:56         ` Borislav Petkov
  2 siblings, 1 reply; 49+ messages in thread
From: Raj, Ashok @ 2017-01-05 23:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj

Hi Boris

On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > Not sure if it is related, but I am also seeing those messages on my
> > MacBookPro11,3:
> 
> Yours look to me like thermal throttling MCEs. And TBH we whould
> not issue those as actual MCEs because they are not - they *signal*
> overheating condition only and should be handled differently.

After looking at the code, seems like these events are logged as MCE's
but are really picked from real lvt thermal event interrupts.  via a fake
bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 23:28       ` Raj, Ashok
@ 2017-01-05 23:56         ` Borislav Petkov
  2017-01-06  1:26           ` Raj, Ashok
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-05 23:56 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 03:28:00PM -0800, Raj, Ashok wrote:
> After looking at the code, seems like these events are logged as MCE's
> but are really picked from real lvt thermal event interrupts.  via a fake
> bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
> and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

Right, we've done that since forever but I do think that it confuses
people. This thread case-in-point. I mean, we already scream:

	pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",

to dmesg, why do we have to log a fake MCE too?!

Hell, we even log an MCE when things go back to normal:

        if (old_event) {
                if (event == THERMAL_THROTTLING_EVENT)
                        pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
                                level == CORE_LEVEL ? "Core" : "Package");
                return 1;

And Alexander's log shows exactly that:

[ 6338.170924] CPU1: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170925] CPU5: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170928] CPU7: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170931] CPU4: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170932] CPU0: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170933] CPU6: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170935] CPU2: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170936] CPU3: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170937] CPU5: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170945] CPU1: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170947] mce_notify_irq: 1 callbacks suppressed
[ 6338.170948] mce: [Hardware Error]: Machine check events logged				<--- new event
[ 6338.171917] CPU1: Core temperature/speed normal
[ 6338.171918] CPU5: Core temperature/speed normal
[ 6338.171920] CPU4: Package temperature/speed normal
[ 6338.171920] CPU0: Package temperature/speed normal
[ 6338.171922] CPU2: Package temperature/speed normal
[ 6338.171923] CPU6: Package temperature/speed normal
[ 6338.171924] CPU3: Package temperature/speed normal
[ 6338.171925] CPU7: Package temperature/speed normal
[ 6338.171927] CPU5: Package temperature/speed normal
[ 6338.171929] CPU1: Package temperature/speed normal
[ 6338.171930] mce: [Hardware Error]: Machine check events logged				<--- old event

Oh, and it's not like the user can do anything - there's a thermald
which is supposed to deal with all that. Which is not really
trouble-free too, TBH. What happens if that thing dies? Fried CPU?

So I say we should rip out that mce_log_therm_throt_event() and never
ever handle thermal events with MCEs. It is a bad idea.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 23:56         ` Borislav Petkov
@ 2017-01-06  1:26           ` Raj, Ashok
  2017-01-06 11:16             ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Raj, Ashok @ 2017-01-06  1:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj

On Fri, Jan 06, 2017 at 12:56:11AM +0100, Borislav Petkov wrote:
> Oh, and it's not like the user can do anything - there's a thermald
> which is supposed to deal with all that. Which is not really
> trouble-free too, TBH. What happens if that thing dies? Fried CPU?
> 
> So I say we should rip out that mce_log_therm_throt_event() and never
> ever handle thermal events with MCEs. It is a bad idea.

Agree, since we have both a log and another agent to deal with it, it makes
no good reason to continue... Will pass this along, and have someone look at
cleaning this up.


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06  1:26           ` Raj, Ashok
@ 2017-01-06 11:16             ` Borislav Petkov
  2017-01-06 15:58               ` Raj, Ashok
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-06 11:16 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Thu, Jan 05, 2017 at 05:26:17PM -0800, Raj, Ashok wrote:
> Agree, since we have both a log and another agent to deal with it, it makes
> no good reason to continue... Will pass this along, and have someone look at
> cleaning this up.

Like this?

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 6 Jan 2017 12:07:08 +0100
Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it should've been done in the first place. And besides
we have other means for dealing with thermal events which are much more
suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  9 ++++-----
 3 files changed, 4 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..109fbb25c851 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -365,10 +365,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 11:16             ` Borislav Petkov
@ 2017-01-06 15:58               ` Raj, Ashok
  2017-01-06 16:54                 ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Raj, Ashok @ 2017-01-06 15:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas, ashok.raj

Hi Boris

On Fri, Jan 06, 2017 at 12:16:17PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 05:26:17PM -0800, Raj, Ashok wrote:
> > Agree, since we have both a log and another agent to deal with it, it makes
> > no good reason to continue... Will pass this along, and have someone look at
> > cleaning this up.
> 
> Like this?

That was quick :-).

> -	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> -				THERMAL_THROTTLING_EVENT,
> -				CORE_LEVEL) != 0)

Looks like we don't need a return value from therm_throt_process(),
we can fix that as void as well.

Otherwise it looks good. 

> +	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> +			    THERMAL_THROTTLING_EVENT,
> +			    CORE_LEVEL);
>  

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 15:58               ` Raj, Ashok
@ 2017-01-06 16:54                 ` Borislav Petkov
  2017-01-06 17:04                   ` Raj, Ashok
  2017-01-09 10:55                   ` Paul Menzel
  0 siblings, 2 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-01-06 16:54 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
> Looks like we don't need a return value from therm_throt_process(),
> we can fix that as void as well.

Right you are, here's v2:

---
>From a8151fa6f18c2605eb7972061234f05e79b372c4 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Fri, 6 Jan 2017 12:07:08 +0100
Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it should've been done in the first place. And besides
we have other means for dealing with thermal events which are much more
suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
---

v2: Ashok: make therm_throt_process() void.

 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..85469f84c921 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 16:54                 ` Borislav Petkov
@ 2017-01-06 17:04                   ` Raj, Ashok
  2017-01-09 10:55                   ` Paul Menzel
  1 sibling, 0 replies; 49+ messages in thread
From: Raj, Ashok @ 2017-01-06 17:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas, ashok.raj

Hi Boris

This looks good to me!

On Fri, Jan 06, 2017 at 05:54:23PM +0100, Borislav Petkov wrote:
> On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
> > Looks like we don't need a return value from therm_throt_process(),
> > we can fix that as void as well.
> 
> Right you are, here's v2:
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
> 
> v2: Ashok: make therm_throt_process() void.

Acked-by: Ashok Raj <ashok.raj@intel.com>
> 
>  arch/x86/include/asm/mce.h               |  6 ------
>  arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
>  arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
>  3 files changed, 11 insertions(+), 50 deletions(-)

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 16:54                 ` Borislav Petkov
  2017-01-06 17:04                   ` Raj, Ashok
@ 2017-01-09 10:55                   ` Paul Menzel
  2017-01-09 11:05                     ` Borislav Petkov
  1 sibling, 1 reply; 49+ messages in thread
From: Paul Menzel @ 2017-01-09 10:55 UTC (permalink / raw)
  To: Borislav Petkov, Ashok Raj
  Cc: Alexander Alemayhu, Daniel J Blueman, tony.luck, linux,
	len.brown, Linux Kernel, Pandruvada, Srinivas

On 01/06/17 17:54, Borislav Petkov wrote:
> On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
>> Looks like we don't need a return value from therm_throt_process(),
>> we can fix that as void as well.
>
> Right you are, here's v2:
>
> ---
> From a8151fa6f18c2605eb7972061234f05e79b372c4 Mon Sep 17 00:00:00 2001
> From: Borislav Petkov <bp@suse.de>
> Date: Fri, 6 Jan 2017 12:07:08 +0100
> Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
>
> We log a fake bank 128 MCE to note that we're handling a CPU thermal
> event. However, this confuses people into thinking that their hardware
> generates MCEs. Hijacking MCA for logging thermal events is a gross
> misuse anyway and it should've been done in the first place. And besides

Do you mean *shouldn’t have been done*?

> we have other means for dealing with thermal events which are much more
> suitable.
>
> So let's kill the MCE logging part.
>
> Signed-off-by: Borislav Petkov <bp@suse.de>

Should the discussion be referenced?

Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks 
different from what I see, doesn’t it?

[…]


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 10:55                   ` Paul Menzel
@ 2017-01-09 11:05                     ` Borislav Petkov
  2017-01-09 11:11                       ` Paul Menzel
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-09 11:05 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ashok Raj, Alexander Alemayhu, Daniel J Blueman, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Mon, Jan 09, 2017 at 11:55:41AM +0100, Paul Menzel wrote:
> Do you mean *shouldn’t have been done*?

Yes.

> Should the discussion be referenced?

Yap, it will be.

> Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks
> different from what I see, doesn’t it?

Yes, yours is different. I'm still waiting for you to reply to Ashok's
questions here:

https://lkml.kernel.org/r/20170105011236.GA80100@otc-brkl-03

Thanks.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 11:05                     ` Borislav Petkov
@ 2017-01-09 11:11                       ` Paul Menzel
  0 siblings, 0 replies; 49+ messages in thread
From: Paul Menzel @ 2017-01-09 11:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ashok Raj, Alexander Alemayhu, Daniel J Blueman, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas,
	Daniel Blueman

Dear Boris,


On 01/09/17 12:05, Borislav Petkov wrote:
> On Mon, Jan 09, 2017 at 11:55:41AM +0100, Paul Menzel wrote:

[…]

>> Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks
>> different from what I see, doesn’t it?
>
> Yes, yours is different. I'm still waiting for you to reply to Ashok's
> questions here:
>
> https://lkml.kernel.org/r/20170105011236.GA80100@otc-brkl-03

I see. I thought Daniel’s answered them. I’ll reply now.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 0/9] x86/RAS: Queue for 4.11
@ 2017-01-23 18:35 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
                   ` (8 more replies)
  0 siblings, 9 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Hi,

here's the stuff which got ready in time. The more exciting things are
going to wait for the next merge window. :-)

Please apply,
thanks.

Borislav Petkov (8):
  x86/mce-inject: Make it depend on X86_LOCAL_APIC
  x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
  x86/MCE: Flip the TSC-adding logic
  x86/ras/mce_amd_inj: Change dependency
  EDAC/mce_amd: Unexport amd_decode_mce()
  EDAC/mce_amd: Dump TSC value
  x86/MCE: Get rid of mce_process_work()
  x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority

Yazen Ghannam (1):
  x86/MCE/AMD: Make sysfs names of banks more user-friendly

 arch/x86/Kconfig                          |  2 +-
 arch/x86/include/asm/mce.h                | 20 ++++++-----
 arch/x86/kernel/cpu/mcheck/mce-apei.c     |  5 ++-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c   |  5 +--
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 57 ++++---------------------------
 arch/x86/kernel/cpu/mcheck/mce_amd.c      |  9 +++--
 arch/x86/kernel/cpu/mcheck/therm_throt.c  | 30 ++++++----------
 arch/x86/ras/Kconfig                      |  2 +-
 drivers/acpi/acpi_extlog.c                |  1 +
 drivers/acpi/nfit/mce.c                   |  1 +
 drivers/edac/i7core_edac.c                |  1 +
 drivers/edac/mce_amd.c                    |  8 +++--
 drivers/edac/mce_amd.h                    |  1 -
 drivers/edac/sb_edac.c                    |  3 +-
 drivers/edac/skx_edac.c                   |  3 +-
 17 files changed, 59 insertions(+), 93 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

... and get rid of the annoying:

  arch/x86/kernel/cpu/mcheck/mce-inject.c:97:13: warning: ‘mce_irq_ipi’
  defined but not used [-Wunused-function]
   static void mce_irq_ipi(void *info

when doing randconfig builds.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/Kconfig                        | 2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c | 5 +----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493bbd47..7b6fd68b4715 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1070,7 +1070,7 @@ config X86_MCE_THRESHOLD
 	def_bool y
 
 config X86_MCE_INJECT
-	depends on X86_MCE
+	depends on X86_MCE && X86_LOCAL_APIC
 	tristate "Machine check injector support"
 	---help---
 	  Provide support for injecting machine checks for testing purposes.
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 517619ea6498..99165b206df3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -152,7 +152,6 @@ static void raise_mce(struct mce *m)
 	if (context == MCJ_CTX_RANDOM)
 		return;
 
-#ifdef CONFIG_X86_LOCAL_APIC
 	if (m->inject_flags & (MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST)) {
 		unsigned long start;
 		int cpu;
@@ -192,9 +191,7 @@ static void raise_mce(struct mce *m)
 		raise_local();
 		put_cpu();
 		put_online_cpus();
-	} else
-#endif
-	{
+	} else {
 		preempt_disable();
 		raise_local();
 		preempt_enable();
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it shouldn't have been done in the first place. And
besides we have other means for dealing with thermal events which are
much more suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..85469f84c921 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
  2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
  2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Yazen Ghannam <Yazen.Ghannam@amd.com>

Currently, we append the MCA_IPID[InstanceId] to the bank name to create
the sysfs filename. The InstanceId field uniquely identifies a bank
instance but it doesn't look very nice for most banks.

Replace the InstanceId with a simpler, ascending (0, 1, ..) value.
Only use this in the sysfs name when there is more than 1 instance.
Otherwise, just use the bank's name as the sysfs name.

Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1484322741-41884-3-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h           | 5 +++--
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 6 +++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index a09ed05725c2..528f6ec897cb 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -356,12 +356,13 @@ struct smca_hwid {
 	unsigned int bank_type;	/* Use with smca_bank_types for easy indexing. */
 	u32 hwid_mcatype;	/* (hwid,mcatype) tuple */
 	u32 xec_bitmap;		/* Bitmap of valid ExtErrorCodes; current max is 21. */
+	u8 count;		/* Number of instances. */
 };
 
 struct smca_bank {
 	struct smca_hwid *hwid;
-	/* Instance ID */
-	u32 id;
+	u32 id;			/* Value of MCA_IPID[InstanceId]. */
+	u8 sysfs_id;		/* Value used for sysfs name. */
 };
 
 extern struct smca_bank smca_banks[MAX_NR_BANKS];
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index a5fd137417a2..776379e4a39c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -192,6 +192,7 @@ static void get_smca_bank_info(unsigned int bank)
 
 			smca_banks[bank].hwid = s_hwid;
 			smca_banks[bank].id = instance_id;
+			smca_banks[bank].sysfs_id = s_hwid->count++;
 			break;
 		}
 	}
@@ -1064,9 +1065,12 @@ static const char *get_name(unsigned int bank, struct threshold_block *b)
 		return NULL;
 	}
 
+	if (smca_banks[bank].hwid->count == 1)
+		return smca_get_name(bank_type);
+
 	snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
 		 "%s_%x", smca_get_name(bank_type),
-			  smca_banks[bank].id);
+			  smca_banks[bank].sysfs_id);
 	return buf_mcatype;
 }
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 4/9] x86/MCE: Flip the TSC-adding logic
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (2 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Add the TSC value to the MCE record only when the MCE being logged is
precise, i.e., it is logged as an exception or an MCE-related interrupt.

So it doesn't look particularly easy to do without touching/changing a
bunch of places. That's why I'm trying tricks first.

For example, the mce-apei.c case I'm addressing by setting ->tsc only
for errors of panic severity. The idea there is, that, panic errors will
have raised an #MC and not polled.

And then instead of propagating a flag to mce_setup(), it seems
easier/less code to set ->tsc depending on the call sites, i.e.,
are we polling or are we preparing an MCE record in an exception
handler/thresholding interrupt.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce-apei.c |  5 ++++-
 arch/x86/kernel/cpu/mcheck/mce.c      | 12 +++---------
 arch/x86/kernel/cpu/mcheck/mce_amd.c  |  3 ++-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 83f1a98d37db..2eee85379689 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -52,8 +52,11 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	if (severity >= GHES_SEV_RECOVERABLE)
 		m.status |= MCI_STATUS_UC;
-	if (severity >= GHES_SEV_PANIC)
+
+	if (severity >= GHES_SEV_PANIC) {
 		m.status |= MCI_STATUS_PCC;
+		m.tsc = rdtsc();
+	}
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6eef6fde0f02..ca15a7e1f97d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -128,7 +128,6 @@ void mce_setup(struct mce *m)
 {
 	memset(m, 0, sizeof(struct mce));
 	m->cpu = m->extcpu = smp_processor_id();
-	m->tsc = rdtsc();
 	/* We hope get_seconds stays lockless */
 	m->time = get_seconds();
 	m->cpuvendor = boot_cpu_data.x86_vendor;
@@ -710,14 +709,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 
 	mce_gather_info(&m, NULL);
 
-	/*
-	 * m.tsc was set in mce_setup(). Clear it if not requested.
-	 *
-	 * FIXME: Propagate @flags to mce_gather_info/mce_setup() to avoid
-	 *	  that dance.
-	 */
-	if (!(flags & MCP_TIMESTAMP))
-		m.tsc = 0;
+	if (flags & MCP_TIMESTAMP)
+		m.tsc = rdtsc();
 
 	for (i = 0; i < mca_cfg.banks; i++) {
 		if (!mce_banks[i].ctl || !test_bit(i, *b))
@@ -1156,6 +1149,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		goto out;
 
 	mce_gather_info(&m, regs);
+	m.tsc = rdtsc();
 
 	final = this_cpu_ptr(&mces_seen);
 	*final = m;
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 776379e4a39c..9e5427df3243 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -778,7 +778,8 @@ __log_error(unsigned int bank, bool deferred_err, bool threshold_err, u64 misc)
 	mce_setup(&m);
 
 	m.status = status;
-	m.bank = bank;
+	m.bank   = bank;
+	m.tsc	 = rdtsc();
 
 	if (threshold_err)
 		m.misc = misc;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (3 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Change dependency to mce.c as we're using mce_inject_log() now to stick
an MCE into the MCA subsystem.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/ras/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index d957d5f21a86..0bc60a308730 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -1,6 +1,6 @@
 config MCE_AMD_INJ
 	tristate "Simple MCE injection interface for AMD processors"
-	depends on RAS && EDAC_DECODE_MCE && DEBUG_FS && AMD_NB
+	depends on RAS && X86_MCE && DEBUG_FS && AMD_NB
 	default n
 	help
 	  This is a simple debugfs interface to inject MCEs and test different
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce()
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (4 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

It is not used outside of the driver anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/edac/mce_amd.c | 4 ++--
 drivers/edac/mce_amd.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 34208f38c5b1..5cd3c39bc695 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -942,7 +942,8 @@ static const char *decode_error_status(struct mce *m)
 	return "Corrected error, no action required.";
 }
 
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
+static int
+amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 {
 	struct mce *m = (struct mce *)data;
 	struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
@@ -1047,7 +1048,6 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 	return NOTIFY_STOP;
 }
-EXPORT_SYMBOL_GPL(amd_decode_mce);
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
diff --git a/drivers/edac/mce_amd.h b/drivers/edac/mce_amd.h
index c2359a1ea6b3..0b6a68673e0e 100644
--- a/drivers/edac/mce_amd.h
+++ b/drivers/edac/mce_amd.h
@@ -79,6 +79,5 @@ struct amd_decoder_ops {
 void amd_report_gart_errors(bool);
 void amd_register_ecc_decoder(void (*f)(int, struct mce *));
 void amd_unregister_ecc_decoder(void (*f)(int, struct mce *));
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data);
 
 #endif /* _EDAC_MCE_AMD_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 7/9] EDAC/mce_amd: Dump TSC value
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (5 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
  2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Dump the TSC value of the time when the MCE got logged.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/edac/mce_amd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 5cd3c39bc695..ecad750fd090 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1007,6 +1007,9 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 	} else
 		pr_cont("\n");
 
+	if (m->tsc)
+		pr_emerg(HW_ERR "TSC: %llu\n", m->tsc);
+
 	if (!fam_ops)
 		goto err_code;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 8/9] x86/MCE: Get rid of mce_process_work()
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (6 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Make mce_gen_pool_process() the workqueue function directly and save us
an indirection.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 12 +-----------
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
index 93d824ec3120..1e5a50c11d3c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-genpool.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -72,7 +72,7 @@ struct llist_node *mce_gen_pool_prepare_records(void)
 	return new_head.first;
 }
 
-void mce_gen_pool_process(void)
+void mce_gen_pool_process(struct work_struct *__unused)
 {
 	struct llist_node *head;
 	struct mce_evt_llist *node, *tmp;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index cd74a3f00aea..903043e6a62b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,7 +31,7 @@ struct mce_evt_llist {
 	struct mce mce;
 };
 
-void mce_gen_pool_process(void);
+void mce_gen_pool_process(struct work_struct *__unused);
 bool mce_gen_pool_empty(void);
 int mce_gen_pool_add(struct mce *mce);
 int mce_gen_pool_init(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ca15a7e1f97d..0fef5406f0eb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1316,16 +1316,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Action optional processing happens here (picking up
- * from the list of faulting pages that do_machine_check()
- * placed into the genpool).
- */
-static void mce_process_work(struct work_struct *dummy)
-{
-	mce_gen_pool_process();
-}
-
-/*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
  * errors, poll 2x slower (up to check_interval seconds).
@@ -2165,7 +2155,7 @@ int __init mcheck_init(void)
 	mce_register_decode_chain(&mce_default_nb);
 	mcheck_vendor_init_severity();
 
-	INIT_WORK(&mce_work, mce_process_work);
+	INIT_WORK(&mce_work, mce_gen_pool_process);
 	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (7 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
  8 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h       | 9 +++++++++
 arch/x86/kernel/cpu/mcheck/mce.c | 8 +++-----
 drivers/acpi/acpi_extlog.c       | 1 +
 drivers/acpi/nfit/mce.c          | 1 +
 drivers/edac/i7core_edac.c       | 1 +
 drivers/edac/mce_amd.c           | 1 +
 drivers/edac/sb_edac.c           | 3 ++-
 drivers/edac/skx_edac.c          | 3 ++-
 8 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 528f6ec897cb..e63873683d4a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -189,6 +189,15 @@ extern struct mce_vendor_flags mce_flags;
 
 extern struct mca_config mca_cfg;
 extern struct mca_msr_regs msr_ops;
+
+enum mce_notifier_prios {
+	MCE_PRIO_SRAO		= INT_MAX,
+	MCE_PRIO_EXTLOG		= INT_MAX - 1,
+	MCE_PRIO_NFIT		= INT_MAX - 2,
+	MCE_PRIO_EDAC		= INT_MAX - 3,
+	MCE_PRIO_LOWEST		= 0,
+};
+
 extern void mce_register_decode_chain(struct notifier_block *nb);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0fef5406f0eb..e39bbc0e7c8b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -216,9 +216,7 @@ void mce_register_decode_chain(struct notifier_block *nb)
 {
 	atomic_inc(&num_notifiers);
 
-	/* Ensure SRAO notifier has the highest priority in the decode chain. */
-	if (nb != &mce_srao_nb && nb->priority == INT_MAX)
-		nb->priority -= 1;
+	WARN_ON(nb->priority > MCE_PRIO_LOWEST && nb->priority < MCE_PRIO_EDAC);
 
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, nb);
 }
@@ -582,7 +580,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 }
 static struct notifier_block mce_srao_nb = {
 	.notifier_call	= srao_decode_notifier,
-	.priority = INT_MAX,
+	.priority	= MCE_PRIO_SRAO,
 };
 
 static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
@@ -608,7 +606,7 @@ static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
 static struct notifier_block mce_default_nb = {
 	.notifier_call	= mce_default_notifier,
 	/* lowest prio, we want it to run last. */
-	.priority	= 0,
+	.priority	= MCE_PRIO_LOWEST,
 };
 
 /*
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index b3842ffc19ba..a15270a806fc 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -212,6 +212,7 @@ static bool __init extlog_get_l1addr(void)
 }
 static struct notifier_block extlog_mce_dec = {
 	.notifier_call	= extlog_print,
+	.priority	= MCE_PRIO_EXTLOG,
 };
 
 static int __init extlog_init(void)
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index e5ce81c38eed..3ba1c3472cf9 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -90,6 +90,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block nfit_mce_dec = {
 	.notifier_call	= nfit_handle_mce,
+	.priority	= MCE_PRIO_NFIT,
 };
 
 void nfit_mce_register(void)
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 69b5adead0ad..75ad847593b7 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -1835,6 +1835,7 @@ static int i7core_mce_check_error(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block i7_mce_dec = {
 	.notifier_call	= i7core_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 struct memdev_dmi_entry {
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index ecad750fd090..0d9bc25543d8 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1054,6 +1054,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static int __init mce_amd_init(void)
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 54ae6dc45ab2..c585a014dd3d 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3136,7 +3136,8 @@ static int sbridge_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block sbridge_mce_dec = {
-	.notifier_call      = sbridge_mce_check_error,
+	.notifier_call	= sbridge_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 /****************************************************************************
diff --git a/drivers/edac/skx_edac.c b/drivers/edac/skx_edac.c
index 79ef675e4d6f..1159dba4671f 100644
--- a/drivers/edac/skx_edac.c
+++ b/drivers/edac/skx_edac.c
@@ -1007,7 +1007,8 @@ static int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block skx_mce_dec = {
-	.notifier_call = skx_mce_check_error,
+	.notifier_call	= skx_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static void skx_remove(void)
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
@ 2017-01-24  8:46   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, hpa, linux-kernel, peterz, torvalds, tglx, mingo,
	Yazen.Ghannam, linux-edac, tony.luck

Commit-ID:  d4b2ac63b0eae461fc10c9791084be24724ef57a
Gitweb:     http://git.kernel.org/tip/d4b2ac63b0eae461fc10c9791084be24724ef57a
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:06 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:52 +0100

x86/ras/inject: Make it depend on X86_LOCAL_APIC=y

... and get rid of the annoying:

  arch/x86/kernel/cpu/mcheck/mce-inject.c:97:13: warning: ‘mce_irq_ipi’ defined but not used [-Wunused-function]

when doing randconfig builds.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig                        | 2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c | 5 +----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..7b6fd68 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1070,7 +1070,7 @@ config X86_MCE_THRESHOLD
 	def_bool y
 
 config X86_MCE_INJECT
-	depends on X86_MCE
+	depends on X86_MCE && X86_LOCAL_APIC
 	tristate "Machine check injector support"
 	---help---
 	  Provide support for injecting machine checks for testing purposes.
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 517619e..99165b2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -152,7 +152,6 @@ static void raise_mce(struct mce *m)
 	if (context == MCJ_CTX_RANDOM)
 		return;
 
-#ifdef CONFIG_X86_LOCAL_APIC
 	if (m->inject_flags & (MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST)) {
 		unsigned long start;
 		int cpu;
@@ -192,9 +191,7 @@ static void raise_mce(struct mce *m)
 		raise_local();
 		put_cpu();
 		put_online_cpus();
-	} else
-#endif
-	{
+	} else {
 		preempt_disable();
 		raise_local();
 		preempt_enable();

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events
  2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
@ 2017-01-24  8:47   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Yazen.Ghannam, torvalds, mingo, tglx, bp, linux-edac, ashok.raj,
	hpa, tony.luck, peterz, linux-kernel

Commit-ID:  9b052ea4ced0fa1ad30a2eafe86984a16297e6f1
Gitweb:     http://git.kernel.org/tip/9b052ea4ced0fa1ad30a2eafe86984a16297e6f1
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:07 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:53 +0100

x86/ras/therm_throt: Do not log a fake MCE for thermal events

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it shouldn't have been done in the first place. And
besides we have other means for dealing with thermal events which are
much more suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a..a09ed05 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef432..6eef6fd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8..85469f8 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras/amd: Make sysfs names of banks more user-friendly
  2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
@ 2017-01-24  8:47   ` tip-bot for Yazen Ghannam
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Yazen Ghannam @ 2017-01-24  8:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tony.luck, Yazen.Ghannam, torvalds, peterz, mingo, linux-edac,
	bp, linux-kernel, tglx, hpa

Commit-ID:  0b737a9c2af85cc8295f9308d9250f9111bbf94d
Gitweb:     http://git.kernel.org/tip/0b737a9c2af85cc8295f9308d9250f9111bbf94d
Author:     Yazen Ghannam <Yazen.Ghannam@amd.com>
AuthorDate: Mon, 23 Jan 2017 19:35:08 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:53 +0100

x86/ras/amd: Make sysfs names of banks more user-friendly

Currently, we append the MCA_IPID[InstanceId] to the bank name to create
the sysfs filename. The InstanceId field uniquely identifies a bank
instance but it doesn't look very nice for most banks.

Replace the InstanceId with a simpler, ascending (0, 1, ..) value.
Only use this in the sysfs name when there is more than 1 instance.
Otherwise, just use the bank's name as the sysfs name.

Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1484322741-41884-3-git-send-email-Yazen.Ghannam@amd.com
Link: http://lkml.kernel.org/r/20170123183514.13356-4-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mce.h           | 5 +++--
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 6 +++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index a09ed05..528f6ec 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -356,12 +356,13 @@ struct smca_hwid {
 	unsigned int bank_type;	/* Use with smca_bank_types for easy indexing. */
 	u32 hwid_mcatype;	/* (hwid,mcatype) tuple */
 	u32 xec_bitmap;		/* Bitmap of valid ExtErrorCodes; current max is 21. */
+	u8 count;		/* Number of instances. */
 };
 
 struct smca_bank {
 	struct smca_hwid *hwid;
-	/* Instance ID */
-	u32 id;
+	u32 id;			/* Value of MCA_IPID[InstanceId]. */
+	u8 sysfs_id;		/* Value used for sysfs name. */
 };
 
 extern struct smca_bank smca_banks[MAX_NR_BANKS];
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index a5fd137..776379e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -192,6 +192,7 @@ static void get_smca_bank_info(unsigned int bank)
 
 			smca_banks[bank].hwid = s_hwid;
 			smca_banks[bank].id = instance_id;
+			smca_banks[bank].sysfs_id = s_hwid->count++;
 			break;
 		}
 	}
@@ -1064,9 +1065,12 @@ static const char *get_name(unsigned int bank, struct threshold_block *b)
 		return NULL;
 	}
 
+	if (smca_banks[bank].hwid->count == 1)
+		return smca_get_name(bank_type);
+
 	snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
 		 "%s_%x", smca_get_name(bank_type),
-			  smca_banks[bank].id);
+			  smca_banks[bank].sysfs_id);
 	return buf_mcatype;
 }
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras: Flip the TSC-adding logic
  2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
@ 2017-01-24  8:48   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tony.luck, peterz, Yazen.Ghannam, linux-edac, bp, linux-kernel,
	hpa, tglx, torvalds, mingo

Commit-ID:  669c00f09935fc7a22297eadee04536af141595b
Gitweb:     http://git.kernel.org/tip/669c00f09935fc7a22297eadee04536af141595b
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:09 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:54 +0100

x86/ras: Flip the TSC-adding logic

Add the TSC value to the MCE record only when the MCE being logged is
precise, i.e., it is logged as an exception or an MCE-related interrupt.

So it doesn't look particularly easy to do without touching/changing a
bunch of places. That's why I'm trying tricks first.

For example, the mce-apei.c case I'm addressing by setting ->tsc only
for errors of panic severity. The idea there is, that, panic errors will
have raised an #MC and not polled.

And then instead of propagating a flag to mce_setup(), it seems
easier/less code to set ->tsc depending on the call sites, i.e.,
are we polling or are we preparing an MCE record in an exception
handler/thresholding interrupt.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-5-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce-apei.c |  5 ++++-
 arch/x86/kernel/cpu/mcheck/mce.c      | 12 +++---------
 arch/x86/kernel/cpu/mcheck/mce_amd.c  |  3 ++-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 83f1a98..2eee853 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -52,8 +52,11 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	if (severity >= GHES_SEV_RECOVERABLE)
 		m.status |= MCI_STATUS_UC;
-	if (severity >= GHES_SEV_PANIC)
+
+	if (severity >= GHES_SEV_PANIC) {
 		m.status |= MCI_STATUS_PCC;
+		m.tsc = rdtsc();
+	}
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6eef6fd..ca15a7e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -128,7 +128,6 @@ void mce_setup(struct mce *m)
 {
 	memset(m, 0, sizeof(struct mce));
 	m->cpu = m->extcpu = smp_processor_id();
-	m->tsc = rdtsc();
 	/* We hope get_seconds stays lockless */
 	m->time = get_seconds();
 	m->cpuvendor = boot_cpu_data.x86_vendor;
@@ -710,14 +709,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 
 	mce_gather_info(&m, NULL);
 
-	/*
-	 * m.tsc was set in mce_setup(). Clear it if not requested.
-	 *
-	 * FIXME: Propagate @flags to mce_gather_info/mce_setup() to avoid
-	 *	  that dance.
-	 */
-	if (!(flags & MCP_TIMESTAMP))
-		m.tsc = 0;
+	if (flags & MCP_TIMESTAMP)
+		m.tsc = rdtsc();
 
 	for (i = 0; i < mca_cfg.banks; i++) {
 		if (!mce_banks[i].ctl || !test_bit(i, *b))
@@ -1156,6 +1149,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		goto out;
 
 	mce_gather_info(&m, regs);
+	m.tsc = rdtsc();
 
 	final = this_cpu_ptr(&mces_seen);
 	*final = m;
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 776379e..9e5427d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -778,7 +778,8 @@ __log_error(unsigned int bank, bool deferred_err, bool threshold_err, u64 misc)
 	mce_setup(&m);
 
 	m.status = status;
-	m.bank = bank;
+	m.bank   = bank;
+	m.tsc	 = rdtsc();
 
 	if (threshold_err)
 		m.misc = misc;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras/amd/inj: Change dependency
  2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
@ 2017-01-24  8:48   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, linux-edac, hpa, tglx, mingo, Yazen.Ghannam,
	peterz, bp, tony.luck, torvalds

Commit-ID:  bd43f60a260c83cbc9befd7d710a3f2bfd3b2dd2
Gitweb:     http://git.kernel.org/tip/bd43f60a260c83cbc9befd7d710a3f2bfd3b2dd2
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:10 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:55 +0100

x86/ras/amd/inj: Change dependency

Change dependency to mce.c as we're using mce_inject_log() now to stick
an MCE into the MCA subsystem.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-6-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/ras/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index d957d5f..0bc60a3 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -1,6 +1,6 @@
 config MCE_AMD_INJ
 	tristate "Simple MCE injection interface for AMD processors"
-	depends on RAS && EDAC_DECODE_MCE && DEBUG_FS && AMD_NB
+	depends on RAS && X86_MCE && DEBUG_FS && AMD_NB
 	default n
 	help
 	  This is a simple debugfs interface to inject MCEs and test different

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] EDAC/mce/amd: Unexport amd_decode_mce()
  2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
@ 2017-01-24  8:49   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-edac, bp, mingo, torvalds, linux-kernel, tony.luck,
	peterz, hpa, Yazen.Ghannam

Commit-ID:  1fbcd909035251b5eac267f1c5d6d67b32d16b62
Gitweb:     http://git.kernel.org/tip/1fbcd909035251b5eac267f1c5d6d67b32d16b62
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:11 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:55 +0100

EDAC/mce/amd: Unexport amd_decode_mce()

It is not used outside of the driver anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/edac/mce_amd.c | 4 ++--
 drivers/edac/mce_amd.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 34208f3..5cd3c39 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -942,7 +942,8 @@ static const char *decode_error_status(struct mce *m)
 	return "Corrected error, no action required.";
 }
 
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
+static int
+amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 {
 	struct mce *m = (struct mce *)data;
 	struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
@@ -1047,7 +1048,6 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 	return NOTIFY_STOP;
 }
-EXPORT_SYMBOL_GPL(amd_decode_mce);
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
diff --git a/drivers/edac/mce_amd.h b/drivers/edac/mce_amd.h
index c2359a1..0b6a686 100644
--- a/drivers/edac/mce_amd.h
+++ b/drivers/edac/mce_amd.h
@@ -79,6 +79,5 @@ struct amd_decoder_ops {
 void amd_report_gart_errors(bool);
 void amd_register_ecc_decoder(void (*f)(int, struct mce *));
 void amd_unregister_ecc_decoder(void (*f)(int, struct mce *));
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data);
 
 #endif /* _EDAC_MCE_AMD_H */

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] EDAC/mce/amd: Dump TSC value
  2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
@ 2017-01-24  8:50   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, linux-edac, tony.luck, bp, torvalds, peterz, mingo,
	Yazen.Ghannam, tglx, linux-kernel

Commit-ID:  0bceab677dcef409f6281d922461057721d547b3
Gitweb:     http://git.kernel.org/tip/0bceab677dcef409f6281d922461057721d547b3
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:12 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:56 +0100

EDAC/mce/amd: Dump TSC value

Dump the TSC value of the time when the MCE got logged.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/edac/mce_amd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 5cd3c39..ecad750 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1007,6 +1007,9 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 	} else
 		pr_cont("\n");
 
+	if (m->tsc)
+		pr_emerg(HW_ERR "TSC: %llu\n", m->tsc);
+
 	if (!fam_ops)
 		goto err_code;
 

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras: Get rid of mce_process_work()
  2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
@ 2017-01-24  8:50   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, tony.luck, linux-edac, linux-kernel, bp, peterz, torvalds,
	hpa, Yazen.Ghannam, mingo

Commit-ID:  cff4c0391a692cf9b89932c62a7f879fb3637148
Gitweb:     http://git.kernel.org/tip/cff4c0391a692cf9b89932c62a7f879fb3637148
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:56 +0100

x86/ras: Get rid of mce_process_work()

Make mce_gen_pool_process() the workqueue function directly and save us
an indirection.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-9-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 12 +-----------
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
index 93d824e..1e5a50c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-genpool.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -72,7 +72,7 @@ struct llist_node *mce_gen_pool_prepare_records(void)
 	return new_head.first;
 }
 
-void mce_gen_pool_process(void)
+void mce_gen_pool_process(struct work_struct *__unused)
 {
 	struct llist_node *head;
 	struct mce_evt_llist *node, *tmp;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index cd74a3f..903043e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,7 +31,7 @@ struct mce_evt_llist {
 	struct mce mce;
 };
 
-void mce_gen_pool_process(void);
+void mce_gen_pool_process(struct work_struct *__unused);
 bool mce_gen_pool_empty(void);
 int mce_gen_pool_add(struct mce *mce);
 int mce_gen_pool_init(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ca15a7e..0fef540 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1316,16 +1316,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Action optional processing happens here (picking up
- * from the list of faulting pages that do_machine_check()
- * placed into the genpool).
- */
-static void mce_process_work(struct work_struct *dummy)
-{
-	mce_gen_pool_process();
-}
-
-/*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
  * errors, poll 2x slower (up to check_interval seconds).
@@ -2165,7 +2155,7 @@ int __init mcheck_init(void)
 	mce_register_decode_chain(&mce_default_nb);
 	mcheck_vendor_init_severity();
 
-	INIT_WORK(&mce_work, mce_process_work);
+	INIT_WORK(&mce_work, mce_gen_pool_process);
 	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 
 	return 0;

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [tip:ras/core] x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
  2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
@ 2017-01-24  8:51   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 49+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tony.luck, peterz, tglx, mingo, linux-kernel, hpa, bp,
	Yazen.Ghannam, linux-edac

Commit-ID:  9026cc82b632ed1a859935c82ed8ad65f27f2781
Gitweb:     http://git.kernel.org/tip/9026cc82b632ed1a859935c82ed8ad65f27f2781
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:57 +0100

x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority

Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mce.h       | 9 +++++++++
 arch/x86/kernel/cpu/mcheck/mce.c | 8 +++-----
 drivers/acpi/acpi_extlog.c       | 1 +
 drivers/acpi/nfit/mce.c          | 1 +
 drivers/edac/i7core_edac.c       | 1 +
 drivers/edac/mce_amd.c           | 1 +
 drivers/edac/sb_edac.c           | 3 ++-
 drivers/edac/skx_edac.c          | 3 ++-
 8 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 528f6ec..e638736 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -189,6 +189,15 @@ extern struct mce_vendor_flags mce_flags;
 
 extern struct mca_config mca_cfg;
 extern struct mca_msr_regs msr_ops;
+
+enum mce_notifier_prios {
+	MCE_PRIO_SRAO		= INT_MAX,
+	MCE_PRIO_EXTLOG		= INT_MAX - 1,
+	MCE_PRIO_NFIT		= INT_MAX - 2,
+	MCE_PRIO_EDAC		= INT_MAX - 3,
+	MCE_PRIO_LOWEST		= 0,
+};
+
 extern void mce_register_decode_chain(struct notifier_block *nb);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0fef540..e39bbc0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -216,9 +216,7 @@ void mce_register_decode_chain(struct notifier_block *nb)
 {
 	atomic_inc(&num_notifiers);
 
-	/* Ensure SRAO notifier has the highest priority in the decode chain. */
-	if (nb != &mce_srao_nb && nb->priority == INT_MAX)
-		nb->priority -= 1;
+	WARN_ON(nb->priority > MCE_PRIO_LOWEST && nb->priority < MCE_PRIO_EDAC);
 
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, nb);
 }
@@ -582,7 +580,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 }
 static struct notifier_block mce_srao_nb = {
 	.notifier_call	= srao_decode_notifier,
-	.priority = INT_MAX,
+	.priority	= MCE_PRIO_SRAO,
 };
 
 static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
@@ -608,7 +606,7 @@ static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
 static struct notifier_block mce_default_nb = {
 	.notifier_call	= mce_default_notifier,
 	/* lowest prio, we want it to run last. */
-	.priority	= 0,
+	.priority	= MCE_PRIO_LOWEST,
 };
 
 /*
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index b3842ff..a15270a 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -212,6 +212,7 @@ static bool __init extlog_get_l1addr(void)
 }
 static struct notifier_block extlog_mce_dec = {
 	.notifier_call	= extlog_print,
+	.priority	= MCE_PRIO_EXTLOG,
 };
 
 static int __init extlog_init(void)
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index e5ce81c..3ba1c34 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -90,6 +90,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block nfit_mce_dec = {
 	.notifier_call	= nfit_handle_mce,
+	.priority	= MCE_PRIO_NFIT,
 };
 
 void nfit_mce_register(void)
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 69b5ade..75ad847 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -1835,6 +1835,7 @@ static int i7core_mce_check_error(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block i7_mce_dec = {
 	.notifier_call	= i7core_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 struct memdev_dmi_entry {
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index ecad750..0d9bc25 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1054,6 +1054,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static int __init mce_amd_init(void)
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 54ae6dc..c585a01 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3136,7 +3136,8 @@ static int sbridge_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block sbridge_mce_dec = {
-	.notifier_call      = sbridge_mce_check_error,
+	.notifier_call	= sbridge_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 /****************************************************************************
diff --git a/drivers/edac/skx_edac.c b/drivers/edac/skx_edac.c
index 79ef675..1159dba 100644
--- a/drivers/edac/skx_edac.c
+++ b/drivers/edac/skx_edac.c
@@ -1007,7 +1007,8 @@ static int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block skx_mce_dec = {
-	.notifier_call = skx_mce_check_error,
+	.notifier_call	= skx_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static void skx_remove(void)

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* RE: Dell XPS13: MCE (Hardware Error) reported
  2017-01-31 15:29               ` Paul Menzel
  2017-01-31 17:20                 ` Borislav Petkov
  2017-01-31 18:50                 ` Austin S. Hemmelgarn
@ 2017-02-01 20:52                 ` Mario.Limonciello
  2 siblings, 0 replies; 49+ messages in thread
From: Mario.Limonciello @ 2017-02-01 20:52 UTC (permalink / raw)
  To: pmenzel, bp; +Cc: ashok.raj, linux-kernel, linux, len.brown, tony.luck

Hi Paul,

> Mario, there are at least two more firmware bugs [4][5][6]. Having fast
> suspend and resume says something about the quality of a device. If the Dell
> XPS13 9360 is supposed to compete with Apple devices, and Google
> Chromebooks, then this should be improved in my opinion. Do you have a
> suggestion on how to best get this solved? Do you want me to contact the
> support, or are there Dell employees with access to the Linux kernel Bugzilla
> bug tracker?
> 
> 

Thanks for sharing these to my attention.  I wasn't aware, and neither was the rest
of the team handling these machines.  I've started some internal discussion around
the slow ASL resume one.

The WoWLAN one, I'm not sure there will be much for us to do on the BIOS FW end.
As Kalle was mentioning, this comes down to the device FW needing to spin up.  The
easiest way to keep it quick is to leave the radio active, which is what WoWLAN will
accomplish.

The other option is going to be to move over to suspend to idle.  This will leave most
devices operational (in RTD3), freeze kernel threads and let the CPU go into a low
enough state.  Some of the plumbing that will allow supporting this has recently gone
in, but there are some problems to work through that will prevent letting machines
that support it make it the default until at least the next kernel version.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-31 15:29               ` Paul Menzel
  2017-01-31 17:20                 ` Borislav Petkov
@ 2017-01-31 18:50                 ` Austin S. Hemmelgarn
  2017-02-01 20:52                 ` Mario.Limonciello
  2 siblings, 0 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-31 18:50 UTC (permalink / raw)
  To: Paul Menzel, Mario Limonciello, Borislav Petkov
  Cc: Ashok Raj, linux-kernel, Thorsten Leemhuis, Len Brown, Tony Luck

On 2017-01-31 10:29, Paul Menzel wrote:
> Dear Borislav, dear Mario,
>
>
> On 01/27/17 18:16, Mario.Limonciello@dell.com wrote:
>>> -----Original Message-----
>>> From: Borislav Petkov [mailto:bp@alien8.de]
>>> Sent: Friday, January 27, 2017 11:11 AM
>>> To: Paul Menzel <pmenzel@molgen.mpg.de>
>>> Cc: Ashok Raj <ashok.raj@intel.com>; Linux Kernel Mailing List <linux-
>>> kernel@vger.kernel.org>; Thorsten Leemhuis <linux@leemhuis.info>; Len
>>> Brown <len.brown@intel.com>; Tony Luck <tony.luck@intel.com>;
>>> Limonciello, Mario <Mario_Limonciello@Dell.com>
>>> Subject: Re: Dell XPS13: MCE (Hardware Error) reported
>>>
>>> On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
>>>> Thank you very much for your help. After wasting my time with the Dell
>>>> support over Twitter [1], where they basically also make you jump
>>>> through hoops,
>>>
>>> Fun read that twitter page. Especially the bit about not supporting
>>> Linux but
>>> but you did ship this laptop with ubuntu. LOOL.
>>>
>>> FWIW, this is not the first time I've heard vendors trying to get
>>> away from
>>> dealing with bugs by playing the "we-dont-support-Linux" card.
>
> I think, in this case it was an honest mistake from the support person,
> which they corrected in a follow-up post.
>
> But the next conversion was a little disappointing as the problem was
> already diagnosed out on the LKML.
>
> As Mario replied, the Twitter people do not seem to know the free
> software world that well, that “normal” people actually are in contact
> with the developers.
>
> What surprises me though is, that the change-log for the firmware seems
> to be incomplete even for the Dell staff.
>
>>> Good to know when I go looking for a new laptop in the future.
>
> Unfortunately, there are not a lot of options. If you know a good
> vendor, please share.
>
> Dell is one of the view “big” vendors selling laptops with a GNU/Linux
> operating system installed.
>
> There are Google Chromebooks, which even have coreboot. Google’s
> Chromium team does a great job and have a great quality assurance. On
> par, or even better than Apple, I’d say.
>
> Some people need more powerful devices though.
>
> In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux
> installed [2].
>
> TUXEDO does not build the devices themselves, and gets template
> models(?) from Chinese manufactures. They only seem to start hiring the
> expertise to do Linux kernel and OS work themselves [3]. Currently, it’s
> mostly the stock Ubuntu for desktop with their artwork and some small
> optimizations.
When it comes to laptops, I've recently become a fan of System76 [1]. 
Linux runs pretty much flawlessly on all the systems I've ever 
encountered from them without needing any special configuration, and 
they offer some pretty high-end systems.  Their Oryx Pro laptop is 
particularly impressive, it's got hardware on par with a desktop bearing 
a similar price-tag, which is rare even for systems that ship with 
Windows per-installed.  They're based in the US, but I'm pretty certain 
they ship worldwide, and like most of the better manufacturers these 
days, the systems are made-to-order with quite a lot in the way of 
options (you can even get an NVMe boot drive in their higher-end systems 
if you're willing to shell out more than a third of the base cost for 
the system on it).

They also do desktops and servers, but it's generally cheaper in my 
experience for anything but a laptop to just build it yourself.
>
>> Sorry you had that experience.  I'll pass that on to get that messaging
>> corrected.  The team that does Linux support doesn't use twitter, they
>> do phone and email support only.
>
> Thank you that’s good to know.
>
>> I'm glad you got things sorted moving to the next FW version.
>
> Me too.
>
> Mario, there are at least two more firmware bugs [4][5][6]. Having fast
> suspend and resume says something about the quality of a device. If the
> Dell XPS13 9360 is supposed to compete with Apple devices, and Google
> Chromebooks, then this should be improved in my opinion. Do you have a
> suggestion on how to best get this solved? Do you want me to contact the
> support, or are there Dell employees with access to the Linux kernel
> Bugzilla bug tracker?
>
>
> Kind regards,
>
> Paul
>
>
> [1] https://www.tuxedocomputers.com/
> [2] https://puri.sm/
> [3]
> https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5
>
> [4] https://github.molgen.mpg.de/mariux64/dell_xps13/
> [5] https://bugzilla.kernel.org/show_bug.cgi?id=185611
> [6] https://bugzilla.kernel.org/show_bug.cgi?id=185621
[1] https://system76.com/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-31 15:29               ` Paul Menzel
@ 2017-01-31 17:20                 ` Borislav Petkov
  2017-01-31 18:50                 ` Austin S. Hemmelgarn
  2017-02-01 20:52                 ` Mario.Limonciello
  2 siblings, 0 replies; 49+ messages in thread
From: Borislav Petkov @ 2017-01-31 17:20 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Mario Limonciello, Ashok Raj, linux-kernel, Thorsten Leemhuis,
	Len Brown, Tony Luck

On Tue, Jan 31, 2017 at 04:29:51PM +0100, Paul Menzel wrote:
> Unfortunately, there are not a lot of options. If you know a good
> vendor, please share.

What I'd do is go to the store with a live CD and boot it on the
machine. Using a fairly recent allmodconfig kernel should be able to
tell you what works and what doesn't.

If there's no store, you can still return it in the first 14 days
according to EU law.

> Dell is one of the view “big” vendors selling laptops with a GNU/Linux
> operating system installed.
> 
> There are Google Chromebooks, which even have coreboot.

Yes, I dream of the day when we get rid of that vendor firmware and we
all use coreboot. That'll be a good day.

> team does a great job and have a great quality assurance. On par, or even
> better than Apple, I’d say.
> 
> Some people need more powerful devices though.
> 
> In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux
> installed [2].
> 
> TUXEDO does not build the devices themselves, and gets template models(?)
> from Chinese manufactures.

Looks cool. Especially if I don't want to pay for that windoze license
being shoved down my throat.

I'd probably give it a try next time I need new hw. Thanks for the tip!

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-27 17:16             ` Mario.Limonciello
@ 2017-01-31 15:29               ` Paul Menzel
  2017-01-31 17:20                 ` Borislav Petkov
                                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Paul Menzel @ 2017-01-31 15:29 UTC (permalink / raw)
  To: Mario Limonciello, Borislav Petkov
  Cc: Ashok Raj, linux-kernel, Thorsten Leemhuis, Len Brown, Tony Luck

Dear Borislav, dear Mario,


On 01/27/17 18:16, Mario.Limonciello@dell.com wrote:
>> -----Original Message-----
>> From: Borislav Petkov [mailto:bp@alien8.de]
>> Sent: Friday, January 27, 2017 11:11 AM
>> To: Paul Menzel <pmenzel@molgen.mpg.de>
>> Cc: Ashok Raj <ashok.raj@intel.com>; Linux Kernel Mailing List <linux-
>> kernel@vger.kernel.org>; Thorsten Leemhuis <linux@leemhuis.info>; Len
>> Brown <len.brown@intel.com>; Tony Luck <tony.luck@intel.com>;
>> Limonciello, Mario <Mario_Limonciello@Dell.com>
>> Subject: Re: Dell XPS13: MCE (Hardware Error) reported
>>
>> On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
>>> Thank you very much for your help. After wasting my time with the Dell
>>> support over Twitter [1], where they basically also make you jump
>>> through hoops,
>>
>> Fun read that twitter page. Especially the bit about not supporting Linux but
>> but you did ship this laptop with ubuntu. LOOL.
>>
>> FWIW, this is not the first time I've heard vendors trying to get away from
>> dealing with bugs by playing the "we-dont-support-Linux" card.

I think, in this case it was an honest mistake from the support person, 
which they corrected in a follow-up post.

But the next conversion was a little disappointing as the problem was 
already diagnosed out on the LKML.

As Mario replied, the Twitter people do not seem to know the free 
software world that well, that “normal” people actually are in contact 
with the developers.

What surprises me though is, that the change-log for the firmware seems 
to be incomplete even for the Dell staff.

>> Good to know when I go looking for a new laptop in the future.

Unfortunately, there are not a lot of options. If you know a good 
vendor, please share.

Dell is one of the view “big” vendors selling laptops with a GNU/Linux 
operating system installed.

There are Google Chromebooks, which even have coreboot. Google’s 
Chromium team does a great job and have a great quality assurance. On 
par, or even better than Apple, I’d say.

Some people need more powerful devices though.

In Germany, I know of TUXEDO [1]. Also, Purism devices have GNU/Linux 
installed [2].

TUXEDO does not build the devices themselves, and gets template 
models(?) from Chinese manufactures. They only seem to start hiring the 
expertise to do Linux kernel and OS work themselves [3]. Currently, it’s 
mostly the stock Ubuntu for desktop with their artwork and some small 
optimizations.

> Sorry you had that experience.  I'll pass that on to get that messaging
> corrected.  The team that does Linux support doesn't use twitter, they
> do phone and email support only.

Thank you that’s good to know.

> I'm glad you got things sorted moving to the next FW version.

Me too.

Mario, there are at least two more firmware bugs [4][5][6]. Having fast 
suspend and resume says something about the quality of a device. If the 
Dell XPS13 9360 is supposed to compete with Apple devices, and Google 
Chromebooks, then this should be improved in my opinion. Do you have a 
suggestion on how to best get this solved? Do you want me to contact the 
support, or are there Dell employees with access to the Linux kernel 
Bugzilla bug tracker?


Kind regards,

Paul


[1] https://www.tuxedocomputers.com/
[2] https://puri.sm/
[3] 
https://www.tuxedocomputers.com/Jobs-Karriere-Stellenausschreibungen.geek?x6a1ec=o8323ih26224vook9uu997r1a5
[4] https://github.molgen.mpg.de/mariux64/dell_xps13/
[5] https://bugzilla.kernel.org/show_bug.cgi?id=185611
[6] https://bugzilla.kernel.org/show_bug.cgi?id=185621

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: Dell XPS13: MCE (Hardware Error) reported
  2017-01-27 17:10           ` Borislav Petkov
@ 2017-01-27 17:16             ` Mario.Limonciello
  2017-01-31 15:29               ` Paul Menzel
  0 siblings, 1 reply; 49+ messages in thread
From: Mario.Limonciello @ 2017-01-27 17:16 UTC (permalink / raw)
  To: bp, pmenzel; +Cc: ashok.raj, linux-kernel, linux, len.brown, tony.luck

> -----Original Message-----
> From: Borislav Petkov [mailto:bp@alien8.de]
> Sent: Friday, January 27, 2017 11:11 AM
> To: Paul Menzel <pmenzel@molgen.mpg.de>
> Cc: Ashok Raj <ashok.raj@intel.com>; Linux Kernel Mailing List <linux-
> kernel@vger.kernel.org>; Thorsten Leemhuis <linux@leemhuis.info>; Len
> Brown <len.brown@intel.com>; Tony Luck <tony.luck@intel.com>;
> Limonciello, Mario <Mario_Limonciello@Dell.com>
> Subject: Re: Dell XPS13: MCE (Hardware Error) reported
> 
> On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
> > Thank you very much for your help. After wasting my time with the Dell
> > support over Twitter [1], where they basically also make you jump
> > through hoops,
> 
> Fun read that twitter page. Especially the bit about not supporting Linux but
> but you did ship this laptop with ubuntu. LOOL.
> 
> FWIW, this is not the first time I've heard vendors trying to get away from
> dealing with bugs by playing the "we-dont-support-Linux" card.
> 
> Good to know when I go looking for a new laptop in the future.
> 
> --
> Regards/Gruss,
>     Boris.
> 
> Good mailing practices for 400: avoid top-posting and trim the reply.

Sorry you had that experience.  I'll pass that on to get that messaging
corrected.  The team that does Linux support doesn't use twitter, they
do phone and email support only.

I'm glad you got things sorted moving to the next FW version. 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-27 13:35         ` Paul Menzel
@ 2017-01-27 17:10           ` Borislav Petkov
  2017-01-27 17:16             ` Mario.Limonciello
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-27 17:10 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ashok Raj, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, Mario Limonciello

On Fri, Jan 27, 2017 at 02:35:16PM +0100, Paul Menzel wrote:
> Thank you very much for your help. After wasting my time with the Dell
> support over Twitter [1], where they basically also make you jump through
> hoops,

Fun read that twitter page. Especially the bit about not supporting
Linux but but you did ship this laptop with ubuntu. LOOL.

FWIW, this is not the first time I've heard vendors trying to get away
from dealing with bugs by playing the "we-dont-support-Linux" card.

Good to know when I go looking for a new laptop in the future.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 19:23       ` Raj, Ashok
@ 2017-01-27 13:35         ` Paul Menzel
  2017-01-27 17:10           ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Paul Menzel @ 2017-01-27 13:35 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Borislav Petkov, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, Mario Limonciello, Thorsten Leemhuis

Dear Ashok,


On 01/09/17 20:23, Raj, Ashok wrote:

> On Mon, Jan 09, 2017 at 12:53:33PM +0100, Paul Menzel wrote:
>
>> On 01/05/17 02:12, Raj, Ashok wrote:
>>
>>>>> CPUID Vendor Intel Family 6 Model 142
>>> This is Kabylake Mobile
>>>
>>>>> Hardware event. This is not a software error.
>>>>> MCE 1
>>>>> CPU 0 BANK 7
>>>>> MISC 7880018086 ADDR fef1ce40
>>>>> TIME 1483543069 Wed Jan  4 16:17:49 2017
>
>>>>> STATUS ee0000000040110a MCGSTATUS 0
>>>
>>> Decoding the bits further from MCi_STATUS above:
>>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>>> been signaled by a CMCI.
>>>
>>> PCC=1, but should be ignored when EN=0.
>>> MCACOD: 110a MSCOD: 0040
>
> This MSCOD indicates that its a write back access to mmio space. Its possible
> that BIOS is scanning certain memory region during boot. During which time
> BIOS does disable generation of MCE's. Which is why EN=0 in the above log.
>
> Its a BIOS bug, one would expect that BIOS clears up these before handoff to
> OS. During OS boot we also scan all MC banks and log/clear them.
>
> If you aren't observing them during normal operation you can safely ignore
> these preboot logs, or pass them along to your OEM.

Thank you very much for your help. After wasting my time with the Dell 
support over Twitter [1], where they basically also make you jump 
through hoops, and then claim it’s an mcelog issue – as they apparently 
only execute `sudo mcelog` –, I updated to the latest firmware 1.3.2 
released yesterday [2].

With that new firmware version, it looks like that the firmware has been 
fixed and Linux does not report any MCEs.

It’d be great if other Dell XPS13 9360 users could verify that.


Kind regards,

Paul


[1] https://twitter.com/pmenzel_molgen/status/818808708692115456
[2] XPS_9360_1.3.2.exe

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 11:53     ` Paul Menzel
@ 2017-01-09 19:23       ` Raj, Ashok
  2017-01-27 13:35         ` Paul Menzel
  0 siblings, 1 reply; 49+ messages in thread
From: Raj, Ashok @ 2017-01-09 19:23 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Borislav Petkov, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, ashok.raj

Hi Paul

On Mon, Jan 09, 2017 at 12:53:33PM +0100, Paul Menzel wrote:
> 
> 
> On 01/05/17 02:12, Raj, Ashok wrote:
> 
> >>>CPUID Vendor Intel Family 6 Model 142
> >This is Kabylake Mobile
> >
> >>>Hardware event. This is not a software error.
> >>>MCE 1
> >>>CPU 0 BANK 7
> >>>MISC 7880018086 ADDR fef1ce40
> >>>TIME 1483543069 Wed Jan  4 16:17:49 2017

> >>>STATUS ee0000000040110a MCGSTATUS 0
> >
> >Decoding the bits further from MCi_STATUS above:
> >Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> >been signaled by a CMCI.
> >
> >PCC=1, but should be ignored when EN=0.
> >MCACOD: 110a MSCOD: 0040

This MSCOD indicates that its a write back access to mmio space. Its possible
that BIOS is scanning certain memory region during boot. During which time
BIOS does disable generation of MCE's. Which is why EN=0 in the above log.

Its a BIOS bug, one would expect that BIOS clears up these before handoff to
OS. During OS boot we also scan all MC banks and log/clear them.

If you aren't observing them during normal operation you can safely ignore
these preboot logs, or pass them along to your OEM.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05  1:12   ` Raj, Ashok
@ 2017-01-09 11:53     ` Paul Menzel
  2017-01-09 19:23       ` Raj, Ashok
  0 siblings, 1 reply; 49+ messages in thread
From: Paul Menzel @ 2017-01-09 11:53 UTC (permalink / raw)
  To: Ashok Raj, Borislav Petkov
  Cc: Linux Kernel Mailing List, Thorsten Leemhuis, Len Brown, Tony Luck

Dear Ashosk, dear Borislav,


On 01/05/17 02:12, Raj, Ashok wrote:

>>> CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
>>> Hardware event. This is not a software error.
>>> MCE 1
>>> CPU 0 BANK 7
>>> MISC 7880018086 ADDR fef1ce40
>>> TIME 1483543069 Wed Jan  4 16:17:49 2017
>>> MCG status:
>>> MCi status:
>>> Error overflow
>>> Uncorrected error
>>> MCi_MISC register valid
>>> MCi_ADDR register valid
>>> Processor context corrupt
>>> MCA: corrected filtering (some unreported errors in same region)
>>> Generic CACHE Level-2 Generic Error
>>> STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.

To be clear, other than the message, the system is stable for me.

Here is `/proc/interrupts`.

```
$ more /proc/interrupts
             CPU0       CPU1       CPU2       CPU3
    0:         27          0          0          0  IR-IO-APIC    2-edge 
      timer
    1:          3          2        125          5  IR-IO-APIC    1-edge 
      i8042
    8:          0          1          0          0  IR-IO-APIC    8-edge 
      rtc0
    9:        108         31        397          5  IR-IO-APIC 
9-fasteoi   acpi
   12:         66         18         92         35  IR-IO-APIC   12-edge 
      i8042
   14:          0          0          0          0  IR-IO-APIC 
14-fasteoi   INT344B:00
   16:          0          0          0          0  IR-IO-APIC 
16-fasteoi   idma64.0, i801_smbus, i2c_designware.0
   17:        419         42        280        415  IR-IO-APIC 
17-fasteoi   idma64.1, i2c_designware.1
   51:          2          0          0          1  IR-IO-APIC 
51-fasteoi   DLL075B:01
  120:          0          0          0          0  DMAR-MSI    0-edge 
    dmar0
  121:          0          0          0          0  DMAR-MSI    1-edge 
    dmar1
  274:         17          2          0          4  IR-PCI-MSI 
30932992-edge      rtsx_pci
  275:         89         26         57         45  IR-PCI-MSI 
327680-edge      xhci_hcd
  276:       1886          0       2361          0  IR-PCI-MSI 
31457280-edge      nvme0q0, nvme0q1
  277:          0       3010       2570          0  IR-PCI-MSI 
31457281-edge      nvme0q2
  278:          0          0       2023       3480  IR-PCI-MSI 
31457282-edge      nvme0q3
  279:          0       3319          0       5863  IR-PCI-MSI 
31457283-edge      nvme0q4
  280:         45          0          0          0  IR-PCI-MSI 
360448-edge      mei_me
  281:        201         52       3008         85  IR-PCI-MSI 
32768-edge      i915
  282:        151         29        997      24821  IR-PCI-MSI 
30408704-edge      ath10k_pci
  283:        331        938        677        188  IR-PCI-MSI 
514048-edge      snd_hda_intel:card0
  NMI:          1          0          0          0   Non-maskable interrupts
  LOC:      15198      21708      16850      31954   Local timer interrupts
  SPU:          0          0          0          0   Spurious interrupts
  PMI:          1          0          0          0   Performance 
monitoring interrupts
  IWI:          3          0          0          0   IRQ work interrupts
  RTR:          0          0          0          0   APIC ICR read retries
  RES:       1329       1974       1532       1959   Rescheduling interrupts
  CAL:       2254       3827       1969       3963   Function call 
interrupts
  TLB:        396       2349        342       2193   TLB shootdowns
  TRM:          0          0          0          0   Thermal event 
interrupts
  THR:          0          0          0          0   Threshold APIC 
interrupts
  DFR:          0          0          0          0   Deferred Error APIC 
interrupts
  MCE:          0          0          0          0   Machine check 
exceptions
  MCP:          9          9          9          9   Machine check polls
  ERR:         17
  MIS:          0
  PIN:          0          0          0          0   Posted-interrupt 
notification event
  PIW:          0          0          0          0   Posted-interrupt 
wakeup event
```

> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
>>> MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.

No, I don’t. And everybody I talked to with a Dell XPS13 (9360) seems to 
have these errors.

> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.

I need some time for that.

> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

Sorry, I don’t know, as I am not the person from StackExchange [1].


Kind regards,

Paul


[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-04 22:55 ` Borislav Petkov
@ 2017-01-05  1:12   ` Raj, Ashok
  2017-01-09 11:53     ` Paul Menzel
  0 siblings, 1 reply; 49+ messages in thread
From: Raj, Ashok @ 2017-01-05  1:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Paul Menzel, Linux Kernel Mailing List, Thorsten Leemhuis,
	Len Brown, Tony Luck, ashok.raj

Hi Boris

thanks for forwarding.

> > CPUID Vendor Intel Family 6 Model 142
This is Kabylake Mobile

> > Hardware event. This is not a software error.
> > MCE 1
> > CPU 0 BANK 7
> > MISC 7880018086 ADDR fef1ce40
> > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > MCG status:
> > MCi status:
> > Error overflow
> > Uncorrected error
> > MCi_MISC register valid
> > MCi_ADDR register valid
> > Processor context corrupt
> > MCA: corrected filtering (some unreported errors in same region)
> > Generic CACHE Level-2 Generic Error
> > STATUS ee0000000040110a MCGSTATUS 0

Decoding the bits further from MCi_STATUS above:
Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
been signaled by a CMCI.

PCC=1, but should be ignored when EN=0. 
MCACOD: 110a MSCOD: 0040

If the system is stable enough after the report, can you send the output of 
/proc/interrupts to confirm that. 

Although its reported as a L2 error, some memory errors can also manifest
itself as a cache error in certain cases.  In this case it looks like 
some speculative fetch from bad memory might be the cause.

> > MCGCAP c08 APICID 0 SOCKETID 0

MCG_CAP: c08
Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
Threshold based error reporting (bit 11) (TES_P). 


Do you have another machine which doesn't report these errors? if so try 
swapping memory between them to see if the error disappears.

I don't have the model specific error handy.. will check that in the meantime
to get some decoding as well.

If you haven't already running some memory tests would also help.

If you replaced the motherboard, did that involve both cpu and memory?
or just the motheboard swap?

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
@ 2017-01-04 22:55 ` Borislav Petkov
  2017-01-05  1:12   ` Raj, Ashok
  0 siblings, 1 reply; 49+ messages in thread
From: Borislav Petkov @ 2017-01-04 22:55 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Linux Kernel Mailing List, Thorsten Leemhuis, Len Brown,
	Tony Luck, Raj, Ashok

Lemme add some more folks to CC.

On Wed, Jan 04, 2017 at 04:42:18PM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> 
> The logs contain the following messages.
> 
> From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of
> git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):
> 
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> > Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> 
> I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.
> 
> Installing *mcelog* 144+dfsg-1, the file below is created.
> 
> ```
> $ more /var/log/mcelog
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543069 Wed Jan  4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543069 Wed Jan  4 16:17:49 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 6
> MISC 47880018086 ADDR fef1ff40
> TIME 1483543581 Wed Jan  4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 7
> MISC 7880018086 ADDR fef1ce40
> TIME 1483543581 Wed Jan  4 16:26:21 2017
> MCG status:
> MCi status:
> Error overflow
> Uncorrected error
> MCi_MISC register valid
> MCi_ADDR register valid
> Processor context corrupt
> MCA: corrected filtering (some unreported errors in same region)
> Generic CACHE Level-2 Generic Error
> STATUS ee0000000040110a MCGSTATUS 0
> MCGCAP c08 APICID 0 SOCKETID 0
> CPUID Vendor Intel Family 6 Model 142
> ```
> 
> It looks like it’s a common problem on this machine [1].
> 
> > First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
> > 
> > <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
> > 
> > To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
> > 
> > Regarding your questions:
> > 
> >     What do these errors mean and should I worry about them?
> > 
> > I don't know. Dell Support thinks those are false positives.
> > 
> >     Could these hardware errors be the cause of the freezes of the entire system?
> > 
> > Besides the messages my system works fine. I'd guess the freeze is a different issue.
> > 
> >     Should I have the laptop (or parts) replaced by the manufacturer?
> > 
> > Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
> > 
> >     Are there any other actions I should take?
> > 
> > If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.
> 
> Could you please tell me, if and where I should open an issue in the Linux
> bug tracker [2]?
> 
> Any ideas are welcome.
> 
> 
> Kind regards,
> 
> Paul
> 
> 
> [1] https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
> [2] https://bugzilla.kernel.org/
> 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-04 15:42 Paul Menzel
  2017-01-04 22:55 ` Borislav Petkov
  0 siblings, 1 reply; 49+ messages in thread
From: Paul Menzel @ 2017-01-04 15:42 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Thorsten Leemhuis, Len Brown

Dear Linux folks,


The logs contain the following messages.

 From Linux 4.10-rc2+ (0f64df301240 Merge branch 'parisc-4.10-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux):

> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ff40 MISC 47880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 7: ee0000000040110a
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: TSC 0 ADDR fef1ce40 MISC 7880018086
> Jan 04 16:17:51 xps13 kernel: mce: [Hardware Error]: PROCESSOR 0:806e9 TIME 1483543069 SOCKET 0 APIC 0 microcode 0

I am able to reproduce this also with Linux 4.8.11 from Debian Sid/unstable.

Installing *mcelog* 144+dfsg-1, the file below is created.

```
$ more /var/log/mcelog
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543069 Wed Jan  4 16:17:49 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 6
MISC 47880018086 ADDR fef1ff40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
Hardware event. This is not a software error.
MCE 1
CPU 0 BANK 7
MISC 7880018086 ADDR fef1ce40
TIME 1483543581 Wed Jan  4 16:26:21 2017
MCG status:
MCi status:
Error overflow
Uncorrected error
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS ee0000000040110a MCGSTATUS 0
MCGCAP c08 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 142
```

It looks like it’s a common problem on this machine [1].

> First, I fear that I cannot really give good answers to your questions. I also own a Dell XPS 13 (9360) and see the same MCE messages. I'm in contact with Dell Support because of these. They replaced the mainboard but it did not help. Same messages in the logs. At some point they concluded that it is probably a false positive. They had no idea what is causing it, though (mcelog/kernel/Intel problem?). The correspondence with Support is still ongoing.
>
> <rant> Btw, talking to Dell Support is a very unpleasant experience. They seem to only suggest the "standard" solutions like resetting the Firmware, run self-health tests and so on. I didn't had the impression to talk to someone with some technical insight. </rant>
>
> To add more details, I see the same issue on Fedora 24 so it seems not to be related to Ubuntu.
>
> Regarding your questions:
>
>     What do these errors mean and should I worry about them?
>
> I don't know. Dell Support thinks those are false positives.
>
>     Could these hardware errors be the cause of the freezes of the entire system?
>
> Besides the messages my system works fine. I'd guess the freeze is a different issue.
>
>     Should I have the laptop (or parts) replaced by the manufacturer?
>
> Replacing the mainboard did not fix the MCE issue. It might solve the freezing issue, although it seems that this was fixed by a kernel update.
>
>     Are there any other actions I should take?
>
> If you are not already in contact with Support, contact them. Maybe they will come up with a real solution once they see that it affects more customers.

Could you please tell me, if and where I should open an issue in the 
Linux bug tracker [2]?

Any ideas are welcome.


Kind regards,

Paul


[1] 
https://unix.stackexchange.com/questions/324237/understanding-machine-check-exceptions-mce/330283
[2] https://bugzilla.kernel.org/

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-02-01 20:52 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
2017-01-05 14:05 ` Daniel J Blueman
2017-01-05 20:10   ` Alexander Alemayhu
2017-01-05 20:31     ` Borislav Petkov
2017-01-05 20:43       ` Raj, Ashok
2017-01-05 21:03         ` Pandruvada, Srinivas
2017-01-05 23:23           ` Alexander Alemayhu
2017-01-05 21:38       ` Alexander Alemayhu
2017-01-05 23:28       ` Raj, Ashok
2017-01-05 23:56         ` Borislav Petkov
2017-01-06  1:26           ` Raj, Ashok
2017-01-06 11:16             ` Borislav Petkov
2017-01-06 15:58               ` Raj, Ashok
2017-01-06 16:54                 ` Borislav Petkov
2017-01-06 17:04                   ` Raj, Ashok
2017-01-09 10:55                   ` Paul Menzel
2017-01-09 11:05                     ` Borislav Petkov
2017-01-09 11:11                       ` Paul Menzel
  -- strict thread matches above, loose matches on Subject: below --
2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
2017-01-04 15:42 Dell XPS13: MCE (Hardware Error) reported Paul Menzel
2017-01-04 22:55 ` Borislav Petkov
2017-01-05  1:12   ` Raj, Ashok
2017-01-09 11:53     ` Paul Menzel
2017-01-09 19:23       ` Raj, Ashok
2017-01-27 13:35         ` Paul Menzel
2017-01-27 17:10           ` Borislav Petkov
2017-01-27 17:16             ` Mario.Limonciello
2017-01-31 15:29               ` Paul Menzel
2017-01-31 17:20                 ` Borislav Petkov
2017-01-31 18:50                 ` Austin S. Hemmelgarn
2017-02-01 20:52                 ` Mario.Limonciello

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.