All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Dell XPS13: MCE (Hardware Error) reported
@ 2017-01-05  5:00 Daniel J Blueman
  2017-01-05 14:05 ` Daniel J Blueman
  0 siblings, 1 reply; 37+ messages in thread
From: Daniel J Blueman @ 2017-01-05  5:00 UTC (permalink / raw)
  To: ashok.raj, Borislav Petkov, pmenzel, tony.luck, linux, len.brown,
	Linux Kernel

On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
> Hi Boris
>
> thanks for forwarding.
>
> > > CPUID Vendor Intel Family 6 Model 142
> This is Kabylake Mobile
>
> > > Hardware event. This is not a software error.
> > > MCE 1
> > > CPU 0 BANK 7
> > > MISC 7880018086 ADDR fef1ce40
> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
> > > MCG status:
> > > MCi status:
> > > Error overflow
> > > Uncorrected error
> > > MCi_MISC register valid
> > > MCi_ADDR register valid
> > > Processor context corrupt
> > > MCA: corrected filtering (some unreported errors in same region)
> > > Generic CACHE Level-2 Generic Error
> > > STATUS ee0000000040110a MCGSTATUS 0
>
> Decoding the bits further from MCi_STATUS above:
> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
> been signaled by a CMCI.
>
> PCC=1, but should be ignored when EN=0.
> MCACOD: 110a MSCOD: 0040
>
> If the system is stable enough after the report, can you send the output of
> /proc/interrupts to confirm that.
>
> Although its reported as a L2 error, some memory errors can also manifest
> itself as a cache error in certain cases.  In this case it looks like
> some speculative fetch from bad memory might be the cause.
>
> > > MCGCAP c08 APICID 0 SOCKETID 0
>
> MCG_CAP: c08
> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
> Threshold based error reporting (bit 11) (TES_P).
>
>
> Do you have another machine which doesn't report these errors? if so try
> swapping memory between them to see if the error disappears.
>
> I don't have the model specific error handy.. will check that in the meantime
> to get some decoding as well.
>
> If you haven't already running some memory tests would also help.
>
> If you replaced the motherboard, did that involve both cpu and memory?
> or just the motheboard swap?

I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
physical address is in the non-coherent low MMIO window:
MISC 7880018086 ADDR fef1ce40

Which is declared as device memory:
[    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]

For core-generated cycles, it is between the local APIC space at
FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
subtractively decoded to the PCH, maybe being aborted due to a device
not being enabled (hello TPM3 or new image processor).

As it is logged as soon as the MCE driver initialises, it was probably
logged during BIOS init, so there's not much we can do about it
anyways.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
@ 2017-01-05 14:05 ` Daniel J Blueman
  2017-01-05 20:10   ` Alexander Alemayhu
  0 siblings, 1 reply; 37+ messages in thread
From: Daniel J Blueman @ 2017-01-05 14:05 UTC (permalink / raw)
  To: Raj, Ashok, Borislav Petkov, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On 5 January 2017 at 13:00, Daniel J Blueman <daniel@quora.org> wrote:
> On Thursday, January 5, 2017 at 9:20:04 AM UTC+8, Raj, Ashok wrote:
>> Hi Boris
>>
>> thanks for forwarding.
>>
>> > > CPUID Vendor Intel Family 6 Model 142
>> This is Kabylake Mobile
>>
>> > > Hardware event. This is not a software error.
>> > > MCE 1
>> > > CPU 0 BANK 7
>> > > MISC 7880018086 ADDR fef1ce40
>> > > TIME 1483543069 Wed Jan  4 16:17:49 2017
>> > > MCG status:
>> > > MCi status:
>> > > Error overflow
>> > > Uncorrected error
>> > > MCi_MISC register valid
>> > > MCi_ADDR register valid
>> > > Processor context corrupt
>> > > MCA: corrected filtering (some unreported errors in same region)
>> > > Generic CACHE Level-2 Generic Error
>> > > STATUS ee0000000040110a MCGSTATUS 0
>>
>> Decoding the bits further from MCi_STATUS above:
>> Val=1, OVER=1, UC=1, but EN=0 indicates this isn't a MCE, hence should have
>> been signaled by a CMCI.
>>
>> PCC=1, but should be ignored when EN=0.
>> MCACOD: 110a MSCOD: 0040
>>
>> If the system is stable enough after the report, can you send the output of
>> /proc/interrupts to confirm that.
>>
>> Although its reported as a L2 error, some memory errors can also manifest
>> itself as a cache error in certain cases.  In this case it looks like
>> some speculative fetch from bad memory might be the cause.
>>
>> > > MCGCAP c08 APICID 0 SOCKETID 0
>>
>> MCG_CAP: c08
>> Support CMCI(bit 10) - Corrected Machine Check Interrupt (CMCI_P) and
>> Threshold based error reporting (bit 11) (TES_P).
>>
>>
>> Do you have another machine which doesn't report these errors? if so try
>> swapping memory between them to see if the error disappears.
>>
>> I don't have the model specific error handy.. will check that in the meantime
>> to get some decoding as well.
>>
>> If you haven't already running some memory tests would also help.
>>
>> If you replaced the motherboard, did that involve both cpu and memory?
>> or just the motheboard swap?
>
> I see the MCE on my XPS 9360 also. It's not related to DRAM, as the
> physical address is in the non-coherent low MMIO window:
> MISC 7880018086 ADDR fef1ce40
>
> Which is declared as device memory:
> [    0.000000] PM: Registered nosave memory: [mem 0xfee01000-0xfeffffff]
>
> For core-generated cycles, it is between the local APIC space at
> FEE00000:FEEFFFF and SPI BIOS at FFE00000:FFFFFFFF, so will be
> subtractively decoded to the PCH, maybe being aborted due to a device
> not being enabled (hello TPM3 or new image processor).
>
> As it is logged as soon as the MCE driver initialises, it was probably
> logged during BIOS init, so there's not much we can do about it
> anyways.

That said, I have seen this reoccur after boot; there were no other
kernel messages around 300s uptime, and it hasn't occurred in the last
hours since:

$ dmesg | grep Machine
[    0.039072] mce: [Hardware Error]: Machine check events logged
[  300.069176] mce: [Hardware Error]: Machine check events logged

As I don't see a driver controlling this area of address space, the
access is likely initiated from the UEFI BIOS System Management Mode
handler, and we see the same pair of registers FEF1FF40, FEF1CE40
accessed each time.

Dan
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 14:05 ` Daniel J Blueman
@ 2017-01-05 20:10   ` Alexander Alemayhu
  2017-01-05 20:31     ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 20:10 UTC (permalink / raw)
  To: Daniel J Blueman
  Cc: Raj, Ashok, Borislav Petkov, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 10:05:39PM +0800, Daniel J Blueman wrote:
> 
> That said, I have seen this reoccur after boot; there were no other
> kernel messages around 300s uptime, and it hasn't occurred in the last
> hours since:
>

Not sure if it is related, but I am also seeing those messages on my
MacBookPro11,3:

grep -e "Linux version" -e "Machine" oops-2017-01-05-10:32:10-1076-0/dmesg

[    0.000000] Linux version 4.9.0 (scanf@hafza) (gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC) ) #1 SMP Sun Dec 25 22:25:17 CET 2016
[ 4231.274376] mce: [Hardware Error]: Machine check events logged
[ 4231.274893] mce: [Hardware Error]: Machine check events logged
[ 4531.292608] mce: [Hardware Error]: Machine check events logged
[ 4531.292610] mce: [Hardware Error]: Machine check events logged
[ 4833.369927] mce: [Hardware Error]: Machine check events logged
[ 4833.370906] mce: [Hardware Error]: Machine check events logged
[ 5135.449222] mce: [Hardware Error]: Machine check events logged
[ 5135.449228] mce: [Hardware Error]: Machine check events logged
[ 5435.564152] mce: [Hardware Error]: Machine check events logged
[ 5435.564153] mce: [Hardware Error]: Machine check events logged
[ 5735.592854] mce: [Hardware Error]: Machine check events logged
[ 5735.592862] mce: [Hardware Error]: Machine check events logged
[ 6038.070068] mce: [Hardware Error]: Machine check events logged
[ 6038.070073] mce: [Hardware Error]: Machine check events logged
[ 6338.170948] mce: [Hardware Error]: Machine check events logged
[ 6338.171930] mce: [Hardware Error]: Machine check events logged
[ 6920.788280] mce: [Hardware Error]: Machine check events logged
[ 6920.788284] mce: [Hardware Error]: Machine check events logged

full output: https://gist.githubusercontent.com/scanf/5b9dc1940c4913f393fbfbbe40ef6788/raw/2fe98a912979b8b4346c4c55582e092fc42bce8a/7ba692efe.txt
-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:10   ` Alexander Alemayhu
@ 2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Borislav Petkov @ 2017-01-05 20:31 UTC (permalink / raw)
  To: Alexander Alemayhu
  Cc: Daniel J Blueman, Raj, Ashok, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> Not sure if it is related, but I am also seeing those messages on my
> MacBookPro11,3:

Yours look to me like thermal throttling MCEs. And TBH we whould
not issue those as actual MCEs because they are not - they *signal*
overheating condition only and should be handled differently.

What does your /var/log/mcelog* log file contain?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
@ 2017-01-05 20:43       ` Raj, Ashok
  2017-01-05 21:03         ` Pandruvada, Srinivas
  2017-01-05 21:38       ` Alexander Alemayhu
  2017-01-05 23:28       ` Raj, Ashok
  2 siblings, 1 reply; 37+ messages in thread
From: Raj, Ashok @ 2017-01-05 20:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj, srinivas.pandruvada

Hi Boris


On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > Not sure if it is related, but I am also seeing those messages on my
> > MacBookPro11,3:
> 
> Yours look to me like thermal throttling MCEs. And TBH we whould
> not issue those as actual MCEs because they are not - they *signal*
> overheating condition only and should be handled differently.

That's right.. the thermal interrupts are being reported, that should have
started running cpu's at lower frequencies via some thermald. 
If that's not handled and the PCU starts enforcing trying to keep the temps 
under control system starts logging MCE's. 

Ccing Srinivas who might be able to give a better pointer to check why 
that's not happening.

The log didn't have the exact MCE's reported.. if you have mcelog, please
attach that in the report.

> 
> What does your /var/log/mcelog* log file contain?
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> -- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:43       ` Raj, Ashok
@ 2017-01-05 21:03         ` Pandruvada, Srinivas
  2017-01-05 23:23           ` Alexander Alemayhu
  0 siblings, 1 reply; 37+ messages in thread
From: Pandruvada, Srinivas @ 2017-01-05 21:03 UTC (permalink / raw)
  To: Raj, Ashok, bp
  Cc: Brown, Len, linux-kernel, Luck, Tony, linux, pmenzel, alexander, daniel

On Thu, 2017-01-05 at 12:43 -0800, Raj, Ashok wrote:
> Hi Boris
> 
> 
> On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> > 
> > On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > > 
> > > Not sure if it is related, but I am also seeing those messages on
> > > my
> > > MacBookPro11,3:
> > 
> > Yours look to me like thermal throttling MCEs. And TBH we whould
> > not issue those as actual MCEs because they are not - they *signal*
> > overheating condition only and should be handled differently.
> 
> That's right.. the thermal interrupts are being reported, that should
> have
> started running cpu's at lower frequencies via some thermald. 
> If that's not handled and the PCU starts enforcing trying to keep the
> temps 
> under control system starts logging MCE's. 
> 
> Ccing Srinivas who might be able to give a better pointer to check
> why 
> that's not happening.
I suggest trying with the following kernel command line, if your
getting notification to throttle from SMM before:

	intel_pstate=support_acpi_ppc

opensuse doesn't start thermald by default.

Thanks,
Srinivas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
@ 2017-01-05 21:38       ` Alexander Alemayhu
  2017-01-05 23:28       ` Raj, Ashok
  2 siblings, 0 replies; 37+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 21:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Daniel J Blueman, Raj, Ashok, Paul Menzel, tony.luck, linux,
	len.brown, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 352 bytes --]

On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> 
> What does your /var/log/mcelog* log file contain?
>

There are no files there, but in the attached backtrace mcelog is mentioned
several times. Sorry I am not familiar with mcelog so I don't know where it
would be logging on Fedora.

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

[-- Attachment #2: backtrace --]
[-- Type: text/plain, Size: 1500 bytes --]

The kernel log indicates that hardware errors were detected.
System log may have more information.
The last 20 mcelog lines of system log are:
==========================================
Jan  5 10:32:09 hafza mcelog: mcelog: warning: 16 bytes ignored in each record
Jan  5 10:32:09 hafza mcelog: mcelog: consider an update
Jan  5 10:32:09 hafza mcelog: Hardware event. This is not a software error.
Jan  5 10:32:09 hafza mcelog: MCE 0
Jan  5 10:32:09 hafza mcelog: CPU 1 THERMAL EVENT TSC 7722fe2d8c7
Jan  5 10:32:09 hafza mcelog: TIME 1483608727 Thu Jan  5 10:32:07 2017
Jan  5 10:32:09 hafza mcelog: Processor 1 below trip temperature. Throttling disabled
Jan  5 10:32:09 hafza mcelog: STATUS 88020a8a MCGSTATUS 0
Jan  5 10:32:09 hafza mcelog: MCGCAP c0a APICID 2 SOCKETID 0
Jan  5 10:32:09 hafza mcelog: CPUID Vendor Intel Family 6 Model 70
Jan  5 10:32:09 hafza mcelog: Hardware event. This is not a software error.
Jan  5 10:32:09 hafza mcelog: MCE 1
Jan  5 10:32:09 hafza mcelog: CPU 5 THERMAL EVENT TSC 7722fe316ed
Jan  5 10:32:09 hafza mcelog: TIME 1483608727 Thu Jan  5 10:32:07 2017
Jan  5 10:32:09 hafza mcelog: Processor 5 below trip temperature. Throttling disabled
Jan  5 10:32:09 hafza mcelog: STATUS 88020a8a MCGSTATUS 0
Jan  5 10:32:09 hafza mcelog: MCGCAP c0a APICID 3 SOCKETID 0
Jan  5 10:32:09 hafza mcelog: CPUID Vendor Intel Family 6 Model 70
Jan  5 10:32:09 hafza mcelog: mcelog: warning: 16 bytes ignored in each record
Jan  5 10:32:09 hafza mcelog: mcelog: consider an update

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 21:03         ` Pandruvada, Srinivas
@ 2017-01-05 23:23           ` Alexander Alemayhu
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Alemayhu @ 2017-01-05 23:23 UTC (permalink / raw)
  To: Pandruvada, Srinivas
  Cc: Raj, Ashok, bp, Brown, Len, linux-kernel, Luck, Tony, linux,
	pmenzel, daniel

On Thu, Jan 05, 2017 at 09:03:10PM +0000, Pandruvada, Srinivas wrote:
> I suggest trying with the following kernel command line, if your
> getting notification to throttle from SMM before:
> 
> 	intel_pstate=support_acpi_ppc
> 
> opensuse doesn't start thermald by default.
> 

Used the suggested kernel command line change and made sure thermald[0] is
running.  I see no more mce* related errors and my fans are making less noise
during workloads.  Thank you :)

[0]: https://github.com/01org/thermal_daemon

-- 
Mit freundlichen Grüßen

Alexander Alemayhu

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 20:31     ` Borislav Petkov
  2017-01-05 20:43       ` Raj, Ashok
  2017-01-05 21:38       ` Alexander Alemayhu
@ 2017-01-05 23:28       ` Raj, Ashok
  2017-01-05 23:56         ` Borislav Petkov
  2 siblings, 1 reply; 37+ messages in thread
From: Raj, Ashok @ 2017-01-05 23:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj

Hi Boris

On Thu, Jan 05, 2017 at 09:31:47PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 09:10:34PM +0100, Alexander Alemayhu wrote:
> > Not sure if it is related, but I am also seeing those messages on my
> > MacBookPro11,3:
> 
> Yours look to me like thermal throttling MCEs. And TBH we whould
> not issue those as actual MCEs because they are not - they *signal*
> overheating condition only and should be handled differently.

After looking at the code, seems like these events are logged as MCE's
but are really picked from real lvt thermal event interrupts.  via a fake
bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 23:28       ` Raj, Ashok
@ 2017-01-05 23:56         ` Borislav Petkov
  2017-01-06  1:26           ` Raj, Ashok
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-05 23:56 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel

On Thu, Jan 05, 2017 at 03:28:00PM -0800, Raj, Ashok wrote:
> After looking at the code, seems like these events are logged as MCE's
> but are really picked from real lvt thermal event interrupts.  via a fake
> bank 128 for MCE_THERMAL. These are not really HW MCE's, but fake ones created 
> and logged as mcelog entries. (arch/x86/kernel/cpu/mcheck/therm_throt.c)

Right, we've done that since forever but I do think that it confuses
people. This thread case-in-point. I mean, we already scream:

	pr_crit("CPU%d: %s temperature above threshold, cpu clock throttled (total events = %lu)\n",

to dmesg, why do we have to log a fake MCE too?!

Hell, we even log an MCE when things go back to normal:

        if (old_event) {
                if (event == THERMAL_THROTTLING_EVENT)
                        pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
                                level == CORE_LEVEL ? "Core" : "Package");
                return 1;

And Alexander's log shows exactly that:

[ 6338.170924] CPU1: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170925] CPU5: Core temperature above threshold, cpu clock throttled (total events = 21068)
[ 6338.170928] CPU7: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170931] CPU4: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170932] CPU0: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170933] CPU6: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170935] CPU2: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170936] CPU3: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170937] CPU5: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170945] CPU1: Package temperature above threshold, cpu clock throttled (total events = 22842)
[ 6338.170947] mce_notify_irq: 1 callbacks suppressed
[ 6338.170948] mce: [Hardware Error]: Machine check events logged				<--- new event
[ 6338.171917] CPU1: Core temperature/speed normal
[ 6338.171918] CPU5: Core temperature/speed normal
[ 6338.171920] CPU4: Package temperature/speed normal
[ 6338.171920] CPU0: Package temperature/speed normal
[ 6338.171922] CPU2: Package temperature/speed normal
[ 6338.171923] CPU6: Package temperature/speed normal
[ 6338.171924] CPU3: Package temperature/speed normal
[ 6338.171925] CPU7: Package temperature/speed normal
[ 6338.171927] CPU5: Package temperature/speed normal
[ 6338.171929] CPU1: Package temperature/speed normal
[ 6338.171930] mce: [Hardware Error]: Machine check events logged				<--- old event

Oh, and it's not like the user can do anything - there's a thermald
which is supposed to deal with all that. Which is not really
trouble-free too, TBH. What happens if that thing dies? Fried CPU?

So I say we should rip out that mce_log_therm_throt_event() and never
ever handle thermal events with MCEs. It is a bad idea.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-05 23:56         ` Borislav Petkov
@ 2017-01-06  1:26           ` Raj, Ashok
  2017-01-06 11:16             ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Raj, Ashok @ 2017-01-06  1:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, ashok.raj

On Fri, Jan 06, 2017 at 12:56:11AM +0100, Borislav Petkov wrote:
> Oh, and it's not like the user can do anything - there's a thermald
> which is supposed to deal with all that. Which is not really
> trouble-free too, TBH. What happens if that thing dies? Fried CPU?
> 
> So I say we should rip out that mce_log_therm_throt_event() and never
> ever handle thermal events with MCEs. It is a bad idea.

Agree, since we have both a log and another agent to deal with it, it makes
no good reason to continue... Will pass this along, and have someone look at
cleaning this up.


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06  1:26           ` Raj, Ashok
@ 2017-01-06 11:16             ` Borislav Petkov
  2017-01-06 15:58               ` Raj, Ashok
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-06 11:16 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Thu, Jan 05, 2017 at 05:26:17PM -0800, Raj, Ashok wrote:
> Agree, since we have both a log and another agent to deal with it, it makes
> no good reason to continue... Will pass this along, and have someone look at
> cleaning this up.

Like this?

---
From: Borislav Petkov <bp@suse.de>
Date: Fri, 6 Jan 2017 12:07:08 +0100
Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it should've been done in the first place. And besides
we have other means for dealing with thermal events which are much more
suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c |  9 ++++-----
 3 files changed, 4 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..109fbb25c851 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -365,10 +365,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 11:16             ` Borislav Petkov
@ 2017-01-06 15:58               ` Raj, Ashok
  2017-01-06 16:54                 ` Borislav Petkov
  0 siblings, 1 reply; 37+ messages in thread
From: Raj, Ashok @ 2017-01-06 15:58 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas, ashok.raj

Hi Boris

On Fri, Jan 06, 2017 at 12:16:17PM +0100, Borislav Petkov wrote:
> On Thu, Jan 05, 2017 at 05:26:17PM -0800, Raj, Ashok wrote:
> > Agree, since we have both a log and another agent to deal with it, it makes
> > no good reason to continue... Will pass this along, and have someone look at
> > cleaning this up.
> 
> Like this?

That was quick :-).

> -	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> -				THERMAL_THROTTLING_EVENT,
> -				CORE_LEVEL) != 0)

Looks like we don't need a return value from therm_throt_process(),
we can fix that as void as well.

Otherwise it looks good. 

> +	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
> +			    THERMAL_THROTTLING_EVENT,
> +			    CORE_LEVEL);
>  

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 15:58               ` Raj, Ashok
@ 2017-01-06 16:54                 ` Borislav Petkov
  2017-01-06 17:04                   ` Raj, Ashok
  2017-01-09 10:55                   ` Paul Menzel
  0 siblings, 2 replies; 37+ messages in thread
From: Borislav Petkov @ 2017-01-06 16:54 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
> Looks like we don't need a return value from therm_throt_process(),
> we can fix that as void as well.

Right you are, here's v2:

---
>From a8151fa6f18c2605eb7972061234f05e79b372c4 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <bp@suse.de>
Date: Fri, 6 Jan 2017 12:07:08 +0100
Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it should've been done in the first place. And besides
we have other means for dealing with thermal events which are much more
suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
---

v2: Ashok: make therm_throt_process() void.

 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..85469f84c921 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 16:54                 ` Borislav Petkov
@ 2017-01-06 17:04                   ` Raj, Ashok
  2017-01-09 10:55                   ` Paul Menzel
  1 sibling, 0 replies; 37+ messages in thread
From: Raj, Ashok @ 2017-01-06 17:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Alexander Alemayhu, Daniel J Blueman, Paul Menzel, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas, ashok.raj

Hi Boris

This looks good to me!

On Fri, Jan 06, 2017 at 05:54:23PM +0100, Borislav Petkov wrote:
> On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
> > Looks like we don't need a return value from therm_throt_process(),
> > we can fix that as void as well.
> 
> Right you are, here's v2:
> 
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
> 
> v2: Ashok: make therm_throt_process() void.

Acked-by: Ashok Raj <ashok.raj@intel.com>
> 
>  arch/x86/include/asm/mce.h               |  6 ------
>  arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
>  arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
>  3 files changed, 11 insertions(+), 50 deletions(-)

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-06 16:54                 ` Borislav Petkov
  2017-01-06 17:04                   ` Raj, Ashok
@ 2017-01-09 10:55                   ` Paul Menzel
  2017-01-09 11:05                     ` Borislav Petkov
  1 sibling, 1 reply; 37+ messages in thread
From: Paul Menzel @ 2017-01-09 10:55 UTC (permalink / raw)
  To: Borislav Petkov, Ashok Raj
  Cc: Alexander Alemayhu, Daniel J Blueman, tony.luck, linux,
	len.brown, Linux Kernel, Pandruvada, Srinivas

On 01/06/17 17:54, Borislav Petkov wrote:
> On Fri, Jan 06, 2017 at 07:58:31AM -0800, Raj, Ashok wrote:
>> Looks like we don't need a return value from therm_throt_process(),
>> we can fix that as void as well.
>
> Right you are, here's v2:
>
> ---
> From a8151fa6f18c2605eb7972061234f05e79b372c4 Mon Sep 17 00:00:00 2001
> From: Borislav Petkov <bp@suse.de>
> Date: Fri, 6 Jan 2017 12:07:08 +0100
> Subject: [PATCH] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
>
> We log a fake bank 128 MCE to note that we're handling a CPU thermal
> event. However, this confuses people into thinking that their hardware
> generates MCEs. Hijacking MCA for logging thermal events is a gross
> misuse anyway and it should've been done in the first place. And besides

Do you mean *shouldn’t have been done*?

> we have other means for dealing with thermal events which are much more
> suitable.
>
> So let's kill the MCE logging part.
>
> Signed-off-by: Borislav Petkov <bp@suse.de>

Should the discussion be referenced?

Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks 
different from what I see, doesn’t it?

[…]


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 10:55                   ` Paul Menzel
@ 2017-01-09 11:05                     ` Borislav Petkov
  2017-01-09 11:11                       ` Paul Menzel
  0 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-09 11:05 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ashok Raj, Alexander Alemayhu, Daniel J Blueman, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas

On Mon, Jan 09, 2017 at 11:55:41AM +0100, Paul Menzel wrote:
> Do you mean *shouldn’t have been done*?

Yes.

> Should the discussion be referenced?

Yap, it will be.

> Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks
> different from what I see, doesn’t it?

Yes, yours is different. I'm still waiting for you to reply to Ashok's
questions here:

https://lkml.kernel.org/r/20170105011236.GA80100@otc-brkl-03

Thanks.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: Dell XPS13: MCE (Hardware Error) reported
  2017-01-09 11:05                     ` Borislav Petkov
@ 2017-01-09 11:11                       ` Paul Menzel
  0 siblings, 0 replies; 37+ messages in thread
From: Paul Menzel @ 2017-01-09 11:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ashok Raj, Alexander Alemayhu, Daniel J Blueman, tony.luck,
	linux, len.brown, Linux Kernel, Pandruvada, Srinivas,
	Daniel Blueman

Dear Boris,


On 01/09/17 12:05, Borislav Petkov wrote:
> On Mon, Jan 09, 2017 at 11:55:41AM +0100, Paul Menzel wrote:

[…]

>> Also, is that just for MacBookPro11,3? The MCE for the Dell XPS13 looks
>> different from what I see, doesn’t it?
>
> Yes, yours is different. I'm still waiting for you to reply to Ashok's
> questions here:
>
> https://lkml.kernel.org/r/20170105011236.GA80100@otc-brkl-03

I see. I thought Daniel’s answered them. I’ll reply now.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 0/9] x86/RAS: Queue for 4.11
@ 2017-01-23 18:35 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
                   ` (8 more replies)
  0 siblings, 9 replies; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Hi,

here's the stuff which got ready in time. The more exciting things are
going to wait for the next merge window. :-)

Please apply,
thanks.

Borislav Petkov (8):
  x86/mce-inject: Make it depend on X86_LOCAL_APIC
  x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
  x86/MCE: Flip the TSC-adding logic
  x86/ras/mce_amd_inj: Change dependency
  EDAC/mce_amd: Unexport amd_decode_mce()
  EDAC/mce_amd: Dump TSC value
  x86/MCE: Get rid of mce_process_work()
  x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority

Yazen Ghannam (1):
  x86/MCE/AMD: Make sysfs names of banks more user-friendly

 arch/x86/Kconfig                          |  2 +-
 arch/x86/include/asm/mce.h                | 20 ++++++-----
 arch/x86/kernel/cpu/mcheck/mce-apei.c     |  5 ++-
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c   |  5 +--
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 57 ++++---------------------------
 arch/x86/kernel/cpu/mcheck/mce_amd.c      |  9 +++--
 arch/x86/kernel/cpu/mcheck/therm_throt.c  | 30 ++++++----------
 arch/x86/ras/Kconfig                      |  2 +-
 drivers/acpi/acpi_extlog.c                |  1 +
 drivers/acpi/nfit/mce.c                   |  1 +
 drivers/edac/i7core_edac.c                |  1 +
 drivers/edac/mce_amd.c                    |  8 +++--
 drivers/edac/mce_amd.h                    |  1 -
 drivers/edac/sb_edac.c                    |  3 +-
 drivers/edac/skx_edac.c                   |  3 +-
 17 files changed, 59 insertions(+), 93 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

... and get rid of the annoying:

  arch/x86/kernel/cpu/mcheck/mce-inject.c:97:13: warning: ‘mce_irq_ipi’
  defined but not used [-Wunused-function]
   static void mce_irq_ipi(void *info

when doing randconfig builds.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/Kconfig                        | 2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c | 5 +----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493bbd47..7b6fd68b4715 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1070,7 +1070,7 @@ config X86_MCE_THRESHOLD
 	def_bool y
 
 config X86_MCE_INJECT
-	depends on X86_MCE
+	depends on X86_MCE && X86_LOCAL_APIC
 	tristate "Machine check injector support"
 	---help---
 	  Provide support for injecting machine checks for testing purposes.
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 517619ea6498..99165b206df3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -152,7 +152,6 @@ static void raise_mce(struct mce *m)
 	if (context == MCJ_CTX_RANDOM)
 		return;
 
-#ifdef CONFIG_X86_LOCAL_APIC
 	if (m->inject_flags & (MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST)) {
 		unsigned long start;
 		int cpu;
@@ -192,9 +191,7 @@ static void raise_mce(struct mce *m)
 		raise_local();
 		put_cpu();
 		put_online_cpus();
-	} else
-#endif
-	{
+	} else {
 		preempt_disable();
 		raise_local();
 		preempt_enable();
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it shouldn't have been done in the first place. And
besides we have other means for dealing with thermal events which are
much more suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a6c0a2..a09ed05725c2 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef43233e03..6eef6fde0f02 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8be009..85469f84c921 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
  2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
  2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Yazen Ghannam <Yazen.Ghannam@amd.com>

Currently, we append the MCA_IPID[InstanceId] to the bank name to create
the sysfs filename. The InstanceId field uniquely identifies a bank
instance but it doesn't look very nice for most banks.

Replace the InstanceId with a simpler, ascending (0, 1, ..) value.
Only use this in the sysfs name when there is more than 1 instance.
Otherwise, just use the bank's name as the sysfs name.

Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: x86-ml <x86@kernel.org>
Link: http://lkml.kernel.org/r/1484322741-41884-3-git-send-email-Yazen.Ghannam@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h           | 5 +++--
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 6 +++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index a09ed05725c2..528f6ec897cb 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -356,12 +356,13 @@ struct smca_hwid {
 	unsigned int bank_type;	/* Use with smca_bank_types for easy indexing. */
 	u32 hwid_mcatype;	/* (hwid,mcatype) tuple */
 	u32 xec_bitmap;		/* Bitmap of valid ExtErrorCodes; current max is 21. */
+	u8 count;		/* Number of instances. */
 };
 
 struct smca_bank {
 	struct smca_hwid *hwid;
-	/* Instance ID */
-	u32 id;
+	u32 id;			/* Value of MCA_IPID[InstanceId]. */
+	u8 sysfs_id;		/* Value used for sysfs name. */
 };
 
 extern struct smca_bank smca_banks[MAX_NR_BANKS];
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index a5fd137417a2..776379e4a39c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -192,6 +192,7 @@ static void get_smca_bank_info(unsigned int bank)
 
 			smca_banks[bank].hwid = s_hwid;
 			smca_banks[bank].id = instance_id;
+			smca_banks[bank].sysfs_id = s_hwid->count++;
 			break;
 		}
 	}
@@ -1064,9 +1065,12 @@ static const char *get_name(unsigned int bank, struct threshold_block *b)
 		return NULL;
 	}
 
+	if (smca_banks[bank].hwid->count == 1)
+		return smca_get_name(bank_type);
+
 	snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
 		 "%s_%x", smca_get_name(bank_type),
-			  smca_banks[bank].id);
+			  smca_banks[bank].sysfs_id);
 	return buf_mcatype;
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 4/9] x86/MCE: Flip the TSC-adding logic
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (2 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Add the TSC value to the MCE record only when the MCE being logged is
precise, i.e., it is logged as an exception or an MCE-related interrupt.

So it doesn't look particularly easy to do without touching/changing a
bunch of places. That's why I'm trying tricks first.

For example, the mce-apei.c case I'm addressing by setting ->tsc only
for errors of panic severity. The idea there is, that, panic errors will
have raised an #MC and not polled.

And then instead of propagating a flag to mce_setup(), it seems
easier/less code to set ->tsc depending on the call sites, i.e.,
are we polling or are we preparing an MCE record in an exception
handler/thresholding interrupt.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce-apei.c |  5 ++++-
 arch/x86/kernel/cpu/mcheck/mce.c      | 12 +++---------
 arch/x86/kernel/cpu/mcheck/mce_amd.c  |  3 ++-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 83f1a98d37db..2eee85379689 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -52,8 +52,11 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	if (severity >= GHES_SEV_RECOVERABLE)
 		m.status |= MCI_STATUS_UC;
-	if (severity >= GHES_SEV_PANIC)
+
+	if (severity >= GHES_SEV_PANIC) {
 		m.status |= MCI_STATUS_PCC;
+		m.tsc = rdtsc();
+	}
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6eef6fde0f02..ca15a7e1f97d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -128,7 +128,6 @@ void mce_setup(struct mce *m)
 {
 	memset(m, 0, sizeof(struct mce));
 	m->cpu = m->extcpu = smp_processor_id();
-	m->tsc = rdtsc();
 	/* We hope get_seconds stays lockless */
 	m->time = get_seconds();
 	m->cpuvendor = boot_cpu_data.x86_vendor;
@@ -710,14 +709,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 
 	mce_gather_info(&m, NULL);
 
-	/*
-	 * m.tsc was set in mce_setup(). Clear it if not requested.
-	 *
-	 * FIXME: Propagate @flags to mce_gather_info/mce_setup() to avoid
-	 *	  that dance.
-	 */
-	if (!(flags & MCP_TIMESTAMP))
-		m.tsc = 0;
+	if (flags & MCP_TIMESTAMP)
+		m.tsc = rdtsc();
 
 	for (i = 0; i < mca_cfg.banks; i++) {
 		if (!mce_banks[i].ctl || !test_bit(i, *b))
@@ -1156,6 +1149,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		goto out;
 
 	mce_gather_info(&m, regs);
+	m.tsc = rdtsc();
 
 	final = this_cpu_ptr(&mces_seen);
 	*final = m;
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 776379e4a39c..9e5427df3243 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -778,7 +778,8 @@ __log_error(unsigned int bank, bool deferred_err, bool threshold_err, u64 misc)
 	mce_setup(&m);
 
 	m.status = status;
-	m.bank = bank;
+	m.bank   = bank;
+	m.tsc	 = rdtsc();
 
 	if (threshold_err)
 		m.misc = misc;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (3 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Change dependency to mce.c as we're using mce_inject_log() now to stick
an MCE into the MCA subsystem.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/ras/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index d957d5f21a86..0bc60a308730 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -1,6 +1,6 @@
 config MCE_AMD_INJ
 	tristate "Simple MCE injection interface for AMD processors"
-	depends on RAS && EDAC_DECODE_MCE && DEBUG_FS && AMD_NB
+	depends on RAS && X86_MCE && DEBUG_FS && AMD_NB
 	default n
 	help
 	  This is a simple debugfs interface to inject MCEs and test different
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce()
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (4 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

It is not used outside of the driver anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/edac/mce_amd.c | 4 ++--
 drivers/edac/mce_amd.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 34208f38c5b1..5cd3c39bc695 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -942,7 +942,8 @@ static const char *decode_error_status(struct mce *m)
 	return "Corrected error, no action required.";
 }
 
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
+static int
+amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 {
 	struct mce *m = (struct mce *)data;
 	struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
@@ -1047,7 +1048,6 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 	return NOTIFY_STOP;
 }
-EXPORT_SYMBOL_GPL(amd_decode_mce);
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
diff --git a/drivers/edac/mce_amd.h b/drivers/edac/mce_amd.h
index c2359a1ea6b3..0b6a68673e0e 100644
--- a/drivers/edac/mce_amd.h
+++ b/drivers/edac/mce_amd.h
@@ -79,6 +79,5 @@ struct amd_decoder_ops {
 void amd_report_gart_errors(bool);
 void amd_register_ecc_decoder(void (*f)(int, struct mce *));
 void amd_unregister_ecc_decoder(void (*f)(int, struct mce *));
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data);
 
 #endif /* _EDAC_MCE_AMD_H */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 7/9] EDAC/mce_amd: Dump TSC value
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (5 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
  2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Dump the TSC value of the time when the MCE got logged.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 drivers/edac/mce_amd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 5cd3c39bc695..ecad750fd090 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1007,6 +1007,9 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 	} else
 		pr_cont("\n");
 
+	if (m->tsc)
+		pr_emerg(HW_ERR "TSC: %llu\n", m->tsc);
+
 	if (!fam_ops)
 		goto err_code;
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 8/9] x86/MCE: Get rid of mce_process_work()
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (6 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
  2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Make mce_gen_pool_process() the workqueue function directly and save us
an indirection.

Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 12 +-----------
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
index 93d824ec3120..1e5a50c11d3c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-genpool.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -72,7 +72,7 @@ struct llist_node *mce_gen_pool_prepare_records(void)
 	return new_head.first;
 }
 
-void mce_gen_pool_process(void)
+void mce_gen_pool_process(struct work_struct *__unused)
 {
 	struct llist_node *head;
 	struct mce_evt_llist *node, *tmp;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index cd74a3f00aea..903043e6a62b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,7 +31,7 @@ struct mce_evt_llist {
 	struct mce mce;
 };
 
-void mce_gen_pool_process(void);
+void mce_gen_pool_process(struct work_struct *__unused);
 bool mce_gen_pool_empty(void);
 int mce_gen_pool_add(struct mce *mce);
 int mce_gen_pool_init(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ca15a7e1f97d..0fef5406f0eb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1316,16 +1316,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Action optional processing happens here (picking up
- * from the list of faulting pages that do_machine_check()
- * placed into the genpool).
- */
-static void mce_process_work(struct work_struct *dummy)
-{
-	mce_gen_pool_process();
-}
-
-/*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
  * errors, poll 2x slower (up to check_interval seconds).
@@ -2165,7 +2155,7 @@ int __init mcheck_init(void)
 	mce_register_decode_chain(&mce_default_nb);
 	mcheck_vendor_init_severity();
 
-	INIT_WORK(&mce_work, mce_process_work);
+	INIT_WORK(&mce_work, mce_gen_pool_process);
 	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 
 	return 0;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority
  2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
                   ` (7 preceding siblings ...)
  2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
@ 2017-01-23 18:35 ` Borislav Petkov
  2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
  8 siblings, 1 reply; 37+ messages in thread
From: Borislav Petkov @ 2017-01-23 18:35 UTC (permalink / raw)
  To: X86 ML; +Cc: Tony Luck, Yazen Ghannam, linux-edac, LKML

From: Borislav Petkov <bp@suse.de>

Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h       | 9 +++++++++
 arch/x86/kernel/cpu/mcheck/mce.c | 8 +++-----
 drivers/acpi/acpi_extlog.c       | 1 +
 drivers/acpi/nfit/mce.c          | 1 +
 drivers/edac/i7core_edac.c       | 1 +
 drivers/edac/mce_amd.c           | 1 +
 drivers/edac/sb_edac.c           | 3 ++-
 drivers/edac/skx_edac.c          | 3 ++-
 8 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 528f6ec897cb..e63873683d4a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -189,6 +189,15 @@ extern struct mce_vendor_flags mce_flags;
 
 extern struct mca_config mca_cfg;
 extern struct mca_msr_regs msr_ops;
+
+enum mce_notifier_prios {
+	MCE_PRIO_SRAO		= INT_MAX,
+	MCE_PRIO_EXTLOG		= INT_MAX - 1,
+	MCE_PRIO_NFIT		= INT_MAX - 2,
+	MCE_PRIO_EDAC		= INT_MAX - 3,
+	MCE_PRIO_LOWEST		= 0,
+};
+
 extern void mce_register_decode_chain(struct notifier_block *nb);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0fef5406f0eb..e39bbc0e7c8b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -216,9 +216,7 @@ void mce_register_decode_chain(struct notifier_block *nb)
 {
 	atomic_inc(&num_notifiers);
 
-	/* Ensure SRAO notifier has the highest priority in the decode chain. */
-	if (nb != &mce_srao_nb && nb->priority == INT_MAX)
-		nb->priority -= 1;
+	WARN_ON(nb->priority > MCE_PRIO_LOWEST && nb->priority < MCE_PRIO_EDAC);
 
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, nb);
 }
@@ -582,7 +580,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 }
 static struct notifier_block mce_srao_nb = {
 	.notifier_call	= srao_decode_notifier,
-	.priority = INT_MAX,
+	.priority	= MCE_PRIO_SRAO,
 };
 
 static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
@@ -608,7 +606,7 @@ static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
 static struct notifier_block mce_default_nb = {
 	.notifier_call	= mce_default_notifier,
 	/* lowest prio, we want it to run last. */
-	.priority	= 0,
+	.priority	= MCE_PRIO_LOWEST,
 };
 
 /*
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index b3842ffc19ba..a15270a806fc 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -212,6 +212,7 @@ static bool __init extlog_get_l1addr(void)
 }
 static struct notifier_block extlog_mce_dec = {
 	.notifier_call	= extlog_print,
+	.priority	= MCE_PRIO_EXTLOG,
 };
 
 static int __init extlog_init(void)
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index e5ce81c38eed..3ba1c3472cf9 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -90,6 +90,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block nfit_mce_dec = {
 	.notifier_call	= nfit_handle_mce,
+	.priority	= MCE_PRIO_NFIT,
 };
 
 void nfit_mce_register(void)
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 69b5adead0ad..75ad847593b7 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -1835,6 +1835,7 @@ static int i7core_mce_check_error(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block i7_mce_dec = {
 	.notifier_call	= i7core_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 struct memdev_dmi_entry {
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index ecad750fd090..0d9bc25543d8 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1054,6 +1054,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static int __init mce_amd_init(void)
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 54ae6dc45ab2..c585a014dd3d 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3136,7 +3136,8 @@ static int sbridge_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block sbridge_mce_dec = {
-	.notifier_call      = sbridge_mce_check_error,
+	.notifier_call	= sbridge_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 /****************************************************************************
diff --git a/drivers/edac/skx_edac.c b/drivers/edac/skx_edac.c
index 79ef675e4d6f..1159dba4671f 100644
--- a/drivers/edac/skx_edac.c
+++ b/drivers/edac/skx_edac.c
@@ -1007,7 +1007,8 @@ static int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block skx_mce_dec = {
-	.notifier_call = skx_mce_check_error,
+	.notifier_call	= skx_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static void skx_remove(void)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
  2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
@ 2017-01-24  8:46   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: bp, hpa, linux-kernel, peterz, torvalds, tglx, mingo,
	Yazen.Ghannam, linux-edac, tony.luck

Commit-ID:  d4b2ac63b0eae461fc10c9791084be24724ef57a
Gitweb:     http://git.kernel.org/tip/d4b2ac63b0eae461fc10c9791084be24724ef57a
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:06 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:52 +0100

x86/ras/inject: Make it depend on X86_LOCAL_APIC=y

... and get rid of the annoying:

  arch/x86/kernel/cpu/mcheck/mce-inject.c:97:13: warning: ‘mce_irq_ipi’ defined but not used [-Wunused-function]

when doing randconfig builds.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-2-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig                        | 2 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c | 5 +----
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e487493..7b6fd68 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1070,7 +1070,7 @@ config X86_MCE_THRESHOLD
 	def_bool y
 
 config X86_MCE_INJECT
-	depends on X86_MCE
+	depends on X86_MCE && X86_LOCAL_APIC
 	tristate "Machine check injector support"
 	---help---
 	  Provide support for injecting machine checks for testing purposes.
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 517619e..99165b2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -152,7 +152,6 @@ static void raise_mce(struct mce *m)
 	if (context == MCJ_CTX_RANDOM)
 		return;
 
-#ifdef CONFIG_X86_LOCAL_APIC
 	if (m->inject_flags & (MCJ_IRQ_BROADCAST | MCJ_NMI_BROADCAST)) {
 		unsigned long start;
 		int cpu;
@@ -192,9 +191,7 @@ static void raise_mce(struct mce *m)
 		raise_local();
 		put_cpu();
 		put_online_cpus();
-	} else
-#endif
-	{
+	} else {
 		preempt_disable();
 		raise_local();
 		preempt_enable();

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events
  2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
@ 2017-01-24  8:47   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Yazen.Ghannam, torvalds, mingo, tglx, bp, linux-edac, ashok.raj,
	hpa, tony.luck, peterz, linux-kernel

Commit-ID:  9b052ea4ced0fa1ad30a2eafe86984a16297e6f1
Gitweb:     http://git.kernel.org/tip/9b052ea4ced0fa1ad30a2eafe86984a16297e6f1
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:07 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:53 +0100

x86/ras/therm_throt: Do not log a fake MCE for thermal events

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it shouldn't have been done in the first place. And
besides we have other means for dealing with thermal events which are
much more suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mce.h               |  6 ------
 arch/x86/kernel/cpu/mcheck/mce.c         | 25 -------------------------
 arch/x86/kernel/cpu/mcheck/therm_throt.c | 30 +++++++++++-------------------
 3 files changed, 11 insertions(+), 50 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 5132f2a..a09ed05 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -97,10 +97,6 @@
 
 #define MCE_OVERFLOW 0		/* bit 0 in flags means overflow */
 
-/* Software defined banks */
-#define MCE_EXTENDED_BANK	128
-#define MCE_THERMAL_BANK	(MCE_EXTENDED_BANK + 0)
-
 #define MCE_LOG_LEN 32
 #define MCE_LOG_SIGNATURE	"MACHINECHECK"
 
@@ -306,8 +302,6 @@ extern void (*deferred_error_int_vector)(void);
 
 void intel_init_thermal(struct cpuinfo_x86 *c);
 
-void mce_log_therm_throt_event(__u64 status);
-
 /* Interrupt Handler for core thermal thresholds */
 extern int (*platform_thermal_notify)(__u64 msr_val);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 00ef432..6eef6fd 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,31 +1331,6 @@ static void mce_process_work(struct work_struct *dummy)
 	mce_gen_pool_process();
 }
 
-#ifdef CONFIG_X86_MCE_INTEL
-/***
- * mce_log_therm_throt_event - Logs the thermal throttling event to mcelog
- * @cpu: The CPU on which the event occurred.
- * @status: Event status information
- *
- * This function should be called by the thermal interrupt after the
- * event has been processed and the decision was made to log the event
- * further.
- *
- * The status parameter will be saved to the 'status' field of 'struct mce'
- * and historically has been the register value of the
- * MSR_IA32_THERMAL_STATUS (Intel) msr.
- */
-void mce_log_therm_throt_event(__u64 status)
-{
-	struct mce m;
-
-	mce_setup(&m);
-	m.bank = MCE_THERMAL_BANK;
-	m.status = status;
-	mce_log(&m);
-}
-#endif /* CONFIG_X86_MCE_INTEL */
-
 /*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
diff --git a/arch/x86/kernel/cpu/mcheck/therm_throt.c b/arch/x86/kernel/cpu/mcheck/therm_throt.c
index 465aca8..85469f8 100644
--- a/arch/x86/kernel/cpu/mcheck/therm_throt.c
+++ b/arch/x86/kernel/cpu/mcheck/therm_throt.c
@@ -6,7 +6,7 @@
  *
  * Maintains a counter in /sys that keeps track of the number of thermal
  * events, such that the user knows how bad the thermal problem might be
- * (since the logging to syslog and mcelog is rate limited).
+ * (since the logging to syslog is rate limited).
  *
  * Author: Dmitriy Zavin (dmitriyz@google.com)
  *
@@ -141,13 +141,8 @@ static struct attribute_group thermal_attr_group = {
  * IRQ has been acknowledged.
  *
  * It will take care of rate limiting and printing messages to the syslog.
- *
- * Returns: 0 : Event should NOT be further logged, i.e. still in
- *              "timeout" from previous log message.
- *          1 : Event should be logged further, and a message has been
- *              printed to the syslog.
  */
-static int therm_throt_process(bool new_event, int event, int level)
+static void therm_throt_process(bool new_event, int event, int level)
 {
 	struct _thermal_state *state;
 	unsigned int this_cpu = smp_processor_id();
@@ -162,16 +157,16 @@ static int therm_throt_process(bool new_event, int event, int level)
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->core_power_limit;
 		else
-			 return 0;
+			return;
 	} else if (level == PACKAGE_LEVEL) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			state = &pstate->package_throttle;
 		else if (event == POWER_LIMIT_EVENT)
 			state = &pstate->package_power_limit;
 		else
-			return 0;
+			return;
 	} else
-		return 0;
+		return;
 
 	old_event = state->new_event;
 	state->new_event = new_event;
@@ -181,7 +176,7 @@ static int therm_throt_process(bool new_event, int event, int level)
 
 	if (time_before64(now, state->next_check) &&
 			state->count != state->last_count)
-		return 0;
+		return;
 
 	state->next_check = now + CHECK_INTERVAL;
 	state->last_count = state->count;
@@ -193,16 +188,14 @@ static int therm_throt_process(bool new_event, int event, int level)
 				this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package",
 				state->count);
-		return 1;
+		return;
 	}
 	if (old_event) {
 		if (event == THERMAL_THROTTLING_EVENT)
 			pr_info("CPU%d: %s temperature/speed normal\n", this_cpu,
 				level == CORE_LEVEL ? "Core" : "Package");
-		return 1;
+		return;
 	}
-
-	return 0;
 }
 
 static int thresh_event_valid(int level, int event)
@@ -365,10 +358,9 @@ static void intel_thermal_interrupt(void)
 	/* Check for violation of core thermal thresholds*/
 	notify_thresholds(msr_val);
 
-	if (therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
-				THERMAL_THROTTLING_EVENT,
-				CORE_LEVEL) != 0)
-		mce_log_therm_throt_event(msr_val);
+	therm_throt_process(msr_val & THERM_STATUS_PROCHOT,
+			    THERMAL_THROTTLING_EVENT,
+			    CORE_LEVEL);
 
 	if (this_cpu_has(X86_FEATURE_PLN) && int_pln_enable)
 		therm_throt_process(msr_val & THERM_STATUS_POWER_LIMIT,

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras/amd: Make sysfs names of banks more user-friendly
  2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
@ 2017-01-24  8:47   ` tip-bot for Yazen Ghannam
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Yazen Ghannam @ 2017-01-24  8:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tony.luck, Yazen.Ghannam, torvalds, peterz, mingo, linux-edac,
	bp, linux-kernel, tglx, hpa

Commit-ID:  0b737a9c2af85cc8295f9308d9250f9111bbf94d
Gitweb:     http://git.kernel.org/tip/0b737a9c2af85cc8295f9308d9250f9111bbf94d
Author:     Yazen Ghannam <Yazen.Ghannam@amd.com>
AuthorDate: Mon, 23 Jan 2017 19:35:08 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:53 +0100

x86/ras/amd: Make sysfs names of banks more user-friendly

Currently, we append the MCA_IPID[InstanceId] to the bank name to create
the sysfs filename. The InstanceId field uniquely identifies a bank
instance but it doesn't look very nice for most banks.

Replace the InstanceId with a simpler, ascending (0, 1, ..) value.
Only use this in the sysfs name when there is more than 1 instance.
Otherwise, just use the bank's name as the sysfs name.

Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1484322741-41884-3-git-send-email-Yazen.Ghannam@amd.com
Link: http://lkml.kernel.org/r/20170123183514.13356-4-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mce.h           | 5 +++--
 arch/x86/kernel/cpu/mcheck/mce_amd.c | 6 +++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index a09ed05..528f6ec 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -356,12 +356,13 @@ struct smca_hwid {
 	unsigned int bank_type;	/* Use with smca_bank_types for easy indexing. */
 	u32 hwid_mcatype;	/* (hwid,mcatype) tuple */
 	u32 xec_bitmap;		/* Bitmap of valid ExtErrorCodes; current max is 21. */
+	u8 count;		/* Number of instances. */
 };
 
 struct smca_bank {
 	struct smca_hwid *hwid;
-	/* Instance ID */
-	u32 id;
+	u32 id;			/* Value of MCA_IPID[InstanceId]. */
+	u8 sysfs_id;		/* Value used for sysfs name. */
 };
 
 extern struct smca_bank smca_banks[MAX_NR_BANKS];
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index a5fd137..776379e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -192,6 +192,7 @@ static void get_smca_bank_info(unsigned int bank)
 
 			smca_banks[bank].hwid = s_hwid;
 			smca_banks[bank].id = instance_id;
+			smca_banks[bank].sysfs_id = s_hwid->count++;
 			break;
 		}
 	}
@@ -1064,9 +1065,12 @@ static const char *get_name(unsigned int bank, struct threshold_block *b)
 		return NULL;
 	}
 
+	if (smca_banks[bank].hwid->count == 1)
+		return smca_get_name(bank_type);
+
 	snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN,
 		 "%s_%x", smca_get_name(bank_type),
-			  smca_banks[bank].id);
+			  smca_banks[bank].sysfs_id);
 	return buf_mcatype;
 }
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras: Flip the TSC-adding logic
  2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
@ 2017-01-24  8:48   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tony.luck, peterz, Yazen.Ghannam, linux-edac, bp, linux-kernel,
	hpa, tglx, torvalds, mingo

Commit-ID:  669c00f09935fc7a22297eadee04536af141595b
Gitweb:     http://git.kernel.org/tip/669c00f09935fc7a22297eadee04536af141595b
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:09 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:54 +0100

x86/ras: Flip the TSC-adding logic

Add the TSC value to the MCE record only when the MCE being logged is
precise, i.e., it is logged as an exception or an MCE-related interrupt.

So it doesn't look particularly easy to do without touching/changing a
bunch of places. That's why I'm trying tricks first.

For example, the mce-apei.c case I'm addressing by setting ->tsc only
for errors of panic severity. The idea there is, that, panic errors will
have raised an #MC and not polled.

And then instead of propagating a flag to mce_setup(), it seems
easier/less code to set ->tsc depending on the call sites, i.e.,
are we polling or are we preparing an MCE record in an exception
handler/thresholding interrupt.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-5-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce-apei.c |  5 ++++-
 arch/x86/kernel/cpu/mcheck/mce.c      | 12 +++---------
 arch/x86/kernel/cpu/mcheck/mce_amd.c  |  3 ++-
 3 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 83f1a98..2eee853 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -52,8 +52,11 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
 
 	if (severity >= GHES_SEV_RECOVERABLE)
 		m.status |= MCI_STATUS_UC;
-	if (severity >= GHES_SEV_PANIC)
+
+	if (severity >= GHES_SEV_PANIC) {
 		m.status |= MCI_STATUS_PCC;
+		m.tsc = rdtsc();
+	}
 
 	m.addr = mem_err->physical_addr;
 	mce_log(&m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 6eef6fd..ca15a7e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -128,7 +128,6 @@ void mce_setup(struct mce *m)
 {
 	memset(m, 0, sizeof(struct mce));
 	m->cpu = m->extcpu = smp_processor_id();
-	m->tsc = rdtsc();
 	/* We hope get_seconds stays lockless */
 	m->time = get_seconds();
 	m->cpuvendor = boot_cpu_data.x86_vendor;
@@ -710,14 +709,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 
 	mce_gather_info(&m, NULL);
 
-	/*
-	 * m.tsc was set in mce_setup(). Clear it if not requested.
-	 *
-	 * FIXME: Propagate @flags to mce_gather_info/mce_setup() to avoid
-	 *	  that dance.
-	 */
-	if (!(flags & MCP_TIMESTAMP))
-		m.tsc = 0;
+	if (flags & MCP_TIMESTAMP)
+		m.tsc = rdtsc();
 
 	for (i = 0; i < mca_cfg.banks; i++) {
 		if (!mce_banks[i].ctl || !test_bit(i, *b))
@@ -1156,6 +1149,7 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		goto out;
 
 	mce_gather_info(&m, regs);
+	m.tsc = rdtsc();
 
 	final = this_cpu_ptr(&mces_seen);
 	*final = m;
diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c b/arch/x86/kernel/cpu/mcheck/mce_amd.c
index 776379e..9e5427d 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
@@ -778,7 +778,8 @@ __log_error(unsigned int bank, bool deferred_err, bool threshold_err, u64 misc)
 	mce_setup(&m);
 
 	m.status = status;
-	m.bank = bank;
+	m.bank   = bank;
+	m.tsc	 = rdtsc();
 
 	if (threshold_err)
 		m.misc = misc;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras/amd/inj: Change dependency
  2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
@ 2017-01-24  8:48   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, linux-edac, hpa, tglx, mingo, Yazen.Ghannam,
	peterz, bp, tony.luck, torvalds

Commit-ID:  bd43f60a260c83cbc9befd7d710a3f2bfd3b2dd2
Gitweb:     http://git.kernel.org/tip/bd43f60a260c83cbc9befd7d710a3f2bfd3b2dd2
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:10 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:55 +0100

x86/ras/amd/inj: Change dependency

Change dependency to mce.c as we're using mce_inject_log() now to stick
an MCE into the MCA subsystem.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-6-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/ras/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index d957d5f..0bc60a3 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -1,6 +1,6 @@
 config MCE_AMD_INJ
 	tristate "Simple MCE injection interface for AMD processors"
-	depends on RAS && EDAC_DECODE_MCE && DEBUG_FS && AMD_NB
+	depends on RAS && X86_MCE && DEBUG_FS && AMD_NB
 	default n
 	help
 	  This is a simple debugfs interface to inject MCEs and test different

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] EDAC/mce/amd: Unexport amd_decode_mce()
  2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
@ 2017-01-24  8:49   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, linux-edac, bp, mingo, torvalds, linux-kernel, tony.luck,
	peterz, hpa, Yazen.Ghannam

Commit-ID:  1fbcd909035251b5eac267f1c5d6d67b32d16b62
Gitweb:     http://git.kernel.org/tip/1fbcd909035251b5eac267f1c5d6d67b32d16b62
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:11 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:55 +0100

EDAC/mce/amd: Unexport amd_decode_mce()

It is not used outside of the driver anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-7-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/edac/mce_amd.c | 4 ++--
 drivers/edac/mce_amd.h | 1 -
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 34208f3..5cd3c39 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -942,7 +942,8 @@ static const char *decode_error_status(struct mce *m)
 	return "Corrected error, no action required.";
 }
 
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
+static int
+amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 {
 	struct mce *m = (struct mce *)data;
 	struct cpuinfo_x86 *c = &cpu_data(m->extcpu);
@@ -1047,7 +1048,6 @@ int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 	return NOTIFY_STOP;
 }
-EXPORT_SYMBOL_GPL(amd_decode_mce);
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
diff --git a/drivers/edac/mce_amd.h b/drivers/edac/mce_amd.h
index c2359a1..0b6a686 100644
--- a/drivers/edac/mce_amd.h
+++ b/drivers/edac/mce_amd.h
@@ -79,6 +79,5 @@ struct amd_decoder_ops {
 void amd_report_gart_errors(bool);
 void amd_register_ecc_decoder(void (*f)(int, struct mce *));
 void amd_unregister_ecc_decoder(void (*f)(int, struct mce *));
-int amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data);
 
 #endif /* _EDAC_MCE_AMD_H */

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] EDAC/mce/amd: Dump TSC value
  2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
@ 2017-01-24  8:50   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, linux-edac, tony.luck, bp, torvalds, peterz, mingo,
	Yazen.Ghannam, tglx, linux-kernel

Commit-ID:  0bceab677dcef409f6281d922461057721d547b3
Gitweb:     http://git.kernel.org/tip/0bceab677dcef409f6281d922461057721d547b3
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:12 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:56 +0100

EDAC/mce/amd: Dump TSC value

Dump the TSC value of the time when the MCE got logged.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-8-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 drivers/edac/mce_amd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 5cd3c39..ecad750 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1007,6 +1007,9 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 	} else
 		pr_cont("\n");
 
+	if (m->tsc)
+		pr_emerg(HW_ERR "TSC: %llu\n", m->tsc);
+
 	if (!fam_ops)
 		goto err_code;
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras: Get rid of mce_process_work()
  2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
@ 2017-01-24  8:50   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, tony.luck, linux-edac, linux-kernel, bp, peterz, torvalds,
	hpa, Yazen.Ghannam, mingo

Commit-ID:  cff4c0391a692cf9b89932c62a7f879fb3637148
Gitweb:     http://git.kernel.org/tip/cff4c0391a692cf9b89932c62a7f879fb3637148
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:56 +0100

x86/ras: Get rid of mce_process_work()

Make mce_gen_pool_process() the workqueue function directly and save us
an indirection.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-9-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/cpu/mcheck/mce-genpool.c  |  2 +-
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  2 +-
 arch/x86/kernel/cpu/mcheck/mce.c          | 12 +-----------
 3 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-genpool.c b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
index 93d824e..1e5a50c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-genpool.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-genpool.c
@@ -72,7 +72,7 @@ struct llist_node *mce_gen_pool_prepare_records(void)
 	return new_head.first;
 }
 
-void mce_gen_pool_process(void)
+void mce_gen_pool_process(struct work_struct *__unused)
 {
 	struct llist_node *head;
 	struct mce_evt_llist *node, *tmp;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index cd74a3f..903043e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -31,7 +31,7 @@ struct mce_evt_llist {
 	struct mce mce;
 };
 
-void mce_gen_pool_process(void);
+void mce_gen_pool_process(struct work_struct *__unused);
 bool mce_gen_pool_empty(void);
 int mce_gen_pool_add(struct mce *mce);
 int mce_gen_pool_init(void);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index ca15a7e..0fef540 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1316,16 +1316,6 @@ int memory_failure(unsigned long pfn, int vector, int flags)
 #endif
 
 /*
- * Action optional processing happens here (picking up
- * from the list of faulting pages that do_machine_check()
- * placed into the genpool).
- */
-static void mce_process_work(struct work_struct *dummy)
-{
-	mce_gen_pool_process();
-}
-
-/*
  * Periodic polling timer for "silent" machine check errors.  If the
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
  * errors, poll 2x slower (up to check_interval seconds).
@@ -2165,7 +2155,7 @@ int __init mcheck_init(void)
 	mce_register_decode_chain(&mce_default_nb);
 	mcheck_vendor_init_severity();
 
-	INIT_WORK(&mce_work, mce_process_work);
+	INIT_WORK(&mce_work, mce_gen_pool_process);
 	init_irq_work(&mce_irq_work, mce_irq_work_cb);
 
 	return 0;

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [tip:ras/core] x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority
  2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
@ 2017-01-24  8:51   ` tip-bot for Borislav Petkov
  0 siblings, 0 replies; 37+ messages in thread
From: tip-bot for Borislav Petkov @ 2017-01-24  8:51 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tony.luck, peterz, tglx, mingo, linux-kernel, hpa, bp,
	Yazen.Ghannam, linux-edac

Commit-ID:  9026cc82b632ed1a859935c82ed8ad65f27f2781
Gitweb:     http://git.kernel.org/tip/9026cc82b632ed1a859935c82ed8ad65f27f2781
Author:     Borislav Petkov <bp@suse.de>
AuthorDate: Mon, 23 Jan 2017 19:35:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 24 Jan 2017 09:14:57 +0100

x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority

Assign all notifiers on the MCE decode chain a priority so that they get
called in the correct order.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mce.h       | 9 +++++++++
 arch/x86/kernel/cpu/mcheck/mce.c | 8 +++-----
 drivers/acpi/acpi_extlog.c       | 1 +
 drivers/acpi/nfit/mce.c          | 1 +
 drivers/edac/i7core_edac.c       | 1 +
 drivers/edac/mce_amd.c           | 1 +
 drivers/edac/sb_edac.c           | 3 ++-
 drivers/edac/skx_edac.c          | 3 ++-
 8 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 528f6ec..e638736 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -189,6 +189,15 @@ extern struct mce_vendor_flags mce_flags;
 
 extern struct mca_config mca_cfg;
 extern struct mca_msr_regs msr_ops;
+
+enum mce_notifier_prios {
+	MCE_PRIO_SRAO		= INT_MAX,
+	MCE_PRIO_EXTLOG		= INT_MAX - 1,
+	MCE_PRIO_NFIT		= INT_MAX - 2,
+	MCE_PRIO_EDAC		= INT_MAX - 3,
+	MCE_PRIO_LOWEST		= 0,
+};
+
 extern void mce_register_decode_chain(struct notifier_block *nb);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 0fef540..e39bbc0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -216,9 +216,7 @@ void mce_register_decode_chain(struct notifier_block *nb)
 {
 	atomic_inc(&num_notifiers);
 
-	/* Ensure SRAO notifier has the highest priority in the decode chain. */
-	if (nb != &mce_srao_nb && nb->priority == INT_MAX)
-		nb->priority -= 1;
+	WARN_ON(nb->priority > MCE_PRIO_LOWEST && nb->priority < MCE_PRIO_EDAC);
 
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, nb);
 }
@@ -582,7 +580,7 @@ static int srao_decode_notifier(struct notifier_block *nb, unsigned long val,
 }
 static struct notifier_block mce_srao_nb = {
 	.notifier_call	= srao_decode_notifier,
-	.priority = INT_MAX,
+	.priority	= MCE_PRIO_SRAO,
 };
 
 static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
@@ -608,7 +606,7 @@ static int mce_default_notifier(struct notifier_block *nb, unsigned long val,
 static struct notifier_block mce_default_nb = {
 	.notifier_call	= mce_default_notifier,
 	/* lowest prio, we want it to run last. */
-	.priority	= 0,
+	.priority	= MCE_PRIO_LOWEST,
 };
 
 /*
diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
index b3842ff..a15270a 100644
--- a/drivers/acpi/acpi_extlog.c
+++ b/drivers/acpi/acpi_extlog.c
@@ -212,6 +212,7 @@ static bool __init extlog_get_l1addr(void)
 }
 static struct notifier_block extlog_mce_dec = {
 	.notifier_call	= extlog_print,
+	.priority	= MCE_PRIO_EXTLOG,
 };
 
 static int __init extlog_init(void)
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index e5ce81c..3ba1c34 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -90,6 +90,7 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block nfit_mce_dec = {
 	.notifier_call	= nfit_handle_mce,
+	.priority	= MCE_PRIO_NFIT,
 };
 
 void nfit_mce_register(void)
diff --git a/drivers/edac/i7core_edac.c b/drivers/edac/i7core_edac.c
index 69b5ade..75ad847 100644
--- a/drivers/edac/i7core_edac.c
+++ b/drivers/edac/i7core_edac.c
@@ -1835,6 +1835,7 @@ static int i7core_mce_check_error(struct notifier_block *nb, unsigned long val,
 
 static struct notifier_block i7_mce_dec = {
 	.notifier_call	= i7core_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 struct memdev_dmi_entry {
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index ecad750..0d9bc25 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -1054,6 +1054,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long val, void *data)
 
 static struct notifier_block amd_mce_dec_nb = {
 	.notifier_call	= amd_decode_mce,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static int __init mce_amd_init(void)
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 54ae6dc..c585a01 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3136,7 +3136,8 @@ static int sbridge_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block sbridge_mce_dec = {
-	.notifier_call      = sbridge_mce_check_error,
+	.notifier_call	= sbridge_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 /****************************************************************************
diff --git a/drivers/edac/skx_edac.c b/drivers/edac/skx_edac.c
index 79ef675..1159dba 100644
--- a/drivers/edac/skx_edac.c
+++ b/drivers/edac/skx_edac.c
@@ -1007,7 +1007,8 @@ static int skx_mce_check_error(struct notifier_block *nb, unsigned long val,
 }
 
 static struct notifier_block skx_mce_dec = {
-	.notifier_call = skx_mce_check_error,
+	.notifier_call	= skx_mce_check_error,
+	.priority	= MCE_PRIO_EDAC,
 };
 
 static void skx_remove(void)

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-01-24  9:43 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-23 18:35 [PATCH 0/9] x86/RAS: Queue for 4.11 Borislav Petkov
2017-01-23 18:35 ` [PATCH 1/9] x86/mce-inject: Make it depend on X86_LOCAL_APIC Borislav Petkov
2017-01-24  8:46   ` [tip:ras/core] x86/ras/inject: Make it depend on X86_LOCAL_APIC=y tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 2/9] x86/MCE/therm_throt: Do not log a fake MCE for a thermal event Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/therm_throt: Do not log a fake MCE for thermal events tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 3/9] x86/MCE/AMD: Make sysfs names of banks more user-friendly Borislav Petkov
2017-01-24  8:47   ` [tip:ras/core] x86/ras/amd: " tip-bot for Yazen Ghannam
2017-01-23 18:35 ` [PATCH 4/9] x86/MCE: Flip the TSC-adding logic Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 5/9] x86/ras/mce_amd_inj: Change dependency Borislav Petkov
2017-01-24  8:48   ` [tip:ras/core] x86/ras/amd/inj: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 6/9] EDAC/mce_amd: Unexport amd_decode_mce() Borislav Petkov
2017-01-24  8:49   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 7/9] EDAC/mce_amd: Dump TSC value Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] EDAC/mce/amd: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 8/9] x86/MCE: Get rid of mce_process_work() Borislav Petkov
2017-01-24  8:50   ` [tip:ras/core] x86/ras: " tip-bot for Borislav Petkov
2017-01-23 18:35 ` [PATCH 9/9] x86/MCE, EDAC, acpi: Assign MCE notifier handlers a priority Borislav Petkov
2017-01-24  8:51   ` [tip:ras/core] x86/ras, " tip-bot for Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2017-01-05  5:00 Dell XPS13: MCE (Hardware Error) reported Daniel J Blueman
2017-01-05 14:05 ` Daniel J Blueman
2017-01-05 20:10   ` Alexander Alemayhu
2017-01-05 20:31     ` Borislav Petkov
2017-01-05 20:43       ` Raj, Ashok
2017-01-05 21:03         ` Pandruvada, Srinivas
2017-01-05 23:23           ` Alexander Alemayhu
2017-01-05 21:38       ` Alexander Alemayhu
2017-01-05 23:28       ` Raj, Ashok
2017-01-05 23:56         ` Borislav Petkov
2017-01-06  1:26           ` Raj, Ashok
2017-01-06 11:16             ` Borislav Petkov
2017-01-06 15:58               ` Raj, Ashok
2017-01-06 16:54                 ` Borislav Petkov
2017-01-06 17:04                   ` Raj, Ashok
2017-01-09 10:55                   ` Paul Menzel
2017-01-09 11:05                     ` Borislav Petkov
2017-01-09 11:11                       ` Paul Menzel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.