linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* unknown NMI on AMD Rome
@ 2021-03-16 15:45 Jiri Olsa
  2021-03-16 16:02 ` Adam Borowski
  2021-03-16 19:53 ` Peter Zijlstra
  0 siblings, 2 replies; 9+ messages in thread
From: Jiri Olsa @ 2021-03-16 15:45 UTC (permalink / raw)
  To: Borislav Petkov, Tom Lendacky, Peter Zijlstra
  Cc: x86, lkml, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Stanislav Kozina, Michael Petlan, Pierre Amadio, onatalen,
	darcari

hi,
when running 'perf top' on AMD Rome (/proc/cpuinfo below)
with fedora 33 kernel 5.10.22-200.fc33.x86_64

we got unknown NMI messages:

[  226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
[  226.700162] Do you have a strange power saving mode enabled?
[  226.700163] Dazed and confused, but trying to continue
[  226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
[  226.769566] Do you have a strange power saving mode enabled?
[  226.769567] Dazed and confused, but trying to continue
[  226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
[  226.769773] Do you have a strange power saving mode enabled?
[  226.769774] Dazed and confused, but trying to continue
[  226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
[  226.812846] Do you have a strange power saving mode enabled?
[  226.812847] Dazed and confused, but trying to continue
[  226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
[  226.893785] Do you have a strange power saving mode enabled?
[  226.893786] Dazed and confused, but trying to continue
[  226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
[  226.900141] Do you have a strange power saving mode enabled?
[  226.900143] Dazed and confused, but trying to continue
[  226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
[  226.908765] Do you have a strange power saving mode enabled?
[  226.908766] Dazed and confused, but trying to continue
[  227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
[  227.751298] Do you have a strange power saving mode enabled?
[  227.751299] Dazed and confused, but trying to continue
[  227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.

also when discussing ths with Borislav, he managed to reproduce easily
on his AMD Rome machine

any idea?

thanks,
jirka


---
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x8301034
cpu MHz         : 1497.024
cache size      : 512 KB
physical id     : 0
siblings        : 64
core id         : 0
cpu cores       : 64
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4491.76
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-16 15:45 unknown NMI on AMD Rome Jiri Olsa
@ 2021-03-16 16:02 ` Adam Borowski
  2021-03-16 16:48   ` Alexander Monakov
  2021-03-16 19:53 ` Peter Zijlstra
  1 sibling, 1 reply; 9+ messages in thread
From: Adam Borowski @ 2021-03-16 16:02 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Borislav Petkov, Tom Lendacky, Peter Zijlstra, x86, lkml,
	Alexander Shishkin, Arnaldo Carvalho de Melo, Stanislav Kozina,
	Michael Petlan, Pierre Amadio, onatalen, darcari

On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> hi,
> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> with fedora 33 kernel 5.10.22-200.fc33.x86_64
> 
> we got unknown NMI messages:
> 
> [  226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> [  226.700162] Do you have a strange power saving mode enabled?
> [  226.700163] Dazed and confused, but trying to continue
> 
> also when discussing ths with Borislav, he managed to reproduce easily
> on his AMD Rome machine

Likewise, 3c on Pinnacle Ridge.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ in the beginning was the boot and root floppies and they were good.
⢿⡄⠘⠷⠚⠋⠀                                       -- <willmore> on #linux-sunxi
⠈⠳⣄⠀⠀⠀⠀

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-16 16:02 ` Adam Borowski
@ 2021-03-16 16:48   ` Alexander Monakov
  0 siblings, 0 replies; 9+ messages in thread
From: Alexander Monakov @ 2021-03-16 16:48 UTC (permalink / raw)
  To: Adam Borowski
  Cc: Jiri Olsa, Borislav Petkov, Tom Lendacky, Peter Zijlstra, x86,
	lkml, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Stanislav Kozina, Michael Petlan, Pierre Amadio, onatalen,
	darcari

On Tue, 16 Mar 2021, Adam Borowski wrote:

> On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> > hi,
> > when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> > with fedora 33 kernel 5.10.22-200.fc33.x86_64
> > 
> > we got unknown NMI messages:
> > 
> > [  226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> > [  226.700162] Do you have a strange power saving mode enabled?
> > [  226.700163] Dazed and confused, but trying to continue
> > 
> > also when discussing ths with Borislav, he managed to reproduce easily
> > on his AMD Rome machine
> 
> Likewise, 3c on Pinnacle Ridge.

I've also seen it on Renoir, and it appears related to PMU interrupt racing
against C-state entry/exit. Disabling C2 and C3 via 'cpupower' is enough to
avoid those NMIs in my case.

IIRC there were a few patches related to this area from AMD in the past.

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-16 15:45 unknown NMI on AMD Rome Jiri Olsa
  2021-03-16 16:02 ` Adam Borowski
@ 2021-03-16 19:53 ` Peter Zijlstra
  2021-03-16 20:02   ` Kim Phillips
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2021-03-16 19:53 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Borislav Petkov, Tom Lendacky, x86, lkml, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Stanislav Kozina, Michael Petlan,
	Pierre Amadio, onatalen, darcari, kim.phillips

On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> hi,
> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> with fedora 33 kernel 5.10.22-200.fc33.x86_64
> 
> we got unknown NMI messages:
> 
> [  226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> [  226.700162] Do you have a strange power saving mode enabled?
> [  226.700163] Dazed and confused, but trying to continue
> [  226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
> [  226.769566] Do you have a strange power saving mode enabled?
> [  226.769567] Dazed and confused, but trying to continue
> [  226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
> [  226.769773] Do you have a strange power saving mode enabled?
> [  226.769774] Dazed and confused, but trying to continue
> [  226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
> [  226.812846] Do you have a strange power saving mode enabled?
> [  226.812847] Dazed and confused, but trying to continue
> [  226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
> [  226.893785] Do you have a strange power saving mode enabled?
> [  226.893786] Dazed and confused, but trying to continue
> [  226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
> [  226.900141] Do you have a strange power saving mode enabled?
> [  226.900143] Dazed and confused, but trying to continue
> [  226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
> [  226.908765] Do you have a strange power saving mode enabled?
> [  226.908766] Dazed and confused, but trying to continue
> [  227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
> [  227.751298] Do you have a strange power saving mode enabled?
> [  227.751299] Dazed and confused, but trying to continue
> [  227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
> 
> also when discussing ths with Borislav, he managed to reproduce easily
> on his AMD Rome machine
> 
> any idea?

Kim is the AMD point person for this I think..

> 
> thanks,
> jirka
> 
> 
> ---
> processor       : 0
> vendor_id       : AuthenticAMD
> cpu family      : 23
> model           : 49
> model name      : AMD EPYC 7742 64-Core Processor
> stepping        : 0
> microcode       : 0x8301034
> cpu MHz         : 1497.024
> cache size      : 512 KB
> physical id     : 0
> siblings        : 64
> core id         : 0
> cpu cores       : 64
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 16
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
> bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
> bogomips        : 4491.76
> TLB size        : 3072 4K pages
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 43 bits physical, 48 bits virtual
> power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-16 19:53 ` Peter Zijlstra
@ 2021-03-16 20:02   ` Kim Phillips
  2021-03-17  8:48     ` Ingo Molnar
  0 siblings, 1 reply; 9+ messages in thread
From: Kim Phillips @ 2021-03-16 20:02 UTC (permalink / raw)
  To: Peter Zijlstra, Jiri Olsa
  Cc: Borislav Petkov, Tom Lendacky, x86, lkml, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Stanislav Kozina, Michael Petlan,
	Pierre Amadio, onatalen, darcari

On 3/16/21 2:53 PM, Peter Zijlstra wrote:
> On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
>> hi,
>> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
>> with fedora 33 kernel 5.10.22-200.fc33.x86_64
>>
>> we got unknown NMI messages:
>>
>> [  226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
>> [  226.700162] Do you have a strange power saving mode enabled?
>> [  226.700163] Dazed and confused, but trying to continue
>> [  226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
>> [  226.769566] Do you have a strange power saving mode enabled?
>> [  226.769567] Dazed and confused, but trying to continue
>> [  226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
>> [  226.769773] Do you have a strange power saving mode enabled?
>> [  226.769774] Dazed and confused, but trying to continue
>> [  226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
>> [  226.812846] Do you have a strange power saving mode enabled?
>> [  226.812847] Dazed and confused, but trying to continue
>> [  226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
>> [  226.893785] Do you have a strange power saving mode enabled?
>> [  226.893786] Dazed and confused, but trying to continue
>> [  226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
>> [  226.900141] Do you have a strange power saving mode enabled?
>> [  226.900143] Dazed and confused, but trying to continue
>> [  226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
>> [  226.908765] Do you have a strange power saving mode enabled?
>> [  226.908766] Dazed and confused, but trying to continue
>> [  227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
>> [  227.751298] Do you have a strange power saving mode enabled?
>> [  227.751299] Dazed and confused, but trying to continue
>> [  227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
>>
>> also when discussing ths with Borislav, he managed to reproduce easily
>> on his AMD Rome machine
>>
>> any idea?
> 
> Kim is the AMD point person for this I think..

Since perf top invokes precision and therefore IBS,
this looks like it's hitting erratum #1215:

https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

Kim

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-16 20:02   ` Kim Phillips
@ 2021-03-17  8:48     ` Ingo Molnar
  2021-03-17 10:13       ` Peter Zijlstra
  0 siblings, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2021-03-17  8:48 UTC (permalink / raw)
  To: Kim Phillips
  Cc: Peter Zijlstra, Jiri Olsa, Borislav Petkov, Tom Lendacky, x86,
	lkml, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Stanislav Kozina, Michael Petlan, Pierre Amadio, onatalen,
	darcari


* Kim Phillips <kim.phillips@amd.com> wrote:

> On 3/16/21 2:53 PM, Peter Zijlstra wrote:
> > On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> >> hi,
> >> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> >> with fedora 33 kernel 5.10.22-200.fc33.x86_64
> >>
> >> we got unknown NMI messages:
> >>
> >> [  226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> >> [  226.700162] Do you have a strange power saving mode enabled?
> >> [  226.700163] Dazed and confused, but trying to continue
> >> [  226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
> >> [  226.769566] Do you have a strange power saving mode enabled?
> >> [  226.769567] Dazed and confused, but trying to continue
> >> [  226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
> >> [  226.769773] Do you have a strange power saving mode enabled?
> >> [  226.769774] Dazed and confused, but trying to continue
> >> [  226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
> >> [  226.812846] Do you have a strange power saving mode enabled?
> >> [  226.812847] Dazed and confused, but trying to continue
> >> [  226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
> >> [  226.893785] Do you have a strange power saving mode enabled?
> >> [  226.893786] Dazed and confused, but trying to continue
> >> [  226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
> >> [  226.900141] Do you have a strange power saving mode enabled?
> >> [  226.900143] Dazed and confused, but trying to continue
> >> [  226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
> >> [  226.908765] Do you have a strange power saving mode enabled?
> >> [  226.908766] Dazed and confused, but trying to continue
> >> [  227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
> >> [  227.751298] Do you have a strange power saving mode enabled?
> >> [  227.751299] Dazed and confused, but trying to continue
> >> [  227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
> >>
> >> also when discussing ths with Borislav, he managed to reproduce easily
> >> on his AMD Rome machine
> >>
> >> any idea?
> > 
> > Kim is the AMD point person for this I think..
> 
> Since perf top invokes precision and therefore IBS,
> this looks like it's hitting erratum #1215:
> 
> https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

So:


  1215 IBS (Instruction Based Sampling) Counter Valid Value
  May be Incorrect After Exit From Core C6 (CC6) State

  Description

  If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
  Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
  issued, but an invalid value of the valid bit may be restored when the core exits CC6.
  Potential Effect on System

  The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
  valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
  Linux systems.

  Suggested Workaround: None
  Fix Planned: No fix planned

lovely.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-17  8:48     ` Ingo Molnar
@ 2021-03-17 10:13       ` Peter Zijlstra
  2021-03-17 13:32         ` Alexander Monakov
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2021-03-17 10:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Kim Phillips, Jiri Olsa, Borislav Petkov, Tom Lendacky, x86,
	lkml, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Stanislav Kozina, Michael Petlan, Pierre Amadio, onatalen,
	darcari, Rafael J. Wysocki

On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> 
> So:
> 
> 
>   1215 IBS (Instruction Based Sampling) Counter Valid Value
>   May be Incorrect After Exit From Core C6 (CC6) State
> 
>   Description
> 
>   If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
>   Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
>   issued, but an invalid value of the valid bit may be restored when the core exits CC6.
>   Potential Effect on System
> 
>   The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
>   valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
>   Linux systems.
> 
>   Suggested Workaround: None
>   Fix Planned: No fix planned

Should be simple enough to disable CC6 while IBS is in use. Kim, can you
please make that happen?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-17 10:13       ` Peter Zijlstra
@ 2021-03-17 13:32         ` Alexander Monakov
  2021-03-17 13:37           ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Monakov @ 2021-03-17 13:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Kim Phillips, Jiri Olsa, Borislav Petkov,
	Tom Lendacky, x86, lkml, Alexander Shishkin,
	Arnaldo Carvalho de Melo, Stanislav Kozina, Michael Petlan,
	Pierre Amadio, onatalen, darcari, Rafael J. Wysocki

On Wed, 17 Mar 2021, Peter Zijlstra wrote:

> On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> > 
> > So:
> > 
> > 
> >   1215 IBS (Instruction Based Sampling) Counter Valid Value
> >   May be Incorrect After Exit From Core C6 (CC6) State
> > 
> >   Description
> > 
> >   If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
> >   Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
> >   issued, but an invalid value of the valid bit may be restored when the core exits CC6.
> >   Potential Effect on System
> > 
> >   The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
> >   valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
> >   Linux systems.
> > 
> >   Suggested Workaround: None
> >   Fix Planned: No fix planned
> 
> Should be simple enough to disable CC6 while IBS is in use. Kim, can you
> please make that happen?

Wouldn't that "magically" significantly speed up workloads running under
'perf top', in case they don't saturate the CPUs? Scheduling gets
much snappier if the target CPU doesn't need to wake up from deep sleep :)

Alternatively, would you consider adding the errata reference to the
printk message when IBS is in use, and rate-limit it so it doesn't
flood dmesg? Then the user will know what's going on, and may
choose to temporarily disable C-states using the 'cpupower' tool.

Alexander

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: unknown NMI on AMD Rome
  2021-03-17 13:32         ` Alexander Monakov
@ 2021-03-17 13:37           ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 9+ messages in thread
From: Arnaldo Carvalho de Melo @ 2021-03-17 13:37 UTC (permalink / raw)
  To: Alexander Monakov
  Cc: Peter Zijlstra, Ingo Molnar, Kim Phillips, Jiri Olsa,
	Borislav Petkov, Tom Lendacky, x86, lkml, Alexander Shishkin,
	Stanislav Kozina, Michael Petlan, Pierre Amadio, onatalen,
	darcari, Rafael J. Wysocki

Em Wed, Mar 17, 2021 at 04:32:17PM +0300, Alexander Monakov escreveu:
> On Wed, 17 Mar 2021, Peter Zijlstra wrote:
> > On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > > > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

> > >   1215 IBS (Instruction Based Sampling) Counter Valid Value
> > >   May be Incorrect After Exit From Core C6 (CC6) State

> > >   Description

> > >   If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
> > >   Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
> > >   issued, but an invalid value of the valid bit may be restored when the core exits CC6.
> > >   Potential Effect on System

> > >   The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
> > >   valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
> > >   Linux systems.

> > >   Suggested Workaround: None
> > >   Fix Planned: No fix planned

> > Should be simple enough to disable CC6 while IBS is in use. Kim, can you
> > please make that happen?

> Wouldn't that "magically" significantly speed up workloads running under
> 'perf top', in case they don't saturate the CPUs? Scheduling gets
> much snappier if the target CPU doesn't need to wake up from deep sleep :)

> Alternatively, would you consider adding the errata reference to the
> printk message when IBS is in use, and rate-limit it so it doesn't
> flood dmesg? Then the user will know what's going on, and may
> choose to temporarily disable C-states using the 'cpupower' tool.

Would be interesting as well to make 'perf top' realize that somehow
(looking at some cpu id, etc) and don't use IBS when C-states are being
used and/or warn the user about the situation, i.e. cycles:P can't be
used in this machine if C-states are enabled?

- Arnaldo

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-03-17 13:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-16 15:45 unknown NMI on AMD Rome Jiri Olsa
2021-03-16 16:02 ` Adam Borowski
2021-03-16 16:48   ` Alexander Monakov
2021-03-16 19:53 ` Peter Zijlstra
2021-03-16 20:02   ` Kim Phillips
2021-03-17  8:48     ` Ingo Molnar
2021-03-17 10:13       ` Peter Zijlstra
2021-03-17 13:32         ` Alexander Monakov
2021-03-17 13:37           ` Arnaldo Carvalho de Melo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).