linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter
@ 2019-06-25 14:21 kan.liang
  2019-06-25 14:58 ` Jiri Olsa
  2019-07-13 11:13 ` [tip:perf/urgent] " tip-bot for Kan Liang
  0 siblings, 2 replies; 6+ messages in thread
From: kan.liang @ 2019-06-25 14:21 UTC (permalink / raw)
  To: mingo, jolsa, peterz, linux-kernel; +Cc: ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

If a user first sample a PEBS event on a fixed counter, then sample a
non-PEBS event on the same fixed counter on Icelake, it will trigger
spurious NMI. For example,

  perf record -e 'cycles:p' -a
  perf record -e 'cycles' -a

The error message for spurious NMI.

  [June 21 15:38] Uhhuh. NMI received for unknown reason 30 on CPU 2.
  [  +0.000000] Do you have a strange power saving mode enabled?
  [  +0.000000] Dazed and confused, but trying to continue

The issue was introduced by the following commit:

  commit 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")

The commit moves the intel_pmu_pebs_disable() after
intel_pmu_disable_fixed(), which returns immediately.
The related bit of PEBS_ENABLE MSR will never be cleared for the fixed
counter. Then a non-PEBS event runs on the fixed counter, but the bit
on PEBS_ENABLE is still set, which trigger spurious NMI.

Check and disable PEBS for fixed counter after intel_pmu_disable_fixed().

Reported-by: Yi, Ammy <ammy.yi@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Fixes: 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
---
 arch/x86/events/intel/core.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 4377bf6a6f82..464316218b77 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2160,12 +2160,10 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
 
-	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
+	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
 		intel_pmu_disable_fixed(hwc);
-		return;
-	}
-
-	x86_pmu_disable_event(event);
+	else
+		x86_pmu_disable_event(event);
 
 	/*
 	 * Needs to be called after x86_pmu_disable_event,
-- 
2.14.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter
  2019-06-25 14:21 [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter kan.liang
@ 2019-06-25 14:58 ` Jiri Olsa
  2019-07-05  0:23   ` Jin, Yao
  2019-07-13 11:13 ` [tip:perf/urgent] " tip-bot for Kan Liang
  1 sibling, 1 reply; 6+ messages in thread
From: Jiri Olsa @ 2019-06-25 14:58 UTC (permalink / raw)
  To: kan.liang; +Cc: mingo, jolsa, peterz, linux-kernel, ak

On Tue, Jun 25, 2019 at 07:21:35AM -0700, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> If a user first sample a PEBS event on a fixed counter, then sample a
> non-PEBS event on the same fixed counter on Icelake, it will trigger
> spurious NMI. For example,
> 
>   perf record -e 'cycles:p' -a
>   perf record -e 'cycles' -a
> 
> The error message for spurious NMI.
> 
>   [June 21 15:38] Uhhuh. NMI received for unknown reason 30 on CPU 2.
>   [  +0.000000] Do you have a strange power saving mode enabled?
>   [  +0.000000] Dazed and confused, but trying to continue
> 
> The issue was introduced by the following commit:
> 
>   commit 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
> 
> The commit moves the intel_pmu_pebs_disable() after
> intel_pmu_disable_fixed(), which returns immediately.
> The related bit of PEBS_ENABLE MSR will never be cleared for the fixed
> counter. Then a non-PEBS event runs on the fixed counter, but the bit
> on PEBS_ENABLE is still set, which trigger spurious NMI.
> 
> Check and disable PEBS for fixed counter after intel_pmu_disable_fixed().
> 
> Reported-by: Yi, Ammy <ammy.yi@intel.com>
> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> Fixes: 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
> ---
>  arch/x86/events/intel/core.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 4377bf6a6f82..464316218b77 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2160,12 +2160,10 @@ static void intel_pmu_disable_event(struct perf_event *event)
>  	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
>  	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
>  
> -	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
> +	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
>  		intel_pmu_disable_fixed(hwc);
> -		return;
> -	}
> -
> -	x86_pmu_disable_event(event);
> +	else
> +		x86_pmu_disable_event(event);
>  

oops, I overlooed this, looks good

Acked-by: Jiri Olsa <jolsa@kernel.org>

thanks,
jirka

>  	/*
>  	 * Needs to be called after x86_pmu_disable_event,
> -- 
> 2.14.5
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter
  2019-06-25 14:58 ` Jiri Olsa
@ 2019-07-05  0:23   ` Jin, Yao
  2019-07-05 11:44     ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Jin, Yao @ 2019-07-05  0:23 UTC (permalink / raw)
  To: Jiri Olsa, kan.liang; +Cc: mingo, jolsa, peterz, linux-kernel, ak



On 6/25/2019 10:58 PM, Jiri Olsa wrote:
> On Tue, Jun 25, 2019 at 07:21:35AM -0700, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> If a user first sample a PEBS event on a fixed counter, then sample a
>> non-PEBS event on the same fixed counter on Icelake, it will trigger
>> spurious NMI. For example,
>>
>>    perf record -e 'cycles:p' -a
>>    perf record -e 'cycles' -a
>>
>> The error message for spurious NMI.
>>
>>    [June 21 15:38] Uhhuh. NMI received for unknown reason 30 on CPU 2.
>>    [  +0.000000] Do you have a strange power saving mode enabled?
>>    [  +0.000000] Dazed and confused, but trying to continue
>>
>> The issue was introduced by the following commit:
>>
>>    commit 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
>>
>> The commit moves the intel_pmu_pebs_disable() after
>> intel_pmu_disable_fixed(), which returns immediately.
>> The related bit of PEBS_ENABLE MSR will never be cleared for the fixed
>> counter. Then a non-PEBS event runs on the fixed counter, but the bit
>> on PEBS_ENABLE is still set, which trigger spurious NMI.
>>
>> Check and disable PEBS for fixed counter after intel_pmu_disable_fixed().
>>
>> Reported-by: Yi, Ammy <ammy.yi@intel.com>
>> Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
>> Fixes: 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
>> ---
>>   arch/x86/events/intel/core.c | 8 +++-----
>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 4377bf6a6f82..464316218b77 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2160,12 +2160,10 @@ static void intel_pmu_disable_event(struct perf_event *event)
>>   	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
>>   	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
>>   
>> -	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
>> +	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
>>   		intel_pmu_disable_fixed(hwc);
>> -		return;
>> -	}
>> -
>> -	x86_pmu_disable_event(event);
>> +	else
>> +		x86_pmu_disable_event(event);
>>   
> 
> oops, I overlooed this, looks good
> 
> Acked-by: Jiri Olsa <jolsa@kernel.org>
> 
> thanks,
> jirka
> 

Hi,

Could this fix be accepted?

Thanks
Jin Yao

>>   	/*
>>   	 * Needs to be called after x86_pmu_disable_event,
>> -- 
>> 2.14.5
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter
  2019-07-05  0:23   ` Jin, Yao
@ 2019-07-05 11:44     ` Peter Zijlstra
  2019-07-10 19:37       ` Andi Kleen
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2019-07-05 11:44 UTC (permalink / raw)
  To: Jin, Yao; +Cc: Jiri Olsa, kan.liang, mingo, jolsa, linux-kernel, ak

On Fri, Jul 05, 2019 at 08:23:37AM +0800, Jin, Yao wrote:
> 
> 
> On 6/25/2019 10:58 PM, Jiri Olsa wrote:
> > On Tue, Jun 25, 2019 at 07:21:35AM -0700, kan.liang@linux.intel.com wrote:
> > > From: Kan Liang <kan.liang@linux.intel.com>
> > > 
> > > If a user first sample a PEBS event on a fixed counter, then sample a
> > > non-PEBS event on the same fixed counter on Icelake, it will trigger
> > > spurious NMI. For example,
> > > 
> > >    perf record -e 'cycles:p' -a
> > >    perf record -e 'cycles' -a
> > > 
> > > The error message for spurious NMI.
> > > 
> > >    [June 21 15:38] Uhhuh. NMI received for unknown reason 30 on CPU 2.
> > >    [  +0.000000] Do you have a strange power saving mode enabled?
> > >    [  +0.000000] Dazed and confused, but trying to continue
> > > 
> > > The issue was introduced by the following commit:
> > > 
> > >    commit 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
> > > 
> > > The commit moves the intel_pmu_pebs_disable() after
> > > intel_pmu_disable_fixed(), which returns immediately.
> > > The related bit of PEBS_ENABLE MSR will never be cleared for the fixed
> > > counter. Then a non-PEBS event runs on the fixed counter, but the bit
> > > on PEBS_ENABLE is still set, which trigger spurious NMI.
> > > 
> > > Check and disable PEBS for fixed counter after intel_pmu_disable_fixed().
> > > 
> > > Reported-by: Yi, Ammy <ammy.yi@intel.com>
> > > Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
> > > Fixes: 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")

> > oops, I overlooed this, looks good
> > 
> > Acked-by: Jiri Olsa <jolsa@kernel.org>

Have it now, thanks!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter
  2019-07-05 11:44     ` Peter Zijlstra
@ 2019-07-10 19:37       ` Andi Kleen
  0 siblings, 0 replies; 6+ messages in thread
From: Andi Kleen @ 2019-07-10 19:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jin, Yao, Jiri Olsa, kan.liang, mingo, jolsa, linux-kernel

> > > oops, I overlooed this, looks good
> > > 
> > > Acked-by: Jiri Olsa <jolsa@kernel.org>
> 
> Have it now, thanks!

Can Kan's patch please be merged asap and also put into stable for 5.2?

The regression causes crashes on Icelake when fixed counters
are used in PEBS groups, and presumably also on Goldmont Plus.

I just had to debug this again until I realized it was the 
same problem.

-Andi


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:perf/urgent] perf/x86/intel: Fix spurious NMI on fixed counter
  2019-06-25 14:21 [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter kan.liang
  2019-06-25 14:58 ` Jiri Olsa
@ 2019-07-13 11:13 ` tip-bot for Kan Liang
  1 sibling, 0 replies; 6+ messages in thread
From: tip-bot for Kan Liang @ 2019-07-13 11:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, torvalds, peterz, tglx, jolsa,
	vincent.weaver, jolsa, mingo, alexander.shishkin, ammy.yi, acme,
	kan.liang, stable, hpa

Commit-ID:  e4557c1a46b0d32746bd309e1941914b5a6912b4
Gitweb:     https://git.kernel.org/tip/e4557c1a46b0d32746bd309e1941914b5a6912b4
Author:     Kan Liang <kan.liang@linux.intel.com>
AuthorDate: Tue, 25 Jun 2019 07:21:35 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 13 Jul 2019 11:21:29 +0200

perf/x86/intel: Fix spurious NMI on fixed counter

If a user first sample a PEBS event on a fixed counter, then sample a
non-PEBS event on the same fixed counter on Icelake, it will trigger
spurious NMI. For example:

  perf record -e 'cycles:p' -a
  perf record -e 'cycles' -a

The error message for spurious NMI:

  [June 21 15:38] Uhhuh. NMI received for unknown reason 30 on CPU 2.
  [    +0.000000] Do you have a strange power saving mode enabled?
  [    +0.000000] Dazed and confused, but trying to continue

The bug was introduced by the following commit:

  commit 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")

The commit moves the intel_pmu_pebs_disable() after intel_pmu_disable_fixed(),
which returns immediately.  The related bit of PEBS_ENABLE MSR will never be
cleared for the fixed counter. Then a non-PEBS event runs on the fixed counter,
but the bit on PEBS_ENABLE is still set, which triggers spurious NMIs.

Check and disable PEBS for fixed counters after intel_pmu_disable_fixed().

Reported-by: Yi, Ammy <ammy.yi@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 6f55967ad9d9 ("perf/x86/intel: Fix race in intel_pmu_disable_event()")
Link: https://lkml.kernel.org/r/20190625142135.22112-1-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/core.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index bda450ff51ee..9e911a96972b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2161,12 +2161,10 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	cpuc->intel_ctrl_host_mask &= ~(1ull << hwc->idx);
 	cpuc->intel_cp_status &= ~(1ull << hwc->idx);
 
-	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL)) {
+	if (unlikely(hwc->config_base == MSR_ARCH_PERFMON_FIXED_CTR_CTRL))
 		intel_pmu_disable_fixed(hwc);
-		return;
-	}
-
-	x86_pmu_disable_event(event);
+	else
+		x86_pmu_disable_event(event);
 
 	/*
 	 * Needs to be called after x86_pmu_disable_event,

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-07-13 11:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-25 14:21 [PATCH] perf/x86/intel: Fix spurious NMI on fixed counter kan.liang
2019-06-25 14:58 ` Jiri Olsa
2019-07-05  0:23   ` Jin, Yao
2019-07-05 11:44     ` Peter Zijlstra
2019-07-10 19:37       ` Andi Kleen
2019-07-13 11:13 ` [tip:perf/urgent] " tip-bot for Kan Liang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).