linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS
@ 2020-01-13 14:09 kan.liang
  2020-01-20 10:50 ` Peter Zijlstra
  0 siblings, 1 reply; 3+ messages in thread
From: kan.liang @ 2020-01-13 14:09 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: ak, like.xu, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

A warning as below may be triggered when sampling large PEBS.

[  410.411250] perf: interrupt took too long (72145 > 71975), lowering
kernel.perf_event_max_sample_rate to 2000
[  410.724923] ------------[ cut here ]------------
[  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
x86_pmu_stop+0x95/0xa0
[  410.933811]  x86_pmu_del+0x50/0x150
[  410.937304]  event_sched_out.isra.0+0xbc/0x210
[  410.941751]  group_sched_out.part.0+0x53/0xd0
[  410.946111]  ctx_sched_out+0x193/0x270
[  410.949862]  __perf_event_task_sched_out+0x32c/0x890
[  410.954827]  ? set_next_entity+0x98/0x2d0
[  410.958841]  __schedule+0x592/0x9c0
[  410.962332]  schedule+0x5f/0xd0
[  410.965477]  exit_to_usermode_loop+0x73/0x120
[  410.969837]  prepare_exit_to_usermode+0xcd/0xf0
[  410.974369]  ret_from_intr+0x2a/0x3a
[  410.977946] RIP: 0033:0x40123c
[  411.079661] ---[ end trace bc83adaea7bb664a ]---

For large PEBS, the PEBS buffer can be drained from either NMI handler
or !NMI e.g. context switch. Current implementation doesn't handle them
differently. For !nmi, perf also call the generic overflow handler for
the last PEBS record. That may trigger the interrupt throttle, and stop
the event. That's wrong.

Here is an example for !NMI scenario, context switch.
Let's say the max_samples_per_tick is adjusted to 2 for some reason.
A context switch happens right after a NMI.
When an old task is scheduled out, it will drain the PEBS buffer, and
then delete the event.
When draining the PEBS buffer, perf_event_overflow() will be called for
the last PEBS record. Since the max_samples_per_tick is only 2, the
interrupt throttle must be triggered. The event will be stopped.
After the draining, the scheduler will delete the event, which stops the
event again. The warning is triggered.

Perf should handle the NMI and !NMI differently for large PEBS.
For NMI, the generic overflow handler is required for the last PEBS
record.
But, for !NMI, there is no overflow. The generic overflow handler should
not be invoked. Perf should treat the last record exactly the same as
the rest of PEBS records.

Fixes: 21509084f999 ("perf/x86/intel: Handle multiple records in the PEBS buffer")
Reported-by: Like Xu <like.xu@linux.intel.com>
Tested-by: Like Xu <like.xu@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 7c896d7e8b6c..51baff083938 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1780,15 +1780,22 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 
 	setup_sample(event, iregs, at, &data, regs);
 
-	/*
-	 * All but the last records are processed.
-	 * The last one is left to be able to call the overflow handler.
-	 */
-	if (perf_event_overflow(event, &data, regs)) {
-		x86_pmu_stop(event, 0);
-		return;
+	if (in_nmi()) {
+		/*
+		 * All but the last records are processed.
+		 * The last one is left to be able to call the overflow handler.
+		 */
+		if (perf_event_overflow(event, &data, regs))
+			x86_pmu_stop(event, 0);
+	} else {
+		/*
+		 * For !NMI, e.g context switch, there is no overflow.
+		 * The generic overflow handler should not be invoked.
+		 * Perf should treat the last record exactly the same as the
+		 * rest of PEBS records.
+		 */
+		perf_event_output(event, &data, regs);
 	}
-
 }
 
 static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS
  2020-01-13 14:09 [PATCH] perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS kan.liang
@ 2020-01-20 10:50 ` Peter Zijlstra
  2020-01-20 16:50   ` Liang, Kan
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Zijlstra @ 2020-01-20 10:50 UTC (permalink / raw)
  To: kan.liang; +Cc: mingo, linux-kernel, ak, like.xu

On Mon, Jan 13, 2020 at 06:09:35AM -0800, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> A warning as below may be triggered when sampling large PEBS.

> [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
> x86_pmu_stop+0x95/0xa0

> For large PEBS, the PEBS buffer can be drained from either NMI handler
> or !NMI e.g. context switch. Current implementation doesn't handle them
> differently. For !nmi, perf also call the generic overflow handler for
> the last PEBS record. That may trigger the interrupt throttle, and stop
> the event. That's wrong.
> 
> Here is an example for !NMI scenario, context switch.
> Let's say the max_samples_per_tick is adjusted to 2 for some reason.
> A context switch happens right after a NMI.
> When an old task is scheduled out, it will drain the PEBS buffer, and
> then delete the event.
> When draining the PEBS buffer, perf_event_overflow() will be called for
> the last PEBS record. Since the max_samples_per_tick is only 2, the
> interrupt throttle must be triggered. The event will be stopped.
> After the draining, the scheduler will delete the event, which stops the
> event again. The warning is triggered.
> 
> Perf should handle the NMI and !NMI differently for large PEBS.
> For NMI, the generic overflow handler is required for the last PEBS
> record.
> But, for !NMI, there is no overflow. The generic overflow handler should
> not be invoked. Perf should treat the last record exactly the same as
> the rest of PEBS records.

Hurmph. there's something there, but the above is hard to read.

drain_pebs() is called from:

 - handle_pmi_common()		-- sample context
 - intel_pmu_pebs_sched_task()  -- non sample context
 - intel_pmu_pebs_disable()     -- non sample context
 - intel_pmu_auto_reload_read() -- possible sample context

So the question is what to do for PERF_SAMPLE_READ + PERF_FORMAT_GROUP.

I don't think throttling there is right either, but that does mean the
simple in_nmi() test you use is wrong.

Perhaps we can do something with how intel_pmu_drain_pebs_buffer()
passes in dummy regs pointer to distinguish between the sample and non
sample context.

> ---
>  arch/x86/events/intel/ds.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 7c896d7e8b6c..51baff083938 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1780,15 +1780,22 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>  
>  	setup_sample(event, iregs, at, &data, regs);
>  
> -	/*
> -	 * All but the last records are processed.
> -	 * The last one is left to be able to call the overflow handler.
> -	 */
> -	if (perf_event_overflow(event, &data, regs)) {
> -		x86_pmu_stop(event, 0);
> -		return;
> +	if (in_nmi()) {
> +		/*
> +		 * All but the last records are processed.
> +		 * The last one is left to be able to call the overflow handler.
> +		 */
> +		if (perf_event_overflow(event, &data, regs))
> +			x86_pmu_stop(event, 0);
> +	} else {
> +		/*
> +		 * For !NMI, e.g context switch, there is no overflow.
> +		 * The generic overflow handler should not be invoked.
> +		 * Perf should treat the last record exactly the same as the
> +		 * rest of PEBS records.
> +		 */
> +		perf_event_output(event, &data, regs);
>  	}

Maybe write it like so?

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 4b94ae4ae369..b66be085c7a4 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1747,25 +1747,22 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
 	} else if (!intel_pmu_save_and_restart(event))
 		return;
 
-	while (count > 1) {
+	while (count > /* cond */) {
 		setup_sample(event, iregs, at, &data, regs);
 		perf_event_output(event, &data, regs);
 		at += cpuc->pebs_record_size;
 		at = get_next_pebs_record_by_bit(at, top, bit);
-		count--;
+		if (!--count)
+			return;
 	}
 
-	setup_sample(event, iregs, at, &data, regs);
-
 	/*
 	 * All but the last records are processed.
 	 * The last one is left to be able to call the overflow handler.
 	 */
-	if (perf_event_overflow(event, &data, regs)) {
+	setup_sample(event, iregs, at, &data, regs);
+	if (perf_event_overflow(event, &data, regs))
 		x86_pmu_stop(event, 0);
-		return;
-	}
-
 }
 
 static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS
  2020-01-20 10:50 ` Peter Zijlstra
@ 2020-01-20 16:50   ` Liang, Kan
  0 siblings, 0 replies; 3+ messages in thread
From: Liang, Kan @ 2020-01-20 16:50 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, linux-kernel, ak, like.xu



On 1/20/2020 5:50 AM, Peter Zijlstra wrote:
> On Mon, Jan 13, 2020 at 06:09:35AM -0800, kan.liang@linux.intel.com wrote:
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> A warning as below may be triggered when sampling large PEBS.
> 
>> [  410.729822] WARNING: CPU: 0 PID: 16397 at arch/x86/events/core.c:1422
>> x86_pmu_stop+0x95/0xa0
> 
>> For large PEBS, the PEBS buffer can be drained from either NMI handler
>> or !NMI e.g. context switch. Current implementation doesn't handle them
>> differently. For !nmi, perf also call the generic overflow handler for
>> the last PEBS record. That may trigger the interrupt throttle, and stop
>> the event. That's wrong.
>>
>> Here is an example for !NMI scenario, context switch.
>> Let's say the max_samples_per_tick is adjusted to 2 for some reason.
>> A context switch happens right after a NMI.
>> When an old task is scheduled out, it will drain the PEBS buffer, and
>> then delete the event.
>> When draining the PEBS buffer, perf_event_overflow() will be called for
>> the last PEBS record. Since the max_samples_per_tick is only 2, the
>> interrupt throttle must be triggered. The event will be stopped.
>> After the draining, the scheduler will delete the event, which stops the
>> event again. The warning is triggered.
>>
>> Perf should handle the NMI and !NMI differently for large PEBS.
>> For NMI, the generic overflow handler is required for the last PEBS
>> record.
>> But, for !NMI, there is no overflow. The generic overflow handler should
>> not be invoked. Perf should treat the last record exactly the same as
>> the rest of PEBS records.
> 
> Hurmph. there's something there, but the above is hard to read.
>

It describes an example as below
//max_samples_per_tick is adjusted to 2
//NMI happens
intel_pmu_handle_irq()
   handle_pmi_common()
     drain_pebs()
       __intel_pmu_pebs_event()
         perf_event_overflow()
           __perf_event_account_interrupt()
             hwc->interrupts = 1
             return 0
//A context switch right after the NMI.
//In the same tick, perf_throttled_seq is not changed.
perf_event_task_sched_out()
   perf_pmu_sched_task()
     intel_pmu_drain_pebs_buffer()
       __intel_pmu_pebs_event()
         perf_event_overflow()
           __perf_event_account_interrupt()
             ++hwc->interrupts >= max_samples_per_tick
             return 1
       x86_pmu_stop();  # First stop
   perf_event_context_sched_out()
     task_ctx_sched_out()
       ctx_sched_out()
         event_sched_out()
           x86_pmu_del()
             x86_pmu_stop();  # Second stop and trigger the warning

> drain_pebs() is called from:
> 
>   - handle_pmi_common()		-- sample context
>   - intel_pmu_pebs_sched_task()  -- non sample context
>   - intel_pmu_pebs_disable()     -- non sample context
>   - intel_pmu_auto_reload_read() -- possible sample context
> > So the question is what to do for PERF_SAMPLE_READ + PERF_FORMAT_GROUP.
> 
> I don't think throttling there is right either, but that does mean the
> simple in_nmi() test you use is wrong.


I'm not sure if it's accurate to call it sample/non sample context.
Because for large PEBS, it should always output samples here for any cases.

As my understanding, the sample context you mean is actually the same as 
interrupt context.
   - handle_pmi_common(): -- NMI
   - intel_pmu_pebs_sched_task(): -- Can only be invoked in !NMI
   - intel_pmu_pebs_disable(): -- Can only be invoked in !NMI
     (I think there is only one case which the function will be called
      in NMI, throttling triggered x86_pmu_stop(). For this case, the
      PEBS buffer has been drained. __intel_pmu_pebs_event() will not
      be touched.)
   - intel_pmu_auto_reload_read(): Can only be invoked in NMI via
     PERF_SAMPLE_READ. For other cases, it will be only invoked in !NMI.

The throttling here is to control the number of interrupts. In NMI 
context, we should check the throttling. Otherwise, it's unnecessary.

I think the in_nmi() test should be enough here.

> 
> Perhaps we can do something with how intel_pmu_drain_pebs_buffer()
> passes in dummy regs pointer to distinguish between the sample and non
> sample context.
> 
>> ---
>>   arch/x86/events/intel/ds.c | 23 +++++++++++++++--------
>>   1 file changed, 15 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 7c896d7e8b6c..51baff083938 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1780,15 +1780,22 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>>   
>>   	setup_sample(event, iregs, at, &data, regs);
>>   
>> -	/*
>> -	 * All but the last records are processed.
>> -	 * The last one is left to be able to call the overflow handler.
>> -	 */
>> -	if (perf_event_overflow(event, &data, regs)) {
>> -		x86_pmu_stop(event, 0);
>> -		return;
>> +	if (in_nmi()) {
>> +		/*
>> +		 * All but the last records are processed.
>> +		 * The last one is left to be able to call the overflow handler.
>> +		 */
>> +		if (perf_event_overflow(event, &data, regs))
>> +			x86_pmu_stop(event, 0);
>> +	} else {
>> +		/*
>> +		 * For !NMI, e.g context switch, there is no overflow.
>> +		 * The generic overflow handler should not be invoked.
>> +		 * Perf should treat the last record exactly the same as the
>> +		 * rest of PEBS records.
>> +		 */
>> +		perf_event_output(event, &data, regs);
>>   	}
> 
> Maybe write it like so?
> 
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 4b94ae4ae369..b66be085c7a4 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1747,25 +1747,22 @@ static void __intel_pmu_pebs_event(struct perf_event *event,
>   	} else if (!intel_pmu_save_and_restart(event))
>   		return;
>   
> -	while (count > 1) {
> +	while (count > /* cond */) {

What's cond?
Could you please elaborate?

Thanks,
Kan

>   		setup_sample(event, iregs, at, &data, regs);
>   		perf_event_output(event, &data, regs);
>   		at += cpuc->pebs_record_size;
>   		at = get_next_pebs_record_by_bit(at, top, bit);
> -		count--;
> +		if (!--count)
> +			return; >   	}
>   
> -	setup_sample(event, iregs, at, &data, regs);
> -
>   	/*
>   	 * All but the last records are processed.
>   	 * The last one is left to be able to call the overflow handler.
>   	 */
> -	if (perf_event_overflow(event, &data, regs)) {
> +	setup_sample(event, iregs, at, &data, regs);
> +	if (perf_event_overflow(event, &data, regs))
>   		x86_pmu_stop(event, 0);
> -		return;
> -	}
> -
>   }
>   
>   static void intel_pmu_drain_pebs_core(struct pt_regs *iregs)
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-01-20 16:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-13 14:09 [PATCH] perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS kan.liang
2020-01-20 10:50 ` Peter Zijlstra
2020-01-20 16:50   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).