linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/2] perf: Fixing throttling related WARN
@ 2013-08-13 16:39 Jiri Olsa
  2013-08-13 16:39 ` [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events Jiri Olsa
  2013-08-13 16:39 ` [PATCH 2/2] perf: Move throtling flag to perf_event_context Jiri Olsa
  0 siblings, 2 replies; 11+ messages in thread
From: Jiri Olsa @ 2013-08-13 16:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jiri Olsa, Corey Ashford, Frederic Weisbecker, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Stephane Eranian

hi,
we managed to trigger perf WARN (described in patch 2),
but only on AMD system with 48 CPUs.

I believe the reason is a race in throttling code and
I tried to fix it. So far my testing looks ok, but I
suspect I broke something else.. ;-)

thanks for comments,
jirka


Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c |  4 +++-
 include/linux/perf_event.h             |  2 +-
 kernel/events/core.c                   | 67 +++++++++++++++++++++++++++++++++++++++----------------------------
 3 files changed, 43 insertions(+), 30 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-13 16:39 [RFC 0/2] perf: Fixing throttling related WARN Jiri Olsa
@ 2013-08-13 16:39 ` Jiri Olsa
  2013-08-15 11:40   ` Peter Zijlstra
  2013-08-13 16:39 ` [PATCH 2/2] perf: Move throtling flag to perf_event_context Jiri Olsa
  1 sibling, 1 reply; 11+ messages in thread
From: Jiri Olsa @ 2013-08-13 16:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jiri Olsa, Corey Ashford, Frederic Weisbecker, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Stephane Eranian

Currently the intel_pmu_enable_all enables all possible
events, which is not allways desired. One case (there'll
be probably more) is:

  - event hits throttling threshold
  - NMI stops event
  - intel_pmu_enable_all starts it back on the NMI exit

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Stephane Eranian <eranian@google.com>
---
 arch/x86/kernel/cpu/perf_event_intel.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c
index a45d8d4..360e7a0 100644
--- a/arch/x86/kernel/cpu/perf_event_intel.c
+++ b/arch/x86/kernel/cpu/perf_event_intel.c
@@ -912,11 +912,13 @@ static void intel_pmu_disable_all(void)
 static void intel_pmu_enable_all(int added)
 {
 	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+	u64 active_mask = *((u64*) cpuc->active_mask);
 
 	intel_pmu_pebs_enable_all();
 	intel_pmu_lbr_enable_all();
 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
-			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask);
+			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask
+			& active_mask);
 
 	if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask)) {
 		struct perf_event *event =
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] perf: Move throtling flag to perf_event_context
  2013-08-13 16:39 [RFC 0/2] perf: Fixing throttling related WARN Jiri Olsa
  2013-08-13 16:39 ` [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events Jiri Olsa
@ 2013-08-13 16:39 ` Jiri Olsa
  2013-08-15 12:12   ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: Jiri Olsa @ 2013-08-13 16:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Jiri Olsa, Corey Ashford, Frederic Weisbecker, Paul Mackerras,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Stephane Eranian

The current throttling code triggers WARN below via following
workload (only hit on AMD machine with 48 CPUs):

  # while [ 1 ]; do perf record perf bench sched messaging; done

  WARNING: at arch/x86/kernel/cpu/perf_event.c:1054 x86_pmu_start+0xc6/0x100()
  SNIP
  Call Trace:
   <IRQ>  [<ffffffff815f62d6>] dump_stack+0x19/0x1b
   [<ffffffff8105f531>] warn_slowpath_common+0x61/0x80
   [<ffffffff8105f60a>] warn_slowpath_null+0x1a/0x20
   [<ffffffff810213a6>] x86_pmu_start+0xc6/0x100
   [<ffffffff81129dd2>] perf_adjust_freq_unthr_context.part.75+0x182/0x1a0
   [<ffffffff8112a058>] perf_event_task_tick+0xc8/0xf0
   [<ffffffff81093221>] scheduler_tick+0xd1/0x140
   [<ffffffff81070176>] update_process_times+0x66/0x80
   [<ffffffff810b9565>] tick_sched_handle.isra.15+0x25/0x60
   [<ffffffff810b95e1>] tick_sched_timer+0x41/0x60
   [<ffffffff81087c24>] __run_hrtimer+0x74/0x1d0
   [<ffffffff810b95a0>] ? tick_sched_handle.isra.15+0x60/0x60
   [<ffffffff81088407>] hrtimer_interrupt+0xf7/0x240
   [<ffffffff81606829>] smp_apic_timer_interrupt+0x69/0x9c
   [<ffffffff8160569d>] apic_timer_interrupt+0x6d/0x80
   <EOI>  [<ffffffff81129f74>] ? __perf_event_task_sched_in+0x184/0x1a0
   [<ffffffff814dd937>] ? kfree_skbmem+0x37/0x90
   [<ffffffff815f2c47>] ? __slab_free+0x1ac/0x30f
   [<ffffffff8118143d>] ? kfree+0xfd/0x130
   [<ffffffff81181622>] kmem_cache_free+0x1b2/0x1d0
   [<ffffffff814dd937>] kfree_skbmem+0x37/0x90
   [<ffffffff814e03c4>] consume_skb+0x34/0x80
   [<ffffffff8158b057>] unix_stream_recvmsg+0x4e7/0x820
   [<ffffffff814d5546>] sock_aio_read.part.7+0x116/0x130
   [<ffffffff8112c10c>] ? __perf_sw_event+0x19c/0x1e0
   [<ffffffff814d5581>] sock_aio_read+0x21/0x30
   [<ffffffff8119a5d0>] do_sync_read+0x80/0xb0
   [<ffffffff8119ac85>] vfs_read+0x145/0x170
   [<ffffffff8119b699>] SyS_read+0x49/0xa0
   [<ffffffff810df516>] ? __audit_syscall_exit+0x1f6/0x2a0
   [<ffffffff81604a19>] system_call_fastpath+0x16/0x1b
  ---[ end trace 622b7e226c4a766a ]---

The reason is race in perf_event_task_tick throttling code.
The race flow (simplified code):

  - perf_throttled_count is per cpu variable and is
    CPU throttling flag, here starting with 0
  - perf_throttled_seq is sequence/domain for allowed
    count of interrupts within the tick, gets increased
    each tick

    on single CPU (CPU bounded event):

      ... workload

    perf_event_task_tick:
    |
    | T0    inc(perf_throttled_seq)
    | T1    throttled = xchg(perf_throttled_count, 0) == 0
     tick gets interrupted:

            ... event gets throttled under new seq ...

      T2    last NMI comes, event is throttled - inc(perf_throttled_count)

     back to tick:
    | perf_adjust_freq_unthr_context:
    |
    | T3    unthrottling is skiped for event (throttled == 0)
    | T4    event is stop and started via freq adjustment
    |
    tick ends

      ... workload
      ... no sample is hit for event ...

    perf_event_task_tick:
    |
    | T5    throttled = xchg(perf_throttled_count, 0) != 0 (from T2)
    | T6    unthrottling is done on event (interrupts == MAX_INTERRUPTS)
    |       event is already started (from T4) -> WARN

Fixing this by:
  - moving the per cpu 'throttle' flag into the event's ctx so the flag
    is context specific and thus not disturbed (changed) by others
  - checking the 'throttled' under pmu disabled code again
  - removing perf_throttled_seq as it's no longer needed
  - keeping perf_throttled_count as perf_event_can_stop_tick flag

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Stephane Eranian <eranian@google.com>
---
 include/linux/perf_event.h |  2 +-
 kernel/events/core.c       | 67 +++++++++++++++++++++++++++-------------------
 2 files changed, 40 insertions(+), 29 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c43f6ea..d0c04fb 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -167,7 +167,6 @@ struct hw_perf_event {
 	u64				sample_period;
 	u64				last_period;
 	local64_t			period_left;
-	u64                             interrupts_seq;
 	u64				interrupts;
 
 	u64				freq_time_stamp;
@@ -494,6 +493,7 @@ struct perf_event_context {
 	int				nr_cgroups;	 /* cgroup evts */
 	int				nr_branch_stack; /* branch_stack evt */
 	struct rcu_head			rcu_head;
+	int * __percpu			throttled;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e82e700..40d7127 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -886,6 +886,7 @@ static void put_ctx(struct perf_event_context *ctx)
 			put_ctx(ctx->parent_ctx);
 		if (ctx->task)
 			put_task_struct(ctx->task);
+		free_percpu(ctx->throttled);
 		kfree_rcu(ctx, rcu_head);
 	}
 }
@@ -2433,6 +2434,13 @@ ctx_sched_in(struct perf_event_context *ctx,
 	/* Then walk through the lower prio flexible groups */
 	if (!(is_active & EVENT_FLEXIBLE) && (event_type & EVENT_FLEXIBLE))
 		ctx_flexible_sched_in(ctx, cpuctx);
+
+	/*
+	 * If we missed the tick before last unscheduling we could
+	 * have some events in throttled state. They get unthrottled
+	 * in event_sched_in and we can clear the flag for context.
+	 */
+	*this_cpu_ptr(ctx->throttled) = 0;
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
@@ -2648,7 +2656,6 @@ do {					\
 }
 
 static DEFINE_PER_CPU(int, perf_throttled_count);
-static DEFINE_PER_CPU(u64, perf_throttled_seq);
 
 static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable)
 {
@@ -2684,9 +2691,9 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
  * events. At the same time, make sure, having freq events does not change
  * the rate of unthrottling as that would introduce bias.
  */
-static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
-					   int needs_unthr)
+static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx)
 {
+	int *throttled = this_cpu_ptr(ctx->throttled);
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
 	u64 now, period = TICK_NSEC;
@@ -2697,7 +2704,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 	 * - context have events in frequency mode (needs freq adjust)
 	 * - there are events to unthrottle on this cpu
 	 */
-	if (!(ctx->nr_freq || needs_unthr))
+	if (!(ctx->nr_freq || *throttled))
 		return;
 
 	raw_spin_lock(&ctx->lock);
@@ -2712,7 +2719,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 
 		hwc = &event->hw;
 
-		if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) {
+		if (*throttled && hwc->interrupts == MAX_INTERRUPTS) {
 			hwc->interrupts = 0;
 			perf_log_throttle(event, 1);
 			event->pmu->start(event, 0);
@@ -2743,6 +2750,8 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
 	}
 
+	*throttled = 0;
+
 	perf_pmu_enable(ctx->pmu);
 	raw_spin_unlock(&ctx->lock);
 }
@@ -2824,20 +2833,19 @@ void perf_event_task_tick(void)
 	struct list_head *head = &__get_cpu_var(rotation_list);
 	struct perf_cpu_context *cpuctx, *tmp;
 	struct perf_event_context *ctx;
-	int throttled;
 
 	WARN_ON(!irqs_disabled());
 
-	__this_cpu_inc(perf_throttled_seq);
-	throttled = __this_cpu_xchg(perf_throttled_count, 0);
+	/* perf_event_can_stop_tick global throtlling flag */
+	__this_cpu_write(perf_throttled_count, 0);
 
 	list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
 		ctx = &cpuctx->ctx;
-		perf_adjust_freq_unthr_context(ctx, throttled);
+		perf_adjust_freq_unthr_context(ctx);
 
 		ctx = cpuctx->task_ctx;
 		if (ctx)
-			perf_adjust_freq_unthr_context(ctx, throttled);
+			perf_adjust_freq_unthr_context(ctx);
 	}
 }
 
@@ -2973,7 +2981,7 @@ static u64 perf_event_read(struct perf_event *event)
 /*
  * Initialize the perf_event context in a task_struct:
  */
-static void __perf_event_init_context(struct perf_event_context *ctx)
+static int __perf_event_init_context(struct perf_event_context *ctx)
 {
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
@@ -2981,6 +2989,8 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 	INIT_LIST_HEAD(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
 	atomic_set(&ctx->refcount, 1);
+	ctx->throttled = alloc_percpu(int);
+	return ctx->throttled ? 0 : -1;
 }
 
 static struct perf_event_context *
@@ -2992,7 +3002,11 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
 	if (!ctx)
 		return NULL;
 
-	__perf_event_init_context(ctx);
+	if (__perf_event_init_context(ctx)) {
+		kfree(ctx);
+		return NULL;
+	}
+
 	if (task) {
 		ctx->task = task;
 		get_task_struct(task);
@@ -5191,7 +5205,6 @@ static int __perf_event_overflow(struct perf_event *event,
 {
 	int events = atomic_read(&event->event_limit);
 	struct hw_perf_event *hwc = &event->hw;
-	u64 seq;
 	int ret = 0;
 
 	/*
@@ -5201,20 +5214,15 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (unlikely(!is_sampling_event(event)))
 		return 0;
 
-	seq = __this_cpu_read(perf_throttled_seq);
-	if (seq != hwc->interrupts_seq) {
-		hwc->interrupts_seq = seq;
-		hwc->interrupts = 1;
-	} else {
-		hwc->interrupts++;
-		if (unlikely(throttle
-			     && hwc->interrupts >= max_samples_per_tick)) {
-			__this_cpu_inc(perf_throttled_count);
-			hwc->interrupts = MAX_INTERRUPTS;
-			perf_log_throttle(event, 0);
-			tick_nohz_full_kick();
-			ret = 1;
-		}
+	hwc->interrupts++;
+	if (unlikely(throttle
+		     && hwc->interrupts >= max_samples_per_tick)) {
+		(*this_cpu_ptr(event->ctx->throttled))++;
+		__this_cpu_inc(perf_throttled_count);
+		hwc->interrupts = MAX_INTERRUPTS;
+		perf_log_throttle(event, 0);
+		tick_nohz_full_kick();
+		ret = 1;
 	}
 
 	if (event->attr.freq) {
@@ -6362,7 +6370,8 @@ skip_type:
 		struct perf_cpu_context *cpuctx;
 
 		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		__perf_event_init_context(&cpuctx->ctx);
+		if (__perf_event_init_context(&cpuctx->ctx))
+			goto free_context;
 		lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
 		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
 		cpuctx->ctx.type = cpu_context;
@@ -6407,6 +6416,8 @@ unlock:
 
 	return ret;
 
+free_context:
+	free_percpu(pmu->pmu_cpu_context);
 free_dev:
 	device_del(pmu->dev);
 	put_device(pmu->dev);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-13 16:39 ` [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events Jiri Olsa
@ 2013-08-15 11:40   ` Peter Zijlstra
  2013-08-15 11:49     ` Jiri Olsa
  2013-08-15 13:53     ` Andi Kleen
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-15 11:40 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: linux-kernel, Corey Ashford, Frederic Weisbecker, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen,
	Stephane Eranian

On Tue, Aug 13, 2013 at 06:39:11PM +0200, Jiri Olsa wrote:
> Currently the intel_pmu_enable_all enables all possible
> events, which is not allways desired. One case (there'll
> be probably more) is:
> 
>   - event hits throttling threshold
>   - NMI stops event
>   - intel_pmu_enable_all starts it back on the NMI exit
> 


> +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> @@ -912,11 +912,13 @@ static void intel_pmu_disable_all(void)
>  static void intel_pmu_enable_all(int added)
>  {
>  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +	u64 active_mask = *((u64*) cpuc->active_mask);
>  
>  	intel_pmu_pebs_enable_all();
>  	intel_pmu_lbr_enable_all();
>  	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
> -			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask);
> +			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask
> +			& active_mask);

Garh.. you made my head hurt :-)

I think its a NOP; this is the global ctrl register but
intel_pmu_disable_event() writes PERFEVTSELx.EN = 0, so even if you
enable it in the global mask, the event should still be disabled.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-15 11:40   ` Peter Zijlstra
@ 2013-08-15 11:49     ` Jiri Olsa
  2013-08-15 13:53     ` Andi Kleen
  1 sibling, 0 replies; 11+ messages in thread
From: Jiri Olsa @ 2013-08-15 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Corey Ashford, Frederic Weisbecker, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen,
	Stephane Eranian

On Thu, Aug 15, 2013 at 01:40:40PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 13, 2013 at 06:39:11PM +0200, Jiri Olsa wrote:
> > Currently the intel_pmu_enable_all enables all possible
> > events, which is not allways desired. One case (there'll
> > be probably more) is:
> > 
> >   - event hits throttling threshold
> >   - NMI stops event
> >   - intel_pmu_enable_all starts it back on the NMI exit
> > 
> 
> 
> > +++ b/arch/x86/kernel/cpu/perf_event_intel.c
> > @@ -912,11 +912,13 @@ static void intel_pmu_disable_all(void)
> >  static void intel_pmu_enable_all(int added)
> >  {
> >  	struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > +	u64 active_mask = *((u64*) cpuc->active_mask);
> >  
> >  	intel_pmu_pebs_enable_all();
> >  	intel_pmu_lbr_enable_all();
> >  	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL,
> > -			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask);
> > +			x86_pmu.intel_ctrl & ~cpuc->intel_ctrl_guest_mask
> > +			& active_mask);
> 
> Garh.. you made my head hurt :-)
> 
> I think its a NOP; this is the global ctrl register but
> intel_pmu_disable_event() writes PERFEVTSELx.EN = 0, so even if you
> enable it in the global mask, the event should still be disabled.

hum, I might have misread the doc.. strange the test
showed this result

jirka

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] perf: Move throtling flag to perf_event_context
  2013-08-13 16:39 ` [PATCH 2/2] perf: Move throtling flag to perf_event_context Jiri Olsa
@ 2013-08-15 12:12   ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-15 12:12 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: linux-kernel, Corey Ashford, Frederic Weisbecker, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, Andi Kleen,
	Stephane Eranian

On Tue, Aug 13, 2013 at 06:39:12PM +0200, Jiri Olsa wrote:
> The current throttling code triggers WARN below via following
> workload (only hit on AMD machine with 48 CPUs):
> 
>   # while [ 1 ]; do perf record perf bench sched messaging; done


> The reason is race in perf_event_task_tick throttling code.
> The race flow (simplified code):
> 
>   - perf_throttled_count is per cpu variable and is
>     CPU throttling flag, here starting with 0
>   - perf_throttled_seq is sequence/domain for allowed
>     count of interrupts within the tick, gets increased
>     each tick
> 
>     on single CPU (CPU bounded event):
> 
>       ... workload
> 
>     perf_event_task_tick:
>     |
>     | T0    inc(perf_throttled_seq)
>     | T1    throttled = xchg(perf_throttled_count, 0) == 0
>      tick gets interrupted:
> 
>             ... event gets throttled under new seq ...
> 
>       T2    last NMI comes, event is throttled - inc(perf_throttled_count)
> 
>      back to tick:
>     | perf_adjust_freq_unthr_context:
>     |
>     | T3    unthrottling is skiped for event (throttled == 0)
>     | T4    event is stop and started via freq adjustment
>     |
>     tick ends
> 
>       ... workload
>       ... no sample is hit for event ...
> 
>     perf_event_task_tick:
>     |
>     | T5    throttled = xchg(perf_throttled_count, 0) != 0 (from T2)
>     | T6    unthrottling is done on event (interrupts == MAX_INTERRUPTS)
>     |       event is already started (from T4) -> WARN
> 
> Fixing this by:
>   - moving the per cpu 'throttle' flag into the event's ctx so the flag
>     is context specific and thus not disturbed (changed) by others

This is an 'unrelated' change purely meant to optimize the tick handler
to not iterate as many events, right? It isn't actually part of the
solution for the above issue afaict. Hence I think it should be
separated into a separate patch. Esp. so because this seems to be the
bulk of the below patch due to the extra allocations/error paths etc.

That said, thinking about it, we might want to use a llist and queue all
throttled events on it from __perf_event_overflow(). That way we can
limit the iteration to all throttled events; a much better proposition
still.

>   - checking the 'throttled' under pmu disabled code again

Might as well remove that and only keep hwc->interrupts ==
MAX_INTERRUPTS; that should be sufficient I think.

In either case, we could have not gotten here on the first tick and a
second tick is needed to unthrottle us.

>   - removing perf_throttled_seq as it's no longer needed

I'm not entirely sure how we got here and your changelog is short of
clues; that said, I think I agree.

>   - keeping perf_throttled_count as perf_event_can_stop_tick flag

... uhu.

Anyway, would something like the below not be a similar fix? It avoids
the entire situation afaic.

---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index e82e700..8fffd18 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2712,7 +2712,7 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 
 		hwc = &event->hw;
 
-		if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) {
+		if (hwc->interrupts == MAX_INTERRUPTS) {
 			hwc->interrupts = 0;
 			perf_log_throttle(event, 1);
 			event->pmu->start(event, 0);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-15 11:40   ` Peter Zijlstra
  2013-08-15 11:49     ` Jiri Olsa
@ 2013-08-15 13:53     ` Andi Kleen
  2013-08-19  9:16       ` Stephane Eranian
  1 sibling, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2013-08-15 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jiri Olsa, linux-kernel, Corey Ashford, Frederic Weisbecker,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	Andi Kleen, Stephane Eranian

> 
> I think its a NOP; this is the global ctrl register but
> intel_pmu_disable_event() writes PERFEVTSELx.EN = 0, so even if you
> enable it in the global mask, the event should still be disabled.

Yes the hardware ANDs the various enable bits in the different
registers.

-andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-15 13:53     ` Andi Kleen
@ 2013-08-19  9:16       ` Stephane Eranian
  2013-08-19 11:16         ` Jiri Olsa
  0 siblings, 1 reply; 11+ messages in thread
From: Stephane Eranian @ 2013-08-19  9:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Peter Zijlstra, Jiri Olsa, LKML, Corey Ashford,
	Frederic Weisbecker, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo

On Thu, Aug 15, 2013 at 3:53 PM, Andi Kleen <andi@firstfloor.org> wrote:
>>
>> I think its a NOP; this is the global ctrl register but
>> intel_pmu_disable_event() writes PERFEVTSELx.EN = 0, so even if you
>> enable it in the global mask, the event should still be disabled.
>
> Yes the hardware ANDs the various enable bits in the different
> registers.
>
Andi is correct. There is a logical AND between GLOBAL_CTRL and
the individual PERFEVTCTL.EN bits. If the EN bit is zero, then the
counter does not count. That also applies to fixed counters.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-19  9:16       ` Stephane Eranian
@ 2013-08-19 11:16         ` Jiri Olsa
  2013-08-19 14:31           ` Stephane Eranian
  0 siblings, 1 reply; 11+ messages in thread
From: Jiri Olsa @ 2013-08-19 11:16 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Andi Kleen, Peter Zijlstra, LKML, Corey Ashford,
	Frederic Weisbecker, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo

On Mon, Aug 19, 2013 at 11:16:54AM +0200, Stephane Eranian wrote:
> On Thu, Aug 15, 2013 at 3:53 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >>
> >> I think its a NOP; this is the global ctrl register but
> >> intel_pmu_disable_event() writes PERFEVTSELx.EN = 0, so even if you
> >> enable it in the global mask, the event should still be disabled.
> >
> > Yes the hardware ANDs the various enable bits in the different
> > registers.
> >
> Andi is correct. There is a logical AND between GLOBAL_CTRL and
> the individual PERFEVTCTL.EN bits. If the EN bit is zero, then the
> counter does not count. That also applies to fixed counters.

right, peter mentioned I could have seen already queded
NMI comming in for that event..

jirka

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-19 11:16         ` Jiri Olsa
@ 2013-08-19 14:31           ` Stephane Eranian
  2013-08-19 14:45             ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Stephane Eranian @ 2013-08-19 14:31 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Andi Kleen, Peter Zijlstra, LKML, Corey Ashford,
	Frederic Weisbecker, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo

Hi,

On Mon, Aug 19, 2013 at 1:16 PM, Jiri Olsa <jolsa@redhat.com> wrote:
> On Mon, Aug 19, 2013 at 11:16:54AM +0200, Stephane Eranian wrote:
>> On Thu, Aug 15, 2013 at 3:53 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >>
>> >> I think its a NOP; this is the global ctrl register but
>> >> intel_pmu_disable_event() writes PERFEVTSELx.EN = 0, so even if you
>> >> enable it in the global mask, the event should still be disabled.
>> >
>> > Yes the hardware ANDs the various enable bits in the different
>> > registers.
>> >
>> Andi is correct. There is a logical AND between GLOBAL_CTRL and
>> the individual PERFEVTCTL.EN bits. If the EN bit is zero, then the
>> counter does not count. That also applies to fixed counters.
>
> right, peter mentioned I could have seen already queded
> NMI comming in for that event..
>
Yeah, I think you can have one NMI pending because it came
while you were already servicing an NMI interrupt. Though, this
is very unlikely.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events
  2013-08-19 14:31           ` Stephane Eranian
@ 2013-08-19 14:45             ` Peter Zijlstra
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Zijlstra @ 2013-08-19 14:45 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Jiri Olsa, Andi Kleen, LKML, Corey Ashford, Frederic Weisbecker,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo

On Mon, Aug 19, 2013 at 04:31:31PM +0200, Stephane Eranian wrote:
> Yeah, I think you can have one NMI pending because it came
> while you were already servicing an NMI interrupt. Though, this
> is very unlikely.

Yeah exceedingly rare. I'm also fairly certain I've had NMIs right after
clearing EN bits. Although I cannot remember if that was on Intel or AMD
hardware, nor on what specific model :/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-08-19 14:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-08-13 16:39 [RFC 0/2] perf: Fixing throttling related WARN Jiri Olsa
2013-08-13 16:39 ` [PATCH 1/2] perf x86: Make intel_pmu_enable_all to enable only active events Jiri Olsa
2013-08-15 11:40   ` Peter Zijlstra
2013-08-15 11:49     ` Jiri Olsa
2013-08-15 13:53     ` Andi Kleen
2013-08-19  9:16       ` Stephane Eranian
2013-08-19 11:16         ` Jiri Olsa
2013-08-19 14:31           ` Stephane Eranian
2013-08-19 14:45             ` Peter Zijlstra
2013-08-13 16:39 ` [PATCH 2/2] perf: Move throtling flag to perf_event_context Jiri Olsa
2013-08-15 12:12   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).