linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf_events: fix broken intr throttling (v3)
@ 2012-01-26 16:03 Stephane Eranian
  2012-01-26 16:16 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Stephane Eranian @ 2012-01-26 16:03 UTC (permalink / raw)
  To: linux-kernel; +Cc: peterz, mingo, gleb, wcohen, vince, asharma, andi


This patch fixes the throttling mechanism. It was broken
in 3.2.0. Events were not being unthrottled. The unthrottling
mechanism required that events be checked at each timer tick.

This patch solves this problem and also separates:
  - unthrottling
  - multiplexing
  - frequency-mode period adjustments

Not all of them need to be executed at each timer tick.

This third version of the patch is based on my original patch +
PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).

At each timer tick, for each context:
  - if the current CPU has throttled events, we unthrottle events

  - if context has frequency-based events, we adjust sampling periods

  - if we have reached the jiffies interval, we multiplex (rotate)

We decoupled rotation (multiplexing) from frequency-mode sampling
period adjustments.  They should not necessarily happen at the same
rate. Multiplexing is subject to jiffies_interval (currently at 1
but could be higher once the tunable is exposed via sysfs).

We have grouped frequency-mode adjustment and unthrottling into the
same routine to minimize code duplication. When throttled while in
frequency mode, we scan the events only once.

We have fixed the threshold enforcement code in __perf_event_overflow().
There was a bug whereby it would allow more than the authorized rate
because an increment of hwc->interrupts was not executed at the right
place.

The patch was tested with low sampling limit (2000) and fixed periods,
frequency mode, overcommitted PMU.

On a 2.1GHz AMD CPU:
$ cat /proc/sys/kernel/perf_event_max_sample_rate
2000

We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):

$ perf record -e cycles,cycles -c 700000  noploop 10
$ perf report -D | tail -21
Aggregated stats:
           TOTAL events:      80086
            MMAP events:         88
            COMM events:          2
            EXIT events:          4
        THROTTLE events:      19996
      UNTHROTTLE events:      19996
          SAMPLE events:      40000
cycles stats:
           TOTAL events:      40006
            MMAP events:          5
            COMM events:          1
            EXIT events:          4
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000
cycles stats:
           TOTAL events:      39996
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

For 10s, the cap is 2x2000x10 = 40000 samples.
We get exactly that: 20000 samples/event.

Signed-off-by: Stephane Eranian <eranian@google.com>
---

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0b91db2..412b790 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -589,6 +589,7 @@ struct hw_perf_event {
 	u64				sample_period;
 	u64				last_period;
 	local64_t			period_left;
+	u64                             interrupts_seq;
 	u64				interrupts;
 
 	u64				freq_time_stamp;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index de859fb..7c3b9de 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2300,6 +2300,9 @@ do {					\
 	return div64_u64(dividend, divisor);
 }
 
+static DEFINE_PER_CPU(int, perf_throttled_count);
+static DEFINE_PER_CPU(u64, perf_throttled_seq);
+
 static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
 {
 	struct hw_perf_event *hwc = &event->hw;
@@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
 	}
 }
 
-static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
+/*
+ * combine freq adjustment with unthrottling to avoid two passes over the
+ * events. At the same time, make sure, having freq events does not change
+ * the rate of unthrottling as that would introduce bias.
+ */
+static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
+					   int needs_unthr)
 {
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
-	u64 interrupts, now;
+	u64 now, period = TICK_NSEC;
 	s64 delta;
 
-	if (!ctx->nr_freq)
+	/*
+	 * only need to iterate over all events iff:
+	 * - context have events in frequency mode (needs freq adjust)
+	 * - there are events to unthrottle on this cpu
+	 */
+	if (!(ctx->nr_freq || needs_unthr))
 		return;
 
+	raw_spin_lock(&ctx->lock);
+
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 		if (event->state != PERF_EVENT_STATE_ACTIVE)
 			continue;
@@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
 
 		hwc = &event->hw;
 
-		interrupts = hwc->interrupts;
-		hwc->interrupts = 0;
-
-		/*
-		 * unthrottle events on the tick
-		 */
-		if (interrupts == MAX_INTERRUPTS) {
+		if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) {
+			hwc->interrupts = 0;
 			perf_log_throttle(event, 1);
 			event->pmu->start(event, 0);
 		}
@@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
 		if (!event->attr.freq || !event->attr.sample_freq)
 			continue;
 
-		event->pmu->read(event);
+		/*
+		 * stop the event and update event->count
+		 */
+		event->pmu->stop(event, PERF_EF_UPDATE);
+
 		now = local64_read(&event->count);
 		delta = now - hwc->freq_count_stamp;
 		hwc->freq_count_stamp = now;
 
+		/*
+		 * restart the event
+		 * reload only if value has changed
+		 */
 		if (delta > 0)
 			perf_adjust_period(event, period, delta);
+
+		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
 	}
+
+	raw_spin_unlock(&ctx->lock);
 }
 
 /*
@@ -2388,16 +2411,13 @@ static void rotate_ctx(struct perf_event_context *ctx)
  */
 static void perf_rotate_context(struct perf_cpu_context *cpuctx)
 {
-	u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
 	struct perf_event_context *ctx = NULL;
-	int rotate = 0, remove = 1, freq = 0;
+	int rotate = 0, remove = 1;
 
 	if (cpuctx->ctx.nr_events) {
 		remove = 0;
 		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
 			rotate = 1;
-		if (cpuctx->ctx.nr_freq)
-			freq = 1;
 	}
 
 	ctx = cpuctx->task_ctx;
@@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct perf_cpu_context *cpuctx)
 		remove = 0;
 		if (ctx->nr_events != ctx->nr_active)
 			rotate = 1;
-		if (ctx->nr_freq)
-			freq = 1;
 	}
 
-	if (!rotate && !freq)
+	if (!rotate)
 		goto done;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(cpuctx->ctx.pmu);
 
-	if (freq) {
-		perf_ctx_adjust_freq(&cpuctx->ctx, interval);
-		if (ctx)
-			perf_ctx_adjust_freq(ctx, interval);
-	}
-
-	if (rotate) {
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-		if (ctx)
-			ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	if (ctx)
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
 
-		rotate_ctx(&cpuctx->ctx);
-		if (ctx)
-			rotate_ctx(ctx);
+	rotate_ctx(&cpuctx->ctx);
+	if (ctx)
+		rotate_ctx(ctx);
 
-		perf_event_sched_in(cpuctx, ctx, current);
-	}
+	perf_event_sched_in(cpuctx, ctx, current);
 
 	perf_pmu_enable(cpuctx->ctx.pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-
 done:
 	if (remove)
 		list_del_init(&cpuctx->rotation_list);
@@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
 {
 	struct list_head *head = &__get_cpu_var(rotation_list);
 	struct perf_cpu_context *cpuctx, *tmp;
+	struct perf_event_context *ctx;
+	int throttled;
 
 	WARN_ON(!irqs_disabled());
 
+	__this_cpu_inc(perf_throttled_seq);
+	throttled = __this_cpu_xchg(perf_throttled_count, 0);
+
 	list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
+		ctx = &cpuctx->ctx;
+		perf_adjust_freq_unthr_context(ctx, throttled);
+
+		ctx = cpuctx->task_ctx;
+		if (ctx)
+			perf_adjust_freq_unthr_context(ctx, throttled);
+
 		if (cpuctx->jiffies_interval == 1 ||
 				!(jiffies % cpuctx->jiffies_interval))
 			perf_rotate_context(cpuctx);
@@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct perf_event *event,
 {
 	int events = atomic_read(&event->event_limit);
 	struct hw_perf_event *hwc = &event->hw;
+	u64 seq;
 	int ret = 0;
 
 	/*
@@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (unlikely(!is_sampling_event(event)))
 		return 0;
 
-	if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
-		if (throttle) {
+	seq = __this_cpu_read(perf_throttled_seq);
+	if (seq != hwc->interrupts_seq) {
+		hwc->interrupts_seq = seq;
+		hwc->interrupts = 1;
+	} else {
+		hwc->interrupts++;
+		if (unlikely(throttle
+			     && hwc->interrupts >= max_samples_per_tick)) {
+			__this_cpu_inc(perf_throttled_count);
 			hwc->interrupts = MAX_INTERRUPTS;
 			perf_log_throttle(event, 0);
 			ret = 1;
 		}
-	} else
-		hwc->interrupts++;
+	}
 
 	if (event->attr.freq) {
 		u64 now = perf_clock();

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-01-26 16:03 [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
@ 2012-01-26 16:16 ` Peter Zijlstra
  2012-01-26 18:22   ` Arun Sharma
  2012-01-28 12:04 ` [tip:perf/urgent] perf: Fix broken interrupt rate throttling tip-bot for Stephane Eranian
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2012-01-26 16:16 UTC (permalink / raw)
  To: Stephane Eranian; +Cc: linux-kernel, mingo, gleb, wcohen, vince, asharma, andi

On Thu, 2012-01-26 at 17:03 +0100, Stephane Eranian wrote:
> This patch fixes the throttling mechanism. It was broken
> in 3.2.0. Events were not being unthrottled. The unthrottling
> mechanism required that events be checked at each timer tick.
> 
> This patch solves this problem and also separates:
>   - unthrottling
>   - multiplexing
>   - frequency-mode period adjustments 

Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-01-26 16:16 ` Peter Zijlstra
@ 2012-01-26 18:22   ` Arun Sharma
  0 siblings, 0 replies; 13+ messages in thread
From: Arun Sharma @ 2012-01-26 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, linux-kernel, mingo, gleb, wcohen, vince, andi

On 1/26/12 8:16 AM, Peter Zijlstra wrote:
> On Thu, 2012-01-26 at 17:03 +0100, Stephane Eranian wrote:
>> This patch fixes the throttling mechanism. It was broken
>> in 3.2.0. Events were not being unthrottled. The unthrottling
>> mechanism required that events be checked at each timer tick.
>>
>> This patch solves this problem and also separates:
>>    - unthrottling
>>    - multiplexing
>>    - frequency-mode period adjustments
>
> Thanks!

Should this be Cc'ed to stable@?

  -Arun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tip:perf/urgent] perf: Fix broken interrupt rate throttling
  2012-01-26 16:03 [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
  2012-01-26 16:16 ` Peter Zijlstra
@ 2012-01-28 12:04 ` tip-bot for Stephane Eranian
  2012-02-08 11:32 ` [PATCH] perf_events: fix broken intr throttling (v3) Anton Blanchard
  2012-02-09  5:23 ` Anshuman Khandual
  3 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Stephane Eranian @ 2012-01-28 12:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, eranian, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  e050e3f0a71bf7dc2c148b35caff0234decc8198
Gitweb:     http://git.kernel.org/tip/e050e3f0a71bf7dc2c148b35caff0234decc8198
Author:     Stephane Eranian <eranian@google.com>
AuthorDate: Thu, 26 Jan 2012 17:03:19 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 27 Jan 2012 12:06:39 +0100

perf: Fix broken interrupt rate throttling

This patch fixes the sampling interrupt throttling mechanism.

It was broken in v3.2. Events were not being unthrottled. The
unthrottling mechanism required that events be checked at each
timer tick.

This patch solves this problem and also separates:

  - unthrottling
  - multiplexing
  - frequency-mode period adjustments

Not all of them need to be executed at each timer tick.

This third version of the patch is based on my original patch +
PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).

At each timer tick, for each context:

  - if the current CPU has throttled events, we unthrottle events

  - if context has frequency-based events, we adjust sampling periods

  - if we have reached the jiffies interval, we multiplex (rotate)

We decoupled rotation (multiplexing) from frequency-mode sampling
period adjustments.  They should not necessarily happen at the same
rate. Multiplexing is subject to jiffies_interval (currently at 1
but could be higher once the tunable is exposed via sysfs).

We have grouped frequency-mode adjustment and unthrottling into the
same routine to minimize code duplication. When throttled while in
frequency mode, we scan the events only once.

We have fixed the threshold enforcement code in __perf_event_overflow().
There was a bug whereby it would allow more than the authorized rate
because an increment of hwc->interrupts was not executed at the right
place.

The patch was tested with low sampling limit (2000) and fixed periods,
frequency mode, overcommitted PMU.

On a 2.1GHz AMD CPU:

 $ cat /proc/sys/kernel/perf_event_max_sample_rate
 2000

We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):

 $ perf record -e cycles,cycles -c 700000  noploop 10
 $ perf report -D | tail -21

 Aggregated stats:
           TOTAL events:      80086
            MMAP events:         88
            COMM events:          2
            EXIT events:          4
        THROTTLE events:      19996
      UNTHROTTLE events:      19996
          SAMPLE events:      40000

 cycles stats:
           TOTAL events:      40006
            MMAP events:          5
            COMM events:          1
            EXIT events:          4
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

 cycles stats:
           TOTAL events:      39996
        THROTTLE events:       9998
      UNTHROTTLE events:       9998
          SAMPLE events:      20000

For 10s, the cap is 2x2000x10 = 40000 samples.
We get exactly that: 20000 samples/event.

Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: <stable@kernel.org> # v3.2+
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120126160319.GA5655@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/perf_event.h |    1 +
 kernel/events/core.c       |  104 ++++++++++++++++++++++++++++----------------
 2 files changed, 67 insertions(+), 38 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0885561..abb2776 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -587,6 +587,7 @@ struct hw_perf_event {
 	u64				sample_period;
 	u64				last_period;
 	local64_t			period_left;
+	u64                             interrupts_seq;
 	u64				interrupts;
 
 	u64				freq_time_stamp;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 32b48c8..ba36013 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2300,6 +2300,9 @@ do {					\
 	return div64_u64(dividend, divisor);
 }
 
+static DEFINE_PER_CPU(int, perf_throttled_count);
+static DEFINE_PER_CPU(u64, perf_throttled_seq);
+
 static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
 {
 	struct hw_perf_event *hwc = &event->hw;
@@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
 	}
 }
 
-static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
+/*
+ * combine freq adjustment with unthrottling to avoid two passes over the
+ * events. At the same time, make sure, having freq events does not change
+ * the rate of unthrottling as that would introduce bias.
+ */
+static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
+					   int needs_unthr)
 {
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
-	u64 interrupts, now;
+	u64 now, period = TICK_NSEC;
 	s64 delta;
 
-	if (!ctx->nr_freq)
+	/*
+	 * only need to iterate over all events iff:
+	 * - context have events in frequency mode (needs freq adjust)
+	 * - there are events to unthrottle on this cpu
+	 */
+	if (!(ctx->nr_freq || needs_unthr))
 		return;
 
+	raw_spin_lock(&ctx->lock);
+
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 		if (event->state != PERF_EVENT_STATE_ACTIVE)
 			continue;
@@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
 
 		hwc = &event->hw;
 
-		interrupts = hwc->interrupts;
-		hwc->interrupts = 0;
-
-		/*
-		 * unthrottle events on the tick
-		 */
-		if (interrupts == MAX_INTERRUPTS) {
+		if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) {
+			hwc->interrupts = 0;
 			perf_log_throttle(event, 1);
 			event->pmu->start(event, 0);
 		}
@@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
 		if (!event->attr.freq || !event->attr.sample_freq)
 			continue;
 
-		event->pmu->read(event);
+		/*
+		 * stop the event and update event->count
+		 */
+		event->pmu->stop(event, PERF_EF_UPDATE);
+
 		now = local64_read(&event->count);
 		delta = now - hwc->freq_count_stamp;
 		hwc->freq_count_stamp = now;
 
+		/*
+		 * restart the event
+		 * reload only if value has changed
+		 */
 		if (delta > 0)
 			perf_adjust_period(event, period, delta);
+
+		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
 	}
+
+	raw_spin_unlock(&ctx->lock);
 }
 
 /*
@@ -2388,16 +2411,13 @@ static void rotate_ctx(struct perf_event_context *ctx)
  */
 static void perf_rotate_context(struct perf_cpu_context *cpuctx)
 {
-	u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
 	struct perf_event_context *ctx = NULL;
-	int rotate = 0, remove = 1, freq = 0;
+	int rotate = 0, remove = 1;
 
 	if (cpuctx->ctx.nr_events) {
 		remove = 0;
 		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
 			rotate = 1;
-		if (cpuctx->ctx.nr_freq)
-			freq = 1;
 	}
 
 	ctx = cpuctx->task_ctx;
@@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct perf_cpu_context *cpuctx)
 		remove = 0;
 		if (ctx->nr_events != ctx->nr_active)
 			rotate = 1;
-		if (ctx->nr_freq)
-			freq = 1;
 	}
 
-	if (!rotate && !freq)
+	if (!rotate)
 		goto done;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(cpuctx->ctx.pmu);
 
-	if (freq) {
-		perf_ctx_adjust_freq(&cpuctx->ctx, interval);
-		if (ctx)
-			perf_ctx_adjust_freq(ctx, interval);
-	}
-
-	if (rotate) {
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-		if (ctx)
-			ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
+	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	if (ctx)
+		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
 
-		rotate_ctx(&cpuctx->ctx);
-		if (ctx)
-			rotate_ctx(ctx);
+	rotate_ctx(&cpuctx->ctx);
+	if (ctx)
+		rotate_ctx(ctx);
 
-		perf_event_sched_in(cpuctx, ctx, current);
-	}
+	perf_event_sched_in(cpuctx, ctx, current);
 
 	perf_pmu_enable(cpuctx->ctx.pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-
 done:
 	if (remove)
 		list_del_init(&cpuctx->rotation_list);
@@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
 {
 	struct list_head *head = &__get_cpu_var(rotation_list);
 	struct perf_cpu_context *cpuctx, *tmp;
+	struct perf_event_context *ctx;
+	int throttled;
 
 	WARN_ON(!irqs_disabled());
 
+	__this_cpu_inc(perf_throttled_seq);
+	throttled = __this_cpu_xchg(perf_throttled_count, 0);
+
 	list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
+		ctx = &cpuctx->ctx;
+		perf_adjust_freq_unthr_context(ctx, throttled);
+
+		ctx = cpuctx->task_ctx;
+		if (ctx)
+			perf_adjust_freq_unthr_context(ctx, throttled);
+
 		if (cpuctx->jiffies_interval == 1 ||
 				!(jiffies % cpuctx->jiffies_interval))
 			perf_rotate_context(cpuctx);
@@ -4509,6 +4530,7 @@ static int __perf_event_overflow(struct perf_event *event,
 {
 	int events = atomic_read(&event->event_limit);
 	struct hw_perf_event *hwc = &event->hw;
+	u64 seq;
 	int ret = 0;
 
 	/*
@@ -4518,14 +4540,20 @@ static int __perf_event_overflow(struct perf_event *event,
 	if (unlikely(!is_sampling_event(event)))
 		return 0;
 
-	if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
-		if (throttle) {
+	seq = __this_cpu_read(perf_throttled_seq);
+	if (seq != hwc->interrupts_seq) {
+		hwc->interrupts_seq = seq;
+		hwc->interrupts = 1;
+	} else {
+		hwc->interrupts++;
+		if (unlikely(throttle
+			     && hwc->interrupts >= max_samples_per_tick)) {
+			__this_cpu_inc(perf_throttled_count);
 			hwc->interrupts = MAX_INTERRUPTS;
 			perf_log_throttle(event, 0);
 			ret = 1;
 		}
-	} else
-		hwc->interrupts++;
+	}
 
 	if (event->attr.freq) {
 		u64 now = perf_clock();

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-01-26 16:03 [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
  2012-01-26 16:16 ` Peter Zijlstra
  2012-01-28 12:04 ` [tip:perf/urgent] perf: Fix broken interrupt rate throttling tip-bot for Stephane Eranian
@ 2012-02-08 11:32 ` Anton Blanchard
  2012-02-08 11:43   ` Stephane Eranian
  2012-02-09  5:23 ` Anshuman Khandual
  3 siblings, 1 reply; 13+ messages in thread
From: Anton Blanchard @ 2012-02-08 11:32 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, gleb, wcohen, vince, asharma, andi,
	paulus, emunson, imunsie, benh, sukadev


Hi,

On Thu, 26 Jan 2012 17:03:19 +0100
Stephane Eranian <eranian@google.com> wrote:

> This patch fixes the throttling mechanism. It was broken
> in 3.2.0. Events were not being unthrottled. The unthrottling
> mechanism required that events be checked at each timer tick.

This patch breaks perf on POWER. I haven't had a chance to work out why
yet, but will investigate tomorrow.

Anton

> This patch solves this problem and also separates:
>   - unthrottling
>   - multiplexing
>   - frequency-mode period adjustments
> 
> Not all of them need to be executed at each timer tick.
> 
> This third version of the patch is based on my original patch +
> PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
> 
> At each timer tick, for each context:
>   - if the current CPU has throttled events, we unthrottle events
> 
>   - if context has frequency-based events, we adjust sampling periods
> 
>   - if we have reached the jiffies interval, we multiplex (rotate)
> 
> We decoupled rotation (multiplexing) from frequency-mode sampling
> period adjustments.  They should not necessarily happen at the same
> rate. Multiplexing is subject to jiffies_interval (currently at 1
> but could be higher once the tunable is exposed via sysfs).
> 
> We have grouped frequency-mode adjustment and unthrottling into the
> same routine to minimize code duplication. When throttled while in
> frequency mode, we scan the events only once.
> 
> We have fixed the threshold enforcement code in
> __perf_event_overflow(). There was a bug whereby it would allow more
> than the authorized rate because an increment of hwc->interrupts was
> not executed at the right place.
> 
> The patch was tested with low sampling limit (2000) and fixed periods,
> frequency mode, overcommitted PMU.
> 
> On a 2.1GHz AMD CPU:
> $ cat /proc/sys/kernel/perf_event_max_sample_rate
> 2000
> 
> We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
> 
> $ perf record -e cycles,cycles -c 700000  noploop 10
> $ perf report -D | tail -21
> Aggregated stats:
>            TOTAL events:      80086
>             MMAP events:         88
>             COMM events:          2
>             EXIT events:          4
>         THROTTLE events:      19996
>       UNTHROTTLE events:      19996
>           SAMPLE events:      40000
> cycles stats:
>            TOTAL events:      40006
>             MMAP events:          5
>             COMM events:          1
>             EXIT events:          4
>         THROTTLE events:       9998
>       UNTHROTTLE events:       9998
>           SAMPLE events:      20000
> cycles stats:
>            TOTAL events:      39996
>         THROTTLE events:       9998
>       UNTHROTTLE events:       9998
>           SAMPLE events:      20000
> 
> For 10s, the cap is 2x2000x10 = 40000 samples.
> We get exactly that: 20000 samples/event.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0b91db2..412b790 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -589,6 +589,7 @@ struct hw_perf_event {
>  	u64				sample_period;
>  	u64				last_period;
>  	local64_t			period_left;
> +	u64                             interrupts_seq;
>  	u64				interrupts;
>  
>  	u64				freq_time_stamp;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index de859fb..7c3b9de 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2300,6 +2300,9 @@ do {					\
>  	return div64_u64(dividend, divisor);
>  }
>  
> +static DEFINE_PER_CPU(int, perf_throttled_count);
> +static DEFINE_PER_CPU(u64, perf_throttled_seq);
> +
>  static void perf_adjust_period(struct perf_event *event, u64 nsec,
> u64 count) {
>  	struct hw_perf_event *hwc = &event->hw;
> @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct
> perf_event *event, u64 nsec, u64 count) }
>  }
>  
> -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64
> period) +/*
> + * combine freq adjustment with unthrottling to avoid two passes
> over the
> + * events. At the same time, make sure, having freq events does not
> change
> + * the rate of unthrottling as that would introduce bias.
> + */
> +static void perf_adjust_freq_unthr_context(struct perf_event_context
> *ctx,
> +					   int needs_unthr)
>  {
>  	struct perf_event *event;
>  	struct hw_perf_event *hwc;
> -	u64 interrupts, now;
> +	u64 now, period = TICK_NSEC;
>  	s64 delta;
>  
> -	if (!ctx->nr_freq)
> +	/*
> +	 * only need to iterate over all events iff:
> +	 * - context have events in frequency mode (needs freq
> adjust)
> +	 * - there are events to unthrottle on this cpu
> +	 */
> +	if (!(ctx->nr_freq || needs_unthr))
>  		return;
>  
> +	raw_spin_lock(&ctx->lock);
> +
>  	list_for_each_entry_rcu(event, &ctx->event_list,
> event_entry) { if (event->state != PERF_EVENT_STATE_ACTIVE)
>  			continue;
> @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct
> perf_event_context *ctx, u64 period) 
>  		hwc = &event->hw;
>  
> -		interrupts = hwc->interrupts;
> -		hwc->interrupts = 0;
> -
> -		/*
> -		 * unthrottle events on the tick
> -		 */
> -		if (interrupts == MAX_INTERRUPTS) {
> +		if (needs_unthr && hwc->interrupts ==
> MAX_INTERRUPTS) {
> +			hwc->interrupts = 0;
>  			perf_log_throttle(event, 1);
>  			event->pmu->start(event, 0);
>  		}
> @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct
> perf_event_context *ctx, u64 period) if (!event->attr.freq
> || !event->attr.sample_freq) continue;
>  
> -		event->pmu->read(event);
> +		/*
> +		 * stop the event and update event->count
> +		 */
> +		event->pmu->stop(event, PERF_EF_UPDATE);
> +
>  		now = local64_read(&event->count);
>  		delta = now - hwc->freq_count_stamp;
>  		hwc->freq_count_stamp = now;
>  
> +		/*
> +		 * restart the event
> +		 * reload only if value has changed
> +		 */
>  		if (delta > 0)
>  			perf_adjust_period(event, period, delta);
> +
> +		event->pmu->start(event, delta > 0 ?
> PERF_EF_RELOAD : 0); }
> +
> +	raw_spin_unlock(&ctx->lock);
>  }
>  
>  /*
> @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct
> perf_event_context *ctx) */
>  static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>  {
> -	u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
>  	struct perf_event_context *ctx = NULL;
> -	int rotate = 0, remove = 1, freq = 0;
> +	int rotate = 0, remove = 1;
>  
>  	if (cpuctx->ctx.nr_events) {
>  		remove = 0;
>  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>  			rotate = 1;
> -		if (cpuctx->ctx.nr_freq)
> -			freq = 1;
>  	}
>  
>  	ctx = cpuctx->task_ctx;
> @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct
> perf_cpu_context *cpuctx) remove = 0;
>  		if (ctx->nr_events != ctx->nr_active)
>  			rotate = 1;
> -		if (ctx->nr_freq)
> -			freq = 1;
>  	}
>  
> -	if (!rotate && !freq)
> +	if (!rotate)
>  		goto done;
>  
>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>  	perf_pmu_disable(cpuctx->ctx.pmu);
>  
> -	if (freq) {
> -		perf_ctx_adjust_freq(&cpuctx->ctx, interval);
> -		if (ctx)
> -			perf_ctx_adjust_freq(ctx, interval);
> -	}
> -
> -	if (rotate) {
> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> -		if (ctx)
> -			ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> +	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> +	if (ctx)
> +		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>  
> -		rotate_ctx(&cpuctx->ctx);
> -		if (ctx)
> -			rotate_ctx(ctx);
> +	rotate_ctx(&cpuctx->ctx);
> +	if (ctx)
> +		rotate_ctx(ctx);
>  
> -		perf_event_sched_in(cpuctx, ctx, current);
> -	}
> +	perf_event_sched_in(cpuctx, ctx, current);
>  
>  	perf_pmu_enable(cpuctx->ctx.pmu);
>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -
>  done:
>  	if (remove)
>  		list_del_init(&cpuctx->rotation_list);
> @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
>  {
>  	struct list_head *head = &__get_cpu_var(rotation_list);
>  	struct perf_cpu_context *cpuctx, *tmp;
> +	struct perf_event_context *ctx;
> +	int throttled;
>  
>  	WARN_ON(!irqs_disabled());
>  
> +	__this_cpu_inc(perf_throttled_seq);
> +	throttled = __this_cpu_xchg(perf_throttled_count, 0);
> +
>  	list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
> +		ctx = &cpuctx->ctx;
> +		perf_adjust_freq_unthr_context(ctx, throttled);
> +
> +		ctx = cpuctx->task_ctx;
> +		if (ctx)
> +			perf_adjust_freq_unthr_context(ctx,
> throttled); +
>  		if (cpuctx->jiffies_interval == 1 ||
>  				!(jiffies %
> cpuctx->jiffies_interval)) perf_rotate_context(cpuctx);
> @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct
> perf_event *event, {
>  	int events = atomic_read(&event->event_limit);
>  	struct hw_perf_event *hwc = &event->hw;
> +	u64 seq;
>  	int ret = 0;
>  
>  	/*
> @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct
> perf_event *event, if (unlikely(!is_sampling_event(event)))
>  		return 0;
>  
> -	if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
> -		if (throttle) {
> +	seq = __this_cpu_read(perf_throttled_seq);
> +	if (seq != hwc->interrupts_seq) {
> +		hwc->interrupts_seq = seq;
> +		hwc->interrupts = 1;
> +	} else {
> +		hwc->interrupts++;
> +		if (unlikely(throttle
> +			     && hwc->interrupts >=
> max_samples_per_tick)) {
> +			__this_cpu_inc(perf_throttled_count);
>  			hwc->interrupts = MAX_INTERRUPTS;
>  			perf_log_throttle(event, 0);
>  			ret = 1;
>  		}
> -	} else
> -		hwc->interrupts++;
> +	}
>  
>  	if (event->attr.freq) {
>  		u64 now = perf_clock();
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-08 11:32 ` [PATCH] perf_events: fix broken intr throttling (v3) Anton Blanchard
@ 2012-02-08 11:43   ` Stephane Eranian
  2012-02-10  3:19     ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 13+ messages in thread
From: Stephane Eranian @ 2012-02-08 11:43 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: linux-kernel, peterz, mingo, gleb, wcohen, vince, asharma, andi,
	paulus, emunson, imunsie, benh, sukadev

On Wed, Feb 8, 2012 at 12:32 PM, Anton Blanchard <anton@samba.org> wrote:
>
> Hi,
>
> On Thu, 26 Jan 2012 17:03:19 +0100
> Stephane Eranian <eranian@google.com> wrote:
>
>> This patch fixes the throttling mechanism. It was broken
>> in 3.2.0. Events were not being unthrottled. The unthrottling
>> mechanism required that events be checked at each timer tick.
>
> This patch breaks perf on POWER. I haven't had a chance to work out why
> yet, but will investigate tomorrow.
>
Did you apply the subsequent patch I posted yesterday:

https://lkml.org/lkml/2012/2/7/185

> Anton
>
>> This patch solves this problem and also separates:
>>   - unthrottling
>>   - multiplexing
>>   - frequency-mode period adjustments
>>
>> Not all of them need to be executed at each timer tick.
>>
>> This third version of the patch is based on my original patch +
>> PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
>>
>> At each timer tick, for each context:
>>   - if the current CPU has throttled events, we unthrottle events
>>
>>   - if context has frequency-based events, we adjust sampling periods
>>
>>   - if we have reached the jiffies interval, we multiplex (rotate)
>>
>> We decoupled rotation (multiplexing) from frequency-mode sampling
>> period adjustments.  They should not necessarily happen at the same
>> rate. Multiplexing is subject to jiffies_interval (currently at 1
>> but could be higher once the tunable is exposed via sysfs).
>>
>> We have grouped frequency-mode adjustment and unthrottling into the
>> same routine to minimize code duplication. When throttled while in
>> frequency mode, we scan the events only once.
>>
>> We have fixed the threshold enforcement code in
>> __perf_event_overflow(). There was a bug whereby it would allow more
>> than the authorized rate because an increment of hwc->interrupts was
>> not executed at the right place.
>>
>> The patch was tested with low sampling limit (2000) and fixed periods,
>> frequency mode, overcommitted PMU.
>>
>> On a 2.1GHz AMD CPU:
>> $ cat /proc/sys/kernel/perf_event_max_sample_rate
>> 2000
>>
>> We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
>>
>> $ perf record -e cycles,cycles -c 700000  noploop 10
>> $ perf report -D | tail -21
>> Aggregated stats:
>>            TOTAL events:      80086
>>             MMAP events:         88
>>             COMM events:          2
>>             EXIT events:          4
>>         THROTTLE events:      19996
>>       UNTHROTTLE events:      19996
>>           SAMPLE events:      40000
>> cycles stats:
>>            TOTAL events:      40006
>>             MMAP events:          5
>>             COMM events:          1
>>             EXIT events:          4
>>         THROTTLE events:       9998
>>       UNTHROTTLE events:       9998
>>           SAMPLE events:      20000
>> cycles stats:
>>            TOTAL events:      39996
>>         THROTTLE events:       9998
>>       UNTHROTTLE events:       9998
>>           SAMPLE events:      20000
>>
>> For 10s, the cap is 2x2000x10 = 40000 samples.
>> We get exactly that: 20000 samples/event.
>>
>> Signed-off-by: Stephane Eranian <eranian@google.com>
>> ---
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 0b91db2..412b790 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -589,6 +589,7 @@ struct hw_perf_event {
>>       u64                             sample_period;
>>       u64                             last_period;
>>       local64_t                       period_left;
>> +     u64                             interrupts_seq;
>>       u64                             interrupts;
>>
>>       u64                             freq_time_stamp;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index de859fb..7c3b9de 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2300,6 +2300,9 @@ do {                                    \
>>       return div64_u64(dividend, divisor);
>>  }
>>
>> +static DEFINE_PER_CPU(int, perf_throttled_count);
>> +static DEFINE_PER_CPU(u64, perf_throttled_seq);
>> +
>>  static void perf_adjust_period(struct perf_event *event, u64 nsec,
>> u64 count) {
>>       struct hw_perf_event *hwc = &event->hw;
>> @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct
>> perf_event *event, u64 nsec, u64 count) }
>>  }
>>
>> -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64
>> period) +/*
>> + * combine freq adjustment with unthrottling to avoid two passes
>> over the
>> + * events. At the same time, make sure, having freq events does not
>> change
>> + * the rate of unthrottling as that would introduce bias.
>> + */
>> +static void perf_adjust_freq_unthr_context(struct perf_event_context
>> *ctx,
>> +                                        int needs_unthr)
>>  {
>>       struct perf_event *event;
>>       struct hw_perf_event *hwc;
>> -     u64 interrupts, now;
>> +     u64 now, period = TICK_NSEC;
>>       s64 delta;
>>
>> -     if (!ctx->nr_freq)
>> +     /*
>> +      * only need to iterate over all events iff:
>> +      * - context have events in frequency mode (needs freq
>> adjust)
>> +      * - there are events to unthrottle on this cpu
>> +      */
>> +     if (!(ctx->nr_freq || needs_unthr))
>>               return;
>>
>> +     raw_spin_lock(&ctx->lock);
>> +
>>       list_for_each_entry_rcu(event, &ctx->event_list,
>> event_entry) { if (event->state != PERF_EVENT_STATE_ACTIVE)
>>                       continue;
>> @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct
>> perf_event_context *ctx, u64 period)
>>               hwc = &event->hw;
>>
>> -             interrupts = hwc->interrupts;
>> -             hwc->interrupts = 0;
>> -
>> -             /*
>> -              * unthrottle events on the tick
>> -              */
>> -             if (interrupts == MAX_INTERRUPTS) {
>> +             if (needs_unthr && hwc->interrupts ==
>> MAX_INTERRUPTS) {
>> +                     hwc->interrupts = 0;
>>                       perf_log_throttle(event, 1);
>>                       event->pmu->start(event, 0);
>>               }
>> @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct
>> perf_event_context *ctx, u64 period) if (!event->attr.freq
>> || !event->attr.sample_freq) continue;
>>
>> -             event->pmu->read(event);
>> +             /*
>> +              * stop the event and update event->count
>> +              */
>> +             event->pmu->stop(event, PERF_EF_UPDATE);
>> +
>>               now = local64_read(&event->count);
>>               delta = now - hwc->freq_count_stamp;
>>               hwc->freq_count_stamp = now;
>>
>> +             /*
>> +              * restart the event
>> +              * reload only if value has changed
>> +              */
>>               if (delta > 0)
>>                       perf_adjust_period(event, period, delta);
>> +
>> +             event->pmu->start(event, delta > 0 ?
>> PERF_EF_RELOAD : 0); }
>> +
>> +     raw_spin_unlock(&ctx->lock);
>>  }
>>
>>  /*
>> @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct
>> perf_event_context *ctx) */
>>  static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>>  {
>> -     u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
>>       struct perf_event_context *ctx = NULL;
>> -     int rotate = 0, remove = 1, freq = 0;
>> +     int rotate = 0, remove = 1;
>>
>>       if (cpuctx->ctx.nr_events) {
>>               remove = 0;
>>               if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>>                       rotate = 1;
>> -             if (cpuctx->ctx.nr_freq)
>> -                     freq = 1;
>>       }
>>
>>       ctx = cpuctx->task_ctx;
>> @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct
>> perf_cpu_context *cpuctx) remove = 0;
>>               if (ctx->nr_events != ctx->nr_active)
>>                       rotate = 1;
>> -             if (ctx->nr_freq)
>> -                     freq = 1;
>>       }
>>
>> -     if (!rotate && !freq)
>> +     if (!rotate)
>>               goto done;
>>
>>       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>       perf_pmu_disable(cpuctx->ctx.pmu);
>>
>> -     if (freq) {
>> -             perf_ctx_adjust_freq(&cpuctx->ctx, interval);
>> -             if (ctx)
>> -                     perf_ctx_adjust_freq(ctx, interval);
>> -     }
>> -
>> -     if (rotate) {
>> -             cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> -             if (ctx)
>> -                     ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>> +     cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> +     if (ctx)
>> +             ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>>
>> -             rotate_ctx(&cpuctx->ctx);
>> -             if (ctx)
>> -                     rotate_ctx(ctx);
>> +     rotate_ctx(&cpuctx->ctx);
>> +     if (ctx)
>> +             rotate_ctx(ctx);
>>
>> -             perf_event_sched_in(cpuctx, ctx, current);
>> -     }
>> +     perf_event_sched_in(cpuctx, ctx, current);
>>
>>       perf_pmu_enable(cpuctx->ctx.pmu);
>>       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> -
>>  done:
>>       if (remove)
>>               list_del_init(&cpuctx->rotation_list);
>> @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
>>  {
>>       struct list_head *head = &__get_cpu_var(rotation_list);
>>       struct perf_cpu_context *cpuctx, *tmp;
>> +     struct perf_event_context *ctx;
>> +     int throttled;
>>
>>       WARN_ON(!irqs_disabled());
>>
>> +     __this_cpu_inc(perf_throttled_seq);
>> +     throttled = __this_cpu_xchg(perf_throttled_count, 0);
>> +
>>       list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
>> +             ctx = &cpuctx->ctx;
>> +             perf_adjust_freq_unthr_context(ctx, throttled);
>> +
>> +             ctx = cpuctx->task_ctx;
>> +             if (ctx)
>> +                     perf_adjust_freq_unthr_context(ctx,
>> throttled); +
>>               if (cpuctx->jiffies_interval == 1 ||
>>                               !(jiffies %
>> cpuctx->jiffies_interval)) perf_rotate_context(cpuctx);
>> @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct
>> perf_event *event, {
>>       int events = atomic_read(&event->event_limit);
>>       struct hw_perf_event *hwc = &event->hw;
>> +     u64 seq;
>>       int ret = 0;
>>
>>       /*
>> @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct
>> perf_event *event, if (unlikely(!is_sampling_event(event)))
>>               return 0;
>>
>> -     if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
>> -             if (throttle) {
>> +     seq = __this_cpu_read(perf_throttled_seq);
>> +     if (seq != hwc->interrupts_seq) {
>> +             hwc->interrupts_seq = seq;
>> +             hwc->interrupts = 1;
>> +     } else {
>> +             hwc->interrupts++;
>> +             if (unlikely(throttle
>> +                          && hwc->interrupts >=
>> max_samples_per_tick)) {
>> +                     __this_cpu_inc(perf_throttled_count);
>>                       hwc->interrupts = MAX_INTERRUPTS;
>>                       perf_log_throttle(event, 0);
>>                       ret = 1;
>>               }
>> -     } else
>> -             hwc->interrupts++;
>> +     }
>>
>>       if (event->attr.freq) {
>>               u64 now = perf_clock();
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-01-26 16:03 [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
                   ` (2 preceding siblings ...)
  2012-02-08 11:32 ` [PATCH] perf_events: fix broken intr throttling (v3) Anton Blanchard
@ 2012-02-09  5:23 ` Anshuman Khandual
  2012-02-09 22:45   ` Stephane Eranian
  3 siblings, 1 reply; 13+ messages in thread
From: Anshuman Khandual @ 2012-02-09  5:23 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: linux-kernel, peterz, mingo, gleb, wcohen, vince, asharma, andi

On Thursday 26 January 2012 09:33 PM, Stephane Eranian wrote:
> 
> This patch fixes the throttling mechanism. It was broken
> in 3.2.0. Events were not being unthrottled. The unthrottling
> mechanism required that events be checked at each timer tick.
> 
> This patch solves this problem and also separates:
>   - unthrottling
>   - multiplexing
>   - frequency-mode period adjustments
> 
> Not all of them need to be executed at each timer tick.
> 
> This third version of the patch is based on my original patch +
> PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
> 
> At each timer tick, for each context:
>   - if the current CPU has throttled events, we unthrottle events
> 
>   - if context has frequency-based events, we adjust sampling periods
> 
>   - if we have reached the jiffies interval, we multiplex (rotate)
> 
> We decoupled rotation (multiplexing) from frequency-mode sampling
> period adjustments.  They should not necessarily happen at the same
> rate. Multiplexing is subject to jiffies_interval (currently at 1
> but could be higher once the tunable is exposed via sysfs).

I am wondering how much higher value would jiffies_interval can be assigned
to ? Once its exposed to the user through sysfs it can have any value. Are
we planning to impose some sort of MAX limit on this ?

> 
> We have grouped frequency-mode adjustment and unthrottling into the
> same routine to minimize code duplication. When throttled while in
> frequency mode, we scan the events only once.
> 
> We have fixed the threshold enforcement code in __perf_event_overflow().
> There was a bug whereby it would allow more than the authorized rate
> because an increment of hwc->interrupts was not executed at the right
> place.
> 
> The patch was tested with low sampling limit (2000) and fixed periods,
> frequency mode, overcommitted PMU.
> 
> On a 2.1GHz AMD CPU:
> $ cat /proc/sys/kernel/perf_event_max_sample_rate
> 2000
> 
> We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
> 
> $ perf record -e cycles,cycles -c 700000  noploop 10
> $ perf report -D | tail -21
> Aggregated stats:
>            TOTAL events:      80086
>             MMAP events:         88
>             COMM events:          2
>             EXIT events:          4
>         THROTTLE events:      19996
>       UNTHROTTLE events:      19996
>           SAMPLE events:      40000
> cycles stats:
>            TOTAL events:      40006
>             MMAP events:          5
>             COMM events:          1
>             EXIT events:          4
>         THROTTLE events:       9998
>       UNTHROTTLE events:       9998
>           SAMPLE events:      20000
> cycles stats:
>            TOTAL events:      39996
>         THROTTLE events:       9998
>       UNTHROTTLE events:       9998
>           SAMPLE events:      20000
> 
> For 10s, the cap is 2x2000x10 = 40000 samples.
> We get exactly that: 20000 samples/event.
> 
> Signed-off-by: Stephane Eranian <eranian@google.com>
> ---
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0b91db2..412b790 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -589,6 +589,7 @@ struct hw_perf_event {
>  	u64				sample_period;
>  	u64				last_period;
>  	local64_t			period_left;
> +	u64                             interrupts_seq;
>  	u64				interrupts;
> 
>  	u64				freq_time_stamp;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index de859fb..7c3b9de 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2300,6 +2300,9 @@ do {					\
>  	return div64_u64(dividend, divisor);
>  }
> 
> +static DEFINE_PER_CPU(int, perf_throttled_count);
> +static DEFINE_PER_CPU(u64, perf_throttled_seq);
> +
>  static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
>  {
>  	struct hw_perf_event *hwc = &event->hw;
> @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
>  	}
>  }
> 
> -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
> +/*
> + * combine freq adjustment with unthrottling to avoid two passes over the
> + * events. At the same time, make sure, having freq events does not change
> + * the rate of unthrottling as that would introduce bias.
> + */
> +static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
> +					   int needs_unthr)
>  {
>  	struct perf_event *event;
>  	struct hw_perf_event *hwc;
> -	u64 interrupts, now;
> +	u64 now, period = TICK_NSEC;
>  	s64 delta;
> 
> -	if (!ctx->nr_freq)
> +	/*
> +	 * only need to iterate over all events iff:
> +	 * - context have events in frequency mode (needs freq adjust)
> +	 * - there are events to unthrottle on this cpu
> +	 */
> +	if (!(ctx->nr_freq || needs_unthr))
>  		return;
> 
> +	raw_spin_lock(&ctx->lock);
> +
>  	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
>  		if (event->state != PERF_EVENT_STATE_ACTIVE)
>  			continue;
> @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
> 
>  		hwc = &event->hw;
> 
> -		interrupts = hwc->interrupts;
> -		hwc->interrupts = 0;
> -
> -		/*
> -		 * unthrottle events on the tick
> -		 */
> -		if (interrupts == MAX_INTERRUPTS) {
> +		if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) {
> +			hwc->interrupts = 0;
>  			perf_log_throttle(event, 1);
>  			event->pmu->start(event, 0);
>  		}
> @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
>  		if (!event->attr.freq || !event->attr.sample_freq)
>  			continue;
> 
> -		event->pmu->read(event);
> +		/*
> +		 * stop the event and update event->count
> +		 */
> +		event->pmu->stop(event, PERF_EF_UPDATE);
> +
>  		now = local64_read(&event->count);
>  		delta = now - hwc->freq_count_stamp;
>  		hwc->freq_count_stamp = now;
> 
> +		/*
> +		 * restart the event
> +		 * reload only if value has changed
> +		 */
>  		if (delta > 0)
>  			perf_adjust_period(event, period, delta);
> +
> +		event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>  	}
> +
> +	raw_spin_unlock(&ctx->lock);
>  }
> 
>  /*
> @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct perf_event_context *ctx)
>   */
>  static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>  {
> -	u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
>  	struct perf_event_context *ctx = NULL;
> -	int rotate = 0, remove = 1, freq = 0;
> +	int rotate = 0, remove = 1;
> 
>  	if (cpuctx->ctx.nr_events) {
>  		remove = 0;
>  		if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>  			rotate = 1;
> -		if (cpuctx->ctx.nr_freq)
> -			freq = 1;
>  	}
> 
>  	ctx = cpuctx->task_ctx;
> @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>  		remove = 0;
>  		if (ctx->nr_events != ctx->nr_active)
>  			rotate = 1;
> -		if (ctx->nr_freq)
> -			freq = 1;
>  	}
> 
> -	if (!rotate && !freq)
> +	if (!rotate)
>  		goto done;
> 
>  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>  	perf_pmu_disable(cpuctx->ctx.pmu);
> 
> -	if (freq) {
> -		perf_ctx_adjust_freq(&cpuctx->ctx, interval);
> -		if (ctx)
> -			perf_ctx_adjust_freq(ctx, interval);
> -	}
> -
> -	if (rotate) {
> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> -		if (ctx)
> -			ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> +	cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> +	if (ctx)
> +		ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
> 
> -		rotate_ctx(&cpuctx->ctx);
> -		if (ctx)
> -			rotate_ctx(ctx);
> +	rotate_ctx(&cpuctx->ctx);
> +	if (ctx)
> +		rotate_ctx(ctx);
> 
> -		perf_event_sched_in(cpuctx, ctx, current);
> -	}
> +	perf_event_sched_in(cpuctx, ctx, current);
> 
>  	perf_pmu_enable(cpuctx->ctx.pmu);
>  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
> -
>  done:
>  	if (remove)
>  		list_del_init(&cpuctx->rotation_list);
> @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
>  {
>  	struct list_head *head = &__get_cpu_var(rotation_list);
>  	struct perf_cpu_context *cpuctx, *tmp;
> +	struct perf_event_context *ctx;
> +	int throttled;
> 
>  	WARN_ON(!irqs_disabled());
> 
> +	__this_cpu_inc(perf_throttled_seq);
> +	throttled = __this_cpu_xchg(perf_throttled_count, 0);
> +
>  	list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
> +		ctx = &cpuctx->ctx;
> +		perf_adjust_freq_unthr_context(ctx, throttled);
> +
> +		ctx = cpuctx->task_ctx;
> +		if (ctx)
> +			perf_adjust_freq_unthr_context(ctx, throttled);
> +
>  		if (cpuctx->jiffies_interval == 1 ||
>  				!(jiffies % cpuctx->jiffies_interval))
>  			perf_rotate_context(cpuctx);
> @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct perf_event *event,
>  {
>  	int events = atomic_read(&event->event_limit);
>  	struct hw_perf_event *hwc = &event->hw;
> +	u64 seq;
>  	int ret = 0;
> 
>  	/*
> @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct perf_event *event,
>  	if (unlikely(!is_sampling_event(event)))
>  		return 0;
> 
> -	if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
> -		if (throttle) {
> +	seq = __this_cpu_read(perf_throttled_seq);
> +	if (seq != hwc->interrupts_seq) {
> +		hwc->interrupts_seq = seq;
> +		hwc->interrupts = 1;
> +	} else {
> +		hwc->interrupts++;
> +		if (unlikely(throttle
> +			     && hwc->interrupts >= max_samples_per_tick)) {
> +			__this_cpu_inc(perf_throttled_count);
>  			hwc->interrupts = MAX_INTERRUPTS;
>  			perf_log_throttle(event, 0);
>  			ret = 1;
>  		}
> -	} else
> -		hwc->interrupts++;
> +	}
> 
>  	if (event->attr.freq) {
>  		u64 now = perf_clock();
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-09  5:23 ` Anshuman Khandual
@ 2012-02-09 22:45   ` Stephane Eranian
  0 siblings, 0 replies; 13+ messages in thread
From: Stephane Eranian @ 2012-02-09 22:45 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: linux-kernel, peterz, mingo, gleb, wcohen, vince, asharma, andi

On Thu, Feb 9, 2012 at 6:23 AM, Anshuman Khandual
<khandual@linux.vnet.ibm.com> wrote:
> On Thursday 26 January 2012 09:33 PM, Stephane Eranian wrote:
>>
>> This patch fixes the throttling mechanism. It was broken
>> in 3.2.0. Events were not being unthrottled. The unthrottling
>> mechanism required that events be checked at each timer tick.
>>
>> This patch solves this problem and also separates:
>>   - unthrottling
>>   - multiplexing
>>   - frequency-mode period adjustments
>>
>> Not all of them need to be executed at each timer tick.
>>
>> This third version of the patch is based on my original patch +
>> PeterZ proposal (https://lkml.org/lkml/2012/1/7/87).
>>
>> At each timer tick, for each context:
>>   - if the current CPU has throttled events, we unthrottle events
>>
>>   - if context has frequency-based events, we adjust sampling periods
>>
>>   - if we have reached the jiffies interval, we multiplex (rotate)
>>
>> We decoupled rotation (multiplexing) from frequency-mode sampling
>> period adjustments.  They should not necessarily happen at the same
>> rate. Multiplexing is subject to jiffies_interval (currently at 1
>> but could be higher once the tunable is exposed via sysfs).
>
> I am wondering how much higher value would jiffies_interval can be assigned
> to ? Once its exposed to the user through sysfs it can have any value. Are
> we planning to impose some sort of MAX limit on this ?
>
It's a multiplier of the timer tick. For now, it's 1, so every tick we go
through the entire rescheduling if the PMU is overcommitted. In most
situations, it may not be necessary to have such granularity. I am also
wondering about the overall overhead of the operation. I have not had
time to measure. If tick is 1ms, then it does not necessarily sound
unreasonable to have the multuplier be 10 = 10ms to multiplex events.


>>
>> We have grouped frequency-mode adjustment and unthrottling into the
>> same routine to minimize code duplication. When throttled while in
>> frequency mode, we scan the events only once.
>>
>> We have fixed the threshold enforcement code in __perf_event_overflow().
>> There was a bug whereby it would allow more than the authorized rate
>> because an increment of hwc->interrupts was not executed at the right
>> place.
>>
>> The patch was tested with low sampling limit (2000) and fixed periods,
>> frequency mode, overcommitted PMU.
>>
>> On a 2.1GHz AMD CPU:
>> $ cat /proc/sys/kernel/perf_event_max_sample_rate
>> 2000
>>
>> We set a rate of 3000 samples/sec (2.1GHz/3000 = 700000):
>>
>> $ perf record -e cycles,cycles -c 700000  noploop 10
>> $ perf report -D | tail -21
>> Aggregated stats:
>>            TOTAL events:      80086
>>             MMAP events:         88
>>             COMM events:          2
>>             EXIT events:          4
>>         THROTTLE events:      19996
>>       UNTHROTTLE events:      19996
>>           SAMPLE events:      40000
>> cycles stats:
>>            TOTAL events:      40006
>>             MMAP events:          5
>>             COMM events:          1
>>             EXIT events:          4
>>         THROTTLE events:       9998
>>       UNTHROTTLE events:       9998
>>           SAMPLE events:      20000
>> cycles stats:
>>            TOTAL events:      39996
>>         THROTTLE events:       9998
>>       UNTHROTTLE events:       9998
>>           SAMPLE events:      20000
>>
>> For 10s, the cap is 2x2000x10 = 40000 samples.
>> We get exactly that: 20000 samples/event.
>>
>> Signed-off-by: Stephane Eranian <eranian@google.com>
>> ---
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 0b91db2..412b790 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -589,6 +589,7 @@ struct hw_perf_event {
>>       u64                             sample_period;
>>       u64                             last_period;
>>       local64_t                       period_left;
>> +     u64                             interrupts_seq;
>>       u64                             interrupts;
>>
>>       u64                             freq_time_stamp;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index de859fb..7c3b9de 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -2300,6 +2300,9 @@ do {                                    \
>>       return div64_u64(dividend, divisor);
>>  }
>>
>> +static DEFINE_PER_CPU(int, perf_throttled_count);
>> +static DEFINE_PER_CPU(u64, perf_throttled_seq);
>> +
>>  static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
>>  {
>>       struct hw_perf_event *hwc = &event->hw;
>> @@ -2325,16 +2328,29 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count)
>>       }
>>  }
>>
>> -static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
>> +/*
>> + * combine freq adjustment with unthrottling to avoid two passes over the
>> + * events. At the same time, make sure, having freq events does not change
>> + * the rate of unthrottling as that would introduce bias.
>> + */
>> +static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
>> +                                        int needs_unthr)
>>  {
>>       struct perf_event *event;
>>       struct hw_perf_event *hwc;
>> -     u64 interrupts, now;
>> +     u64 now, period = TICK_NSEC;
>>       s64 delta;
>>
>> -     if (!ctx->nr_freq)
>> +     /*
>> +      * only need to iterate over all events iff:
>> +      * - context have events in frequency mode (needs freq adjust)
>> +      * - there are events to unthrottle on this cpu
>> +      */
>> +     if (!(ctx->nr_freq || needs_unthr))
>>               return;
>>
>> +     raw_spin_lock(&ctx->lock);
>> +
>>       list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
>>               if (event->state != PERF_EVENT_STATE_ACTIVE)
>>                       continue;
>> @@ -2344,13 +2360,8 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
>>
>>               hwc = &event->hw;
>>
>> -             interrupts = hwc->interrupts;
>> -             hwc->interrupts = 0;
>> -
>> -             /*
>> -              * unthrottle events on the tick
>> -              */
>> -             if (interrupts == MAX_INTERRUPTS) {
>> +             if (needs_unthr && hwc->interrupts == MAX_INTERRUPTS) {
>> +                     hwc->interrupts = 0;
>>                       perf_log_throttle(event, 1);
>>                       event->pmu->start(event, 0);
>>               }
>> @@ -2358,14 +2369,26 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx, u64 period)
>>               if (!event->attr.freq || !event->attr.sample_freq)
>>                       continue;
>>
>> -             event->pmu->read(event);
>> +             /*
>> +              * stop the event and update event->count
>> +              */
>> +             event->pmu->stop(event, PERF_EF_UPDATE);
>> +
>>               now = local64_read(&event->count);
>>               delta = now - hwc->freq_count_stamp;
>>               hwc->freq_count_stamp = now;
>>
>> +             /*
>> +              * restart the event
>> +              * reload only if value has changed
>> +              */
>>               if (delta > 0)
>>                       perf_adjust_period(event, period, delta);
>> +
>> +             event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0);
>>       }
>> +
>> +     raw_spin_unlock(&ctx->lock);
>>  }
>>
>>  /*
>> @@ -2388,16 +2411,13 @@ static void rotate_ctx(struct perf_event_context *ctx)
>>   */
>>  static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>>  {
>> -     u64 interval = (u64)cpuctx->jiffies_interval * TICK_NSEC;
>>       struct perf_event_context *ctx = NULL;
>> -     int rotate = 0, remove = 1, freq = 0;
>> +     int rotate = 0, remove = 1;
>>
>>       if (cpuctx->ctx.nr_events) {
>>               remove = 0;
>>               if (cpuctx->ctx.nr_events != cpuctx->ctx.nr_active)
>>                       rotate = 1;
>> -             if (cpuctx->ctx.nr_freq)
>> -                     freq = 1;
>>       }
>>
>>       ctx = cpuctx->task_ctx;
>> @@ -2405,37 +2425,26 @@ static void perf_rotate_context(struct perf_cpu_context *cpuctx)
>>               remove = 0;
>>               if (ctx->nr_events != ctx->nr_active)
>>                       rotate = 1;
>> -             if (ctx->nr_freq)
>> -                     freq = 1;
>>       }
>>
>> -     if (!rotate && !freq)
>> +     if (!rotate)
>>               goto done;
>>
>>       perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>       perf_pmu_disable(cpuctx->ctx.pmu);
>>
>> -     if (freq) {
>> -             perf_ctx_adjust_freq(&cpuctx->ctx, interval);
>> -             if (ctx)
>> -                     perf_ctx_adjust_freq(ctx, interval);
>> -     }
>> -
>> -     if (rotate) {
>> -             cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> -             if (ctx)
>> -                     ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>> +     cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
>> +     if (ctx)
>> +             ctx_sched_out(ctx, cpuctx, EVENT_FLEXIBLE);
>>
>> -             rotate_ctx(&cpuctx->ctx);
>> -             if (ctx)
>> -                     rotate_ctx(ctx);
>> +     rotate_ctx(&cpuctx->ctx);
>> +     if (ctx)
>> +             rotate_ctx(ctx);
>>
>> -             perf_event_sched_in(cpuctx, ctx, current);
>> -     }
>> +     perf_event_sched_in(cpuctx, ctx, current);
>>
>>       perf_pmu_enable(cpuctx->ctx.pmu);
>>       perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> -
>>  done:
>>       if (remove)
>>               list_del_init(&cpuctx->rotation_list);
>> @@ -2445,10 +2454,22 @@ void perf_event_task_tick(void)
>>  {
>>       struct list_head *head = &__get_cpu_var(rotation_list);
>>       struct perf_cpu_context *cpuctx, *tmp;
>> +     struct perf_event_context *ctx;
>> +     int throttled;
>>
>>       WARN_ON(!irqs_disabled());
>>
>> +     __this_cpu_inc(perf_throttled_seq);
>> +     throttled = __this_cpu_xchg(perf_throttled_count, 0);
>> +
>>       list_for_each_entry_safe(cpuctx, tmp, head, rotation_list) {
>> +             ctx = &cpuctx->ctx;
>> +             perf_adjust_freq_unthr_context(ctx, throttled);
>> +
>> +             ctx = cpuctx->task_ctx;
>> +             if (ctx)
>> +                     perf_adjust_freq_unthr_context(ctx, throttled);
>> +
>>               if (cpuctx->jiffies_interval == 1 ||
>>                               !(jiffies % cpuctx->jiffies_interval))
>>                       perf_rotate_context(cpuctx);
>> @@ -4514,6 +4535,7 @@ static int __perf_event_overflow(struct perf_event *event,
>>  {
>>       int events = atomic_read(&event->event_limit);
>>       struct hw_perf_event *hwc = &event->hw;
>> +     u64 seq;
>>       int ret = 0;
>>
>>       /*
>> @@ -4523,14 +4545,20 @@ static int __perf_event_overflow(struct perf_event *event,
>>       if (unlikely(!is_sampling_event(event)))
>>               return 0;
>>
>> -     if (unlikely(hwc->interrupts >= max_samples_per_tick)) {
>> -             if (throttle) {
>> +     seq = __this_cpu_read(perf_throttled_seq);
>> +     if (seq != hwc->interrupts_seq) {
>> +             hwc->interrupts_seq = seq;
>> +             hwc->interrupts = 1;
>> +     } else {
>> +             hwc->interrupts++;
>> +             if (unlikely(throttle
>> +                          && hwc->interrupts >= max_samples_per_tick)) {
>> +                     __this_cpu_inc(perf_throttled_count);
>>                       hwc->interrupts = MAX_INTERRUPTS;
>>                       perf_log_throttle(event, 0);
>>                       ret = 1;
>>               }
>> -     } else
>> -             hwc->interrupts++;
>> +     }
>>
>>       if (event->attr.freq) {
>>               u64 now = perf_clock();
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-08 11:43   ` Stephane Eranian
@ 2012-02-10  3:19     ` Sukadev Bhattiprolu
  2012-02-10  6:15       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 13+ messages in thread
From: Sukadev Bhattiprolu @ 2012-02-10  3:19 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Anton Blanchard, linux-kernel, peterz, mingo, gleb, wcohen,
	vince, asharma, andi, paulus, emunson, imunsie, benh

Stephane Eranian [eranian@google.com] wrote:
| On Wed, Feb 8, 2012 at 12:32 PM, Anton Blanchard <anton@samba.org> wrote:
| >
| > Hi,
| >
| > On Thu, 26 Jan 2012 17:03:19 +0100
| > Stephane Eranian <eranian@google.com> wrote:
| >
| >> This patch fixes the throttling mechanism. It was broken
| >> in 3.2.0. Events were not being unthrottled. The unthrottling
| >> mechanism required that events be checked at each timer tick.
| >
| > This patch breaks perf on POWER. I haven't had a chance to work out why
| > yet, but will investigate tomorrow.
| >
| Did you apply the subsequent patch I posted yesterday:
| 
| https://lkml.org/lkml/2012/2/7/185

I applied both patches to the powerpc.git tree - following hunk already
exists in the powerpc tree (and even Jan 27 mainline) so I skipped this
hunk.

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -988,6 +988,9 @@ static void x86_pmu_start(struct perf_event *event, int flags)
        struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
        int idx = event->hw.idx;

+       if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+               return;
+
        if (WARN_ON_ONCE(idx == -1))
                return;

I tried this on Power7 system with a simple 'nooploop' program that loops
in while(1) for 10 seconds.

Results before the patches were applied:

	# /tmp/perf record -e cycles,cycles /tmp/nooploop 10
	[ perf record: Woken up 4 times to write data ]
	[ perf record: Captured and wrote 0.886 MB perf.data (~38714 samples) ]

	# /tmp/perf report -D | tail -15

	Aggregated stats:
		   TOTAL events:      19307
		    MMAP events:         42
		    COMM events:          2
		    EXIT events:          2
		  SAMPLE events:      19261
	cycles stats:
		   TOTAL events:       9993
		    MMAP events:          4
		    COMM events:          1
		    EXIT events:          2
		  SAMPLE events:       9986
	cycles stats:
		   TOTAL events:       9275
		  SAMPLE events:       9275

After applying both patches:

	# /tmp/perf record -e cycles,cycles /tmp/nooploop 10
	[ perf record: Woken up 1 times to write data ]
	[ perf record: Captured and wrote 0.006 MB perf.data (~260 samples) ]

	# /tmp/perf report -D | tail -15

	Aggregated stats:
		   TOTAL events:         80
		    MMAP events:         42
		    COMM events:          2
		    EXIT events:          2
		  SAMPLE events:         34
	cycles stats:
		   TOTAL events:         24
		    MMAP events:          4
		    COMM events:          1
		    EXIT events:          2
		  SAMPLE events:         17

	cycles stats:
		   TOTAL events:         17
		  SAMPLE events:         17

Sukadev


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-10  3:19     ` Sukadev Bhattiprolu
@ 2012-02-10  6:15       ` Benjamin Herrenschmidt
  2012-02-15  2:37         ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 13+ messages in thread
From: Benjamin Herrenschmidt @ 2012-02-10  6:15 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Stephane Eranian, Anton Blanchard, linux-kernel, peterz, mingo,
	gleb, wcohen, vince, asharma, andi, paulus, emunson, imunsie

On Thu, 2012-02-09 at 19:19 -0800, Sukadev Bhattiprolu wrote:
> Stephane Eranian [eranian@google.com] wrote:
> | On Wed, Feb 8, 2012 at 12:32 PM, Anton Blanchard <anton@samba.org> wrote:
> | >
> | > Hi,
> | >
> | > On Thu, 26 Jan 2012 17:03:19 +0100
> | > Stephane Eranian <eranian@google.com> wrote:
> | >
> | >> This patch fixes the throttling mechanism. It was broken
> | >> in 3.2.0. Events were not being unthrottled. The unthrottling
> | >> mechanism required that events be checked at each timer tick.
> | >
> | > This patch breaks perf on POWER. I haven't had a chance to work out why
> | > yet, but will investigate tomorrow.
> | >
> | Did you apply the subsequent patch I posted yesterday:
> | 
> | https://lkml.org/lkml/2012/2/7/185
> 
> I applied both patches to the powerpc.git tree - following hunk already
> exists in the powerpc tree (and even Jan 27 mainline) so I skipped this
> hunk.

The powerpc .git tree is not useful at this point, I haven't opened
powerpc -next yet so it's just all old stale stuff. Please work with
upstream always unless you have a very good reason to use my tree. It's
really only meant as a conduit between me and Linus or for -next.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-10  6:15       ` Benjamin Herrenschmidt
@ 2012-02-15  2:37         ` Sukadev Bhattiprolu
  2012-02-19 11:26           ` Stephane Eranian
  0 siblings, 1 reply; 13+ messages in thread
From: Sukadev Bhattiprolu @ 2012-02-15  2:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Stephane Eranian, Anton Blanchard, linux-kernel, peterz, mingo,
	gleb, wcohen, vince, asharma, andi, paulus, emunson, imunsie,
	Matt Helsley

Benjamin Herrenschmidt [benh@kernel.crashing.org] wrote:
| On Thu, 2012-02-09 at 19:19 -0800, Sukadev Bhattiprolu wrote:
| > Stephane Eranian [eranian@google.com] wrote:
| > | On Wed, Feb 8, 2012 at 12:32 PM, Anton Blanchard <anton@samba.org> wrote:
| > | >
| > | > Hi,
| > | >
| > | > On Thu, 26 Jan 2012 17:03:19 +0100
| The powerpc .git tree is not useful at this point, I haven't opened
| powerpc -next yet so it's just all old stale stuff. Please work with
| upstream always unless you have a very good reason to use my tree. It's
| really only meant as a conduit between me and Linus or for -next.

Ok.

I tried this on current mainline (commit 13d261932) which now includes both
of Stephane's patches and repeated the nooploop test.

The problem still exists - the number of events reported on Power is far
fewer than without the patches.

Aggregated stats:
           TOTAL events:         78
            MMAP events:         47
            COMM events:          2
            EXIT events:          2
          SAMPLE events:         27
cycles stats:
           TOTAL events:         21
            MMAP events:          4
            COMM events:          1
            EXIT events:          2
          SAMPLE events:         14
cycles stats:
           TOTAL events:         13
          SAMPLE events:         13

Sukadev


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-15  2:37         ` Sukadev Bhattiprolu
@ 2012-02-19 11:26           ` Stephane Eranian
  2012-02-20 19:01             ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 13+ messages in thread
From: Stephane Eranian @ 2012-02-19 11:26 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Benjamin Herrenschmidt, Anton Blanchard, linux-kernel, peterz,
	mingo, gleb, wcohen, vince, asharma, andi, paulus, emunson,
	imunsie, Matt Helsley

Did Anton recent fix solve that problem?


On Wed, Feb 15, 2012 at 3:37 AM, Sukadev Bhattiprolu
<sukadev@linux.vnet.ibm.com> wrote:
> Benjamin Herrenschmidt [benh@kernel.crashing.org] wrote:
> | On Thu, 2012-02-09 at 19:19 -0800, Sukadev Bhattiprolu wrote:
> | > Stephane Eranian [eranian@google.com] wrote:
> | > | On Wed, Feb 8, 2012 at 12:32 PM, Anton Blanchard <anton@samba.org> wrote:
> | > | >
> | > | > Hi,
> | > | >
> | > | > On Thu, 26 Jan 2012 17:03:19 +0100
> | The powerpc .git tree is not useful at this point, I haven't opened
> | powerpc -next yet so it's just all old stale stuff. Please work with
> | upstream always unless you have a very good reason to use my tree. It's
> | really only meant as a conduit between me and Linus or for -next.
>
> Ok.
>
> I tried this on current mainline (commit 13d261932) which now includes both
> of Stephane's patches and repeated the nooploop test.
>
> The problem still exists - the number of events reported on Power is far
> fewer than without the patches.
>
> Aggregated stats:
>           TOTAL events:         78
>            MMAP events:         47
>            COMM events:          2
>            EXIT events:          2
>          SAMPLE events:         27
> cycles stats:
>           TOTAL events:         21
>            MMAP events:          4
>            COMM events:          1
>            EXIT events:          2
>          SAMPLE events:         14
> cycles stats:
>           TOTAL events:         13
>          SAMPLE events:         13
>
> Sukadev
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] perf_events: fix broken intr throttling (v3)
  2012-02-19 11:26           ` Stephane Eranian
@ 2012-02-20 19:01             ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 13+ messages in thread
From: Sukadev Bhattiprolu @ 2012-02-20 19:01 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Benjamin Herrenschmidt, Anton Blanchard, linux-kernel, peterz,
	mingo, gleb, wcohen, vince, asharma, andi, paulus, emunson,
	imunsie, Matt Helsley

Stephane Eranian [eranian@google.com] wrote:
| Did Anton recent fix solve that problem?

Yes - could not reproduce on -rc4.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-02-20 19:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-26 16:03 [PATCH] perf_events: fix broken intr throttling (v3) Stephane Eranian
2012-01-26 16:16 ` Peter Zijlstra
2012-01-26 18:22   ` Arun Sharma
2012-01-28 12:04 ` [tip:perf/urgent] perf: Fix broken interrupt rate throttling tip-bot for Stephane Eranian
2012-02-08 11:32 ` [PATCH] perf_events: fix broken intr throttling (v3) Anton Blanchard
2012-02-08 11:43   ` Stephane Eranian
2012-02-10  3:19     ` Sukadev Bhattiprolu
2012-02-10  6:15       ` Benjamin Herrenschmidt
2012-02-15  2:37         ` Sukadev Bhattiprolu
2012-02-19 11:26           ` Stephane Eranian
2012-02-20 19:01             ` Sukadev Bhattiprolu
2012-02-09  5:23 ` Anshuman Khandual
2012-02-09 22:45   ` Stephane Eranian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).