All of lore.kernel.org
 help / color / mirror / Atom feed
* perf: bisected sampling bug in Linux 4.11-rc1
@ 2017-07-14 18:05 Vince Weaver
  2017-07-14 18:25 ` Vince Weaver
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Vince Weaver @ 2017-07-14 18:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Shishkin, Ingo Molnar, Peter Zijlstra, Stephane Eranian


I was tracking down some regressions in my perf_event_test testsuite.
Some of the tests broke in the 4.11-rc1 timeframe.

I've bisected one of them, this report is about
	tests/overflow/simul_oneshot_group_overflow
This test creates an event group containing two sampling events, set
to overflow to a signal handler (which disables and then refreshes the 
event).

On a good kernel you get the following:
	Event perf::instructions with period 1000000
	Event perf::instructions with period 2000000
		fd 3 overflows: 946 (perf::instructions/1000000)
		fd 4 overflows: 473 (perf::instructions/2000000)
	Ending counts:
		Count 0: 946379875
		Count 1: 946365218

With the broken kernels you get:
	Event perf::instructions with period 1000000
	Event perf::instructions with period 2000000
		fd 3 overflows: 938 (perf::instructions/1000000)
		fd 4 overflows: 318 (perf::instructions/2000000)
	Ending counts:
		Count 0: 946373080
		Count 1: 653373058


487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit

commit 487f05e18aa4efacee6357480f293a5afe6593b5
Author: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Date:   Thu Jan 19 18:43:30 2017 +0200

    perf/core: Optimize event rescheduling on active contexts
    
    When new events are added to an active context, we go and reschedule
    all cpu groups and all task groups in order to preserve the priority
    (cpu pinned, task pinned, cpu flexible, task flexible), but in
    reality we only need to reschedule groups of the same priority as
    that of the events being added, and below.
    
    This patch changes the behavior so that only groups that need to be
    rescheduled are rescheduled.
    
    Reported-by: Adrian Hunter <adrian.hunter@intel.com>
    Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-14 18:05 perf: bisected sampling bug in Linux 4.11-rc1 Vince Weaver
@ 2017-07-14 18:25 ` Vince Weaver
  2017-07-14 20:07 ` Alexander Shishkin
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Vince Weaver @ 2017-07-14 18:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Shishkin, Ingo Molnar, Peter Zijlstra, Stephane Eranian

On Fri, 14 Jul 2017, Vince Weaver wrote:

> On a good kernel you get the following:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 946 (perf::instructions/1000000)
> 		fd 4 overflows: 473 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946379875
> 		Count 1: 946365218
> 
> With the broken kernels you get:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 938 (perf::instructions/1000000)
> 		fd 4 overflows: 318 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946373080
> 		Count 1: 653373058

additional relevant detail:
	in the failing case, the group leader of the event set has 
		.pinned=1
	If I change that to .pinned=0 then the test passes.

Vince

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-14 18:05 perf: bisected sampling bug in Linux 4.11-rc1 Vince Weaver
  2017-07-14 18:25 ` Vince Weaver
@ 2017-07-14 20:07 ` Alexander Shishkin
  2017-07-14 20:14   ` Vince Weaver
  2017-07-15 11:00 ` Ingo Molnar
  2017-07-18 11:08 ` perf: bisected sampling bug in Linux 4.11-rc1 Alexander Shishkin
  3 siblings, 1 reply; 12+ messages in thread
From: Alexander Shishkin @ 2017-07-14 20:07 UTC (permalink / raw)
  To: Vince Weaver, linux-kernel; +Cc: Ingo Molnar, Peter Zijlstra, Stephane Eranian

Vince Weaver <vincent.weaver@maine.edu> writes:

> I was tracking down some regressions in my perf_event_test testsuite.
> Some of the tests broke in the 4.11-rc1 timeframe.
>
> I've bisected one of them, this report is about
> 	tests/overflow/simul_oneshot_group_overflow
> This test creates an event group containing two sampling events, set
> to overflow to a signal handler (which disables and then refreshes the 
> event).
>
> On a good kernel you get the following:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 946 (perf::instructions/1000000)
> 		fd 4 overflows: 473 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946379875
> 		Count 1: 946365218
>
> With the broken kernels you get:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 938 (perf::instructions/1000000)
> 		fd 4 overflows: 318 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946373080
> 		Count 1: 653373058

I'm not sure I'm seeing it (granted, it's a friday evening): is it the
difference in overflow counts?

Also, are they cpu or task bound?

Regards,
--
Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-14 20:07 ` Alexander Shishkin
@ 2017-07-14 20:14   ` Vince Weaver
  0 siblings, 0 replies; 12+ messages in thread
From: Vince Weaver @ 2017-07-14 20:14 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: linux-kernel, Ingo Molnar, Peter Zijlstra, Stephane Eranian

On Fri, 14 Jul 2017, Alexander Shishkin wrote:

> Vince Weaver <vincent.weaver@maine.edu> writes:
> 
> > I was tracking down some regressions in my perf_event_test testsuite.
> > Some of the tests broke in the 4.11-rc1 timeframe.
> >
> > I've bisected one of them, this report is about
> > 	tests/overflow/simul_oneshot_group_overflow
> > This test creates an event group containing two sampling events, set
> > to overflow to a signal handler (which disables and then refreshes the 
> > event).
> >
> > On a good kernel you get the following:
> > 	Event perf::instructions with period 1000000
> > 	Event perf::instructions with period 2000000
> > 		fd 3 overflows: 946 (perf::instructions/1000000)
> > 		fd 4 overflows: 473 (perf::instructions/2000000)
> > 	Ending counts:
> > 		Count 0: 946379875
> > 		Count 1: 946365218
> >
> > With the broken kernels you get:
> > 	Event perf::instructions with period 1000000
> > 	Event perf::instructions with period 2000000
> > 		fd 3 overflows: 938 (perf::instructions/1000000)
> > 		fd 4 overflows: 318 (perf::instructions/2000000)
> > 	Ending counts:
> > 		Count 0: 946373080
> > 		Count 1: 653373058
> 
> I'm not sure I'm seeing it (granted, it's a friday evening): is it the
> difference in overflow counts?

It's two things.
	It's created an grouped event, with the two events both 
	perf::instructions.

	1.  The total count at the end should be the same for both
		(on the failing kernels it is not)
	2.  The overflow count for both events should be roughly
		total_events/sample_freq.
		(on the failing kernels it is not)

> Also, are they cpu or task bound?

The open looks like this:
	perf_event_open(&pe,0,-1,-1,0);

On the failing case, the group leader is pinned.

The source code for the test is here:
	https://github.com/deater/perf_event_tests/blob/master/tests/overflow/simul_oneshot_group_overflow.c

Vince

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-14 18:05 perf: bisected sampling bug in Linux 4.11-rc1 Vince Weaver
  2017-07-14 18:25 ` Vince Weaver
  2017-07-14 20:07 ` Alexander Shishkin
@ 2017-07-15 11:00 ` Ingo Molnar
  2017-07-15 14:37   ` Vince Weaver
  2017-07-18 10:39   ` [tip:perf/urgent] Revert "perf/core: Optimize event rescheduling on active contexts" tip-bot for Ingo Molnar
  2017-07-18 11:08 ` perf: bisected sampling bug in Linux 4.11-rc1 Alexander Shishkin
  3 siblings, 2 replies; 12+ messages in thread
From: Ingo Molnar @ 2017-07-15 11:00 UTC (permalink / raw)
  To: Vince Weaver
  Cc: linux-kernel, Alexander Shishkin, Peter Zijlstra,
	Stephane Eranian, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner


* Vince Weaver <vincent.weaver@maine.edu> wrote:

> 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
> 
> commit 487f05e18aa4efacee6357480f293a5afe6593b5
> Author: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Date:   Thu Jan 19 18:43:30 2017 +0200
> 
>     perf/core: Optimize event rescheduling on active contexts

BTW., just to prepare for the eventuality: below is a (completely untested...) 
revert of this commit, against recent kernels, with conflicts fixed up.

Does this fix your testcase?

Thanks,

	Ingo

>From fe0deecf2a8e9f5097013bcf89a9ef4b80715be1 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Sat, 15 Jul 2017 12:57:51 +0200
Subject: [PATCH] Revert "perf/core: Optimize event rescheduling on active contexts"

This reverts commit 487f05e18aa4efacee6357480f293a5afe6593b5.
---
 kernel/events/core.c | 80 ++++++++--------------------------------------------
 1 file changed, 11 insertions(+), 69 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9747e422ab20..778aa2548142 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -359,8 +359,6 @@ enum event_type_t {
 	EVENT_FLEXIBLE = 0x1,
 	EVENT_PINNED = 0x2,
 	EVENT_TIME = 0x4,
-	/* see ctx_resched() for details */
-	EVENT_CPU = 0x8,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };
 
@@ -1445,20 +1443,6 @@ static void update_group_times(struct perf_event *leader)
 		update_event_times(event);
 }
 
-static enum event_type_t get_event_type(struct perf_event *event)
-{
-	struct perf_event_context *ctx = event->ctx;
-	enum event_type_t event_type;
-
-	lockdep_assert_held(&ctx->lock);
-
-	event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
-	if (!ctx->task)
-		event_type |= EVENT_CPU;
-
-	return event_type;
-}
-
 static struct list_head *
 ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 {
@@ -2232,8 +2216,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     struct task_struct *task);
 
 static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			       struct perf_event_context *ctx,
-			       enum event_type_t event_type)
+			       struct perf_event_context *ctx)
 {
 	if (!cpuctx->task_ctx)
 		return;
@@ -2241,7 +2224,7 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, EVENT_ALL);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
@@ -2256,51 +2239,13 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
 }
 
-/*
- * We want to maintain the following priority of scheduling:
- *  - CPU pinned (EVENT_CPU | EVENT_PINNED)
- *  - task pinned (EVENT_PINNED)
- *  - CPU flexible (EVENT_CPU | EVENT_FLEXIBLE)
- *  - task flexible (EVENT_FLEXIBLE).
- *
- * In order to avoid unscheduling and scheduling back in everything every
- * time an event is added, only do it for the groups of equal priority and
- * below.
- *
- * This can be called after a batch operation on task events, in which case
- * event_type is a bit mask of the types of events involved. For CPU events,
- * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE.
- */
 static void ctx_resched(struct perf_cpu_context *cpuctx,
-			struct perf_event_context *task_ctx,
-			enum event_type_t event_type)
+			struct perf_event_context *task_ctx)
 {
-	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
-	bool cpu_event = !!(event_type & EVENT_CPU);
-
-	/*
-	 * If pinned groups are involved, flexible groups also need to be
-	 * scheduled out.
-	 */
-	if (event_type & EVENT_PINNED)
-		event_type |= EVENT_FLEXIBLE;
-
 	perf_pmu_disable(cpuctx->ctx.pmu);
 	if (task_ctx)
-		task_ctx_sched_out(cpuctx, task_ctx, event_type);
-
-	/*
-	 * Decide which cpu ctx groups to schedule out based on the types
-	 * of events that caused rescheduling:
-	 *  - EVENT_CPU: schedule out corresponding groups;
-	 *  - EVENT_PINNED task events: schedule out EVENT_FLEXIBLE groups;
-	 *  - otherwise, do nothing more.
-	 */
-	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
-	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-
+		task_ctx_sched_out(cpuctx, task_ctx);
+	cpu_ctx_sched_out(cpuctx, EVENT_ALL);
 	perf_event_sched_in(cpuctx, task_ctx, current);
 	perf_pmu_enable(cpuctx->ctx.pmu);
 }
@@ -2347,7 +2292,7 @@ static int  __perf_install_in_context(void *info)
 	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
-		ctx_resched(cpuctx, task_ctx, get_event_type(event));
+		ctx_resched(cpuctx, task_ctx);
 	} else {
 		add_event_to_ctx(event, ctx);
 	}
@@ -2514,7 +2459,7 @@ static void __perf_event_enable(struct perf_event *event,
 	if (ctx->task)
 		WARN_ON_ONCE(task_ctx != ctx);
 
-	ctx_resched(cpuctx, task_ctx, get_event_type(event));
+	ctx_resched(cpuctx, task_ctx);
 }
 
 /*
@@ -2941,7 +2886,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+		task_ctx_sched_out(cpuctx, ctx);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -3498,7 +3443,6 @@ static int event_enable_on_exec(struct perf_event *event,
 static void perf_event_enable_on_exec(int ctxn)
 {
 	struct perf_event_context *ctx, *clone_ctx = NULL;
-	enum event_type_t event_type = 0;
 	struct perf_cpu_context *cpuctx;
 	struct perf_event *event;
 	unsigned long flags;
@@ -3512,17 +3456,15 @@ static void perf_event_enable_on_exec(int ctxn)
 	cpuctx = __get_cpu_context(ctx);
 	perf_ctx_lock(cpuctx, ctx);
 	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
-	list_for_each_entry(event, &ctx->event_list, event_entry) {
+	list_for_each_entry(event, &ctx->event_list, event_entry)
 		enabled |= event_enable_on_exec(event, ctx);
-		event_type |= get_event_type(event);
-	}
 
 	/*
 	 * Unclone and reschedule this context if we enabled any event.
 	 */
 	if (enabled) {
 		clone_ctx = unclone_ctx(ctx);
-		ctx_resched(cpuctx, ctx, event_type);
+		ctx_resched(cpuctx, ctx);
 	} else {
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
 	}
@@ -10466,7 +10408,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 	 * in.
 	 */
 	raw_spin_lock_irq(&child_ctx->lock);
-	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx);
 
 	/*
 	 * Now that the context is inactive, destroy the task <-> ctx relation

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-15 11:00 ` Ingo Molnar
@ 2017-07-15 14:37   ` Vince Weaver
  2017-07-16  3:03     ` Vince Weaver
  2017-07-18 10:39   ` [tip:perf/urgent] Revert "perf/core: Optimize event rescheduling on active contexts" tip-bot for Ingo Molnar
  1 sibling, 1 reply; 12+ messages in thread
From: Vince Weaver @ 2017-07-15 14:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vince Weaver, linux-kernel, Alexander Shishkin, Peter Zijlstra,
	Stephane Eranian, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner

On Sat, 15 Jul 2017, Ingo Molnar wrote:

> 
> * Vince Weaver <vincent.weaver@maine.edu> wrote:
> 
> > 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
> > 
> > commit 487f05e18aa4efacee6357480f293a5afe6593b5
> > Author: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Date:   Thu Jan 19 18:43:30 2017 +0200
> > 
> >     perf/core: Optimize event rescheduling on active contexts
> 
> BTW., just to prepare for the eventuality: below is a (completely untested...) 
> revert of this commit, against recent kernels, with conflicts fixed up.
> 
> Does this fix your testcase?

Yes, applying this to current git fixes the testcase and doesn't seem to 
break anything else.


Although there is a separate issue also introduced in 4.11-rc1 that still 
fails a different testcase.  I'm in the middle of bisecting that one and 
probably won't have the result of the bisect until Monday.

Vince

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-15 14:37   ` Vince Weaver
@ 2017-07-16  3:03     ` Vince Weaver
  0 siblings, 0 replies; 12+ messages in thread
From: Vince Weaver @ 2017-07-16  3:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Alexander Shishkin, Peter Zijlstra,
	Stephane Eranian, Jiri Olsa, Arnaldo Carvalho de Melo,
	Thomas Gleixner

On Sat, 15 Jul 2017, Vince Weaver wrote:

> Although there is a separate issue also introduced in 4.11-rc1 that still 
> fails a different testcase.  I'm in the middle of bisecting that one and 
> probably won't have the result of the bisect until Monday.

I went and bisected the other issue anyway.  It wasn't in 4.11-rc1, but in 
4.12-rc4.  Yes, I have a test that triggered the
	c1582c231ea041 perf/core: Drop kernel samples even though :u is specified
issue.

The test was actually losing 7% of its samples, so not just a single 
sample here or there.  The test had two events sampling, one at 100k and 
one at 200k so I guess every other sample would have two samples 
immediately back-to-back which must make it more likely to stray into the 
kernel.

Vince

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:perf/urgent] Revert "perf/core: Optimize event rescheduling on active contexts"
  2017-07-15 11:00 ` Ingo Molnar
  2017-07-15 14:37   ` Vince Weaver
@ 2017-07-18 10:39   ` tip-bot for Ingo Molnar
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Ingo Molnar @ 2017-07-18 10:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: acme, alexander.shishkin, peterz, eranian, mingo, linux-kernel,
	jolsa, torvalds, acme, hpa, tglx, vincent.weaver

Commit-ID:  770f8eb8a990a8904bfd8a6849be147b40b6e1aa
Gitweb:     http://git.kernel.org/tip/770f8eb8a990a8904bfd8a6849be147b40b6e1aa
Author:     Ingo Molnar <mingo@kernel.org>
AuthorDate: Sat, 15 Jul 2017 13:00:49 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 18 Jul 2017 10:44:47 +0200

Revert "perf/core: Optimize event rescheduling on active contexts"

This reverts commit 487f05e18aa4efacee6357480f293a5afe6593b5.

Vince Weaver reported that it breaks a testcase for pinned events:

| I've bisected one of them, this report is about:
|
|         tests/overflow/simul_oneshot_group_overflow
|
| This test creates an event group containing two sampling events, set
| to overflow to a signal handler (which disables and then refreshes the
| event).
|
| On a good kernel you get the following:
|         Event perf::instructions with period 1000000
|         Event perf::instructions with period 2000000
|                 fd 3 overflows: 946 (perf::instructions/1000000)
|                 fd 4 overflows: 473 (perf::instructions/2000000)
|         Ending counts:
|                 Count 0: 946379875
|                 Count 1: 946365218
|
| With the broken kernels you get:
|         Event perf::instructions with period 1000000
|         Event perf::instructions with period 2000000
|                 fd 3 overflows: 938 (perf::instructions/1000000)
|                 fd 4 overflows: 318 (perf::instructions/2000000)
|         Ending counts:
|                 Count 0: 946373080
|                 Count 1: 653373058
...
| additional relevant detail:
|         in the failing case, the group leader of the event set has
|                 .pinned=1
|         If I change that to .pinned=0 then the test passes.

As it's an optimization we can revert it for now until the root cause is found.

Adrian Hunter <adrian.hunter@intel.com>
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170715110049.36jvxnidy2flh6ll@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 80 ++++++++--------------------------------------------
 1 file changed, 11 insertions(+), 69 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9747e42..778aa25 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -359,8 +359,6 @@ enum event_type_t {
 	EVENT_FLEXIBLE = 0x1,
 	EVENT_PINNED = 0x2,
 	EVENT_TIME = 0x4,
-	/* see ctx_resched() for details */
-	EVENT_CPU = 0x8,
 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
 };
 
@@ -1445,20 +1443,6 @@ static void update_group_times(struct perf_event *leader)
 		update_event_times(event);
 }
 
-static enum event_type_t get_event_type(struct perf_event *event)
-{
-	struct perf_event_context *ctx = event->ctx;
-	enum event_type_t event_type;
-
-	lockdep_assert_held(&ctx->lock);
-
-	event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
-	if (!ctx->task)
-		event_type |= EVENT_CPU;
-
-	return event_type;
-}
-
 static struct list_head *
 ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
 {
@@ -2232,8 +2216,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	     struct task_struct *task);
 
 static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			       struct perf_event_context *ctx,
-			       enum event_type_t event_type)
+			       struct perf_event_context *ctx)
 {
 	if (!cpuctx->task_ctx)
 		return;
@@ -2241,7 +2224,7 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, cpuctx, EVENT_ALL);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
@@ -2256,51 +2239,13 @@ static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
 }
 
-/*
- * We want to maintain the following priority of scheduling:
- *  - CPU pinned (EVENT_CPU | EVENT_PINNED)
- *  - task pinned (EVENT_PINNED)
- *  - CPU flexible (EVENT_CPU | EVENT_FLEXIBLE)
- *  - task flexible (EVENT_FLEXIBLE).
- *
- * In order to avoid unscheduling and scheduling back in everything every
- * time an event is added, only do it for the groups of equal priority and
- * below.
- *
- * This can be called after a batch operation on task events, in which case
- * event_type is a bit mask of the types of events involved. For CPU events,
- * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE.
- */
 static void ctx_resched(struct perf_cpu_context *cpuctx,
-			struct perf_event_context *task_ctx,
-			enum event_type_t event_type)
+			struct perf_event_context *task_ctx)
 {
-	enum event_type_t ctx_event_type = event_type & EVENT_ALL;
-	bool cpu_event = !!(event_type & EVENT_CPU);
-
-	/*
-	 * If pinned groups are involved, flexible groups also need to be
-	 * scheduled out.
-	 */
-	if (event_type & EVENT_PINNED)
-		event_type |= EVENT_FLEXIBLE;
-
 	perf_pmu_disable(cpuctx->ctx.pmu);
 	if (task_ctx)
-		task_ctx_sched_out(cpuctx, task_ctx, event_type);
-
-	/*
-	 * Decide which cpu ctx groups to schedule out based on the types
-	 * of events that caused rescheduling:
-	 *  - EVENT_CPU: schedule out corresponding groups;
-	 *  - EVENT_PINNED task events: schedule out EVENT_FLEXIBLE groups;
-	 *  - otherwise, do nothing more.
-	 */
-	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
-	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-
+		task_ctx_sched_out(cpuctx, task_ctx);
+	cpu_ctx_sched_out(cpuctx, EVENT_ALL);
 	perf_event_sched_in(cpuctx, task_ctx, current);
 	perf_pmu_enable(cpuctx->ctx.pmu);
 }
@@ -2347,7 +2292,7 @@ static int  __perf_install_in_context(void *info)
 	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
-		ctx_resched(cpuctx, task_ctx, get_event_type(event));
+		ctx_resched(cpuctx, task_ctx);
 	} else {
 		add_event_to_ctx(event, ctx);
 	}
@@ -2514,7 +2459,7 @@ static void __perf_event_enable(struct perf_event *event,
 	if (ctx->task)
 		WARN_ON_ONCE(task_ctx != ctx);
 
-	ctx_resched(cpuctx, task_ctx, get_event_type(event));
+	ctx_resched(cpuctx, task_ctx);
 }
 
 /*
@@ -2941,7 +2886,7 @@ unlock:
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+		task_ctx_sched_out(cpuctx, ctx);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
@@ -3498,7 +3443,6 @@ static int event_enable_on_exec(struct perf_event *event,
 static void perf_event_enable_on_exec(int ctxn)
 {
 	struct perf_event_context *ctx, *clone_ctx = NULL;
-	enum event_type_t event_type = 0;
 	struct perf_cpu_context *cpuctx;
 	struct perf_event *event;
 	unsigned long flags;
@@ -3512,17 +3456,15 @@ static void perf_event_enable_on_exec(int ctxn)
 	cpuctx = __get_cpu_context(ctx);
 	perf_ctx_lock(cpuctx, ctx);
 	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
-	list_for_each_entry(event, &ctx->event_list, event_entry) {
+	list_for_each_entry(event, &ctx->event_list, event_entry)
 		enabled |= event_enable_on_exec(event, ctx);
-		event_type |= get_event_type(event);
-	}
 
 	/*
 	 * Unclone and reschedule this context if we enabled any event.
 	 */
 	if (enabled) {
 		clone_ctx = unclone_ctx(ctx);
-		ctx_resched(cpuctx, ctx, event_type);
+		ctx_resched(cpuctx, ctx);
 	} else {
 		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
 	}
@@ -10466,7 +10408,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 	 * in.
 	 */
 	raw_spin_lock_irq(&child_ctx->lock);
-	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx);
 
 	/*
 	 * Now that the context is inactive, destroy the task <-> ctx relation

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-14 18:05 perf: bisected sampling bug in Linux 4.11-rc1 Vince Weaver
                   ` (2 preceding siblings ...)
  2017-07-15 11:00 ` Ingo Molnar
@ 2017-07-18 11:08 ` Alexander Shishkin
  2017-07-18 12:13   ` Ingo Molnar
  2017-07-20  8:34   ` [tip:perf/urgent] perf/core: Fix scheduling regression of pinned groups tip-bot for Alexander Shishkin
  3 siblings, 2 replies; 12+ messages in thread
From: Alexander Shishkin @ 2017-07-18 11:08 UTC (permalink / raw)
  To: Vince Weaver, linux-kernel; +Cc: Ingo Molnar, Peter Zijlstra, Stephane Eranian

Vince Weaver <vincent.weaver@maine.edu> writes:

> I was tracking down some regressions in my perf_event_test testsuite.
> Some of the tests broke in the 4.11-rc1 timeframe.
>
> I've bisected one of them, this report is about
> 	tests/overflow/simul_oneshot_group_overflow
> This test creates an event group containing two sampling events, set
> to overflow to a signal handler (which disables and then refreshes the 
> event).
>
> On a good kernel you get the following:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 946 (perf::instructions/1000000)
> 		fd 4 overflows: 473 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946379875
> 		Count 1: 946365218
>
> With the broken kernels you get:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 938 (perf::instructions/1000000)
> 		fd 4 overflows: 318 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946373080
> 		Count 1: 653373058
>
>
> 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
>
> commit 487f05e18aa4efacee6357480f293a5afe6593b5
> Author: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Date:   Thu Jan 19 18:43:30 2017 +0200

Ok, there was a bug there indeed. This patch should take care of it and
should also be backportable in case it's stable-worthy.

>From 187d67c9908cb126656c34546772089c17a8e6c5 Mon Sep 17 00:00:00 2001
From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Date: Tue, 18 Jul 2017 13:53:01 +0300
Subject: [PATCH] perf: Fix scheduling regression of pinned groups

Commit 487f05e18a ("perf/core: Optimize event rescheduling on active
contexts") erronously assumed that event's 'pinned' setting determines
whether the event belongs to a pinned group or not, but in fact, it's
the group leader's pinned state that matters. This was discovered by
Vince in a test case where two instruction counters are grouped, the
group leader is pinned, but the other event is not; in the regressed
case the counters were off by 33% (the difference between events'
periods), but should be the same within the error margin.

This fixes the problem by looking at the group leader's pinning.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Cc: stable@vger.kernel.org
---
 kernel/events/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index bc63f8db1b..1edbaf94dd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1451,6 +1451,13 @@ static enum event_type_t get_event_type(struct perf_event *event)
 
 	lockdep_assert_held(&ctx->lock);
 
+	/*
+	 * It's 'group type', really, because if our group leader is
+	 * pinned, so are we.
+	 */
+	if (event->group_leader != event)
+		event = event->group_leader;
+
 	event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
 	if (!ctx->task)
 		event_type |= EVENT_CPU;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-18 11:08 ` perf: bisected sampling bug in Linux 4.11-rc1 Alexander Shishkin
@ 2017-07-18 12:13   ` Ingo Molnar
  2017-07-20  3:30     ` Vince Weaver
  2017-07-20  8:34   ` [tip:perf/urgent] perf/core: Fix scheduling regression of pinned groups tip-bot for Alexander Shishkin
  1 sibling, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2017-07-18 12:13 UTC (permalink / raw)
  To: Alexander Shishkin
  Cc: Vince Weaver, linux-kernel, Peter Zijlstra, Stephane Eranian


* Alexander Shishkin <alexander.shishkin@linux.intel.com> wrote:

> Vince Weaver <vincent.weaver@maine.edu> writes:
> 
> > I was tracking down some regressions in my perf_event_test testsuite.
> > Some of the tests broke in the 4.11-rc1 timeframe.
> >
> > I've bisected one of them, this report is about
> > 	tests/overflow/simul_oneshot_group_overflow
> > This test creates an event group containing two sampling events, set
> > to overflow to a signal handler (which disables and then refreshes the 
> > event).
> >
> > On a good kernel you get the following:
> > 	Event perf::instructions with period 1000000
> > 	Event perf::instructions with period 2000000
> > 		fd 3 overflows: 946 (perf::instructions/1000000)
> > 		fd 4 overflows: 473 (perf::instructions/2000000)
> > 	Ending counts:
> > 		Count 0: 946379875
> > 		Count 1: 946365218
> >
> > With the broken kernels you get:
> > 	Event perf::instructions with period 1000000
> > 	Event perf::instructions with period 2000000
> > 		fd 3 overflows: 938 (perf::instructions/1000000)
> > 		fd 4 overflows: 318 (perf::instructions/2000000)
> > 	Ending counts:
> > 		Count 0: 946373080
> > 		Count 1: 653373058
> >
> >
> > 487f05e18aa4efacee6357480f293a5afe6593b5 is the first bad commit
> >
> > commit 487f05e18aa4efacee6357480f293a5afe6593b5
> > Author: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> > Date:   Thu Jan 19 18:43:30 2017 +0200
> 
> Ok, there was a bug there indeed. This patch should take care of it and
> should also be backportable in case it's stable-worthy.
> 
> From 187d67c9908cb126656c34546772089c17a8e6c5 Mon Sep 17 00:00:00 2001
> From: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Date: Tue, 18 Jul 2017 13:53:01 +0300
> Subject: [PATCH] perf: Fix scheduling regression of pinned groups

Ok, great - if this works then I'll pick up this fix instead of the revert that 
I've queued up earlier today.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: perf: bisected sampling bug in Linux 4.11-rc1
  2017-07-18 12:13   ` Ingo Molnar
@ 2017-07-20  3:30     ` Vince Weaver
  0 siblings, 0 replies; 12+ messages in thread
From: Vince Weaver @ 2017-07-20  3:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alexander Shishkin, Vince Weaver, linux-kernel, Peter Zijlstra,
	Stephane Eranian

On Tue, 18 Jul 2017, Ingo Molnar wrote:
> 
> Ok, great - if this works then I'll pick up this fix instead of the revert that 
> I've queued up earlier today.

sorry for the delay, I was out of town for a few days.

I've tried the patch and it looks like it fixes the test in question and 
doesn't seem to add any other regressions.

Vince

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [tip:perf/urgent] perf/core: Fix scheduling regression of pinned groups
  2017-07-18 11:08 ` perf: bisected sampling bug in Linux 4.11-rc1 Alexander Shishkin
  2017-07-18 12:13   ` Ingo Molnar
@ 2017-07-20  8:34   ` tip-bot for Alexander Shishkin
  1 sibling, 0 replies; 12+ messages in thread
From: tip-bot for Alexander Shishkin @ 2017-07-20  8:34 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, torvalds, alexander.shishkin, linux-kernel, jolsa, tglx,
	acme, vincent.weaver, mingo, hpa, eranian

Commit-ID:  3bda69c1c3993a2bddbae01397d12bfef6054011
Gitweb:     http://git.kernel.org/tip/3bda69c1c3993a2bddbae01397d12bfef6054011
Author:     Alexander Shishkin <alexander.shishkin@linux.intel.com>
AuthorDate: Tue, 18 Jul 2017 14:08:34 +0300
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 20 Jul 2017 09:43:02 +0200

perf/core: Fix scheduling regression of pinned groups

Vince Weaver reported:

> I was tracking down some regressions in my perf_event_test testsuite.
> Some of the tests broke in the 4.11-rc1 timeframe.
>
> I've bisected one of them, this report is about
>	tests/overflow/simul_oneshot_group_overflow
> This test creates an event group containing two sampling events, set
> to overflow to a signal handler (which disables and then refreshes the
> event).
>
> On a good kernel you get the following:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 946 (perf::instructions/1000000)
> 		fd 4 overflows: 473 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946379875
> 		Count 1: 946365218
>
> With the broken kernels you get:
> 	Event perf::instructions with period 1000000
> 	Event perf::instructions with period 2000000
> 		fd 3 overflows: 938 (perf::instructions/1000000)
> 		fd 4 overflows: 318 (perf::instructions/2000000)
> 	Ending counts:
> 		Count 0: 946373080
> 		Count 1: 653373058

The root cause of the bug is that the following commit:

  487f05e18a ("perf/core: Optimize event rescheduling on active contexts")

erronously assumed that event's 'pinned' setting determines whether the
event belongs to a pinned group or not, but in fact, it's the group
leader's pinned state that matters.

This was discovered by Vince in the test case described above, where two instruction
counters are grouped, the group leader is pinned, but the other event is not;
in the regressed case the counters were off by 33% (the difference between events'
periods), but should be the same within the error margin.

Fix the problem by looking at the group leader's pinning.

Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 487f05e18a ("perf/core: Optimize event rescheduling on active contexts")
Link: http://lkml.kernel.org/r/87lgnmvw7h.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9747e42..c9cdbd3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1452,6 +1452,13 @@ static enum event_type_t get_event_type(struct perf_event *event)
 
 	lockdep_assert_held(&ctx->lock);
 
+	/*
+	 * It's 'group type', really, because if our group leader is
+	 * pinned, so are we.
+	 */
+	if (event->group_leader != event)
+		event = event->group_leader;
+
 	event_type = event->attr.pinned ? EVENT_PINNED : EVENT_FLEXIBLE;
 	if (!ctx->task)
 		event_type |= EVENT_CPU;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-07-20  8:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-14 18:05 perf: bisected sampling bug in Linux 4.11-rc1 Vince Weaver
2017-07-14 18:25 ` Vince Weaver
2017-07-14 20:07 ` Alexander Shishkin
2017-07-14 20:14   ` Vince Weaver
2017-07-15 11:00 ` Ingo Molnar
2017-07-15 14:37   ` Vince Weaver
2017-07-16  3:03     ` Vince Weaver
2017-07-18 10:39   ` [tip:perf/urgent] Revert "perf/core: Optimize event rescheduling on active contexts" tip-bot for Ingo Molnar
2017-07-18 11:08 ` perf: bisected sampling bug in Linux 4.11-rc1 Alexander Shishkin
2017-07-18 12:13   ` Ingo Molnar
2017-07-20  3:30     ` Vince Weaver
2017-07-20  8:34   ` [tip:perf/urgent] perf/core: Fix scheduling regression of pinned groups tip-bot for Alexander Shishkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.