linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] perf: fix pmu::filter_match for SW-led groups
@ 2016-06-14 15:10 Mark Rutland
  2016-07-02 16:40 ` Peter Zijlstra
  2016-07-07  8:31 ` [tip:perf/core] perf/core: Fix " tip-bot for Mark Rutland
  0 siblings, 2 replies; 8+ messages in thread
From: Mark Rutland @ 2016-06-14 15:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: Mark Rutland, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Peter Zijlstra, Will Deacon

Commit 66eb579e66ecfea5 ("perf: allow for PMU-specific event filtering")
added pmu::filter_match. This was intended to avoid HW constraints on
events from resulting in extremely pessimistic scheduling.

However, pmu::filter_match is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.

This can result in extremely sub-optimal scheduling behaviour, e.g. if
running the following on a big.LITTLE platform:

$ taskset -c 0 ./perf stat \
 -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
 -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
 ls

     <not counted>      context-switches                                              (0.00%)
     <not counted>      armv8_cortex_a57/config=0x11/                                     (0.00%)
                24      context-switches                                              (37.36%)
          57589154      armv8_cortex_a53/config=0x11/                                     (37.36%)

Here the 'a53' event group was always eligible to be scheduled, but the
a57 group never eligible to be scheduled, as the task was always affine
to a Cortex-A53 CPU. The SW (group leader) event in the 'a57' group was
eligible, but the HW event failed at pmu::add time, resulting in
ctx_flexible_sched_in giving up on scheduling further groups with HW
events.

One way of avoiding this is to check pmu::filter_match on siblings as
well as the group leader. If any of these fail their pmu::filter_match,
we must skip the entire group before attempting to add any events.

Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
---
 kernel/events/core.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

I've tried to find a better way of handling this (without needing to walk the
siblings list), but so far I'm at a loss. At least it's "only" O(n) in the size
of the sibling list we were going to walk anyway.

I suspect that at a more fundamental level, I need to stop sharing a
perf_hw_context between HW PMUs (i.e. replace task_struct::perf_event_ctxp with
something that can handle multiple HW PMUs). From previous attempts I'm not
sure if that's going to be possible.

Any ideas appreciated!

Mark.

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9c51ec3..c0b6db0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1678,12 +1678,32 @@ static bool is_orphaned_event(struct perf_event *event)
 	return event->state == PERF_EVENT_STATE_DEAD;
 }
 
-static inline int pmu_filter_match(struct perf_event *event)
+static inline int __pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
 	return pmu->filter_match ? pmu->filter_match(event) : 1;
 }
 
+/*
+ * Check whether we should attempt to schedule an event group based on
+ * PMU-specific filtering. An event group can consist of HW and SW events,
+ * potentially with a SW leader, so we must check all the filters to determine
+ * whether a group is schedulable.
+ */
+static inline int pmu_filter_match(struct perf_event *event)
+{
+	struct perf_event *child;
+
+	if (!__pmu_filter_match(event))
+		return 0;
+
+	list_for_each_entry(child, &event->sibling_list, group_entry)
+		if (!__pmu_filter_match(child))
+			return 0;
+
+	return 1;
+}
+
 static inline int
 event_filter_match(struct perf_event *event)
 {
-- 
1.9.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: fix pmu::filter_match for SW-led groups
  2016-06-14 15:10 [PATCH] perf: fix pmu::filter_match for SW-led groups Mark Rutland
@ 2016-07-02 16:40 ` Peter Zijlstra
  2016-07-04 18:05   ` Mark Rutland
  2016-07-07  8:31 ` [tip:perf/core] perf/core: Fix " tip-bot for Mark Rutland
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2016-07-02 16:40 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Will Deacon

On Tue, Jun 14, 2016 at 04:10:41PM +0100, Mark Rutland wrote:
> However, pmu::filter_match is only called for the leader of each event
> group. When the leader is a SW event, we do not filter the groups, and
> may fail at pmu::add time, and when this happens we'll give up on
> scheduling any event groups later in the list until they are rotated
> ahead of the failing group.

Ha! indeed.

> I've tried to find a better way of handling this (without needing to walk the
> siblings list), but so far I'm at a loss. At least it's "only" O(n) in the size
> of the sibling list we were going to walk anyway.
> 
> I suspect that at a more fundamental level, I need to stop sharing a
> perf_hw_context between HW PMUs (i.e. replace task_struct::perf_event_ctxp with
> something that can handle multiple HW PMUs). From previous attempts I'm not
> sure if that's going to be possible.
> 
> Any ideas appreciated!

So I think I have half-cooked ideas.

One of the problems I've been wanting to solve for a long time is that
the per-cpu flexible list has priority over the per-task flexible list.

I would like them to rotate together.

One of the ways I was looking at getting that done is a virtual runtime
scheduler (just like cfs). The tricky point is merging two virtual
runtime trees. But I think that should be doable if we sort the trees on
lag.

In any case, the relevance to your question is that once we have a tree,
we can play games with order; that is, if we first order on PMU-id and
only second on lag, we get whole subtree clusters specific for a PMU.


Lost of details missing in that picture, but I think something along
those lines might get us what we want.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: fix pmu::filter_match for SW-led groups
  2016-07-02 16:40 ` Peter Zijlstra
@ 2016-07-04 18:05   ` Mark Rutland
  2016-07-05  8:35     ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2016-07-04 18:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Will Deacon

On Sat, Jul 02, 2016 at 06:40:25PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 14, 2016 at 04:10:41PM +0100, Mark Rutland wrote:
> > However, pmu::filter_match is only called for the leader of each event
> > group. When the leader is a SW event, we do not filter the groups, and
> > may fail at pmu::add time, and when this happens we'll give up on
> > scheduling any event groups later in the list until they are rotated
> > ahead of the failing group.
> 
> Ha! indeed.
> 
> > I've tried to find a better way of handling this (without needing to walk the
> > siblings list), but so far I'm at a loss. At least it's "only" O(n) in the size
> > of the sibling list we were going to walk anyway.
> > 
> > I suspect that at a more fundamental level, I need to stop sharing a
> > perf_hw_context between HW PMUs (i.e. replace task_struct::perf_event_ctxp with
> > something that can handle multiple HW PMUs). From previous attempts I'm not
> > sure if that's going to be possible.
> > 
> > Any ideas appreciated!
> 
> So I think I have half-cooked ideas.
> 
> One of the problems I've been wanting to solve for a long time is that
> the per-cpu flexible list has priority over the per-task flexible list.
> 
> I would like them to rotate together.

Makes sense.

> One of the ways I was looking at getting that done is a virtual runtime
> scheduler (just like cfs). The tricky point is merging two virtual
> runtime trees. But I think that should be doable if we sort the trees on
> lag.
> 
> In any case, the relevance to your question is that once we have a tree,
> we can play games with order; that is, if we first order on PMU-id and
> only second on lag, we get whole subtree clusters specific for a PMU.

Hmm... I'm not sure how that helps in this case. Wouldn't we still need
to walk the sibling list to get the HW PMU-id in the case of a SW group
leader?

For the heterogeenous case we'd need a different sort order per-cpu
(well, per microarchitecture), which sounds like we're going to have to
fully sort the events every time they move between CPUs. :/

> Lost of details missing in that picture, but I think something along
> those lines might get us what we want.

Perhaps! Hopefully I'm just missing those detail above. :)

I also had another though about solving the SW-led group case: if the
leader had a reference to the group's HW PMU (of which there should only
be one), we can filter on that alone, and can also use that in
group_sched_in rather than the ctx->pmu, avoiding the issue that
ctx->pmu is not the same as the group's HW PMU.

I'll have a play with that approach in the mean time.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: fix pmu::filter_match for SW-led groups
  2016-07-04 18:05   ` Mark Rutland
@ 2016-07-05  8:35     ` Peter Zijlstra
  2016-07-05  9:44       ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2016-07-05  8:35 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Will Deacon

On Mon, Jul 04, 2016 at 07:05:35PM +0100, Mark Rutland wrote:
> On Sat, Jul 02, 2016 at 06:40:25PM +0200, Peter Zijlstra wrote:
> > One of the ways I was looking at getting that done is a virtual runtime
> > scheduler (just like cfs). The tricky point is merging two virtual
> > runtime trees. But I think that should be doable if we sort the trees on
> > lag.
> > 
> > In any case, the relevance to your question is that once we have a tree,
> > we can play games with order; that is, if we first order on PMU-id and
> > only second on lag, we get whole subtree clusters specific for a PMU.
> 
> Hmm... I'm not sure how that helps in this case. Wouldn't we still need
> to walk the sibling list to get the HW PMU-id in the case of a SW group
> leader?

Since there is a hardware even in the group, it must be part of the
hardware pmu list/tree and would thus end up classified (and sorted) by
that (hardware) PMU-id.

> For the heterogeenous case we'd need a different sort order per-cpu
> (well, per microarchitecture), which sounds like we're going to have to
> fully sort the events every time they move between CPUs. :/

Confused, I thought that for the HG case you had multiple events, one
for each PMU. If we classify these events differently we'd simply use a
different subtree depending on which CPU the task lands.

Currently we've munged the two PMUs together, because, well, that's the
only way.

> I also had another though about solving the SW-led group case: if the
> leader had a reference to the group's HW PMU (of which there should only
> be one), we can filter on that alone, and can also use that in
> group_sched_in rather than the ctx->pmu, avoiding the issue that
> ctx->pmu is not the same as the group's HW PMU.
> 
> I'll have a play with that approach in the mean time.

Right, adds one more pointer to the struct event, but that thing is
massive already.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: fix pmu::filter_match for SW-led groups
  2016-07-05  8:35     ` Peter Zijlstra
@ 2016-07-05  9:44       ` Mark Rutland
  2016-07-05 12:04         ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2016-07-05  9:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Will Deacon

On Tue, Jul 05, 2016 at 10:35:26AM +0200, Peter Zijlstra wrote:
> On Mon, Jul 04, 2016 at 07:05:35PM +0100, Mark Rutland wrote:
> > On Sat, Jul 02, 2016 at 06:40:25PM +0200, Peter Zijlstra wrote:
> > > One of the ways I was looking at getting that done is a virtual runtime
> > > scheduler (just like cfs). The tricky point is merging two virtual
> > > runtime trees. But I think that should be doable if we sort the trees on
> > > lag.
> > > 
> > > In any case, the relevance to your question is that once we have a tree,
> > > we can play games with order; that is, if we first order on PMU-id and
> > > only second on lag, we get whole subtree clusters specific for a PMU.
> > 
> > Hmm... I'm not sure how that helps in this case. Wouldn't we still need
> > to walk the sibling list to get the HW PMU-id in the case of a SW group
> > leader?
> 
> Since there is a hardware even in the group, it must be part of the
> hardware pmu list/tree and would thus end up classified (and sorted) by
> that (hardware) PMU-id.
> 
> > For the heterogeenous case we'd need a different sort order per-cpu
> > (well, per microarchitecture), which sounds like we're going to have to
> > fully sort the events every time they move between CPUs. :/
> 
> Confused, I thought that for the HG case you had multiple events, one
> for each PMU. If we classify these events differently we'd simply use a
> different subtree depending on which CPU the task lands.

My bad; I assumed that for both PMUs we'd start at the root, and thus
would need to re-sort in order to get the current CPU's PMU ordered
first, much like currently with rotation.

I guess I'm having difficulty figuring out the structure of that tree.
If we can easily/cheaply find the relevant sub-tree then the above isn't
an issue.

> Currently we've munged the two PMUs together, because, well, that's the
> only way.

Yeah. Splitting them by any means would be great. In the past I'd looked
at changing task_struct::perf_event_ctxp into something that could
handle an arbitrary number of contexts, such that we could avoid
sharing, but ran away after considering the locking/rcu implications.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: fix pmu::filter_match for SW-led groups
  2016-07-05  9:44       ` Mark Rutland
@ 2016-07-05 12:04         ` Peter Zijlstra
  2016-07-05 12:52           ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2016-07-05 12:04 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Will Deacon

On Tue, Jul 05, 2016 at 10:44:48AM +0100, Mark Rutland wrote:
> My bad; I assumed that for both PMUs we'd start at the root, and thus
> would need to re-sort in order to get the current CPU's PMU ordered
> first, much like currently with rotation.
> 
> I guess I'm having difficulty figuring out the structure of that tree.
> If we can easily/cheaply find the relevant sub-tree then the above isn't
> an issue.

struct event {
	struct rb_node node;
	int pmu_id;
	s64 lag;
	...
};

bool event_less(struct rb_node *a, struct rb_node *b)
{
	struct event *left = rb_entry(a, struct event, node);
	struct event *right = rb_entry(b, struct event, node);

	if (a->pmu_id < b->pmu_id)
		return true;

	if (b->pmu_id > a->pmu_id)
		return false;

	/* a->pmu_id == b->pmu_id */
	if (a->lag < b->lag)
		return true;

	return false;
}

Will give you a tree with primary order @pmu_id and secondary order
@lag.

Which you'd iterate like:

	for (event = event_find(pmu_id); event->pmu_id == pmu_id; event = event_next(event)) {
	}

And get only the events matching @pmu_id in @lag order.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] perf: fix pmu::filter_match for SW-led groups
  2016-07-05 12:04         ` Peter Zijlstra
@ 2016-07-05 12:52           ` Mark Rutland
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Rutland @ 2016-07-05 12:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Alexander Shishkin, Arnaldo Carvalho de Melo,
	Ingo Molnar, Will Deacon

On Tue, Jul 05, 2016 at 02:04:26PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 05, 2016 at 10:44:48AM +0100, Mark Rutland wrote:
> > My bad; I assumed that for both PMUs we'd start at the root, and thus
> > would need to re-sort in order to get the current CPU's PMU ordered
> > first, much like currently with rotation.
> > 
> > I guess I'm having difficulty figuring out the structure of that tree.
> > If we can easily/cheaply find the relevant sub-tree then the above isn't
> > an issue.
> 
> struct event {
> 	struct rb_node node;
> 	int pmu_id;
> 	s64 lag;
> 	...
> };
> 
> bool event_less(struct rb_node *a, struct rb_node *b)
> {
> 	struct event *left = rb_entry(a, struct event, node);
> 	struct event *right = rb_entry(b, struct event, node);
> 
> 	if (a->pmu_id < b->pmu_id)
> 		return true;
> 
> 	if (b->pmu_id > a->pmu_id)
> 		return false;
> 
> 	/* a->pmu_id == b->pmu_id */
> 	if (a->lag < b->lag)
> 		return true;
> 
> 	return false;
> }
> 
> Will give you a tree with primary order @pmu_id and secondary order
> @lag.
> 
> Which you'd iterate like:
> 
> 	for (event = event_find(pmu_id); event->pmu_id == pmu_id; event = event_next(event)) {
> 	}
> 
> And get only the events matching @pmu_id in @lag order.

Cheers! Sorry for being thick; I think I understand now.

I'll have a tinker with the idea.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip:perf/core] perf/core: Fix pmu::filter_match for SW-led groups
  2016-06-14 15:10 [PATCH] perf: fix pmu::filter_match for SW-led groups Mark Rutland
  2016-07-02 16:40 ` Peter Zijlstra
@ 2016-07-07  8:31 ` tip-bot for Mark Rutland
  1 sibling, 0 replies; 8+ messages in thread
From: tip-bot for Mark Rutland @ 2016-07-07  8:31 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mark.rutland, tglx, hpa, jolsa, mingo, peterz,
	acme, will.deacon, torvalds, alexander.shishkin, acme

Commit-ID:  2c81a6477081966fe80b8c6daa68459bca896774
Gitweb:     http://git.kernel.org/tip/2c81a6477081966fe80b8c6daa68459bca896774
Author:     Mark Rutland <mark.rutland@arm.com>
AuthorDate: Tue, 14 Jun 2016 16:10:41 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 7 Jul 2016 08:57:57 +0200

perf/core: Fix pmu::filter_match for SW-led groups

The following commit:

  66eb579e66ec ("perf: allow for PMU-specific event filtering")

added the pmu::filter_match() callback. This was intended to
avoid HW constraints on events from resulting in extremely
pessimistic scheduling.

However, pmu::filter_match() is only called for the leader of each event
group. When the leader is a SW event, we do not filter the groups, and
may fail at pmu::add() time, and when this happens we'll give up on
scheduling any event groups later in the list until they are rotated
ahead of the failing group.

This can result in extremely sub-optimal event scheduling behaviour,
e.g. if running the following on a big.LITTLE platform:

$ taskset -c 0 ./perf stat \
 -e 'a57{context-switches,armv8_cortex_a57/config=0x11/}' \
 -e 'a53{context-switches,armv8_cortex_a53/config=0x11/}' \
 ls

     <not counted>      context-switches                                              (0.00%)
     <not counted>      armv8_cortex_a57/config=0x11/                                 (0.00%)
                24      context-switches                                              (37.36%)
          57589154      armv8_cortex_a53/config=0x11/                                 (37.36%)

Here the 'a53' event group was always eligible to be scheduled, but
the 'a57' group never eligible to be scheduled, as the task was always
affine to a Cortex-A53 CPU. The SW (group leader) event in the 'a57'
group was eligible, but the HW event failed at pmu::add() time,
resulting in ctx_flexible_sched_in giving up on scheduling further
groups with HW events.

One way of avoiding this is to check pmu::filter_match() on siblings
as well as the group leader. If any of these fail their
pmu::filter_match() call, we must skip the entire group before
attempting to add any events.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Fixes: 66eb579e66ec ("perf: allow for PMU-specific event filtering")
Link: http://lkml.kernel.org/r/1465917041-15339-1-git-send-email-mark.rutland@arm.com
[ Small readability edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 85cd418..43d43a2d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1678,12 +1678,33 @@ static bool is_orphaned_event(struct perf_event *event)
 	return event->state == PERF_EVENT_STATE_DEAD;
 }
 
-static inline int pmu_filter_match(struct perf_event *event)
+static inline int __pmu_filter_match(struct perf_event *event)
 {
 	struct pmu *pmu = event->pmu;
 	return pmu->filter_match ? pmu->filter_match(event) : 1;
 }
 
+/*
+ * Check whether we should attempt to schedule an event group based on
+ * PMU-specific filtering. An event group can consist of HW and SW events,
+ * potentially with a SW leader, so we must check all the filters, to
+ * determine whether a group is schedulable:
+ */
+static inline int pmu_filter_match(struct perf_event *event)
+{
+	struct perf_event *child;
+
+	if (!__pmu_filter_match(event))
+		return 0;
+
+	list_for_each_entry(child, &event->sibling_list, group_entry) {
+		if (!__pmu_filter_match(child))
+			return 0;
+	}
+
+	return 1;
+}
+
 static inline int
 event_filter_match(struct perf_event *event)
 {

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-07-07  9:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 15:10 [PATCH] perf: fix pmu::filter_match for SW-led groups Mark Rutland
2016-07-02 16:40 ` Peter Zijlstra
2016-07-04 18:05   ` Mark Rutland
2016-07-05  8:35     ` Peter Zijlstra
2016-07-05  9:44       ` Mark Rutland
2016-07-05 12:04         ` Peter Zijlstra
2016-07-05 12:52           ` Mark Rutland
2016-07-07  8:31 ` [tip:perf/core] perf/core: Fix " tip-bot for Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).