[PATCH 0/2] Optimize cgroup ctx switch and remove cpuctx->unique

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Optimize cgroup ctx switch and remove cpuctx->unique_pmu
@ 2017-01-17 17:38 David Carrillo-Cisneros
  2017-01-17 17:38 ` [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events David Carrillo-Cisneros
  2017-01-17 17:38 ` [PATCH 2/2] perf/core: Remove perf_cpu_context::unique_pmu David Carrillo-Cisneros
  0 siblings, 2 replies; 6+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-17 17:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Borislav Petkov, Srinivas Pandruvada,
	Dave Hansen, Vikas Shivappa, Mark Rutland,
	Arnaldo Carvalho de Melo, Vince Weaver, Paul Turner,
	Stephane Eranian, David Carrillo-Cisneros

First patch is a resend of ("perf/core: Make cgroup switch visit only
cpuctxs with cgroup events") with updated change log.

Second patch cleans up the now unused cpuctx->unique_pmu.

David Carrillo-Cisneros (2):
  perf/core: Make cgroup switch visit only cpuctxs with cgroup events
  perf/core: Remove perf_cpu_context::unique_pmu

 include/linux/perf_event.h |   2 +-
 kernel/events/core.c       | 129 ++++++++++++++++-----------------------------
 2 files changed, 47 insertions(+), 84 deletions(-)

-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events
  2017-01-17 17:38 [PATCH 0/2] Optimize cgroup ctx switch and remove cpuctx->unique_pmu David Carrillo-Cisneros
@ 2017-01-17 17:38 ` David Carrillo-Cisneros
  2017-01-18 12:11   ` Mark Rutland
  2017-01-17 17:38 ` [PATCH 2/2] perf/core: Remove perf_cpu_context::unique_pmu David Carrillo-Cisneros
  1 sibling, 1 reply; 6+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-17 17:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Borislav Petkov, Srinivas Pandruvada,
	Dave Hansen, Vikas Shivappa, Mark Rutland,
	Arnaldo Carvalho de Melo, Vince Weaver, Paul Turner,
	Stephane Eranian, David Carrillo-Cisneros

v3: This is a resend of ("perf/core: only check cpuctx with cgroup
    events during cgroup switch") with updated changelog.
v2: Fix build when no CONFIG_CGROUP_PERF

This patch follows from a conversation in CQM/CMT's last series about
speeding up the context switch for cgroup events:

https://patchwork.kernel.org/patch/9478617/

This is a low-hanging fruit optimization. It replaces the iteration over
the "pmus" list in cgroup switch by an iteration over a new list that
contains only cpuctxs with at least one cgroup event.

This is necessary because the number of pmus have increased over the years
e.g modern x86 server systems have well above 50 pmus.
The iteration over the full pmu list is unneccessary and can be costly in
heavy cache contention scenarios.

Below are some instrumentation measurements with 10, 50 and 90 percentiles
of the total cost of context switch before and after this optimization for
a simple array read/write microbenchark.

Contention
   Level    Nr events      Before (us)            After (us)       Median
 L2    L3     types      (10%, 50%, 90%)       (10%, 50%, 90%     Speedup
--------------------------------------------------------------------------
Low   Low       1       (1.72, 2.42, 5.85)    (1.35, 1.64, 5.46)     29%
High  Low       1       (2.08, 4.56, 19.8)    (1720, 2.20, 13.7)     51%
High  High      1       (2.86, 10.4, 12.7)    (2.54, 4.32, 12.1)     58%

Low   Low       2       (1.98, 3.20, 6.89)    (1.68, 2.41, 8.89)     24%
High  Low       2       (2.48, 5.28, 22.4)    (2150, 3.69, 14.6)     30%
High  High      2       (3.32, 8.09, 13.9)    (2.80, 5.15, 13.7)     36%

where:
 1 event type  = cycles
 2 event types = cycles,intel_cqm/llc_occupancy/

Contention L2 Low: workset  <  L2 cache size.
              High:  "     >>  L2   "     " .
Contention L3 Low: workset of task on all sockets  <  L3 cache size.
              High:   "     "   "   "   "    "    >>  L3   "     " .

Median Speedup is (50%ile Before - 50%ile After) /  50%ile Before

Unsurprisingly, the benefits of this optimization decrease with the number
of cpuctxs with a cgroup events, yet, is never detrimental.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h |  1 +
 kernel/events/core.c       | 98 +++++++++++++++++++++-------------------------
 2 files changed, 46 insertions(+), 53 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 4741ecdb9817..10d223572855 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -788,6 +788,7 @@ struct perf_cpu_context {
 	struct pmu			*unique_pmu;
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp;
+	struct list_head		cgrp_cpuctx_entry;
 #endif
 
 	struct list_head		sched_cb_entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ab15509fab8c..ef61a9e7328d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -678,6 +678,8 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	info->timestamp = ctx->timestamp;
 }
 
+static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
+
 #define PERF_CGROUP_SWOUT	0x1 /* cgroup switch out every event */
 #define PERF_CGROUP_SWIN	0x2 /* cgroup switch in events based on task */
 
@@ -690,61 +692,46 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 static void perf_cgroup_switch(struct task_struct *task, int mode)
 {
 	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
+	struct list_head *list;
 	unsigned long flags;
 
 	/*
-	 * disable interrupts to avoid geting nr_cgroup
-	 * changes via __perf_event_disable(). Also
-	 * avoids preemption.
+	 * Disable interrupts and preemption to avoid this CPU's
+	 * cgrp_cpuctx_entry to change under us.
 	 */
 	local_irq_save(flags);
 
-	/*
-	 * we reschedule only in the presence of cgroup
-	 * constrained events.
-	 */
+	list = this_cpu_ptr(&cgrp_cpuctx_list);
+	list_for_each_entry(cpuctx, list, cgrp_cpuctx_entry) {
+		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
 
-	list_for_each_entry_rcu(pmu, &pmus, entry) {
-		cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
-		if (cpuctx->unique_pmu != pmu)
-			continue; /* ensure we process each cpuctx once */
+		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+		perf_pmu_disable(cpuctx->ctx.pmu);
 
-		/*
-		 * perf_cgroup_events says at least one
-		 * context on this CPU has cgroup events.
-		 *
-		 * ctx->nr_cgroups reports the number of cgroup
-		 * events for a context.
-		 */
-		if (cpuctx->ctx.nr_cgroups > 0) {
-			perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-			perf_pmu_disable(cpuctx->ctx.pmu);
-
-			if (mode & PERF_CGROUP_SWOUT) {
-				cpu_ctx_sched_out(cpuctx, EVENT_ALL);
-				/*
-				 * must not be done before ctxswout due
-				 * to event_filter_match() in event_sched_out()
-				 */
-				cpuctx->cgrp = NULL;
-			}
+		if (mode & PERF_CGROUP_SWOUT) {
+			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			/*
+			 * must not be done before ctxswout due
+			 * to event_filter_match() in event_sched_out()
+			 */
+			cpuctx->cgrp = NULL;
+		}
 
-			if (mode & PERF_CGROUP_SWIN) {
-				WARN_ON_ONCE(cpuctx->cgrp);
-				/*
-				 * set cgrp before ctxsw in to allow
-				 * event_filter_match() to not have to pass
-				 * task around
-				 * we pass the cpuctx->ctx to perf_cgroup_from_task()
-				 * because cgorup events are only per-cpu
-				 */
-				cpuctx->cgrp = perf_cgroup_from_task(task, &cpuctx->ctx);
-				cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
-			}
-			perf_pmu_enable(cpuctx->ctx.pmu);
-			perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
+		if (mode & PERF_CGROUP_SWIN) {
+			WARN_ON_ONCE(cpuctx->cgrp);
+			/*
+			 * set cgrp before ctxsw in to allow
+			 * event_filter_match() to not have to pass
+			 * task around
+			 * we pass the cpuctx->ctx to perf_cgroup_from_task()
+			 * because cgorup events are only per-cpu
+			 */
+			cpuctx->cgrp = perf_cgroup_from_task(task,
+							     &cpuctx->ctx);
+			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
 		}
+		perf_pmu_enable(cpuctx->ctx.pmu);
+		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 	}
 
 	local_irq_restore(flags);
@@ -889,6 +876,7 @@ list_update_cgroup_event(struct perf_event *event,
 			 struct perf_event_context *ctx, bool add)
 {
 	struct perf_cpu_context *cpuctx;
+	struct list_head *lentry;
 
 	if (!is_cgroup_event(event))
 		return;
@@ -902,15 +890,16 @@ list_update_cgroup_event(struct perf_event *event,
 	 * this will always be called from the right CPU.
 	 */
 	cpuctx = __get_cpu_context(ctx);
-
-	/*
-	 * cpuctx->cgrp is NULL until a cgroup event is sched in or
-	 * ctx->nr_cgroup == 0 .
-	 */
-	if (add && perf_cgroup_from_task(current, ctx) == event->cgrp)
-		cpuctx->cgrp = event->cgrp;
-	else if (!add)
+	lentry = &cpuctx->cgrp_cpuctx_entry;
+	/* cpuctx->cgrp is NULL unless a cgroup event is active in this CPU .*/
+	if (add) {
+		list_add(lentry, this_cpu_ptr(&cgrp_cpuctx_list));
+		if (perf_cgroup_from_task(current, ctx) == event->cgrp)
+			cpuctx->cgrp = event->cgrp;
+	} else {
+		list_del(lentry);
 		cpuctx->cgrp = NULL;
+	}
 }
 
 #else /* !CONFIG_CGROUP_PERF */
@@ -10595,6 +10584,9 @@ static void __init perf_event_init_all_cpus(void)
 		INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
 		raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
 
+#ifdef CONFIG_CGROUP_PERF
+		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
+#endif
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
 	}
 }
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] perf/core: Remove perf_cpu_context::unique_pmu
  2017-01-17 17:38 [PATCH 0/2] Optimize cgroup ctx switch and remove cpuctx->unique_pmu David Carrillo-Cisneros
  2017-01-17 17:38 ` [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events David Carrillo-Cisneros
@ 2017-01-17 17:38 ` David Carrillo-Cisneros
  2017-01-18 12:25   ` Mark Rutland
  1 sibling, 1 reply; 6+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-17 17:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: x86, Ingo Molnar, Thomas Gleixner, Andi Kleen, Kan Liang,
	Peter Zijlstra, Borislav Petkov, Srinivas Pandruvada,
	Dave Hansen, Vikas Shivappa, Mark Rutland,
	Arnaldo Carvalho de Melo, Vince Weaver, Paul Turner,
	Stephane Eranian, David Carrillo-Cisneros

cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
with shared pmus in order to avoid visiting the same cpuctx more than once
in a for_each_pmu loop.

cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
have only one pmu per cpuctx. Since perf_pmu_sched_task is only called in
hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.

The change above, together with the previous patch in this series, removed
the remaining uses of cpuctx->unique_pmu, so we remove it altogether.

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
---
 include/linux/perf_event.h |  1 -
 kernel/events/core.c       | 31 +------------------------------
 2 files changed, 1 insertion(+), 31 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 10d223572855..4f8a1f681aac 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -785,7 +785,6 @@ struct perf_cpu_context {
 	ktime_t				hrtimer_interval;
 	unsigned int			hrtimer_active;
 
-	struct pmu			*unique_pmu;
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp;
 	struct list_head		cgrp_cpuctx_entry;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ef61a9e7328d..57324d67e128 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2883,7 +2883,7 @@ static void perf_pmu_sched_task(struct task_struct *prev,
 		return;
 
 	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
-		pmu = cpuctx->unique_pmu; /* software PMUs will not have sched_task */
+		pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */
 
 		if (WARN_ON_ONCE(!pmu->sched_task))
 			continue;
@@ -8572,37 +8572,10 @@ static struct perf_cpu_context __percpu *find_pmu_context(int ctxn)
 	return NULL;
 }
 
-static void update_pmu_context(struct pmu *pmu, struct pmu *old_pmu)
-{
-	int cpu;
-
-	for_each_possible_cpu(cpu) {
-		struct perf_cpu_context *cpuctx;
-
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-
-		if (cpuctx->unique_pmu == old_pmu)
-			cpuctx->unique_pmu = pmu;
-	}
-}
-
 static void free_pmu_context(struct pmu *pmu)
 {
-	struct pmu *i;
-
 	mutex_lock(&pmus_lock);
-	/*
-	 * Like a real lame refcount.
-	 */
-	list_for_each_entry(i, &pmus, entry) {
-		if (i->pmu_cpu_context == pmu->pmu_cpu_context) {
-			update_pmu_context(i, pmu);
-			goto out;
-		}
-	}
-
 	free_percpu(pmu->pmu_cpu_context);
-out:
 	mutex_unlock(&pmus_lock);
 }
 
@@ -8806,8 +8779,6 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 		cpuctx->ctx.pmu = pmu;
 
 		__perf_mux_hrtimer_init(cpuctx, cpu);
-
-		cpuctx->unique_pmu = pmu;
 	}
 
 got_cpu_context:
-- 
2.11.0.483.g087da7b7c-goog

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events
  2017-01-17 17:38 ` [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events David Carrillo-Cisneros
@ 2017-01-18 12:11   ` Mark Rutland
  2017-01-18 19:26     ` David Carrillo-Cisneros
  0 siblings, 1 reply; 6+ messages in thread
From: Mark Rutland @ 2017-01-18 12:11 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: linux-kernel, x86, Ingo Molnar, Thomas Gleixner, Andi Kleen,
	Kan Liang, Peter Zijlstra, Borislav Petkov, Srinivas Pandruvada,
	Dave Hansen, Vikas Shivappa, Arnaldo Carvalho de Melo,
	Vince Weaver, Paul Turner, Stephane Eranian

On Tue, Jan 17, 2017 at 09:38:39AM -0800, David Carrillo-Cisneros wrote:
> This is a low-hanging fruit optimization. It replaces the iteration over
> the "pmus" list in cgroup switch by an iteration over a new list that
> contains only cpuctxs with at least one cgroup event.
> 
> This is necessary because the number of pmus have increased over the years
> e.g modern x86 server systems have well above 50 pmus.
> The iteration over the full pmu list is unneccessary and can be costly in
> heavy cache contention scenarios.

While I haven't done any measurement of the overhead, this looks like a
nice rework/cleanup.

Since this is only changing the management of cpu contexts, this
shouldn't adversely affect systems with heterogeneous CPUs. I've also
given this a spin on such a system, to no ill effect.

I have one (very minor) comment below, but either way:

Acked-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>

> @@ -889,6 +876,7 @@ list_update_cgroup_event(struct perf_event *event,
>  			 struct perf_event_context *ctx, bool add)
>  {
>  	struct perf_cpu_context *cpuctx;
> +	struct list_head *lentry;

It might be worth calling this cpuctx_entry, so that it's clear which
list element it refers to. I can imagine we'll add more list
manipulation in this path in future.

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] perf/core: Remove perf_cpu_context::unique_pmu
  2017-01-17 17:38 ` [PATCH 2/2] perf/core: Remove perf_cpu_context::unique_pmu David Carrillo-Cisneros
@ 2017-01-18 12:25   ` Mark Rutland
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Rutland @ 2017-01-18 12:25 UTC (permalink / raw)
  To: David Carrillo-Cisneros
  Cc: linux-kernel, x86, Ingo Molnar, Thomas Gleixner, Andi Kleen,
	Kan Liang, Peter Zijlstra, Borislav Petkov, Srinivas Pandruvada,
	Dave Hansen, Vikas Shivappa, Arnaldo Carvalho de Melo,
	Vince Weaver, Paul Turner, Stephane Eranian

On Tue, Jan 17, 2017 at 09:38:40AM -0800, David Carrillo-Cisneros wrote:
> cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
> with shared pmus in order to avoid visiting the same cpuctx more than once
> in a for_each_pmu loop.
> 
> cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
> have only one pmu per cpuctx. Since perf_pmu_sched_task is only called in
> hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.
> 
> The change above, together with the previous patch in this series, removed
> the remaining uses of cpuctx->unique_pmu, so we remove it altogether.
> 
> Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>

I don't have any HW with a PMU that needs a sched_task callback, so I'm
unable to give this a full test, but it looks sane to me. FWIW:

Acked-by: Mark Rutland <mark.rutland@arm.com>

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events
  2017-01-18 12:11   ` Mark Rutland
@ 2017-01-18 19:26     ` David Carrillo-Cisneros
  0 siblings, 0 replies; 6+ messages in thread
From: David Carrillo-Cisneros @ 2017-01-18 19:26 UTC (permalink / raw)
  To: Mark Rutland
  Cc: linux-kernel, x86, Ingo Molnar, Thomas Gleixner, Andi Kleen,
	Kan Liang, Peter Zijlstra, Borislav Petkov, Srinivas Pandruvada,
	Dave Hansen, Vikas Shivappa, Arnaldo Carvalho de Melo,
	Vince Weaver, Paul Turner, Stephane Eranian

On Wed, Jan 18, 2017 at 4:11 AM, Mark Rutland <mark.rutland@arm.com> wrote:
> On Tue, Jan 17, 2017 at 09:38:39AM -0800, David Carrillo-Cisneros wrote:
>> This is a low-hanging fruit optimization. It replaces the iteration over
>> the "pmus" list in cgroup switch by an iteration over a new list that
>> contains only cpuctxs with at least one cgroup event.
>>
>> This is necessary because the number of pmus have increased over the years
>> e.g modern x86 server systems have well above 50 pmus.
>> The iteration over the full pmu list is unneccessary and can be costly in
>> heavy cache contention scenarios.
>
> While I haven't done any measurement of the overhead, this looks like a
> nice rework/cleanup.
>
> Since this is only changing the management of cpu contexts, this
> shouldn't adversely affect systems with heterogeneous CPUs. I've also
> given this a spin on such a system, to no ill effect.
>
> I have one (very minor) comment below, but either way:
>
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> Tested-by: Mark Rutland <mark.rutland@arm.com>

Thanks for reviewing it. I made the lil' change you proposed and added your Ack.

>
>> @@ -889,6 +876,7 @@ list_update_cgroup_event(struct perf_event *event,
>>                        struct perf_event_context *ctx, bool add)
>>  {
>>       struct perf_cpu_context *cpuctx;
>> +     struct list_head *lentry;
>
> It might be worth calling this cpuctx_entry, so that it's clear which
> list element it refers to. I can imagine we'll add more list
> manipulation in this path in future.
>
> Thanks,
> Mark.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-18 19:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17 17:38 [PATCH 0/2] Optimize cgroup ctx switch and remove cpuctx->unique_pmu David Carrillo-Cisneros
2017-01-17 17:38 ` [PATCH 1/2] perf/core: Make cgroup switch visit only cpuctxs with cgroup events David Carrillo-Cisneros
2017-01-18 12:11   ` Mark Rutland
2017-01-18 19:26     ` David Carrillo-Cisneros
2017-01-17 17:38 ` [PATCH 2/2] perf/core: Remove perf_cpu_context::unique_pmu David Carrillo-Cisneros
2017-01-18 12:25   ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).