All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events
@ 2022-03-22 12:08 Chengming Zhou
  2022-03-22 12:08 ` [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in Chengming Zhou
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

Hi,

This patchset starts from the v1 RFC patch[1] that try to fix a race
problem between perf_cgroup_switch() and perf_cgroup_attach(), then
change the usage of perf_cgroup_from_task() to use stable cpuctx->cgrp
in perf_cgroup active time start and stop for the similar reason. Then
remove some obselete comments and some code cleanup.

[1] https://lore.kernel.org/lkml/20220308135948.55336-1-zhouchengming@bytedance.com/

v2:
 - split into two patches to fix the race problem for easier review
 - use cpuctx->cgrp when start ctx time and delete unused task argument
 - use cpuctx->cgrp when update ctx time from cgroup perf_event
 - always set cpuctx->cgrp when the first cgroup event enabled
 - remove obselete comments

Chengming Zhou (6):
  perf/core: Fix incosistency between cgroup sched_out and sched_in
  perf/core: Introduce percpu perf_cgroup
  perf/core: Don't pass task around when ctx sched in
  perf/core: Use stable cpuctx->cgrp when update perf cgroup time
  perf/core: Always set cpuctx cgrp when enable cgroup event
  perf/core: Don't need event_filter_match when merge_sched_in()

 kernel/events/core.c | 218 ++++++++++++-------------------------------
 1 file changed, 60 insertions(+), 158 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
@ 2022-03-22 12:08 ` Chengming Zhou
  2022-03-22 12:59   ` Peter Zijlstra
  2022-03-22 12:08 ` [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup Chengming Zhou
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().

CPU1					CPU2
(in context_switch)			(attach running task)
perf_cgroup_sched_out(prev, next)
	cgrp1 == cgrp2 is True
					next->cgroups = cgrp3
					perf_cgroup_attach()
perf_cgroup_sched_in(prev, next)
	cgrp1 == cgrp3 is False

The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
context switch code") would save cpuctx switch out/in when the
perf_cgroup of "prev" and "next" are the same.

But perf_cgroup of task can change in concurrent with context_switch.
If cgrp1 == cgrp2 in sched_out(), cpuctx won't do switch out. Then
task perf_cgroup changed cause cgrp1 != cgrp2 in sched_in(), cpuctx
will do switch in, and trigger WARN_ON_ONCE(cpuctx->cgrp).

Even though perf_cgroup_switch will be synchronized as the context
switch disables the interrupt, it still can see the task->cgroups
is changing in the middle.

So this patch combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
rename to perf_cgroup_sched_switch(), to fix the incosistency between
cgroup sched_out and sched_in.

Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/events/core.c | 101 ++++++++++---------------------------------
 1 file changed, 24 insertions(+), 77 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6859229497b1..8b5cf2aedfe6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -828,16 +828,10 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
-#define PERF_CGROUP_SWOUT	0x1 /* cgroup switch out every event */
-#define PERF_CGROUP_SWIN	0x2 /* cgroup switch in events based on task */
-
 /*
  * reschedule events based on the cgroup constraint of task.
- *
- * mode SWOUT : schedule out everything
- * mode SWIN : schedule in based on cgroup for next
  */
-static void perf_cgroup_switch(struct task_struct *task, int mode)
+static void perf_cgroup_switch(struct task_struct *task)
 {
 	struct perf_cpu_context *cpuctx, *tmp;
 	struct list_head *list;
@@ -856,28 +850,22 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
-		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
-			/*
-			 * must not be done before ctxswout due
-			 * to event_filter_match() in event_sched_out()
-			 */
-			cpuctx->cgrp = NULL;
-		}
+		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+		/*
+		 * must not be done before ctxswout due
+		 * to event_filter_match() in event_sched_out()
+		 */
+		cpuctx->cgrp = perf_cgroup_from_task(task,
+						     &cpuctx->ctx);
+		/*
+		 * set cgrp before ctxsw in to allow
+		 * event_filter_match() to not have to pass
+		 * task around
+		 * we pass the cpuctx->ctx to perf_cgroup_from_task()
+		 * because cgroup events are only per-cpu
+		 */
+		cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
 
-		if (mode & PERF_CGROUP_SWIN) {
-			WARN_ON_ONCE(cpuctx->cgrp);
-			/*
-			 * set cgrp before ctxsw in to allow
-			 * event_filter_match() to not have to pass
-			 * task around
-			 * we pass the cpuctx->ctx to perf_cgroup_from_task()
-			 * because cgorup events are only per-cpu
-			 */
-			cpuctx->cgrp = perf_cgroup_from_task(task,
-							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
-		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 	}
@@ -885,8 +873,8 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 	local_irq_restore(flags);
 }
 
-static inline void perf_cgroup_sched_out(struct task_struct *task,
-					 struct task_struct *next)
+static inline void perf_cgroup_sched_switch(struct task_struct *task,
+					    struct task_struct *next)
 {
 	struct perf_cgroup *cgrp1;
 	struct perf_cgroup *cgrp2 = NULL;
@@ -906,33 +894,7 @@ static inline void perf_cgroup_sched_out(struct task_struct *task,
 	 * do no touch the cgroup events.
 	 */
 	if (cgrp1 != cgrp2)
-		perf_cgroup_switch(task, PERF_CGROUP_SWOUT);
-
-	rcu_read_unlock();
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
-					struct task_struct *task)
-{
-	struct perf_cgroup *cgrp1;
-	struct perf_cgroup *cgrp2 = NULL;
-
-	rcu_read_lock();
-	/*
-	 * we come here when we know perf_cgroup_events > 0
-	 * we do not need to pass the ctx here because we know
-	 * we are holding the rcu lock
-	 */
-	cgrp1 = perf_cgroup_from_task(task, NULL);
-	cgrp2 = perf_cgroup_from_task(prev, NULL);
-
-	/*
-	 * only need to schedule in cgroup events if we are changing
-	 * cgroup during ctxsw. Cgroup events were not scheduled
-	 * out of ctxsw out if that was not the case.
-	 */
-	if (cgrp1 != cgrp2)
-		perf_cgroup_switch(task, PERF_CGROUP_SWIN);
+		perf_cgroup_switch(task);
 
 	rcu_read_unlock();
 }
@@ -1100,13 +1062,8 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
 {
 }
 
-static inline void perf_cgroup_sched_out(struct task_struct *task,
-					 struct task_struct *next)
-{
-}
-
-static inline void perf_cgroup_sched_in(struct task_struct *prev,
-					struct task_struct *task)
+static inline void perf_cgroup_sched_switch(struct task_struct *task,
+					    struct task_struct *next)
 {
 }
 
@@ -1124,7 +1081,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 }
 
 static inline void
-perf_cgroup_switch(struct task_struct *task, struct task_struct *next)
+perf_cgroup_sched_switch(struct task_struct *task, struct task_struct *next)
 {
 }
 
@@ -3668,7 +3625,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
 	 * cgroup event are system-wide mode only
 	 */
 	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
-		perf_cgroup_sched_out(task, next);
+		perf_cgroup_sched_switch(task, next);
 }
 
 /*
@@ -3984,16 +3941,6 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	struct perf_event_context *ctx;
 	int ctxn;
 
-	/*
-	 * If cgroup events exist on this CPU, then we need to check if we have
-	 * to switch in PMU state; cgroup event are system-wide mode only.
-	 *
-	 * Since cgroup events are CPU events, we must schedule these in before
-	 * we schedule in the task events.
-	 */
-	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
-		perf_cgroup_sched_in(prev, task);
-
 	for_each_task_context_nr(ctxn) {
 		ctx = task->perf_event_ctxp[ctxn];
 		if (likely(!ctx))
@@ -13564,7 +13511,7 @@ static int __perf_cgroup_move(void *info)
 {
 	struct task_struct *task = info;
 	rcu_read_lock();
-	perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN);
+	perf_cgroup_switch(task);
 	rcu_read_unlock();
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
  2022-03-22 12:08 ` [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in Chengming Zhou
@ 2022-03-22 12:08 ` Chengming Zhou
  2022-03-22 13:01   ` Peter Zijlstra
                     ` (2 more replies)
  2022-03-22 12:08 ` [PATCH v2 3/6] perf/core: Don't pass task around when ctx sched in Chengming Zhou
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

Although we don't have incosistency problem any more, we can
have other problem like:

CPU1					CPU2
(in context_switch)			(attach running task)
					prev->cgroups = cgrp2
perf_cgroup_sched_switch(prev, next)
	cgrp2 == cgrp2 is True

If perf_cgroup of prev task changes from cgrp1 to cgrp2,
perf_cgroup_sched_switch() will skip perf_cgroup_switch(),
so the CPU would still schedule the cgrp1 events, but we should
schedule the cgrp2 events.

The reason of this problem is that we shouldn't use the changeable
prev->cgroups to decide whether skip perf_cgroup_switch().

This patch introduces a percpu perf_cgroup to cache the perf_cgroup
that scheduled in cpuctxes, which later used to compare with the
perf_cgroup of next task to decide whether skip perf_cgroup_switch().

Since the perf_cgroup_switch() can be called after the context switch,
the cgroup events might be scheduled already. So we put the comparison
of perf_cgroups in perf_cgroup_switch(), and delete the unused function
perf_cgroup_sched_switch().

We must clear the percpu perf_cgroup cache when the last cgroup event
disabled.

Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/events/core.c | 63 ++++++++++++++++----------------------------
 1 file changed, 22 insertions(+), 41 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8b5cf2aedfe6..848a3bfa9513 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -826,6 +826,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	}
 }
 
+static DEFINE_PER_CPU(struct perf_cgroup *, cpu_perf_cgroup);
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
 /*
@@ -833,6 +834,7 @@ static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
  */
 static void perf_cgroup_switch(struct task_struct *task)
 {
+	struct perf_cgroup *cgrp;
 	struct perf_cpu_context *cpuctx, *tmp;
 	struct list_head *list;
 	unsigned long flags;
@@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
 	 */
 	local_irq_save(flags);
 
+	cgrp = perf_cgroup_from_task(task, NULL);
+	if (cgrp == __this_cpu_read(cpu_perf_cgroup))
+		goto out;
+
+	__this_cpu_write(cpu_perf_cgroup, cgrp);
+
 	list = this_cpu_ptr(&cgrp_cpuctx_list);
 	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
 		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
 
 		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+		if (cpuctx->cgrp == cgrp)
+			continue;
+
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
@@ -855,14 +867,11 @@ static void perf_cgroup_switch(struct task_struct *task)
 		 * must not be done before ctxswout due
 		 * to event_filter_match() in event_sched_out()
 		 */
-		cpuctx->cgrp = perf_cgroup_from_task(task,
-						     &cpuctx->ctx);
+		cpuctx->cgrp = cgrp;
 		/*
 		 * set cgrp before ctxsw in to allow
 		 * event_filter_match() to not have to pass
 		 * task around
-		 * we pass the cpuctx->ctx to perf_cgroup_from_task()
-		 * because cgroup events are only per-cpu
 		 */
 		cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
 
@@ -870,35 +879,10 @@ static void perf_cgroup_switch(struct task_struct *task)
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 	}
 
+out:
 	local_irq_restore(flags);
 }
 
-static inline void perf_cgroup_sched_switch(struct task_struct *task,
-					    struct task_struct *next)
-{
-	struct perf_cgroup *cgrp1;
-	struct perf_cgroup *cgrp2 = NULL;
-
-	rcu_read_lock();
-	/*
-	 * we come here when we know perf_cgroup_events > 0
-	 * we do not need to pass the ctx here because we know
-	 * we are holding the rcu lock
-	 */
-	cgrp1 = perf_cgroup_from_task(task, NULL);
-	cgrp2 = perf_cgroup_from_task(next, NULL);
-
-	/*
-	 * only schedule out current cgroup events if we know
-	 * that we are switching to a different cgroup. Otherwise,
-	 * do no touch the cgroup events.
-	 */
-	if (cgrp1 != cgrp2)
-		perf_cgroup_switch(task);
-
-	rcu_read_unlock();
-}
-
 static int perf_cgroup_ensure_storage(struct perf_event *event,
 				struct cgroup_subsys_state *css)
 {
@@ -1035,6 +1019,9 @@ perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *c
 		cpuctx->cgrp = NULL;
 
 	list_del(&cpuctx->cgrp_cpuctx_entry);
+
+	if (list_empty(per_cpu_ptr(&cgrp_cpuctx_list, event->cpu)))
+		__this_cpu_write(cpu_perf_cgroup, NULL);
 }
 
 #else /* !CONFIG_CGROUP_PERF */
@@ -1062,11 +1049,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
 {
 }
 
-static inline void perf_cgroup_sched_switch(struct task_struct *task,
-					    struct task_struct *next)
-{
-}
-
 static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -1080,11 +1062,6 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 {
 }
 
-static inline void
-perf_cgroup_sched_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
 static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
 	return 0;
@@ -1104,6 +1081,10 @@ static inline void
 perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
 {
 }
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
 #endif
 
 /*
@@ -3625,7 +3606,7 @@ void __perf_event_task_sched_out(struct task_struct *task,
 	 * cgroup event are system-wide mode only
 	 */
 	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
-		perf_cgroup_sched_switch(task, next);
+		perf_cgroup_switch(next);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 3/6] perf/core: Don't pass task around when ctx sched in
  2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
  2022-03-22 12:08 ` [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in Chengming Zhou
  2022-03-22 12:08 ` [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup Chengming Zhou
@ 2022-03-22 12:08 ` Chengming Zhou
  2022-03-22 13:01   ` Peter Zijlstra
  2022-03-22 12:08 ` [PATCH v2 4/6] perf/core: Use stable cpuctx->cgrp when update perf cgroup time Chengming Zhou
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

The current code pass task around for ctx_sched_in(), only
to get perf_cgroup of the task, then update the timestamp
of it and its ancestors and set them to active.

But we can use cpuctx->cgrp to get active perf_cgroup and
its ancestors since cpuctx->cgrp has been set before
ctx_sched_in().

This patch remove the task argument in ctx_sched_in()
and cleanup related code.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/events/core.c | 54 ++++++++++++++++++++------------------------
 1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 848a3bfa9513..e27c06628bad 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -574,8 +574,7 @@ static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
 			      enum event_type_t event_type);
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-			     enum event_type_t event_type,
-			     struct task_struct *task);
+			     enum event_type_t event_type);
 
 static void update_context_time(struct perf_event_context *ctx);
 static u64 perf_event_time(struct perf_event *event);
@@ -801,10 +800,10 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 }
 
 static inline void
-perf_cgroup_set_timestamp(struct task_struct *task,
-			  struct perf_event_context *ctx)
+perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
 {
-	struct perf_cgroup *cgrp;
+	struct perf_event_context *ctx = &cpuctx->ctx;
+	struct perf_cgroup *cgrp = cpuctx->cgrp;
 	struct perf_cgroup_info *info;
 	struct cgroup_subsys_state *css;
 
@@ -813,10 +812,10 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	 * ensure we do not access cgroup data
 	 * unless we have the cgroup pinned (css_get)
 	 */
-	if (!task || !ctx->nr_cgroups)
+	if (!cgrp)
 		return;
 
-	cgrp = perf_cgroup_from_task(task, ctx);
+	WARN_ON_ONCE(!ctx->nr_cgroups);
 
 	for (css = &cgrp->css; css; css = css->parent) {
 		cgrp = container_of(css, struct perf_cgroup, css);
@@ -873,7 +872,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 		 * event_filter_match() to not have to pass
 		 * task around
 		 */
-		cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+		cpu_ctx_sched_in(cpuctx, EVENT_ALL);
 
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -1057,8 +1056,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
 }
 
 static inline void
-perf_cgroup_set_timestamp(struct task_struct *task,
-			  struct perf_event_context *ctx)
+perf_cgroup_set_timestamp(struct perf_cpu_context *cpuctx)
 {
 }
 
@@ -2651,8 +2649,7 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
-	     enum event_type_t event_type,
-	     struct task_struct *task);
+	     enum event_type_t event_type);
 
 static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 			       struct perf_event_context *ctx,
@@ -2668,15 +2665,14 @@ static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
-				struct perf_event_context *ctx,
-				struct task_struct *task)
+				struct perf_event_context *ctx)
 {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	cpu_ctx_sched_in(cpuctx, EVENT_PINNED);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_PINNED);
+	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE);
 }
 
 /*
@@ -2726,7 +2722,7 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	else if (ctx_event_type & EVENT_PINNED)
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
 
-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx);
 	perf_pmu_enable(cpuctx->ctx.pmu);
 }
 
@@ -2949,7 +2945,7 @@ static void __perf_event_enable(struct perf_event *event,
 		return;
 
 	if (!event_filter_match(event)) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME);
 		return;
 	}
 
@@ -2958,7 +2954,7 @@ static void __perf_event_enable(struct perf_event *event,
 	 * then don't put it on unless the group is on.
 	 */
 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME);
 		return;
 	}
 
@@ -3803,8 +3799,7 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
 static void
 ctx_sched_in(struct perf_event_context *ctx,
 	     struct perf_cpu_context *cpuctx,
-	     enum event_type_t event_type,
-	     struct task_struct *task)
+	     enum event_type_t event_type)
 {
 	int is_active = ctx->is_active;
 
@@ -3816,7 +3811,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 	if (is_active ^ EVENT_TIME) {
 		/* start ctx time */
 		__update_context_time(ctx, false);
-		perf_cgroup_set_timestamp(task, ctx);
+		perf_cgroup_set_timestamp(cpuctx);
 		/*
 		 * CPU-release for the below ->is_active store,
 		 * see __load_acquire() in perf_event_time_now()
@@ -3847,12 +3842,11 @@ ctx_sched_in(struct perf_event_context *ctx,
 }
 
 static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-			     enum event_type_t event_type,
-			     struct task_struct *task)
+			     enum event_type_t event_type)
 {
 	struct perf_event_context *ctx = &cpuctx->ctx;
 
-	ctx_sched_in(ctx, cpuctx, event_type, task);
+	ctx_sched_in(ctx, cpuctx, event_type);
 }
 
 static void perf_event_context_sched_in(struct perf_event_context *ctx,
@@ -3894,7 +3888,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 */
 	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
 		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
-	perf_event_sched_in(cpuctx, ctx, task);
+	perf_event_sched_in(cpuctx, ctx);
 
 	if (cpuctx->sched_cb_usage && pmu->sched_task)
 		pmu->sched_task(cpuctx->task_ctx, true);
@@ -4195,7 +4189,7 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
 	if (cpu_event)
 		rotate_ctx(&cpuctx->ctx, cpu_event);
 
-	perf_event_sched_in(cpuctx, task_ctx, current);
+	perf_event_sched_in(cpuctx, task_ctx);
 
 	perf_pmu_enable(cpuctx->ctx.pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -4267,7 +4261,7 @@ static void perf_event_enable_on_exec(int ctxn)
 		clone_ctx = unclone_ctx(ctx);
 		ctx_resched(cpuctx, ctx, event_type);
 	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, cpuctx, EVENT_TIME);
 	}
 	perf_ctx_unlock(cpuctx, ctx);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 4/6] perf/core: Use stable cpuctx->cgrp when update perf cgroup time
  2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
                   ` (2 preceding siblings ...)
  2022-03-22 12:08 ` [PATCH v2 3/6] perf/core: Don't pass task around when ctx sched in Chengming Zhou
@ 2022-03-22 12:08 ` Chengming Zhou
  2022-03-22 13:03   ` Peter Zijlstra
  2022-03-22 12:08 ` [PATCH v2 5/6] perf/core: Always set cpuctx cgrp when enable cgroup event Chengming Zhou
  2022-03-22 12:08 ` [PATCH v2 6/6] perf/core: Don't need event_filter_match when merge_sched_in() Chengming Zhou
  5 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

The current code use the changeable task->cgroups when update
the perf cgroup time, which maybe not the active perf_cgroup
that sched_in on the CPU.

This patch change to use the stable cpuctx->cgrp and only
update time when event is matched with cpuctx->cgrp.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/events/core.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e27c06628bad..849a81299906 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -780,7 +780,6 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx,
 static inline void update_cgrp_time_from_event(struct perf_event *event)
 {
 	struct perf_cgroup_info *info;
-	struct perf_cgroup *cgrp;
 
 	/*
 	 * ensure we access cgroup data only when needed and
@@ -789,11 +788,10 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
 	if (!is_cgroup_event(event))
 		return;
 
-	cgrp = perf_cgroup_from_task(current, event->ctx);
 	/*
-	 * Do not update time when cgroup is not active
+	 * Only update time when event is matched with cpuctx cgrp
 	 */
-	if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup)) {
+	if (perf_cgroup_match(event)) {
 		info = this_cpu_ptr(event->cgrp->info);
 		__update_cgrp_time(info, perf_clock(), true);
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 5/6] perf/core: Always set cpuctx cgrp when enable cgroup event
  2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
                   ` (3 preceding siblings ...)
  2022-03-22 12:08 ` [PATCH v2 4/6] perf/core: Use stable cpuctx->cgrp when update perf cgroup time Chengming Zhou
@ 2022-03-22 12:08 ` Chengming Zhou
  2022-03-22 12:08 ` [PATCH v2 6/6] perf/core: Don't need event_filter_match when merge_sched_in() Chengming Zhou
  5 siblings, 0 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

When enable a cgroup event, cpuctx->cgrp setting is conditional
on the current task cgrp matching the event's cgroup, so have to
do it for every new event. It brings complexity but no advantage.

To keep it simple, this patch would always set cpuctx->cgrp
when enable the first cgroup event, and reset to NULL when disable
the last cgroup event.

In this way, perf_cgroup_match() won't see cpuctx->cgrp == NULL if
it's a cgroup event, so add a WARN_ON_ONCE(!cpuctx->cgrp) there.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/events/core.c | 20 +++-----------------
 1 file changed, 3 insertions(+), 17 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 849a81299906..4c8657b08301 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -703,7 +703,7 @@ perf_cgroup_match(struct perf_event *event)
 		return true;
 
 	/* wants specific cgroup scope but @cpuctx isn't associated with any */
-	if (!cpuctx->cgrp)
+	if (WARN_ON_ONCE(!cpuctx->cgrp))
 		return false;
 
 	/*
@@ -975,22 +975,10 @@ perf_cgroup_event_enable(struct perf_event *event, struct perf_event_context *ct
 	 */
 	cpuctx = container_of(ctx, struct perf_cpu_context, ctx);
 
-	/*
-	 * Since setting cpuctx->cgrp is conditional on the current @cgrp
-	 * matching the event's cgroup, we must do this for every new event,
-	 * because if the first would mismatch, the second would not try again
-	 * and we would leave cpuctx->cgrp unset.
-	 */
-	if (ctx->is_active && !cpuctx->cgrp) {
-		struct perf_cgroup *cgrp = perf_cgroup_from_task(current, ctx);
-
-		if (cgroup_is_descendant(cgrp->css.cgroup, event->cgrp->css.cgroup))
-			cpuctx->cgrp = cgrp;
-	}
-
 	if (ctx->nr_cgroups++)
 		return;
 
+	cpuctx->cgrp = perf_cgroup_from_task(current, ctx);
 	list_add(&cpuctx->cgrp_cpuctx_entry,
 			per_cpu_ptr(&cgrp_cpuctx_list, event->cpu));
 }
@@ -1012,9 +1000,7 @@ perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *c
 	if (--ctx->nr_cgroups)
 		return;
 
-	if (ctx->is_active && cpuctx->cgrp)
-		cpuctx->cgrp = NULL;
-
+	cpuctx->cgrp = NULL;
 	list_del(&cpuctx->cgrp_cpuctx_entry);
 
 	if (list_empty(per_cpu_ptr(&cgrp_cpuctx_list, event->cpu)))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v2 6/6] perf/core: Don't need event_filter_match when merge_sched_in()
  2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
                   ` (4 preceding siblings ...)
  2022-03-22 12:08 ` [PATCH v2 5/6] perf/core: Always set cpuctx cgrp when enable cgroup event Chengming Zhou
@ 2022-03-22 12:08 ` Chengming Zhou
  5 siblings, 0 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 12:08 UTC (permalink / raw)
  To: peterz, mingo, acme, mark.rutland, alexander.shishkin, jolsa,
	namhyung, eranian
  Cc: linux-perf-users, linux-kernel, duanxiongchun, songmuchun,
	Chengming Zhou

There are two obselete comments in perf_cgroup_switch(), since
we don't use event_filter_match() when event_sched_out(). And
found we needn't to use event_filter_match() when sched_in too.

Because now we use the perf_event groups RB-tree to get the
exact matched perf_events, don't need to go through the
event_filter_match() to check if matched.

We can remove it in merge_sched_in() actually, but this patch
changes it to a WARN_ON_ONCE for debug purpose, and found
no warning in our stress test.

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 kernel/events/core.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4c8657b08301..744078fe2819 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -860,15 +860,11 @@ static void perf_cgroup_switch(struct task_struct *task)
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
-		/*
-		 * must not be done before ctxswout due
-		 * to event_filter_match() in event_sched_out()
-		 */
 		cpuctx->cgrp = cgrp;
 		/*
 		 * set cgrp before ctxsw in to allow
-		 * event_filter_match() to not have to pass
-		 * task around
+		 * visit_groups_merge() to find matched
+		 * cgroup events
 		 */
 		cpu_ctx_sched_in(cpuctx, EVENT_ALL);
 
@@ -3729,7 +3725,7 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	if (event->state <= PERF_EVENT_STATE_OFF)
 		return 0;
 
-	if (!event_filter_match(event))
+	if (WARN_ON_ONCE(!event_filter_match(event)))
 		return 0;
 
 	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 12:08 ` [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in Chengming Zhou
@ 2022-03-22 12:59   ` Peter Zijlstra
  2022-03-22 13:38     ` [External] " Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-22 12:59 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
> in perf_cgroup_switch().
> 
> CPU1					CPU2
> (in context_switch)			(attach running task)
> perf_cgroup_sched_out(prev, next)
> 	cgrp1 == cgrp2 is True
> 					next->cgroups = cgrp3
> 					perf_cgroup_attach()
> perf_cgroup_sched_in(prev, next)
> 	cgrp1 == cgrp3 is False
> 
> The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
> context switch code") would save cpuctx switch out/in when the
> perf_cgroup of "prev" and "next" are the same.
> 
> But perf_cgroup of task can change in concurrent with context_switch.

Can you clarify? IIRC then a task changes cgroup it goes throught the
whole ->attach() dance, and that serializes against the context switch
code.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 12:08 ` [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup Chengming Zhou
@ 2022-03-22 13:01   ` Peter Zijlstra
  2022-03-22 16:33     ` [External] " Chengming Zhou
  2022-03-22 22:21     ` Namhyung Kim
  2022-03-22 22:18   ` Namhyung Kim
  2022-03-23 12:51   ` Peter Zijlstra
  2 siblings, 2 replies; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-22 13:01 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
> Although we don't have incosistency problem any more, we can
> have other problem like:
> 
> CPU1					CPU2
> (in context_switch)			(attach running task)
> 					prev->cgroups = cgrp2
> perf_cgroup_sched_switch(prev, next)
> 	cgrp2 == cgrp2 is True
> 

Again, I'm not following, how can you attach to a running task from
another CPU ?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 3/6] perf/core: Don't pass task around when ctx sched in
  2022-03-22 12:08 ` [PATCH v2 3/6] perf/core: Don't pass task around when ctx sched in Chengming Zhou
@ 2022-03-22 13:01   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-22 13:01 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 08:08:31PM +0800, Chengming Zhou wrote:
> The current code pass task around for ctx_sched_in(), only
> to get perf_cgroup of the task, then update the timestamp
> of it and its ancestors and set them to active.
> 
> But we can use cpuctx->cgrp to get active perf_cgroup and
> its ancestors since cpuctx->cgrp has been set before
> ctx_sched_in().
> 
> This patch remove the task argument in ctx_sched_in()
> and cleanup related code.
> 
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>

Yeah, this looks reasonable I suppose...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 4/6] perf/core: Use stable cpuctx->cgrp when update perf cgroup time
  2022-03-22 12:08 ` [PATCH v2 4/6] perf/core: Use stable cpuctx->cgrp when update perf cgroup time Chengming Zhou
@ 2022-03-22 13:03   ` Peter Zijlstra
  0 siblings, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-22 13:03 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 08:08:32PM +0800, Chengming Zhou wrote:
> The current code use the changeable task->cgroups when update
> the perf cgroup time, which maybe not the active perf_cgroup
> that sched_in on the CPU.

How?!?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 12:59   ` Peter Zijlstra
@ 2022-03-22 13:38     ` Chengming Zhou
  2022-03-22 14:54       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 13:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/22 8:59 下午, Peter Zijlstra wrote:
> On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
>> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
>> in perf_cgroup_switch().
>>
>> CPU1					CPU2
>> (in context_switch)			(attach running task)
>> perf_cgroup_sched_out(prev, next)
>> 	cgrp1 == cgrp2 is True
>> 					next->cgroups = cgrp3
>> 					perf_cgroup_attach()
>> perf_cgroup_sched_in(prev, next)
>> 	cgrp1 == cgrp3 is False
>>
>> The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
>> context switch code") would save cpuctx switch out/in when the
>> perf_cgroup of "prev" and "next" are the same.
>>
>> But perf_cgroup of task can change in concurrent with context_switch.
> 
> Can you clarify? IIRC then a task changes cgroup it goes throught the
> whole ->attach() dance, and that serializes against the context switch
> code.
> 

task->cgroups changed before perf_cgroup_attach(), and is not serialized
against the context switch, since task->cgroups can be changed without
rq lock held. (cgroup v1 or cgroup v2 with PSI disabled)

So perf_cgroup_from_task() in perf_cgroup_switch() may see the old or
new perf_cgroup when do context switch.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 13:38     ` [External] " Chengming Zhou
@ 2022-03-22 14:54       ` Peter Zijlstra
  2022-03-22 15:16         ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-22 14:54 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 09:38:21PM +0800, Chengming Zhou wrote:
> On 2022/3/22 8:59 下午, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
> >> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
> >> in perf_cgroup_switch().
> >>
> >> CPU1					CPU2
> >> (in context_switch)			(attach running task)
> >> perf_cgroup_sched_out(prev, next)
> >> 	cgrp1 == cgrp2 is True
> >> 					next->cgroups = cgrp3
> >> 					perf_cgroup_attach()
> >> perf_cgroup_sched_in(prev, next)
> >> 	cgrp1 == cgrp3 is False
> >>
> >> The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
> >> context switch code") would save cpuctx switch out/in when the
> >> perf_cgroup of "prev" and "next" are the same.
> >>
> >> But perf_cgroup of task can change in concurrent with context_switch.
> > 
> > Can you clarify? IIRC then a task changes cgroup it goes throught the
> > whole ->attach() dance, and that serializes against the context switch
> > code.
> > 
> 
> task->cgroups changed before perf_cgroup_attach(), and is not serialized
> against the context switch, since task->cgroups can be changed without
> rq lock held. (cgroup v1 or cgroup v2 with PSI disabled)
> 
> So perf_cgroup_from_task() in perf_cgroup_switch() may see the old or
> new perf_cgroup when do context switch.

__schedule()
  local_irq_disable();				<--- IRQ disable
  rq_lock();

  ...

  context_switch()
    prepare_task_switch()
      perf_event_task_sched_out()
        __perf_event_task_sched_out()
	  perf_cgroup_sched_out();

  switch_to()
  finish_task_switch()
    perf_event_task_sched_in()
      __perf_event_task_sched_in()
        perf_cgroup_sched_in();
    finish_lock_switch()
      raw_spin_irq_unlock_irq();		<--- IRQ enable


vs

perf_event_cgrp_subsys.attach = perf_cgroup_attach()
  cgroup_taskset_for_each()
    task_function_call(task, __perf_cgroup_move) <--- sends IPI


Please explain how this can interleave.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 14:54       ` Peter Zijlstra
@ 2022-03-22 15:16         ` Chengming Zhou
  2022-03-22 15:28           ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 15:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

Hi peter,

On 2022/3/22 10:54 下午, Peter Zijlstra wrote:
> On Tue, Mar 22, 2022 at 09:38:21PM +0800, Chengming Zhou wrote:
>> On 2022/3/22 8:59 下午, Peter Zijlstra wrote:
>>> On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
>>>> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
>>>> in perf_cgroup_switch().
>>>>
>>>> CPU1					CPU2
>>>> (in context_switch)			(attach running task)
>>>> perf_cgroup_sched_out(prev, next)
>>>> 	cgrp1 == cgrp2 is True
>>>> 					next->cgroups = cgrp3
>>>> 					perf_cgroup_attach()
>>>> perf_cgroup_sched_in(prev, next)
>>>> 	cgrp1 == cgrp3 is False
>>>>
>>>> The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
>>>> context switch code") would save cpuctx switch out/in when the
>>>> perf_cgroup of "prev" and "next" are the same.
>>>>
>>>> But perf_cgroup of task can change in concurrent with context_switch.
>>>
>>> Can you clarify? IIRC then a task changes cgroup it goes throught the
>>> whole ->attach() dance, and that serializes against the context switch
>>> code.
>>>
>>
>> task->cgroups changed before perf_cgroup_attach(), and is not serialized
>> against the context switch, since task->cgroups can be changed without
>> rq lock held. (cgroup v1 or cgroup v2 with PSI disabled)
>>
>> So perf_cgroup_sched_out() in perf_cgroup_switch() may see the old or
>> new perf_cgroup when do context switch.
> 
> __schedule()
>   local_irq_disable();				<--- IRQ disable
>   rq_lock();
> 
>   ...
> 
>   context_switch()
>     prepare_task_switch()
>       perf_event_task_sched_out()
>         __perf_event_task_sched_out()
> 	  perf_cgroup_sched_out();

here compare perf_cgroup_from_task(prev) and perf_cgroup_from_task(next)

> 
>   switch_to()
>   finish_task_switch()
>     perf_event_task_sched_in()
>       __perf_event_task_sched_in()
>         perf_cgroup_sched_in();

here compare perf_cgroup_from_task(prev) and perf_cgroup_from_task(next)

>     finish_lock_switch()
>       raw_spin_irq_unlock_irq();		<--- IRQ enable
> 
> 
> vs
> 

rcu_assign_pointer(p->cgroups, to)		<--- task perf_cgroup changed

task->cgroups has changed before sending IPI

> perf_event_cgrp_subsys.attach = perf_cgroup_attach()
>   cgroup_taskset_for_each()
>     task_function_call(task, __perf_cgroup_move) <--- sends IPI
> 
> 
> Please explain how this can interleave.

__perf_cgroup_move in IPI is of course serialized against context switch,
but the task->cgroups has changed before that, without rq lock held.
So perf_cgroup_from_task() may see the old or new perf_cgroup.

Thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 15:16         ` Chengming Zhou
@ 2022-03-22 15:28           ` Chengming Zhou
  2022-03-22 22:06             ` Namhyung Kim
  2022-03-23  8:11             ` Peter Zijlstra
  0 siblings, 2 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 15:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/22 11:16 下午, Chengming Zhou wrote:
> Hi peter,
> 
> On 2022/3/22 10:54 下午, Peter Zijlstra wrote:
>> On Tue, Mar 22, 2022 at 09:38:21PM +0800, Chengming Zhou wrote:
>>> On 2022/3/22 8:59 下午, Peter Zijlstra wrote:
>>>> On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
>>>>> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
>>>>> in perf_cgroup_switch().
>>>>>
>>>>> CPU1					CPU2
>>>>> (in context_switch)			(attach running task)
>>>>> perf_cgroup_sched_out(prev, next)
>>>>> 	cgrp1 == cgrp2 is True
>>>>> 					next->cgroups = cgrp3
>>>>> 					perf_cgroup_attach()
>>>>> perf_cgroup_sched_in(prev, next)
>>>>> 	cgrp1 == cgrp3 is False

I see, you must have been misled by my wrong drawing above ;-)
I'm sorry, perf_cgroup_attach() on the right should be put at the bottom.

CPU1						CPU2
(in context_switch)				(attach running task)
perf_cgroup_sched_out(prev, next)
	cgrp1 == cgrp2 is True
						next->cgroups = cgrp3
perf_cgroup_sched_in(prev, next)
	cgrp1 == cgrp3 is False
						__perf_cgroup_move()

Thanks.

>>>>>
>>>>> The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
>>>>> context switch code") would save cpuctx switch out/in when the
>>>>> perf_cgroup of "prev" and "next" are the same.
>>>>>
>>>>> But perf_cgroup of task can change in concurrent with context_switch.
>>>>
>>>> Can you clarify? IIRC then a task changes cgroup it goes throught the
>>>> whole ->attach() dance, and that serializes against the context switch
>>>> code.
>>>>
>>>
>>> task->cgroups changed before perf_cgroup_attach(), and is not serialized
>>> against the context switch, since task->cgroups can be changed without
>>> rq lock held. (cgroup v1 or cgroup v2 with PSI disabled)
>>>
>>> So perf_cgroup_sched_out() in perf_cgroup_switch() may see the old or
>>> new perf_cgroup when do context switch.
>>
>> __schedule()
>>   local_irq_disable();				<--- IRQ disable
>>   rq_lock();
>>
>>   ...
>>
>>   context_switch()
>>     prepare_task_switch()
>>       perf_event_task_sched_out()
>>         __perf_event_task_sched_out()
>> 	  perf_cgroup_sched_out();
> 
> here compare perf_cgroup_from_task(prev) and perf_cgroup_from_task(next)
> 
>>
>>   switch_to()
>>   finish_task_switch()
>>     perf_event_task_sched_in()
>>       __perf_event_task_sched_in()
>>         perf_cgroup_sched_in();
> 
> here compare perf_cgroup_from_task(prev) and perf_cgroup_from_task(next)
> 
>>     finish_lock_switch()
>>       raw_spin_irq_unlock_irq();		<--- IRQ enable
>>
>>
>> vs
>>
> 
> rcu_assign_pointer(p->cgroups, to)		<--- task perf_cgroup changed
> 
> task->cgroups has changed before sending IPI
> 
>> perf_event_cgrp_subsys.attach = perf_cgroup_attach()
>>   cgroup_taskset_for_each()
>>     task_function_call(task, __perf_cgroup_move) <--- sends IPI
>>
>>
>> Please explain how this can interleave.
> 
> __perf_cgroup_move in IPI is of course serialized against context switch,
> but the task->cgroups has changed before that, without rq lock held.
> So perf_cgroup_from_task() may see the old or new perf_cgroup.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 13:01   ` Peter Zijlstra
@ 2022-03-22 16:33     ` Chengming Zhou
  2022-03-23  8:13       ` Peter Zijlstra
  2022-03-22 22:21     ` Namhyung Kim
  1 sibling, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-22 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/22 9:01 下午, Peter Zijlstra wrote:
> On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
>> Although we don't have incosistency problem any more, we can
>> have other problem like:
>>
>> CPU1					CPU2
>> (in context_switch)			(attach running task)
>> 					prev->cgroups = cgrp2
>> perf_cgroup_sched_switch(prev, next)
>> 	cgrp2 == cgrp2 is True
>>
> 
> Again, I'm not following, how can you attach to a running task from
> another CPU ?

Hi Peter, I make a little testcase which can reproduce the race
problem, on system with PSI disabled. Because when PSI enabled,
cgroup_move_task() will hold rq lock to assign task->cgroups.

```
#!/bin/bash

cd /sys/fs/cgroup/perf_event

mkdir cg1
mkdir cg2

perf stat -e cycles --cgroup /cg1 &

cg_run()
{
        cg=$1
        shift
        echo $BASHPID > $cg/cgroup.procs
        $@
}

cg_run cg1 schbench -r 100 &
cg_run cg2 schbench -r 100 &

while true; do
        for i in $(cat cg1/cgroup.procs); do
                echo $i > cg2/cgroup.procs
        done
        for i in $(cat cg2/cgroup.procs); do
                echo $i > cg1/cgroup.procs
        done
done
```

Some seconds later, dmesg will show the WARNING message:

[   51.777830] WARNING: CPU: 2 PID: 1849 at kernel/events/core.c:869 perf_cgroup_switch+0x246/0x290
[   51.779167] Modules linked in:
[   51.779696] CPU: 2 PID: 1849 Comm: schbench Not tainted 5.17.0-rc8 #28
[   51.780691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[   51.782353] RIP: 0010:perf_cgroup_switch+0x246/0x290
[   51.783145] Code: 0f 0b e9 0b ff ff ff 48 83 7c 24 08 00 74 0c e8 00 7e f7 ff fb 66 0f 1f 44 00 00 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 <0f> 0b e9 4f fe ff ff e8 be 7e f7 ff e8 a9 1f 93 00 89 c0 49 c7 c5
[   51.785804] RSP: 0018:ffffba4440fcbd80 EFLAGS: 00010086
[   51.786617] RAX: 0000000000000002 RBX: ffff8d78eb8b7200 RCX: 0000000000000000
[   51.787696] RDX: 0000000000000000 RSI: ffffffffae1c83db RDI: ffffffffae185a69
[   51.788777] RBP: ffff8d78eb8aad40 R08: 0000000000000001 R09: 0000000000000001
[   51.789854] R10: 0000000000000000 R11: ffff8d78eb8b7220 R12: ffff8d78eb8b7208
[   51.790929] R13: ffff8d78eb8aafa0 R14: ffff8d74cd6bb600 R15: 0000000000000000
[   51.792006] FS:  00007fedaaffd700(0000) GS:ffff8d78eb880000(0000) knlGS:0000000000000000
[   51.793223] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.794122] CR2: 000055e4bf2b696c CR3: 00000001128a2003 CR4: 0000000000370ee0
[   51.795209] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   51.796292] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   51.797375] Call Trace:
[   51.797828]  <TASK>
[   51.798229]  __perf_event_task_sched_in+0x151/0x350
[   51.799009]  ? lock_release+0x1ed/0x2e0
[   51.799640]  finish_task_switch+0x1d3/0x2e0
[   51.800328]  ? __switch_to+0x136/0x4b0
[   51.800953]  __schedule+0x33e/0xae0
[   51.801535]  schedule+0x4e/0xc0
[   51.802080]  exit_to_user_mode_prepare+0x172/0x2a0
[   51.802850]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[   51.803675]  irqentry_exit_to_user_mode+0x5/0x40
[   51.804413]  sysvec_apic_timer_interrupt+0x5c/0xd0
[   51.805183]  asm_sysvec_apic_timer_interrupt+0x12/0x20

Thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 15:28           ` Chengming Zhou
@ 2022-03-22 22:06             ` Namhyung Kim
  2022-03-23  8:11             ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Namhyung Kim @ 2022-03-22 22:06 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Stephane Eranian,
	linux-perf-users, linux-kernel, duanxiongchun, songmuchun

Hello,

On Tue, Mar 22, 2022 at 8:28 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> On 2022/3/22 11:16 下午, Chengming Zhou wrote:
> > Hi peter,
> >
> > On 2022/3/22 10:54 下午, Peter Zijlstra wrote:
> >> On Tue, Mar 22, 2022 at 09:38:21PM +0800, Chengming Zhou wrote:
> >>> On 2022/3/22 8:59 下午, Peter Zijlstra wrote:
> >>>> On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
> >>>>> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
> >>>>> in perf_cgroup_switch().
> >>>>>
> >>>>> CPU1                                      CPU2
> >>>>> (in context_switch)                       (attach running task)
> >>>>> perf_cgroup_sched_out(prev, next)
> >>>>>   cgrp1 == cgrp2 is True
> >>>>>                                   next->cgroups = cgrp3
> >>>>>                                   perf_cgroup_attach()
> >>>>> perf_cgroup_sched_in(prev, next)
> >>>>>   cgrp1 == cgrp3 is False
>
> I see, you must have been misled by my wrong drawing above ;-)
> I'm sorry, perf_cgroup_attach() on the right should be put at the bottom.
>
> CPU1                                            CPU2
> (in context_switch)                             (attach running task)
> perf_cgroup_sched_out(prev, next)
>         cgrp1 == cgrp2 is True
>                                                 next->cgroups = cgrp3
> perf_cgroup_sched_in(prev, next)
>         cgrp1 == cgrp3 is False
>                                                 __perf_cgroup_move()
>

Yep, this is a real race and I saw the warnings reported sometimes.
The perf_cgroup_attach() is called after the task's cgroup is changed.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 12:08 ` [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup Chengming Zhou
  2022-03-22 13:01   ` Peter Zijlstra
@ 2022-03-22 22:18   ` Namhyung Kim
  2022-03-23  1:27     ` [Phishing Risk] [External] " Chengming Zhou
  2022-03-23 12:51   ` Peter Zijlstra
  2 siblings, 1 reply; 29+ messages in thread
From: Namhyung Kim @ 2022-03-22 22:18 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Stephane Eranian,
	linux-perf-users, linux-kernel, duanxiongchun, songmuchun

On Tue, Mar 22, 2022 at 5:10 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Although we don't have incosistency problem any more, we can
> have other problem like:
>
> CPU1                                    CPU2
> (in context_switch)                     (attach running task)
>                                         prev->cgroups = cgrp2
> perf_cgroup_sched_switch(prev, next)
>         cgrp2 == cgrp2 is True
>
> If perf_cgroup of prev task changes from cgrp1 to cgrp2,
> perf_cgroup_sched_switch() will skip perf_cgroup_switch(),
> so the CPU would still schedule the cgrp1 events, but we should
> schedule the cgrp2 events.

Ah ok, now I see the problem in changing prev->cgroup too.

>
> The reason of this problem is that we shouldn't use the changeable
> prev->cgroups to decide whether skip perf_cgroup_switch().
>
> This patch introduces a percpu perf_cgroup to cache the perf_cgroup
> that scheduled in cpuctxes, which later used to compare with the
> perf_cgroup of next task to decide whether skip perf_cgroup_switch().
>
> Since the perf_cgroup_switch() can be called after the context switch,
> the cgroup events might be scheduled already. So we put the comparison
> of perf_cgroups in perf_cgroup_switch(), and delete the unused function
> perf_cgroup_sched_switch().
>
> We must clear the percpu perf_cgroup cache when the last cgroup event
> disabled.
>
> Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  kernel/events/core.c | 63 ++++++++++++++++----------------------------
>  1 file changed, 22 insertions(+), 41 deletions(-)
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8b5cf2aedfe6..848a3bfa9513 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -826,6 +826,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
>         }
>  }
>
> +static DEFINE_PER_CPU(struct perf_cgroup *, cpu_perf_cgroup);
>  static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>
>  /*
> @@ -833,6 +834,7 @@ static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>   */
>  static void perf_cgroup_switch(struct task_struct *task)
>  {
> +       struct perf_cgroup *cgrp;
>         struct perf_cpu_context *cpuctx, *tmp;
>         struct list_head *list;
>         unsigned long flags;
> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
>          */
>         local_irq_save(flags);
>
> +       cgrp = perf_cgroup_from_task(task, NULL);
> +       if (cgrp == __this_cpu_read(cpu_perf_cgroup))
> +               goto out;
> +
> +       __this_cpu_write(cpu_perf_cgroup, cgrp);
> +
>         list = this_cpu_ptr(&cgrp_cpuctx_list);
>         list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>                 WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>
>                 perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +               if (cpuctx->cgrp == cgrp)

Missing perf_ctx_unlock().

Thanks,
Namhyung

> +                       continue;
> +
>                 perf_pmu_disable(cpuctx->ctx.pmu);
>
>                 cpu_ctx_sched_out(cpuctx, EVENT_ALL);
> @@ -855,14 +867,11 @@ static void perf_cgroup_switch(struct task_struct *task)
>                  * must not be done before ctxswout due
>                  * to event_filter_match() in event_sched_out()
>                  */
> -               cpuctx->cgrp = perf_cgroup_from_task(task,
> -                                                    &cpuctx->ctx);
> +               cpuctx->cgrp = cgrp;
>                 /*
>                  * set cgrp before ctxsw in to allow
>                  * event_filter_match() to not have to pass
>                  * task around
> -                * we pass the cpuctx->ctx to perf_cgroup_from_task()
> -                * because cgroup events are only per-cpu
>                  */
>                 cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>
> @@ -870,35 +879,10 @@ static void perf_cgroup_switch(struct task_struct *task)
>                 perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>         }
>
> +out:
>         local_irq_restore(flags);
>  }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 13:01   ` Peter Zijlstra
  2022-03-22 16:33     ` [External] " Chengming Zhou
@ 2022-03-22 22:21     ` Namhyung Kim
  1 sibling, 0 replies; 29+ messages in thread
From: Namhyung Kim @ 2022-03-22 22:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chengming Zhou, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Stephane Eranian,
	linux-perf-users, linux-kernel, duanxiongchun, songmuchun

On Tue, Mar 22, 2022 at 6:01 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
> > Although we don't have incosistency problem any more, we can
> > have other problem like:
> >
> > CPU1                                  CPU2
> > (in context_switch)                   (attach running task)
> >                                       prev->cgroups = cgrp2
> > perf_cgroup_sched_switch(prev, next)
> >       cgrp2 == cgrp2 is True
> >
>
> Again, I'm not following, how can you attach to a running task from
> another CPU ?

I think it's supported from the beginning by writing PID to a file
in the cgroupfs.

Thanks,
Namhyung

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Phishing Risk] [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 22:18   ` Namhyung Kim
@ 2022-03-23  1:27     ` Chengming Zhou
  0 siblings, 0 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-23  1:27 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Alexander Shishkin, Jiri Olsa, Stephane Eranian,
	linux-perf-users, linux-kernel, duanxiongchun, songmuchun

Hi Namhyung,

On 2022/3/23 6:18 上午, Namhyung Kim wrote:
> On Tue, Mar 22, 2022 at 5:10 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Although we don't have incosistency problem any more, we can
>> have other problem like:
>>
>> CPU1                                    CPU2
>> (in context_switch)                     (attach running task)
>>                                         prev->cgroups = cgrp2
>> perf_cgroup_sched_switch(prev, next)
>>         cgrp2 == cgrp2 is True
>>
>> If perf_cgroup of prev task changes from cgrp1 to cgrp2,
>> perf_cgroup_sched_switch() will skip perf_cgroup_switch(),
>> so the CPU would still schedule the cgrp1 events, but we should
>> schedule the cgrp2 events.
> 
> Ah ok, now I see the problem in changing prev->cgroup too.
> 
>>
>> The reason of this problem is that we shouldn't use the changeable
>> prev->cgroups to decide whether skip perf_cgroup_switch().
>>
>> This patch introduces a percpu perf_cgroup to cache the perf_cgroup
>> that scheduled in cpuctxes, which later used to compare with the
>> perf_cgroup of next task to decide whether skip perf_cgroup_switch().
>>
>> Since the perf_cgroup_switch() can be called after the context switch,
>> the cgroup events might be scheduled already. So we put the comparison
>> of perf_cgroups in perf_cgroup_switch(), and delete the unused function
>> perf_cgroup_sched_switch().
>>
>> We must clear the percpu perf_cgroup cache when the last cgroup event
>> disabled.
>>
>> Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context switch code")
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  kernel/events/core.c | 63 ++++++++++++++++----------------------------
>>  1 file changed, 22 insertions(+), 41 deletions(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 8b5cf2aedfe6..848a3bfa9513 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -826,6 +826,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
>>         }
>>  }
>>
>> +static DEFINE_PER_CPU(struct perf_cgroup *, cpu_perf_cgroup);
>>  static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>>
>>  /*
>> @@ -833,6 +834,7 @@ static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>>   */
>>  static void perf_cgroup_switch(struct task_struct *task)
>>  {
>> +       struct perf_cgroup *cgrp;
>>         struct perf_cpu_context *cpuctx, *tmp;
>>         struct list_head *list;
>>         unsigned long flags;
>> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
>>          */
>>         local_irq_save(flags);
>>
>> +       cgrp = perf_cgroup_from_task(task, NULL);
>> +       if (cgrp == __this_cpu_read(cpu_perf_cgroup))
>> +               goto out;
>> +
>> +       __this_cpu_write(cpu_perf_cgroup, cgrp);
>> +
>>         list = this_cpu_ptr(&cgrp_cpuctx_list);
>>         list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>>                 WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>>
>>                 perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +               if (cpuctx->cgrp == cgrp)
> 
> Missing perf_ctx_unlock().

Thank you, will fix next version.

> 
> Thanks,
> Namhyung
> 
>> +                       continue;
>> +
>>                 perf_pmu_disable(cpuctx->ctx.pmu);
>>
>>                 cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>> @@ -855,14 +867,11 @@ static void perf_cgroup_switch(struct task_struct *task)
>>                  * must not be done before ctxswout due
>>                  * to event_filter_match() in event_sched_out()
>>                  */
>> -               cpuctx->cgrp = perf_cgroup_from_task(task,
>> -                                                    &cpuctx->ctx);
>> +               cpuctx->cgrp = cgrp;
>>                 /*
>>                  * set cgrp before ctxsw in to allow
>>                  * event_filter_match() to not have to pass
>>                  * task around
>> -                * we pass the cpuctx->ctx to perf_cgroup_from_task()
>> -                * because cgroup events are only per-cpu
>>                  */
>>                 cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>>
>> @@ -870,35 +879,10 @@ static void perf_cgroup_switch(struct task_struct *task)
>>                 perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>>         }
>>
>> +out:
>>         local_irq_restore(flags);
>>  }

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in
  2022-03-22 15:28           ` Chengming Zhou
  2022-03-22 22:06             ` Namhyung Kim
@ 2022-03-23  8:11             ` Peter Zijlstra
  1 sibling, 0 replies; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-23  8:11 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 11:28:41PM +0800, Chengming Zhou wrote:
> On 2022/3/22 11:16 下午, Chengming Zhou wrote:
> > Hi peter,
> > 
> > On 2022/3/22 10:54 下午, Peter Zijlstra wrote:
> >> On Tue, Mar 22, 2022 at 09:38:21PM +0800, Chengming Zhou wrote:
> >>> On 2022/3/22 8:59 下午, Peter Zijlstra wrote:
> >>>> On Tue, Mar 22, 2022 at 08:08:29PM +0800, Chengming Zhou wrote:
> >>>>> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
> >>>>> in perf_cgroup_switch().
> >>>>>
> >>>>> CPU1					CPU2
> >>>>> (in context_switch)			(attach running task)
> >>>>> perf_cgroup_sched_out(prev, next)
> >>>>> 	cgrp1 == cgrp2 is True
> >>>>> 					next->cgroups = cgrp3
> >>>>> 					perf_cgroup_attach()
> >>>>> perf_cgroup_sched_in(prev, next)
> >>>>> 	cgrp1 == cgrp3 is False
> 
> I see, you must have been misled by my wrong drawing above ;-)
> I'm sorry, perf_cgroup_attach() on the right should be put at the bottom.
> 
> CPU1						CPU2
> (in context_switch)				(attach running task)
> perf_cgroup_sched_out(prev, next)
> 	cgrp1 == cgrp2 is True
> 						next->cgroups = cgrp3
> perf_cgroup_sched_in(prev, next)
> 	cgrp1 == cgrp3 is False
> 						__perf_cgroup_move()
> 

Ohhhh, you're taking about CPU2 running cgroup_migrate_execute()...
clear as mud this :/

I think I remember this race; in the scheduler we fixed it by not using
task_css to track the active cgroup and using the various cgroup_subsys
hooks to keep an internally consistent set of state.

But let me go look at what you did in this new light.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 16:33     ` [External] " Chengming Zhou
@ 2022-03-23  8:13       ` Peter Zijlstra
  2022-03-23 12:58         ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-23  8:13 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Wed, Mar 23, 2022 at 12:33:51AM +0800, Chengming Zhou wrote:
> On 2022/3/22 9:01 下午, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
> >> Although we don't have incosistency problem any more, we can
> >> have other problem like:
> >>
> >> CPU1					CPU2
> >> (in context_switch)			(attach running task)
> >> 					prev->cgroups = cgrp2
> >> perf_cgroup_sched_switch(prev, next)
> >> 	cgrp2 == cgrp2 is True
> >>
> > 
> > Again, I'm not following, how can you attach to a running task from
> > another CPU ?
> 
> Hi Peter, I make a little testcase which can reproduce the race
> problem, on system with PSI disabled. Because when PSI enabled,
> cgroup_move_task() will hold rq lock to assign task->cgroups.

No, the problem is that you're talking about cgroup attach while I'm
thinking of attaching a event to a task. And your picture has nothing to
clarify.

Those pictures of yours could really do with a few more function names
in them, otherwise it's absolutely unclear what code is running where.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-22 12:08 ` [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup Chengming Zhou
  2022-03-22 13:01   ` Peter Zijlstra
  2022-03-22 22:18   ` Namhyung Kim
@ 2022-03-23 12:51   ` Peter Zijlstra
  2022-03-23 13:07     ` [External] " Chengming Zhou
  2 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-23 12:51 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:

> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8b5cf2aedfe6..848a3bfa9513 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c

> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
>  	 */
>  	local_irq_save(flags);
>  
> +	cgrp = perf_cgroup_from_task(task, NULL);
> +	if (cgrp == __this_cpu_read(cpu_perf_cgroup))
> +		goto out;
> +
> +	__this_cpu_write(cpu_perf_cgroup, cgrp);
> +
>  	list = this_cpu_ptr(&cgrp_cpuctx_list);
>  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>  
>  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +		if (cpuctx->cgrp == cgrp)
> +			continue;
> +
>  		perf_pmu_disable(cpuctx->ctx.pmu);
>  
>  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);

This is just straight up broken.. you can't continue after taking a
lock, that'll miss unlock.

Also, I really don't see the point of cpu_perf_cgroup, cpuctx->cgrp
should be able to do this.

Also, perhaps merge this in the previous patch, I'm not sure it makes
sense to split this.

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -833,6 +833,7 @@ static DEFINE_PER_CPU(struct list_head,
  */
 static void perf_cgroup_switch(struct task_struct *task)
 {
+	struct perf_cgroup *cgrp;
 	struct perf_cpu_context *cpuctx, *tmp;
 	struct list_head *list;
 	unsigned long flags;
@@ -843,11 +844,20 @@ static void perf_cgroup_switch(struct ta
 	 */
 	local_irq_save(flags);
 
+	cgrp = perf_cgroup_from_task(task, NULL);
+
 	list = this_cpu_ptr(&cgrp_cpuctx_list);
 	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
 		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
 
+		if (READ_ONCE(cpuctx->cgrp == cgrp))
+			continue
+
 		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+
+		if (cpuctx->cgrp == cgrp)
+			goto next;
+
 		perf_pmu_disable(cpuctx->ctx.pmu);
 
 		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
@@ -855,50 +865,22 @@ static void perf_cgroup_switch(struct ta
 		 * must not be done before ctxswout due
 		 * to event_filter_match() in event_sched_out()
 		 */
-		cpuctx->cgrp = perf_cgroup_from_task(task,
-						     &cpuctx->ctx);
+		WRITE_ONCE(cpuctx->cgrp, cgrp);
 		/*
 		 * set cgrp before ctxsw in to allow
 		 * event_filter_match() to not have to pass
 		 * task around
-		 * we pass the cpuctx->ctx to perf_cgroup_from_task()
-		 * because cgroup events are only per-cpu
 		 */
 		cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
 
 		perf_pmu_enable(cpuctx->ctx.pmu);
+next:
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 	}
 
 	local_irq_restore(flags);
 }
 
-static inline void perf_cgroup_sched_switch(struct task_struct *task,
-					    struct task_struct *next)
-{
-	struct perf_cgroup *cgrp1;
-	struct perf_cgroup *cgrp2 = NULL;
-
-	rcu_read_lock();
-	/*
-	 * we come here when we know perf_cgroup_events > 0
-	 * we do not need to pass the ctx here because we know
-	 * we are holding the rcu lock
-	 */
-	cgrp1 = perf_cgroup_from_task(task, NULL);
-	cgrp2 = perf_cgroup_from_task(next, NULL);
-
-	/*
-	 * only schedule out current cgroup events if we know
-	 * that we are switching to a different cgroup. Otherwise,
-	 * do no touch the cgroup events.
-	 */
-	if (cgrp1 != cgrp2)
-		perf_cgroup_switch(task);
-
-	rcu_read_unlock();
-}
-
 static int perf_cgroup_ensure_storage(struct perf_event *event,
 				struct cgroup_subsys_state *css)
 {
@@ -1062,11 +1044,6 @@ static inline void update_cgrp_time_from
 {
 }
 
-static inline void perf_cgroup_sched_switch(struct task_struct *task,
-					    struct task_struct *next)
-{
-}
-
 static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
 				      struct perf_event_attr *attr,
 				      struct perf_event *group_leader)
@@ -1080,11 +1057,6 @@ perf_cgroup_set_timestamp(struct task_st
 {
 }
 
-static inline void
-perf_cgroup_sched_switch(struct task_struct *task, struct task_struct *next)
-{
-}
-
 static inline u64 perf_cgroup_event_time(struct perf_event *event)
 {
 	return 0;
@@ -1104,6 +1076,10 @@ static inline void
 perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
 {
 }
+
+static void perf_cgroup_switch(struct task_struct *task)
+{
+}
 #endif
 
 /*
@@ -3625,7 +3601,7 @@ void __perf_event_task_sched_out(struct
 	 * cgroup event are system-wide mode only
 	 */
 	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
-		perf_cgroup_sched_switch(task, next);
+		perf_cgroup_switch(next);
 }
 
 /*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-23  8:13       ` Peter Zijlstra
@ 2022-03-23 12:58         ` Chengming Zhou
  0 siblings, 0 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-23 12:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/23 4:13 下午, Peter Zijlstra wrote:
> On Wed, Mar 23, 2022 at 12:33:51AM +0800, Chengming Zhou wrote:
>> On 2022/3/22 9:01 下午, Peter Zijlstra wrote:
>>> On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
>>>> Although we don't have incosistency problem any more, we can
>>>> have other problem like:
>>>>
>>>> CPU1					CPU2
>>>> (in context_switch)			(attach running task)
>>>> 					prev->cgroups = cgrp2
>>>> perf_cgroup_sched_switch(prev, next)
>>>> 	cgrp2 == cgrp2 is True
>>>>
>>>
>>> Again, I'm not following, how can you attach to a running task from
>>> another CPU ?
>>
>> Hi Peter, I make a little testcase which can reproduce the race
>> problem, on system with PSI disabled. Because when PSI enabled,
>> cgroup_move_task() will hold rq lock to assign task->cgroups.
> 
> No, the problem is that you're talking about cgroup attach while I'm
> thinking of attaching a event to a task. And your picture has nothing to
> clarify.
> 
> Those pictures of yours could really do with a few more function names
> in them, otherwise it's absolutely unclear what code is running where.

Sorry for the confusion ;-)
I will draw a better picture including more function names in the next version.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-23 12:51   ` Peter Zijlstra
@ 2022-03-23 13:07     ` Chengming Zhou
  2022-03-23 13:17       ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-23 13:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/23 8:51 下午, Peter Zijlstra wrote:
> On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
> 
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 8b5cf2aedfe6..848a3bfa9513 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
> 
>> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
>>  	 */
>>  	local_irq_save(flags);
>>  
>> +	cgrp = perf_cgroup_from_task(task, NULL);
>> +	if (cgrp == __this_cpu_read(cpu_perf_cgroup))
>> +		goto out;
>> +
>> +	__this_cpu_write(cpu_perf_cgroup, cgrp);
>> +
>>  	list = this_cpu_ptr(&cgrp_cpuctx_list);
>>  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>>  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>>  
>>  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> +
>> +		if (cpuctx->cgrp == cgrp)
>> +			continue;
>> +
>>  		perf_pmu_disable(cpuctx->ctx.pmu);
>>  
>>  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
> 
> This is just straight up broken.. you can't continue after taking a
> lock, that'll miss unlock.

Yes, Namhyung has pointed it out, I will fix it next version.

> 
> Also, I really don't see the point of cpu_perf_cgroup, cpuctx->cgrp
> should be able to do this.

But the problem is that we have two cpuctx on the percpu list, do you
mean we should use perf_cgroup of the first cpuctx to compare with
the next task's perf_cgroup ?

Or we should delete the cgrp in cpuctx, and use this new percpu cpu_perf_cgroup?

> 
> Also, perhaps merge this in the previous patch, I'm not sure it makes
> sense to split this.

Ok, will do. I put it in one patch in v1 RFC, then split it for easier review.
I will put it together in the next version.

Thanks.

> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -833,6 +833,7 @@ static DEFINE_PER_CPU(struct list_head,
>   */
>  static void perf_cgroup_switch(struct task_struct *task)
>  {
> +	struct perf_cgroup *cgrp;
>  	struct perf_cpu_context *cpuctx, *tmp;
>  	struct list_head *list;
>  	unsigned long flags;
> @@ -843,11 +844,20 @@ static void perf_cgroup_switch(struct ta
>  	 */
>  	local_irq_save(flags);
>  
> +	cgrp = perf_cgroup_from_task(task, NULL);
> +
>  	list = this_cpu_ptr(&cgrp_cpuctx_list);
>  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>  
> +		if (READ_ONCE(cpuctx->cgrp == cgrp))
> +			continue
> +
>  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> +
> +		if (cpuctx->cgrp == cgrp)
> +			goto next;
> +
>  		perf_pmu_disable(cpuctx->ctx.pmu);
>  
>  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
> @@ -855,50 +865,22 @@ static void perf_cgroup_switch(struct ta
>  		 * must not be done before ctxswout due
>  		 * to event_filter_match() in event_sched_out()
>  		 */
> -		cpuctx->cgrp = perf_cgroup_from_task(task,
> -						     &cpuctx->ctx);
> +		WRITE_ONCE(cpuctx->cgrp, cgrp);
>  		/*
>  		 * set cgrp before ctxsw in to allow
>  		 * event_filter_match() to not have to pass
>  		 * task around
> -		 * we pass the cpuctx->ctx to perf_cgroup_from_task()
> -		 * because cgroup events are only per-cpu
>  		 */
>  		cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>  
>  		perf_pmu_enable(cpuctx->ctx.pmu);
> +next:
>  		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>  	}
>  
>  	local_irq_restore(flags);
>  }
>  
> -static inline void perf_cgroup_sched_switch(struct task_struct *task,
> -					    struct task_struct *next)
> -{
> -	struct perf_cgroup *cgrp1;
> -	struct perf_cgroup *cgrp2 = NULL;
> -
> -	rcu_read_lock();
> -	/*
> -	 * we come here when we know perf_cgroup_events > 0
> -	 * we do not need to pass the ctx here because we know
> -	 * we are holding the rcu lock
> -	 */
> -	cgrp1 = perf_cgroup_from_task(task, NULL);
> -	cgrp2 = perf_cgroup_from_task(next, NULL);
> -
> -	/*
> -	 * only schedule out current cgroup events if we know
> -	 * that we are switching to a different cgroup. Otherwise,
> -	 * do no touch the cgroup events.
> -	 */
> -	if (cgrp1 != cgrp2)
> -		perf_cgroup_switch(task);
> -
> -	rcu_read_unlock();
> -}
> -
>  static int perf_cgroup_ensure_storage(struct perf_event *event,
>  				struct cgroup_subsys_state *css)
>  {
> @@ -1062,11 +1044,6 @@ static inline void update_cgrp_time_from
>  {
>  }
>  
> -static inline void perf_cgroup_sched_switch(struct task_struct *task,
> -					    struct task_struct *next)
> -{
> -}
> -
>  static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
>  				      struct perf_event_attr *attr,
>  				      struct perf_event *group_leader)
> @@ -1080,11 +1057,6 @@ perf_cgroup_set_timestamp(struct task_st
>  {
>  }
>  
> -static inline void
> -perf_cgroup_sched_switch(struct task_struct *task, struct task_struct *next)
> -{
> -}
> -
>  static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  {
>  	return 0;
> @@ -1104,6 +1076,10 @@ static inline void
>  perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *ctx)
>  {
>  }
> +
> +static void perf_cgroup_switch(struct task_struct *task)
> +{
> +}
>  #endif
>  
>  /*
> @@ -3625,7 +3601,7 @@ void __perf_event_task_sched_out(struct
>  	 * cgroup event are system-wide mode only
>  	 */
>  	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
> -		perf_cgroup_sched_switch(task, next);
> +		perf_cgroup_switch(next);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-23 13:07     ` [External] " Chengming Zhou
@ 2022-03-23 13:17       ` Peter Zijlstra
  2022-03-23 13:37         ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-23 13:17 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Wed, Mar 23, 2022 at 09:07:01PM +0800, Chengming Zhou wrote:
> On 2022/3/23 8:51 下午, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
> > 
> >> diff --git a/kernel/events/core.c b/kernel/events/core.c
> >> index 8b5cf2aedfe6..848a3bfa9513 100644
> >> --- a/kernel/events/core.c
> >> +++ b/kernel/events/core.c
> > 
> >> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
> >>  	 */
> >>  	local_irq_save(flags);
> >>  
> >> +	cgrp = perf_cgroup_from_task(task, NULL);
> >> +	if (cgrp == __this_cpu_read(cpu_perf_cgroup))
> >> +		goto out;

So this compares the cpu thing against the task thing, if matching, we
bail.

> >> +
> >> +	__this_cpu_write(cpu_perf_cgroup, cgrp);

Then we set cpu thing.

> >> +
> >>  	list = this_cpu_ptr(&cgrp_cpuctx_list);
> >>  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
> >>  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
> >>  
> >>  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> >> +
> >> +		if (cpuctx->cgrp == cgrp)
> >> +			continue;
> >> +
> >>  		perf_pmu_disable(cpuctx->ctx.pmu);
> >>  
> >>  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);

> >> +		cpuctx->cgrp = cgrp

But here we already have exactly the same pattern, we match cpuctx thing
against task thing and skip/set etc.

> > Also, I really don't see the point of cpu_perf_cgroup, cpuctx->cgrp
> > should be able to do this.
> 
> But the problem is that we have two cpuctx on the percpu list, do you
> mean we should use perf_cgroup of the first cpuctx to compare with
> the next task's perf_cgroup ?
> 
> Or we should delete the cgrp in cpuctx, and use this new percpu cpu_perf_cgroup?

I'm a bit confused, per the above, you already do exactly what the new
cpu_perf_cgroup does on the cpuctx->cgrp variable. AFAICT the only think
the new per-cpu variable does is avoid a lock, howveer:


> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -833,6 +833,7 @@ static DEFINE_PER_CPU(struct list_head,
> >   */
> >  static void perf_cgroup_switch(struct task_struct *task)
> >  {
> > +	struct perf_cgroup *cgrp;
> >  	struct perf_cpu_context *cpuctx, *tmp;
> >  	struct list_head *list;
> >  	unsigned long flags;
> > @@ -843,11 +844,20 @@ static void perf_cgroup_switch(struct ta
> >  	 */
> >  	local_irq_save(flags);
> >  
> > +	cgrp = perf_cgroup_from_task(task, NULL);
> > +
> >  	list = this_cpu_ptr(&cgrp_cpuctx_list);
> >  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
> >  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
> >  
> > +		if (READ_ONCE(cpuctx->cgrp == cgrp))
> > +			continue

I think we can avoid that by doing an early check, hmm?

> > +
> >  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
> > +
> > +		if (cpuctx->cgrp == cgrp)
> > +			goto next;
> > +
> >  		perf_pmu_disable(cpuctx->ctx.pmu);
> >  
> >  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
> > @@ -855,50 +865,22 @@ static void perf_cgroup_switch(struct ta
> >  		 * must not be done before ctxswout due
> >  		 * to event_filter_match() in event_sched_out()
> >  		 */
> > -		cpuctx->cgrp = perf_cgroup_from_task(task,
> > -						     &cpuctx->ctx);
> > +		WRITE_ONCE(cpuctx->cgrp, cgrp);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-23 13:17       ` Peter Zijlstra
@ 2022-03-23 13:37         ` Chengming Zhou
  2022-03-23 14:05           ` Peter Zijlstra
  0 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2022-03-23 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/23 9:17 下午, Peter Zijlstra wrote:
> On Wed, Mar 23, 2022 at 09:07:01PM +0800, Chengming Zhou wrote:
>> On 2022/3/23 8:51 下午, Peter Zijlstra wrote:
>>> On Tue, Mar 22, 2022 at 08:08:30PM +0800, Chengming Zhou wrote:
>>>
>>>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>>>> index 8b5cf2aedfe6..848a3bfa9513 100644
>>>> --- a/kernel/events/core.c
>>>> +++ b/kernel/events/core.c
>>>
>>>> @@ -843,11 +845,21 @@ static void perf_cgroup_switch(struct task_struct *task)
>>>>  	 */
>>>>  	local_irq_save(flags);
>>>>  
>>>> +	cgrp = perf_cgroup_from_task(task, NULL);
>>>> +	if (cgrp == __this_cpu_read(cpu_perf_cgroup))
>>>> +		goto out;
> 
> So this compares the cpu thing against the task thing, if matching, we
> bail.
> 
>>>> +
>>>> +	__this_cpu_write(cpu_perf_cgroup, cgrp);
> 
> Then we set cpu thing.
> 
>>>> +
>>>>  	list = this_cpu_ptr(&cgrp_cpuctx_list);
>>>>  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>>>>  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>>>>  
>>>>  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>>> +
>>>> +		if (cpuctx->cgrp == cgrp)
>>>> +			continue;
>>>> +
>>>>  		perf_pmu_disable(cpuctx->ctx.pmu);
>>>>  
>>>>  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
> 
>>>> +		cpuctx->cgrp = cgrp
> 
> But here we already have exactly the same pattern, we match cpuctx thing
> against task thing and skip/set etc.

Yes, cpu_perf_cgroup is just "cache" which cgrp assigned to cpuctx->cgrp.

> 
>>> Also, I really don't see the point of cpu_perf_cgroup, cpuctx->cgrp
>>> should be able to do this.
>>
>> But the problem is that we have two cpuctx on the percpu list, do you
>> mean we should use perf_cgroup of the first cpuctx to compare with
>> the next task's perf_cgroup ?
>>
>> Or we should delete the cgrp in cpuctx, and use this new percpu cpu_perf_cgroup?
> 
> I'm a bit confused, per the above, you already do exactly what the new
> cpu_perf_cgroup does on the cpuctx->cgrp variable. AFAICT the only think
> the new per-cpu variable does is avoid a lock, howveer:

we have cgrp in cpuctx make me confused at first too.

perf_cgroup_event_enable()
	if (ctx->is_active && !cpuctx->cgrp)
		if ...
			cpuctx->cgrp = cgrp;   -->  set cpuctx->cgrp when enable event

	list_add(&cpuctx->cgrp_cpuctx_entry,
			per_cpu_ptr(&cgrp_cpuctx_list, event->cpu))  --> add cpuctx on percpu list

But we have two (hw and sw) cpuctx, so these two cpuctx->cgrp maybe different...

This is the reason why I want to create a percpu cpu_perf_cgroup,
just "cache" cgrp last scheduled, to compare with next task to decide
whether perf_cgroup_switch() can be skipped.

But you are right, having cpuctx->cgrp and cpu_perf_cgroup make things confused..
maybe we can delete cpuctx->cgrp, change to use percpu cpu_perf_cgroup?

> 
> 
>>> --- a/kernel/events/core.c
>>> +++ b/kernel/events/core.c
>>> @@ -833,6 +833,7 @@ static DEFINE_PER_CPU(struct list_head,
>>>   */
>>>  static void perf_cgroup_switch(struct task_struct *task)
>>>  {
>>> +	struct perf_cgroup *cgrp;
>>>  	struct perf_cpu_context *cpuctx, *tmp;
>>>  	struct list_head *list;
>>>  	unsigned long flags;
>>> @@ -843,11 +844,20 @@ static void perf_cgroup_switch(struct ta
>>>  	 */
>>>  	local_irq_save(flags);
>>>  
>>> +	cgrp = perf_cgroup_from_task(task, NULL);
>>> +
>>>  	list = this_cpu_ptr(&cgrp_cpuctx_list);
>>>  	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>>>  		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>>>  
>>> +		if (READ_ONCE(cpuctx->cgrp == cgrp))
>>> +			continue
> 
> I think we can avoid that by doing an early check, hmm?

perf_cgroup_switch() can be called from context_switch(), or __perf_cgroup_move() from IPI.
Say if in context_switch() already set cpuctx->cgrp to the new task->cgroups, then context_switch()
finish, handle IPI in __perf_cgroup_move(), we don't need to do sched_out/in again.

Thanks.

> 
>>> +
>>>  		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>>> +
>>> +		if (cpuctx->cgrp == cgrp)
>>> +			goto next;
>>> +
>>>  		perf_pmu_disable(cpuctx->ctx.pmu);
>>>  
>>>  		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>>> @@ -855,50 +865,22 @@ static void perf_cgroup_switch(struct ta
>>>  		 * must not be done before ctxswout due
>>>  		 * to event_filter_match() in event_sched_out()
>>>  		 */
>>> -		cpuctx->cgrp = perf_cgroup_from_task(task,
>>> -						     &cpuctx->ctx);
>>> +		WRITE_ONCE(cpuctx->cgrp, cgrp);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-23 13:37         ` Chengming Zhou
@ 2022-03-23 14:05           ` Peter Zijlstra
  2022-03-23 15:44             ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Peter Zijlstra @ 2022-03-23 14:05 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On Wed, Mar 23, 2022 at 09:37:16PM +0800, Chengming Zhou wrote:

> But you are right, having cpuctx->cgrp and cpu_perf_cgroup make things confused..
> maybe we can delete cpuctx->cgrp, change to use percpu cpu_perf_cgroup?

I'm hoping to soon (*finally*!) have some time for this again:

  https://lkml.kernel.org/r/20220113134743.1292-1-ravi.bangoria@amd.com

That should get us a single cpuctx.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup
  2022-03-23 14:05           ` Peter Zijlstra
@ 2022-03-23 15:44             ` Chengming Zhou
  0 siblings, 0 replies; 29+ messages in thread
From: Chengming Zhou @ 2022-03-23 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, mark.rutland, alexander.shishkin, jolsa, namhyung,
	eranian, linux-perf-users, linux-kernel, duanxiongchun,
	songmuchun

On 2022/3/23 10:05 下午, Peter Zijlstra wrote:
> On Wed, Mar 23, 2022 at 09:37:16PM +0800, Chengming Zhou wrote:
> 
>> But you are right, having cpuctx->cgrp and cpu_perf_cgroup make things confused..
>> maybe we can delete cpuctx->cgrp, change to use percpu cpu_perf_cgroup?
> 
> I'm hoping to soon (*finally*!) have some time for this again:
> 
>   https://lkml.kernel.org/r/20220113134743.1292-1-ravi.bangoria@amd.com
> 
> That should get us a single cpuctx.

I just seen the patch description, the new design is clearer than
the current code. But it may take a long time to upstream and test
since it changed the whole design and so much code.

Anyway, I will reorganize this patchset, fix reviewed bugs, improve
the commit message and send next version for review.

Thanks.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-03-23 15:45 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-22 12:08 [PATCH v2 0/6] perf/core: Fixes and cleanup for cgroup events Chengming Zhou
2022-03-22 12:08 ` [PATCH v2 1/6] perf/core: Fix incosistency between cgroup sched_out and sched_in Chengming Zhou
2022-03-22 12:59   ` Peter Zijlstra
2022-03-22 13:38     ` [External] " Chengming Zhou
2022-03-22 14:54       ` Peter Zijlstra
2022-03-22 15:16         ` Chengming Zhou
2022-03-22 15:28           ` Chengming Zhou
2022-03-22 22:06             ` Namhyung Kim
2022-03-23  8:11             ` Peter Zijlstra
2022-03-22 12:08 ` [PATCH v2 2/6] perf/core: Introduce percpu perf_cgroup Chengming Zhou
2022-03-22 13:01   ` Peter Zijlstra
2022-03-22 16:33     ` [External] " Chengming Zhou
2022-03-23  8:13       ` Peter Zijlstra
2022-03-23 12:58         ` Chengming Zhou
2022-03-22 22:21     ` Namhyung Kim
2022-03-22 22:18   ` Namhyung Kim
2022-03-23  1:27     ` [Phishing Risk] [External] " Chengming Zhou
2022-03-23 12:51   ` Peter Zijlstra
2022-03-23 13:07     ` [External] " Chengming Zhou
2022-03-23 13:17       ` Peter Zijlstra
2022-03-23 13:37         ` Chengming Zhou
2022-03-23 14:05           ` Peter Zijlstra
2022-03-23 15:44             ` Chengming Zhou
2022-03-22 12:08 ` [PATCH v2 3/6] perf/core: Don't pass task around when ctx sched in Chengming Zhou
2022-03-22 13:01   ` Peter Zijlstra
2022-03-22 12:08 ` [PATCH v2 4/6] perf/core: Use stable cpuctx->cgrp when update perf cgroup time Chengming Zhou
2022-03-22 13:03   ` Peter Zijlstra
2022-03-22 12:08 ` [PATCH v2 5/6] perf/core: Always set cpuctx cgrp when enable cgroup event Chengming Zhou
2022-03-22 12:08 ` [PATCH v2 6/6] perf/core: Don't need event_filter_match when merge_sched_in() Chengming Zhou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.