linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Accounting forced idle time per cpu and per cgroup
@ 2022-01-11  9:55 Cruz Zhao
  2022-01-11  9:55 ` [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Cruz Zhao
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Cruz Zhao @ 2022-01-11  9:55 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, joshdon
  Cc: cgroups, linux-kernel

There are two types of forced idle time: forced idle time from cookie'd 
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.

This patch set accounts forced idle time for each cpu to measure how long
the cpu is forced idle, which is displayed via via /proc/schedstat, and
also accounts for each cgroup to measure how long it forced its SMT siblings
into idle, which is displayed via /sys/fs/cgroup/cpuacct/cpuacct.forceidle
and /sys/fs/cgroup/cpuacct/cpuacct.forceidle_percpu. It is worth noting that
the forced idle time and the force idle time have different meanings.

We can get the total system forced idle time by looking at the root cgroup,
and we can get how long the cgroup forced it SMT siblings into idle. If the
force idle time of a cgroup is high, that can be rectified by making some
changes(ie. affinity, cpu budget, etc.) to the cgroup.

Cruz Zhao (3):
  sched/core: Accounting forceidle time for all tasks except idle task
  sched/core: Forced idle accounting per-cpu
  sched/core: Force idle accounting per cgroup

 include/linux/cgroup.h    |  7 +++++
 kernel/sched/core.c       | 10 ++++--
 kernel/sched/core_sched.c | 10 ++++--
 kernel/sched/cpuacct.c    | 79 +++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h      |  4 +++
 kernel/sched/stats.c      | 17 ++++++++--
 6 files changed, 119 insertions(+), 8 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task
  2022-01-11  9:55 [PATCH v2 0/3] Accounting forced idle time per cpu and per cgroup Cruz Zhao
@ 2022-01-11  9:55 ` Cruz Zhao
  2022-01-11 23:52   ` Josh Don
  2022-01-18 11:18   ` [tip: sched/urgent] " tip-bot2 for Cruz Zhao
  2022-01-11  9:56 ` [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu Cruz Zhao
  2022-01-11  9:56 ` [PATCH v2 3/3] sched/core: Force idle accounting per cgroup Cruz Zhao
  2 siblings, 2 replies; 14+ messages in thread
From: Cruz Zhao @ 2022-01-11  9:55 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, joshdon
  Cc: cgroups, linux-kernel

There are two types of forced idle time: forced idle time from cookie'd
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.

Assuming cpu x and cpu y are a pair of SMT siblings, consider the
following scenarios:
  1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
    tasks running on cpu y. For cpu x, there will be 80% forced idle time
    (from uncookie'd task); for cpu y, there will be 20% forced idle time
    (from cookie'd task).
  2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
    tasks running on cpu y. For cpu x, there will be 80% forced idle time
    (from cookie'd task); for cpu y, there will be 20% forced idle time
    (from uncookie'd task).

The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
    (cookie'd)taskset -c x stress-ng -c 1 -l 100
    (uncookie'd)taskset -c y stress-ng -c 4 -l 100

In the above two scenarios, the total capacity loss is 1 cpu, but in
scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
which are not accurate. It'll be more accurate to measure with cookie'd
forced idle time and uncookie'd forced idle time.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
---
 kernel/sched/core.c       | 3 +--
 kernel/sched/core_sched.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e4ae00..e8187e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5822,8 +5822,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
 	}
 
 	if (schedstat_enabled() && rq->core->core_forceidle_count) {
-		if (cookie)
-			rq->core->core_forceidle_start = rq_clock(rq->core);
+		rq->core->core_forceidle_start = rq_clock(rq->core);
 		rq->core->core_forceidle_occupation = occ;
 	}
 
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 1fb4567..c8746a9 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -277,7 +277,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 		rq_i = cpu_rq(i);
 		p = rq_i->core_pick ?: rq_i->curr;
 
-		if (!p->core_cookie)
+		if (p == rq_i->idle)
 			continue;
 
 		__schedstat_add(p->stats.core_forceidle_sum, delta);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu
  2022-01-11  9:55 [PATCH v2 0/3] Accounting forced idle time per cpu and per cgroup Cruz Zhao
  2022-01-11  9:55 ` [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Cruz Zhao
@ 2022-01-11  9:56 ` Cruz Zhao
  2022-01-12  1:59   ` Josh Don
  2022-01-12 12:27   ` Peter Zijlstra
  2022-01-11  9:56 ` [PATCH v2 3/3] sched/core: Force idle accounting per cgroup Cruz Zhao
  2 siblings, 2 replies; 14+ messages in thread
From: Cruz Zhao @ 2022-01-11  9:56 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, joshdon
  Cc: cgroups, linux-kernel

Accounting for "forced idle" time per cpu, which is the time a cpu is
forced to idle by its SMT sibling.

As it's not accurate to measure the capacity loss only by cookie'd forced
idle time, and it's hard to trace the forced idle time caused by all the
uncookie'd tasks, we account the forced idle time from the perspective of
the cpu.

Forced idle time per cpu is displayed via /proc/schedstat, we can get the
forced idle time of cpu x from the 10th column of the row for cpu x. The
unit is ns. It also requires that schedstats is enabled.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
---
 kernel/sched/core.c       |  7 ++++++-
 kernel/sched/core_sched.c |  7 +++++--
 kernel/sched/sched.h      |  4 ++++
 kernel/sched/stats.c      | 17 +++++++++++++++--
 4 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8187e7..a224b916 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -285,8 +285,10 @@ static void __sched_core_flip(bool enabled)
 
 		sched_core_lock(cpu, &flags);
 
-		for_each_cpu(t, smt_mask)
+		for_each_cpu(t, smt_mask) {
 			cpu_rq(t)->core_enabled = enabled;
+			cpu_rq(t)->in_forcedidle = false;
+		}
 
 		cpu_rq(cpu)->core->core_forceidle_start = 0;
 
@@ -5690,6 +5692,7 @@ static inline struct task_struct *pick_task(struct rq *rq)
 		 * another cpu during offline.
 		 */
 		rq->core_pick = NULL;
+		rq->in_forcedidle = false;
 		return __pick_next_task(rq, prev, rf);
 	}
 
@@ -5810,9 +5813,11 @@ static inline struct task_struct *pick_task(struct rq *rq)
 
 		rq_i->core_pick = p;
 
+		rq->in_forcedidle = false;
 		if (p == rq_i->idle) {
 			if (rq_i->nr_running) {
 				rq->core->core_forceidle_count++;
+				rq_i->in_forcedidle = true;
 				if (!fi_before)
 					rq->core->core_forceidle_seq++;
 			}
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index c8746a9..fe04805 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -242,7 +242,7 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, enum pid_type type,
 void __sched_core_account_forceidle(struct rq *rq)
 {
 	const struct cpumask *smt_mask = cpu_smt_mask(cpu_of(rq));
-	u64 delta, now = rq_clock(rq->core);
+	u64 delta_per_idlecpu, delta, now = rq_clock(rq->core);
 	struct rq *rq_i;
 	struct task_struct *p;
 	int i;
@@ -254,7 +254,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 	if (rq->core->core_forceidle_start == 0)
 		return;
 
-	delta = now - rq->core->core_forceidle_start;
+	delta_per_idlecpu = delta = now - rq->core->core_forceidle_start;
 	if (unlikely((s64)delta <= 0))
 		return;
 
@@ -277,6 +277,9 @@ void __sched_core_account_forceidle(struct rq *rq)
 		rq_i = cpu_rq(i);
 		p = rq_i->core_pick ?: rq_i->curr;
 
+		if (rq_i->in_forcedidle)
+			rq->rq_forceidle_time += delta_per_idlecpu;
+
 		if (p == rq_i->idle)
 			continue;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be9..9377d91 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1086,6 +1086,9 @@ struct rq {
 	/* try_to_wake_up() stats */
 	unsigned int		ttwu_count;
 	unsigned int		ttwu_local;
+#ifdef CONFIG_SCHED_CORE
+	u64			rq_forceidle_time;
+#endif
 #endif
 
 #ifdef CONFIG_CPU_IDLE
@@ -1115,6 +1118,7 @@ struct rq {
 	unsigned int		core_forceidle_seq;
 	unsigned int		core_forceidle_occupation;
 	u64			core_forceidle_start;
+	bool			in_forcedidle;
 #endif
 };
 
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde29..ea22a8c 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
 	}
 }
 
+#ifdef CONFIG_SCHED_CORE
+static inline u64 get_rq_forceidle_time(struct rq *rq) {
+	return rq->rq_forceidle_time;
+}
+#else
+static inline u64 get_rq_forceidle_time(struct rq *rq) {
+	return 0;
+}
+#endif
+
 /*
  * Current schedstat API version.
  *
@@ -125,21 +135,24 @@ static int show_schedstat(struct seq_file *seq, void *v)
 		seq_printf(seq, "timestamp %lu\n", jiffies);
 	} else {
 		struct rq *rq;
+		u64 rq_forceidle_time;
 #ifdef CONFIG_SMP
 		struct sched_domain *sd;
 		int dcount = 0;
 #endif
 		cpu = (unsigned long)(v - 2);
 		rq = cpu_rq(cpu);
+		rq_forceidle_time = get_rq_forceidle_time(rq);
 
 		/* runqueue-specific stats */
 		seq_printf(seq,
-		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+		    "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu",
 		    cpu, rq->yld_count,
 		    rq->sched_count, rq->sched_goidle,
 		    rq->ttwu_count, rq->ttwu_local,
 		    rq->rq_cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+		    rq_forceidle_time);
 
 		seq_printf(seq, "\n");
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] sched/core: Force idle accounting per cgroup
  2022-01-11  9:55 [PATCH v2 0/3] Accounting forced idle time per cpu and per cgroup Cruz Zhao
  2022-01-11  9:55 ` [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Cruz Zhao
  2022-01-11  9:56 ` [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu Cruz Zhao
@ 2022-01-11  9:56 ` Cruz Zhao
  2022-01-12 20:42   ` Tejun Heo
  2 siblings, 1 reply; 14+ messages in thread
From: Cruz Zhao @ 2022-01-11  9:56 UTC (permalink / raw)
  To: tj, lizefan.x, hannes, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, joshdon
  Cc: cgroups, linux-kernel

Accounting for "force idle" time per cgroup, which is the time the tasks
of the cgroup forced its SMT siblings into idle.

Force idle time per cgroup is displayed via
  /sys/fs/cgroup/cpuacct/$cg/cpuacct.forceidle.
Force idle time per cgroup per cpu is displayed via
  /sys/fs/cgroup/cpuacct/$cg/cpuacct.forceidle_percpu.
The unit is ns.
It also requires that schedstats is enabled.

We can get the total system forced idle time by looking at the root cgroup,
and we can get how long the cgroup forced it SMT siblings into idle. If the
force idle time of a cgroup is high, that can be rectified by making some
changes(ie. affinity, cpu budget, etc.) to the cgroup.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
---
 include/linux/cgroup.h    |  7 +++++
 kernel/sched/core_sched.c |  1 +
 kernel/sched/cpuacct.c    | 79 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 87 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75c1514..0c1b616 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -774,10 +774,17 @@ static inline struct cgroup *cgroup_get_from_id(u64 id)
 #ifdef CONFIG_CGROUP_CPUACCT
 void cpuacct_charge(struct task_struct *tsk, u64 cputime);
 void cpuacct_account_field(struct task_struct *tsk, int index, u64 val);
+#ifdef CONFIG_SCHED_CORE
+void cpuacct_account_forceidle(int cpu, struct task_struct *task, u64 cputime);
+#endif
 #else
 static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 static inline void cpuacct_account_field(struct task_struct *tsk, int index,
 					 u64 val) {}
+#ifdef CONFIG_SCHED_CORE
+static inline void cpuacct_account_forceidle(int cpu, struct task_struct *task,
+					     u64 cputime) {}
+#endif
 #endif
 
 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index fe04805..add8672 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -284,6 +284,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 			continue;
 
 		__schedstat_add(p->stats.core_forceidle_sum, delta);
+		cpuacct_account_forceidle(i, p, delta);
 	}
 }
 
diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 3d06c5e..b5c5d99 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -27,6 +27,9 @@ struct cpuacct {
 	/* cpuusage holds pointer to a u64-type object on every CPU */
 	u64 __percpu	*cpuusage;
 	struct kernel_cpustat __percpu	*cpustat;
+#ifdef CONFIG_SCHED_CORE
+	u64 __percpu	*forceidle;
+#endif
 };
 
 static inline struct cpuacct *css_ca(struct cgroup_subsys_state *css)
@@ -46,9 +49,15 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
 }
 
 static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
+#ifdef CONFIG_SCHED_CORE
+static DEFINE_PER_CPU(u64, root_cpuacct_forceidle);
+#endif
 static struct cpuacct root_cpuacct = {
 	.cpustat	= &kernel_cpustat,
 	.cpuusage	= &root_cpuacct_cpuusage,
+#ifdef CONFIG_SCHED_CORE
+	.forceidle	= &root_cpuacct_forceidle,
+#endif
 };
 
 /* Create a new CPU accounting group */
@@ -72,8 +81,18 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
 	if (!ca->cpustat)
 		goto out_free_cpuusage;
 
+#ifdef CONFIG_SCHED_CORE
+	ca->forceidle = alloc_percpu(u64);
+	if (!ca->forceidle)
+		goto out_free_cpustat;
+#endif
+
 	return &ca->css;
 
+#ifdef CONFIG_SCHED_CORE
+out_free_cpustat:
+	free_percpu(ca->cpustat);
+#endif
 out_free_cpuusage:
 	free_percpu(ca->cpuusage);
 out_free_ca:
@@ -290,6 +309,37 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
 	return 0;
 }
 
+#ifdef CONFIG_SCHED_CORE
+static u64 __forceidle_read(struct cpuacct *ca, int cpu)
+{
+	return *per_cpu_ptr(ca->forceidle, cpu);
+}
+static int cpuacct_percpu_forceidle_seq_show(struct seq_file *m, void *V)
+{
+	struct cpuacct *ca = css_ca(seq_css(m));
+	u64 percpu;
+	int i;
+
+	for_each_possible_cpu(i) {
+		percpu = __forceidle_read(ca, i);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+static u64 cpuacct_forceidle_read(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	struct cpuacct *ca = css_ca(css);
+	u64 totalforceidle = 0;
+	int i;
+
+	for_each_possible_cpu(i)
+		totalforceidle += __forceidle_read(ca, i);
+	return totalforceidle;
+}
+#endif
+
 static struct cftype files[] = {
 	{
 		.name = "usage",
@@ -324,6 +374,16 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
 		.name = "stat",
 		.seq_show = cpuacct_stats_show,
 	},
+#ifdef CONFIG_SCHED_CORE
+	{
+		.name = "forceidle",
+		.read_u64 = cpuacct_forceidle_read,
+	},
+	{
+		.name = "forceidle_percpu",
+		.seq_show = cpuacct_percpu_forceidle_seq_show,
+	},
+#endif
 	{ }	/* terminate */
 };
 
@@ -359,6 +419,25 @@ void cpuacct_account_field(struct task_struct *tsk, int index, u64 val)
 	rcu_read_unlock();
 }
 
+#ifdef CONFIG_SCHED_CORE
+void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputime)
+{
+	struct cpuacct *ca;
+	u64 *fi;
+
+	rcu_read_lock();
+	/*
+	 * We have hold rq->core->__lock here, which protects ca->forceidle
+	 * percpu.
+	 */
+	for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
+		fi = per_cpu_ptr(ca->forceidle, cpu);
+		*fi += cputime;
+	}
+	rcu_read_unlock();
+}
+#endif
+
 struct cgroup_subsys cpuacct_cgrp_subsys = {
 	.css_alloc	= cpuacct_css_alloc,
 	.css_free	= cpuacct_css_free,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task
  2022-01-11  9:55 ` [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Cruz Zhao
@ 2022-01-11 23:52   ` Josh Don
  2022-01-18 11:18   ` [tip: sched/urgent] " tip-bot2 for Cruz Zhao
  1 sibling, 0 replies; 14+ messages in thread
From: Josh Don @ 2022-01-11 23:52 UTC (permalink / raw)
  To: Cruz Zhao
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel

On Tue, Jan 11, 2022 at 1:56 AM Cruz Zhao <CruzZhao@linux.alibaba.com> wrote:
>
> There are two types of forced idle time: forced idle time from cookie'd
> task and forced idle time form uncookie'd task. The forced idle time from
> uncookie'd task is actually caused by the cookie'd task in runqueue
> indirectly, and it's more accurate to measure the capacity loss with the
> sum of both.
>
> Assuming cpu x and cpu y are a pair of SMT siblings, consider the
> following scenarios:
>   1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
>     tasks running on cpu y. For cpu x, there will be 80% forced idle time
>     (from uncookie'd task); for cpu y, there will be 20% forced idle time
>     (from cookie'd task).
>   2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
>     tasks running on cpu y. For cpu x, there will be 80% forced idle time
>     (from cookie'd task); for cpu y, there will be 20% forced idle time
>     (from uncookie'd task).
>
> The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
>     (cookie'd)taskset -c x stress-ng -c 1 -l 100
>     (uncookie'd)taskset -c y stress-ng -c 4 -l 100
>
> In the above two scenarios, the total capacity loss is 1 cpu, but in
> scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
> scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
> which are not accurate. It'll be more accurate to measure with cookie'd
> forced idle time and uncookie'd forced idle time.
>
> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
> ---

Thanks,

Reviewed-by: Josh Don <joshdon@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu
  2022-01-11  9:56 ` [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu Cruz Zhao
@ 2022-01-12  1:59   ` Josh Don
  2022-01-14 15:04     ` cruzzhao
  2022-01-12 12:27   ` Peter Zijlstra
  1 sibling, 1 reply; 14+ messages in thread
From: Josh Don @ 2022-01-12  1:59 UTC (permalink / raw)
  To: Cruz Zhao
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel

On Tue, Jan 11, 2022 at 1:56 AM Cruz Zhao <CruzZhao@linux.alibaba.com> wrote:
>
> Accounting for "forced idle" time per cpu, which is the time a cpu is
> forced to idle by its SMT sibling.
>
> As it's not accurate to measure the capacity loss only by cookie'd forced
> idle time, and it's hard to trace the forced idle time caused by all the
> uncookie'd tasks, we account the forced idle time from the perspective of
> the cpu.
>
> Forced idle time per cpu is displayed via /proc/schedstat, we can get the
> forced idle time of cpu x from the 10th column of the row for cpu x. The
> unit is ns. It also requires that schedstats is enabled.
>
> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
> ---

Two quick followup questions:

1) From your v1, I still wasn't quite sure if the per-cpu time was
useful or not for you versus a single overall sum (ie. I think other
metrics could be more useful for analyzing steal_cookie if that's what
you're specifically interested in). Do you definitely want the per-cpu
totals?

2) If your cgroup accounting patch is merged, do you still want this
patch? You can grab the global values from the root cgroup (assuming
you have cgroups enabled). The only potential gap is if you need
per-cpu totals, though I'm working to add percpu stats to cgroup-v2:
https://lkml.kernel.org/r/%3C20220107234138.1765668-1-joshdon@google.com%3E

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu
  2022-01-11  9:56 ` [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu Cruz Zhao
  2022-01-12  1:59   ` Josh Don
@ 2022-01-12 12:27   ` Peter Zijlstra
  2022-01-14 11:06     ` cruzzhao
  1 sibling, 1 reply; 14+ messages in thread
From: Peter Zijlstra @ 2022-01-12 12:27 UTC (permalink / raw)
  To: Cruz Zhao
  Cc: tj, lizefan.x, hannes, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, joshdon,
	cgroups, linux-kernel

On Tue, Jan 11, 2022 at 05:56:00PM +0800, Cruz Zhao wrote:

> @@ -1115,6 +1118,7 @@ struct rq {
>  	unsigned int		core_forceidle_seq;
>  	unsigned int		core_forceidle_occupation;
>  	u64			core_forceidle_start;
> +	bool			in_forcedidle;

naming is wrong

>  #endif
>  };
>  
> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
> index 07dde29..ea22a8c 100644
> --- a/kernel/sched/stats.c
> +++ b/kernel/sched/stats.c
> @@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
>  	}
>  }
>  
> +#ifdef CONFIG_SCHED_CORE
> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
> +	return rq->rq_forceidle_time;
> +}
> +#else
> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
> +	return 0;
> +}
> +#endif

indent is wrong, and if you put the #ifdef inside the function it'll be
smaller.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup
  2022-01-11  9:56 ` [PATCH v2 3/3] sched/core: Force idle accounting per cgroup Cruz Zhao
@ 2022-01-12 20:42   ` Tejun Heo
  2022-01-14 11:13     ` cruzzhao
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2022-01-12 20:42 UTC (permalink / raw)
  To: Cruz Zhao
  Cc: lizefan.x, hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, joshdon,
	cgroups, linux-kernel

Hello,

On Tue, Jan 11, 2022 at 05:56:01PM +0800, Cruz Zhao wrote:
> +#ifdef CONFIG_SCHED_CORE
> +void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputime)
> +{
> +	struct cpuacct *ca;
> +	u64 *fi;
> +
> +	rcu_read_lock();
> +	/*
> +	 * We have hold rq->core->__lock here, which protects ca->forceidle
> +	 * percpu.
> +	 */
> +	for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
> +		fi = per_cpu_ptr(ca->forceidle, cpu);
> +		*fi += cputime;
> +	}

Please don't do this. Use rstat and integrate it with other stats.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu
  2022-01-12 12:27   ` Peter Zijlstra
@ 2022-01-14 11:06     ` cruzzhao
  0 siblings, 0 replies; 14+ messages in thread
From: cruzzhao @ 2022-01-14 11:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tj, lizefan.x, hannes, mingo, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, joshdon,
	cgroups, linux-kernel



在 2022/1/12 下午8:27, Peter Zijlstra 写道:
> On Tue, Jan 11, 2022 at 05:56:00PM +0800, Cruz Zhao wrote:
> 
>> @@ -1115,6 +1118,7 @@ struct rq {
>>  	unsigned int		core_forceidle_seq;
>>  	unsigned int		core_forceidle_occupation;
>>  	u64			core_forceidle_start;
>> +	bool			in_forcedidle;
> 
> naming is wrong
> 
>>  #endif
>>  };
>>  
>> diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
>> index 07dde29..ea22a8c 100644
>> --- a/kernel/sched/stats.c
>> +++ b/kernel/sched/stats.c
>> @@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, struct task_struct *p,
>>  	}
>>  }
>>  
>> +#ifdef CONFIG_SCHED_CORE
>> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
>> +	return rq->rq_forceidle_time;
>> +}
>> +#else
>> +static inline u64 get_rq_forceidle_time(struct rq *rq) {
>> +	return 0;
>> +}
>> +#endif
> 
> indent is wrong, and if you put the #ifdef inside the function it'll be
> smaller.

Thanks for reviewing and suggestions, I'll fix these problems in the
next version.

Best,
Cruz Zhao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup
  2022-01-12 20:42   ` Tejun Heo
@ 2022-01-14 11:13     ` cruzzhao
  2022-01-14 16:39       ` Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: cruzzhao @ 2022-01-14 11:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: lizefan.x, hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, joshdon,
	cgroups, linux-kernel



在 2022/1/13 上午4:42, Tejun Heo 写道:
> Hello,
> 
> On Tue, Jan 11, 2022 at 05:56:01PM +0800, Cruz Zhao wrote:
>> +#ifdef CONFIG_SCHED_CORE
>> +void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputime)
>> +{
>> +	struct cpuacct *ca;
>> +	u64 *fi;
>> +
>> +	rcu_read_lock();
>> +	/*
>> +	 * We have hold rq->core->__lock here, which protects ca->forceidle
>> +	 * percpu.
>> +	 */
>> +	for (ca = task_ca(tsk); ca; ca = parent_ca(ca)) {
>> +		fi = per_cpu_ptr(ca->forceidle, cpu);
>> +		*fi += cputime;
>> +	}
> 
> Please don't do this. Use rstat and integrate it with other stats.
> 
> Thanks.
> 

Thanks for suggestions, I'll try to do this using rstat. BTW, is it ok
to integrate it with cgroup_base_stat?

Best,
Cruz Zhao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu
  2022-01-12  1:59   ` Josh Don
@ 2022-01-14 15:04     ` cruzzhao
  2022-01-14 23:40       ` Josh Don
  0 siblings, 1 reply; 14+ messages in thread
From: cruzzhao @ 2022-01-14 15:04 UTC (permalink / raw)
  To: Josh Don
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel



在 2022/1/12 上午9:59, Josh Don 写道:
> On Tue, Jan 11, 2022 at 1:56 AM Cruz Zhao <CruzZhao@linux.alibaba.com> wrote:
>>
>> Accounting for "forced idle" time per cpu, which is the time a cpu is
>> forced to idle by its SMT sibling.
>>
>> As it's not accurate to measure the capacity loss only by cookie'd forced
>> idle time, and it's hard to trace the forced idle time caused by all the
>> uncookie'd tasks, we account the forced idle time from the perspective of
>> the cpu.
>>
>> Forced idle time per cpu is displayed via /proc/schedstat, we can get the
>> forced idle time of cpu x from the 10th column of the row for cpu x. The
>> unit is ns. It also requires that schedstats is enabled.
>>
>> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
>> ---
> 
> Two quick followup questions:
> 
> 1) From your v1, I still wasn't quite sure if the per-cpu time was
> useful or not for you versus a single overall sum (ie. I think other
> metrics could be more useful for analyzing steal_cookie if that's what
> you're specifically interested in). Do you definitely want the per-cpu
> totals?
> 
IMO, the per-cpu forced idle time can help us get to know whether the
forced idle time is uniform among the core, or among all the cpus. IMO,
it's a kind of balance.

> 2) If your cgroup accounting patch is merged, do you still want this
> patch? You can grab the global values from the root cgroup (assuming
> you have cgroups enabled). The only potential gap is if you need
> per-cpu totals, though I'm working to add percpu stats to cgroup-v2:
> https://lkml.kernel.org/r/%3C20220107234138.1765668-1-joshdon@google.com%3E

If cgroup accounting patch is merged, this patch is still needed.

Consider the following scenario:
Task p of cgroup A is running on cpu x, and it forces cpu y into idle
for t ns. The forceidle time of cgroup A on cpu x increases t ns, and
the forcedidle time of cpu y increases t ns.

That is, the force idle time of cgroup is defined as the forced idle
time it caused, and the force idle time of cpu is defined as the time
the cpu is forced into idle, which have different meanings from each other.

And for SMT > 2, we cannot caculate the forced idle time of cpu x from
the cgroup interface.

Best,
Cruz Zhao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup
  2022-01-14 11:13     ` cruzzhao
@ 2022-01-14 16:39       ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2022-01-14 16:39 UTC (permalink / raw)
  To: cruzzhao
  Cc: lizefan.x, hannes, mingo, peterz, juri.lelli, vincent.guittot,
	dietmar.eggemann, rostedt, bsegall, mgorman, bristot, joshdon,
	cgroups, linux-kernel

On Fri, Jan 14, 2022 at 07:13:12PM +0800, cruzzhao wrote:
> Thanks for suggestions, I'll try to do this using rstat. BTW, is it ok
> to integrate it with cgroup_base_stat?

Yeah, I don't see a problem at least from cgroup / rstat POV.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu
  2022-01-14 15:04     ` cruzzhao
@ 2022-01-14 23:40       ` Josh Don
  0 siblings, 0 replies; 14+ messages in thread
From: Josh Don @ 2022-01-14 23:40 UTC (permalink / raw)
  To: cruzzhao
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Benjamin Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel

> > 1) From your v1, I still wasn't quite sure if the per-cpu time was
> > useful or not for you versus a single overall sum (ie. I think other
> > metrics could be more useful for analyzing steal_cookie if that's what
> > you're specifically interested in). Do you definitely want the per-cpu
> > totals?
> >
> IMO, the per-cpu forced idle time can help us get to know whether the
> forced idle time is uniform among the core, or among all the cpus. IMO,
> it's a kind of balance.

Sure, I'm not opposed to it.

> > 2) If your cgroup accounting patch is merged, do you still want this
> > patch? You can grab the global values from the root cgroup (assuming
> > you have cgroups enabled). The only potential gap is if you need
> > per-cpu totals, though I'm working to add percpu stats to cgroup-v2:
> > https://lkml.kernel.org/r/%3C20220107234138.1765668-1-joshdon@google.com%3E
>
> If cgroup accounting patch is merged, this patch is still needed.
>
> Consider the following scenario:
> Task p of cgroup A is running on cpu x, and it forces cpu y into idle
> for t ns. The forceidle time of cgroup A on cpu x increases t ns, and
> the forcedidle time of cpu y increases t ns.
>
> That is, the force idle time of cgroup is defined as the forced idle
> time it caused, and the force idle time of cpu is defined as the time
> the cpu is forced into idle, which have different meanings from each other.
>
> And for SMT > 2, we cannot caculate the forced idle time of cpu x from
> the cgroup interface.

Ack. Note that the patch I linked above for per-cpu stats for
cgroup-v2 has been nack'd, so it is unlikely we'll have kernel exports
for cgroup-v2 per-cpu stats (but perhaps some export via BPF). In v2,
we could at least export the cgroup sum force idle time in cpu.stat,
if you feel it would be useful and want to add the accounting to
rstat. If you really just want the per-cpu root stats then I'm fine
with skipping cgroup for now.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip: sched/urgent] sched/core: Accounting forceidle time for all tasks except idle task
  2022-01-11  9:55 ` [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Cruz Zhao
  2022-01-11 23:52   ` Josh Don
@ 2022-01-18 11:18   ` tip-bot2 for Cruz Zhao
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot2 for Cruz Zhao @ 2022-01-18 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Cruz Zhao, Peter Zijlstra (Intel), Josh Don, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     b171501f258063f5c56dd2c5fdf310802d8d7dc1
Gitweb:        https://git.kernel.org/tip/b171501f258063f5c56dd2c5fdf310802d8d7dc1
Author:        Cruz Zhao <CruzZhao@linux.alibaba.com>
AuthorDate:    Tue, 11 Jan 2022 17:55:59 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 18 Jan 2022 12:09:59 +01:00

sched/core: Accounting forceidle time for all tasks except idle task

There are two types of forced idle time: forced idle time from cookie'd
task and forced idle time form uncookie'd task. The forced idle time from
uncookie'd task is actually caused by the cookie'd task in runqueue
indirectly, and it's more accurate to measure the capacity loss with the
sum of both.

Assuming cpu x and cpu y are a pair of SMT siblings, consider the
following scenarios:
  1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd
    tasks running on cpu y. For cpu x, there will be 80% forced idle time
    (from uncookie'd task); for cpu y, there will be 20% forced idle time
    (from cookie'd task).
  2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd
    tasks running on cpu y. For cpu x, there will be 80% forced idle time
    (from cookie'd task); for cpu y, there will be 20% forced idle time
    (from uncookie'd task).

The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary):
    (cookie'd)taskset -c x stress-ng -c 1 -l 100
    (uncookie'd)taskset -c y stress-ng -c 4 -l 100

In the above two scenarios, the total capacity loss is 1 cpu, but in
scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in
scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss,
which are not accurate. It'll be more accurate to measure with cookie'd
forced idle time and uncookie'd forced idle time.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Josh Don <joshdon@google.com>
Link: https://lore.kernel.org/r/1641894961-9241-2-git-send-email-CruzZhao@linux.alibaba.com
---
 kernel/sched/core.c       | 3 +--
 kernel/sched/core_sched.c | 2 +-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 83872f9..0d2ab2a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5822,8 +5822,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	}
 
 	if (schedstat_enabled() && rq->core->core_forceidle_count) {
-		if (cookie)
-			rq->core->core_forceidle_start = rq_clock(rq->core);
+		rq->core->core_forceidle_start = rq_clock(rq->core);
 		rq->core->core_forceidle_occupation = occ;
 	}
 
diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c
index 1fb4567..c8746a9 100644
--- a/kernel/sched/core_sched.c
+++ b/kernel/sched/core_sched.c
@@ -277,7 +277,7 @@ void __sched_core_account_forceidle(struct rq *rq)
 		rq_i = cpu_rq(i);
 		p = rq_i->core_pick ?: rq_i->curr;
 
-		if (!p->core_cookie)
+		if (p == rq_i->idle)
 			continue;
 
 		__schedstat_add(p->stats.core_forceidle_sum, delta);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-01-18 11:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-11  9:55 [PATCH v2 0/3] Accounting forced idle time per cpu and per cgroup Cruz Zhao
2022-01-11  9:55 ` [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Cruz Zhao
2022-01-11 23:52   ` Josh Don
2022-01-18 11:18   ` [tip: sched/urgent] " tip-bot2 for Cruz Zhao
2022-01-11  9:56 ` [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu Cruz Zhao
2022-01-12  1:59   ` Josh Don
2022-01-14 15:04     ` cruzzhao
2022-01-14 23:40       ` Josh Don
2022-01-12 12:27   ` Peter Zijlstra
2022-01-14 11:06     ` cruzzhao
2022-01-11  9:56 ` [PATCH v2 3/3] sched/core: Force idle accounting per cgroup Cruz Zhao
2022-01-12 20:42   ` Tejun Heo
2022-01-14 11:13     ` cruzzhao
2022-01-14 16:39       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).