linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat
@ 2021-02-17 12:00 Andrey Ryabinin
  2021-02-17 12:00 ` [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-02-17 12:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Tejun Heo, Zefan Li, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel,
	Andrey Ryabinin, stable

cpuacct.stat in no-root cgroups shows user time without guest time
included int it. This doesn't match with user time shown in root
cpuacct.stat and /proc/<pid>/stat.

Make account_guest_time() to add user time to cgroup's cpustat to
fix this.

Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Cc: <stable@vger.kernel.org>
---
 kernel/sched/cputime.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5f611658eeab..95a9c5603d29 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -139,8 +139,6 @@ void account_user_time(struct task_struct *p, u64 cputime)
  */
 void account_guest_time(struct task_struct *p, u64 cputime)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-
 	/* Add guest time to process. */
 	p->utime += cputime;
 	account_group_user_time(p, cputime);
@@ -148,11 +146,11 @@ void account_guest_time(struct task_struct *p, u64 cputime)
 
 	/* Add guest time to cpustat. */
 	if (task_nice(p) > 0) {
-		cpustat[CPUTIME_NICE] += cputime;
-		cpustat[CPUTIME_GUEST_NICE] += cputime;
+		task_group_account_field(p, CPUTIME_NICE, cputime);
+		task_group_account_field(p, CPUTIME_GUEST_NICE, cputime);
 	} else {
-		cpustat[CPUTIME_USER] += cputime;
-		cpustat[CPUTIME_GUEST] += cputime;
+		task_group_account_field(p, CPUTIME_USER, cputime);
+		task_group_account_field(p, CPUTIME_GUEST, cputime);
 	}
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat
  2021-02-17 12:00 [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Andrey Ryabinin
@ 2021-02-17 12:00 ` Andrey Ryabinin
  2021-03-17 22:13   ` Daniel Jordan
  2021-02-17 12:00 ` [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2021-02-17 12:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Tejun Heo, Zefan Li, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel,
	Andrey Ryabinin, stable

Global CPUTIME_USER counter already includes CPUTIME_GUEST
Also CPUTIME_NICE already includes CPUTIME_GUEST_NICE.

Remove additions of CPUTIME_GUEST[_NICE] to total ->sum_exec_runtime
to not account them twice.

Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Cc: <stable@vger.kernel.org>
---
 kernel/cgroup/rstat.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d51175cedfca..89ca9b61aa0d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -421,8 +421,6 @@ static void root_cgroup_cputime(struct task_cputime *cputime)
 		cputime->sum_exec_runtime += user;
 		cputime->sum_exec_runtime += sys;
 		cputime->sum_exec_runtime += cpustat[CPUTIME_STEAL];
-		cputime->sum_exec_runtime += cpustat[CPUTIME_GUEST];
-		cputime->sum_exec_runtime += cpustat[CPUTIME_GUEST_NICE];
 	}
 }
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage*
  2021-02-17 12:00 [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Andrey Ryabinin
  2021-02-17 12:00 ` [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
@ 2021-02-17 12:00 ` Andrey Ryabinin
  2021-03-17 22:22   ` Daniel Jordan
  2021-02-17 12:00 ` [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2021-02-17 12:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Tejun Heo, Zefan Li, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel,
	Andrey Ryabinin, stable

cpuacct has 2 different ways of accounting and showing user
and system times.

The first one uses cpuacct_account_field() to account times
and cpuacct.stat file to expose them. And this one seems to work ok.

The second one is uses cpuacct_charge() function for accounting and
set of cpuacct.usage* files to show times. Despite some attempts to
fix it in the past it still doesn't work. E.g. while running KVM
guest the cpuacct_charge() accounts most of the guest time as
system time. This doesn't match with user&system times shown in
cpuacct.stat or proc/<pid>/stat.

Use cpustats accounted in cpuacct_account_field() as the source
of user/sys times for cpuacct.usage* files. Make cpuacct_charge()
to account only summary execution time.

Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Cc: <stable@vger.kernel.org>
---
 kernel/sched/cpuacct.c | 77 +++++++++++++++++++-----------------------
 1 file changed, 34 insertions(+), 43 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 941c28cf9738..7eff79faab0d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -29,7 +29,7 @@ struct cpuacct_usage {
 struct cpuacct {
 	struct cgroup_subsys_state	css;
 	/* cpuusage holds pointer to a u64-type object on every CPU */
-	struct cpuacct_usage __percpu	*cpuusage;
+	u64 __percpu	*cpuusage;
 	struct kernel_cpustat __percpu	*cpustat;
 };
 
@@ -49,7 +49,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
 	return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
 	.cpustat	= &kernel_cpustat,
 	.cpuusage	= &root_cpuacct_cpuusage,
@@ -68,7 +68,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (!ca)
 		goto out;
 
-	ca->cpuusage = alloc_percpu(struct cpuacct_usage);
+	ca->cpuusage = alloc_percpu(u64);
 	if (!ca->cpuusage)
 		goto out_free_ca;
 
@@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 				 enum cpuacct_stat_index index)
 {
-	struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 	u64 data;
 
 	/*
@@ -115,14 +116,17 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
 #endif
 
-	if (index == CPUACCT_STAT_NSTATS) {
-		int i = 0;
-
-		data = 0;
-		for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-			data += cpuusage->usages[i];
-	} else {
-		data = cpuusage->usages[index];
+	switch (index) {
+	case CPUACCT_STAT_USER:
+		data = cpustat[CPUTIME_USER] + cpustat[CPUTIME_NICE];
+		break;
+	case CPUACCT_STAT_SYSTEM:
+		data = cpustat[CPUTIME_SYSTEM] + cpustat[CPUTIME_IRQ] +
+			cpustat[CPUTIME_SOFTIRQ];
+		break;
+	case CPUACCT_STAT_NSTATS:
+		data = *cpuusage;
+		break;
 	}
 
 #ifndef CONFIG_64BIT
@@ -132,10 +136,14 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu)
 {
-	struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	int i;
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
+
+	/* Don't allow to reset global kernel_cpustat */
+	if (ca == &root_cpuacct)
+		return;
 
 #ifndef CONFIG_64BIT
 	/*
@@ -143,9 +151,10 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	 */
 	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
 #endif
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		cpuusage->usages[i] = val;
+	*cpuusage = 0;
+	cpustat[CPUTIME_USER] = cpustat[CPUTIME_NICE] = 0;
+	cpustat[CPUTIME_SYSTEM] = cpustat[CPUTIME_IRQ] = 0;
+	cpustat[CPUTIME_SOFTIRQ] = 0;
 
 #ifndef CONFIG_64BIT
 	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
@@ -196,7 +205,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		return -EINVAL;
 
 	for_each_possible_cpu(cpu)
-		cpuacct_cpuusage_write(ca, cpu, 0);
+		cpuacct_cpuusage_write(ca, cpu);
 
 	return 0;
 }
@@ -243,25 +252,12 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 	seq_puts(m, "\n");
 
 	for_each_possible_cpu(cpu) {
-		struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
 		seq_printf(m, "%d", cpu);
 
-		for (index = 0; index < CPUACCT_STAT_NSTATS; index++) {
-#ifndef CONFIG_64BIT
-			/*
-			 * Take rq->lock to make 64-bit read safe on 32-bit
-			 * platforms.
-			 */
-			raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-#endif
-
-			seq_printf(m, " %llu", cpuusage->usages[index]);
+		for (index = 0; index < CPUACCT_STAT_NSTATS; index++)
+			seq_printf(m, " %llu",
+				cpuacct_cpuusage_read(ca, cpu, index));
 
-#ifndef CONFIG_64BIT
-			raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#endif
-		}
 		seq_puts(m, "\n");
 	}
 	return 0;
@@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
+		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
+		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];
 		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
 		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
 		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
@@ -339,16 +335,11 @@ static struct cftype files[] = {
 void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 {
 	struct cpuacct *ca;
-	int index = CPUACCT_STAT_SYSTEM;
-	struct pt_regs *regs = get_irq_regs() ? : task_pt_regs(tsk);
-
-	if (regs && user_mode(regs))
-		index = CPUACCT_STAT_USER;
 
 	rcu_read_lock();
 
 	for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
-		__this_cpu_add(ca->cpuusage->usages[index], cputime);
+		__this_cpu_add(*ca->cpuusage, cputime);
 
 	rcu_read_unlock();
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise
  2021-02-17 12:00 [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Andrey Ryabinin
  2021-02-17 12:00 ` [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
  2021-02-17 12:00 ` [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
@ 2021-02-17 12:00 ` Andrey Ryabinin
  2021-03-17 22:25   ` Daniel Jordan
  2021-03-17 22:09 ` [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Daniel Jordan
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
  4 siblings, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2021-02-17 12:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Tejun Heo, Zefan Li, Johannes Weiner, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel, Andrey Ryabinin

cpuacct.stat shows user time based on raw random precision tick
based counters. Use cputime_addjust() to scale these values against the
total runtime accounted by the scheduler, like we already do
for user/system times in /proc/<pid>/stat.

Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
 kernel/sched/cpuacct.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 7eff79faab0d..36347f2810c0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -266,25 +266,30 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 static int cpuacct_stats_show(struct seq_file *sf, void *v)
 {
 	struct cpuacct *ca = css_ca(seq_css(sf));
-	s64 val[CPUACCT_STAT_NSTATS];
+	struct task_cputime cputime;
+	u64 val[CPUACCT_STAT_NSTATS];
 	int cpu;
 	int stat;
 
-	memset(val, 0, sizeof(val));
+	memset(&cputime, 0, sizeof(cputime));
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+		cputime.utime += cpustat[CPUTIME_USER];
+		cputime.utime += cpustat[CPUTIME_NICE];
+		cputime.stime += cpustat[CPUTIME_SYSTEM];
+		cputime.stime += cpustat[CPUTIME_IRQ];
+		cputime.stime += cpustat[CPUTIME_SOFTIRQ];
+
+		cputime.sum_exec_runtime += this_cpu_read(*ca->cpuusage);
 	}
 
+	cputime_adjust(&cputime, &seq_css(sf)->cgroup->prev_cputime,
+		&val[CPUACCT_STAT_USER], &val[CPUACCT_STAT_SYSTEM]);
+
 	for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-		seq_printf(sf, "%s %lld\n",
-			   cpuacct_stat_desc[stat],
-			   (long long)nsec_to_clock_t(val[stat]));
+		seq_printf(sf, "%s %llu\n", cpuacct_stat_desc[stat],
+			nsec_to_clock_t(val[stat]));
 	}
 
 	return 0;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat
  2021-02-17 12:00 [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Andrey Ryabinin
                   ` (2 preceding siblings ...)
  2021-02-17 12:00 ` [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
@ 2021-03-17 22:09 ` Daniel Jordan
  2021-08-20  9:37   ` Andrey Ryabinin
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
  4 siblings, 1 reply; 17+ messages in thread
From: Daniel Jordan @ 2021-03-17 22:09 UTC (permalink / raw)
  To: Andrey Ryabinin, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Tejun Heo, Zefan Li, Johannes Weiner,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel,
	Andrey Ryabinin, stable

Andrey Ryabinin <arbn@yandex-team.com> writes:

> cpuacct.stat in no-root cgroups shows user time without guest time
> included int it. This doesn't match with user time shown in root
> cpuacct.stat and /proc/<pid>/stat.

Yeah, that's inconsistent.

> Make account_guest_time() to add user time to cgroup's cpustat to
> fix this.

Yep.

cgroup2's cpu.stat is broken the same way for child cgroups, and this
happily fixes it.  Probably deserves a mention in the changelog.

The problem with cgroup2 was, if the workload was mostly guest time,
cpu.stat's user and system together reflected it, but it was split
unevenly across the two.  I think guest time wasn't actually included in
either bucket, it was just that the little user and system time there
was got scaled up in cgroup_base_stat_cputime_show -> cputime_adjust to
match sum_exec_runtime, which did have it.

The stats look ok now for both cgroup1 and 2.  Just slightly unsure
whether we want to change the way both interfaces expose the accounting
in case something out there depends on it.  Seems like we should, but
it'd be good to hear more opinions.

> @@ -148,11 +146,11 @@ void account_guest_time(struct task_struct *p, u64 cputime)
>  
>  	/* Add guest time to cpustat. */
>  	if (task_nice(p) > 0) {
> -		cpustat[CPUTIME_NICE] += cputime;
> -		cpustat[CPUTIME_GUEST_NICE] += cputime;
> +		task_group_account_field(p, CPUTIME_NICE, cputime);
> +		task_group_account_field(p, CPUTIME_GUEST_NICE, cputime);
>  	} else {
> -		cpustat[CPUTIME_USER] += cputime;
> -		cpustat[CPUTIME_GUEST] += cputime;
> +		task_group_account_field(p, CPUTIME_USER, cputime);
> +		task_group_account_field(p, CPUTIME_GUEST, cputime);
>  	}

Makes sense for _USER and _NICE, but it doesn't seem cgroup1 or 2
actually use _GUEST and _GUEST_NICE.

Could go either way.  Consistency is nice, but I probably wouldn't
change the GUEST ones so people aren't confused about why they're
accounted.  It's also extra cycles for nothing, even though most of the
data is probably in the cache.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat
  2021-02-17 12:00 ` [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
@ 2021-03-17 22:13   ` Daniel Jordan
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2021-03-17 22:13 UTC (permalink / raw)
  To: Andrey Ryabinin, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Tejun Heo, Zefan Li, Johannes Weiner,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel,
	Andrey Ryabinin, stable

Andrey Ryabinin <arbn@yandex-team.com> writes:

> Global CPUTIME_USER counter already includes CPUTIME_GUEST
> Also CPUTIME_NICE already includes CPUTIME_GUEST_NICE.
>
> Remove additions of CPUTIME_GUEST[_NICE] to total ->sum_exec_runtime
> to not account them twice.

Yes, that's just wrong.  usage_usec looks ok now.

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage*
  2021-02-17 12:00 ` [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
@ 2021-03-17 22:22   ` Daniel Jordan
  2021-08-20  9:37     ` Andrey Ryabinin
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Jordan @ 2021-03-17 22:22 UTC (permalink / raw)
  To: Andrey Ryabinin, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Tejun Heo, Zefan Li, Johannes Weiner,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel,
	Andrey Ryabinin, stable

Andrey Ryabinin <arbn@yandex-team.com> writes:

> cpuacct has 2 different ways of accounting and showing user
> and system times.
>
> The first one uses cpuacct_account_field() to account times
> and cpuacct.stat file to expose them. And this one seems to work ok.
>
> The second one is uses cpuacct_charge() function for accounting and
> set of cpuacct.usage* files to show times. Despite some attempts to
> fix it in the past it still doesn't work. E.g. while running KVM
> guest the cpuacct_charge() accounts most of the guest time as
> system time. This doesn't match with user&system times shown in
> cpuacct.stat or proc/<pid>/stat.

I couldn't reproduce this running a cpu bound load in a kvm guest on a
nohz_full cpu on 5.11.  The time is almost entirely in cpuacct.usage and
_user, while _sys stays low.

Could you say more about how you're seeing this?  Don't really doubt
there's a problem, just wondering what you're doing.

> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
> index 941c28cf9738..7eff79faab0d 100644
> --- a/kernel/sched/cpuacct.c
> +++ b/kernel/sched/cpuacct.c
> @@ -29,7 +29,7 @@ struct cpuacct_usage {
>  struct cpuacct {
>  	struct cgroup_subsys_state	css;
>  	/* cpuusage holds pointer to a u64-type object on every CPU */
> -	struct cpuacct_usage __percpu	*cpuusage;

Definition of struct cpuacct_usage can go away now.

> @@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
>  static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
>  				 enum cpuacct_stat_index index)
>  {
> -	struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> +	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
> +	u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>  	u64 data;

There's a BUG_ON below this that could probably be WARN_ON_ONCE while
you're here

> @@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
>  	for_each_possible_cpu(cpu) {
>  		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>  
> -		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
> -		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
> +		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
> +		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];

unnecessary whitespace change?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise
  2021-02-17 12:00 ` [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
@ 2021-03-17 22:25   ` Daniel Jordan
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2021-03-17 22:25 UTC (permalink / raw)
  To: Andrey Ryabinin, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Tejun Heo, Zefan Li, Johannes Weiner,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel, Andrey Ryabinin

Andrey Ryabinin <arbn@yandex-team.com> writes:
>  static int cpuacct_stats_show(struct seq_file *sf, void *v)
>  {
        ...
>  	for_each_possible_cpu(cpu) {
>  		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>  
> -		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
> -		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];
> -		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
> -		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
> -		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
> +		cputime.utime += cpustat[CPUTIME_USER];
> +		cputime.utime += cpustat[CPUTIME_NICE];
> +		cputime.stime += cpustat[CPUTIME_SYSTEM];
> +		cputime.stime += cpustat[CPUTIME_IRQ];
> +		cputime.stime += cpustat[CPUTIME_SOFTIRQ];
> +
> +		cputime.sum_exec_runtime += this_cpu_read(*ca->cpuusage);
>  	}

                cputime.sum_exec_runtime += *per_cpu_ptr(ca->cpuusage, cpu);

Or the stats can all be 0...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat
  2021-03-17 22:09 ` [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Daniel Jordan
@ 2021-08-20  9:37   ` Andrey Ryabinin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:37 UTC (permalink / raw)
  To: Daniel Jordan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Tejun Heo, Zefan Li, Johannes Weiner,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel, stable


Sorry for abandoning this, got distracted by lots of other stuff.


On 3/18/21 1:09 AM, Daniel Jordan wrote:
> Andrey Ryabinin <arbn@yandex-team.com> writes:
> 
>> cpuacct.stat in no-root cgroups shows user time without guest time
>> included int it. This doesn't match with user time shown in root
>> cpuacct.stat and /proc/<pid>/stat.
> 
> Yeah, that's inconsistent.
> 
>> Make account_guest_time() to add user time to cgroup's cpustat to
>> fix this.
> 
> Yep.
> 
> cgroup2's cpu.stat is broken the same way for child cgroups, and this
> happily fixes it.  Probably deserves a mention in the changelog.
> 

Sure.

> The problem with cgroup2 was, if the workload was mostly guest time,
> cpu.stat's user and system together reflected it, but it was split
> unevenly across the two.  I think guest time wasn't actually included in
> either bucket, it was just that the little user and system time there
> was got scaled up in cgroup_base_stat_cputime_show -> cputime_adjust to
> match sum_exec_runtime, which did have it.
> 
> The stats look ok now for both cgroup1 and 2.  Just slightly unsure
> whether we want to change the way both interfaces expose the accounting
> in case something out there depends on it.  Seems like we should, but
> it'd be good to hear more opinions.
> 
>> @@ -148,11 +146,11 @@ void account_guest_time(struct task_struct *p, u64 cputime)
>>  
>>  	/* Add guest time to cpustat. */
>>  	if (task_nice(p) > 0) {
>> -		cpustat[CPUTIME_NICE] += cputime;
>> -		cpustat[CPUTIME_GUEST_NICE] += cputime;
>> +		task_group_account_field(p, CPUTIME_NICE, cputime);
>> +		task_group_account_field(p, CPUTIME_GUEST_NICE, cputime);
>>  	} else {
>> -		cpustat[CPUTIME_USER] += cputime;
>> -		cpustat[CPUTIME_GUEST] += cputime;
>> +		task_group_account_field(p, CPUTIME_USER, cputime);
>> +		task_group_account_field(p, CPUTIME_GUEST, cputime);
>>  	}
> 
> Makes sense for _USER and _NICE, but it doesn't seem cgroup1 or 2
> actually use _GUEST and _GUEST_NICE.
> 
> Could go either way.  Consistency is nice, but I probably wouldn't
> change the GUEST ones so people aren't confused about why they're
> accounted.  It's also extra cycles for nothing, even though most of the
> data is probably in the cache.
> 

Agreed, will live the _GUEST* as is.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage*
  2021-03-17 22:22   ` Daniel Jordan
@ 2021-08-20  9:37     ` Andrey Ryabinin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:37 UTC (permalink / raw)
  To: Daniel Jordan, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Tejun Heo, Zefan Li, Johannes Weiner,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira
  Cc: Boris Burkov, Bharata B Rao, cgroups, linux-kernel, stable



On 3/18/21 1:22 AM, Daniel Jordan wrote:
> Andrey Ryabinin <arbn@yandex-team.com> writes:
> 
>> cpuacct has 2 different ways of accounting and showing user
>> and system times.
>>
>> The first one uses cpuacct_account_field() to account times
>> and cpuacct.stat file to expose them. And this one seems to work ok.
>>
>> The second one is uses cpuacct_charge() function for accounting and
>> set of cpuacct.usage* files to show times. Despite some attempts to
>> fix it in the past it still doesn't work. E.g. while running KVM
>> guest the cpuacct_charge() accounts most of the guest time as
>> system time. This doesn't match with user&system times shown in
>> cpuacct.stat or proc/<pid>/stat.
> 
> I couldn't reproduce this running a cpu bound load in a kvm guest on a
> nohz_full cpu on 5.11.  The time is almost entirely in cpuacct.usage and
> _user, while _sys stays low.
> 
> Could you say more about how you're seeing this?  Don't really doubt
> there's a problem, just wondering what you're doing.
> 


Yeah, I it's almost unnoticable if you run some load in guest like qemu.

But more simple case with busy loop in KVM_RUN triggers this:

# git clone https://github.com/aryabinin/kvmsample
# make
# mkdir /sys/fs/cgroup/cpuacct/test
# echo $$ > /sys/fs/cgroup/cpuacct/test/tasks
# ./kvmsample &
# for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done
1976535645
2979839428
3979832704
4983603153
5983604157

>> diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
>> index 941c28cf9738..7eff79faab0d 100644
>> --- a/kernel/sched/cpuacct.c
>> +++ b/kernel/sched/cpuacct.c
>> @@ -29,7 +29,7 @@ struct cpuacct_usage {
>>  struct cpuacct {
>>  	struct cgroup_subsys_state	css;
>>  	/* cpuusage holds pointer to a u64-type object on every CPU */
>> -	struct cpuacct_usage __percpu	*cpuusage;
> 
> Definition of struct cpuacct_usage can go away now.
> 

Done.

>> @@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
>>  static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
>>  				 enum cpuacct_stat_index index)
>>  {
>> -	struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
>> +	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
>> +	u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>>  	u64 data;
> 
> There's a BUG_ON below this that could probably be WARN_ON_ONCE while
> you're here
> 

Sure.

>> @@ -278,8 +274,8 @@ static int cpuacct_stats_show(struct seq_file *sf, void *v)
>>  	for_each_possible_cpu(cpu) {
>>  		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
>>  
>> -		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
>> -		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
>> +		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
>> +		val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];
> 
> unnecessary whitespace change?
> 

yup

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/5] cputime, cpuacct: Include guest time in user time in cpuacct.stat
  2021-02-17 12:00 [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Andrey Ryabinin
                   ` (3 preceding siblings ...)
  2021-03-17 22:09 ` [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Daniel Jordan
@ 2021-08-20  9:40 ` Andrey Ryabinin
  2021-08-20  9:40   ` [PATCH v2 2/5] cpuacct: convert BUG_ON() to WARN_ON_ONCE() Andrey Ryabinin
                     ` (4 more replies)
  4 siblings, 5 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:40 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, cgroups, linux-kernel,
	bharata, boris, Andrey Ryabinin, stable

cpuacct.stat in no-root cgroups shows user time without guest time
included int it. This doesn't match with user time shown in root
cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat
in the same way.

Make account_guest_time() to add user time to cgroup's cpustat to
fix this.

Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Cc: <stable@vger.kernel.org>
---
Changes since v1:
   - Don't CPUTIME_GUEST* since they aren't used cgroups
---
 kernel/sched/cputime.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 872e481d5098..042a6dbce8f3 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -148,10 +148,10 @@ void account_guest_time(struct task_struct *p, u64 cputime)
 
 	/* Add guest time to cpustat. */
 	if (task_nice(p) > 0) {
-		cpustat[CPUTIME_NICE] += cputime;
+		task_group_account_field(p, CPUTIME_NICE, cputime);
 		cpustat[CPUTIME_GUEST_NICE] += cputime;
 	} else {
-		cpustat[CPUTIME_USER] += cputime;
+		task_group_account_field(p, CPUTIME_USER, cputime);
 		cpustat[CPUTIME_GUEST] += cputime;
 	}
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 2/5] cpuacct: convert BUG_ON() to WARN_ON_ONCE()
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
@ 2021-08-20  9:40   ` Andrey Ryabinin
  2021-08-20  9:40   ` [PATCH v2 3/5] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:40 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, cgroups, linux-kernel,
	bharata, boris, Andrey Ryabinin

Replace fatal BUG_ON() with more safe WARN_ON_ONCE() in cpuacct_cpuusage_read().

Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
---
 kernel/sched/cpuacct.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 893eece65bfd..f347cf9e4634 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -106,7 +106,8 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	 * We allow index == CPUACCT_STAT_NSTATS here to read
 	 * the sum of usages.
 	 */
-	BUG_ON(index > CPUACCT_STAT_NSTATS);
+	if (WARN_ON_ONCE(index > CPUACCT_STAT_NSTATS))
+		return 0;
 
 #ifndef CONFIG_64BIT
 	/*
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 3/5] cgroup: Fix 'usage_usec' time in root's cpu.stat
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
  2021-08-20  9:40   ` [PATCH v2 2/5] cpuacct: convert BUG_ON() to WARN_ON_ONCE() Andrey Ryabinin
@ 2021-08-20  9:40   ` Andrey Ryabinin
  2021-08-20  9:40   ` [PATCH v2 4/5] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:40 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, cgroups, linux-kernel,
	bharata, boris, Andrey Ryabinin, stable

Global CPUTIME_USER counter already includes CPUTIME_GUEST
Also CPUTIME_NICE already includes CPUTIME_GUEST_NICE.

Remove additions of CPUTIME_GUEST[_NICE] to total ->sum_exec_runtime
to not account them twice.

Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Tested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Changes since v1:
  - Add review/tested tags in changelog
---
 kernel/cgroup/rstat.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index b264ab5652ba..1486768f2318 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -433,8 +433,6 @@ static void root_cgroup_cputime(struct task_cputime *cputime)
 		cputime->sum_exec_runtime += user;
 		cputime->sum_exec_runtime += sys;
 		cputime->sum_exec_runtime += cpustat[CPUTIME_STEAL];
-		cputime->sum_exec_runtime += cpustat[CPUTIME_GUEST];
-		cputime->sum_exec_runtime += cpustat[CPUTIME_GUEST_NICE];
 	}
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 4/5] sched/cpuacct: fix user/system in shown cpuacct.usage*
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
  2021-08-20  9:40   ` [PATCH v2 2/5] cpuacct: convert BUG_ON() to WARN_ON_ONCE() Andrey Ryabinin
  2021-08-20  9:40   ` [PATCH v2 3/5] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
@ 2021-08-20  9:40   ` Andrey Ryabinin
  2021-09-10 19:03     ` Daniel Jordan
  2021-08-20  9:40   ` [PATCH v2 5/5] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
  2021-09-07 17:19   ` [PATCH v2 1/5] cputime, cpuacct: Include guest time in user time in cpuacct.stat Tejun Heo
  4 siblings, 1 reply; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:40 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, cgroups, linux-kernel,
	bharata, boris, Andrey Ryabinin, stable

cpuacct has 2 different ways of accounting and showing user
and system times.

The first one uses cpuacct_account_field() to account times
and cpuacct.stat file to expose them. And this one seems to work ok.

The second one is uses cpuacct_charge() function for accounting and
set of cpuacct.usage* files to show times. Despite some attempts to
fix it in the past it still doesn't work. Sometimes while running KVM
guest the cpuacct_charge() accounts most of the guest time as
system time. This doesn't match with user&system times shown in
cpuacct.stat or proc/<pid>/stat.

Demonstration:
 # git clone https://github.com/aryabinin/kvmsample
 # make
 # mkdir /sys/fs/cgroup/cpuacct/test
 # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks
 # ./kvmsample &
 # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done
 1976535645
 2979839428
 3979832704
 4983603153
 5983604157

Use cpustats accounted in cpuacct_account_field() as the source
of user/sys times for cpuacct.usage* files. Make cpuacct_charge()
to account only summary execution time.

Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage")
Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
Cc: <stable@vger.kernel.org>
---
Changes since v1:
 - remove struct cpuacct_usage;
---
 kernel/sched/cpuacct.c | 79 +++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 47 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index f347cf9e4634..9de7dd51beb0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -21,15 +21,11 @@ static const char * const cpuacct_stat_desc[] = {
 	[CPUACCT_STAT_SYSTEM] = "system",
 };
 
-struct cpuacct_usage {
-	u64	usages[CPUACCT_STAT_NSTATS];
-};
-
 /* track CPU usage of a group of tasks and its child groups */
 struct cpuacct {
 	struct cgroup_subsys_state	css;
 	/* cpuusage holds pointer to a u64-type object on every CPU */
-	struct cpuacct_usage __percpu	*cpuusage;
+	u64 __percpu	*cpuusage;
 	struct kernel_cpustat __percpu	*cpustat;
 };
 
@@ -49,7 +45,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
 	return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
 	.cpustat	= &kernel_cpustat,
 	.cpuusage	= &root_cpuacct_cpuusage,
@@ -68,7 +64,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
 	if (!ca)
 		goto out;
 
-	ca->cpuusage = alloc_percpu(struct cpuacct_usage);
+	ca->cpuusage = alloc_percpu(u64);
 	if (!ca->cpuusage)
 		goto out_free_ca;
 
@@ -99,7 +95,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 				 enum cpuacct_stat_index index)
 {
-	struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 	u64 data;
 
 	/*
@@ -116,14 +113,17 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
 
-	if (index == CPUACCT_STAT_NSTATS) {
-		int i = 0;
-
-		data = 0;
-		for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-			data += cpuusage->usages[i];
-	} else {
-		data = cpuusage->usages[index];
+	switch (index) {
+	case CPUACCT_STAT_USER:
+		data = cpustat[CPUTIME_USER] + cpustat[CPUTIME_NICE];
+		break;
+	case CPUACCT_STAT_SYSTEM:
+		data = cpustat[CPUTIME_SYSTEM] + cpustat[CPUTIME_IRQ] +
+			cpustat[CPUTIME_SOFTIRQ];
+		break;
+	case CPUACCT_STAT_NSTATS:
+		data = *cpuusage;
+		break;
 	}
 
 #ifndef CONFIG_64BIT
@@ -133,10 +133,14 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 	return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu)
 {
-	struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	int i;
+	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+	u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
+
+	/* Don't allow to reset global kernel_cpustat */
+	if (ca == &root_cpuacct)
+		return;
 
 #ifndef CONFIG_64BIT
 	/*
@@ -144,9 +148,10 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 	 */
 	raw_spin_rq_lock_irq(cpu_rq(cpu));
 #endif
-
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		cpuusage->usages[i] = val;
+	*cpuusage = 0;
+	cpustat[CPUTIME_USER] = cpustat[CPUTIME_NICE] = 0;
+	cpustat[CPUTIME_SYSTEM] = cpustat[CPUTIME_IRQ] = 0;
+	cpustat[CPUTIME_SOFTIRQ] = 0;
 
 #ifndef CONFIG_64BIT
 	raw_spin_rq_unlock_irq(cpu_rq(cpu));
@@ -197,7 +202,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		return -EINVAL;
 
 	for_each_possible_cpu(cpu)
-		cpuacct_cpuusage_write(ca, cpu, 0);
+		cpuacct_cpuusage_write(ca, cpu);
 
 	return 0;
 }
@@ -244,25 +249,10 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 	seq_puts(m, "\n");
 
 	for_each_possible_cpu(cpu) {
-		struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
 		seq_printf(m, "%d", cpu);
-
-		for (index = 0; index < CPUACCT_STAT_NSTATS; index++) {
-#ifndef CONFIG_64BIT
-			/*
-			 * Take rq->lock to make 64-bit read safe on 32-bit
-			 * platforms.
-			 */
-			raw_spin_rq_lock_irq(cpu_rq(cpu));
-#endif
-
-			seq_printf(m, " %llu", cpuusage->usages[index]);
-
-#ifndef CONFIG_64BIT
-			raw_spin_rq_unlock_irq(cpu_rq(cpu));
-#endif
-		}
+		for (index = 0; index < CPUACCT_STAT_NSTATS; index++)
+			seq_printf(m, " %llu",
+				   cpuacct_cpuusage_read(ca, cpu, index));
 		seq_puts(m, "\n");
 	}
 	return 0;
@@ -340,16 +330,11 @@ static struct cftype files[] = {
 void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 {
 	struct cpuacct *ca;
-	int index = CPUACCT_STAT_SYSTEM;
-	struct pt_regs *regs = get_irq_regs() ? : task_pt_regs(tsk);
-
-	if (regs && user_mode(regs))
-		index = CPUACCT_STAT_USER;
 
 	rcu_read_lock();
 
 	for (ca = task_ca(tsk); ca; ca = parent_ca(ca))
-		__this_cpu_add(ca->cpuusage->usages[index], cputime);
+		__this_cpu_add(*ca->cpuusage, cputime);
 
 	rcu_read_unlock();
 }
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 5/5] sched/cpuacct: Make user/system times in cpuacct.stat more precise
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
                     ` (2 preceding siblings ...)
  2021-08-20  9:40   ` [PATCH v2 4/5] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
@ 2021-08-20  9:40   ` Andrey Ryabinin
  2021-09-07 17:19   ` [PATCH v2 1/5] cputime, cpuacct: Include guest time in user time in cpuacct.stat Tejun Heo
  4 siblings, 0 replies; 17+ messages in thread
From: Andrey Ryabinin @ 2021-08-20  9:40 UTC (permalink / raw)
  To: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Daniel Jordan, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, cgroups, linux-kernel,
	bharata, boris, Andrey Ryabinin

From: Andrey Ryabinin <arb@yandex-team.com>

cpuacct.stat shows user time based on raw random precision tick
based counters. Use cputime_addjust() to scale these values against the
total runtime accounted by the scheduler, like we already do
for user/system times in /proc/<pid>/stat.

Signed-off-by: Andrey Ryabinin <arb@yandex-team.com>
---
 Changes since v1:
    - fix cputime.sum_exec_runtime calculation
---
 kernel/sched/cpuacct.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9de7dd51beb0..3d06c5e4220d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -261,25 +261,30 @@ static int cpuacct_all_seq_show(struct seq_file *m, void *V)
 static int cpuacct_stats_show(struct seq_file *sf, void *v)
 {
 	struct cpuacct *ca = css_ca(seq_css(sf));
-	s64 val[CPUACCT_STAT_NSTATS];
+	struct task_cputime cputime;
+	u64 val[CPUACCT_STAT_NSTATS];
 	int cpu;
 	int stat;
 
-	memset(val, 0, sizeof(val));
+	memset(&cputime, 0, sizeof(cputime));
 	for_each_possible_cpu(cpu) {
 		u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_USER];
-		val[CPUACCT_STAT_USER]   += cpustat[CPUTIME_NICE];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-		val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+		cputime.utime += cpustat[CPUTIME_USER];
+		cputime.utime += cpustat[CPUTIME_NICE];
+		cputime.stime += cpustat[CPUTIME_SYSTEM];
+		cputime.stime += cpustat[CPUTIME_IRQ];
+		cputime.stime += cpustat[CPUTIME_SOFTIRQ];
+
+		cputime.sum_exec_runtime += *per_cpu_ptr(ca->cpuusage, cpu);
 	}
 
+	cputime_adjust(&cputime, &seq_css(sf)->cgroup->prev_cputime,
+		&val[CPUACCT_STAT_USER], &val[CPUACCT_STAT_SYSTEM]);
+
 	for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-		seq_printf(sf, "%s %lld\n",
-			   cpuacct_stat_desc[stat],
-			   (long long)nsec_to_clock_t(val[stat]));
+		seq_printf(sf, "%s %llu\n", cpuacct_stat_desc[stat],
+			nsec_to_clock_t(val[stat]));
 	}
 
 	return 0;
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/5] cputime, cpuacct: Include guest time in user time in cpuacct.stat
  2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
                     ` (3 preceding siblings ...)
  2021-08-20  9:40   ` [PATCH v2 5/5] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
@ 2021-09-07 17:19   ` Tejun Heo
  4 siblings, 0 replies; 17+ messages in thread
From: Tejun Heo @ 2021-09-07 17:19 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Zefan Li, Johannes Weiner, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Daniel Jordan, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel, bharata,
	boris, stable

On Fri, Aug 20, 2021 at 12:40:01PM +0300, Andrey Ryabinin wrote:
> cpuacct.stat in no-root cgroups shows user time without guest time
> included int it. This doesn't match with user time shown in root
> cpuacct.stat and /proc/<pid>/stat. This also affects cgroup2's cpu.stat
> in the same way.
> 
> Make account_guest_time() to add user time to cgroup's cpustat to
> fix this.
> 
> Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics")
> Signed-off-by: Andrey Ryabinin <arbn@yandex-team.com>
> Cc: <stable@vger.kernel.org>

The fact that this has been broken for so long, prolly from the beginning,
gives me some pause but the patches looks fine to me.

For the series,

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/5] sched/cpuacct: fix user/system in shown cpuacct.usage*
  2021-08-20  9:40   ` [PATCH v2 4/5] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
@ 2021-09-10 19:03     ` Daniel Jordan
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Jordan @ 2021-09-10 19:03 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Tejun Heo, Zefan Li, Johannes Weiner, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, cgroups, linux-kernel, bharata,
	boris, stable

On Fri, Aug 20, 2021 at 12:40:04PM +0300, Andrey Ryabinin wrote:
> cpuacct has 2 different ways of accounting and showing user
> and system times.
> 
> The first one uses cpuacct_account_field() to account times
> and cpuacct.stat file to expose them. And this one seems to work ok.
> 
> The second one is uses cpuacct_charge() function for accounting and
> set of cpuacct.usage* files to show times. Despite some attempts to
> fix it in the past it still doesn't work. Sometimes while running KVM
> guest the cpuacct_charge() accounts most of the guest time as
> system time. This doesn't match with user&system times shown in
> cpuacct.stat or proc/<pid>/stat.
> 
> Demonstration:
>  # git clone https://github.com/aryabinin/kvmsample
>  # make
>  # mkdir /sys/fs/cgroup/cpuacct/test
>  # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks
>  # ./kvmsample &
>  # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done
>  1976535645
>  2979839428
>  3979832704
>  4983603153
>  5983604157

Thanks for expanding on this, and fixing broken cpuacct_charge.

For the series,
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-10 19:04 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-17 12:00 [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Andrey Ryabinin
2021-02-17 12:00 ` [PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
2021-03-17 22:13   ` Daniel Jordan
2021-02-17 12:00 ` [PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
2021-03-17 22:22   ` Daniel Jordan
2021-08-20  9:37     ` Andrey Ryabinin
2021-02-17 12:00 ` [PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
2021-03-17 22:25   ` Daniel Jordan
2021-03-17 22:09 ` [PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat Daniel Jordan
2021-08-20  9:37   ` Andrey Ryabinin
2021-08-20  9:40 ` [PATCH v2 1/5] cputime, cpuacct: " Andrey Ryabinin
2021-08-20  9:40   ` [PATCH v2 2/5] cpuacct: convert BUG_ON() to WARN_ON_ONCE() Andrey Ryabinin
2021-08-20  9:40   ` [PATCH v2 3/5] cgroup: Fix 'usage_usec' time in root's cpu.stat Andrey Ryabinin
2021-08-20  9:40   ` [PATCH v2 4/5] sched/cpuacct: fix user/system in shown cpuacct.usage* Andrey Ryabinin
2021-09-10 19:03     ` Daniel Jordan
2021-08-20  9:40   ` [PATCH v2 5/5] sched/cpuacct: Make user/system times in cpuacct.stat more precise Andrey Ryabinin
2021-09-07 17:19   ` [PATCH v2 1/5] cputime, cpuacct: Include guest time in user time in cpuacct.stat Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).