All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] Per-cgroup /proc/stat
@ 2011-09-14 20:04 Glauber Costa
  2011-09-14 20:04 ` [PATCH 1/9] Remove parent field in cpuacct cgroup Glauber Costa
                   ` (9 more replies)
  0 siblings, 10 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra, jbottomley

[[ For those getting this twice: I sent it previously to containers
   ml, but I guess it was out. Sending now to a broader audience anyway ]]

Hi,

This patchset is a simple initial proposal for a per-cgroup/container
display of /proc/stat. The display method is based on Daniel's idea of
exposing a file that can be bind mounted (Daniel, is that more or less
what you had in mind?)

To grab the stats themselves, I am (ab)using cpuacct cgroup. percpu counters
are dropped in favor of normal percpu pointers, so we can easily track
per-cpu quantities.

In case you guys like this idea, my TODO list would include the removal
of the show stat code in fs/proc/stat.c altogether, and the displaying
of some fields I haven't touched yet.

Also, to demonstrate one of the potential ideas for such method, I
implemented a feature comonly found in hypervisors - steal time - on top
of it. I arguee that containers can/should also display steal time when
available. Turns out that due to the fact that we run on the same kernel,
steal time is quite easy to implement once we have per-container tick
accounting in place.

Please let me know what you guys think

Glauber Costa (9):
  Remove parent field in cpuacct cgroup
  Make cpuacct fields per cpu variables
  Include nice values in cpuacct
  Include irq and softirq fields in cpuacct
  Include guest fields in cpuacct
  Include idle and iowait fields in cpuacct
  Create cpuacct.proc.stat file
  per-cgroup boot time
  Report steal time for cgroup

 kernel/sched.c |  265 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 234 insertions(+), 31 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-19 16:03   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 2/9] Make cpuacct fields per cpu variables Glauber Costa
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

cgroup already has parent pointers, so we  don't need to define our
own.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   18 ++++++++++++------
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..4cde3eb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9096,7 +9096,6 @@ struct cpuacct {
 	/* cpuusage holds pointer to a u64-type object on every cpu */
 	u64 __percpu *cpuusage;
 	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
-	struct cpuacct *parent;
 };
 
 struct cgroup_subsys cpuacct_subsys;
@@ -9115,6 +9114,16 @@ static inline struct cpuacct *task_ca(struct task_struct *tsk)
 			    struct cpuacct, css);
 }
 
+static inline struct cpuacct *parent_ca(struct cpuacct *ca)
+{
+	struct cpuacct *parent = NULL;
+	struct cgroup *cgrp = ca->css.cgroup;
+	if (cgrp->parent)
+		parent = container_of(cgroup_subsys_state(cgrp->parent, cpuacct_subsys_id),
+				    struct cpuacct, css);
+	return parent;
+}
+
 /* create a new cpu accounting group */
 static struct cgroup_subsys_state *cpuacct_create(
 	struct cgroup_subsys *ss, struct cgroup *cgrp)
@@ -9133,9 +9142,6 @@ static struct cgroup_subsys_state *cpuacct_create(
 		if (percpu_counter_init(&ca->cpustat[i], 0))
 			goto out_free_counters;
 
-	if (cgrp->parent)
-		ca->parent = cgroup_ca(cgrp->parent);
-
 	return &ca->css;
 
 out_free_counters:
@@ -9302,7 +9308,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 
 	ca = task_ca(tsk);
 
-	for (; ca; ca = ca->parent) {
+	for (; ca; ca = parent_ca(ca)) {
 		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 		*cpuusage += cputime;
 	}
@@ -9344,7 +9350,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 
 	do {
 		__percpu_counter_add(&ca->cpustat[idx], val, batch);
-		ca = ca->parent;
+		ca = parent_ca(ca);
 	} while (ca);
 	rcu_read_unlock();
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/9] Make cpuacct fields per cpu variables
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
  2011-09-14 20:04 ` [PATCH 1/9] Remove parent field in cpuacct cgroup Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-19 16:10   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 3/9] Include nice values in cpuacct Glauber Costa
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

Turn them from percpu counters to normal per cpu variables.
It will be useful in later patches

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   34 ++++++++++++++++++----------------
 1 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4cde3eb..8fd3f8b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9095,7 +9095,7 @@ struct cpuacct {
 	struct cgroup_subsys_state css;
 	/* cpuusage holds pointer to a u64-type object on every cpu */
 	u64 __percpu *cpuusage;
-	struct percpu_counter cpustat[CPUACCT_STAT_NSTATS];
+	u64 __percpu *cpustat;
 };
 
 struct cgroup_subsys cpuacct_subsys;
@@ -9129,7 +9129,6 @@ static struct cgroup_subsys_state *cpuacct_create(
 	struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
-	int i;
 
 	if (!ca)
 		goto out;
@@ -9138,15 +9137,13 @@ static struct cgroup_subsys_state *cpuacct_create(
 	if (!ca->cpuusage)
 		goto out_free_ca;
 
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		if (percpu_counter_init(&ca->cpustat[i], 0))
-			goto out_free_counters;
+	ca->cpustat = __alloc_percpu(sizeof(u64) * CPUACCT_STAT_NSTATS, __alignof__(u64));
+	if (!ca->cpustat)
+		goto out_free_usage;
 
 	return &ca->css;
 
-out_free_counters:
-	while (--i >= 0)
-		percpu_counter_destroy(&ca->cpustat[i]);
+out_free_usage:
 	free_percpu(ca->cpuusage);
 out_free_ca:
 	kfree(ca);
@@ -9159,10 +9156,8 @@ static void
 cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct cpuacct *ca = cgroup_ca(cgrp);
-	int i;
 
-	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-		percpu_counter_destroy(&ca->cpustat[i]);
+	free_percpu(ca->cpustat);
 	free_percpu(ca->cpuusage);
 	kfree(ca);
 }
@@ -9258,11 +9253,17 @@ static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 		struct cgroup_map_cb *cb)
 {
 	struct cpuacct *ca = cgroup_ca(cgrp);
-	int i;
+	int i, cpu;
+	u64 acc[CPUACCT_STAT_NSTATS] = { 0, };
+
+	for_each_present_cpu(cpu) {
+		u64 *vec = per_cpu_ptr(ca->cpustat, cpu);
+		for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
+			acc[i] += vec[i];
+	}
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
-		s64 val = percpu_counter_read(&ca->cpustat[i]);
-		val = cputime64_to_clock_t(val);
+		s64 val = cputime64_to_clock_t(acc[i]);
 		cb->fill(cb, cpuacct_stat_desc[i], val);
 	}
 	return 0;
@@ -9340,7 +9341,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val)
 {
 	struct cpuacct *ca;
-	int batch = CPUACCT_BATCH;
+	u64 *cpustat;
 
 	if (unlikely(!cpuacct_subsys.active))
 		return;
@@ -9349,7 +9350,8 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 	ca = task_ca(tsk);
 
 	do {
-		__percpu_counter_add(&ca->cpustat[idx], val, batch);
+		cpustat = per_cpu_ptr(ca->cpustat, smp_processor_id());
+		cpustat[idx] +=  val;
 		ca = parent_ca(ca);
 	} while (ca);
 	rcu_read_unlock();
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 3/9] Include nice values in cpuacct
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
  2011-09-14 20:04 ` [PATCH 1/9] Remove parent field in cpuacct cgroup Glauber Costa
  2011-09-14 20:04 ` [PATCH 2/9] Make cpuacct fields per cpu variables Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-19 16:19   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 4/9] Include irq and softirq fields " Glauber Costa
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

Besides user and system.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8fd3f8b..93aa666 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1446,6 +1446,7 @@ static const u32 prio_to_wmult[40] = {
 /* Time spent by the tasks of the cpu accounting group executing in ... */
 enum cpuacct_stat_index {
 	CPUACCT_STAT_USER,	/* ... user mode */
+	CPUACCT_STAT_NICE,	/* ... user nice */
 	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
 
 	CPUACCT_STAT_NSTATS,
@@ -3758,6 +3759,7 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	int stat = CPUACCT_STAT_USER;
 	cputime64_t tmp;
 
 	/* Add user time to process. */
@@ -3767,12 +3769,13 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
-	if (TASK_NICE(p) > 0)
+	if (TASK_NICE(p) > 0) {
 		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-	else
+		stat = CPUACCT_STAT_NICE;
+	} else
 		cpustat->user = cputime64_add(cpustat->user, tmp);
 
-	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
+	cpuacct_update_stats(p, stat, cputime);
 	/* Account for user time used */
 	acct_update_integrals(p);
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 4/9] Include irq and softirq fields in cpuacct
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (2 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 3/9] Include nice values in cpuacct Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-19 18:38   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 5/9] Include guest " Glauber Costa
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   26 ++++++++++++++++++--------
 1 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 93aa666..a36d995 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1448,6 +1448,8 @@ enum cpuacct_stat_index {
 	CPUACCT_STAT_USER,	/* ... user mode */
 	CPUACCT_STAT_NICE,	/* ... user nice */
 	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
+	CPUACCT_STAT_IRQ,	/* ... irq ticks */
+	CPUACCT_STAT_SOFTIRQ,	/* ... softirq ticks */
 
 	CPUACCT_STAT_NSTATS,
 };
@@ -3819,7 +3821,8 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+			   cputime_t cputime_scaled,
+			   cputime64_t *target_cputime64, int stat)
 {
 	cputime64_t tmp = cputime_to_cputime64(cputime);
 
@@ -3830,10 +3833,10 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add system time to cpustat. */
 	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
-	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
 	acct_update_integrals(p);
+	cpuacct_update_stats(p, stat, cputime);
 }
 
 /*
@@ -3848,20 +3851,23 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 {
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
 	cputime64_t *target_cputime64;
+	int stat = CPUACCT_STAT_SYSTEM;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
 		return;
 	}
 
-	if (hardirq_count() - hardirq_offset)
+	if (hardirq_count() - hardirq_offset) {
 		target_cputime64 = &cpustat->irq;
-	else if (in_serving_softirq())
+		stat = CPUACCT_STAT_IRQ;
+	} else if (in_serving_softirq()) {
 		target_cputime64 = &cpustat->softirq;
-	else
+		stat = CPUACCT_STAT_SOFTIRQ;
+	} else
 		target_cputime64 = &cpustat->system;
 
-	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
+	__account_system_time(p, cputime, cputime_scaled, target_cputime64, stat);
 }
 
 /*
@@ -3941,22 +3947,26 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	int stat = CPUACCT_STAT_SYSTEM;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
 		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		cpuacct_update_stats(p, CPUACCT_STAT_IRQ, tmp);
 	} else if (irqtime_account_si_update()) {
 		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		cpuacct_update_stats(p, CPUACCT_STAT_SOFTIRQ, tmp);
 	} else if (this_cpu_ksoftirqd() == p) {
+		stat = CPUACCT_STAT_SOFTIRQ;
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->softirq);
+					&cpustat->softirq, stat);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -3965,7 +3975,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->system);
+					&cpustat->system, stat);
 	}
 }
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 5/9] Include guest fields in cpuacct
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (3 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 4/9] Include irq and softirq fields " Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-14 20:04 ` [PATCH 6/9] Include idle and iowait " Glauber Costa
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index a36d995..f0309a0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1450,6 +1450,8 @@ enum cpuacct_stat_index {
 	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
 	CPUACCT_STAT_IRQ,	/* ... irq ticks */
 	CPUACCT_STAT_SOFTIRQ,	/* ... softirq ticks */
+	CPUACCT_STAT_GUEST,	/* ... guest mode */
+	CPUACCT_STAT_GUEST_NICE,	/* ... guest nice */
 
 	CPUACCT_STAT_NSTATS,
 };
@@ -3793,6 +3795,8 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 {
 	cputime64_t tmp;
 	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	int stat = CPUACCT_STAT_NICE;
+	int statnice = CPUACCT_STAT_GUEST_NICE;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3809,7 +3813,12 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 	} else {
 		cpustat->user = cputime64_add(cpustat->user, tmp);
 		cpustat->guest = cputime64_add(cpustat->guest, tmp);
+		stat = CPUACCT_STAT_USER;
+		statnice = CPUACCT_STAT_GUEST;
 	}
+
+	cpuacct_update_stats(p, stat, tmp);
+	cpuacct_update_stats(p, statnice, tmp);
 }
 
 /*
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (4 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 5/9] Include guest " Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-20  9:21   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 7/9] Create cpuacct.proc.stat file Glauber Costa
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

These are slightly different from the others though:
(note to reviewers: might be better to put those in a separate
array?)

Since idle/iowait are a property of the system - by definition,
no process from any cgroup is running when the system is idle,
they are system wide. So what these fields really mean, are baselines
for when the cgroup was created. It allows the cgroup to start
counting idle/iowait from 0.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index f0309a0..6c953f5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1452,6 +1452,8 @@ enum cpuacct_stat_index {
 	CPUACCT_STAT_SOFTIRQ,	/* ... softirq ticks */
 	CPUACCT_STAT_GUEST,	/* ... guest mode */
 	CPUACCT_STAT_GUEST_NICE,	/* ... guest nice */
+	CPUACCT_STAT_IDLE,	/* ... idle base ticks when created */
+	CPUACCT_STAT_IOWAIT,	/* ... iowait when created */
 
 	CPUACCT_STAT_NSTATS,
 };
@@ -9151,6 +9153,8 @@ static struct cgroup_subsys_state *cpuacct_create(
 	struct cgroup_subsys *ss, struct cgroup *cgrp)
 {
 	struct cpuacct *ca = kzalloc(sizeof(*ca), GFP_KERNEL);
+	u64 *acct;
+	int i;
 
 	if (!ca)
 		goto out;
@@ -9163,6 +9167,11 @@ static struct cgroup_subsys_state *cpuacct_create(
 	if (!ca->cpustat)
 		goto out_free_usage;
 
+	for_each_possible_cpu(i) {
+		acct = per_cpu_ptr(ca->cpustat, i);
+		acct[CPUACCT_STAT_IDLE] = kstat_cpu(i).cpustat.idle;
+		acct[CPUACCT_STAT_IOWAIT] = kstat_cpu(i).cpustat.iowait;
+	}
 	return &ca->css;
 
 out_free_usage:
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 7/9] Create cpuacct.proc.stat file
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (5 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 6/9] Include idle and iowait " Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-20  9:22   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 8/9] per-cgroup boot time Glauber Costa
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

When doing containers, this file represents how /proc/stat should like
from inside it. It is user, system, etc, from the perspective of
the current cgroup.

In this submission, not all fields are converted

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |  128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 128 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6c953f5..4611c54 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9300,6 +9300,129 @@ static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+#ifndef arch_idle_time
+#define arch_idle_time(cpu) 0
+#endif
+
+static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
+		struct seq_file *p)
+{
+	int i, j;
+	unsigned long jif;
+	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
+	cputime64_t guest, guest_nice;
+	u64 sum = 0;
+	u64 sum_softirq = 0;
+	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
+	struct timespec boottime;
+	struct cpuacct *ca = cgroup_ca(cgrp);
+	u64 *cpustat;
+
+	user = nice = system = idle = iowait =
+		irq = softirq = steal = cputime64_zero;
+	guest = guest_nice = cputime64_zero;
+	getboottime(&boottime);
+	jif = boottime.tv_sec;
+
+	for_each_possible_cpu(i) {
+		cpustat = per_cpu_ptr(ca->cpustat, i);
+		user = cputime64_add(user, cpustat[CPUACCT_STAT_USER]);
+		nice = cputime64_add(nice, cpustat[CPUACCT_STAT_NICE]);
+		system = cputime64_add(system, cpustat[CPUACCT_STAT_SYSTEM]);
+		idle = cputime64_add(idle, kstat_cpu(i).cpustat.idle);
+		idle = cputime64_sub(idle, cpustat[CPUACCT_STAT_IDLE]);
+		idle = cputime64_add(idle, arch_idle_time(i));
+		iowait = cputime64_add(iowait, kstat_cpu(i).cpustat.iowait);
+		iowait = cputime64_sub(idle, cpustat[CPUACCT_STAT_IOWAIT]);
+		irq = cputime64_add(irq, cpustat[CPUACCT_STAT_IRQ]);
+		softirq = cputime64_add(softirq, cpustat[CPUACCT_STAT_SOFTIRQ]);
+		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
+		guest = cputime64_add(guest, cpustat[CPUACCT_STAT_GUEST]);
+		guest_nice = cputime64_add(guest_nice,
+			cpustat[CPUACCT_STAT_GUEST_NICE]);
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+
+		for (j = 0; j < NR_SOFTIRQS; j++) {
+			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
+
+			per_softirq_sums[j] += softirq_stat;
+			sum_softirq += softirq_stat;
+		}
+	}
+	sum += arch_irq_stat();
+
+	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+		"%llu\n",
+		(unsigned long long)cputime64_to_clock_t(user),
+		(unsigned long long)cputime64_to_clock_t(nice),
+		(unsigned long long)cputime64_to_clock_t(system),
+		(unsigned long long)cputime64_to_clock_t(idle),
+		(unsigned long long)cputime64_to_clock_t(iowait),
+		(unsigned long long)cputime64_to_clock_t(irq),
+		(unsigned long long)cputime64_to_clock_t(softirq),
+		(unsigned long long)cputime64_to_clock_t(steal),
+		(unsigned long long)cputime64_to_clock_t(guest),
+		(unsigned long long)cputime64_to_clock_t(guest_nice));
+	for_each_online_cpu(i) {
+		cpustat = per_cpu_ptr(ca->cpustat, i);
+		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
+		user = cpustat[CPUACCT_STAT_USER];
+		nice = cpustat[CPUACCT_STAT_NICE];
+		system = cpustat[CPUACCT_STAT_SYSTEM];
+		idle = kstat_cpu(i).cpustat.idle;
+		idle = cputime64_sub(idle, cpustat[CPUACCT_STAT_IDLE]);
+		idle = cputime64_add(idle, arch_idle_time(i));
+		iowait = kstat_cpu(i).cpustat.iowait;
+		iowait = cputime64_sub(iowait, cpustat[CPUACCT_STAT_IOWAIT]);
+		irq = cpustat[CPUACCT_STAT_IRQ];
+		softirq = cpustat[CPUACCT_STAT_SOFTIRQ];
+		steal = kstat_cpu(i).cpustat.steal;
+		guest = cpustat[CPUACCT_STAT_GUEST];
+		guest_nice = cpustat[CPUACCT_STAT_GUEST_NICE];
+		seq_printf(p,
+			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+			"%llu\n",
+			i,
+			(unsigned long long)cputime64_to_clock_t(user),
+			(unsigned long long)cputime64_to_clock_t(nice),
+			(unsigned long long)cputime64_to_clock_t(system),
+			(unsigned long long)cputime64_to_clock_t(idle),
+			(unsigned long long)cputime64_to_clock_t(iowait),
+			(unsigned long long)cputime64_to_clock_t(irq),
+			(unsigned long long)cputime64_to_clock_t(softirq),
+			(unsigned long long)cputime64_to_clock_t(steal),
+			(unsigned long long)cputime64_to_clock_t(guest),
+			(unsigned long long)cputime64_to_clock_t(guest_nice));
+	}
+	seq_printf(p, "intr %llu", (unsigned long long)sum);
+
+	/* sum again ? it could be updated? */
+	for_each_irq_nr(j)
+		seq_printf(p, " %u", kstat_irqs(j));
+
+	seq_printf(p,
+		"\nctxt %llu\n"
+		"btime %lu\n"
+		"processes %lu\n"
+		"procs_running %lu\n"
+		"procs_blocked %lu\n",
+		nr_context_switches(),
+		(unsigned long)jif,
+		total_forks,
+		nr_running(),
+		nr_iowait());
+
+	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
+
+	for (i = 0; i < NR_SOFTIRQS; i++)
+		seq_printf(p, " %u", per_softirq_sums[i]);
+	seq_putc(p, '\n');
+
+	return 0;
+
+}
+
 static struct cftype files[] = {
 	{
 		.name = "usage",
@@ -9314,6 +9437,11 @@ static struct cftype files[] = {
 		.name = "stat",
 		.read_map = cpuacct_stats_show,
 	},
+	{
+		.name = "proc.stat",
+		.read_seq_string = cpuacct_proc_stat,
+	},
+
 };
 
 static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 8/9] per-cgroup boot time
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (6 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 7/9] Create cpuacct.proc.stat file Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-20  9:25   ` Peter Zijlstra
  2011-09-14 20:04 ` [PATCH 9/9] Report steal time for cgroup Glauber Costa
  2011-09-14 20:13 ` [PATCH 0/9] Per-cgroup /proc/stat Peter Zijlstra
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

Record the time in which the cgroup was created. This can be
used to provide a more accurate boottime information in
cpuacct.proc.stat.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4611c54..8f254d0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9120,6 +9120,7 @@ struct cpuacct {
 	/* cpuusage holds pointer to a u64-type object on every cpu */
 	u64 __percpu *cpuusage;
 	u64 __percpu *cpustat;
+	struct timespec start_time;
 };
 
 struct cgroup_subsys cpuacct_subsys;
@@ -9172,6 +9173,8 @@ static struct cgroup_subsys_state *cpuacct_create(
 		acct[CPUACCT_STAT_IDLE] = kstat_cpu(i).cpustat.idle;
 		acct[CPUACCT_STAT_IOWAIT] = kstat_cpu(i).cpustat.iowait;
 	}
+
+	get_monotonic_boottime(&ca->start_time);
 	return &ca->css;
 
 out_free_usage:
@@ -9316,13 +9319,16 @@ static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 	struct cpuacct *ca = cgroup_ca(cgrp);
+	struct timespec ts;
 	u64 *cpustat;
 
 	user = nice = system = idle = iowait =
 		irq = softirq = steal = cputime64_zero;
 	guest = guest_nice = cputime64_zero;
 	getboottime(&boottime);
-	jif = boottime.tv_sec;
+	getboottime(&boottime);
+	ts = timespec_add(boottime, ca->start_time);
+	jif = ts.tv_sec;
 
 	for_each_possible_cpu(i) {
 		cpustat = per_cpu_ptr(ca->cpustat, i);
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 9/9] Report steal time for cgroup
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (7 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 8/9] per-cgroup boot time Glauber Costa
@ 2011-09-14 20:04 ` Glauber Costa
  2011-09-20  9:29   ` Peter Zijlstra
  2011-09-14 20:13 ` [PATCH 0/9] Per-cgroup /proc/stat Peter Zijlstra
  9 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: xemul, paul, lizf, daniel.lezcano, mingo, a.p.zijlstra,
	jbottomley, Glauber Costa

This patch introduces a functionality commonly found in
hypervisors: steal time.

For those not particularly familiar with it, steal time
is defined as any time in which a virtual machine (or container)
wanted to perform cpu work, but could not due to another
VM/container being scheduled in its place. Note that idle
time is never defined as steal time.

Assuming each container will live in its cgroup, we can
very easily and nicely calculate steal time as all user/system
time recorded in our sibling cgroups.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   30 ++++++++++++++++++++++++++++++
 1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8f254d0..2dad1e0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9321,6 +9321,7 @@ static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	struct cpuacct *ca = cgroup_ca(cgrp);
 	struct timespec ts;
 	u64 *cpustat;
+	struct cgroup *sib;
 
 	user = nice = system = idle = iowait =
 		irq = softirq = steal = cputime64_zero;
@@ -9343,6 +9344,20 @@ static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		irq = cputime64_add(irq, cpustat[CPUACCT_STAT_IRQ]);
 		softirq = cputime64_add(softirq, cpustat[CPUACCT_STAT_SOFTIRQ]);
 		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
+		rcu_read_lock();
+		list_for_each_entry(sib, &ca->css.cgroup->sibling, sibling) {
+			u64 *cpustat_sib;
+			struct cpuacct *ca_sib = cgroup_ca(sib);
+			if (!ca_sib)
+				continue;
+
+			cpustat_sib = per_cpu_ptr(ca_sib->cpustat, i);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_USER]);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_SYSTEM]);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_IRQ]);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_SOFTIRQ]);
+		}
+		rcu_read_unlock();
 		guest = cputime64_add(guest, cpustat[CPUACCT_STAT_GUEST]);
 		guest_nice = cputime64_add(guest_nice,
 			cpustat[CPUACCT_STAT_GUEST_NICE]);
@@ -9384,6 +9399,21 @@ static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		irq = cpustat[CPUACCT_STAT_IRQ];
 		softirq = cpustat[CPUACCT_STAT_SOFTIRQ];
 		steal = kstat_cpu(i).cpustat.steal;
+		rcu_read_lock();
+		list_for_each_entry(sib, &ca->css.cgroup->sibling, sibling) {
+			u64 *cpustat_sib;
+			struct cpuacct *ca_sib = cgroup_ca(sib);
+
+			if (!ca_sib)
+				continue;
+			cpustat_sib = per_cpu_ptr(ca_sib->cpustat, i);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_USER]);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_SYSTEM]);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_IRQ]);
+			steal = cputime64_add(steal, cpustat_sib[CPUACCT_STAT_SOFTIRQ]);
+		}
+		rcu_read_unlock();
+
 		guest = cpustat[CPUACCT_STAT_GUEST];
 		guest_nice = cpustat[CPUACCT_STAT_GUEST_NICE];
 		seq_printf(p,
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
                   ` (8 preceding siblings ...)
  2011-09-14 20:04 ` [PATCH 9/9] Report steal time for cgroup Glauber Costa
@ 2011-09-14 20:13 ` Peter Zijlstra
  2011-09-14 20:20   ` Glauber Costa
  2011-09-14 20:23   ` Andi Kleen
  9 siblings, 2 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-14 20:13 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> [[ For those getting this twice: I sent it previously to containers
>    ml, but I guess it was out. Sending now to a broader audience anyway ]]
> 
> Hi,
> 
> This patchset is a simple initial proposal for a per-cgroup/container
> display of /proc/stat. The display method is based on Daniel's idea of
> exposing a file that can be bind mounted (Daniel, is that more or less
> what you had in mind?)
> 
> To grab the stats themselves, I am (ab)using cpuacct cgroup. percpu counters
> are dropped in favor of normal percpu pointers, so we can easily track
> per-cpu quantities.
> 
> In case you guys like this idea, my TODO list would include the removal
> of the show stat code in fs/proc/stat.c altogether, and the displaying
> of some fields I haven't touched yet.
> 
> Also, to demonstrate one of the potential ideas for such method, I
> implemented a feature comonly found in hypervisors - steal time - on top
> of it. I arguee that containers can/should also display steal time when
> available. Turns out that due to the fact that we run on the same kernel,
> steal time is quite easy to implement once we have per-container tick
> accounting in place.
> 
> Please let me know what you guys think
> 
> Glauber Costa (9):
>   Remove parent field in cpuacct cgroup
>   Make cpuacct fields per cpu variables
>   Include nice values in cpuacct
>   Include irq and softirq fields in cpuacct
>   Include guest fields in cpuacct
>   Include idle and iowait fields in cpuacct
>   Create cpuacct.proc.stat file
>   per-cgroup boot time
>   Report steal time for cgroup
> 
>  kernel/sched.c |  265 +++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 files changed, 234 insertions(+), 31 deletions(-)

I hate it already.. it just smells of more senseless accounting
overhead.

Guys we should seriously trim back a lot of that code, not grow ever
more and more. The sad fact is that if you build a kernel with
cpu-cgroup support the context switch cost is more than double that of a
kernel without, and then you haven't even started creating cgroups yet.

Also, how doesn't all this duplicate part of cpuacct-cgroup?

/me won't actually look at the patches for a little while longer.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-14 20:13 ` [PATCH 0/9] Per-cgroup /proc/stat Peter Zijlstra
@ 2011-09-14 20:20   ` Glauber Costa
  2011-09-15  8:53     ` Peter Zijlstra
  2011-09-14 20:23   ` Andi Kleen
  1 sibling, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-14 20:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/14/2011 05:13 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> [[ For those getting this twice: I sent it previously to containers
>>     ml, but I guess it was out. Sending now to a broader audience anyway ]]
>>
>> Hi,
>>
>> This patchset is a simple initial proposal for a per-cgroup/container
>> display of /proc/stat. The display method is based on Daniel's idea of
>> exposing a file that can be bind mounted (Daniel, is that more or less
>> what you had in mind?)
>>
>> To grab the stats themselves, I am (ab)using cpuacct cgroup. percpu counters
>> are dropped in favor of normal percpu pointers, so we can easily track
>> per-cpu quantities.
>>
>> In case you guys like this idea, my TODO list would include the removal
>> of the show stat code in fs/proc/stat.c altogether, and the displaying
>> of some fields I haven't touched yet.
>>
>> Also, to demonstrate one of the potential ideas for such method, I
>> implemented a feature comonly found in hypervisors - steal time - on top
>> of it. I arguee that containers can/should also display steal time when
>> available. Turns out that due to the fact that we run on the same kernel,
>> steal time is quite easy to implement once we have per-container tick
>> accounting in place.
>>
>> Please let me know what you guys think
>>
>> Glauber Costa (9):
>>    Remove parent field in cpuacct cgroup
>>    Make cpuacct fields per cpu variables
>>    Include nice values in cpuacct
>>    Include irq and softirq fields in cpuacct
>>    Include guest fields in cpuacct
>>    Include idle and iowait fields in cpuacct
>>    Create cpuacct.proc.stat file
>>    per-cgroup boot time
>>    Report steal time for cgroup
>>
>>   kernel/sched.c |  265 +++++++++++++++++++++++++++++++++++++++++++++++++-------
>>   1 files changed, 234 insertions(+), 31 deletions(-)
>
> I hate it already.. it just smells of more senseless accounting
> overhead.
>
> Guys we should seriously trim back a lot of that code, not grow ever
> more and more. The sad fact is that if you build a kernel with
> cpu-cgroup support the context switch cost is more than double that of a
> kernel without, and then you haven't even started creating cgroups yet.
>
> Also, how doesn't all this duplicate part of cpuacct-cgroup?
>
> /me won't actually look at the patches for a little while longer.
Hey Peter,

Answering just a single point here, if you look closely, it does not 
duplicate  anything from cpuacct. What it does, is to divide it in more
fine grained groups than just user/system. But it is not even called 
more than it already used to be. Also, I change the counters to per-cpu 
variables instead of percpu counters (so we can access per-cpu data). If 
there is any perf. change wrt the current code, it comes from that, and 
since percpu variables are cheaper to update (and summing up is much 
less frequent), it will end up even cheaper.

The steal time feature is really trivial once it is in place.

About your point of the context switch cost, how would you feel if we 
optimized it out using static_branch() like it was done for kvm steal time?

I can also commit to taking a look at making the overall performance 
suck less here, but it is really orthogonal to what I just posted.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-14 20:13 ` [PATCH 0/9] Per-cgroup /proc/stat Peter Zijlstra
  2011-09-14 20:20   ` Glauber Costa
@ 2011-09-14 20:23   ` Andi Kleen
  2011-09-15  8:56     ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2011-09-14 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, xemul, paul, lizf, daniel.lezcano,
	mingo, jbottomley

Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
>
> Guys we should seriously trim back a lot of that code, not grow ever
> more and more. The sad fact is that if you build a kernel with
> cpu-cgroup support the context switch cost is more than double that of a
> kernel without, and then you haven't even started creating cgroups yet.

That sounds indeed quite bad. Is it known why it is so costly?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-14 20:20   ` Glauber Costa
@ 2011-09-15  8:53     ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-15  8:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:20 -0300, Glauber Costa wrote:


> Answering just a single point here, if you look closely, it does not 
> duplicate  anything from cpuacct. What it does, is to divide it in more
> fine grained groups than just user/system. But it is not even called 
> more than it already used to be. Also, I change the counters to per-cpu 
> variables instead of percpu counters (so we can access per-cpu data). If 
> there is any perf. change wrt the current code, it comes from that, and 
> since percpu variables are cheaper to update (and summing up is much 
> less frequent), it will end up even cheaper.

I just saw the word statistics and 234 lines added to sched.c and went
mental ;-)

> The steal time feature is really trivial once it is in place.
> 
> About your point of the context switch cost, how would you feel if we 
> optimized it out using static_branch() like it was done for kvm steal time?

Both pjt and me tried, but its hard, part of the problem is also data
structure bloat.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-14 20:23   ` Andi Kleen
@ 2011-09-15  8:56     ` Peter Zijlstra
  2011-09-19 23:07       ` Paul Turner
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-15  8:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Glauber Costa, linux-kernel, xemul, paul, lizf, daniel.lezcano,
	mingo, jbottomley, Paul Turner

On Wed, 2011-09-14 at 13:23 -0700, Andi Kleen wrote:
> Peter Zijlstra <a.p.zijlstra@chello.nl> writes:
> >
> > Guys we should seriously trim back a lot of that code, not grow ever
> > more and more. The sad fact is that if you build a kernel with
> > cpu-cgroup support the context switch cost is more than double that of a
> > kernel without, and then you haven't even started creating cgroups yet.
> 
> That sounds indeed quite bad. Is it known why it is so costly?

Mostly because all data structures grow and all code paths grow, some by
quite a bit, its spread all over the place, lots of little cuts etc..

pjt and I tried trimming some of the code paths with static_branch() but
didn't really get anywhere.. need to get back to looking at this stuff
sometime soon.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-14 20:04 ` [PATCH 1/9] Remove parent field in cpuacct cgroup Glauber Costa
@ 2011-09-19 16:03   ` Peter Zijlstra
  2011-09-19 16:09     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 16:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> +       for (; ca; ca = parent_ca(ca)) {

It might be good to check that the loop condition and null condition in
the parent_ca() function get folded. Otherwise there's a double branch
in that loop.

Note that this function is one of the reasons I dislike cpuacct, it adds
a second cgroup hierarchy traversal to every context switch.

Cache-miss heaven all this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:03   ` Peter Zijlstra
@ 2011-09-19 16:09     ` Glauber Costa
  2011-09-19 16:19       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 16:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/19/2011 01:03 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> +       for (; ca; ca = parent_ca(ca)) {
>
> It might be good to check that the loop condition and null condition in
> the parent_ca() function get folded. Otherwise there's a double branch
> in that loop.
>
> Note that this function is one of the reasons I dislike cpuacct, it adds
> a second cgroup hierarchy traversal to every context switch.
>
Well, it is not that hard to optimize this.

Those values are always updated, but they don't really need to, unless 
they are read.

So what we can do, is introduce a marker in the cgroup, representing the 
last read value. Parent is untouched. We then update parent when 1) 
reading this value, 2) cgroup destroy, 3) cpu hotplug. (humm, and maybe 
we don't even need to do it in cpu hotplug, since the per-cpu variables 
will still be accessible... )

How about it ?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/9] Make cpuacct fields per cpu variables
  2011-09-14 20:04 ` [PATCH 2/9] Make cpuacct fields per cpu variables Glauber Costa
@ 2011-09-19 16:10   ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 16:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> @@ -9258,11 +9253,17 @@ static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
>                 struct cgroup_map_cb *cb)
>  {
>         struct cpuacct *ca = cgroup_ca(cgrp);
> -       int i;
> +       int i, cpu;
> +       u64 acc[CPUACCT_STAT_NSTATS] = { 0, };
> +
> +       for_each_present_cpu(cpu) {
> +               u64 *vec = per_cpu_ptr(ca->cpustat, cpu);
> +               for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
> +                       acc[i] += vec[i];
> +       }
>  
>         for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> -               s64 val = percpu_counter_read(&ca->cpustat[i]);
> -               val = cputime64_to_clock_t(val);
> +               s64 val = cputime64_to_clock_t(acc[i]);
>                 cb->fill(cb, cpuacct_stat_desc[i], val);
>         }
>         return 0; 

The changelog doesn't mention this function is a lot more expensive now.

Someone might have daft software polling this frequently.. at least its
still fully preemptible so its not much of a problem, but it is
something to be aware of.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:09     ` Glauber Costa
@ 2011-09-19 16:19       ` Peter Zijlstra
  2011-09-19 16:30         ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 16:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Mon, 2011-09-19 at 13:09 -0300, Glauber Costa wrote:
> On 09/19/2011 01:03 PM, Peter Zijlstra wrote:
> > On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> >> +       for (; ca; ca = parent_ca(ca)) {
> >
> > It might be good to check that the loop condition and null condition in
> > the parent_ca() function get folded. Otherwise there's a double branch
> > in that loop.
> >
> > Note that this function is one of the reasons I dislike cpuacct, it adds
> > a second cgroup hierarchy traversal to every context switch.
> >
> Well, it is not that hard to optimize this.
> 
> Those values are always updated, but they don't really need to, unless 
> they are read.
> 
> So what we can do, is introduce a marker in the cgroup, representing the 
> last read value. Parent is untouched. We then update parent when 1) 
> reading this value, 2) cgroup destroy, 3) cpu hotplug. (humm, and maybe 
> we don't even need to do it in cpu hotplug, since the per-cpu variables 
> will still be accessible... )
> 
> How about it ?

Updating that value would involve iterating all tasks in the entire
cgroup subtree nested at whatever cgroup you're wanting to read.

The delayed update would be an entire subtree walk, that can be quite
expensive. Who wants these numbers and what for and at what frequency?
Does that really make sense?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/9] Include nice values in cpuacct
  2011-09-14 20:04 ` [PATCH 3/9] Include nice values in cpuacct Glauber Costa
@ 2011-09-19 16:19   ` Peter Zijlstra
  2011-09-19 16:26     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 16:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> Besides user and system.


Why!? 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/9] Include nice values in cpuacct
  2011-09-19 16:19   ` Peter Zijlstra
@ 2011-09-19 16:26     ` Glauber Costa
  2011-09-19 18:36       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On 09/19/2011 01:19 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> Besides user and system.
>
>
> Why!?
Because this is the whole purpose of this series. Think that in a 
reasonable use case, each cgroup will be tied to a container system. 
Each of them wants to have its own statistics, in /proc/stat. The best 
way I thought of doing it was to piggyback into cpuacct, instead of 
coming up with a 124th accounting mechanism just for that.

I also used it do demonstrate how can we easily implement a new feature 
found in hypervisors out of it, but this is really just the cherry on top.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:19       ` Peter Zijlstra
@ 2011-09-19 16:30         ` Glauber Costa
  2011-09-19 16:39           ` Peter Zijlstra
  2011-09-19 18:35           ` Peter Zijlstra
  0 siblings, 2 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/19/2011 01:19 PM, Peter Zijlstra wrote:
> On Mon, 2011-09-19 at 13:09 -0300, Glauber Costa wrote:
>> On 09/19/2011 01:03 PM, Peter Zijlstra wrote:
>>> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>>>> +       for (; ca; ca = parent_ca(ca)) {
>>>
>>> It might be good to check that the loop condition and null condition in
>>> the parent_ca() function get folded. Otherwise there's a double branch
>>> in that loop.
>>>
>>> Note that this function is one of the reasons I dislike cpuacct, it adds
>>> a second cgroup hierarchy traversal to every context switch.
>>>
>> Well, it is not that hard to optimize this.
>>
>> Those values are always updated, but they don't really need to, unless
>> they are read.
>>
>> So what we can do, is introduce a marker in the cgroup, representing the
>> last read value. Parent is untouched. We then update parent when 1)
>> reading this value, 2) cgroup destroy, 3) cpu hotplug. (humm, and maybe
>> we don't even need to do it in cpu hotplug, since the per-cpu variables
>> will still be accessible... )
>>
>> How about it ?
>
> Updating that value would involve iterating all tasks in the entire
> cgroup subtree nested at whatever cgroup you're wanting to read.

No, it would not. Because nothing is stored in the task, all is stored 
in the cgroup. So it is O(h(n)), where n is the number of cgroups and 
h(n) the height of the cgroups tree.

> The delayed update would be an entire subtree walk, that can be quite
> expensive.
But the subtrees are small, because we are talking about the cgroup 
subtree, wich can grow quite a lot in breadth, but rarely in depth.

> Who wants these numbers and what for and at what frequency?
> Does that really make sense?

Whoever wants /proc/stat numbers. Once, or maybe twice a sec would be 
the normal interval here for most use cases, I guess (top inside a 
container, for instance).

Even people doing much more frequent updates here, would not come as 
close as doing it every tick, therefore making this option cheaper than 
transversing the tree at each tick.

Btw, this works for cpuacct. For cpuusage, I am not sure this 
optimization is a valid one. Since this value is at least intended to 
provide a basis for cpu capping in the near future (Well, it is not 
there, but I think it is), it is expected to be used much more 
frequently by the kernel itself.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:30         ` Glauber Costa
@ 2011-09-19 16:39           ` Peter Zijlstra
  2011-09-19 16:41             ` Glauber Costa
  2011-09-19 18:35           ` Peter Zijlstra
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 16:39 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
> Since this value is at least intended to 
> provide a basis for cpu capping in the near future (Well, it is not 
> there, but I think it is), it is expected to be used much more 
> frequently by the kernel itself. 

We just merged a bandwidth limiter quite separate from cpuacct.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:39           ` Peter Zijlstra
@ 2011-09-19 16:41             ` Glauber Costa
  2011-09-19 18:40               ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/19/2011 01:39 PM, Peter Zijlstra wrote:
> On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
>> Since this value is at least intended to
>> provide a basis for cpu capping in the near future (Well, it is not
>> there, but I think it is), it is expected to be used much more
>> frequently by the kernel itself.
>
> We just merged a bandwidth limiter quite separate from cpuacct.

Note I was talking about cpuusage. Well, in this case, I myself fail
to see the value of the whole cpuusage thing... Someone that sees 
further could maybe help us out here? (but this is sideways to the whole 
discussion we're having anyway)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:30         ` Glauber Costa
  2011-09-19 16:39           ` Peter Zijlstra
@ 2011-09-19 18:35           ` Peter Zijlstra
  2011-09-19 18:38             ` Glauber Costa
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 18:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
> For cpuusage, I am not sure this optimization is a valid one 

I was talking about cpuusage, cpuacct_charge() is called for every
ctxsw/tick.

But even for cpuacct tick stuff, wouldn't you need to sum all your child
cgroups to update the current cgroup? and that up the whole tree?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/9] Include nice values in cpuacct
  2011-09-19 16:26     ` Glauber Costa
@ 2011-09-19 18:36       ` Peter Zijlstra
  2011-09-19 18:37         ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 18:36 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On Mon, 2011-09-19 at 13:26 -0300, Glauber Costa wrote:
> On 09/19/2011 01:19 PM, Peter Zijlstra wrote:
> > On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> >> Besides user and system.
> >
> >
> > Why!?
> Because this is the whole purpose of this series. Think that in a 
> reasonable use case, each cgroup will be tied to a container system. 
> Each of them wants to have its own statistics, in /proc/stat. The best 
> way I thought of doing it was to piggyback into cpuacct, instead of 
> coming up with a 124th accounting mechanism just for that.
> 
> I also used it do demonstrate how can we easily implement a new feature 
> found in hypervisors out of it, but this is really just the cherry on top.


OK, but then amend the changelog with something like:

Since we want to provide a full /proc/stat equivalent for cgroups, also
implement nice time.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 3/9] Include nice values in cpuacct
  2011-09-19 18:36       ` Peter Zijlstra
@ 2011-09-19 18:37         ` Glauber Costa
  0 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 18:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On 09/19/2011 03:36 PM, Peter Zijlstra wrote:
> On Mon, 2011-09-19 at 13:26 -0300, Glauber Costa wrote:
>> On 09/19/2011 01:19 PM, Peter Zijlstra wrote:
>>> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>>>> Besides user and system.
>>>
>>>
>>> Why!?
>> Because this is the whole purpose of this series. Think that in a
>> reasonable use case, each cgroup will be tied to a container system.
>> Each of them wants to have its own statistics, in /proc/stat. The best
>> way I thought of doing it was to piggyback into cpuacct, instead of
>> coming up with a 124th accounting mechanism just for that.
>>
>> I also used it do demonstrate how can we easily implement a new feature
>> found in hypervisors out of it, but this is really just the cherry on top.
>
>
> OK, but then amend the changelog with something like:
>
> Since we want to provide a full /proc/stat equivalent for cgroups, also
> implement nice time.
Fair request.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 18:35           ` Peter Zijlstra
@ 2011-09-19 18:38             ` Glauber Costa
  2011-09-19 18:44               ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 18:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/19/2011 03:35 PM, Peter Zijlstra wrote:
> On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
>> For cpuusage, I am not sure this optimization is a valid one
>
> I was talking about cpuusage, cpuacct_charge() is called for every
> ctxsw/tick.

I am not touching it right now.

> But even for cpuacct tick stuff, wouldn't you need to sum all your child
> cgroups to update the current cgroup? and that up the whole tree?

Of course I would. But as I said, it does not need to be done every 
tick, in case it poses such a cacheline mayhem as you fear.

Since we'll only really need those values when someone reads it - which 
is a far less frequent operation than the tick resolution - and when a 
cgroup is destroyed - even less frequent operation - it should work well.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 4/9] Include irq and softirq fields in cpuacct
  2011-09-14 20:04 ` [PATCH 4/9] Include irq and softirq fields " Glauber Costa
@ 2011-09-19 18:38   ` Peter Zijlstra
  2011-09-19 18:40     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 18:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> +       CPUACCT_STAT_IRQ,       /* ... irq ticks */
> +       CPUACCT_STAT_SOFTIRQ,   /* ... softirq ticks */

Would it make sense to make the cpustat thing an array as well? That way
the if forest will compute an array index and the cgroup and cpustat
parts can then use the 'same' code.

That avoids having to pass a cputime64_t* and int around for pretty much
the same thing.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 4/9] Include irq and softirq fields in cpuacct
  2011-09-19 18:38   ` Peter Zijlstra
@ 2011-09-19 18:40     ` Glauber Costa
  0 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 18:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/19/2011 03:38 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> +       CPUACCT_STAT_IRQ,       /* ... irq ticks */
>> +       CPUACCT_STAT_SOFTIRQ,   /* ... softirq ticks */
>
> Would it make sense to make the cpustat thing an array as well? That way
> the if forest will compute an array index and the cgroup and cpustat
> parts can then use the 'same' code.
>
> That avoids having to pass a cputime64_t* and int around for pretty much
> the same thing.
I think so, and I go further: I think that when cgroups are compiled in 
- by far the common case this days - cpustat should be just cpuacct for 
the root cgroup. With the right macro trickery, I can even be quite 
transparent...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 16:41             ` Glauber Costa
@ 2011-09-19 18:40               ` Peter Zijlstra
  2011-09-20 17:29                 ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 18:40 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo,
	jbottomley, Srivatsa Vaddagiri

On Mon, 2011-09-19 at 13:41 -0300, Glauber Costa wrote:
> On 09/19/2011 01:39 PM, Peter Zijlstra wrote:
> > On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
> >> Since this value is at least intended to
> >> provide a basis for cpu capping in the near future (Well, it is not
> >> there, but I think it is), it is expected to be used much more
> >> frequently by the kernel itself.
> >
> > We just merged a bandwidth limiter quite separate from cpuacct.
> 
> Note I was talking about cpuusage. Well, in this case, I myself fail
> to see the value of the whole cpuusage thing... Someone that sees 
> further could maybe help us out here? (but this is sideways to the whole 
> discussion we're having anyway)

Would probably do to cc the cpuacct authors in this patch set ;-) I
think it came from IBM, so I added vatsa.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 18:38             ` Glauber Costa
@ 2011-09-19 18:44               ` Peter Zijlstra
  2011-09-19 19:14                 ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 18:44 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Mon, 2011-09-19 at 15:38 -0300, Glauber Costa wrote:
> On 09/19/2011 03:35 PM, Peter Zijlstra wrote:
> > On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
> >> For cpuusage, I am not sure this optimization is a valid one
> >
> > I was talking about cpuusage, cpuacct_charge() is called for every
> > ctxsw/tick.
> 
> I am not touching it right now.

See hunk #4 of this particular patch:

@@ -9302,7 +9308,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)

That's cpuusage muck ;-)

> > But even for cpuacct tick stuff, wouldn't you need to sum all your child
> > cgroups to update the current cgroup? and that up the whole tree?
> 
> Of course I would. But as I said, it does not need to be done every 
> tick, in case it poses such a cacheline mayhem as you fear.
> 
> Since we'll only really need those values when someone reads it - which 
> is a far less frequent operation than the tick resolution - and when a 
> cgroup is destroyed - even less frequent operation - it should work well.

Quite possible, yes. Although if you create a cgroup with 100 subgroups
and poll very frequently.. it all depends on the avg use case etc.. and
since I don't use any of this stuff someone needs to tell me about how
the trade-offs work out in practice.

So explicit changelogs with numbers and agreements from multiple users
go a long way to make me feel good ;-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 18:44               ` Peter Zijlstra
@ 2011-09-19 19:14                 ` Glauber Costa
  2011-09-19 19:18                   ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 19:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On 09/19/2011 03:44 PM, Peter Zijlstra wrote:
> On Mon, 2011-09-19 at 15:38 -0300, Glauber Costa wrote:
>> On 09/19/2011 03:35 PM, Peter Zijlstra wrote:
>>> On Mon, 2011-09-19 at 13:30 -0300, Glauber Costa wrote:
>>>> For cpuusage, I am not sure this optimization is a valid one
>>>
>>> I was talking about cpuusage, cpuacct_charge() is called for every
>>> ctxsw/tick.
>>
>> I am not touching it right now.
>
> See hunk #4 of this particular patch:
>
> @@ -9302,7 +9308,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>
> That's cpuusage muck ;-)

Well, sure. But then, the only thing I am doing in this particular hunk 
is changing the definition of how to grab a parent... Can easily even be 
left aside.

>>> But even for cpuacct tick stuff, wouldn't you need to sum all your child
>>> cgroups to update the current cgroup? and that up the whole tree?
>>
>> Of course I would. But as I said, it does not need to be done every
>> tick, in case it poses such a cacheline mayhem as you fear.
>>
>> Since we'll only really need those values when someone reads it - which
>> is a far less frequent operation than the tick resolution - and when a
>> cgroup is destroyed - even less frequent operation - it should work well.
>
> Quite possible, yes. Although if you create a cgroup with 100 subgroups
> and poll very frequently.. it all depends on the avg use case etc.. and
> since I don't use any of this stuff someone needs to tell me about how
> the trade-offs work out in practice.
>
> So explicit changelogs with numbers and agreements from multiple users
> go a long way to make me feel good ;-)

Well, thinking again about that, maybe it is not that much of a good 
idea. systemd is already starting to span a lot of cgroups... and if we 
update the tree bottom-up, we actually do have a much larger cost, since 
we need to touch all its breath...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 19:14                 ` Glauber Costa
@ 2011-09-19 19:18                   ` Peter Zijlstra
  2011-09-19 19:19                     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-19 19:18 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On Mon, 2011-09-19 at 16:14 -0300, Glauber Costa wrote:
> not that much of a good idea. systemd 

I think that just about says it right :-)

/me runs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 19:18                   ` Peter Zijlstra
@ 2011-09-19 19:19                     ` Glauber Costa
  0 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-19 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On 09/19/2011 04:18 PM, Peter Zijlstra wrote:
> On Mon, 2011-09-19 at 16:14 -0300, Glauber Costa wrote:
>> not that much of a good idea. systemd
>
> I think that just about says it right :-)
>
> /me runs
Awesome the way you make it looks like I said that. =p


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-15  8:56     ` Peter Zijlstra
@ 2011-09-19 23:07       ` Paul Turner
  2011-09-20  8:33         ` Peter Zijlstra
  2011-09-20 21:37         ` Glauber Costa
  0 siblings, 2 replies; 58+ messages in thread
From: Paul Turner @ 2011-09-19 23:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Glauber Costa, linux-kernel, xemul, paul, lizf,
	daniel.lezcano, mingo, jbottomley

On 09/15/11 01:56, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 13:23 -0700, Andi Kleen wrote:
>> Peter Zijlstra<a.p.zijlstra@chello.nl>  writes:
>>>
>>> Guys we should seriously trim back a lot of that code, not grow ever
>>> more and more. The sad fact is that if you build a kernel with
>>> cpu-cgroup support the context switch cost is more than double that of a
>>> kernel without, and then you haven't even started creating cgroups yet.
>>
>> That sounds indeed quite bad. Is it known why it is so costly?
>
> Mostly because all data structures grow and all code paths grow, some by
> quite a bit, its spread all over the place, lots of little cuts etc..
>
> pjt and I tried trimming some of the code paths with static_branch() but
> didn't really get anywhere.. need to get back to looking at this stuff
> sometime soon.

When I get some time I think I'm just going to post a patch[*] that 
merges the useful _field_ (usage, usage_percpu) from cpuacct into cpu 
since we are *already* doing the accounting on the entity level making 
this addition free.

At that point we could !CONFIG_CGROUP_CPUACCT by default and deprecate 
the beast without breaking ABI for those who really need it (either 
because their applications have hard-coded paths or because they really 
like cgroup user/sys time -- which we COULD duplicate into cpu but I'm 
inclined not to).

[*]: the only real caveat is how loudly people scream about the code 
duplication; I think it's worth it if it let's us kill cpuacct in the 
long run.

Another unrelated optimization on this path I have sitting around in 
patches/ to push at some point is keeping the left-most entity out of 
tree; since the worst case is an entity with a lower-vruntime comes 
along and we insert the previous left-most and the best case is we get 
to pick it without futzing with the rb-tree.  I think this was good for 
a percent or two when I hacked it together before.

Another idea I have kicking around for this path is the introduction of 
a link_entity which bridges over nr_running=1 chains (break it 
opportunistically when an element in the chain goes to nr_running=2). 
This one requires some pretty careful accounting around the breaking of 
a chain though so I'm not touching it until I get the new load tracking 
code out.  (Incidentally when I benchmarked it before LPC I had it 
working out to be a little more efficient than the current math good for 
~2-3% on pipe_test.)

- Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-19 23:07       ` Paul Turner
@ 2011-09-20  8:33         ` Peter Zijlstra
  2011-09-20 21:37         ` Glauber Costa
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20  8:33 UTC (permalink / raw)
  To: Paul Turner
  Cc: Andi Kleen, Glauber Costa, linux-kernel, xemul, paul, lizf,
	daniel.lezcano, mingo, jbottomley

On Mon, 2011-09-19 at 16:07 -0700, Paul Turner wrote:
> At that point we could !CONFIG_CGROUP_CPUACCT by default and deprecate 
> the beast without breaking ABI for those who really need it (either 
> because their applications have hard-coded paths or because they really 
> like cgroup user/sys time -- which we COULD duplicate into cpu but I'm 
> inclined not to).
> 
> [*]: the only real caveat is how loudly people scream about the code 
> duplication; I think it's worth it if it let's us kill cpuacct in the 
> long run. 

Right so this patch is all about extended the stat crap, so that would
have to be duplicated as well.

IIRC the reason cpuacct is a separate controller, is that it then
doesn't add the overhead it incurs to the regular cpu controller.

But I figure we could add some control knobs on the cpu cgroup mount to
enable/disable certain features.

Having two separate cgroup hierarchies for the scheduler is quite daft
of course.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-14 20:04 ` [PATCH 6/9] Include idle and iowait " Glauber Costa
@ 2011-09-20  9:21   ` Peter Zijlstra
  2011-09-20 12:36     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20  9:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> These are slightly different from the others though:
> (note to reviewers: might be better to put those in a separate
> array?)
> 
> Since idle/iowait are a property of the system - by definition,
> no process from any cgroup is running when the system is idle,
> they are system wide. So what these fields really mean, are baselines
> for when the cgroup was created. It allows the cgroup to start
> counting idle/iowait from 0.

Alternatively you can make iowait based on nr_uninterruptible per cgroup
and count all ticks _this_ cgroup was idle.

Now all that would need moar accounting...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/9] Create cpuacct.proc.stat file
  2011-09-14 20:04 ` [PATCH 7/9] Create cpuacct.proc.stat file Glauber Costa
@ 2011-09-20  9:22   ` Peter Zijlstra
  2011-09-20 12:22     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20  9:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> When doing containers, this file represents how /proc/stat should like
> from inside it. It is user, system, etc, from the perspective of
> the current cgroup.
> 
It would be very nice indeed if you could unify fs/proc/stat.c and this.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 8/9] per-cgroup boot time
  2011-09-14 20:04 ` [PATCH 8/9] per-cgroup boot time Glauber Costa
@ 2011-09-20  9:25   ` Peter Zijlstra
  2011-09-20 12:37     ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20  9:25 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> Record the time in which the cgroup was created. This can be
> used to provide a more accurate boottime information in
> cpuacct.proc.stat.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> ---
>  kernel/sched.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 4611c54..8f254d0 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9120,6 +9120,7 @@ struct cpuacct {
>  	/* cpuusage holds pointer to a u64-type object on every cpu */
>  	u64 __percpu *cpuusage;
>  	u64 __percpu *cpustat;
> +	struct timespec start_time;
>  };
>  
>  struct cgroup_subsys cpuacct_subsys;
> @@ -9172,6 +9173,8 @@ static struct cgroup_subsys_state *cpuacct_create(
>  		acct[CPUACCT_STAT_IDLE] = kstat_cpu(i).cpustat.idle;
>  		acct[CPUACCT_STAT_IOWAIT] = kstat_cpu(i).cpustat.iowait;
>  	}
> +
> +	get_monotonic_boottime(&ca->start_time);
>  	return &ca->css;
>  
>  out_free_usage:
> @@ -9316,13 +9319,16 @@ static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>  	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>  	struct timespec boottime;
>  	struct cpuacct *ca = cgroup_ca(cgrp);
> +	struct timespec ts;
>  	u64 *cpustat;
>  
>  	user = nice = system = idle = iowait =
>  		irq = softirq = steal = cputime64_zero;
>  	guest = guest_nice = cputime64_zero;
>  	getboottime(&boottime);
> -	jif = boottime.tv_sec;
> +	getboottime(&boottime);
> +	ts = timespec_add(boottime, ca->start_time);
> +	jif = ts.tv_sec;
>  
>  	for_each_possible_cpu(i) {
>  		cpustat = per_cpu_ptr(ca->cpustat, i);


I'm confused, what does it do? You take a boot time timestamp at cgroup
creation, add that to all boot-time readings and print the result. How
does that make sense? Subtracting the start_time, maybe, that would make
the cgroup creation time 0, adding, not so much.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 9/9] Report steal time for cgroup
  2011-09-14 20:04 ` [PATCH 9/9] Report steal time for cgroup Glauber Costa
@ 2011-09-20  9:29   ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20  9:29 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> +               list_for_each_entry(sib, &ca->css.cgroup->sibling, sibling) {

Oh hey, look at that, cgroups have a sibling list as well as a parent
pointer.. good thing I duplicated all that for task_group :-)

[ sadly we can't actually remove that, because we can have groups not
represented in the cgroup muck ]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 7/9] Create cpuacct.proc.stat file
  2011-09-20  9:22   ` Peter Zijlstra
@ 2011-09-20 12:22     ` Glauber Costa
  0 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 12:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/20/2011 06:22 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> When doing containers, this file represents how /proc/stat should like
>> from inside it. It is user, system, etc, from the perspective of
>> the current cgroup.
>>
> It would be very nice indeed if you could unify fs/proc/stat.c and this.

I think it is a good idea.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-20  9:21   ` Peter Zijlstra
@ 2011-09-20 12:36     ` Glauber Costa
  2011-09-20 12:58       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/20/2011 06:21 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> These are slightly different from the others though:
>> (note to reviewers: might be better to put those in a separate
>> array?)
>>
>> Since idle/iowait are a property of the system - by definition,
>> no process from any cgroup is running when the system is idle,
>> they are system wide. So what these fields really mean, are baselines
>> for when the cgroup was created. It allows the cgroup to start
>> counting idle/iowait from 0.
>
> Alternatively you can make iowait based on nr_uninterruptible per cgroup
> and count all ticks _this_ cgroup was idle.
You think?

Humm,humm... maybe...
iowait can indeed be seen as a process group characteristic. I was 
mainly concerned about overhead here, specially for the idle case:

If we are idle, there is no task context we can draw from, since the 
task in the cpu is the idle task. So we end up having to touch all 
cgroups... Or am I missing something?

Sounds expensive.


> Now all that would need moar accounting...


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 8/9] per-cgroup boot time
  2011-09-20  9:25   ` Peter Zijlstra
@ 2011-09-20 12:37     ` Glauber Costa
  2011-09-20 13:04       ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 12:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/20/2011 06:25 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>> Record the time in which the cgroup was created. This can be
>> used to provide a more accurate boottime information in
>> cpuacct.proc.stat.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> ---
>>   kernel/sched.c |    8 +++++++-
>>   1 files changed, 7 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 4611c54..8f254d0 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -9120,6 +9120,7 @@ struct cpuacct {
>>   	/* cpuusage holds pointer to a u64-type object on every cpu */
>>   	u64 __percpu *cpuusage;
>>   	u64 __percpu *cpustat;
>> +	struct timespec start_time;
>>   };
>>
>>   struct cgroup_subsys cpuacct_subsys;
>> @@ -9172,6 +9173,8 @@ static struct cgroup_subsys_state *cpuacct_create(
>>   		acct[CPUACCT_STAT_IDLE] = kstat_cpu(i).cpustat.idle;
>>   		acct[CPUACCT_STAT_IOWAIT] = kstat_cpu(i).cpustat.iowait;
>>   	}
>> +
>> +	get_monotonic_boottime(&ca->start_time);
>>   	return&ca->css;
>>
>>   out_free_usage:
>> @@ -9316,13 +9319,16 @@ static int cpuacct_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>>   	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>>   	struct timespec boottime;
>>   	struct cpuacct *ca = cgroup_ca(cgrp);
>> +	struct timespec ts;
>>   	u64 *cpustat;
>>
>>   	user = nice = system = idle = iowait =
>>   		irq = softirq = steal = cputime64_zero;
>>   	guest = guest_nice = cputime64_zero;
>>   	getboottime(&boottime);
>> -	jif = boottime.tv_sec;
>> +	getboottime(&boottime);
>> +	ts = timespec_add(boottime, ca->start_time);
>> +	jif = ts.tv_sec;
>>
>>   	for_each_possible_cpu(i) {
>>   		cpustat = per_cpu_ptr(ca->cpustat, i);
>
>
> I'm confused, what does it do? You take a boot time timestamp at cgroup
> creation, add that to all boot-time readings and print the result. How
> does that make sense? Subtracting the start_time, maybe, that would make
> the cgroup creation time 0, adding, not so much.

Boot time represent at which times the machine was booted. In this 
context, at which time the container/cgroup was created. So it have to 
be an addition.... don't really understand your question

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-20 12:36     ` Glauber Costa
@ 2011-09-20 12:58       ` Peter Zijlstra
  2011-09-20 12:58         ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20 12:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Tue, 2011-09-20 at 09:36 -0300, Glauber Costa wrote:
> On 09/20/2011 06:21 AM, Peter Zijlstra wrote:
> > On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> >> These are slightly different from the others though:
> >> (note to reviewers: might be better to put those in a separate
> >> array?)
> >>
> >> Since idle/iowait are a property of the system - by definition,
> >> no process from any cgroup is running when the system is idle,
> >> they are system wide. So what these fields really mean, are baselines
> >> for when the cgroup was created. It allows the cgroup to start
> >> counting idle/iowait from 0.
> >
> > Alternatively you can make iowait based on nr_uninterruptible per cgroup
> > and count all ticks _this_ cgroup was idle.
> You think?
> 
> Humm,humm... maybe...
> iowait can indeed be seen as a process group characteristic. I was 
> mainly concerned about overhead here, specially for the idle case:

The overhead of accounting per cgroup nr_uninterruptible is the worst I
think, that's in the sleep/wakeup paths.

> If we are idle, there is no task context we can draw from, since the 
> task in the cpu is the idle task. So we end up having to touch all 
> cgroups... Or am I missing something?
> 
> Sounds expensive.

Count the total number of ticks on the cpu (I think we already have
that) and subtract the number of ticks in this cgroup (I think we also
already have that), which should yield: number of ticks not in this
cgroup, aka number of ticks this cgroup was idle.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-20 12:58       ` Peter Zijlstra
@ 2011-09-20 12:58         ` Glauber Costa
  2011-09-20 13:05           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 12:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/20/2011 09:58 AM, Peter Zijlstra wrote:
> On Tue, 2011-09-20 at 09:36 -0300, Glauber Costa wrote:
>> On 09/20/2011 06:21 AM, Peter Zijlstra wrote:
>>> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>>>> These are slightly different from the others though:
>>>> (note to reviewers: might be better to put those in a separate
>>>> array?)
>>>>
>>>> Since idle/iowait are a property of the system - by definition,
>>>> no process from any cgroup is running when the system is idle,
>>>> they are system wide. So what these fields really mean, are baselines
>>>> for when the cgroup was created. It allows the cgroup to start
>>>> counting idle/iowait from 0.
>>>
>>> Alternatively you can make iowait based on nr_uninterruptible per cgroup
>>> and count all ticks _this_ cgroup was idle.
>> You think?
>>
>> Humm,humm... maybe...
>> iowait can indeed be seen as a process group characteristic. I was
>> mainly concerned about overhead here, specially for the idle case:
>
> The overhead of accounting per cgroup nr_uninterruptible is the worst I
> think, that's in the sleep/wakeup paths.
>
>> If we are idle, there is no task context we can draw from, since the
>> task in the cpu is the idle task. So we end up having to touch all
>> cgroups... Or am I missing something?
>>
>> Sounds expensive.
>
> Count the total number of ticks on the cpu (I think we already have
> that) and subtract the number of ticks in this cgroup (I think we also
> already have that), which should yield: number of ticks not in this
> cgroup, aka number of ticks this cgroup was idle.
No , no... remember steal time.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 8/9] per-cgroup boot time
  2011-09-20 12:37     ` Glauber Costa
@ 2011-09-20 13:04       ` Peter Zijlstra
  2011-09-20 13:06         ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20 13:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Tue, 2011-09-20 at 09:37 -0300, Glauber Costa wrote:

> >> +	getboottime(&boottime);
> >> +	ts = timespec_add(boottime, ca->start_time);
> >> +	jif = ts.tv_sec;
> >>
> >>   	for_each_possible_cpu(i) {
> >>   		cpustat = per_cpu_ptr(ca->cpustat, i);
> >
> >
> > I'm confused, what does it do? You take a boot time timestamp at cgroup
> > creation, add that to all boot-time readings and print the result. How
> > does that make sense? Subtracting the start_time, maybe, that would make
> > the cgroup creation time 0, adding, not so much.
> 
> Boot time represent at which times the machine was booted. In this 
> context, at which time the container/cgroup was created. So it have to 
> be an addition.... don't really understand your question

I think we're all properly confused now :-)

getboottime() gives a time since boot, right? 

You take stamp at cgroup creation: say 50s after boot.

Then on usage you take a new getboottime() reading (which per definition
is > 50s) and add your 50s that you read previous. This results in the
cgroup having 50s of boot-time _MORE_ than the machine. Say you read at
123s, you then add your 50s timestamp, resulting in 173s to report.

If instead you did a subtraction: 123-50=73, you would have reported the
time since cgroup creation.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-20 12:58         ` Glauber Costa
@ 2011-09-20 13:05           ` Peter Zijlstra
  2011-09-20 13:29             ` Glauber Costa
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20 13:05 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Tue, 2011-09-20 at 09:58 -0300, Glauber Costa wrote:
> On 09/20/2011 09:58 AM, Peter Zijlstra wrote:
> > On Tue, 2011-09-20 at 09:36 -0300, Glauber Costa wrote:
> >> On 09/20/2011 06:21 AM, Peter Zijlstra wrote:
> >>> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
> >>>> These are slightly different from the others though:
> >>>> (note to reviewers: might be better to put those in a separate
> >>>> array?)
> >>>>
> >>>> Since idle/iowait are a property of the system - by definition,
> >>>> no process from any cgroup is running when the system is idle,
> >>>> they are system wide. So what these fields really mean, are baselines
> >>>> for when the cgroup was created. It allows the cgroup to start
> >>>> counting idle/iowait from 0.
> >>>
> >>> Alternatively you can make iowait based on nr_uninterruptible per cgroup
> >>> and count all ticks _this_ cgroup was idle.
> >> You think?
> >>
> >> Humm,humm... maybe...
> >> iowait can indeed be seen as a process group characteristic. I was
> >> mainly concerned about overhead here, specially for the idle case:
> >
> > The overhead of accounting per cgroup nr_uninterruptible is the worst I
> > think, that's in the sleep/wakeup paths.
> >
> >> If we are idle, there is no task context we can draw from, since the
> >> task in the cpu is the idle task. So we end up having to touch all
> >> cgroups... Or am I missing something?
> >>
> >> Sounds expensive.
> >
> > Count the total number of ticks on the cpu (I think we already have
> > that) and subtract the number of ticks in this cgroup (I think we also
> > already have that), which should yield: number of ticks not in this
> > cgroup, aka number of ticks this cgroup was idle.
> No , no... remember steal time.

Of course I don't.. that's virt stuff, I repress that with all my might.
But add or subtract steal ticks someplace and it doesn't come out right?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 8/9] per-cgroup boot time
  2011-09-20 13:04       ` Peter Zijlstra
@ 2011-09-20 13:06         ` Glauber Costa
  2011-09-20 13:31           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On 09/20/2011 10:04 AM, Peter Zijlstra wrote:
> On Tue, 2011-09-20 at 09:37 -0300, Glauber Costa wrote:
>
>>>> +	getboottime(&boottime);
>>>> +	ts = timespec_add(boottime, ca->start_time);
>>>> +	jif = ts.tv_sec;
>>>>
>>>>    	for_each_possible_cpu(i) {
>>>>    		cpustat = per_cpu_ptr(ca->cpustat, i);
>>>
>>>
>>> I'm confused, what does it do? You take a boot time timestamp at cgroup
>>> creation, add that to all boot-time readings and print the result. How
>>> does that make sense? Subtracting the start_time, maybe, that would make
>>> the cgroup creation time 0, adding, not so much.
>>
>> Boot time represent at which times the machine was booted. In this
>> context, at which time the container/cgroup was created. So it have to
>> be an addition.... don't really understand your question
>
> I think we're all properly confused now :-)
>
> getboottime() gives a time since boot, right?
>
> You take stamp at cgroup creation: say 50s after boot.
>
> Then on usage you take a new getboottime() reading (which per definition
> is>  50s) and add your 50s that you read previous. This results in the
> cgroup having 50s of boot-time _MORE_ than the machine. Say you read at
> 123s, you then add your 50s timestamp, resulting in 173s to report.
>
> If instead you did a subtraction: 123-50=73, you would have reported the
> time since cgroup creation.
>
no.

/**
  * getboottime - Return the real time of system boot.
  * @ts:         pointer to the timespec to be set
  *
  * Returns the wall-time of boot in a timespec.
...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 6/9] Include idle and iowait fields in cpuacct
  2011-09-20 13:05           ` Peter Zijlstra
@ 2011-09-20 13:29             ` Glauber Costa
  0 siblings, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 13:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, jbottomley

On 09/20/2011 10:05 AM, Peter Zijlstra wrote:
> On Tue, 2011-09-20 at 09:58 -0300, Glauber Costa wrote:
>> On 09/20/2011 09:58 AM, Peter Zijlstra wrote:
>>> On Tue, 2011-09-20 at 09:36 -0300, Glauber Costa wrote:
>>>> On 09/20/2011 06:21 AM, Peter Zijlstra wrote:
>>>>> On Wed, 2011-09-14 at 17:04 -0300, Glauber Costa wrote:
>>>>>> These are slightly different from the others though:
>>>>>> (note to reviewers: might be better to put those in a separate
>>>>>> array?)
>>>>>>
>>>>>> Since idle/iowait are a property of the system - by definition,
>>>>>> no process from any cgroup is running when the system is idle,
>>>>>> they are system wide. So what these fields really mean, are baselines
>>>>>> for when the cgroup was created. It allows the cgroup to start
>>>>>> counting idle/iowait from 0.
>>>>>
>>>>> Alternatively you can make iowait based on nr_uninterruptible per cgroup
>>>>> and count all ticks _this_ cgroup was idle.
>>>> You think?
>>>>
>>>> Humm,humm... maybe...
>>>> iowait can indeed be seen as a process group characteristic. I was
>>>> mainly concerned about overhead here, specially for the idle case:
>>>
>>> The overhead of accounting per cgroup nr_uninterruptible is the worst I
>>> think, that's in the sleep/wakeup paths.
>>>
>>>> If we are idle, there is no task context we can draw from, since the
>>>> task in the cpu is the idle task. So we end up having to touch all
>>>> cgroups... Or am I missing something?
>>>>
>>>> Sounds expensive.
>>>
>>> Count the total number of ticks on the cpu (I think we already have
>>> that) and subtract the number of ticks in this cgroup (I think we also
>>> already have that), which should yield: number of ticks not in this
>>> cgroup, aka number of ticks this cgroup was idle.
>> No , no... remember steal time.
>
> Of course I don't.. that's virt stuff, I repress that with all my might.
> But add or subtract steal ticks someplace and it doesn't come out right?

That's what I am here for...

But back to your answer:

 >>> Count the total number of ticks on the cpu (I think we already have
 >>> that) and subtract the number of ticks in this cgroup (I think we also
 >>> already have that),

Not sure if we have ticks in this cgroup... anyway, it can be done. We 
need a baseline for what was the tick situation when the cgroup started 
anyway.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 8/9] per-cgroup boot time
  2011-09-20 13:06         ` Glauber Costa
@ 2011-09-20 13:31           ` Peter Zijlstra
  0 siblings, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-20 13:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, xemul, paul, lizf, daniel.lezcano, mingo, jbottomley

On Tue, 2011-09-20 at 10:06 -0300, Glauber Costa wrote:
> On 09/20/2011 10:04 AM, Peter Zijlstra wrote:
> > On Tue, 2011-09-20 at 09:37 -0300, Glauber Costa wrote:
> >
> >>>> +	getboottime(&boottime);
> >>>> +	ts = timespec_add(boottime, ca->start_time);
> >>>> +	jif = ts.tv_sec;
> >>>>
> >>>>    	for_each_possible_cpu(i) {
> >>>>    		cpustat = per_cpu_ptr(ca->cpustat, i);
> >>>
> >>>
> >>> I'm confused, what does it do? You take a boot time timestamp at cgroup
> >>> creation, add that to all boot-time readings and print the result. How
> >>> does that make sense? Subtracting the start_time, maybe, that would make
> >>> the cgroup creation time 0, adding, not so much.
> >>
> >> Boot time represent at which times the machine was booted. In this
> >> context, at which time the container/cgroup was created. So it have to
> >> be an addition.... don't really understand your question
> >
> > I think we're all properly confused now :-)
> >
> > getboottime() gives a time since boot, right?
> >
> > You take stamp at cgroup creation: say 50s after boot.
> >
> > Then on usage you take a new getboottime() reading (which per definition
> > is>  50s) and add your 50s that you read previous. This results in the
> > cgroup having 50s of boot-time _MORE_ than the machine. Say you read at
> > 123s, you then add your 50s timestamp, resulting in 173s to report.
> >
> > If instead you did a subtraction: 123-50=73, you would have reported the
> > time since cgroup creation.
> >
> no.
> 
> /**
>   * getboottime - Return the real time of system boot.
>   * @ts:         pointer to the timespec to be set
>   *
>   * Returns the wall-time of boot in a timespec.
> ...

Argh,. ok. 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-19 18:40               ` Peter Zijlstra
@ 2011-09-20 17:29                 ` Srivatsa Vaddagiri
  2011-09-22 15:11                   ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: Srivatsa Vaddagiri @ 2011-09-20 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, xemul, paul, lizf, daniel.lezcano,
	mingo, jbottomley

* Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-19 20:40:53]:

> > Note I was talking about cpuusage. Well, in this case, I myself fail
> > to see the value of the whole cpuusage thing...

If I recall, cpuacct was prototyped by Paul Menage (and later modified
by Balbir) as a way to demonstrate cgroup controller ..It was one 
of those "first" cgroups controllers written to demonstrate how cgroups worked.

With regard to its value, it does provide a handy interface to measure
cpu utilization of cgroups (vs having to manually sum up task's
utilization) and afaik libvirt makes use of it (although I
don't think anyone reads it at "high" frequency).

http://www.redhat.com/archives/libvir-list/2011-April/msg00801.html

> >  Someone that sees 
> > further could maybe help us out here? (but this is sideways to the whole 
> > discussion we're having anyway)
> 
> Would probably do to cc the cpuacct authors in this patch set ;-) I
> think it came from IBM, so I added vatsa.

- vatsa

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 0/9] Per-cgroup /proc/stat
  2011-09-19 23:07       ` Paul Turner
  2011-09-20  8:33         ` Peter Zijlstra
@ 2011-09-20 21:37         ` Glauber Costa
  1 sibling, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-20 21:37 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Andi Kleen, linux-kernel, xemul, paul, lizf,
	daniel.lezcano, jbottomley

On 09/19/2011 08:07 PM, Paul Turner wrote:
> On 09/15/11 01:56, Peter Zijlstra wrote:
>> On Wed, 2011-09-14 at 13:23 -0700, Andi Kleen wrote:
>>> Peter Zijlstra<a.p.zijlstra@chello.nl> writes:
>>>>
>>>> Guys we should seriously trim back a lot of that code, not grow ever
>>>> more and more. The sad fact is that if you build a kernel with
>>>> cpu-cgroup support the context switch cost is more than double that
>>>> of a
>>>> kernel without, and then you haven't even started creating cgroups yet.
>>>
>>> That sounds indeed quite bad. Is it known why it is so costly?
>>
>> Mostly because all data structures grow and all code paths grow, some by
>> quite a bit, its spread all over the place, lots of little cuts etc..
>>
>> pjt and I tried trimming some of the code paths with static_branch() but
>> didn't really get anywhere.. need to get back to looking at this stuff
>> sometime soon.
>
> When I get some time I think I'm just going to post a patch[*] that
> merges the useful _field_ (usage, usage_percpu) from cpuacct into cpu
> since we are *already* doing the accounting on the entity level making
> this addition free.
agree.

> At that point we could !CONFIG_CGROUP_CPUACCT by default and deprecate
> the beast without breaking ABI for those who really need it (either
> because their applications have hard-coded paths or because they really
> like cgroup user/sys time -- which we COULD duplicate into cpu but I'm
> inclined not to).

Well, why ? Now that I look into it, one of the nice ways to achieve 
what I am proposing in this patchset is:
  1) get rid of cpuacct.
  2) do all accounting per-cpu cgroup, and then merge it to fs/proc/stat.c

> [*]: the only real caveat is how loudly people scream about the code
> duplication; I think it's worth it if it let's us kill cpuacct in the
> long run.

One way to deprecate it, is probably disallowing cpuacct to have any 
tasks written to its task file. We then expose whatever information 
there is in cpu/.

It may get ugly since we'll need to touch core cgroup code, but it is 
nice from a user PoV.

> Another unrelated optimization on this path I have sitting around in
> patches/ to push at some point is keeping the left-most entity out of
> tree; since the worst case is an entity with a lower-vruntime comes
> along and we insert the previous left-most and the best case is we get
> to pick it without futzing with the rb-tree. I think this was good for a
> percent or two when I hacked it together before.
>
> Another idea I have kicking around for this path is the introduction of
> a link_entity which bridges over nr_running=1 chains (break it
> opportunistically when an element in the chain goes to nr_running=2).
> This one requires some pretty careful accounting around the breaking of
> a chain though so I'm not touching it until I get the new load tracking
> code out. (Incidentally when I benchmarked it before LPC I had it
> working out to be a little more efficient than the current math good for
> ~2-3% on pipe_test.)
>
> - Paul


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-20 17:29                 ` Srivatsa Vaddagiri
@ 2011-09-22 15:11                   ` Balbir Singh
  2011-09-22 15:17                     ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Balbir Singh @ 2011-09-22 15:11 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Peter Zijlstra, Glauber Costa, linux-kernel, xemul, paul, lizf,
	daniel.lezcano, mingo, jbottomley

On Tue, Sep 20, 2011 at 10:59 PM, Srivatsa Vaddagiri
<vatsa@linux.vnet.ibm.com> wrote:
> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-19 20:40:53]:
>
>> > Note I was talking about cpuusage. Well, in this case, I myself fail
>> > to see the value of the whole cpuusage thing...
>
> If I recall, cpuacct was prototyped by Paul Menage (and later modified
> by Balbir) as a way to demonstrate cgroup controller ..It was one
> of those "first" cgroups controllers written to demonstrate how cgroups worked.
>
> With regard to its value, it does provide a handy interface to measure
> cpu utilization of cgroups (vs having to manually sum up task's
> utilization) and afaik libvirt makes use of it (although I
> don't think anyone reads it at "high" frequency).
>

Android makes heavy use of cpuacct as well. We invented it to do cpu
accounting independent of control

Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-22 15:11                   ` Balbir Singh
@ 2011-09-22 15:17                     ` Peter Zijlstra
  2011-09-23  8:09                       ` Balbir Singh
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-22 15:17 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Srivatsa Vaddagiri, Glauber Costa, linux-kernel, xemul, paul,
	lizf, daniel.lezcano, mingo, jbottomley

On Thu, 2011-09-22 at 20:41 +0530, Balbir Singh wrote:
> On Tue, Sep 20, 2011 at 10:59 PM, Srivatsa Vaddagiri
> <vatsa@linux.vnet.ibm.com> wrote:
> > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2011-09-19 20:40:53]:
> >
> >> > Note I was talking about cpuusage. Well, in this case, I myself fail
> >> > to see the value of the whole cpuusage thing...
> >
> > If I recall, cpuacct was prototyped by Paul Menage (and later modified
> > by Balbir) as a way to demonstrate cgroup controller ..It was one
> > of those "first" cgroups controllers written to demonstrate how cgroups worked.
> >
> > With regard to its value, it does provide a handy interface to measure
> > cpu utilization of cgroups (vs having to manually sum up task's
> > utilization) and afaik libvirt makes use of it (although I
> > don't think anyone reads it at "high" frequency).
> >
> 
> Android makes heavy use of cpuacct as well. We invented it to do cpu
> accounting independent of control

But its a massive waste of time to have to iterate a double hierarchy,
cache-miss heaven.

Or so people really have independent cgroups as well? I thought people
just co-mount all this crap.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-22 15:17                     ` Peter Zijlstra
@ 2011-09-23  8:09                       ` Balbir Singh
  2011-09-23 14:35                         ` Peter Zijlstra
  2011-09-23 15:45                         ` Glauber Costa
  0 siblings, 2 replies; 58+ messages in thread
From: Balbir Singh @ 2011-09-23  8:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srivatsa Vaddagiri, Glauber Costa, linux-kernel, xemul, paul,
	lizf, daniel.lezcano, mingo, jbottomley

>> Android makes heavy use of cpuacct as well. We invented it to do cpu
>> accounting independent of control
>
> But its a massive waste of time to have to iterate a double hierarchy,
> cache-miss heaven.
>

By double hierarchy you mean both cpu and cpuacct? What if they are
not mounted together?

> Or so people really have independent cgroups as well? I thought people
> just co-mount all this crap.
>

Please see
http://code.google.com/p/cgroupmgr/wiki/cgroupsForAndroid
http://stackoverflow.com/questions/6814207/android-cpuacct-usage-file

AFAIK, some versions mount cpuacct at /acct and independent of cpu cgroups

Balbir

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-23  8:09                       ` Balbir Singh
@ 2011-09-23 14:35                         ` Peter Zijlstra
  2011-09-23 15:45                         ` Glauber Costa
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Zijlstra @ 2011-09-23 14:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Srivatsa Vaddagiri, Glauber Costa, linux-kernel, xemul, paul,
	lizf, daniel.lezcano, mingo, jbottomley

On Fri, 2011-09-23 at 13:39 +0530, Balbir Singh wrote:
> >> Android makes heavy use of cpuacct as well. We invented it to do cpu
> >> accounting independent of control
> >
> > But its a massive waste of time to have to iterate a double hierarchy,
> > cache-miss heaven.
> >
> 
> By double hierarchy you mean both cpu and cpuacct? 

Yes

> What if they are not mounted together?

I was asking if there was a real reason not to have that.

> > Or so people really have independent cgroups as well? I thought people
> > just co-mount all this crap.
> >
> 
> Please see
> http://code.google.com/p/cgroupmgr/wiki/cgroupsForAndroid
> http://stackoverflow.com/questions/6814207/android-cpuacct-usage-file
> 
> AFAIK, some versions mount cpuacct at /acct and independent of cpu cgroups

They do now, but can that be fixed? Is there a sane use-case?

Overhead really must come down of all this crap.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 1/9] Remove parent field in cpuacct cgroup
  2011-09-23  8:09                       ` Balbir Singh
  2011-09-23 14:35                         ` Peter Zijlstra
@ 2011-09-23 15:45                         ` Glauber Costa
  1 sibling, 0 replies; 58+ messages in thread
From: Glauber Costa @ 2011-09-23 15:45 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Peter Zijlstra, Srivatsa Vaddagiri, linux-kernel, xemul, paul,
	lizf, daniel.lezcano, mingo, jbottomley

On 09/23/2011 05:09 AM, Balbir Singh wrote:
>>> Android makes heavy use of cpuacct as well. We invented it to do cpu
>>> accounting independent of control
>>
>> But its a massive waste of time to have to iterate a double hierarchy,
>> cache-miss heaven.
>>
>
> By double hierarchy you mean both cpu and cpuacct? What if they are
> not mounted together?
>
>> Or so people really have independent cgroups as well? I thought people
>> just co-mount all this crap.
>>
>
> Please see
> http://code.google.com/p/cgroupmgr/wiki/cgroupsForAndroid
> http://stackoverflow.com/questions/6814207/android-cpuacct-usage-file
>
> AFAIK, some versions mount cpuacct at /acct and independent of cpu cgroups
>
> Balbir
But then, are they using both cpuacct and cpu cgroup for an independent 
group of tasks?

The problem here is that we're gathering the same statistics 3 times 
just because we need/want to present then in a different way. This is 
really bad design.



^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2011-09-23 15:47 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-14 20:04 [PATCH 0/9] Per-cgroup /proc/stat Glauber Costa
2011-09-14 20:04 ` [PATCH 1/9] Remove parent field in cpuacct cgroup Glauber Costa
2011-09-19 16:03   ` Peter Zijlstra
2011-09-19 16:09     ` Glauber Costa
2011-09-19 16:19       ` Peter Zijlstra
2011-09-19 16:30         ` Glauber Costa
2011-09-19 16:39           ` Peter Zijlstra
2011-09-19 16:41             ` Glauber Costa
2011-09-19 18:40               ` Peter Zijlstra
2011-09-20 17:29                 ` Srivatsa Vaddagiri
2011-09-22 15:11                   ` Balbir Singh
2011-09-22 15:17                     ` Peter Zijlstra
2011-09-23  8:09                       ` Balbir Singh
2011-09-23 14:35                         ` Peter Zijlstra
2011-09-23 15:45                         ` Glauber Costa
2011-09-19 18:35           ` Peter Zijlstra
2011-09-19 18:38             ` Glauber Costa
2011-09-19 18:44               ` Peter Zijlstra
2011-09-19 19:14                 ` Glauber Costa
2011-09-19 19:18                   ` Peter Zijlstra
2011-09-19 19:19                     ` Glauber Costa
2011-09-14 20:04 ` [PATCH 2/9] Make cpuacct fields per cpu variables Glauber Costa
2011-09-19 16:10   ` Peter Zijlstra
2011-09-14 20:04 ` [PATCH 3/9] Include nice values in cpuacct Glauber Costa
2011-09-19 16:19   ` Peter Zijlstra
2011-09-19 16:26     ` Glauber Costa
2011-09-19 18:36       ` Peter Zijlstra
2011-09-19 18:37         ` Glauber Costa
2011-09-14 20:04 ` [PATCH 4/9] Include irq and softirq fields " Glauber Costa
2011-09-19 18:38   ` Peter Zijlstra
2011-09-19 18:40     ` Glauber Costa
2011-09-14 20:04 ` [PATCH 5/9] Include guest " Glauber Costa
2011-09-14 20:04 ` [PATCH 6/9] Include idle and iowait " Glauber Costa
2011-09-20  9:21   ` Peter Zijlstra
2011-09-20 12:36     ` Glauber Costa
2011-09-20 12:58       ` Peter Zijlstra
2011-09-20 12:58         ` Glauber Costa
2011-09-20 13:05           ` Peter Zijlstra
2011-09-20 13:29             ` Glauber Costa
2011-09-14 20:04 ` [PATCH 7/9] Create cpuacct.proc.stat file Glauber Costa
2011-09-20  9:22   ` Peter Zijlstra
2011-09-20 12:22     ` Glauber Costa
2011-09-14 20:04 ` [PATCH 8/9] per-cgroup boot time Glauber Costa
2011-09-20  9:25   ` Peter Zijlstra
2011-09-20 12:37     ` Glauber Costa
2011-09-20 13:04       ` Peter Zijlstra
2011-09-20 13:06         ` Glauber Costa
2011-09-20 13:31           ` Peter Zijlstra
2011-09-14 20:04 ` [PATCH 9/9] Report steal time for cgroup Glauber Costa
2011-09-20  9:29   ` Peter Zijlstra
2011-09-14 20:13 ` [PATCH 0/9] Per-cgroup /proc/stat Peter Zijlstra
2011-09-14 20:20   ` Glauber Costa
2011-09-15  8:53     ` Peter Zijlstra
2011-09-14 20:23   ` Andi Kleen
2011-09-15  8:56     ` Peter Zijlstra
2011-09-19 23:07       ` Paul Turner
2011-09-20  8:33         ` Peter Zijlstra
2011-09-20 21:37         ` Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.