All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFD 0/9] per-cgroup /proc/stat statistics
@ 2011-09-23 22:20 Glauber Costa
  2011-09-23 22:20 ` [RFD 1/9] Change cpustat fields to an array Glauber Costa
                   ` (9 more replies)
  0 siblings, 10 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel; +Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley

Hi,

Since I've sent already a RFC about it, I am sending now a RFD.
If you eager for meaning, this can then be a "Request for Doctors",
since Peter is likely to have a heart attack now.

So here's the deal:

* My main goal here was to demonstrate that we can avoid double accounting
  in a lot of places. So what I did was getting rid of the original and first
  kstat mechanism, and use only cgroups accounting for that. Since the parent
  is always updated, the original stats are the one for the root cgroup.
* I believe that all those cpu cgroups are confusing and should be unified. Not
  that we can simply get rid of it, but my goal here is to provide all the
  information they do, in cpu cgroup. If the set of tasks needed for accounting
  is not independent of the ones in cpu cgroup, we can avoid double accounting
  for that. I default cpuacct to n, but leave it to people that wants to use it
  alone. 
* Well, I'm also doing what I was doing originally: Providing a per-cgroup version
  of the /proc/stat file.


Glauber Costa (9):
  Change cpustat fields to an array.
  Move /proc/stat logic inside sched.c
  Display /proc/stat information per cgroup
  Make total_forks per-cgroup
  per-cgroup boot time
  Report steal time for cgroup
  provide a version of cpuacct statistics inside cpu cgroup
  provide a version of cpuusage statistics inside cpu cgroup
  Change CPUACCT to default n

 fs/proc/stat.c              |  117 +---------
 fs/proc/uptime.c            |    2 +-
 include/linux/kernel_stat.h |   48 +++--
 include/linux/sched.h       |    5 +
 init/Kconfig                |    1 +
 kernel/fork.c               |    9 +-
 kernel/sched.c              |  595 +++++++++++++++++++++++++++++++++++--------
 kernel/sched_fair.c         |    2 +-
 kernel/sched_rt.c           |    2 +-
 9 files changed, 530 insertions(+), 251 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFD 1/9] Change cpustat fields to an array.
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-27 21:00   ` Peter Zijlstra
  2011-09-27 21:03   ` Peter Zijlstra
  2011-09-23 22:20 ` [RFD 2/9] Move /proc/stat logic inside sched.c Glauber Costa
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, Glauber Costa

This will give us a bit more flexibility to deal with the
fields in this structure. This is a preparation patch for
later patches in this series.

I tried to keep the acessor macros unchanged, so we don't need
to patch all users.. At some point I gave up on kstat_this_cpu.
But to be fair, this one is not used outside of sched.c, so is
not a big deal.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/proc/stat.c              |   40 ++++++++++++------------
 fs/proc/uptime.c            |    2 +-
 include/linux/kernel_stat.h |   46 +++++++++++++++++-----------
 kernel/sched.c              |   68 ++++++++++++++++++++++++-------------------
 4 files changed, 87 insertions(+), 69 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9758b65..ec708c7 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -39,18 +39,18 @@ static int show_stat(struct seq_file *p, void *v)
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user = cputime64_add(user, kstat_cpu(i).cpustat.user);
-		nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
-		system = cputime64_add(system, kstat_cpu(i).cpustat.system);
-		idle = cputime64_add(idle, kstat_cpu(i).cpustat.idle);
+		user = cputime64_add(user, kstat_cpu(i).cpustat[USER]);
+		nice = cputime64_add(nice, kstat_cpu(i).cpustat[NICE]);
+		system = cputime64_add(system, kstat_cpu(i).cpustat[SYSTEM]);
+		idle = cputime64_add(idle, kstat_cpu(i).cpustat[IDLE]);
 		idle = cputime64_add(idle, arch_idle_time(i));
-		iowait = cputime64_add(iowait, kstat_cpu(i).cpustat.iowait);
-		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
-		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
-		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-		guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
+		iowait = cputime64_add(iowait, kstat_cpu(i).cpustat[IOWAIT]);
+		irq = cputime64_add(irq, kstat_cpu(i).cpustat[IRQ]);
+		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat[SOFTIRQ]);
+		steal = cputime64_add(steal, kstat_cpu(i).cpustat[STEAL]);
+		guest = cputime64_add(guest, kstat_cpu(i).cpustat[GUEST]);
 		guest_nice = cputime64_add(guest_nice,
-			kstat_cpu(i).cpustat.guest_nice);
+			kstat_cpu(i).cpustat[GUEST_NICE]);
 		sum += kstat_cpu_irqs_sum(i);
 		sum += arch_irq_stat_cpu(i);
 
@@ -78,17 +78,17 @@ static int show_stat(struct seq_file *p, void *v)
 	for_each_online_cpu(i) {
 
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat.user;
-		nice = kstat_cpu(i).cpustat.nice;
-		system = kstat_cpu(i).cpustat.system;
-		idle = kstat_cpu(i).cpustat.idle;
+		user = kstat_cpu(i).cpustat[USER];
+		nice = kstat_cpu(i).cpustat[NICE];
+		system = kstat_cpu(i).cpustat[SYSTEM];
+		idle = kstat_cpu(i).cpustat[IDLE];
 		idle = cputime64_add(idle, arch_idle_time(i));
-		iowait = kstat_cpu(i).cpustat.iowait;
-		irq = kstat_cpu(i).cpustat.irq;
-		softirq = kstat_cpu(i).cpustat.softirq;
-		steal = kstat_cpu(i).cpustat.steal;
-		guest = kstat_cpu(i).cpustat.guest;
-		guest_nice = kstat_cpu(i).cpustat.guest_nice;
+		iowait = kstat_cpu(i).cpustat[IOWAIT];
+		irq = kstat_cpu(i).cpustat[IRQ];
+		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
+		steal = kstat_cpu(i).cpustat[STEAL];
+		guest = kstat_cpu(i).cpustat[GUEST];
+		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 766b1d4..b0e053d 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -15,7 +15,7 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	cputime_t idletime = cputime_zero;
 
 	for_each_possible_cpu(i)
-		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
+		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat[IDLE]);
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..93f64f3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -6,6 +6,7 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
 #include <asm/irq.h>
 #include <asm/cputime.h>
 
@@ -15,21 +16,22 @@
  * used by rstatd/perfmeter
  */
 
-struct cpu_usage_stat {
-	cputime64_t user;
-	cputime64_t nice;
-	cputime64_t system;
-	cputime64_t softirq;
-	cputime64_t irq;
-	cputime64_t idle;
-	cputime64_t iowait;
-	cputime64_t steal;
-	cputime64_t guest;
-	cputime64_t guest_nice;
+enum cpu_usage_stat {
+	USER,
+	NICE,
+	SYSTEM,
+	SOFTIRQ,
+	IRQ,
+	IDLE,
+	IOWAIT,
+	STEAL,
+	GUEST,
+	GUEST_NICE,
+	NR_STATS,
 };
 
 struct kernel_stat {
-	struct cpu_usage_stat	cpustat;
+	cputime64_t cpustat[NR_STATS];
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -39,9 +41,17 @@ struct kernel_stat {
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
 
-#define kstat_cpu(cpu)	per_cpu(kstat, cpu)
+struct kernel_stat *task_group_kstat(struct task_struct *p);
+
+#ifdef CONFIG_CGROUP_SCHED
+#define kstat_cpu(cpu) (*per_cpu_ptr(task_group_kstat(current), cpu))
+
 /* Must have preemption disabled for this to be meaningful. */
-#define kstat_this_cpu	__get_cpu_var(kstat)
+#define kstat_this_cpu	this_cpu_ptr(task_group_kstat(current))
+#else
+#define kstat_cpu(cpu) per_cpu(kstat, cpu)
+#define kstat_this_cpu (&__get_cpu_var(kstat))
+#endif
 
 extern unsigned long long nr_context_switches(void);
 
@@ -52,8 +62,8 @@ struct irq_desc;
 static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
 					    struct irq_desc *desc)
 {
-	__this_cpu_inc(kstat.irqs[irq]);
-	__this_cpu_inc(kstat.irqs_sum);
+	kstat_this_cpu->irqs[irq]++;
+	kstat_this_cpu->irqs_sum++;
 }
 
 static inline unsigned int kstat_irqs_cpu(unsigned int irq, int cpu)
@@ -67,14 +77,14 @@ extern unsigned int kstat_irqs_cpu(unsigned int irq, int cpu);
 #define kstat_incr_irqs_this_cpu(irqno, DESC)		\
 do {							\
 	__this_cpu_inc(*(DESC)->kstat_irqs);		\
-	__this_cpu_inc(kstat.irqs_sum);			\
+	kstat_this_cpu->irqs_sum++;			\
 } while (0)
 
 #endif
 
 static inline void kstat_incr_softirqs_this_cpu(unsigned int irq)
 {
-	__this_cpu_inc(kstat.softirqs[irq]);
+	kstat_this_cpu->softirqs[irq]++;
 }
 
 static inline unsigned int kstat_softirqs_cpu(unsigned int irq, int cpu)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3ed4107..2f6bab4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -278,6 +278,7 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+	struct kernel_stat __percpu *cpustat;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -623,6 +624,9 @@ static inline struct task_group *task_group(struct task_struct *p)
 	struct task_group *tg;
 	struct cgroup_subsys_state *css;
 
+	if (!p->mm)
+		return &root_task_group;
+
 	css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
 			lockdep_is_held(&p->pi_lock) ||
 			lockdep_is_held(&task_rq(p)->lock));
@@ -631,6 +635,12 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return autogroup_task_group(p, tg);
 }
 
+struct kernel_stat *task_group_kstat(struct task_struct *p)
+{
+	struct task_group *tg = task_group(p);
+
+	return tg->cpustat;
+}
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
@@ -653,6 +663,8 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return NULL;
 }
 
+DEFINE_PER_CPU(struct kernel_stat, kstat);
+EXPORT_PER_CPU_SYMBOL(kstat);
 #endif /* CONFIG_CGROUP_SCHED */
 
 static void update_rq_clock_task(struct rq *rq, s64 delta);
@@ -2004,14 +2016,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -2019,14 +2031,14 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -3669,10 +3681,6 @@ unlock:
 
 #endif
 
-DEFINE_PER_CPU(struct kernel_stat, kstat);
-
-EXPORT_PER_CPU_SYMBOL(kstat);
-
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
  * @p in case that task is currently running.
@@ -3757,7 +3765,7 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	cputime64_t tmp;
 
 	/* Add user time to process. */
@@ -3768,9 +3776,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
 	if (TASK_NICE(p) > 0)
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
+		cpustat[NICE] = cputime64_add(cpustat[NICE], tmp);
 	else
-		cpustat->user = cputime64_add(cpustat->user, tmp);
+		cpustat[USER] = cputime64_add(cpustat[USER], tmp);
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3787,7 +3795,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
 	cputime64_t tmp;
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3799,11 +3807,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-		cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
+		cpustat[NICE] = cputime64_add(cpustat[NICE], tmp);
+		cpustat[GUEST_NICE] = cputime64_add(cpustat[GUEST_NICE], tmp);
 	} else {
-		cpustat->user = cputime64_add(cpustat->user, tmp);
-		cpustat->guest = cputime64_add(cpustat->guest, tmp);
+		cpustat[USER] = cputime64_add(cpustat[USER], tmp);
+		cpustat[GUEST] = cputime64_add(cpustat[GUEST], tmp);
 	}
 }
 
@@ -3843,7 +3851,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	cputime64_t *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
@@ -3852,11 +3860,11 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat->irq;
+		target_cputime64 = &cpustat[IRQ];
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat->softirq;
+		target_cputime64 = &cpustat[SOFTIRQ];
 	else
-		target_cputime64 = &cpustat->system;
+		target_cputime64 = &cpustat[SYSTEM];
 
 	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
@@ -3867,10 +3875,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 
-	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
+	cpustat[STEAL] = cputime64_add(cpustat[STEAL], cputime64);
 }
 
 /*
@@ -3879,14 +3887,14 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	cputime64_t cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
+		cpustat[IOWAIT] = cputime64_add(cpustat[IOWAIT], cputime64);
 	else
-		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
+		cpustat[IDLE] = cputime64_add(cpustat[IDLE], cputime64);
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -3937,15 +3945,15 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		cpustat[IRQ] = cputime64_add(cpustat[IRQ], tmp);
 	} else if (irqtime_account_si_update()) {
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		cpustat[SOFTIRQ] = cputime64_add(cpustat[SOFTIRQ], tmp);
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
@@ -3953,7 +3961,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->softirq);
+					&cpustat[SOFTIRQ]);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -3962,7 +3970,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->system);
+					&cpustat[SYSTEM]);
 	}
 }
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 2/9] Move /proc/stat logic inside sched.c
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
  2011-09-23 22:20 ` [RFD 1/9] Change cpustat fields to an array Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-23 22:20 ` [RFD 3/9] Display /proc/stat information per cgroup Glauber Costa
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, Glauber Costa

This patch moves all of the /proc/stat display code inside
sched.c. The goal is to, later on, have a different version
of it per-cgroup. In containers environment, this is useful
to give each container a different and independent view of
the statistics displayed in this file.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/proc/stat.c        |  117 +-----------------------------------------
 include/linux/sched.h |    4 ++
 kernel/sched.c        |  136 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 141 insertions(+), 116 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index ec708c7..c9b2ae9 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -11,124 +11,9 @@
 #include <linux/irqnr.h>
 #include <asm/cputime.h>
 
-#ifndef arch_irq_stat_cpu
-#define arch_irq_stat_cpu(cpu) 0
-#endif
-#ifndef arch_irq_stat
-#define arch_irq_stat() 0
-#endif
-#ifndef arch_idle_time
-#define arch_idle_time(cpu) 0
-#endif
-
 static int show_stat(struct seq_file *p, void *v)
 {
-	int i, j;
-	unsigned long jif;
-	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
-	cputime64_t guest, guest_nice;
-	u64 sum = 0;
-	u64 sum_softirq = 0;
-	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
-	struct timespec boottime;
-
-	user = nice = system = idle = iowait =
-		irq = softirq = steal = cputime64_zero;
-	guest = guest_nice = cputime64_zero;
-	getboottime(&boottime);
-	jif = boottime.tv_sec;
-
-	for_each_possible_cpu(i) {
-		user = cputime64_add(user, kstat_cpu(i).cpustat[USER]);
-		nice = cputime64_add(nice, kstat_cpu(i).cpustat[NICE]);
-		system = cputime64_add(system, kstat_cpu(i).cpustat[SYSTEM]);
-		idle = cputime64_add(idle, kstat_cpu(i).cpustat[IDLE]);
-		idle = cputime64_add(idle, arch_idle_time(i));
-		iowait = cputime64_add(iowait, kstat_cpu(i).cpustat[IOWAIT]);
-		irq = cputime64_add(irq, kstat_cpu(i).cpustat[IRQ]);
-		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat[SOFTIRQ]);
-		steal = cputime64_add(steal, kstat_cpu(i).cpustat[STEAL]);
-		guest = cputime64_add(guest, kstat_cpu(i).cpustat[GUEST]);
-		guest_nice = cputime64_add(guest_nice,
-			kstat_cpu(i).cpustat[GUEST_NICE]);
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
-
-		for (j = 0; j < NR_SOFTIRQS; j++) {
-			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
-
-			per_softirq_sums[j] += softirq_stat;
-			sum_softirq += softirq_stat;
-		}
-	}
-	sum += arch_irq_stat();
-
-	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-		"%llu\n",
-		(unsigned long long)cputime64_to_clock_t(user),
-		(unsigned long long)cputime64_to_clock_t(nice),
-		(unsigned long long)cputime64_to_clock_t(system),
-		(unsigned long long)cputime64_to_clock_t(idle),
-		(unsigned long long)cputime64_to_clock_t(iowait),
-		(unsigned long long)cputime64_to_clock_t(irq),
-		(unsigned long long)cputime64_to_clock_t(softirq),
-		(unsigned long long)cputime64_to_clock_t(steal),
-		(unsigned long long)cputime64_to_clock_t(guest),
-		(unsigned long long)cputime64_to_clock_t(guest_nice));
-	for_each_online_cpu(i) {
-
-		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat[USER];
-		nice = kstat_cpu(i).cpustat[NICE];
-		system = kstat_cpu(i).cpustat[SYSTEM];
-		idle = kstat_cpu(i).cpustat[IDLE];
-		idle = cputime64_add(idle, arch_idle_time(i));
-		iowait = kstat_cpu(i).cpustat[IOWAIT];
-		irq = kstat_cpu(i).cpustat[IRQ];
-		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
-		steal = kstat_cpu(i).cpustat[STEAL];
-		guest = kstat_cpu(i).cpustat[GUEST];
-		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
-		seq_printf(p,
-			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-			"%llu\n",
-			i,
-			(unsigned long long)cputime64_to_clock_t(user),
-			(unsigned long long)cputime64_to_clock_t(nice),
-			(unsigned long long)cputime64_to_clock_t(system),
-			(unsigned long long)cputime64_to_clock_t(idle),
-			(unsigned long long)cputime64_to_clock_t(iowait),
-			(unsigned long long)cputime64_to_clock_t(irq),
-			(unsigned long long)cputime64_to_clock_t(softirq),
-			(unsigned long long)cputime64_to_clock_t(steal),
-			(unsigned long long)cputime64_to_clock_t(guest),
-			(unsigned long long)cputime64_to_clock_t(guest_nice));
-	}
-	seq_printf(p, "intr %llu", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_printf(p, " %u", kstat_irqs(j));
-
-	seq_printf(p,
-		"\nctxt %llu\n"
-		"btime %lu\n"
-		"processes %lu\n"
-		"procs_running %lu\n"
-		"procs_blocked %lu\n",
-		nr_context_switches(),
-		(unsigned long)jif,
-		total_forks,
-		nr_running(),
-		nr_iowait());
-
-	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
-
-	for (i = 0; i < NR_SOFTIRQS; i++)
-		seq_printf(p, " %u", per_softirq_sums[i]);
-	seq_putc(p, '\n');
-
-	return 0;
+	return cpu_cgroup_proc_stat(NULL, NULL, p);
 }
 
 static int stat_open(struct inode *inode, struct file *file)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4ac2c05..64c5ba5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2712,6 +2712,10 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+struct cgroup;
+struct cftype;
+int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
+			 struct seq_file *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index 2f6bab4..4d831fe 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9049,6 +9049,142 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifndef arch_irq_stat_cpu
+#define arch_irq_stat_cpu(cpu) 0
+#endif
+#ifndef arch_irq_stat
+#define arch_irq_stat() 0
+#endif
+#ifndef arch_idle_time
+#define arch_idle_time(cpu) 0
+#endif
+
+int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_file *p)
+{
+	int i, j;
+	unsigned long jif;
+	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
+	cputime64_t guest, guest_nice;
+	u64 sum = 0;
+	u64 sum_softirq = 0;
+	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
+	struct timespec boottime;
+	struct task_group *tg;
+
+	if (cgrp)
+		tg = cgroup_tg(cgrp);
+	else
+		tg = &root_task_group;
+
+	user = nice = system = idle = iowait =
+		irq = softirq = steal = cputime64_zero;
+	guest = guest_nice = cputime64_zero;
+	getboottime(&boottime);
+	jif = boottime.tv_sec;
+
+	for_each_possible_cpu(i) {
+		struct kernel_stat *kstat, *root_kstat;
+		kstat = per_cpu_ptr(tg->cpustat, i);
+		root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+
+		user = cputime64_add(user, kstat->cpustat[USER]);
+		nice = cputime64_add(nice, kstat->cpustat[NICE]);
+		system = cputime64_add(system, kstat->cpustat[SYSTEM]);
+		idle = cputime64_add(idle, root_kstat->cpustat[IDLE]);
+		idle = cputime64_add(idle, arch_idle_time(i));
+		iowait = cputime64_add(iowait, root_kstat->cpustat[IOWAIT]);
+		irq = cputime64_add(irq, kstat->cpustat[IRQ]);
+		softirq = cputime64_add(softirq, kstat->cpustat[SOFTIRQ]);
+		steal = cputime64_add(steal, kstat->cpustat[STEAL]);
+		guest = cputime64_add(guest, kstat->cpustat[GUEST]);
+		guest_nice = cputime64_add(guest_nice,
+			kstat->cpustat[GUEST_NICE]);
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+
+		for (j = 0; j < NR_SOFTIRQS; j++) {
+			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
+
+			per_softirq_sums[j] += softirq_stat;
+			sum_softirq += softirq_stat;
+		}
+	}
+	sum += arch_irq_stat();
+
+	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+		"%llu\n",
+		(unsigned long long)cputime64_to_clock_t(user),
+		(unsigned long long)cputime64_to_clock_t(nice),
+		(unsigned long long)cputime64_to_clock_t(system),
+		(unsigned long long)cputime64_to_clock_t(idle),
+		(unsigned long long)cputime64_to_clock_t(iowait),
+		(unsigned long long)cputime64_to_clock_t(irq),
+		(unsigned long long)cputime64_to_clock_t(softirq),
+		(unsigned long long)cputime64_to_clock_t(steal),
+		(unsigned long long)cputime64_to_clock_t(guest),
+		(unsigned long long)cputime64_to_clock_t(guest_nice));
+	for_each_online_cpu(i) {
+		struct kernel_stat *kstat, *root_kstat;
+		kstat = per_cpu_ptr(tg->cpustat, i);
+		root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+
+		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
+		user = kstat->cpustat[USER];
+		nice = kstat->cpustat[NICE];
+		system = kstat->cpustat[SYSTEM];
+		idle = root_kstat->cpustat[IDLE];
+		idle = cputime64_add(idle, arch_idle_time(i));
+		iowait = root_kstat->cpustat[IOWAIT];
+		irq = kstat->cpustat[IRQ];
+		softirq = kstat->cpustat[SOFTIRQ];
+		steal = kstat->cpustat[STEAL];
+		guest = kstat->cpustat[GUEST];
+		guest_nice = kstat->cpustat[GUEST_NICE];
+		seq_printf(p,
+			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+			"%llu\n",
+			i,
+			(unsigned long long)cputime64_to_clock_t(user),
+			(unsigned long long)cputime64_to_clock_t(nice),
+			(unsigned long long)cputime64_to_clock_t(system),
+			(unsigned long long)cputime64_to_clock_t(idle),
+			(unsigned long long)cputime64_to_clock_t(iowait),
+			(unsigned long long)cputime64_to_clock_t(irq),
+			(unsigned long long)cputime64_to_clock_t(softirq),
+			(unsigned long long)cputime64_to_clock_t(steal),
+			(unsigned long long)cputime64_to_clock_t(guest),
+			(unsigned long long)cputime64_to_clock_t(guest_nice));
+	}
+	seq_printf(p, "intr %llu", (unsigned long long)sum);
+
+	/* sum again ? it could be updated? */
+	for_each_irq_nr(j)
+		seq_printf(p, " %u", kstat_irqs(j));
+
+	seq_printf(p,
+		"\nctxt %llu\n"
+		"btime %lu\n"
+		"processes %lu\n"
+		"procs_running %lu\n"
+		"procs_blocked %lu\n",
+		nr_context_switches(),
+		(unsigned long)jif,
+		total_forks,
+		nr_running(),
+		nr_iowait());
+
+	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
+
+	for (i = 0; i < NR_SOFTIRQS; i++)
+		seq_printf(p, " %u", per_softirq_sums[i]);
+	seq_putc(p, '\n');
+
+	return 0;
+}
+
+
+
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
  2011-09-23 22:20 ` [RFD 1/9] Change cpustat fields to an array Glauber Costa
  2011-09-23 22:20 ` [RFD 2/9] Move /proc/stat logic inside sched.c Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-27 17:01   ` Balbir Singh
                     ` (2 more replies)
  2011-09-23 22:20 ` [RFD 4/9] Make total_forks per-cgroup Glauber Costa
                   ` (6 subsequent siblings)
  9 siblings, 3 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, Glauber Costa

Each cgroup has its own file, cpu.proc.stat that will
display the exact same format as /proc/stat. Users
that want to have access to a per-cgroup version of
this information, can query it for that purpose.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/kernel_stat.h |    2 +
 kernel/sched.c              |   91 +++++++++++++++++++++++++++++++++---------
 2 files changed, 73 insertions(+), 20 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 93f64f3..e82c937 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,6 +27,8 @@ enum cpu_usage_stat {
 	STEAL,
 	GUEST,
 	GUEST_NICE,
+	IDLE_BASE,
+	IOWAIT_BASE,
 	NR_STATS,
 };
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 4d831fe..a272257 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3756,6 +3756,20 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
+static inline void task_cgroup_account_field(struct task_struct *p,
+					     cputime64_t tmp, int index)
+{
+	struct kernel_stat *kstat;
+	struct task_group *tg = task_group(p);
+
+	do {
+		kstat = this_cpu_ptr(tg->cpustat);
+		kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
+						      tmp);
+		tg = tg->parent;
+	} while (tg);
+}
+
 /*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
@@ -3763,9 +3777,8 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
  * @cputime_scaled: cputime scaled by cpu frequency
  */
 void account_user_time(struct task_struct *p, cputime_t cputime,
-		       cputime_t cputime_scaled)
+		cputime_t cputime_scaled)
 {
-	cputime64_t *cpustat = kstat_this_cpu->cpustat;
 	cputime64_t tmp;
 
 	/* Add user time to process. */
@@ -3775,10 +3788,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
+
 	if (TASK_NICE(p) > 0)
-		cpustat[NICE] = cputime64_add(cpustat[NICE], tmp);
+		task_cgroup_account_field(p, tmp, NICE);
 	else
-		cpustat[USER] = cputime64_add(cpustat[USER], tmp);
+		task_cgroup_account_field(p, tmp, USER);
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3824,7 +3838,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+			cputime_t cputime_scaled, int index)
 {
 	cputime64_t tmp = cputime_to_cputime64(cputime);
 
@@ -3834,7 +3848,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
+	task_cgroup_account_field(p, tmp, index);
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3851,8 +3865,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	cputime64_t *cpustat = kstat_this_cpu->cpustat;
-	cputime64_t *target_cputime64;
+	int index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3860,13 +3873,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat[IRQ];
+		index = IRQ;
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat[SOFTIRQ];
+		index = SOFTIRQ;
 	else
-		target_cputime64 = &cpustat[SYSTEM];
+		index = SYSTEM;
 
-	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
+	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
 /*
@@ -3941,27 +3954,29 @@ static __always_inline bool steal_account_process_tick(void)
  * softirq as those do not count in task exec_runtime any more.
  */
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
-						struct rq *rq)
+					 struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
-	cputime64_t *cpustat = kstat_this_cpu->cpustat;
+	struct task_group *tg;
 
 	if (steal_account_process_tick())
 		return;
 
+	tg = task_group(p);
+
 	if (irqtime_account_hi_update()) {
-		cpustat[IRQ] = cputime64_add(cpustat[IRQ], tmp);
+		task_cgroup_account_field(p, tmp, IRQ);
 	} else if (irqtime_account_si_update()) {
-		cpustat[SOFTIRQ] = cputime64_add(cpustat[SOFTIRQ], tmp);
+		task_cgroup_account_field(p, tmp, SOFTIRQ);
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SOFTIRQ]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -3969,8 +3984,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SYSTEM]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SYSTEM);
 	}
 }
 
@@ -8038,6 +8053,7 @@ void __init sched_init(void)
 {
 	int i, j;
 	unsigned long alloc_size = 0, ptr;
+	struct kernel_stat *kstat;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	alloc_size += 2 * nr_cpu_ids * sizeof(void **);
@@ -8092,6 +8108,18 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
+
+	root_task_group.cpustat = alloc_percpu(struct kernel_stat);
+	/* Failing that early an allocation means we're screwed anyway */
+	BUG_ON(!root_task_group.cpustat);
+	for_each_possible_cpu(i) {
+		kstat = per_cpu_ptr(root_task_group.cpustat, i);
+		kstat->cpustat[IDLE] = 0;
+		kstat->cpustat[IDLE_BASE] = 0;
+		kstat->cpustat[IOWAIT_BASE] = 0;
+		kstat->cpustat[IOWAIT] = 0;
+	}
+
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -8526,6 +8554,7 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	free_percpu(tg->cpustat);
 	kfree(tg);
 }
 
@@ -8534,6 +8563,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 	unsigned long flags;
+	int i;
 
 	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 	if (!tg)
@@ -8545,6 +8575,19 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpustat = alloc_percpu(struct kernel_stat);
+	if (!tg->cpustat)
+		goto err;
+
+	for_each_possible_cpu(i) {
+		struct kernel_stat *kstat, *root_kstat;
+
+		kstat = per_cpu_ptr(tg->cpustat, i);
+		root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+		kstat->cpustat[IDLE_BASE]  = root_kstat->cpustat[IDLE];
+		kstat->cpustat[IOWAIT_BASE] = root_kstat->cpustat[IOWAIT];
+	}
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9092,7 +9135,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 		system = cputime64_add(system, kstat->cpustat[SYSTEM]);
 		idle = cputime64_add(idle, root_kstat->cpustat[IDLE]);
 		idle = cputime64_add(idle, arch_idle_time(i));
+		idle = cputime64_sub(idle, kstat->cpustat[IDLE_BASE]);
 		iowait = cputime64_add(iowait, root_kstat->cpustat[IOWAIT]);
+		iowait = cputime64_sub(iowait, kstat->cpustat[IOWAIT_BASE]);
 		irq = cputime64_add(irq, kstat->cpustat[IRQ]);
 		softirq = cputime64_add(softirq, kstat->cpustat[SOFTIRQ]);
 		steal = cputime64_add(steal, kstat->cpustat[STEAL]);
@@ -9134,7 +9179,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 		system = kstat->cpustat[SYSTEM];
 		idle = root_kstat->cpustat[IDLE];
 		idle = cputime64_add(idle, arch_idle_time(i));
+		idle = cputime64_sub(idle, kstat->cpustat[IDLE_BASE]);
 		iowait = root_kstat->cpustat[IOWAIT];
+		iowait = cputime64_sub(iowait, kstat->cpustat[IOWAIT_BASE]);
 		irq = kstat->cpustat[IRQ];
 		softirq = kstat->cpustat[SOFTIRQ];
 		steal = kstat->cpustat[STEAL];
@@ -9205,6 +9252,10 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	{
+		.name = "proc.stat",
+		.read_seq_string = cpu_cgroup_proc_stat,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 4/9] Make total_forks per-cgroup
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (2 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 3/9] Display /proc/stat information per cgroup Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-27 22:00   ` Peter Zijlstra
  2011-09-23 22:20 ` [RFD 5/9] per-cgroup boot time Glauber Costa
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, Glauber Costa

This patch counts the total number of forks per-cgroup.
The information is propagated to the parent, so the total
number of forks in the system, is the parent cgroup's one.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/sched.h |    1 +
 kernel/fork.c         |    9 ++++-----
 kernel/sched.c        |   35 +++++++++++++++++++++++++++++++++--
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 64c5ba5..4ba9dde 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2716,6 +2716,7 @@ struct cgroup;
 struct cftype;
 int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 			 struct seq_file *p);
+void task_group_new_fork(struct task_struct *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..9e0d8a6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,10 +76,6 @@
 
 #include <trace/events/sched.h>
 
-/*
- * Protected counters by write_lock_irq(&tasklist_lock)
- */
-unsigned long total_forks;	/* Handle normal Linux uptimes. */
 int nr_threads;			/* The idle threads do not count.. */
 
 int max_threads;		/* tunable limit on nr_threads */
@@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
 	INIT_LIST_HEAD(&tsk->cpu_timers[2]);
 }
 
+struct task_group *task_group(struct task_struct *p);
+
 /*
  * This creates a new process as a copy of the old one,
  * but does not actually start it yet.
@@ -1372,7 +1370,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		nr_threads++;
 	}
 
-	total_forks++;
+	task_group_new_fork(p);
+
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
diff --git a/kernel/sched.c b/kernel/sched.c
index a272257..9c6e44e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -278,6 +278,7 @@ struct task_group {
 #ifdef CONFIG_SCHED_AUTOGROUP
 	struct autogroup *autogroup;
 #endif
+	unsigned long total_forks;
 	struct kernel_stat __percpu *cpustat;
 };
 
@@ -641,6 +642,17 @@ struct kernel_stat *task_group_kstat(struct task_struct *p)
 
 	return tg->cpustat;
 }
+
+void task_group_new_fork(struct task_struct *p)
+{
+	struct task_group *tg = task_group(p);
+
+	do {
+		tg->total_forks++;
+		tg = tg->parent;
+	} while (tg);
+}
+
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
@@ -655,7 +667,17 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+static unsigned long task_group_total_forks(struct task_group *tg)
+{
+	return tg->total_forks;
+}
+
+
 #else /* CONFIG_CGROUP_SCHED */
+/*
+ * Protected counters by write_lock_irq(&tasklist_lock)
+ */
+unsigned long total_forks;	/* Handle normal Linux uptimes. */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
 static inline struct task_group *task_group(struct task_struct *p)
@@ -663,6 +685,16 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return NULL;
 }
 
+void task_group_new_fork(struct task_struct *p)
+{
+	total_forks++;
+}
+
+void task_group_total_forks(struct task_group *tg)
+{
+	return total_forks;
+}
+
 DEFINE_PER_CPU(struct kernel_stat, kstat);
 EXPORT_PER_CPU_SYMBOL(kstat);
 #endif /* CONFIG_CGROUP_SCHED */
@@ -8119,7 +8151,6 @@ void __init sched_init(void)
 		kstat->cpustat[IOWAIT_BASE] = 0;
 		kstat->cpustat[IOWAIT] = 0;
 	}
-
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -9216,7 +9247,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 		"procs_blocked %lu\n",
 		nr_context_switches(),
 		(unsigned long)jif,
-		total_forks,
+		task_group_total_forks(tg),
 		nr_running(),
 		nr_iowait());
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 5/9] per-cgroup boot time
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (3 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 4/9] Make total_forks per-cgroup Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-23 22:20 ` [RFD 6/9] Report steal time for cgroup Glauber Costa
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, Glauber Costa

Record the time in which the cgroup was created. This can be
used to provide a more accurate boottime information in
cpuacct.proc.stat.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 9c6e44e..7612410 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -280,6 +280,7 @@ struct task_group {
 #endif
 	unsigned long total_forks;
 	struct kernel_stat __percpu *cpustat;
+	struct timespec start_time;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -8141,6 +8142,7 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
 
+	root_task_group.start_time = (struct timespec){0, 0};
 	root_task_group.cpustat = alloc_percpu(struct kernel_stat);
 	/* Failing that early an allocation means we're screwed anyway */
 	BUG_ON(!root_task_group.cpustat);
@@ -8619,6 +8621,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 		kstat->cpustat[IOWAIT_BASE] = root_kstat->cpustat[IOWAIT];
 	}
 
+	get_monotonic_boottime(&tg->start_time);
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9154,6 +9158,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 		irq = softirq = steal = cputime64_zero;
 	guest = guest_nice = cputime64_zero;
 	getboottime(&boottime);
+	boottime = timespec_add(boottime, tg->start_time);
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 6/9] Report steal time for cgroup
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (4 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 5/9] per-cgroup boot time Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-23 22:20 ` [RFD 7/9] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, Glauber Costa

This patch introduces a functionality commonly found in
hypervisors: steal time.

For those not particularly familiar with it, steal time
is defined as any time in which a virtual machine (or container)
wanted to perform cpu work, but could not due to another
VM/container being scheduled in its place. Note that idle
time is never defined as steal time.

Assuming each container will live in its cgroup, we can
very easily and nicely calculate steal time as all user/system
time recorded in our sibling cgroups.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7612410..8a510ab 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9148,6 +9148,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 	struct task_group *tg;
+	struct task_group *sib;
 
 	if (cgrp)
 		tg = cgroup_tg(cgrp);
@@ -9177,6 +9178,19 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 		irq = cputime64_add(irq, kstat->cpustat[IRQ]);
 		softirq = cputime64_add(softirq, kstat->cpustat[SOFTIRQ]);
 		steal = cputime64_add(steal, kstat->cpustat[STEAL]);
+		rcu_read_lock();
+		list_for_each_entry(sib, &tg->siblings, siblings) {
+			struct kernel_stat *ksib;
+			if (!sib)
+				continue;
+
+			ksib = per_cpu_ptr(sib->cpustat, i);
+			steal = cputime64_add(steal, ksib->cpustat[USER]);
+			steal = cputime64_add(steal, ksib->cpustat[SYSTEM]);
+			steal = cputime64_add(steal, ksib->cpustat[IRQ]);
+			steal = cputime64_add(steal, ksib->cpustat[SOFTIRQ]);
+		}
+		rcu_read_unlock();
 		guest = cputime64_add(guest, kstat->cpustat[GUEST]);
 		guest_nice = cputime64_add(guest_nice,
 			kstat->cpustat[GUEST_NICE]);
@@ -9221,6 +9235,20 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 		irq = kstat->cpustat[IRQ];
 		softirq = kstat->cpustat[SOFTIRQ];
 		steal = kstat->cpustat[STEAL];
+		rcu_read_lock();
+		list_for_each_entry(sib, &tg->siblings, siblings) {
+			struct kernel_stat *ksib;
+			if (!sib)
+				continue;
+
+			ksib = per_cpu_ptr(sib->cpustat, i);
+			steal = cputime64_add(steal, ksib->cpustat[USER]);
+			steal = cputime64_add(steal, ksib->cpustat[SYSTEM]);
+			steal = cputime64_add(steal, ksib->cpustat[IRQ]);
+			steal = cputime64_add(steal, ksib->cpustat[SOFTIRQ]);
+		}
+		rcu_read_unlock();
+
 		guest = kstat->cpustat[GUEST];
 		guest_nice = kstat->cpustat[GUEST_NICE];
 		seq_printf(p,
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 7/9] provide a version of cpuacct statistics inside cpu cgroup
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (5 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 6/9] Report steal time for cgroup Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-23 22:20 ` [RFD 8/9] provide a version of cpuusage " Glauber Costa
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.stat, we provide it inside the cpu cgroup.

This have the advantage of accounting this information only once,
instead of twice, when there is no need to have an independent group
of tasks for controlling and accounting, leading to a lot less overhead.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c |   42 +++++++++++++++++++++++++++++++++++++-----
 1 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8a510ab..b687441 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9296,6 +9296,39 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 
 
 
+static const char *cpuacct_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int cpu;
+	s64 val = 0;
+
+	for_each_online_cpu(cpu) {
+		struct kernel_stat *kstat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kstat->cpustat[USER];
+		val += kstat->cpustat[NICE];
+	}
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_USER], val);
+
+	val = 0;
+	for_each_online_cpu(cpu) {
+		struct kernel_stat *kstat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kstat->cpustat[SYSTEM];
+		val += kstat->cpustat[IRQ];
+		val += kstat->cpustat[SOFTIRQ];
+	}
+
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+	return 0;
+}
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9320,6 +9353,10 @@ static struct cftype cpu_files[] = {
 		.name = "proc.stat",
 		.read_seq_string = cpu_cgroup_proc_stat,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_cgroup_stats_show,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -9503,11 +9540,6 @@ static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
 	return 0;
 }
 
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
 static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 		struct cgroup_map_cb *cb)
 {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 8/9] provide a version of cpuusage statistics inside cpu cgroup
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (6 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 7/9] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-23 22:20 ` [RFD 9/9] Change CPUACCT to default n Glauber Costa
  2011-09-27 22:11 ` [RFD 0/9] per-cgroup /proc/stat statistics Peter Zijlstra
  9 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.usage and cpuaact.usage_per_cpu, we provide them inside
the cpu cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c      |  220 +++++++++++++++++++++++++++++++++++----------------
 kernel/sched_fair.c |    2 +-
 kernel/sched_rt.c   |    2 +-
 3 files changed, 155 insertions(+), 69 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index b687441..fc873c9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -280,6 +280,7 @@ struct task_group {
 #endif
 	unsigned long total_forks;
 	struct kernel_stat __percpu *cpustat;
+	u64 __percpu *cpuusage;
 	struct timespec start_time;
 };
 
@@ -2083,6 +2084,8 @@ static int irqtime_account_si_update(void)
 
 #endif
 
+static void cpuusage_charge(struct task_struct *tsk, u64 cputime);
+
 #include "sched_idletask.c"
 #include "sched_fair.c"
 #include "sched_rt.c"
@@ -8144,8 +8147,10 @@ void __init sched_init(void)
 
 	root_task_group.start_time = (struct timespec){0, 0};
 	root_task_group.cpustat = alloc_percpu(struct kernel_stat);
+	root_task_group.cpuusage = alloc_percpu(u64);
 	/* Failing that early an allocation means we're screwed anyway */
 	BUG_ON(!root_task_group.cpustat);
+	BUG_ON(!root_task_group.cpuusage);
 	for_each_possible_cpu(i) {
 		kstat = per_cpu_ptr(root_task_group.cpustat, i);
 		kstat->cpustat[IDLE] = 0;
@@ -8587,7 +8592,10 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
-	free_percpu(tg->cpustat);
+	if (tg->cpustat)
+		free_percpu(tg->cpustat);
+	if (tg->cpuusage)
+		free_percpu(tg->cpuusage);
 	kfree(tg);
 }
 
@@ -8608,6 +8616,10 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpuusage = alloc_percpu(u64);
+	if (!tg->cpuusage)
+		goto err;
+
 	tg->cpustat = alloc_percpu(struct kernel_stat);
 	if (!tg->cpustat)
 		goto err;
@@ -9296,6 +9308,87 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
 
 
 
+static u64 cpuacct_cpuusage_read(u64 *cpuusage, int cpu)
+{
+	u64 data;
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	data = *cpuusage;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	data = *cpuusage;
+#endif
+
+	return data;
+}
+
+static void cpuacct_cpuusage_write(u64 *cpuusage, int cpu, u64 val)
+{
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	*cpuusage = val;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	*cpuusage = val;
+#endif
+}
+
+static u64 cpu_cgroup_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	u64 totalcpuusage = 0;
+	int i;
+
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, i);
+		totalcpuusage += cpuacct_cpuusage_read(cpuusage, i);
+	}
+
+	return totalcpuusage;
+}
+
+static int cpu_cgroup_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+								u64 reset)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int err = 0;
+	int i;
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, i);
+		cpuacct_cpuusage_write(cpuusage, i, 0);
+	}
+
+out:
+	return err;
+}
+
+static int cpu_cgroup_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
+				      struct seq_file *m)
+{
+	u64 percpu;
+	int i;
+
+	for_each_present_cpu(i) {
+		percpu = cpu_cgroup_cpuusage_read(cgroup, cft);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
 static const char *cpuacct_stat_desc[] = {
 	[CPUACCT_STAT_USER] = "user",
 	[CPUACCT_STAT_SYSTEM] = "system",
@@ -9357,6 +9450,15 @@ static struct cftype cpu_files[] = {
 		.name = "stat",
 		.read_map = cpu_cgroup_stats_show,
 	},
+	{
+		.name = "usage",
+		.read_u64 = cpu_cgroup_cpuusage_read,
+		.write_u64 = cpu_cgroup_cpuusage_write,
+	},
+	{
+		.name = "usage_percpu",
+		.read_seq_string = cpu_cgroup_percpu_seq_read,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -9458,41 +9560,6 @@ cpuacct_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
 	kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	u64 data;
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
-
-	return data;
-}
-
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
-}
-
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 {
@@ -9500,8 +9567,10 @@ static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 	u64 totalcpuusage = 0;
 	int i;
 
-	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		totalcpuusage += cpuacct_cpuusage_read(cpuusage, i);
+	}
 
 	return totalcpuusage;
 }
@@ -9518,8 +9587,10 @@ static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
 		goto out;
 	}
 
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
+	for_each_present_cpu(i) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		cpuacct_cpuusage_write(cpuusage, i, 0);
+	}
 
 out:
 	return err;
@@ -9576,33 +9647,6 @@ static int cpuacct_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
 }
 
 /*
- * charge this task's execution time to its accounting group.
- *
- * called with rq->lock held.
- */
-static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-	struct cpuacct *ca;
-	int cpu;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	cpu = task_cpu(tsk);
-
-	rcu_read_lock();
-
-	ca = task_ca(tsk);
-
-	for (; ca; ca = ca->parent) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-		*cpuusage += cputime;
-	}
-
-	rcu_read_unlock();
-}
-
-/*
  * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
  * in cputime_t units. As a result, cpuacct_update_stats calls
  * percpu_counter_add with values large enough to always overflow the
@@ -9650,3 +9694,45 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+/*
+ * charge this task's execution time to its accounting group.
+ *
+ * called with rq->lock held.
+ */
+static void cpuusage_charge(struct task_struct *tsk, u64 cputime)
+{
+	int cpu;
+
+#ifdef CONFIG_CGROUP_CPUACCT
+	struct cpuacct *ca;
+#endif
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *tg;
+#endif
+	cpu = task_cpu(tsk);
+
+	rcu_read_lock();
+
+#ifdef CONFIG_CGROUP_CPUACCT
+	ca = task_ca(tsk);
+
+	if (unlikely(!cpuacct_subsys.active))
+		goto no_cpuacct;
+
+	for (; ca; ca = ca->parent) {
+		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+no_cpuacct:
+#endif
+
+#ifdef CONFIG_CGROUP_SCHED
+	tg = task_group(tsk);
+	for (; tg; tg = tg->parent) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+#endif
+	rcu_read_unlock();
+}
+
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index bc8ee99..38b4549 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -580,7 +580,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
+		cpuusage_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 }
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 97540f0..a21b58e 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -676,7 +676,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-	cpuacct_charge(curr, delta_exec);
+	cpuusage_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFD 9/9] Change CPUACCT to default n
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (7 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 8/9] provide a version of cpuusage " Glauber Costa
@ 2011-09-23 22:20 ` Glauber Costa
  2011-09-27 22:11 ` [RFD 0/9] per-cgroup /proc/stat statistics Peter Zijlstra
  9 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-23 22:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley,
	Glauber Costa, Balbir Singh

Now that we're providing all cpuacct capability in cpu cgroup,
default CPUACCT to n. We still maintain it for compatiblity for
anyone that need an independent set of tasks to be managed by it
relatively to cpu cgroup, but encourage the use of cpucgroup for that.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 init/Kconfig |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index d627783..891d6c1 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -626,6 +626,7 @@ config PROC_PID_CPUSET
 
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
+	default n
 	help
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-23 22:20 ` [RFD 3/9] Display /proc/stat information per cgroup Glauber Costa
@ 2011-09-27 17:01   ` Balbir Singh
  2011-09-27 18:42     ` Glauber Costa
  2011-09-27 21:48   ` Peter Zijlstra
  2011-09-27 21:52   ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Balbir Singh @ 2011-09-27 17:01 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley

On Sat, Sep 24, 2011 at 3:50 AM, Glauber Costa <glommer@parallels.com> wrote:
> Each cgroup has its own file, cpu.proc.stat that will
> display the exact same format as /proc/stat. Users
> that want to have access to a per-cgroup version of
> this information, can query it for that purpose.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
...

> +static inline void task_cgroup_account_field(struct task_struct *p,
> +                                            cputime64_t tmp, int index)
> +{
> +       struct kernel_stat *kstat;
> +       struct task_group *tg = task_group(p);
> +
> +       do {
> +               kstat = this_cpu_ptr(tg->cpustat);
> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
> +                                                     tmp);
> +               tg = tg->parent;
> +       } while (tg);
> +}

What protects the walk (tg = tg->parent)? Could you please document it

> +
>  /*
>  * Account user cpu time to a process.
>  * @p: the process that the cpu time gets accounted to
> @@ -3763,9 +3777,8 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
>  * @cputime_scaled: cputime scaled by cpu frequency
>  */
>  void account_user_time(struct task_struct *p, cputime_t cputime,
> -                      cputime_t cputime_scaled)
> +               cputime_t cputime_scaled)
>  {
> -       cputime64_t *cpustat = kstat_this_cpu->cpustat;
>        cputime64_t tmp;
>
>        /* Add user time to process. */
> @@ -3775,10 +3788,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
>
>        /* Add user time to cpustat. */
>        tmp = cputime_to_cputime64(cputime);
> +
>        if (TASK_NICE(p) > 0)
> -               cpustat[NICE] = cputime64_add(cpustat[NICE], tmp);
> +               task_cgroup_account_field(p, tmp, NICE);
>        else
> -               cpustat[USER] = cputime64_add(cpustat[USER], tmp);
> +               task_cgroup_account_field(p, tmp, USER);
>
>        cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
>        /* Account for user time used */
> @@ -3824,7 +3838,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>  */
>  static inline
>  void __account_system_time(struct task_struct *p, cputime_t cputime,
> -                       cputime_t cputime_scaled, cputime64_t *target_cputime64)
> +                       cputime_t cputime_scaled, int index)
>  {
>        cputime64_t tmp = cputime_to_cputime64(cputime);
>
> @@ -3834,7 +3848,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>        account_group_system_time(p, cputime);
>
>        /* Add system time to cpustat. */
> -       *target_cputime64 = cputime64_add(*target_cputime64, tmp);
> +       task_cgroup_account_field(p, tmp, index);
>        cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
>
>        /* Account for system time used */
> @@ -3851,8 +3865,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>  void account_system_time(struct task_struct *p, int hardirq_offset,
>                         cputime_t cputime, cputime_t cputime_scaled)
>  {
> -       cputime64_t *cpustat = kstat_this_cpu->cpustat;
> -       cputime64_t *target_cputime64;
> +       int index;
>
>        if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
>                account_guest_time(p, cputime, cputime_scaled);
> @@ -3860,13 +3873,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>        }
>
>        if (hardirq_count() - hardirq_offset)
> -               target_cputime64 = &cpustat[IRQ];
> +               index = IRQ;
>        else if (in_serving_softirq())
> -               target_cputime64 = &cpustat[SOFTIRQ];
> +               index = SOFTIRQ;
>        else
> -               target_cputime64 = &cpustat[SYSTEM];
> +               index = SYSTEM;
>
> -       __account_system_time(p, cputime, cputime_scaled, target_cputime64);
> +       __account_system_time(p, cputime, cputime_scaled, index);
>  }
>
>  /*
> @@ -3941,27 +3954,29 @@ static __always_inline bool steal_account_process_tick(void)
>  * softirq as those do not count in task exec_runtime any more.
>  */
>  static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
> -                                               struct rq *rq)
> +                                        struct rq *rq)
>  {
>        cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>        cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
> -       cputime64_t *cpustat = kstat_this_cpu->cpustat;
> +       struct task_group *tg;
>
>        if (steal_account_process_tick())
>                return;
>
> +       tg = task_group(p);
> +
>        if (irqtime_account_hi_update()) {
> -               cpustat[IRQ] = cputime64_add(cpustat[IRQ], tmp);
> +               task_cgroup_account_field(p, tmp, IRQ);
>        } else if (irqtime_account_si_update()) {
> -               cpustat[SOFTIRQ] = cputime64_add(cpustat[SOFTIRQ], tmp);
> +               task_cgroup_account_field(p, tmp, SOFTIRQ);
>        } else if (this_cpu_ksoftirqd() == p) {
>                /*
>                 * ksoftirqd time do not get accounted in cpu_softirq_time.
>                 * So, we have to handle it separately here.
>                 * Also, p->stime needs to be updated for ksoftirqd.
>                 */
> -               __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> -                                       &cpustat[SOFTIRQ]);
> +               __account_system_time(p, cputime_one_jiffy,
> +                                     one_jiffy_scaled, SOFTIRQ);
>        } else if (user_tick) {
>                account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>        } else if (p == rq->idle) {
> @@ -3969,8 +3984,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>        } else if (p->flags & PF_VCPU) { /* System time or guest time */
>                account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
>        } else {
> -               __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
> -                                       &cpustat[SYSTEM]);
> +               __account_system_time(p, cputime_one_jiffy,
> +                                     one_jiffy_scaled, SYSTEM);
>        }
>  }
>
> @@ -8038,6 +8053,7 @@ void __init sched_init(void)
>  {
>        int i, j;
>        unsigned long alloc_size = 0, ptr;
> +       struct kernel_stat *kstat;
>
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>        alloc_size += 2 * nr_cpu_ids * sizeof(void **);
> @@ -8092,6 +8108,18 @@ void __init sched_init(void)
>        INIT_LIST_HEAD(&root_task_group.children);
>        INIT_LIST_HEAD(&root_task_group.siblings);
>        autogroup_init(&init_task);
> +
> +       root_task_group.cpustat = alloc_percpu(struct kernel_stat);
> +       /* Failing that early an allocation means we're screwed anyway */
> +       BUG_ON(!root_task_group.cpustat);
> +       for_each_possible_cpu(i) {

for_each_possible_cpu might be an overkill, no?

> +               kstat = per_cpu_ptr(root_task_group.cpustat, i);
> +               kstat->cpustat[IDLE] = 0;
> +               kstat->cpustat[IDLE_BASE] = 0;
> +               kstat->cpustat[IOWAIT_BASE] = 0;
> +               kstat->cpustat[IOWAIT] = 0;
> +       }
> +
>  #endif /* CONFIG_CGROUP_SCHED */
>
>        for_each_possible_cpu(i) {
> @@ -8526,6 +8554,7 @@ static void free_sched_group(struct task_group *tg)
>        free_fair_sched_group(tg);
>        free_rt_sched_group(tg);
>        autogroup_free(tg);
> +       free_percpu(tg->cpustat);
>        kfree(tg);
>  }
>
> @@ -8534,6 +8563,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>  {
>        struct task_group *tg;
>        unsigned long flags;
> +       int i;
>
>        tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>        if (!tg)
> @@ -8545,6 +8575,19 @@ struct task_group *sched_create_group(struct task_group *parent)
>        if (!alloc_rt_sched_group(tg, parent))
>                goto err;
>
> +       tg->cpustat = alloc_percpu(struct kernel_stat);
> +       if (!tg->cpustat)
> +               goto err;
> +
> +       for_each_possible_cpu(i) {
> +               struct kernel_stat *kstat, *root_kstat;
> +
> +               kstat = per_cpu_ptr(tg->cpustat, i);
> +               root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
> +               kstat->cpustat[IDLE_BASE]  = root_kstat->cpustat[IDLE];
> +               kstat->cpustat[IOWAIT_BASE] = root_kstat->cpustat[IOWAIT];
> +       }
> +
>        spin_lock_irqsave(&task_group_lock, flags);
>        list_add_rcu(&tg->list, &task_groups);
>
> @@ -9092,7 +9135,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
>                system = cputime64_add(system, kstat->cpustat[SYSTEM]);
>                idle = cputime64_add(idle, root_kstat->cpustat[IDLE]);
>                idle = cputime64_add(idle, arch_idle_time(i));
> +               idle = cputime64_sub(idle, kstat->cpustat[IDLE_BASE]);
>                iowait = cputime64_add(iowait, root_kstat->cpustat[IOWAIT]);
> +               iowait = cputime64_sub(iowait, kstat->cpustat[IOWAIT_BASE]);
>                irq = cputime64_add(irq, kstat->cpustat[IRQ]);
>                softirq = cputime64_add(softirq, kstat->cpustat[SOFTIRQ]);
>                steal = cputime64_add(steal, kstat->cpustat[STEAL]);
> @@ -9134,7 +9179,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
>                system = kstat->cpustat[SYSTEM];
>                idle = root_kstat->cpustat[IDLE];
>                idle = cputime64_add(idle, arch_idle_time(i));
> +               idle = cputime64_sub(idle, kstat->cpustat[IDLE_BASE]);
>                iowait = root_kstat->cpustat[IOWAIT];
> +               iowait = cputime64_sub(iowait, kstat->cpustat[IOWAIT_BASE]);
>                irq = kstat->cpustat[IRQ];
>                softirq = kstat->cpustat[SOFTIRQ];
>                steal = kstat->cpustat[STEAL];
> @@ -9205,6 +9252,10 @@ static struct cftype cpu_files[] = {
>                .write_u64 = cpu_rt_period_write_uint,
>        },
>  #endif
> +       {
> +               .name = "proc.stat",
> +               .read_seq_string = cpu_cgroup_proc_stat,

Looks fine to me

Balbir Singh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-27 17:01   ` Balbir Singh
@ 2011-09-27 18:42     ` Glauber Costa
  2011-09-27 22:21       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-27 18:42 UTC (permalink / raw)
  To: Balbir Singh
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley

On 09/27/2011 02:01 PM, Balbir Singh wrote:
> On Sat, Sep 24, 2011 at 3:50 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> Each cgroup has its own file, cpu.proc.stat that will
>> display the exact same format as /proc/stat. Users
>> that want to have access to a per-cgroup version of
>> this information, can query it for that purpose.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
> ...

Hi Balbir, thanks for reviewing this.

>> +static inline void task_cgroup_account_field(struct task_struct *p,
>> +                                            cputime64_t tmp, int index)
>> +{
>> +       struct kernel_stat *kstat;
>> +       struct task_group *tg = task_group(p);
>> +
>> +       do {
>> +               kstat = this_cpu_ptr(tg->cpustat);
>> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
>> +                                                     tmp);
>> +               tg = tg->parent;
>> +       } while (tg);
>> +}
>
> What protects the walk (tg = tg->parent)? Could you please document it
I think that the fact that the hierarchy only grows down, thus parent 
never changes (or am I wrong?)

And since we run all this with preempt disabled and with the runqueue 
locked, we should have no problems.

Do you agree?

>> +
>>   /*
>>   * Account user cpu time to a process.
>>   * @p: the process that the cpu time gets accounted to
>> @@ -3763,9 +3777,8 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
>>   * @cputime_scaled: cputime scaled by cpu frequency
>>   */
>>   void account_user_time(struct task_struct *p, cputime_t cputime,
>> -                      cputime_t cputime_scaled)
>> +               cputime_t cputime_scaled)
>>   {
>> -       cputime64_t *cpustat = kstat_this_cpu->cpustat;
>>         cputime64_t tmp;
>>
>>         /* Add user time to process. */
>> @@ -3775,10 +3788,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
>>
>>         /* Add user time to cpustat. */
>>         tmp = cputime_to_cputime64(cputime);
>> +
>>         if (TASK_NICE(p)>  0)
>> -               cpustat[NICE] = cputime64_add(cpustat[NICE], tmp);
>> +               task_cgroup_account_field(p, tmp, NICE);
>>         else
>> -               cpustat[USER] = cputime64_add(cpustat[USER], tmp);
>> +               task_cgroup_account_field(p, tmp, USER);
>>
>>         cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
>>         /* Account for user time used */
>> @@ -3824,7 +3838,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>>   */
>>   static inline
>>   void __account_system_time(struct task_struct *p, cputime_t cputime,
>> -                       cputime_t cputime_scaled, cputime64_t *target_cputime64)
>> +                       cputime_t cputime_scaled, int index)
>>   {
>>         cputime64_t tmp = cputime_to_cputime64(cputime);
>>
>> @@ -3834,7 +3848,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>>         account_group_system_time(p, cputime);
>>
>>         /* Add system time to cpustat. */
>> -       *target_cputime64 = cputime64_add(*target_cputime64, tmp);
>> +       task_cgroup_account_field(p, tmp, index);
>>         cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
>>
>>         /* Account for system time used */
>> @@ -3851,8 +3865,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
>>   void account_system_time(struct task_struct *p, int hardirq_offset,
>>                          cputime_t cputime, cputime_t cputime_scaled)
>>   {
>> -       cputime64_t *cpustat = kstat_this_cpu->cpustat;
>> -       cputime64_t *target_cputime64;
>> +       int index;
>>
>>         if ((p->flags&  PF_VCPU)&&  (irq_count() - hardirq_offset == 0)) {
>>                 account_guest_time(p, cputime, cputime_scaled);
>> @@ -3860,13 +3873,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
>>         }
>>
>>         if (hardirq_count() - hardirq_offset)
>> -               target_cputime64 =&cpustat[IRQ];
>> +               index = IRQ;
>>         else if (in_serving_softirq())
>> -               target_cputime64 =&cpustat[SOFTIRQ];
>> +               index = SOFTIRQ;
>>         else
>> -               target_cputime64 =&cpustat[SYSTEM];
>> +               index = SYSTEM;
>>
>> -       __account_system_time(p, cputime, cputime_scaled, target_cputime64);
>> +       __account_system_time(p, cputime, cputime_scaled, index);
>>   }
>>
>>   /*
>> @@ -3941,27 +3954,29 @@ static __always_inline bool steal_account_process_tick(void)
>>   * softirq as those do not count in task exec_runtime any more.
>>   */
>>   static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>> -                                               struct rq *rq)
>> +                                        struct rq *rq)
>>   {
>>         cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
>>         cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
>> -       cputime64_t *cpustat = kstat_this_cpu->cpustat;
>> +       struct task_group *tg;
>>
>>         if (steal_account_process_tick())
>>                 return;
>>
>> +       tg = task_group(p);
>> +
>>         if (irqtime_account_hi_update()) {
>> -               cpustat[IRQ] = cputime64_add(cpustat[IRQ], tmp);
>> +               task_cgroup_account_field(p, tmp, IRQ);
>>         } else if (irqtime_account_si_update()) {
>> -               cpustat[SOFTIRQ] = cputime64_add(cpustat[SOFTIRQ], tmp);
>> +               task_cgroup_account_field(p, tmp, SOFTIRQ);
>>         } else if (this_cpu_ksoftirqd() == p) {
>>                 /*
>>                  * ksoftirqd time do not get accounted in cpu_softirq_time.
>>                  * So, we have to handle it separately here.
>>                  * Also, p->stime needs to be updated for ksoftirqd.
>>                  */
>> -               __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
>> -&cpustat[SOFTIRQ]);
>> +               __account_system_time(p, cputime_one_jiffy,
>> +                                     one_jiffy_scaled, SOFTIRQ);
>>         } else if (user_tick) {
>>                 account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
>>         } else if (p == rq->idle) {
>> @@ -3969,8 +3984,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
>>         } else if (p->flags&  PF_VCPU) { /* System time or guest time */
>>                 account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
>>         } else {
>> -               __account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
>> -&cpustat[SYSTEM]);
>> +               __account_system_time(p, cputime_one_jiffy,
>> +                                     one_jiffy_scaled, SYSTEM);
>>         }
>>   }
>>
>> @@ -8038,6 +8053,7 @@ void __init sched_init(void)
>>   {
>>         int i, j;
>>         unsigned long alloc_size = 0, ptr;
>> +       struct kernel_stat *kstat;
>>
>>   #ifdef CONFIG_FAIR_GROUP_SCHED
>>         alloc_size += 2 * nr_cpu_ids * sizeof(void **);
>> @@ -8092,6 +8108,18 @@ void __init sched_init(void)
>>         INIT_LIST_HEAD(&root_task_group.children);
>>         INIT_LIST_HEAD(&root_task_group.siblings);
>>         autogroup_init(&init_task);
>> +
>> +       root_task_group.cpustat = alloc_percpu(struct kernel_stat);
>> +       /* Failing that early an allocation means we're screwed anyway */
>> +       BUG_ON(!root_task_group.cpustat);
>> +       for_each_possible_cpu(i) {
>
> for_each_possible_cpu might be an overkill, no?
>
>> +               kstat = per_cpu_ptr(root_task_group.cpustat, i);
>> +               kstat->cpustat[IDLE] = 0;
>> +               kstat->cpustat[IDLE_BASE] = 0;
>> +               kstat->cpustat[IOWAIT_BASE] = 0;
>> +               kstat->cpustat[IOWAIT] = 0;
>> +       }
>> +
>>   #endif /* CONFIG_CGROUP_SCHED */
>>
>>         for_each_possible_cpu(i) {
>> @@ -8526,6 +8554,7 @@ static void free_sched_group(struct task_group *tg)
>>         free_fair_sched_group(tg);
>>         free_rt_sched_group(tg);
>>         autogroup_free(tg);
>> +       free_percpu(tg->cpustat);
>>         kfree(tg);
>>   }
>>
>> @@ -8534,6 +8563,7 @@ struct task_group *sched_create_group(struct task_group *parent)
>>   {
>>         struct task_group *tg;
>>         unsigned long flags;
>> +       int i;
>>
>>         tg = kzalloc(sizeof(*tg), GFP_KERNEL);
>>         if (!tg)
>> @@ -8545,6 +8575,19 @@ struct task_group *sched_create_group(struct task_group *parent)
>>         if (!alloc_rt_sched_group(tg, parent))
>>                 goto err;
>>
>> +       tg->cpustat = alloc_percpu(struct kernel_stat);
>> +       if (!tg->cpustat)
>> +               goto err;
>> +
>> +       for_each_possible_cpu(i) {
>> +               struct kernel_stat *kstat, *root_kstat;
>> +
>> +               kstat = per_cpu_ptr(tg->cpustat, i);
>> +               root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
>> +               kstat->cpustat[IDLE_BASE]  = root_kstat->cpustat[IDLE];
>> +               kstat->cpustat[IOWAIT_BASE] = root_kstat->cpustat[IOWAIT];
>> +       }
>> +
>>         spin_lock_irqsave(&task_group_lock, flags);
>>         list_add_rcu(&tg->list,&task_groups);
>>
>> @@ -9092,7 +9135,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
>>                 system = cputime64_add(system, kstat->cpustat[SYSTEM]);
>>                 idle = cputime64_add(idle, root_kstat->cpustat[IDLE]);
>>                 idle = cputime64_add(idle, arch_idle_time(i));
>> +               idle = cputime64_sub(idle, kstat->cpustat[IDLE_BASE]);
>>                 iowait = cputime64_add(iowait, root_kstat->cpustat[IOWAIT]);
>> +               iowait = cputime64_sub(iowait, kstat->cpustat[IOWAIT_BASE]);
>>                 irq = cputime64_add(irq, kstat->cpustat[IRQ]);
>>                 softirq = cputime64_add(softirq, kstat->cpustat[SOFTIRQ]);
>>                 steal = cputime64_add(steal, kstat->cpustat[STEAL]);
>> @@ -9134,7 +9179,9 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft, struct seq_fil
>>                 system = kstat->cpustat[SYSTEM];
>>                 idle = root_kstat->cpustat[IDLE];
>>                 idle = cputime64_add(idle, arch_idle_time(i));
>> +               idle = cputime64_sub(idle, kstat->cpustat[IDLE_BASE]);
>>                 iowait = root_kstat->cpustat[IOWAIT];
>> +               iowait = cputime64_sub(iowait, kstat->cpustat[IOWAIT_BASE]);
>>                 irq = kstat->cpustat[IRQ];
>>                 softirq = kstat->cpustat[SOFTIRQ];
>>                 steal = kstat->cpustat[STEAL];
>> @@ -9205,6 +9252,10 @@ static struct cftype cpu_files[] = {
>>                 .write_u64 = cpu_rt_period_write_uint,
>>         },
>>   #endif
>> +       {
>> +               .name = "proc.stat",
>> +               .read_seq_string = cpu_cgroup_proc_stat,
>
> Looks fine to me
>
> Balbir Singh
Awesome.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-23 22:20 ` [RFD 1/9] Change cpustat fields to an array Glauber Costa
@ 2011-09-27 21:00   ` Peter Zijlstra
  2011-09-28 15:13     ` Glauber Costa
  2011-09-28 18:19     ` Glauber Costa
  2011-09-27 21:03   ` Peter Zijlstra
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 21:00 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>  /* Must have preemption disabled for this to be meaningful. */
> -#define kstat_this_cpu __get_cpu_var(kstat)
> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))

This just lost you a debug check, the former would whinge when called
without preemption, the new one wont. Its part of the this_cpu feature
set to make debugging impossible.

> +#else
> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
> +#define kstat_this_cpu (&__get_cpu_var(kstat))
> +#endif
>  
>  extern unsigned long long nr_context_switches(void);
>  
> @@ -52,8 +62,8 @@ struct irq_desc;
>  static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
>                                             struct irq_desc *desc)
>  {
> -       __this_cpu_inc(kstat.irqs[irq]);
> -       __this_cpu_inc(kstat.irqs_sum);
> +       kstat_this_cpu->irqs[irq]++;
> +       kstat_this_cpu->irqs_sum++;

It might be worth looking at the asm output of that, I think you made it
worse, but I'm not quite sure how smart gcc is, it might just figure out
what you meant.

>  } 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-23 22:20 ` [RFD 1/9] Change cpustat fields to an array Glauber Costa
  2011-09-27 21:00   ` Peter Zijlstra
@ 2011-09-27 21:03   ` Peter Zijlstra
  2011-09-28 15:14     ` Glauber Costa
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 21:03 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> @@ -623,6 +624,9 @@ static inline struct task_group *task_group(struct
> task_struct *p)
>         struct task_group *tg;
>         struct cgroup_subsys_state *css;
>  
> +       if (!p->mm)
> +               return &root_task_group;
> +
>         css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
>                         lockdep_is_held(&p->pi_lock) ||
>                         lockdep_is_held(&task_rq(p)->lock)); 

Hmm, why is that? Aren't kthreads part of the cgroup muck just as much
as normal tasks are?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-23 22:20 ` [RFD 3/9] Display /proc/stat information per cgroup Glauber Costa
  2011-09-27 17:01   ` Balbir Singh
@ 2011-09-27 21:48   ` Peter Zijlstra
  2011-09-28 15:14     ` Glauber Costa
  2011-09-27 21:52   ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 21:48 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> +static inline void task_cgroup_account_field(struct task_struct *p,
> +                                            cputime64_t tmp, int index)
> +{
> +       struct kernel_stat *kstat;
> +       struct task_group *tg = task_group(p);
> +
> +       do {
> +               kstat = this_cpu_ptr(tg->cpustat);
> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
> +                                                     tmp);

So aside from the cputime64_t nonsense you could actually write that as:

		__this_cpu_add(tg->cpustat[index], tmp);

which should yield better asm I think, a quick grep seems to confirm
cputime64_t is indeed a u64 all over so its fair to just ignore that.

> +               tg = tg->parent;
> +       } while (tg);
> +} 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-23 22:20 ` [RFD 3/9] Display /proc/stat information per cgroup Glauber Costa
  2011-09-27 17:01   ` Balbir Singh
  2011-09-27 21:48   ` Peter Zijlstra
@ 2011-09-27 21:52   ` Peter Zijlstra
  2011-09-28 15:15     ` Glauber Costa
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 21:52 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> +       root_task_group.cpustat = alloc_percpu(struct kernel_stat);
> +       /* Failing that early an allocation means we're screwed anyway */
> +       BUG_ON(!root_task_group.cpustat);
> +       for_each_possible_cpu(i) {
> +               kstat = per_cpu_ptr(root_task_group.cpustat, i);
> +               kstat->cpustat[IDLE] = 0;
> +               kstat->cpustat[IDLE_BASE] = 0;
> +               kstat->cpustat[IOWAIT_BASE] = 0;
> +               kstat->cpustat[IOWAIT] = 0;

so alloc_percpu() will actually zero your memory for you, which you rely
on for all other fields as well, so why reset these ones?

> +       }



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-23 22:20 ` [RFD 4/9] Make total_forks per-cgroup Glauber Costa
@ 2011-09-27 22:00   ` Peter Zijlstra
  2011-09-28  8:13     ` Martin Schwidefsky
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 22:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley,
	Martin Schwidefsky, Heiko Carstens

On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
>         INIT_LIST_HEAD(&tsk->cpu_timers[2]);
>  }
>  
> +struct task_group *task_group(struct task_struct *p);

That doesn't appear to be actually used in this file..

Also, since there's already a for_each_possible_cpu() loop in that
proc/stat function, would it yield some code improvement to make
total_forks a cpu_usage_stat?

I guess the whole cputime64_t crap gets in the way of that being
natural... 

We could of course kill off the cputime64_t thing, its pretty pointless
and its a u64 all over the board. I think Martin or Heiko created this
stuff (although I might be wrong, my git tree doesn't go back that far).



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 0/9] per-cgroup /proc/stat statistics
  2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
                   ` (8 preceding siblings ...)
  2011-09-23 22:20 ` [RFD 9/9] Change CPUACCT to default n Glauber Costa
@ 2011-09-27 22:11 ` Peter Zijlstra
  2011-09-28 15:21   ` Glauber Costa
  9 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 22:11 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> Hi,
> 
> Since I've sent already a RFC about it, I am sending now a RFD.
> If you eager for meaning, this can then be a "Request for Doctors",
> since Peter is likely to have a heart attack now.

:-)

All we need is to ensure the case of cgroups enabled but not used isn't
actually more expensive that what we have now, after that, if people
create a 100 deep cgroup hierarchy they get what they asked.

>From a conceptual pov this patch-set is a lot saner than the previous
one, doesn't duplicate nearly as much and actually tries to improve the
code (although I suspect simply killing off cputime64_t as a whole will
get us even more).

> So here's the deal:
> 
> * My main goal here was to demonstrate that we can avoid double accounting
>   in a lot of places. So what I did was getting rid of the original and first
>   kstat mechanism, and use only cgroups accounting for that. Since the parent
>   is always updated, the original stats are the one for the root cgroup.

Right, current patch-set won't compile for those who have CGROUP=n
kernels though, need to find something for that. Shouldn't be too hard
though. It looks like you only need to provide static per-cpu storage
and a custom version of task_cgroup_account_field().

> * I believe that all those cpu cgroups are confusing and should be unified. Not
>   that we can simply get rid of it, but my goal here is to provide all the
>   information they do, in cpu cgroup. If the set of tasks needed for accounting
>   is not independent of the ones in cpu cgroup, we can avoid double accounting
>   for that. I default cpuacct to n, but leave it to people that wants to use it
>   alone. 

Amen! Ideally we place cpuacct on the deprecated list or somesuch..

> * Well, I'm also doing what I was doing originally: Providing a per-cgroup version
>   of the /proc/stat file.

Right, so how much sense does it make to keep calling it proc.stat?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-27 18:42     ` Glauber Costa
@ 2011-09-27 22:21       ` Peter Zijlstra
  2011-09-28 15:22         ` Glauber Costa
  2011-09-28 15:23         ` Glauber Costa
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-27 22:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Tue, 2011-09-27 at 15:42 -0300, Glauber Costa wrote:
> >> +static inline void task_cgroup_account_field(struct task_struct *p,
> >> +                                            cputime64_t tmp, int index)
> >> +{
> >> +       struct kernel_stat *kstat;
> >> +       struct task_group *tg = task_group(p);
> >> +
> >> +       do {
> >> +               kstat = this_cpu_ptr(tg->cpustat);
> >> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
> >> +                                                     tmp);
> >> +               tg = tg->parent;
> >> +       } while (tg);
> >> +}
> >
> > What protects the walk (tg = tg->parent)? Could you please document it
> I think that the fact that the hierarchy only grows down, thus parent 
> never changes (or am I wrong?)
> 
> And since we run all this with preempt disabled and with the runqueue 
> locked, we should have no problems.
> 
> Do you agree? 

Right, so the tg can't be destroyed unless its empty, us finding this
task in it means its not empty, we require rq->lock or p->pi_lock to
move the task.

However, afaict we don't actually have any of those locks.

That said, it should be sufficient to wrap the whole thing in
rcu_read_lock(), accounting one tick funny because a cgroup move race
isn't the end of the world and its really no different than moving the
task a little later anyway.

task_group() should complain about this if you compile a kernel with
CONFIG_PROVE_RCU.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-27 22:00   ` Peter Zijlstra
@ 2011-09-28  8:13     ` Martin Schwidefsky
  2011-09-28 10:35       ` Peter Zijlstra
  2011-09-28 15:26       ` Glauber Costa
  0 siblings, 2 replies; 44+ messages in thread
From: Martin Schwidefsky @ 2011-09-28  8:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On Wed, 28 Sep 2011 00:00:37 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> > @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
> >         INIT_LIST_HEAD(&tsk->cpu_timers[2]);
> >  }
> >  
> > +struct task_group *task_group(struct task_struct *p);
> 
> That doesn't appear to be actually used in this file..
> 
> Also, since there's already a for_each_possible_cpu() loop in that
> proc/stat function, would it yield some code improvement to make
> total_forks a cpu_usage_stat?
> 
> I guess the whole cputime64_t crap gets in the way of that being
> natural... 
> 
> We could of course kill off the cputime64_t thing, its pretty pointless
> and its a u64 all over the board. I think Martin or Heiko created this
> stuff (although I might be wrong, my git tree doesn't go back that far).

The reason to introduce cputime_t has been that different architecture
needed differently sized integers for their respective representation
of cputime. On x86-32 the number of ticks is recorded in a u32, on s390
we needed a u64 for the cpu timer values. cputime64_t is needed for
cpustat and other sums of cputime that would overflow a cputime_t
(in particular on x86-32 with the u32 cputime_t and the u64 cputime64_t).

Now we would convert everything to u64 but that would cause x86-32 to
use 64-bit arithmetic for the tick counter. If that is acceptable I
can't say.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28  8:13     ` Martin Schwidefsky
@ 2011-09-28 10:35       ` Peter Zijlstra
  2011-09-28 12:42         ` Martin Schwidefsky
  2011-09-28 15:27         ` Glauber Costa
  2011-09-28 15:26       ` Glauber Costa
  1 sibling, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 10:35 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Glauber Costa, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On Wed, 2011-09-28 at 10:13 +0200, Martin Schwidefsky wrote:
> On Wed, 28 Sep 2011 00:00:37 +0200
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> > > @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
> > >         INIT_LIST_HEAD(&tsk->cpu_timers[2]);
> > >  }
> > >  
> > > +struct task_group *task_group(struct task_struct *p);
> > 
> > That doesn't appear to be actually used in this file..
> > 
> > Also, since there's already a for_each_possible_cpu() loop in that
> > proc/stat function, would it yield some code improvement to make
> > total_forks a cpu_usage_stat?
> > 
> > I guess the whole cputime64_t crap gets in the way of that being
> > natural... 
> > 
> > We could of course kill off the cputime64_t thing, its pretty pointless
> > and its a u64 all over the board. I think Martin or Heiko created this
> > stuff (although I might be wrong, my git tree doesn't go back that far).
> 
> The reason to introduce cputime_t has been that different architecture
> needed differently sized integers for their respective representation
> of cputime. On x86-32 the number of ticks is recorded in a u32, on s390
> we needed a u64 for the cpu timer values. cputime64_t is needed for
> cpustat and other sums of cputime that would overflow a cputime_t
> (in particular on x86-32 with the u32 cputime_t and the u64 cputime64_t).
> 
> Now we would convert everything to u64 but that would cause x86-32 to
> use 64-bit arithmetic for the tick counter. If that is acceptable I
> can't say.

Right, so the main point was about cputime64_t, we might as well use a
u64 for that throughout and ditch the silly cputime64_$op() accessors
and write normal code.

But even if cputime_t differs between 32 and 64 bit machines, there is
no reason actually use cputime_add(), C can do this.

The only reason to use things like cputime_add() is if you use a non
simple type, but that doesn't seem to be the case.

So I think we can simplify the code lots by doing away with cputime64_t
and all the cputime_*() functions. We can keep cputime_t, or we can use
unsigned long, which I think will end up doing pretty much the same.

That is, am I missing some added value of all this cputime*() foo?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 10:35       ` Peter Zijlstra
@ 2011-09-28 12:42         ` Martin Schwidefsky
  2011-09-28 12:53           ` Peter Zijlstra
  2011-09-28 15:28           ` Glauber Costa
  2011-09-28 15:27         ` Glauber Costa
  1 sibling, 2 replies; 44+ messages in thread
From: Martin Schwidefsky @ 2011-09-28 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On Wed, 28 Sep 2011 12:35:24 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2011-09-28 at 10:13 +0200, Martin Schwidefsky wrote:
> > On Wed, 28 Sep 2011 00:00:37 +0200
> > Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > 
> > > On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> > > > @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
> > > >         INIT_LIST_HEAD(&tsk->cpu_timers[2]);
> > > >  }
> > > >  
> > > > +struct task_group *task_group(struct task_struct *p);
> > > 
> > > That doesn't appear to be actually used in this file..
> > > 
> > > Also, since there's already a for_each_possible_cpu() loop in that
> > > proc/stat function, would it yield some code improvement to make
> > > total_forks a cpu_usage_stat?
> > > 
> > > I guess the whole cputime64_t crap gets in the way of that being
> > > natural... 
> > > 
> > > We could of course kill off the cputime64_t thing, its pretty pointless
> > > and its a u64 all over the board. I think Martin or Heiko created this
> > > stuff (although I might be wrong, my git tree doesn't go back that far).
> > 
> > The reason to introduce cputime_t has been that different architecture
> > needed differently sized integers for their respective representation
> > of cputime. On x86-32 the number of ticks is recorded in a u32, on s390
> > we needed a u64 for the cpu timer values. cputime64_t is needed for
> > cpustat and other sums of cputime that would overflow a cputime_t
> > (in particular on x86-32 with the u32 cputime_t and the u64 cputime64_t).
> > 
> > Now we would convert everything to u64 but that would cause x86-32 to
> > use 64-bit arithmetic for the tick counter. If that is acceptable I
> > can't say.
> 
> Right, so the main point was about cputime64_t, we might as well use a
> u64 for that throughout and ditch the silly cputime64_$op() accessors
> and write normal code.
> 
> But even if cputime_t differs between 32 and 64 bit machines, there is
> no reason actually use cputime_add(), C can do this.
> 
> The only reason to use things like cputime_add() is if you use a non
> simple type, but that doesn't seem to be the case.
> 
> So I think we can simplify the code lots by doing away with cputime64_t
> and all the cputime_*() functions. We can keep cputime_t, or we can use
> unsigned long, which I think will end up doing pretty much the same.
> 
> That is, am I missing some added value of all this cputime*() foo?

C can do the math as long as the encoding of the cputime is simple enough.
Can we demand that a cputime value needs to be an integral type ?

What I did when I wrote all that stuff is to define cputime_t as a struct
that contains a single u64. That way I found all the places in the kernel
that used a cputime and could convert the code accordingly.

My fear is that if the cputime_xxx operations are removed, code will
sneak in again that just uses an unsigned long instead of a cputime_t.
That would break any arch that requires something bigger than a u32 for
its cputime. I really have to find my old debugging patch and see if we
already have bit rot in regard to cputime_t.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 12:42         ` Martin Schwidefsky
@ 2011-09-28 12:53           ` Peter Zijlstra
  2011-09-28 15:29             ` Glauber Costa
  2011-09-28 15:28           ` Glauber Costa
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 12:53 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Glauber Costa, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On Wed, 2011-09-28 at 14:42 +0200, Martin Schwidefsky wrote:

> > That is, am I missing some added value of all this cputime*() foo?
> 
> C can do the math as long as the encoding of the cputime is simple enough.
> Can we demand that a cputime value needs to be an integral type ?

I'd like to think we can ;-)

> What I did when I wrote all that stuff is to define cputime_t as a struct
> that contains a single u64. That way I found all the places in the kernel
> that used a cputime and could convert the code accordingly.

Indeed, that makes it a non-simple type and breaks all the C arith bits.

> My fear is that if the cputime_xxx operations are removed, code will
> sneak in again that just uses an unsigned long instead of a cputime_t.
> That would break any arch that requires something bigger than a u32 for
> its cputime.

Which is only a problem for 32bit archs, of which s390 is the only one
that matters, right? Hurm,. could we do something with sparse? Lots of
people run sparse.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-27 21:00   ` Peter Zijlstra
@ 2011-09-28 15:13     ` Glauber Costa
  2011-09-28 15:23       ` Peter Zijlstra
  2011-09-28 18:19     ` Glauber Costa
  1 sibling, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 06:00 PM, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>   /* Must have preemption disabled for this to be meaningful. */
>> -#define kstat_this_cpu __get_cpu_var(kstat)
>> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))
>
> This just lost you a debug check, the former would whinge when called
> without preemption, the new one wont. Its part of the this_cpu feature
> set to make debugging impossible.

Why is that?

from percpu.h:
#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

So I don't get it.

>> +#else
>> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>> +#define kstat_this_cpu (&__get_cpu_var(kstat))
>> +#endif
>>
>>   extern unsigned long long nr_context_switches(void);
>>
>> @@ -52,8 +62,8 @@ struct irq_desc;
>>   static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
>>                                              struct irq_desc *desc)
>>   {
>> -       __this_cpu_inc(kstat.irqs[irq]);
>> -       __this_cpu_inc(kstat.irqs_sum);
>> +       kstat_this_cpu->irqs[irq]++;
>> +       kstat_this_cpu->irqs_sum++;
>
> It might be worth looking at the asm output of that, I think you made it
> worse, but I'm not quite sure how smart gcc is, it might just figure out
> what you meant.
>
>>   }

Yes, it is indeed a bit worse, but the reason appears to be an extra 
call to  something related to task_group(). That one is inline, but it 
calls more stuff in the way. And it is hard to dodge from that once we 
move to this path. It might be acceptable to lose some cycles here to 
gain more back once cpuacct is out.

That said, I'll also see if we can squeeze something more clever out of it.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-27 21:03   ` Peter Zijlstra
@ 2011-09-28 15:14     ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 06:03 PM, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>> @@ -623,6 +624,9 @@ static inline struct task_group *task_group(struct
>> task_struct *p)
>>          struct task_group *tg;
>>          struct cgroup_subsys_state *css;
>>
>> +       if (!p->mm)
>> +               return&root_task_group;
>> +
>>          css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
>>                          lockdep_is_held(&p->pi_lock) ||
>>                          lockdep_is_held(&task_rq(p)->lock));
>
> Hmm, why is that? Aren't kthreads part of the cgroup muck just as much
> as normal tasks are?
You're right. I had something else from another work I'm doing in mind 
and got confused.
Will remove it.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-27 21:48   ` Peter Zijlstra
@ 2011-09-28 15:14     ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 06:48 PM, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>> +static inline void task_cgroup_account_field(struct task_struct *p,
>> +                                            cputime64_t tmp, int index)
>> +{
>> +       struct kernel_stat *kstat;
>> +       struct task_group *tg = task_group(p);
>> +
>> +       do {
>> +               kstat = this_cpu_ptr(tg->cpustat);
>> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
>> +                                                     tmp);
>
> So aside from the cputime64_t nonsense you could actually write that as:
>
> 		__this_cpu_add(tg->cpustat[index], tmp);
>
> which should yield better asm I think, a quick grep seems to confirm
> cputime64_t is indeed a u64 all over so its fair to just ignore that.

Ok, It would indeed be better.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-27 21:52   ` Peter Zijlstra
@ 2011-09-28 15:15     ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:15 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 06:52 PM, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>> +       root_task_group.cpustat = alloc_percpu(struct kernel_stat);
>> +       /* Failing that early an allocation means we're screwed anyway */
>> +       BUG_ON(!root_task_group.cpustat);
>> +       for_each_possible_cpu(i) {
>> +               kstat = per_cpu_ptr(root_task_group.cpustat, i);
>> +               kstat->cpustat[IDLE] = 0;
>> +               kstat->cpustat[IDLE_BASE] = 0;
>> +               kstat->cpustat[IOWAIT_BASE] = 0;
>> +               kstat->cpustat[IOWAIT] = 0;
>
> so alloc_percpu() will actually zero your memory for you, which you rely
> on for all other fields as well, so why reset these ones?

Just to be style-consistent with the non-root initialization, in which 
we actually put values on that. But I can easily remove it.

>
>> +       }
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 0/9] per-cgroup /proc/stat statistics
  2011-09-27 22:11 ` [RFD 0/9] per-cgroup /proc/stat statistics Peter Zijlstra
@ 2011-09-28 15:21   ` Glauber Costa
  2011-09-28 15:27     ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 07:11 PM, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>> Hi,
>>
>> Since I've sent already a RFC about it, I am sending now a RFD.
>> If you eager for meaning, this can then be a "Request for Doctors",
>> since Peter is likely to have a heart attack now.
>
> :-)
>
> All we need is to ensure the case of cgroups enabled but not used isn't
> actually more expensive that what we have now, after that, if people
> create a 100 deep cgroup hierarchy they get what they asked.
>
>  From a conceptual pov this patch-set is a lot saner than the previous
> one, doesn't duplicate nearly as much and actually tries to improve the
> code (although I suspect simply killing off cputime64_t as a whole will
> get us even more).

Are you actually planning to do that by yourself ?

>
>> So here's the deal:
>>
>> * My main goal here was to demonstrate that we can avoid double accounting
>>    in a lot of places. So what I did was getting rid of the original and first
>>    kstat mechanism, and use only cgroups accounting for that. Since the parent
>>    is always updated, the original stats are the one for the root cgroup.
>
> Right, current patch-set won't compile for those who have CGROUP=n

yet.

> kernels though, need to find something for that. Shouldn't be too hard
> though. It looks like you only need to provide static per-cpu storage
> and a custom version of task_cgroup_account_field().
Precisely. I was more worried about getting acceptance of the whole 
thing first...

>
>> * I believe that all those cpu cgroups are confusing and should be unified. Not
>>    that we can simply get rid of it, but my goal here is to provide all the
>>    information they do, in cpu cgroup. If the set of tasks needed for accounting
>>    is not independent of the ones in cpu cgroup, we can avoid double accounting
>>    for that. I default cpuacct to n, but leave it to people that wants to use it
>>    alone.
>
> Amen! Ideally we place cpuacct on the deprecated list or somesuch..
If no one opposes, I can actually include that in the official 
submission, that should be the next one.

>
>> * Well, I'm also doing what I was doing originally: Providing a per-cgroup version
>>    of the /proc/stat file.
>
> Right, so how much sense does it make to keep calling it proc.stat?

For the root cgroup, it doesn't. But I don't see why we need to special 
case it.

For all others, it is pretty much the whole point of the series... We 
could of course automatically display it, and I considered that for 
simplicity. For my use case, it would even work flawlessly, but in the 
general case of people using cgroups for resource control only, they 
might be interested in seeing whole system usage when they poll 
/proc/stat. So /proc/stat shows system-wide information, proc.stat, 
cgroup's. In containers, we want to tie those together, but it is a 
different story.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-27 22:21       ` Peter Zijlstra
@ 2011-09-28 15:22         ` Glauber Costa
  2011-09-28 15:23         ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 07:21 PM, Peter Zijlstra wrote:
> On Tue, 2011-09-27 at 15:42 -0300, Glauber Costa wrote:
>>>> +static inline void task_cgroup_account_field(struct task_struct *p,
>>>> +                                            cputime64_t tmp, int index)
>>>> +{
>>>> +       struct kernel_stat *kstat;
>>>> +       struct task_group *tg = task_group(p);
>>>> +
>>>> +       do {
>>>> +               kstat = this_cpu_ptr(tg->cpustat);
>>>> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
>>>> +                                                     tmp);
>>>> +               tg = tg->parent;
>>>> +       } while (tg);
>>>> +}
>>>
>>> What protects the walk (tg = tg->parent)? Could you please document it
>> I think that the fact that the hierarchy only grows down, thus parent
>> never changes (or am I wrong?)
>>
>> And since we run all this with preempt disabled and with the runqueue
>> locked, we should have no problems.
>>
>> Do you agree?
>
> Right, so the tg can't be destroyed unless its empty, us finding this
> task in it means its not empty, we require rq->lock or p->pi_lock to
> move the task.
>
> However, afaict we don't actually have any of those locks.
>
> That said, it should be sufficient to wrap the whole thing in
> rcu_read_lock(), accounting one tick funny because a cgroup move race
> isn't the end of the world and its really no different than moving the
> task a little later anyway.
>
> task_group() should complain about this if you compile a kernel with
> CONFIG_PROVE_RCU.


Yeah, wrapping it around rcu_read_lock locks fine.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 3/9] Display /proc/stat information per cgroup
  2011-09-27 22:21       ` Peter Zijlstra
  2011-09-28 15:22         ` Glauber Costa
@ 2011-09-28 15:23         ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 07:21 PM, Peter Zijlstra wrote:
> On Tue, 2011-09-27 at 15:42 -0300, Glauber Costa wrote:
>>>> +static inline void task_cgroup_account_field(struct task_struct *p,
>>>> +                                            cputime64_t tmp, int index)
>>>> +{
>>>> +       struct kernel_stat *kstat;
>>>> +       struct task_group *tg = task_group(p);
>>>> +
>>>> +       do {
>>>> +               kstat = this_cpu_ptr(tg->cpustat);
>>>> +               kstat->cpustat[index] = cputime64_add(kstat->cpustat[index],
>>>> +                                                     tmp);
>>>> +               tg = tg->parent;
>>>> +       } while (tg);
>>>> +}
>>>
>>> What protects the walk (tg = tg->parent)? Could you please document it
>> I think that the fact that the hierarchy only grows down, thus parent
>> never changes (or am I wrong?)
>>
>> And since we run all this with preempt disabled and with the runqueue
>> locked, we should have no problems.
>>
>> Do you agree?
>
> Right, so the tg can't be destroyed unless its empty, us finding this
> task in it means its not empty, we require rq->lock or p->pi_lock to
> move the task.
>
> However, afaict we don't actually have any of those locks.
>
> That said, it should be sufficient to wrap the whole thing in
> rcu_read_lock(), accounting one tick funny because a cgroup move race
> isn't the end of the world and its really no different than moving the
> task a little later anyway.
>
> task_group() should complain about this if you compile a kernel with
> CONFIG_PROVE_RCU.


Yeah, wrapping it around rcu_read_lock looks fine.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-28 15:13     ` Glauber Costa
@ 2011-09-28 15:23       ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 15:23 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Wed, 2011-09-28 at 12:13 -0300, Glauber Costa wrote:
> On 09/27/2011 06:00 PM, Peter Zijlstra wrote:
> > On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> >>   /* Must have preemption disabled for this to be meaningful. */
> >> -#define kstat_this_cpu __get_cpu_var(kstat)
> >> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))
> >
> > This just lost you a debug check, the former would whinge when called
> > without preemption, the new one wont. Its part of the this_cpu feature
> > set to make debugging impossible.
> 
> Why is that?
> 
> from percpu.h:
> #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
> 
> So I don't get it.

Ah, my bad, its not actually part of the this_cpu*() api, what wonderful
mess that is.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28  8:13     ` Martin Schwidefsky
  2011-09-28 10:35       ` Peter Zijlstra
@ 2011-09-28 15:26       ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:26 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On 09/28/2011 05:13 AM, Martin Schwidefsky wrote:
> On Wed, 28 Sep 2011 00:00:37 +0200
> Peter Zijlstra<a.p.zijlstra@chello.nl>  wrote:
>
>> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>> @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
>>>          INIT_LIST_HEAD(&tsk->cpu_timers[2]);
>>>   }
>>>
>>> +struct task_group *task_group(struct task_struct *p);
>>
>> That doesn't appear to be actually used in this file..
>>
>> Also, since there's already a for_each_possible_cpu() loop in that
>> proc/stat function, would it yield some code improvement to make
>> total_forks a cpu_usage_stat?
>>
>> I guess the whole cputime64_t crap gets in the way of that being
>> natural...
>>
>> We could of course kill off the cputime64_t thing, its pretty pointless
>> and its a u64 all over the board. I think Martin or Heiko created this
>> stuff (although I might be wrong, my git tree doesn't go back that far).
>
> The reason to introduce cputime_t has been that different architecture
> needed differently sized integers for their respective representation
> of cputime. On x86-32 the number of ticks is recorded in a u32, on s390
> we needed a u64 for the cpu timer values. cputime64_t is needed for
> cpustat and other sums of cputime that would overflow a cputime_t
> (in particular on x86-32 with the u32 cputime_t and the u64 cputime64_t).
>
> Now we would convert everything to u64 but that would cause x86-32 to
> use 64-bit arithmetic for the tick counter. If that is acceptable I
> can't say.
>
If we get rid of cputime64_t, it doesn't mean we need to get rid of 
cputime_t. Or am I missing the point here?

We would still have to call an analogous of cputime_to_cputime64 before 
using it, but I get that where it matters, gcc can even make it void.
And for all the rest, we do u64 math and typing and just get happy.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 0/9] per-cgroup /proc/stat statistics
  2011-09-28 15:21   ` Glauber Costa
@ 2011-09-28 15:27     ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 15:27 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Wed, 2011-09-28 at 12:21 -0300, Glauber Costa wrote:
> >  From a conceptual pov this patch-set is a lot saner than the previous
> > one, doesn't duplicate nearly as much and actually tries to improve the
> > code (although I suspect simply killing off cputime64_t as a whole will
> > get us even more).
> 
> Are you actually planning to do that by yourself ?

I could, but I think I need to sort that out with Martin first.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 10:35       ` Peter Zijlstra
  2011-09-28 12:42         ` Martin Schwidefsky
@ 2011-09-28 15:27         ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On 09/28/2011 07:35 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 10:13 +0200, Martin Schwidefsky wrote:
>> On Wed, 28 Sep 2011 00:00:37 +0200
>> Peter Zijlstra<a.p.zijlstra@chello.nl>  wrote:
>>
>>> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>>> @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
>>>>          INIT_LIST_HEAD(&tsk->cpu_timers[2]);
>>>>   }
>>>>
>>>> +struct task_group *task_group(struct task_struct *p);
>>>
>>> That doesn't appear to be actually used in this file..
>>>
>>> Also, since there's already a for_each_possible_cpu() loop in that
>>> proc/stat function, would it yield some code improvement to make
>>> total_forks a cpu_usage_stat?
>>>
>>> I guess the whole cputime64_t crap gets in the way of that being
>>> natural...
>>>
>>> We could of course kill off the cputime64_t thing, its pretty pointless
>>> and its a u64 all over the board. I think Martin or Heiko created this
>>> stuff (although I might be wrong, my git tree doesn't go back that far).
>>
>> The reason to introduce cputime_t has been that different architecture
>> needed differently sized integers for their respective representation
>> of cputime. On x86-32 the number of ticks is recorded in a u32, on s390
>> we needed a u64 for the cpu timer values. cputime64_t is needed for
>> cpustat and other sums of cputime that would overflow a cputime_t
>> (in particular on x86-32 with the u32 cputime_t and the u64 cputime64_t).
>>
>> Now we would convert everything to u64 but that would cause x86-32 to
>> use 64-bit arithmetic for the tick counter. If that is acceptable I
>> can't say.
>
> Right, so the main point was about cputime64_t, we might as well use a
> u64 for that throughout and ditch the silly cputime64_$op() accessors
> and write normal code.
>
> But even if cputime_t differs between 32 and 64 bit machines, there is
> no reason actually use cputime_add(), C can do this.
>
> The only reason to use things like cputime_add() is if you use a non
> simple type, but that doesn't seem to be the case.
>
> So I think we can simplify the code lots by doing away with cputime64_t
> and all the cputime_*() functions. We can keep cputime_t, or we can use
> unsigned long, which I think will end up doing pretty much the same.
for cputime64, I'm with you here. (so far)
>
> That is, am I missing some added value of all this cputime*() foo?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 12:42         ` Martin Schwidefsky
  2011-09-28 12:53           ` Peter Zijlstra
@ 2011-09-28 15:28           ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:28 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Peter Zijlstra, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On 09/28/2011 09:42 AM, Martin Schwidefsky wrote:
> On Wed, 28 Sep 2011 12:35:24 +0200
> Peter Zijlstra<a.p.zijlstra@chello.nl>  wrote:
>
>> On Wed, 2011-09-28 at 10:13 +0200, Martin Schwidefsky wrote:
>>> On Wed, 28 Sep 2011 00:00:37 +0200
>>> Peter Zijlstra<a.p.zijlstra@chello.nl>  wrote:
>>>
>>>> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>>>> @@ -1039,6 +1035,8 @@ static void posix_cpu_timers_init(struct task_struct *tsk)
>>>>>          INIT_LIST_HEAD(&tsk->cpu_timers[2]);
>>>>>   }
>>>>>
>>>>> +struct task_group *task_group(struct task_struct *p);
>>>>
>>>> That doesn't appear to be actually used in this file..
>>>>
>>>> Also, since there's already a for_each_possible_cpu() loop in that
>>>> proc/stat function, would it yield some code improvement to make
>>>> total_forks a cpu_usage_stat?
>>>>
>>>> I guess the whole cputime64_t crap gets in the way of that being
>>>> natural...
>>>>
>>>> We could of course kill off the cputime64_t thing, its pretty pointless
>>>> and its a u64 all over the board. I think Martin or Heiko created this
>>>> stuff (although I might be wrong, my git tree doesn't go back that far).
>>>
>>> The reason to introduce cputime_t has been that different architecture
>>> needed differently sized integers for their respective representation
>>> of cputime. On x86-32 the number of ticks is recorded in a u32, on s390
>>> we needed a u64 for the cpu timer values. cputime64_t is needed for
>>> cpustat and other sums of cputime that would overflow a cputime_t
>>> (in particular on x86-32 with the u32 cputime_t and the u64 cputime64_t).
>>>
>>> Now we would convert everything to u64 but that would cause x86-32 to
>>> use 64-bit arithmetic for the tick counter. If that is acceptable I
>>> can't say.
>>
>> Right, so the main point was about cputime64_t, we might as well use a
>> u64 for that throughout and ditch the silly cputime64_$op() accessors
>> and write normal code.
>>
>> But even if cputime_t differs between 32 and 64 bit machines, there is
>> no reason actually use cputime_add(), C can do this.
>>
>> The only reason to use things like cputime_add() is if you use a non
>> simple type, but that doesn't seem to be the case.
>>
>> So I think we can simplify the code lots by doing away with cputime64_t
>> and all the cputime_*() functions. We can keep cputime_t, or we can use
>> unsigned long, which I think will end up doing pretty much the same.
>>
>> That is, am I missing some added value of all this cputime*() foo?
>
> C can do the math as long as the encoding of the cputime is simple enough.
> Can we demand that a cputime value needs to be an integral type ?
>
> What I did when I wrote all that stuff is to define cputime_t as a struct
> that contains a single u64. That way I found all the places in the kernel
> that used a cputime and could convert the code accordingly.
>
> My fear is that if the cputime_xxx operations are removed, code will
> sneak in again that just uses an unsigned long instead of a cputime_t.
> That would break any arch that requires something bigger than a u32 for
> its cputime. I really have to find my old debugging patch and see if we
> already have bit rot in regard to cputime_t.
>
Martin,

Proposal is to keep cputime_t as is, and only get rid of its 
size-specific version. So I think we're safe as far as cputime_t is 
concerned.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 12:53           ` Peter Zijlstra
@ 2011-09-28 15:29             ` Glauber Costa
  2011-09-28 15:33               ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On 09/28/2011 09:53 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 14:42 +0200, Martin Schwidefsky wrote:
>
>>> That is, am I missing some added value of all this cputime*() foo?
>>
>> C can do the math as long as the encoding of the cputime is simple enough.
>> Can we demand that a cputime value needs to be an integral type ?
>
> I'd like to think we can ;-)
>
>> What I did when I wrote all that stuff is to define cputime_t as a struct
>> that contains a single u64. That way I found all the places in the kernel
>> that used a cputime and could convert the code accordingly.
>
> Indeed, that makes it a non-simple type and breaks all the C arith bits.
>
>> My fear is that if the cputime_xxx operations are removed, code will
>> sneak in again that just uses an unsigned long instead of a cputime_t.
>> That would break any arch that requires something bigger than a u32 for
>> its cputime.
>
> Which is only a problem for 32bit archs, of which s390 is the only one
> that matters, right? Hurm,. could we do something with sparse? Lots of
> people run sparse.
>
Well, I think x86-32 is unlikely to ever really go away.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 15:29             ` Glauber Costa
@ 2011-09-28 15:33               ` Peter Zijlstra
  2011-09-28 15:35                 ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 15:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Martin Schwidefsky, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On Wed, 2011-09-28 at 12:29 -0300, Glauber Costa wrote:
> On 09/28/2011 09:53 AM, Peter Zijlstra wrote:
> > On Wed, 2011-09-28 at 14:42 +0200, Martin Schwidefsky wrote:
> >
> >>> That is, am I missing some added value of all this cputime*() foo?
> >>
> >> C can do the math as long as the encoding of the cputime is simple enough.
> >> Can we demand that a cputime value needs to be an integral type ?
> >
> > I'd like to think we can ;-)
> >
> >> What I did when I wrote all that stuff is to define cputime_t as a struct
> >> that contains a single u64. That way I found all the places in the kernel
> >> that used a cputime and could convert the code accordingly.
> >
> > Indeed, that makes it a non-simple type and breaks all the C arith bits.
> >
> >> My fear is that if the cputime_xxx operations are removed, code will
> >> sneak in again that just uses an unsigned long instead of a cputime_t.
> >> That would break any arch that requires something bigger than a u32 for
> >> its cputime.
> >
> > Which is only a problem for 32bit archs, of which s390 is the only one
> > that matters, right? Hurm,. could we do something with sparse? Lots of
> > people run sparse.
> >
> Well, I think x86-32 is unlikely to ever really go away.

Sadly I'd agree with you, but that's not really the point, the only 32
bit arch that has !32 bit cputime_t is s390.

But yeah, death to ia32 (and everything else 32bit fwiw)!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 15:33               ` Peter Zijlstra
@ 2011-09-28 15:35                 ` Glauber Costa
  2011-09-28 15:37                   ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On 09/28/2011 12:33 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 12:29 -0300, Glauber Costa wrote:
>> On 09/28/2011 09:53 AM, Peter Zijlstra wrote:
>>> On Wed, 2011-09-28 at 14:42 +0200, Martin Schwidefsky wrote:
>>>
>>>>> That is, am I missing some added value of all this cputime*() foo?
>>>>
>>>> C can do the math as long as the encoding of the cputime is simple enough.
>>>> Can we demand that a cputime value needs to be an integral type ?
>>>
>>> I'd like to think we can ;-)
>>>
>>>> What I did when I wrote all that stuff is to define cputime_t as a struct
>>>> that contains a single u64. That way I found all the places in the kernel
>>>> that used a cputime and could convert the code accordingly.
>>>
>>> Indeed, that makes it a non-simple type and breaks all the C arith bits.
>>>
>>>> My fear is that if the cputime_xxx operations are removed, code will
>>>> sneak in again that just uses an unsigned long instead of a cputime_t.
>>>> That would break any arch that requires something bigger than a u32 for
>>>> its cputime.
>>>
>>> Which is only a problem for 32bit archs, of which s390 is the only one
>>> that matters, right? Hurm,. could we do something with sparse? Lots of
>>> people run sparse.
>>>
>> Well, I think x86-32 is unlikely to ever really go away.
>
> Sadly I'd agree with you, but that's not really the point, the only 32
> bit arch that has !32 bit cputime_t is s390.

Ah, I see.

Right, then.

So let me get this straight: The proposal here is really to get rid of 
all cputime_t , not only cputime64_t ?

>
> But yeah, death to ia32 (and everything else 32bit fwiw)!
Well, we need to kill the 16bit stuff still lying around first =)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 15:35                 ` Glauber Costa
@ 2011-09-28 15:37                   ` Peter Zijlstra
  2011-09-28 15:39                     ` Glauber Costa
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 15:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Martin Schwidefsky, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On Wed, 2011-09-28 at 12:35 -0300, Glauber Costa wrote:
> 
> So let me get this straight: The proposal here is really to get rid of 
> all cputime_t , not only cputime64_t ? 

For now cputime64_t would be sufficient, just wanted to extend the
argument with Martin to see if we really need the cputime*() trickery at
all.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 4/9] Make total_forks per-cgroup
  2011-09-28 15:37                   ` Peter Zijlstra
@ 2011-09-28 15:39                     ` Glauber Costa
  0 siblings, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Martin Schwidefsky, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, Heiko Carstens

On 09/28/2011 12:37 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 12:35 -0300, Glauber Costa wrote:
>>
>> So let me get this straight: The proposal here is really to get rid of
>> all cputime_t , not only cputime64_t ?
>
> For now cputime64_t would be sufficient, just wanted to extend the
> argument with Martin to see if we really need the cputime*() trickery at
> all.
>
>
Well, at least the 64-bit part I can do myself in this series, since I 
am already touching it. I'd keep a transition point in which we convert 
ticks to u64 whenever needed, and we can get rid of it later if you 
really ditch cputime_t

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-27 21:00   ` Peter Zijlstra
  2011-09-28 15:13     ` Glauber Costa
@ 2011-09-28 18:19     ` Glauber Costa
  2011-09-28 19:09       ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 18:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/27/2011 06:00 PM, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>   /* Must have preemption disabled for this to be meaningful. */
>> -#define kstat_this_cpu __get_cpu_var(kstat)
>> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))
>
> This just lost you a debug check, the former would whinge when called
> without preemption, the new one wont. Its part of the this_cpu feature
> set to make debugging impossible.
>
>> +#else
>> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>> +#define kstat_this_cpu (&__get_cpu_var(kstat))
>> +#endif
>>
>>   extern unsigned long long nr_context_switches(void);
>>
>> @@ -52,8 +62,8 @@ struct irq_desc;
>>   static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
>>                                              struct irq_desc *desc)
>>   {
>> -       __this_cpu_inc(kstat.irqs[irq]);
>> -       __this_cpu_inc(kstat.irqs_sum);
>> +       kstat_this_cpu->irqs[irq]++;
>> +       kstat_this_cpu->irqs_sum++;
>
> It might be worth looking at the asm output of that, I think you made it
> worse, but I'm not quite sure how smart gcc is, it might just figure out
> what you meant.

I'd say leave it alone.
The biggest difference is that we don't have access to task_group(), or 
any of the fields in struct task_group. Because of that, we end up 
having to export a function to do the job of dealing with it.

Users inside sched.c won't have this problem. Outside of it, we'll add a 
call to some paths. True, mostly handle_irq paths, but I don't think 
that's what's going to kill us.

Now if we really really want to save it, we'd have to move struct 
task_group and its friends to a more visible location like a header...



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-28 18:19     ` Glauber Costa
@ 2011-09-28 19:09       ` Peter Zijlstra
  2011-09-28 20:04         ` Glauber Costa
  2011-10-01 17:47         ` Glauber Costa
  0 siblings, 2 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-09-28 19:09 UTC (permalink / raw)
  To: Glauber Costa; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On Wed, 2011-09-28 at 15:19 -0300, Glauber Costa wrote:
> On 09/27/2011 06:00 PM, Peter Zijlstra wrote:
> > On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
> >>   /* Must have preemption disabled for this to be meaningful. */
> >> -#define kstat_this_cpu __get_cpu_var(kstat)
> >> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))
> >
> > This just lost you a debug check, the former would whinge when called
> > without preemption, the new one wont. Its part of the this_cpu feature
> > set to make debugging impossible.
> >
> >> +#else
> >> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
> >> +#define kstat_this_cpu (&__get_cpu_var(kstat))
> >> +#endif
> >>
> >>   extern unsigned long long nr_context_switches(void);
> >>
> >> @@ -52,8 +62,8 @@ struct irq_desc;
> >>   static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
> >>                                              struct irq_desc *desc)
> >>   {
> >> -       __this_cpu_inc(kstat.irqs[irq]);
> >> -       __this_cpu_inc(kstat.irqs_sum);
> >> +       kstat_this_cpu->irqs[irq]++;
> >> +       kstat_this_cpu->irqs_sum++;
> >
> > It might be worth looking at the asm output of that, I think you made it
> > worse, but I'm not quite sure how smart gcc is, it might just figure out
> > what you meant.
> 
> I'd say leave it alone.
> The biggest difference is that we don't have access to task_group(), or 
> any of the fields in struct task_group. Because of that, we end up 
> having to export a function to do the job of dealing with it.
> 
> Users inside sched.c won't have this problem. Outside of it, we'll add a 
> call to some paths. True, mostly handle_irq paths, but I don't think 
> that's what's going to kill us.
> 
> Now if we really really want to save it, we'd have to move struct 
> task_group and its friends to a more visible location like a header...

I'm not quite getting how task_group is relevant here.

The above will do something like:

	mov gs:$per-cpu-offset-of-kstat, reg
	inc reg + idx*8

whereas __this_cpu_inc() could end up like:

	inc gs:$per-cpu-offset-of-kstat + idx*8

or whatnot. Now clearly gcc could be smart and optimize the temporary
reg thing away in the earlier case, or it might not, I really don't know
how smart that thing is.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-28 19:09       ` Peter Zijlstra
@ 2011-09-28 20:04         ` Glauber Costa
  2011-10-01 17:47         ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-09-28 20:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/28/2011 04:09 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 15:19 -0300, Glauber Costa wrote:
>> On 09/27/2011 06:00 PM, Peter Zijlstra wrote:
>>> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>>>    /* Must have preemption disabled for this to be meaningful. */
>>>> -#define kstat_this_cpu __get_cpu_var(kstat)
>>>> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))
>>>
>>> This just lost you a debug check, the former would whinge when called
>>> without preemption, the new one wont. Its part of the this_cpu feature
>>> set to make debugging impossible.
>>>
>>>> +#else
>>>> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>>>> +#define kstat_this_cpu (&__get_cpu_var(kstat))
>>>> +#endif
>>>>
>>>>    extern unsigned long long nr_context_switches(void);
>>>>
>>>> @@ -52,8 +62,8 @@ struct irq_desc;
>>>>    static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
>>>>                                               struct irq_desc *desc)
>>>>    {
>>>> -       __this_cpu_inc(kstat.irqs[irq]);
>>>> -       __this_cpu_inc(kstat.irqs_sum);
>>>> +       kstat_this_cpu->irqs[irq]++;
>>>> +       kstat_this_cpu->irqs_sum++;
>>>
>>> It might be worth looking at the asm output of that, I think you made it
>>> worse, but I'm not quite sure how smart gcc is, it might just figure out
>>> what you meant.
>>
>> I'd say leave it alone.
>> The biggest difference is that we don't have access to task_group(), or
>> any of the fields in struct task_group. Because of that, we end up
>> having to export a function to do the job of dealing with it.
>>
>> Users inside sched.c won't have this problem. Outside of it, we'll add a
>> call to some paths. True, mostly handle_irq paths, but I don't think
>> that's what's going to kill us.
>>
>> Now if we really really want to save it, we'd have to move struct
>> task_group and its friends to a more visible location like a header...
>
> I'm not quite getting how task_group is relevant here.
>
> The above will do something like:
>
> 	mov gs:$per-cpu-offset-of-kstat, reg
> 	inc reg + idx*8

except offset of kstat is not fixed. It is dynamic allocated, so you have
to calculate it. You can't really do it without knowing at least the 
address of task_group, and then, the offset of kstat inside it.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFD 1/9] Change cpustat fields to an array.
  2011-09-28 19:09       ` Peter Zijlstra
  2011-09-28 20:04         ` Glauber Costa
@ 2011-10-01 17:47         ` Glauber Costa
  1 sibling, 0 replies; 44+ messages in thread
From: Glauber Costa @ 2011-10-01 17:47 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, paul, lizf, daniel.lezcano, jbottomley

On 09/28/2011 11:09 PM, Peter Zijlstra wrote:
> On Wed, 2011-09-28 at 15:19 -0300, Glauber Costa wrote:
>> On 09/27/2011 06:00 PM, Peter Zijlstra wrote:
>>> On Fri, 2011-09-23 at 19:20 -0300, Glauber Costa wrote:
>>>>    /* Must have preemption disabled for this to be meaningful. */
>>>> -#define kstat_this_cpu __get_cpu_var(kstat)
>>>> +#define kstat_this_cpu this_cpu_ptr(task_group_kstat(current))
>>>
>>> This just lost you a debug check, the former would whinge when called
>>> without preemption, the new one wont. Its part of the this_cpu feature
>>> set to make debugging impossible.
>>>
>>>> +#else
>>>> +#define kstat_cpu(cpu) per_cpu(kstat, cpu)
>>>> +#define kstat_this_cpu (&__get_cpu_var(kstat))
>>>> +#endif
>>>>
>>>>    extern unsigned long long nr_context_switches(void);
>>>>
>>>> @@ -52,8 +62,8 @@ struct irq_desc;
>>>>    static inline void kstat_incr_irqs_this_cpu(unsigned int irq,
>>>>                                               struct irq_desc *desc)
>>>>    {
>>>> -       __this_cpu_inc(kstat.irqs[irq]);
>>>> -       __this_cpu_inc(kstat.irqs_sum);
>>>> +       kstat_this_cpu->irqs[irq]++;
>>>> +       kstat_this_cpu->irqs_sum++;
>>>
>>> It might be worth looking at the asm output of that, I think you made it
>>> worse, but I'm not quite sure how smart gcc is, it might just figure out
>>> what you meant.
>>
>> I'd say leave it alone.
>> The biggest difference is that we don't have access to task_group(), or
>> any of the fields in struct task_group. Because of that, we end up
>> having to export a function to do the job of dealing with it.
>>
>> Users inside sched.c won't have this problem. Outside of it, we'll add a
>> call to some paths. True, mostly handle_irq paths, but I don't think
>> that's what's going to kill us.
>>
>> Now if we really really want to save it, we'd have to move struct
>> task_group and its friends to a more visible location like a header...
>
> I'm not quite getting how task_group is relevant here.
>
> The above will do something like:
>
> 	mov gs:$per-cpu-offset-of-kstat, reg
> 	inc reg + idx*8
>
> whereas __this_cpu_inc() could end up like:
>
> 	inc gs:$per-cpu-offset-of-kstat + idx*8
>
> or whatnot. Now clearly gcc could be smart and optimize the temporary
> reg thing away in the earlier case, or it might not, I really don't know
> how smart that thing is.
Btw, asm output with CGROUP_SCHED disabled seem to be no worse than
what is in now.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2011-10-01 17:48 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-23 22:20 [RFD 0/9] per-cgroup /proc/stat statistics Glauber Costa
2011-09-23 22:20 ` [RFD 1/9] Change cpustat fields to an array Glauber Costa
2011-09-27 21:00   ` Peter Zijlstra
2011-09-28 15:13     ` Glauber Costa
2011-09-28 15:23       ` Peter Zijlstra
2011-09-28 18:19     ` Glauber Costa
2011-09-28 19:09       ` Peter Zijlstra
2011-09-28 20:04         ` Glauber Costa
2011-10-01 17:47         ` Glauber Costa
2011-09-27 21:03   ` Peter Zijlstra
2011-09-28 15:14     ` Glauber Costa
2011-09-23 22:20 ` [RFD 2/9] Move /proc/stat logic inside sched.c Glauber Costa
2011-09-23 22:20 ` [RFD 3/9] Display /proc/stat information per cgroup Glauber Costa
2011-09-27 17:01   ` Balbir Singh
2011-09-27 18:42     ` Glauber Costa
2011-09-27 22:21       ` Peter Zijlstra
2011-09-28 15:22         ` Glauber Costa
2011-09-28 15:23         ` Glauber Costa
2011-09-27 21:48   ` Peter Zijlstra
2011-09-28 15:14     ` Glauber Costa
2011-09-27 21:52   ` Peter Zijlstra
2011-09-28 15:15     ` Glauber Costa
2011-09-23 22:20 ` [RFD 4/9] Make total_forks per-cgroup Glauber Costa
2011-09-27 22:00   ` Peter Zijlstra
2011-09-28  8:13     ` Martin Schwidefsky
2011-09-28 10:35       ` Peter Zijlstra
2011-09-28 12:42         ` Martin Schwidefsky
2011-09-28 12:53           ` Peter Zijlstra
2011-09-28 15:29             ` Glauber Costa
2011-09-28 15:33               ` Peter Zijlstra
2011-09-28 15:35                 ` Glauber Costa
2011-09-28 15:37                   ` Peter Zijlstra
2011-09-28 15:39                     ` Glauber Costa
2011-09-28 15:28           ` Glauber Costa
2011-09-28 15:27         ` Glauber Costa
2011-09-28 15:26       ` Glauber Costa
2011-09-23 22:20 ` [RFD 5/9] per-cgroup boot time Glauber Costa
2011-09-23 22:20 ` [RFD 6/9] Report steal time for cgroup Glauber Costa
2011-09-23 22:20 ` [RFD 7/9] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
2011-09-23 22:20 ` [RFD 8/9] provide a version of cpuusage " Glauber Costa
2011-09-23 22:20 ` [RFD 9/9] Change CPUACCT to default n Glauber Costa
2011-09-27 22:11 ` [RFD 0/9] per-cgroup /proc/stat statistics Peter Zijlstra
2011-09-28 15:21   ` Glauber Costa
2011-09-28 15:27     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.