linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/14] per-cgroup /proc/stat
@ 2011-11-01 21:19 Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 01/14] trivial: initialize root cgroup's sibling list Glauber Costa
                   ` (13 more replies)
  0 siblings, 14 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt, fweisbec

Hi,

This is a new shot of the per-cgroup /proc/stat series.
I believe this series is a huge improvement over the last one.
Basically, there is a control file in the root cgroup that controls
wether or not statistics will be collected in a per-cgroup fashion.
Only when this is on, a jump label is enabled allowing the "run to parent"
to happen. The parent cgroup's kstat is coincident with the system-wide kstat,
so we don't need special code for when cgroups-disabled, either runtime or
compile time.

cpuusage and cpuacct statistics are now also available in the cpu cgroup.
Since the task_group structure is already walked by the scheduler anyway,
by keeping everything in the same cgroup, we avoid multiple walks.

I also have a follow up patch that displays the information in this file
automatically in /proc/stat when requested, but I am not including in
this series. I'd like to give some second thoughts about that: Since I
am pretty stubborn, I haven't yet given up on coming up with a way to
tie namespaces and cgroups together in a good fashion for theses cases.


Glauber Costa (14):
  trivial: initialize root cgroup's sibling list
  Change cpustat fields to an array.
  Move /proc/stat logic inside sched.c
  split kernel stat in two
  Display /proc/stat information per cgroup
  Make total_forks per-cgroup
  per-cgroup boot time
  Report steal time for cgroup
  Keep nr_iowait per cgroup
  Keep number of context switches per-cgroup
  provide a version of cpuacct statistics inside cpu cgroup
  Keep number of running processes per-cgroup
  provide a version of cpuusage statistics inside cpu cgroup
  Change CPUACCT to default n

 Documentation/feature-removal-schedule.txt |    8 +
 arch/s390/appldata/appldata_os.c           |   18 +-
 arch/x86/include/asm/i387.h                |    2 +-
 drivers/cpufreq/cpufreq_conservative.c     |   33 +-
 drivers/cpufreq/cpufreq_ondemand.c         |   33 +-
 drivers/macintosh/rack-meter.c             |    8 +-
 fs/proc/stat.c                             |  144 +------
 fs/proc/uptime.c                           |    8 +-
 include/linux/kernel_stat.h                |   56 ++-
 include/linux/sched.h                      |    5 +
 init/Kconfig                               |    1 +
 kernel/fork.c                              |    7 +-
 kernel/sched.c                             |  660 +++++++++++++++++++++++++---
 kernel/sched_debug.c                       |    3 +-
 kernel/sched_fair.c                        |   10 +
 kernel/sched_rt.c                          |    4 +
 16 files changed, 731 insertions(+), 269 deletions(-)

-- 
1.7.6.4


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v2 01/14] trivial: initialize root cgroup's sibling list
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-11 21:34   ` Paul Turner
  2011-11-18 23:42   ` [tip:sched/core] sched, trivial: Initialize " tip-bot for Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 02/14] Change cpustat fields to an array Glauber Costa
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

Even though there are no siblings, the list should be
initialized not to contain bogus values.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Paul Menage <paul@paulmenage.org>
---
 kernel/sched.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d87c6e5..e327665 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8234,6 +8234,7 @@ void __init sched_init(void)
 #ifdef CONFIG_CGROUP_SCHED
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
+	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
 #endif /* CONFIG_CGROUP_SCHED */
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 02/14] Change cpustat fields to an array.
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 01/14] trivial: initialize root cgroup's sibling list Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 03/14] Move /proc/stat logic inside sched.c Glauber Costa
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

This will give us a bit more flexibility to deal with the
fields in this structure. This is a preparation patch for
later patches in this series.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 arch/s390/appldata/appldata_os.c       |   16 ++++----
 arch/x86/include/asm/i387.h            |    2 +-
 drivers/cpufreq/cpufreq_conservative.c |   23 +++++-----
 drivers/cpufreq/cpufreq_ondemand.c     |   23 +++++-----
 drivers/macintosh/rack-meter.c         |    6 +-
 fs/proc/stat.c                         |   63 +++++++++++++---------------
 fs/proc/uptime.c                       |    2 +-
 include/linux/kernel_stat.h            |   30 +++++++------
 kernel/sched.c                         |   71 ++++++++++++++++----------------
 9 files changed, 116 insertions(+), 120 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 92f1cb7..3d6b672 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
 	j = 0;
 	for_each_online_cpu(i) {
 		os_data->os_cpu[j].per_cpu_user =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.user);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
 		os_data->os_cpu[j].per_cpu_nice =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.nice);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
 		os_data->os_cpu[j].per_cpu_system =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.system);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
 		os_data->os_cpu[j].per_cpu_idle =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.idle);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
 		os_data->os_cpu[j].per_cpu_irq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.irq);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
 		os_data->os_cpu[j].per_cpu_softirq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.softirq);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
 		os_data->os_cpu[j].per_cpu_iowait =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.iowait);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
 		os_data->os_cpu[j].per_cpu_steal =
-			cputime_to_jiffies(kstat_cpu(i).cpustat.steal);
+			cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
 		j++;
 	}
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index c9e09ea..56fa4d7 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #ifdef CONFIG_SMP
 #define safe_address (__per_cpu_offset[0])
 #else
-#define safe_address (kstat_cpu(0).cpustat.user)
+#define safe_address (kstat_cpu(0).cpustat[USER])
 #endif
 
 /*
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index c97b468..2ab538f 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -103,13 +103,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
-			kstat_cpu(cpu).cpustat.system);
+	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
+			kstat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 	}
 	return count;
 }
@@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
+			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -501,10 +501,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice) {
+			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat.nice;
-			}
+						kstat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index fa8af4e..45d8e17 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -127,13 +127,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat.user,
-			kstat_cpu(cpu).cpustat.system);
+	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
+			kstat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.irq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.softirq);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.steal);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat.nice);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 
 	}
 	return count;
@@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat.nice,
+			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat.nice;
+			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -646,10 +646,9 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice) {
+			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat.nice;
-			}
+						kstat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index 2637c13..c80e49a 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
 	cputime64_t retval;
 
-	retval = cputime64_add(kstat_cpu(cpu).cpustat.idle,
-			kstat_cpu(cpu).cpustat.iowait);
+	retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
+			kstat_cpu(cpu).cpustat[IOWAIT]);
 
 	if (rackmeter_ignore_nice)
-		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat.nice);
+		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
 
 	return retval;
 }
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 42b274d..b7b74ad 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -22,29 +22,27 @@
 #define arch_idle_time(cpu) 0
 #endif
 
-static cputime64_t get_idle_time(int cpu)
+static u64 get_idle_time(int cpu)
 {
-	u64 idle_time = get_cpu_idle_time_us(cpu, NULL);
-	cputime64_t idle;
+	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kstat_cpu(cpu).cpustat.idle;
-		idle = cputime64_add(idle, arch_idle_time(cpu));
+		idle = kstat_cpu(cpu).cpustat[IDLE];
+		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
 
 	return idle;
 }
 
-static cputime64_t get_iowait_time(int cpu)
+static u64 get_iowait_time(int cpu)
 {
-	u64 iowait_time = get_cpu_iowait_time_us(cpu, NULL);
-	cputime64_t iowait;
+	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kstat_cpu(cpu).cpustat.iowait;
+		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
@@ -55,33 +53,30 @@ static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	unsigned long jif;
-	cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;
-	cputime64_t guest, guest_nice;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 
 	user = nice = system = idle = iowait =
-		irq = softirq = steal = cputime64_zero;
-	guest = guest_nice = cputime64_zero;
+		irq = softirq = steal = 0;
+	guest = guest_nice = 0;
 	getboottime(&boottime);
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user = cputime64_add(user, kstat_cpu(i).cpustat.user);
-		nice = cputime64_add(nice, kstat_cpu(i).cpustat.nice);
-		system = cputime64_add(system, kstat_cpu(i).cpustat.system);
-		idle = cputime64_add(idle, get_idle_time(i));
-		iowait = cputime64_add(iowait, get_iowait_time(i));
-		irq = cputime64_add(irq, kstat_cpu(i).cpustat.irq);
-		softirq = cputime64_add(softirq, kstat_cpu(i).cpustat.softirq);
-		steal = cputime64_add(steal, kstat_cpu(i).cpustat.steal);
-		guest = cputime64_add(guest, kstat_cpu(i).cpustat.guest);
-		guest_nice = cputime64_add(guest_nice,
-			kstat_cpu(i).cpustat.guest_nice);
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
+		user += kstat_cpu(i).cpustat[USER];
+		nice += kstat_cpu(i).cpustat[NICE];
+		system += kstat_cpu(i).cpustat[SYSTEM];
+		idle += get_idle_time(i);
+		iowait += get_iowait_time(i);
+		irq += kstat_cpu(i).cpustat[IRQ];
+		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
+		steal += kstat_cpu(i).cpustat[STEAL];
+		guest += kstat_cpu(i).cpustat[GUEST];
+		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -106,16 +101,16 @@ static int show_stat(struct seq_file *p, void *v)
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat.user;
-		nice = kstat_cpu(i).cpustat.nice;
-		system = kstat_cpu(i).cpustat.system;
+		user = kstat_cpu(i).cpustat[USER];
+		nice = kstat_cpu(i).cpustat[NICE];
+		system = kstat_cpu(i).cpustat[SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
-		irq = kstat_cpu(i).cpustat.irq;
-		softirq = kstat_cpu(i).cpustat.softirq;
-		steal = kstat_cpu(i).cpustat.steal;
-		guest = kstat_cpu(i).cpustat.guest;
-		guest_nice = kstat_cpu(i).cpustat.guest_nice;
+		irq = kstat_cpu(i).cpustat[IRQ];
+		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
+		steal = kstat_cpu(i).cpustat[STEAL];
+		guest = kstat_cpu(i).cpustat[GUEST];
+		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index 766b1d4..b0e053d 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -15,7 +15,7 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	cputime_t idletime = cputime_zero;
 
 	for_each_possible_cpu(i)
-		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat.idle);
+		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat[IDLE]);
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 0cce2db..7bfd0fe 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -6,6 +6,7 @@
 #include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/interrupt.h>
+#include <linux/sched.h>
 #include <asm/irq.h>
 #include <asm/cputime.h>
 
@@ -15,21 +16,22 @@
  * used by rstatd/perfmeter
  */
 
-struct cpu_usage_stat {
-	cputime64_t user;
-	cputime64_t nice;
-	cputime64_t system;
-	cputime64_t softirq;
-	cputime64_t irq;
-	cputime64_t idle;
-	cputime64_t iowait;
-	cputime64_t steal;
-	cputime64_t guest;
-	cputime64_t guest_nice;
+enum cpu_usage_stat {
+	USER,
+	NICE,
+	SYSTEM,
+	SOFTIRQ,
+	IRQ,
+	IDLE,
+	IOWAIT,
+	STEAL,
+	GUEST,
+	GUEST_NICE,
+	NR_STATS,
 };
 
 struct kernel_stat {
-	struct cpu_usage_stat	cpustat;
+	u64 cpustat[NR_STATS];
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -39,9 +41,9 @@ struct kernel_stat {
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
 
-#define kstat_cpu(cpu)	per_cpu(kstat, cpu)
 /* Must have preemption disabled for this to be meaningful. */
-#define kstat_this_cpu	__get_cpu_var(kstat)
+#define kstat_this_cpu (&__get_cpu_var(kstat))
+#define kstat_cpu(cpu) per_cpu(kstat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index e327665..e78e1aa 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2158,14 +2158,14 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->irq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -2173,14 +2173,14 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
-	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat->softirq))
+	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
 		ret = 1;
 	local_irq_restore(flags);
 	return ret;
@@ -3866,8 +3866,8 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t tmp;
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 tmp;
 
 	/* Add user time to process. */
 	p->utime = cputime_add(p->utime, cputime);
@@ -3876,10 +3876,11 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add user time to cpustat. */
 	tmp = cputime_to_cputime64(cputime);
+
 	if (TASK_NICE(p) > 0)
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
+		cpustat[NICE] += tmp;
 	else
-		cpustat->user = cputime64_add(cpustat->user, tmp);
+		cpustat[USER] += tmp;
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3895,8 +3896,8 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	cputime64_t tmp;
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp;
+	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3908,11 +3909,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat->nice = cputime64_add(cpustat->nice, tmp);
-		cpustat->guest_nice = cputime64_add(cpustat->guest_nice, tmp);
+		cpustat[NICE] += tmp;
+		cpustat[GUEST_NICE] += tmp;
 	} else {
-		cpustat->user = cputime64_add(cpustat->user, tmp);
-		cpustat->guest = cputime64_add(cpustat->guest, tmp);
+		cpustat[USER] += tmp;
+		cpustat[GUEST] += tmp;
 	}
 }
 
@@ -3925,9 +3926,9 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, cputime64_t *target_cputime64)
+			cputime_t cputime_scaled, u64 *target_cputime64)
 {
-	cputime64_t tmp = cputime_to_cputime64(cputime);
+	u64 tmp = cputime_to_cputime64(cputime);
 
 	/* Add system time to process. */
 	p->stime = cputime_add(p->stime, cputime);
@@ -3935,7 +3936,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 = cputime64_add(*target_cputime64, tmp);
+	*target_cputime64 += tmp;
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3952,8 +3953,8 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t *target_cputime64;
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3961,11 +3962,11 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat->irq;
+		target_cputime64 = &cpustat[IRQ];
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat->softirq;
+		target_cputime64 = &cpustat[SOFTIRQ];
 	else
-		target_cputime64 = &cpustat->system;
+		target_cputime64 = &cpustat[SYSTEM];
 
 	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
 }
@@ -3976,10 +3977,10 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 
-	cpustat->steal = cputime64_add(cpustat->steal, cputime64);
+	cpustat[STEAL] += cputime64;
 }
 
 /*
@@ -3988,14 +3989,14 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
-	cputime64_t cputime64 = cputime_to_cputime64(cputime);
+	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
+		cpustat[IOWAIT] += cputime64;
 	else
-		cpustat->idle = cputime64_add(cpustat->idle, cputime64);
+		cpustat[IDLE] += cputime64;
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -4045,16 +4046,16 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 						struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
-	cputime64_t tmp = cputime_to_cputime64(cputime_one_jiffy);
-	struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
+	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
+	u64 *cpustat = kstat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat->irq = cputime64_add(cpustat->irq, tmp);
+		cpustat[IRQ] += tmp;
 	} else if (irqtime_account_si_update()) {
-		cpustat->softirq = cputime64_add(cpustat->softirq, tmp);
+		cpustat[SOFTIRQ] += tmp;
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
@@ -4062,7 +4063,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->softirq);
+					&cpustat[SOFTIRQ]);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -4071,7 +4072,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
 		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat->system);
+					&cpustat[SYSTEM]);
 	}
 }
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 03/14] Move /proc/stat logic inside sched.c
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 01/14] trivial: initialize root cgroup's sibling list Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 02/14] Change cpustat fields to an array Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-12  1:35   ` Paul Turner
  2011-11-01 21:19 ` [PATCH v2 04/14] split kernel stat in two Glauber Costa
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

This patch moves all of the /proc/stat display code inside
sched.c. The goal is to later on, have a different version
of it per-cgroup. In containers environment, this is useful
to give each container a different and independent view of
the statistics displayed in this file.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 fs/proc/stat.c        |  139 +-----------------------------------------------
 include/linux/sched.h |    1 +
 kernel/sched.c        |  142 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 144 insertions(+), 138 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index b7b74ad..6b10387 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -10,147 +10,10 @@
 #include <linux/time.h>
 #include <linux/irqnr.h>
 #include <asm/cputime.h>
-#include <linux/tick.h>
-
-#ifndef arch_irq_stat_cpu
-#define arch_irq_stat_cpu(cpu) 0
-#endif
-#ifndef arch_irq_stat
-#define arch_irq_stat() 0
-#endif
-#ifndef arch_idle_time
-#define arch_idle_time(cpu) 0
-#endif
-
-static u64 get_idle_time(int cpu)
-{
-	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
-
-	if (idle_time == -1ULL) {
-		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kstat_cpu(cpu).cpustat[IDLE];
-		idle += arch_idle_time(cpu);
-	} else
-		idle = usecs_to_cputime(idle_time);
-
-	return idle;
-}
-
-static u64 get_iowait_time(int cpu)
-{
-	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
-
-	if (iowait_time == -1ULL)
-		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
-	else
-		iowait = usecs_to_cputime(iowait_time);
-
-	return iowait;
-}
 
 static int show_stat(struct seq_file *p, void *v)
 {
-	int i, j;
-	unsigned long jif;
-	u64 user, nice, system, idle, iowait, irq, softirq, steal;
-	u64 guest, guest_nice;
-	u64 sum = 0;
-	u64 sum_softirq = 0;
-	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
-	struct timespec boottime;
-
-	user = nice = system = idle = iowait =
-		irq = softirq = steal = 0;
-	guest = guest_nice = 0;
-	getboottime(&boottime);
-	jif = boottime.tv_sec;
-
-	for_each_possible_cpu(i) {
-		user += kstat_cpu(i).cpustat[USER];
-		nice += kstat_cpu(i).cpustat[NICE];
-		system += kstat_cpu(i).cpustat[SYSTEM];
-		idle += get_idle_time(i);
-		iowait += get_iowait_time(i);
-		irq += kstat_cpu(i).cpustat[IRQ];
-		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
-		steal += kstat_cpu(i).cpustat[STEAL];
-		guest += kstat_cpu(i).cpustat[GUEST];
-		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
-
-		for (j = 0; j < NR_SOFTIRQS; j++) {
-			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
-
-			per_softirq_sums[j] += softirq_stat;
-			sum_softirq += softirq_stat;
-		}
-	}
-	sum += arch_irq_stat();
-
-	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-		"%llu\n",
-		(unsigned long long)cputime64_to_clock_t(user),
-		(unsigned long long)cputime64_to_clock_t(nice),
-		(unsigned long long)cputime64_to_clock_t(system),
-		(unsigned long long)cputime64_to_clock_t(idle),
-		(unsigned long long)cputime64_to_clock_t(iowait),
-		(unsigned long long)cputime64_to_clock_t(irq),
-		(unsigned long long)cputime64_to_clock_t(softirq),
-		(unsigned long long)cputime64_to_clock_t(steal),
-		(unsigned long long)cputime64_to_clock_t(guest),
-		(unsigned long long)cputime64_to_clock_t(guest_nice));
-	for_each_online_cpu(i) {
-		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat[USER];
-		nice = kstat_cpu(i).cpustat[NICE];
-		system = kstat_cpu(i).cpustat[SYSTEM];
-		idle = get_idle_time(i);
-		iowait = get_iowait_time(i);
-		irq = kstat_cpu(i).cpustat[IRQ];
-		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
-		steal = kstat_cpu(i).cpustat[STEAL];
-		guest = kstat_cpu(i).cpustat[GUEST];
-		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
-		seq_printf(p,
-			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-			"%llu\n",
-			i,
-			(unsigned long long)cputime64_to_clock_t(user),
-			(unsigned long long)cputime64_to_clock_t(nice),
-			(unsigned long long)cputime64_to_clock_t(system),
-			(unsigned long long)cputime64_to_clock_t(idle),
-			(unsigned long long)cputime64_to_clock_t(iowait),
-			(unsigned long long)cputime64_to_clock_t(irq),
-			(unsigned long long)cputime64_to_clock_t(softirq),
-			(unsigned long long)cputime64_to_clock_t(steal),
-			(unsigned long long)cputime64_to_clock_t(guest),
-			(unsigned long long)cputime64_to_clock_t(guest_nice));
-	}
-	seq_printf(p, "intr %llu", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_printf(p, " %u", kstat_irqs(j));
-
-	seq_printf(p,
-		"\nctxt %llu\n"
-		"btime %lu\n"
-		"processes %lu\n"
-		"procs_running %lu\n"
-		"procs_blocked %lu\n",
-		nr_context_switches(),
-		(unsigned long)jif,
-		total_forks,
-		nr_running(),
-		nr_iowait());
-
-	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
-
-	for (i = 0; i < NR_SOFTIRQS; i++)
-		seq_printf(p, " %u", per_softirq_sums[i]);
-	seq_putc(p, '\n');
-
-	return 0;
+	return cpu_cgroup_proc_stat(p);
 }
 
 static int stat_open(struct inode *inode, struct file *file)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e8acce7..8311551 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2713,6 +2713,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+int cpu_cgroup_proc_stat(struct seq_file *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index e78e1aa..3f42916 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/tick.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -9494,6 +9495,147 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 
 #endif	/* CONFIG_CGROUP_SCHED */
 
+#ifndef arch_irq_stat_cpu
+#define arch_irq_stat_cpu(cpu) 0
+#endif
+#ifndef arch_irq_stat
+#define arch_irq_stat() 0
+#endif
+#ifndef arch_idle_time
+#define arch_idle_time(cpu) 0
+#endif
+
+static u64 get_idle_time(int cpu)
+{
+	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
+
+	if (idle_time == -1ULL) {
+		/* !NO_HZ so we can rely on cpustat.idle */
+		idle = kstat_cpu(cpu).cpustat[IDLE];
+		idle += arch_idle_time(cpu);
+	} else
+		idle = usecs_to_cputime(idle_time);
+
+	return idle;
+}
+
+static u64 get_iowait_time(int cpu)
+{
+	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
+
+	if (iowait_time == -1ULL)
+		/* !NO_HZ so we can rely on cpustat.iowait */
+		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
+	else
+		iowait = usecs_to_cputime(iowait_time);
+
+	return iowait;
+}
+
+int cpu_cgroup_proc_stat(struct seq_file *p)
+{
+	int i, j;
+	unsigned long jif;
+	u64 user, nice, system, idle, iowait, irq, softirq, steal;
+	u64 guest, guest_nice;
+	u64 sum = 0;
+	u64 sum_softirq = 0;
+	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
+	struct timespec boottime;
+
+	user = nice = system = idle = iowait =
+		irq = softirq = steal = 0;
+	guest = guest_nice = 0;
+	getboottime(&boottime);
+	jif = boottime.tv_sec;
+
+	for_each_possible_cpu(i) {
+		user += kstat_cpu(i).cpustat[USER];
+		nice += kstat_cpu(i).cpustat[NICE];
+		system += kstat_cpu(i).cpustat[SYSTEM];
+		idle += get_idle_time(i);
+		iowait += get_iowait_time(i);
+		irq += kstat_cpu(i).cpustat[IRQ];
+		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
+		steal += kstat_cpu(i).cpustat[STEAL];
+		guest += kstat_cpu(i).cpustat[GUEST];
+		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
+
+		for (j = 0; j < NR_SOFTIRQS; j++) {
+			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
+
+			per_softirq_sums[j] += softirq_stat;
+			sum_softirq += softirq_stat;
+		}
+	}
+	sum += arch_irq_stat();
+
+	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+		"%llu\n",
+		(unsigned long long)cputime64_to_clock_t(user),
+		(unsigned long long)cputime64_to_clock_t(nice),
+		(unsigned long long)cputime64_to_clock_t(system),
+		(unsigned long long)cputime64_to_clock_t(idle),
+		(unsigned long long)cputime64_to_clock_t(iowait),
+		(unsigned long long)cputime64_to_clock_t(irq),
+		(unsigned long long)cputime64_to_clock_t(softirq),
+		(unsigned long long)cputime64_to_clock_t(steal),
+		(unsigned long long)cputime64_to_clock_t(guest),
+		(unsigned long long)cputime64_to_clock_t(guest_nice));
+	for_each_online_cpu(i) {
+		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
+		user = kstat_cpu(i).cpustat[USER];
+		nice = kstat_cpu(i).cpustat[NICE];
+		system = kstat_cpu(i).cpustat[SYSTEM];
+		idle = get_idle_time(i);
+		iowait = get_iowait_time(i);
+		irq = kstat_cpu(i).cpustat[IRQ];
+		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
+		steal = kstat_cpu(i).cpustat[STEAL];
+		guest = kstat_cpu(i).cpustat[GUEST];
+		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
+		seq_printf(p,
+			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
+			"%llu\n",
+			i,
+			(unsigned long long)cputime64_to_clock_t(user),
+			(unsigned long long)cputime64_to_clock_t(nice),
+			(unsigned long long)cputime64_to_clock_t(system),
+			(unsigned long long)cputime64_to_clock_t(idle),
+			(unsigned long long)cputime64_to_clock_t(iowait),
+			(unsigned long long)cputime64_to_clock_t(irq),
+			(unsigned long long)cputime64_to_clock_t(softirq),
+			(unsigned long long)cputime64_to_clock_t(steal),
+			(unsigned long long)cputime64_to_clock_t(guest),
+			(unsigned long long)cputime64_to_clock_t(guest_nice));
+	}
+	seq_printf(p, "intr %llu", (unsigned long long)sum);
+
+	/* sum again ? it could be updated? */
+	for_each_irq_nr(j)
+		seq_printf(p, " %u", kstat_irqs(j));
+
+	seq_printf(p,
+		"\nctxt %llu\n"
+		"btime %lu\n"
+		"processes %lu\n"
+		"procs_running %lu\n"
+		"procs_blocked %lu\n",
+		nr_context_switches(),
+		(unsigned long)jif,
+		total_forks,
+		nr_running(),
+		nr_iowait());
+
+	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
+
+	for (i = 0; i < NR_SOFTIRQS; i++)
+		seq_printf(p, " %u", per_softirq_sums[i]);
+	seq_putc(p, '\n');
+
+	return 0;
+}
+
 #ifdef CONFIG_CGROUP_CPUACCT
 
 /*
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 04/14] split kernel stat in two
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (2 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 03/14] Move /proc/stat logic inside sched.c Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 05/14] Display /proc/stat information per cgroup Glauber Costa
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

In a later patch, we will use cpustat information per-task group.
However, some of its fields are naturally global, such as the irq
counters. There is no need to impose the task group overhead to them
in this case. So better separate them.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 arch/s390/appldata/appldata_os.c       |   16 +++++-----
 arch/x86/include/asm/i387.h            |    2 +-
 drivers/cpufreq/cpufreq_conservative.c |   20 ++++++------
 drivers/cpufreq/cpufreq_ondemand.c     |   20 ++++++------
 drivers/macintosh/rack-meter.c         |    6 ++--
 include/linux/kernel_stat.h            |    8 ++++-
 kernel/sched.c                         |   54 ++++++++++++++++---------------
 7 files changed, 67 insertions(+), 59 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 3d6b672..695388a 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -115,21 +115,21 @@ static void appldata_get_os_data(void *data)
 	j = 0;
 	for_each_online_cpu(i) {
 		os_data->os_cpu[j].per_cpu_user =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[USER]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
 		os_data->os_cpu[j].per_cpu_nice =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[NICE]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[NICE]);
 		os_data->os_cpu[j].per_cpu_system =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[SYSTEM]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[SYSTEM]);
 		os_data->os_cpu[j].per_cpu_idle =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[IDLE]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IDLE]);
 		os_data->os_cpu[j].per_cpu_irq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[IRQ]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IRQ]);
 		os_data->os_cpu[j].per_cpu_softirq =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[SOFTIRQ]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[SOFTIRQ]);
 		os_data->os_cpu[j].per_cpu_iowait =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[IOWAIT]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[IOWAIT]);
 		os_data->os_cpu[j].per_cpu_steal =
-			cputime_to_jiffies(kstat_cpu(i).cpustat[STEAL]);
+			cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
 		j++;
 	}
diff --git a/arch/x86/include/asm/i387.h b/arch/x86/include/asm/i387.h
index 56fa4d7..1f1b536 100644
--- a/arch/x86/include/asm/i387.h
+++ b/arch/x86/include/asm/i387.h
@@ -218,7 +218,7 @@ static inline void fpu_fxsave(struct fpu *fpu)
 #ifdef CONFIG_SMP
 #define safe_address (__per_cpu_offset[0])
 #else
-#define safe_address (kstat_cpu(0).cpustat[USER])
+#define safe_address (__get_cpu_var(kernel_cpustat).cpustat[USER])
 #endif
 
 /*
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index 2ab538f..a3a739f 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -103,13 +103,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
-			kstat_cpu(cpu).cpustat[SYSTEM]);
+	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
+			kcpustat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -272,7 +272,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 	}
 	return count;
 }
@@ -365,7 +365,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
+			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -374,7 +374,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -503,7 +503,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 						&j_dbs_info->prev_cpu_wall);
 			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat[NICE];
+						kcpustat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 45d8e17..46e89663 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -127,13 +127,13 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t busy_time;
 
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
-	busy_time = cputime64_add(kstat_cpu(cpu).cpustat[USER],
-			kstat_cpu(cpu).cpustat[SYSTEM]);
+	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
+			kcpustat_cpu(cpu).cpustat[SYSTEM]);
 
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[IRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[SOFTIRQ]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[STEAL]);
-	busy_time = cputime64_add(busy_time, kstat_cpu(cpu).cpustat[NICE]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[IRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
+	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -345,7 +345,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
 		if (dbs_tuners_ins.ignore_nice)
-			dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 
 	}
 	return count;
@@ -458,7 +458,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
-			cur_nice = cputime64_sub(kstat_cpu(j).cpustat[NICE],
+			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
 			/*
 			 * Assumption: nice time between sampling periods will
@@ -467,7 +467,7 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
-			j_dbs_info->prev_cpu_nice = kstat_cpu(j).cpustat[NICE];
+			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -648,7 +648,7 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 						&j_dbs_info->prev_cpu_wall);
 			if (dbs_tuners_ins.ignore_nice)
 				j_dbs_info->prev_cpu_nice =
-						kstat_cpu(j).cpustat[NICE];
+						kcpustat_cpu(j).cpustat[NICE];
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index c80e49a..c8e67b0 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,11 +83,11 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
 	cputime64_t retval;
 
-	retval = cputime64_add(kstat_cpu(cpu).cpustat[IDLE],
-			kstat_cpu(cpu).cpustat[IOWAIT]);
+	retval = cputime64_add(kcpustat_cpu(cpu).cpustat[IDLE],
+			kcpustat_cpu(cpu).cpustat[IOWAIT]);
 
 	if (rackmeter_ignore_nice)
-		retval = cputime64_add(retval, kstat_cpu(cpu).cpustat[NICE]);
+		retval = cputime64_add(retval, kcpustat_cpu(cpu).cpustat[NICE]);
 
 	return retval;
 }
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 7bfd0fe..f0e31a9 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -30,8 +30,11 @@ enum cpu_usage_stat {
 	NR_STATS,
 };
 
-struct kernel_stat {
+struct kernel_cpustat {
 	u64 cpustat[NR_STATS];
+};
+
+struct kernel_stat {
 #ifndef CONFIG_GENERIC_HARDIRQS
        unsigned int irqs[NR_IRQS];
 #endif
@@ -40,10 +43,13 @@ struct kernel_stat {
 };
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
+DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
 /* Must have preemption disabled for this to be meaningful. */
 #define kstat_this_cpu (&__get_cpu_var(kstat))
+#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
 #define kstat_cpu(cpu) per_cpu(kstat, cpu)
+#define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 3f42916..c225e41 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2159,7 +2159,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
@@ -2174,7 +2174,7 @@ static int irqtime_account_hi_update(void)
 
 static int irqtime_account_si_update(void)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
@@ -3804,8 +3804,10 @@ unlock:
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
 EXPORT_PER_CPU_SYMBOL(kstat);
+EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
 
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
@@ -3867,7 +3869,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 tmp;
 
 	/* Add user time to process. */
@@ -3898,7 +3900,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
 	u64 tmp;
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3954,7 +3956,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 *target_cputime64;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
@@ -3978,7 +3980,7 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 
 	cpustat[STEAL] += cputime64;
@@ -3990,7 +3992,7 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
@@ -4048,7 +4050,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
-	u64 *cpustat = kstat_this_cpu->cpustat;
+	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
@@ -9511,7 +9513,7 @@ static u64 get_idle_time(int cpu)
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kstat_cpu(cpu).cpustat[IDLE];
+		idle = kcpustat_cpu(cpu).cpustat[IDLE];
 		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
@@ -9525,7 +9527,7 @@ static u64 get_iowait_time(int cpu)
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
+		iowait = kcpustat_cpu(cpu).cpustat[IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
@@ -9550,16 +9552,16 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user += kstat_cpu(i).cpustat[USER];
-		nice += kstat_cpu(i).cpustat[NICE];
-		system += kstat_cpu(i).cpustat[SYSTEM];
+		user += kcpustat_cpu(i).cpustat[USER];
+		nice += kcpustat_cpu(i).cpustat[NICE];
+		system += kcpustat_cpu(i).cpustat[SYSTEM];
 		idle += get_idle_time(i);
 		iowait += get_iowait_time(i);
-		irq += kstat_cpu(i).cpustat[IRQ];
-		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
-		steal += kstat_cpu(i).cpustat[STEAL];
-		guest += kstat_cpu(i).cpustat[GUEST];
-		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
+		irq += kcpustat_cpu(i).cpustat[IRQ];
+		softirq += kcpustat_cpu(i).cpustat[SOFTIRQ];
+		steal += kcpustat_cpu(i).cpustat[STEAL];
+		guest += kcpustat_cpu(i).cpustat[GUEST];
+		guest_nice += kcpustat_cpu(i).cpustat[GUEST_NICE];
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -9584,16 +9586,16 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kstat_cpu(i).cpustat[USER];
-		nice = kstat_cpu(i).cpustat[NICE];
-		system = kstat_cpu(i).cpustat[SYSTEM];
+		user = kcpustat_cpu(i).cpustat[USER];
+		nice = kcpustat_cpu(i).cpustat[NICE];
+		system = kcpustat_cpu(i).cpustat[SYSTEM];
 		idle = get_idle_time(i);
 		iowait = get_iowait_time(i);
-		irq = kstat_cpu(i).cpustat[IRQ];
-		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
-		steal = kstat_cpu(i).cpustat[STEAL];
-		guest = kstat_cpu(i).cpustat[GUEST];
-		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
+		irq = kcpustat_cpu(i).cpustat[IRQ];
+		softirq = kcpustat_cpu(i).cpustat[SOFTIRQ];
+		steal = kcpustat_cpu(i).cpustat[STEAL];
+		guest = kcpustat_cpu(i).cpustat[GUEST];
+		guest_nice = kcpustat_cpu(i).cpustat[GUEST_NICE];
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 05/14] Display /proc/stat information per cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (3 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 04/14] split kernel stat in two Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 06/14] Make total_forks per-cgroup Glauber Costa
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

Each cgroup has its own file, cpu.proc.stat that will
display the exact same format as /proc/stat. Users
that want to have access to a per-cgroup version of
this information, can query it for that purpose.

This is only meaningful when the top level file
sched_stats is set to 1. This newly introduced file
controls whether or not the statistics are collected
globally, or per-cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 arch/s390/appldata/appldata_os.c       |    2 +
 drivers/cpufreq/cpufreq_conservative.c |   16 ++-
 drivers/cpufreq/cpufreq_ondemand.c     |   16 ++-
 drivers/macintosh/rack-meter.c         |    2 +
 fs/proc/stat.c                         |    2 +-
 fs/proc/uptime.c                       |    8 +-
 include/linux/kernel_stat.h            |   20 ++-
 include/linux/sched.h                  |    5 +-
 kernel/sched.c                         |  254 ++++++++++++++++++++++++--------
 9 files changed, 254 insertions(+), 71 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index 695388a..0612a7c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -114,6 +114,7 @@ static void appldata_get_os_data(void *data)
 
 	j = 0;
 	for_each_online_cpu(i) {
+		kstat_lock();
 		os_data->os_cpu[j].per_cpu_user =
 			cputime_to_jiffies(kcpustat_cpu(i).cpustat[USER]);
 		os_data->os_cpu[j].per_cpu_nice =
@@ -131,6 +132,7 @@ static void appldata_get_os_data(void *data)
 		os_data->os_cpu[j].per_cpu_steal =
 			cputime_to_jiffies(kcpustat_cpu(i).cpustat[STEAL]);
 		os_data->os_cpu[j].cpu_id = i;
+		kstat_unlock();
 		j++;
 	}
 
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c
index a3a739f..ca98530 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -102,6 +102,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t cur_wall_time;
 	cputime64_t busy_time;
 
+	kstat_lock();
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
 	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
 			kcpustat_cpu(cpu).cpustat[SYSTEM]);
@@ -110,6 +111,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
+	kstat_unlock();
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -271,8 +273,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info = &per_cpu(cs_cpu_dbs_info, j);
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
-		if (dbs_tuners_ins.ignore_nice)
+		if (dbs_tuners_ins.ignore_nice) {
+			kstat_lock();
 			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
+		}
 	}
 	return count;
 }
@@ -365,8 +370,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
+			kstat_lock();
 			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
+			kstat_unlock();
 			/*
 			 * Assumption: nice time between sampling periods will
 			 * be less than 2^32 jiffies for 32 bit sys
@@ -374,7 +381,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
+			kstat_lock();
 			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -501,9 +510,12 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice)
+			if (dbs_tuners_ins.ignore_nice) {
+				kstat_lock();
 				j_dbs_info->prev_cpu_nice =
 						kcpustat_cpu(j).cpustat[NICE];
+				kstat_unlock();
+			}
 		}
 		this_dbs_info->down_skip = 0;
 		this_dbs_info->requested_freq = policy->cur;
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c
index 46e89663..4076453 100644
--- a/drivers/cpufreq/cpufreq_ondemand.c
+++ b/drivers/cpufreq/cpufreq_ondemand.c
@@ -126,6 +126,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	cputime64_t cur_wall_time;
 	cputime64_t busy_time;
 
+	kstat_lock();
 	cur_wall_time = jiffies64_to_cputime64(get_jiffies_64());
 	busy_time = cputime64_add(kcpustat_cpu(cpu).cpustat[USER],
 			kcpustat_cpu(cpu).cpustat[SYSTEM]);
@@ -134,6 +135,7 @@ static inline cputime64_t get_cpu_idle_time_jiffy(unsigned int cpu,
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[SOFTIRQ]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[STEAL]);
 	busy_time = cputime64_add(busy_time, kcpustat_cpu(cpu).cpustat[NICE]);
+	kstat_unlock();
 
 	idle_time = cputime64_sub(cur_wall_time, busy_time);
 	if (wall)
@@ -344,8 +346,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b,
 		dbs_info = &per_cpu(od_cpu_dbs_info, j);
 		dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&dbs_info->prev_cpu_wall);
-		if (dbs_tuners_ins.ignore_nice)
+		if (dbs_tuners_ins.ignore_nice) {
+			kstat_lock();
 			dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
+		}
 
 	}
 	return count;
@@ -458,8 +463,10 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cputime64_t cur_nice;
 			unsigned long cur_nice_jiffies;
 
+			kstat_lock();
 			cur_nice = cputime64_sub(kcpustat_cpu(j).cpustat[NICE],
 					 j_dbs_info->prev_cpu_nice);
+			kstat_unlock();
 			/*
 			 * Assumption: nice time between sampling periods will
 			 * be less than 2^32 jiffies for 32 bit sys
@@ -467,7 +474,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s *this_dbs_info)
 			cur_nice_jiffies = (unsigned long)
 					cputime64_to_jiffies64(cur_nice);
 
+			kstat_lock();
 			j_dbs_info->prev_cpu_nice = kcpustat_cpu(j).cpustat[NICE];
+			kstat_unlock();
 			idle_time += jiffies_to_usecs(cur_nice_jiffies);
 		}
 
@@ -646,9 +655,12 @@ static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
 
 			j_dbs_info->prev_cpu_idle = get_cpu_idle_time(j,
 						&j_dbs_info->prev_cpu_wall);
-			if (dbs_tuners_ins.ignore_nice)
+			if (dbs_tuners_ins.ignore_nice) {
+				kstat_lock();
 				j_dbs_info->prev_cpu_nice =
 						kcpustat_cpu(j).cpustat[NICE];
+				kstat_unlock();
+			}
 		}
 		this_dbs_info->cpu = cpu;
 		this_dbs_info->rate_mult = 1;
diff --git a/drivers/macintosh/rack-meter.c b/drivers/macintosh/rack-meter.c
index c8e67b0..196244f 100644
--- a/drivers/macintosh/rack-meter.c
+++ b/drivers/macintosh/rack-meter.c
@@ -83,11 +83,13 @@ static inline cputime64_t get_cpu_idle_time(unsigned int cpu)
 {
 	cputime64_t retval;
 
+	kstat_lock();
 	retval = cputime64_add(kcpustat_cpu(cpu).cpustat[IDLE],
 			kcpustat_cpu(cpu).cpustat[IOWAIT]);
 
 	if (rackmeter_ignore_nice)
 		retval = cputime64_add(retval, kcpustat_cpu(cpu).cpustat[NICE]);
+	kstat_unlock();
 
 	return retval;
 }
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 6b10387..c9b2ae9 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -13,7 +13,7 @@
 
 static int show_stat(struct seq_file *p, void *v)
 {
-	return cpu_cgroup_proc_stat(p);
+	return cpu_cgroup_proc_stat(NULL, NULL, p);
 }
 
 static int stat_open(struct inode *inode, struct file *file)
diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index b0e053d..edd62c1 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -14,8 +14,12 @@ static int uptime_proc_show(struct seq_file *m, void *v)
 	int i;
 	cputime_t idletime = cputime_zero;
 
-	for_each_possible_cpu(i)
-		idletime = cputime64_add(idletime, kstat_cpu(i).cpustat[IDLE]);
+	for_each_possible_cpu(i) {
+		kstat_lock();
+		idletime = cputime64_add(idletime, kcpustat_cpu(i).cpustat[IDLE] -
+						   kcpustat_cpu(i).cpustat[IDLE_BASE]);
+		kstat_unlock();
+	}
 
 	do_posix_clock_monotonic_gettime(&uptime);
 	monotonic_to_bootbased(&uptime);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index f0e31a9..9b7463f 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,6 +27,8 @@ enum cpu_usage_stat {
 	STEAL,
 	GUEST,
 	GUEST_NICE,
+	STEAL_BASE,
+	IDLE_BASE,
 	NR_STATS,
 };
 
@@ -43,13 +45,25 @@ struct kernel_stat {
 };
 
 DECLARE_PER_CPU(struct kernel_stat, kstat);
-DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
-/* Must have preemption disabled for this to be meaningful. */
 #define kstat_this_cpu (&__get_cpu_var(kstat))
-#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
 #define kstat_cpu(cpu) per_cpu(kstat, cpu)
+
+#ifdef CONFIG_CGROUP_SCHED
+struct kernel_cpustat *task_group_kstat(struct task_struct *p);
+
+#define kcpustat_this_cpu      this_cpu_ptr(task_group_kstat(current))
+#define kcpustat_cpu(cpu) (*per_cpu_ptr(task_group_kstat(current), cpu))
+#define kstat_lock()   rcu_read_lock()
+#define kstat_unlock() rcu_read_unlock()
+#else
+DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
+#define kcpustat_this_cpu (&__get_cpu_var(kernel_cpustat))
 #define kcpustat_cpu(cpu) per_cpu(kernel_cpustat, cpu)
+#define kstat_lock()
+#define kstat_unlock()
+#endif
+
 
 extern unsigned long long nr_context_switches(void);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8311551..16713ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2713,7 +2713,10 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
-int cpu_cgroup_proc_stat(struct seq_file *p);
+struct cgroup;
+struct cftype;
+int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
+			 struct seq_file *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/sched.c b/kernel/sched.c
index c225e41..7ffafc0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -302,6 +302,7 @@ struct task_group {
 #endif
 
 	struct cfs_bandwidth cfs_bandwidth;
+	struct kernel_cpustat __percpu *cpustat;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -741,6 +742,12 @@ static inline int cpu_of(struct rq *rq)
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
+DEFINE_PER_CPU(struct kernel_stat, kstat);
+DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
+
+EXPORT_PER_CPU_SYMBOL(kstat);
+EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /*
@@ -764,6 +771,21 @@ static inline struct task_group *task_group(struct task_struct *p)
 	return autogroup_task_group(p, tg);
 }
 
+static struct jump_label_key sched_cgroup_enabled;
+static int sched_has_sched_stats = 0;
+
+struct kernel_cpustat *task_group_kstat(struct task_struct *p)
+{
+	if (static_branch(&sched_cgroup_enabled)) {
+		struct task_group *tg;
+		tg = task_group(p);
+		return tg->cpustat;
+	}
+
+	return &kernel_cpustat;
+}
+EXPORT_SYMBOL(task_group_kstat);
+
 /* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 {
@@ -777,7 +799,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 	p->rt.parent = task_group(p)->rt_se[cpu];
 #endif
 }
-
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -785,9 +806,36 @@ static inline struct task_group *task_group(struct task_struct *p)
 {
 	return NULL;
 }
-
 #endif /* CONFIG_CGROUP_SCHED */
 
+static inline void task_group_account_field(struct task_struct *p,
+					     u64 tmp, int index)
+{
+	/*
+	 * Since all updates are sure to touch the root cgroup, we
+	 * get ourselves ahead and touch it first. If the root cgroup
+	 * is the only cgroup, then nothing else should be necessary.
+	 *
+	 */
+	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
+
+#ifdef CONFIG_CGROUP_SCHED
+	if (static_branch(&sched_cgroup_enabled)) {
+		struct kernel_cpustat *kcpustat;
+		struct task_group *tg;
+
+		rcu_read_lock();
+		tg = task_group(p);
+		while (tg && (tg != &root_task_group)) {
+			kcpustat = this_cpu_ptr(tg->cpustat);
+			kcpustat->cpustat[index] += tmp;
+			tg = tg->parent;
+		}
+		rcu_read_unlock();
+	}
+#endif
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta);
 
 static void update_rq_clock(struct rq *rq)
@@ -2159,30 +2207,36 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 static int irqtime_account_hi_update(void)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	unsigned long flags;
 	u64 latest_ns;
+	u64 *cpustat;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_hardirq_time);
+	kstat_lock();
+	cpustat = kcpustat_this_cpu->cpustat;
 	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[IRQ]))
 		ret = 1;
+	kstat_unlock();
 	local_irq_restore(flags);
 	return ret;
 }
 
 static int irqtime_account_si_update(void)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	u64 *cpustat;
 	unsigned long flags;
 	u64 latest_ns;
 	int ret = 0;
 
 	local_irq_save(flags);
 	latest_ns = this_cpu_read(cpu_softirq_time);
+	kstat_lock();
+	cpustat = kcpustat_this_cpu->cpustat;
 	if (cputime64_gt(nsecs_to_cputime64(latest_ns), cpustat[SOFTIRQ]))
 		ret = 1;
+	kstat_unlock();
 	local_irq_restore(flags);
 	return ret;
 }
@@ -3803,12 +3857,6 @@ unlock:
 
 #endif
 
-DEFINE_PER_CPU(struct kernel_stat, kstat);
-DEFINE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
-
-EXPORT_PER_CPU_SYMBOL(kstat);
-EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
-
 /*
  * Return any ns on the sched_clock that have not yet been accounted in
  * @p in case that task is currently running.
@@ -3869,7 +3917,6 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 void account_user_time(struct task_struct *p, cputime_t cputime,
 		       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 tmp;
 
 	/* Add user time to process. */
@@ -3881,9 +3928,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 	tmp = cputime_to_cputime64(cputime);
 
 	if (TASK_NICE(p) > 0)
-		cpustat[NICE] += tmp;
+		task_group_account_field(p, tmp, NICE);
 	else
-		cpustat[USER] += tmp;
+		task_group_account_field(p, tmp, USER);
 
 	cpuacct_update_stats(p, CPUACCT_STAT_USER, cputime);
 	/* Account for user time used */
@@ -3900,7 +3947,6 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
 	u64 tmp;
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	tmp = cputime_to_cputime64(cputime);
 
@@ -3912,11 +3958,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat[NICE] += tmp;
-		cpustat[GUEST_NICE] += tmp;
+		task_group_account_field(p, tmp, NICE);
+		task_group_account_field(p, tmp, GUEST_NICE);
 	} else {
-		cpustat[USER] += tmp;
-		cpustat[GUEST] += tmp;
+		task_group_account_field(p, tmp, USER);
+		task_group_account_field(p, tmp, GUEST);
 	}
 }
 
@@ -3929,7 +3975,7 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
  */
 static inline
 void __account_system_time(struct task_struct *p, cputime_t cputime,
-			cputime_t cputime_scaled, u64 *target_cputime64)
+			cputime_t cputime_scaled, int index)
 {
 	u64 tmp = cputime_to_cputime64(cputime);
 
@@ -3939,7 +3985,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 	account_group_system_time(p, cputime);
 
 	/* Add system time to cpustat. */
-	*target_cputime64 += tmp;
+	task_group_account_field(p, tmp, index);
 	cpuacct_update_stats(p, CPUACCT_STAT_SYSTEM, cputime);
 
 	/* Account for system time used */
@@ -3956,8 +4002,7 @@ void __account_system_time(struct task_struct *p, cputime_t cputime,
 void account_system_time(struct task_struct *p, int hardirq_offset,
 			 cputime_t cputime, cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-	u64 *target_cputime64;
+	int index;
 
 	if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
 		account_guest_time(p, cputime, cputime_scaled);
@@ -3965,13 +4010,13 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
 	}
 
 	if (hardirq_count() - hardirq_offset)
-		target_cputime64 = &cpustat[IRQ];
+		index = IRQ;
 	else if (in_serving_softirq())
-		target_cputime64 = &cpustat[SOFTIRQ];
+		index = SOFTIRQ;
 	else
-		target_cputime64 = &cpustat[SYSTEM];
+		index = SYSTEM;
 
-	__account_system_time(p, cputime, cputime_scaled, target_cputime64);
+	__account_system_time(p, cputime, cputime_scaled, index);
 }
 
 /*
@@ -3980,10 +4025,8 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
  */
 void account_steal_time(cputime_t cputime)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
-
-	cpustat[STEAL] += cputime64;
+	__get_cpu_var(kernel_cpustat).cpustat[STEAL] += cputime64;
 }
 
 /*
@@ -3992,14 +4035,19 @@ void account_steal_time(cputime_t cputime)
  */
 void account_idle_time(cputime_t cputime)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
+	struct kernel_cpustat *kcpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
 	struct rq *rq = this_rq();
 
+	kstat_lock();
+	kcpustat = kcpustat_this_cpu;
+
 	if (atomic_read(&rq->nr_iowait) > 0)
-		cpustat[IOWAIT] += cputime64;
+		kcpustat->cpustat[IOWAIT] += cputime64;
 	else
-		cpustat[IDLE] += cputime64;
+		/* idle is always accounted to the root cgroup */
+		__get_cpu_var(kernel_cpustat).cpustat[IDLE] += cputime64;
+	kstat_unlock();
 }
 
 static __always_inline bool steal_account_process_tick(void)
@@ -4046,27 +4094,26 @@ static __always_inline bool steal_account_process_tick(void)
  * softirq as those do not count in task exec_runtime any more.
  */
 static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
-						struct rq *rq)
+					 struct rq *rq)
 {
 	cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
 	u64 tmp = cputime_to_cputime64(cputime_one_jiffy);
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
 
 	if (steal_account_process_tick())
 		return;
 
 	if (irqtime_account_hi_update()) {
-		cpustat[IRQ] += tmp;
+		task_group_account_field(p, tmp, IRQ);
 	} else if (irqtime_account_si_update()) {
-		cpustat[SOFTIRQ] += tmp;
+		task_group_account_field(p, tmp, SOFTIRQ);
 	} else if (this_cpu_ksoftirqd() == p) {
 		/*
 		 * ksoftirqd time do not get accounted in cpu_softirq_time.
 		 * So, we have to handle it separately here.
 		 * Also, p->stime needs to be updated for ksoftirqd.
 		 */
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SOFTIRQ]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SOFTIRQ);
 	} else if (user_tick) {
 		account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else if (p == rq->idle) {
@@ -4074,8 +4121,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
 	} else if (p->flags & PF_VCPU) { /* System time or guest time */
 		account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
 	} else {
-		__account_system_time(p, cputime_one_jiffy, one_jiffy_scaled,
-					&cpustat[SYSTEM]);
+		__account_system_time(p, cputime_one_jiffy,
+				      one_jiffy_scaled, SYSTEM);
 	}
 }
 
@@ -8240,6 +8287,8 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
+
+	root_task_group.cpustat = &kernel_cpustat;
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -8677,6 +8726,7 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	free_percpu(tg->cpustat);
 	kfree(tg);
 }
 
@@ -8685,6 +8735,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 {
 	struct task_group *tg;
 	unsigned long flags;
+	int i;
 
 	tg = kzalloc(sizeof(*tg), GFP_KERNEL);
 	if (!tg)
@@ -8696,6 +8747,21 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpustat = alloc_percpu(struct kernel_cpustat);
+	if (!tg->cpustat)
+		goto err;
+
+	for_each_possible_cpu(i) {
+		struct kernel_cpustat *kcpustat, *root_kstat;
+
+		kstat_lock();
+		kcpustat = per_cpu_ptr(tg->cpustat, i);
+		root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
+		kcpustat->cpustat[IDLE_BASE]  = root_kstat->cpustat[IDLE];
+		kcpustat->cpustat[STEAL_BASE]  = root_kstat->cpustat[STEAL];
+		kstat_unlock();
+	}
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9440,6 +9506,23 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpu_has_sched_stats(struct cgroup *cgrp, struct cftype *cft)
+{
+	return sched_has_sched_stats;
+}
+
+static int cpu_set_sched_stats(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	if (!val && sched_has_sched_stats)
+		jump_label_dec(&sched_cgroup_enabled);
+
+	if (val && !sched_has_sched_stats)
+		jump_label_inc(&sched_cgroup_enabled);
+
+	sched_has_sched_stats = !!val;
+	return 0;
+}
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9476,11 +9559,33 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	{
+		.name = "proc.stat",
+		.read_seq_string = cpu_cgroup_proc_stat,
+	},
+};
+
+/*
+ * Files appearing here will be shown at the top level only. Although we could
+ * show them unconditionally, and then return an error when read/writen from
+ * non-root cgroups, this is less confusing for users
+ */
+static struct cftype cpu_root_files[] = {
+	{
+		.name = "sched_stats",
+		.read_u64 = cpu_has_sched_stats,
+		.write_u64 = cpu_set_sched_stats,
+	},
 };
 
 static int cpu_cgroup_populate(struct cgroup_subsys *ss, struct cgroup *cont)
 {
-	return cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
+	int ret;
+	ret = cgroup_add_files(cont, ss, cpu_files, ARRAY_SIZE(cpu_files));
+	if (!ret)
+		ret = cgroup_add_files(cont, ss, cpu_root_files,
+				       ARRAY_SIZE(cpu_root_files));
+	return ret;
 }
 
 struct cgroup_subsys cpu_cgroup_subsys = {
@@ -9513,7 +9618,7 @@ static u64 get_idle_time(int cpu)
 
 	if (idle_time == -1ULL) {
 		/* !NO_HZ so we can rely on cpustat.idle */
-		idle = kcpustat_cpu(cpu).cpustat[IDLE];
+		idle = per_cpu(kernel_cpustat, cpu).cpustat[IDLE];
 		idle += arch_idle_time(cpu);
 	} else
 		idle = usecs_to_cputime(idle_time);
@@ -9527,14 +9632,15 @@ static u64 get_iowait_time(int cpu)
 
 	if (iowait_time == -1ULL)
 		/* !NO_HZ so we can rely on cpustat.iowait */
-		iowait = kcpustat_cpu(cpu).cpustat[IOWAIT];
+		iowait = per_cpu(kernel_cpustat, cpu).cpustat[IOWAIT];
 	else
 		iowait = usecs_to_cputime(iowait_time);
 
 	return iowait;
 }
 
-int cpu_cgroup_proc_stat(struct seq_file *p)
+int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
+			 struct seq_file *p)
 {
 	int i, j;
 	unsigned long jif;
@@ -9544,6 +9650,14 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *tg;
+
+	if (cgrp)
+		tg = cgroup_tg(cgrp);
+	else
+		tg = &root_task_group;
+#endif
 
 	user = nice = system = idle = iowait =
 		irq = softirq = steal = 0;
@@ -9552,16 +9666,26 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-		user += kcpustat_cpu(i).cpustat[USER];
-		nice += kcpustat_cpu(i).cpustat[NICE];
-		system += kcpustat_cpu(i).cpustat[SYSTEM];
+		struct kernel_cpustat *kcpustat;
+		kstat_lock();
+#ifdef CONFIG_CGROUP_SCHED
+		kcpustat = per_cpu_ptr(tg->cpustat, i);
+#else
+		kcpustat = &per_cpu(kernel_cpustat, i);
+#endif
+		user += kcpustat->cpustat[USER];
+		nice += kcpustat->cpustat[NICE];
+		system += kcpustat->cpustat[SYSTEM];
 		idle += get_idle_time(i);
+		idle -= kcpustat->cpustat[IDLE_BASE];
 		iowait += get_iowait_time(i);
-		irq += kcpustat_cpu(i).cpustat[IRQ];
-		softirq += kcpustat_cpu(i).cpustat[SOFTIRQ];
-		steal += kcpustat_cpu(i).cpustat[STEAL];
-		guest += kcpustat_cpu(i).cpustat[GUEST];
-		guest_nice += kcpustat_cpu(i).cpustat[GUEST_NICE];
+		irq += kcpustat->cpustat[IRQ];
+		softirq += kcpustat->cpustat[SOFTIRQ];
+		steal += kcpustat->cpustat[STEAL];
+		steal -= kcpustat->cpustat[STEAL_BASE];
+		guest += kcpustat->cpustat[GUEST];
+		guest_nice += kcpustat->cpustat[GUEST_NICE];
+		kstat_unlock();
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -9585,17 +9709,27 @@ int cpu_cgroup_proc_stat(struct seq_file *p)
 		(unsigned long long)cputime64_to_clock_t(guest),
 		(unsigned long long)cputime64_to_clock_t(guest_nice));
 	for_each_online_cpu(i) {
+		struct kernel_cpustat *kcpustat;
+		kstat_lock();
+#ifdef CONFIG_CGROUP_SCHED
+		kcpustat = per_cpu_ptr(tg->cpustat, i);
+#else
+		kcpustat = &per_cpu(kernel_cpustat, i);
+#endif
 		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */
-		user = kcpustat_cpu(i).cpustat[USER];
-		nice = kcpustat_cpu(i).cpustat[NICE];
-		system = kcpustat_cpu(i).cpustat[SYSTEM];
+		user = kcpustat->cpustat[USER];
+		nice = kcpustat->cpustat[NICE];
+		system = kcpustat->cpustat[SYSTEM];
 		idle = get_idle_time(i);
+		idle -= kcpustat->cpustat[IDLE_BASE];
 		iowait = get_iowait_time(i);
-		irq = kcpustat_cpu(i).cpustat[IRQ];
-		softirq = kcpustat_cpu(i).cpustat[SOFTIRQ];
-		steal = kcpustat_cpu(i).cpustat[STEAL];
-		guest = kcpustat_cpu(i).cpustat[GUEST];
-		guest_nice = kcpustat_cpu(i).cpustat[GUEST_NICE];
+		irq = kcpustat->cpustat[IRQ];
+		softirq = kcpustat->cpustat[SOFTIRQ];
+		steal = kcpustat->cpustat[STEAL];
+		steal -= kcpustat->cpustat[STEAL_BASE];
+		guest = kcpustat->cpustat[GUEST];
+		guest_nice = kcpustat->cpustat[GUEST_NICE];
+		kstat_unlock();
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
 			"%llu\n",
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 06/14] Make total_forks per-cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (4 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 05/14] Display /proc/stat information per cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 07/14] per-cgroup boot time Glauber Costa
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

This patch counts the total number of forks per-cgroup.
The information is propagated to the parent, so the total
number of forks in the system, is the parent cgroup's one.

To achieve that, total_forks is made per-cpu. There is no
particular reason to do that, but by doing this, we are
able to bundle it inside the cpustat structure already
present.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/kernel_stat.h |    1 +
 include/linux/sched.h       |    1 +
 kernel/fork.c               |    7 ++-----
 kernel/sched.c              |    9 ++++++++-
 4 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 9b7463f..a0f1182 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -29,6 +29,7 @@ enum cpu_usage_stat {
 	GUEST_NICE,
 	STEAL_BASE,
 	IDLE_BASE,
+	TOTAL_FORKS,
 	NR_STATS,
 };
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 16713ea..2195ac7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2717,6 +2717,7 @@ struct cgroup;
 struct cftype;
 int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 			 struct seq_file *p);
+void task_group_new_fork(struct task_struct *p);
 #endif /* __KERNEL__ */
 
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..ec2b729 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -76,10 +76,6 @@
 
 #include <trace/events/sched.h>
 
-/*
- * Protected counters by write_lock_irq(&tasklist_lock)
- */
-unsigned long total_forks;	/* Handle normal Linux uptimes. */
 int nr_threads;			/* The idle threads do not count.. */
 
 int max_threads;		/* tunable limit on nr_threads */
@@ -1372,7 +1368,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		nr_threads++;
 	}
 
-	total_forks++;
+	task_group_new_fork(p);
+
 	spin_unlock(&current->sighand->siglock);
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
diff --git a/kernel/sched.c b/kernel/sched.c
index 7ffafc0..df0c4de 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -836,6 +836,11 @@ static inline void task_group_account_field(struct task_struct *p,
 #endif
 }
 
+void task_group_new_fork(struct task_struct *p)
+{
+	task_group_account_field(p, 1, TOTAL_FORKS);
+}
+
 static void update_rq_clock_task(struct rq *rq, s64 delta);
 
 static void update_rq_clock(struct rq *rq)
@@ -9648,6 +9653,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	u64 guest, guest_nice;
 	u64 sum = 0;
 	u64 sum_softirq = 0;
+	u64 total_forks = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 #ifdef CONFIG_CGROUP_SCHED
@@ -9685,6 +9691,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		steal -= kcpustat->cpustat[STEAL_BASE];
 		guest += kcpustat->cpustat[GUEST];
 		guest_nice += kcpustat->cpustat[GUEST_NICE];
+		total_forks += kcpustat->cpustat[TOTAL_FORKS];
 		kstat_unlock();
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
@@ -9754,7 +9761,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	seq_printf(p,
 		"\nctxt %llu\n"
 		"btime %lu\n"
-		"processes %lu\n"
+		"processes %llu\n"
 		"procs_running %lu\n"
 		"procs_blocked %lu\n",
 		nr_context_switches(),
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 07/14] per-cgroup boot time
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (5 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 06/14] Make total_forks per-cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 08/14] Report steal time for cgroup Glauber Costa
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

Record the time in which the cgroup was created. This can be
used to provide a more accurate boottime information in
cpuacct.proc.stat.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index df0c4de..7dd4dea 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -303,6 +303,7 @@ struct task_group {
 
 	struct cfs_bandwidth cfs_bandwidth;
 	struct kernel_cpustat __percpu *cpustat;
+	struct timespec start_time;
 };
 
 /* task_group_lock serializes the addition/removal of task groups */
@@ -8293,6 +8294,7 @@ void __init sched_init(void)
 	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
 
+	root_task_group.start_time = (struct timespec){0, 0};
 	root_task_group.cpustat = &kernel_cpustat;
 #endif /* CONFIG_CGROUP_SCHED */
 
@@ -8767,6 +8769,8 @@ struct task_group *sched_create_group(struct task_group *parent)
 		kstat_unlock();
 	}
 
+	get_monotonic_boottime(&tg->start_time);
+
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_add_rcu(&tg->list, &task_groups);
 
@@ -9669,6 +9673,10 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		irq = softirq = steal = 0;
 	guest = guest_nice = 0;
 	getboottime(&boottime);
+#ifdef CONFIG_CGROUP_SCHED
+	if (static_branch(&sched_cgroup_enabled))
+		boottime = timespec_add(boottime, tg->start_time);
+#endif
 	jif = boottime.tv_sec;
 
 	for_each_possible_cpu(i) {
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 08/14] Report steal time for cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (6 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 07/14] per-cgroup boot time Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 09/14] Keep nr_iowait per cgroup Glauber Costa
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

This patch introduces a functionality commonly found in
hypervisors: steal time.

For those not particularly familiar with it, steal time
is defined as any time in which a virtual machine (or container)
wanted to perform cpu work, but could not due to another
VM/container being scheduled in its place. Note that idle
time is never defined as steal time.

Assuming each container will live in its cgroup, we can
very easily and nicely calculate steal time as all user/system
time recorded in our sibling cgroups.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   31 +++++++++++++++++++++++++++++++
 1 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7dd4dea..c7ac150 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9662,6 +9662,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	struct timespec boottime;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *tg;
+	struct task_group *sib;
 
 	if (cgrp)
 		tg = cgroup_tg(cgrp);
@@ -9700,6 +9701,21 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		guest += kcpustat->cpustat[GUEST];
 		guest_nice += kcpustat->cpustat[GUEST_NICE];
 		total_forks += kcpustat->cpustat[TOTAL_FORKS];
+#ifdef CONFIG_CGROUP_SCHED
+		if (static_branch(&sched_cgroup_enabled)) {
+			list_for_each_entry(sib, &tg->siblings, siblings) {
+				struct kernel_cpustat *ksib;
+				if (!sib)
+					continue;
+
+				ksib = per_cpu_ptr(sib->cpustat, i);
+				steal += ksib->cpustat[USER] +
+					 ksib->cpustat[SYSTEM] +
+					 ksib->cpustat[IRQ] +
+					 ksib->cpustat[SOFTIRQ];
+			}
+		}
+#endif
 		kstat_unlock();
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
@@ -9744,6 +9760,21 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		steal -= kcpustat->cpustat[STEAL_BASE];
 		guest = kcpustat->cpustat[GUEST];
 		guest_nice = kcpustat->cpustat[GUEST_NICE];
+#ifdef CONFIG_CGROUP_SCHED
+		if (static_branch(&sched_cgroup_enabled)) {
+			list_for_each_entry(sib, &tg->siblings, siblings) {
+				struct kernel_cpustat *ksib;
+				if (!sib)
+					continue;
+
+				ksib = per_cpu_ptr(sib->cpustat, i);
+				steal += ksib->cpustat[USER] +
+					 ksib->cpustat[SYSTEM] +
+					 ksib->cpustat[IRQ] +
+					 ksib->cpustat[SOFTIRQ];
+			}
+		}
+#endif
 		kstat_unlock();
 		seq_printf(p,
 			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 09/14] Keep nr_iowait per cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (7 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 08/14] Report steal time for cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-10 10:27   ` Andrew Wagin
  2011-11-01 21:19 ` [PATCH v2 10/14] Keep number of context switches per-cgroup Glauber Costa
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

Since we are able to know precisely which process are waiting for I/O,
keep nr_iowait per-cgroup. this is used by the idle tick to calculate
whether the system is considered to be idle, or waiting for I/O.

When only the root cgroup is enabled, this should be not too much different
from before.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/kernel_stat.h |    1 +
 kernel/sched.c              |   83 +++++++++++++++++++++++++++++++++++-------
 2 files changed, 70 insertions(+), 14 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index a0f1182..77e91f6 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -35,6 +35,7 @@ enum cpu_usage_stat {
 
 struct kernel_cpustat {
 	u64 cpustat[NR_STATS];
+	atomic_t nr_iowait;
 };
 
 struct kernel_stat {
diff --git a/kernel/sched.c b/kernel/sched.c
index c7ac150..800728e 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -639,8 +639,6 @@ struct rq {
 	u64 clock;
 	u64 clock_task;
 
-	atomic_t nr_iowait;
-
 #ifdef CONFIG_SMP
 	struct root_domain *rd;
 	struct sched_domain *sd;
@@ -817,6 +815,7 @@ static inline void task_group_account_field(struct task_struct *p,
 	 * get ourselves ahead and touch it first. If the root cgroup
 	 * is the only cgroup, then nothing else should be necessary.
 	 *
+	 * Same thing applies to the iowait related functions.
 	 */
 	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
 
@@ -837,6 +836,50 @@ static inline void task_group_account_field(struct task_struct *p,
 #endif
 }
 
+static inline void task_group_nr_iowait_inc(struct task_struct *p, int cpu)
+{
+
+	atomic_inc(&per_cpu(kernel_cpustat, cpu).nr_iowait);
+
+#ifdef CONFIG_CGROUP_SCHED
+	if (static_branch(&sched_cgroup_enabled)) {
+		struct kernel_cpustat *kcpustat;
+		struct task_group *tg;
+
+		rcu_read_lock();
+		tg = task_group(p);
+		while (tg && (tg != &root_task_group)) {
+			kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+			atomic_inc(&kcpustat->nr_iowait);
+			tg = tg->parent;
+		}
+		rcu_read_unlock();
+	}
+#endif
+}
+
+static inline void task_group_nr_iowait_dec(struct task_struct *p, int cpu)
+{
+
+	atomic_dec(&per_cpu(kernel_cpustat, cpu).nr_iowait);
+
+#ifdef CONFIG_CGROUP_SCHED
+	if (static_branch(&sched_cgroup_enabled)) {
+		struct kernel_cpustat *kcpustat;
+		struct task_group *tg;
+
+		rcu_read_lock();
+		tg = task_group(p);
+		while (tg && (tg != &root_task_group)) {
+			kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+			atomic_dec(&kcpustat->nr_iowait);
+			tg = tg->parent;
+		}
+		rcu_read_unlock();
+	}
+#endif
+}
+
 void task_group_new_fork(struct task_struct *p)
 {
 	task_group_account_field(p, 1, TOTAL_FORKS);
@@ -3442,16 +3485,24 @@ unsigned long nr_iowait(void)
 {
 	unsigned long i, sum = 0;
 
-	for_each_possible_cpu(i)
-		sum += atomic_read(&cpu_rq(i)->nr_iowait);
+	for_each_possible_cpu(i) {
+		kstat_lock();
+		sum += atomic_read(&per_cpu(kernel_cpustat, i).nr_iowait);
+		kstat_unlock();
+	}
 
 	return sum;
 }
 
 unsigned long nr_iowait_cpu(int cpu)
 {
-	struct rq *this = cpu_rq(cpu);
-	return atomic_read(&this->nr_iowait);
+	unsigned long ret;
+
+	kstat_lock();
+	ret = atomic_read(&per_cpu(kernel_cpustat, cpu).nr_iowait);
+	kstat_unlock();
+
+	return ret;
 }
 
 unsigned long this_cpu_load(void)
@@ -4043,12 +4094,11 @@ void account_idle_time(cputime_t cputime)
 {
 	struct kernel_cpustat *kcpustat;
 	u64 cputime64 = cputime_to_cputime64(cputime);
-	struct rq *rq = this_rq();
 
 	kstat_lock();
 	kcpustat = kcpustat_this_cpu;
 
-	if (atomic_read(&rq->nr_iowait) > 0)
+	if (atomic_read(&kcpustat->nr_iowait) > 0)
 		kcpustat->cpustat[IOWAIT] += cputime64;
 	else
 		/* idle is always accounted to the root cgroup */
@@ -5915,14 +5965,15 @@ EXPORT_SYMBOL_GPL(yield_to);
 void __sched io_schedule(void)
 {
 	struct rq *rq = raw_rq();
+	int cpu = cpu_of(rq);
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
+	task_group_nr_iowait_inc(current, cpu);
 	blk_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	task_group_nr_iowait_dec(current, cpu);
 	delayacct_blkio_end();
 }
 EXPORT_SYMBOL(io_schedule);
@@ -5930,15 +5981,16 @@ EXPORT_SYMBOL(io_schedule);
 long __sched io_schedule_timeout(long timeout)
 {
 	struct rq *rq = raw_rq();
+	int cpu = cpu_of(rq);
 	long ret;
 
 	delayacct_blkio_start();
-	atomic_inc(&rq->nr_iowait);
+	task_group_nr_iowait_inc(current, cpu);
 	blk_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-	atomic_dec(&rq->nr_iowait);
+	task_group_nr_iowait_dec(current, cpu);
 	delayacct_blkio_end();
 	return ret;
 }
@@ -8363,7 +8415,6 @@ void __init sched_init(void)
 #endif
 #endif
 		init_rq_hrtick(rq);
-		atomic_set(&rq->nr_iowait, 0);
 	}
 
 	set_load_weight(&init_task);
@@ -8766,6 +8817,7 @@ struct task_group *sched_create_group(struct task_group *parent)
 		root_kstat = per_cpu_ptr(root_task_group.cpustat, i);
 		kcpustat->cpustat[IDLE_BASE]  = root_kstat->cpustat[IDLE];
 		kcpustat->cpustat[STEAL_BASE]  = root_kstat->cpustat[STEAL];
+		atomic_set(&kcpustat->nr_iowait, 0);
 		kstat_unlock();
 	}
 
@@ -9660,6 +9712,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	u64 total_forks = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
+	unsigned long tg_iowait = 0;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *tg;
 	struct task_group *sib;
@@ -9701,6 +9754,8 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		guest += kcpustat->cpustat[GUEST];
 		guest_nice += kcpustat->cpustat[GUEST_NICE];
 		total_forks += kcpustat->cpustat[TOTAL_FORKS];
+		tg_iowait += atomic_read(&kcpustat->nr_iowait);
+
 #ifdef CONFIG_CGROUP_SCHED
 		if (static_branch(&sched_cgroup_enabled)) {
 			list_for_each_entry(sib, &tg->siblings, siblings) {
@@ -9807,7 +9862,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		(unsigned long)jif,
 		total_forks,
 		nr_running(),
-		nr_iowait());
+		tg_iowait),
 
 	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
 
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 10/14] Keep number of context switches per-cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (8 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 09/14] Keep nr_iowait per cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 11/14] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

This patch ties the number of context_switches to a cgroup.
No impact is expected when per-cgroup stats collecting is disabled
in the root cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/kernel_stat.h |    2 ++
 kernel/sched.c              |   24 ++++++++++++++++++------
 kernel/sched_debug.c        |    3 ++-
 3 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 77e91f6..2c32b24 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -30,6 +30,7 @@ enum cpu_usage_stat {
 	STEAL_BASE,
 	IDLE_BASE,
 	TOTAL_FORKS,
+	NR_SWITCHES,
 	NR_STATS,
 };
 
@@ -68,6 +69,7 @@ DECLARE_PER_CPU(struct kernel_cpustat, kernel_cpustat);
 
 
 extern unsigned long long nr_context_switches(void);
+extern unsigned long long nr_context_switches_cpu(int cpu);
 
 #ifndef CONFIG_GENERIC_HARDIRQS
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 800728e..4f91781 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -611,7 +611,6 @@ struct rq {
 	/* capture load from *all* tasks on this cpu: */
 	struct load_weight load;
 	unsigned long nr_load_updates;
-	u64 nr_switches;
 
 	struct cfs_rq cfs;
 	struct rt_rq rt;
@@ -3475,11 +3474,23 @@ unsigned long long nr_context_switches(void)
 	int i;
 	unsigned long long sum = 0;
 
-	for_each_possible_cpu(i)
-		sum += cpu_rq(i)->nr_switches;
+	for_each_possible_cpu(i) {
+		kstat_lock();
+		sum += per_cpu(kernel_cpustat, i).cpustat[NR_SWITCHES];
+		kstat_unlock();
+	}
 
 	return sum;
 }
+unsigned long long nr_context_switches_cpu(int cpu)
+{
+	unsigned long long ret;
+
+	kstat_lock();
+	ret = per_cpu(kernel_cpustat, cpu).cpustat[NR_SWITCHES];
+	kstat_unlock();
+	return ret;
+}
 
 unsigned long nr_iowait(void)
 {
@@ -4554,9 +4565,9 @@ need_resched:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
+		task_group_account_field(prev, 1, NR_SWITCHES);
 
 		context_switch(rq, prev, next); /* unlocks the rq */
 		/*
@@ -9713,6 +9724,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec boottime;
 	unsigned long tg_iowait = 0;
+	u64 tg_nr_switches = 0;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *tg;
 	struct task_group *sib;
@@ -9754,8 +9766,8 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		guest += kcpustat->cpustat[GUEST];
 		guest_nice += kcpustat->cpustat[GUEST_NICE];
 		total_forks += kcpustat->cpustat[TOTAL_FORKS];
+		tg_nr_switches += kcpustat->cpustat[NR_SWITCHES];
 		tg_iowait += atomic_read(&kcpustat->nr_iowait);
-
 #ifdef CONFIG_CGROUP_SCHED
 		if (static_branch(&sched_cgroup_enabled)) {
 			list_for_each_entry(sib, &tg->siblings, siblings) {
@@ -9858,7 +9870,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		"processes %llu\n"
 		"procs_running %lu\n"
 		"procs_blocked %lu\n",
-		nr_context_switches(),
+		tg_nr_switches,
 		(unsigned long)jif,
 		total_forks,
 		nr_running(),
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index a6710a1..c7464601 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -246,6 +246,7 @@ static void print_cpu(struct seq_file *m, int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
+	unsigned long long nr_switches = nr_context_switches_cpu(cpu);
 
 #ifdef CONFIG_X86
 	{
@@ -266,8 +267,8 @@ static void print_cpu(struct seq_file *m, int cpu)
 	P(nr_running);
 	SEQ_printf(m, "  .%-30s: %lu\n", "load",
 		   rq->load.weight);
-	P(nr_switches);
 	P(nr_load_updates);
+	SEQ_printf(m, "  .%-30s: %Ld\n","nr_switches", nr_switches);
 	P(nr_uninterruptible);
 	PN(next_balance);
 	P(curr->pid);
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 11/14] provide a version of cpuacct statistics inside cpu cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (9 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 10/14] Keep number of context switches per-cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 12/14] Keep number of running processes per-cgroup Glauber Costa
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.stat, we provide it inside the cpu cgroup.

This have the advantage of accounting this information only once,
instead of twice, when there is no need to have an independent group
of tasks for controlling and accounting, leading to a lot less overhead.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c |   43 ++++++++++++++++++++++++++++++++++++++-----
 1 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 4f91781..6ef6e1b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9237,6 +9237,11 @@ int sched_rt_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+static const char *cpuacct_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
 #ifdef CONFIG_CGROUP_SCHED
 
 /* return corresponding task_group object of a cgroup */
@@ -9595,6 +9600,35 @@ static int cpu_set_sched_stats(struct cgroup *cgrp, struct cftype *cft, u64 val)
 	return 0;
 }
 
+static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
+		struct cgroup_map_cb *cb)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int cpu;
+	s64 val = 0;
+
+	for_each_present_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[USER];
+		val += kcpustat->cpustat[NICE];
+	}
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_USER], val);
+
+	val = 0;
+	for_each_online_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[SYSTEM];
+		val += kcpustat->cpustat[IRQ];
+		val += kcpustat->cpustat[SOFTIRQ];
+	}
+
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+	return 0;
+}
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9635,6 +9669,10 @@ static struct cftype cpu_files[] = {
 		.name = "proc.stat",
 		.read_seq_string = cpu_cgroup_proc_stat,
 	},
+	{
+		.name = "stat",
+		.read_map = cpu_cgroup_stats_show,
+	},
 };
 
 /*
@@ -10047,11 +10085,6 @@ static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
 	return 0;
 }
 
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
 static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
 		struct cgroup_map_cb *cb)
 {
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 12/14] Keep number of running processes per-cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (10 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 11/14] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-14 14:42   ` Andrew Wagin
  2011-11-01 21:19 ` [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup Glauber Costa
  2011-11-01 21:19 ` [PATCH v2 14/14] Change CPUACCT to default n Glauber Costa
  13 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa

This relies on the fact that the scheduler classes keep the number
of running processes internally in their entities.

Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 kernel/sched.c |   15 ++++++++++++++-
 1 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 6ef6e1b..d93cfd4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9763,6 +9763,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 	struct timespec boottime;
 	unsigned long tg_iowait = 0;
 	u64 tg_nr_switches = 0;
+	unsigned long tg_nr_running = 0;
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group *tg;
 	struct task_group *sib;
@@ -9822,6 +9823,18 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		}
 #endif
 		kstat_unlock();
+#ifdef CONFIG_CGROUP_SCHED
+		/* root task group has autogrouping, so this doesn't hold */
+		if  (tg != &root_task_group) {
+#ifdef CONFIG_FAIR_GROUP_SCHED
+			tg_nr_running += cpu_rq(i)->cfs.nr_running;
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+			tg_nr_running += cpu_rq(i)->rt.rt_nr_running;
+#endif
+		} else
+#endif
+			tg_nr_running += cpu_rq(i)->nr_running;
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -9911,7 +9924,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
 		tg_nr_switches,
 		(unsigned long)jif,
 		total_forks,
-		nr_running(),
+		tg_nr_running,
 		tg_iowait),
 
 	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (11 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 12/14] Keep number of running processes per-cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-09 11:51   ` Andrew Wagin
  2011-11-01 21:19 ` [PATCH v2 14/14] Change CPUACCT to default n Glauber Costa
  13 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa, Balbir Singh

For users interested in using the information currently displayed
at cpuacct.usage and cpuaact.usage_per_cpu, we provide them inside
the cpu cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 kernel/sched.c      |   83 +++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched_fair.c |   10 ++++++
 kernel/sched_rt.c   |    4 ++
 3 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d93cfd4..b9296cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -303,6 +303,7 @@ struct task_group {
 
 	struct cfs_bandwidth cfs_bandwidth;
 	struct kernel_cpustat __percpu *cpustat;
+	u64 __percpu *cpuusage;
 	struct timespec start_time;
 };
 
@@ -344,6 +345,8 @@ struct cfs_rq {
 #ifndef CONFIG_64BIT
 	u64 min_vruntime_copy;
 #endif
+	u64 sum_exec_runtime;
+	u64 prev_sum_exec_runtime;
 
 	struct rb_root tasks_timeline;
 	struct rb_node *rb_leftmost;
@@ -547,7 +550,10 @@ struct rt_rq {
 	struct rq *rq;
 	struct list_head leaf_rt_rq_list;
 	struct task_group *tg;
+
 #endif
+	u64 sum_exec_runtime;
+	u64 prev_sum_exec_runtime;
 };
 
 #ifdef CONFIG_SMP
@@ -8359,6 +8365,10 @@ void __init sched_init(void)
 
 	root_task_group.start_time = (struct timespec){0, 0};
 	root_task_group.cpustat = &kernel_cpustat;
+
+	root_task_group.cpuusage = alloc_percpu(u64);
+	/* Failing that early an allocation means we're screwed anyway */
+	BUG_ON(!root_task_group.cpuusage);
 #endif /* CONFIG_CGROUP_SCHED */
 
 	for_each_possible_cpu(i) {
@@ -8796,6 +8806,7 @@ static void free_sched_group(struct task_group *tg)
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
 	free_percpu(tg->cpustat);
+	free_percpu(tg->cpuusage);
 	kfree(tg);
 }
 
@@ -8816,6 +8827,10 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!alloc_rt_sched_group(tg, parent))
 		goto err;
 
+	tg->cpuusage = alloc_percpu(u64);
+	if (!tg->cpuusage)
+		goto err;
+
 	tg->cpustat = alloc_percpu(struct kernel_cpustat);
 	if (!tg->cpustat)
 		goto err;
@@ -9629,6 +9644,65 @@ static int cpu_cgroup_stats_show(struct cgroup *cgrp, struct cftype *cft,
 	return 0;
 }
 
+static u64 cpu_cgroup_usage_cpu(struct cgroup *cgrp, int i)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	u64 ret = 0;
+
+	ret = tg->cfs_rq[i]->sum_exec_runtime;
+
+	return ret;
+}
+
+static u64 cpu_cgroup_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	u64 totalcpuusage = 0;
+	int i;
+
+	for_each_present_cpu(i)
+		totalcpuusage += cpu_cgroup_usage_cpu(cgrp, i);
+
+	return totalcpuusage;
+}
+
+static int cpu_cgroup_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+								u64 reset)
+{
+	int err = 0;
+	int i;
+	struct task_group *tg = cgroup_tg(cgrp);
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i)
+		if (tg == &root_task_group)
+			cpu_rq(i)->cfs.prev_sum_exec_runtime =
+						cpu_rq(i)->cfs.sum_exec_runtime;
+		else
+			tg->se[i]->prev_sum_exec_runtime =
+						tg->se[i]->sum_exec_runtime;
+
+out:
+	return err;
+}
+
+static int cpu_cgroup_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
+				      struct seq_file *m)
+{
+	u64 percpu;
+	int i;
+
+	for_each_present_cpu(i) {
+		percpu = cpu_cgroup_usage_cpu(cgroup, i);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -9673,6 +9747,15 @@ static struct cftype cpu_files[] = {
 		.name = "stat",
 		.read_map = cpu_cgroup_stats_show,
 	},
+	{
+		.name = "usage",
+		.read_u64 = cpu_cgroup_cpuusage_read,
+		.write_u64 = cpu_cgroup_cpuusage_write,
+	},
+	{
+		.name = "usage_percpu",
+		.read_seq_string = cpu_cgroup_percpu_seq_read,
+	},
 };
 
 /*
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..030b8eb 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -560,6 +560,16 @@ __update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
 		      max((u64)delta_exec, curr->statistics.exec_max));
 
 	curr->sum_exec_runtime += delta_exec;
+
+	/*
+	 * sched_entities are moved around runqueues and cpus at all times.
+	 * we want to record the total exec time of a particular entity (curr)
+	 * but we are also interested in the total time this particular runqueue
+	 * got. So we have to increase the total runtime in two different locations
+	 */
+	if (static_branch(&sched_cgroup_enabled))
+		cfs_rq->sum_exec_runtime += delta_exec;
+
 	schedstat_add(cfs_rq, exec_clock, delta_exec);
 	delta_exec_weighted = calc_delta_fair(delta_exec, curr);
 
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 056cbd2..2edaeb4 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -686,6 +686,10 @@ static void update_curr_rt(struct rq *rq)
 	schedstat_set(curr->se.statistics.exec_max, max(curr->se.statistics.exec_max, delta_exec));
 
 	curr->se.sum_exec_runtime += delta_exec;
+
+	if (static_branch(&sched_cgroup_enabled))
+		rq->rt.sum_exec_runtime += delta_exec;
+
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
                   ` (12 preceding siblings ...)
  2011-11-01 21:19 ` [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup Glauber Costa
@ 2011-11-01 21:19 ` Glauber Costa
  2011-11-11 21:33   ` Paul Turner
  13 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-01 21:19 UTC (permalink / raw)
  To: linux-kernel
  Cc: paul, lizf, daniel.lezcano, a.p.zijlstra, jbottomley, pjt,
	fweisbec, Glauber Costa, Balbir Singh

Now that we're providing all cpuacct capability in cpu cgroup,
default CPUACCT to n. We still maintain it for compatiblity for
anyone that need an independent set of tasks to be managed by it
relatively to cpu cgroup, but encourage the use of cpucgroup for that.

Proposing schedule of deprecation for 3.5

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Balbir Singh <bsingharora@gmail.com>
---
 Documentation/feature-removal-schedule.txt |    8 ++++++++
 init/Kconfig                               |    1 +
 2 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index d5ac362..91165fe 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -609,3 +609,11 @@ When:	3.5
 Why:	The iwlagn module has been renamed iwlwifi.  The alias will be around
 	for backward compatibility for several cycles and then dropped.
 Who:	Don Fry <donald.h.fry@intel.com>
+
+----------------------------
+
+What:	cpuacct cgroup
+When:	3.5
+Why:	Same functionality is provided by the CGROUP_SCHED, with a lower
+	footprint.
+Who:	Glauber Costa <glommer@parallels.com>
diff --git a/init/Kconfig b/init/Kconfig
index 31ba0fd..d3d958b 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -626,6 +626,7 @@ config PROC_PID_CPUSET
 
 config CGROUP_CPUACCT
 	bool "Simple CPU accounting cgroup subsystem"
+	default n
 	help
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-01 21:19 ` [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup Glauber Costa
@ 2011-11-09 11:51   ` Andrew Wagin
  2011-11-09 11:58     ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: Andrew Wagin @ 2011-11-09 11:51 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec, Balbir Singh

Hi Glauber,

> diff --git a/kernel/sched.c b/kernel/sched.c
> index d93cfd4..b9296cd 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -303,6 +303,7 @@ struct task_group {
>
>        struct cfs_bandwidth cfs_bandwidth;
>        struct kernel_cpustat __percpu *cpustat;
> +       u64 __percpu *cpuusage;

I can't find where you use this variable, probably it may be removed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-09 11:51   ` Andrew Wagin
@ 2011-11-09 11:58     ` Glauber Costa
  2011-11-09 14:18       ` Andrew Wagin
  0 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-09 11:58 UTC (permalink / raw)
  To: Andrew Wagin
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec, Balbir Singh

On 11/09/2011 09:51 AM, Andrew Wagin wrote:
> Hi Glauber,
>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index d93cfd4..b9296cd 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -303,6 +303,7 @@ struct task_group {
>>
>>         struct cfs_bandwidth cfs_bandwidth;
>>         struct kernel_cpustat __percpu *cpustat;
>> +       u64 __percpu *cpuusage;
>
> I can't find where you use this variable, probably it may be removed.

Indeed.
This is a leftover from the previous attempt.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-09 11:58     ` Glauber Costa
@ 2011-11-09 14:18       ` Andrew Wagin
  2011-11-09 15:30         ` Peter Zijlstra
  2011-11-09 16:51         ` Glauber Costa
  0 siblings, 2 replies; 50+ messages in thread
From: Andrew Wagin @ 2011-11-09 14:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec, Balbir Singh

And look at cfs_rq->prev_sum_exec_runtime, probably it is not used too.
Usage of cfs_rq->sum_exec_runtime looks strange.

2011/11/9 Glauber Costa <glommer@parallels.com>:
> On 11/09/2011 09:51 AM, Andrew Wagin wrote:
>>
>> Hi Glauber,
>>
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>> index d93cfd4..b9296cd 100644
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>> @@ -303,6 +303,7 @@ struct task_group {
>>>
>>>        struct cfs_bandwidth cfs_bandwidth;
>>>        struct kernel_cpustat __percpu *cpustat;
>>> +       u64 __percpu *cpuusage;
>>
>> I can't find where you use this variable, probably it may be removed.
>
> Indeed.
> This is a leftover from the previous attempt.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-09 14:18       ` Andrew Wagin
@ 2011-11-09 15:30         ` Peter Zijlstra
  2011-11-09 16:51         ` Glauber Costa
  1 sibling, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-09 15:30 UTC (permalink / raw)
  To: Andrew Wagin
  Cc: Glauber Costa, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, pjt, fweisbec, Balbir Singh

On Wed, 2011-11-09 at 17:18 +0300, Andrew Wagin wrote:
> And look at cfs_rq->prev_sum_exec_runtime, probably it is not used too.
> Usage of cfs_rq->sum_exec_runtime looks strange.

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-09 14:18       ` Andrew Wagin
  2011-11-09 15:30         ` Peter Zijlstra
@ 2011-11-09 16:51         ` Glauber Costa
  2011-11-10  8:59           ` Andrew Wagin
  1 sibling, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-09 16:51 UTC (permalink / raw)
  To: Andrew Wagin
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec, Balbir Singh

On 11/09/2011 12:18 PM, Andrew Wagin wrote:
> And look at cfs_rq->prev_sum_exec_runtime, probably it is not used too.
> Usage of cfs_rq->sum_exec_runtime looks strange.

No, it is not.
se->sum_exec_runtime is a per-se measure. It follows the task group as 
it moves from rq to rq. cfs->rq->sum_exec_runtime (and it's rt 
counterpart) is a measurement of accumulated runtime of all tasks that 
ever passed through this rq.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup
  2011-11-09 16:51         ` Glauber Costa
@ 2011-11-10  8:59           ` Andrew Wagin
  0 siblings, 0 replies; 50+ messages in thread
From: Andrew Wagin @ 2011-11-10  8:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec, Balbir Singh

2011/11/9 Glauber Costa <glommer@parallels.com>:
> On 11/09/2011 12:18 PM, Andrew Wagin wrote:
>>
>> And look at cfs_rq->prev_sum_exec_runtime, probably it is not used too.
>> Usage of cfs_rq->sum_exec_runtime looks strange.
>
> No, it is not.
> se->sum_exec_runtime is a per-se measure. It follows the task group as it
> moves from rq to rq. cfs->rq->sum_exec_runtime (and it's rt counterpart) is
> a measurement of accumulated runtime of all tasks that ever passed through
> this rq.
>

1. cfs_rq->sum_exec_runtime and cfs_rq->exec_clock are same things.

I enabled sched_stat by default and execute following commands:
# cat /proc/sched_debug  | grep exec_clock
  .exec_clock                    : 2959.768408
  .exec_clock                    : 2340.166419
# cat /cgroup/cpu.usage_percpu
2960772019 2341358764

2. Reseting of cpu.usage doesn't work for root cgroup.
[root@dhcp-10-30-20-19 ~]# echo 0 > /cgroup/cpu.usage
[root@dhcp-10-30-20-19 ~]# cat /cgroup/cpu.usage
6087490986

Probably for this reason cfs_rq->prev_sum_exec is unused.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 09/14] Keep nr_iowait per cgroup
  2011-11-01 21:19 ` [PATCH v2 09/14] Keep nr_iowait per cgroup Glauber Costa
@ 2011-11-10 10:27   ` Andrew Wagin
  0 siblings, 0 replies; 50+ messages in thread
From: Andrew Wagin @ 2011-11-10 10:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec

This code has a race condition.
A task may be moved in another group between  task_group_nr_iowait_inc
and task_group_nr_iowait_dec.

# dd if=/dev/sda of=//dev/null bs=4095 & pid=$!
# echo $pid > /cgroup/1/tasks
# kill $pid
# cat /cgroup/1/cpu.proc.stat  | grep procs_blocked
procs_blocked 18446744073709551615

> +static inline void task_group_nr_iowait_inc(struct task_struct *p, int cpu)
> +{
> +
> +       atomic_inc(&per_cpu(kernel_cpustat, cpu).nr_iowait);
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +       if (static_branch(&sched_cgroup_enabled)) {
> +               struct kernel_cpustat *kcpustat;
> +               struct task_group *tg;
> +
> +               rcu_read_lock();
> +               tg = task_group(p);
> +               while (tg && (tg != &root_task_group)) {
> +                       kcpustat = per_cpu_ptr(tg->cpustat, cpu);
> +                       atomic_inc(&kcpustat->nr_iowait);
> +                       tg = tg->parent;
> +               }
> +               rcu_read_unlock();
> +       }
> +#endif
> +}
> +
> +static inline void task_group_nr_iowait_dec(struct task_struct *p, int cpu)
> +{
> +
> +       atomic_dec(&per_cpu(kernel_cpustat, cpu).nr_iowait);
> +
> +#ifdef CONFIG_CGROUP_SCHED
> +       if (static_branch(&sched_cgroup_enabled)) {
> +               struct kernel_cpustat *kcpustat;
> +               struct task_group *tg;
> +
> +               rcu_read_lock();
> +               tg = task_group(p);
> +               while (tg && (tg != &root_task_group)) {
> +                       kcpustat = per_cpu_ptr(tg->cpustat, cpu);
> +                       atomic_dec(&kcpustat->nr_iowait);
> +                       tg = tg->parent;
> +               }
> +               rcu_read_unlock();
> +       }
> +#endif
> +}
> +

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-01 21:19 ` [PATCH v2 14/14] Change CPUACCT to default n Glauber Costa
@ 2011-11-11 21:33   ` Paul Turner
  2011-11-12 10:29     ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: Paul Turner @ 2011-11-11 21:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec, Balbir Singh

On 11/01/2011 02:19 PM, Glauber Costa wrote:
> Now that we're providing all cpuacct capability in cpu cgroup,
> default CPUACCT to n. We still maintain it for compatiblity for
> anyone that need an independent set of tasks to be managed by it
> relatively to cpu cgroup, but encourage the use of cpucgroup for that.
>
> Proposing schedule of deprecation for 3.5
>


I'd like to see a little more separation beyond 1 linear series here.


We're doing the following things

1. Migrating the existing cpuacct functionality into cpu
2. Deprecating cpuacct
3. Adding new functionality

I would like to consider (3) separately from 1/2 which we can and should 
accomplish immediately due to the over-head it's currently introducing.  It 
seems less than optimal to hinge resolving that on reaching agreement for the 
new bits.

It also helps that the the migrated functionality in (1) is just exporting state 
that is already being maintained by cpu so those changes end up being quite 
small and non-invasive.

- Paul

> Signed-off-by: Glauber Costa<glommer@parallels.com>
> CC: Balbir Singh<bsingharora@gmail.com>
> ---
>   Documentation/feature-removal-schedule.txt |    8 ++++++++
>   init/Kconfig                               |    1 +
>   2 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
> index d5ac362..91165fe 100644
> --- a/Documentation/feature-removal-schedule.txt
> +++ b/Documentation/feature-removal-schedule.txt
> @@ -609,3 +609,11 @@ When:	3.5
>   Why:	The iwlagn module has been renamed iwlwifi.  The alias will be around
>   	for backward compatibility for several cycles and then dropped.
>   Who:	Don Fry<donald.h.fry@intel.com>
> +
> +----------------------------
> +
> +What:	cpuacct cgroup
> +When:	3.5
> +Why:	Same functionality is provided by the CGROUP_SCHED, with a lower
> +	footprint.
> +Who:	Glauber Costa<glommer@parallels.com>
> diff --git a/init/Kconfig b/init/Kconfig
> index 31ba0fd..d3d958b 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -626,6 +626,7 @@ config PROC_PID_CPUSET
>
>   config CGROUP_CPUACCT
>   	bool "Simple CPU accounting cgroup subsystem"
> +	default n
>   	help
>   	  Provides a simple Resource Controller for monitoring the
>   	  total CPU consumed by the tasks in a cgroup.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 01/14] trivial: initialize root cgroup's sibling list
  2011-11-01 21:19 ` [PATCH v2 01/14] trivial: initialize root cgroup's sibling list Glauber Costa
@ 2011-11-11 21:34   ` Paul Turner
  2011-11-14 19:44     ` Glauber Costa
  2011-11-18 23:42   ` [tip:sched/core] sched, trivial: Initialize " tip-bot for Glauber Costa
  1 sibling, 1 reply; 50+ messages in thread
From: Paul Turner @ 2011-11-11 21:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec

On 11/01/2011 02:19 PM, Glauber Costa wrote:
> Even though there are no siblings, the list should be
> initialized not to contain bogus values.
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> Acked-by: Paul Menage<paul@paulmenage.org>

Signed-off-by: Paul Turner <pjt@google.com>

> ---
>   kernel/sched.c |    1 +
>   1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index d87c6e5..e327665 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -8234,6 +8234,7 @@ void __init sched_init(void)
>   #ifdef CONFIG_CGROUP_SCHED
>   	list_add(&root_task_group.list,&task_groups);
>   	INIT_LIST_HEAD(&root_task_group.children);
> +	INIT_LIST_HEAD(&root_task_group.siblings);
>   	autogroup_init(&init_task);
>   #endif /* CONFIG_CGROUP_SCHED */
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 03/14] Move /proc/stat logic inside sched.c
  2011-11-01 21:19 ` [PATCH v2 03/14] Move /proc/stat logic inside sched.c Glauber Costa
@ 2011-11-12  1:35   ` Paul Turner
  2011-11-12 10:27     ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: Paul Turner @ 2011-11-12  1:35 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec

On 11/01/2011 02:19 PM, Glauber Costa wrote:
> This patch moves all of the /proc/stat display code inside
> sched.c. The goal is to later on, have a different version
> of it per-cgroup. In containers environment, this is useful
> to give each container a different and independent view of
> the statistics displayed in this file.
>
> Signed-off-by: Glauber Costa<glommer@parallels.com>
> ---
>   fs/proc/stat.c        |  139 +-----------------------------------------------
>   include/linux/sched.h |    1 +
>   kernel/sched.c        |  142 +++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 144 insertions(+), 138 deletions(-)
>

This feels a little contrived.  sched.c isn't the right place for much of this 
code.  Why do you want to move it all instead of exporting the functionality 
(e.g. remove static?).

> diff --git a/kernel/sched.c b/kernel/sched.c
> index e78e1aa..3f42916 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -71,6 +71,7 @@
>   #include<linux/ctype.h>
>   #include<linux/ftrace.h>
>   #include<linux/slab.h>
> +#include<linux/tick.h>
>
>   #include<asm/tlb.h>
>   #include<asm/irq_regs.h>
> @@ -9494,6 +9495,147 @@ struct cgroup_subsys cpu_cgroup_subsys = {
>
>   #endif	/* CONFIG_CGROUP_SCHED */
>
> +#ifndef fi
> +#define arch_irq_stat_cpu(cpu) 0
> +#endif
> +#ifndef arch_irq_stat
> +#define arch_irq_stat() 0
> +#endif
> +#ifndef arch_idle_time
> +#define arch_idle_time(cpu) 0
> +#endif
> +
> +static u64 get_idle_time(int cpu)
> +{
> +	u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
> +
> +	if (idle_time == -1ULL) {
> +		/* !NO_HZ so we can rely on cpustat.idle */
> +		idle = kstat_cpu(cpu).cpustat[IDLE];
> +		idle += arch_idle_time(cpu);
> +	} else
> +		idle = usecs_to_cputime(idle_time);
> +
> +	return idle;
> +}
> +
> +static u64 get_iowait_time(int cpu)
> +{
> +	u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
> +
> +	if (iowait_time == -1ULL)
> +		/* !NO_HZ so we can rely on cpustat.iowait */
> +		iowait = kstat_cpu(cpu).cpustat[IOWAIT];
> +	else
> +		iowait = usecs_to_cputime(iowait_time);
> +
> +	return iowait;
> +}
> +
> +int cpu_cgroup_proc_stat(struct seq_file *p)
> +{
> +	int i, j;
> +	unsigned long jif;
> +	u64 user, nice, system, idle, iowait, irq, softirq, steal;
> +	u64 guest, guest_nice;
> +	u64 sum = 0;
> +	u64 sum_softirq = 0;
> +	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
> +	struct timespec boottime;
> +
> +	user = nice = system = idle = iowait =
> +		irq = softirq = steal = 0;
> +	guest = guest_nice = 0;
> +	getboottime(&boottime);
> +	jif = boottime.tv_sec;
> +
> +	for_each_possible_cpu(i) {
> +		user += kstat_cpu(i).cpustat[USER];
> +		nice += kstat_cpu(i).cpustat[NICE];
> +		system += kstat_cpu(i).cpustat[SYSTEM];
> +		idle += get_idle_time(i);
> +		iowait += get_iowait_time(i);
> +		irq += kstat_cpu(i).cpustat[IRQ];
> +		softirq += kstat_cpu(i).cpustat[SOFTIRQ];
> +		steal += kstat_cpu(i).cpustat[STEAL];
> +		guest += kstat_cpu(i).cpustat[GUEST];
> +		guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
> +
> +		for (j = 0; j<  NR_SOFTIRQS; j++) {
> +			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> +
> +			per_softirq_sums[j] += softirq_stat;
> +			sum_softirq += softirq_stat;
> +		}
> +	}
> +	sum += arch_irq_stat();
> +
> +	seq_printf(p, "cpu  %llu %llu %llu %llu %llu %llu %llu %llu %llu "
> +		"%llu\n",
> +		(unsigned long long)cputime64_to_clock_t(user),
> +		(unsigned long long)cputime64_to_clock_t(nice),
> +		(unsigned long long)cputime64_to_clock_t(system),
> +		(unsigned long long)cputime64_to_clock_t(idle),
> +		(unsigned long long)cputime64_to_clock_t(iowait),
> +		(unsigned long long)cputime64_to_clock_t(irq),
> +		(unsigned long long)cputime64_to_clock_t(softirq),
> +		(unsigned long long)cputime64_to_clock_t(steal),
> +		(unsigned long long)cputime64_to_clock_t(guest),
> +		(unsigned long long)cputime64_to_clock_t(guest_nice));
> +	for_each_online_cpu(i) {
> +		/* Copy values here to work around gcc-2.95.3, gcc-2.96 */

GCC 3.2 is the listed current minimum requirement.  If this code is being 
revisited this could be cleaned up.

> +		user = kstat_cpu(i).cpustat[USER];
> +		nice = kstat_cpu(i).cpustat[NICE];
> +		system = kstat_cpu(i).cpustat[SYSTEM];
> +		idle = get_idle_time(i);
> +		iowait = get_iowait_time(i);
> +		irq = kstat_cpu(i).cpustat[IRQ];
> +		softirq = kstat_cpu(i).cpustat[SOFTIRQ];
> +		steal = kstat_cpu(i).cpustat[STEAL];
> +		guest = kstat_cpu(i).cpustat[GUEST];
> +		guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
> +		seq_printf(p,
> +			"cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
> +			"%llu\n",
> +			i,
> +			(unsigned long long)cputime64_to_clock_t(user),
> +			(unsigned long long)cputime64_to_clock_t(nice),
> +			(unsigned long long)cputime64_to_clock_t(system),
> +			(unsigned long long)cputime64_to_clock_t(idle),
> +			(unsigned long long)cputime64_to_clock_t(iowait),
> +			(unsigned long long)cputime64_to_clock_t(irq),
> +			(unsigned long long)cputime64_to_clock_t(softirq),
> +			(unsigned long long)cputime64_to_clock_t(steal),
> +			(unsigned long long)cputime64_to_clock_t(guest),
> +			(unsigned long long)cputime64_to_clock_t(guest_nice));
> +	}
> +	seq_printf(p, "intr %llu", (unsigned long long)sum);
> +
> +	/* sum again ? it could be updated? */
> +	for_each_irq_nr(j)
> +		seq_printf(p, " %u", kstat_irqs(j));
> +
> +	seq_printf(p,
> +		"\nctxt %llu\n"
> +		"btime %lu\n"
> +		"processes %lu\n"
> +		"procs_running %lu\n"
> +		"procs_blocked %lu\n",
> +		nr_context_switches(),
> +		(unsigned long)jif,
> +		total_forks,
> +		nr_running(),
> +		nr_iowait());
> +
> +	seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
> +
> +	for (i = 0; i<  NR_SOFTIRQS; i++)
> +		seq_printf(p, " %u", per_softirq_sums[i]);
> +	seq_putc(p, '\n');
> +
> +	return 0;
> +}
> +
>   #ifdef CONFIG_CGROUP_CPUACCT
>
>   /*


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 03/14] Move /proc/stat logic inside sched.c
  2011-11-12  1:35   ` Paul Turner
@ 2011-11-12 10:27     ` Glauber Costa
  0 siblings, 0 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-12 10:27 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec

On 11/11/2011 11:35 PM, Paul Turner wrote:
> On 11/01/2011 02:19 PM, Glauber Costa wrote:
>> This patch moves all of the /proc/stat display code inside
>> sched.c. The goal is to later on, have a different version
>> of it per-cgroup. In containers environment, this is useful
>> to give each container a different and independent view of
>> the statistics displayed in this file.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> ---
>> fs/proc/stat.c | 139 +-----------------------------------------------
>> include/linux/sched.h | 1 +
>> kernel/sched.c | 142 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 144 insertions(+), 138 deletions(-)
>>
>
> This feels a little contrived. sched.c isn't the right place for much of
> this code. Why do you want to move it all instead of exporting the
> functionality (e.g. remove static?).

Because later on in the series I start using task_group, which is a 
scheduler internal data structure (Since the very goal here is achieving
per cgroup data.

It seemed to me better to do this way - as much as I agree with you that 
a lot here may not belong in sched.c - than to use external functions. 
Since all we have outside sched.c is a task_struct,
we'd have to derive a task_group from it every time. Also,
as you can see in the followup patches, which task_group to use depends 
on the caller and possibly some runtime variables. So the computation is 
not as trivially fast as just getting a field in a structure.


>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index e78e1aa..3f42916 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -71,6 +71,7 @@
>> #include<linux/ctype.h>
>> #include<linux/ftrace.h>
>> #include<linux/slab.h>
>> +#include<linux/tick.h>
>>
>> #include<asm/tlb.h>
>> #include<asm/irq_regs.h>
>> @@ -9494,6 +9495,147 @@ struct cgroup_subsys cpu_cgroup_subsys = {
>>
>> #endif /* CONFIG_CGROUP_SCHED */
>>
>> +#ifndef fi
>> +#define arch_irq_stat_cpu(cpu) 0
>> +#endif
>> +#ifndef arch_irq_stat
>> +#define arch_irq_stat() 0
>> +#endif
>> +#ifndef arch_idle_time
>> +#define arch_idle_time(cpu) 0
>> +#endif
>> +
>> +static u64 get_idle_time(int cpu)
>> +{
>> + u64 idle, idle_time = get_cpu_idle_time_us(cpu, NULL);
>> +
>> + if (idle_time == -1ULL) {
>> + /* !NO_HZ so we can rely on cpustat.idle */
>> + idle = kstat_cpu(cpu).cpustat[IDLE];
>> + idle += arch_idle_time(cpu);
>> + } else
>> + idle = usecs_to_cputime(idle_time);
>> +
>> + return idle;
>> +}
>> +
>> +static u64 get_iowait_time(int cpu)
>> +{
>> + u64 iowait, iowait_time = get_cpu_iowait_time_us(cpu, NULL);
>> +
>> + if (iowait_time == -1ULL)
>> + /* !NO_HZ so we can rely on cpustat.iowait */
>> + iowait = kstat_cpu(cpu).cpustat[IOWAIT];
>> + else
>> + iowait = usecs_to_cputime(iowait_time);
>> +
>> + return iowait;
>> +}
>> +
>> +int cpu_cgroup_proc_stat(struct seq_file *p)
>> +{
>> + int i, j;
>> + unsigned long jif;
>> + u64 user, nice, system, idle, iowait, irq, softirq, steal;
>> + u64 guest, guest_nice;
>> + u64 sum = 0;
>> + u64 sum_softirq = 0;
>> + unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>> + struct timespec boottime;
>> +
>> + user = nice = system = idle = iowait =
>> + irq = softirq = steal = 0;
>> + guest = guest_nice = 0;
>> + getboottime(&boottime);
>> + jif = boottime.tv_sec;
>> +
>> + for_each_possible_cpu(i) {
>> + user += kstat_cpu(i).cpustat[USER];
>> + nice += kstat_cpu(i).cpustat[NICE];
>> + system += kstat_cpu(i).cpustat[SYSTEM];
>> + idle += get_idle_time(i);
>> + iowait += get_iowait_time(i);
>> + irq += kstat_cpu(i).cpustat[IRQ];
>> + softirq += kstat_cpu(i).cpustat[SOFTIRQ];
>> + steal += kstat_cpu(i).cpustat[STEAL];
>> + guest += kstat_cpu(i).cpustat[GUEST];
>> + guest_nice += kstat_cpu(i).cpustat[GUEST_NICE];
>> +
>> + for (j = 0; j< NR_SOFTIRQS; j++) {
>> + unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
>> +
>> + per_softirq_sums[j] += softirq_stat;
>> + sum_softirq += softirq_stat;
>> + }
>> + }
>> + sum += arch_irq_stat();
>> +
>> + seq_printf(p, "cpu %llu %llu %llu %llu %llu %llu %llu %llu %llu "
>> + "%llu\n",
>> + (unsigned long long)cputime64_to_clock_t(user),
>> + (unsigned long long)cputime64_to_clock_t(nice),
>> + (unsigned long long)cputime64_to_clock_t(system),
>> + (unsigned long long)cputime64_to_clock_t(idle),
>> + (unsigned long long)cputime64_to_clock_t(iowait),
>> + (unsigned long long)cputime64_to_clock_t(irq),
>> + (unsigned long long)cputime64_to_clock_t(softirq),
>> + (unsigned long long)cputime64_to_clock_t(steal),
>> + (unsigned long long)cputime64_to_clock_t(guest),
>> + (unsigned long long)cputime64_to_clock_t(guest_nice));
>> + for_each_online_cpu(i) {
>> + /* Copy values here to work around gcc-2.95.3, gcc-2.96 */
>
> GCC 3.2 is the listed current minimum requirement. If this code is being
> revisited this could be cleaned up.
>
>> + user = kstat_cpu(i).cpustat[USER];
>> + nice = kstat_cpu(i).cpustat[NICE];
>> + system = kstat_cpu(i).cpustat[SYSTEM];
>> + idle = get_idle_time(i);
>> + iowait = get_iowait_time(i);
>> + irq = kstat_cpu(i).cpustat[IRQ];
>> + softirq = kstat_cpu(i).cpustat[SOFTIRQ];
>> + steal = kstat_cpu(i).cpustat[STEAL];
>> + guest = kstat_cpu(i).cpustat[GUEST];
>> + guest_nice = kstat_cpu(i).cpustat[GUEST_NICE];
>> + seq_printf(p,
>> + "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu "
>> + "%llu\n",
>> + i,
>> + (unsigned long long)cputime64_to_clock_t(user),
>> + (unsigned long long)cputime64_to_clock_t(nice),
>> + (unsigned long long)cputime64_to_clock_t(system),
>> + (unsigned long long)cputime64_to_clock_t(idle),
>> + (unsigned long long)cputime64_to_clock_t(iowait),
>> + (unsigned long long)cputime64_to_clock_t(irq),
>> + (unsigned long long)cputime64_to_clock_t(softirq),
>> + (unsigned long long)cputime64_to_clock_t(steal),
>> + (unsigned long long)cputime64_to_clock_t(guest),
>> + (unsigned long long)cputime64_to_clock_t(guest_nice));
>> + }
>> + seq_printf(p, "intr %llu", (unsigned long long)sum);
>> +
>> + /* sum again ? it could be updated? */
>> + for_each_irq_nr(j)
>> + seq_printf(p, " %u", kstat_irqs(j));
>> +
>> + seq_printf(p,
>> + "\nctxt %llu\n"
>> + "btime %lu\n"
>> + "processes %lu\n"
>> + "procs_running %lu\n"
>> + "procs_blocked %lu\n",
>> + nr_context_switches(),
>> + (unsigned long)jif,
>> + total_forks,
>> + nr_running(),
>> + nr_iowait());
>> +
>> + seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);
>> +
>> + for (i = 0; i< NR_SOFTIRQS; i++)
>> + seq_printf(p, " %u", per_softirq_sums[i]);
>> + seq_putc(p, '\n');
>> +
>> + return 0;
>> +}
>> +
>> #ifdef CONFIG_CGROUP_CPUACCT
>>
>> /*
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-11 21:33   ` Paul Turner
@ 2011-11-12 10:29     ` Glauber Costa
  2011-11-15 11:02       ` Paul Turner
  0 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-12 10:29 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec, Balbir Singh

On 11/11/2011 07:33 PM, Paul Turner wrote:
> On 11/01/2011 02:19 PM, Glauber Costa wrote:
>> Now that we're providing all cpuacct capability in cpu cgroup,
>> default CPUACCT to n. We still maintain it for compatiblity for
>> anyone that need an independent set of tasks to be managed by it
>> relatively to cpu cgroup, but encourage the use of cpucgroup for that.
>>
>> Proposing schedule of deprecation for 3.5
>>
>
>
> I'd like to see a little more separation beyond 1 linear series here.
>
>
> We're doing the following things
>
> 1. Migrating the existing cpuacct functionality into cpu
> 2. Deprecating cpuacct
> 3. Adding new functionality
>
> I would like to consider (3) separately from 1/2 which we can and should
> accomplish immediately due to the over-head it's currently introducing.
> It seems less than optimal to hinge resolving that on reaching agreement
> for the new bits.
>
> It also helps that the the migrated functionality in (1) is just
> exporting state that is already being maintained by cpu so those changes
> end up being quite small and non-invasive.

So you'd like me to submit a new series, that does not do per-cgroup 
/proc/stat first?

> - Paul
>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> CC: Balbir Singh<bsingharora@gmail.com>
>> ---
>> Documentation/feature-removal-schedule.txt | 8 ++++++++
>> init/Kconfig | 1 +
>> 2 files changed, 9 insertions(+), 0 deletions(-)
>>
>> diff --git a/Documentation/feature-removal-schedule.txt
>> b/Documentation/feature-removal-schedule.txt
>> index d5ac362..91165fe 100644
>> --- a/Documentation/feature-removal-schedule.txt
>> +++ b/Documentation/feature-removal-schedule.txt
>> @@ -609,3 +609,11 @@ When: 3.5
>> Why: The iwlagn module has been renamed iwlwifi. The alias will be around
>> for backward compatibility for several cycles and then dropped.
>> Who: Don Fry<donald.h.fry@intel.com>
>> +
>> +----------------------------
>> +
>> +What: cpuacct cgroup
>> +When: 3.5
>> +Why: Same functionality is provided by the CGROUP_SCHED, with a lower
>> + footprint.
>> +Who: Glauber Costa<glommer@parallels.com>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 31ba0fd..d3d958b 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -626,6 +626,7 @@ config PROC_PID_CPUSET
>>
>> config CGROUP_CPUACCT
>> bool "Simple CPU accounting cgroup subsystem"
>> + default n
>> help
>> Provides a simple Resource Controller for monitoring the
>> total CPU consumed by the tasks in a cgroup.
>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 12/14] Keep number of running processes per-cgroup
  2011-11-01 21:19 ` [PATCH v2 12/14] Keep number of running processes per-cgroup Glauber Costa
@ 2011-11-14 14:42   ` Andrew Wagin
  0 siblings, 0 replies; 50+ messages in thread
From: Andrew Wagin @ 2011-11-14 14:42 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, pjt, fweisbec

> This relies on the fact that the scheduler classes keep the number
> of running processes internally in their entities.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> ---
....

> +#ifdef CONFIG_CGROUP_SCHED
> +               /* root task group has autogrouping, so this doesn't hold */
> +               if  (tg != &root_task_group) {
> +#ifdef CONFIG_FAIR_GROUP_SCHED
> +                       tg_nr_running += cpu_rq(i)->cfs.nr_running;

I think you want to say tg->cfs_rq[i]->nr_running, otherwise it will
show nr_running for root_task_group.

# mount -t cgroup -o cpu xxx /mnt/
# echo $$ > /mnt/1/tasks
# while :; do :; done &
[1] 1679
# while :; do :; done &
[2] 1682
# while :; do :; done &
[3] 1684
# while :; do :; done &
[4] 1686
# while :; do :; done &
[5] 1688
# cat /mnt/1/cpu.proc.stat  | grep procs_running
procs_running 2

> +#endif
> +#ifdef CONFIG_RT_GROUP_SCHED
> +                       tg_nr_running += cpu_rq(i)->rt.rt_nr_running;
> +#endif
> +               } else
> +#endif
> +                       tg_nr_running += cpu_rq(i)->nr_running;
>
>                for (j = 0; j < NR_SOFTIRQS; j++) {
>                        unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -9911,7 +9924,7 @@ int cpu_cgroup_proc_stat(struct cgroup *cgrp, struct cftype *cft,
>                tg_nr_switches,
>                (unsigned long)jif,
>                total_forks,
> -               nr_running(),
> +               tg_nr_running,
>                tg_iowait),
>
>        seq_printf(p, "softirq %llu", (unsigned long long)sum_softirq);

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 01/14] trivial: initialize root cgroup's sibling list
  2011-11-11 21:34   ` Paul Turner
@ 2011-11-14 19:44     ` Glauber Costa
  2011-11-14 21:44       ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-14 19:44 UTC (permalink / raw)
  To: Paul Turner
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec

On 11/11/2011 07:34 PM, Paul Turner wrote:
> On 11/01/2011 02:19 PM, Glauber Costa wrote:
>> Even though there are no siblings, the list should be
>> initialized not to contain bogus values.
>>
>> Signed-off-by: Glauber Costa<glommer@parallels.com>
>> Acked-by: Paul Menage<paul@paulmenage.org>
>
> Signed-off-by: Paul Turner <pjt@google.com>
>
Since this is pretty trivial, I would like to keep it out of my next 
submission. Would any of you pick this one individually to your trees?

Thanks


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 01/14] trivial: initialize root cgroup's sibling list
  2011-11-14 19:44     ` Glauber Costa
@ 2011-11-14 21:44       ` Peter Zijlstra
  0 siblings, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-14 21:44 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Paul Turner, linux-kernel, paul, lizf, daniel.lezcano,
	jbottomley, fweisbec

On Mon, 2011-11-14 at 17:44 -0200, Glauber Costa wrote:
> Since this is pretty trivial, I would like to keep it out of my next 
> submission. Would any of you pick this one individually to your trees?
> 
Done, thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-12 10:29     ` Glauber Costa
@ 2011-11-15 11:02       ` Paul Turner
  2011-11-16 10:21         ` Balbir Singh
  2011-11-25  2:05         ` Li Zefan
  0 siblings, 2 replies; 50+ messages in thread
From: Paul Turner @ 2011-11-15 11:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, paul, lizf, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec, Balbir Singh

On Sat, Nov 12, 2011 at 2:29 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 11/11/2011 07:33 PM, Paul Turner wrote:
>>
>> On 11/01/2011 02:19 PM, Glauber Costa wrote:
>>>
>>> Now that we're providing all cpuacct capability in cpu cgroup,
>>> default CPUACCT to n. We still maintain it for compatiblity for
>>> anyone that need an independent set of tasks to be managed by it
>>> relatively to cpu cgroup, but encourage the use of cpucgroup for that.
>>>
>>> Proposing schedule of deprecation for 3.5
>>>
>>
>>
>> I'd like to see a little more separation beyond 1 linear series here.
>>
>>
>> We're doing the following things
>>
>> 1. Migrating the existing cpuacct functionality into cpu
>> 2. Deprecating cpuacct
>> 3. Adding new functionality
>>
>> I would like to consider (3) separately from 1/2 which we can and should
>> accomplish immediately due to the over-head it's currently introducing.
>> It seems less than optimal to hinge resolving that on reaching agreement
>> for the new bits.
>>
>> It also helps that the the migrated functionality in (1) is just
>> exporting state that is already being maintained by cpu so those changes
>> end up being quite small and non-invasive.
>
> So you'd like me to submit a new series, that does not do per-cgroup
> /proc/stat first?
>

Yes, I think there's a fair amount of discussion remaining on
/proc/stat and the right way to integrate it with the cpu subsystem
despite it being a not entirely natural fit.  Something I proposed at
Prague and that we could explore here is the idea of a co-mounted
controller.  In this example it would only be mountable with cpu so
you could always depend on the cpu hierarchy being there; likewise we
can put (jump-labeled) touchpoints within the cpu-subsystem to call
out for updates as appropriate when the co-mount exists.

However, this does not force us to bring all of
the required data-structures into cpu/sched and allows us to have the
config enabled with no memory cost if users choose to mount cpu and
not "cpu-proc".  This is just one proposal and there may be other ways
to do it, but I'm really not keen on stuffing it into cpu.  What do
you think?

Note: There's some clean-up to sched.c going on to pull out
data-structures and make it not a monolithic .c-#include beast.  This
will remove some of the symbol visibility requirements that would have
required you to be in cpu/sched.c before.

On the other hand, I don't think much discussion remains for cpuacct,
everyone's pretty unanimous in that they'd like to see it deprecated.
By splitting this up we can close out that quickly while we figure out the
best way to resolve the above.

Thanks!

- Paul

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-15 11:02       ` Paul Turner
@ 2011-11-16 10:21         ` Balbir Singh
  2011-11-16 23:52           ` KAMEZAWA Hiroyuki
  2011-11-25  2:05         ` Li Zefan
  1 sibling, 1 reply; 50+ messages in thread
From: Balbir Singh @ 2011-11-16 10:21 UTC (permalink / raw)
  To: Paul Turner
  Cc: Glauber Costa, linux-kernel, paul, lizf, daniel.lezcano,
	a.p.zijlstra, jbottomley, fweisbec

>
> On the other hand, I don't think much discussion remains for cpuacct,
> everyone's pretty unanimous in that they'd like to see it deprecated.
> By splitting this up we can close out that quickly while we figure out the
> best way to resolve the above.
>

I'd give it a thumbs up, if we can create sched groups and provide
accounting without control - like we can for the memory cgroup today.

Balbir

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-16 10:21         ` Balbir Singh
@ 2011-11-16 23:52           ` KAMEZAWA Hiroyuki
  2011-11-17  2:49             ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-16 23:52 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Paul Turner, Glauber Costa, linux-kernel, paul, lizf,
	daniel.lezcano, a.p.zijlstra, jbottomley, fweisbec

On Wed, 16 Nov 2011 15:51:27 +0530
Balbir Singh <bsingharora@gmail.com> wrote:

> >
> > On the other hand, I don't think much discussion remains for cpuacct,
> > everyone's pretty unanimous in that they'd like to see it deprecated.
> > By splitting this up we can close out that quickly while we figure out the
> > best way to resolve the above.
> >
> 
> I'd give it a thumbs up, if we can create sched groups and provide
> accounting without control - like we can for the memory cgroup today.
> 

Isn't it possible ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-16 23:52           ` KAMEZAWA Hiroyuki
@ 2011-11-17  2:49             ` Glauber Costa
  2011-11-17  2:58               ` Balbir Singh
  0 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-17  2:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Balbir Singh, Paul Turner, linux-kernel, paul, lizf,
	daniel.lezcano, a.p.zijlstra, jbottomley, fweisbec

On 11/16/2011 09:52 PM, KAMEZAWA Hiroyuki wrote:
> On Wed, 16 Nov 2011 15:51:27 +0530
> Balbir Singh<bsingharora@gmail.com>  wrote:
>
>>>
>>> On the other hand, I don't think much discussion remains for cpuacct,
>>> everyone's pretty unanimous in that they'd like to see it deprecated.
>>> By splitting this up we can close out that quickly while we figure out the
>>> best way to resolve the above.
>>>
>>
>> I'd give it a thumbs up, if we can create sched groups and provide
>> accounting without control - like we can for the memory cgroup today.
>>
>
> Isn't it possible ?
>
> Thanks,
> -Kame
>
I must say I don't really understand what exactly you propose, and how 
it is different from what we have today.

My take is that you are talking about a single cgroup in which you can 
have the functionality of both cpuacct and cpu, but surrounded by knobs 
that allows you to turn them off individually.

Am I right?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-17  2:49             ` Glauber Costa
@ 2011-11-17  2:58               ` Balbir Singh
  2011-11-17 15:58                 ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: Balbir Singh @ 2011-11-17  2:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Paul Turner, linux-kernel, paul, lizf,
	daniel.lezcano, a.p.zijlstra, jbottomley, fweisbec

On Thu, Nov 17, 2011 at 8:19 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 11/16/2011 09:52 PM, KAMEZAWA Hiroyuki wrote:
>>
>> On Wed, 16 Nov 2011 15:51:27 +0530
>> Balbir Singh<bsingharora@gmail.com>  wrote:
>>
>>>>
>>>> On the other hand, I don't think much discussion remains for cpuacct,
>>>> everyone's pretty unanimous in that they'd like to see it deprecated.
>>>> By splitting this up we can close out that quickly while we figure out
>>>> the
>>>> best way to resolve the above.
>>>>
>>>
>>> I'd give it a thumbs up, if we can create sched groups and provide
>>> accounting without control - like we can for the memory cgroup today.
>>>
>>
>> Isn't it possible ?
>>
>> Thanks,
>> -Kame
>>
> I must say I don't really understand what exactly you propose, and how it is
> different from what we have today.
>
> My take is that you are talking about a single cgroup in which you can have
> the functionality of both cpuacct and cpu, but surrounded by knobs that
> allows you to turn them off individually.
>
> Am I right?
>

No here is what I am asking for

I don't want CPU control, just accounting, so I create the following groups

                          a
                       /    \
                      V    V
                      b     c

Today, with the cpu controller, the moment I create a, b and c, they
get default shares and if I put tasks, their b/w is decided by the
shares, what if I don't want control, but I want to account for their
time only?

Balbir

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-17  2:58               ` Balbir Singh
@ 2011-11-17 15:58                 ` Glauber Costa
  2011-11-21  1:59                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-17 15:58 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Paul Turner, linux-kernel, paul, lizf,
	daniel.lezcano, a.p.zijlstra, jbottomley, fweisbec

On 11/17/2011 12:58 AM, Balbir Singh wrote:
> On Thu, Nov 17, 2011 at 8:19 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> On 11/16/2011 09:52 PM, KAMEZAWA Hiroyuki wrote:
>>>
>>> On Wed, 16 Nov 2011 15:51:27 +0530
>>> Balbir Singh<bsingharora@gmail.com>    wrote:
>>>
>>>>>
>>>>> On the other hand, I don't think much discussion remains for cpuacct,
>>>>> everyone's pretty unanimous in that they'd like to see it deprecated.
>>>>> By splitting this up we can close out that quickly while we figure out
>>>>> the
>>>>> best way to resolve the above.
>>>>>
>>>>
>>>> I'd give it a thumbs up, if we can create sched groups and provide
>>>> accounting without control - like we can for the memory cgroup today.
>>>>
>>>
>>> Isn't it possible ?
>>>
>>> Thanks,
>>> -Kame
>>>
>> I must say I don't really understand what exactly you propose, and how it is
>> different from what we have today.
>>
>> My take is that you are talking about a single cgroup in which you can have
>> the functionality of both cpuacct and cpu, but surrounded by knobs that
>> allows you to turn them off individually.
>>
>> Am I right?
>>
>
> No here is what I am asking for
>
> I don't want CPU control, just accounting, so I create the following groups
>
>                            a
>                         /    \
>                        V    V
>                        b     c
>
> Today, with the cpu controller, the moment I create a, b and c, they
> get default shares and if I put tasks, their b/w is decided by the
> shares, what if I don't want control, but I want to account for their
> time only?
>
> Balbir

I think that if this really a requirement, cpuacct should stay. I was 
working under the assumption that it was not really an important case - 
so thanks for the clarification. Peter and Paul can chime in here, but I 
think that this requirement poses constraints to the cpu cgroup and 
consequently the scheduler - both in its current incarnation and in what 
come in the future - that may not be acceptable. What I am concerned 
about is that it might mandate the scheduler to always test whether or 
not the grouping has a scheduling effect or not - and then walk the 
group if it is not, etc. In a summary, if we can or cannot bundle 
processes together for scheduling purposes, we'll likely need separate 
data structures anyway.

A lot of the code I wrote can be reused to at least make it faster in 
the case in which only the root is mounted - for cpuacct.stat at least.

However, the big question remains: The most expensive operation for 
cpuacct also seem to be the most important, cpuusage, which was a big 
part of the motivation to bundle them all together. Maybe then Paul's 
co-mounting idea starts to make sense, but it will still be quite slow 
for your usage, in which the groups are clearly different.

I think the best I can come up with right now, is to base my work on 
cpuacct - I am fine with that, and it was actually how my first version 
looked like - and then think about a way to make cpuusage faster later...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [tip:sched/core] sched, trivial: Initialize root cgroup's sibling list
  2011-11-01 21:19 ` [PATCH v2 01/14] trivial: initialize root cgroup's sibling list Glauber Costa
  2011-11-11 21:34   ` Paul Turner
@ 2011-11-18 23:42   ` tip-bot for Glauber Costa
  1 sibling, 0 replies; 50+ messages in thread
From: tip-bot for Glauber Costa @ 2011-11-18 23:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, glommer, pjt, tglx, paul, mingo

Commit-ID:  f4d6f6c2649c2c47261db4dcc3110d6f22202ea2
Gitweb:     http://git.kernel.org/tip/f4d6f6c2649c2c47261db4dcc3110d6f22202ea2
Author:     Glauber Costa <glommer@parallels.com>
AuthorDate: Tue, 1 Nov 2011 19:19:07 -0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 16 Nov 2011 08:48:22 +0100

sched, trivial: Initialize root cgroup's sibling list

Even though there are no siblings, the list should be
initialized to not contain bogus values.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Acked-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320182360-20043-2-git-send-email-glommer@parallels.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d9d79a4..0df6986 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8275,6 +8275,7 @@ void __init sched_init(void)
 #ifdef CONFIG_CGROUP_SCHED
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
+	INIT_LIST_HEAD(&root_task_group.siblings);
 	autogroup_init(&init_task);
 #endif /* CONFIG_CGROUP_SCHED */
 

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-17 15:58                 ` Glauber Costa
@ 2011-11-21  1:59                   ` KAMEZAWA Hiroyuki
  2011-11-24 13:24                     ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-11-21  1:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Balbir Singh, Paul Turner, linux-kernel, paul, lizf,
	daniel.lezcano, a.p.zijlstra, jbottomley, fweisbec

On Thu, 17 Nov 2011 13:58:42 -0200
Glauber Costa <glommer@parallels.com> wrote:

> On 11/17/2011 12:58 AM, Balbir Singh wrote:
> > On Thu, Nov 17, 2011 at 8:19 AM, Glauber Costa<glommer@parallels.com>  wrote:
> >> On 11/16/2011 09:52 PM, KAMEZAWA Hiroyuki wrote:
> >>>
> >>> On Wed, 16 Nov 2011 15:51:27 +0530
> >>> Balbir Singh<bsingharora@gmail.com>    wrote:
> >>>
> >>>>>
> >>>>> On the other hand, I don't think much discussion remains for cpuacct,
> >>>>> everyone's pretty unanimous in that they'd like to see it deprecated.
> >>>>> By splitting this up we can close out that quickly while we figure out
> >>>>> the
> >>>>> best way to resolve the above.
> >>>>>
> >>>>
> >>>> I'd give it a thumbs up, if we can create sched groups and provide
> >>>> accounting without control - like we can for the memory cgroup today.
> >>>>
> >>>
> >>> Isn't it possible ?
> >>>
> >>> Thanks,
> >>> -Kame
> >>>
> >> I must say I don't really understand what exactly you propose, and how it is
> >> different from what we have today.
> >>
> >> My take is that you are talking about a single cgroup in which you can have
> >> the functionality of both cpuacct and cpu, but surrounded by knobs that
> >> allows you to turn them off individually.
> >>
> >> Am I right?
> >>
> >
> > No here is what I am asking for
> >
> > I don't want CPU control, just accounting, so I create the following groups
> >
> >                            a
> >                         /    \
> >                        V    V
> >                        b     c
> >
> > Today, with the cpu controller, the moment I create a, b and c, they
> > get default shares and if I put tasks, their b/w is decided by the
> > shares, what if I don't want control, but I want to account for their
> > time only?
> >
> > Balbir
> 
> I think that if this really a requirement, cpuacct should stay. I was 
> working under the assumption that it was not really an important case - 
> so thanks for the clarification. Peter and Paul can chime in here, but I 
> think that this requirement poses constraints to the cpu cgroup and 
> consequently the scheduler - both in its current incarnation and in what 
> come in the future - that may not be acceptable. What I am concerned 
> about is that it might mandate the scheduler to always test whether or 
> not the grouping has a scheduling effect or not - and then walk the 
> group if it is not, etc. In a summary, if we can or cannot bundle 
> processes together for scheduling purposes, we'll likely need separate 
> data structures anyway.
> 
> A lot of the code I wrote can be reused to at least make it faster in 
> the case in which only the root is mounted - for cpuacct.stat at least.
> 
> However, the big question remains: The most expensive operation for 
> cpuacct also seem to be the most important, cpuusage, which was a big 
> part of the motivation to bundle them all together. Maybe then Paul's 
> co-mounting idea starts to make sense, but it will still be quite slow 
> for your usage, in which the groups are clearly different.
> 
> I think the best I can come up with right now, is to base my work on 
> cpuacct - I am fine with that, and it was actually how my first version 
> looked like - and then think about a way to make cpuusage faster later...
> 

Could you share your analysis why cpuacct cgroup is slow ?

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-21  1:59                   ` KAMEZAWA Hiroyuki
@ 2011-11-24 13:24                     ` Peter Zijlstra
  2011-11-24 16:07                       ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-24 13:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Glauber Costa, Balbir Singh, Paul Turner, linux-kernel, paul,
	lizf, daniel.lezcano, jbottomley, fweisbec

On Mon, 2011-11-21 at 10:59 +0900, KAMEZAWA Hiroyuki wrote:
> Could you share your analysis why cpuacct cgroup is slow ?

It adds accounting to all scheduler hot paths, accounting that is mostly
duplicate of accounting already done. It also does another cgroup
hierarchy walk, separate from existing controllers, hitting all cold
cachelines again.

IOW, it sucks chunks and I should never have allowed it to be merged.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-24 13:24                     ` Peter Zijlstra
@ 2011-11-24 16:07                       ` Glauber Costa
  2011-11-24 16:29                         ` Peter Zijlstra
  0 siblings, 1 reply; 50+ messages in thread
From: Glauber Costa @ 2011-11-24 16:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Paul Turner, linux-kernel, paul,
	lizf, daniel.lezcano, jbottomley, fweisbec

On 11/24/2011 11:24 AM, Peter Zijlstra wrote:
> On Mon, 2011-11-21 at 10:59 +0900, KAMEZAWA Hiroyuki wrote:
>> Could you share your analysis why cpuacct cgroup is slow ?
>
> It adds accounting to all scheduler hot paths, accounting that is mostly
> duplicate of accounting already done. It also does another cgroup
> hierarchy walk, separate from existing controllers, hitting all cold
> cachelines again.
>
> IOW, it sucks chunks and I should never have allowed it to be merged.

OTOH, if the use case for it includes separating processes for the cpu 
and cpuacct cgroups in an independent manner - which apparently it does, 
I've just learned, there isn't much we can do except try to make it cheaper.

The direction I am going right now wrt this is to clean this up and 
reduce the impact of it when the root cgroup is the only one active. All 
other users are using the functionality, so let them pay...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-24 16:07                       ` Glauber Costa
@ 2011-11-24 16:29                         ` Peter Zijlstra
  2011-11-24 16:38                           ` Glauber Costa
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-24 16:29 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Paul Turner, linux-kernel, paul,
	lizf, daniel.lezcano, jbottomley, fweisbec

On Thu, 2011-11-24 at 14:07 -0200, Glauber Costa wrote:
> OTOH, if the use case for it includes separating processes for the cpu 
> and cpuacct cgroups in an independent manner - which apparently it does, 
> I've just learned, there isn't much we can do except try to make it cheaper.

Yeah it allows that, but is that really useful? If we buy that argument
shouldn't we split up controllers to be as minimal as possible to that
you get as great a number of independent cgroups as possible?

That way lies madness if you ask me. The two biggest controllers we
currently have are cpu and memcg, and they aren't as orthogonal as you
might think, see how cpusets has both a task affinity as well as node
affinity side.

The more comprehensive these controllers become, the greater also the
overlap in functionality and thus a reduction in separation.

Really, what is the killer case for separating all this nonsense? And
no: 'Because $ustomer wants it', doesn't count.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-24 16:29                         ` Peter Zijlstra
@ 2011-11-24 16:38                           ` Glauber Costa
  2011-11-24 16:58                             ` Peter Zijlstra
  2011-11-25  5:38                             ` Balbir Singh
  0 siblings, 2 replies; 50+ messages in thread
From: Glauber Costa @ 2011-11-24 16:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Paul Turner, linux-kernel, paul,
	lizf, daniel.lezcano, jbottomley, fweisbec

On 11/24/2011 02:29 PM, Peter Zijlstra wrote:
> On Thu, 2011-11-24 at 14:07 -0200, Glauber Costa wrote:
>> OTOH, if the use case for it includes separating processes for the cpu
>> and cpuacct cgroups in an independent manner - which apparently it does,
>> I've just learned, there isn't much we can do except try to make it cheaper.
>
> Yeah it allows that, but is that really useful? If we buy that argument
> shouldn't we split up controllers to be as minimal as possible to that
> you get as great a number of independent cgroups as possible?
>
> That way lies madness if you ask me. The two biggest controllers we
> currently have are cpu and memcg, and they aren't as orthogonal as you
> might think, see how cpusets has both a task affinity as well as node
> affinity side.
>
> The more comprehensive these controllers become, the greater also the
> overlap in functionality and thus a reduction in separation.
>
> Really, what is the killer case for separating all this nonsense? And
> no: 'Because $ustomer wants it', doesn't count.

It's hard for me to say that, since I come from a virtualization 
background: for us, a single cgroup would do just fine: even the 
division between mem and cpu is not needed. However, I've been learning 
recently that the use cases for that are quite diverse. So I'll have to 
leave the answer to Balbir, and other interested parties.

Furthermore, what I have to be implemented can be done with either one: 
I am really just bouncing around the two implementations trying to find 
a common ground... (*)

However, now that there are users for it, if this use case is really 
important, it gets harder to change it. That's more like a lesson for 
the future, for the new cgroups being proposed.

But even here, if we start adopting the policy of "cgroups must have 
zero impact for the mounted-but-not-used case, what's so wrong in having 
many small ones if their purpose differ? (more of a rethorical question)

* I wonder if the first part of the patches, that changes kstat to an 
array instead of u64 could be merged in this mean time? I have now a 
both task_group and cpuacct implementation and they both depend on it 
somehow. Peter? Paul ?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-24 16:38                           ` Glauber Costa
@ 2011-11-24 16:58                             ` Peter Zijlstra
  2011-11-25  5:38                             ` Balbir Singh
  1 sibling, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-24 16:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Paul Turner, linux-kernel, paul,
	lizf, daniel.lezcano, jbottomley, fweisbec

On Thu, 2011-11-24 at 14:38 -0200, Glauber Costa wrote:
> It's hard for me to say that, since I come from a virtualization 
> background: for us, a single cgroup would do just fine: even the 
> division between mem and cpu is not needed. However, I've been learning 
> recently that the use cases for that are quite diverse. So I'll have to 
> leave the answer to Balbir, and other interested parties. 

Hehe, I'm the one running a CGROUP=n kernel, so what do I know. Its just
that all this stuff hurts my brain.

Maybe we should just add functionality that requires controllers to be
co-mounted and this extra functionality will then make people co-mount
since you know features.. :-)

I know I've wanted to do that several times, so far I've always resisted
because it looked like pain, but hey, if I can cause more pain somewhere
else and my life gets easier in the long run I might change my mind ;-)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-15 11:02       ` Paul Turner
  2011-11-16 10:21         ` Balbir Singh
@ 2011-11-25  2:05         ` Li Zefan
  2011-11-25 10:09           ` Peter Zijlstra
  2011-11-26 13:07           ` Paul Turner
  1 sibling, 2 replies; 50+ messages in thread
From: Li Zefan @ 2011-11-25  2:05 UTC (permalink / raw)
  To: Paul Turner
  Cc: Glauber Costa, linux-kernel, paul, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec, Balbir Singh

> despite it being a not entirely natural fit.  Something I proposed at
> Prague and that we could explore here is the idea of a co-mounted
> controller.  In this example it would only be mountable with cpu so
> you could always depend on the cpu hierarchy being there; likewise we
> can put (jump-labeled) touchpoints within the cpu-subsystem to call
> out for updates as appropriate when the co-mount exists.
> 

IIUC, this co-mounting idea is something I implemented years ago:

https://lkml.org/lkml/2008/6/18/389

The use case and the reason it was rejected:

https://lkml.org/lkml/2008/7/1/97

--
Li Zefan

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-24 16:38                           ` Glauber Costa
  2011-11-24 16:58                             ` Peter Zijlstra
@ 2011-11-25  5:38                             ` Balbir Singh
  2011-11-25 10:19                               ` Peter Zijlstra
  1 sibling, 1 reply; 50+ messages in thread
From: Balbir Singh @ 2011-11-25  5:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Paul Turner, linux-kernel,
	paul, lizf, daniel.lezcano, jbottomley, fweisbec

On Thu, Nov 24, 2011 at 10:08 PM, Glauber Costa <glommer@parallels.com> wrote:
> It's hard for me to say that, since I come from a virtualization background:
> for us, a single cgroup would do just fine: even the division between mem
> and cpu is not needed. However, I've been learning recently that the use
> cases for that are quite diverse. So I'll have to leave the answer to
> Balbir, and other interested parties.
>

It is not about $customer. I am OK with a design that allows
accounting independent of control. Put it another way when I look at
cgroups, I see the following functionality

1. Accounting and feedback
2. Control

Why do 1 and 2 have to co-exist. A good case would be that we might
need just stats and might want to implement control based on 1. But if
I have to do both 1 and 2 together, how do we decide on control
values?

Balbir

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-25  2:05         ` Li Zefan
@ 2011-11-25 10:09           ` Peter Zijlstra
  2011-11-26 13:07           ` Paul Turner
  1 sibling, 0 replies; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-25 10:09 UTC (permalink / raw)
  To: Li Zefan
  Cc: Paul Turner, Glauber Costa, linux-kernel, paul, daniel.lezcano,
	jbottomley, fweisbec, Balbir Singh

On Fri, 2011-11-25 at 10:05 +0800, Li Zefan wrote:
> > despite it being a not entirely natural fit.  Something I proposed at
> > Prague and that we could explore here is the idea of a co-mounted
> > controller.  In this example it would only be mountable with cpu so
> > you could always depend on the cpu hierarchy being there; likewise we
> > can put (jump-labeled) touchpoints within the cpu-subsystem to call
> > out for updates as appropriate when the co-mount exists.
> > 
> 
> IIUC, this co-mounting idea is something I implemented years ago:
> 
> https://lkml.org/lkml/2008/6/18/389
> 
> The use case and the reason it was rejected:
> 
> https://lkml.org/lkml/2008/7/1/97

"This allows one subsystem to require that it only be mounted when some
other subsystems are also present in the proposed hierarchy."

That's not exactly what I meant, what I meant was that say you co-mount
cpu,cpuset the combined mount provides more features/better performance
than if you don't. Which then provides a natural incentive to actually
co-mount the stuff.

You patch looks like it forces the co-mount.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-25  5:38                             ` Balbir Singh
@ 2011-11-25 10:19                               ` Peter Zijlstra
  2011-11-26 13:18                                 ` Paul Turner
  0 siblings, 1 reply; 50+ messages in thread
From: Peter Zijlstra @ 2011-11-25 10:19 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Glauber Costa, KAMEZAWA Hiroyuki, Paul Turner, linux-kernel,
	paul, lizf, daniel.lezcano, jbottomley, fweisbec

On Fri, 2011-11-25 at 11:08 +0530, Balbir Singh wrote:
> It is not about $customer. I am OK with a design that allows
> accounting independent of control. Put it another way when I look at
> cgroups, I see the following functionality
> 
> 1. Accounting and feedback
> 2. Control
> 
> Why do 1 and 2 have to co-exist. A good case would be that we might
> need just stats and might want to implement control based on 1.

I would say that 2 always requires 1 (provided they are of course on the
same subject), for the very simple reason that you need to know the
current state (as provided by 1) to control it (2).

Therefore separating them leads to useless duplication.

> But if
> I have to do both 1 and 2 together, how do we decide on control
> values? 

Uh, what?

What was not answered is, is there a sane reason to have both on
different hierarchies? I think the whole different hierarchy per
controller thing is one of the biggest trainwrecks of cgroups.

It allows for great confusion, but I haven't yet seen an up-side to it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-25  2:05         ` Li Zefan
  2011-11-25 10:09           ` Peter Zijlstra
@ 2011-11-26 13:07           ` Paul Turner
  1 sibling, 0 replies; 50+ messages in thread
From: Paul Turner @ 2011-11-26 13:07 UTC (permalink / raw)
  To: Li Zefan
  Cc: Glauber Costa, linux-kernel, paul, daniel.lezcano, a.p.zijlstra,
	jbottomley, fweisbec, Balbir Singh

On Thu, Nov 24, 2011 at 6:05 PM, Li Zefan <lizf@cn.fujitsu.com> wrote:
>> despite it being a not entirely natural fit.  Something I proposed at
>> Prague and that we could explore here is the idea of a co-mounted
>> controller.  In this example it would only be mountable with cpu so
>> you could always depend on the cpu hierarchy being there; likewise we
>> can put (jump-labeled) touchpoints within the cpu-subsystem to call
>> out for updates as appropriate when the co-mount exists.
>>
>
> IIUC, this co-mounting idea is something I implemented years ago:
>
> https://lkml.org/lkml/2008/6/18/389
>
> The use case and the reason it was rejected:
>
> https://lkml.org/lkml/2008/7/1/97
>

Rejection is a bit of a strong statement -- the idea seemed amenable
but lacking a strong use-case.  That said, taking a deep look at some
of what Glauber is trying to do in this series I don't think it's
something that would help here.

For this discussion the motivation for a co-mount would be to
piggy-back on the cpu sub-systems own hierarchy walks to reduce
overhead.  However, this is not structured in a way that can take
advantage of this, and, looking at what Glauber is attempting to
collect it's not clear that it can be.

I think this moves the discussion towards whether we should consider
deprecating some of the exported fields (namely usage and
usage_per_cpu) from cpuacct instead of the entire controller as we had
initially desired.  This would allow cpuacct to exist with a much
lower overhead, especially within the context-switch path.

- Paul

> --
> Li Zefan
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-25 10:19                               ` Peter Zijlstra
@ 2011-11-26 13:18                                 ` Paul Turner
  2011-11-28  8:29                                   ` Balbir Singh
  0 siblings, 1 reply; 50+ messages in thread
From: Paul Turner @ 2011-11-26 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Balbir Singh, Glauber Costa, KAMEZAWA Hiroyuki, linux-kernel,
	paul, lizf, daniel.lezcano, jbottomley, fweisbec

On Fri, Nov 25, 2011 at 2:19 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Fri, 2011-11-25 at 11:08 +0530, Balbir Singh wrote:
>> It is not about $customer. I am OK with a design that allows
>> accounting independent of control. Put it another way when I look at
>> cgroups, I see the following functionality
>>
>> 1. Accounting and feedback
>> 2. Control
>>
>> Why do 1 and 2 have to co-exist. A good case would be that we might
>> need just stats and might want to implement control based on 1.
>
> I would say that 2 always requires 1 (provided they are of course on the
> same subject), for the very simple reason that you need to know the
> current state (as provided by 1) to control it (2).
>
> Therefore separating them leads to useless duplication.

Almost all of the stats that are coming in here are completely useless
for control and don't belong in the cpu controller.

While it's not as pure we really have 2 tiers of statistic:

1) Accounting needed for control
2) Everything else

What causes so much over-head for cpuacct currently is that it tries
to maintain a separate accounting of usage.  If this were removed from
the controller then it would be *much* cheaper, even with all extra
stats added here.

This series adds a lot of statistics in the second category, they
really have no common touch points on the cpu controller path.  That
second category could be re-worded as 'accounting how scheduled time
is spent' as opposed to the actual accounting of scheduling decisions.
 I would really rather not see anything in that category weighing down
the cpu-controller any more than it already is.

- Paul
>
>> But if
>> I have to do both 1 and 2 together, how do we decide on control
>> values?
>
> Uh, what?
>
> What was not answered is, is there a sane reason to have both on
> different hierarchies? I think the whole different hierarchy per
> controller thing is one of the biggest trainwrecks of cgroups.
>
> It allows for great confusion, but I haven't yet seen an up-side to it.
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v2 14/14] Change CPUACCT to default n
  2011-11-26 13:18                                 ` Paul Turner
@ 2011-11-28  8:29                                   ` Balbir Singh
  0 siblings, 0 replies; 50+ messages in thread
From: Balbir Singh @ 2011-11-28  8:29 UTC (permalink / raw)
  To: Paul Turner
  Cc: Peter Zijlstra, Glauber Costa, KAMEZAWA Hiroyuki, linux-kernel,
	paul, lizf, daniel.lezcano, jbottomley, fweisbec

On Sat, Nov 26, 2011 at 6:48 PM, Paul Turner <pjt@google.com> wrote:
> On Fri, Nov 25, 2011 at 2:19 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>> On Fri, 2011-11-25 at 11:08 +0530, Balbir Singh wrote:
>>> It is not about $customer. I am OK with a design that allows
>>> accounting independent of control. Put it another way when I look at
>>> cgroups, I see the following functionality
>>>
>>> 1. Accounting and feedback
>>> 2. Control
>>>
>>> Why do 1 and 2 have to co-exist. A good case would be that we might
>>> need just stats and might want to implement control based on 1.
>>
>> I would say that 2 always requires 1 (provided they are of course on the
>> same subject), for the very simple reason that you need to know the
>> current state (as provided by 1) to control it (2).
>>
>> Therefore separating them leads to useless duplication.
>
> Almost all of the stats that are coming in here are completely useless
> for control and don't belong in the cpu controller.
>
> While it's not as pure we really have 2 tiers of statistic:
>
> 1) Accounting needed for control
> 2) Everything else
>
> What causes so much over-head for cpuacct currently is that it tries
> to maintain a separate accounting of usage.  If this were removed from
> the controller then it would be *much* cheaper, even with all extra
> stats added here.
>

Sure, if we can integrate the stats without control, I am all for it :)

Balbir Singh

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2011-11-28  8:29 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-01 21:19 [PATCH v2 00/14] per-cgroup /proc/stat Glauber Costa
2011-11-01 21:19 ` [PATCH v2 01/14] trivial: initialize root cgroup's sibling list Glauber Costa
2011-11-11 21:34   ` Paul Turner
2011-11-14 19:44     ` Glauber Costa
2011-11-14 21:44       ` Peter Zijlstra
2011-11-18 23:42   ` [tip:sched/core] sched, trivial: Initialize " tip-bot for Glauber Costa
2011-11-01 21:19 ` [PATCH v2 02/14] Change cpustat fields to an array Glauber Costa
2011-11-01 21:19 ` [PATCH v2 03/14] Move /proc/stat logic inside sched.c Glauber Costa
2011-11-12  1:35   ` Paul Turner
2011-11-12 10:27     ` Glauber Costa
2011-11-01 21:19 ` [PATCH v2 04/14] split kernel stat in two Glauber Costa
2011-11-01 21:19 ` [PATCH v2 05/14] Display /proc/stat information per cgroup Glauber Costa
2011-11-01 21:19 ` [PATCH v2 06/14] Make total_forks per-cgroup Glauber Costa
2011-11-01 21:19 ` [PATCH v2 07/14] per-cgroup boot time Glauber Costa
2011-11-01 21:19 ` [PATCH v2 08/14] Report steal time for cgroup Glauber Costa
2011-11-01 21:19 ` [PATCH v2 09/14] Keep nr_iowait per cgroup Glauber Costa
2011-11-10 10:27   ` Andrew Wagin
2011-11-01 21:19 ` [PATCH v2 10/14] Keep number of context switches per-cgroup Glauber Costa
2011-11-01 21:19 ` [PATCH v2 11/14] provide a version of cpuacct statistics inside cpu cgroup Glauber Costa
2011-11-01 21:19 ` [PATCH v2 12/14] Keep number of running processes per-cgroup Glauber Costa
2011-11-14 14:42   ` Andrew Wagin
2011-11-01 21:19 ` [PATCH v2 13/14] provide a version of cpuusage statistics inside cpu cgroup Glauber Costa
2011-11-09 11:51   ` Andrew Wagin
2011-11-09 11:58     ` Glauber Costa
2011-11-09 14:18       ` Andrew Wagin
2011-11-09 15:30         ` Peter Zijlstra
2011-11-09 16:51         ` Glauber Costa
2011-11-10  8:59           ` Andrew Wagin
2011-11-01 21:19 ` [PATCH v2 14/14] Change CPUACCT to default n Glauber Costa
2011-11-11 21:33   ` Paul Turner
2011-11-12 10:29     ` Glauber Costa
2011-11-15 11:02       ` Paul Turner
2011-11-16 10:21         ` Balbir Singh
2011-11-16 23:52           ` KAMEZAWA Hiroyuki
2011-11-17  2:49             ` Glauber Costa
2011-11-17  2:58               ` Balbir Singh
2011-11-17 15:58                 ` Glauber Costa
2011-11-21  1:59                   ` KAMEZAWA Hiroyuki
2011-11-24 13:24                     ` Peter Zijlstra
2011-11-24 16:07                       ` Glauber Costa
2011-11-24 16:29                         ` Peter Zijlstra
2011-11-24 16:38                           ` Glauber Costa
2011-11-24 16:58                             ` Peter Zijlstra
2011-11-25  5:38                             ` Balbir Singh
2011-11-25 10:19                               ` Peter Zijlstra
2011-11-26 13:18                                 ` Paul Turner
2011-11-28  8:29                                   ` Balbir Singh
2011-11-25  2:05         ` Li Zefan
2011-11-25 10:09           ` Peter Zijlstra
2011-11-26 13:07           ` Paul Turner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).