All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-09 11:45 ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra, Paul Turner

[ update: I thought I posted this already before leaving for holidays. However,
  now that I am checking for replies, I can't find nor replies nor the original
  mail in my boxes or archives. I am posting again for safety sake, but sorry
  you are getting this twice by any chance ]

Hi all,

This is an attempt to provide userspace with enough information to reconstruct
per-container version of files like "/proc/stat". In particular, we are
interested in knowing the per-cgroup slices of user time, system time, wait
time, number of processes, and a variety of statistics.

This task is made more complicated by the fact that multiple controllers are
involved in collecting those statistics: cpu and cpuacct. So the first thing I
am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.

This is one of the major differences from earlier attempts: all data is provided
by the cpu controller, resulting in greater simplicity.

This also tries to hook into the existing scheduler hierarchy walks instead of
providing new ones.

Glauber Costa (7):
  don't call cpuacct_charge in stop_task.c
  sched: adjust exec_clock to use it as cpu usage metric
  cpuacct: don't actually do anything.
  account guest time per-cgroup as well.
  record per-cgroup number of context switches
  sched: change nr_context_switches calculation.
  sched: introduce cgroup file stat_percpu

Peter Zijlstra (1):
  sched: Push put_prev_task() into pick_next_task()

Tejun Heo (3):
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup, sched: let cpu serve the same files as cpuacct
  cgroup, sched: deprecate cpuacct

 include/linux/cgroup.h   |   1 +
 include/linux/sched.h    |   8 +-
 init/Kconfig             |  11 +-
 kernel/cgroup.c          |  57 ++++++-
 kernel/sched/core.c      | 375 +++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/cputime.c   |  28 +++-
 kernel/sched/fair.c      |  38 ++++-
 kernel/sched/idle_task.c |   9 +-
 kernel/sched/rt.c        |  42 ++++--
 kernel/sched/sched.h     |  18 ++-
 kernel/sched/stop_task.c |   8 +-
 11 files changed, 551 insertions(+), 44 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-09 11:45 ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra, Paul Turner

[ update: I thought I posted this already before leaving for holidays. However,
  now that I am checking for replies, I can't find nor replies nor the original
  mail in my boxes or archives. I am posting again for safety sake, but sorry
  you are getting this twice by any chance ]

Hi all,

This is an attempt to provide userspace with enough information to reconstruct
per-container version of files like "/proc/stat". In particular, we are
interested in knowing the per-cgroup slices of user time, system time, wait
time, number of processes, and a variety of statistics.

This task is made more complicated by the fact that multiple controllers are
involved in collecting those statistics: cpu and cpuacct. So the first thing I
am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.

This is one of the major differences from earlier attempts: all data is provided
by the cpu controller, resulting in greater simplicity.

This also tries to hook into the existing scheduler hierarchy walks instead of
providing new ones.

Glauber Costa (7):
  don't call cpuacct_charge in stop_task.c
  sched: adjust exec_clock to use it as cpu usage metric
  cpuacct: don't actually do anything.
  account guest time per-cgroup as well.
  record per-cgroup number of context switches
  sched: change nr_context_switches calculation.
  sched: introduce cgroup file stat_percpu

Peter Zijlstra (1):
  sched: Push put_prev_task() into pick_next_task()

Tejun Heo (3):
  cgroup: implement CFTYPE_NO_PREFIX
  cgroup, sched: let cpu serve the same files as cpuacct
  cgroup, sched: deprecate cpuacct

 include/linux/cgroup.h   |   1 +
 include/linux/sched.h    |   8 +-
 init/Kconfig             |  11 +-
 kernel/cgroup.c          |  57 ++++++-
 kernel/sched/core.c      | 375 +++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/cputime.c   |  28 +++-
 kernel/sched/fair.c      |  38 ++++-
 kernel/sched/idle_task.c |   9 +-
 kernel/sched/rt.c        |  42 ++++--
 kernel/sched/sched.h     |  18 ++-
 kernel/sched/stop_task.c |   8 +-
 11 files changed, 551 insertions(+), 44 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH v5 01/11] don't call cpuacct_charge in stop_task.c
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Mike Galbraith, Thomas Gleixner

Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Mike Galbraith <mgalbraith@suse.de>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/stop_task.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..fda1cbe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-	cpuacct_charge(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 01/11] don't call cpuacct_charge in stop_task.c
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Mike Galbraith, Thomas Gleixner

Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Mike Galbraith <mgalbraith@suse.de>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/stop_task.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..fda1cbe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-	cpuacct_charge(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 02/11] cgroup: implement CFTYPE_NO_PREFIX
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Glauber Costa

From: Tejun Heo <tj@kernel.org>

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix.  This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Glauber Costa <glommer@parallels.com>
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup.c        | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..7d193f9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,7 @@ struct cgroup_map_cb {
 /* cftype->flags */
 #define CFTYPE_ONLY_ON_ROOT	(1U << 0)	/* only create on root cg */
 #define CFTYPE_NOT_ON_ROOT	(1U << 1)	/* don't create on root cg */
+#define CFTYPE_NO_PREFIX	(1U << 2)	/* skip subsys prefix */
 
 #define MAX_CFTYPE_NAME		64
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4855892..2f98398 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2723,7 +2723,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
 
 	simple_xattrs_init(&cft->xattrs);
 
-	if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
+	if (subsys && !(cft->flags & CFTYPE_NO_PREFIX) &&
+	    !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
 		strcpy(name, subsys->name);
 		strcat(name, ".");
 	}
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 02/11] cgroup: implement CFTYPE_NO_PREFIX
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Glauber Costa

From: Tejun Heo <tj@kernel.org>

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix.  This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Glauber Costa <glommer@parallels.com>
---
 include/linux/cgroup.h | 1 +
 kernel/cgroup.c        | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7d73905..7d193f9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -277,6 +277,7 @@ struct cgroup_map_cb {
 /* cftype->flags */
 #define CFTYPE_ONLY_ON_ROOT	(1U << 0)	/* only create on root cg */
 #define CFTYPE_NOT_ON_ROOT	(1U << 1)	/* don't create on root cg */
+#define CFTYPE_NO_PREFIX	(1U << 2)	/* skip subsys prefix */
 
 #define MAX_CFTYPE_NAME		64
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4855892..2f98398 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2723,7 +2723,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
 
 	simple_xattrs_init(&cft->xattrs);
 
-	if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
+	if (subsys && !(cft->flags & CFTYPE_NO_PREFIX) &&
+	    !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
 		strcpy(name, subsys->name);
 		strcat(name, ".");
 	}
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

From: Tejun Heo <tj@kernel.org>

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake.  In
  general multiple controllers on the same type of resource should be
  avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
  served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
step in deprecating cpuacct.  The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
there.  cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way.  It's
mostly verbatim copy from cpuacct.  The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Paul Turner <pjt@google.com>
---
 kernel/cgroup.c        |  13 ++++
 kernel/sched/core.c    | 173 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cputime.c |  18 ++++-
 kernel/sched/fair.c    |   1 +
 kernel/sched/rt.c      |   1 +
 kernel/sched/sched.h   |   7 ++
 6 files changed, 212 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2f98398..0750669d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1248,6 +1248,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 	/* Consistency checks */
 
 	/*
+	 * cpuacct is deprecated and cpu will serve the same stat files.
+	 * If co-mount with cpu is requested, ignore cpuacct.  Note that
+	 * this creates some discrepancies in /proc/cgroups and
+	 * /proc/PID/cgroup.
+	 *
+	 * https://lkml.org/lkml/2012/9/13/542
+	 */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+	if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
+	    (opts->subsys_bits & (1 << cpuacct_subsys_id)))
+		opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+#endif
+	/*
 	 * Option noprefix was introduced just for backward compatibility
 	 * with the old cpuset, so we allow noprefix only if mounting just
 	 * the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..6516694 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6811,6 +6811,7 @@ int in_sched_functions(unsigned long addr)
 #ifdef CONFIG_CGROUP_SCHED
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
@@ -6869,6 +6870,8 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+	root_task_group.cpustat = &kernel_cpustat;
+	root_task_group.cpuusage = &root_tg_cpuusage;
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
@@ -7152,6 +7155,8 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	free_percpu(tg->cpuusage);
+	free_percpu(tg->cpustat);
 	kfree(tg);
 }
 
@@ -7165,6 +7170,11 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
+	tg->cpuusage = alloc_percpu(u64);
+	tg->cpustat = alloc_percpu(struct kernel_cpustat);
+	if (!tg->cpuusage || !tg->cpustat)
+		goto err;
+
 	if (!alloc_fair_sched_group(tg, parent))
 		goto err;
 
@@ -7256,6 +7266,24 @@ void sched_move_task(struct task_struct *tsk)
 
 	task_rq_unlock(rq, tsk, &flags);
 }
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+	struct task_group *tg;
+	int cpu = task_cpu(tsk);
+
+	rcu_read_lock();
+
+	tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for (; tg; tg = tg->parent) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+
+	rcu_read_unlock();
+}
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7612,6 +7640,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	sched_move_task(task);
 }
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+	u64 data;
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	data = *cpuusage;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	data = *cpuusage;
+#endif
+
+	return data;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	*cpuusage = val;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	*cpuusage = val;
+#endif
+}
+
+/* return total cpu usage (in nanoseconds) of a group */
+static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 totalcpuusage = 0;
+	int i;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for_each_present_cpu(i)
+		totalcpuusage += task_group_cpuusage_read(tg, i);
+
+	return totalcpuusage;
+}
+
+static int cpucg_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+				u64 reset)
+{
+	struct task_group *tg;
+	int err = 0;
+	int i;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i)
+		task_group_cpuusage_write(tg, i, 0);
+
+out:
+	return err;
+}
+
+static int cpucg_percpu_seq_read(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *m)
+{
+	struct task_group *tg;
+	u64 percpu;
+	int i;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for_each_present_cpu(i) {
+		percpu = task_group_cpuusage_read(tg, i);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const char *cpucg_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpucg_stats_show(struct cgroup *cgrp, struct cftype *cft,
+			    struct cgroup_map_cb *cb)
+{
+	struct task_group *tg;
+	int cpu;
+	s64 val = 0;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for_each_online_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[CPUTIME_USER];
+		val += kcpustat->cpustat[CPUTIME_NICE];
+	}
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_USER], val);
+
+	val = 0;
+	for_each_online_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[CPUTIME_SYSTEM];
+		val += kcpustat->cpustat[CPUTIME_IRQ];
+		val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+	}
+
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+	return 0;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
@@ -7918,6 +8074,23 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	/* cpuacct.* which used to be served by a separate cpuacct controller */
+	{
+		.name = "cpuacct.usage",
+		.flags = CFTYPE_NO_PREFIX,
+		.read_u64 = cpucg_cpuusage_read,
+		.write_u64 = cpucg_cpuusage_write,
+	},
+	{
+		.name = "cpuacct.usage_percpu",
+		.flags = CFTYPE_NO_PREFIX,
+		.read_seq_string = cpucg_percpu_seq_read,
+	},
+	{
+		.name = "cpuacct.stat",
+		.flags = CFTYPE_NO_PREFIX,
+		.read_map = cpucg_stats_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 293b202..a4fe53e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -114,8 +114,10 @@ static int irqtime_account_si_update(void)
 static inline void task_group_account_field(struct task_struct *p, int index,
 					    u64 tmp)
 {
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *tg;
+#endif
 #ifdef CONFIG_CGROUP_CPUACCT
-	struct kernel_cpustat *kcpustat;
 	struct cpuacct *ca;
 #endif
 	/*
@@ -125,7 +127,19 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 *
 	 */
 	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
+#ifdef CONFIG_CGROUP_SCHED
+	rcu_read_lock();
+	tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+			  struct task_group, css);
 
+	while (tg && (tg != &root_task_group)) {
+		struct kernel_cpustat *kcpustat = this_cpu_ptr(tg->cpustat);
+
+		kcpustat->cpustat[index] += tmp;
+		tg = tg->parent;
+	}
+	rcu_read_unlock();
+#endif
 #ifdef CONFIG_CGROUP_CPUACCT
 	if (unlikely(!cpuacct_subsys.active))
 		return;
@@ -133,6 +147,8 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	rcu_read_lock();
 	ca = task_ca(p);
 	while (ca && (ca != &root_cpuacct)) {
+		struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
+
 		kcpustat = this_cpu_ptr(ca->cpustat);
 		kcpustat->cpustat[index] += tmp;
 		ca = parent_ca(ca);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..c15bc92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -707,6 +707,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+		task_group_charge(curtask, delta_exec);
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..f7e05d87 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -935,6 +935,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
+	task_group_charge(curr, delta_exec);
 	cpuacct_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..84a339d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -104,6 +104,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+	/* statistics */
+	u64 __percpu *cpuusage;
+	struct kernel_cpustat __percpu *cpustat;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each cpu */
 	struct sched_entity **se;
@@ -590,6 +594,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -597,6 +603,7 @@ static inline struct task_group *task_group(struct task_struct *p)
 {
 	return NULL;
 }
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
 
 #endif /* CONFIG_CGROUP_SCHED */
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Glauber Costa, Peter Zijlstra,
	Michal Hocko, Kay Sievers, Lennart Poettering, Dave Jones,
	Ben Hutchings

From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake.  In
  general multiple controllers on the same type of resource should be
  avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
  served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
step in deprecating cpuacct.  The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
there.  cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way.  It's
mostly verbatim copy from cpuacct.  The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org>
Cc: Lennart Poettering <mzxreary-uLTowLwuiw4b1SvskN2V4Q@public.gmane.org>
Cc: Dave Jones <davej-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org>
Cc: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/cgroup.c        |  13 ++++
 kernel/sched/core.c    | 173 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/cputime.c |  18 ++++-
 kernel/sched/fair.c    |   1 +
 kernel/sched/rt.c      |   1 +
 kernel/sched/sched.h   |   7 ++
 6 files changed, 212 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2f98398..0750669d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1248,6 +1248,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 	/* Consistency checks */
 
 	/*
+	 * cpuacct is deprecated and cpu will serve the same stat files.
+	 * If co-mount with cpu is requested, ignore cpuacct.  Note that
+	 * this creates some discrepancies in /proc/cgroups and
+	 * /proc/PID/cgroup.
+	 *
+	 * https://lkml.org/lkml/2012/9/13/542
+	 */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+	if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
+	    (opts->subsys_bits & (1 << cpuacct_subsys_id)))
+		opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+#endif
+	/*
 	 * Option noprefix was introduced just for backward compatibility
 	 * with the old cpuset, so we allow noprefix only if mounting just
 	 * the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 257002c..6516694 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6811,6 +6811,7 @@ int in_sched_functions(unsigned long addr)
 #ifdef CONFIG_CGROUP_SCHED
 struct task_group root_task_group;
 LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
 #endif
 
 DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
@@ -6869,6 +6870,8 @@ void __init sched_init(void)
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 #ifdef CONFIG_CGROUP_SCHED
+	root_task_group.cpustat = &kernel_cpustat;
+	root_task_group.cpuusage = &root_tg_cpuusage;
 	list_add(&root_task_group.list, &task_groups);
 	INIT_LIST_HEAD(&root_task_group.children);
 	INIT_LIST_HEAD(&root_task_group.siblings);
@@ -7152,6 +7155,8 @@ static void free_sched_group(struct task_group *tg)
 	free_fair_sched_group(tg);
 	free_rt_sched_group(tg);
 	autogroup_free(tg);
+	free_percpu(tg->cpuusage);
+	free_percpu(tg->cpustat);
 	kfree(tg);
 }
 
@@ -7165,6 +7170,11 @@ struct task_group *sched_create_group(struct task_group *parent)
 	if (!tg)
 		return ERR_PTR(-ENOMEM);
 
+	tg->cpuusage = alloc_percpu(u64);
+	tg->cpustat = alloc_percpu(struct kernel_cpustat);
+	if (!tg->cpuusage || !tg->cpustat)
+		goto err;
+
 	if (!alloc_fair_sched_group(tg, parent))
 		goto err;
 
@@ -7256,6 +7266,24 @@ void sched_move_task(struct task_struct *tsk)
 
 	task_rq_unlock(rq, tsk, &flags);
 }
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+	struct task_group *tg;
+	int cpu = task_cpu(tsk);
+
+	rcu_read_lock();
+
+	tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for (; tg; tg = tg->parent) {
+		u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+		*cpuusage += cputime;
+	}
+
+	rcu_read_unlock();
+}
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7612,6 +7640,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	sched_move_task(task);
 }
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+	u64 data;
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	data = *cpuusage;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	data = *cpuusage;
+#endif
+
+	return data;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+
+#ifndef CONFIG_64BIT
+	/*
+	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+	 */
+	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	*cpuusage = val;
+	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+	*cpuusage = val;
+#endif
+}
+
+/* return total cpu usage (in nanoseconds) of a group */
+static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct task_group *tg;
+	u64 totalcpuusage = 0;
+	int i;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for_each_present_cpu(i)
+		totalcpuusage += task_group_cpuusage_read(tg, i);
+
+	return totalcpuusage;
+}
+
+static int cpucg_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+				u64 reset)
+{
+	struct task_group *tg;
+	int err = 0;
+	int i;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	if (reset) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	for_each_present_cpu(i)
+		task_group_cpuusage_write(tg, i, 0);
+
+out:
+	return err;
+}
+
+static int cpucg_percpu_seq_read(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *m)
+{
+	struct task_group *tg;
+	u64 percpu;
+	int i;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for_each_present_cpu(i) {
+		percpu = task_group_cpuusage_read(tg, i);
+		seq_printf(m, "%llu ", (unsigned long long) percpu);
+	}
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const char *cpucg_stat_desc[] = {
+	[CPUACCT_STAT_USER] = "user",
+	[CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpucg_stats_show(struct cgroup *cgrp, struct cftype *cft,
+			    struct cgroup_map_cb *cb)
+{
+	struct task_group *tg;
+	int cpu;
+	s64 val = 0;
+
+	tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+			  struct task_group, css);
+
+	for_each_online_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[CPUTIME_USER];
+		val += kcpustat->cpustat[CPUTIME_NICE];
+	}
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_USER], val);
+
+	val = 0;
+	for_each_online_cpu(cpu) {
+		struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+		val += kcpustat->cpustat[CPUTIME_SYSTEM];
+		val += kcpustat->cpustat[CPUTIME_IRQ];
+		val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+	}
+
+	val = cputime64_to_clock_t(val);
+	cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+	return 0;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
 				u64 shareval)
@@ -7918,6 +8074,23 @@ static struct cftype cpu_files[] = {
 		.write_u64 = cpu_rt_period_write_uint,
 	},
 #endif
+	/* cpuacct.* which used to be served by a separate cpuacct controller */
+	{
+		.name = "cpuacct.usage",
+		.flags = CFTYPE_NO_PREFIX,
+		.read_u64 = cpucg_cpuusage_read,
+		.write_u64 = cpucg_cpuusage_write,
+	},
+	{
+		.name = "cpuacct.usage_percpu",
+		.flags = CFTYPE_NO_PREFIX,
+		.read_seq_string = cpucg_percpu_seq_read,
+	},
+	{
+		.name = "cpuacct.stat",
+		.flags = CFTYPE_NO_PREFIX,
+		.read_map = cpucg_stats_show,
+	},
 	{ }	/* terminate */
 };
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 293b202..a4fe53e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -114,8 +114,10 @@ static int irqtime_account_si_update(void)
 static inline void task_group_account_field(struct task_struct *p, int index,
 					    u64 tmp)
 {
+#ifdef CONFIG_CGROUP_SCHED
+	struct task_group *tg;
+#endif
 #ifdef CONFIG_CGROUP_CPUACCT
-	struct kernel_cpustat *kcpustat;
 	struct cpuacct *ca;
 #endif
 	/*
@@ -125,7 +127,19 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 *
 	 */
 	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
+#ifdef CONFIG_CGROUP_SCHED
+	rcu_read_lock();
+	tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+			  struct task_group, css);
 
+	while (tg && (tg != &root_task_group)) {
+		struct kernel_cpustat *kcpustat = this_cpu_ptr(tg->cpustat);
+
+		kcpustat->cpustat[index] += tmp;
+		tg = tg->parent;
+	}
+	rcu_read_unlock();
+#endif
 #ifdef CONFIG_CGROUP_CPUACCT
 	if (unlikely(!cpuacct_subsys.active))
 		return;
@@ -133,6 +147,8 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	rcu_read_lock();
 	ca = task_ca(p);
 	while (ca && (ca != &root_cpuacct)) {
+		struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
+
 		kcpustat = this_cpu_ptr(ca->cpustat);
 		kcpustat->cpustat[index] += tmp;
 		ca = parent_ca(ca);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eea870..c15bc92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -707,6 +707,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+		task_group_charge(curtask, delta_exec);
 		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..f7e05d87 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -935,6 +935,7 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
+	task_group_charge(curr, delta_exec);
 	cpuacct_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc88644..84a339d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -104,6 +104,10 @@ struct cfs_bandwidth {
 struct task_group {
 	struct cgroup_subsys_state css;
 
+	/* statistics */
+	u64 __percpu *cpuusage;
+	struct kernel_cpustat __percpu *cpustat;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* schedulable entities of this group on each cpu */
 	struct sched_entity **se;
@@ -590,6 +594,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -597,6 +603,7 @@ static inline struct task_group *task_group(struct task_struct *p)
 {
 	return NULL;
 }
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
 
 #endif /* CONFIG_CGROUP_SCHED */
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 04/11] cgroup, sched: deprecate cpuacct
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Glauber Costa, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

From: Tejun Heo <tj@kernel.org>

Now that cpu serves the same files as cpuacct and using cpuacct
separately from cpu is deprecated, we can deprecate cpuacct.  To avoid
disturbing userland which has been co-mounting cpu and cpuacct,
implement some hackery in cgroup core so that cpuacct co-mounting
still works even if cpuacct is disabled.

The goal of this patch is to accelerate disabling and removal of
cpuacct by decoupling kernel-side deprecation from userland changes.
Userland is recommended to do the following.

* If /proc/cgroups lists cpuacct, always co-mount it with cpu under
  e.g. /sys/fs/cgroup/cpu.

* Optionally create symlinks for compatibility -
  e.g. /sys/fs/cgroup/cpuacct and /sys/fs/cgroup/cpu,cpucct both
  pointing to /sys/fs/cgroup/cpu - whether cpuacct exists or not.

This compatibility hack will eventually go away.

[ glommer@parallels.com: subsys_bits => subsys_mask ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Paul Turner <pjt@google.com>
---
 init/Kconfig        | 11 ++++++++++-
 kernel/cgroup.c     | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/core.c |  2 ++
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..4e411ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -815,11 +815,20 @@ config PROC_PID_CPUSET
 	default y
 
 config CGROUP_CPUACCT
-	bool "Simple CPU accounting cgroup subsystem"
+	bool "DEPRECATED: Simple CPU accounting cgroup subsystem"
+	default n
 	help
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
 
+	  This cgroup subsystem is deprecated.  The CPU cgroup
+	  subsystem serves the same accounting files and "cpuacct"
+	  mount option is ignored if specified with "cpu".  As long as
+	  userland co-mounts cpu and cpuacct, disabling this
+	  controller should be mostly unnoticeable - one notable
+	  difference is that /proc/PID/cgroup won't list cpuacct
+	  anymore.
+
 config RESOURCE_COUNTERS
 	bool "Resource counters"
 	help
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0750669d..4ddb335 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1136,6 +1136,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 	unsigned long mask = (unsigned long)-1;
 	int i;
 	bool module_pin_failed = false;
+	bool cpuacct_requested = false;
 
 	BUG_ON(!mutex_is_locked(&cgroup_mutex));
 
@@ -1225,8 +1226,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 			break;
 		}
-		if (i == CGROUP_SUBSYS_COUNT)
+		/* handle deprecated cpuacct specially, see below */
+		if (!strcmp(token, "cpuacct")) {
+			cpuacct_requested = true;
+			one_ss = true;
+		} else if (i == CGROUP_SUBSYS_COUNT) {
 			return -ENOENT;
+		}
 	}
 
 	/*
@@ -1253,12 +1259,29 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 	 * this creates some discrepancies in /proc/cgroups and
 	 * /proc/PID/cgroup.
 	 *
+	 * Accept and ignore "cpuacct" option if comounted with "cpu" even
+	 * when cpuacct itself is disabled to allow quick disabling and
+	 * removal of cpuacct.  This will be removed eventually.
+	 *
 	 * https://lkml.org/lkml/2012/9/13/542
 	 */
+	if (cpuacct_requested) {
+		bool comounted = false;
+
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)
+		comounted = opts->subsys_mask & (1 << cpu_cgroup_subsys_id);
+#endif
+		if (!comounted) {
+			pr_warning("cgroup: mounting cpuacct separately from cpu is deprecated\n");
+#if !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+			return -EINVAL;
+#endif
+		}
+	}
 #if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
-	if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
-	    (opts->subsys_bits & (1 << cpuacct_subsys_id)))
-		opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+	if ((opts->subsys_mask & (1 << cpu_cgroup_subsys_id)) &&
+	    (opts->subsys_mask & (1 << cpuacct_subsys_id)))
+		opts->subsys_mask &= ~(1 << cpuacct_subsys_id);
 #endif
 	/*
 	 * Option noprefix was introduced just for backward compatibility
@@ -4806,6 +4829,7 @@ const struct file_operations proc_cgroup_operations = {
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
+	struct cgroup_subsys *ss;
 	int i;
 
 	seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
@@ -4816,7 +4840,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
 	 */
 	mutex_lock(&cgroup_mutex);
 	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
-		struct cgroup_subsys *ss = subsys[i];
+		ss = subsys[i];
 		if (ss == NULL)
 			continue;
 		seq_printf(m, "%s\t%d\t%d\t%d\n",
@@ -4824,6 +4848,19 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
 			   ss->root->number_of_cgroups, !ss->disabled);
 	}
 	mutex_unlock(&cgroup_mutex);
+
+	/*
+	 * Fake /proc/cgroups entry for cpuacct to trick userland into
+	 * cpu,cpuacct comounts.  This is to allow quick disabling and
+	 * removal of cpuacct and will be removed eventually.
+	 */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+	ss = subsys[cpu_cgroup_subsys_id];
+	if (ss) {
+		seq_printf(m, "cpuacct\t%d\t%d\t%d\n", ss->root->hierarchy_id,
+			   ss->root->number_of_cgroups, !ss->disabled);
+	}
+#endif
 	return 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6516694..a62b771 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8110,6 +8110,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 
 #ifdef CONFIG_CGROUP_CPUACCT
 
+#warning CONFIG_CGROUP_CPUACCT is deprecated, read the Kconfig help message
+
 /*
  * CPU accounting code for task groups.
  *
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 04/11] cgroup, sched: deprecate cpuacct
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Glauber Costa, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

From: Tejun Heo <tj@kernel.org>

Now that cpu serves the same files as cpuacct and using cpuacct
separately from cpu is deprecated, we can deprecate cpuacct.  To avoid
disturbing userland which has been co-mounting cpu and cpuacct,
implement some hackery in cgroup core so that cpuacct co-mounting
still works even if cpuacct is disabled.

The goal of this patch is to accelerate disabling and removal of
cpuacct by decoupling kernel-side deprecation from userland changes.
Userland is recommended to do the following.

* If /proc/cgroups lists cpuacct, always co-mount it with cpu under
  e.g. /sys/fs/cgroup/cpu.

* Optionally create symlinks for compatibility -
  e.g. /sys/fs/cgroup/cpuacct and /sys/fs/cgroup/cpu,cpucct both
  pointing to /sys/fs/cgroup/cpu - whether cpuacct exists or not.

This compatibility hack will eventually go away.

[ glommer@parallels.com: subsys_bits => subsys_mask ]

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Paul Turner <pjt@google.com>
---
 init/Kconfig        | 11 ++++++++++-
 kernel/cgroup.c     | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/core.c |  2 ++
 3 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 7d30240..4e411ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -815,11 +815,20 @@ config PROC_PID_CPUSET
 	default y
 
 config CGROUP_CPUACCT
-	bool "Simple CPU accounting cgroup subsystem"
+	bool "DEPRECATED: Simple CPU accounting cgroup subsystem"
+	default n
 	help
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
 
+	  This cgroup subsystem is deprecated.  The CPU cgroup
+	  subsystem serves the same accounting files and "cpuacct"
+	  mount option is ignored if specified with "cpu".  As long as
+	  userland co-mounts cpu and cpuacct, disabling this
+	  controller should be mostly unnoticeable - one notable
+	  difference is that /proc/PID/cgroup won't list cpuacct
+	  anymore.
+
 config RESOURCE_COUNTERS
 	bool "Resource counters"
 	help
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0750669d..4ddb335 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1136,6 +1136,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 	unsigned long mask = (unsigned long)-1;
 	int i;
 	bool module_pin_failed = false;
+	bool cpuacct_requested = false;
 
 	BUG_ON(!mutex_is_locked(&cgroup_mutex));
 
@@ -1225,8 +1226,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 
 			break;
 		}
-		if (i == CGROUP_SUBSYS_COUNT)
+		/* handle deprecated cpuacct specially, see below */
+		if (!strcmp(token, "cpuacct")) {
+			cpuacct_requested = true;
+			one_ss = true;
+		} else if (i == CGROUP_SUBSYS_COUNT) {
 			return -ENOENT;
+		}
 	}
 
 	/*
@@ -1253,12 +1259,29 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
 	 * this creates some discrepancies in /proc/cgroups and
 	 * /proc/PID/cgroup.
 	 *
+	 * Accept and ignore "cpuacct" option if comounted with "cpu" even
+	 * when cpuacct itself is disabled to allow quick disabling and
+	 * removal of cpuacct.  This will be removed eventually.
+	 *
 	 * https://lkml.org/lkml/2012/9/13/542
 	 */
+	if (cpuacct_requested) {
+		bool comounted = false;
+
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)
+		comounted = opts->subsys_mask & (1 << cpu_cgroup_subsys_id);
+#endif
+		if (!comounted) {
+			pr_warning("cgroup: mounting cpuacct separately from cpu is deprecated\n");
+#if !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+			return -EINVAL;
+#endif
+		}
+	}
 #if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
-	if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
-	    (opts->subsys_bits & (1 << cpuacct_subsys_id)))
-		opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+	if ((opts->subsys_mask & (1 << cpu_cgroup_subsys_id)) &&
+	    (opts->subsys_mask & (1 << cpuacct_subsys_id)))
+		opts->subsys_mask &= ~(1 << cpuacct_subsys_id);
 #endif
 	/*
 	 * Option noprefix was introduced just for backward compatibility
@@ -4806,6 +4829,7 @@ const struct file_operations proc_cgroup_operations = {
 /* Display information about each subsystem and each hierarchy */
 static int proc_cgroupstats_show(struct seq_file *m, void *v)
 {
+	struct cgroup_subsys *ss;
 	int i;
 
 	seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
@@ -4816,7 +4840,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
 	 */
 	mutex_lock(&cgroup_mutex);
 	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
-		struct cgroup_subsys *ss = subsys[i];
+		ss = subsys[i];
 		if (ss == NULL)
 			continue;
 		seq_printf(m, "%s\t%d\t%d\t%d\n",
@@ -4824,6 +4848,19 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
 			   ss->root->number_of_cgroups, !ss->disabled);
 	}
 	mutex_unlock(&cgroup_mutex);
+
+	/*
+	 * Fake /proc/cgroups entry for cpuacct to trick userland into
+	 * cpu,cpuacct comounts.  This is to allow quick disabling and
+	 * removal of cpuacct and will be removed eventually.
+	 */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+	ss = subsys[cpu_cgroup_subsys_id];
+	if (ss) {
+		seq_printf(m, "cpuacct\t%d\t%d\t%d\n", ss->root->hierarchy_id,
+			   ss->root->number_of_cgroups, !ss->disabled);
+	}
+#endif
 	return 0;
 }
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6516694..a62b771 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8110,6 +8110,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 
 #ifdef CONFIG_CGROUP_CPUACCT
 
+#warning CONFIG_CGROUP_CPUACCT is deprecated, read the Kconfig help message
+
 /*
  * CPU accounting code for task groups.
  *
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 05/11] sched: adjust exec_clock to use it as cpu usage metric
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Dave Jones, Ben Hutchings,
	Lennart Poettering, Kay Sievers

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 kernel/sched/rt.c    | 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f7e05d87..7f6f6c6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -945,6 +945,7 @@ static void update_curr_rt(struct rq *rq)
 
 	for_each_sched_rt_entity(rt_se) {
 		rt_rq = rt_rq_of_se(rt_se);
+		schedstat_add(rt_rq, exec_clock, delta_exec);
 
 		if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
 			raw_spin_lock(&rt_rq->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84a339d..01ca8a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -210,6 +210,7 @@ struct cfs_rq {
 	unsigned int nr_running, h_nr_running;
 
 	u64 exec_clock;
+	u64 prev_exec_clock;
 	u64 min_vruntime;
 #ifndef CONFIG_64BIT
 	u64 min_vruntime_copy;
@@ -312,6 +313,8 @@ struct rt_rq {
 	struct plist_head pushable_tasks;
 #endif
 	int rt_throttled;
+	u64 exec_clock;
+	u64 prev_exec_clock;
 	u64 rt_time;
 	u64 rt_runtime;
 	/* Nests inside the rq lock: */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 05/11] sched: adjust exec_clock to use it as cpu usage metric
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Dave Jones, Ben Hutchings,
	Lennart Poettering, Kay Sievers

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 kernel/sched/rt.c    | 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f7e05d87..7f6f6c6 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -945,6 +945,7 @@ static void update_curr_rt(struct rq *rq)
 
 	for_each_sched_rt_entity(rt_se) {
 		rt_rq = rt_rq_of_se(rt_se);
+		schedstat_add(rt_rq, exec_clock, delta_exec);
 
 		if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
 			raw_spin_lock(&rt_rq->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84a339d..01ca8a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -210,6 +210,7 @@ struct cfs_rq {
 	unsigned int nr_running, h_nr_running;
 
 	u64 exec_clock;
+	u64 prev_exec_clock;
 	u64 min_vruntime;
 #ifndef CONFIG_64BIT
 	u64 min_vruntime_copy;
@@ -312,6 +313,8 @@ struct rt_rq {
 	struct plist_head pushable_tasks;
 #endif
 	int rt_throttled;
+	u64 exec_clock;
+	u64 prev_exec_clock;
 	u64 rt_time;
 	u64 rt_runtime;
 	/* Nests inside the rq lock: */
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 06/11] cpuacct: don't actually do anything.
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Paul Turner <pjt@google.com>
---
 kernel/sched/core.c  | 102 ++++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  10 +++--
 2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62b771..f8a9acf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7267,6 +7267,7 @@ void sched_move_task(struct task_struct *tsk)
 	task_rq_unlock(rq, tsk, &flags);
 }
 
+#ifndef CONFIG_SCHEDSTATS
 void task_group_charge(struct task_struct *tsk, u64 cputime)
 {
 	struct task_group *tg;
@@ -7284,6 +7285,7 @@ void task_group_charge(struct task_struct *tsk, u64 cputime)
 
 	rcu_read_unlock();
 }
+#endif
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7640,22 +7642,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	sched_move_task(task);
 }
 
-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
 {
-	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
-	u64 data;
-
 #ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
 	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
 	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+	return tg->cfs_rq[cpu]->exec_clock - tg->cfs_rq[cpu]->prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+	tg->cfs_rq[cpu]->prev_exec_clock = tg->cfs_rq[cpu]->exec_clock;
+}
 #else
-	data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+	return tg->rt_rq[cpu]->exec_clock - tg->rt_rq[cpu]->prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+	tg->rt_rq[cpu]->prev_exec_clock = tg->rt_rq[cpu]->exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+	return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
 #endif
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+	u64 ret = 0;
+
+	lock_rq_dword(cpu);
+	ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+	unlock_rq_dword(cpu);
+
+	return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+	lock_rq_dword(cpu);
+	cfs_exec_clock_reset(tg, cpu);
+	rt_exec_clock_reset(tg, cpu);
+	unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+	u64 data;
+
+	lock_rq_dword(cpu);
+	data = *cpuusage;
+	unlock_rq_dword(cpu);
+
 	return data;
 }
 
@@ -7663,17 +7735,11 @@ static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
 {
 	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	lock_rq_dword(cpu);
 	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
+	unlock_rq_dword(cpu);
 }
+#endif
 
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01ca8a4..640aa14 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,8 +597,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -606,10 +604,14 @@ static inline struct task_group *task_group(struct task_struct *p)
 {
 	return NULL;
 }
-static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
-
 #endif /* CONFIG_CGROUP_SCHED */
 
+#if defined(CONFIG_CGROUP_SCHED) && !defined(CONFIG_SCHEDSTATS)
+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
 static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 06/11] cpuacct: don't actually do anything.
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Lennart Poettering <mzxreary@0pointer.de>
Cc: Dave Jones <davej@redhat.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Paul Turner <pjt@google.com>
---
 kernel/sched/core.c  | 102 ++++++++++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h |  10 +++--
 2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a62b771..f8a9acf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7267,6 +7267,7 @@ void sched_move_task(struct task_struct *tsk)
 	task_rq_unlock(rq, tsk, &flags);
 }
 
+#ifndef CONFIG_SCHEDSTATS
 void task_group_charge(struct task_struct *tsk, u64 cputime)
 {
 	struct task_group *tg;
@@ -7284,6 +7285,7 @@ void task_group_charge(struct task_struct *tsk, u64 cputime)
 
 	rcu_read_unlock();
 }
+#endif
 #endif /* CONFIG_CGROUP_SCHED */
 
 #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7640,22 +7642,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
 	sched_move_task(task);
 }
 
-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
 {
-	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
-	u64 data;
-
 #ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
 	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
 	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+	return tg->cfs_rq[cpu]->exec_clock - tg->cfs_rq[cpu]->prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+	tg->cfs_rq[cpu]->prev_exec_clock = tg->cfs_rq[cpu]->exec_clock;
+}
 #else
-	data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+	return tg->rt_rq[cpu]->exec_clock - tg->rt_rq[cpu]->prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+	tg->rt_rq[cpu]->prev_exec_clock = tg->rt_rq[cpu]->exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+	return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
 #endif
 
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+	u64 ret = 0;
+
+	lock_rq_dword(cpu);
+	ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+	unlock_rq_dword(cpu);
+
+	return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+	lock_rq_dword(cpu);
+	cfs_exec_clock_reset(tg, cpu);
+	rt_exec_clock_reset(tg, cpu);
+	unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+	u64 data;
+
+	lock_rq_dword(cpu);
+	data = *cpuusage;
+	unlock_rq_dword(cpu);
+
 	return data;
 }
 
@@ -7663,17 +7735,11 @@ static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
 {
 	u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+	lock_rq_dword(cpu);
 	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
+	unlock_rq_dword(cpu);
 }
+#endif
 
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 01ca8a4..640aa14 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -597,8 +597,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
 #endif
 }
 
-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
 #else /* CONFIG_CGROUP_SCHED */
 
 static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -606,10 +604,14 @@ static inline struct task_group *task_group(struct task_struct *p)
 {
 	return NULL;
 }
-static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
-
 #endif /* CONFIG_CGROUP_SCHED */
 
+#if defined(CONFIG_CGROUP_SCHED) && !defined(CONFIG_SCHEDSTATS)
+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
 static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 	set_task_rq(p, cpu);
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 07/11] account guest time per-cgroup as well.
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/cputime.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a4fe53e..db93aa7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -191,8 +191,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-
 	/* Add guest time to process. */
 	p->utime += cputime;
 	p->utimescaled += cputime_scaled;
@@ -201,11 +199,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat[CPUTIME_NICE] += (__force u64) cputime;
-		cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
+		task_group_account_field(p, CPUTIME_NICE, (__force u64) cputime);
+		task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
 	} else {
-		cpustat[CPUTIME_USER] += (__force u64) cputime;
-		cpustat[CPUTIME_GUEST] += (__force u64) cputime;
+		task_group_account_field(p, CPUTIME_USER, (__force u64) cputime);
+		task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
 	}
 }
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 07/11] account guest time per-cgroup as well.
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/cputime.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a4fe53e..db93aa7 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -191,8 +191,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
 static void account_guest_time(struct task_struct *p, cputime_t cputime,
 			       cputime_t cputime_scaled)
 {
-	u64 *cpustat = kcpustat_this_cpu->cpustat;
-
 	/* Add guest time to process. */
 	p->utime += cputime;
 	p->utimescaled += cputime_scaled;
@@ -201,11 +199,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
 
 	/* Add guest time to cpustat. */
 	if (TASK_NICE(p) > 0) {
-		cpustat[CPUTIME_NICE] += (__force u64) cputime;
-		cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
+		task_group_account_field(p, CPUTIME_NICE, (__force u64) cputime);
+		task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
 	} else {
-		cpustat[CPUTIME_USER] += (__force u64) cputime;
-		cpustat[CPUTIME_GUEST] += (__force u64) cputime;
+		task_group_account_field(p, CPUTIME_USER, (__force u64) cputime);
+		task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
 	}
 }
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 08/11] sched: Push put_prev_task() into pick_next_task()
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

[ glommer@parallels.com: incorporated mailing list feedback ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Glauber Costa <glommer@parallels.com>
---
 include/linux/sched.h    |  8 +++++++-
 kernel/sched/core.c      | 20 +++++++-------------
 kernel/sched/fair.c      |  6 +++++-
 kernel/sched/idle_task.c |  6 +++++-
 kernel/sched/rt.c        | 27 ++++++++++++++++-----------
 kernel/sched/stop_task.c |  5 ++++-
 6 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..31d86e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1082,7 +1082,13 @@ struct sched_class {
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
-	struct task_struct * (*pick_next_task) (struct rq *rq);
+	/*
+	 * It is the responsibility of the pick_next_task() method that will
+	 * return the next task to call put_prev_task() on the @prev task or
+	 * something equivalent.
+	 */
+	struct task_struct * (*pick_next_task) (struct rq *rq,
+						struct task_struct *prev);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8a9acf..c36df03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2807,18 +2807,11 @@ static inline void schedule_debug(struct task_struct *prev)
 	schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-	if (prev->on_rq || rq->skip_clock_update < 0)
-		update_rq_clock(rq);
-	prev->sched_class->put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -2828,13 +2821,13 @@ pick_next_task(struct rq *rq)
 	 * the fair class we can call that function directly:
 	 */
 	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
-		p = fair_sched_class.pick_next_task(rq);
+		p = fair_sched_class.pick_next_task(rq, prev);
 		if (likely(p))
 			return p;
 	}
 
 	for_each_class(class) {
-		p = class->pick_next_task(rq);
+		p = class->pick_next_task(rq, prev);
 		if (p)
 			return p;
 	}
@@ -2929,8 +2922,9 @@ need_resched:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
-	put_prev_task(rq, prev);
-	next = pick_next_task(rq);
+	if (prev->on_rq || rq->skip_clock_update < 0)
+		update_rq_clock(rq);
+	next = pick_next_task(rq, prev);
 	clear_tsk_need_resched(prev);
 	rq->skip_clock_update = 0;
 
@@ -4880,7 +4874,7 @@ static void migrate_tasks(unsigned int dead_cpu)
 		if (rq->nr_running == 1)
 			break;
 
-		next = pick_next_task(rq);
+		next = pick_next_task(rq, NULL);
 		BUG_ON(!next);
 		next->sched_class->put_prev_task(rq, next);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c15bc92..d59a106 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3595,7 +3595,8 @@ preempt:
 		set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -3604,6 +3605,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
 	if (!cfs_rq->nr_running)
 		return NULL;
 
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
 	do {
 		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..07e6027 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -22,8 +22,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_task(rq->idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
 	schedstat_inc(rq, sched_goidle);
 	return rq->idle;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7f6f6c6..80c58fe 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1368,15 +1368,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
-	struct rt_rq *rt_rq;
-
-	rt_rq = &rq->rt;
-
-	if (!rt_rq->rt_nr_running)
-		return NULL;
-
-	if (rt_rq_throttled(rt_rq))
-		return NULL;
+	struct rt_rq *rt_rq  = &rq->rt;
 
 	do {
 		rt_se = pick_next_rt_entity(rq, rt_rq);
@@ -1390,9 +1382,22 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return p;
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p = _pick_next_task_rt(rq);
+	struct task_struct *p;
+	struct rt_rq *rt_rq = &rq->rt;
+
+	if (!rt_rq->rt_nr_running)
+		return NULL;
+
+	if (rt_rq_throttled(rt_rq))
+		return NULL;
+
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
+	p = _pick_next_task_rt(rq);
 
 	/* The running task is never eligible for pushing */
 	if (p)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index fda1cbe..5f10918 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,12 +23,15 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
 	/* we're never preempted */
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *stop = rq->stop;
 
 	if (stop && stop->on_rq) {
 		stop->se.exec_start = rq->clock_task;
+		if (prev)
+			prev->sched_class->put_prev_task(rq, prev);
 		return stop;
 	}
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 08/11] sched: Push put_prev_task() into pick_next_task()
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Glauber Costa

From: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

[ glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org: incorporated mailing list feedback ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
---
 include/linux/sched.h    |  8 +++++++-
 kernel/sched/core.c      | 20 +++++++-------------
 kernel/sched/fair.c      |  6 +++++-
 kernel/sched/idle_task.c |  6 +++++-
 kernel/sched/rt.c        | 27 ++++++++++++++++-----------
 kernel/sched/stop_task.c |  5 ++++-
 6 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 206bb08..31d86e5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1082,7 +1082,13 @@ struct sched_class {
 
 	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
 
-	struct task_struct * (*pick_next_task) (struct rq *rq);
+	/*
+	 * It is the responsibility of the pick_next_task() method that will
+	 * return the next task to call put_prev_task() on the @prev task or
+	 * something equivalent.
+	 */
+	struct task_struct * (*pick_next_task) (struct rq *rq,
+						struct task_struct *prev);
 	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f8a9acf..c36df03 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2807,18 +2807,11 @@ static inline void schedule_debug(struct task_struct *prev)
 	schedstat_inc(this_rq(), sched_count);
 }
 
-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
-	if (prev->on_rq || rq->skip_clock_update < 0)
-		update_rq_clock(rq);
-	prev->sched_class->put_prev_task(rq, prev);
-}
-
 /*
  * Pick up the highest-prio task:
  */
 static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
 {
 	const struct sched_class *class;
 	struct task_struct *p;
@@ -2828,13 +2821,13 @@ pick_next_task(struct rq *rq)
 	 * the fair class we can call that function directly:
 	 */
 	if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
-		p = fair_sched_class.pick_next_task(rq);
+		p = fair_sched_class.pick_next_task(rq, prev);
 		if (likely(p))
 			return p;
 	}
 
 	for_each_class(class) {
-		p = class->pick_next_task(rq);
+		p = class->pick_next_task(rq, prev);
 		if (p)
 			return p;
 	}
@@ -2929,8 +2922,9 @@ need_resched:
 	if (unlikely(!rq->nr_running))
 		idle_balance(cpu, rq);
 
-	put_prev_task(rq, prev);
-	next = pick_next_task(rq);
+	if (prev->on_rq || rq->skip_clock_update < 0)
+		update_rq_clock(rq);
+	next = pick_next_task(rq, prev);
 	clear_tsk_need_resched(prev);
 	rq->skip_clock_update = 0;
 
@@ -4880,7 +4874,7 @@ static void migrate_tasks(unsigned int dead_cpu)
 		if (rq->nr_running == 1)
 			break;
 
-		next = pick_next_task(rq);
+		next = pick_next_task(rq, NULL);
 		BUG_ON(!next);
 		next->sched_class->put_prev_task(rq, next);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c15bc92..d59a106 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3595,7 +3595,8 @@ preempt:
 		set_last_buddy(se);
 }
 
-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct cfs_rq *cfs_rq = &rq->cfs;
@@ -3604,6 +3605,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
 	if (!cfs_rq->nr_running)
 		return NULL;
 
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
 	do {
 		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b6baf37..07e6027 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -22,8 +22,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
 	resched_task(rq->idle);
 }
 
-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 {
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
 	schedstat_inc(rq, sched_goidle);
 	return rq->idle;
 }
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7f6f6c6..80c58fe 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1368,15 +1368,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
-	struct rt_rq *rt_rq;
-
-	rt_rq = &rq->rt;
-
-	if (!rt_rq->rt_nr_running)
-		return NULL;
-
-	if (rt_rq_throttled(rt_rq))
-		return NULL;
+	struct rt_rq *rt_rq  = &rq->rt;
 
 	do {
 		rt_se = pick_next_rt_entity(rq, rt_rq);
@@ -1390,9 +1382,22 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	return p;
 }
 
-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *p = _pick_next_task_rt(rq);
+	struct task_struct *p;
+	struct rt_rq *rt_rq = &rq->rt;
+
+	if (!rt_rq->rt_nr_running)
+		return NULL;
+
+	if (rt_rq_throttled(rt_rq))
+		return NULL;
+
+	if (prev)
+		prev->sched_class->put_prev_task(rq, prev);
+
+	p = _pick_next_task_rt(rq);
 
 	/* The running task is never eligible for pushing */
 	if (p)
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index fda1cbe..5f10918 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,12 +23,15 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
 	/* we're never preempted */
 }
 
-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev)
 {
 	struct task_struct *stop = rq->stop;
 
 	if (stop && stop->on_rq) {
 		stop->se.exec_start = rq->clock_task;
+		if (prev)
+			prev->sched_class->put_prev_task(rq, prev);
 		return stop;
 	}
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 09/11] record per-cgroup number of context switches
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

Context switches are, to this moment, a property of the runqueue. When
running containers, we would like to be able to present a separate
figure for each container (or cgroup, in this context).

The chosen way to accomplish this is to increment a per cfs_rq or
rt_rq, depending on the task, for each of the sched entities involved,
up to the parent. It is trivial to note that for the parent cgroup, we
always add 1 by doing this. Also, we are not introducing any hierarchy
walks in here. An already existent walk is reused.
There are, however, two main issues:

 1. the traditional context switch code only increment nr_switches when
 a different task is being inserted in the rq. Eventually, albeit not
 likely, we will pick the same task as before. Since for cfq and rt we
 only now which task will be next after the walk, we need to do the walk
 again, decrementing 1. Since this is by far not likely, it seems a fair
 price to pay.

 2. Those figures do not include switches from and to the idle or stop
 task. Those need to be recorded separately, which will happen in a
 follow up patch.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c  | 18 ++++++++++++++++++
 kernel/sched/rt.c    | 15 +++++++++++++--
 kernel/sched/sched.h |  3 +++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59a106..0dd9c50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3609,6 +3609,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 		prev->sched_class->put_prev_task(rq, prev);
 
 	do {
+		if (likely(prev))
+			cfs_rq->nr_switches++;
 		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
@@ -3618,6 +3620,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
+	/*
+	 * This condition is extremely unlikely, and most of the time will just
+	 * consist of this unlikely branch, which is extremely cheap. But we
+	 * still need to have it, because when we first loop through cfs_rq's,
+	 * we can't possibly know which task we will pick. The call to
+	 * set_next_entity above is not meant to mess up the tree in this case,
+	 * so this should give us the same chain, in the same order.
+	 */
+	if (unlikely(p == prev)) {
+		se = &p->se;
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			cfs_rq->nr_switches--;
+		}
+	}
+
 	return p;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 80c58fe..19ceed9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1364,13 +1364,16 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
 	return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+_pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
 	struct rt_rq *rt_rq  = &rq->rt;
 
 	do {
+		if (likely(prev))
+			rt_rq->rt_nr_switches++;
 		rt_se = pick_next_rt_entity(rq, rt_rq);
 		BUG_ON(!rt_se);
 		rt_rq = group_rt_rq(rt_se);
@@ -1379,6 +1382,14 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	p = rt_task_of(rt_se);
 	p->se.exec_start = rq->clock_task;
 
+	/* See fair.c for an explanation on this */
+	if (unlikely(p == prev)) {
+		for_each_sched_rt_entity(rt_se) {
+			rt_rq = rt_rq_of_se(rt_se);
+			rt_rq->rt_nr_switches--;
+		}
+	}
+
 	return p;
 }
 
@@ -1397,7 +1408,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 	if (prev)
 		prev->sched_class->put_prev_task(rq, prev);
 
-	p = _pick_next_task_rt(rq);
+	p = _pick_next_task_rt(rq, prev);
 
 	/* The running task is never eligible for pushing */
 	if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 640aa14..a426abc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,6 +229,7 @@ struct cfs_rq {
 	unsigned int nr_spread_over;
 #endif
 
+	u64 nr_switches;
 #ifdef CONFIG_SMP
 /*
  * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
@@ -298,6 +299,8 @@ static inline int rt_bandwidth_enabled(void)
 struct rt_rq {
 	struct rt_prio_array active;
 	unsigned int rt_nr_running;
+	u64 rt_nr_switches;
+
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
 	struct {
 		int curr; /* highest queued rt task prio */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 09/11] record per-cgroup number of context switches
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

Context switches are, to this moment, a property of the runqueue. When
running containers, we would like to be able to present a separate
figure for each container (or cgroup, in this context).

The chosen way to accomplish this is to increment a per cfs_rq or
rt_rq, depending on the task, for each of the sched entities involved,
up to the parent. It is trivial to note that for the parent cgroup, we
always add 1 by doing this. Also, we are not introducing any hierarchy
walks in here. An already existent walk is reused.
There are, however, two main issues:

 1. the traditional context switch code only increment nr_switches when
 a different task is being inserted in the rq. Eventually, albeit not
 likely, we will pick the same task as before. Since for cfq and rt we
 only now which task will be next after the walk, we need to do the walk
 again, decrementing 1. Since this is by far not likely, it seems a fair
 price to pay.

 2. Those figures do not include switches from and to the idle or stop
 task. Those need to be recorded separately, which will happen in a
 follow up patch.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/fair.c  | 18 ++++++++++++++++++
 kernel/sched/rt.c    | 15 +++++++++++++--
 kernel/sched/sched.h |  3 +++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d59a106..0dd9c50 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3609,6 +3609,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 		prev->sched_class->put_prev_task(rq, prev);
 
 	do {
+		if (likely(prev))
+			cfs_rq->nr_switches++;
 		se = pick_next_entity(cfs_rq);
 		set_next_entity(cfs_rq, se);
 		cfs_rq = group_cfs_rq(se);
@@ -3618,6 +3620,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
+	/*
+	 * This condition is extremely unlikely, and most of the time will just
+	 * consist of this unlikely branch, which is extremely cheap. But we
+	 * still need to have it, because when we first loop through cfs_rq's,
+	 * we can't possibly know which task we will pick. The call to
+	 * set_next_entity above is not meant to mess up the tree in this case,
+	 * so this should give us the same chain, in the same order.
+	 */
+	if (unlikely(p == prev)) {
+		se = &p->se;
+		for_each_sched_entity(se) {
+			cfs_rq = cfs_rq_of(se);
+			cfs_rq->nr_switches--;
+		}
+	}
+
 	return p;
 }
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 80c58fe..19ceed9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1364,13 +1364,16 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
 	return next;
 }
 
-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+_pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 {
 	struct sched_rt_entity *rt_se;
 	struct task_struct *p;
 	struct rt_rq *rt_rq  = &rq->rt;
 
 	do {
+		if (likely(prev))
+			rt_rq->rt_nr_switches++;
 		rt_se = pick_next_rt_entity(rq, rt_rq);
 		BUG_ON(!rt_se);
 		rt_rq = group_rt_rq(rt_se);
@@ -1379,6 +1382,14 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
 	p = rt_task_of(rt_se);
 	p->se.exec_start = rq->clock_task;
 
+	/* See fair.c for an explanation on this */
+	if (unlikely(p == prev)) {
+		for_each_sched_rt_entity(rt_se) {
+			rt_rq = rt_rq_of_se(rt_se);
+			rt_rq->rt_nr_switches--;
+		}
+	}
+
 	return p;
 }
 
@@ -1397,7 +1408,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
 	if (prev)
 		prev->sched_class->put_prev_task(rq, prev);
 
-	p = _pick_next_task_rt(rq);
+	p = _pick_next_task_rt(rq, prev);
 
 	/* The running task is never eligible for pushing */
 	if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 640aa14..a426abc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -229,6 +229,7 @@ struct cfs_rq {
 	unsigned int nr_spread_over;
 #endif
 
+	u64 nr_switches;
 #ifdef CONFIG_SMP
 /*
  * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
@@ -298,6 +299,8 @@ static inline int rt_bandwidth_enabled(void)
 struct rt_rq {
 	struct rt_prio_array active;
 	unsigned int rt_nr_running;
+	u64 rt_nr_switches;
+
 #if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
 	struct {
 		int curr; /* highest queued rt task prio */
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 10/11] sched: change nr_context_switches calculation.
  2013-01-09 11:45 ` Glauber Costa
@ 2013-01-09 11:45   ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

This patch changes the calculation of nr_context_switches. The variable
"nr_switches" is now used to account for the number of transition to the
idle task, or stop task. It is removed from the schedule() path.

The total calculation can be made using the fact that the transitions to
fair and rt classes are recorded in the root_task_group. One can easily
derive the total figure by adding those quantities together.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/core.c      | 17 +++++++++++++++--
 kernel/sched/idle_task.c |  3 +++
 kernel/sched/stop_task.c |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c36df03..6bb56f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2001,13 +2001,27 @@ unsigned long nr_uninterruptible(void)
 	return sum;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define cfs_nr_switches(tg, cpu) (tg)->cfs_rq[cpu]->nr_switches
+#else
+#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)->cfs.nr_switches
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_nr_switches(tg, cpu) (tg)->rt_rq[cpu]->rt_nr_switches
+#else
+#define rt_nr_switches(tg, cpu) cpu_rq(cpu)->rt.rt_nr_switches
+#endif
+
 unsigned long long nr_context_switches(void)
 {
 	int i;
 	unsigned long long sum = 0;
 
-	for_each_possible_cpu(i)
+	for_each_possible_cpu(i) {
+		sum += cfs_nr_switches(&root_task_group, i);
+		sum += rt_nr_switches(&root_task_group, i);
 		sum += cpu_rq(i)->nr_switches;
+	}
 
 	return sum;
 }
@@ -2929,7 +2943,6 @@ need_resched:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 07e6027..652d98c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 	if (prev)
 		prev->sched_class->put_prev_task(rq, prev);
 
+	if (prev != rq->idle)
+		rq->nr_switches++;
+
 	schedstat_inc(rq, sched_goidle);
 	return rq->idle;
 }
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 5f10918..d1e9b82 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev)
 		stop->se.exec_start = rq->clock_task;
 		if (prev)
 			prev->sched_class->put_prev_task(rq, prev);
+		if (prev != rq->stop)
+			rq->nr_switches++;
 		return stop;
 	}
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 10/11] sched: change nr_context_switches calculation.
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

This patch changes the calculation of nr_context_switches. The variable
"nr_switches" is now used to account for the number of transition to the
idle task, or stop task. It is removed from the schedule() path.

The total calculation can be made using the fact that the transitions to
fair and rt classes are recorded in the root_task_group. One can easily
derive the total figure by adding those quantities together.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/core.c      | 17 +++++++++++++++--
 kernel/sched/idle_task.c |  3 +++
 kernel/sched/stop_task.c |  2 ++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c36df03..6bb56f0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2001,13 +2001,27 @@ unsigned long nr_uninterruptible(void)
 	return sum;
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define cfs_nr_switches(tg, cpu) (tg)->cfs_rq[cpu]->nr_switches
+#else
+#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)->cfs.nr_switches
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_nr_switches(tg, cpu) (tg)->rt_rq[cpu]->rt_nr_switches
+#else
+#define rt_nr_switches(tg, cpu) cpu_rq(cpu)->rt.rt_nr_switches
+#endif
+
 unsigned long long nr_context_switches(void)
 {
 	int i;
 	unsigned long long sum = 0;
 
-	for_each_possible_cpu(i)
+	for_each_possible_cpu(i) {
+		sum += cfs_nr_switches(&root_task_group, i);
+		sum += rt_nr_switches(&root_task_group, i);
 		sum += cpu_rq(i)->nr_switches;
+	}
 
 	return sum;
 }
@@ -2929,7 +2943,6 @@ need_resched:
 	rq->skip_clock_update = 0;
 
 	if (likely(prev != next)) {
-		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
 
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 07e6027..652d98c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -28,6 +28,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev)
 	if (prev)
 		prev->sched_class->put_prev_task(rq, prev);
 
+	if (prev != rq->idle)
+		rq->nr_switches++;
+
 	schedstat_inc(rq, sched_goidle);
 	return rq->idle;
 }
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 5f10918..d1e9b82 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev)
 		stop->se.exec_start = rq->clock_task;
 		if (prev)
 			prev->sched_class->put_prev_task(rq, prev);
+		if (prev != rq->stop)
+			rq->nr_switches++;
 		return stop;
 	}
 
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups
  Cc: linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Glauber Costa

The file cpu.stat_percpu will show various scheduler related
information, that are usually available to the top level through other
files.

For instance, most of the meaningful data in /proc/stat is presented
here. Given this file, a container can easily construct a local copy of
/proc/stat for internal consumption.

The data we export is comprised of:
* all the tick information, previously available only through cpuacct,
  like user time, system time, etc.

* wait time, which can be used to construct analogous information to
  steal time in hypervisors,

* nr_switches and nr_running, which are cgroup-local versions of
  their global counterparts.

The file includes a header, so fields can come and go if needed.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
---
 kernel/sched/core.c  | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 13 +++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bb56f0..5135b50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8111,6 +8111,97 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHEDSTATS
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define fair_rq(field, tg, i)  (tg)->cfs_rq[i]->field
+#else
+#define fair_rq(field, tg, i)  0
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_rq(field, tg, i)  (tg)->rt_rq[i]->field
+#else
+#define rt_rq(field, tg, i)  0
+#endif
+
+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+	/* nr_switches, which counts idle and stop task, is added to all tgs */
+	return cpu_rq(cpu)->nr_switches +
+		cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu);
+}
+
+static u64 tg_nr_running(struct task_group *tg, int cpu)
+{
+	/*
+	 * because of autogrouped groups in root_task_group, the
+	 * following does not hold.
+	 */
+	if (tg != &root_task_group)
+		return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, cpu);
+
+	return cpu_rq(cpu)->nr_running;
+}
+
+static u64 tg_wait(struct task_group *tg, int cpu)
+{
+	u64 val;
+
+	if (tg != &root_task_group)
+		val = cfs_read_wait(tg->se[cpu]);
+	else
+		/*
+		 * There are many errors here that we are accumulating.
+		 * However, we only provide this in the interest of having
+		 * a consistent interface for all cgroups. Everybody
+		 * probing the root cgroup should be getting its figures
+		 * from system-wide files as /proc/stat. That would be faster
+		 * to begin with...
+		 */
+		val = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL] * TICK_NSEC;
+
+	return val;
+}
+
+static inline void do_fill_seq(struct seq_file *m, struct task_group *tg,
+			       int cpu, int index)
+{
+	u64 val = 0;
+	struct kernel_cpustat *kcpustat;
+	kcpustat = this_cpu_ptr(tg->cpustat);
+	val = cputime64_to_clock_t(kcpustat->cpustat[index]) * TICK_NSEC;
+	seq_put_decimal_ull(m, ' ', val);
+}
+
+static int cpu_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *m)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int cpu;
+
+	seq_printf(m, "user nice system irq softirq guest guest_nice ");
+	seq_printf(m, "wait nr_switches nr_running\n");
+
+	for_each_online_cpu(cpu) {
+		seq_printf(m, "cpu%d", cpu);
+		do_fill_seq(m, tg, cpu, CPUTIME_USER);
+		do_fill_seq(m, tg, cpu, CPUTIME_NICE);
+		do_fill_seq(m, tg, cpu, CPUTIME_SYSTEM);
+		do_fill_seq(m, tg, cpu, CPUTIME_IRQ);
+		do_fill_seq(m, tg, cpu, CPUTIME_SOFTIRQ);
+		do_fill_seq(m, tg, cpu, CPUTIME_GUEST);
+		do_fill_seq(m, tg, cpu, CPUTIME_GUEST_NICE);
+		seq_put_decimal_ull(m, ' ', tg_wait(tg, cpu));
+		seq_put_decimal_ull(m, ' ', tg_nr_switches(tg, cpu));
+		seq_put_decimal_ull(m, ' ', tg_nr_running(tg, cpu));
+		seq_putc(m, '\n');
+	}
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -8164,6 +8255,12 @@ static struct cftype cpu_files[] = {
 		.flags = CFTYPE_NO_PREFIX,
 		.read_map = cpucg_stats_show,
 	},
+#ifdef CONFIG_SCHEDSTATS
+	{
+		.name = "stat_percpu",
+		.read_seq_string = cpu_stats_percpu_show,
+	},
+#endif
 	{ }	/* terminate */
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dd9c50..778b249 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -721,6 +721,19 @@ update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	schedstat_set(se->statistics.wait_start, rq_of(cfs_rq)->clock);
 }
 
+#ifdef CONFIG_SCHEDSTATS
+u64 cfs_read_wait(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 value = se->statistics.wait_sum;
+
+	if (!se->statistics.wait_start)
+		return value;
+
+	return value + rq_of(cfs_rq)->clock - se->statistics.wait_start;
+}
+#endif
+
 /*
  * Task is being enqueued - update stats:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a426abc..0a12980 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1195,6 +1195,7 @@ extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
 
 extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
+extern u64 cfs_read_wait(struct sched_entity *se);
 
 #ifdef CONFIG_NO_HZ
 enum rq_nohz_flag_bits {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 11:45   ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 11:45 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Glauber Costa

The file cpu.stat_percpu will show various scheduler related
information, that are usually available to the top level through other
files.

For instance, most of the meaningful data in /proc/stat is presented
here. Given this file, a container can easily construct a local copy of
/proc/stat for internal consumption.

The data we export is comprised of:
* all the tick information, previously available only through cpuacct,
  like user time, system time, etc.

* wait time, which can be used to construct analogous information to
  steal time in hypervisors,

* nr_switches and nr_running, which are cgroup-local versions of
  their global counterparts.

The file includes a header, so fields can come and go if needed.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Peter Zijlstra <a.p.zijlstra-/NLkJaSkS4VmR6Xm/wNWPw@public.gmane.org>
CC: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 kernel/sched/core.c  | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c  | 13 +++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 111 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6bb56f0..5135b50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8111,6 +8111,97 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+#ifdef CONFIG_SCHEDSTATS
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define fair_rq(field, tg, i)  (tg)->cfs_rq[i]->field
+#else
+#define fair_rq(field, tg, i)  0
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_rq(field, tg, i)  (tg)->rt_rq[i]->field
+#else
+#define rt_rq(field, tg, i)  0
+#endif
+
+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+	/* nr_switches, which counts idle and stop task, is added to all tgs */
+	return cpu_rq(cpu)->nr_switches +
+		cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu);
+}
+
+static u64 tg_nr_running(struct task_group *tg, int cpu)
+{
+	/*
+	 * because of autogrouped groups in root_task_group, the
+	 * following does not hold.
+	 */
+	if (tg != &root_task_group)
+		return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, cpu);
+
+	return cpu_rq(cpu)->nr_running;
+}
+
+static u64 tg_wait(struct task_group *tg, int cpu)
+{
+	u64 val;
+
+	if (tg != &root_task_group)
+		val = cfs_read_wait(tg->se[cpu]);
+	else
+		/*
+		 * There are many errors here that we are accumulating.
+		 * However, we only provide this in the interest of having
+		 * a consistent interface for all cgroups. Everybody
+		 * probing the root cgroup should be getting its figures
+		 * from system-wide files as /proc/stat. That would be faster
+		 * to begin with...
+		 */
+		val = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL] * TICK_NSEC;
+
+	return val;
+}
+
+static inline void do_fill_seq(struct seq_file *m, struct task_group *tg,
+			       int cpu, int index)
+{
+	u64 val = 0;
+	struct kernel_cpustat *kcpustat;
+	kcpustat = this_cpu_ptr(tg->cpustat);
+	val = cputime64_to_clock_t(kcpustat->cpustat[index]) * TICK_NSEC;
+	seq_put_decimal_ull(m, ' ', val);
+}
+
+static int cpu_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
+				 struct seq_file *m)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	int cpu;
+
+	seq_printf(m, "user nice system irq softirq guest guest_nice ");
+	seq_printf(m, "wait nr_switches nr_running\n");
+
+	for_each_online_cpu(cpu) {
+		seq_printf(m, "cpu%d", cpu);
+		do_fill_seq(m, tg, cpu, CPUTIME_USER);
+		do_fill_seq(m, tg, cpu, CPUTIME_NICE);
+		do_fill_seq(m, tg, cpu, CPUTIME_SYSTEM);
+		do_fill_seq(m, tg, cpu, CPUTIME_IRQ);
+		do_fill_seq(m, tg, cpu, CPUTIME_SOFTIRQ);
+		do_fill_seq(m, tg, cpu, CPUTIME_GUEST);
+		do_fill_seq(m, tg, cpu, CPUTIME_GUEST_NICE);
+		seq_put_decimal_ull(m, ' ', tg_wait(tg, cpu));
+		seq_put_decimal_ull(m, ' ', tg_nr_switches(tg, cpu));
+		seq_put_decimal_ull(m, ' ', tg_nr_running(tg, cpu));
+		seq_putc(m, '\n');
+	}
+
+	return 0;
+}
+#endif
+
 static struct cftype cpu_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -8164,6 +8255,12 @@ static struct cftype cpu_files[] = {
 		.flags = CFTYPE_NO_PREFIX,
 		.read_map = cpucg_stats_show,
 	},
+#ifdef CONFIG_SCHEDSTATS
+	{
+		.name = "stat_percpu",
+		.read_seq_string = cpu_stats_percpu_show,
+	},
+#endif
 	{ }	/* terminate */
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0dd9c50..778b249 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -721,6 +721,19 @@ update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	schedstat_set(se->statistics.wait_start, rq_of(cfs_rq)->clock);
 }
 
+#ifdef CONFIG_SCHEDSTATS
+u64 cfs_read_wait(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 value = se->statistics.wait_sum;
+
+	if (!se->statistics.wait_start)
+		return value;
+
+	return value + rq_of(cfs_rq)->clock - se->statistics.wait_start;
+}
+#endif
+
 /*
  * Task is being enqueued - update stats:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a426abc..0a12980 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1195,6 +1195,7 @@ extern void init_cfs_rq(struct cfs_rq *cfs_rq);
 extern void init_rt_rq(struct rt_rq *rt_rq, struct rq *rq);
 
 extern void account_cfs_bandwidth_used(int enabled, int was_enabled);
+extern u64 cfs_read_wait(struct sched_entity *se);
 
 #ifdef CONFIG_NO_HZ
 enum rq_nohz_flag_bits {
-- 
1.7.11.7

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-09 14:41   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-09 14:41 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-kernel, Andrew Morton, Peter Zijlstra, Paul Turner

Hello, Glauber.

On Wed, Jan 09, 2013 at 03:45:27PM +0400, Glauber Costa wrote:
> [ update: I thought I posted this already before leaving for holidays. However,
>   now that I am checking for replies, I can't find nor replies nor the original
>   mail in my boxes or archives. I am posting again for safety sake, but sorry
>   you are getting this twice by any chance ]

I think I saw the series a while back.

> This is an attempt to provide userspace with enough information to reconstruct
> per-container version of files like "/proc/stat". In particular, we are
> interested in knowing the per-cgroup slices of user time, system time, wait
> time, number of processes, and a variety of statistics.
> 
> This task is made more complicated by the fact that multiple controllers are
> involved in collecting those statistics: cpu and cpuacct. So the first thing I
> am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.
> 
> This is one of the major differences from earlier attempts: all data is provided
> by the cpu controller, resulting in greater simplicity.
> 
> This also tries to hook into the existing scheduler hierarchy walks instead of
> providing new ones.

The cgroup part looks good to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-09 14:41   ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-09 14:41 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Peter Zijlstra, Paul Turner

Hello, Glauber.

On Wed, Jan 09, 2013 at 03:45:27PM +0400, Glauber Costa wrote:
> [ update: I thought I posted this already before leaving for holidays. However,
>   now that I am checking for replies, I can't find nor replies nor the original
>   mail in my boxes or archives. I am posting again for safety sake, but sorry
>   you are getting this twice by any chance ]

I think I saw the series a while back.

> This is an attempt to provide userspace with enough information to reconstruct
> per-container version of files like "/proc/stat". In particular, we are
> interested in knowing the per-cgroup slices of user time, system time, wait
> time, number of processes, and a variety of statistics.
> 
> This task is made more complicated by the fact that multiple controllers are
> involved in collecting those statistics: cpu and cpuacct. So the first thing I
> am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.
> 
> This is one of the major differences from earlier attempts: all data is provided
> by the cpu controller, resulting in greater simplicity.
> 
> This also tries to hook into the existing scheduler hierarchy walks instead of
> providing new ones.

The cgroup part looks good to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 20:42     ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2013-01-09 20:42 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On Wed,  9 Jan 2013 15:45:38 +0400
Glauber Costa <glommer@parallels.com> wrote:

> The file cpu.stat_percpu will show various scheduler related
> information, that are usually available to the top level through other
> files.
> 
> For instance, most of the meaningful data in /proc/stat is presented
> here. Given this file, a container can easily construct a local copy of
> /proc/stat for internal consumption.
> 
> The data we export is comprised of:
> * all the tick information, previously available only through cpuacct,
>   like user time, system time, etc.
> 
> * wait time, which can be used to construct analogous information to
>   steal time in hypervisors,
> 
> * nr_switches and nr_running, which are cgroup-local versions of
>   their global counterparts.
> 
> The file includes a header, so fields can come and go if needed.

Please update this changelog to fully describe the proposed userspace
interfaces.  That means full pathnames and example output. 
Understanding these interfaces is the most important part of reviewing
this patchset, so this info should be prominent.

Also, this patchset appears to alter (or remove?) existing userspace
interfaces?  If so, please clearly describe those alterations and also
share your thinking on the back-compatibility issues.

Also, I'm not seeing any changes to Docmentation/ in this patchset. 
How do we explain the interface to our users?


>From a quick read, it appears that the output will be something along
the lines of:

user nice system irq softirq guest guest_nice wait nr_switches nr_running
cpu0 nn nn nn nn nn ...
cpu1 nn nn nn nn nn ...

which looks pretty terrible.  Apart from being very hard to read, it
means that we can never remove fields.  A nicer and more extensible
interface would be

cpu:0 nice:nn system:nn irq:nn ...




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 20:42     ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2013-01-09 20:42 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Peter Zijlstra,
	Paul Turner

On Wed,  9 Jan 2013 15:45:38 +0400
Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:

> The file cpu.stat_percpu will show various scheduler related
> information, that are usually available to the top level through other
> files.
> 
> For instance, most of the meaningful data in /proc/stat is presented
> here. Given this file, a container can easily construct a local copy of
> /proc/stat for internal consumption.
> 
> The data we export is comprised of:
> * all the tick information, previously available only through cpuacct,
>   like user time, system time, etc.
> 
> * wait time, which can be used to construct analogous information to
>   steal time in hypervisors,
> 
> * nr_switches and nr_running, which are cgroup-local versions of
>   their global counterparts.
> 
> The file includes a header, so fields can come and go if needed.

Please update this changelog to fully describe the proposed userspace
interfaces.  That means full pathnames and example output. 
Understanding these interfaces is the most important part of reviewing
this patchset, so this info should be prominent.

Also, this patchset appears to alter (or remove?) existing userspace
interfaces?  If so, please clearly describe those alterations and also
share your thinking on the back-compatibility issues.

Also, I'm not seeing any changes to Docmentation/ in this patchset. 
How do we explain the interface to our users?


From a quick read, it appears that the output will be something along
the lines of:

user nice system irq softirq guest guest_nice wait nr_switches nr_running
cpu0 nn nn nn nn nn ...
cpu1 nn nn nn nn nn ...

which looks pretty terrible.  Apart from being very hard to read, it
means that we can never remove fields.  A nicer and more extensible
interface would be

cpu:0 nice:nn system:nn irq:nn ...



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
  2013-01-09 20:42     ` Andrew Morton
@ 2013-01-09 21:10       ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 21:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On 01/10/2013 12:42 AM, Andrew Morton wrote:
> On Wed,  9 Jan 2013 15:45:38 +0400
> Glauber Costa <glommer@parallels.com> wrote:
> 
>> The file cpu.stat_percpu will show various scheduler related
>> information, that are usually available to the top level through other
>> files.
>>
>> For instance, most of the meaningful data in /proc/stat is presented
>> here. Given this file, a container can easily construct a local copy of
>> /proc/stat for internal consumption.
>>
>> The data we export is comprised of:
>> * all the tick information, previously available only through cpuacct,
>>   like user time, system time, etc.
>>
>> * wait time, which can be used to construct analogous information to
>>   steal time in hypervisors,
>>
>> * nr_switches and nr_running, which are cgroup-local versions of
>>   their global counterparts.
>>
>> The file includes a header, so fields can come and go if needed.
> 
> Please update this changelog to fully describe the proposed userspace
> interfaces.  That means full pathnames and example output. 
> Understanding these interfaces is the most important part of reviewing
> this patchset, so this info should be prominent.
> 
> Also, this patchset appears to alter (or remove?) existing userspace
> interfaces?  If so, please clearly describe those alterations and also
> share your thinking on the back-compatibility issues.
> 
> Also, I'm not seeing any changes to Docmentation/ in this patchset. 
> How do we explain the interface to our users?
> 
> 
> From a quick read, it appears that the output will be something along
> the lines of:
> 
> user nice system irq softirq guest guest_nice wait nr_switches nr_running
> cpu0 nn nn nn nn nn ...
> cpu1 nn nn nn nn nn ...
> 
> which looks pretty terrible.  Apart from being very hard to read, it
> means that we can never remove fields.  A nicer and more extensible
> interface would be
> 
> cpu:0 nice:nn system:nn irq:nn ...
> 

Ok.

The actual output format is what matters the least to me, so I can
change to whatever pleases you guys.

I just don't how is it that we can never remove fields. My very
motivation for adding a header in the first place, was to give us
ability to extend this.

For a next round, I will include a Documentation file as you requested.
I could, for instance, explicitly mention that people parsing this
should first query the first line, which constitutes a header.

The main advantage I see in this approach, is that there is way less
data to be written using a header. Although your way works, it means we
will write the strings "nice", "system", etc. #cpu times. Quite a waste.

> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 21:10       ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 21:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On 01/10/2013 12:42 AM, Andrew Morton wrote:
> On Wed,  9 Jan 2013 15:45:38 +0400
> Glauber Costa <glommer@parallels.com> wrote:
> 
>> The file cpu.stat_percpu will show various scheduler related
>> information, that are usually available to the top level through other
>> files.
>>
>> For instance, most of the meaningful data in /proc/stat is presented
>> here. Given this file, a container can easily construct a local copy of
>> /proc/stat for internal consumption.
>>
>> The data we export is comprised of:
>> * all the tick information, previously available only through cpuacct,
>>   like user time, system time, etc.
>>
>> * wait time, which can be used to construct analogous information to
>>   steal time in hypervisors,
>>
>> * nr_switches and nr_running, which are cgroup-local versions of
>>   their global counterparts.
>>
>> The file includes a header, so fields can come and go if needed.
> 
> Please update this changelog to fully describe the proposed userspace
> interfaces.  That means full pathnames and example output. 
> Understanding these interfaces is the most important part of reviewing
> this patchset, so this info should be prominent.
> 
> Also, this patchset appears to alter (or remove?) existing userspace
> interfaces?  If so, please clearly describe those alterations and also
> share your thinking on the back-compatibility issues.
> 
> Also, I'm not seeing any changes to Docmentation/ in this patchset. 
> How do we explain the interface to our users?
> 
> 
> From a quick read, it appears that the output will be something along
> the lines of:
> 
> user nice system irq softirq guest guest_nice wait nr_switches nr_running
> cpu0 nn nn nn nn nn ...
> cpu1 nn nn nn nn nn ...
> 
> which looks pretty terrible.  Apart from being very hard to read, it
> means that we can never remove fields.  A nicer and more extensible
> interface would be
> 
> cpu:0 nice:nn system:nn irq:nn ...
> 

Ok.

The actual output format is what matters the least to me, so I can
change to whatever pleases you guys.

I just don't how is it that we can never remove fields. My very
motivation for adding a header in the first place, was to give us
ability to extend this.

For a next round, I will include a Documentation file as you requested.
I could, for instance, explicitly mention that people parsing this
should first query the first line, which constitutes a header.

The main advantage I see in this approach, is that there is way less
data to be written using a header. Although your way works, it means we
will write the strings "nice", "system", etc. #cpu times. Quite a waste.

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
  2013-01-09 21:10       ` Glauber Costa
@ 2013-01-09 21:17         ` Andrew Morton
  -1 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2013-01-09 21:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On Thu, 10 Jan 2013 01:10:02 +0400
Glauber Costa <glommer@parallels.com> wrote:

> The main advantage I see in this approach, is that there is way less
> data to be written using a header. Although your way works, it means we
> will write the strings "nice", "system", etc. #cpu times. Quite a waste.

Yes, overhead can be a significant issue with this type of interface. 
But we already incurred a massive overhead by using a human-readable
ascii interface.  If performance is an issue, perhaps the whole thing
should be grafted onto taskstats instead.  Or create a new
taskstats-like thing.

btw, a more typical interface would be

cat /.../cpu0
nice:nn
system:nn
irq:nn

- the traditional one-per-line name:value tuples.  But I'd assumed that
having a file per CPU would be aawkward.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 21:17         ` Andrew Morton
  0 siblings, 0 replies; 62+ messages in thread
From: Andrew Morton @ 2013-01-09 21:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On Thu, 10 Jan 2013 01:10:02 +0400
Glauber Costa <glommer@parallels.com> wrote:

> The main advantage I see in this approach, is that there is way less
> data to be written using a header. Although your way works, it means we
> will write the strings "nice", "system", etc. #cpu times. Quite a waste.

Yes, overhead can be a significant issue with this type of interface. 
But we already incurred a massive overhead by using a human-readable
ascii interface.  If performance is an issue, perhaps the whole thing
should be grafted onto taskstats instead.  Or create a new
taskstats-like thing.

btw, a more typical interface would be

cat /.../cpu0
nice:nn
system:nn
irq:nn

- the traditional one-per-line name:value tuples.  But I'd assumed that
having a file per CPU would be aawkward.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
  2013-01-09 21:17         ` Andrew Morton
@ 2013-01-09 21:27           ` Glauber Costa
  -1 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On 01/10/2013 01:17 AM, Andrew Morton wrote:
> On Thu, 10 Jan 2013 01:10:02 +0400
> Glauber Costa <glommer@parallels.com> wrote:
> 
>> The main advantage I see in this approach, is that there is way less
>> data to be written using a header. Although your way works, it means we
>> will write the strings "nice", "system", etc. #cpu times. Quite a waste.
> 
> Yes, overhead can be a significant issue with this type of interface. 
> But we already incurred a massive overhead by using a human-readable
> ascii interface.  If performance is an issue, perhaps the whole thing
> should be grafted onto taskstats instead.  Or create a new
> taskstats-like thing.

I think this would be a little alienish in the already alien world of
cgroups.

However, I was not so much talking about plain performance overhead as
measurable in miliseconds-to-parse, but rather just alluding to the fact
that we would be writing the same set of strings multiple times when a
header would do just fine.

This is the same method used for instance by slabinfo.

> 
> btw, a more typical interface would be
> 
> cat /.../cpu0
> nice:nn
> system:nn
> irq:nn
> 

Well, yes. But welcome to cgroups: directories have a meaning, so the
only way to organize stuff is with plain files in the current hierarchy
is by filling it with files. As many files as we have cpus.

At this point you are certain to miss all the other files present in the
directory.


> - the traditional one-per-line name:value tuples.  But I'd assumed that
> having a file per CPU would be aawkward.
> 
Indeed.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-09 21:27           ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-09 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On 01/10/2013 01:17 AM, Andrew Morton wrote:
> On Thu, 10 Jan 2013 01:10:02 +0400
> Glauber Costa <glommer@parallels.com> wrote:
> 
>> The main advantage I see in this approach, is that there is way less
>> data to be written using a header. Although your way works, it means we
>> will write the strings "nice", "system", etc. #cpu times. Quite a waste.
> 
> Yes, overhead can be a significant issue with this type of interface. 
> But we already incurred a massive overhead by using a human-readable
> ascii interface.  If performance is an issue, perhaps the whole thing
> should be grafted onto taskstats instead.  Or create a new
> taskstats-like thing.

I think this would be a little alienish in the already alien world of
cgroups.

However, I was not so much talking about plain performance overhead as
measurable in miliseconds-to-parse, but rather just alluding to the fact
that we would be writing the same set of strings multiple times when a
header would do just fine.

This is the same method used for instance by slabinfo.

> 
> btw, a more typical interface would be
> 
> cat /.../cpu0
> nice:nn
> system:nn
> irq:nn
> 

Well, yes. But welcome to cgroups: directories have a meaning, so the
only way to organize stuff is with plain files in the current hierarchy
is by filling it with files. As many files as we have cpus.

At this point you are certain to miss all the other files present in the
directory.


> - the traditional one-per-line name:value tuples.  But I'd assumed that
> having a file per CPU would be aawkward.
> 
Indeed.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-14  8:34     ` Sha Zhengju
  0 siblings, 0 replies; 62+ messages in thread
From: Sha Zhengju @ 2013-01-14  8:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Michal Hocko, Kay Sievers,
	Lennart Poettering, Dave Jones, Ben Hutchings

On Wed, Jan 9, 2013 at 7:45 PM, Glauber Costa <glommer@parallels.com> wrote:
> From: Tejun Heo <tj@kernel.org>
>
> cpuacct being on a separate hierarchy is one of the main cgroup
> related complaints from scheduler side and the consensus seems to be
>
> * Allowing cpuacct to be a separate controller was a mistake.  In
>   general multiple controllers on the same type of resource should be
>   avoided, especially accounting-only ones.
>
> * Statistics provided by cpuacct are useful and should instead be
>   served by cpu.
>
> This patch makes cpu maintain and serve all cpuacct.* files and make
> cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
> step in deprecating cpuacct.  The next patch will allow disabling or
> dropping cpuacct without affecting userland too much.
>
> Note that this creates some discrepancies in /proc/cgroups and
> /proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
> there.  cpuacct will eventually be removed completely probably except
> for the statistics filenames and I'd like to keep the amount of
> compatbility hackery to minimum as much as possible.
>
> The cpu statistics implementation isn't optimized in any way.  It's
> mostly verbatim copy from cpuacct.  The goal is allowing quick
> disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
> top of which cpu can implement proper optimization.
>
> [ glommer: don't call *_charge in stop_task.c ]
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Kay Sievers <kay.sievers@vrfy.org>
> Cc: Lennart Poettering <mzxreary@0pointer.de>
> Cc: Dave Jones <davej@redhat.com>
> Cc: Ben Hutchings <ben@decadent.org.uk>
> Cc: Paul Turner <pjt@google.com>
> ---
>  kernel/cgroup.c        |  13 ++++
>  kernel/sched/core.c    | 173 +++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/cputime.c |  18 ++++-
>  kernel/sched/fair.c    |   1 +
>  kernel/sched/rt.c      |   1 +
>  kernel/sched/sched.h   |   7 ++
>  6 files changed, 212 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2f98398..0750669d 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1248,6 +1248,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>         /* Consistency checks */
>
>         /*
> +        * cpuacct is deprecated and cpu will serve the same stat files.
> +        * If co-mount with cpu is requested, ignore cpuacct.  Note that
> +        * this creates some discrepancies in /proc/cgroups and
> +        * /proc/PID/cgroup.
> +        *
> +        * https://lkml.org/lkml/2012/9/13/542
> +        */
> +#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
> +       if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
> +           (opts->subsys_bits & (1 << cpuacct_subsys_id)))
> +               opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
> +#endif
> +       /*
>          * Option noprefix was introduced just for backward compatibility
>          * with the old cpuset, so we allow noprefix only if mounting just
>          * the cpuset subsystem.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 257002c..6516694 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6811,6 +6811,7 @@ int in_sched_functions(unsigned long addr)
>  #ifdef CONFIG_CGROUP_SCHED
>  struct task_group root_task_group;
>  LIST_HEAD(task_groups);
> +static DEFINE_PER_CPU(u64, root_tg_cpuusage);
>  #endif
>
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
> @@ -6869,6 +6870,8 @@ void __init sched_init(void)
>  #endif /* CONFIG_RT_GROUP_SCHED */
>
>  #ifdef CONFIG_CGROUP_SCHED
> +       root_task_group.cpustat = &kernel_cpustat;
> +       root_task_group.cpuusage = &root_tg_cpuusage;
>         list_add(&root_task_group.list, &task_groups);
>         INIT_LIST_HEAD(&root_task_group.children);
>         INIT_LIST_HEAD(&root_task_group.siblings);
> @@ -7152,6 +7155,8 @@ static void free_sched_group(struct task_group *tg)
>         free_fair_sched_group(tg);
>         free_rt_sched_group(tg);
>         autogroup_free(tg);
> +       free_percpu(tg->cpuusage);
> +       free_percpu(tg->cpustat);
>         kfree(tg);
>  }
>
> @@ -7165,6 +7170,11 @@ struct task_group *sched_create_group(struct task_group *parent)
>         if (!tg)
>                 return ERR_PTR(-ENOMEM);
>
> +       tg->cpuusage = alloc_percpu(u64);
> +       tg->cpustat = alloc_percpu(struct kernel_cpustat);
> +       if (!tg->cpuusage || !tg->cpustat)
> +               goto err;
> +
>         if (!alloc_fair_sched_group(tg, parent))
>                 goto err;
>
> @@ -7256,6 +7266,24 @@ void sched_move_task(struct task_struct *tsk)
>
>         task_rq_unlock(rq, tsk, &flags);
>  }
> +
> +void task_group_charge(struct task_struct *tsk, u64 cputime)
> +{
> +       struct task_group *tg;
> +       int cpu = task_cpu(tsk);
> +
> +       rcu_read_lock();
> +
> +       tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for (; tg; tg = tg->parent) {
> +               u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +               *cpuusage += cputime;
> +       }
> +
> +       rcu_read_unlock();
> +}
>  #endif /* CONFIG_CGROUP_SCHED */
>
>  #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
> @@ -7612,6 +7640,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
>         sched_move_task(task);
>  }
>
> +static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
> +{
> +       u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +       u64 data;
> +
> +#ifndef CONFIG_64BIT
> +       /*
> +        * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> +        */
> +       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +       data = *cpuusage;
> +       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +#else
> +       data = *cpuusage;
> +#endif
> +
> +       return data;
> +}
> +
> +static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
> +{
> +       u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +
> +#ifndef CONFIG_64BIT
> +       /*
> +        * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> +        */
> +       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +       *cpuusage = val;
> +       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +#else
> +       *cpuusage = val;
> +#endif
> +}
> +
> +/* return total cpu usage (in nanoseconds) of a group */
> +static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct task_group *tg;
> +       u64 totalcpuusage = 0;
> +       int i;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for_each_present_cpu(i)
> +               totalcpuusage += task_group_cpuusage_read(tg, i);
> +
> +       return totalcpuusage;
> +}
> +
> +static int cpucg_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> +                               u64 reset)
> +{
> +       struct task_group *tg;
> +       int err = 0;
> +       int i;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       if (reset) {
> +               err = -EINVAL;
> +               goto out;
> +       }
> +
> +       for_each_present_cpu(i)
> +               task_group_cpuusage_write(tg, i, 0);
> +
> +out:
> +       return err;
> +}
> +
> +static int cpucg_percpu_seq_read(struct cgroup *cgrp, struct cftype *cft,
> +                                struct seq_file *m)
> +{
> +       struct task_group *tg;
> +       u64 percpu;
> +       int i;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for_each_present_cpu(i) {
> +               percpu = task_group_cpuusage_read(tg, i);
> +               seq_printf(m, "%llu ", (unsigned long long) percpu);
> +       }
> +       seq_printf(m, "\n");
> +       return 0;
> +}
> +
> +static const char *cpucg_stat_desc[] = {
> +       [CPUACCT_STAT_USER] = "user",
> +       [CPUACCT_STAT_SYSTEM] = "system",
> +};
> +
> +static int cpucg_stats_show(struct cgroup *cgrp, struct cftype *cft,
> +                           struct cgroup_map_cb *cb)
> +{
> +       struct task_group *tg;
> +       int cpu;
> +       s64 val = 0;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for_each_online_cpu(cpu) {
> +               struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
> +               val += kcpustat->cpustat[CPUTIME_USER];
> +               val += kcpustat->cpustat[CPUTIME_NICE];
> +       }
> +       val = cputime64_to_clock_t(val);
> +       cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_USER], val);
> +
> +       val = 0;
> +       for_each_online_cpu(cpu) {
> +               struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
> +               val += kcpustat->cpustat[CPUTIME_SYSTEM];
> +               val += kcpustat->cpustat[CPUTIME_IRQ];
> +               val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
> +       }
> +
> +       val = cputime64_to_clock_t(val);
> +       cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_SYSTEM], val);
> +
> +       return 0;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
>                                 u64 shareval)
> @@ -7918,6 +8074,23 @@ static struct cftype cpu_files[] = {
>                 .write_u64 = cpu_rt_period_write_uint,
>         },
>  #endif
> +       /* cpuacct.* which used to be served by a separate cpuacct controller */
> +       {
> +               .name = "cpuacct.usage",
> +               .flags = CFTYPE_NO_PREFIX,
> +               .read_u64 = cpucg_cpuusage_read,
> +               .write_u64 = cpucg_cpuusage_write,
> +       },
> +       {
> +               .name = "cpuacct.usage_percpu",
> +               .flags = CFTYPE_NO_PREFIX,
> +               .read_seq_string = cpucg_percpu_seq_read,
> +       },
> +       {
> +               .name = "cpuacct.stat",
> +               .flags = CFTYPE_NO_PREFIX,
> +               .read_map = cpucg_stats_show,
> +       },
>         { }     /* terminate */
>  };
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 293b202..a4fe53e 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -114,8 +114,10 @@ static int irqtime_account_si_update(void)
>  static inline void task_group_account_field(struct task_struct *p, int index,
>                                             u64 tmp)
>  {
> +#ifdef CONFIG_CGROUP_SCHED
> +       struct task_group *tg;
> +#endif
>  #ifdef CONFIG_CGROUP_CPUACCT
> -       struct kernel_cpustat *kcpustat;
>         struct cpuacct *ca;
>  #endif
>         /*
> @@ -125,7 +127,19 @@ static inline void task_group_account_field(struct task_struct *p, int index,
>          *
>          */
>         __get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
> +#ifdef CONFIG_CGROUP_SCHED
> +       rcu_read_lock();
> +       tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
>
> +       while (tg && (tg != &root_task_group)) {
> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(tg->cpustat);
> +
> +               kcpustat->cpustat[index] += tmp;
> +               tg = tg->parent;
> +       }
> +       rcu_read_unlock();
> +#endif
>  #ifdef CONFIG_CGROUP_CPUACCT
>         if (unlikely(!cpuacct_subsys.active))
>                 return;
> @@ -133,6 +147,8 @@ static inline void task_group_account_field(struct task_struct *p, int index,
>         rcu_read_lock();
>         ca = task_ca(p);
>         while (ca && (ca != &root_cpuacct)) {
> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
> +
>                 kcpustat = this_cpu_ptr(ca->cpustat);
Is this reassignment unnecessary?


>                 kcpustat->cpustat[index] += tmp;
>                 ca = parent_ca(ca);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5eea870..c15bc92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -707,6 +707,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>                 struct task_struct *curtask = task_of(curr);
>
>                 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> +               task_group_charge(curtask, delta_exec);
>                 cpuacct_charge(curtask, delta_exec);
>                 account_group_exec_runtime(curtask, delta_exec);
>         }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 418feb0..f7e05d87 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -935,6 +935,7 @@ static void update_curr_rt(struct rq *rq)
>         account_group_exec_runtime(curr, delta_exec);
>
>         curr->se.exec_start = rq->clock_task;
> +       task_group_charge(curr, delta_exec);
>         cpuacct_charge(curr, delta_exec);
>
>         sched_rt_avg_update(rq, delta_exec);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index fc88644..84a339d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -104,6 +104,10 @@ struct cfs_bandwidth {
>  struct task_group {
>         struct cgroup_subsys_state css;
>
> +       /* statistics */
> +       u64 __percpu *cpuusage;
> +       struct kernel_cpustat __percpu *cpustat;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         /* schedulable entities of this group on each cpu */
>         struct sched_entity **se;
> @@ -590,6 +594,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>  #endif
>  }
>
> +extern void task_group_charge(struct task_struct *tsk, u64 cputime);
> +
>  #else /* CONFIG_CGROUP_SCHED */
>
>  static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
> @@ -597,6 +603,7 @@ static inline struct task_group *task_group(struct task_struct *p)
>  {
>         return NULL;
>  }
> +static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
>
>  #endif /* CONFIG_CGROUP_SCHED */
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-14  8:34     ` Sha Zhengju
  0 siblings, 0 replies; 62+ messages in thread
From: Sha Zhengju @ 2013-01-14  8:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

On Wed, Jan 9, 2013 at 7:45 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> From: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
>
> cpuacct being on a separate hierarchy is one of the main cgroup
> related complaints from scheduler side and the consensus seems to be
>
> * Allowing cpuacct to be a separate controller was a mistake.  In
>   general multiple controllers on the same type of resource should be
>   avoided, especially accounting-only ones.
>
> * Statistics provided by cpuacct are useful and should instead be
>   served by cpu.
>
> This patch makes cpu maintain and serve all cpuacct.* files and make
> cgroup core ignore cpuacct if it's co-mounted with cpu.  This is a
> step in deprecating cpuacct.  The next patch will allow disabling or
> dropping cpuacct without affecting userland too much.
>
> Note that this creates some discrepancies in /proc/cgroups and
> /proc/PID/cgroup.  The co-mounted cpuacct won't be reflected correctly
> there.  cpuacct will eventually be removed completely probably except
> for the statistics filenames and I'd like to keep the amount of
> compatbility hackery to minimum as much as possible.
>
> The cpu statistics implementation isn't optimized in any way.  It's
> mostly verbatim copy from cpuacct.  The goal is allowing quick
> disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
> top of which cpu can implement proper optimization.
>
> [ glommer: don't call *_charge in stop_task.c ]
>
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Peter Zijlstra <peterz-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Kay Sievers <kay.sievers-tD+1rO4QERM@public.gmane.org>
> Cc: Lennart Poettering <mzxreary-uLTowLwuiw4b1SvskN2V4Q@public.gmane.org>
> Cc: Dave Jones <davej-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Ben Hutchings <ben-/+tVBieCtBitmTQ+vhA3Yw@public.gmane.org>
> Cc: Paul Turner <pjt-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> ---
>  kernel/cgroup.c        |  13 ++++
>  kernel/sched/core.c    | 173 +++++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/cputime.c |  18 ++++-
>  kernel/sched/fair.c    |   1 +
>  kernel/sched/rt.c      |   1 +
>  kernel/sched/sched.h   |   7 ++
>  6 files changed, 212 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2f98398..0750669d 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1248,6 +1248,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>         /* Consistency checks */
>
>         /*
> +        * cpuacct is deprecated and cpu will serve the same stat files.
> +        * If co-mount with cpu is requested, ignore cpuacct.  Note that
> +        * this creates some discrepancies in /proc/cgroups and
> +        * /proc/PID/cgroup.
> +        *
> +        * https://lkml.org/lkml/2012/9/13/542
> +        */
> +#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
> +       if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
> +           (opts->subsys_bits & (1 << cpuacct_subsys_id)))
> +               opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
> +#endif
> +       /*
>          * Option noprefix was introduced just for backward compatibility
>          * with the old cpuset, so we allow noprefix only if mounting just
>          * the cpuset subsystem.
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 257002c..6516694 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6811,6 +6811,7 @@ int in_sched_functions(unsigned long addr)
>  #ifdef CONFIG_CGROUP_SCHED
>  struct task_group root_task_group;
>  LIST_HEAD(task_groups);
> +static DEFINE_PER_CPU(u64, root_tg_cpuusage);
>  #endif
>
>  DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
> @@ -6869,6 +6870,8 @@ void __init sched_init(void)
>  #endif /* CONFIG_RT_GROUP_SCHED */
>
>  #ifdef CONFIG_CGROUP_SCHED
> +       root_task_group.cpustat = &kernel_cpustat;
> +       root_task_group.cpuusage = &root_tg_cpuusage;
>         list_add(&root_task_group.list, &task_groups);
>         INIT_LIST_HEAD(&root_task_group.children);
>         INIT_LIST_HEAD(&root_task_group.siblings);
> @@ -7152,6 +7155,8 @@ static void free_sched_group(struct task_group *tg)
>         free_fair_sched_group(tg);
>         free_rt_sched_group(tg);
>         autogroup_free(tg);
> +       free_percpu(tg->cpuusage);
> +       free_percpu(tg->cpustat);
>         kfree(tg);
>  }
>
> @@ -7165,6 +7170,11 @@ struct task_group *sched_create_group(struct task_group *parent)
>         if (!tg)
>                 return ERR_PTR(-ENOMEM);
>
> +       tg->cpuusage = alloc_percpu(u64);
> +       tg->cpustat = alloc_percpu(struct kernel_cpustat);
> +       if (!tg->cpuusage || !tg->cpustat)
> +               goto err;
> +
>         if (!alloc_fair_sched_group(tg, parent))
>                 goto err;
>
> @@ -7256,6 +7266,24 @@ void sched_move_task(struct task_struct *tsk)
>
>         task_rq_unlock(rq, tsk, &flags);
>  }
> +
> +void task_group_charge(struct task_struct *tsk, u64 cputime)
> +{
> +       struct task_group *tg;
> +       int cpu = task_cpu(tsk);
> +
> +       rcu_read_lock();
> +
> +       tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for (; tg; tg = tg->parent) {
> +               u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +               *cpuusage += cputime;
> +       }
> +
> +       rcu_read_unlock();
> +}
>  #endif /* CONFIG_CGROUP_SCHED */
>
>  #if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
> @@ -7612,6 +7640,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
>         sched_move_task(task);
>  }
>
> +static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
> +{
> +       u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +       u64 data;
> +
> +#ifndef CONFIG_64BIT
> +       /*
> +        * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> +        */
> +       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +       data = *cpuusage;
> +       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +#else
> +       data = *cpuusage;
> +#endif
> +
> +       return data;
> +}
> +
> +static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
> +{
> +       u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
> +
> +#ifndef CONFIG_64BIT
> +       /*
> +        * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> +        */
> +       raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> +       *cpuusage = val;
> +       raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> +#else
> +       *cpuusage = val;
> +#endif
> +}
> +
> +/* return total cpu usage (in nanoseconds) of a group */
> +static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct task_group *tg;
> +       u64 totalcpuusage = 0;
> +       int i;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for_each_present_cpu(i)
> +               totalcpuusage += task_group_cpuusage_read(tg, i);
> +
> +       return totalcpuusage;
> +}
> +
> +static int cpucg_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> +                               u64 reset)
> +{
> +       struct task_group *tg;
> +       int err = 0;
> +       int i;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       if (reset) {
> +               err = -EINVAL;
> +               goto out;
> +       }
> +
> +       for_each_present_cpu(i)
> +               task_group_cpuusage_write(tg, i, 0);
> +
> +out:
> +       return err;
> +}
> +
> +static int cpucg_percpu_seq_read(struct cgroup *cgrp, struct cftype *cft,
> +                                struct seq_file *m)
> +{
> +       struct task_group *tg;
> +       u64 percpu;
> +       int i;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for_each_present_cpu(i) {
> +               percpu = task_group_cpuusage_read(tg, i);
> +               seq_printf(m, "%llu ", (unsigned long long) percpu);
> +       }
> +       seq_printf(m, "\n");
> +       return 0;
> +}
> +
> +static const char *cpucg_stat_desc[] = {
> +       [CPUACCT_STAT_USER] = "user",
> +       [CPUACCT_STAT_SYSTEM] = "system",
> +};
> +
> +static int cpucg_stats_show(struct cgroup *cgrp, struct cftype *cft,
> +                           struct cgroup_map_cb *cb)
> +{
> +       struct task_group *tg;
> +       int cpu;
> +       s64 val = 0;
> +
> +       tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
> +
> +       for_each_online_cpu(cpu) {
> +               struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
> +               val += kcpustat->cpustat[CPUTIME_USER];
> +               val += kcpustat->cpustat[CPUTIME_NICE];
> +       }
> +       val = cputime64_to_clock_t(val);
> +       cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_USER], val);
> +
> +       val = 0;
> +       for_each_online_cpu(cpu) {
> +               struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
> +               val += kcpustat->cpustat[CPUTIME_SYSTEM];
> +               val += kcpustat->cpustat[CPUTIME_IRQ];
> +               val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
> +       }
> +
> +       val = cputime64_to_clock_t(val);
> +       cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_SYSTEM], val);
> +
> +       return 0;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
>                                 u64 shareval)
> @@ -7918,6 +8074,23 @@ static struct cftype cpu_files[] = {
>                 .write_u64 = cpu_rt_period_write_uint,
>         },
>  #endif
> +       /* cpuacct.* which used to be served by a separate cpuacct controller */
> +       {
> +               .name = "cpuacct.usage",
> +               .flags = CFTYPE_NO_PREFIX,
> +               .read_u64 = cpucg_cpuusage_read,
> +               .write_u64 = cpucg_cpuusage_write,
> +       },
> +       {
> +               .name = "cpuacct.usage_percpu",
> +               .flags = CFTYPE_NO_PREFIX,
> +               .read_seq_string = cpucg_percpu_seq_read,
> +       },
> +       {
> +               .name = "cpuacct.stat",
> +               .flags = CFTYPE_NO_PREFIX,
> +               .read_map = cpucg_stats_show,
> +       },
>         { }     /* terminate */
>  };
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 293b202..a4fe53e 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -114,8 +114,10 @@ static int irqtime_account_si_update(void)
>  static inline void task_group_account_field(struct task_struct *p, int index,
>                                             u64 tmp)
>  {
> +#ifdef CONFIG_CGROUP_SCHED
> +       struct task_group *tg;
> +#endif
>  #ifdef CONFIG_CGROUP_CPUACCT
> -       struct kernel_cpustat *kcpustat;
>         struct cpuacct *ca;
>  #endif
>         /*
> @@ -125,7 +127,19 @@ static inline void task_group_account_field(struct task_struct *p, int index,
>          *
>          */
>         __get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
> +#ifdef CONFIG_CGROUP_SCHED
> +       rcu_read_lock();
> +       tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
> +                         struct task_group, css);
>
> +       while (tg && (tg != &root_task_group)) {
> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(tg->cpustat);
> +
> +               kcpustat->cpustat[index] += tmp;
> +               tg = tg->parent;
> +       }
> +       rcu_read_unlock();
> +#endif
>  #ifdef CONFIG_CGROUP_CPUACCT
>         if (unlikely(!cpuacct_subsys.active))
>                 return;
> @@ -133,6 +147,8 @@ static inline void task_group_account_field(struct task_struct *p, int index,
>         rcu_read_lock();
>         ca = task_ca(p);
>         while (ca && (ca != &root_cpuacct)) {
> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
> +
>                 kcpustat = this_cpu_ptr(ca->cpustat);
Is this reassignment unnecessary?


>                 kcpustat->cpustat[index] += tmp;
>                 ca = parent_ca(ca);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5eea870..c15bc92 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -707,6 +707,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
>                 struct task_struct *curtask = task_of(curr);
>
>                 trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
> +               task_group_charge(curtask, delta_exec);
>                 cpuacct_charge(curtask, delta_exec);
>                 account_group_exec_runtime(curtask, delta_exec);
>         }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 418feb0..f7e05d87 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -935,6 +935,7 @@ static void update_curr_rt(struct rq *rq)
>         account_group_exec_runtime(curr, delta_exec);
>
>         curr->se.exec_start = rq->clock_task;
> +       task_group_charge(curr, delta_exec);
>         cpuacct_charge(curr, delta_exec);
>
>         sched_rt_avg_update(rq, delta_exec);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index fc88644..84a339d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -104,6 +104,10 @@ struct cfs_bandwidth {
>  struct task_group {
>         struct cgroup_subsys_state css;
>
> +       /* statistics */
> +       u64 __percpu *cpuusage;
> +       struct kernel_cpustat __percpu *cpustat;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         /* schedulable entities of this group on each cpu */
>         struct sched_entity **se;
> @@ -590,6 +594,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
>  #endif
>  }
>
> +extern void task_group_charge(struct task_struct *tsk, u64 cputime);
> +
>  #else /* CONFIG_CGROUP_SCHED */
>
>  static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
> @@ -597,6 +603,7 @@ static inline struct task_group *task_group(struct task_struct *p)
>  {
>         return NULL;
>  }
> +static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
>
>  #endif /* CONFIG_CGROUP_SCHED */
>
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-14 14:55       ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-14 14:55 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Michal Hocko, Kay Sievers,
	Lennart Poettering, Dave Jones, Ben Hutchings

On 01/14/2013 12:34 AM, Sha Zhengju wrote:
>> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
>> > +
>> >                 kcpustat = this_cpu_ptr(ca->cpustat);
> Is this reassignment unnecessary?
> 
> 
No.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-14 14:55       ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-14 14:55 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

On 01/14/2013 12:34 AM, Sha Zhengju wrote:
>> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
>> > +
>> >                 kcpustat = this_cpu_ptr(ca->cpustat);
> Is this reassignment unnecessary?
> 
> 
No.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-15 10:19         ` Sha Zhengju
  0 siblings, 0 replies; 62+ messages in thread
From: Sha Zhengju @ 2013-01-15 10:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Michal Hocko, Kay Sievers,
	Lennart Poettering, Dave Jones, Ben Hutchings

On Mon, Jan 14, 2013 at 10:55 PM, Glauber Costa <glommer@parallels.com> wrote:
> On 01/14/2013 12:34 AM, Sha Zhengju wrote:
>>> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
>>> > +
>>> >                 kcpustat = this_cpu_ptr(ca->cpustat);
>> Is this reassignment unnecessary?
>>
>>
> No.
>

No?  No!

In task_group_account_field(), the following two hunks have the
similar behavior but different codes, there must be a trial in one of
them.

Hunk #1:
+#ifdef CONFIG_CGROUP_SCHED
+       rcu_read_lock();
+       tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+                         struct task_group, css);

+       while (tg && (tg != &root_task_group)) {
+               struct kernel_cpustat *kcpustat =
this_cpu_ptr(tg->cpustat);   **HERE**
+
+               kcpustat->cpustat[index] += tmp;
+               tg = tg->parent;
+       }
+       rcu_read_unlock();
+#endif

Hunk #2:
#ifdef CONFIG_CGROUP_CPUACCT
        if (unlikely(!cpuacct_subsys.active))
                return;

        rcu_read_lock();
        ca = task_ca(p);
        while (ca && (ca != &root_cpuacct)) {
+               struct kernel_cpustat *kcpustat =
this_cpu_ptr(ca->cpustat); **HERE**
+
                kcpustat = this_cpu_ptr(ca->cpustat);   **HERE, which
is unnecessary**
                kcpustat->cpustat[index] += tmp;
                ca = parent_ca(ca);
        }
        rcu_read_unlock();
#endif


Also you can prove it by the following testcase.
#include <stdio.h>

int main(void)
{
        int i = 0;
        int array[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        int *index = &array[0];

        while (i < 10) {
                int *ptr = index;

                printf("ptr=%d  %p, index = %d\n", *ptr, ptr, *index);
                index ++;
                i++;
                sleep(1);
        }

        return 0;
}

-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-15 10:19         ` Sha Zhengju
  0 siblings, 0 replies; 62+ messages in thread
From: Sha Zhengju @ 2013-01-15 10:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

On Mon, Jan 14, 2013 at 10:55 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> On 01/14/2013 12:34 AM, Sha Zhengju wrote:
>>> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
>>> > +
>>> >                 kcpustat = this_cpu_ptr(ca->cpustat);
>> Is this reassignment unnecessary?
>>
>>
> No.
>

No?  No!

In task_group_account_field(), the following two hunks have the
similar behavior but different codes, there must be a trial in one of
them.

Hunk #1:
+#ifdef CONFIG_CGROUP_SCHED
+       rcu_read_lock();
+       tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+                         struct task_group, css);

+       while (tg && (tg != &root_task_group)) {
+               struct kernel_cpustat *kcpustat =
this_cpu_ptr(tg->cpustat);   **HERE**
+
+               kcpustat->cpustat[index] += tmp;
+               tg = tg->parent;
+       }
+       rcu_read_unlock();
+#endif

Hunk #2:
#ifdef CONFIG_CGROUP_CPUACCT
        if (unlikely(!cpuacct_subsys.active))
                return;

        rcu_read_lock();
        ca = task_ca(p);
        while (ca && (ca != &root_cpuacct)) {
+               struct kernel_cpustat *kcpustat =
this_cpu_ptr(ca->cpustat); **HERE**
+
                kcpustat = this_cpu_ptr(ca->cpustat);   **HERE, which
is unnecessary**
                kcpustat->cpustat[index] += tmp;
                ca = parent_ca(ca);
        }
        rcu_read_unlock();
#endif


Also you can prove it by the following testcase.
#include <stdio.h>

int main(void)
{
        int i = 0;
        int array[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
        int *index = &array[0];

        while (i < 10) {
                int *ptr = index;

                printf("ptr=%d  %p, index = %d\n", *ptr, ptr, *index);
                index ++;
                i++;
                sleep(1);
        }

        return 0;
}

-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-15 17:52           ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-15 17:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups, linux-kernel, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Peter Zijlstra, Michal Hocko, Kay Sievers,
	Lennart Poettering, Dave Jones, Ben Hutchings

On 01/15/2013 02:19 AM, Sha Zhengju wrote:
> On Mon, Jan 14, 2013 at 10:55 PM, Glauber Costa <glommer@parallels.com> wrote:
>> On 01/14/2013 12:34 AM, Sha Zhengju wrote:
>>>> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
>>>>> +
>>>>>                 kcpustat = this_cpu_ptr(ca->cpustat);
>>> Is this reassignment unnecessary?
>>>
>>>
>> No.
>>
> 
> No?  No!
> 
I misread your question as "is this reassignment necessary".
For which the answer is no: there is no reason to do it, it is just a
bad chunk.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct
@ 2013-01-15 17:52           ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-15 17:52 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton, Tejun Heo,
	Peter Zijlstra, Paul Turner, Peter Zijlstra, Michal Hocko,
	Kay Sievers, Lennart Poettering, Dave Jones, Ben Hutchings

On 01/15/2013 02:19 AM, Sha Zhengju wrote:
> On Mon, Jan 14, 2013 at 10:55 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>> On 01/14/2013 12:34 AM, Sha Zhengju wrote:
>>>> +               struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
>>>>> +
>>>>>                 kcpustat = this_cpu_ptr(ca->cpustat);
>>> Is this reassignment unnecessary?
>>>
>>>
>> No.
>>
> 
> No?  No!
> 
I misread your question as "is this reassignment necessary".
For which the answer is no: there is no reason to do it, it is just a
bad chunk.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
  2013-01-09 11:45 ` Glauber Costa
                   ` (12 preceding siblings ...)
  (?)
@ 2013-01-16  0:33 ` Colin Cross
  2013-01-21 12:14     ` Glauber Costa
  -1 siblings, 1 reply; 62+ messages in thread
From: Colin Cross @ 2013-01-16  0:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups, lkml, Andrew Morton, Tejun Heo, Peter Zijlstra, Paul Turner

On Wed, Jan 9, 2013 at 3:45 AM, Glauber Costa <glommer@parallels.com> wrote:
> [ update: I thought I posted this already before leaving for holidays. However,
>   now that I am checking for replies, I can't find nor replies nor the original
>   mail in my boxes or archives. I am posting again for safety sake, but sorry
>   you are getting this twice by any chance ]
>
> Hi all,
>
> This is an attempt to provide userspace with enough information to reconstruct
> per-container version of files like "/proc/stat". In particular, we are
> interested in knowing the per-cgroup slices of user time, system time, wait
> time, number of processes, and a variety of statistics.
>
> This task is made more complicated by the fact that multiple controllers are
> involved in collecting those statistics: cpu and cpuacct. So the first thing I
> am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.
>
> This is one of the major differences from earlier attempts: all data is provided
> by the cpu controller, resulting in greater simplicity.

Android userspace is currently using both cpu and cpuacct, and not
co-mounting them.  They are used for fundamentally different uses such
that creating a single hierarchy for both of them while maintaining
the existing behavior is not possible.

We use the cpu cgroup primarily as a priority container.  A simple
view is that each thread is assigned to a foreground cgroup when it is
user-visible, and a background cgroup when it is not.  The foreground
cgroup is assigned a significantly higher cpu.shares value such that
when each group is fully loaded the background group will get 5% and
the foreground group will get 95%.

We use the cpuacct cgroup to measure cpu usage per uid, primarily to
estimate one cause of battery usage.  Each uid gets a cgroup, and when
spawning a task for a new uid we put it in the appropriate cgroup.

We could create a new uid cgroup for cpuacct inside the foreground and
background cgroups used for scheduling, but that would drastically
change the way scheduling works when multiple uids have active
threads.  With separate cpu and cpuacct mounts, every active
foreground thread will get equal cpu time.  With co-mounted cpu and
cpuacct cgroups, cpu time will be shared between each accounting
group, and then sub-shared inside that group.

A concrete example:
Two uids, 1 and 2.  Uid 1 has one thread A, uid 2 has two threads B
and C.  All threads are foreground and running continuously.

With separate cpu and cpuacct mounts, we have:
/cpu/foreground/tasks:
A
B
C
/cpuacct/uid/1/tasks:
A
/cpuacct/uid/2/tasks:
B
C

A, B, and C each will get 33% of the cpu time.

With co-mounted cpu and cpuacct mounts:
/cpu/foreground/1/tasks:
A
/cpu/foreground/2/tasks
B
C

A will get 50% of the cpu time, B and C will get 25% of the cpu time.
I don't see any way to add new subgroups for accounting without
partitioning the cpu time for each subgroup.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
  2013-01-16  0:33 ` Colin Cross
@ 2013-01-21 12:14     ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-21 12:14 UTC (permalink / raw)
  To: Colin Cross
  Cc: cgroups, lkml, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Tejun Heo

On 01/16/2013 04:33 AM, Colin Cross wrote:
> On Wed, Jan 9, 2013 at 3:45 AM, Glauber Costa <glommer@parallels.com> wrote:
>> [ update: I thought I posted this already before leaving for holidays. However,
>>   now that I am checking for replies, I can't find nor replies nor the original
>>   mail in my boxes or archives. I am posting again for safety sake, but sorry
>>   you are getting this twice by any chance ]
>>
>> Hi all,
>>
>> This is an attempt to provide userspace with enough information to reconstruct
>> per-container version of files like "/proc/stat". In particular, we are
>> interested in knowing the per-cgroup slices of user time, system time, wait
>> time, number of processes, and a variety of statistics.
>>
>> This task is made more complicated by the fact that multiple controllers are
>> involved in collecting those statistics: cpu and cpuacct. So the first thing I
>> am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.
>>
>> This is one of the major differences from earlier attempts: all data is provided
>> by the cpu controller, resulting in greater simplicity.
> 
> Android userspace is currently using both cpu and cpuacct, and not
> co-mounting them.  They are used for fundamentally different uses such
> that creating a single hierarchy for both of them while maintaining
> the existing behavior is not possible.
> 
> We use the cpu cgroup primarily as a priority container.  A simple
> view is that each thread is assigned to a foreground cgroup when it is
> user-visible, and a background cgroup when it is not.  The foreground
> cgroup is assigned a significantly higher cpu.shares value such that
> when each group is fully loaded the background group will get 5% and
> the foreground group will get 95%.
> 
> We use the cpuacct cgroup to measure cpu usage per uid, primarily to
> estimate one cause of battery usage.  Each uid gets a cgroup, and when
> spawning a task for a new uid we put it in the appropriate cgroup.

As we are all in a way sons of Linus the Great, the fact that you have
this usecase should be by itself a reason for us not to deprecate it.

I still view this, however, as a not common use case. And from the
scheduler PoV, we still have all the duplicate hierarchy walks. So
assuming we would carry on all the changes in this patchset, except the
deprecation, would it be okay for you?

This way we could take steps to make sure the scheduler codepaths for
cpuacct are not taking during normal comounted operation, and you could
still have your setup unchanged.

Tejun, any words here?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-21 12:14     ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-21 12:14 UTC (permalink / raw)
  To: Colin Cross
  Cc: cgroups, lkml, Andrew Morton, Tejun Heo, Peter Zijlstra,
	Paul Turner, Tejun Heo

On 01/16/2013 04:33 AM, Colin Cross wrote:
> On Wed, Jan 9, 2013 at 3:45 AM, Glauber Costa <glommer@parallels.com> wrote:
>> [ update: I thought I posted this already before leaving for holidays. However,
>>   now that I am checking for replies, I can't find nor replies nor the original
>>   mail in my boxes or archives. I am posting again for safety sake, but sorry
>>   you are getting this twice by any chance ]
>>
>> Hi all,
>>
>> This is an attempt to provide userspace with enough information to reconstruct
>> per-container version of files like "/proc/stat". In particular, we are
>> interested in knowing the per-cgroup slices of user time, system time, wait
>> time, number of processes, and a variety of statistics.
>>
>> This task is made more complicated by the fact that multiple controllers are
>> involved in collecting those statistics: cpu and cpuacct. So the first thing I
>> am doing here, is ressurecting Tejun's patches that aim at deprecating cpuacct.
>>
>> This is one of the major differences from earlier attempts: all data is provided
>> by the cpu controller, resulting in greater simplicity.
> 
> Android userspace is currently using both cpu and cpuacct, and not
> co-mounting them.  They are used for fundamentally different uses such
> that creating a single hierarchy for both of them while maintaining
> the existing behavior is not possible.
> 
> We use the cpu cgroup primarily as a priority container.  A simple
> view is that each thread is assigned to a foreground cgroup when it is
> user-visible, and a background cgroup when it is not.  The foreground
> cgroup is assigned a significantly higher cpu.shares value such that
> when each group is fully loaded the background group will get 5% and
> the foreground group will get 95%.
> 
> We use the cpuacct cgroup to measure cpu usage per uid, primarily to
> estimate one cause of battery usage.  Each uid gets a cgroup, and when
> spawning a task for a new uid we put it in the appropriate cgroup.

As we are all in a way sons of Linus the Great, the fact that you have
this usecase should be by itself a reason for us not to deprecate it.

I still view this, however, as a not common use case. And from the
scheduler PoV, we still have all the duplicate hierarchy walks. So
assuming we would carry on all the changes in this patchset, except the
deprecation, would it be okay for you?

This way we could take steps to make sure the scheduler codepaths for
cpuacct are not taking during normal comounted operation, and you could
still have your setup unchanged.

Tejun, any words here?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23  1:02       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-23  1:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Colin Cross, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

Hello,

On Mon, Jan 21, 2013 at 04:14:27PM +0400, Glauber Costa wrote:
> > Android userspace is currently using both cpu and cpuacct, and not
> > co-mounting them.  They are used for fundamentally different uses such
> > that creating a single hierarchy for both of them while maintaining
> > the existing behavior is not possible.
> > 
> > We use the cpu cgroup primarily as a priority container.  A simple
> > view is that each thread is assigned to a foreground cgroup when it is
> > user-visible, and a background cgroup when it is not.  The foreground
> > cgroup is assigned a significantly higher cpu.shares value such that
> > when each group is fully loaded the background group will get 5% and
> > the foreground group will get 95%.
> > 
> > We use the cpuacct cgroup to measure cpu usage per uid, primarily to
> > estimate one cause of battery usage.  Each uid gets a cgroup, and when
> > spawning a task for a new uid we put it in the appropriate cgroup.
> 
> As we are all in a way sons of Linus the Great, the fact that you have
> this usecase should be by itself a reason for us not to deprecate it.
> 
> I still view this, however, as a not common use case. And from the
> scheduler PoV, we still have all the duplicate hierarchy walks. So
> assuming we would carry on all the changes in this patchset, except the
> deprecation, would it be okay for you?
> 
> This way we could take steps to make sure the scheduler codepaths for
> cpuacct are not taking during normal comounted operation, and you could
> still have your setup unchanged.
> 
> Tejun, any words here?

I think the only thing we can do is keeping cpuacct around.  We can
still optimize comounted cpu and cpuacct as the usual case.  That
said, I'd really like to avoid growing new use cases for separate
hierarchies for cpu and cpuacct (well, any controller actually).
Having multiple hierarchies is fundamentally broken in that we can't
say whether a given resource belongs to certain cgroup independently
from the current task, and we're definitnely moving towards unified
hierarchy.

We are not gonna break multiple hierarchies but won't go extra miles
to optimize or enable new features on it, so it would be best to move
away from it.

Maybe we can generate a warning message on separate mounts?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23  1:02       ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-23  1:02 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Colin Cross, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, Andrew Morton,
	Peter Zijlstra, Paul Turner

Hello,

On Mon, Jan 21, 2013 at 04:14:27PM +0400, Glauber Costa wrote:
> > Android userspace is currently using both cpu and cpuacct, and not
> > co-mounting them.  They are used for fundamentally different uses such
> > that creating a single hierarchy for both of them while maintaining
> > the existing behavior is not possible.
> > 
> > We use the cpu cgroup primarily as a priority container.  A simple
> > view is that each thread is assigned to a foreground cgroup when it is
> > user-visible, and a background cgroup when it is not.  The foreground
> > cgroup is assigned a significantly higher cpu.shares value such that
> > when each group is fully loaded the background group will get 5% and
> > the foreground group will get 95%.
> > 
> > We use the cpuacct cgroup to measure cpu usage per uid, primarily to
> > estimate one cause of battery usage.  Each uid gets a cgroup, and when
> > spawning a task for a new uid we put it in the appropriate cgroup.
> 
> As we are all in a way sons of Linus the Great, the fact that you have
> this usecase should be by itself a reason for us not to deprecate it.
> 
> I still view this, however, as a not common use case. And from the
> scheduler PoV, we still have all the duplicate hierarchy walks. So
> assuming we would carry on all the changes in this patchset, except the
> deprecation, would it be okay for you?
> 
> This way we could take steps to make sure the scheduler codepaths for
> cpuacct are not taking during normal comounted operation, and you could
> still have your setup unchanged.
> 
> Tejun, any words here?

I think the only thing we can do is keeping cpuacct around.  We can
still optimize comounted cpu and cpuacct as the usual case.  That
said, I'd really like to avoid growing new use cases for separate
hierarchies for cpu and cpuacct (well, any controller actually).
Having multiple hierarchies is fundamentally broken in that we can't
say whether a given resource belongs to certain cgroup independently
from the current task, and we're definitnely moving towards unified
hierarchy.

We are not gonna break multiple hierarchies but won't go extra miles
to optimize or enable new features on it, so it would be best to move
away from it.

Maybe we can generate a warning message on separate mounts?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23  1:53         ` Colin Cross
  0 siblings, 0 replies; 62+ messages in thread
From: Colin Cross @ 2013-01-23  1:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

On Tue, Jan 22, 2013 at 5:02 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Mon, Jan 21, 2013 at 04:14:27PM +0400, Glauber Costa wrote:
>> > Android userspace is currently using both cpu and cpuacct, and not
>> > co-mounting them.  They are used for fundamentally different uses such
>> > that creating a single hierarchy for both of them while maintaining
>> > the existing behavior is not possible.
>> >
>> > We use the cpu cgroup primarily as a priority container.  A simple
>> > view is that each thread is assigned to a foreground cgroup when it is
>> > user-visible, and a background cgroup when it is not.  The foreground
>> > cgroup is assigned a significantly higher cpu.shares value such that
>> > when each group is fully loaded the background group will get 5% and
>> > the foreground group will get 95%.
>> >
>> > We use the cpuacct cgroup to measure cpu usage per uid, primarily to
>> > estimate one cause of battery usage.  Each uid gets a cgroup, and when
>> > spawning a task for a new uid we put it in the appropriate cgroup.
>>
>> As we are all in a way sons of Linus the Great, the fact that you have
>> this usecase should be by itself a reason for us not to deprecate it.
>>
>> I still view this, however, as a not common use case. And from the
>> scheduler PoV, we still have all the duplicate hierarchy walks. So
>> assuming we would carry on all the changes in this patchset, except the
>> deprecation, would it be okay for you?
>>
>> This way we could take steps to make sure the scheduler codepaths for
>> cpuacct are not taking during normal comounted operation, and you could
>> still have your setup unchanged.
>>
>> Tejun, any words here?
>
> I think the only thing we can do is keeping cpuacct around.  We can
> still optimize comounted cpu and cpuacct as the usual case.  That
> said, I'd really like to avoid growing new use cases for separate
> hierarchies for cpu and cpuacct (well, any controller actually).
> Having multiple hierarchies is fundamentally broken in that we can't
> say whether a given resource belongs to certain cgroup independently
> from the current task, and we're definitnely moving towards unified
> hierarchy.

I understand why it makes sense from a code perspective to combine cpu
and cpuacct, but by combining them you are enforcing a strange
requirement that to measure the cpu usage of a group of processes you
force them to be treated as a single scheduling entity by their parent
group, effectively splitting their time as if they were a single task.
 That doesn't make any sense to me.

> We are not gonna break multiple hierarchies but won't go extra miles
> to optimize or enable new features on it, so it would be best to move
> away from it.

I don't see how I can move away from it with the current design.

> Maybe we can generate a warning message on separate mounts?
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23  1:53         ` Colin Cross
  0 siblings, 0 replies; 62+ messages in thread
From: Colin Cross @ 2013-01-23  1:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	Andrew Morton, Peter Zijlstra, Paul Turner

On Tue, Jan 22, 2013 at 5:02 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Mon, Jan 21, 2013 at 04:14:27PM +0400, Glauber Costa wrote:
>> > Android userspace is currently using both cpu and cpuacct, and not
>> > co-mounting them.  They are used for fundamentally different uses such
>> > that creating a single hierarchy for both of them while maintaining
>> > the existing behavior is not possible.
>> >
>> > We use the cpu cgroup primarily as a priority container.  A simple
>> > view is that each thread is assigned to a foreground cgroup when it is
>> > user-visible, and a background cgroup when it is not.  The foreground
>> > cgroup is assigned a significantly higher cpu.shares value such that
>> > when each group is fully loaded the background group will get 5% and
>> > the foreground group will get 95%.
>> >
>> > We use the cpuacct cgroup to measure cpu usage per uid, primarily to
>> > estimate one cause of battery usage.  Each uid gets a cgroup, and when
>> > spawning a task for a new uid we put it in the appropriate cgroup.
>>
>> As we are all in a way sons of Linus the Great, the fact that you have
>> this usecase should be by itself a reason for us not to deprecate it.
>>
>> I still view this, however, as a not common use case. And from the
>> scheduler PoV, we still have all the duplicate hierarchy walks. So
>> assuming we would carry on all the changes in this patchset, except the
>> deprecation, would it be okay for you?
>>
>> This way we could take steps to make sure the scheduler codepaths for
>> cpuacct are not taking during normal comounted operation, and you could
>> still have your setup unchanged.
>>
>> Tejun, any words here?
>
> I think the only thing we can do is keeping cpuacct around.  We can
> still optimize comounted cpu and cpuacct as the usual case.  That
> said, I'd really like to avoid growing new use cases for separate
> hierarchies for cpu and cpuacct (well, any controller actually).
> Having multiple hierarchies is fundamentally broken in that we can't
> say whether a given resource belongs to certain cgroup independently
> from the current task, and we're definitnely moving towards unified
> hierarchy.

I understand why it makes sense from a code perspective to combine cpu
and cpuacct, but by combining them you are enforcing a strange
requirement that to measure the cpu usage of a group of processes you
force them to be treated as a single scheduling entity by their parent
group, effectively splitting their time as if they were a single task.
 That doesn't make any sense to me.

> We are not gonna break multiple hierarchies but won't go extra miles
> to optimize or enable new features on it, so it would be best to move
> away from it.

I don't see how I can move away from it with the current design.

> Maybe we can generate a warning message on separate mounts?
>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23  8:12           ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-23  8:12 UTC (permalink / raw)
  To: Colin Cross
  Cc: Tejun Heo, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

On 01/23/2013 05:53 AM, Colin Cross wrote:
> On Tue, Jan 22, 2013 at 5:02 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> On Mon, Jan 21, 2013 at 04:14:27PM +0400, Glauber Costa wrote:
>>>> Android userspace is currently using both cpu and cpuacct, and not
>>>> co-mounting them.  They are used for fundamentally different uses such
>>>> that creating a single hierarchy for both of them while maintaining
>>>> the existing behavior is not possible.
>>>>
>>>> We use the cpu cgroup primarily as a priority container.  A simple
>>>> view is that each thread is assigned to a foreground cgroup when it is
>>>> user-visible, and a background cgroup when it is not.  The foreground
>>>> cgroup is assigned a significantly higher cpu.shares value such that
>>>> when each group is fully loaded the background group will get 5% and
>>>> the foreground group will get 95%.
>>>>
>>>> We use the cpuacct cgroup to measure cpu usage per uid, primarily to
>>>> estimate one cause of battery usage.  Each uid gets a cgroup, and when
>>>> spawning a task for a new uid we put it in the appropriate cgroup.
>>>
>>> As we are all in a way sons of Linus the Great, the fact that you have
>>> this usecase should be by itself a reason for us not to deprecate it.
>>>
>>> I still view this, however, as a not common use case. And from the
>>> scheduler PoV, we still have all the duplicate hierarchy walks. So
>>> assuming we would carry on all the changes in this patchset, except the
>>> deprecation, would it be okay for you?
>>>
>>> This way we could take steps to make sure the scheduler codepaths for
>>> cpuacct are not taking during normal comounted operation, and you could
>>> still have your setup unchanged.
>>>
>>> Tejun, any words here?
>>
>> I think the only thing we can do is keeping cpuacct around.  We can
>> still optimize comounted cpu and cpuacct as the usual case.  That
>> said, I'd really like to avoid growing new use cases for separate
>> hierarchies for cpu and cpuacct (well, any controller actually).
>> Having multiple hierarchies is fundamentally broken in that we can't
>> say whether a given resource belongs to certain cgroup independently
>> from the current task, and we're definitnely moving towards unified
>> hierarchy.
> 
> I understand why it makes sense from a code perspective to combine cpu
> and cpuacct, but by combining them you are enforcing a strange
> requirement that to measure the cpu usage of a group of processes you
> force them to be treated as a single scheduling entity by their parent
> group, effectively splitting their time as if they were a single task.
>  That doesn't make any sense to me.
> 
That is a bit backwards.

The question is not if it makes sense to enforce that tasks that are
having their cputime measured needs to be grouped for scheduling
purposes, but rather, if it makes sense to collect timing information
collectively for something that is not a scheduling entity.

The fact that you can do it today, is an artifact of the way cgroups
were implemented in the first place. If controllers were bound to a
single hierarchy from the very beginning, I really doubt you would have
any luck convincing people that allowing separate hierarchy grouping
would be necessary for this.

Again, all that said, now that I survived 2012, I would like to be alive
next year as well. And if we break your use case, Linus will kill us. So
we don't plan to do it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23  8:12           ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-23  8:12 UTC (permalink / raw)
  To: Colin Cross
  Cc: Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml, Andrew Morton,
	Peter Zijlstra, Paul Turner

On 01/23/2013 05:53 AM, Colin Cross wrote:
> On Tue, Jan 22, 2013 at 5:02 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> Hello,
>>
>> On Mon, Jan 21, 2013 at 04:14:27PM +0400, Glauber Costa wrote:
>>>> Android userspace is currently using both cpu and cpuacct, and not
>>>> co-mounting them.  They are used for fundamentally different uses such
>>>> that creating a single hierarchy for both of them while maintaining
>>>> the existing behavior is not possible.
>>>>
>>>> We use the cpu cgroup primarily as a priority container.  A simple
>>>> view is that each thread is assigned to a foreground cgroup when it is
>>>> user-visible, and a background cgroup when it is not.  The foreground
>>>> cgroup is assigned a significantly higher cpu.shares value such that
>>>> when each group is fully loaded the background group will get 5% and
>>>> the foreground group will get 95%.
>>>>
>>>> We use the cpuacct cgroup to measure cpu usage per uid, primarily to
>>>> estimate one cause of battery usage.  Each uid gets a cgroup, and when
>>>> spawning a task for a new uid we put it in the appropriate cgroup.
>>>
>>> As we are all in a way sons of Linus the Great, the fact that you have
>>> this usecase should be by itself a reason for us not to deprecate it.
>>>
>>> I still view this, however, as a not common use case. And from the
>>> scheduler PoV, we still have all the duplicate hierarchy walks. So
>>> assuming we would carry on all the changes in this patchset, except the
>>> deprecation, would it be okay for you?
>>>
>>> This way we could take steps to make sure the scheduler codepaths for
>>> cpuacct are not taking during normal comounted operation, and you could
>>> still have your setup unchanged.
>>>
>>> Tejun, any words here?
>>
>> I think the only thing we can do is keeping cpuacct around.  We can
>> still optimize comounted cpu and cpuacct as the usual case.  That
>> said, I'd really like to avoid growing new use cases for separate
>> hierarchies for cpu and cpuacct (well, any controller actually).
>> Having multiple hierarchies is fundamentally broken in that we can't
>> say whether a given resource belongs to certain cgroup independently
>> from the current task, and we're definitnely moving towards unified
>> hierarchy.
> 
> I understand why it makes sense from a code perspective to combine cpu
> and cpuacct, but by combining them you are enforcing a strange
> requirement that to measure the cpu usage of a group of processes you
> force them to be treated as a single scheduling entity by their parent
> group, effectively splitting their time as if they were a single task.
>  That doesn't make any sense to me.
> 
That is a bit backwards.

The question is not if it makes sense to enforce that tasks that are
having their cputime measured needs to be grouped for scheduling
purposes, but rather, if it makes sense to collect timing information
collectively for something that is not a scheduling entity.

The fact that you can do it today, is an artifact of the way cgroups
were implemented in the first place. If controllers were bound to a
single hierarchy from the very beginning, I really doubt you would have
any luck convincing people that allowing separate hierarchy grouping
would be necessary for this.

Again, all that said, now that I survived 2012, I would like to be alive
next year as well. And if we break your use case, Linus will kill us. So
we don't plan to do it.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-23 14:20       ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-23 14:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner,
	Randy Dunlap

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On 01/10/2013 12:42 AM, Andrew Morton wrote:
> Also, I'm not seeing any changes to Docmentation/ in this patchset. 
> How do we explain the interface to our users?

There is little point in adding any Documentation, since the cpu cgroup
itself is not documented. I took the liberty of doing this myself so to
provide a baseline for the upcoming changes. It would be very nice if
you guys could review the file as-is, since it would save me one
patchset iteration, at least.

When the contents are settled, I intend to then proceed into documenting
the new file in there.

Thanks.


[-- Attachment #2: cpu.txt --]
[-- Type: text/plain, Size: 3599 bytes --]

CPU Controller
--------------

The CPU controller is responsible for grouping tasks together that will be
viewed by the scheduler as a single unit. The CFS scheduler will first divide
CPU time equally between all entities in the same level, and then proceed by
doing the same in the next level. Basic use cases for that are described in the
main cgroup documentation file, cgroups.txt.

Users of this functionality should be aware that deep hierarchies will of
course impose scheduler overhead, since the scheduler will have to take extra
steps and look up additional data structures to make its final decision.

Through the CPU controller, the scheduler is also able to cap the CPU
utilization of a particular group. This is particularly useful in environments
in which CPU is paid for by the hour, and one values predictability over
performance.

CPU Accounting
--------------

The CPU cgroup will also provide additional files under the prefix "cpuacct".
Those files provide accounting statistics and were previously provided by the
separate cpuacct controller. Although the cpuacct controller will still be kept
around for compatibility reasons, its usage is discouraged. If both the CPU and
cpuacct controllers are present in the system, distributors are encouraged to
always mount them together.

Files
-----

The CPU controller exposes the following files to the user:

cpu.shares:

 - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
 bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
 improve throughput at the expense of latency, since the scheduler will be able
 to sustain a cpu-bound workload for longer. The opposite of true for smaller
 periods. Note that this only affects non-RT tasks that are scheduled by the
 CFS scheduler.

- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
  in for the current group will be allowed to run. For instance, if it is set to
  half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
  the time. One should note that this represents aggregate time over all CPUs
  in the system. Therefore, in order to allow full usage of two CPUs, for
  instance, one should set this value to twice the value of cfs_period_us.

- cpu.stat: statistics about the bandwidth controls. No data will be presented
  if cpu.cfs_quota_us is not set. The file presents three
  numbers:
	nr_periods: how many full periods have been elapsed.
	nr_throttled: number of times we exausted the full allowed bandwidth
	throttled_time: total time the tasks were not run due to being overquota

 - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
   analogous to the CFS files cfs_quota_us and cfs_period_us. One important
   difference, though, is that while the cfs quotas are upper bounds that
   won't necessarily be met, the rt runtimes form a stricter guarantee.
   Therefore, no overlap is allowed. Implications of that are that given a
   hierarchy with multiple children, the sum of all rt_runtime_us may not exceed
   the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt tasks
   can ever be run in this cgroup.

 - cpuacct.usage: The aggregate CPU time, in microseconds, consumed by all tasks
   in this group.

 - cpuacct.usage_percpu: The CPU time, in microseconds, consumed by all tasks in
   this group, separated by CPU. The format is an space-separated array of time
   values, one for each present CPU.

 - cpuacct.stat: aggregate user and system time consumed by tasks in this group.
   The format is user: x\nsystem: y.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-23 14:20       ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-23 14:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Peter Zijlstra,
	Paul Turner, Randy Dunlap

[-- Attachment #1: Type: text/plain, Size: 581 bytes --]

On 01/10/2013 12:42 AM, Andrew Morton wrote:
> Also, I'm not seeing any changes to Docmentation/ in this patchset. 
> How do we explain the interface to our users?

There is little point in adding any Documentation, since the cpu cgroup
itself is not documented. I took the liberty of doing this myself so to
provide a baseline for the upcoming changes. It would be very nice if
you guys could review the file as-is, since it would save me one
patchset iteration, at least.

When the contents are settled, I intend to then proceed into documenting
the new file in there.

Thanks.


[-- Attachment #2: cpu.txt --]
[-- Type: text/plain, Size: 3599 bytes --]

CPU Controller
--------------

The CPU controller is responsible for grouping tasks together that will be
viewed by the scheduler as a single unit. The CFS scheduler will first divide
CPU time equally between all entities in the same level, and then proceed by
doing the same in the next level. Basic use cases for that are described in the
main cgroup documentation file, cgroups.txt.

Users of this functionality should be aware that deep hierarchies will of
course impose scheduler overhead, since the scheduler will have to take extra
steps and look up additional data structures to make its final decision.

Through the CPU controller, the scheduler is also able to cap the CPU
utilization of a particular group. This is particularly useful in environments
in which CPU is paid for by the hour, and one values predictability over
performance.

CPU Accounting
--------------

The CPU cgroup will also provide additional files under the prefix "cpuacct".
Those files provide accounting statistics and were previously provided by the
separate cpuacct controller. Although the cpuacct controller will still be kept
around for compatibility reasons, its usage is discouraged. If both the CPU and
cpuacct controllers are present in the system, distributors are encouraged to
always mount them together.

Files
-----

The CPU controller exposes the following files to the user:

cpu.shares:

 - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
 bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
 improve throughput at the expense of latency, since the scheduler will be able
 to sustain a cpu-bound workload for longer. The opposite of true for smaller
 periods. Note that this only affects non-RT tasks that are scheduled by the
 CFS scheduler.

- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
  in for the current group will be allowed to run. For instance, if it is set to
  half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
  the time. One should note that this represents aggregate time over all CPUs
  in the system. Therefore, in order to allow full usage of two CPUs, for
  instance, one should set this value to twice the value of cfs_period_us.

- cpu.stat: statistics about the bandwidth controls. No data will be presented
  if cpu.cfs_quota_us is not set. The file presents three
  numbers:
	nr_periods: how many full periods have been elapsed.
	nr_throttled: number of times we exausted the full allowed bandwidth
	throttled_time: total time the tasks were not run due to being overquota

 - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
   analogous to the CFS files cfs_quota_us and cfs_period_us. One important
   difference, though, is that while the cfs quotas are upper bounds that
   won't necessarily be met, the rt runtimes form a stricter guarantee.
   Therefore, no overlap is allowed. Implications of that are that given a
   hierarchy with multiple children, the sum of all rt_runtime_us may not exceed
   the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt tasks
   can ever be run in this cgroup.

 - cpuacct.usage: The aggregate CPU time, in microseconds, consumed by all tasks
   in this group.

 - cpuacct.usage_percpu: The CPU time, in microseconds, consumed by all tasks in
   this group, separated by CPU. The format is an space-separated array of time
   values, one for each present CPU.

 - cpuacct.stat: aggregate user and system time consumed by tasks in this group.
   The format is user: x\nsystem: y.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-23 14:26             ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-23 14:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-kernel, Tejun Heo, Peter Zijlstra, Paul Turner

On 01/10/2013 01:27 AM, Glauber Costa wrote:
> On 01/10/2013 01:17 AM, Andrew Morton wrote:
>> On Thu, 10 Jan 2013 01:10:02 +0400
>> Glauber Costa <glommer@parallels.com> wrote:
>>
>>> The main advantage I see in this approach, is that there is way less
>>> data to be written using a header. Although your way works, it means we
>>> will write the strings "nice", "system", etc. #cpu times. Quite a waste.
>>
>> Yes, overhead can be a significant issue with this type of interface. 
>> But we already incurred a massive overhead by using a human-readable
>> ascii interface.  If performance is an issue, perhaps the whole thing
>> should be grafted onto taskstats instead.  Or create a new
>> taskstats-like thing.
> 
> I think this would be a little alienish in the already alien world of
> cgroups.
> 
> However, I was not so much talking about plain performance overhead as
> measurable in miliseconds-to-parse, but rather just alluding to the fact
> that we would be writing the same set of strings multiple times when a
> header would do just fine.
> 
> This is the same method used for instance by slabinfo.
> 
>>
>> btw, a more typical interface would be
>>
>> cat /.../cpu0
>> nice:nn
>> system:nn
>> irq:nn
>>
> 
> Well, yes. But welcome to cgroups: directories have a meaning, so the
> only way to organize stuff is with plain files in the current hierarchy
> is by filling it with files. As many files as we have cpus.
> 
> At this point you are certain to miss all the other files present in the
> directory.
> 
> 
>> - the traditional one-per-line name:value tuples.  But I'd assumed that
>> having a file per CPU would be aawkward.
>>
> Indeed.
> 

Andrew,

Given my arguments above, which interface would you prefer for me to
settle down? I still don't see any problems with the header, specially
given the fact that it exists precisely to allow fields to come and go
if needed.



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 11/11] sched: introduce cgroup file stat_percpu
@ 2013-01-23 14:26             ` Glauber Costa
  0 siblings, 0 replies; 62+ messages in thread
From: Glauber Costa @ 2013-01-23 14:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Peter Zijlstra,
	Paul Turner

On 01/10/2013 01:27 AM, Glauber Costa wrote:
> On 01/10/2013 01:17 AM, Andrew Morton wrote:
>> On Thu, 10 Jan 2013 01:10:02 +0400
>> Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
>>
>>> The main advantage I see in this approach, is that there is way less
>>> data to be written using a header. Although your way works, it means we
>>> will write the strings "nice", "system", etc. #cpu times. Quite a waste.
>>
>> Yes, overhead can be a significant issue with this type of interface. 
>> But we already incurred a massive overhead by using a human-readable
>> ascii interface.  If performance is an issue, perhaps the whole thing
>> should be grafted onto taskstats instead.  Or create a new
>> taskstats-like thing.
> 
> I think this would be a little alienish in the already alien world of
> cgroups.
> 
> However, I was not so much talking about plain performance overhead as
> measurable in miliseconds-to-parse, but rather just alluding to the fact
> that we would be writing the same set of strings multiple times when a
> header would do just fine.
> 
> This is the same method used for instance by slabinfo.
> 
>>
>> btw, a more typical interface would be
>>
>> cat /.../cpu0
>> nice:nn
>> system:nn
>> irq:nn
>>
> 
> Well, yes. But welcome to cgroups: directories have a meaning, so the
> only way to organize stuff is with plain files in the current hierarchy
> is by filling it with files. As many files as we have cpus.
> 
> At this point you are certain to miss all the other files present in the
> directory.
> 
> 
>> - the traditional one-per-line name:value tuples.  But I'd assumed that
>> having a file per CPU would be aawkward.
>>
> Indeed.
> 

Andrew,

Given my arguments above, which interface would you prefer for me to
settle down? I still don't see any problems with the header, specially
given the fact that it exists precisely to allow fields to come and go
if needed.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23 16:56           ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-23 16:56 UTC (permalink / raw)
  To: Colin Cross
  Cc: Glauber Costa, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

Hello, Collin.

On Tue, Jan 22, 2013 at 05:53:59PM -0800, Colin Cross wrote:
> I understand why it makes sense from a code perspective to combine cpu
> and cpuacct, but by combining them you are enforcing a strange
> requirement that to measure the cpu usage of a group of processes you

Well, "strange" is in the eyes of the beholder.  The thing is that
cgroup, as its name suggests, is a facility to control and enforce
resources to groups of tasks.  As accounting is often a part of
resource control, it happens as part of it too, but at least I think
cpuacct becoming a separate controller wasn't a technically sound
choice and intend to stop growth of usages outside resource control.

An over-arching theme of the problems in cgroup is having too much
unorganized flexibility to the extent where it impededs the original
intended goals.  The braindead hierarchy implementations make the
whole hierarchy completely meaningless.  Multiple hierarchies make it
impossible to tag and control resources in any sane way when a
resource exists across different resource and thus controller
boundaries.

So, well, that's the direction cgroup is headed.  Narrower focus on
actual resource control and actively shutting out misuses of cgroup as
generic task grouping mechanism.

> force them to be treated as a single scheduling entity by their parent
> group, effectively splitting their time as if they were a single task.
>  That doesn't make any sense to me.


> > We are not gonna break multiple hierarchies but won't go extra miles
> > to optimize or enable new features on it, so it would be best to move
> > away from it.
> 
> I don't see how I can move away from it with the current design.

What I don't get is why you don't put each applications into their
cgroups and tune their config variables, which is the intended usage
anyway.  You say that that would make the scheduler not give more cpu
time to applications with more threads, but isn't that the right thing
to do?  Why does the number of threads an application uses have any
bearing on how much CPU time it gets?  One is an implementation detail
while the other is a policy decision.  Also, if you wanna factor in
the number of threads into the policy decision for whatever reason,
you can easily do so by factoring in that number into the decision,
right?  That way, at least the decision would be explicit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23 16:56           ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-23 16:56 UTC (permalink / raw)
  To: Colin Cross
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	Andrew Morton, Peter Zijlstra, Paul Turner

Hello, Collin.

On Tue, Jan 22, 2013 at 05:53:59PM -0800, Colin Cross wrote:
> I understand why it makes sense from a code perspective to combine cpu
> and cpuacct, but by combining them you are enforcing a strange
> requirement that to measure the cpu usage of a group of processes you

Well, "strange" is in the eyes of the beholder.  The thing is that
cgroup, as its name suggests, is a facility to control and enforce
resources to groups of tasks.  As accounting is often a part of
resource control, it happens as part of it too, but at least I think
cpuacct becoming a separate controller wasn't a technically sound
choice and intend to stop growth of usages outside resource control.

An over-arching theme of the problems in cgroup is having too much
unorganized flexibility to the extent where it impededs the original
intended goals.  The braindead hierarchy implementations make the
whole hierarchy completely meaningless.  Multiple hierarchies make it
impossible to tag and control resources in any sane way when a
resource exists across different resource and thus controller
boundaries.

So, well, that's the direction cgroup is headed.  Narrower focus on
actual resource control and actively shutting out misuses of cgroup as
generic task grouping mechanism.

> force them to be treated as a single scheduling entity by their parent
> group, effectively splitting their time as if they were a single task.
>  That doesn't make any sense to me.


> > We are not gonna break multiple hierarchies but won't go extra miles
> > to optimize or enable new features on it, so it would be best to move
> > away from it.
> 
> I don't see how I can move away from it with the current design.

What I don't get is why you don't put each applications into their
cgroups and tune their config variables, which is the intended usage
anyway.  You say that that would make the scheduler not give more cpu
time to applications with more threads, but isn't that the right thing
to do?  Why does the number of threads an application uses have any
bearing on how much CPU time it gets?  One is an implementation detail
while the other is a policy decision.  Also, if you wanna factor in
the number of threads into the policy decision for whatever reason,
you can easily do so by factoring in that number into the decision,
right?  That way, at least the decision would be explicit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
  2013-01-23 16:56           ` Tejun Heo
  (?)
@ 2013-01-23 22:41           ` Colin Cross
  2013-01-23 23:06               ` Tejun Heo
  -1 siblings, 1 reply; 62+ messages in thread
From: Colin Cross @ 2013-01-23 22:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

On Wed, Jan 23, 2013 at 8:56 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Collin.
>
> On Tue, Jan 22, 2013 at 05:53:59PM -0800, Colin Cross wrote:
>> I understand why it makes sense from a code perspective to combine cpu
>> and cpuacct, but by combining them you are enforcing a strange
>> requirement that to measure the cpu usage of a group of processes you
>
> Well, "strange" is in the eyes of the beholder.  The thing is that
> cgroup, as its name suggests, is a facility to control and enforce
> resources to groups of tasks.  As accounting is often a part of
> resource control, it happens as part of it too, but at least I think
> cpuacct becoming a separate controller wasn't a technically sound
> choice and intend to stop growth of usages outside resource control.
>
> An over-arching theme of the problems in cgroup is having too much
> unorganized flexibility to the extent where it impededs the original
> intended goals.  The braindead hierarchy implementations make the
> whole hierarchy completely meaningless.  Multiple hierarchies make it
> impossible to tag and control resources in any sane way when a
> resource exists across different resource and thus controller
> boundaries.
>
> So, well, that's the direction cgroup is headed.  Narrower focus on
> actual resource control and actively shutting out misuses of cgroup as
> generic task grouping mechanism.
>
>> force them to be treated as a single scheduling entity by their parent
>> group, effectively splitting their time as if they were a single task.
>>  That doesn't make any sense to me.
>
>
>> > We are not gonna break multiple hierarchies but won't go extra miles
>> > to optimize or enable new features on it, so it would be best to move
>> > away from it.
>>
>> I don't see how I can move away from it with the current design.
>
> What I don't get is why you don't put each applications into their
> cgroups and tune their config variables, which is the intended usage
> anyway.  You say that that would make the scheduler not give more cpu
> time to applications with more threads, but isn't that the right thing
> to do?  Why does the number of threads an application uses have any
> bearing on how much CPU time it gets?  One is an implementation detail
> while the other is a policy decision.  Also, if you wanna factor in
> the number of threads into the policy decision for whatever reason,
> you can easily do so by factoring in that number into the decision,
> right?  That way, at least the decision would be explicit.

I think some of it is just historic, we previously did not group
application threads in the scheduler, so it would cause a change in
behavior if we started grouping them.  I will investigate switching to
a co-mounted hierarchy so hopefully you can deprecate cpuacct in the
future.

We can't factor the number of threads into the policy decision,
because it depends on how many threads are runnable at any time in any
particular application, and we have no way to track that.  It would
have to be a cgroup scheduler feature.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23 23:06               ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-23 23:06 UTC (permalink / raw)
  To: Colin Cross
  Cc: Glauber Costa, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

Hello, Collin.

On Wed, Jan 23, 2013 at 02:41:46PM -0800, Colin Cross wrote:
> I think some of it is just historic, we previously did not group
> application threads in the scheduler, so it would cause a change in
> behavior if we started grouping them.  I will investigate switching to
> a co-mounted hierarchy so hopefully you can deprecate cpuacct in the
> future.

Yeah, it's gonna be many years, if ever, before we can actually
deprecate cpuacct and multiple hierarchies but it would be really nice
to move at least popular uses away from them sooner than later.

Also, maybe I'm misunderstanding what you were saying but isn't it the
case that only single application is "foreground" in at least vanilla
android?  Maybe multi-window support is scheduled for future releases
but it wouldn't count as behavior change in that case, right?

At any rate, IMHO, it's simply the better and correct to not depend on
the number of threads in use as a measure of CPU resource
distribution.

> We can't factor the number of threads into the policy decision,
> because it depends on how many threads are runnable at any time in any
> particular application, and we have no way to track that.  It would
> have to be a cgroup scheduler feature.

My understanding of android is very limited but the number of threads
in dalvik apps are controlled by the base system rather than
application itself, no?  If so, factoring that into scheduling params
shouldn't be difficult.  For native processes, if the number of
threads just *have* to be factored in some way, we can resort to
sampling.  That said, as native apps can easily thread-bomb out of
fairness, there are way more reasons to avoid basing the resource
policy decision on the number of threads in use.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23 23:06               ` Tejun Heo
  0 siblings, 0 replies; 62+ messages in thread
From: Tejun Heo @ 2013-01-23 23:06 UTC (permalink / raw)
  To: Colin Cross
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	Andrew Morton, Peter Zijlstra, Paul Turner

Hello, Collin.

On Wed, Jan 23, 2013 at 02:41:46PM -0800, Colin Cross wrote:
> I think some of it is just historic, we previously did not group
> application threads in the scheduler, so it would cause a change in
> behavior if we started grouping them.  I will investigate switching to
> a co-mounted hierarchy so hopefully you can deprecate cpuacct in the
> future.

Yeah, it's gonna be many years, if ever, before we can actually
deprecate cpuacct and multiple hierarchies but it would be really nice
to move at least popular uses away from them sooner than later.

Also, maybe I'm misunderstanding what you were saying but isn't it the
case that only single application is "foreground" in at least vanilla
android?  Maybe multi-window support is scheduled for future releases
but it wouldn't count as behavior change in that case, right?

At any rate, IMHO, it's simply the better and correct to not depend on
the number of threads in use as a measure of CPU resource
distribution.

> We can't factor the number of threads into the policy decision,
> because it depends on how many threads are runnable at any time in any
> particular application, and we have no way to track that.  It would
> have to be a cgroup scheduler feature.

My understanding of android is very limited but the number of threads
in dalvik apps are controlled by the base system rather than
application itself, no?  If so, factoring that into scheduling params
shouldn't be difficult.  For native processes, if the number of
threads just *have* to be factored in some way, we can resort to
sampling.  That said, as native apps can easily thread-bomb out of
fairness, there are way more reasons to avoid basing the resource
policy decision on the number of threads in use.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23 23:53                 ` Colin Cross
  0 siblings, 0 replies; 62+ messages in thread
From: Colin Cross @ 2013-01-23 23:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, cgroups, lkml, Andrew Morton, Peter Zijlstra, Paul Turner

On Wed, Jan 23, 2013 at 3:06 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Collin.
>
> On Wed, Jan 23, 2013 at 02:41:46PM -0800, Colin Cross wrote:
>> I think some of it is just historic, we previously did not group
>> application threads in the scheduler, so it would cause a change in
>> behavior if we started grouping them.  I will investigate switching to
>> a co-mounted hierarchy so hopefully you can deprecate cpuacct in the
>> future.
>
> Yeah, it's gonna be many years, if ever, before we can actually
> deprecate cpuacct and multiple hierarchies but it would be really nice
> to move at least popular uses away from them sooner than later.
>
> Also, maybe I'm misunderstanding what you were saying but isn't it the
> case that only single application is "foreground" in at least vanilla
> android?  Maybe multi-window support is scheduled for future releases
> but it wouldn't count as behavior change in that case, right?

Not exactly.  Only one "app" is ever foreground, but there are
numerous other app-like things in their own process containers that
the user is aware of, and therefore gets foreground resources.  For
example, a service that is playing music in the background may get
treated like a foreground app.  Also, a foreground app may spawn
threads and request that they run in the background, so a single
container for an app is not sufficient.

> At any rate, IMHO, it's simply the better and correct to not depend on
> the number of threads in use as a measure of CPU resource
> distribution.
>
>> We can't factor the number of threads into the policy decision,
>> because it depends on how many threads are runnable at any time in any
>> particular application, and we have no way to track that.  It would
>> have to be a cgroup scheduler feature.
>
> My understanding of android is very limited but the number of threads
> in dalvik apps are controlled by the base system rather than
> application itself, no?  If so, factoring that into scheduling params
> shouldn't be difficult.  For native processes, if the number of
> threads just *have* to be factored in some way, we can resort to
> sampling.  That said, as native apps can easily thread-bomb out of
> fairness, there are way more reasons to avoid basing the resource
> policy decision on the number of threads in use.

A pure-dalvik app will spawn threads through a controlled interface
that could count threads (but still could not count threads that are
runnable vs. sleeping), but any app can use JNI to link to a native
library and use pthreads directly if it wants to.  I agree per-app
containers may have some fairness advantages between multiple apps, I
just have to figure out what it will mean for apps vs. system
services, etc

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v5 00/11] per-cgroup cpu-stat
@ 2013-01-23 23:53                 ` Colin Cross
  0 siblings, 0 replies; 62+ messages in thread
From: Colin Cross @ 2013-01-23 23:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, cgroups-u79uwXL29TY76Z2rM5mHXA, lkml,
	Andrew Morton, Peter Zijlstra, Paul Turner

On Wed, Jan 23, 2013 at 3:06 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Collin.
>
> On Wed, Jan 23, 2013 at 02:41:46PM -0800, Colin Cross wrote:
>> I think some of it is just historic, we previously did not group
>> application threads in the scheduler, so it would cause a change in
>> behavior if we started grouping them.  I will investigate switching to
>> a co-mounted hierarchy so hopefully you can deprecate cpuacct in the
>> future.
>
> Yeah, it's gonna be many years, if ever, before we can actually
> deprecate cpuacct and multiple hierarchies but it would be really nice
> to move at least popular uses away from them sooner than later.
>
> Also, maybe I'm misunderstanding what you were saying but isn't it the
> case that only single application is "foreground" in at least vanilla
> android?  Maybe multi-window support is scheduled for future releases
> but it wouldn't count as behavior change in that case, right?

Not exactly.  Only one "app" is ever foreground, but there are
numerous other app-like things in their own process containers that
the user is aware of, and therefore gets foreground resources.  For
example, a service that is playing music in the background may get
treated like a foreground app.  Also, a foreground app may spawn
threads and request that they run in the background, so a single
container for an app is not sufficient.

> At any rate, IMHO, it's simply the better and correct to not depend on
> the number of threads in use as a measure of CPU resource
> distribution.
>
>> We can't factor the number of threads into the policy decision,
>> because it depends on how many threads are runnable at any time in any
>> particular application, and we have no way to track that.  It would
>> have to be a cgroup scheduler feature.
>
> My understanding of android is very limited but the number of threads
> in dalvik apps are controlled by the base system rather than
> application itself, no?  If so, factoring that into scheduling params
> shouldn't be difficult.  For native processes, if the number of
> threads just *have* to be factored in some way, we can resort to
> sampling.  That said, as native apps can easily thread-bomb out of
> fairness, there are way more reasons to avoid basing the resource
> policy decision on the number of threads in use.

A pure-dalvik app will spawn threads through a controlled interface
that could count threads (but still could not count threads that are
runnable vs. sleeping), but any app can use JNI to link to a native
library and use pthreads directly if it wants to.  I agree per-app
containers may have some fairness advantages between multiple apps, I
just have to figure out what it will mean for apps vs. system
services, etc

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2013-01-24  0:01 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-01-09 11:45 [PATCH v5 00/11] per-cgroup cpu-stat Glauber Costa
2013-01-09 11:45 ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 01/11] don't call cpuacct_charge in stop_task.c Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 02/11] cgroup: implement CFTYPE_NO_PREFIX Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 03/11] cgroup, sched: let cpu serve the same files as cpuacct Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-14  8:34   ` Sha Zhengju
2013-01-14  8:34     ` Sha Zhengju
2013-01-14 14:55     ` Glauber Costa
2013-01-14 14:55       ` Glauber Costa
2013-01-15 10:19       ` Sha Zhengju
2013-01-15 10:19         ` Sha Zhengju
2013-01-15 17:52         ` Glauber Costa
2013-01-15 17:52           ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 04/11] cgroup, sched: deprecate cpuacct Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 05/11] sched: adjust exec_clock to use it as cpu usage metric Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 06/11] cpuacct: don't actually do anything Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 07/11] account guest time per-cgroup as well Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 08/11] sched: Push put_prev_task() into pick_next_task() Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 09/11] record per-cgroup number of context switches Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 10/11] sched: change nr_context_switches calculation Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 11:45 ` [PATCH v5 11/11] sched: introduce cgroup file stat_percpu Glauber Costa
2013-01-09 11:45   ` Glauber Costa
2013-01-09 20:42   ` Andrew Morton
2013-01-09 20:42     ` Andrew Morton
2013-01-09 21:10     ` Glauber Costa
2013-01-09 21:10       ` Glauber Costa
2013-01-09 21:17       ` Andrew Morton
2013-01-09 21:17         ` Andrew Morton
2013-01-09 21:27         ` Glauber Costa
2013-01-09 21:27           ` Glauber Costa
2013-01-23 14:26           ` Glauber Costa
2013-01-23 14:26             ` Glauber Costa
2013-01-23 14:20     ` Glauber Costa
2013-01-23 14:20       ` Glauber Costa
2013-01-09 14:41 ` [PATCH v5 00/11] per-cgroup cpu-stat Tejun Heo
2013-01-09 14:41   ` Tejun Heo
2013-01-16  0:33 ` Colin Cross
2013-01-21 12:14   ` Glauber Costa
2013-01-21 12:14     ` Glauber Costa
2013-01-23  1:02     ` Tejun Heo
2013-01-23  1:02       ` Tejun Heo
2013-01-23  1:53       ` Colin Cross
2013-01-23  1:53         ` Colin Cross
2013-01-23  8:12         ` Glauber Costa
2013-01-23  8:12           ` Glauber Costa
2013-01-23 16:56         ` Tejun Heo
2013-01-23 16:56           ` Tejun Heo
2013-01-23 22:41           ` Colin Cross
2013-01-23 23:06             ` Tejun Heo
2013-01-23 23:06               ` Tejun Heo
2013-01-23 23:53               ` Colin Cross
2013-01-23 23:53                 ` Colin Cross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.