linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] forced comounts for cgroups.
@ 2012-09-04 14:18 Glauber Costa
  2012-09-04 14:18 ` [RFC 1/5] cgroup: allow some comounts to be forced Glauber Costa
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-04 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt, lennart,
	kay.sievers, tj

Hi,

As we have been extensively discussing, the cost and pain points for cgroups
come from many places. But at least one of those is the arbitrary nature of
hierarchies. Many people, including at least Tejun and me would like this to go
away altogether. Problem so far, is breaking compatiblity with existing setups

I am proposing here a default-n Kconfig option that will guarantee that the cpu
cgroups (for now) will be comounted. I started with them because the
cpu/cpuacct division is clearly the worst offender. Also, the default-n is here
so distributions will have time to adapt: Forcing this flag to be on without
userspace changes will just lead to cgroups failing to mount, which we don't
want.

Although I've tested it and it works, I haven't compile-tested all possible
config combinations. So this is mostly for your eyes. If this gets traction,
I'll submit it properly, along with any changes that you might require.

Thanks.

Glauber Costa (5):
  cgroup: allow some comounts to be forced.
  sched: adjust exec_clock to use it as cpu usage metric
  sched: do not call cpuacct_charge when cpu and cpuacct are comounted
  cpuacct: do not gather cpuacct statistics when not mounted
  sched: add cpusets to comounts list

 include/linux/cgroup.h |   6 ++
 init/Kconfig           |  23 ++++++++
 kernel/cgroup.c        |  29 +++++++++-
 kernel/cpuset.c        |   4 ++
 kernel/sched/core.c    | 149 +++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/rt.c      |   1 +
 kernel/sched/sched.h   |  20 ++++++-
 7 files changed, 220 insertions(+), 12 deletions(-)

-- 
1.7.11.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 1/5] cgroup: allow some comounts to be forced.
  2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
@ 2012-09-04 14:18 ` Glauber Costa
  2012-09-04 14:18 ` [RFC 2/5] sched: adjust exec_clock to use it as cpu usage metric Glauber Costa
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-04 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt, lennart,
	kay.sievers, tj, Glauber Costa

One of the pain points we have today with cgroups, is the excessive
flexibility coming from the fact that controllers can be mounted at
will, without any relationship with each other.

Although this is nice in principle, this comes with a cost that is not
always welcome in practice. The very fact of this being possible is
already enough to trigger those costs. We cannot assume a common
hierarchy between controllers, and then hierarchy walks have to be done
more than once. This happens in hotpaths as well.

This patch introduces a Kconfig option, default n, that will force some
controllers to be comounted. After some time, we may be able to
deprecate this mode of operation.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 include/linux/cgroup.h |  6 ++++++
 init/Kconfig           |  4 ++++
 kernel/cgroup.c        | 29 ++++++++++++++++++++++++++++-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d3f5fba..f986ad1 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -531,6 +531,12 @@ struct cgroup_subsys {
 
 	/* should be defined only by modular subsystems */
 	struct module *module;
+
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT
+	/* List of groups that we must be comounted with */
+	int comounts;
+	int must_comount[3];
+#endif
 };
 
 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
diff --git a/init/Kconfig b/init/Kconfig
index f64f888..d7d693d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -680,6 +680,10 @@ config CGROUP_CPUACCT
 	  Provides a simple Resource Controller for monitoring the
 	  total CPU consumed by the tasks in a cgroup.
 
+config CGROUP_FORCE_COMOUNT
+	bool
+	default n
+
 config RESOURCE_COUNTERS
 	bool "Resource counters"
 	help
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b303dfc..137ac62 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1058,6 +1058,33 @@ static int rebind_subsystems(struct cgroupfs_root *root,
 	if (root->number_of_cgroups > 1)
 		return -EBUSY;
 
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT
+	/*
+	 * Some subsystems should not be allowed to be freely mounted in
+	 * separate hierarchies. They may not be present, but if they are, they
+	 * should be together. For compatibility with older kernels, we'll allow
+	 * this to live inside a separate Kconfig option. Each subsys will be
+	 * able to tell us which other subsys it expects to be mounted with.
+	 *
+	 * We do a separate path for this, to avoid unwinding our modifications
+	 * in case of an error.
+	 */
+	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+		unsigned long bit = 1UL << i;
+		int j;
+
+		if (!(bit & added_bits))
+			continue;
+
+		for (j = 0; j < subsys[i]->comounts; j++) {
+			int comount_id = subsys[i]->must_comount[j];
+			struct cgroup_subsys *ss = subsys[comount_id];
+			if ((ss->root != &rootnode) && (ss->root != root))
+				return -EINVAL;
+		}
+	}
+#endif
+
 	/* Process each subsystem */
 	for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
 		struct cgroup_subsys *ss = subsys[i];
@@ -1634,7 +1661,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 			goto unlock_drop;
 
 		ret = rebind_subsystems(root, root->subsys_bits);
-		if (ret == -EBUSY) {
+		if ((ret == -EBUSY) || (ret == -EINVAL)) {
 			free_cg_links(&tmp_cg_links);
 			goto unlock_drop;
 		}
-- 
1.7.11.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 2/5] sched: adjust exec_clock to use it as cpu usage metric
  2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
  2012-09-04 14:18 ` [RFC 1/5] cgroup: allow some comounts to be forced Glauber Costa
@ 2012-09-04 14:18 ` Glauber Costa
  2012-09-04 14:18 ` [RFC 3/5] sched: do not call cpuacct_charge when cpu and cpuacct are comounted Glauber Costa
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-04 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt, lennart,
	kay.sievers, tj, Glauber Costa

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 kernel/sched/rt.c    | 1 +
 kernel/sched/sched.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 573e1ca..40ef6af 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -930,6 +930,7 @@ static void update_curr_rt(struct rq *rq)
 
 	for_each_sched_rt_entity(rt_se) {
 		rt_rq = rt_rq_of_se(rt_se);
+		schedstat_add(rt_rq, exec_clock, delta_exec);
 
 		if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
 			raw_spin_lock(&rt_rq->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 55844f2..8da579d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -204,6 +204,7 @@ struct cfs_rq {
 	unsigned int nr_running, h_nr_running;
 
 	u64 exec_clock;
+	u64 prev_exec_clock;
 	u64 min_vruntime;
 #ifndef CONFIG_64BIT
 	u64 min_vruntime_copy;
@@ -295,6 +296,8 @@ struct rt_rq {
 	struct plist_head pushable_tasks;
 #endif
 	int rt_throttled;
+	u64 exec_clock;
+	u64 prev_exec_clock;
 	u64 rt_time;
 	u64 rt_runtime;
 	/* Nests inside the rq lock: */
-- 
1.7.11.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 3/5] sched: do not call cpuacct_charge when cpu and cpuacct are comounted
  2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
  2012-09-04 14:18 ` [RFC 1/5] cgroup: allow some comounts to be forced Glauber Costa
  2012-09-04 14:18 ` [RFC 2/5] sched: adjust exec_clock to use it as cpu usage metric Glauber Costa
@ 2012-09-04 14:18 ` Glauber Costa
  2012-09-04 14:18 ` [RFC 4/5] cpuacct: do not gather cpuacct statistics when not mounted Glauber Costa
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-04 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt, lennart,
	kay.sievers, tj, Glauber Costa

cpuacct_charge() incurs in some quite expensive operations to achieve
its measurement goal. To make matters worse, this cost is not constant,
but grows with the depth of the cgroup hierarchy tree. Also, all this
data is already available anyway in the scheduler core.

The fact that the cpuacct cgroup cannot be guaranteed to be mounted in
the same hierarchy as the scheduler core cgroup (cpu), forces us to go
gather them all again.

With the introduction of CONFIG_CGROUP_FORCE_COMOUNT_CPU, we will be
able to be absolutely sure that such a coupling exists. After that, the
hierarchy walks can be completely abandoned.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 init/Kconfig         |  19 +++++++
 kernel/sched/core.c  | 141 +++++++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  14 ++++-
 3 files changed, 163 insertions(+), 11 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index d7d693d..694944e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -684,6 +684,25 @@ config CGROUP_FORCE_COMOUNT
 	bool
 	default n
 
+config CGROUP_FORCE_COMOUNT_CPU
+	bool "Enforce single hierarchy for the cpu related cgroups"
+	depends on CGROUP_SCHED || CPUSETS || CGROUP_CPUACCT
+	select SCHEDSTATS
+	select CGROUP_FORCE_COMOUNT
+	default n
+	help
+	  Throughout cgroup's life, it was always possible to mount the
+	  controllers in completely independent hierarchies. However, the
+	  costs incurred by allowing are considerably big. Hotpaths in the
+	  scheduler needs to call expensive hierarchy walks more than once in
+	  the same place just to account for the fact that multiple controllers
+	  can be mounted in different places.
+
+	  Setting this option will disallow cpu, cpuacct and cpuset to be
+	  mounted in different hierarchies. Distributions are highly encouraged
+	  to set this option and comount those groups.
+
+
 config RESOURCE_COUNTERS
 	bool "Resource counters"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 468bdd4..e46871d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8282,6 +8282,15 @@ static struct cftype cpu_files[] = {
 	{ }	/* terminate */
 };
 
+bool cpuacct_from_cpu;
+
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
+void cpu_cgroup_bind(struct cgroup *root)
+{
+	cpuacct_from_cpu = root->root == root_task_group.css.cgroup->root;
+#endif
+}
+
 struct cgroup_subsys cpu_cgroup_subsys = {
 	.name		= "cpu",
 	.create		= cpu_cgroup_create,
@@ -8291,6 +8300,11 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.exit		= cpu_cgroup_exit,
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.base_cftypes	= cpu_files,
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
+	.comounts	= 1,
+	.must_comount	= { cpuacct_subsys_id, },
+	.bind		= cpu_cgroup_bind,
+#endif
 	.early_init	= 1,
 };
 
@@ -8345,8 +8359,102 @@ static void cpuacct_destroy(struct cgroup *cgrp)
 	kfree(ca);
 }
 
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
+#ifdef CONFIG_CGROUP_SCHED
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static struct cfs_rq *
+cpu_cgroup_cfs_rq(struct cgroup *cgrp, int cpu)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+
+	if (tg == &root_task_group)
+		return &cpu_rq(cpu)->cfs;
+
+	return tg->cfs_rq[cpu];
+}
+
+static void cpu_cgroup_update_cpuusage_cfs(struct cgroup *cgrp, int cpu)
+{
+	struct cfs_rq *cfs = cpu_cgroup_cfs_rq(cgrp, cpu);
+	cfs->prev_exec_clock = cfs->exec_clock;
+}
+static u64 cpu_cgroup_cpuusage_cfs(struct cgroup *cgrp, int cpu)
+{
+	struct cfs_rq *cfs = cpu_cgroup_cfs_rq(cgrp, cpu);
+	return cfs->exec_clock - cfs->prev_exec_clock;
+}
+#else
+static void cpu_cgroup_update_cpuusage_cfs(struct cgroup *cgrp, int cpu)
 {
+}
+
+static u64 cpu_cgroup_cpuusage_cfs(struct cgroup *cgrp, int cpu)
+{
+	return 0;
+}
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+static struct rt_rq *
+cpu_cgroup_rt_rq(struct cgroup *cgrp, int cpu)
+{
+	struct task_group *tg = cgroup_tg(cgrp);
+	if (tg == &root_task_group)
+		return &cpu_rq(cpu)->rt;
+
+	return tg->rt_rq[cpu];
+
+}
+static void cpu_cgroup_update_cpuusage_rt(struct cgroup *cgrp, int cpu)
+{
+	struct rt_rq *rt = cpu_cgroup_rt_rq(cgrp, cpu);
+	rt->prev_exec_clock = rt->exec_clock;
+}
+
+static u64 cpu_cgroup_cpuusage_rt(struct cgroup *cgrp, int cpu)
+{
+	struct rt_rq *rt = cpu_cgroup_rt_rq(cgrp, cpu);
+	return rt->exec_clock - rt->prev_exec_clock;
+}
+#else
+static void cpu_cgroup_update_cpuusage_rt(struct cgroup *cgrp, int cpu)
+{
+}
+static u64 cpu_cgroup_cpuusage_rt(struct cgroup *cgrp, int cpu)
+{
+	return 0;
+}
+#endif
+
+static int cpu_cgroup_cpuusage_write(struct cgroup *cgrp, int cpu, u64 val)
+{
+	cpu_cgroup_update_cpuusage_cfs(cgrp, cpu);
+	cpu_cgroup_update_cpuusage_rt(cgrp, cpu);
+	return 0;
+}
+
+static u64 cpu_cgroup_cpuusage_read(struct cgroup *cgrp, int cpu)
+{
+	return cpu_cgroup_cpuusage_cfs(cgrp, cpu) +
+	       cpu_cgroup_cpuusage_rt(cgrp, cpu);
+}
+
+#else
+static u64 cpu_cgroup_cpuusage_read(struct cgroup *cgrp, int i)
+{
+	BUG();
+	return 0;
+}
+
+static int cpu_cgroup_cpuusage_write(struct cgroup *cgrp, int cpu, u64 val)
+{
+	BUG();
+	return 0;
+}
+#endif /* CONFIG_CGROUP_SCHED */
+
+static u64 cpuacct_cpuusage_read(struct cgroup *cgrp, int cpu)
+{
+	struct cpuacct *ca = cgroup_ca(cgrp);
 	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 	u64 data;
 
@@ -8364,8 +8472,9 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 	return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cgroup *cgrp, int cpu, u64 val)
 {
+	struct cpuacct *ca = cgroup_ca(cgrp);
 	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
 
 #ifndef CONFIG_64BIT
@@ -8380,15 +8489,21 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
 #endif
 }
 
+static u64 cpuusage_read_percpu(struct cgroup *cgrp, int cpu)
+{
+	if (cpuacct_from_cpu)
+		return cpu_cgroup_cpuusage_read(cgrp, cpu);
+	return cpuacct_cpuusage_read(cgrp, cpu);
+}
+
 /* return total cpu usage (in nanoseconds) of a group */
 static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 {
-	struct cpuacct *ca = cgroup_ca(cgrp);
 	u64 totalcpuusage = 0;
 	int i;
 
 	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
+		totalcpuusage += cpuusage_read_percpu(cgrp, i);
 
 	return totalcpuusage;
 }
@@ -8396,7 +8511,6 @@ static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
 static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
 								u64 reset)
 {
-	struct cpuacct *ca = cgroup_ca(cgrp);
 	int err = 0;
 	int i;
 
@@ -8405,8 +8519,12 @@ static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
 		goto out;
 	}
 
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
+	for_each_present_cpu(i) {
+		if (cpuacct_from_cpu)
+			cpu_cgroup_cpuusage_write(cgrp, i, 0);
+		else
+			cpuacct_cpuusage_write(cgrp, i, 0);
+	}
 
 out:
 	return err;
@@ -8415,12 +8533,11 @@ out:
 static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
 				   struct seq_file *m)
 {
-	struct cpuacct *ca = cgroup_ca(cgroup);
 	u64 percpu;
 	int i;
 
 	for_each_present_cpu(i) {
-		percpu = cpuacct_cpuusage_read(ca, i);
+		percpu = cpuusage_read_percpu(cgroup, i);
 		seq_printf(m, "%llu ", (unsigned long long) percpu);
 	}
 	seq_printf(m, "\n");
@@ -8483,7 +8600,7 @@ static struct cftype files[] = {
  *
  * called with rq->lock held.
  */
-void cpuacct_charge(struct task_struct *tsk, u64 cputime)
+void __cpuacct_charge(struct task_struct *tsk, u64 cputime)
 {
 	struct cpuacct *ca;
 	int cpu;
@@ -8511,5 +8628,9 @@ struct cgroup_subsys cpuacct_subsys = {
 	.destroy = cpuacct_destroy,
 	.subsys_id = cpuacct_subsys_id,
 	.base_cftypes = files,
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
+	.comounts = 1,
+	.must_comount = { cpu_cgroup_subsys_id, },
+#endif
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8da579d..1da9fa8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -885,6 +885,9 @@ extern void update_idle_cpu_load(struct rq *this_rq);
 
 #ifdef CONFIG_CGROUP_CPUACCT
 #include <linux/cgroup.h>
+
+extern bool cpuacct_from_cpu;
+
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
 	struct cgroup_subsys_state css;
@@ -914,7 +917,16 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
 	return cgroup_ca(ca->css.cgroup->parent);
 }
 
-extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
+extern void __cpuacct_charge(struct task_struct *tsk, u64 cputime);
+
+static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
+{
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
+	if (likely(!cpuacct_from_cpu))
+		return;
+#endif
+	__cpuacct_charge(tsk, cputime);
+}
 #else
 static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
 #endif
-- 
1.7.11.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 4/5] cpuacct: do not gather cpuacct statistics when not mounted
  2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
                   ` (2 preceding siblings ...)
  2012-09-04 14:18 ` [RFC 3/5] sched: do not call cpuacct_charge when cpu and cpuacct are comounted Glauber Costa
@ 2012-09-04 14:18 ` Glauber Costa
  2012-09-04 14:18 ` [RFC 5/5] sched: add cpusets to comounts list Glauber Costa
  2012-09-04 21:46 ` [RFC 0/5] forced comounts for cgroups Tejun Heo
  5 siblings, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-04 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt, lennart,
	kay.sievers, tj, Glauber Costa

Currently, the only test that prevents us from running the expensive
cpuacct_charge() is cpuacct_subsys.active == true. This will hold at all
times after the subsystem is activated, even if it is not mounted.

IOW, use it or not, you pay it. By hooking with the bind() callback, we
can detect when cpuacct is mounted or umounted, and stop collecting
statistics when this cgroup is not in use.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 kernel/sched/core.c  | 8 ++++++++
 kernel/sched/sched.h | 3 +++
 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e46871d..d654bd1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8595,6 +8595,13 @@ static struct cftype files[] = {
 	{ }	/* terminate */
 };
 
+bool cpuacct_mounted;
+
+void cpuacct_bind(struct cgroup *root)
+{
+	cpuacct_mounted = root->root == root_cpuacct.css.cgroup->root;
+}
+
 /*
  * charge this task's execution time to its accounting group.
  *
@@ -8628,6 +8635,7 @@ struct cgroup_subsys cpuacct_subsys = {
 	.destroy = cpuacct_destroy,
 	.subsys_id = cpuacct_subsys_id,
 	.base_cftypes = files,
+	.bind = cpuacct_bind,
 #ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
 	.comounts = 1,
 	.must_comount = { cpu_cgroup_subsys_id, },
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1da9fa8..d33f777 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -887,6 +887,7 @@ extern void update_idle_cpu_load(struct rq *this_rq);
 #include <linux/cgroup.h>
 
 extern bool cpuacct_from_cpu;
+extern bool cpuacct_mounted;
 
 /* track cpu usage of a group of tasks and its child groups */
 struct cpuacct {
@@ -921,6 +922,8 @@ extern void __cpuacct_charge(struct task_struct *tsk, u64 cputime);
 
 static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 {
+	if (unlikely(!cpuacct_mounted))
+		return;
 #ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
 	if (likely(!cpuacct_from_cpu))
 		return;
-- 
1.7.11.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC 5/5] sched: add cpusets to comounts list
  2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
                   ` (3 preceding siblings ...)
  2012-09-04 14:18 ` [RFC 4/5] cpuacct: do not gather cpuacct statistics when not mounted Glauber Costa
@ 2012-09-04 14:18 ` Glauber Costa
  2012-09-04 21:46 ` [RFC 0/5] forced comounts for cgroups Tejun Heo
  5 siblings, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-04 14:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt, lennart,
	kay.sievers, tj, Glauber Costa

Although we have not yet identified any place where cpusets could be
improved performance-wise by guaranteeing comounts with the other two
cpu cgroups, it is a sane choice to mount them together.

We can preemptively benefit from it and avoid a growing mess, by
guaranteeing that subsystems that mostly contraint the same kind of
resource will live together. With cgroups is never that simple, and
things crosses boundaries quite often. But I hope this can be seen as a
potential improvement.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Dave Jones <davej@redhat.com>
CC: Ben Hutchings <ben@decadent.org.uk>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Paul Turner <pjt@google.com>
CC: Lennart Poettering <lennart@poettering.net>
CC: Kay Sievers <kay.sievers@vrfy.org>
CC: Tejun Heo <tj@kernel.org>
---
 kernel/cpuset.c     | 4 ++++
 kernel/sched/core.c | 8 ++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 8c8bd65..f8e1c49 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1879,6 +1879,10 @@ struct cgroup_subsys cpuset_subsys = {
 	.post_clone = cpuset_post_clone,
 	.subsys_id = cpuset_subsys_id,
 	.base_cftypes = files,
+#ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
+	.comounts = 2,
+	.must_comount = { cpu_cgroup_subsys_id, cpuacct_subsys_id, },
+#endif
 	.early_init = 1,
 };
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d654bd1..aeff02c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8301,8 +8301,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 	.subsys_id	= cpu_cgroup_subsys_id,
 	.base_cftypes	= cpu_files,
 #ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
-	.comounts	= 1,
-	.must_comount	= { cpuacct_subsys_id, },
+	.comounts	= 2,
+	.must_comount	= { cpuacct_subsys_id, cpuset_subsys_id, },
 	.bind		= cpu_cgroup_bind,
 #endif
 	.early_init	= 1,
@@ -8637,8 +8637,8 @@ struct cgroup_subsys cpuacct_subsys = {
 	.base_cftypes = files,
 	.bind = cpuacct_bind,
 #ifdef CONFIG_CGROUP_FORCE_COMOUNT_CPU
-	.comounts = 1,
-	.must_comount = { cpu_cgroup_subsys_id, },
+	.comounts = 2,
+	.must_comount = { cpu_cgroup_subsys_id, cpuset_subsys_id, },
 #endif
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
-- 
1.7.11.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
                   ` (4 preceding siblings ...)
  2012-09-04 14:18 ` [RFC 5/5] sched: add cpusets to comounts list Glauber Costa
@ 2012-09-04 21:46 ` Tejun Heo
  2012-09-05  8:03   ` Glauber Costa
  5 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-04 21:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

Hello, Glauber.

On Tue, Sep 04, 2012 at 06:18:15PM +0400, Glauber Costa wrote:
> As we have been extensively discussing, the cost and pain points for cgroups
> come from many places. But at least one of those is the arbitrary nature of
> hierarchies. Many people, including at least Tejun and me would like this to go
> away altogether. Problem so far, is breaking compatiblity with existing setups
> 
> I am proposing here a default-n Kconfig option that will guarantee that the cpu
> cgroups (for now) will be comounted. I started with them because the
> cpu/cpuacct division is clearly the worst offender. Also, the default-n is here
> so distributions will have time to adapt: Forcing this flag to be on without
> userspace changes will just lead to cgroups failing to mount, which we don't
> want.
> 
> Although I've tested it and it works, I haven't compile-tested all possible
> config combinations. So this is mostly for your eyes. If this gets traction,
> I'll submit it properly, along with any changes that you might require.

As I said during the discussion, I'm skeptical about how useful this
is.  This can't nudge existing users in any meaningfully gradual way.
Kconfig doesn't make it any better.  It's still an abrupt behavior
change when seen from userland.

Also, I really don't see much point in enforcing this almost arbitrary
grouping of controllers.  It doesn't simplify anything and using
cpuacct in more granular way than cpu actually is one of the better
justified use of multiple hierarchies.  Also, what about memcg and
blkcg?  Do they *really* coincide?  Note that both blkcg and memcg
involve non-trivial overhead and blkcg is essentially broken
hierarchy-wise.

Currently, from userland visible behavior POV, the crazy parts are

1. The flat hierarchy thing.  This just should go away.

2. Orthogonal multiple hierarchies.

I think we agree that #1 should go away one way or the other.  I
*really* wanna get rid of #2 but am not sure how.  I'll give it
another stab once the writeback thing is resolved.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-04 21:46 ` [RFC 0/5] forced comounts for cgroups Tejun Heo
@ 2012-09-05  8:03   ` Glauber Costa
  2012-09-05  8:14     ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  8:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

On 09/05/2012 01:46 AM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Tue, Sep 04, 2012 at 06:18:15PM +0400, Glauber Costa wrote:
>> As we have been extensively discussing, the cost and pain points for cgroups
>> come from many places. But at least one of those is the arbitrary nature of
>> hierarchies. Many people, including at least Tejun and me would like this to go
>> away altogether. Problem so far, is breaking compatiblity with existing setups
>>
>> I am proposing here a default-n Kconfig option that will guarantee that the cpu
>> cgroups (for now) will be comounted. I started with them because the
>> cpu/cpuacct division is clearly the worst offender. Also, the default-n is here
>> so distributions will have time to adapt: Forcing this flag to be on without
>> userspace changes will just lead to cgroups failing to mount, which we don't
>> want.
>>
>> Although I've tested it and it works, I haven't compile-tested all possible
>> config combinations. So this is mostly for your eyes. If this gets traction,
>> I'll submit it properly, along with any changes that you might require.
> 
> As I said during the discussion, I'm skeptical about how useful this
> is.  This can't nudge existing users in any meaningfully gradual way.
> Kconfig doesn't make it any better.  It's still an abrupt behavior
> change when seen from userland.
>

The goal here is to have distributions to do it, because they tend to
have a well defined lifecycle management, much more than upstream. Whoever
sets this option, can coordinate with upstream.

Aside from enforcing it, we can pretty much warn() as well, to direct
people towards flipping the switch.

> Also, I really don't see much point in enforcing this almost arbitrary
> grouping of controllers.  It doesn't simplify anything and using
> cpuacct in more granular way than cpu actually is one of the better
> justified use of multiple hierarchies.  Also, what about memcg and
> blkcg?  Do they *really* coincide?  Note that both blkcg and memcg
> involve non-trivial overhead and blkcg is essentially broken
> hierarchy-wise.
> 

Where did I mention memcg or blkcg in this patch ?


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:03   ` Glauber Costa
@ 2012-09-05  8:14     ` Tejun Heo
  2012-09-05  8:17       ` Glauber Costa
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  8:14 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

Hello, Glauber.

On Wed, Sep 05, 2012 at 12:03:25PM +0400, Glauber Costa wrote:
> The goal here is to have distributions to do it, because they tend to
> have a well defined lifecycle management, much more than upstream. Whoever
> sets this option, can coordinate with upstream.

Distros can just co-mount them during boot.  What's the point of the
config options?

> > Also, I really don't see much point in enforcing this almost arbitrary
> > grouping of controllers.  It doesn't simplify anything and using
> > cpuacct in more granular way than cpu actually is one of the better
> > justified use of multiple hierarchies.  Also, what about memcg and
> > blkcg?  Do they *really* coincide?  Note that both blkcg and memcg
> > involve non-trivial overhead and blkcg is essentially broken
> > hierarchy-wise.
> 
> Where did I mention memcg or blkcg in this patch ?

Differing hierarchies in memcg and blkcg currently is the most
prominent case where the intersection in writeback is problematic and
your proposed solution doesn't help one way or the other.  What's the
point?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:14     ` Tejun Heo
@ 2012-09-05  8:17       ` Glauber Costa
  2012-09-05  8:29         ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  8:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

On 09/05/2012 12:14 PM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Wed, Sep 05, 2012 at 12:03:25PM +0400, Glauber Costa wrote:
>> The goal here is to have distributions to do it, because they tend to
>> have a well defined lifecycle management, much more than upstream. Whoever
>> sets this option, can coordinate with upstream.
> 
> Distros can just co-mount them during boot.  What's the point of the
> config options?
> 

Pretty simple. The kernel can't assume the distro did. And then we still
need to pay a stupid big price in the scheduler.

After this patchset, We can assume this. And cpuusage can totally be
derived from the cpu cgroup. Because much more than "they can comount",
we can assume they did.

>>> Also, I really don't see much point in enforcing this almost arbitrary
>>> grouping of controllers.  It doesn't simplify anything and using
>>> cpuacct in more granular way than cpu actually is one of the better
>>> justified use of multiple hierarchies.  Also, what about memcg and
>>> blkcg?  Do they *really* coincide?  Note that both blkcg and memcg
>>> involve non-trivial overhead and blkcg is essentially broken
>>> hierarchy-wise.
>>
>> Where did I mention memcg or blkcg in this patch ?
> 
> Differing hierarchies in memcg and blkcg currently is the most
> prominent case where the intersection in writeback is problematic and
> your proposed solution doesn't help one way or the other.  What's the
> point?
> 

The point is that I am focusing at one problem at a time. But FWIW, I
don't see why memcg/blkcg can't use a step just like this one in a
separate pass.

If the goal is comounting them eventually, at some point when the issues
are sorted out, just do it. Get a switch like this one, and then you
will start being able to assume a lot of things in the code. Miracles
can happen.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:17       ` Glauber Costa
@ 2012-09-05  8:29         ` Tejun Heo
  2012-09-05  8:35           ` Glauber Costa
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  8:29 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

Hello, Glauber.

On Wed, Sep 05, 2012 at 12:17:11PM +0400, Glauber Costa wrote:
> > Distros can just co-mount them during boot.  What's the point of the
> > config options?
> 
> Pretty simple. The kernel can't assume the distro did. And then we still
> need to pay a stupid big price in the scheduler.
> 
> After this patchset, We can assume this. And cpuusage can totally be
> derived from the cpu cgroup. Because much more than "they can comount",
> we can assume they did.

As long as cpuacct and cpu are separate, I think it makes sense to
assume that they at least could be at different granularity.  As for
optimization for co-mounted case, if that is *really* necessary,
couldn't it be done dynamically?  It's not like CONFIG_XXX blocks are
pretty things and they're worse for runtime code path coverage.

> > Differing hierarchies in memcg and blkcg currently is the most
> > prominent case where the intersection in writeback is problematic and
> > your proposed solution doesn't help one way or the other.  What's the
> > point?
> 
> The point is that I am focusing at one problem at a time. But FWIW, I
> don't see why memcg/blkcg can't use a step just like this one in a
> separate pass.
> 
> If the goal is comounting them eventually, at some point when the issues
> are sorted out, just do it. Get a switch like this one, and then you
> will start being able to assume a lot of things in the code. Miracles
> can happen.

The problem is that I really don't see how this leads to where we
eventually wanna be.  Orthogonal hierarchies are bad because,

* It complicates the code.  This doesn't really help there much.

* Intersections between controllers are cumbersome to handle.  Again,
  this doesn't help much.

And this restricts the only valid use case for multiple hierarchies
which is applying differing level of granularity depending on
controllers.  So, I don't know.  Doesn't seem like a good idea to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:29         ` Tejun Heo
@ 2012-09-05  8:35           ` Glauber Costa
  2012-09-05  8:47             ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  8:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

On 09/05/2012 12:29 PM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Wed, Sep 05, 2012 at 12:17:11PM +0400, Glauber Costa wrote:
>>> Distros can just co-mount them during boot.  What's the point of the
>>> config options?
>>
>> Pretty simple. The kernel can't assume the distro did. And then we still
>> need to pay a stupid big price in the scheduler.
>>
>> After this patchset, We can assume this. And cpuusage can totally be
>> derived from the cpu cgroup. Because much more than "they can comount",
>> we can assume they did.
> 
> As long as cpuacct and cpu are separate, I think it makes sense to
> assume that they at least could be at different granularity.  

If they are comounted, and more: forceably comounted, I don't see how to
call them separate. At the very best, they are this way for
compatibility purposes only, to lay a path that would allow us to get
rid of the separation eventually.

> As for
> optimization for co-mounted case, if that is *really* necessary,
> couldn't it be done dynamically?  It's not like CONFIG_XXX blocks are
> pretty things and they're worse for runtime code path coverage.
> 

I've done it dynamically, as you know. But if you think that complicated
the code less than this, we're operating by very different standards...

CONFIG options can make the code uglier, but it is a lot more
predictable. It also guarantee no state changes will happen during the
lifecycle of the machine. Doing it dynamically makes the code prettier,
but still extensively large, and prone to subtle bugs, as we've already
seen in practice.

>>> Differing hierarchies in memcg and blkcg currently is the most
>>> prominent case where the intersection in writeback is problematic and
>>> your proposed solution doesn't help one way or the other.  What's the
>>> point?
>>
>> The point is that I am focusing at one problem at a time. But FWIW, I
>> don't see why memcg/blkcg can't use a step just like this one in a
>> separate pass.
>>
>> If the goal is comounting them eventually, at some point when the issues
>> are sorted out, just do it. Get a switch like this one, and then you
>> will start being able to assume a lot of things in the code. Miracles
>> can happen.
> 
> The problem is that I really don't see how this leads to where we
> eventually wanna be.  Orthogonal hierarchies are bad because,
> 
> * It complicates the code.  This doesn't really help there much.
> 

Way I see it, it is the price we pay for having screwed up before.
And Kconfig options doesn't necessarily complicate the code. They make
it bigger, and possibly slightly harder to follow. But I myself

> * Intersections between controllers are cumbersome to handle.  Again,
>   this doesn't help much.
>

They are only cumbersome because we can't assume nothing. The cpuacct is
the perfect example. Once we can start assuming, they become a lot less so.


> And this restricts the only valid use case for multiple hierarchies
> which is applying differing level of granularity depending on
> controllers.  So, I don't know.  Doesn't seem like a good idea to me.
> 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:35           ` Glauber Costa
@ 2012-09-05  8:47             ` Tejun Heo
  2012-09-05  8:55               ` Glauber Costa
  2012-09-05  9:06               ` Peter Zijlstra
  0 siblings, 2 replies; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  8:47 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

Hello, Glauber.

On Wed, Sep 05, 2012 at 12:35:11PM +0400, Glauber Costa wrote:
> > As long as cpuacct and cpu are separate, I think it makes sense to
> > assume that they at least could be at different granularity.  
> 
> If they are comounted, and more: forceably comounted, I don't see how to
> call them separate. At the very best, they are this way for
> compatibility purposes only, to lay a path that would allow us to get
> rid of the separation eventually.

I think this is where we disagree.  I didn't mean that all controllers
should be using exactly the same hierarchy when I was talking about
unified hierarchy.  I do think it's useful and maybe even essential to
allow differing levels of granularity.  cpu and cpuacct could be a
valid example for this.  Likely blkcg and memcg too.

So, I think it's desirable for all controllers to be able to handle
hierarchies the same way and to have the ability to tag something as
belonging to certain group in the hierarchy for all controllers but I
don't think it's desirable or feasible to require all of them to
follow exactly the same grouping at all levels.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:47             ` Tejun Heo
@ 2012-09-05  8:55               ` Glauber Costa
  2012-09-05  9:07                 ` Tejun Heo
  2012-09-05  9:06               ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  8:55 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

On 09/05/2012 12:47 PM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Wed, Sep 05, 2012 at 12:35:11PM +0400, Glauber Costa wrote:
>>> As long as cpuacct and cpu are separate, I think it makes sense to
>>> assume that they at least could be at different granularity.  
>>
>> If they are comounted, and more: forceably comounted, I don't see how to
>> call them separate. At the very best, they are this way for
>> compatibility purposes only, to lay a path that would allow us to get
>> rid of the separation eventually.
> 
> I think this is where we disagree.  I didn't mean that all controllers
> should be using exactly the same hierarchy when I was talking about
> unified hierarchy.  I do think it's useful and maybe even essential to
> allow differing levels of granularity.  cpu and cpuacct could be a
> valid example for this.  Likely blkcg and memcg too.
> 
> So, I think it's desirable for all controllers to be able to handle
> hierarchies the same way and to have the ability to tag something as
> belonging to certain group in the hierarchy for all controllers but I
> don't think it's desirable or feasible to require all of them to
> follow exactly the same grouping at all levels.
> 

By "different levels of granularity" do you mean having just a subset of
them turned on at a particular place?

If yes, having them guaranteed to be comounted is still perceived by me
as a good first step. A natural following would be to turn them on/off
on a per-group basis.





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:47             ` Tejun Heo
  2012-09-05  8:55               ` Glauber Costa
@ 2012-09-05  9:06               ` Peter Zijlstra
  2012-09-05  9:07                 ` Peter Zijlstra
                                   ` (2 more replies)
  1 sibling, 3 replies; 39+ messages in thread
From: Peter Zijlstra @ 2012-09-05  9:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, 2012-09-05 at 01:47 -0700, Tejun Heo wrote:
> I think this is where we disagree.  I didn't mean that all controllers
> should be using exactly the same hierarchy when I was talking about
> unified hierarchy.  I do think it's useful and maybe even essential to
> allow differing levels of granularity.  cpu and cpuacct could be a
> valid example for this.  Likely blkcg and memcg too.
> 
> So, I think it's desirable for all controllers to be able to handle
> hierarchies the same way and to have the ability to tag something as
> belonging to certain group in the hierarchy for all controllers but I
> don't think it's desirable or feasible to require all of them to
> follow exactly the same grouping at all levels. 

*confused* I always thought that was exactly what you meant with unified
hierarchy.

Doing all this runtime is just going to make the mess even bigger,
because now we have to deal with even more stupid cases.

So either we go and try to contain this mess as proposed by Glauber or
we go delete controllers.. I've had it with this crap.

---
 Documentation/cgroups/00-INDEX    |   2 -
 Documentation/cgroups/cpuacct.txt |  49 --------
 include/linux/cgroup_subsys.h     |   6 -
 init/Kconfig                      |   6 -
 kernel/sched/core.c               | 247 --------------------------------------
 kernel/sched/fair.c               |   1 -
 kernel/sched/rt.c                 |   1 -
 kernel/sched/sched.h              |  45 -------
 kernel/sched/stop_task.c          |   1 -
 9 files changed, 358 deletions(-)

diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
index 3f58fa3..9f100cc 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroups/00-INDEX
@@ -2,8 +2,6 @@
 	- this file
 cgroups.txt
 	- Control Groups definition, implementation details, examples and API.
-cpuacct.txt
-	- CPU Accounting Controller; account CPU usage for groups of tasks.
 cpusets.txt
 	- documents the cpusets feature; assign CPUs and Mem to a set of tasks.
 devices.txt
diff --git a/Documentation/cgroups/cpuacct.txt b/Documentation/cgroups/cpuacct.txt
deleted file mode 100644
index 9d73cc0..0000000
--- a/Documentation/cgroups/cpuacct.txt
+++ /dev/null
@@ -1,49 +0,0 @@
-CPU Accounting Controller
--------------------------
-
-The CPU accounting controller is used to group tasks using cgroups and
-account the CPU usage of these groups of tasks.
-
-The CPU accounting controller supports multi-hierarchy groups. An accounting
-group accumulates the CPU usage of all of its child groups and the tasks
-directly present in its group.
-
-Accounting groups can be created by first mounting the cgroup filesystem.
-
-# mount -t cgroup -ocpuacct none /sys/fs/cgroup
-
-With the above step, the initial or the parent accounting group becomes
-visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in
-the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup.
-/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained
-by this group which is essentially the CPU time obtained by all the tasks
-in the system.
-
-New accounting groups can be created under the parent group /sys/fs/cgroup.
-
-# cd /sys/fs/cgroup
-# mkdir g1
-# echo $$ > g1/tasks
-
-The above steps create a new group g1 and move the current shell
-process (bash) into it. CPU time consumed by this bash and its children
-can be obtained from g1/cpuacct.usage and the same is accumulated in
-/sys/fs/cgroup/cpuacct.usage also.
-
-cpuacct.stat file lists a few statistics which further divide the
-CPU time obtained by the cgroup into user and system times. Currently
-the following statistics are supported:
-
-user: Time spent by tasks of the cgroup in user mode.
-system: Time spent by tasks of the cgroup in kernel mode.
-
-user and system are in USER_HZ unit.
-
-cpuacct controller uses percpu_counter interface to collect user and
-system times. This has two side effects:
-
-- It is theoretically possible to see wrong values for user and system times.
-  This is because percpu_counter_read() on 32bit systems isn't safe
-  against concurrent writes.
-- It is possible to see slightly outdated values for user and system times
-  due to the batch processing nature of percpu_counter.
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index dfae957..73b7cc1 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -25,12 +25,6 @@ SUBSYS(cpu_cgroup)
 
 /* */
 
-#ifdef CONFIG_CGROUP_CPUACCT
-SUBSYS(cpuacct)
-#endif
-
-/* */
-
 #ifdef CONFIG_MEMCG
 SUBSYS(mem_cgroup)
 #endif
diff --git a/init/Kconfig b/init/Kconfig
index af6c7f8..3ac9e1c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -674,12 +674,6 @@ config PROC_PID_CPUSET
 	depends on CPUSETS
 	default y
 
-config CGROUP_CPUACCT
-	bool "Simple CPU accounting cgroup subsystem"
-	help
-	  Provides a simple Resource Controller for monitoring the
-	  total CPU consumed by the tasks in a cgroup.
-
 config RESOURCE_COUNTERS
 	bool "Resource counters"
 	help
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4376c9f..47c7cdb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2809,18 +2809,9 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	return ns;
 }
 
-#ifdef CONFIG_CGROUP_CPUACCT
-struct cgroup_subsys cpuacct_subsys;
-struct cpuacct root_cpuacct;
-#endif
-
 static inline void task_group_account_field(struct task_struct *p, int index,
 					    u64 tmp)
 {
-#ifdef CONFIG_CGROUP_CPUACCT
-	struct kernel_cpustat *kcpustat;
-	struct cpuacct *ca;
-#endif
 	/*
 	 * Since all updates are sure to touch the root cgroup, we
 	 * get ourselves ahead and touch it first. If the root cgroup
@@ -2828,20 +2819,6 @@ static inline void task_group_account_field(struct task_struct *p, int index,
 	 *
 	 */
 	__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;
-
-#ifdef CONFIG_CGROUP_CPUACCT
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	rcu_read_lock();
-	ca = task_ca(p);
-	while (ca && (ca != &root_cpuacct)) {
-		kcpustat = this_cpu_ptr(ca->cpustat);
-		kcpustat->cpustat[index] += tmp;
-		ca = parent_ca(ca);
-	}
-	rcu_read_unlock();
-#endif
 }
 
 
@@ -7351,12 +7328,6 @@ void __init sched_init(void)
 
 #endif /* CONFIG_CGROUP_SCHED */
 
-#ifdef CONFIG_CGROUP_CPUACCT
-	root_cpuacct.cpustat = &kernel_cpustat;
-	root_cpuacct.cpuusage = alloc_percpu(u64);
-	/* Too early, not expected to fail */
-	BUG_ON(!root_cpuacct.cpuusage);
-#endif
 	for_each_possible_cpu(i) {
 		struct rq *rq;
 
@@ -8409,221 +8380,3 @@ struct cgroup_subsys cpu_cgroup_subsys = {
 };
 
 #endif	/* CONFIG_CGROUP_SCHED */
-
-#ifdef CONFIG_CGROUP_CPUACCT
-
-/*
- * CPU accounting code for task groups.
- *
- * Based on the work by Paul Menage (menage@google.com) and Balbir Singh
- * (balbir@in.ibm.com).
- */
-
-/* create a new cpu accounting group */
-static struct cgroup_subsys_state *cpuacct_create(struct cgroup *cgrp)
-{
-	struct cpuacct *ca;
-
-	if (!cgrp->parent)
-		return &root_cpuacct.css;
-
-	ca = kzalloc(sizeof(*ca), GFP_KERNEL);
-	if (!ca)
-		goto out;
-
-	ca->cpuusage = alloc_percpu(u64);
-	if (!ca->cpuusage)
-		goto out_free_ca;
-
-	ca->cpustat = alloc_percpu(struct kernel_cpustat);
-	if (!ca->cpustat)
-		goto out_free_cpuusage;
-
-	return &ca->css;
-
-out_free_cpuusage:
-	free_percpu(ca->cpuusage);
-out_free_ca:
-	kfree(ca);
-out:
-	return ERR_PTR(-ENOMEM);
-}
-
-/* destroy an existing cpu accounting group */
-static void cpuacct_destroy(struct cgroup *cgrp)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-
-	free_percpu(ca->cpustat);
-	free_percpu(ca->cpuusage);
-	kfree(ca);
-}
-
-static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-	u64 data;
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
-
-	return data;
-}
-
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
-{
-	u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	raw_spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
-}
-
-/* return total cpu usage (in nanoseconds) of a group */
-static u64 cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	u64 totalcpuusage = 0;
-	int i;
-
-	for_each_present_cpu(i)
-		totalcpuusage += cpuacct_cpuusage_read(ca, i);
-
-	return totalcpuusage;
-}
-
-static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
-								u64 reset)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int err = 0;
-	int i;
-
-	if (reset) {
-		err = -EINVAL;
-		goto out;
-	}
-
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
-
-out:
-	return err;
-}
-
-static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
-				   struct seq_file *m)
-{
-	struct cpuacct *ca = cgroup_ca(cgroup);
-	u64 percpu;
-	int i;
-
-	for_each_present_cpu(i) {
-		percpu = cpuacct_cpuusage_read(ca, i);
-		seq_printf(m, "%llu ", (unsigned long long) percpu);
-	}
-	seq_printf(m, "\n");
-	return 0;
-}
-
-static const char *cpuacct_stat_desc[] = {
-	[CPUACCT_STAT_USER] = "user",
-	[CPUACCT_STAT_SYSTEM] = "system",
-};
-
-static int cpuacct_stats_show(struct cgroup *cgrp, struct cftype *cft,
-			      struct cgroup_map_cb *cb)
-{
-	struct cpuacct *ca = cgroup_ca(cgrp);
-	int cpu;
-	s64 val = 0;
-
-	for_each_online_cpu(cpu) {
-		struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
-		val += kcpustat->cpustat[CPUTIME_USER];
-		val += kcpustat->cpustat[CPUTIME_NICE];
-	}
-	val = cputime64_to_clock_t(val);
-	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_USER], val);
-
-	val = 0;
-	for_each_online_cpu(cpu) {
-		struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu);
-		val += kcpustat->cpustat[CPUTIME_SYSTEM];
-		val += kcpustat->cpustat[CPUTIME_IRQ];
-		val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
-	}
-
-	val = cputime64_to_clock_t(val);
-	cb->fill(cb, cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val);
-
-	return 0;
-}
-
-static struct cftype files[] = {
-	{
-		.name = "usage",
-		.read_u64 = cpuusage_read,
-		.write_u64 = cpuusage_write,
-	},
-	{
-		.name = "usage_percpu",
-		.read_seq_string = cpuacct_percpu_seq_read,
-	},
-	{
-		.name = "stat",
-		.read_map = cpuacct_stats_show,
-	},
-	{ }	/* terminate */
-};
-
-/*
- * charge this task's execution time to its accounting group.
- *
- * called with rq->lock held.
- */
-void cpuacct_charge(struct task_struct *tsk, u64 cputime)
-{
-	struct cpuacct *ca;
-	int cpu;
-
-	if (unlikely(!cpuacct_subsys.active))
-		return;
-
-	cpu = task_cpu(tsk);
-
-	rcu_read_lock();
-
-	ca = task_ca(tsk);
-
-	for (; ca; ca = parent_ca(ca)) {
-		u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-		*cpuusage += cputime;
-	}
-
-	rcu_read_unlock();
-}
-
-struct cgroup_subsys cpuacct_subsys = {
-	.name = "cpuacct",
-	.create = cpuacct_create,
-	.destroy = cpuacct_destroy,
-	.subsys_id = cpuacct_subsys_id,
-	.base_cftypes = files,
-};
-#endif	/* CONFIG_CGROUP_CPUACCT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01d3eda..bff5b6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -706,7 +706,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
 		struct task_struct *curtask = task_of(curr);
 
 		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
-		cpuacct_charge(curtask, delta_exec);
 		account_group_exec_runtime(curtask, delta_exec);
 	}
 
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 944cb68..8e5805e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -934,7 +934,6 @@ static void update_curr_rt(struct rq *rq)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-	cpuacct_charge(curr, delta_exec);
 
 	sched_rt_avg_update(rq, delta_exec);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f6714d0..00ca3f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -833,15 +833,6 @@ static const u32 prio_to_wmult[40] = {
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
 
-/* Time spent by the tasks of the cpu accounting group executing in ... */
-enum cpuacct_stat_index {
-	CPUACCT_STAT_USER,	/* ... user mode */
-	CPUACCT_STAT_SYSTEM,	/* ... kernel mode */
-
-	CPUACCT_STAT_NSTATS,
-};
-
-
 #define sched_class_highest (&stop_sched_class)
 #define for_each_class(class) \
    for (class = sched_class_highest; class; class = class->next)
@@ -881,42 +872,6 @@ extern void init_rt_bandwidth(struct rt_bandwidth *rt_b, u64 period, u64 runtime
 
 extern void update_idle_cpu_load(struct rq *this_rq);
 
-#ifdef CONFIG_CGROUP_CPUACCT
-#include <linux/cgroup.h>
-/* track cpu usage of a group of tasks and its child groups */
-struct cpuacct {
-	struct cgroup_subsys_state css;
-	/* cpuusage holds pointer to a u64-type object on every cpu */
-	u64 __percpu *cpuusage;
-	struct kernel_cpustat __percpu *cpustat;
-};
-
-/* return cpu accounting group corresponding to this container */
-static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
-{
-	return container_of(cgroup_subsys_state(cgrp, cpuacct_subsys_id),
-			    struct cpuacct, css);
-}
-
-/* return cpu accounting group to which this task belongs */
-static inline struct cpuacct *task_ca(struct task_struct *tsk)
-{
-	return container_of(task_subsys_state(tsk, cpuacct_subsys_id),
-			    struct cpuacct, css);
-}
-
-static inline struct cpuacct *parent_ca(struct cpuacct *ca)
-{
-	if (!ca || !ca->css.cgroup->parent)
-		return NULL;
-	return cgroup_ca(ca->css.cgroup->parent);
-}
-
-extern void cpuacct_charge(struct task_struct *tsk, u64 cputime);
-#else
-static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
-#endif
-
 static inline void inc_nr_running(struct rq *rq)
 {
 	rq->nr_running++;
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..fda1cbe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq->clock_task;
-	cpuacct_charge(curr, delta_exec);
 }
 
 static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:07                 ` Tejun Heo
@ 2012-09-05  9:06                   ` Glauber Costa
  2012-09-05  9:14                     ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  9:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

On 09/05/2012 01:07 PM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Wed, Sep 05, 2012 at 12:55:21PM +0400, Glauber Costa wrote:
>>> So, I think it's desirable for all controllers to be able to handle
>>> hierarchies the same way and to have the ability to tag something as
>>> belonging to certain group in the hierarchy for all controllers but I
>>> don't think it's desirable or feasible to require all of them to
>>> follow exactly the same grouping at all levels.
>>
>> By "different levels of granularity" do you mean having just a subset of
>> them turned on at a particular place?
> 
> Heh, this is tricky to describe and I'm not really following what you
> mean. 

Do we really want to start cleaning up all this by changing the
interface to something that is described as "tricky" ?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:06               ` Peter Zijlstra
@ 2012-09-05  9:07                 ` Peter Zijlstra
  2012-09-05  9:22                   ` Tejun Heo
  2012-09-05  9:11                 ` Tejun Heo
  2012-09-05  9:32                 ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2012-09-05  9:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, 2012-09-05 at 11:06 +0200, Peter Zijlstra wrote:
> 
> So either we go and try to contain this mess as proposed by Glauber or
> we go delete controllers.. I've had it with this crap.
> 
> 

Glauber, the other approach is sending a patch that doesn't touch
cgroup.c but only the controllers and I'll merge it regardless of what
tj thinks.

We need some movement here.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  8:55               ` Glauber Costa
@ 2012-09-05  9:07                 ` Tejun Heo
  2012-09-05  9:06                   ` Glauber Costa
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

Hello, Glauber.

On Wed, Sep 05, 2012 at 12:55:21PM +0400, Glauber Costa wrote:
> > So, I think it's desirable for all controllers to be able to handle
> > hierarchies the same way and to have the ability to tag something as
> > belonging to certain group in the hierarchy for all controllers but I
> > don't think it's desirable or feasible to require all of them to
> > follow exactly the same grouping at all levels.
> 
> By "different levels of granularity" do you mean having just a subset of
> them turned on at a particular place?

Heh, this is tricky to describe and I'm not really following what you
mean.  They're all on the same tree but a controller should be able to
handle a given subtree as single group.  e.g. if you draw the tree,
different controllers should be able to draw different enclosing
circles and operate on the simplifed tree.  How flexible that should
be, I don't know.  Maybe it would be enough to be able to say "treat
all children of this node as belonging to this node for controllers X
and Y".

> If yes, having them guaranteed to be comounted is still perceived by me
> as a good first step. A natural following would be to turn them on/off
> on a per-group basis.

I don't agree with that.  If we do it that way, we would lose
differing granularity from forcing co-mounting and then restore it
later when the subtree handling is implemented.  If we can do away
with differing granularity, that's fine; otherwise, it doesn't make
much sense to remove and then restore it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:06               ` Peter Zijlstra
  2012-09-05  9:07                 ` Peter Zijlstra
@ 2012-09-05  9:11                 ` Tejun Heo
  2012-09-05  9:12                   ` Glauber Costa
  2012-09-05  9:32                 ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

Hello, Peter.

On Wed, Sep 05, 2012 at 11:06:33AM +0200, Peter Zijlstra wrote:
> *confused* I always thought that was exactly what you meant with unified
> hierarchy.

No, I never counted out differing granularity.

> Doing all this runtime is just going to make the mess even bigger,
> because now we have to deal with even more stupid cases.
> 
> So either we go and try to contain this mess as proposed by Glauber or
> we go delete controllers.. I've had it with this crap.

If cpuacct can really go away, that's great, but I don't think the
problem at hand is unsolvable, so let's not jump it.  cpuacct and cpu
aren't the onlfy problem cases after all.  We need to solve it for
other controllers too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:11                 ` Tejun Heo
@ 2012-09-05  9:12                   ` Glauber Costa
  2012-09-05  9:19                     ` Tejun Heo
  2012-09-05  9:26                     ` Peter Zijlstra
  0 siblings, 2 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  9:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On 09/05/2012 01:11 PM, Tejun Heo wrote:
> Hello, Peter.
> 
> On Wed, Sep 05, 2012 at 11:06:33AM +0200, Peter Zijlstra wrote:
>> *confused* I always thought that was exactly what you meant with unified
>> hierarchy.
> 
> No, I never counted out differing granularity.
> 

Can you elaborate on which interface do you envision to make it work?
They will clearly be mounted in the same hierarchy, or as said
alternatively, comounted.

If you can turn them on/off on a per-subtree basis, which interface
exactly do you propose for that?

Would a pair of cgroup core files like available_controllers and
current_controllers are a lot of drivers do, suffice?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:06                   ` Glauber Costa
@ 2012-09-05  9:14                     ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:14 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, cgroups, linux-mm, davej, ben, a.p.zijlstra, pjt,
	lennart, kay.sievers

On Wed, Sep 05, 2012 at 01:06:39PM +0400, Glauber Costa wrote:
> > Heh, this is tricky to describe and I'm not really following what you
> > mean. 
> 
> Do we really want to start cleaning up all this by changing the
> interface to something that is described as "tricky" ?

The concept is not tricky.  I just can't find the appropriate words.
I *suspect* this can mostly re-use the existing css_set thing.  It
mostly becomes that css_set belongs to the unified hierarchy rather
than each task.  The user interface part isn't trivial and maybe
"don't nest beyond this level" is the only thing reasonable.  Not sure
yet whether that would be enough tho.  Need to think more about it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:12                   ` Glauber Costa
@ 2012-09-05  9:19                     ` Tejun Heo
  2012-09-05  9:30                       ` Glauber Costa
  2012-09-05  9:26                     ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:19 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, Sep 05, 2012 at 01:12:34PM +0400, Glauber Costa wrote:
> > No, I never counted out differing granularity.
> 
> Can you elaborate on which interface do you envision to make it work?
> They will clearly be mounted in the same hierarchy, or as said
> alternatively, comounted.

I'm not sure yet.  At the simplest, mask of controllers which should
honor (or ignore) nesting beyond the node.  That should be
understandable enough.  Not sure whether that would be flexible enough
yet tho.  In the end, they should be comounted but again I don't think
enforcing comounting at the moment is a step towards that.  It's more
like a step sideways.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:07                 ` Peter Zijlstra
@ 2012-09-05  9:22                   ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

Hey,

On Wed, Sep 05, 2012 at 11:07:21AM +0200, Peter Zijlstra wrote:
> Glauber, the other approach is sending a patch that doesn't touch
> cgroup.c but only the controllers and I'll merge it regardless of what
> tj thinks.
> 
> We need some movement here.

Peter, I don't think the proposed patch is helpful at this point.
While movement is necessary, it's not like moving towards any
direction is helpful.  They might just become another cruft which
needs to be maintained.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:12                   ` Glauber Costa
  2012-09-05  9:19                     ` Tejun Heo
@ 2012-09-05  9:26                     ` Peter Zijlstra
  2012-09-05  9:31                       ` Glauber Costa
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2012-09-05  9:26 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, 2012-09-05 at 13:12 +0400, Glauber Costa wrote:
> On 09/05/2012 01:11 PM, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Wed, Sep 05, 2012 at 11:06:33AM +0200, Peter Zijlstra wrote:
> >> *confused* I always thought that was exactly what you meant with unified
> >> hierarchy.
> > 
> > No, I never counted out differing granularity.
> > 
> 
> Can you elaborate on which interface do you envision to make it work?
> They will clearly be mounted in the same hierarchy, or as said
> alternatively, comounted.
> 
> If you can turn them on/off on a per-subtree basis, which interface
> exactly do you propose for that?

I wouldn't, screw that. That would result in the exact same problem
we're trying to fix. I want a single hierarchy walk, that's expensive
enough.

> Would a pair of cgroup core files like available_controllers and
> current_controllers are a lot of drivers do, suffice?

No.. its not a 'feature' I care to support for 'my' controllers.

I simply don't want to have to do two (or more) hierarchy walks for
accounting on every schedule event, all that pointer chasing is stupidly
expensive.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:19                     ` Tejun Heo
@ 2012-09-05  9:30                       ` Glauber Costa
  0 siblings, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  9:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On 09/05/2012 01:19 PM, Tejun Heo wrote:
> On Wed, Sep 05, 2012 at 01:12:34PM +0400, Glauber Costa wrote:
>>> No, I never counted out differing granularity.
>>
>> Can you elaborate on which interface do you envision to make it work?
>> They will clearly be mounted in the same hierarchy, or as said
>> alternatively, comounted.
> 
> I'm not sure yet.  At the simplest, mask of controllers which should
> honor (or ignore) nesting beyond the node.  That should be
> understandable enough.  Not sure whether that would be flexible enough
> yet tho.  In the end, they should be comounted but again I don't think
> enforcing comounting at the moment is a step towards that.  It's more
> like a step sideways.
> 

Tejun,

>From the code PoV, guaranteed comounting is what allow us to make
optimizations. "Maybe comounting" will maybe simplify the interface, but
will buy us nothing in the performance level.

I am more than happy to respin it with an added interface for masking
cgroups, if you believe this is a requirement.

But hinting me about what you would like to see on that front would be
really helpful.

Re-asking my question:

cpufreq, clocksources, ftrace, etc, they all use an interface that at
this point can be considered quite standard.

Applying the same logic, each cgroup would have a pair of files:

available_controllers, current_controllers, that you can just control by
writing to.

This can get slightly funny when we consider the right semantics for the
hierarchy, but really, everything will. And it is not like we'll have
anything crazy, we just need to tailor it with care.

If you think there is any chance of this getting us somewhere, I'll code
it. But that would be something to be sent *together* with what I've
just done. As I've said, if we can't guarantee the comounting, we would
still lose all the optimization opportunities.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:26                     ` Peter Zijlstra
@ 2012-09-05  9:31                       ` Glauber Costa
  2012-09-05  9:45                         ` Tejun Heo
  2012-09-05 10:20                         ` Peter Zijlstra
  0 siblings, 2 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  9:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On 09/05/2012 01:26 PM, Peter Zijlstra wrote:
> On Wed, 2012-09-05 at 13:12 +0400, Glauber Costa wrote:
>> On 09/05/2012 01:11 PM, Tejun Heo wrote:
>>> Hello, Peter.
>>>
>>> On Wed, Sep 05, 2012 at 11:06:33AM +0200, Peter Zijlstra wrote:
>>>> *confused* I always thought that was exactly what you meant with unified
>>>> hierarchy.
>>>
>>> No, I never counted out differing granularity.
>>>
>>
>> Can you elaborate on which interface do you envision to make it work?
>> They will clearly be mounted in the same hierarchy, or as said
>> alternatively, comounted.
>>
>> If you can turn them on/off on a per-subtree basis, which interface
>> exactly do you propose for that?
> 
> I wouldn't, screw that. That would result in the exact same problem
> we're trying to fix. I want a single hierarchy walk, that's expensive
> enough.
> 
>> Would a pair of cgroup core files like available_controllers and
>> current_controllers are a lot of drivers do, suffice?
> 
> No.. its not a 'feature' I care to support for 'my' controllers.
> 
> I simply don't want to have to do two (or more) hierarchy walks for
> accounting on every schedule event, all that pointer chasing is stupidly
> expensive.
> 

You wouldn't have to do more than one hierarchy walks for that. What
Tejun seems to want, is the ability to not have a particular controller
at some point in the tree. But if they exist, they are always together.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:06               ` Peter Zijlstra
  2012-09-05  9:07                 ` Peter Zijlstra
  2012-09-05  9:11                 ` Tejun Heo
@ 2012-09-05  9:32                 ` Tejun Heo
  2012-09-05 10:04                   ` Peter Zijlstra
  2 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

Hey, again.

On Wed, Sep 05, 2012 at 11:06:33AM +0200, Peter Zijlstra wrote:
> Doing all this runtime is just going to make the mess even bigger,
> because now we have to deal with even more stupid cases.
> 
> So either we go and try to contain this mess as proposed by Glauber or
> we go delete controllers.. I've had it with this crap.

cpuacct is rather unique tho.  I think it's gonna be silly whether the
hierarchy is unified or not.

1. If they always can live on the exact same hierarchy, there's no
   point in having the two separate.  Just merge them.

2. If they need differing levels of granularity, they either need to
   do it completely separately as they do now or have some form of
   dynamic optimization if absolutely necesary.

So, I think that choice is rather separate from other issues.  If
cpuacct is gonna be kept, I'd just keep it separate and warn that it
incurs extra overhead for the current users if for nothing else.
Otherwise, kill it or merge it into cpu.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:31                       ` Glauber Costa
@ 2012-09-05  9:45                         ` Tejun Heo
  2012-09-05  9:48                           ` Glauber Costa
  2012-09-05 10:20                         ` Peter Zijlstra
  1 sibling, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

Hello,

On Wed, Sep 05, 2012 at 01:31:56PM +0400, Glauber Costa wrote:
> > I simply don't want to have to do two (or more) hierarchy walks for
> > accounting on every schedule event, all that pointer chasing is stupidly
> > expensive.
> 
> You wouldn't have to do more than one hierarchy walks for that. What
> Tejun seems to want, is the ability to not have a particular controller
> at some point in the tree. But if they exist, they are always together.

Nope, as I wrote in the other reply, for cpu and cpuacct, either just
merge them or kill cpuacct if you want to avoid silliness from walking
multiple times.  Does cpuset cause problem in this regard too?  Or can
it be handled similarly to other controllers?

I think the confusion here is that we're talking about two different
issues.  As for cpuacct, I can see why strict co-mounting can be
attractive but then again if that's gonna be required, there's no
point in having them separate, right?  If that's the way you want it,
just trigger WARN_ON() if cpu and cpuacct aren't co-mounted and later
on kill cpuacct.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:45                         ` Tejun Heo
@ 2012-09-05  9:48                           ` Glauber Costa
  2012-09-05  9:56                             ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-05  9:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On 09/05/2012 01:45 PM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 05, 2012 at 01:31:56PM +0400, Glauber Costa wrote:
>>> > > I simply don't want to have to do two (or more) hierarchy walks for
>>> > > accounting on every schedule event, all that pointer chasing is stupidly
>>> > > expensive.
>> > 
>> > You wouldn't have to do more than one hierarchy walks for that. What
>> > Tejun seems to want, is the ability to not have a particular controller
>> > at some point in the tree. But if they exist, they are always together.
> Nope, as I wrote in the other reply, 

Would you mind, then, stopping for a moment and telling us what it is,
then, that you envision?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:48                           ` Glauber Costa
@ 2012-09-05  9:56                             ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2012-09-05  9:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, Sep 05, 2012 at 01:48:30PM +0400, Glauber Costa wrote:
> > Nope, as I wrote in the other reply, 
> 
> Would you mind, then, stopping for a moment and telling us what it is,
> then, that you envision?

I thought I already explained it a couple times in this thread (also
in the big thread from several months ago).  It's nearing three in the
morning here.  I'll try to explain it better tomorrow.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:32                 ` Tejun Heo
@ 2012-09-05 10:04                   ` Peter Zijlstra
  2012-09-06 20:46                     ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2012-09-05 10:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, 2012-09-05 at 02:32 -0700, Tejun Heo wrote:
> Hey, again.
> 
> On Wed, Sep 05, 2012 at 11:06:33AM +0200, Peter Zijlstra wrote:
> > Doing all this runtime is just going to make the mess even bigger,
> > because now we have to deal with even more stupid cases.
> > 
> > So either we go and try to contain this mess as proposed by Glauber or
> > we go delete controllers.. I've had it with this crap.
> 
> cpuacct is rather unique tho.  I think it's gonna be silly whether the
> hierarchy is unified or not.
> 
> 1. If they always can live on the exact same hierarchy, there's no
>    point in having the two separate.  Just merge them.
> 
> 2. If they need differing levels of granularity, they either need to
>    do it completely separately as they do now or have some form of
>    dynamic optimization if absolutely necesary.
> 
> So, I think that choice is rather separate from other issues.  If
> cpuacct is gonna be kept, I'd just keep it separate and warn that it
> incurs extra overhead for the current users if for nothing else.
> Otherwise, kill it or merge it into cpu.

Quite, hence my 'proposal' to remove cpuacct.

There was some whining last time Glauber proposed this, but the one
whining never convinced and has gone away from Linux, so lets just do
this.

Lets make cpuacct print a deprecated msg to dmesg for a few releases and
make cpu do all this.

The co-mounting stuff would have been nice for cpusets as well, knowing
all your tasks are affine to a subset of cpus allows for a few
optimizations (smaller cpumask iterations), but I guess we'll have to do
that dynamically, we'll just have to see how ugly that is.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05  9:31                       ` Glauber Costa
  2012-09-05  9:45                         ` Tejun Heo
@ 2012-09-05 10:20                         ` Peter Zijlstra
  2012-09-06 20:38                           ` Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2012-09-05 10:20 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On Wed, 2012-09-05 at 13:31 +0400, Glauber Costa wrote:
> 
> You wouldn't have to do more than one hierarchy walks for that. What
> Tejun seems to want, is the ability to not have a particular controller
> at some point in the tree. But if they exist, they are always together. 

Right, but the accounting is very much tied to the control structures, I
suppose we could change that, but my jet-leg addled brain isn't seeing
anything particularly nice atm.

But I don't really see the point though, this kind of interface would
only ever work for the non-controlling and controlling controller
combination (confused yet ;-), and I don't think we have many of those.

I would really rather see a simplification of the entire cgroup
interface space as opposed to making it more complex. And adding this
subtree 'feature' only makes it more complex.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05 10:20                         ` Peter Zijlstra
@ 2012-09-06 20:38                           ` Tejun Heo
  2012-09-06 22:39                             ` Glauber Costa
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-06 20:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

Hello, Peter, Glauber.

(I'm gonna write up cgroup core todos which should explain / address
this issue too.  ATM I'm a bit overwhelmed with stuff accumulated
while traveling.)

On Wed, Sep 05, 2012 at 12:20:53PM +0200, Peter Zijlstra wrote:
> But I don't really see the point though, this kind of interface would
> only ever work for the non-controlling and controlling controller
> combination (confused yet ;-), and I don't think we have many of those.

It's more than that.  One may not want to apply the same level of
granularity to different resources.  e.g. depending on the setup, IOs
may need to be further categorized and controlled than memory or vice
versa.

> I would really rather see a simplification of the entire cgroup
> interface space as opposed to making it more complex. And adding this
> subtree 'feature' only makes it more complex.

It does in the meantime but I think most of it can piggyback on the
existing css_set mechanism.  No matter what we do, this isn't gonna be
a short and easy transition.  More than half of the controllers don't
even support proper hierarchy yet.  We can't move to any kind of
unified hierarchy without getting that settled first.  I *think* I
have a plan which can mostly work now.  I'll write more later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-05 10:04                   ` Peter Zijlstra
@ 2012-09-06 20:46                     ` Tejun Heo
  2012-09-06 21:11                       ` Paul Turner
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2012-09-06 20:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Glauber Costa, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers, Dhaval Giani, Frederic Weisbecker

Hello,

cc'ing Dhaval and Frederic.  They were interested in the subject
before and Dhaval was pretty vocal about cpuacct having a separate
hierarchy (or at least granularity).

On Wed, Sep 05, 2012 at 12:04:47PM +0200, Peter Zijlstra wrote:
> > cpuacct is rather unique tho.  I think it's gonna be silly whether the
> > hierarchy is unified or not.
> > 
> > 1. If they always can live on the exact same hierarchy, there's no
> >    point in having the two separate.  Just merge them.
> > 
> > 2. If they need differing levels of granularity, they either need to
> >    do it completely separately as they do now or have some form of
> >    dynamic optimization if absolutely necesary.
> > 
> > So, I think that choice is rather separate from other issues.  If
> > cpuacct is gonna be kept, I'd just keep it separate and warn that it
> > incurs extra overhead for the current users if for nothing else.
> > Otherwise, kill it or merge it into cpu.
> 
> Quite, hence my 'proposal' to remove cpuacct.
> 
> There was some whining last time Glauber proposed this, but the one
> whining never convinced and has gone away from Linux, so lets just do
> this.
> 
> Lets make cpuacct print a deprecated msg to dmesg for a few releases and
> make cpu do all this.

I like it.  Currently cpuacct is the only problematic one in this
regard (cpuset to a much lesser extent) and it would be great to make
it go away.

Dhaval, Frederic, Paul, if you guys object, please voice your
opinions.

> The co-mounting stuff would have been nice for cpusets as well, knowing
> all your tasks are affine to a subset of cpus allows for a few
> optimizations (smaller cpumask iterations), but I guess we'll have to do
> that dynamically, we'll just have to see how ugly that is.

Forced co-mounting sounds rather silly to me.  If the two are always
gonna be co-mounted, why not just merge them and switch the
functionality depending on configuration?  I'm fairly sure the code
would be simpler that way.

If cpuset and cpu being separate is important enough && the overhead
of doing things separately for cpuset isn't too high, I wouldn't
bother too much with dynamic optimization but that's your call.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-06 20:46                     ` Tejun Heo
@ 2012-09-06 21:11                       ` Paul Turner
  2012-09-06 22:36                         ` Glauber Costa
  2012-09-08 13:36                         ` Dhaval Giani
  0 siblings, 2 replies; 39+ messages in thread
From: Paul Turner @ 2012-09-06 21:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Glauber Costa, linux-kernel, cgroups, linux-mm,
	davej, ben, lennart, kay.sievers, Dhaval Giani,
	Frederic Weisbecker

On Thu, Sep 6, 2012 at 1:46 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> cc'ing Dhaval and Frederic.  They were interested in the subject
> before and Dhaval was pretty vocal about cpuacct having a separate
> hierarchy (or at least granularity).

Really?  Time just has _not_ borne out this use-case.  I'll let Dhaval
make a case for this but he should expect violent objection.

>
> On Wed, Sep 05, 2012 at 12:04:47PM +0200, Peter Zijlstra wrote:
>> > cpuacct is rather unique tho.  I think it's gonna be silly whether the
>> > hierarchy is unified or not.
>> >
>> > 1. If they always can live on the exact same hierarchy, there's no
>> >    point in having the two separate.  Just merge them.
>> >
>> > 2. If they need differing levels of granularity, they either need to
>> >    do it completely separately as they do now or have some form of
>> >    dynamic optimization if absolutely necesary.
>> >
>> > So, I think that choice is rather separate from other issues.  If
>> > cpuacct is gonna be kept, I'd just keep it separate and warn that it
>> > incurs extra overhead for the current users if for nothing else.
>> > Otherwise, kill it or merge it into cpu.
>>
>> Quite, hence my 'proposal' to remove cpuacct.
>>
>> There was some whining last time Glauber proposed this, but the one
>> whining never convinced and has gone away from Linux, so lets just do
>> this.
>>
>> Lets make cpuacct print a deprecated msg to dmesg for a few releases and
>> make cpu do all this.
>
> I like it.  Currently cpuacct is the only problematic one in this
> regard (cpuset to a much lesser extent) and it would be great to make
> it go away.
>
> Dhaval, Frederic, Paul, if you guys object, please voice your
> opinions.
>
>> The co-mounting stuff would have been nice for cpusets as well, knowing
>> all your tasks are affine to a subset of cpus allows for a few
>> optimizations (smaller cpumask iterations), but I guess we'll have to do
>> that dynamically, we'll just have to see how ugly that is.
>
> Forced co-mounting sounds rather silly to me.  If the two are always
> gonna be co-mounted, why not just merge them and switch the
> functionality depending on configuration?  I'm fairly sure the code
> would be simpler that way.

It would be simpler but the problem is we'd break any userspace that
was just doing mount cpuacct?

Further, even if it were mounting both, userspace code still has to be
changed to read from "cpu.export" instead of "cpuacct.export".

I think a sane path on this front is:

Immediately:
Don't allow cpuacct and cpu to be co-mounted on separate hierarchies
simultaneously.

That is:
mount none /dev/cgroup/cpuacct -t cgroupfs -o cpuacct : still works
mount none /dev/cgroup/cpu -t cgroupfs -o cpu : still works
mount none /dev/cgroup/cpux -t cgroupfs -o cpuacct,cpu : still works

But the combination:
mount none /dev/cgroup/cpu -t cgroupfs -o cpu : still works
mount none /dev/cgroup/cpuacct -t cgroupfs -o cpu : EINVAL [or vice versa].

Also:
WARN_ON when mounting cpuacct without cpu, strongly explaining that
ANY such configuration is deprecated.

Glauber's patchset goes most of the way towards enabling this.

In a release or two:
Make the restriction strict; don't allow individual mounting of
cpuacct, force it to be mounted ONLY with cpu.

Glauber's patchset gives us this.

Finally:
Mirror the interfaces to cpu, print nasty syslog messages about ANY
mounts of cpuacct
Follow that up by eventually removing cpuacct completely

--

In general I think this sets a hard precedent of never allowing an
accounting controller to exist with a control one for a given area,
e.g. cpu, networking, mm, etc.

In the cases where one of these exists already, any attempts to extend
(acounting or control) must extend the existing.

>
> If cpuset and cpu being separate is important enough && the overhead
> of doing things separately for cpuset isn't too high, I wouldn't
> bother too much with dynamic optimization but that's your call.
>

Given the choice we would just straight out ripped it out long ago.
Breaking the user-space ABI is the problem.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-06 21:11                       ` Paul Turner
@ 2012-09-06 22:36                         ` Glauber Costa
  2012-09-08 13:36                         ` Dhaval Giani
  1 sibling, 0 replies; 39+ messages in thread
From: Glauber Costa @ 2012-09-06 22:36 UTC (permalink / raw)
  To: Paul Turner
  Cc: Tejun Heo, Peter Zijlstra, linux-kernel, cgroups, linux-mm,
	davej, ben, lennart, kay.sievers, Dhaval Giani,
	Frederic Weisbecker

On 09/07/2012 01:11 AM, Paul Turner wrote:
> On Thu, Sep 6, 2012 at 1:46 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> cc'ing Dhaval and Frederic.  They were interested in the subject
>> before and Dhaval was pretty vocal about cpuacct having a separate
>> hierarchy (or at least granularity).
> 
> Really?  Time just has _not_ borne out this use-case.  I'll let Dhaval
> make a case for this but he should expect violent objection.
> 

I strongly advise against physical violence. In case it is really
necessary, please break his legs only.

>> On Wed, Sep 05, 2012 at 12:04:47PM +0200, Peter Zijlstra wrote:
>>>> cpuacct is rather unique tho.  I think it's gonna be silly whether the
>>>> hierarchy is unified or not.
>>>>
>>>> 1. If they always can live on the exact same hierarchy, there's no
>>>>    point in having the two separate.  Just merge them.
>>>>
>>>> 2. If they need differing levels of granularity, they either need to
>>>>    do it completely separately as they do now or have some form of
>>>>    dynamic optimization if absolutely necesary.
>>>>
>>>> So, I think that choice is rather separate from other issues.  If
>>>> cpuacct is gonna be kept, I'd just keep it separate and warn that it
>>>> incurs extra overhead for the current users if for nothing else.
>>>> Otherwise, kill it or merge it into cpu.
>>>
>>> Quite, hence my 'proposal' to remove cpuacct.
>>>
>>> There was some whining last time Glauber proposed this, but the one
>>> whining never convinced and has gone away from Linux, so lets just do
>>> this.
>>>
>>> Lets make cpuacct print a deprecated msg to dmesg for a few releases and
>>> make cpu do all this.
>>
>> I like it.  Currently cpuacct is the only problematic one in this
>> regard (cpuset to a much lesser extent) and it would be great to make
>> it go away.
>>
>> Dhaval, Frederic, Paul, if you guys object, please voice your
>> opinions.
>>
>>> The co-mounting stuff would have been nice for cpusets as well, knowing
>>> all your tasks are affine to a subset of cpus allows for a few
>>> optimizations (smaller cpumask iterations), but I guess we'll have to do
>>> that dynamically, we'll just have to see how ugly that is.
>>
>> Forced co-mounting sounds rather silly to me.  If the two are always
>> gonna be co-mounted, why not just merge them and switch the
>> functionality depending on configuration?  I'm fairly sure the code
>> would be simpler that way.
> 
> It would be simpler but the problem is we'd break any userspace that
> was just doing mount cpuacct?
> 
> Further, even if it were mounting both, userspace code still has to be
> changed to read from "cpu.export" instead of "cpuacct.export".
> 

Only if we remove cpuacct. What we can do, and I thought about doing, is
just merging cpuacct functionality into cpu. Then we move cpuacct to
default no. It will be there for userspace if they absolutely want to
use it.

> I think a sane path on this front is:
> 
> Immediately:
> Don't allow cpuacct and cpu to be co-mounted on separate hierarchies
> simultaneously.
> 
that is precisely what my patch does, except it is a bit more generic.

> That is:
> mount none /dev/cgroup/cpuacct -t cgroupfs -o cpuacct : still works
> mount none /dev/cgroup/cpu -t cgroupfs -o cpu : still works
> mount none /dev/cgroup/cpux -t cgroupfs -o cpuacct,cpu : still works
> 
> But the combination:
> mount none /dev/cgroup/cpu -t cgroupfs -o cpu : still works
> mount none /dev/cgroup/cpuacct -t cgroupfs -o cpu : EINVAL [or vice versa].
> 
> Also:
> WARN_ON when mounting cpuacct without cpu, strongly explaining that
> ANY such configuration is deprecated.
> 
> Glauber's patchset goes most of the way towards enabling this.
>
yes.

> In a release or two:
> Make the restriction strict; don't allow individual mounting of
> cpuacct, force it to be mounted ONLY with cpu.
> 
> Glauber's patchset gives us this.
> 
> Finally:
> Mirror the interfaces to cpu, print nasty syslog messages about ANY
> mounts of cpuacct
> Follow that up by eventually removing cpuacct completely
>
Why don't start with mirroring? It gives more time for people to start
switching to it.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-06 20:38                           ` Tejun Heo
@ 2012-09-06 22:39                             ` Glauber Costa
  2012-09-06 22:45                               ` Tejun Heo
  0 siblings, 1 reply; 39+ messages in thread
From: Glauber Costa @ 2012-09-06 22:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

On 09/07/2012 12:38 AM, Tejun Heo wrote:
> Hello, Peter, Glauber.
> 
> (I'm gonna write up cgroup core todos which should explain / address
> this issue too.  ATM I'm a bit overwhelmed with stuff accumulated
> while traveling.)
> 

Yes, please.

While you rightfully claim that you explained it a couple of times, it
all seems to be quite fuzzy. I don't blame it on you: the current state
of the interface leads to this.

So another detailed explanation of what you envision at this point,
considering the discussions we had in the previous days, would be really
helpful,


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-06 22:39                             ` Glauber Costa
@ 2012-09-06 22:45                               ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2012-09-06 22:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Peter Zijlstra, linux-kernel, cgroups, linux-mm, davej, ben, pjt,
	lennart, kay.sievers

Hello, Glauber.

On Thu, Sep 6, 2012 at 3:39 PM, Glauber Costa <glommer@parallels.com> wrote:
> Yes, please.
>
> While you rightfully claim that you explained it a couple of times, it
> all seems to be quite fuzzy. I don't blame it on you: the current state
> of the interface leads to this.

Heh, I drank two cups of coffee and two glasses of wine that evening.
Coffee won and I couldn't sleep till around 4am with splitting
headache. I'm not too confident about what I wrote that night. :)

> So another detailed explanation of what you envision at this point,
> considering the discussions we had in the previous days, would be really
> helpful,

Definitely, will do. Please give me a few days to sort through
immediately pending stuff.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC 0/5] forced comounts for cgroups.
  2012-09-06 21:11                       ` Paul Turner
  2012-09-06 22:36                         ` Glauber Costa
@ 2012-09-08 13:36                         ` Dhaval Giani
  1 sibling, 0 replies; 39+ messages in thread
From: Dhaval Giani @ 2012-09-08 13:36 UTC (permalink / raw)
  To: Paul Turner
  Cc: Tejun Heo, Peter Zijlstra, Glauber Costa, linux-kernel, cgroups,
	linux-mm, davej, ben, lennart, kay.sievers, Frederic Weisbecker,
	Balbir Singh, Bharata B Rao

On Thu, Sep 6, 2012 at 5:11 PM, Paul Turner <pjt@google.com> wrote:
> On Thu, Sep 6, 2012 at 1:46 PM, Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> cc'ing Dhaval and Frederic.  They were interested in the subject
>> before and Dhaval was pretty vocal about cpuacct having a separate
>> hierarchy (or at least granularity).
>
> Really?  Time just has _not_ borne out this use-case.  I'll let Dhaval
> make a case for this but he should expect violent objection.
>

I am not objecting directly! I am aware of a few users who are (or at
least were) using cpu and cpuacct separately because they want to be
able to account without control. Having said that, there are tons of
flaws in the current approach, because the accounting without control
is just plain wrong. I have copied a few other folks who might be able
to shed light on those users and if we should still consider them.

[And the lesser number of controllers, the better it is!]

Thanks!
Dhaval

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2012-09-08 13:36 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-04 14:18 [RFC 0/5] forced comounts for cgroups Glauber Costa
2012-09-04 14:18 ` [RFC 1/5] cgroup: allow some comounts to be forced Glauber Costa
2012-09-04 14:18 ` [RFC 2/5] sched: adjust exec_clock to use it as cpu usage metric Glauber Costa
2012-09-04 14:18 ` [RFC 3/5] sched: do not call cpuacct_charge when cpu and cpuacct are comounted Glauber Costa
2012-09-04 14:18 ` [RFC 4/5] cpuacct: do not gather cpuacct statistics when not mounted Glauber Costa
2012-09-04 14:18 ` [RFC 5/5] sched: add cpusets to comounts list Glauber Costa
2012-09-04 21:46 ` [RFC 0/5] forced comounts for cgroups Tejun Heo
2012-09-05  8:03   ` Glauber Costa
2012-09-05  8:14     ` Tejun Heo
2012-09-05  8:17       ` Glauber Costa
2012-09-05  8:29         ` Tejun Heo
2012-09-05  8:35           ` Glauber Costa
2012-09-05  8:47             ` Tejun Heo
2012-09-05  8:55               ` Glauber Costa
2012-09-05  9:07                 ` Tejun Heo
2012-09-05  9:06                   ` Glauber Costa
2012-09-05  9:14                     ` Tejun Heo
2012-09-05  9:06               ` Peter Zijlstra
2012-09-05  9:07                 ` Peter Zijlstra
2012-09-05  9:22                   ` Tejun Heo
2012-09-05  9:11                 ` Tejun Heo
2012-09-05  9:12                   ` Glauber Costa
2012-09-05  9:19                     ` Tejun Heo
2012-09-05  9:30                       ` Glauber Costa
2012-09-05  9:26                     ` Peter Zijlstra
2012-09-05  9:31                       ` Glauber Costa
2012-09-05  9:45                         ` Tejun Heo
2012-09-05  9:48                           ` Glauber Costa
2012-09-05  9:56                             ` Tejun Heo
2012-09-05 10:20                         ` Peter Zijlstra
2012-09-06 20:38                           ` Tejun Heo
2012-09-06 22:39                             ` Glauber Costa
2012-09-06 22:45                               ` Tejun Heo
2012-09-05  9:32                 ` Tejun Heo
2012-09-05 10:04                   ` Peter Zijlstra
2012-09-06 20:46                     ` Tejun Heo
2012-09-06 21:11                       ` Paul Turner
2012-09-06 22:36                         ` Glauber Costa
2012-09-08 13:36                         ` Dhaval Giani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).