linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
@ 2012-09-10 22:31 Tejun Heo
  2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-10 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michal Hocko, Li Zefan, Glauber Costa, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

(I tried to document what's broken and how it should be fixed.  If I
 got something wrong, please let me know.)

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 block/blk-cgroup.c        |    8 ++++++++
 include/linux/cgroup.h    |   15 +++++++++++++++
 kernel/cgroup.c           |   11 ++++++++++-
 kernel/cgroup_freezer.c   |    8 ++++++++
 kernel/events/core.c      |    7 +++++++
 mm/memcontrol.c           |   12 +++++++++---
 net/core/netprio_cgroup.c |   12 +++++++++++-
 net/sched/cls_cgroup.c    |    9 +++++++++
 security/device_cgroup.c  |    9 +++++++++
 9 files changed, 86 insertions(+), 5 deletions(-)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -737,6 +737,14 @@ struct cgroup_subsys blkio_subsys = {
 	.subsys_id = blkio_subsys_id,
 	.base_cftypes = blkcg_files,
 	.module = THIS_MODULE,
+
+	/*
+	 * blkio subsystem is utterly broken in terms of hierarchy support.
+	 * It treats all cgroups equally regardless of where they're
+	 * located in the hierarchy - all cgroups are treated as if they're
+	 * right below the root.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 EXPORT_SYMBOL_GPL(blkio_subsys);
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -503,6 +503,21 @@ struct cgroup_subsys {
 	 */
 	bool __DEPRECATED_clear_css_refs;
 
+	/*
+	 * If %false, this subsystem is properly hierarchical -
+	 * configuration, resource accounting and restriction on a parent
+	 * cgroup cover those of its children.  If %true, hierarchy support
+	 * is broken in some ways - some subsystems ignore hierarchy
+	 * completely while others are only implemented half-way.
+	 *
+	 * It's now diallowed to create nested cgroups if the subsystem is
+	 * broken and cgroup core will emit a warning message on such
+	 * cases.  Eventually, all subsystems will be made properly
+	 * hierarchical and this will go away.
+	 */
+	bool broken_hierarchy;
+	bool warned_broken_hierarchy;
+
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4075,8 +4075,17 @@ static long cgroup_create(struct cgroup 
 		set_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
 
 	for_each_subsys(root, ss) {
-		struct cgroup_subsys_state *css = ss->create(cgrp);
+		struct cgroup_subsys_state *css;
+
+		if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
+		    parent->parent) {
+			pr_warning("cgroup: \"%s\" is not properly hierarchical yet, do not nest cgroups.\n",
+				   ss->name);
+			pr_warning("cgroup: \"memory\" requires setting use_hierarchy to 1 on the root.\n");
+			ss->warned_broken_hierarchy = true;
+		}
 
+		css = ss->create(cgrp);
 		if (IS_ERR(css)) {
 			err = PTR_ERR(css);
 			goto err_destroy;
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -373,4 +373,12 @@ struct cgroup_subsys freezer_subsys = {
 	.can_attach	= freezer_can_attach,
 	.fork		= freezer_fork,
 	.base_cftypes	= files,
+
+	/*
+	 * freezer subsys doesn't handle hierarchy at all.  Frozen state
+	 * should be inherited through the hierarchy - if a parent is
+	 * frozen, all its children should be frozen.  Fix it and remove
+	 * the following.
+	 */
+	.broken_hierarchy = true,
 };
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7285,5 +7285,12 @@ struct cgroup_subsys perf_subsys = {
 	.destroy	= perf_cgroup_destroy,
 	.exit		= perf_cgroup_exit,
 	.attach		= perf_cgroup_attach,
+
+	/*
+	 * perf_event cgroup doesn't handle nesting correctly.
+	 * ctx->nr_cgroups adjustments should be propagated through the
+	 * cgroup hierarchy.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 #endif /* CONFIG_CGROUP_PERF */
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3855,12 +3855,17 @@ static int mem_cgroup_hierarchy_write(st
 	 */
 	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
 				(val == 1 || val == 0)) {
-		if (list_empty(&cont->children))
+		if (list_empty(&cont->children)) {
 			memcg->use_hierarchy = val;
-		else
+			/* we're fully hierarchical iff root uses hierarchy */
+			if (mem_cgroup_is_root(memcg))
+				mem_cgroup_subsys.broken_hierarchy = !val;
+		} else {
 			retval = -EBUSY;
-	} else
+		}
+	} else {
 		retval = -EINVAL;
+	}
 
 out:
 	cgroup_unlock();
@@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
 						&per_cpu(memcg_stock, cpu);
 			INIT_WORK(&stock->work, drain_local_stock);
 		}
+		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
 		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -330,7 +330,17 @@ struct cgroup_subsys net_prio_subsys = {
 	.subsys_id	= net_prio_subsys_id,
 #endif
 	.base_cftypes	= ss_files,
-	.module		= THIS_MODULE
+	.module		= THIS_MODULE,
+
+	/*
+	 * net_prio has artificial limit on the number of cgroups and
+	 * disallows nesting making it impossible to co-mount it with other
+	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
+	 * limit and properly nest configuration such that children follow
+	 * their parents' configurations by default and are allowed to
+	 * override and remove the following.
+	 */
+	.broken_hierarchy = trye,
 };
 
 static int netprio_device_event(struct notifier_block *unused,
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -82,6 +82,15 @@ struct cgroup_subsys net_cls_subsys = {
 #endif
 	.base_cftypes	= ss_files,
 	.module		= THIS_MODULE,
+
+	/*
+	 * While net_cls cgroup has the rudimentary hierarchy support of
+	 * inheriting the parent's classid on cgroup creation, it doesn't
+	 * properly propagates config changes in ancestors to their
+	 * descendents.  A child should follow the parent's configuration
+	 * but be allowed to override it.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 struct cls_cgroup_head {
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -457,6 +457,15 @@ struct cgroup_subsys devices_subsys = {
 	.destroy = devcgroup_destroy,
 	.subsys_id = devices_subsys_id,
 	.base_cftypes = dev_cgroup_files,
+
+	/*
+	 * While devices cgroup has the rudimentary hierarchy support which
+	 * checks the parent's restriction, it doesn't properly propagates
+	 * config changes in ancestors to their descendents.  A child
+	 * should only be allowed to add more restrictions to the parent's
+	 * configuration.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 int __devcgroup_inode_permission(struct inode *inode, int mask)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-10 22:31 [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them Tejun Heo
@ 2012-09-10 22:33 ` Tejun Heo
  2012-09-11 10:04   ` Michal Hocko
                     ` (3 more replies)
  2012-09-11 14:51 ` [PATCH " Vivek Goyal
       [not found] ` <505055E5.90903@parallels.com>
  2 siblings, 4 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-10 22:33 UTC (permalink / raw)
  To: linux-kernel, containers, cgroups
  Cc: Michal Hocko, Li Zefan, Glauber Costa, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

(forgot cc'ing containers / cgroups mailing lists and used the old
 address for Li.  Reposting.  Sorry for the noise.)

Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

(I tried to document what's broken and how it should be fixed.  If I
 got something wrong, please let me know.)

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 block/blk-cgroup.c        |    8 ++++++++
 include/linux/cgroup.h    |   15 +++++++++++++++
 kernel/cgroup.c           |   11 ++++++++++-
 kernel/cgroup_freezer.c   |    8 ++++++++
 kernel/events/core.c      |    7 +++++++
 mm/memcontrol.c           |   12 +++++++++---
 net/core/netprio_cgroup.c |   12 +++++++++++-
 net/sched/cls_cgroup.c    |    9 +++++++++
 security/device_cgroup.c  |    9 +++++++++
 9 files changed, 86 insertions(+), 5 deletions(-)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -737,6 +737,14 @@ struct cgroup_subsys blkio_subsys = {
 	.subsys_id = blkio_subsys_id,
 	.base_cftypes = blkcg_files,
 	.module = THIS_MODULE,
+
+	/*
+	 * blkio subsystem is utterly broken in terms of hierarchy support.
+	 * It treats all cgroups equally regardless of where they're
+	 * located in the hierarchy - all cgroups are treated as if they're
+	 * right below the root.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 EXPORT_SYMBOL_GPL(blkio_subsys);
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -503,6 +503,21 @@ struct cgroup_subsys {
 	 */
 	bool __DEPRECATED_clear_css_refs;
 
+	/*
+	 * If %false, this subsystem is properly hierarchical -
+	 * configuration, resource accounting and restriction on a parent
+	 * cgroup cover those of its children.  If %true, hierarchy support
+	 * is broken in some ways - some subsystems ignore hierarchy
+	 * completely while others are only implemented half-way.
+	 *
+	 * It's now diallowed to create nested cgroups if the subsystem is
+	 * broken and cgroup core will emit a warning message on such
+	 * cases.  Eventually, all subsystems will be made properly
+	 * hierarchical and this will go away.
+	 */
+	bool broken_hierarchy;
+	bool warned_broken_hierarchy;
+
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4075,8 +4075,17 @@ static long cgroup_create(struct cgroup 
 		set_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
 
 	for_each_subsys(root, ss) {
-		struct cgroup_subsys_state *css = ss->create(cgrp);
+		struct cgroup_subsys_state *css;
+
+		if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
+		    parent->parent) {
+			pr_warning("cgroup: \"%s\" is not properly hierarchical yet, do not nest cgroups.\n",
+				   ss->name);
+			pr_warning("cgroup: \"memory\" requires setting use_hierarchy to 1 on the root.\n");
+			ss->warned_broken_hierarchy = true;
+		}
 
+		css = ss->create(cgrp);
 		if (IS_ERR(css)) {
 			err = PTR_ERR(css);
 			goto err_destroy;
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -373,4 +373,12 @@ struct cgroup_subsys freezer_subsys = {
 	.can_attach	= freezer_can_attach,
 	.fork		= freezer_fork,
 	.base_cftypes	= files,
+
+	/*
+	 * freezer subsys doesn't handle hierarchy at all.  Frozen state
+	 * should be inherited through the hierarchy - if a parent is
+	 * frozen, all its children should be frozen.  Fix it and remove
+	 * the following.
+	 */
+	.broken_hierarchy = true,
 };
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7285,5 +7285,12 @@ struct cgroup_subsys perf_subsys = {
 	.destroy	= perf_cgroup_destroy,
 	.exit		= perf_cgroup_exit,
 	.attach		= perf_cgroup_attach,
+
+	/*
+	 * perf_event cgroup doesn't handle nesting correctly.
+	 * ctx->nr_cgroups adjustments should be propagated through the
+	 * cgroup hierarchy.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 #endif /* CONFIG_CGROUP_PERF */
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3855,12 +3855,17 @@ static int mem_cgroup_hierarchy_write(st
 	 */
 	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
 				(val == 1 || val == 0)) {
-		if (list_empty(&cont->children))
+		if (list_empty(&cont->children)) {
 			memcg->use_hierarchy = val;
-		else
+			/* we're fully hierarchical iff root uses hierarchy */
+			if (mem_cgroup_is_root(memcg))
+				mem_cgroup_subsys.broken_hierarchy = !val;
+		} else {
 			retval = -EBUSY;
-	} else
+		}
+	} else {
 		retval = -EINVAL;
+	}
 
 out:
 	cgroup_unlock();
@@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
 						&per_cpu(memcg_stock, cpu);
 			INIT_WORK(&stock->work, drain_local_stock);
 		}
+		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
 		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -330,7 +330,17 @@ struct cgroup_subsys net_prio_subsys = {
 	.subsys_id	= net_prio_subsys_id,
 #endif
 	.base_cftypes	= ss_files,
-	.module		= THIS_MODULE
+	.module		= THIS_MODULE,
+
+	/*
+	 * net_prio has artificial limit on the number of cgroups and
+	 * disallows nesting making it impossible to co-mount it with other
+	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
+	 * limit and properly nest configuration such that children follow
+	 * their parents' configurations by default and are allowed to
+	 * override and remove the following.
+	 */
+	.broken_hierarchy = trye,
 };
 
 static int netprio_device_event(struct notifier_block *unused,
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -82,6 +82,15 @@ struct cgroup_subsys net_cls_subsys = {
 #endif
 	.base_cftypes	= ss_files,
 	.module		= THIS_MODULE,
+
+	/*
+	 * While net_cls cgroup has the rudimentary hierarchy support of
+	 * inheriting the parent's classid on cgroup creation, it doesn't
+	 * properly propagates config changes in ancestors to their
+	 * descendents.  A child should follow the parent's configuration
+	 * but be allowed to override it.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 struct cls_cgroup_head {
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -457,6 +457,15 @@ struct cgroup_subsys devices_subsys = {
 	.destroy = devcgroup_destroy,
 	.subsys_id = devices_subsys_id,
 	.base_cftypes = dev_cgroup_files,
+
+	/*
+	 * While devices cgroup has the rudimentary hierarchy support which
+	 * checks the parent's restriction, it doesn't properly propagates
+	 * config changes in ancestors to their descendents.  A child
+	 * should only be allowed to add more restrictions to the parent's
+	 * configuration.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 int __devcgroup_inode_permission(struct inode *inode, int mask)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
@ 2012-09-11 10:04   ` Michal Hocko
  2012-09-11 17:07     ` Tejun Heo
       [not found]     ` <5050568B.9090601@parallels.com>
  2012-09-11 12:38   ` Li Zefan
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 41+ messages in thread
From: Michal Hocko @ 2012-09-11 10:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, containers, cgroups, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

I like the approach in general but see the comments bellow:

On Mon 10-09-12 15:33:55, Tejun Heo wrote:
[...]
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3855,12 +3855,17 @@ static int mem_cgroup_hierarchy_write(st
>  	 */
>  	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
>  				(val == 1 || val == 0)) {
> -		if (list_empty(&cont->children))
> +		if (list_empty(&cont->children)) {
>  			memcg->use_hierarchy = val;
> -		else
> +			/* we're fully hierarchical iff root uses hierarchy */
> +			if (mem_cgroup_is_root(memcg))
> +				mem_cgroup_subsys.broken_hierarchy = !val;
> +		} else {
>  			retval = -EBUSY;
> -	} else
> +		}
> +	} else {
>  		retval = -EINVAL;
> +	}
>  
>  out:
>  	cgroup_unlock();
> @@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
>  						&per_cpu(memcg_stock, cpu);
>  			INIT_WORK(&stock->work, drain_local_stock);
>  		}
> +		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;

Hmmm, this will warn even if we have
root (default use_hierarchy=0)
 \
  A (use_hierarchy=1)
   \
    B <- here

which is unfortunate because it will add a noise to a reasonable
configuration.
I think this is fixable if you move the warning after 
cgroup_subsys_state::create and broken_hierarchy would be set only if
parent is not root and use_hierarchy==0 in mem_cgroup_create. Something
like:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 795e525..d5c93ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4973,6 +4973,13 @@ mem_cgroup_create(struct cgroup *cont)
 	} else {
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
+		/*
+		 * Deeper hierachy with use_hierarchy == false doesn't make
+		 * much sense so let cgroup subsystem know about this unfortunate
+		 * state in our controller.
+		 */
+		if (parent && parent != root_mem_cgroup)
+			mem_cgroup_subsys.broken_hierarchy = true;
 	}
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);

What do you think?

>  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -330,7 +330,17 @@ struct cgroup_subsys net_prio_subsys = {
>  	.subsys_id	= net_prio_subsys_id,
>  #endif
>  	.base_cftypes	= ss_files,
> -	.module		= THIS_MODULE
> +	.module		= THIS_MODULE,
> +
> +	/*
> +	 * net_prio has artificial limit on the number of cgroups and
> +	 * disallows nesting making it impossible to co-mount it with other
> +	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
> +	 * limit and properly nest configuration such that children follow
> +	 * their parents' configurations by default and are allowed to
> +	 * override and remove the following.
> +	 */
> +	.broken_hierarchy = trye,

typo

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
  2012-09-11 10:04   ` Michal Hocko
@ 2012-09-11 12:38   ` Li Zefan
  2012-09-11 17:08     ` Tejun Heo
  2012-09-11 18:23   ` [PATCH UPDATED " Tejun Heo
  2012-09-13 12:16   ` [PATCH REPOST " Daniel P. Berrange
  3 siblings, 1 reply; 41+ messages in thread
From: Li Zefan @ 2012-09-11 12:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, containers, cgroups, Michal Hocko, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On 2012/9/11 6:33, Tejun Heo wrote:
> (forgot cc'ing containers / cgroups mailing lists and used the old
>  address for Li.  Reposting.  Sorry for the noise.)
> 
> Currently, cgroup hierarchy support is a mess.  cpu related subsystems
> behave correctly - configuration, accounting and control on a parent
> properly cover its children.  blkio and freezer completely ignore
> hierarchy and treat all cgroups as if they're directly under the root
> cgroup.  Others show yet different behaviors.
> 
> These differing interpretations of cgroup hierarchy make using cgroup
> confusing and it impossible to co-mount controllers into the same
> hierarchy and obtain sane behavior.
> 
> Eventually, we want full hierarchy support from all subsystems and
> probably a unified hierarchy.  Users using separate hierarchies
> expecting completely different behaviors depending on the mounted
> subsystem is deterimental to making any progress on this front.
> 
> This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
> for controllers which are lacking in hierarchy support.  The goal of
> this patch is two-fold.
> 
> * Move users away from using hierarchy on currently non-hierarchical
>   subsystems, so that implementing proper hierarchy support on those
>   doesn't surprise them.
> 
> * Keep track of which controllers are broken how and nudge the
>   subsystems to implement proper hierarchy support.
> 
> For now, start with a single warning message.  We can whine louder
> later on.
> 
> (I tried to document what's broken and how it should be fixed.  If I
>  got something wrong, please let me know.)
> 

So isn't cpuset broken too? child cpuset's cpu mask isn't necessary a subset
of the parent's if the cpu_exclusive flag is not set.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-10 22:31 [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them Tejun Heo
  2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
@ 2012-09-11 14:51 ` Vivek Goyal
  2012-09-11 14:54   ` Vivek Goyal
  2012-09-11 17:16   ` Tejun Heo
       [not found] ` <505055E5.90903@parallels.com>
  2 siblings, 2 replies; 41+ messages in thread
From: Vivek Goyal @ 2012-09-11 14:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Mon, Sep 10, 2012 at 03:31:25PM -0700, Tejun Heo wrote:
> Currently, cgroup hierarchy support is a mess.  cpu related subsystems
> behave correctly - configuration, accounting and control on a parent
> properly cover its children.  blkio and freezer completely ignore
> hierarchy and treat all cgroups as if they're directly under the root
> cgroup.  Others show yet different behaviors.
> 
> These differing interpretations of cgroup hierarchy make using cgroup
> confusing and it impossible to co-mount controllers into the same
> hierarchy and obtain sane behavior.
> 
> Eventually, we want full hierarchy support from all subsystems and
> probably a unified hierarchy.  Users using separate hierarchies
> expecting completely different behaviors depending on the mounted
> subsystem is deterimental to making any progress on this front.
> 
> This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
> for controllers which are lacking in hierarchy support.  The goal of
> this patch is two-fold.
> 
> * Move users away from using hierarchy on currently non-hierarchical
>   subsystems, so that implementing proper hierarchy support on those
>   doesn't surprise them.

I know two current/potential users. systemd and libvirt. They are
anyway going to create hierarchy irrespective of the fact whether
controller supports it or not.

So even if we start screaming, nothing is going to change there, I 
suspect. Just that by default they expect every controller supports
hiearchies. 

> 
> * Keep track of which controllers are broken how and nudge the
>   subsystems to implement proper hierarchy support.

I thought we can easily keep track of this in a simple .txt file and
we really don't have to  provide explicit warnings.

I think for these controllers it is a known fact that they don't support
hiearchy yet. I am skeptical that providing explicit warnings is going
to help.

Thanks
Vivek


> 
> For now, start with a single warning message.  We can whine louder
> later on.
> 
> (I tried to document what's broken and how it should be fixed.  If I
>  got something wrong, please let me know.)
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Li Zefan <lizf@cn.fujitsu.com>
> Cc: Glauber Costa <glommer@parallels.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Paul Turner <pjt@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Thomas Graf <tgraf@suug.ch>
> Cc: Serge E. Hallyn <serue@us.ibm.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
> Cc: Neil Horman <nhorman@tuxdriver.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  block/blk-cgroup.c        |    8 ++++++++
>  include/linux/cgroup.h    |   15 +++++++++++++++
>  kernel/cgroup.c           |   11 ++++++++++-
>  kernel/cgroup_freezer.c   |    8 ++++++++
>  kernel/events/core.c      |    7 +++++++
>  mm/memcontrol.c           |   12 +++++++++---
>  net/core/netprio_cgroup.c |   12 +++++++++++-
>  net/sched/cls_cgroup.c    |    9 +++++++++
>  security/device_cgroup.c  |    9 +++++++++
>  9 files changed, 86 insertions(+), 5 deletions(-)
> 
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -737,6 +737,14 @@ struct cgroup_subsys blkio_subsys = {
>  	.subsys_id = blkio_subsys_id,
>  	.base_cftypes = blkcg_files,
>  	.module = THIS_MODULE,
> +
> +	/*
> +	 * blkio subsystem is utterly broken in terms of hierarchy support.
> +	 * It treats all cgroups equally regardless of where they're
> +	 * located in the hierarchy - all cgroups are treated as if they're
> +	 * right below the root.  Fix it and remove the following.
> +	 */
> +	.broken_hierarchy = true,
>  };
>  EXPORT_SYMBOL_GPL(blkio_subsys);
>  
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -503,6 +503,21 @@ struct cgroup_subsys {
>  	 */
>  	bool __DEPRECATED_clear_css_refs;
>  
> +	/*
> +	 * If %false, this subsystem is properly hierarchical -
> +	 * configuration, resource accounting and restriction on a parent
> +	 * cgroup cover those of its children.  If %true, hierarchy support
> +	 * is broken in some ways - some subsystems ignore hierarchy
> +	 * completely while others are only implemented half-way.
> +	 *
> +	 * It's now diallowed to create nested cgroups if the subsystem is
> +	 * broken and cgroup core will emit a warning message on such
> +	 * cases.  Eventually, all subsystems will be made properly
> +	 * hierarchical and this will go away.
> +	 */
> +	bool broken_hierarchy;
> +	bool warned_broken_hierarchy;
> +
>  #define MAX_CGROUP_TYPE_NAMELEN 32
>  	const char *name;
>  
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4075,8 +4075,17 @@ static long cgroup_create(struct cgroup 
>  		set_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
>  
>  	for_each_subsys(root, ss) {
> -		struct cgroup_subsys_state *css = ss->create(cgrp);
> +		struct cgroup_subsys_state *css;
> +
> +		if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
> +		    parent->parent) {
> +			pr_warning("cgroup: \"%s\" is not properly hierarchical yet, do not nest cgroups.\n",
> +				   ss->name);
> +			pr_warning("cgroup: \"memory\" requires setting use_hierarchy to 1 on the root.\n");
> +			ss->warned_broken_hierarchy = true;
> +		}
>  
> +		css = ss->create(cgrp);
>  		if (IS_ERR(css)) {
>  			err = PTR_ERR(css);
>  			goto err_destroy;
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -373,4 +373,12 @@ struct cgroup_subsys freezer_subsys = {
>  	.can_attach	= freezer_can_attach,
>  	.fork		= freezer_fork,
>  	.base_cftypes	= files,
> +
> +	/*
> +	 * freezer subsys doesn't handle hierarchy at all.  Frozen state
> +	 * should be inherited through the hierarchy - if a parent is
> +	 * frozen, all its children should be frozen.  Fix it and remove
> +	 * the following.
> +	 */
> +	.broken_hierarchy = true,
>  };
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7285,5 +7285,12 @@ struct cgroup_subsys perf_subsys = {
>  	.destroy	= perf_cgroup_destroy,
>  	.exit		= perf_cgroup_exit,
>  	.attach		= perf_cgroup_attach,
> +
> +	/*
> +	 * perf_event cgroup doesn't handle nesting correctly.
> +	 * ctx->nr_cgroups adjustments should be propagated through the
> +	 * cgroup hierarchy.  Fix it and remove the following.
> +	 */
> +	.broken_hierarchy = true,
>  };
>  #endif /* CONFIG_CGROUP_PERF */
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3855,12 +3855,17 @@ static int mem_cgroup_hierarchy_write(st
>  	 */
>  	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
>  				(val == 1 || val == 0)) {
> -		if (list_empty(&cont->children))
> +		if (list_empty(&cont->children)) {
>  			memcg->use_hierarchy = val;
> -		else
> +			/* we're fully hierarchical iff root uses hierarchy */
> +			if (mem_cgroup_is_root(memcg))
> +				mem_cgroup_subsys.broken_hierarchy = !val;
> +		} else {
>  			retval = -EBUSY;
> -	} else
> +		}
> +	} else {
>  		retval = -EINVAL;
> +	}
>  
>  out:
>  	cgroup_unlock();
> @@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
>  						&per_cpu(memcg_stock, cpu);
>  			INIT_WORK(&stock->work, drain_local_stock);
>  		}
> +		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
>  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -330,7 +330,17 @@ struct cgroup_subsys net_prio_subsys = {
>  	.subsys_id	= net_prio_subsys_id,
>  #endif
>  	.base_cftypes	= ss_files,
> -	.module		= THIS_MODULE
> +	.module		= THIS_MODULE,
> +
> +	/*
> +	 * net_prio has artificial limit on the number of cgroups and
> +	 * disallows nesting making it impossible to co-mount it with other
> +	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
> +	 * limit and properly nest configuration such that children follow
> +	 * their parents' configurations by default and are allowed to
> +	 * override and remove the following.
> +	 */
> +	.broken_hierarchy = trye,
>  };
>  
>  static int netprio_device_event(struct notifier_block *unused,
> --- a/net/sched/cls_cgroup.c
> +++ b/net/sched/cls_cgroup.c
> @@ -82,6 +82,15 @@ struct cgroup_subsys net_cls_subsys = {
>  #endif
>  	.base_cftypes	= ss_files,
>  	.module		= THIS_MODULE,
> +
> +	/*
> +	 * While net_cls cgroup has the rudimentary hierarchy support of
> +	 * inheriting the parent's classid on cgroup creation, it doesn't
> +	 * properly propagates config changes in ancestors to their
> +	 * descendents.  A child should follow the parent's configuration
> +	 * but be allowed to override it.  Fix it and remove the following.
> +	 */
> +	.broken_hierarchy = true,
>  };
>  
>  struct cls_cgroup_head {
> --- a/security/device_cgroup.c
> +++ b/security/device_cgroup.c
> @@ -457,6 +457,15 @@ struct cgroup_subsys devices_subsys = {
>  	.destroy = devcgroup_destroy,
>  	.subsys_id = devices_subsys_id,
>  	.base_cftypes = dev_cgroup_files,
> +
> +	/*
> +	 * While devices cgroup has the rudimentary hierarchy support which
> +	 * checks the parent's restriction, it doesn't properly propagates
> +	 * config changes in ancestors to their descendents.  A child
> +	 * should only be allowed to add more restrictions to the parent's
> +	 * configuration.  Fix it and remove the following.
> +	 */
> +	.broken_hierarchy = true,
>  };
>  
>  int __devcgroup_inode_permission(struct inode *inode, int mask)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 14:51 ` [PATCH " Vivek Goyal
@ 2012-09-11 14:54   ` Vivek Goyal
  2012-09-11 17:16   ` Tejun Heo
  1 sibling, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2012-09-11 14:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Tue, Sep 11, 2012 at 10:51:06AM -0400, Vivek Goyal wrote:
> On Mon, Sep 10, 2012 at 03:31:25PM -0700, Tejun Heo wrote:
> > Currently, cgroup hierarchy support is a mess.  cpu related subsystems
> > behave correctly - configuration, accounting and control on a parent
> > properly cover its children.  blkio and freezer completely ignore
> > hierarchy and treat all cgroups as if they're directly under the root
> > cgroup.  Others show yet different behaviors.
> > 
> > These differing interpretations of cgroup hierarchy make using cgroup
> > confusing and it impossible to co-mount controllers into the same
> > hierarchy and obtain sane behavior.
> > 
> > Eventually, we want full hierarchy support from all subsystems and
> > probably a unified hierarchy.  Users using separate hierarchies
> > expecting completely different behaviors depending on the mounted
> > subsystem is deterimental to making any progress on this front.
> > 
> > This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
> > for controllers which are lacking in hierarchy support.  The goal of
> > this patch is two-fold.
> > 
> > * Move users away from using hierarchy on currently non-hierarchical
> >   subsystems, so that implementing proper hierarchy support on those
> >   doesn't surprise them.
> 
> I know two current/potential users. systemd and libvirt. They are
> anyway going to create hierarchy irrespective of the fact whether
> controller supports it or not.

Just to be little precise here. AFAIK, systemd only mounts blkio
controller on a separate hiearchy but does not create child cgroups
under it. So all the services should still run in root cgroup.

For libvirt, I think by default they had enabled using blkio
controller and they were creating hiearchies under that for each
virtual machine/container.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 10:04   ` Michal Hocko
@ 2012-09-11 17:07     ` Tejun Heo
  2012-09-12 15:47       ` Michal Hocko
       [not found]     ` <5050568B.9090601@parallels.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 17:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, containers, cgroups, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Michal.

On Tue, Sep 11, 2012 at 12:04:33PM +0200, Michal Hocko wrote:
> >  	cgroup_unlock();
> > @@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
> >  						&per_cpu(memcg_stock, cpu);
> >  			INIT_WORK(&stock->work, drain_local_stock);
> >  		}
> > +		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
> 
> Hmmm, this will warn even if we have
> root (default use_hierarchy=0)
>  \
>   A (use_hierarchy=1)
>    \
>     B <- here
> 
> which is unfortunate because it will add a noise to a reasonable
> configuration.

I suppose you're talking about having root group not performing any
accounting and/or control?  I suppose such could be a valid use case
(is it really necessary tho?)  but I don't think .use_hierarchy is the
right interface for that.  If it's absolutely necessary, I think it
should be a root-only flag (even if that ends up using the same code
path).  Eventually, we really want to kill .use_hierarchy, or at least
make it to RO 1.  As it's currently defined, it's just way too
confusing.

> >  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> > --- a/net/core/netprio_cgroup.c
> > +++ b/net/core/netprio_cgroup.c
> > @@ -330,7 +330,17 @@ struct cgroup_subsys net_prio_subsys = {
> >  	.subsys_id	= net_prio_subsys_id,
> >  #endif
> >  	.base_cftypes	= ss_files,
> > -	.module		= THIS_MODULE
> > +	.module		= THIS_MODULE,
> > +
> > +	/*
> > +	 * net_prio has artificial limit on the number of cgroups and
> > +	 * disallows nesting making it impossible to co-mount it with other
> > +	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
> > +	 * limit and properly nest configuration such that children follow
> > +	 * their parents' configurations by default and are allowed to
> > +	 * override and remove the following.
> > +	 */
> > +	.broken_hierarchy = trye,
> 
> typo

Heh, I thought I enabled all controllers.  Thanks. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 12:38   ` Li Zefan
@ 2012-09-11 17:08     ` Tejun Heo
  2012-09-11 17:43       ` Tejun Heo
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 17:08 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, containers, cgroups, Michal Hocko, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Li.

On Tue, Sep 11, 2012 at 08:38:51PM +0800, Li Zefan wrote:
> > (I tried to document what's broken and how it should be fixed.  If I
> >  got something wrong, please let me know.)
> > 
> 
> So isn't cpuset broken too? child cpuset's cpu mask isn't necessary a subset
> of the parent's if the cpu_exclusive flag is not set.

Heh, didn't even look at that.  Just assumed it was in the same boat
w/ cpu{acct}.  Will take a look.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 14:51 ` [PATCH " Vivek Goyal
  2012-09-11 14:54   ` Vivek Goyal
@ 2012-09-11 17:16   ` Tejun Heo
  2012-09-11 17:35     ` Vivek Goyal
  1 sibling, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 17:16 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Vivek.

On Tue, Sep 11, 2012 at 10:51:06AM -0400, Vivek Goyal wrote:
> > * Move users away from using hierarchy on currently non-hierarchical
> >   subsystems, so that implementing proper hierarchy support on those
> >   doesn't surprise them.
> 
> I know two current/potential users. systemd and libvirt. They are
> anyway going to create hierarchy irrespective of the fact whether
> controller supports it or not.

systemd mounts all controllers by default but it only creates
hierarchy for controllers other than cpu, so we're in the clear there.

> So even if we start screaming, nothing is going to change there, I 
> suspect. Just that by default they expect every controller supports
> hiearchies. 

As for libvirt, that's exactly the case where we want to be warning.
The problem is that they're currently creating a hierarchy and
expecting (or at least experiencing) flat behavior.  We want them to
be either switch to explicit flat cgroups or at least know very well
that they're doing something unsupported and the behavior will change
beneath them.

> > * Keep track of which controllers are broken how and nudge the
> >   subsystems to implement proper hierarchy support.
> 
> I thought we can easily keep track of this in a simple .txt file and
> we really don't have to  provide explicit warnings.
> 
> I think for these controllers it is a known fact that they don't support
> hiearchy yet. I am skeptical that providing explicit warnings is going
> to help.

We should change blkio to support full hierarchy and soon and the
current users of the broken hierarchy should be warned loudly and
clearly; otherwise, this whole excercise is pointless, so I don't
think it makes sense to suppress warnings for known broken ones.  In
fact, my concern there is that it's too quiet and polite and I'm
planning on promoting it to full-blown WARN in time.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 17:16   ` Tejun Heo
@ 2012-09-11 17:35     ` Vivek Goyal
  2012-09-11 17:55       ` Tejun Heo
  0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2012-09-11 17:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Tue, Sep 11, 2012 at 10:16:01AM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Tue, Sep 11, 2012 at 10:51:06AM -0400, Vivek Goyal wrote:
> > > * Move users away from using hierarchy on currently non-hierarchical
> > >   subsystems, so that implementing proper hierarchy support on those
> > >   doesn't surprise them.
> > 
> > I know two current/potential users. systemd and libvirt. They are
> > anyway going to create hierarchy irrespective of the fact whether
> > controller supports it or not.
> 
> systemd mounts all controllers by default but it only creates
> hierarchy for controllers other than cpu, so we're in the clear there.
> 
> > So even if we start screaming, nothing is going to change there, I 
> > suspect. Just that by default they expect every controller supports
> > hiearchies. 
> 
> As for libvirt, that's exactly the case where we want to be warning.
> The problem is that they're currently creating a hierarchy and
> expecting (or at least experiencing) flat behavior.  We want them to
> be either switch to explicit flat cgroups or at least know very well
> that they're doing something unsupported and the behavior will change
> beneath them.

It is kind of strange. First kernel allows creation of hiearchy for
non-hierarchical controllers and it also gives warning for user space to
not do that.

If creating hiearchy for flat controllers is wrong then kernel should
not allow it in first place and enforce it, instead of just giving a
warning to user space to not create the hierarchy.

Initially I had blocked the creation of hierarchy deeper than 1 level
but later had to remove it as people wanted libvirt to use blkio
controller and seemed to be fine with flat support. In fact there
were people who insisted on flat support as they thought that made
more sense.

> 
> > > * Keep track of which controllers are broken how and nudge the
> > >   subsystems to implement proper hierarchy support.
> > 
> > I thought we can easily keep track of this in a simple .txt file and
> > we really don't have to  provide explicit warnings.
> > 
> > I think for these controllers it is a known fact that they don't support
> > hiearchy yet. I am skeptical that providing explicit warnings is going
> > to help.
> 
> We should change blkio to support full hierarchy and soon and the
> current users of the broken hierarchy should be warned loudly and
> clearly; otherwise, this whole excercise is pointless,

Sure. Just that CFQ code now has become really complicated and messy
(especially with that 3 service trees per group) so making it hierarchical
is significant amount of effort.

And regarding change of behavior, we can always intoduce a .hierarchy
file like cgroup which needs to be explicitly set to make controller
truly hiearchical. That way behavior does not change in a subtle manner
in future kernel releases. (Not that I am a fan of hierarchy file, just
that it is cost we pay for not implementing hierarchical controller to
begin with).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 17:08     ` Tejun Heo
@ 2012-09-11 17:43       ` Tejun Heo
       [not found]         ` <505057D8.4010908@parallels.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 17:43 UTC (permalink / raw)
  To: Li Zefan
  Cc: linux-kernel, containers, cgroups, Michal Hocko, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello,

On Tue, Sep 11, 2012 at 10:08:37AM -0700, Tejun Heo wrote:
> > So isn't cpuset broken too? child cpuset's cpu mask isn't necessary a subset
> > of the parent's if the cpu_exclusive flag is not set.
> 
> Heh, didn't even look at that.  Just assumed it was in the same boat
> w/ cpu{acct}.  Will take a look.

Hmm... Looked at it and I don't think hierarchy support is broken w/
or w/o exclusive.  Can you please elaborate?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 17:35     ` Vivek Goyal
@ 2012-09-11 17:55       ` Tejun Heo
  2012-09-11 18:16         ` Vivek Goyal
       [not found]         ` <50505C39.1050600@parallels.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 17:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Vivek.

On Tue, Sep 11, 2012 at 01:35:25PM -0400, Vivek Goyal wrote:
> It is kind of strange. First kernel allows creation of hiearchy for
> non-hierarchical controllers and it also gives warning for user space to
> not do that.
> 
> If creating hiearchy for flat controllers is wrong then kernel should
> not allow it in first place and enforce it, instead of just giving a
> warning to user space to not create the hierarchy.
> 
> Initially I had blocked the creation of hierarchy deeper than 1 level
> but later had to remove it as people wanted libvirt to use blkio
> controller and seemed to be fine with flat support. In fact there
> were people who insisted on flat support as they thought that made
> more sense.

Yes, it is an ugly situation and we'll have to drag ourselves out of
this mess gradually.  I hope it hadn't happened like this but what
happened already happened and I can't see a better way out.  If you
can, please share.

> Sure. Just that CFQ code now has become really complicated and messy
> (especially with that 3 service trees per group) so making it hierarchical
> is significant amount of effort.

I think the wording for the warning wasn't entirely accurate.  The
thing that we wanna warn about is that the hierarchy behavior isn't
complete yet and it may change in the future.  If we can absolutely
declare that cfq is and will stay broken in terms of hierarchy
support, that could work too but I don't really think that's a good
idea.  It's something we need to do one way or the other.

> And regarding change of behavior, we can always intoduce a .hierarchy
> file like cgroup which needs to be explicitly set to make controller
> truly hiearchical. That way behavior does not change in a subtle manner
> in future kernel releases. (Not that I am a fan of hierarchy file, just
> that it is cost we pay for not implementing hierarchical controller to
> begin with).

I think that way of thinking is what led to this horrible mess in the
first place.  More flexibility doesn't equal better.  We can't keep
piling things on top.  I'd like to at least have an exit strategy.
use_hierarchy drives us further away from where we wanna be.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 17:55       ` Tejun Heo
@ 2012-09-11 18:16         ` Vivek Goyal
  2012-09-11 18:22           ` Tejun Heo
       [not found]         ` <50505C39.1050600@parallels.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2012-09-11 18:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Tue, Sep 11, 2012 at 10:55:15AM -0700, Tejun Heo wrote:
> Hello, Vivek.
> 
> On Tue, Sep 11, 2012 at 01:35:25PM -0400, Vivek Goyal wrote:
> > It is kind of strange. First kernel allows creation of hiearchy for
> > non-hierarchical controllers and it also gives warning for user space to
> > not do that.
> > 
> > If creating hiearchy for flat controllers is wrong then kernel should
> > not allow it in first place and enforce it, instead of just giving a
> > warning to user space to not create the hierarchy.
> > 
> > Initially I had blocked the creation of hierarchy deeper than 1 level
> > but later had to remove it as people wanted libvirt to use blkio
> > controller and seemed to be fine with flat support. In fact there
> > were people who insisted on flat support as they thought that made
> > more sense.
> 
> Yes, it is an ugly situation and we'll have to drag ourselves out of
> this mess gradually.  I hope it hadn't happened like this but what
> happened already happened and I can't see a better way out.  If you
> can, please share.
> 
> > Sure. Just that CFQ code now has become really complicated and messy
> > (especially with that 3 service trees per group) so making it hierarchical
> > is significant amount of effort.
> 
> I think the wording for the warning wasn't entirely accurate.  The
> thing that we wanna warn about is that the hierarchy behavior isn't
> complete yet and it may change in the future.  If we can absolutely
> declare that cfq is and will stay broken in terms of hierarchy
> support, that could work too but I don't really think that's a good
> idea.  It's something we need to do one way or the other.
> 
> > And regarding change of behavior, we can always intoduce a .hierarchy
> > file like cgroup which needs to be explicitly set to make controller
> > truly hiearchical. That way behavior does not change in a subtle manner
> > in future kernel releases. (Not that I am a fan of hierarchy file, just
> > that it is cost we pay for not implementing hierarchical controller to
> > begin with).
> 
> I think that way of thinking is what led to this horrible mess in the
> first place.  More flexibility doesn't equal better.  We can't keep
> piling things on top.  I'd like to at least have an exit strategy.
> use_hierarchy drives us further away from where we wanna be.

Ok, so whole point of warning seems to be so that we can change the
behavior in future and say to user space they few kernel releases back we
had started printing a warning that creating hierarchy is wrong and 
move to a flat setup. So don't complain to us now.?

Are you planning to get rid of .user_hierarchy file from memory cgroup
too? If you are planning not to put such a file in blkio controller,
then it will make sense to remove it from mem_cgorup too.

.use_hierarchy was not set by default in memory controller and I think one
of the reasons was increased overhead of accounting in deep hierarchies.
Has that been sorted out?

The point I am trying to make is that deep hierarchies (5-6 levels) are
/going to be a reality and if accounting overhead is not manageable then
enabling hierarchy by default might not be a practical solution even
if you implement hierarchy support (like memory cgroup), and in that
case retaining .use_hierarchy will make sense.

IIUC, are you saying that now none of the controller will have flat
hiearchy support because there is no way to be able to create flat 
hierarchy. (Any new group is child of root group). So are we moving
towards a model where every controller is hierarhical and there is
no concept of flat hierarchy.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 18:16         ` Vivek Goyal
@ 2012-09-11 18:22           ` Tejun Heo
  2012-09-11 18:38             ` Vivek Goyal
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 18:22 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Vivek.

On Tue, Sep 11, 2012 at 02:16:00PM -0400, Vivek Goyal wrote:
> Ok, so whole point of warning seems to be so that we can change the
> behavior in future and say to user space they few kernel releases back we
> had started printing a warning that creating hierarchy is wrong and 
> move to a flat setup. So don't complain to us now.?

Yes, pretty much.  At the moment, it's simply broken.

> Are you planning to get rid of .user_hierarchy file from memory cgroup
> too? If you are planning not to put such a file in blkio controller,
> then it will make sense to remove it from mem_cgorup too.

Yes, or at least make it RO 1 eventually.

> The point I am trying to make is that deep hierarchies (5-6 levels) are
> /going to be a reality and if accounting overhead is not manageable then
> enabling hierarchy by default might not be a practical solution even
> if you implement hierarchy support (like memory cgroup), and in that
> case retaining .use_hierarchy will make sense.

That doesn't make any sense to me.  If you don't want feature and
overhead of hierarchy, you just need to not create a hierarchy.  If
hierarchical behavior isn't needed, why create hierarchy at all?

> IIUC, are you saying that now none of the controller will have flat
> hiearchy support because there is no way to be able to create flat 
> hierarchy. (Any new group is child of root group). So are we moving
> towards a model where every controller is hierarhical and there is
> no concept of flat hierarchy.

Yeap.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH UPDATED RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
  2012-09-11 10:04   ` Michal Hocko
  2012-09-11 12:38   ` Li Zefan
@ 2012-09-11 18:23   ` Tejun Heo
  2012-09-11 20:50     ` Aristeu Rozanski
  2012-09-13 12:16   ` [PATCH REPOST " Daniel P. Berrange
  3 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 18:23 UTC (permalink / raw)
  To: linux-kernel, containers, cgroups
  Cc: Michal Hocko, Li Zefan, Glauber Costa, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

(I tried to document what's broken and how it should be fixed.  If I
 got something wrong, please let me know.)

v2: Fixed a typo spotted by Michal.  Warning message updated.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
Here's a slightly updated version w/ updated warning message and typo
fixed.

Thanks.

 block/blk-cgroup.c        |    8 ++++++++
 include/linux/cgroup.h    |   15 +++++++++++++++
 kernel/cgroup.c           |   12 +++++++++++-
 kernel/cgroup_freezer.c   |    8 ++++++++
 kernel/events/core.c      |    7 +++++++
 mm/memcontrol.c           |   12 +++++++++---
 net/core/netprio_cgroup.c |   12 +++++++++++-
 net/sched/cls_cgroup.c    |    9 +++++++++
 security/device_cgroup.c  |    9 +++++++++
 9 files changed, 87 insertions(+), 5 deletions(-)

--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -737,6 +737,14 @@ struct cgroup_subsys blkio_subsys = {
 	.subsys_id = blkio_subsys_id,
 	.base_cftypes = blkcg_files,
 	.module = THIS_MODULE,
+
+	/*
+	 * blkio subsystem is utterly broken in terms of hierarchy support.
+	 * It treats all cgroups equally regardless of where they're
+	 * located in the hierarchy - all cgroups are treated as if they're
+	 * right below the root.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 EXPORT_SYMBOL_GPL(blkio_subsys);
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -503,6 +503,21 @@ struct cgroup_subsys {
 	 */
 	bool __DEPRECATED_clear_css_refs;
 
+	/*
+	 * If %false, this subsystem is properly hierarchical -
+	 * configuration, resource accounting and restriction on a parent
+	 * cgroup cover those of its children.  If %true, hierarchy support
+	 * is broken in some ways - some subsystems ignore hierarchy
+	 * completely while others are only implemented half-way.
+	 *
+	 * It's now diallowed to create nested cgroups if the subsystem is
+	 * broken and cgroup core will emit a warning message on such
+	 * cases.  Eventually, all subsystems will be made properly
+	 * hierarchical and this will go away.
+	 */
+	bool broken_hierarchy;
+	bool warned_broken_hierarchy;
+
 #define MAX_CGROUP_TYPE_NAMELEN 32
 	const char *name;
 
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4075,8 +4075,18 @@ static long cgroup_create(struct cgroup 
 		set_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
 
 	for_each_subsys(root, ss) {
-		struct cgroup_subsys_state *css = ss->create(cgrp);
+		struct cgroup_subsys_state *css;
+
+		if (ss->broken_hierarchy && !ss->warned_broken_hierarchy &&
+		    parent->parent) {
+			pr_warning("cgroup: %s (%d) created nested cgroup for controller \"%s\" which has incomplete hierarchy support. Nested cgroups may change behavior in the future.\n",
+				   current->comm, current->pid, ss->name);
+			if (!strcmp(ss->name, "memory"))
+				pr_warning("cgroup: \"memory\" requires setting use_hierarchy to 1 on the root.\n");
+			ss->warned_broken_hierarchy = true;
+		}
 
+		css = ss->create(cgrp);
 		if (IS_ERR(css)) {
 			err = PTR_ERR(css);
 			goto err_destroy;
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -373,4 +373,12 @@ struct cgroup_subsys freezer_subsys = {
 	.can_attach	= freezer_can_attach,
 	.fork		= freezer_fork,
 	.base_cftypes	= files,
+
+	/*
+	 * freezer subsys doesn't handle hierarchy at all.  Frozen state
+	 * should be inherited through the hierarchy - if a parent is
+	 * frozen, all its children should be frozen.  Fix it and remove
+	 * the following.
+	 */
+	.broken_hierarchy = true,
 };
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7285,5 +7285,12 @@ struct cgroup_subsys perf_subsys = {
 	.destroy	= perf_cgroup_destroy,
 	.exit		= perf_cgroup_exit,
 	.attach		= perf_cgroup_attach,
+
+	/*
+	 * perf_event cgroup doesn't handle nesting correctly.
+	 * ctx->nr_cgroups adjustments should be propagated through the
+	 * cgroup hierarchy.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 #endif /* CONFIG_CGROUP_PERF */
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3855,12 +3855,17 @@ static int mem_cgroup_hierarchy_write(st
 	 */
 	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
 				(val == 1 || val == 0)) {
-		if (list_empty(&cont->children))
+		if (list_empty(&cont->children)) {
 			memcg->use_hierarchy = val;
-		else
+			/* we're fully hierarchical iff root uses hierarchy */
+			if (mem_cgroup_is_root(memcg))
+				mem_cgroup_subsys.broken_hierarchy = !val;
+		} else {
 			retval = -EBUSY;
-	} else
+		}
+	} else {
 		retval = -EINVAL;
+	}
 
 out:
 	cgroup_unlock();
@@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
 						&per_cpu(memcg_stock, cpu);
 			INIT_WORK(&stock->work, drain_local_stock);
 		}
+		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
 		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -330,7 +330,17 @@ struct cgroup_subsys net_prio_subsys = {
 	.subsys_id	= net_prio_subsys_id,
 #endif
 	.base_cftypes	= ss_files,
-	.module		= THIS_MODULE
+	.module		= THIS_MODULE,
+
+	/*
+	 * net_prio has artificial limit on the number of cgroups and
+	 * disallows nesting making it impossible to co-mount it with other
+	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
+	 * limit and properly nest configuration such that children follow
+	 * their parents' configurations by default and are allowed to
+	 * override and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 static int netprio_device_event(struct notifier_block *unused,
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -82,6 +82,15 @@ struct cgroup_subsys net_cls_subsys = {
 #endif
 	.base_cftypes	= ss_files,
 	.module		= THIS_MODULE,
+
+	/*
+	 * While net_cls cgroup has the rudimentary hierarchy support of
+	 * inheriting the parent's classid on cgroup creation, it doesn't
+	 * properly propagates config changes in ancestors to their
+	 * descendents.  A child should follow the parent's configuration
+	 * but be allowed to override it.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 struct cls_cgroup_head {
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -457,6 +457,15 @@ struct cgroup_subsys devices_subsys = {
 	.destroy = devcgroup_destroy,
 	.subsys_id = devices_subsys_id,
 	.base_cftypes = dev_cgroup_files,
+
+	/*
+	 * While devices cgroup has the rudimentary hierarchy support which
+	 * checks the parent's restriction, it doesn't properly propagates
+	 * config changes in ancestors to their descendents.  A child
+	 * should only be allowed to add more restrictions to the parent's
+	 * configuration.  Fix it and remove the following.
+	 */
+	.broken_hierarchy = true,
 };
 
 int __devcgroup_inode_permission(struct inode *inode, int mask)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 18:22           ` Tejun Heo
@ 2012-09-11 18:38             ` Vivek Goyal
  0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2012-09-11 18:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, Michal Hocko, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Tue, Sep 11, 2012 at 11:22:10AM -0700, Tejun Heo wrote:

[..]
> > The point I am trying to make is that deep hierarchies (5-6 levels) are
> > /going to be a reality and if accounting overhead is not manageable then
> > enabling hierarchy by default might not be a practical solution even
> > if you implement hierarchy support (like memory cgroup), and in that
> > case retaining .use_hierarchy will make sense.
> 
> That doesn't make any sense to me.  If you don't want feature and
> overhead of hierarchy, you just need to not create a hierarchy.  If
> hierarchical behavior isn't needed, why create hierarchy at all?

I think ease of use and creation in user space. Different subsystems
(systemd/libvirt etc), don't have to fight each other by keeping
cgroups at same level. So systemd can control top 2 level of hiearchies
and then libvirt can manage another 2-3 levels. systemd is not enforcing
any upper limits. And if libvirt wants to enforce upper memory limits per
VM, they can still easily do that using flat controller (and without
incurring the overhead of hierarchical accounting).

Having said that, I think somebody had mentioned that it would be nice
to have hierarchical features to that a limit can be imposed on a
group of virtual machines without having all virtual machines in
same cgroup.

So yes agreed that creating hierarchy and still not expecting hierarchical
behavior does not make much sense. I think it kind of worked for limited
requirements (per cgroup upper limit). I think once you make hierarchy
default and if accounting overhead shows up, then there will be noises
about introducing regression.

Anyway, thanks for the explanation. This roadmap of converting everything
to hierarchical by default sounds sane. (Hoepefully we will be able to
manage the overhead problem).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH UPDATED RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 18:23   ` [PATCH UPDATED " Tejun Heo
@ 2012-09-11 20:50     ` Aristeu Rozanski
  2012-09-11 20:51       ` Tejun Heo
  0 siblings, 1 reply; 41+ messages in thread
From: Aristeu Rozanski @ 2012-09-11 20:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, containers, cgroups, Michal Hocko, Li Zefan,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Serge E. Hallyn, Vivek Goyal, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman,
	Aneesh Kumar K.V

Hi Tejun,
On Tue, Sep 11, 2012 at 11:23:56AM -0700, Tejun Heo wrote:
> +			/* we're fully hierarchical iff root uses hierarchy */

minor nit: s/iff/if/

-- 
Aristeu


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH UPDATED RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 20:50     ` Aristeu Rozanski
@ 2012-09-11 20:51       ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-11 20:51 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: linux-kernel, containers, cgroups, Michal Hocko, Li Zefan,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Serge E. Hallyn, Vivek Goyal, Paul Mackerras,
	Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman,
	Aneesh Kumar K.V

Hello,

On Tue, Sep 11, 2012 at 1:50 PM, Aristeu Rozanski <aris@redhat.com> wrote:
> Hi Tejun,
> On Tue, Sep 11, 2012 at 11:23:56AM -0700, Tejun Heo wrote:
>> +                     /* we're fully hierarchical iff root uses hierarchy */
>
> minor nit: s/iff/if/

That was intentional iff == if and only if. :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-11 17:07     ` Tejun Heo
@ 2012-09-12 15:47       ` Michal Hocko
  2012-09-12 16:41         ` Tejun Heo
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2012-09-12 15:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, containers, cgroups, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Tue 11-09-12 10:07:46, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Sep 11, 2012 at 12:04:33PM +0200, Michal Hocko wrote:
> > >  	cgroup_unlock();
> > > @@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
> > >  						&per_cpu(memcg_stock, cpu);
> > >  			INIT_WORK(&stock->work, drain_local_stock);
> > >  		}
> > > +		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
> > 
> > Hmmm, this will warn even if we have
> > root (default use_hierarchy=0)
> >  \
> >   A (use_hierarchy=1)
> >    \
> >     B <- here
> > 
> > which is unfortunate because it will add a noise to a reasonable
> > configuration.
> 
> I suppose you're talking about having root group not performing any
> accounting and/or control?

It doesn't do any controlling because you cannot set any limit for it.
Root cgroup has always been special.

> I suppose such could be a valid use case
> (is it really necessary tho?)  but I don't think .use_hierarchy is the
> right interface for that.  

I am not sure I understand what you mean by that. My only concern with
is that we shouldn't complain if somebody doesn't do anything wrong.
And creating a group under root without any other children, no matter
what use_hierarchy says, is a valid use case and we shouldn't make too
much noise about that.
The only difference in such a use case is that hierarchical stats will
include numbers from the group only if root had use_hierarchy==1. There
are no other side effects.

> If it's absolutely necessary, I think it should be a root-only flag
> (even if that ends up using the same code path).  Eventually, we
> really want to kill .use_hierarchy, or at least make it to RO 1.  As
> it's currently defined, it's just way too confusing.

Agreed on that, definitely.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]     ` <5050568B.9090601@parallels.com>
@ 2012-09-12 15:49       ` Michal Hocko
  2012-09-12 17:11         ` Tejun Heo
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2012-09-12 15:49 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Wed 12-09-12 13:31:55, Glauber Costa wrote:
> On 09/11/2012 02:04 PM, Michal Hocko wrote:
> > I like the approach in general but see the comments bellow:
> > 
> > On Mon 10-09-12 15:33:55, Tejun Heo wrote:
> > [...]
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -3855,12 +3855,17 @@ static int mem_cgroup_hierarchy_write(st
> >>  	 */
> >>  	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
> >>  				(val == 1 || val == 0)) {
> >> -		if (list_empty(&cont->children))
> >> +		if (list_empty(&cont->children)) {
> >>  			memcg->use_hierarchy = val;
> >> -		else
> >> +			/* we're fully hierarchical iff root uses hierarchy */
> >> +			if (mem_cgroup_is_root(memcg))
> >> +				mem_cgroup_subsys.broken_hierarchy = !val;
> >> +		} else {
> >>  			retval = -EBUSY;
> >> -	} else
> >> +		}
> >> +	} else {
> >>  		retval = -EINVAL;
> >> +	}
> >>  
> >>  out:
> >>  	cgroup_unlock();
> >> @@ -4953,6 +4958,7 @@ mem_cgroup_create(struct cgroup *cont)
> >>  						&per_cpu(memcg_stock, cpu);
> >>  			INIT_WORK(&stock->work, drain_local_stock);
> >>  		}
> >> +		mem_cgroup_subsys.broken_hierarchy = !memcg->use_hierarchy;
> > 
> > Hmmm, this will warn even if we have
> > root (default use_hierarchy=0)
> >  \
> >   A (use_hierarchy=1)
> >    \
> >     B <- here
> > 
> > which is unfortunate because it will add a noise to a reasonable
> > configuration.
> > I think this is fixable if you move the warning after 
> > cgroup_subsys_state::create and broken_hierarchy would be set only if
> > parent is not root and use_hierarchy==0 in mem_cgroup_create. Something
> > like:
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 795e525..d5c93ab 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4973,6 +4973,13 @@ mem_cgroup_create(struct cgroup *cont)
> >  	} else {
> >  		res_counter_init(&memcg->res, NULL);
> >  		res_counter_init(&memcg->memsw, NULL);
> > +		/*
> > +		 * Deeper hierachy with use_hierarchy == false doesn't make
> > +		 * much sense so let cgroup subsystem know about this unfortunate
> > +		 * state in our controller.
> > +		 */
> > +		if (parent && parent != root_mem_cgroup)
> > +			mem_cgroup_subsys.broken_hierarchy = true;
> >  	}
> >  	memcg->last_scanned_node = MAX_NUMNODES;
> >  	INIT_LIST_HEAD(&memcg->oom_notify);
> > 
> > What do you think?
> > 
> I side with Tejun's original intentions.
> 
> While I respect your goal of not warning about any configuration
> with max_level = 1, I believe the only sane configuration as soon
> as we get any 2nd-level child is use_hierarchy = 1 for everybody.
> 
> Everything aside from it should be warned.

Defintely. And that what the above guarantess, doesn't it?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]         ` <505057D8.4010908@parallels.com>
@ 2012-09-12 16:34           ` Tejun Heo
  2012-09-13  6:48             ` Li Zefan
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-12 16:34 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Li Zefan, linux-kernel, containers, cgroups, Michal Hocko,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello,

On Wed, Sep 12, 2012 at 01:37:28PM +0400, Glauber Costa wrote:
> "If a cpuset is cpu or mem exclusive, no other cpuset, other than
> a direct ancestor or descendant, may share any of the same CPUs or
> Memory Nodes."
> 
> So I think it tricked me as well. I was under the impression that
> "exclusive" would also disallow the kids.

You two are confusing me even more.  AFAICS, the hierarchical
properties don't seem to change whether exclusive is set or not.  It
still ensures children can't have something parent doesn't allow and
exclusive applies to whether to share something with siblings, so I
don't think anything is broken hierarchy-wise.  Am I missing
something?  If so, please be explicit and elaborate where and how it's
broken.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-12 15:47       ` Michal Hocko
@ 2012-09-12 16:41         ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-12 16:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, containers, cgroups, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello,

On Wed, Sep 12, 2012 at 05:47:45PM +0200, Michal Hocko wrote:
> > If it's absolutely necessary, I think it should be a root-only flag
> > (even if that ends up using the same code path).  Eventually, we
> > really want to kill .use_hierarchy, or at least make it to RO 1.  As
> > it's currently defined, it's just way too confusing.
> 
> Agreed on that, definitely.

The thing is that if we're gonna be headed that way, we're gonna have
to nudge people to .use_hierarchy = 1 everywher first.  If
.use_hierarchy = 0 at root is a valid use case which absolutedly needs
to be maintained (from what I read, I don't think it is), let's move
it to a different root-only flag; otherwise, it should be phased out.
As such, while the user might not be necessarily wrong, we still need
to unify the behavior, I think.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found] ` <505055E5.90903@parallels.com>
@ 2012-09-12 17:03   ` Tejun Heo
       [not found]     ` <5051C954.2080600@parallels.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-12 17:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, Michal Hocko, Li Zefan, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Glauber.

On Wed, Sep 12, 2012 at 01:29:09PM +0400, Glauber Costa wrote:
> Haven't gone through the whole patch yet, and not sure how you actually
> touch memcg in here. And I absolutely know we have discussed this
> before, but I still stand that for the memcg case, in which hierarchy
> can be enabled by a crazy boolean, we should be enabling it somehow. It
> is fine if we don't want to change the default without warning first,
> but a Kconfig option to make this default would really help. We should
> tell everybody with a well defined lifecycle to just enable it.

I'm not really sure how useful the Kconfig would be.  I'm not gonna
nack it but am not sure it's useful either.  Michal seems to be in the
same boat, so I suppose there's no strong opposition.

I don't think it would make life easier for distros.  It could differ
depending on distros but in my experience with SUSE trivial patches
flipping the default aren't big deals especially if upstream has
transition plan in place.  The difficulty here is that somebody needs
to assess the situation and preferably make that decision conciously -
the mechanism to do so be it a one liner patch or Kconfig option
doesn't really matter and across the transition period we would want
to keep the memcg behavior consistent regardless which kernel is in
use.

The problem with Kconfig is that we shouldn't enable the new behavior
by default as that would change the behavior silently and if we can't
do that it's just something which is buried under the sea of config
options.  Kconfig or no, we need to coordinate with the distros.

>From upstream, my current plan for .use_hierarchy is

* Warn about broken hierarchy usage in increasing verbosity.

* After a couple releases, warn about creating any mem cgroup if
  .use_hierarchy == 0 at root.

* After a couple releases, switch .use_hierarchy to 1 by default and
  loudly warn on any attempts to set it to zero.

* Rip out flat hierarchy support and fail any attempt to set
  .use_hierarchy to 0.

It'll take some months but I don't think it's too crazy and I think
the whole process should take longer than eight months to ensure any
active distro notices it.

Distros should set .use_hierarchy to 1 on mounting memcg.  This
probably should happen on a new release w/ accompanying release note.
I'll try to coordinate it at least for the popular ones.

> > +	 * It's now diallowed to create nested cgroups if the subsystem is
> typo, disallowed.

Ooh, will fix.

> > +	 * broken and cgroup core will emit a warning message on such
> > +	 * cases.  Eventually, all subsystems will be made properly
> > +	 * hierarchical and this will go away.
> > +	 */
> > +	bool broken_hierarchy;
> > +	bool warned_broken_hierarchy;
> > +
> 
> why do we need the extra bool? Isn't WARN_ON_ONCE() suitable here?

We want to warn once per subsys instead of once for the whole system.

> > +	/*
> > +	 * net_prio has artificial limit on the number of cgroups and
> > +	 * disallows nesting making it impossible to co-mount it with other
> > +	 * hierarchical subsystems.  Remove the artificially low PRIOIDX_SZ
> > +	 * limit and properly nest configuration such that children follow
> > +	 * their parents' configurations by default and are allowed to
> > +	 * override and remove the following.
> > +	 */
> > +	.broken_hierarchy = trye,
> >  };
> 
> "trye" doesn't seem to be a recognized word.

Yeah, fixed.

> >  static int netprio_device_event(struct notifier_block *unused,
> > --- a/net/sched/cls_cgroup.c
> > +++ b/net/sched/cls_cgroup.c
> > @@ -82,6 +82,15 @@ struct cgroup_subsys net_cls_subsys = {
> >  #endif
> >  	.base_cftypes	= ss_files,
> >  	.module		= THIS_MODULE,
> > +
> > +	/*
> > +	 * While net_cls cgroup has the rudimentary hierarchy support of
> > +	 * inheriting the parent's classid on cgroup creation, it doesn't
> > +	 * properly propagates config changes in ancestors to their
> > +	 * descendents.  A child should follow the parent's configuration
> > +	 * but be allowed to override it.  Fix it and remove the following.
> > +	 */
> > +	.broken_hierarchy = true,
> >  };
> >  
> 
> Since all this cgroup provides is a marking, it is not terribly obvious
> to me what "proper hierarchy" would mean. Input from the authors would
> be strongly advisable here.

Setting mark on a parent should be reflected on all its children w/o
their own explicit settings.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]         ` <50505C39.1050600@parallels.com>
@ 2012-09-12 17:09           ` Tejun Heo
  2012-09-13 14:53             ` Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them) Vivek Goyal
       [not found]             ` <5051CBAA.5040308@parallels.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-12 17:09 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vivek Goyal, linux-kernel, Michal Hocko, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Glauber.

On Wed, Sep 12, 2012 at 01:56:09PM +0400, Glauber Costa wrote:
> "Being able to provide hierarchical support" is an example of such
> restrictions. And as Vivek said himself, for the CFQ case should be doable.

Yeah, it's mostly that cfq was already a hairy monster before blkcg
was added to it and unfortunately we didn't make it any cleaner in the
process and blkcg itself has a lot of other issues including being
completely broken w.r.t. writeback writes.  In addition there are two
sub-controllers - the cfq one and blk-throttle.  So, it's just that
there are too many scary things to do and not enough man power or
maybe interest.  I hope we could just declare cgroup isn't supported
on block devices but that doesn't seem feasible at this point either.

I might / probably work on it and am hoping to coerce Vivek into it
too.  If you wanna jump in, please be my guest.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-12 15:49       ` Michal Hocko
@ 2012-09-12 17:11         ` Tejun Heo
  2012-09-13 12:14           ` Michal Hocko
       [not found]           ` <5051CB24.4010801@parallels.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-12 17:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Glauber Costa, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello,

On Wed, Sep 12, 2012 at 05:49:07PM +0200, Michal Hocko wrote:
> > While I respect your goal of not warning about any configuration
> > with max_level = 1, I believe the only sane configuration as soon
> > as we get any 2nd-level child is use_hierarchy = 1 for everybody.
> > 
> > Everything aside from it should be warned.
> 
> Defintely. And that what the above guarantess, doesn't it?

I'm getting a bit worried that I might not be fully understanding what
your concern is.  Can you please elaborate what your worries are and
the transition plan that you have in your mind regarding
.use_hierarchy?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-12 16:34           ` Tejun Heo
@ 2012-09-13  6:48             ` Li Zefan
  0 siblings, 0 replies; 41+ messages in thread
From: Li Zefan @ 2012-09-13  6:48 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, containers, cgroups, Michal Hocko,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On 2012/9/13 0:34, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 12, 2012 at 01:37:28PM +0400, Glauber Costa wrote:
>> "If a cpuset is cpu or mem exclusive, no other cpuset, other than
>> a direct ancestor or descendant, may share any of the same CPUs or
>> Memory Nodes."
>>
>> So I think it tricked me as well. I was under the impression that
>> "exclusive" would also disallow the kids.
> 
> You two are confusing me even more.  AFAICS, the hierarchical
> properties don't seem to change whether exclusive is set or not.  It
> still ensures children can't have something parent doesn't allow and
> exclusive applies to whether to share something with siblings, so I
> don't think anything is broken hierarchy-wise.  Am I missing
> something?  If so, please be explicit and elaborate where and how it's
> broken.
> 

Ignore it. I misunderstood the exclusive flag. Sorry for the noise.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-12 17:11         ` Tejun Heo
@ 2012-09-13 12:14           ` Michal Hocko
  2012-09-13 17:18             ` Tejun Heo
       [not found]           ` <5051CB24.4010801@parallels.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2012-09-13 12:14 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Wed 12-09-12 10:11:20, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 12, 2012 at 05:49:07PM +0200, Michal Hocko wrote:
> > > While I respect your goal of not warning about any configuration
> > > with max_level = 1, I believe the only sane configuration as soon
> > > as we get any 2nd-level child is use_hierarchy = 1 for everybody.
> > > 
> > > Everything aside from it should be warned.
> > 
> > Defintely. And that what the above guarantess, doesn't it?
> 
> I'm getting a bit worried that I might not be fully understanding what
> your concern is.  Can you please elaborate what your worries are and
> the transition plan that you have in your mind regarding
> .use_hierarchy?

I would like to see use_hierarchy go away. The only concern I have is 
to warn only if somebody is doing something wrong (aka flat
hierarchies). Or better put it this way. Do not warn in cases which do
not change if use_hierarchy is gone or default changes to 1.
An example:
root (use_hierarchy=0)
 | \
 |  A (use_hierarchy=0)
 |
 B (use_hierarachy=1)
 |\
 C D

is a perfectly sane configuration and I do not see any reason to fill
logs with some scary warnings when A is created. There will be no
semantical change in this setup When use_hierchy is gone.

So the only thing I am proposing here is to warn only if something
should be fixed in the configuration in order to be prepared for fully
hierarchical (and that is a second level of children from root with
use_hierachy==0).

Does it make more sense now?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
                     ` (2 preceding siblings ...)
  2012-09-11 18:23   ` [PATCH UPDATED " Tejun Heo
@ 2012-09-13 12:16   ` Daniel P. Berrange
  2012-09-13 17:52     ` Tejun Heo
  3 siblings, 1 reply; 41+ messages in thread
From: Daniel P. Berrange @ 2012-09-13 12:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-kernel, containers, cgroups, Neil Horman, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner,
	Ingo Molnar, Vivek Goyal

On Mon, Sep 10, 2012 at 03:33:55PM -0700, Tejun Heo wrote:
> (forgot cc'ing containers / cgroups mailing lists and used the old
>  address for Li.  Reposting.  Sorry for the noise.)
> 
> Currently, cgroup hierarchy support is a mess.  cpu related subsystems
> behave correctly - configuration, accounting and control on a parent
> properly cover its children.  blkio and freezer completely ignore
> hierarchy and treat all cgroups as if they're directly under the root
> cgroup.  Others show yet different behaviors.
> 
> These differing interpretations of cgroup hierarchy make using cgroup
> confusing and it impossible to co-mount controllers into the same
> hierarchy and obtain sane behavior.
> 
> Eventually, we want full hierarchy support from all subsystems and
> probably a unified hierarchy.  Users using separate hierarchies
> expecting completely different behaviors depending on the mounted
> subsystem is deterimental to making any progress on this front.
> 
> This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
> for controllers which are lacking in hierarchy support.  The goal of
> this patch is two-fold.
> 
> * Move users away from using hierarchy on currently non-hierarchical
>   subsystems, so that implementing proper hierarchy support on those
>   doesn't surprise them.
> 
> * Keep track of which controllers are broken how and nudge the
>   subsystems to implement proper hierarchy support.
> 
> For now, start with a single warning message.  We can whine louder
> later on.

If you want application developers / users to understand this, then
you really need to update the Documentation/cgroups/cgroups.txt file
to provide suitable recommendations on use of hierarchies. As it
stands today it arguably encourages apps to make use of arbitrary
hierarchies. While some of the other docs (blkio-controller.txt) do
describe the limitations of their implementation, they don't ever
go as far as recommending against usage of hierarchies, which is
what you seem to be saying.

Just printing warning messages in the logs with no docs to explain
the issues which cause the warnings is not friendly to users, and
IMHO will just lead people to ignore the warnings.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them)
  2012-09-12 17:09           ` Tejun Heo
@ 2012-09-13 14:53             ` Vivek Goyal
  2012-09-13 22:06               ` Tejun Heo
       [not found]             ` <5051CBAA.5040308@parallels.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2012-09-13 14:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, Michal Hocko, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Wed, Sep 12, 2012 at 10:09:33AM -0700, Tejun Heo wrote:

[..]
> Yeah, it's mostly that cfq was already a hairy monster before blkcg
> was added to it and unfortunately we didn't make it any cleaner in the
> process and blkcg itself has a lot of other issues including being
> completely broken w.r.t. writeback writes.  In addition there are two
> sub-controllers - the cfq one and blk-throttle.  So, it's just that
> there are too many scary things to do and not enough man power or
> maybe interest.  I hope we could just declare cgroup isn't supported
> on block devices but that doesn't seem feasible at this point either.
> 
> I might / probably work on it and am hoping to coerce Vivek into it
> too.  If you wanna jump in, please be my guest.

Biggest problem with blkcg CFQ implementation is idling on cgroup. If
we don't idle on cgroup, then we don't get the service differentiaton
for most of the workloads and if we do idle then performance starts
to suck very soon (The moment few cgroups are created). And hierarchy
will just exacertbate this problem because then one will try to idle
at each group in hierarchy.

This problem is something similar to CFQ's idling on sequential queues
and iopriority. Because we never idled on random IO queue, ioprios never
worked on random IO queues. And same is true for buffered write queues.
Similary, if you don't idle on groups, then for most of the workloads,
service differentiation is not visible. Only the one which are highly
sequential on nature, one can see service differentiation.

That's one fundamental problem for which we need to have a good answer
before we try to do more work on blkcg. Because we can write as much
code but at the end of the day it might still not be useful because
of the above mentioned issue I faced.

And that's the reason I think blkcg is primarly useful when you create
number of cgroups very small and move offending/problem creating worklods
in that cgroup and keep all other running in root cgroup. That way you
get less idling due to less number of cgroups at the same time you have
provided more isolation from offending workloads.

So if anybody has ideas on how to address above issue, I am all ears.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-13 12:14           ` Michal Hocko
@ 2012-09-13 17:18             ` Tejun Heo
  2012-09-13 17:39               ` Michal Hocko
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-13 17:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Glauber Costa, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Michal.

On Thu, Sep 13, 2012 at 02:14:38PM +0200, Michal Hocko wrote:
> I would like to see use_hierarchy go away. The only concern I have is 
> to warn only if somebody is doing something wrong (aka flat
> hierarchies). Or better put it this way. Do not warn in cases which do
> not change if use_hierarchy is gone or default changes to 1.
> An example:
> root (use_hierarchy=0)
>  | \
>  |  A (use_hierarchy=0)
>  |
>  B (use_hierarachy=1)
>  |\
>  C D
> 
> is a perfectly sane configuration and I do not see any reason to fill
> logs with some scary warnings when A is created. There will be no
> semantical change in this setup When use_hierchy is gone.
> 
> So the only thing I am proposing here is to warn only if something
> should be fixed in the configuration in order to be prepared for fully
> hierarchical (and that is a second level of children from root with
> use_hierachy==0).
> 
> Does it make more sense now?

Ah, okay, so what you're saying is that we shouldn't warn if 0
.use_hierarchys don't make any behavior difference from when they're
all 1, right?  If so, I have no objection.  Will incorporate your
updated version.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]           ` <5051CB24.4010801@parallels.com>
@ 2012-09-13 17:21             ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-13 17:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Michal Hocko, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Glauber.

On Thu, Sep 13, 2012 at 04:01:40PM +0400, Glauber Costa wrote:
> This is getting confusing for me as well, because I don't know if your
> reply was targeted towards me or Michal. As for me, I am in agreement
> with what you did, and I merely replied to Michal's concern and
> suggestion of not warning in the special 1-st level only setups saying I
> side with you.

Oh, I was replying to Michal.  I thought there was behavior difference
with the .use_hierarchy=0 cases that Michal was talking about but I
don't think that's the case and Michal is trying to say that we
shouldn't warn if the configuration behaves identical to
root.use_hierarchy=1 even if it contains some .use_hierarchy=0's,
which I'm fine with.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-13 17:18             ` Tejun Heo
@ 2012-09-13 17:39               ` Michal Hocko
       [not found]                 ` <5052E87A.1050405@parallels.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2012-09-13 17:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Thu 13-09-12 10:18:32, Tejun Heo wrote:
> Hello, Michal.
> 
> On Thu, Sep 13, 2012 at 02:14:38PM +0200, Michal Hocko wrote:
> > I would like to see use_hierarchy go away. The only concern I have is 
> > to warn only if somebody is doing something wrong (aka flat
> > hierarchies). Or better put it this way. Do not warn in cases which do
> > not change if use_hierarchy is gone or default changes to 1.
> > An example:
> > root (use_hierarchy=0)
> >  | \
> >  |  A (use_hierarchy=0)
> >  |
> >  B (use_hierarachy=1)
> >  |\
> >  C D
> > 
> > is a perfectly sane configuration and I do not see any reason to fill
> > logs with some scary warnings when A is created. There will be no
> > semantical change in this setup When use_hierchy is gone.
> > 
> > So the only thing I am proposing here is to warn only if something
> > should be fixed in the configuration in order to be prepared for fully
> > hierarchical (and that is a second level of children from root with
> > use_hierachy==0).
> > 
> > Does it make more sense now?
> 
> Ah, okay, so what you're saying is that we shouldn't warn if 0
> .use_hierarchys don't make any behavior difference from when they're
> all 1, right?  

Exactly. 1st level of children under the root is exactly this kind of
setup.

> If so, I have no objection.  Will incorporate your updated version.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]     ` <5051C954.2080600@parallels.com>
@ 2012-09-13 17:48       ` Tejun Heo
       [not found]         ` <5052E9BC.2020908@parallels.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-13 17:48 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-kernel, Michal Hocko, Li Zefan, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Dave Jones, Lennart Poettering, Kay Sievers

Hello, Glauber.

On Thu, Sep 13, 2012 at 03:53:56PM +0400, Glauber Costa wrote:
> Here is where the Kconfig option comes to play. If we do it in the
> kernel, userspace doesn't have to do anything. I spoke with Lennart and
> Kay, and at least from a systemd PoV, they would much rather not provide
> a hack in userspace for a file that is scheduled to go away in any case
> - which I personally believe is a fair request.
> 
> It is a default, so the effect for the user is the same: After the
> machine boots, use_hierarchy = 1, and he can still flip to 0 for some time.

Alright, let's go Kconfig.  Let's just make sure that the transitional
nature is clearly labeled and the fact that the default config will
generate a warning when nested cgroups are created in memcg.  We can
then coordinate the flip with distros.  Can you please repost the
Kconfig patch?

> > Setting mark on a parent should be reflected on all its children w/o
> > their own explicit settings.
> 
> That is clear, and better behavior than we have today. What I mean, is
> that by setting its own marking, the child can pretty much "escape" the
> group.
> 
> The ideal solution - from this point of view only - would be to have
> more than one marking, and mark with all the way down to the root. So if
> you have an iptables rule to match one marking, it still applies to the
> kids. And you can still have extra markings.
> 
> I am not sure this is feasible, though, in which case your solution
> could be a good compromise. But please let's aim for it.

I don't think it supports multiple tags.  If that's possible, it would
be nice but I don't think it's a must.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
  2012-09-13 12:16   ` [PATCH REPOST " Daniel P. Berrange
@ 2012-09-13 17:52     ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-13 17:52 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: linux-kernel, containers, cgroups, Neil Horman, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner,
	Ingo Molnar, Vivek Goyal

Hello, Daniel.

On Thu, Sep 13, 2012 at 01:16:29PM +0100, Daniel P. Berrange wrote:
> If you want application developers / users to understand this, then
> you really need to update the Documentation/cgroups/cgroups.txt file
> to provide suitable recommendations on use of hierarchies. As it
> stands today it arguably encourages apps to make use of arbitrary
> hierarchies. While some of the other docs (blkio-controller.txt) do
> describe the limitations of their implementation, they don't ever
> go as far as recommending against usage of hierarchies, which is
> what you seem to be saying.
>
> Just printing warning messages in the logs with no docs to explain
> the issues which cause the warnings is not friendly to users, and
> IMHO will just lead people to ignore the warnings.

Oh yeah, we definitely need to update the docs.  No doubt about it.
Will do.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]             ` <5051CBAA.5040308@parallels.com>
@ 2012-09-13 17:54               ` Tejun Heo
       [not found]                 ` <5052E931.8000007@parallels.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-13 17:54 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vivek Goyal, linux-kernel, Michal Hocko, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hey, Glauber.

On Thu, Sep 13, 2012 at 04:03:54PM +0400, Glauber Costa wrote:
> > I might / probably work on it and am hoping to coerce Vivek into it
> > too.  If you wanna jump in, please be my guest.
> 
> I am devoting part of my time to general cgroup stuff, I can definitely
> help you with that. Advantage is that you won't have to coerce me.

Yeah, that's awesome.  Would you be interested in looking into
implementing hierarchy support for blkcg too?  Or are you more
interested mostly in cgroup core and memcg?

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them)
  2012-09-13 14:53             ` Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them) Vivek Goyal
@ 2012-09-13 22:06               ` Tejun Heo
  2012-09-14  2:53                 ` Vivek Goyal
  0 siblings, 1 reply; 41+ messages in thread
From: Tejun Heo @ 2012-09-13 22:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Glauber Costa, linux-kernel, Michal Hocko, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, rni,
	ctalbott

Hey, Vivek.

(cc'ing Rakesh and Chad who work on iosched in google).

On Thu, Sep 13, 2012 at 10:53:41AM -0400, Vivek Goyal wrote:
> Biggest problem with blkcg CFQ implementation is idling on cgroup. If
> we don't idle on cgroup, then we don't get the service differentiaton
> for most of the workloads and if we do idle then performance starts
> to suck very soon (The moment few cgroups are created). And hierarchy
> will just exacertbate this problem because then one will try to idle
> at each group in hierarchy.
> 
> This problem is something similar to CFQ's idling on sequential queues
> and iopriority. Because we never idled on random IO queue, ioprios never
> worked on random IO queues. And same is true for buffered write queues.
> Similary, if you don't idle on groups, then for most of the workloads,
> service differentiation is not visible. Only the one which are highly
> sequential on nature, one can see service differentiation.
> 
> That's one fundamental problem for which we need to have a good answer
> before we try to do more work on blkcg. Because we can write as much
> code but at the end of the day it might still not be useful because
> of the above mentioned issue I faced.

I talked with Rakesh about this as the modified cfq-iosched used in
google supports proper hierarchy and the feature is heavily depended
upon.  I was told that nesting doesn't really change anything.  The
only thing which matters is the number of active cgroups and whether
they're nested or how deep doesn't matter - IIUC there's no need to
idle for internal nodes if they don't have IOs pending.

He draw me some diagrams which made sense for me and the code
apparently actually works, so there doesn't seem to be any fundamental
issue in implementing hierarchy support in cfq.

Thoughts?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them)
  2012-09-13 22:06               ` Tejun Heo
@ 2012-09-14  2:53                 ` Vivek Goyal
       [not found]                   ` <5052E8DA.1000106@parallels.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Vivek Goyal @ 2012-09-14  2:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Glauber Costa, linux-kernel, Michal Hocko, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, rni,
	ctalbott

On Thu, Sep 13, 2012 at 03:06:13PM -0700, Tejun Heo wrote:
> Hey, Vivek.
> 
> (cc'ing Rakesh and Chad who work on iosched in google).
> 
> On Thu, Sep 13, 2012 at 10:53:41AM -0400, Vivek Goyal wrote:
> > Biggest problem with blkcg CFQ implementation is idling on cgroup. If
> > we don't idle on cgroup, then we don't get the service differentiaton
> > for most of the workloads and if we do idle then performance starts
> > to suck very soon (The moment few cgroups are created). And hierarchy
> > will just exacertbate this problem because then one will try to idle
> > at each group in hierarchy.
> > 
> > This problem is something similar to CFQ's idling on sequential queues
> > and iopriority. Because we never idled on random IO queue, ioprios never
> > worked on random IO queues. And same is true for buffered write queues.
> > Similary, if you don't idle on groups, then for most of the workloads,
> > service differentiation is not visible. Only the one which are highly
> > sequential on nature, one can see service differentiation.
> > 
> > That's one fundamental problem for which we need to have a good answer
> > before we try to do more work on blkcg. Because we can write as much
> > code but at the end of the day it might still not be useful because
> > of the above mentioned issue I faced.
> 
> I talked with Rakesh about this as the modified cfq-iosched used in
> google supports proper hierarchy and the feature is heavily depended
> upon.  I was told that nesting doesn't really change anything.  The
> only thing which matters is the number of active cgroups and whether
> they're nested or how deep doesn't matter - IIUC there's no need to
> idle for internal nodes if they don't have IOs pending.
> 
> He draw me some diagrams which made sense for me and the code
> apparently actually works, so there doesn't seem to be any fundamental
> issue in implementing hierarchy support in cfq.

Hmm...., They probably are right. Idling only on leaf groups can make
sure that none of the groups loses its fair share of quota. Thinking
loud...


				root
                                /  \
                               T1  G1
                                  /  \
                                 T2   G2
                                       \
                                        T3 

So if task T3 finishes and there is no active IO from T3, we will idle
on group G2 (in the hope that soon some IO will show up from T3 or from
some other task in G2). And that alone should make sure all the group
nodes in the path to root (G1) get their fair share at their respective
level and no additional idling should be needed.

So sounds like hierarhcy will not cause additional idling. Idling will
solely depend on leaf active groups. 

So how bad group/no-idle service tree idling is proving to be. In my
experience if we use it on anything other than single spindle sata disk,
performance hit will start showing up immediately.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them)
       [not found]                   ` <5052E8DA.1000106@parallels.com>
@ 2012-09-14 13:22                     ` Vivek Goyal
  0 siblings, 0 replies; 41+ messages in thread
From: Vivek Goyal @ 2012-09-14 13:22 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, Michal Hocko, Li Zefan, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	Neil Horman, Aneesh Kumar K.V, rni, ctalbott

On Fri, Sep 14, 2012 at 12:20:42PM +0400, Glauber Costa wrote:
> On 09/14/2012 06:53 AM, Vivek Goyal wrote:
> > On Thu, Sep 13, 2012 at 03:06:13PM -0700, Tejun Heo wrote:
> >> Hey, Vivek.
> >>
> >> (cc'ing Rakesh and Chad who work on iosched in google).
> >>
> >> On Thu, Sep 13, 2012 at 10:53:41AM -0400, Vivek Goyal wrote:
> >>> Biggest problem with blkcg CFQ implementation is idling on cgroup. If
> >>> we don't idle on cgroup, then we don't get the service differentiaton
> >>> for most of the workloads and if we do idle then performance starts
> >>> to suck very soon (The moment few cgroups are created). And hierarchy
> >>> will just exacertbate this problem because then one will try to idle
> >>> at each group in hierarchy.
> >>>
> >>> This problem is something similar to CFQ's idling on sequential queues
> >>> and iopriority. Because we never idled on random IO queue, ioprios never
> >>> worked on random IO queues. And same is true for buffered write queues.
> >>> Similary, if you don't idle on groups, then for most of the workloads,
> >>> service differentiation is not visible. Only the one which are highly
> >>> sequential on nature, one can see service differentiation.
> >>>
> >>> That's one fundamental problem for which we need to have a good answer
> >>> before we try to do more work on blkcg. Because we can write as much
> >>> code but at the end of the day it might still not be useful because
> >>> of the above mentioned issue I faced.
> >>
> >> I talked with Rakesh about this as the modified cfq-iosched used in
> >> google supports proper hierarchy and the feature is heavily depended
> >> upon.  I was told that nesting doesn't really change anything.  The
> >> only thing which matters is the number of active cgroups and whether
> >> they're nested or how deep doesn't matter - IIUC there's no need to
> >> idle for internal nodes if they don't have IOs pending.
> >>
> >> He draw me some diagrams which made sense for me and the code
> >> apparently actually works, so there doesn't seem to be any fundamental
> >> issue in implementing hierarchy support in cfq.
> > 
> > Hmm...., They probably are right. Idling only on leaf groups can make
> > sure that none of the groups loses its fair share of quota. Thinking
> > loud...
> > 
> > 
> > 				root
> >                                 /  \
> >                                T1  G1
> >                                   /  \
> >                                  T2   G2
> >                                        \
> >                                         T3 
> > 
> > So if task T3 finishes and there is no active IO from T3, we will idle
> > on group G2 (in the hope that soon some IO will show up from T3 or from
> > some other task in G2). And that alone should make sure all the group
> > nodes in the path to root (G1) get their fair share at their respective
> > level and no additional idling should be needed.
> > 
> > So sounds like hierarhcy will not cause additional idling. Idling will
> > solely depend on leaf active groups. 
> > 
> 
> What if the G's also have tasks? That is a valid configuration, after all.

Are you saying what if groups have tasks?I think in above configuration I
have shown Task T2 in G1 and task T3 in G2. So groups do have tasks here.

So say at a given time, G2 is not on service tree (not doing any IO) and
T2 does IO, then we will end up doing idling on group G1. But that's the
idling we will anyway do even if we were in flat mode.


				     pivot
				    /  |  \
		      	          T1   G1  G2
				       |    |
				       T2   T3

So in above flat mode we will idle both on G1 and G2 (assuming T2 and T3)
are doing IO. When converted to hierarchy, I think total idling time
still remains same and hierarchy is not imposing addition idling. 

That's a different thing that I am not happy with amount of idling we
do even now as it kills performance.

Thanks
Vivek


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]                 ` <5052E931.8000007@parallels.com>
@ 2012-09-14 18:56                   ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-14 18:56 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vivek Goyal, linux-kernel, Michal Hocko, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Fri, Sep 14, 2012 at 12:22:09PM +0400, Glauber Costa wrote:
> This is not about being interested anymore
> This thing is more pressing than global warming.

Heh, yeah, let's save the polar bears. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH REPOST RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]                 ` <5052E87A.1050405@parallels.com>
@ 2012-09-14 19:15                   ` Tejun Heo
  0 siblings, 0 replies; 41+ messages in thread
From: Tejun Heo @ 2012-09-14 19:15 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Michal Hocko, linux-kernel, containers, cgroups, Li Zefan,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Fri, Sep 14, 2012 at 12:19:06PM +0400, Glauber Costa wrote:
> I want oppose it as well, but I believe part of this exercise is to make
> the need to have hierarchy widespread. Warning on the case
> 1st-level-only case helps with that, even if we make more noise than we
> should.
> 
> The reason I supported Tejun's proposal originally, is that I think that
> if we make the wrong amount of noise, being wrong by a surplus is better
> than being wrong by a deficit, in this case.

I think both are valid points and don't think it makes a lot of
difference either way.  Michal being the maintainer of the code, I'm
taking his approach for this one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them
       [not found]         ` <5052E9BC.2020908@parallels.com>
@ 2012-09-17  7:59           ` Daniel Wagner
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Wagner @ 2012-09-17  7:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Tejun Heo, linux-kernel, Michal Hocko, Li Zefan, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Vivek Goyal, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Dave Jones, Lennart Poettering, Kay Sievers

On 14.09.2012 10:24, Glauber Costa wrote:
> On 09/13/2012 09:48 PM, Tejun Heo wrote:
>> Hello, Glauber.
>>
>> On Thu, Sep 13, 2012 at 03:53:56PM +0400, Glauber Costa wrote:
>>> Here is where the Kconfig option comes to play. If we do it in the
>>> kernel, userspace doesn't have to do anything. I spoke with Lennart and
>>> Kay, and at least from a systemd PoV, they would much rather not provide
>>> a hack in userspace for a file that is scheduled to go away in any case
>>> - which I personally believe is a fair request.
>>>
>>> It is a default, so the effect for the user is the same: After the
>>> machine boots, use_hierarchy = 1, and he can still flip to 0 for some time.
>>
>> Alright, let's go Kconfig.  Let's just make sure that the transitional
>> nature is clearly labeled and the fact that the default config will
>> generate a warning when nested cgroups are created in memcg.  We can
>> then coordinate the flip with distros.  Can you please repost the
>> Kconfig patch?
>>
>
> Just wait a bit. If you are merging your earlier patch, I'd like to take
> that in consideration. In that case, I'd rebase.
>
>>>> Setting mark on a parent should be reflected on all its children w/o
>>>> their own explicit settings.
>>>
>>> That is clear, and better behavior than we have today. What I mean, is
>>> that by setting its own marking, the child can pretty much "escape" the
>>> group.

To escape net_prio the process needs CAP_NET_ADMIN to overwrite 
SO_PRIORITY. net_cls has already its own marking.

>>> The ideal solution - from this point of view only - would be to have
>>> more than one marking, and mark with all the way down to the root. So if
>>> you have an iptables rule to match one marking, it still applies to the
>>> kids. And you can still have extra markings.

I think what you are describing is something like a generic socket 
marker which and a cgroup matcher for iptables, no? What about SO_MARK. 
As noted above it would be nice if the processes could not escape with 
setting SO_MARK (it also need SO_NET_ADMIN).

>>> I am not sure this is feasible, though, in which case your solution
>>> could be a good compromise. But please let's aim for it.
>>
>> I don't think it supports multiple tags.  If that's possible, it would
>> be nice but I don't think it's a must.
>>
>
> It is not about "it supports", but more about "can it support?"
> But I honestly can't answer this question. net_cls people need to
> come and tell us about it.

I struggle to understand for what the multiple tags are good for. Any 
examples?

Another question on hierarchies on the networking controllers: Are there 
any real dependencies between a cgroup and its children. For resources 
like CPU cycles or memory it makes perfectly sense.

I don't see a *direct* relation ship between the parent cgroup and it's 
children for the networking controllers. There seems to be some sort of 
relation ship for SO_PRIORITY. As the current net_prio implementation 
does not allow hierarchies it avoids to answer this question.

net_cls allows hierarchies but has no restriction on setting the classid 
for TC. The traffic shaping happens completely independent of the cgroup 
hierarchies and there is no relation ship at all.

Do we need any sort of formal definition for hierarchal dependencies for 
the networking controlllers?



^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-09-17  7:59 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-10 22:31 [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them Tejun Heo
2012-09-10 22:33 ` [PATCH REPOST " Tejun Heo
2012-09-11 10:04   ` Michal Hocko
2012-09-11 17:07     ` Tejun Heo
2012-09-12 15:47       ` Michal Hocko
2012-09-12 16:41         ` Tejun Heo
     [not found]     ` <5050568B.9090601@parallels.com>
2012-09-12 15:49       ` Michal Hocko
2012-09-12 17:11         ` Tejun Heo
2012-09-13 12:14           ` Michal Hocko
2012-09-13 17:18             ` Tejun Heo
2012-09-13 17:39               ` Michal Hocko
     [not found]                 ` <5052E87A.1050405@parallels.com>
2012-09-14 19:15                   ` Tejun Heo
     [not found]           ` <5051CB24.4010801@parallels.com>
2012-09-13 17:21             ` Tejun Heo
2012-09-11 12:38   ` Li Zefan
2012-09-11 17:08     ` Tejun Heo
2012-09-11 17:43       ` Tejun Heo
     [not found]         ` <505057D8.4010908@parallels.com>
2012-09-12 16:34           ` Tejun Heo
2012-09-13  6:48             ` Li Zefan
2012-09-11 18:23   ` [PATCH UPDATED " Tejun Heo
2012-09-11 20:50     ` Aristeu Rozanski
2012-09-11 20:51       ` Tejun Heo
2012-09-13 12:16   ` [PATCH REPOST " Daniel P. Berrange
2012-09-13 17:52     ` Tejun Heo
2012-09-11 14:51 ` [PATCH " Vivek Goyal
2012-09-11 14:54   ` Vivek Goyal
2012-09-11 17:16   ` Tejun Heo
2012-09-11 17:35     ` Vivek Goyal
2012-09-11 17:55       ` Tejun Heo
2012-09-11 18:16         ` Vivek Goyal
2012-09-11 18:22           ` Tejun Heo
2012-09-11 18:38             ` Vivek Goyal
     [not found]         ` <50505C39.1050600@parallels.com>
2012-09-12 17:09           ` Tejun Heo
2012-09-13 14:53             ` Block IO controller hierarchy suppport (Was: Re: [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them) Vivek Goyal
2012-09-13 22:06               ` Tejun Heo
2012-09-14  2:53                 ` Vivek Goyal
     [not found]                   ` <5052E8DA.1000106@parallels.com>
2012-09-14 13:22                     ` Vivek Goyal
     [not found]             ` <5051CBAA.5040308@parallels.com>
2012-09-13 17:54               ` [PATCH RFC cgroup/for-3.7] cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them Tejun Heo
     [not found]                 ` <5052E931.8000007@parallels.com>
2012-09-14 18:56                   ` Tejun Heo
     [not found] ` <505055E5.90903@parallels.com>
2012-09-12 17:03   ` Tejun Heo
     [not found]     ` <5051C954.2080600@parallels.com>
2012-09-13 17:48       ` Tejun Heo
     [not found]         ` <5052E9BC.2020908@parallels.com>
2012-09-17  7:59           ` Daniel Wagner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).