* [PATCHSET sched,cgroup] sched: Implement interface for cgroup unified hierarchy @ 2015-08-03 22:41 Tejun Heo 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-03 22:41 UTC (permalink / raw) To: mingo, peterz; +Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team Hello, This patchset implements cpu controller's interface for unified hierarchy. While cpu controller didn't have structural issues that memcg and blkcg had, there still are minor issues such as cpuacct and use of different time units and its interface can be made consistent with other controllers so that cgroup as a whole presents uniform ways to achieve similar things with different resources. This patchset contains the following three patches. 0001-cgroup-define-controller-file-conventions.patch 0002-sched-Misc-preps-for-cgroup-unified-hierarchy-interf.patch 0003-sched-Implement-interface-for-cgroup-unified-hierarc.patch The "Controller file conventions" section in Documentation/cgroups/unified-hierarchy.txt which is added by the first patch codifies the syntax and semantics for controller knobs and the next two patches implement the new interface for the cpu controller. The first patch is needed by blkcg too, so once the changes get acked I'll set up a branch containing the patch so that it can be pulled from both sched and blkcg. This patchset is on top of v4.2-rc1 and also available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-sched-unified-intf diffstat follows, thanks. Documentation/cgroups/unified-hierarchy.txt | 128 +++++++++++++++++++- include/linux/cgroup.h | 9 + kernel/sched/core.c | 173 +++++++++++++++++++++++++++- kernel/sched/cpuacct.c | 57 ++++++--- kernel/sched/cpuacct.h | 5 5 files changed, 342 insertions(+), 30 deletions(-) -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH 1/3] cgroup: define controller file conventions 2015-08-03 22:41 [PATCHSET sched,cgroup] sched: Implement interface for cgroup unified hierarchy Tejun Heo @ 2015-08-03 22:41 ` Tejun Heo 2015-08-04 8:42 ` Peter Zijlstra ` (2 more replies) 2015-08-03 22:41 ` [PATCH 2/3] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo 2015-08-03 22:41 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2 siblings, 3 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-03 22:41 UTC (permalink / raw) To: mingo, peterz Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team, Tejun Heo Traditionally, each cgroup controller implemented whatever interface it wanted leading to interfaces which are widely inconsistent. Examining the requirements of the controllers readily yield that there are only a few control schemes shared among all. Two major controllers already had to implement new interface for the unified hierarchy due to significant structural changes. Let's take the chance to establish common conventions throughout all controllers. This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight based control knobs and documents the conventions that controllers should follow on the unified hierarchy. Except for io.weight knob, all existing unified hierarchy knobs are already compliant. A follow-up patch will update io.weight. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> --- Documentation/cgroups/unified-hierarchy.txt | 75 ++++++++++++++++++++++++++--- include/linux/cgroup.h | 9 ++++ 2 files changed, 76 insertions(+), 8 deletions(-) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 86847a7..fc372b8 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -23,10 +23,13 @@ CONTENTS 5. Other Changes 5-1. [Un]populated Notification 5-2. Other Core Changes - 5-3. Per-Controller Changes - 5-3-1. blkio - 5-3-2. cpuset - 5-3-3. memory + 5-3. Controller file conventions + 5-3-1. Format + 5-3-2. Control knobs + 5-4. Per-Controller Changes + 5-4-1. blkio + 5-4-2. cpuset + 5-4-3. memory 6. Planned Changes 6-1. CAP for resource control @@ -372,14 +375,70 @@ supported and the interface files "release_agent" and - The "cgroup.clone_children" file is removed. -5-3. Per-Controller Changes +5-3. Controller file conventions -5-3-1. blkio +5-3-1. Format + +In general, all controller files should be in one of the following +formats whenever possible. + +- Values only files + + VAL0 VAL1...\n + +- Flat keyed files + + KEY0 VAL0\n + KEY1 VAL1\n + ... + +- Nested keyed files + + KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... + KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... + ... + +For a writeable file, the format for writing should generally match +reading; however, controllers may allow omitting later fields or +implement restricted shortcuts for most common use cases. + +For both flat and nested keyed files, only the values for a single key +can be written at a time. For nested keyed files, the sub key pairs +may be specified in any order and not all pairs have to be specified. + + +5-3-2. Control knobs + +- Settings for a single feature should generally be implemented in a + single file. + +- In general, the root cgroup should be exempt from resource control + and thus shouldn't have resource control knobs. + +- If a controller implements ratio based resource distribution, the + control knob should be named "weight" and have the range [1, 10000] + and 100 should be the default value. The values are chosen to allow + enough and symmetric bias in both directions while keeping it + intuitive (the default is 100%). + +- If a controller implements an absolute resource limit, the control + knob should be named "max". The special token "max" should be used + to represent no limit for both reading and writing. + +- If a setting has configurable default value and specific overrides, + the default settings should be keyed with "default" and appear as + the first entry in the file. Specific entries can use "default" as + its value to indicate inheritance of the default value. + + +5-4. Per-Controller Changes + +5-4-1. blkio - blk-throttle becomes properly hierarchical. -5-3-2. cpuset +5-4-2. cpuset - Tasks are kept in empty cpusets after hotplug and take on the masks of the nearest non-empty ancestor, instead of being moved to it. @@ -388,7 +447,7 @@ supported and the interface files "release_agent" and masks of the nearest non-empty ancestor. -5-3-3. memory +5-4-3. memory - use_hierarchy is on by default and the cgroup file for the flag is not created. diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index a593e29..c6bf9d3 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -22,6 +22,15 @@ #ifdef CONFIG_CGROUPS +/* + * All weight knobs on the default hierarhcy should use the following min, + * default and max values. The default value is the logarithmic center of + * MIN and MAX and allows 100x to be expressed in both directions. + */ +#define CGROUP_WEIGHT_MIN 1 +#define CGROUP_WEIGHT_DFL 100 +#define CGROUP_WEIGHT_MAX 10000 + /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { struct cgroup_subsys *ss; -- 2.4.3 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 1/3] cgroup: define controller file conventions 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo @ 2015-08-04 8:42 ` Peter Zijlstra 2015-08-04 14:51 ` Tejun Heo 2015-08-04 8:48 ` Peter Zijlstra 2015-08-04 19:31 ` [PATCH v2 " Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Peter Zijlstra @ 2015-08-04 8:42 UTC (permalink / raw) To: Tejun Heo; +Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team On Mon, Aug 03, 2015 at 06:41:27PM -0400, Tejun Heo wrote: > +- If a controller implements an absolute resource limit, the control > + knob should be named "max". The special token "max" should be used > + to represent no limit for both reading and writing. So what do you do with minimal resource guarantees? That's still an absolute resource limit and 'max' is obviously the wrong name. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 1/3] cgroup: define controller file conventions 2015-08-04 8:42 ` Peter Zijlstra @ 2015-08-04 14:51 ` Tejun Heo 0 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-04 14:51 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team Hello, On Tue, Aug 04, 2015 at 10:42:57AM +0200, Peter Zijlstra wrote: > On Mon, Aug 03, 2015 at 06:41:27PM -0400, Tejun Heo wrote: > > +- If a controller implements an absolute resource limit, the control > > + knob should be named "max". The special token "max" should be used > > + to represent no limit for both reading and writing. > > So what do you do with minimal resource guarantees? That's still an > absolute resource limit and 'max' is obviously the wrong name. The whole spectrum is min, low, high, max where min, max are absolute guarantee, upper limit and low, high are best effort ones. Will update the doc. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 1/3] cgroup: define controller file conventions 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo 2015-08-04 8:42 ` Peter Zijlstra @ 2015-08-04 8:48 ` Peter Zijlstra 2015-08-04 14:53 ` Tejun Heo 2015-08-04 19:31 ` [PATCH v2 " Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Peter Zijlstra @ 2015-08-04 8:48 UTC (permalink / raw) To: Tejun Heo; +Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team On Mon, Aug 03, 2015 at 06:41:27PM -0400, Tejun Heo wrote: > > This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight > based control knobs and documents the conventions that controllers > should follow on the unified hierarchy. Except for io.weight knob, > all existing unified hierarchy knobs are already compliant. A > follow-up patch will update io.weight. > +- If a controller implements ratio based resource distribution, the > + control knob should be named "weight" and have the range [1, 10000] > + and 100 should be the default value. The values are chosen to allow > + enough and symmetric bias in both directions while keeping it > + intuitive (the default is 100%). Aside from 100% being a sane 'default', what it actually is is a unit. 100% == 1. So I would suggest naming the thing CGROUP_WEIGHT_UNIT := 100, > +/* > + * All weight knobs on the default hierarhcy should use the following min, > + * default and max values. The default value is the logarithmic center of > + * MIN and MAX and allows 100x to be expressed in both directions. > + */ > +#define CGROUP_WEIGHT_MIN 1 > +#define CGROUP_WEIGHT_DFL 100 > +#define CGROUP_WEIGHT_MAX 10000 That said, I'm not entirely keen on having to change this. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 1/3] cgroup: define controller file conventions 2015-08-04 8:48 ` Peter Zijlstra @ 2015-08-04 14:53 ` Tejun Heo 0 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-04 14:53 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team Hello, Peter. On Tue, Aug 04, 2015 at 10:48:55AM +0200, Peter Zijlstra wrote: > > +- If a controller implements ratio based resource distribution, the > > + control knob should be named "weight" and have the range [1, 10000] > > + and 100 should be the default value. The values are chosen to allow > > + enough and symmetric bias in both directions while keeping it > > + intuitive (the default is 100%). > > Aside from 100% being a sane 'default', what it actually is is a unit. > 100% == 1. > > So I would suggest naming the thing CGROUP_WEIGHT_UNIT := 100, It's a minor point either way but I think people would generally find default more familiar. > > +/* > > + * All weight knobs on the default hierarhcy should use the following min, > > + * default and max values. The default value is the logarithmic center of > > + * MIN and MAX and allows 100x to be expressed in both directions. > > + */ > > +#define CGROUP_WEIGHT_MIN 1 > > +#define CGROUP_WEIGHT_DFL 100 > > +#define CGROUP_WEIGHT_MAX 10000 > > That said, I'm not entirely keen on having to change this. Yeah, changing the scale is an icky thing to do but I think the benefits of unifying the scales across different controllers outweigh here. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo 2015-08-04 8:42 ` Peter Zijlstra 2015-08-04 8:48 ` Peter Zijlstra @ 2015-08-04 19:31 ` Tejun Heo 2015-08-05 0:39 ` Kamezawa Hiroyuki 2015-08-17 21:34 ` Johannes Weiner 2 siblings, 2 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-04 19:31 UTC (permalink / raw) To: mingo, peterz; +Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team >From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Tue, 4 Aug 2015 15:20:55 -0400 Traditionally, each cgroup controller implemented whatever interface it wanted leading to interfaces which are widely inconsistent. Examining the requirements of the controllers readily yield that there are only a few control schemes shared among all. Two major controllers already had to implement new interface for the unified hierarchy due to significant structural changes. Let's take the chance to establish common conventions throughout all controllers. This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight based control knobs and documents the conventions that controllers should follow on the unified hierarchy. Except for io.weight knob, all existing unified hierarchy knobs are already compliant. A follow-up patch will update io.weight. v2: Added descriptions of min, low and high knobs. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> --- Hello, Added low/high descriptions and applied to the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.3-unified-base The branch currently only contains this patch and will stay stable so that it can be pulled from. I kept the base weight as DFL for now. If we decide to change it, I'll apply the change on top. Thanks. Documentation/cgroups/unified-hierarchy.txt | 80 ++++++++++++++++++++++++++--- include/linux/cgroup.h | 9 ++++ 2 files changed, 81 insertions(+), 8 deletions(-) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 86847a7..1ee9caf 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -23,10 +23,13 @@ CONTENTS 5. Other Changes 5-1. [Un]populated Notification 5-2. Other Core Changes - 5-3. Per-Controller Changes - 5-3-1. blkio - 5-3-2. cpuset - 5-3-3. memory + 5-3. Controller File Conventions + 5-3-1. Format + 5-3-2. Control Knobs + 5-4. Per-Controller Changes + 5-4-1. blkio + 5-4-2. cpuset + 5-4-3. memory 6. Planned Changes 6-1. CAP for resource control @@ -372,14 +375,75 @@ supported and the interface files "release_agent" and - The "cgroup.clone_children" file is removed. -5-3. Per-Controller Changes +5-3. Controller File Conventions -5-3-1. blkio +5-3-1. Format + +In general, all controller files should be in one of the following +formats whenever possible. + +- Values only files + + VAL0 VAL1...\n + +- Flat keyed files + + KEY0 VAL0\n + KEY1 VAL1\n + ... + +- Nested keyed files + + KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... + KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... + ... + +For a writeable file, the format for writing should generally match +reading; however, controllers may allow omitting later fields or +implement restricted shortcuts for most common use cases. + +For both flat and nested keyed files, only the values for a single key +can be written at a time. For nested keyed files, the sub key pairs +may be specified in any order and not all pairs have to be specified. + + +5-3-2. Control Knobs + +- Settings for a single feature should generally be implemented in a + single file. + +- In general, the root cgroup should be exempt from resource control + and thus shouldn't have resource control knobs. + +- If a controller implements ratio based resource distribution, the + control knob should be named "weight" and have the range [1, 10000] + and 100 should be the default value. The values are chosen to allow + enough and symmetric bias in both directions while keeping it + intuitive (the default is 100%). + +- If a controller implements an absolute resource guarantee and/or + limit, the control knobs should be named "min" and "max" + respectively. If a controller implements best effort resource + gurantee and/or limit, the control knobs should be named "low" and + "high" respectively. + + In the above four control files, the special token "max" should be + used to represent upward infinity for both reading and writing. + +- If a setting has configurable default value and specific overrides, + the default settings should be keyed with "default" and appear as + the first entry in the file. Specific entries can use "default" as + its value to indicate inheritance of the default value. + + +5-4. Per-Controller Changes + +5-4-1. blkio - blk-throttle becomes properly hierarchical. -5-3-2. cpuset +5-4-2. cpuset - Tasks are kept in empty cpusets after hotplug and take on the masks of the nearest non-empty ancestor, instead of being moved to it. @@ -388,7 +452,7 @@ supported and the interface files "release_agent" and masks of the nearest non-empty ancestor. -5-3-3. memory +5-4-3. memory - use_hierarchy is on by default and the cgroup file for the flag is not created. diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index a593e29..c6bf9d3 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -22,6 +22,15 @@ #ifdef CONFIG_CGROUPS +/* + * All weight knobs on the default hierarhcy should use the following min, + * default and max values. The default value is the logarithmic center of + * MIN and MAX and allows 100x to be expressed in both directions. + */ +#define CGROUP_WEIGHT_MIN 1 +#define CGROUP_WEIGHT_DFL 100 +#define CGROUP_WEIGHT_MAX 10000 + /* a css_task_iter should be treated as an opaque object */ struct css_task_iter { struct cgroup_subsys *ss; -- 2.4.3 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-04 19:31 ` [PATCH v2 " Tejun Heo @ 2015-08-05 0:39 ` Kamezawa Hiroyuki 2015-08-05 7:47 ` Michal Hocko 2015-08-17 21:34 ` Johannes Weiner 1 sibling, 1 reply; 92+ messages in thread From: Kamezawa Hiroyuki @ 2015-08-05 0:39 UTC (permalink / raw) To: Tejun Heo, mingo, peterz Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team On 2015/08/05 4:31, Tejun Heo wrote: > From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Tue, 4 Aug 2015 15:20:55 -0400 > > Traditionally, each cgroup controller implemented whatever interface > it wanted leading to interfaces which are widely inconsistent. > Examining the requirements of the controllers readily yield that there > are only a few control schemes shared among all. > > Two major controllers already had to implement new interface for the > unified hierarchy due to significant structural changes. Let's take > the chance to establish common conventions throughout all controllers. > > This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight > based control knobs and documents the conventions that controllers > should follow on the unified hierarchy. Except for io.weight knob, > all existing unified hierarchy knobs are already compliant. A > follow-up patch will update io.weight. > > v2: Added descriptions of min, low and high knobs. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Cc: Li Zefan <lizefan@huawei.com> > Cc: Peter Zijlstra <peterz@infradead.org> > --- > Hello, > > Added low/high descriptions and applied to the following git branch. > > git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.3-unified-base > > The branch currently only contains this patch and will stay stable so > that it can be pulled from. I kept the base weight as DFL for now. > If we decide to change it, I'll apply the change on top. > > Thanks. > > Documentation/cgroups/unified-hierarchy.txt | 80 ++++++++++++++++++++++++++--- > include/linux/cgroup.h | 9 ++++ > 2 files changed, 81 insertions(+), 8 deletions(-) > > diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt > index 86847a7..1ee9caf 100644 > --- a/Documentation/cgroups/unified-hierarchy.txt > +++ b/Documentation/cgroups/unified-hierarchy.txt > @@ -23,10 +23,13 @@ CONTENTS > 5. Other Changes > 5-1. [Un]populated Notification > 5-2. Other Core Changes > - 5-3. Per-Controller Changes > - 5-3-1. blkio > - 5-3-2. cpuset > - 5-3-3. memory > + 5-3. Controller File Conventions > + 5-3-1. Format > + 5-3-2. Control Knobs > + 5-4. Per-Controller Changes > + 5-4-1. blkio > + 5-4-2. cpuset > + 5-4-3. memory > 6. Planned Changes > 6-1. CAP for resource control > > @@ -372,14 +375,75 @@ supported and the interface files "release_agent" and > - The "cgroup.clone_children" file is removed. > > > -5-3. Per-Controller Changes > +5-3. Controller File Conventions > > -5-3-1. blkio > +5-3-1. Format > + > +In general, all controller files should be in one of the following > +formats whenever possible. > + > +- Values only files > + > + VAL0 VAL1...\n > + > +- Flat keyed files > + > + KEY0 VAL0\n > + KEY1 VAL1\n > + ... > + > +- Nested keyed files > + > + KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... > + KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... > + ... > + > +For a writeable file, the format for writing should generally match > +reading; however, controllers may allow omitting later fields or > +implement restricted shortcuts for most common use cases. > + > +For both flat and nested keyed files, only the values for a single key > +can be written at a time. For nested keyed files, the sub key pairs > +may be specified in any order and not all pairs have to be specified. > + > + > +5-3-2. Control Knobs > + > +- Settings for a single feature should generally be implemented in a > + single file. > + > +- In general, the root cgroup should be exempt from resource control > + and thus shouldn't have resource control knobs. > + > +- If a controller implements ratio based resource distribution, the > + control knob should be named "weight" and have the range [1, 10000] > + and 100 should be the default value. The values are chosen to allow > + enough and symmetric bias in both directions while keeping it > + intuitive (the default is 100%). > + > +- If a controller implements an absolute resource guarantee and/or > + limit, the control knobs should be named "min" and "max" > + respectively. If a controller implements best effort resource > + gurantee and/or limit, the control knobs should be named "low" and > + "high" respectively. > + > + In the above four control files, the special token "max" should be > + used to represent upward infinity for both reading and writing. > + so, for memory controller, we'll have (in alphabet order) memory.failcnt memory.force_empty (<= should this be removed ?) memory.kmem.failcnt memory.kmem.max memory.kmem.max_usage memory.kmem.slabinfo memory.kmem.tcp.failcnt memory.kmem.tcp.max memory.kmem.tcp.max_usage memory.kmem.tcp.usage memory.kmem.usage memory.max memory.max_usage memory.move_charge_at_immigrate memory.numa_stat memory.oom_control memory.pressure_level memory.high memory.swapiness memory.usage memory.use_hierarchy (<= removed) ? -Kame ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-05 0:39 ` Kamezawa Hiroyuki @ 2015-08-05 7:47 ` Michal Hocko 2015-08-06 2:30 ` Kamezawa Hiroyuki 0 siblings, 1 reply; 92+ messages in thread From: Michal Hocko @ 2015-08-05 7:47 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Tejun Heo, mingo, peterz, hannes, lizefan, cgroups, linux-kernel, kernel-team On Wed 05-08-15 09:39:40, KAMEZAWA Hiroyuki wrote: [...] > so, for memory controller, we'll have We currently have only current, low, high, max and events currently. All other knobs are either deprecated or waiting for a usecase to emerge before they get added. > (in alphabet order) > memory.failcnt > memory.force_empty (<= should this be removed ?) > memory.kmem.failcnt > memory.kmem.max > memory.kmem.max_usage > memory.kmem.slabinfo > memory.kmem.tcp.failcnt > memory.kmem.tcp.max > memory.kmem.tcp.max_usage > memory.kmem.tcp.usage > memory.kmem.usage > memory.max > memory.max_usage > memory.move_charge_at_immigrate > memory.numa_stat > memory.oom_control > memory.pressure_level > memory.high > memory.swapiness > memory.usage > memory.use_hierarchy (<= removed) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-05 7:47 ` Michal Hocko @ 2015-08-06 2:30 ` Kamezawa Hiroyuki 2015-08-07 18:17 ` Michal Hocko 0 siblings, 1 reply; 92+ messages in thread From: Kamezawa Hiroyuki @ 2015-08-06 2:30 UTC (permalink / raw) To: Michal Hocko Cc: Tejun Heo, mingo, peterz, hannes, lizefan, cgroups, linux-kernel, kernel-team On 2015/08/05 16:47, Michal Hocko wrote: > On Wed 05-08-15 09:39:40, KAMEZAWA Hiroyuki wrote: > [...] >> so, for memory controller, we'll have > > We currently have only current, low, high, max and events currently. > All other knobs are either deprecated or waiting for a usecase to emerge > before they get added. > Sure. I think following has users. - *.stat - for chekcing health of cgroup ,or for debug - *.pressure_level - for notifying memory pressure - *.swappiness - for adjusting LRU activity per application type. - *.oom_control - for surviving/notifiyng out of memory memcg's oom can be recovered if limit goes up rather than kill. But I know people says this knob is not useful. This will require discussion. Hm. If we don't want to increase files, NETLINK or systemcall is an another choice of subsystem specific interface ? -Kame >> (in alphabet order) >> memory.failcnt >> memory.force_empty (<= should this be removed ?) >> memory.kmem.failcnt >> memory.kmem.max >> memory.kmem.max_usage >> memory.kmem.slabinfo >> memory.kmem.tcp.failcnt >> memory.kmem.tcp.max >> memory.kmem.tcp.max_usage >> memory.kmem.tcp.usage >> memory.kmem.usage >> memory.max >> memory.max_usage >> memory.move_charge_at_immigrate >> memory.numa_stat >> memory.oom_control >> memory.pressure_level >> memory.high >> memory.swapiness >> memory.usage >> memory.use_hierarchy (<= removed) > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-06 2:30 ` Kamezawa Hiroyuki @ 2015-08-07 18:17 ` Michal Hocko 2015-08-17 22:04 ` Johannes Weiner 0 siblings, 1 reply; 92+ messages in thread From: Michal Hocko @ 2015-08-07 18:17 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Tejun Heo, mingo, peterz, hannes, lizefan, cgroups, linux-kernel, kernel-team On Thu 06-08-15 11:30:08, KAMEZAWA Hiroyuki wrote: [...] > Sure. I think following has users. > - *.stat - for chekcing health of cgroup ,or for debug Yes but we want to have something which is closer to meminfo/vmstat IMO > - *.pressure_level - for notifying memory pressure Notifications are definitely useful I am just not sure this interface is the right one. We have seen some requests to adjust the interface to get new semantics (edge vs. level triggered). This should be sorted out before we expose the knob. > - *.swappiness - for adjusting LRU activity per application type. Yes, and I wanted to post a patch to export it several times but then I realized that this should be done only as long as vm.swappiness stays and it is not deprecated. And more and more I think about swappiness the less sure I am about it's usefulness. It is not doing much for quite some time because we are heavily biasing to the pagecache reclaim and the knob is more and more misleading. It is also not offering what people might want it to do. E.g. it doesn't allow for preferring swapout which might be useful when the swap is backed by a really fast storage. Maybe we will need a new metric here so I wouldn't rush exporting memcg alternative much. > - *.oom_control - for surviving/notifiyng out of memory > memcg's oom can be recovered if limit goes up rather than kill. I think it is very much useful - when used wisely. I have seen many calls for user defined OOM policies but then we have seen those that are more creative like having the policy maker live in the same memcg which requires some hacks to prevent from self-deadlocks. So overall this is very attractive but we might need to think about a better interface. BPF sounds like a potential way to go. I feel the memcg and the global approaches should be consistent as much as possible wrt. API. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-07 18:17 ` Michal Hocko @ 2015-08-17 22:04 ` Johannes Weiner 0 siblings, 0 replies; 92+ messages in thread From: Johannes Weiner @ 2015-08-17 22:04 UTC (permalink / raw) To: Michal Hocko Cc: Kamezawa Hiroyuki, Tejun Heo, mingo, peterz, lizefan, cgroups, linux-kernel, kernel-team On Fri, Aug 07, 2015 at 08:17:23PM +0200, Michal Hocko wrote: > On Thu 06-08-15 11:30:08, KAMEZAWA Hiroyuki wrote: > > - *.oom_control - for surviving/notifiyng out of memory > > memcg's oom can be recovered if limit goes up rather than kill. > > I think it is very much useful - when used wisely. I have seen many > calls for user defined OOM policies but then we have seen those that are > more creative like having the policy maker live in the same memcg which > requires some hacks to prevent from self-deadlocks. > So overall this is very attractive but we might need to think about a > better interface. BPF sounds like a potential way to go. I feel the > memcg and the global approaches should be consistent as much as possible > wrt. API. I'm not sure I still see a usecase for this. The whole idea behind memory.high is to give the user the chance to monitor the group's health and then act upon that. You can freeze the group if you must, gather information, kill tasks. This is the way to implement a custom OOM policy. memory.max on the other hand tells the *kernel* when to OOM, with all the implications that a kernel OOM has. Don't configure that when you don't want your tasks killed. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH v2 1/3] cgroup: define controller file conventions 2015-08-04 19:31 ` [PATCH v2 " Tejun Heo 2015-08-05 0:39 ` Kamezawa Hiroyuki @ 2015-08-17 21:34 ` Johannes Weiner 1 sibling, 0 replies; 92+ messages in thread From: Johannes Weiner @ 2015-08-17 21:34 UTC (permalink / raw) To: Tejun Heo; +Cc: mingo, peterz, lizefan, cgroups, linux-kernel, kernel-team On Tue, Aug 04, 2015 at 03:31:01PM -0400, Tejun Heo wrote: > From 6abc8ca19df0078de17dc38340db3002ed489ce7 Mon Sep 17 00:00:00 2001 > From: Tejun Heo <tj@kernel.org> > Date: Tue, 4 Aug 2015 15:20:55 -0400 > > Traditionally, each cgroup controller implemented whatever interface > it wanted leading to interfaces which are widely inconsistent. > Examining the requirements of the controllers readily yield that there > are only a few control schemes shared among all. > > Two major controllers already had to implement new interface for the > unified hierarchy due to significant structural changes. Let's take > the chance to establish common conventions throughout all controllers. > > This patch defines CGROUP_WEIGHT_MIN/DFL/MAX to be used on all weight > based control knobs and documents the conventions that controllers > should follow on the unified hierarchy. Except for io.weight knob, > all existing unified hierarchy knobs are already compliant. A > follow-up patch will update io.weight. > > v2: Added descriptions of min, low and high knobs. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Acked-by: Johannes Weiner <hannes@cmpxchg.org> > Cc: Li Zefan <lizefan@huawei.com> > Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH 2/3] sched: Misc preps for cgroup unified hierarchy interface 2015-08-03 22:41 [PATCHSET sched,cgroup] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo @ 2015-08-03 22:41 ` Tejun Heo 2015-08-03 22:41 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-03 22:41 UTC (permalink / raw) To: mingo, peterz Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team, Tejun Heo Make the following changes in preparation for the cpu controller interface implementation for the unified hierarchy. This patch doesn't cause any functional differences. * s/cpu_stats_show()/cpu_cfs_stats_show()/ * s/cpu_files/cpu_legacy_files/ * Separate out cpuacct_stats_read() from cpuacct_stats_show(). While at it, remove pointless cpuacct_stat_desc[] array. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- kernel/sched/core.c | 8 ++++---- kernel/sched/cpuacct.c | 33 +++++++++++++++------------------ 2 files changed, 19 insertions(+), 22 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 78b4bad10..6137037 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8359,7 +8359,7 @@ static int __cfs_schedulable(struct task_group *tg, u64 period, u64 quota) return ret; } -static int cpu_stats_show(struct seq_file *sf, void *v) +static int cpu_cfs_stats_show(struct seq_file *sf, void *v) { struct task_group *tg = css_tg(seq_css(sf)); struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; @@ -8399,7 +8399,7 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, } #endif /* CONFIG_RT_GROUP_SCHED */ -static struct cftype cpu_files[] = { +static struct cftype cpu_legacy_files[] = { #ifdef CONFIG_FAIR_GROUP_SCHED { .name = "shares", @@ -8420,7 +8420,7 @@ static struct cftype cpu_files[] = { }, { .name = "stat", - .seq_show = cpu_stats_show, + .seq_show = cpu_cfs_stats_show, }, #endif #ifdef CONFIG_RT_GROUP_SCHED @@ -8447,7 +8447,7 @@ struct cgroup_subsys cpu_cgrp_subsys = { .can_attach = cpu_cgroup_can_attach, .attach = cpu_cgroup_attach, .exit = cpu_cgroup_exit, - .legacy_cftypes = cpu_files, + .legacy_cftypes = cpu_legacy_files, .early_init = 1, }; diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index dd7cbb5..42b2dd5 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -177,36 +177,33 @@ static int cpuacct_percpu_seq_show(struct seq_file *m, void *V) return 0; } -static const char * const cpuacct_stat_desc[] = { - [CPUACCT_STAT_USER] = "user", - [CPUACCT_STAT_SYSTEM] = "system", -}; - -static int cpuacct_stats_show(struct seq_file *sf, void *v) +static void cpuacct_stats_read(struct cpuacct *ca, u64 *userp, u64 *sysp) { - struct cpuacct *ca = css_ca(seq_css(sf)); int cpu; - s64 val = 0; + *userp = 0; for_each_online_cpu(cpu) { struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu); - val += kcpustat->cpustat[CPUTIME_USER]; - val += kcpustat->cpustat[CPUTIME_NICE]; + *userp += kcpustat->cpustat[CPUTIME_USER]; + *userp += kcpustat->cpustat[CPUTIME_NICE]; } - val = cputime64_to_clock_t(val); - seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_USER], val); - val = 0; + *sysp = 0; for_each_online_cpu(cpu) { struct kernel_cpustat *kcpustat = per_cpu_ptr(ca->cpustat, cpu); - val += kcpustat->cpustat[CPUTIME_SYSTEM]; - val += kcpustat->cpustat[CPUTIME_IRQ]; - val += kcpustat->cpustat[CPUTIME_SOFTIRQ]; + *sysp += kcpustat->cpustat[CPUTIME_SYSTEM]; + *sysp += kcpustat->cpustat[CPUTIME_IRQ]; + *sysp += kcpustat->cpustat[CPUTIME_SOFTIRQ]; } +} - val = cputime64_to_clock_t(val); - seq_printf(sf, "%s %lld\n", cpuacct_stat_desc[CPUACCT_STAT_SYSTEM], val); +static int cpuacct_stats_show(struct seq_file *sf, void *v) +{ + cputime64_t user, sys; + cpuacct_stats_read(css_ca(seq_css(sf)), &user, &sys); + seq_printf(sf, "user %lld\n", cputime64_to_clock_t(user)); + seq_printf(sf, "system %lld\n", cputime64_to_clock_t(sys)); return 0; } -- 2.4.3 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-03 22:41 [PATCHSET sched,cgroup] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo 2015-08-03 22:41 ` [PATCH 2/3] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo @ 2015-08-03 22:41 ` Tejun Heo 2015-08-04 9:07 ` Peter Zijlstra 2015-08-04 19:32 ` [PATCH v2 " Tejun Heo 2 siblings, 2 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-03 22:41 UTC (permalink / raw) To: mingo, peterz Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team, Tejun Heo While the cpu controller doesn't have any functional problems, there are a couple interface issues which can be addressed in the v2 interface. * cpuacct being a separate controller. This separation is artificial and rather pointless as demonstrated by most use cases co-mounting the two controllers. It also forces certain information to be accounted twice. * Use of different time units. Writable control knobs use microseconds, some stat fields use nanoseconds while other cpuacct stat fields use centiseconds. * Control knobs which can't be used in the root cgroup still show up in the root. * Control knob names and semantics aren't consistent with other controllers. This patchset implements cpu controller's interface on the unified hierarchy which adheres to the controller file conventions described in Documentation/cgroups/unified-hierarchy.txt. Overall, the following changes are made. * cpuacct is implictly enabled and disabled by cpu and its information is reported through "cpu.stat" which now uses microseconds for all time durations. All time duration fields now have "_usec" appended to them for clarity. While this doesn't solve the double accounting immediately, once majority of users switch to v2, cpu can directly account and report the relevant stats and cpuacct can be disabled on the unified hierarchy. Note that cpuacct.usage_percpu is currently not included in "cpu.stat". If this information is actually called for, it can be added later. * "cpu.shares" is replaced with "cpu.weight" and operates on the standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000). The weight is scaled to scheduler weight so that 100 maps to 1024 and the ratio relationship is preserved - if weight is W and its scaled value is S, W / 100 == S / 1024. While the mapped range is a bit smaller than the orignal scheduler weight range, the dead zones on both sides are relatively small and covers wider range than the nice value mappings. This file doesn't make sense in the root cgroup and isn't create on root. * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max" which contains both quota and period. * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by "cpu.rt.max" which contains both runtime and period. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- Documentation/cgroups/unified-hierarchy.txt | 53 +++++++++ kernel/sched/core.c | 165 ++++++++++++++++++++++++++++ kernel/sched/cpuacct.c | 24 ++++ kernel/sched/cpuacct.h | 5 + 4 files changed, 247 insertions(+) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index fc372b8..24c3e89 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -30,6 +30,7 @@ CONTENTS 5-4-1. blkio 5-4-2. cpuset 5-4-3. memory + 5-4-4. cpu, cpuacct 6. Planned Changes 6-1. CAP for resource control @@ -532,6 +533,58 @@ may be specified in any order and not all pairs have to be specified. memory.low, memory.high, and memory.max will use the string "max" to indicate and set the highest possible value. +5-4-4. cpu, cpuacct + +- cpuacct is no longer an independent controller. It's implicitly + enabled by cpu and its information is reported in cpu.stat. + +- All time durations, including all stats, are now in microseconds. + +- The interface is updated as follows. + + cpu.stat + + Currently reports the following six stats. All time stats are + in microseconds. + + usage_usec + user_usec + system_usec + nr_periods + nr_throttled + throttled_usec + + cpu.weight + + The weight setting. The weight is between 1 and 10000 and + defaults to 100. + + This file is available only on non-root cgroups. + + cpu.max + + The maximum bandwidth setting. It's in the following format. + + $MAX $PERIOD + + which indicates that the group may consume upto $MAX in each + $PERIOD duration. "max" for $MAX indicates no limit. If only + one number is written, $MAX is updated. + + This file is available only on non-root cgroups. + + cpu.rt.max + + The maximum realtime runtime setting. It's in the following + format. + + $MAX $PERIOD + + which indicates that the group may consume upto $MAX in each + $PERIOD duration. If only one number is written, $MAX is + updated. + + 6. Planned Changes 6-1. CAP for resource control diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6137037..0fb1dd7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8438,6 +8438,163 @@ static struct cftype cpu_legacy_files[] = { { } /* terminate */ }; +static int cpu_stats_show(struct seq_file *sf, void *v) +{ + cpuacct_cpu_stats_show(sf); + +#ifdef CONFIG_FAIR_GROUP_SCHED + { + struct task_group *tg = css_tg(seq_css(sf)); + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + + seq_printf(sf, "nr_periods %d\n" + "nr_throttled %d\n" + "throttled_usec %llu\n", + cfs_b->nr_periods, cfs_b->nr_throttled, + cfs_b->throttled_time / NSEC_PER_USEC); + } +#endif + return 0; +} + +#ifdef CONFIG_FAIR_GROUP_SCHED +static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + u64 weight = scale_load_down(tg->shares); + + return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024); +} + +static int cpu_weight_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 weight) +{ + /* + * cgroup weight knobs should use the common MIN, DFL and MAX + * values which are 1, 100 and 10000 respectively. While it loses + * a bit of range on both ends, it maps pretty well onto the shares + * value used by scheduler and the round-trip conversions preserve + * the original value over the entire range. + */ + if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX) + return -ERANGE; + + weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL); + + return sched_group_set_shares(css_tg(css), scale_load(weight)); +} +#endif + +/* caller should put the current value in *@periodp before calling */ +static int __maybe_unused cpu_max_parse(char *buf, u64 *periodp, u64 *quotap) +{ + char tok[21]; /* U64_MAX */ + + if (!sscanf(buf, "%s %llu", tok, periodp)) + return -EINVAL; + + *periodp *= NSEC_PER_USEC; + + if (sscanf(tok, "%llu", quotap)) + *quotap *= NSEC_PER_USEC; + else if (!strcmp(tok, "max")) + *quotap = RUNTIME_INF; + else + return -EINVAL; + + return 0; +} + +static void __maybe_unused cpu_max_print(struct seq_file *sf, long period, + long quota) +{ + if (quota < 0) + seq_puts(sf, "max"); + else + seq_printf(sf, "%ld", quota); + + seq_printf(sf, " %ld\n", period); +} + +#ifdef CONFIG_CFS_BANDWIDTH +static int cpu_max_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + + cpu_max_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg)); + return 0; +} + +static ssize_t cpu_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct task_group *tg = css_tg(of_css(of)); + u64 period = tg_get_cfs_period(tg); + u64 quota; + int ret; + + ret = cpu_max_parse(buf, &period, "a); + if (!ret) + ret = tg_set_cfs_bandwidth(tg, period, quota); + return ret ?: nbytes; +} +#endif +#ifdef CONFIG_RT_GROUP_SCHED +static int cpu_rt_max_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + + cpu_max_print(sf, sched_group_rt_period(tg), sched_group_rt_runtime(tg)); + return 0; +} + +static ssize_t cpu_rt_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct task_group *tg = css_tg(of_css(of)); + u64 period = sched_group_rt_period(tg); + u64 runtime; + int ret; + + ret = cpu_max_parse(buf, &period, &runtime); + if (!ret) + ret = tg_set_rt_bandwidth(tg, period, runtime); + return ret ?: nbytes; +} +#endif + +static struct cftype cpu_files[] = { + { + .name = "stat", + .seq_show = cpu_stats_show, + }, +#ifdef CONFIG_FAIR_GROUP_SCHED + { + .name = "weight", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_weight_read_u64, + .write_u64 = cpu_weight_write_u64, + }, +#endif +#ifdef CONFIG_CFS_BANDWIDTH + { + .name = "max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_max_show, + .write = cpu_max_write, + }, +#endif +#ifdef CONFIG_RT_GROUP_SCHED + { + .name = "rt.max", + .seq_show = cpu_rt_max_show, + .write = cpu_rt_max_write, + }, +#endif + { } /* terminate */ +}; + struct cgroup_subsys cpu_cgrp_subsys = { .css_alloc = cpu_cgroup_css_alloc, .css_free = cpu_cgroup_css_free, @@ -8448,7 +8605,15 @@ struct cgroup_subsys cpu_cgrp_subsys = { .attach = cpu_cgroup_attach, .exit = cpu_cgroup_exit, .legacy_cftypes = cpu_legacy_files, + .dfl_cftypes = cpu_files, .early_init = 1, +#ifdef CONFIG_CGROUP_CPUACCT + /* + * cpuacct is enabled together with cpu on the unified hierarchy + * and its stats are reported through "cpu.stat". + */ + .depends_on = 1 << cpuacct_cgrp_id, +#endif }; #endif /* CONFIG_CGROUP_SCHED */ diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 42b2dd5..b4d32a6 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -224,6 +224,30 @@ static struct cftype files[] = { { } /* terminate */ }; +/* used to print cpuacct stats in cpu.stat on the unified hierarchy */ +void cpuacct_cpu_stats_show(struct seq_file *sf) +{ + struct cgroup_subsys_state *css; + u64 usage, user, sys; + + css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys); + + usage = cpuusage_read(css, seq_cft(sf)); + cpuacct_stats_read(css_ca(css), &user, &sys); + + user *= TICK_NSEC; + sys *= TICK_NSEC; + do_div(usage, NSEC_PER_USEC); + do_div(user, NSEC_PER_USEC); + do_div(sys, NSEC_PER_USEC); + + seq_printf(sf, "usage_usec %llu\n" + "user_usec %llu\n" + "system_usec %llu\n", usage, user, sys); + + css_put(css); +} + /* * charge this task's execution time to its accounting group. * diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h index ed60562..44eace9 100644 --- a/kernel/sched/cpuacct.h +++ b/kernel/sched/cpuacct.h @@ -2,6 +2,7 @@ extern void cpuacct_charge(struct task_struct *tsk, u64 cputime); extern void cpuacct_account_field(struct task_struct *p, int index, u64 val); +extern void cpuacct_cpu_stats_show(struct seq_file *sf); #else @@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *p, int index, u64 val) { } +static inline void cpuacct_cpu_stats_show(struct seq_file *sf) +{ +} + #endif -- 2.4.3 ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-03 22:41 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Tejun Heo @ 2015-08-04 9:07 ` Peter Zijlstra 2015-08-04 15:10 ` Tejun Heo 2015-08-04 19:32 ` [PATCH v2 " Tejun Heo 1 sibling, 1 reply; 92+ messages in thread From: Peter Zijlstra @ 2015-08-04 9:07 UTC (permalink / raw) To: Tejun Heo; +Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team On Mon, Aug 03, 2015 at 06:41:29PM -0400, Tejun Heo wrote: > While the cpu controller doesn't have any functional problems, there > are a couple interface issues which can be addressed in the v2 > interface. > > * cpuacct being a separate controller. This separation is artificial > and rather pointless as demonstrated by most use cases co-mounting > the two controllers. It also forces certain information to be > accounted twice. > > * Use of different time units. Writable control knobs use > microseconds, some stat fields use nanoseconds while other cpuacct > stat fields use centiseconds. > > * Control knobs which can't be used in the root cgroup still show up > in the root. > > * Control knob names and semantics aren't consistent with other > controllers. What about the unified hierarchy stuff cannot deal with per-task controllers? _That_ was the biggest problem from what I can remember, and I see no proposed resolution for that here. > This patchset implements cpu controller's interface on the unified > hierarchy which adheres to the controller file conventions described > in Documentation/cgroups/unified-hierarchy.txt. Overall, the > following changes are made. > > * cpuacct is implictly enabled and disabled by cpu and its information > is reported through "cpu.stat" which now uses microseconds for all > time durations. All time duration fields now have "_usec" appended > to them for clarity. While this doesn't solve the double accounting > immediately, once majority of users switch to v2, cpu can directly > account and report the relevant stats and cpuacct can be disabled on > the unified hierarchy. > > Note that cpuacct.usage_percpu is currently not included in > "cpu.stat". If this information is actually called for, it can be > added later. Since you're rev'ing the interface, can't we simply kill the old cpuacct and implement the missing pieces in cpu directly ? > * "cpu.shares" is replaced with "cpu.weight" and operates on the > standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000). > The weight is scaled to scheduler weight so that 100 maps to 1024 > and the ratio relationship is preserved - if weight is W and its > scaled value is S, W / 100 == S / 1024. While the mapped range is a > bit smaller than the orignal scheduler weight range, the dead zones > on both sides are relatively small and covers wider range than the > nice value mappings. This file doesn't make sense in the root > cgroup and isn't create on root. Not too thrilled about this, but if people can live with the reduced resolution then I suppose we can do. > * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max" > which contains both quota and period. This is indeed a maximum limit, however > * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by > "cpu.rt.max" which contains both runtime and period. the RT thing is conceptually more of a minimum guarantee, than a maximum, even though the current implementation is both, there are plans to allow (controlled) relaxation of the maximum part. Also, if you're going to rev the interface, there's more changes we should make. I'll have to go dig them out. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-04 9:07 ` Peter Zijlstra @ 2015-08-04 15:10 ` Tejun Heo 2015-08-05 9:10 ` Peter Zijlstra 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-04 15:10 UTC (permalink / raw) To: Peter Zijlstra; +Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team Hello, Peter. On Tue, Aug 04, 2015 at 11:07:11AM +0200, Peter Zijlstra wrote: > What about the unified hierarchy stuff cannot deal with per-task > controllers? > > _That_ was the biggest problem from what I can remember, and I see no > proposed resolution for that here. I've been thinking about it and I'm now convinced that cgroups just is the wrong interface to require each application to be programming against. I wrote this in the CAT thread too but cgroups may be an okay management / administration interface but is a horrible programming interface to be used by individual applications. For things which don't require hierarchy, the obvious thing to do is implementing a usual syscall-like interface be it a separate syscall, an prctl command, an ioctl or whatever. For things which require building a hierarchy of member threads, the right thing to do is making it a part of the usual process hierarchy - this is *the* hierarchy that applications are familiar with and have the facilities to deal with, so we can, for example, add a clone or unshare flag which puts the calling threads in a new child group and then let that use the fore-mentioned syscall-like interface to configure whatever it wants to configure. In the long term, this is *way* better than letting individual applications fumble with cgroup hierarchy delegation and pseudo filesystem access. If hierarchical weight and/or bandwidth limiting for thread hierarchy is absolutely necessary, doing this shouldn't be too difficult and I suspect it wouldn't be all that different from autogroup. > > * cpuacct is implictly enabled and disabled by cpu and its information > > is reported through "cpu.stat" which now uses microseconds for all > > time durations. All time duration fields now have "_usec" appended > > to them for clarity. While this doesn't solve the double accounting > > immediately, once majority of users switch to v2, cpu can directly > > account and report the relevant stats and cpuacct can be disabled on > > the unified hierarchy. > > > > Note that cpuacct.usage_percpu is currently not included in > > "cpu.stat". If this information is actually called for, it can be > > added later. > > Since you're rev'ing the interface, can't we simply kill the old cpuacct > and implement the missing pieces in cpu directly ? Yeah, that's the plan. For the transitional period however, we'd have a lot more usages where cpuacct is mounted in a legacy hierarchy so I didn't want to incur the overhead of duplicate accounting for those cases and the dependency mechanism is already there making it trivial. > > * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max" > > which contains both quota and period. > > This is indeed a maximum limit, however > > > * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by > > "cpu.rt.max" which contains both runtime and period. > > the RT thing is conceptually more of a minimum guarantee, than a > maximum, even though the current implementation is both, there are plans > to allow (controlled) relaxation of the maximum part. Ah, I see. Yeah, then it should be cpu.rt.min. I'll just remove the file until the relaxation part is determined. > Also, if you're going to rev the interface, there's more changes we > should make. I'll have to go dig them out. Great, please let me know what you have on mind. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-04 15:10 ` Tejun Heo @ 2015-08-05 9:10 ` Peter Zijlstra 2015-08-05 14:31 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Peter Zijlstra @ 2015-08-05 9:10 UTC (permalink / raw) To: Tejun Heo Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team, Paul Turner, Linus Torvalds, Andrew Morton On Tue, Aug 04, 2015 at 11:10:17AM -0400, Tejun Heo wrote: > Hello, Peter. > > On Tue, Aug 04, 2015 at 11:07:11AM +0200, Peter Zijlstra wrote: > > What about the unified hierarchy stuff cannot deal with per-task > > controllers? > > > > _That_ was the biggest problem from what I can remember, and I see no > > proposed resolution for that here. > > I've been thinking about it and I'm now convinced that cgroups just is > the wrong interface to require each application to be programming > against. But people are doing it. So you must give them something. You cannot just tell them to go away. So where are the people doing this in this discussion? Or are you one-sidedly forcing things? IIRC Google was doing this. The whole libvirt trainwreck also does this (the programming against cgroups, not the per task thing afaik). You also cannot mandate system-disease, not everybody will want to run that monster. From what I understood last time, Google has no interest what so ever of using it. > I wrote this in the CAT thread too but cgroups may be an > okay management / administration interface but is a horrible > programming interface to be used by individual applications. Yeah, I need to catch up on that CAT thread, but the reality is, people use it as a programming interface, whether you like it or not. > For things which don't require hierarchy, the obvious thing to do is > implementing a usual syscall-like interface be it a separate syscall, > an prctl command, an ioctl or whatever. And then you get /proc extensions to observe them, then people make those /proc extensions writable and before you know it you've got an equal or bigger mess back than you started out with :-( > For things which require > building a hierarchy of member threads, the right thing to do is > making it a part of the usual process hierarchy - this is *the* > hierarchy that applications are familiar with and have the facilities > to deal with, so we can, for example, add a clone or unshare flag > which puts the calling threads in a new child group and then let that > use the fore-mentioned syscall-like interface to configure whatever it > wants to configure. And then you get to add support to cgroups to migrate hierarchies, is that complexity you're waiting for? Not to mention that its an unwieldy interface because then you get spawn spawning threads etc.. Seeing how its impossible for the main thread to create N tasks in one subgroup and another M tasks in another subgroup. Instead they get to spawn a thread A, with which they then need to communicate to spawn a further N tasks, then spawn a thread B, and again communicate for another M tasks. That's a rather awkward change to how people usually spawn threads. Also, what to do when a thread changes profile? I can imagine a situation where a task accepts a connection and depending on the kind of request it gets, gets placed into a certain sub-group. But there's no migration facility, so you get to go hand the work around, which is expensive. If there would be a migration facility, you've just lost naming, so how are you going to denote the subgroups? > In the long term, this is *way* better than > letting individual applications fumble with cgroup hierarchy > delegation and pseudo filesystem access. You're worried about the intersection between what a task does and what the administrator does, and that's a valid worry. But I'm really not convinced this is going to make it better. We already have relative file ops (openat(), mkdirat(), unlinkat() etc..) can't we make sure they do the right thing in the face of a process (hierarchy) getting migrated by the administrator. That way, things at least _can_ work right, and I think being able to do the right thing trumps not being able to make a mess -- people are people, they'll always make a mess. > If hierarchical weight and/or bandwidth limiting for thread hierarchy > is absolutely necessary, doing this shouldn't be too difficult and I > suspect it wouldn't be all that different from autogroup. Autogroups are a bit icky and have the 'advantage' of not intersecting with regular cgroups (much). The above has intricate intersection with the cgroup stuff. As said, your migrate process becomes a move hierarchy. You further get more 'hidden' cgroups. /proc files that report what cgroup a task is in will report a cgroup that's not actually present in the filesystem (autogroups already does this, it confuses people). And as stated you take away a lot of things that are now possible. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-05 9:10 ` Peter Zijlstra @ 2015-08-05 14:31 ` Tejun Heo 2015-08-17 20:35 ` Tejun Heo [not found] ` <CAPM31RJTf0v=2v90kN6-HM9xUGab_k++upO0Ym=irmfO9+BbFw@mail.gmail.com> 0 siblings, 2 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-05 14:31 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team, Paul Turner, Linus Torvalds, Andrew Morton Hello, On Wed, Aug 05, 2015 at 11:10:36AM +0200, Peter Zijlstra wrote: > > I've been thinking about it and I'm now convinced that cgroups just is > > the wrong interface to require each application to be programming > > against. > > But people are doing it. So you must give them something. You cannot > just tell them to go away. Sure, more on specifics later, but, first of all, the transition to v2 is a gradual process. The new and old hierarchies can co-exist, so nothing forces abrupt transitions. Also, we do want to start as restricted as possible and then widen it gradually as necessary. > So where are the people doing this in this discussion? Or are you > one-sidedly forcing things? IIRC Google was doing this. We've been having those discussions for years in person and on the cgroup mailing list. IIRC, the google case was for blkcg where they have an IO proxy process which wanna issue IOs as different cgroups depending on who's the original issuer. They created multiple threads, put them in different cgroups and bounce the IOs to the matching one; however, this is already pretty silly as they have to bounce IOs to different threads. What makes a lot more sense here is the ability to tag an IO as coming from a specific cgroup (or a process's cgroup) and there was discussion of using an extra field in aio request to indicate this, which is an a lot better solution for the problem, can also express different IO priority and pretty easy to implement. > The whole libvirt trainwreck also does this (the programming against > cgroups, not the per task thing afaik). AFAIK, libvirt is doing multiple backends anyway and as long as the delegation rules are clear, libvirt managing its own subhierarchy is not a problem. It's an administration software stack which requires fairly close integration with the userland part of operating system. > You also cannot mandate system-disease, not everybody will want to run > that monster. From what I understood last time, Google has no interest > what so ever of using it. But what would require tight coupling of individual applications and something like systemd is the kernel failing to set up a reasonable boundary between management and application interfaces. If the kernel provides a useable API for individual applications to use, they'll program against it and the management part can be whatever. If we fail to do that, individual applications will have to talk to external agent to coordinate access to management interface and that's what'll end up creating hard dependency on specific system agents from applications like apache or mysql or whatever. We really don't want that. The kernel *NEEDS* to clearly distinguish those two to prevent that from happening. > > I wrote this in the CAT thread too but cgroups may be an > > okay management / administration interface but is a horrible > > programming interface to be used by individual applications. > > Yeah, I need to catch up on that CAT thread, but the reality is, people > use it as a programming interface, whether you like it or not. And that's one of the major fuck ups on cgroup's part that must be rectified. Look at the interface being proposed there. It's exposing direct hardware details w/o much abstraction which is fine for a system management interface but at the same time it's intended to be exposed to individual applications. This lack of distinction makes people skip the attention that they should be paying when they're designing interface exposed to individual programs. Worse, this makes these things fly under the review scrutiny that public API accessible to applications usually receives. Yet, that's what these things end up to be. This just has to stop. cgroups can't continue to be this ghetto shortcut to implementing half-assed APIs. > > For things which don't require hierarchy, the obvious thing to do is > > implementing a usual syscall-like interface be it a separate syscall, > > an prctl command, an ioctl or whatever. > > And then you get /proc extensions to observe them, then people make > those /proc extensions writable and before you know it you've got an > equal or bigger mess back than you started out with :-( What we should be doing is pushing them into the same arena as any other publicly accessible API. I don't think there can be a shortcut to this. > > For things which require > > building a hierarchy of member threads, the right thing to do is > > making it a part of the usual process hierarchy - this is *the* > > hierarchy that applications are familiar with and have the facilities > > to deal with, so we can, for example, add a clone or unshare flag > > which puts the calling threads in a new child group and then let that > > use the fore-mentioned syscall-like interface to configure whatever it > > wants to configure. > > And then you get to add support to cgroups to migrate hierarchies, is > that complexity you're waiting for? Absolutely, if it comes to that, that's what we should do. The only other option is spilling and getting locked into half-baked interface to applications which not only harm userland but also kernel. > Not to mention that its an unwieldy interface because then you get spawn > spawning threads etc.. Seeing how its impossible for the main thread to > create N tasks in one subgroup and another M tasks in another subgroup. > > Instead they get to spawn a thread A, with which they then need to > communicate to spawn a further N tasks, then spawn a thread B, and again > communicate for another M tasks. > > That's a rather awkward change to how people usually spawn threads. It is within the usual purview of how userland deals with hierarchies of processes / threads and I don't think it's necessarily bad and more importantly I don't think the use case or the perceived awkwardness justifies introducing a wholely new mechanism. > Also, what to do when a thread changes profile? I can imagine a > situation where a task accepts a connection and depending on the kind of > request it gets, gets placed into a certain sub-group. Migration is a very expensive operation. The obvious thing to do for such cases is having pools of workers for different profiles. Also, as mentioned before, for more specific cases like IO, it makes a lot more sense to override things per operation rather than moving threads around. > But there's no migration facility, so you get to go hand the work > around, which is expensive. That's a lot cheaper than migrating. > If there would be a migration facility, you've just lost naming, so how > are you going to denote the subgroups? I don't think we want migration in sub-process hierarchy but in the off chance we do the naming can follow the same pid/program group/session id scheme, which, again, is a lot easier to deal with from applications. > > In the long term, this is *way* better than > > letting individual applications fumble with cgroup hierarchy > > delegation and pseudo filesystem access. > > You're worried about the intersection between what a task does and what > the administrator does, and that's a valid worry. But I'm really not > convinced this is going to make it better. > > We already have relative file ops (openat(), mkdirat(), unlinkat() > etc..) can't we make sure they do the right thing in the face of a > process (hierarchy) getting migrated by the administrator. But those are relative to the current directory per operation and there's no way to define a transaction across multiple file operations. There's no way to prevent a process from being migrated inbetween openat() and subsequent write(). > That way, things at least _can_ work right, and I think being able to do > the right thing trumps not being able to make a mess -- people are > people, they'll always make a mess. It can't, at least not in the usual manner that file system operations are defined. This is an interface which requires central coordination (even for delegation) and a horrible one to expose to individual applications. > > If hierarchical weight and/or bandwidth limiting for thread hierarchy > > is absolutely necessary, doing this shouldn't be too difficult and I > > suspect it wouldn't be all that different from autogroup. > > Autogroups are a bit icky and have the 'advantage' of not intersecting > with regular cgroups (much). The above has intricate intersection with > the cgroup stuff. > > As said, your migrate process becomes a move hierarchy. You further get > more 'hidden' cgroups. /proc files that report what cgroup a task is in > will report a cgroup that's not actually present in the filesystem > (autogroups already does this, it confuses people). And as stated you > take away a lot of things that are now possible. I don't think it's a lot that per-process is gonna take away. Per-thread use cases are pretty niche to begin with and most can and should be implemented better using a more fitting mechanism. As for having to deal with more complexity in cgroup core, that's fine. If it comes to that, we'll have to bite the bullet and do it. Sure, we want to be simpler but not at the cost of messing up userland API and please note that what we lost with cgroups is this tension. This tension between the difficulty and complexity of implementing something which can be used by applications and the necessity or desirability of the proposed use cases is crucial in steering kernel development and the APIs it exposes. Abusing cgroups like we've been doing bypasses that tension and we of course end up locked into an extremely crappy interfaces and mechanisms which could never be justified in the first place. This is about time we stopped this disaster train. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-05 14:31 ` Tejun Heo @ 2015-08-17 20:35 ` Tejun Heo [not found] ` <CAPM31RJTf0v=2v90kN6-HM9xUGab_k++upO0Ym=irmfO9+BbFw@mail.gmail.com> 1 sibling, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-17 20:35 UTC (permalink / raw) To: Peter Zijlstra Cc: mingo, hannes, lizefan, cgroups, linux-kernel, kernel-team, Paul Turner, Linus Torvalds, Andrew Morton Hello, Peter. Do we have an agreement on the sched changes? Thanks a lot. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
[parent not found: <CAPM31RJTf0v=2v90kN6-HM9xUGab_k++upO0Ym=irmfO9+BbFw@mail.gmail.com>]
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy [not found] ` <CAPM31RJTf0v=2v90kN6-HM9xUGab_k++upO0Ym=irmfO9+BbFw@mail.gmail.com> @ 2015-08-18 4:03 ` Paul Turner 2015-08-18 20:31 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-18 4:03 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Apologies for the repeat. Gmail ate its plain text setting for some reason. Shame bells. On Mon, Aug 17, 2015 at 9:02 PM, Paul Turner <pjt@google.com> wrote: > > > On Wed, Aug 5, 2015 at 7:31 AM, Tejun Heo <tj@kernel.org> wrote: >> >> Hello, >> >> On Wed, Aug 05, 2015 at 11:10:36AM +0200, Peter Zijlstra wrote: >> > > I've been thinking about it and I'm now convinced that cgroups just is >> > > the wrong interface to require each application to be programming >> > > against. >> > >> > But people are doing it. So you must give them something. You cannot >> > just tell them to go away. >> >> Sure, more on specifics later, but, first of all, the transition to v2 >> is a gradual process. The new and old hierarchies can co-exist, so >> nothing forces abrupt transitions. Also, we do want to start as >> restricted as possible and then widen it gradually as necessary. >> >> > So where are the people doing this in this discussion? Or are you >> > one-sidedly forcing things? IIRC Google was doing this. >> >> We've been having those discussions for years in person and on the >> cgroup mailing list. IIRC, the google case was for blkcg where they >> have an IO proxy process which wanna issue IOs as different cgroups >> depending on who's the original issuer. They created multiple >> threads, put them in different cgroups and bounce the IOs to the >> matching one; however, this is already pretty silly as they have to >> bounce IOs to different threads. What makes a lot more sense here is >> the ability to tag an IO as coming from a specific cgroup (or a >> process's cgroup) and there was discussion of using an extra field in >> aio request to indicate this, which is an a lot better solution for >> the problem, can also express different IO priority and pretty easy to >> implement. >> > > So we have two major types of use that are relevant to this interface: > > 1) Proxy agents. When a control systems want to perform work on behalf of a > container, they will sometimes move the acting thread into the relevant > control groups so that it can be accounted on that container's behalf. > [This is more relevant for non-persistent resources such as CPU time or I/O > priorities than charges that will outlive the work such as memory > allocations.] > > I agree (1) is at best a bit of a hack and can be worked around on the type > of time-frame these interfaces move at. > > 2) Control within an address-space. For subsystems with fungible resources, > e.g. CPU, it can be useful for an address space to partition its own > threads. Losing the capability to do this against the CPU controller would > be a large set-back for instance. Occasionally, it is useful to share these > groupings between address spaces when processes are cooperative, but this is > less of a requirement. > > This is important to us. > > >> > The whole libvirt trainwreck also does this (the programming against >> > cgroups, not the per task thing afaik). >> >> AFAIK, libvirt is doing multiple backends anyway and as long as the >> delegation rules are clear, libvirt managing its own subhierarchy is >> not a problem. It's an administration software stack which requires >> fairly close integration with the userland part of operating system. >> >> > You also cannot mandate system-disease, not everybody will want to run >> > that monster. From what I understood last time, Google has no interest >> > what so ever of using it. >> >> But what would require tight coupling of individual applications and >> something like systemd is the kernel failing to set up a reasonable >> boundary between management and application interfaces. If the kernel >> provides a useable API for individual applications to use, they'll >> program against it and the management part can be whatever. If we >> fail to do that, individual applications will have to talk to external >> agent to coordinate access to management interface > > > It's notable here that for a managed system, the agent coordinating access > *must* be external > >> >> and that's what'll >> end up creating hard dependency on specific system agents from >> applications like apache or mysql or whatever. We really don't want >> that. The kernel *NEEDS* to clearly distinguish those two to prevent >> that from happening. >> >> > > I wrote this in the CAT thread too but cgroups may be an >> > > okay management / administration interface but is a horrible >> > > programming interface to be used by individual applications. >> > >> > Yeah, I need to catch up on that CAT thread, but the reality is, people >> > use it as a programming interface, whether you like it or not. >> >> And that's one of the major fuck ups on cgroup's part that must be >> rectified. Look at the interface being proposed there. It's exposing >> direct hardware details w/o much abstraction which is fine for a >> system management interface but at the same time it's intended to be >> exposed to individual applications. > > > FWIW this is something we've had no significant problems managing with > separate mount mounts and file system protections. Yes, there are some > potential warts around atomicity; but we've not found them too onerous. > > What I don't quite follow here is the assumption that CAT should would be > necessarily exposed to individual applications? What's wrong with subsystems > that are primarily intended only for system management agents, we already > have several of these. > > >> >> This lack of distinction makes >> people skip the attention that they should be paying when they're >> designing interface exposed to individual programs. Worse, this makes >> these things fly under the review scrutiny that public API accessible >> to applications usually receives. Yet, that's what these things end >> up to be. This just has to stop. cgroups can't continue to be this >> ghetto shortcut to implementing half-assed APIs. > > > I certainly don't disagree on this point :). But as above, I don't quite > follow why an API being in cgroups must mean it's accessible to an > application controlled by that group. This has certainly not been a > requirement for our use. > >> >> >> > > For things which don't require hierarchy, the obvious thing to do is >> > > implementing a usual syscall-like interface be it a separate syscall, >> > > an prctl command, an ioctl or whatever. >> > >> > And then you get /proc extensions to observe them, then people make >> > those /proc extensions writable and before you know it you've got an >> > equal or bigger mess back than you started out with :-( >> >> What we should be doing is pushing them into the same arena as any >> other publicly accessible API. I don't think there can be a shortcut >> to this. >> > > Are you explicitly opposed to non-hierarchical partitions, however? Cpuset > is [typically] an example of this, where the interface wants to control > unified properties across a set of processes. Without necessarily being > usefully hierarchical. (This is just to understand your core position, I'm > not proposing cpuset should shape *anything*.) > >> >> > > For things which require >> > > building a hierarchy of member threads, the right thing to do is >> > > making it a part of the usual process hierarchy - this is *the* >> > > hierarchy that applications are familiar with and have the facilities >> > > to deal with, so we can, for example, add a clone or unshare flag >> > > which puts the calling threads in a new child group and then let that >> > > use the fore-mentioned syscall-like interface to configure whatever it >> > > wants to configure. >> > >> > And then you get to add support to cgroups to migrate hierarchies, is >> > that complexity you're waiting for? >> >> Absolutely, if it comes to that, that's what we should do. The only >> other option is spilling and getting locked into half-baked interface >> to applications which not only harm userland but also kernel. >> >> > Not to mention that its an unwieldy interface because then you get spawn >> > spawning threads etc.. Seeing how its impossible for the main thread to >> > create N tasks in one subgroup and another M tasks in another subgroup. >> > >> > Instead they get to spawn a thread A, with which they then need to >> > communicate to spawn a further N tasks, then spawn a thread B, and again >> > communicate for another M tasks. >> > >> > That's a rather awkward change to how people usually spawn threads. >> >> It is within the usual purview of how userland deals with hierarchies >> of processes / threads and I don't think it's necessarily bad and more >> importantly I don't think the use case or the perceived awkwardness >> justifies introducing a wholely new mechanism. >> >> > Also, what to do when a thread changes profile? I can imagine a >> > situation where a task accepts a connection and depending on the kind of >> > request it gets, gets placed into a certain sub-group. >> >> Migration is a very expensive operation. The obvious thing to do for >> such cases is having pools of workers for different profiles. Also, >> as mentioned before, for more specific cases like IO, it makes a lot >> more sense to override things per operation rather than moving threads >> around. >> >> > But there's no migration facility, so you get to go hand the work >> > around, which is expensive. >> >> That's a lot cheaper than migrating. >> >> > If there would be a migration facility, you've just lost naming, so how >> > are you going to denote the subgroups? >> >> I don't think we want migration in sub-process hierarchy but in the >> off chance we do the naming can follow the same pid/program >> group/session id scheme, which, again, is a lot easier to deal with >> from applications. > > > I don't have many objections with hand-off versus migration above, however, > I think that this is a big drawback. Threads are expensive to create and > are often cached rather than released. While migration may be expensive, > creating a more thread is more so. The important to reconfigure a thread's > personality at run-time is important. > >> >> > > In the long term, this is *way* better than >> > > letting individual applications fumble with cgroup hierarchy >> > > delegation and pseudo filesystem access. >> > >> > You're worried about the intersection between what a task does and what >> > the administrator does, and that's a valid worry. But I'm really not >> > convinced this is going to make it better. >> > >> > We already have relative file ops (openat(), mkdirat(), unlinkat() >> > etc..) can't we make sure they do the right thing in the face of a >> > process (hierarchy) getting migrated by the administrator. >> >> But those are relative to the current directory per operation and >> there's no way to define a transaction across multiple file >> operations. There's no way to prevent a process from being migrated >> inbetween openat() and subsequent write(). > > > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another > way to address some of these issues. > >> >> >> > That way, things at least _can_ work right, and I think being able to do >> > the right thing trumps not being able to make a mess -- people are >> > people, they'll always make a mess. >> >> It can't, at least not in the usual manner that file system operations >> are defined. This is an interface which requires central coordination >> (even for delegation) and a horrible one to expose to individual >> applications. >> >> > > If hierarchical weight and/or bandwidth limiting for thread hierarchy >> > > is absolutely necessary, doing this shouldn't be too difficult and I >> > > suspect it wouldn't be all that different from autogroup. >> > >> > Autogroups are a bit icky and have the 'advantage' of not intersecting >> > with regular cgroups (much). The above has intricate intersection with >> > the cgroup stuff. >> > >> > As said, your migrate process becomes a move hierarchy. You further get >> > more 'hidden' cgroups. /proc files that report what cgroup a task is in >> > will report a cgroup that's not actually present in the filesystem >> > (autogroups already does this, it confuses people). And as stated you >> > take away a lot of things that are now possible. >> >> I don't think it's a lot that per-process is gonna take away. >> Per-thread use cases are pretty niche to begin with and most can and >> should be implemented better using a more fitting mechanism. As for >> having to deal with more complexity in cgroup core, that's fine. If >> it comes to that, we'll have to bite the bullet and do it. Sure, we >> want to be simpler but not at the cost of messing up userland API and >> please note that what we lost with cgroups is this tension. > > > I don't quite agree here. Losing per-thread control within the cpu > controller is likely going to mean that much of it ends up being > reimplemented as some duplicate-in-appearance interface that gets us back to > where we are today. I recognize that these controllers (cpu, cpuacct) are > square pegs in that per-process makes sense for most other sub-systems; but > unfortunately, their needs and use-cases are real / dependent on their > present form. > >> >> This tension between the difficulty and complexity of implementing >> something which can be used by applications and the necessity or >> desirability of the proposed use cases is crucial in steering kernel >> development and the APIs it exposes. Abusing cgroups like we've been >> doing bypasses that tension and we of course end up locked into an >> extremely crappy interfaces and mechanisms which could never be >> justified in the first place. This is about time we stopped this >> disaster train. >> >> Thanks. >> >> -- >> tejun > > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-18 4:03 ` Paul Turner @ 2015-08-18 20:31 ` Tejun Heo 2015-08-18 23:39 ` Kamezawa Hiroyuki ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-18 20:31 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote: > > 2) Control within an address-space. For subsystems with fungible resources, > > e.g. CPU, it can be useful for an address space to partition its own > > threads. Losing the capability to do this against the CPU controller would > > be a large set-back for instance. Occasionally, it is useful to share these > > groupings between address spaces when processes are cooperative, but this is > > less of a requirement. > > > > This is important to us. Sure, let's build a proper interface for that. Do you actually need sub-hierarchy inside a process? Can you describe your use case in detail and why having hierarchical CPU cycle distribution is essential for your use case? > >> And that's one of the major fuck ups on cgroup's part that must be > >> rectified. Look at the interface being proposed there. It's exposing > >> direct hardware details w/o much abstraction which is fine for a > >> system management interface but at the same time it's intended to be > >> exposed to individual applications. > > > > FWIW this is something we've had no significant problems managing with > > separate mount mounts and file system protections. Yes, there are some > > potential warts around atomicity; but we've not found them too onerous. You guys control the whole stack. Of course, you can get away with an interface which are pretty messed up in terms of layering and isolation; however, generic kernel interface cannot be designed according to that standard. > > What I don't quite follow here is the assumption that CAT should would be > > necessarily exposed to individual applications? What's wrong with subsystems > > that are primarily intended only for system management agents, we already > > have several of these. Why would you assume that threads of a process wouldn't want to configure it ever? How is this different from CPU affinity? > >> This lack of distinction makes > >> people skip the attention that they should be paying when they're > >> designing interface exposed to individual programs. Worse, this makes > >> these things fly under the review scrutiny that public API accessible > >> to applications usually receives. Yet, that's what these things end > >> up to be. This just has to stop. cgroups can't continue to be this > >> ghetto shortcut to implementing half-assed APIs. > > > > I certainly don't disagree on this point :). But as above, I don't quite > > follow why an API being in cgroups must mean it's accessible to an > > application controlled by that group. This has certainly not been a > > requirement for our use. I don't follow what you're trying to way with the above paragraph. Are you still talking about CAT? If so, that use case isn't the only one. I'm pretty sure there are people who would want to configure cache allocation at thread level. > >> What we should be doing is pushing them into the same arena as any > >> other publicly accessible API. I don't think there can be a shortcut > >> to this. > > > > Are you explicitly opposed to non-hierarchical partitions, however? Cpuset > > is [typically] an example of this, where the interface wants to control > > unified properties across a set of processes. Without necessarily being > > usefully hierarchical. (This is just to understand your core position, I'm > > not proposing cpuset should shape *anything*.) I'm having trouble following what you're trying to say. FWIW, cpuset is fully hierarchical. > >> I don't think we want migration in sub-process hierarchy but in the > >> off chance we do the naming can follow the same pid/program > >> group/session id scheme, which, again, is a lot easier to deal with > >> from applications. > > > > I don't have many objections with hand-off versus migration above, however, > > I think that this is a big drawback. Threads are expensive to create and > > are often cached rather than released. While migration may be expensive, > > creating a more thread is more so. The important to reconfigure a thread's > > personality at run-time is important. The core problem here is picking the hot path. If cgroups as a whole doesn't pick a position here, controllers have to assume that migration might not be a very cold path which naturally leads to overall designs and synchronization schemes which concede hot path performance to accomodate migration. We simply can't afford to do that - we end up losing way more in way hotter paths for something which may be marginally useful in some corner cases. So, this is a trade-off we're consciously making. If there are common-enough use cases which require jumping across different cgroup domains, we'll try to figure out a way to accomodate those but by default migration is a very cold and expensive path. > >> But those are relative to the current directory per operation and > >> there's no way to define a transaction across multiple file > >> operations. There's no way to prevent a process from being migrated > >> inbetween openat() and subsequent write(). > > > > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another > > way to address some of these issues. That sounds horrible to me. What if the process wants to do RMW a config? What if the permissions are different after an intervening migration? What if the sub-hierarchy no longer exists or has been replaced by a hierarchy with the same topology but actualy is a different one? > > I don't quite agree here. Losing per-thread control within the cpu > > controller is likely going to mean that much of it ends up being > > reimplemented as some duplicate-in-appearance interface that gets us back to > > where we are today. I recognize that these controllers (cpu, cpuacct) are > > square pegs in that per-process makes sense for most other sub-systems; but > > unfortunately, their needs and use-cases are real / dependent on their > > present form. Let's build an API which actually looks and behaves like an API which is properly isolated from what external agents may do to the process. I can't see how that would be "back to where we are today". All of those are pretty critical attributes for a public kernel API and utterly broken right now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-18 20:31 ` Tejun Heo @ 2015-08-18 23:39 ` Kamezawa Hiroyuki 2015-08-19 16:23 ` Tejun Heo 2015-08-19 3:23 ` Mike Galbraith 2015-08-21 19:26 ` Paul Turner 2 siblings, 1 reply; 92+ messages in thread From: Kamezawa Hiroyuki @ 2015-08-18 23:39 UTC (permalink / raw) To: Tejun Heo, Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On 2015/08/19 5:31, Tejun Heo wrote: > Hello, Paul. > > On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote: >>> 2) Control within an address-space. For subsystems with fungible resources, >>> e.g. CPU, it can be useful for an address space to partition its own >>> threads. Losing the capability to do this against the CPU controller would >>> be a large set-back for instance. Occasionally, it is useful to share these >>> groupings between address spaces when processes are cooperative, but this is >>> less of a requirement. >>> >>> This is important to us. > > Sure, let's build a proper interface for that. Do you actually need > sub-hierarchy inside a process? Can you describe your use case in > detail and why having hierarchical CPU cycle distribution is essential > for your use case? An actual per-thread use case in our customers is qemu-kvm + cpuset. customers pin each vcpus and qemu-kvm's worker threads to cpus. For example, pinning 4 vcpus to cpu 2-6 and pinning qemu main thread and others(vhost) to cpu 0-1. This is an actual kvm tuning on our customers for performance guarantee. In another case, cpu cgroup's throttling feature is used per vcpu for vm cpu sizing. Thanks, -Kame ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-18 23:39 ` Kamezawa Hiroyuki @ 2015-08-19 16:23 ` Tejun Heo 0 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-19 16:23 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Kame. On Wed, Aug 19, 2015 at 08:39:43AM +0900, Kamezawa Hiroyuki wrote: > An actual per-thread use case in our customers is qemu-kvm + cpuset. > customers pin each vcpus and qemu-kvm's worker threads to cpus. > For example, pinning 4 vcpus to cpu 2-6 and pinning qemu main thread and > others(vhost) to cpu 0-1. taskset and/or teach qemu how to confiure its worker threads? > This is an actual kvm tuning on our customers for performance guarantee. > > In another case, cpu cgroup's throttling feature is used per vcpu for vm cpu sizing. Yeap, this is something we likely want to implement in an accessible way. For kvm, per-thread throttling configuration is enough, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-18 20:31 ` Tejun Heo 2015-08-18 23:39 ` Kamezawa Hiroyuki @ 2015-08-19 3:23 ` Mike Galbraith 2015-08-19 16:41 ` Tejun Heo 2015-08-21 19:26 ` Paul Turner 2 siblings, 1 reply; 92+ messages in thread From: Mike Galbraith @ 2015-08-19 3:23 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, 2015-08-18 at 13:31 -0700, Tejun Heo wrote: > So, this is a trade-off we're consciously making. If there are > common-enough use cases which require jumping across different cgroup > domains, we'll try to figure out a way to accomodate those but by > default migration is a very cold and expensive path. Hm. I know of a big data outfit to which attach/detach performance was important enough for them to have plucked an old experimental overhead reduction hack (mine) off lkml, and shipped it. It must have mattered a LOT for them (not suicidal crash test dummies) to have done that. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-19 3:23 ` Mike Galbraith @ 2015-08-19 16:41 ` Tejun Heo 2015-08-20 4:00 ` Mike Galbraith 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-19 16:41 UTC (permalink / raw) To: Mike Galbraith Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Mike. On Wed, Aug 19, 2015 at 05:23:40AM +0200, Mike Galbraith wrote: > Hm. I know of a big data outfit to which attach/detach performance was > important enough for them to have plucked an old experimental overhead > reduction hack (mine) off lkml, and shipped it. It must have mattered a > LOT for them (not suicidal crash test dummies) to have done that. There haven't been any guidelines on cgroup usage. Of course people have been developing in all directions. It's a natural learning process and there are use cases which can be served by migrating processes back and forth. Nobody is trying to prevent that; however, if one examines how resources and their associations need to be tracked for accounting and control, it's evident that there are inherent trade-offs between migration and the stuff which happens while not migrating and it's clear which side is more important. Most problems can be solved in different ways and I'm doubtful that e.g. bouncing jobs to worker threads would be more expensive than migrating the worker back and forth in a lot of cases. If migrating threads around floats somebody's boat, that's fine but that has never been and can't be the focus of design and optimization, not at the cost of the actual hot paths. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-19 16:41 ` Tejun Heo @ 2015-08-20 4:00 ` Mike Galbraith 2015-08-20 7:52 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Mike Galbraith @ 2015-08-20 4:00 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Wed, 2015-08-19 at 09:41 -0700, Tejun Heo wrote: > Most problems can be solved in different ways and I'm doubtful that > e.g. bouncing jobs to worker threads would be more expensive than > migrating the worker back and forth in a lot of cases. If migrating > threads around floats somebody's boat, that's fine but that has never > been and can't be the focus of design and optimization, not at the > cost of the actual hot paths. If create/attach/detach/destroy aren't hot paths, what is? Those are fork/exec/exit cgroup analogs. If you have thousands upon thousands of potentially active cgroups (aka customers), you wouldn't want to keep them all around just in case when you can launch cgroup tasks the same way we launch any other task. You wouldn't contemplate slowing down fork/exec/exit, but create/attach/detach/destroy are one and the same.. they need to be just as fast/light as they can be, as they are part and parcel of the higher level process. That's why my hack ended up in a large enterprise outfit's product, it was _needed_ to fix up cgroups performance suckage. That suckage was fixed up properly quite a bit later. Anyway, if what they or anybody like them can currently do with their job launcher/manager gizmos is negatively impacted, they can gripe for themselves. All I'm saying is that there are definitely users out there to whom create/attach/detach/destroy are highly important. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-20 4:00 ` Mike Galbraith @ 2015-08-20 7:52 ` Tejun Heo 2015-08-20 8:47 ` Mike Galbraith 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-20 7:52 UTC (permalink / raw) To: Mike Galbraith Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hey, Mike. On Thu, Aug 20, 2015 at 06:00:59AM +0200, Mike Galbraith wrote: > If create/attach/detach/destroy aren't hot paths, what is? Those are > fork/exec/exit cgroup analogs. If you have thousands upon thousands of Things like page faults? cgroup controllers hook into subsystems and their hot path operations get affected by the method of cgroup association. Also, migration and create/destroy are completely different. create/destroy don't need much synchronization - a new task is made visible only after the initial association is set up and a dying task's association is destroyed only after the task isn't referenced by anybody. There's nothing dynamic about those compared to migration. > potentially active cgroups (aka customers), you wouldn't want to keep > them all around just in case when you can launch cgroup tasks the same > way we launch any other task. You wouldn't contemplate slowing down > fork/exec/exit, but create/attach/detach/destroy are one and the same.. > they need to be just as fast/light as they can be, as they are part and > parcel of the higher level process. You're conflating two completely different operations. Also, when I say migration is a relatively expensive operation, I'm comparing it to bouncing a request to another thread as opposed to bouncing the issuing thread to different cgroup request-by-request. > That's why my hack ended up in a large enterprise outfit's product, it > was _needed_ to fix up cgroups performance suckage. That suckage was > fixed up properly quite a bit later. Hmm... I bet you're talking about the removal of synchronize_rcu() in migration path, sure, that was a silly thing to have there but also that comparison is likely a couple orders of magnitude off of what the thread was originally talking about. > Anyway, if what they or anybody like them can currently do with their > job launcher/manager gizmos is negatively impacted, they can gripe for > themselves. All I'm saying is that there are definitely users out there > to whom create/attach/detach/destroy are highly important. Hmmm... I think this discussion got pretty badly derailed at this point. If I'm not mistaken, you're talking about tens or a few hundred millisecs of latency per migration which no longer exists and won't ever come back and the discussion originally was about something like migrating thread for issuing several IO requests versus bouncing that to a dedicated issuer thread in that domain. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-20 7:52 ` Tejun Heo @ 2015-08-20 8:47 ` Mike Galbraith 0 siblings, 0 replies; 92+ messages in thread From: Mike Galbraith @ 2015-08-20 8:47 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Thu, 2015-08-20 at 00:52 -0700, Tejun Heo wrote: > Hmmm... I think this discussion got pretty badly derailed at this > point. If I'm not mistaken, you're talking about tens or a few > hundred millisecs of latency per migration which no longer exists and > won't ever come back and the discussion originally was about something > like migrating thread for issuing several IO requests versus bouncing > that to a dedicated issuer thread in that domain. Yes, ms latencies ever coming back is the concern, whether that be due to something akin to the old synchronize_rcu() horror.. or some handoff of whatever to whomever. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-18 20:31 ` Tejun Heo 2015-08-18 23:39 ` Kamezawa Hiroyuki 2015-08-19 3:23 ` Mike Galbraith @ 2015-08-21 19:26 ` Paul Turner 2015-08-22 18:29 ` Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-21 19:26 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, Aug 18, 2015 at 1:31 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Paul. > > On Mon, Aug 17, 2015 at 09:03:30PM -0700, Paul Turner wrote: >> > 2) Control within an address-space. For subsystems with fungible resources, >> > e.g. CPU, it can be useful for an address space to partition its own >> > threads. Losing the capability to do this against the CPU controller would >> > be a large set-back for instance. Occasionally, it is useful to share these >> > groupings between address spaces when processes are cooperative, but this is >> > less of a requirement. >> > >> > This is important to us. > > Sure, let's build a proper interface for that. Do you actually need > sub-hierarchy inside a process? Can you describe your use case in > detail and why having hierarchical CPU cycle distribution is essential > for your use case? > One common example here is a thread-pool. Having a hierarchical constraint allows users to specify what proportion of time it should receive, independent of how many threads are placed in the pool. A very concrete example of the above is a virtual machine in which you want to guarantee scheduling for the vCPU threads which must schedule beside many hypervisor support threads. A hierarchy is the only way to fix the ratio at which these compete. An example that's not the cpu controller is that we use cpusets to expose to applications their "shared" and "private" cores. (These sets are dynamic based on what is coscheduled on a given machine.) >> >> And that's one of the major fuck ups on cgroup's part that must be >> >> rectified. Look at the interface being proposed there. It's exposing >> >> direct hardware details w/o much abstraction which is fine for a >> >> system management interface but at the same time it's intended to be >> >> exposed to individual applications. >> > >> > FWIW this is something we've had no significant problems managing with >> > separate mount mounts and file system protections. Yes, there are some >> > potential warts around atomicity; but we've not found them too onerous. > > You guys control the whole stack. Of course, you can get away with an > interface which are pretty messed up in terms of layering and > isolation; however, generic kernel interface cannot be designed > according to that standard. I feel like two points are being conflated here: Yes, it is sufficiently generic that it's possible to configure nonsensical things. But, it is also possible to lock things down presently. This is, for better or worse, the direction that general user-space has also taken with centralized management daemons such as systemd. Setting design aside for a moment -- which I fully agree with you that there is room for large improvement in. The largest idiosyncrasy today is that the configuration above does depend on having a stable mount point for applications to manage their sub-hierarchies. Migrations would improve this greatly, but this is a bit of a detour because you're looking to fix the fundamental design rather than improve the state of the world and that's probably a good thing :) > >> > What I don't quite follow here is the assumption that CAT should would be >> > necessarily exposed to individual applications? What's wrong with subsystems >> > that are primarily intended only for system management agents, we already >> > have several of these. > > Why would you assume that threads of a process wouldn't want to > configure it ever? How is this different from CPU affinity? In general cache and CPU behave differently. Generally for it to make sense between threads in a process they would have to have wholly disjoint memory, at which point the only sane long-term implementation is separate processes and the management moves up a level anyway. That said, there are surely cases in which it might be convenient to use at a per-thread level to correct a specific performance anomaly. But at that point, you have certainly reached the level of hammer that you can coordinate with an external daemon if necessary. > >> >> This lack of distinction makes >> >> people skip the attention that they should be paying when they're >> >> designing interface exposed to individual programs. Worse, this makes >> >> these things fly under the review scrutiny that public API accessible >> >> to applications usually receives. Yet, that's what these things end >> >> up to be. This just has to stop. cgroups can't continue to be this >> >> ghetto shortcut to implementing half-assed APIs. >> > >> > I certainly don't disagree on this point :). But as above, I don't quite >> > follow why an API being in cgroups must mean it's accessible to an >> > application controlled by that group. This has certainly not been a >> > requirement for our use. > > I don't follow what you're trying to way with the above paragraph. > Are you still talking about CAT? If so, that use case isn't the only > one. I'm pretty sure there are people who would want to configure > cache allocation at thread level. I'm not agreeing with you that "in cgroups" means "must be usable by applications within that hierarchy". A cgroup subsystem used as a partitioning API only by system management daemons is entirely reasonable. CAT is a reasonable example of this. > >> >> What we should be doing is pushing them into the same arena as any >> >> other publicly accessible API. I don't think there can be a shortcut >> >> to this. >> > >> > Are you explicitly opposed to non-hierarchical partitions, however? Cpuset >> > is [typically] an example of this, where the interface wants to control >> > unified properties across a set of processes. Without necessarily being >> > usefully hierarchical. (This is just to understand your core position, I'm >> > not proposing cpuset should shape *anything*.) > > I'm having trouble following what you're trying to say. FWIW, cpuset > is fully hierarchical. I think where I was going with this is better addressed above. Here all I meant is that it's difficult to construct useful sub-hierarchies on the cpuset side, especially for memory. But this is a little x86-centric so let's drop it. > >> >> I don't think we want migration in sub-process hierarchy but in the >> >> off chance we do the naming can follow the same pid/program >> >> group/session id scheme, which, again, is a lot easier to deal with >> >> from applications. >> > >> > I don't have many objections with hand-off versus migration above, however, >> > I think that this is a big drawback. Threads are expensive to create and >> > are often cached rather than released. While migration may be expensive, >> > creating a more thread is more so. The important to reconfigure a thread's >> > personality at run-time is important. > > The core problem here is picking the hot path. If cgroups as a whole > doesn't pick a position here, controllers have to assume that > migration might not be a very cold path which naturally leads to > overall designs and synchronization schemes which concede hot path > performance to accomodate migration. We simply can't afford to do > that - we end up losing way more in way hotter paths for something > which may be marginally useful in some corner cases. > > So, this is a trade-off we're consciously making. If there are > common-enough use cases which require jumping across different cgroup > domains, we'll try to figure out a way to accomodate those but by > default migration is a very cold and expensive path. > The core here was the need for allowing sub-process migration. I'm not sure I follow the performance trade-off argument; haven't we historically seen the opposite? That migration has been a slow-path without optimizations and people pushing to make it faster? This seems a hard generalization to make for something that's inherently tied to a particular controller. I don't care if we try turning that dial back to assume it's a cold path once more, only that it's supported. >> >> But those are relative to the current directory per operation and >> >> there's no way to define a transaction across multiple file >> >> operations. There's no way to prevent a process from being migrated >> >> inbetween openat() and subsequent write(). >> > >> > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another >> > way to address some of these issues. > > That sounds horrible to me. What if the process wants to do RMW a > config? Locking within a process is easy. > What if the permissions are different after an intervening > migration? This is a side-effect of migration not being properly supported. > What if the sub-hierarchy no longer exists or has been > replaced by a hierarchy with the same topology but actualy is a > different one? The easy answer is that: Only a process should be managing its sub-hierarchy. That's the nice thing about hierarchies. The harder answer is: How do we handle non-fungible resources such as CPU assignments within a hierarchy? This is a big part of why I make arguments for certain partitions being management-software only above. This is imperfect, but better then where we stand today. > >> > I don't quite agree here. Losing per-thread control within the cpu >> > controller is likely going to mean that much of it ends up being >> > reimplemented as some duplicate-in-appearance interface that gets us back to >> > where we are today. I recognize that these controllers (cpu, cpuacct) are >> > square pegs in that per-process makes sense for most other sub-systems; but >> > unfortunately, their needs and use-cases are real / dependent on their >> > present form. > > Let's build an API which actually looks and behaves like an API which > is properly isolated from what external agents may do to the process. > I can't see how that would be "back to where we are today". All of > those are pretty critical attributes for a public kernel API and > utterly broken right now. > Sure, but I don't think you can throw out per-thread control for all controllers to enable this. Which makes everything else harder. A intermediary step in unification might be that we move from N mounts to 2. Those that can be managed at the process level, and those that can't. It's a compromise, but may allow cleaner abstractions for the former case. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-21 19:26 ` Paul Turner @ 2015-08-22 18:29 ` Tejun Heo 2015-08-24 15:47 ` Austin S Hemmelgarn 2015-08-24 20:52 ` Paul Turner 0 siblings, 2 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-22 18:29 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote: ... > A very concrete example of the above is a virtual machine in which you > want to guarantee scheduling for the vCPU threads which must schedule > beside many hypervisor support threads. A hierarchy is the only way > to fix the ratio at which these compete. Just to learn more, what sort of hypervisor support threads are we talking about? They would have to consume considerable amount of cpu cycles for problems like this to be relevant and be dynamic in numbers in a way which letting them competing against vcpus makes sense. Do IO helpers meet these criteria? > An example that's not the cpu controller is that we use cpusets to > expose to applications their "shared" and "private" cores. (These > sets are dynamic based on what is coscheduled on a given machine.) Can you please go into more details with these? > > Why would you assume that threads of a process wouldn't want to > > configure it ever? How is this different from CPU affinity? > > In general cache and CPU behave differently. Generally for it to make > sense between threads in a process they would have to have wholly > disjoint memory, at which point the only sane long-term implementation > is separate processes and the management moves up a level anyway. > > That said, there are surely cases in which it might be convenient to > use at a per-thread level to correct a specific performance anomaly. > But at that point, you have certainly reached the level of hammer that > you can coordinate with an external daemon if necessary. So, I'm not super familiar with all the use cases but the whole cache allocation thing is almost by nature a specific niche thing and I feel pretty reluctant to blow off per-thread usages as too niche to worry about. > > I don't follow what you're trying to way with the above paragraph. > > Are you still talking about CAT? If so, that use case isn't the only > > one. I'm pretty sure there are people who would want to configure > > cache allocation at thread level. > > I'm not agreeing with you that "in cgroups" means "must be usable by > applications within that hierarchy". A cgroup subsystem used as a > partitioning API only by system management daemons is entirely > reasonable. CAT is a reasonable example of this. I see. The same argument. I don't think CAT just being system management thing makes sense. > > So, this is a trade-off we're consciously making. If there are > > common-enough use cases which require jumping across different cgroup > > domains, we'll try to figure out a way to accomodate those but by > > default migration is a very cold and expensive path. > > The core here was the need for allowing sub-process migration. I'm > not sure I follow the performance trade-off argument; haven't we > historically seen the opposite? That migration has been a slow-path > without optimizations and people pushing to make it faster? This > seems a hard generalization to make for something that's inherently > tied to a particular controller. It isn't something tied to a particular controller. Some controllers may get impacted less by than others but there's an inherent connection between how dynamic an association is and how expensive the locking around it needs to be and we need to set up basic behavior and usage conventions so that different controllers are designed and implemented assuming similar usage patterns; otherwise, we end up with the chaotic shit show that we have had where everything behaves differently and nobody knows what's the right way to do things and we end up locked into weird requirements which some controller induced for no good reason but cause significant pain on use cases which actually matter. > I don't care if we try turning that dial back to assume it's a cold > path once more, only that it's supported. It has always been a cold path and I'm not saying this is gonna be noticeably worse in the future but usages like bouncing threads on request-by-request basis are and will be clearly worse than bouncing to threads which are already in the target domain. > >> > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another > >> > way to address some of these issues. > > > > That sounds horrible to me. What if the process wants to do RMW a > > config? > > Locking within a process is easy. It's not contained in the process at all. What if an external entity decides to migrate the process into another cgroup inbetween? > > What if the permissions are different after an intervening > > migration? > > This is a side-effect of migration not being properly supported. > > > What if the sub-hierarchy no longer exists or has been > > replaced by a hierarchy with the same topology but actualy is a > > different one? > > The easy answer is that: Only a process should be managing its > sub-hierarchy. That's the nice thing about hierarchies. cgroupfs is a horrible place to implement that part of interface. It doesn't make any sense to combine those two into the same hierarchy. You're agreeing to the identified problem but somehow still suggesting doing what we've been doing when the root cause of the said problem is conflating and interlocking these two separate things. > The harder answer is: How do we handle non-fungible resources such as > CPU assignments within a hierarchy? This is a big part of why I make > arguments for certain partitions being management-software only above. > This is imperfect, but better then where we stand today. I'm not following. Why is that different? > > Let's build an API which actually looks and behaves like an API which > > is properly isolated from what external agents may do to the process. > > I can't see how that would be "back to where we are today". All of > > those are pretty critical attributes for a public kernel API and > > utterly broken right now. > > Sure, but I don't think you can throw out per-thread control for all > controllers to enable this. Which makes everything else harder. A > intermediary step in unification might be that we move from N mounts > to 2. Those that can be managed at the process level, and those that > can't. It's a compromise, but may allow cleaner abstractions for the > former case. The transition can already be gradual. Why would you add yet another transition step? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-22 18:29 ` Tejun Heo @ 2015-08-24 15:47 ` Austin S Hemmelgarn 2015-08-24 17:04 ` Tejun Heo 2015-08-24 20:52 ` Paul Turner 1 sibling, 1 reply; 92+ messages in thread From: Austin S Hemmelgarn @ 2015-08-24 15:47 UTC (permalink / raw) To: Tejun Heo, Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 2041 bytes --] On 2015-08-22 14:29, Tejun Heo wrote: > Hello, Paul. > > On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote: > ... >> A very concrete example of the above is a virtual machine in which you >> want to guarantee scheduling for the vCPU threads which must schedule >> beside many hypervisor support threads. A hierarchy is the only way >> to fix the ratio at which these compete. > > Just to learn more, what sort of hypervisor support threads are we > talking about? They would have to consume considerable amount of cpu > cycles for problems like this to be relevant and be dynamic in numbers > in a way which letting them competing against vcpus makes sense. Do > IO helpers meet these criteria? > Depending on the configuration, yes they can. VirtualBox has some rather CPU intensive threads that aren't vCPU threads (their emulated APIC thread immediately comes to mind), and so does QEMU depending on the emulated hardware configuration (it gets more noticeable when the disk images are stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is pretty typical usage for large virtualization deployments). I've seen cases first hand where the vCPU's can make no reasonable progress because they are constantly getting crowded out by other threads. The use of the term 'hypervisor support threads' for this is probably not the best way of describing the contention, as it's almost always a full system virtualization issue, and the contending threads are usually storage back-end access threads. I would argue that there are better ways to deal properly with this (Isolate the non vCPU threads on separate physical CPU's from the hardware emulation threads), but such methods require large systems to be practical at any scale, and many people don't have the budget for such large systems, and this way of doing things is much more flexible for small scale use cases (for example, someone running one or two VM's on a laptop under QEMU or VirtualBox). [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 15:47 ` Austin S Hemmelgarn @ 2015-08-24 17:04 ` Tejun Heo 2015-08-24 19:18 ` Mike Galbraith ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-24 17:04 UTC (permalink / raw) To: Austin S Hemmelgarn Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Austin. On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote: > >Just to learn more, what sort of hypervisor support threads are we > >talking about? They would have to consume considerable amount of cpu > >cycles for problems like this to be relevant and be dynamic in numbers > >in a way which letting them competing against vcpus makes sense. Do > >IO helpers meet these criteria? > > > Depending on the configuration, yes they can. VirtualBox has some rather > CPU intensive threads that aren't vCPU threads (their emulated APIC thread > immediately comes to mind), and so does QEMU depending on the emulated And the number of those threads fluctuate widely and dynamically? > hardware configuration (it gets more noticeable when the disk images are > stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is > pretty typical usage for large virtualization deployments). I've seen cases > first hand where the vCPU's can make no reasonable progress because they are > constantly getting crowded out by other threads. That alone doesn't require hierarchical resource distribution tho. Setting nice levels reasonably is likely to alleviate most of the problem. > The use of the term 'hypervisor support threads' for this is probably not > the best way of describing the contention, as it's almost always a full > system virtualization issue, and the contending threads are usually storage > back-end access threads. > > I would argue that there are better ways to deal properly with this (Isolate > the non vCPU threads on separate physical CPU's from the hardware emulation > threads), but such methods require large systems to be practical at any > scale, and many people don't have the budget for such large systems, and > this way of doing things is much more flexible for small scale use cases > (for example, someone running one or two VM's on a laptop under QEMU or > VirtualBox). I don't know. "Someone running one or two VM's on a laptop under QEMU" doesn't really sound like the use case which absolutely requires hierarchical cpu cycle distribution. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 17:04 ` Tejun Heo @ 2015-08-24 19:18 ` Mike Galbraith 2015-08-24 20:00 ` Austin S Hemmelgarn 2015-08-24 20:54 ` Paul Turner 2 siblings, 0 replies; 92+ messages in thread From: Mike Galbraith @ 2015-08-24 19:18 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, 2015-08-24 at 13:04 -0400, Tejun Heo wrote: > Hello, Austin. > > On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote: > > >Just to learn more, what sort of hypervisor support threads are we > > >talking about? They would have to consume considerable amount of cpu > > >cycles for problems like this to be relevant and be dynamic in numbers > > >in a way which letting them competing against vcpus makes sense. Do > > >IO helpers meet these criteria? > > > > > Depending on the configuration, yes they can. VirtualBox has some rather > > CPU intensive threads that aren't vCPU threads (their emulated APIC thread > > immediately comes to mind), and so does QEMU depending on the emulated > > And the number of those threads fluctuate widely and dynamically? > > > hardware configuration (it gets more noticeable when the disk images are > > stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is > > pretty typical usage for large virtualization deployments). I've seen cases > > first hand where the vCPU's can make no reasonable progress because they are > > constantly getting crowded out by other threads. Hm. Serious CPU starvation would seem to require quite a few hungry threads, but even a few IO threads with kick butt hardware under them could easily tilt fairness heavily in favor of VPUs generating IO. > That alone doesn't require hierarchical resource distribution tho. > Setting nice levels reasonably is likely to alleviate most of the > problem. Unless the CPU controller is in use. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 17:04 ` Tejun Heo 2015-08-24 19:18 ` Mike Galbraith @ 2015-08-24 20:00 ` Austin S Hemmelgarn 2015-08-24 20:25 ` Tejun Heo 2015-08-24 20:54 ` Paul Turner 2 siblings, 1 reply; 92+ messages in thread From: Austin S Hemmelgarn @ 2015-08-24 20:00 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton [-- Attachment #1: Type: text/plain, Size: 3313 bytes --] On 2015-08-24 13:04, Tejun Heo wrote: > Hello, Austin. > > On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote: >>> Just to learn more, what sort of hypervisor support threads are we >>> talking about? They would have to consume considerable amount of cpu >>> cycles for problems like this to be relevant and be dynamic in numbers >>> in a way which letting them competing against vcpus makes sense. Do >>> IO helpers meet these criteria? >>> >> Depending on the configuration, yes they can. VirtualBox has some rather >> CPU intensive threads that aren't vCPU threads (their emulated APIC thread >> immediately comes to mind), and so does QEMU depending on the emulated > > And the number of those threads fluctuate widely and dynamically? It depends, usually there isn't dynamic fluctuation unless there is a lot of hot[un]plugging of virtual devices going on (which can be the case for situations with tight host/guest integration), but the number of threads can vary widely between configurations (most of the VM's I run under QEMU have about 16 threads on average, but I've seen instances with more than 100 threads). The most likely case to cause wide and dynamic fluctuations of threads would be systems set up to dynamically hot[un]plug vCPU's based on system load (such systems have other issues to contend with also, but they do exist). >> hardware configuration (it gets more noticeable when the disk images are >> stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is >> pretty typical usage for large virtualization deployments). I've seen cases >> first hand where the vCPU's can make no reasonable progress because they are >> constantly getting crowded out by other threads. > > That alone doesn't require hierarchical resource distribution tho. > Setting nice levels reasonably is likely to alleviate most of the > problem. In the cases I've dealt with this myself, nice levels didn't cut it, and I had to resort to SCHED_RR with particular care to avoid priority inversions. >> The use of the term 'hypervisor support threads' for this is probably not >> the best way of describing the contention, as it's almost always a full >> system virtualization issue, and the contending threads are usually storage >> back-end access threads. >> >> I would argue that there are better ways to deal properly with this (Isolate >> the non vCPU threads on separate physical CPU's from the hardware emulation >> threads), but such methods require large systems to be practical at any >> scale, and many people don't have the budget for such large systems, and >> this way of doing things is much more flexible for small scale use cases >> (for example, someone running one or two VM's on a laptop under QEMU or >> VirtualBox). > > I don't know. "Someone running one or two VM's on a laptop under > QEMU" doesn't really sound like the use case which absolutely requires > hierarchical cpu cycle distribution. It depends on the use case. I never have more than 2 VM's running on my laptop (always under QEMU, setting up Xen is kind of pointless ona quad core system with only 8G of RAM), and I take extensive advantage of the cpu cgroup to partition resources among various services on the host. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 20:00 ` Austin S Hemmelgarn @ 2015-08-24 20:25 ` Tejun Heo 2015-08-24 21:00 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 20:25 UTC (permalink / raw) To: Austin S Hemmelgarn Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Austin. On Mon, Aug 24, 2015 at 04:00:49PM -0400, Austin S Hemmelgarn wrote: > >That alone doesn't require hierarchical resource distribution tho. > >Setting nice levels reasonably is likely to alleviate most of the > >problem. > > In the cases I've dealt with this myself, nice levels didn't cut it, and I > had to resort to SCHED_RR with particular care to avoid priority inversions. I wonder why. The difference between -20 and 20 is around 2500x in terms of weight. That should have been enough for expressing whatever precedence the vcpus should have over other threads. > >I don't know. "Someone running one or two VM's on a laptop under > >QEMU" doesn't really sound like the use case which absolutely requires > >hierarchical cpu cycle distribution. > > It depends on the use case. I never have more than 2 VM's running on my > laptop (always under QEMU, setting up Xen is kind of pointless ona quad core > system with only 8G of RAM), and I take extensive advantage of the cpu > cgroup to partition resources among various services on the host. Hmmm... I'm trying to understand the usecases where having hierarchy inside a process are actually required so that we don't end up doing something complex unnecessarily. So far, it looks like an easy alternative for qemu would be teaching it to manage priorities of its threads given that the threads are mostly static - vcpus going up and down are explicit operations which can trigger priority adjustments if necessary, which is unlikely to begin with. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 20:25 ` Tejun Heo @ 2015-08-24 21:00 ` Paul Turner 2015-08-24 21:12 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 21:00 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 1:25 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Austin. > > On Mon, Aug 24, 2015 at 04:00:49PM -0400, Austin S Hemmelgarn wrote: >> >That alone doesn't require hierarchical resource distribution tho. >> >Setting nice levels reasonably is likely to alleviate most of the >> >problem. >> >> In the cases I've dealt with this myself, nice levels didn't cut it, and I >> had to resort to SCHED_RR with particular care to avoid priority inversions. > > I wonder why. The difference between -20 and 20 is around 2500x in > terms of weight. That should have been enough for expressing whatever > precedence the vcpus should have over other threads. This strongly perturbs the load-balancer which performs busiest cpu selection by weight. Note that also we do not necessarily want total dominance by vCPU threads, the hypervisor threads are almost always doing work on their behalf and we want to provision them with _some_ time. A sub-hierarchy allows this to be performed in a way that is independent of how many vCPUs or support threads that are present. > >> >I don't know. "Someone running one or two VM's on a laptop under >> >QEMU" doesn't really sound like the use case which absolutely requires >> >hierarchical cpu cycle distribution. >> >> It depends on the use case. I never have more than 2 VM's running on my >> laptop (always under QEMU, setting up Xen is kind of pointless ona quad core >> system with only 8G of RAM), and I take extensive advantage of the cpu >> cgroup to partition resources among various services on the host. > > Hmmm... I'm trying to understand the usecases where having hierarchy > inside a process are actually required so that we don't end up doing > something complex unnecessarily. So far, it looks like an easy > alternative for qemu would be teaching it to manage priorities of its > threads given that the threads are mostly static - vcpus going up and > down are explicit operations which can trigger priority adjustments if > necessary, which is unlikely to begin with. What you're proposing is both unnecessarily complex and imprecise. Arbitrating competition between groups of threads is exactly why we support sub-hierarchies within cpu. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:00 ` Paul Turner @ 2015-08-24 21:12 ` Tejun Heo 2015-08-24 21:15 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 21:12 UTC (permalink / raw) To: Paul Turner Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. On Mon, Aug 24, 2015 at 02:00:54PM -0700, Paul Turner wrote: > > Hmmm... I'm trying to understand the usecases where having hierarchy > > inside a process are actually required so that we don't end up doing > > something complex unnecessarily. So far, it looks like an easy > > alternative for qemu would be teaching it to manage priorities of its > > threads given that the threads are mostly static - vcpus going up and > > down are explicit operations which can trigger priority adjustments if > > necessary, which is unlikely to begin with. > > What you're proposing is both unnecessarily complex and imprecise. > Arbitrating competition between groups of threads is exactly why we > support sub-hierarchies within cpu. Sure, and to make that behave half-way acceptable, we'll have to take on significant amount of effort and likely complexity and I'm trying to see whether the usecases are actually justifiable. I get that priority based solution will be less precise and more complex on the application side but by how much and does the added precision enough to justify the extra facilities to support that? If it is, sure, let's get to it but it'd be great if the concrete prolem cases are properly identified and understood. I'll continue on the other reply. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:12 ` Tejun Heo @ 2015-08-24 21:15 ` Paul Turner 0 siblings, 0 replies; 92+ messages in thread From: Paul Turner @ 2015-08-24 21:15 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 2:12 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Paul. > > On Mon, Aug 24, 2015 at 02:00:54PM -0700, Paul Turner wrote: >> > Hmmm... I'm trying to understand the usecases where having hierarchy >> > inside a process are actually required so that we don't end up doing >> > something complex unnecessarily. So far, it looks like an easy >> > alternative for qemu would be teaching it to manage priorities of its >> > threads given that the threads are mostly static - vcpus going up and >> > down are explicit operations which can trigger priority adjustments if >> > necessary, which is unlikely to begin with. >> >> What you're proposing is both unnecessarily complex and imprecise. >> Arbitrating competition between groups of threads is exactly why we >> support sub-hierarchies within cpu. > > Sure, and to make that behave half-way acceptable, we'll have to take > on significant amount of effort and likely complexity and I'm trying > to see whether the usecases are actually justifiable. I get that > priority based solution will be less precise and more complex on the > application side but by how much and does the added precision enough > to justify the extra facilities to support that? If it is, sure, > let's get to it but it'd be great if the concrete prolem cases are > properly identified and understood. I'll continue on the other reply. > No problem, I think the conversation is absolutely constructive/important to have and am happy to help drill down. > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 17:04 ` Tejun Heo 2015-08-24 19:18 ` Mike Galbraith 2015-08-24 20:00 ` Austin S Hemmelgarn @ 2015-08-24 20:54 ` Paul Turner 2015-08-24 21:02 ` Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 20:54 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 10:04 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Austin. > > On Mon, Aug 24, 2015 at 11:47:02AM -0400, Austin S Hemmelgarn wrote: >> >Just to learn more, what sort of hypervisor support threads are we >> >talking about? They would have to consume considerable amount of cpu >> >cycles for problems like this to be relevant and be dynamic in numbers >> >in a way which letting them competing against vcpus makes sense. Do >> >IO helpers meet these criteria? >> > >> Depending on the configuration, yes they can. VirtualBox has some rather >> CPU intensive threads that aren't vCPU threads (their emulated APIC thread >> immediately comes to mind), and so does QEMU depending on the emulated > > And the number of those threads fluctuate widely and dynamically? > >> hardware configuration (it gets more noticeable when the disk images are >> stored on a SAN and served through iSCSI, NBD, FCoE, or ATAoE, which is >> pretty typical usage for large virtualization deployments). I've seen cases >> first hand where the vCPU's can make no reasonable progress because they are >> constantly getting crowded out by other threads. > > That alone doesn't require hierarchical resource distribution tho. > Setting nice levels reasonably is likely to alleviate most of the > problem. Nice is not sufficient here. There could be arbitrarily many threads within the hypervisor that are not actually hosting guest CPU threads. The only way to have this competition occur at a reasonably fixed ratio is a sub-hierarchy. > >> The use of the term 'hypervisor support threads' for this is probably not >> the best way of describing the contention, as it's almost always a full >> system virtualization issue, and the contending threads are usually storage >> back-end access threads. >> >> I would argue that there are better ways to deal properly with this (Isolate >> the non vCPU threads on separate physical CPU's from the hardware emulation >> threads), but such methods require large systems to be practical at any >> scale, and many people don't have the budget for such large systems, and >> this way of doing things is much more flexible for small scale use cases >> (for example, someone running one or two VM's on a laptop under QEMU or >> VirtualBox). > > I don't know. "Someone running one or two VM's on a laptop under > QEMU" doesn't really sound like the use case which absolutely requires > hierarchical cpu cycle distribution. > We run more than 'one or two' VMs using this configuration. :) ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 20:54 ` Paul Turner @ 2015-08-24 21:02 ` Tejun Heo 2015-08-24 21:10 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 21:02 UTC (permalink / raw) To: Paul Turner Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Mon, Aug 24, 2015 at 01:54:08PM -0700, Paul Turner wrote: > > That alone doesn't require hierarchical resource distribution tho. > > Setting nice levels reasonably is likely to alleviate most of the > > problem. > > Nice is not sufficient here. There could be arbitrarily many threads > within the hypervisor that are not actually hosting guest CPU threads. > The only way to have this competition occur at a reasonably fixed > ratio is a sub-hierarchy. I get that having hierarchy of threads would be nicer but am having a bit of difficulty seeing why adjusting priorities of threads wouldn't be sufficient. It's not like threads of the same process competing with each other is a new problem. People have been dealing with it for ages. Hierarchical management can be a nice plus but we want the problem and proposed solution to be justifiable. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:02 ` Tejun Heo @ 2015-08-24 21:10 ` Paul Turner 2015-08-24 21:17 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 21:10 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 2:02 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Aug 24, 2015 at 01:54:08PM -0700, Paul Turner wrote: >> > That alone doesn't require hierarchical resource distribution tho. >> > Setting nice levels reasonably is likely to alleviate most of the >> > problem. >> >> Nice is not sufficient here. There could be arbitrarily many threads >> within the hypervisor that are not actually hosting guest CPU threads. >> The only way to have this competition occur at a reasonably fixed >> ratio is a sub-hierarchy. > > I get that having hierarchy of threads would be nicer but am having a > bit of difficulty seeing why adjusting priorities of threads wouldn't > be sufficient. It's not like threads of the same process competing > with each other is a new problem. People have been dealing with it > for ages. Hierarchical management can be a nice plus but we want the > problem and proposed solution to be justifiable. Consider what happens with load asymmetry: Suppose that we have 10 vcpu threads and 100 support threads. Suppose that we want the support threads to receive up to 10% of the time available to the VM as a whole on that machine. If I have one particular support thread that is busy, I want it to receive that entire 10% (maybe a guest is pounding on scsi for example, or in the thread-pool case, I've passed a single expensive computation). Conversely, suppose the guest is doing lots of different things and several support threads are active, I want the time to be shared between them. There is no way to implement this with nice. Either a single thread can consume 10%, and the group can dominate, or the group cannot dominate and the single thread can be starved. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:10 ` Paul Turner @ 2015-08-24 21:17 ` Tejun Heo 2015-08-24 21:19 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 21:17 UTC (permalink / raw) To: Paul Turner Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Mon, Aug 24, 2015 at 02:10:17PM -0700, Paul Turner wrote: > Suppose that we have 10 vcpu threads and 100 support threads. > Suppose that we want the support threads to receive up to 10% of the > time available to the VM as a whole on that machine. > > If I have one particular support thread that is busy, I want it to > receive that entire 10% (maybe a guest is pounding on scsi for > example, or in the thread-pool case, I've passed a single expensive > computation). Conversely, suppose the guest is doing lots of > different things and several support threads are active, I want the > time to be shared between them. > > There is no way to implement this with nice. Either a single thread > can consume 10%, and the group can dominate, or the group cannot > dominate and the single thread can be starved. Would it be possible for you to give realistic and concrete examples? I'm not trying to play down the use cases but concrete examples are usually helpful at putting things in perspective. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:17 ` Tejun Heo @ 2015-08-24 21:19 ` Paul Turner 2015-08-24 21:40 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 21:19 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 2:17 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Aug 24, 2015 at 02:10:17PM -0700, Paul Turner wrote: >> Suppose that we have 10 vcpu threads and 100 support threads. >> Suppose that we want the support threads to receive up to 10% of the >> time available to the VM as a whole on that machine. >> >> If I have one particular support thread that is busy, I want it to >> receive that entire 10% (maybe a guest is pounding on scsi for >> example, or in the thread-pool case, I've passed a single expensive >> computation). Conversely, suppose the guest is doing lots of >> different things and several support threads are active, I want the >> time to be shared between them. >> >> There is no way to implement this with nice. Either a single thread >> can consume 10%, and the group can dominate, or the group cannot >> dominate and the single thread can be starved. > > Would it be possible for you to give realistic and concrete examples? > I'm not trying to play down the use cases but concrete examples are > usually helpful at putting things in perspective. I don't think there's anything that's not realistic or concrete about the example above. The "suppose" parts were only for qualifying the pool sizes for vcpu and non-vcpu threads above since discussion of implementation using nice is dependent on knowing these counts. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:19 ` Paul Turner @ 2015-08-24 21:40 ` Tejun Heo 2015-08-24 22:03 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 21:40 UTC (permalink / raw) To: Paul Turner Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 02:19:29PM -0700, Paul Turner wrote: > > Would it be possible for you to give realistic and concrete examples? > > I'm not trying to play down the use cases but concrete examples are > > usually helpful at putting things in perspective. > > I don't think there's anything that's not realistic or concrete about > the example above. The "suppose" parts were only for qualifying the > pool sizes for vcpu and non-vcpu threads above since discussion of > implementation using nice is dependent on knowing these counts. Hmm... I was hoping for an actual configurations and usage scenarios. Preferably something people can set up and play with. I take that the CPU intensive helper threads are usually IO workers? Is the scenario where the VM is set up with a lot of IO devices and different ones may consume large amount of CPU cycles at any given point? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:40 ` Tejun Heo @ 2015-08-24 22:03 ` Paul Turner 2015-08-24 22:49 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 22:03 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 2:40 PM, Tejun Heo <tj@kernel.org> wrote: > On Mon, Aug 24, 2015 at 02:19:29PM -0700, Paul Turner wrote: >> > Would it be possible for you to give realistic and concrete examples? >> > I'm not trying to play down the use cases but concrete examples are >> > usually helpful at putting things in perspective. >> >> I don't think there's anything that's not realistic or concrete about >> the example above. The "suppose" parts were only for qualifying the >> pool sizes for vcpu and non-vcpu threads above since discussion of >> implementation using nice is dependent on knowing these counts. > > Hmm... I was hoping for an actual configurations and usage scenarios. > Preferably something people can set up and play with. This is much easier to set up and play with synthetically. Just create the 10 threads and 100 threads above then experiment with configurations designed at guaranteeing the set of 100 threads relatively uniform throughput regardless of how many are active. I don't think trying to run a VM stack adds anything except complexity of reproduction here. > I take that the > CPU intensive helper threads are usually IO workers? Is the scenario > where the VM is set up with a lot of IO devices and different ones may > consume large amount of CPU cycles at any given point? Yes, generally speaking there are a few major classes of IO (flash, disk, network) that a guest may invoke. Each of these backends is separate and chooses its own threading. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 22:03 ` Paul Turner @ 2015-08-24 22:49 ` Tejun Heo 2015-08-24 23:15 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 22:49 UTC (permalink / raw) To: Paul Turner Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote: > > Hmm... I was hoping for an actual configurations and usage scenarios. > > Preferably something people can set up and play with. > > This is much easier to set up and play with synthetically. Just > create the 10 threads and 100 threads above then experiment with > configurations designed at guaranteeing the set of 100 threads > relatively uniform throughput regardless of how many are active. I > don't think trying to run a VM stack adds anything except complexity > of reproduction here. Well, but that loses most of details and why such use cases matter to begin with. We can imagine up stuff to induce arbitrary set of requirements. > > I take that the > > CPU intensive helper threads are usually IO workers? Is the scenario > > where the VM is set up with a lot of IO devices and different ones may > > consume large amount of CPU cycles at any given point? > > Yes, generally speaking there are a few major classes of IO (flash, > disk, network) that a guest may invoke. Each of these backends is > separate and chooses its own threading. Hmmm... if that's the case, would limiting iops on those IO devices (or classes of them) work? qemu already implements IO limit mechanism after all. Anyways, a point here is that threads of the same process competing isn't a new problem. There are many ways to make those threads play nice as the application itself often has to be involved anyway, especially for something like qemu which is heavily involved in provisioning resources. cgroups can be a nice brute-force add-on which lets sysadmins do wild things but it's inherently hacky and incomplete for coordinating threads. For example, what is it gonna do if qemu cloned vcpus and IO helpers dynamically off of the same parent thread? It requires application's cooperation anyway but at the same time is painful to actually interact from those applications. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 22:49 ` Tejun Heo @ 2015-08-24 23:15 ` Paul Turner 2015-08-25 2:36 ` Kamezawa Hiroyuki ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Paul Turner @ 2015-08-24 23:15 UTC (permalink / raw) To: Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote: >> > Hmm... I was hoping for an actual configurations and usage scenarios. >> > Preferably something people can set up and play with. >> >> This is much easier to set up and play with synthetically. Just >> create the 10 threads and 100 threads above then experiment with >> configurations designed at guaranteeing the set of 100 threads >> relatively uniform throughput regardless of how many are active. I >> don't think trying to run a VM stack adds anything except complexity >> of reproduction here. > > Well, but that loses most of details and why such use cases matter to > begin with. We can imagine up stuff to induce arbitrary set of > requirements. All that's being proved or disproved here is that it's difficult to coordinate the consumption of asymmetric thread pools using nice. The constraints here were drawn from a real-world example. > >> > I take that the >> > CPU intensive helper threads are usually IO workers? Is the scenario >> > where the VM is set up with a lot of IO devices and different ones may >> > consume large amount of CPU cycles at any given point? >> >> Yes, generally speaking there are a few major classes of IO (flash, >> disk, network) that a guest may invoke. Each of these backends is >> separate and chooses its own threading. > > Hmmm... if that's the case, would limiting iops on those IO devices > (or classes of them) work? qemu already implements IO limit mechanism > after all. No. 1) They should proceed at the maximum rate that they can that's still within their provisioning budget. 2) The cost/IO is both inconsistent and changes over time. Attempting to micro-optimize every backend for this is infeasible, this is exactly the type of problem that the scheduler can usefully help arbitrate. 3) Even pretending (2) is fixable, dynamically dividing these right-to-work tokens between different I/O device backends is extremely complex. > > Anyways, a point here is that threads of the same process competing > isn't a new problem. There are many ways to make those threads play > nice as the application itself often has to be involved anyway, > especially for something like qemu which is heavily involved in > provisioning resources. It's certainly not a new problem, but it's a real one, and it's _hard_. You're proposing removing the best known solution. > > cgroups can be a nice brute-force add-on which lets sysadmins do wild > things but it's inherently hacky and incomplete for coordinating > threads. For example, what is it gonna do if qemu cloned vcpus and IO > helpers dynamically off of the same parent thread? We're talking about sub-process usage here. This is the application coordinating itself, NOT the sysadmin. Processes are becoming larger and larger, we need many of the same controls within them that we have between them. > It requires > application's cooperation anyway but at the same time is painful to > actually interact from those applications. As discussed elsewhere on thread this is really not a problem if you define consistent rules with respect to which parts are managed by who. The argument of potential interference is no different to messing with an application's on-disk configuration behind its back. Alternate strawmen which greatly improve this from where we are today have also been proposed. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 23:15 ` Paul Turner @ 2015-08-25 2:36 ` Kamezawa Hiroyuki 2015-08-25 21:13 ` Tejun Heo 2015-08-25 9:24 ` Ingo Molnar 2015-08-25 19:18 ` Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Kamezawa Hiroyuki @ 2015-08-25 2:36 UTC (permalink / raw) To: Paul Turner, Tejun Heo Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On 2015/08/25 8:15, Paul Turner wrote: > On Mon, Aug 24, 2015 at 3:49 PM, Tejun Heo <tj@kernel.org> wrote: >> Hello, >> >> On Mon, Aug 24, 2015 at 03:03:05PM -0700, Paul Turner wrote: >>>> Hmm... I was hoping for an actual configurations and usage scenarios. >>>> Preferably something people can set up and play with. >>> >>> This is much easier to set up and play with synthetically. Just >>> create the 10 threads and 100 threads above then experiment with >>> configurations designed at guaranteeing the set of 100 threads >>> relatively uniform throughput regardless of how many are active. I >>> don't think trying to run a VM stack adds anything except complexity >>> of reproduction here. >> >> Well, but that loses most of details and why such use cases matter to >> begin with. We can imagine up stuff to induce arbitrary set of >> requirements. > > All that's being proved or disproved here is that it's difficult to > coordinate the consumption of asymmetric thread pools using nice. The > constraints here were drawn from a real-world example. > >> >>>> I take that the >>>> CPU intensive helper threads are usually IO workers? Is the scenario >>>> where the VM is set up with a lot of IO devices and different ones may >>>> consume large amount of CPU cycles at any given point? >>> >>> Yes, generally speaking there are a few major classes of IO (flash, >>> disk, network) that a guest may invoke. Each of these backends is >>> separate and chooses its own threading. >> >> Hmmm... if that's the case, would limiting iops on those IO devices >> (or classes of them) work? qemu already implements IO limit mechanism >> after all. > > No. > > 1) They should proceed at the maximum rate that they can that's still > within their provisioning budget. > 2) The cost/IO is both inconsistent and changes over time. Attempting > to micro-optimize every backend for this is infeasible, this is > exactly the type of problem that the scheduler can usefully help > arbitrate. > 3) Even pretending (2) is fixable, dynamically dividing these > right-to-work tokens between different I/O device backends is > extremely complex. > I think I should explain my customer's use case of qemu + cpuset/cpu (via libvirt) (1) Isolating hypervisor thread. As already discussed, hypervisor threads are isolated by cpuset. But their purpose is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot be solution as already discussed. (2) Fixed rate vcpu service. With using cpu controller's quota/period feature, my customer creates vcpu models like Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system. To do this, each vcpus should be quota-limited independently, with per-thread cpu control. Especially, the method (1) is used in several enterprise customers for stabilizing their system. Sub-process control should be provided by some way. Thanks, -Kame >> >> Anyways, a point here is that threads of the same process competing >> isn't a new problem. There are many ways to make those threads play >> nice as the application itself often has to be involved anyway, >> especially for something like qemu which is heavily involved in >> provisioning resources. > > It's certainly not a new problem, but it's a real one, and it's > _hard_. You're proposing removing the best known solution. > >> >> cgroups can be a nice brute-force add-on which lets sysadmins do wild >> things but it's inherently hacky and incomplete for coordinating >> threads. For example, what is it gonna do if qemu cloned vcpus and IO >> helpers dynamically off of the same parent thread? > > We're talking about sub-process usage here. This is the application > coordinating itself, NOT the sysadmin. Processes are becoming larger > and larger, we need many of the same controls within them that we have > between them. > >> It requires >> application's cooperation anyway but at the same time is painful to >> actually interact from those applications. > > As discussed elsewhere on thread this is really not a problem if you > define consistent rules with respect to which parts are managed by > who. The argument of potential interference is no different to > messing with an application's on-disk configuration behind its back. > Alternate strawmen which greatly improve this from where we are today > have also been proposed. > >> >> Thanks. >> >> -- >> tejun > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-25 2:36 ` Kamezawa Hiroyuki @ 2015-08-25 21:13 ` Tejun Heo 0 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-25 21:13 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Paul Turner, Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Kame. On Tue, Aug 25, 2015 at 11:36:25AM +0900, Kamezawa Hiroyuki wrote: > I think I should explain my customer's use case of qemu + cpuset/cpu (via libvirt) > > (1) Isolating hypervisor thread. > As already discussed, hypervisor threads are isolated by cpuset. But their purpose > is to avoid _latency_ spike caused by hypervisor behavior. So, "nice" cannot be solution > as already discussed. > > (2) Fixed rate vcpu service. > With using cpu controller's quota/period feature, my customer creates vcpu models like > Low(1GHz), Mid(2GHz), High(3GHz) for IaaS system. > > To do this, each vcpus should be quota-limited independently, with per-thread cpu control. > > Especially, the method (1) is used in several enterprise customers for stabilizing their system. > > Sub-process control should be provided by some way. Can you please take a look at the proposal on my reply to Paul's email? AFAICS, both of above cases should be fine with that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 23:15 ` Paul Turner 2015-08-25 2:36 ` Kamezawa Hiroyuki @ 2015-08-25 9:24 ` Ingo Molnar 2015-08-25 10:00 ` Peter Zijlstra 2015-08-25 19:18 ` Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Ingo Molnar @ 2015-08-25 9:24 UTC (permalink / raw) To: Paul Turner, Tejun Heo Cc: Tejun Heo, Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton * Paul Turner <pjt@google.com> wrote: > > Anyways, a point here is that threads of the same process competing > > isn't a new problem. There are many ways to make those threads play > > nice as the application itself often has to be involved anyway, > > especially for something like qemu which is heavily involved in > > provisioning resources. > > It's certainly not a new problem, but it's a real one, and it's > _hard_. You're proposing removing the best known solution. Also, just to make sure this is resolved properly, I'm NAK-ing the current scheduler bits in this series: NAKed-by: Ingo Molnar <mingo@kernel.org> until all of pjt's API design concerns are resolved. This is conceptual, it is not a 'we can fix it later' detail. Tejun, please keep me Cc:-ed to future versions of this series so that I can lift the NAK if things get resolved. Thanks, Ingo ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-25 9:24 ` Ingo Molnar @ 2015-08-25 10:00 ` Peter Zijlstra 0 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2015-08-25 10:00 UTC (permalink / raw) To: Ingo Molnar Cc: Paul Turner, Tejun Heo, Austin S Hemmelgarn, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, Aug 25, 2015 at 11:24:42AM +0200, Ingo Molnar wrote: > > * Paul Turner <pjt@google.com> wrote: > > > > Anyways, a point here is that threads of the same process competing > > > isn't a new problem. There are many ways to make those threads play > > > nice as the application itself often has to be involved anyway, > > > especially for something like qemu which is heavily involved in > > > provisioning resources. > > > > It's certainly not a new problem, but it's a real one, and it's > > _hard_. You're proposing removing the best known solution. > > Also, just to make sure this is resolved properly, I'm NAK-ing the current > scheduler bits in this series: > > NAKed-by: Ingo Molnar <mingo@kernel.org> > > until all of pjt's API design concerns are resolved. This is conceptual, it is not > a 'we can fix it later' detail. > > Tejun, please keep me Cc:-ed to future versions of this series so that I can lift > the NAK if things get resolved. You can add: NAKed-by: Peter Zijlstra <peterz@infradead.org> to that. There have been at least 3 different groups of people: - Mike, representing Suse customers - Kamezawa-san, representing Fujitsu customers - Paul, representing Google that claim per-thread control groups are in use and important. Any replacement _must_ provide for this use case up front; its not something that can be cobbled on later. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 23:15 ` Paul Turner 2015-08-25 2:36 ` Kamezawa Hiroyuki 2015-08-25 9:24 ` Ingo Molnar @ 2015-08-25 19:18 ` Tejun Heo 2 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-25 19:18 UTC (permalink / raw) To: Paul Turner Cc: Austin S Hemmelgarn, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. On Mon, Aug 24, 2015 at 04:15:59PM -0700, Paul Turner wrote: > > Hmmm... if that's the case, would limiting iops on those IO devices > > (or classes of them) work? qemu already implements IO limit mechanism > > after all. > > No. > > 1) They should proceed at the maximum rate that they can that's still > within their provisioning budget. Ooh, right. > 2) The cost/IO is both inconsistent and changes over time. Attempting > to micro-optimize every backend for this is infeasible, this is > exactly the type of problem that the scheduler can usefully help > arbitrate. > 3) Even pretending (2) is fixable, dynamically dividing these > right-to-work tokens between different I/O device backends is > extremely complex. > > > Anyways, a point here is that threads of the same process competing > > isn't a new problem. There are many ways to make those threads play > > nice as the application itself often has to be involved anyway, > > especially for something like qemu which is heavily involved in > > provisioning resources. > > It's certainly not a new problem, but it's a real one, and it's > _hard_. You're proposing removing the best known solution. Well, I'm trying to figure out whether we actually need it and implement something sane if so. We actually can't do hierarchical resource distribution with existing mechanisms, so if that is something which is beneficial enough, let's go ahead and figure it out. > > cgroups can be a nice brute-force add-on which lets sysadmins do wild > > things but it's inherently hacky and incomplete for coordinating > > threads. For example, what is it gonna do if qemu cloned vcpus and IO > > helpers dynamically off of the same parent thread? > > We're talking about sub-process usage here. This is the application > coordinating itself, NOT the sysadmin. Processes are becoming larger > and larger, we need many of the same controls within them that we have > between them. > > > It requires > > application's cooperation anyway but at the same time is painful to > > actually interact from those applications. > > As discussed elsewhere on thread this is really not a problem if you > define consistent rules with respect to which parts are managed by > who. The argument of potential interference is no different to > messing with an application's on-disk configuration behind its back. > Alternate strawmen which greatly improve this from where we are today > have also been proposed. Let's continue in the other sub-thread but it's not just system management and applications not stepping on each other's toes although even just that is extremely painful with the current interface. cgroup membership is inherently tied to process tree no matter who's managing it which requires coordination from the application side for sub-process management and at that point it's really matter of putting one and one together. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-22 18:29 ` Tejun Heo 2015-08-24 15:47 ` Austin S Hemmelgarn @ 2015-08-24 20:52 ` Paul Turner 2015-08-24 21:36 ` Tejun Heo 1 sibling, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 20:52 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Sat, Aug 22, 2015 at 11:29 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Paul. > > On Fri, Aug 21, 2015 at 12:26:30PM -0700, Paul Turner wrote: > ... >> A very concrete example of the above is a virtual machine in which you >> want to guarantee scheduling for the vCPU threads which must schedule >> beside many hypervisor support threads. A hierarchy is the only way >> to fix the ratio at which these compete. > > Just to learn more, what sort of hypervisor support threads are we > talking about? They would have to consume considerable amount of cpu > cycles for problems like this to be relevant and be dynamic in numbers > in a way which letting them competing against vcpus makes sense. Do > IO helpers meet these criteria? I'm not sure what you mean by an IO helper. By support threads I mean any threads that are used in the hypervisor implementation that are not hosting a vCPU. > >> An example that's not the cpu controller is that we use cpusets to >> expose to applications their "shared" and "private" cores. (These >> sets are dynamic based on what is coscheduled on a given machine.) > > Can you please go into more details with these? We typically share our machines between many jobs, these jobs can have cores that are "private" (and not shared with other jobs) and cores that are "shared" (general purpose cores accessible to all jobs on the same machine). The pool of cpus in the "shared" pool is dynamic as jobs entering and leaving the machine take or release their associated "private" cores. By creating the appropriate sub-containers within the cpuset group we allow jobs to pin specific threads to run on their (typically) private cores. This also allows the management daemons additional flexibility as it's possible to update which cores we place as private, without synchronization with the application. Note that sched_setaffinity() is a non-starter here. > >> > Why would you assume that threads of a process wouldn't want to >> > configure it ever? How is this different from CPU affinity? >> >> In general cache and CPU behave differently. Generally for it to make >> sense between threads in a process they would have to have wholly >> disjoint memory, at which point the only sane long-term implementation >> is separate processes and the management moves up a level anyway. >> >> That said, there are surely cases in which it might be convenient to >> use at a per-thread level to correct a specific performance anomaly. >> But at that point, you have certainly reached the level of hammer that >> you can coordinate with an external daemon if necessary. > > So, I'm not super familiar with all the use cases but the whole cache > allocation thing is almost by nature a specific niche thing and I feel > pretty reluctant to blow off per-thread usages as too niche to worry > about. Let me try to restate: I think that we can specify the usage is specifically niche that it will *typically* be used by higher level management daemons which prefer a more technical and specific interface. This does not preclude use by threads, it just makes it less convenient; I think that we should be optimizing for flexibility over ease-of-use for a very small number of cases here. > >> > I don't follow what you're trying to way with the above paragraph. >> > Are you still talking about CAT? If so, that use case isn't the only >> > one. I'm pretty sure there are people who would want to configure >> > cache allocation at thread level. >> >> I'm not agreeing with you that "in cgroups" means "must be usable by >> applications within that hierarchy". A cgroup subsystem used as a >> partitioning API only by system management daemons is entirely >> reasonable. CAT is a reasonable example of this. > > I see. The same argument. I don't think CAT just being system > management thing makes sense. > >> > So, this is a trade-off we're consciously making. If there are >> > common-enough use cases which require jumping across different cgroup >> > domains, we'll try to figure out a way to accomodate those but by >> > default migration is a very cold and expensive path. >> >> The core here was the need for allowing sub-process migration. I'm >> not sure I follow the performance trade-off argument; haven't we >> historically seen the opposite? That migration has been a slow-path >> without optimizations and people pushing to make it faster? This >> seems a hard generalization to make for something that's inherently >> tied to a particular controller. > > It isn't something tied to a particular controller. Some controllers > may get impacted less by than others but there's an inherent > connection between how dynamic an association is and how expensive the > locking around it needs to be and we need to set up basic behavior and > usage conventions so that different controllers are designed and > implemented assuming similar usage patterns; otherwise, we end up with > the chaotic shit show that we have had where everything behaves > differently and nobody knows what's the right way to do things and we > end up locked into weird requirements which some controller induced > for no good reason but cause significant pain on use cases which > actually matter. > >> I don't care if we try turning that dial back to assume it's a cold >> path once more, only that it's supported. > > It has always been a cold path and I'm not saying this is gonna be > noticeably worse in the future but usages like bouncing threads on > request-by-request basis are and will be clearly worse than bouncing > to threads which are already in the target domain. > >> >> > A forwarding /proc/thread_self/cgroup accessor, or similar, would be another >> >> > way to address some of these issues. >> > >> > That sounds horrible to me. What if the process wants to do RMW a >> > config? >> >> Locking within a process is easy. > > It's not contained in the process at all. What if an external entity > decides to migrate the process into another cgroup inbetween? > If we have 'atomic' moves and a way to access our sub-containers from the process in a consistent fashion (e.g. relative paths) then this is not an issue. >> > What if the permissions are different after an intervening >> > migration? >> >> This is a side-effect of migration not being properly supported. >> >> > What if the sub-hierarchy no longer exists or has been >> > replaced by a hierarchy with the same topology but actualy is a >> > different one? >> >> The easy answer is that: Only a process should be managing its >> sub-hierarchy. That's the nice thing about hierarchies. > > cgroupfs is a horrible place to implement that part of interface. It > doesn't make any sense to combine those two into the same hierarchy. > You're agreeing to the identified problem but somehow still suggesting > doing what we've been doing when the root cause of the said problem is > conflating and interlocking these two separate things. I am not endorsing the world we are in today, only describing how it can be somewhat sanely managed. Some of these lessons could be formalized in imagining the world of tomorrow. E.g. the sub-process mounts could appear within some (non-movable) alternate file-system path. > >> The harder answer is: How do we handle non-fungible resources such as >> CPU assignments within a hierarchy? This is a big part of why I make >> arguments for certain partitions being management-software only above. >> This is imperfect, but better then where we stand today. > > I'm not following. Why is that different? This is generally any time a change in the external-to-application's cgroup-parent requires changes in the sub-hierarchy. This is most visible with a resource such as a cpu which is uniquely identified, but similarly applies to any limits. > >> > Let's build an API which actually looks and behaves like an API which >> > is properly isolated from what external agents may do to the process. >> > I can't see how that would be "back to where we are today". All of >> > those are pretty critical attributes for a public kernel API and >> > utterly broken right now. >> >> Sure, but I don't think you can throw out per-thread control for all >> controllers to enable this. Which makes everything else harder. A >> intermediary step in unification might be that we move from N mounts >> to 2. Those that can be managed at the process level, and those that >> can't. It's a compromise, but may allow cleaner abstractions for the >> former case. > > The transition can already be gradual. Why would you add yet another > transition step? Because what's being proposed today does not offer any replacement for the sub-process control that we depend on today? Why would we embark on merging the new interface before these details are sufficiently resolved? > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 20:52 ` Paul Turner @ 2015-08-24 21:36 ` Tejun Heo 2015-08-24 21:58 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 21:36 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. On Mon, Aug 24, 2015 at 01:52:01PM -0700, Paul Turner wrote: > We typically share our machines between many jobs, these jobs can have > cores that are "private" (and not shared with other jobs) and cores > that are "shared" (general purpose cores accessible to all jobs on the > same machine). > > The pool of cpus in the "shared" pool is dynamic as jobs entering and > leaving the machine take or release their associated "private" cores. > > By creating the appropriate sub-containers within the cpuset group we > allow jobs to pin specific threads to run on their (typically) private > cores. This also allows the management daemons additional flexibility > as it's possible to update which cores we place as private, without > synchronization with the application. Note that sched_setaffinity() > is a non-starter here. Why isn't it? Because the programs themselves might try to override it? > Let me try to restate: > I think that we can specify the usage is specifically niche that it > will *typically* be used by higher level management daemons which I really don't think that's the case. > prefer a more technical and specific interface. This does not > preclude use by threads, it just makes it less convenient; I think > that we should be optimizing for flexibility over ease-of-use for a > very small number of cases here. It's more like there are two niche sets of use cases. If a programmable interface or cgroups has to be picked as an exclusive alternative, it's pretty clear that programmable interface is the way to go. > > It's not contained in the process at all. What if an external entity > > decides to migrate the process into another cgroup inbetween? > > > > If we have 'atomic' moves and a way to access our sub-containers from > the process in a consistent fashion (e.g. relative paths) then this is > not an issue. But it gets so twisted. Relative paths aren't enough. It actually has to proxy accesses to already open files. At that point, why would we even keep it as a file-system based interface? > I am not endorsing the world we are in today, only describing how it > can be somewhat sanely managed. Some of these lessons could be > formalized in imagining the world of tomorrow. E.g. the sub-process > mounts could appear within some (non-movable) alternate file-system > path. Ditto. Wouldn't it be better to implement something which resemables conventional programming interface rather than contorting the filesystem semantics? > >> The harder answer is: How do we handle non-fungible resources such as > >> CPU assignments within a hierarchy? This is a big part of why I make > >> arguments for certain partitions being management-software only above. > >> This is imperfect, but better then where we stand today. > > > > I'm not following. Why is that different? > > This is generally any time a change in the external-to-application's > cgroup-parent requires changes in the sub-hierarchy. This is most > visible with a resource such as a cpu which is uniquely identified, > but similarly applies to any limits. So, except for cpuset, this doesn't matter for controllers. All limits are hierarchical and that's it. For cpuset, it's tricky because a nested cgroup might end up with no intersecting execution resource. The kernel can't have threads which don't have any execution resources and the solution has been assuming the resources from higher-ups till there's some. Application control has always behaved the same way. If the configured affinity becomes empty, the scheduler ignored it. > > The transition can already be gradual. Why would you add yet another > > transition step? > > Because what's being proposed today does not offer any replacement for > the sub-process control that we depend on today? Why would we embark > on merging the new interface before these details are sufficiently > resolved? Because the details on this particular issue can be hashed out in the future? There's nothing permanently blocking any direction that we might choose in the future and what's working today will keep working. Why block the whole thing which can be useful for the majority of use cases for this particular corner case? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:36 ` Tejun Heo @ 2015-08-24 21:58 ` Paul Turner 2015-08-24 22:19 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 21:58 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 2:36 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, Paul. > > On Mon, Aug 24, 2015 at 01:52:01PM -0700, Paul Turner wrote: >> We typically share our machines between many jobs, these jobs can have >> cores that are "private" (and not shared with other jobs) and cores >> that are "shared" (general purpose cores accessible to all jobs on the >> same machine). >> >> The pool of cpus in the "shared" pool is dynamic as jobs entering and >> leaving the machine take or release their associated "private" cores. >> >> By creating the appropriate sub-containers within the cpuset group we >> allow jobs to pin specific threads to run on their (typically) private >> cores. This also allows the management daemons additional flexibility >> as it's possible to update which cores we place as private, without >> synchronization with the application. Note that sched_setaffinity() >> is a non-starter here. > > Why isn't it? Because the programs themselves might try to override > it? The major reasons are: 1) Isolation. Doing everything with sched_setaffinity means that programs can use arbitrary resources if they desire. 1a) These restrictions need to also apply to threads created by library code. Which may be 3rd party. 2) Interaction between cpusets and sched_setaffinity. For necessary reasons, a cpuset update always overwrites all extant sched_setaffinity values. ...And we need some cpusets for (1)....And we need periodic updates for access to shared cores. 3) Virtualization of CPU ids. (Multiple applications all binding to core 1 is a bad thing.) > >> Let me try to restate: >> I think that we can specify the usage is specifically niche that it >> will *typically* be used by higher level management daemons which > > I really don't think that's the case. > Can you provide examples of non-exceptional usage in this fashion? >> prefer a more technical and specific interface. This does not >> preclude use by threads, it just makes it less convenient; I think >> that we should be optimizing for flexibility over ease-of-use for a >> very small number of cases here. > > It's more like there are two niche sets of use cases. If a > programmable interface or cgroups has to be picked as an exclusive > alternative, it's pretty clear that programmable interface is the way > to go. I strongly disagree here: The *major obvious use* is partitioning of a system, which must act on groups of processes. Cgroups is the only interface we have which satisfies this today. > >> > It's not contained in the process at all. What if an external entity >> > decides to migrate the process into another cgroup inbetween? >> > >> >> If we have 'atomic' moves and a way to access our sub-containers from >> the process in a consistent fashion (e.g. relative paths) then this is >> not an issue. > > But it gets so twisted. Relative paths aren't enough. It actually > has to proxy accesses to already open files. At that point, why would > we even keep it as a file-system based interface? Well no, this can just be reversed and we can have the relative paths be the actual files which the hierarchy points back at. Ultimately, they could potentially not even be exposed in the regular hierarchy. At this point we could not expose anything that does not support sub-process splits within processes' hierarchy and we're at a more reasonable state of affairs. There is real value in being able to duplicate interface between process and sub-process level control. > >> I am not endorsing the world we are in today, only describing how it >> can be somewhat sanely managed. Some of these lessons could be >> formalized in imagining the world of tomorrow. E.g. the sub-process >> mounts could appear within some (non-movable) alternate file-system >> path. > > Ditto. Wouldn't it be better to implement something which resemables > conventional programming interface rather than contorting the > filesystem semantics? > Maybe? This is a trade-off, some of which is built on the assumptions we're now debating. There is also value, cost-wise, in iterative improvement of what we have today rather than trying to nuke it from orbit. I do not know which of these is the right choice, it likely depends strongly on where we end up for sub-process interfaces. If we do support those I'm not sure it makes sense for them to have an entirely different API from process-level coordination, at which point the file-system overload is a trade-off rather than a cost. >> >> The harder answer is: How do we handle non-fungible resources such as >> >> CPU assignments within a hierarchy? This is a big part of why I make >> >> arguments for certain partitions being management-software only above. >> >> This is imperfect, but better then where we stand today. >> > >> > I'm not following. Why is that different? >> >> This is generally any time a change in the external-to-application's >> cgroup-parent requires changes in the sub-hierarchy. This is most >> visible with a resource such as a cpu which is uniquely identified, >> but similarly applies to any limits. > > So, except for cpuset, this doesn't matter for controllers. All > limits are hierarchical and that's it. Well no, it still matters because I might want to lower the limit below what children have set. > For cpuset, it's tricky > because a nested cgroup might end up with no intersecting execution > resource. The kernel can't have threads which don't have any > execution resources and the solution has been assuming the resources > from higher-ups till there's some. Application control has always > behaved the same way. If the configured affinity becomes empty, the > scheduler ignored it. Actually no, any configuration change that would result in this state is rejected. It's not possible to configure an empty cpuset once tasks are in it, or attach tasks to an empty set. It's also not possible to create this state using setaffinity, these restrictions are always over-ridden by updates, even if they do not need to be. > >> > The transition can already be gradual. Why would you add yet another >> > transition step? >> >> Because what's being proposed today does not offer any replacement for >> the sub-process control that we depend on today? Why would we embark >> on merging the new interface before these details are sufficiently >> resolved? > > Because the details on this particular issue can be hashed out in the > future? There's nothing permanently blocking any direction that we > might choose in the future and what's working today will keep working. > Why block the whole thing which can be useful for the majority of use > cases for this particular corner case? > Because I do not think sub-process hierarchies are the corner case that you're making them out to be for these controllers and that has real implications for the ultimate direction of this interface. Also. If we are making disruptive changes here, I would want to discuss merging cpu, cpuset, and cpuacct. What this merge looks like depends on the above. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 21:58 ` Paul Turner @ 2015-08-24 22:19 ` Tejun Heo 2015-08-24 23:06 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-08-24 22:19 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hey, On Mon, Aug 24, 2015 at 02:58:23PM -0700, Paul Turner wrote: > > Why isn't it? Because the programs themselves might try to override > > it? > > The major reasons are: > > 1) Isolation. Doing everything with sched_setaffinity means that > programs can use arbitrary resources if they desire. > 1a) These restrictions need to also apply to threads created by > library code. Which may be 3rd party. > 2) Interaction between cpusets and sched_setaffinity. For necessary > reasons, a cpuset update always overwrites all extant > sched_setaffinity values. ...And we need some cpusets for (1)....And > we need periodic updates for access to shared cores. This is an erratic behavior on cpuset's part tho. Nothing else behaves this way and it's borderline buggy. > 3) Virtualization of CPU ids. (Multiple applications all binding to > core 1 is a bad thing.) This is about who's setting the affinity, right? As long as an agent which knows system details sets it, which mechanism doesn't really matter. > >> Let me try to restate: > >> I think that we can specify the usage is specifically niche that it > >> will *typically* be used by higher level management daemons which > > > > I really don't think that's the case. > > > > Can you provide examples of non-exceptional usage in this fashion? I heard of two use cases. One is sytem-partitioning that you're talking about and the other is preventing threads of the same process from stepping on each other's toes. There was a fancy word for the cacheline cannibalizing behavior which shows up in those scenarios. > > It's more like there are two niche sets of use cases. If a > > programmable interface or cgroups has to be picked as an exclusive > > alternative, it's pretty clear that programmable interface is the way > > to go. > > I strongly disagree here: > The *major obvious use* is partitioning of a system, which must act I don't know. Why is that more major obvious use? This is super duper fringe to begin with. It's like tallying up beans. Sure, some may be taller than others but they're all still beans and I'm not even sure there's a big difference between the two use cases here. > on groups of processes. Cgroups is the only interface we have which > satisfies this today. Well, not really. cgroups is more convenient / better at these things but not the only way to do it. People have been doing isolation to varying degrees with other mechanisms for ages. > > Ditto. Wouldn't it be better to implement something which resemables > > conventional programming interface rather than contorting the > > filesystem semantics? > > Maybe? This is a trade-off, some of which is built on the assumptions > we're now debating. > > There is also value, cost-wise, in iterative improvement of what we > have today rather than trying to nuke it from orbit. I do not know > which of these is the right choice, it likely depends strongly on > where we end up for sub-process interfaces. If we do support those > I'm not sure it makes sense for them to have an entirely different API > from process-level coordination, at which point the file-system > overload is a trade-off rather than a cost. Yeah, I understand the similarity part but don't buy that the benefit there is big enough to introduce a kernel API which is expected to be used by individual programs which is radically different from how processes / threads are organized and applications interact with the kernel. These are a lot more grave issues and if we end up paying some complexity from kernel side internally, so be it. > > So, except for cpuset, this doesn't matter for controllers. All > > limits are hierarchical and that's it. > > Well no, it still matters because I might want to lower the limit > below what children have set. All controllers only get what their ancestors can hand down to them. That's basic hierarchical behavior. > > For cpuset, it's tricky > > because a nested cgroup might end up with no intersecting execution > > resource. The kernel can't have threads which don't have any > > execution resources and the solution has been assuming the resources > > from higher-ups till there's some. Application control has always > > behaved the same way. If the configured affinity becomes empty, the > > scheduler ignored it. > > Actually no, any configuration change that would result in this state > is rejected. > > It's not possible to configure an empty cpuset once tasks are in it, > or attach tasks to an empty set. > It's also not possible to create this state using setaffinity, these > restrictions are always over-ridden by updates, even if they do not > need to be. So, even in traditional hierarchies, this isn't true. You can get to no-resource config through cpu hot-unplug and cpuset currently ejects tasks to the closest ancestor with execution resources. > > Because the details on this particular issue can be hashed out in the > > future? There's nothing permanently blocking any direction that we > > might choose in the future and what's working today will keep working. > > Why block the whole thing which can be useful for the majority of use > > cases for this particular corner case? > > Because I do not think sub-process hierarchies are the corner case > that you're making them out to be for these controllers and that has > real implications for the ultimate direction of this interface. If that's the case and we fail miserably at creating a reasonable programming interface for that, we can always revive thread granularity. This is mostly a policy decision after all. > Also. If we are making disruptive changes here, I would want to > discuss merging cpu, cpuset, and cpuacct. What this merge looks like > depends on the above. So, the proposed patches already merge cpu and cpuacct, at least in appearance. Given the kitchen-sink nature of cpuset, I don't think it makes sense to fuse it with cpu. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 22:19 ` Tejun Heo @ 2015-08-24 23:06 ` Paul Turner 2015-08-25 21:02 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-08-24 23:06 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Mon, Aug 24, 2015 at 3:19 PM, Tejun Heo <tj@kernel.org> wrote: > Hey, > > On Mon, Aug 24, 2015 at 02:58:23PM -0700, Paul Turner wrote: >> > Why isn't it? Because the programs themselves might try to override >> > it? >> >> The major reasons are: >> >> 1) Isolation. Doing everything with sched_setaffinity means that >> programs can use arbitrary resources if they desire. >> 1a) These restrictions need to also apply to threads created by >> library code. Which may be 3rd party. >> 2) Interaction between cpusets and sched_setaffinity. For necessary >> reasons, a cpuset update always overwrites all extant >> sched_setaffinity values. ...And we need some cpusets for (1)....And >> we need periodic updates for access to shared cores. > > This is an erratic behavior on cpuset's part tho. Nothing else > behaves this way and it's borderline buggy. > It's actually the only sane possible interaction here. If you don't overwrite the masks you can no longer manage cpusets with a multi-threaded application. If you partially overwrite the masks you can create a host of inconsistent behaviors where an application suddenly loses parallelism. The *only* consistent way is to clobber all masks uniformly. Then either arrange for some notification to the application to re-sync, or use sub-sub-containers within the cpuset hierarchy to advertise finer-partitions. (Generally speaking, there is no real way to mate these APIs and part of the reason we use sub-containers here. What's being proposed will make this worse rather than better.) >> 3) Virtualization of CPU ids. (Multiple applications all binding to >> core 1 is a bad thing.) > > This is about who's setting the affinity, right? As long as an agent > which knows system details sets it, which mechanism doesn't really > matter. Yes, there are other ways to implement this. > >> >> Let me try to restate: >> >> I think that we can specify the usage is specifically niche that it >> >> will *typically* be used by higher level management daemons which >> > >> > I really don't think that's the case. >> > >> >> Can you provide examples of non-exceptional usage in this fashion? > > I heard of two use cases. One is sytem-partitioning that you're > talking about and the other is preventing threads of the same process > from stepping on each other's toes. There was a fancy word for the > cacheline cannibalizing behavior which shows up in those scenarios. So this is a single example right, since the system partitioning case is the one in which it's exclusively used by a higher level management daemon. The case of an process with specifically identified threads in conflict certainly seems exceptional in the level of optimization both in implementation and analysis present. I would expect in this case that either they are comfortable with the more technical API, or they can coordinate with an external controller. Which is much less overloaded both by number of callers and by number of interfaces than it is in the cpuset case. > >> > It's more like there are two niche sets of use cases. If a >> > programmable interface or cgroups has to be picked as an exclusive >> > alternative, it's pretty clear that programmable interface is the way >> > to go. >> >> I strongly disagree here: >> The *major obvious use* is partitioning of a system, which must act > > I don't know. Why is that more major obvious use? This is super > duper fringe to begin with. It's like tallying up beans. Sure, some > may be taller than others but they're all still beans and I'm not even > sure there's a big difference between the two use cases here. I don't think the case of having a large compute farm with "unimportant" and "important" work is particularly fringe. Reducing the impact on the "important" work so that we can scavenge more cycles for the latency insensitive "unimportant" is very real. > >> on groups of processes. Cgroups is the only interface we have which >> satisfies this today. > > Well, not really. cgroups is more convenient / better at these things > but not the only way to do it. People have been doing isolation to > varying degrees with other mechanisms for ages. > Right, but it's exactly because of _how bad_ those other mechanisms _are_ that cgroups was originally created. Its growth was not managed well from there, but let's not step away from the fact that this interface was created to solve this problem. >> > Ditto. Wouldn't it be better to implement something which resemables >> > conventional programming interface rather than contorting the >> > filesystem semantics? >> >> Maybe? This is a trade-off, some of which is built on the assumptions >> we're now debating. >> >> There is also value, cost-wise, in iterative improvement of what we >> have today rather than trying to nuke it from orbit. I do not know >> which of these is the right choice, it likely depends strongly on >> where we end up for sub-process interfaces. If we do support those >> I'm not sure it makes sense for them to have an entirely different API >> from process-level coordination, at which point the file-system >> overload is a trade-off rather than a cost. > > Yeah, I understand the similarity part but don't buy that the benefit > there is big enough to introduce a kernel API which is expected to be > used by individual programs which is radically different from how > processes / threads are organized and applications interact with the > kernel. Sorry, I don't quite follow, in what way is it radically different? What is magically different about a process versus a thread in this sub-division? > These are a lot more grave issues and if we end up paying > some complexity from kernel side internally, so be it. > >> > So, except for cpuset, this doesn't matter for controllers. All >> > limits are hierarchical and that's it. >> >> Well no, it still matters because I might want to lower the limit >> below what children have set. > > All controllers only get what their ancestors can hand down to them. > That's basic hierarchical behavior. > And many users want non work-conserving systems in which we can add and remove idle resources. This means that how much bandwidth an ancestor has is not fixed in stone. >> > For cpuset, it's tricky >> > because a nested cgroup might end up with no intersecting execution >> > resource. The kernel can't have threads which don't have any >> > execution resources and the solution has been assuming the resources >> > from higher-ups till there's some. Application control has always >> > behaved the same way. If the configured affinity becomes empty, the >> > scheduler ignored it. >> >> Actually no, any configuration change that would result in this state >> is rejected. >> >> It's not possible to configure an empty cpuset once tasks are in it, >> or attach tasks to an empty set. >> It's also not possible to create this state using setaffinity, these >> restrictions are always over-ridden by updates, even if they do not >> need to be. > > So, even in traditional hierarchies, this isn't true. You can get to > no-resource config through cpu hot-unplug and cpuset currently ejects > tasks to the closest ancestor with execution resources. This is exactly congruent with what I said. It's not possible to have tasks attached to an empty cpuset. Ejection is only maintaining this in the face of a non-failable operation. > >> > Because the details on this particular issue can be hashed out in the >> > future? There's nothing permanently blocking any direction that we >> > might choose in the future and what's working today will keep working. >> > Why block the whole thing which can be useful for the majority of use >> > cases for this particular corner case? >> >> Because I do not think sub-process hierarchies are the corner case >> that you're making them out to be for these controllers and that has >> real implications for the ultimate direction of this interface. > > If that's the case and we fail miserably at creating a reasonable > programming interface for that, we can always revive thread > granularity. This is mostly a policy decision after all. These interfaces should be presented side-by-side. This is not a reasonable patch-later part of the interface as we depend on it today. > >> Also. If we are making disruptive changes here, I would want to >> discuss merging cpu, cpuset, and cpuacct. What this merge looks like >> depends on the above. > > So, the proposed patches already merge cpu and cpuacct, at least in > appearance. Given the kitchen-sink nature of cpuset, I don't think it > makes sense to fuse it with cpu. Arguments in favor of this: a) Today the load-balancer has _no_ understanding of group level cpu-affinity masks. b) With SCHED_NUMA, we can benefit from also being able to apply (b) to understand which nodes are usable. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-24 23:06 ` Paul Turner @ 2015-08-25 21:02 ` Tejun Heo 2015-09-02 17:03 ` Tejun Heo 2015-09-09 12:49 ` Paul Turner 0 siblings, 2 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-25 21:02 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote: > > This is an erratic behavior on cpuset's part tho. Nothing else > > behaves this way and it's borderline buggy. > > It's actually the only sane possible interaction here. > > If you don't overwrite the masks you can no longer manage cpusets with > a multi-threaded application. > If you partially overwrite the masks you can create a host of > inconsistent behaviors where an application suddenly loses > parallelism. It's a layering problem. It'd be fine if cpuset either did "layer per-thread affinities below w/ config change notification" or "ignore and/or reject per-thread affinities". What we have now is two layers manipulating the same field without any mechanism for coordination. > The *only* consistent way is to clobber all masks uniformly. Then > either arrange for some notification to the application to re-sync, or > use sub-sub-containers within the cpuset hierarchy to advertise > finer-partitions. I don't get it. How is that the only consistent way? Why is making irreversible changes even a good way? Just layer the masks and trigger notification on changes. > I don't think the case of having a large compute farm with > "unimportant" and "important" work is particularly fringe. Reducing > the impact on the "important" work so that we can scavenge more cycles > for the latency insensitive "unimportant" is very real. What if optimizing cache allocation across competing threads of a process can yield, say, 3% gain across large compute farm? Is that fringe? > Right, but it's exactly because of _how bad_ those other mechanisms > _are_ that cgroups was originally created. Its growth was not > managed well from there, but let's not step away from the fact that > this interface was created to solve this problem. Sure, at the same time, please don't forget that there are ample reasons we can't replace more basic mechanisms with cgroups. I'm not saying this can't be part of cgroup but rather that it's misguided to do plunge into cgroups as the first and only step. More importantly, I am extremely doubtful that we understand the usage scenarios and their benefits very well at this point and want to avoid over-committing to something we'll look back and regret. As it currently stands, this has a high likelihood of becoming a mismanaged growth. For the cache allocation thing, I'd strongly suggest something way simpler and non-commmittal - e.g. create a char device node with simple configuration and basic access control. If this *really* turns out to be useful and its configuration complex enough to warrant cgroup integration, let's do it then, and if we actually end up there, I bet the interface that we'd come up with at that point would be different from what people are proposing now. > > Yeah, I understand the similarity part but don't buy that the benefit > > there is big enough to introduce a kernel API which is expected to be > > used by individual programs which is radically different from how > > processes / threads are organized and applications interact with the > > kernel. > > Sorry, I don't quite follow, in what way is it radically different? > What is magically different about a process versus a thread in this > sub-division? I meant that cgroupfs as opposed to most other programming interfaces that we publish to applications. We already have process / thread hierarchy which is created through forking/cloning and conventions built around them for interaction. No sane application programming interface requires individual applications to open a file somewhere, echo some values to it and use directory operations to manage its organization. Will get back to this later. > > All controllers only get what their ancestors can hand down to them. > > That's basic hierarchical behavior. > > And many users want non work-conserving systems in which we can add > and remove idle resources. This means that how much bandwidth an > ancestor has is not fixed in stone. I'm having a hard time following you on this part of the discussion. Can you give me an example? > > If that's the case and we fail miserably at creating a reasonable > > programming interface for that, we can always revive thread > > granularity. This is mostly a policy decision after all. > > These interfaces should be presented side-by-side. This is not a > reasonable patch-later part of the interface as we depend on it today. Revival of thread affinity is trivial and will stay that way for a long time and the transition is already gradual, so it'll be a lost opportunity but there is quite a bit of maneuvering room. Anyways, on with the sub-process interface. Skipping description of the problems with the current setup here as I've repated it a couple times in this thread already. On the other sub-thread, I said that process/thread tree and cgroup association are inherently tied. This is because a new child task is always born into the parent's cgroup and the only reason cgroup works on system management use cases is because system management often controls enough of how processes are created. The flexible migration that cgroup supports may suggest that an external agent with enough information can define and manage sub-process hierarchy without involving the target application but this doesn't necessarily work because such information is often only available in the application itself and the internal thread hierarchy should be agreeable to the hierarchy that's being imposed upon it - when threads are dynamically created, different parts of the hierarchy should be created by different parent thread. Also, the problem with external and in-application manipulations stepping on each other's toes is mostly not caused by individual config settings but by the possibility that they may try to set up different hierarchies or modify the existing one in a way which is not expected by the other. Given that thread hierarchy already needs to be compatible with resource hierarchy, is something unix programs already understands and thus can render itself to an a lot more conventional interface which doesn't cause organizational conflicts, I think it's logical to use that for sub-process resource distribution. So, it comes down to sth like the following set_resource($TID, $FLAGS, $KEY, $VAL) - If $TID isn't already a resource group leader, it creates a sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants to it. - If $TID is already a resource group leader, set $KEY to $VAL. - If the process is moved to another cgroup, the sub-hierarchy is preserved. The reality is a bit more complex and cgroup core would need to handle implicit leaf cgroups and duplicating sub-hierarchy. The biggest complexity would be extending atomic multi-thread migrations to accomodate multiple targets but it already does atomic multi-task migrations and performing the migrations back-to-back should work. Controller side changes wouldn't be much. Copying configs to clone sub-hierarchy and specifying which are availble should be about it. This should give applications a simple and straight-forward interface to program against while avoiding all the issues with exposing cgroupfs directly to individual applications. > > So, the proposed patches already merge cpu and cpuacct, at least in > > appearance. Given the kitchen-sink nature of cpuset, I don't think it > > makes sense to fuse it with cpu. > > Arguments in favor of this: > a) Today the load-balancer has _no_ understanding of group level > cpu-affinity masks. > b) With SCHED_NUMA, we can benefit from also being able to apply (b) > to understand which nodes are usable. Controllers can cooperate with each other on the unified hierarchy - cpu can just query the matching cpuset css about the relevant attributes and the results will always be properly hierarchical for cpu too. There's no reason to merge the two controllers for that. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-25 21:02 ` Tejun Heo @ 2015-09-02 17:03 ` Tejun Heo 2015-09-09 12:49 ` Paul Turner 1 sibling, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-09-02 17:03 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Paul? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-25 21:02 ` Tejun Heo 2015-09-02 17:03 ` Tejun Heo @ 2015-09-09 12:49 ` Paul Turner 2015-09-12 14:40 ` Tejun Heo 1 sibling, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-09-09 12:49 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton [ Picking this back up, I was out of the country last week. Note that we are also wrestling with some DMARC issues as it was just activated for Google.com so apologies if people do not receive this directly. ] On Tue, Aug 25, 2015 at 2:02 PM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Mon, Aug 24, 2015 at 04:06:39PM -0700, Paul Turner wrote: >> > This is an erratic behavior on cpuset's part tho. Nothing else >> > behaves this way and it's borderline buggy. >> >> It's actually the only sane possible interaction here. >> >> If you don't overwrite the masks you can no longer manage cpusets with >> a multi-threaded application. >> If you partially overwrite the masks you can create a host of >> inconsistent behaviors where an application suddenly loses >> parallelism. > > It's a layering problem. It'd be fine if cpuset either did "layer > per-thread affinities below w/ config change notification" or "ignore > and/or reject per-thread affinities". What we have now is two layers > manipulating the same field without any mechanism for coordination. > I think this is a mischaracterization. With respect to the two proposed solutions: a) Notifications do not solve this problem. b) Rejecting per-thread affinities is a non-starter. It's absolutely needed. (Aside: This would also wholly break the existing sched_setaffinity/getaffinity syscalls.) I do not think this is a layering problem. This is more like C++: there is no sane way to concurrently use all the features available, however, reasonably self-consistent subsets may be chosen. >> The *only* consistent way is to clobber all masks uniformly. Then >> either arrange for some notification to the application to re-sync, or >> use sub-sub-containers within the cpuset hierarchy to advertise >> finer-partitions. > > I don't get it. How is that the only consistent way? Why is making > irreversible changes even a good way? Just layer the masks and > trigger notification on changes. I'm not sure if you're agreeing or disagreeing here. Are you saying there's another consistent way from "clobbering then triggering notification changes" since it seems like that's what is rejected and then described. This certainly does not include any provisions for reversibility (which I think is a non-starter). With respect to layering: Are you proposing we maintain a separate mask for sched_setaffinity and cpusets, then do different things depending on their intersection, or lack thereof? I feel this would introduce more consistencies than it would solve as these masks would not be separately inspectable from user-space, leading to confusing interactions as they are changed. There are also very real challenges with how any notification is implemented, independent of delivery: The 'main' of an application often does not have good control or even understanding over its own threads since many may be library managed. Designation of responsibility for updating these masks is difficult. That said, I think a notification would still be a useful improvement here and that some applications would benefit. At the very least, I do not think that cpuset's behavior here can be dismissed as unreasonable. > >> I don't think the case of having a large compute farm with >> "unimportant" and "important" work is particularly fringe. Reducing >> the impact on the "important" work so that we can scavenge more cycles >> for the latency insensitive "unimportant" is very real. > > What if optimizing cache allocation across competing threads of a > process can yield, say, 3% gain across large compute farm? Is that > fringe? Frankly, yes. If you have a compute farm sufficiently dedicated to a single application I'd say that's a fairly specialized use. I see no reason why a more 'technical' API should be a challenge for such a user. The fundamental point here was that it's ok for the API of some controllers to be targeted at system rather than user control in terms of interface. This does not restrict their use by users where appropriate. > >> Right, but it's exactly because of _how bad_ those other mechanisms >> _are_ that cgroups was originally created. Its growth was not >> managed well from there, but let's not step away from the fact that >> this interface was created to solve this problem. > > Sure, at the same time, please don't forget that there are ample > reasons we can't replace more basic mechanisms with cgroups. I'm not > saying this can't be part of cgroup but rather that it's misguided to > do plunge into cgroups as the first and only step. > So there is definitely a proliferation of discussion regarding applying cgroups to other problems which I agree we need to take a step back and re-examine. However, here we're fundamentally talking about APIs designed to partition resources which is the problem that cgroups was introduced to address. If we want to introduce another API to do that below the process level we need to talk about why it's fundamentally different for processes versus threads, and whether we want two APIs that interface with the same underlying kernel mechanics. > More importantly, I am extremely doubtful that we understand the usage > scenarios and their benefits very well at this point and want to avoid > over-committing to something we'll look back and regret. As it > currently stands, this has a high likelihood of becoming a mismanaged > growth. I don't disagree with you with respect to new controllers, but I worry this is forking the discussion somewhat. There are two issues being conflated here: 1) The need for per-thread resource control and what such an API should look like. 2) The proliferation of new controllers, such as CAT. We should try to focus on (1) here as that is the most immediate for forward progress. We can certainly draw anecdotes from (2) but we do know (1) applies to existing controllers (e.g. cpu/cpuacct/cpuset). > > For the cache allocation thing, I'd strongly suggest something way > simpler and non-commmittal - e.g. create a char device node with > simple configuration and basic access control. If this *really* turns > out to be useful and its configuration complex enough to warrant > cgroup integration, let's do it then, and if we actually end up there, > I bet the interface that we'd come up with at that point would be > different from what people are proposing now. As above, I really want to focus on (1) so I will be brief here: Making it a char device requires yet-another adhoc method of describing process groupings that a configuration should apply to and yet-another set of rules for its inheritance. Once we merge it, we're committed to backwards support of the interface either way, I do not see what reimplementing things as a char device or sysfs or seqfile or other buys us over it being cgroupfs in this instance. I think that the real problem here is that stuff gets merged that does not follow the rules of how something implemented with cgroups must behave (typically respect with to a hierarchy); which is obviously increasingly incompatible with a model where we have a single hierarchy. But, provided that we can actually define those rules; I do not see the gain in denying the admission of new controller which is wholly consistent with them. It does not really measurably add to the complexity of the implementation (and it greatly reduces it where grouping semantics are desired). > >> > Yeah, I understand the similarity part but don't buy that the benefit >> > there is big enough to introduce a kernel API which is expected to be >> > used by individual programs which is radically different from how >> > processes / threads are organized and applications interact with the >> > kernel. >> >> Sorry, I don't quite follow, in what way is it radically different? >> What is magically different about a process versus a thread in this >> sub-division? > > I meant that cgroupfs as opposed to most other programming interfaces > that we publish to applications. We already have process / thread > hierarchy which is created through forking/cloning and conventions > built around them for interaction. I do not think the process/thread hierarchy is particularly comparable as it is both immutable and not a partition. It expresses resource parenting only. The only common operation performed on it is killing a sub-tree. > No sane application programming > interface requires individual applications to open a file somewhere, > echo some values to it and use directory operations to manage its > organization. Will get back to this later. > >> > All controllers only get what their ancestors can hand down to them. >> > That's basic hierarchical behavior. >> >> And many users want non work-conserving systems in which we can add >> and remove idle resources. This means that how much bandwidth an >> ancestor has is not fixed in stone. > > I'm having a hard time following you on this part of the discussion. > Can you give me an example? For example, when a system is otherwise idle we might choose to give an application additional memory or cpu resources. These may be reclaimed in the future, such an update requires updating children to be compatible with a parents' new limits. > >> > If that's the case and we fail miserably at creating a reasonable >> > programming interface for that, we can always revive thread >> > granularity. This is mostly a policy decision after all. >> >> These interfaces should be presented side-by-side. This is not a >> reasonable patch-later part of the interface as we depend on it today. > > Revival of thread affinity is trivial and will stay that way for a > long time and the transition is already gradual, so it'll be a lost > opportunity but there is quite a bit of maneuvering room. Anyways, on > with the sub-process interface. > > Skipping description of the problems with the current setup here as > I've repated it a couple times in this thread already. > > On the other sub-thread, I said that process/thread tree and cgroup > association are inherently tied. This is because a new child task is > always born into the parent's cgroup and the only reason cgroup works > on system management use cases is because system management often > controls enough of how processes are created. > > The flexible migration that cgroup supports may suggest that an > external agent with enough information can define and manage > sub-process hierarchy without involving the target application but > this doesn't necessarily work because such information is often only > available in the application itself and the internal thread hierarchy > should be agreeable to the hierarchy that's being imposed upon it - > when threads are dynamically created, different parts of the hierarchy > should be created by different parent thread. I think what's more important here is that you can define it to work. This does require cooperation between the external agent and the application in the layout of the application's hierarchy. But this is something we do use. A good example would be the surfacing of public and private cpus previously discussed to the application. > > Also, the problem with external and in-application manipulations > stepping on each other's toes is mostly not caused by individual > config settings but by the possibility that they may try to set up > different hierarchies or modify the existing one in a way which is not > expected by the other. How is this different from say signals or ptrace or any file-system modification? This does not seem a problem inherent to cgroups. > > Given that thread hierarchy already needs to be compatible with > resource hierarchy, is something unix programs already understands and > thus can render itself to an a lot more conventional interface which > doesn't cause organizational conflicts, I think it's logical to use > that for sub-process resource distribution. > > So, it comes down to sth like the following > > set_resource($TID, $FLAGS, $KEY, $VAL) > > - If $TID isn't already a resource group leader, it creates a > sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants > to it. > > - If $TID is already a resource group leader, set $KEY to $VAL. > > - If the process is moved to another cgroup, the sub-hierarchy is > preserved. > Honestly, I find this API awkward: 1) It depends on "anchor" threads to define groupings. 2) It does not allow thread-level hierarchies to be created 3) When coordination with an external agent is desired this defines no common interface that can be shared. Directories are an extremely useful container. Are you proposing applications would need to somehow publish the list of anchor-threads from (1)? What if I want to set up state that an application will attaches threads to [consider cpuset example above]? 4) How is the cgroup property to $KEY translation defined? This feels like an ioctl and no more natural than the file-system. It also does not seem to resolve your concerns regarding races; the application must still coordinate internally when concurrently calling set_resource(). 5) How does an external agent coordinate when a resource must be removed from a sub-hierarchy? On a larger scale, what properties do you feel this separate API provides that would not be also supported by instead exposing sub-process hierarchies via /proc/self or similar. Perhaps it would help to enumerate the the key problems we're trying to solve with the choice of this interface? 1) Thread spanning trees within the cgroup hierarchy. (Immediately fixed, only processes are present on the cgroup-mount) 2) Interactions with the parent process moving within the hierarchy 3) Potentially supporting move operations within a hierarchy Are there other cruxes? > The reality is a bit more complex and cgroup core would need to handle > implicit leaf cgroups and duplicating sub-hierarchy. The biggest > complexity would be extending atomic multi-thread migrations to > accomodate multiple targets but it already does atomic multi-task > migrations and performing the migrations back-to-back should work. > Controller side changes wouldn't be much. Copying configs to clone > sub-hierarchy and specifying which are availble should be about it. > > This should give applications a simple and straight-forward interface > to program against while avoiding all the issues with exposing > cgroupfs directly to individual applications. Is your primary concern here (2) above? E.g. that moving the parent process means that the location we write to for sub-process updates is not consistent? Or other? For issues involving synchronization, what's proposed at least feels no different to what we face today. > >> > So, the proposed patches already merge cpu and cpuacct, at least in >> > appearance. Given the kitchen-sink nature of cpuset, I don't think it >> > makes sense to fuse it with cpu. >> >> Arguments in favor of this: >> a) Today the load-balancer has _no_ understanding of group level >> cpu-affinity masks. >> b) With SCHED_NUMA, we can benefit from also being able to apply (b) >> to understand which nodes are usable. > > Controllers can cooperate with each other on the unified hierarchy - > cpu can just query the matching cpuset css about the relevant > attributes and the results will always be properly hierarchical for > cpu too. There's no reason to merge the two controllers for that Let's shelve this for now. > > Thanks. > > -- > tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-09 12:49 ` Paul Turner @ 2015-09-12 14:40 ` Tejun Heo 2015-09-17 14:35 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 92+ messages in thread From: Tejun Heo @ 2015-09-12 14:40 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote: > I do not think this is a layering problem. This is more like C++: > there is no sane way to concurrently use all the features available, > however, reasonably self-consistent subsets may be chosen. That's just admitting failure. > > I don't get it. How is that the only consistent way? Why is making > > irreversible changes even a good way? Just layer the masks and > > trigger notification on changes. > > I'm not sure if you're agreeing or disagreeing here. Are you saying > there's another consistent way from "clobbering then triggering > notification changes" since it seems like that's what is rejected and > then described. This certainly does not include any provisions for > reversibility (which I think is a non-starter). > > With respect to layering: Are you proposing we maintain a separate > mask for sched_setaffinity and cpusets, then do different things > depending on their intersection, or lack thereof? I feel this would > introduce more consistencies than it would solve as these masks would > not be separately inspectable from user-space, leading to confusing > interactions as they are changed. So, one of the problems is that the kernel can't have tasks w/o runnable CPUs, so we have to some workaround when, for whatever reason, a task ends up with no CPU that it can run on. The other is that we never established a consistent way to deal with it in global case either. You say cpuset isn't a layering thing but that simply isn't true. It's a cgroup-scope CPU mask. It layers atop task affinities restricting what they can be configured to, limiting the effective cpumask to the intersection of actually existing CPUs and overriding individual affinity setting when the intersection doesn't exist. The kernel does not update all CPU affinity masks when a CPU goes down or comes up. It just enforces the intersection and when the intersection becomes empty, ignores it. cgroup-scoped behaviors should reflect what the system does in the global case in general, and the global behavior here, although missing some bits, is a lot saner than what cpuset is currently doing. > There are also very real challenges with how any notification is > implemented, independent of delivery: > The 'main' of an application often does not have good control or even > understanding over its own threads since many may be library managed. > Designation of responsibility for updating these masks is difficult. > That said, I think a notification would still be a useful improvement > here and that some applications would benefit. And this is the missing piece in the global case too. We've just never solved this problem properly but that does not mean we should go off and do something completely different for cgroup case. Clobbering is fine if there's a single entity controlling everything but at that level it's nothing more than a shorthand for running taskset on member tasks. > At the very least, I do not think that cpuset's behavior here can be > dismissed as unreasonable. It sure is very misguided. > > What if optimizing cache allocation across competing threads of a > > process can yield, say, 3% gain across large compute farm? Is that > > fringe? > > Frankly, yes. If you have a compute farm sufficiently dedicated to a > single application I'd say that's a fairly specialized use. I see no > reason why a more 'technical' API should be a challenge for such a > user. The fundamental point here was that it's ok for the API of some > controllers to be targeted at system rather than user control in terms > of interface. This does not restrict their use by users where > appropriate. We can go back and forth forever on this but I'm fairly sure everything CAT related is niche at this point. > So there is definitely a proliferation of discussion regarding > applying cgroups to other problems which I agree we need to take a > step back and re-examine. > > However, here we're fundamentally talking about APIs designed to > partition resources which is the problem that cgroups was introduced > to address. If we want to introduce another API to do that below the > process level we need to talk about why it's fundamentally different > for processes versus threads, and whether we want two APIs that > interface with the same underlying kernel mechanics. More on this below but going full-on cgroup controller w/ thread-level interface is akin to introducing syscalls for this. That really is what it is. > > For the cache allocation thing, I'd strongly suggest something way > > simpler and non-commmittal - e.g. create a char device node with > > simple configuration and basic access control. If this *really* turns > > out to be useful and its configuration complex enough to warrant > > cgroup integration, let's do it then, and if we actually end up there, > > I bet the interface that we'd come up with at that point would be > > different from what people are proposing now. > > As above, I really want to focus on (1) so I will be brief here: > > Making it a char device requires yet-another adhoc method of > describing process groupings that a configuration should apply to and > yet-another set of rules for its inheritance. Once we merge it, we're Actually, we *always* had a method of describing process groupings called process hierarchy. cgroup provides dyanmic classfication atop, but not completely as the hierarchy still dictates where new processes end up. > committed to backwards support of the interface either way, I do not > see what reimplementing things as a char device or sysfs or seqfile or > other buys us over it being cgroupfs in this instance. > > I think that the real problem here is that stuff gets merged that does > not follow the rules of how something implemented with cgroups must > behave (typically respect with to a hierarchy); which is obviously > increasingly incompatible with a model where we have a single > hierarchy. But, provided that we can actually define those rules; I > do not see the gain in denying the admission of new controller which > is wholly consistent with them. It does not really measurably add to > the complexity of the implementation (and it greatly reduces it where > grouping semantics are desired). CAT is really a bad example. I'd say no as a cgroup controller or as a new set of syscalls. It simply isn't developed enough yet and we don't want to commit that much. System resource partitioning which can't easily be achieved in different ways can surely be a part of cgroup but we don't wanna do that willy nilly. We actually wanna deliberate on what the actual resources and their abstractions are which we have tradtionally been horrible at. > > I'm having a hard time following you on this part of the discussion. > > Can you give me an example? > > For example, when a system is otherwise idle we might choose to give > an application additional memory or cpu resources. These may be > reclaimed in the future, such an update requires updating children to > be compatible with a parents' new limits. There are four types of resource control that cgroup does - weights, limits, guarantees, and strict allocations. Weights are obviously work-preserving. Limiters and strict allocators shouldn't be. Guarantees are limiters applied the other direction and work-preserving and strict allocations are strict allocations. I still don't quite get what you were trying to say. What was the point here? > > The flexible migration that cgroup supports may suggest that an > > external agent with enough information can define and manage > > sub-process hierarchy without involving the target application but > > this doesn't necessarily work because such information is often only > > available in the application itself and the internal thread hierarchy > > should be agreeable to the hierarchy that's being imposed upon it - > > when threads are dynamically created, different parts of the hierarchy > > should be created by different parent thread. > > I think what's more important here is that you can define it to work. > This does require cooperation between the external agent and the > application in the layout of the application's hierarchy. But this is > something we do use. A good example would be the surfacing of public > and private cpus previously discussed to the application. So, if you do that, it's fine, but this is the same as your previous c++ argument. This shouldn't be the standard we design these interfaces on. If it can be clearly layered in a consistent way, we should do that and that doesn't prevent internal and external entities cooperating. > > Also, the problem with external and in-application manipulations > > stepping on each other's toes is mostly not caused by individual > > config settings but by the possibility that they may try to set up > > different hierarchies or modify the existing one in a way which is not > > expected by the other. > > How is this different from say signals or ptrace or any file-system > modification? This does not seem a problem inherent to cgroups. ptrace is obviously a special case but we don't let external agents meddle with signal handlers or change cwd of a process. In most cases, there are distinctions between what's internal to a process and what's not. > > Given that thread hierarchy already needs to be compatible with > > resource hierarchy, is something unix programs already understands and > > thus can render itself to an a lot more conventional interface which > > doesn't cause organizational conflicts, I think it's logical to use > > that for sub-process resource distribution. > > > > So, it comes down to sth like the following > > > > set_resource($TID, $FLAGS, $KEY, $VAL) > > > > - If $TID isn't already a resource group leader, it creates a > > sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants > > to it. > > > > - If $TID is already a resource group leader, set $KEY to $VAL. > > > > - If the process is moved to another cgroup, the sub-hierarchy is > > preserved. > > > > Honestly, I find this API awkward: > > 1) It depends on "anchor" threads to define groupings. So does cgroupfs. Any kind of thread or process grouping can't escape that as soon as things start forking and if things don't fork whether something is anchor or not doesn't make much difference. > 2) It does not allow thread-level hierarchies to be created Huh? That's the only thing it would do. This obviously wouldn't get applied to processes. It's strictly about threads. > 3) When coordination with an external agent is desired this defines no > common interface that can be shared. Directories are an extremely > useful container. Are you proposing applications would need to > somehow publish the list of anchor-threads from (1)? Again, this is an invariant no matter what we do. As I wrote numerous times in this thread, this information is only known to the process itself. If an external agent want to manipulate these from outside, it just has to know which thread is doing what. The difference is that this doesn't require the process itself to coordinate with external agent when operating on itself. > What if I want to set up state that an application will attaches > threads to [consider cpuset example above]? It feels like we're running in circles. Process-level control stays the same. That part is not an issue. Thread-level control requires cooperation from the process itself no matter what and should stay confined to the limits imposed on the process as a whole. Frankly, cpuset example doesn't make much sense to me because there is nothing hierarchical about it and it isn't even layered properly. Can you describe what you're actually trying to achieve? But no matter the specifities of the example, it's almost trivial to achieve whatever end results. > 4) How is the cgroup property to $KEY translation defined? This feels > like an ioctl and no more natural than the file-system. It also does How are they even comparable? Sure ioctl inputs are variable-formed and its definitions aren't closely scrutinized but other than those it's a programmable system-call interface and how programs use and interact with them is completely different from how a program interacts with cgroupfs. It doesn't have to parse out the path, compose the knob path, open and format the data into it all the while not being sure whether the file it's operating on is even the right one anymore or the sub-hierarchcy it's assuming is still there. > not seem to resolve your concerns regarding races; the application > must still coordinate internally when concurrently calling > set_resource(). I have no idea where you're going with this. When did the internal synchronization inside a process become an issue? Sure, if a thread does *(int *)=0, we can't protect other threads from it. Also, why would it be a problem? If two perform set_resource() on the same thread, one will be executed after the other. What are you talking about? > 5) How does an external agent coordinate when a resource must be > removed from a sub-hierarchy? That sort of restriction should generally be put at the higher level. Thread-level resource control should be cooperative with the application if at all necessary and in those cases just set the limit on the sub-hierarchy would work. If the process is adversarial, it can mess up whatever external agent tries to do inside the process by messing up its thread forking hierarchy. It just doesn't matter. > On a larger scale, what properties do you feel this separate API > provides that would not be also supported by instead exposing > sub-process hierarchies via /proc/self or similar. > > Perhaps it would help to enumerate the the key problems we're trying > to solve with the choice of this interface? > 1) Thread spanning trees within the cgroup hierarchy. (Immediately > fixed, only processes are present on the cgroup-mount) > 2) Interactions with the parent process moving within the hierarchy > 3) Potentially supporting move operations within a hierarchy > > Are there other cruxes? It's a lot easier for applications to program against and it makes it explicit that grouping thrads is the domain of the process itself, which is true no matter what we do, and everybody follows the same grouping inside the process thus removing the problems around different entities manipulating the sub-hierarchy in incompatible ways. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-12 14:40 ` Tejun Heo @ 2015-09-17 14:35 ` Peter Zijlstra 2015-09-17 14:53 ` Tejun Heo 2015-09-17 15:10 ` Peter Zijlstra 2015-09-17 23:29 ` Tejun Heo 2015-09-18 11:27 ` Paul Turner 2 siblings, 2 replies; 92+ messages in thread From: Peter Zijlstra @ 2015-09-17 14:35 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Sat, Sep 12, 2015 at 10:40:07AM -0400, Tejun Heo wrote: > So, one of the problems is that the kernel can't have tasks w/o > runnable CPUs, so we have to some workaround when, for whatever > reason, a task ends up with no CPU that it can run on. No, just refuse that configuration. > You say cpuset isn't a layering thing but that simply isn't true. > It's a cgroup-scope CPU mask. It layers atop task affinities > restricting what they can be configured to, limiting the effective > cpumask to the intersection of actually existing CPUs and overriding > individual affinity setting when the intersection doesn't exist. No, just fail. > The kernel does not update all CPU affinity masks when a CPU goes down > or comes up. I'd be happy to fail a CPU down for user tasks where this is the last runnable CPU of. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-17 14:35 ` Peter Zijlstra @ 2015-09-17 14:53 ` Tejun Heo 2015-09-17 15:42 ` Peter Zijlstra 2015-09-17 15:10 ` Peter Zijlstra 1 sibling, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-09-17 14:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Turner, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Thu, Sep 17, 2015 at 04:35:27PM +0200, Peter Zijlstra wrote: > On Sat, Sep 12, 2015 at 10:40:07AM -0400, Tejun Heo wrote: > > So, one of the problems is that the kernel can't have tasks w/o > > runnable CPUs, so we have to some workaround when, for whatever > > reason, a task ends up with no CPU that it can run on. > > No, just refuse that configuration. Well, you've been saying that but that's not our behavior on cpu hotunplug either and it applies the same. If we reject cpu hotunplugs while tasks are affined to it, we can do the same in cpuset too. > > The kernel does not update all CPU affinity masks when a CPU goes down > > or comes up. > > I'd be happy to fail a CPU down for user tasks where this is the last > runnable CPU of. So, yeah, we need to keep these things consistent across global and cgroup cases. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-17 14:53 ` Tejun Heo @ 2015-09-17 15:42 ` Peter Zijlstra 0 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2015-09-17 15:42 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Thu, Sep 17, 2015 at 10:53:09AM -0400, Tejun Heo wrote: > > I'd be happy to fail a CPU down for user tasks where this is the last > > runnable CPU of. > > So, yeah, we need to keep these things consistent across global and > cgroup cases. > Ok, I'll go extend the sysctl_sched_strict_affinity to the cpuset bits too then :-) ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-17 14:35 ` Peter Zijlstra 2015-09-17 14:53 ` Tejun Heo @ 2015-09-17 15:10 ` Peter Zijlstra 2015-09-17 15:52 ` Tejun Heo 1 sibling, 1 reply; 92+ messages in thread From: Peter Zijlstra @ 2015-09-17 15:10 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Thu, Sep 17, 2015 at 04:35:27PM +0200, Peter Zijlstra wrote: > I'd be happy to fail a CPU down for user tasks where this is the last > runnable CPU of. A little like so. Completely untested. --- Subject: sched: Refuse to unplug a CPU if this will violate user task affinity Its bad policy to allow unplugging a CPU for which a user set explicit affinity, either strictly on this CPU or in case this was the last online CPU in its mask. Either would end up forcing the thread on a random other CPU, violating the sys_sched_setaffinity() constraint. Disallow this by default; root might not be aware of all user affinities, but can negotiate and change affinities for all tasks. Provide a sysctl to go back to the old behaviour. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- include/linux/sched/sysctl.h | 1 + kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sysctl.c | 9 +++++++++ 3 files changed, 56 insertions(+) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index c9e4731cf10b..9444b549914b 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -39,6 +39,7 @@ extern unsigned int sysctl_sched_latency; extern unsigned int sysctl_sched_min_granularity; extern unsigned int sysctl_sched_wakeup_granularity; extern unsigned int sysctl_sched_child_runs_first; +extern unsigned int sysctl_sched_strict_affinity; enum sched_tunable_scaling { SCHED_TUNABLESCALING_NONE, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6ab415aa15c4..457c8b912fc6 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -284,6 +284,11 @@ __read_mostly int scheduler_running; */ int sysctl_sched_rt_runtime = 950000; +/* + * Disallows cpu unplug if that would result in a task without runnable CPUs. + */ +unsigned int sysctl_sched_strict_affinity = 1; + /* cpus with isolated domains */ cpumask_var_t cpu_isolated_map; @@ -5430,6 +5435,42 @@ static void set_rq_offline(struct rq *rq) } /* + * Test if there's a user task for which @cpu is the last runnable CPU + */ +static bool migration_possible(int cpu) +{ + struct task_struct *g, *p; + bool ret = true; + int next; + + read_lock(&tasklist_lock); + for_each_process_thread(g, p) { + /* if its running elsewhere, this cannot be its last cpu */ + if (task_cpu(p) != cpu) + continue; + + /* we only care about user state */ + if (p->flags & PF_KTHREAD) + continue; + + next = -1; +again: + next = cpumask_next_and(next, tsk_cpus_allowed(p), cpu_active_mask); + if (next >= nr_cpu_ids) { + printk(KERN_WARNING "task %s-%d refused unplug of CPU %d\n", + p->comm, p->pid, cpu); + ret = false; + break; + } + if (next == cpu) + goto again; + } + read_unlock(&tasklist_lock); + + return ret; +} + +/* * migration_call - callback that gets triggered when a CPU is added. * Here we can start up the necessary migration thread for the new CPU. */ @@ -5440,6 +5481,11 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu) unsigned long flags; struct rq *rq = cpu_rq(cpu); + if (action == CPU_DOWN_PREPARE && sysctl_sched_strict_affinity) { + if (!migration_possible(cpu)) + return notifier_from_errno(-EBUSY); + } + switch (action & ~CPU_TASKS_FROZEN) { case CPU_UP_PREPARE: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e69201d8094e..9d0edcc73cc3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -283,6 +283,15 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_SMP + { + .procname = "sched_strict_affinity", + .data = &sysctl_sched_strict_affinity, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif /* CONFIG_SMP */ #ifdef CONFIG_SCHED_DEBUG { .procname = "sched_min_granularity_ns", ^ permalink raw reply related [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-17 15:10 ` Peter Zijlstra @ 2015-09-17 15:52 ` Tejun Heo 2015-09-17 15:53 ` Peter Zijlstra 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-09-17 15:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Turner, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Thu, Sep 17, 2015 at 05:10:49PM +0200, Peter Zijlstra wrote: > Subject: sched: Refuse to unplug a CPU if this will violate user task affinity > > Its bad policy to allow unplugging a CPU for which a user set explicit > affinity, either strictly on this CPU or in case this was the last > online CPU in its mask. > > Either would end up forcing the thread on a random other CPU, violating > the sys_sched_setaffinity() constraint. Shouldn't this at least handle suspend differently? Otherwise any userland task would be able to block suspend. > Disallow this by default; root might not be aware of all user > affinities, but can negotiate and change affinities for all tasks. > > Provide a sysctl to go back to the old behaviour. I don't think a sysctl is a good way to control this as that breaks the invariant - all tasks always have some cpus online in its affinity mask - which otherwise can be guaranteed. If we wanna go this way, let's plesae start the discussion in a separate thread with detailed explanation on implications of the change. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-17 15:52 ` Tejun Heo @ 2015-09-17 15:53 ` Peter Zijlstra 0 siblings, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2015-09-17 15:53 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Thu, Sep 17, 2015 at 11:52:45AM -0400, Tejun Heo wrote: > Hello, > > On Thu, Sep 17, 2015 at 05:10:49PM +0200, Peter Zijlstra wrote: > > Subject: sched: Refuse to unplug a CPU if this will violate user task affinity > > > > Its bad policy to allow unplugging a CPU for which a user set explicit > > affinity, either strictly on this CPU or in case this was the last > > online CPU in its mask. > > > > Either would end up forcing the thread on a random other CPU, violating > > the sys_sched_setaffinity() constraint. > > Shouldn't this at least handle suspend differently? Otherwise any > userland task would be able to block suspend. It does, it will allow suspend. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-12 14:40 ` Tejun Heo 2015-09-17 14:35 ` Peter Zijlstra @ 2015-09-17 23:29 ` Tejun Heo 2015-09-18 11:27 ` Paul Turner 2 siblings, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-09-17 23:29 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Paul? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-12 14:40 ` Tejun Heo 2015-09-17 14:35 ` Peter Zijlstra 2015-09-17 23:29 ` Tejun Heo @ 2015-09-18 11:27 ` Paul Turner 2015-10-01 18:46 ` Tejun Heo 2 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-09-18 11:27 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, > > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote: >> I do not think this is a layering problem. This is more like C++: >> there is no sane way to concurrently use all the features available, >> however, reasonably self-consistent subsets may be chosen. > > That's just admitting failure. > Alternatively: accepting there are varied use-cases to support. >> > I don't get it. How is that the only consistent way? Why is making >> > irreversible changes even a good way? Just layer the masks and >> > trigger notification on changes. >> >> I'm not sure if you're agreeing or disagreeing here. Are you saying >> there's another consistent way from "clobbering then triggering >> notification changes" since it seems like that's what is rejected and >> then described. This certainly does not include any provisions for >> reversibility (which I think is a non-starter). >> >> With respect to layering: Are you proposing we maintain a separate >> mask for sched_setaffinity and cpusets, then do different things >> depending on their intersection, or lack thereof? I feel this would >> introduce more consistencies than it would solve as these masks would >> not be separately inspectable from user-space, leading to confusing >> interactions as they are changed. > > So, one of the problems is that the kernel can't have tasks w/o > runnable CPUs, so we have to some workaround when, for whatever > reason, a task ends up with no CPU that it can run on. The other is > that we never established a consistent way to deal with it in global > case either. > > You say cpuset isn't a layering thing but that simply isn't true. > It's a cgroup-scope CPU mask. It layers atop task affinities > restricting what they can be configured to, limiting the effective > cpumask to the intersection of actually existing CPUs and overriding > individual affinity setting when the intersection doesn't exist. > > The kernel does not update all CPU affinity masks when a CPU goes down > or comes up. It just enforces the intersection and when the > intersection becomes empty, ignores it. cgroup-scoped behaviors > should reflect what the system does in the global case in general, and > the global behavior here, although missing some bits, is a lot saner > than what cpuset is currently doing. You are conflating two things here: 1) How we maintain these masks 2) The interactions on updates I absolutely agree with you that we want to maintain (1) in a non-pointwise format. I've already talked about that in other replies on this thread. However for (2) I feel you are: i) Underestimating the complexity of synchronizing updates with user-space. ii) Introducing more non-desirable behaviors [partial overwrite] than those you object to [total overwrite]. > >> There are also very real challenges with how any notification is >> implemented, independent of delivery: >> The 'main' of an application often does not have good control or even >> understanding over its own threads since many may be library managed. >> Designation of responsibility for updating these masks is difficult. >> That said, I think a notification would still be a useful improvement >> here and that some applications would benefit. > > And this is the missing piece in the global case too. We've just > never solved this problem properly but that does not mean we should go > off and do something completely different for cgroup case. Clobbering > is fine if there's a single entity controlling everything but at that > level it's nothing more than a shorthand for running taskset on member > tasks. > >From user-space's perspective it always involves some out-of-band clobber since what's specified by cpusets takes precedence. However the result of overlaying the masks is that different update combinations will have very different effects, varying from greatly expanding parallelism to greatly restricting it. Further, these effects are hard to predict since anything returned by getaffinity is obscured by whatever the instantaneous cpuset-level masks happen to be. >> At the very least, I do not think that cpuset's behavior here can be >> dismissed as unreasonable. > > It sure is very misguided. > It's the most consistent choice; you've not given any reasons above why a solution with only partial consistency is any better. Any choice here is difficult to coordinate, that two APIs allow manipulation of the same property means that we must always choose some compromise here. I prefer the one with the least surprises. >> > What if optimizing cache allocation across competing threads of a >> > process can yield, say, 3% gain across large compute farm? Is that >> > fringe? >> >> Frankly, yes. If you have a compute farm sufficiently dedicated to a >> single application I'd say that's a fairly specialized use. I see no >> reason why a more 'technical' API should be a challenge for such a >> user. The fundamental point here was that it's ok for the API of some >> controllers to be targeted at system rather than user control in terms >> of interface. This does not restrict their use by users where >> appropriate. > > We can go back and forth forever on this but I'm fairly sure > everything CAT related is niche at this point. I agree it makes sense to restrict to the partitioning operations desired and not the resource being controlled. Be it CAT or other. > >> So there is definitely a proliferation of discussion regarding >> applying cgroups to other problems which I agree we need to take a >> step back and re-examine. >> >> However, here we're fundamentally talking about APIs designed to >> partition resources which is the problem that cgroups was introduced >> to address. If we want to introduce another API to do that below the >> process level we need to talk about why it's fundamentally different >> for processes versus threads, and whether we want two APIs that >> interface with the same underlying kernel mechanics. > > More on this below but going full-on cgroup controller w/ thread-level > interface is akin to introducing syscalls for this. That really is > what it is. > >> > For the cache allocation thing, I'd strongly suggest something way >> > simpler and non-commmittal - e.g. create a char device node with >> > simple configuration and basic access control. If this *really* turns >> > out to be useful and its configuration complex enough to warrant >> > cgroup integration, let's do it then, and if we actually end up there, >> > I bet the interface that we'd come up with at that point would be >> > different from what people are proposing now. >> >> As above, I really want to focus on (1) so I will be brief here: >> >> Making it a char device requires yet-another adhoc method of >> describing process groupings that a configuration should apply to and >> yet-another set of rules for its inheritance. Once we merge it, we're > > Actually, we *always* had a method of describing process groupings > called process hierarchy. cgroup provides dyanmic classfication atop, > but not completely as the hierarchy still dictates where new processes > end up. > As before, the process hierarchy is essentially static and only really used for resource parenting, not resource partitioning. I think you are somewhat trivializing the dynamic aspect here. As soon as the hierarchy is non-static then you have to reinvent some mechanism of describing and interacting with that hierarchy. This does not apply to the process hierarchy. I do not yet see a good reason why the threads arbitrarily not sharing an address space necessitates the use of an entirely different API. The only problems stated so far in this discussion have been: 1) Actual issues involving relative paths, which are potentially solvable. 2) Aesthetic distaste for using file-system abstractions >> committed to backwards support of the interface either way, I do not >> see what reimplementing things as a char device or sysfs or seqfile or >> other buys us over it being cgroupfs in this instance. >> >> I think that the real problem here is that stuff gets merged that does >> not follow the rules of how something implemented with cgroups must >> behave (typically respect with to a hierarchy); which is obviously >> increasingly incompatible with a model where we have a single >> hierarchy. But, provided that we can actually define those rules; I >> do not see the gain in denying the admission of new controller which >> is wholly consistent with them. It does not really measurably add to >> the complexity of the implementation (and it greatly reduces it where >> grouping semantics are desired). > > CAT is really a bad example. I'd say no as a cgroup controller or as > a new set of syscalls. It simply isn't developed enough yet and we > don't want to commit that much. System resource partitioning which > can't easily be achieved in different ways can surely be a part of > cgroup but we don't wanna do that willy nilly. We actually wanna > deliberate on what the actual resources and their abstractions are > which we have tradtionally been horrible at. None of that paragraph was actually about CAT. It's that don't understand this disjunction that we should arbitrarily partition some things with cgroups, but actively prefer not to use it for others. Fundamentally cgroups was originally an API about defining partitions and attaching control semantics to them, which this statement feels like a step away from. I do not understand the claim that we should try not to use it for problems that want partitioning. > >> > I'm having a hard time following you on this part of the discussion. >> > Can you give me an example? >> >> For example, when a system is otherwise idle we might choose to give >> an application additional memory or cpu resources. These may be >> reclaimed in the future, such an update requires updating children to >> be compatible with a parents' new limits. > > There are four types of resource control that cgroup does - weights, > limits, guarantees, and strict allocations. Weights are obviously > work-preserving. Limiters and strict allocators shouldn't be. > Guarantees are limiters applied the other direction and > work-preserving and strict allocations are strict allocations. I > still don't quite get what you were trying to say. What was the point > here? You asked for an example where updating a parent's limits required the modification of a descendant. This was one. > >> > The flexible migration that cgroup supports may suggest that an >> > external agent with enough information can define and manage >> > sub-process hierarchy without involving the target application but >> > this doesn't necessarily work because such information is often only >> > available in the application itself and the internal thread hierarchy >> > should be agreeable to the hierarchy that's being imposed upon it - >> > when threads are dynamically created, different parts of the hierarchy >> > should be created by different parent thread. >> >> I think what's more important here is that you can define it to work. >> This does require cooperation between the external agent and the >> application in the layout of the application's hierarchy. But this is >> something we do use. A good example would be the surfacing of public >> and private cpus previously discussed to the application. > > So, if you do that, it's fine, but this is the same as your previous > c++ argument. This shouldn't be the standard we design these > interfaces on. If it can be clearly layered in a consistent way, we > should do that and that doesn't prevent internal and external entities > cooperating. Sorry, I don't understand this statement. It's a requirement, not a standard. All that's being said is that this is a real thing that the API presently supports. The alternatives you are proposing do not cleanly support this. I make the C++ argument exactly because this is something not all users are likely to require. > >> > Also, the problem with external and in-application manipulations >> > stepping on each other's toes is mostly not caused by individual >> > config settings but by the possibility that they may try to set up >> > different hierarchies or modify the existing one in a way which is not >> > expected by the other. >> >> How is this different from say signals or ptrace or any file-system >> modification? This does not seem a problem inherent to cgroups. > > ptrace is obviously a special case but we don't let external agents > meddle with signal handlers or change cwd of a process. In most > cases, there are distinctions between what's internal to a process and > what's not. But, given the right capabilities or user, we do allow them to send signals, modify the file-system, etc. There is nothing about using a VFS interface that precludes extending the same protections. > >> > Given that thread hierarchy already needs to be compatible with >> > resource hierarchy, is something unix programs already understands and >> > thus can render itself to an a lot more conventional interface which >> > doesn't cause organizational conflicts, I think it's logical to use >> > that for sub-process resource distribution. >> > >> > So, it comes down to sth like the following >> > >> > set_resource($TID, $FLAGS, $KEY, $VAL) >> > >> > - If $TID isn't already a resource group leader, it creates a >> > sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants >> > to it. >> > >> > - If $TID is already a resource group leader, set $KEY to $VAL. >> > >> > - If the process is moved to another cgroup, the sub-hierarchy is >> > preserved. >> > >> >> Honestly, I find this API awkward: >> >> 1) It depends on "anchor" threads to define groupings. > > So does cgroupfs. Any kind of thread or process grouping can't escape > that as soon as things start forking and if things don't fork whether > something is anchor or not doesn't make much difference. The difference is that this ignores how applications are actually written: A container that is independent of its members (e.g. a cgroup directory) can be created and configured by an application's Init() or within the construction of a data-structure that will use it without dependency on those resources yet being used. As an example: The resources associated with thread pools are often dynamically managed. What you're proposing means that some initialization must now be moved into the first thread that pool creates (as opposed to the pool's initilization), that synchronization and identification of this thread is now required, and that it must be treated differently to other threads in the pool (it can no longer be reclaimed). > >> 2) It does not allow thread-level hierarchies to be created > > Huh? That's the only thing it would do. This obviously wouldn't get > applied to processes. It's strictly about threads. This allows a single *partition*, not a hierarchy. As machines become larger, so are many of the processes we run on them. These larger processes manage resources between threads on scales that we would previously partition between processes. > >> 3) When coordination with an external agent is desired this defines no >> common interface that can be shared. Directories are an extremely >> useful container. Are you proposing applications would need to >> somehow publish the list of anchor-threads from (1)? > > Again, this is an invariant no matter what we do. As I wrote numerous > times in this thread, this information is only known to the process > itself. If an external agent want to manipulate these from outside, > it just has to know which thread is doing what. The difference is > that this doesn't require the process itself to coordinate with > external agent when operating on itself. Nothing about what was previously state would require any coordination with the process and an external agent when operating on itself. What's the basis for this claim? This also ignores the cases previously discussed in which the external agent is providing state for threads within a process to attach to. An example of this is repeated below. This isn't even covering that this requires the invention of entirely new user-level APIs and coordination for somehow publishing these magic tids. > >> What if I want to set up state that an application will attaches >> threads to [consider cpuset example above]? > > It feels like we're running in circles. Process-level control stays > the same. That part is not an issue. Thread-level control requires > cooperation from the process itself no matter what and should stay > confined to the limits imposed on the process as a whole. > > Frankly, cpuset example doesn't make much sense to me because there is > nothing hierarchical about it and it isn't even layered properly. Can > you describe what you're actually trying to achieve? But no matter > the specifities of the example, it's almost trivial to achieve > whatever end results. This has been previously detailed, repeating it here: Machines are shared resources, we partition the available cpus into shared and private sets. These sets are dynamic as when a new application arrives requesting private cpus, we must reserve some cpus that were previously shared. We use sub-cpusets to advertise to applications which of their cpus are shared and which are private. They can then attach threads to these containers -- which are dynamically updated as cores shift between public and private configurations. > >> 4) How is the cgroup property to $KEY translation defined? This feels >> like an ioctl and no more natural than the file-system. It also does > > How are they even comparable? Sure ioctl inputs are variable-formed > and its definitions aren't closely scrutinized but other than those > it's a programmable system-call interface and how programs use and > interact with them is completely different from how a program > interacts with cgroupfs. They're exactly comparable in that every cgroup.<property> api now needs some magic equivalent $KEY defined. I don't understand how you're proposing these would be generated or managed. > It doesn't have to parse out the path, > compose the knob path, open and format the data into it There's nothing hard about this. Further, we already have to do exactly this at the process level; which means abstractions for this already exist; removing this property does not change their presence of requirement, but instead means they must be duplicated for the in-thread case. Even ignoring that the libraries for this can be shared between thread and process, this is also generally easier to work with than magic $KEY values. > all the while > not being sure whether the file it's operating on is even the right > one anymore or the sub-hierarchcy it's assuming is still there. One possible resolution to this has been proposed several times: Have the sub-process hierarchy exposed in an independent and fixed location. > >> not seem to resolve your concerns regarding races; the application >> must still coordinate internally when concurrently calling >> set_resource(). > > I have no idea where you're going with this. When did the internal > synchronization inside a process become an issue? Sure, if a thread > does *(int *)=0, we can't protect other threads from it. Also, why > would it be a problem? If two perform set_resource() on the same > thread, one will be executed after the other. What are you talking > about? It was my impression that you'd had atomicity concerns regarding file-system operations such as writes for updates previously. If you have no concerns within a sub-processes operation then this can be discarded. > >> 5) How does an external agent coordinate when a resource must be >> removed from a sub-hierarchy? > > That sort of restriction should generally be put at the higher level. > Thread-level resource control should be cooperative with the > application if at all necessary and in those cases just set the limit > on the sub-hierarchy would work. > Could you expand on how you envision this being cooperative? This seems tricky to me, particularly when limits are involved. How do I even arbitrate which external agents are allowed control? > If the process is adversarial, it can mess up whatever external agent > tries to do inside the process by messing up its thread forking > >> On a larger scale, what properties do you feel this separate API >> provides that would not be also supported by instead exposing >> sub-process hierarchies via /proc/self or similar. >> >> Perhaps it would help to enumerate the the key problems we're trying >> to solve with the choice of this interface? >> 1) Thread spanning trees within the cgroup hierarchy. (Immediately >> fixed, only processes are present on the cgroup-mount) >> 2) Interactions with the parent process moving within the hierarchy >> 3) Potentially supporting move operations within a hierarchy >> >> Are there other cruxes? > > It's a lot easier for applications to program against and it makes it > explicit that grouping thrads is the domain of the process itself, So I was really trying to make sure we covered the interface problems we're trying to solve here. Are there major ones not listed there? However, I strongly disagree with this statement. It is much easier for applications to work with named abstract objects then having magic threads that it must track and treat specially. My implementation must now look like this: 1) I instantiate some abstraction which uses cgroups. 2) In construction I must now coordinate with my chosen threading implementation (an exciting new dependency) to create a new thread and get its tid. This thread must exist for as long as the associated data-structure. I must pay a kernel stack, at least one page of thread stack and however much TLS I've declared in my real threads. 3) In destruction I must now wake and free the thread created in (2). 4) If I choose to treat it as a real thread, I must be careful, this thread is special and cannot be arbitrarily released like other threads. 5) To do anything I must go grok the documentation to look up the magic $KEY. If I get this wrong I am going to have a fun time debugging it since things are no longer reasonably inspect-able. If I must work with a cgroup that adds features over time things are even more painful since $KEY may or may not exist. Is any of the above unfair with respect to what you've described above? This isn't even beginning to consider the additional pain that a language implementing its own run-time such as Go might incur. > which is true no matter what we do, and everybody follows the same > grouping inside the process thus removing the problems around > different entities manipulating the sub-hierarchy in incompatible > ways. Option B: We expose sub-process hierarchies via /proc/self/cgroups or similar. They do not appear within the process only cgroup hierarchy. Only the same user (or a privileged one) has access to this internal hierarchy. This can be arbitrarily restricted further. Applications continue to use almost exactly the same cgroup interfaces that exist today, however, the problem of path computation and non-stable paths are now eliminated. Really, what problems does this not solve? It eliminates the unstable mount point, your concerns regarding external entity manipulation, and allows for the parent processes to be moved. It provides a reasonable place for coordination to occur, with standard mechanisms for access control. It allows for state to be easily inspected, it does not require new documentation, allows the creation of sub-hierarchies, does not require special threads. This was previously raised as a straw man, but I have not yet seen or thought of good arguments against it. Thanks, - Paul ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-09-18 11:27 ` Paul Turner @ 2015-10-01 18:46 ` Tejun Heo 2015-10-15 11:42 ` Paul Turner 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-10-01 18:46 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. Sorry about the delay. Things were kinda hectic in the past couple weeks. On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote: > On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo <tj@kernel.org> wrote: > > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote: > >> I do not think this is a layering problem. This is more like C++: > >> there is no sane way to concurrently use all the features available, > >> however, reasonably self-consistent subsets may be chosen. > > > > That's just admitting failure. > > > > Alternatively: accepting there are varied use-cases to > support. Analogies like this can go awry but as we're in it anyway, let's push it a bit further. One of the reasons why C++ isn't lauded as an example of great engineering is while it does support a vast number of use-cases or rather usage-scenarios (it's not necessarily focused on utility but just how things are done) it fails to distill the essence of the actual utility out of them and condense it. It's not just an aesthetic argument. That failure exacts heavy costs on its users and is one of the reasons why C++ projects are more prone to horrible disarrays unless specific precautions are taken. I'm not against supporting valid and useful use-cases but not all usage-scenarios are equal. If we can achieve the same eventual goals with reasonable trade-offs in a simpler and more straight-forward way, that's what we should do even though that'd require some modifications to specific usage-scenarios. ie. the usage-scenarios need to scrutinized so that the core of the utility can be extracted and abstracted in the, hopefully, minimal way. This is what worries me when you liken the situation to C++. You probably were trying to make a different point but I'm not sure we're on the same page and I think we need to agree at least on this in principle; otherwise, we'll just keep talking past each other. > > The kernel does not update all CPU affinity masks when a CPU goes down > > or comes up. It just enforces the intersection and when the > > intersection becomes empty, ignores it. cgroup-scoped behaviors > > should reflect what the system does in the global case in general, and > > the global behavior here, although missing some bits, is a lot saner > > than what cpuset is currently doing. > > You are conflating two things here: > 1) How we maintain these masks > 2) The interactions on updates > > I absolutely agree with you that we want to maintain (1) in a > non-pointwise format. I've already talked about that in other replies > on this thread. > > However for (2) I feel you are: > i) Underestimating the complexity of synchronizing updates with user-space. > ii) Introducing more non-desirable behaviors [partial overwrite] than > those you object to [total overwrite]. The thing which bothers me the most is that cpuset behavior is different from global case for no good reason. We don't have a model right now. It's schizophrenic. And what I was trying to say was that maybe this is because we never had a working model in the global case either but if that's the case we need to solve the global case too or at least figure out where we wanna be in the long term. > It's the most consistent choice; you've not given any reasons above > why a solution with only partial consistency is any better. > > Any choice here is difficult to coordinate, that two APIs allow > manipulation of the same property means that we must always > choose some compromise here. I prefer the one with the least > surprises. I don't think the current situation around affinity mask handling can be considered consistent and cpuset is pouring more inconsistencies into it. We need to figure it out one way or the other. ... > I do not yet see a good reason why the threads arbitrarily not sharing an > address space necessitates the use of an entirely different API. The > only problems stated so far in this discussion have been: > 1) Actual issues involving relative paths, which are potentially solvable. Also the ownership of organization. If the use-cases can be reasonably served with static grouping, I think it'd definitely be a worthwhile trade-off to make. It's different from process level grouping. There, we can simply state that this is to be arbitrated in the userland and that arbitration isn't that difficult as it's among administration stack of userspace. In-process attributes are different. The process itself can manipulate its own attributes but it's also common for external tools to peek into processes and set certain attributes. Even when the two parties aren't coordinated, this is usually fine because there's no reason for applications to depend on what those attribute are set to and even when the different entities do different things, the combination is still something coherent. Now, if you make the in-process grouping dynamic and accessible to external entities (and if we aren't gonna do that, why even bother?), this breaks down and we have some of the same problems we have with allowing applications to directly manipulate cgroup sub-directories. This is a fundamental problem. Setting attributes can be shared but organization is an exclusive process. You can't share that without close coordination. Assigning the full responsiblity of in-process organization to the application itself and tying it to static parental relationship allows for solid common grounds where these resource operations can be performed by different entities without causing structural issues just like other similar operations. Another point for assigning this responsibility to the application itself is that it can't be done without the application's cooperation anyway because the group membership of new threads is determined by the group the parent belongs to. > 2) Aesthetic distaste for using file-system abstractions It's not that but more about what the file-system interface implies. It's not just different. It breaks a lot of expectations a lot of application visible kernel interface provides as explained above. There are reasons why we usually don't do things this way. ... > >> 1) It depends on "anchor" threads to define groupings. > > > > So does cgroupfs. Any kind of thread or process grouping can't escape > > that as soon as things start forking and if things don't fork whether > > something is anchor or not doesn't make much difference. > > The difference is that this ignores how applications are actually written: It does require the applications to follow certain protocols to organize itself but this is a pretty trivial thing to do and comes with the benefit that we don't need to introduce a completely new grouping concept to applications. > A container that is independent of its members (e.g. a cgroup > directory) can be created and configured by an application's Init() or > within the construction of a data-structure that will use it without > dependency on those resources yet being used. > > As an example: > The resources associated with thread pools are often dynamically > managed. What you're proposing means that some initialization must > now be moved into the first thread that pool creates (as opposed to > the pool's initilization), that synchronization and identification of > this thread is now required, and that it must be treated differently > to other threads in the pool (it can no longer be reclaimed). That should be like a two hour job for most applications. This is a trivial thing to do. It's difficult for me to consider the difficulty of doing this a major decision point. > >> 2) It does not allow thread-level hierarchies to be created > > > > Huh? That's the only thing it would do. This obviously wouldn't get > > applied to processes. It's strictly about threads. > > This allows a single *partition*, not a hierarchy. As machines > become larger, so are many of the processes we run on them. These > larger processes manage resources between threads on scales that we > would previously partition between processes. I don't get it. Why wouldn't it allow hierarchy? > >> 3) When coordination with an external agent is desired this defines no > >> common interface that can be shared. Directories are an extremely > >> useful container. Are you proposing applications would need to > >> somehow publish the list of anchor-threads from (1)? > > > > Again, this is an invariant no matter what we do. As I wrote numerous > > times in this thread, this information is only known to the process > > itself. If an external agent want to manipulate these from outside, > > it just has to know which thread is doing what. The difference is > > that this doesn't require the process itself to coordinate with > > external agent when operating on itself. > > Nothing about what was previously state would require any coordination > with the process and an external agent when operating on itself. > What's the basis for this claim? I hope this is explained now. > This also ignores the cases previously discussed in which the external > agent is providing state for threads within a process to attach to. > An example of this is repeated below. > > This isn't even covering that this requires the invention of entirely > new user-level APIs and coordination for somehow publishing these > magic tids. We already have those tids. > >> What if I want to set up state that an application will attaches > >> threads to [consider cpuset example above]? > > > > It feels like we're running in circles. Process-level control stays > > the same. That part is not an issue. Thread-level control requires > > cooperation from the process itself no matter what and should stay > > confined to the limits imposed on the process as a whole. > > > > Frankly, cpuset example doesn't make much sense to me because there is > > nothing hierarchical about it and it isn't even layered properly. Can > > you describe what you're actually trying to achieve? But no matter > > the specifities of the example, it's almost trivial to achieve > > whatever end results. > > This has been previously detailed, repeating it here: > > Machines are shared resources, we partition the available cpus into > shared and private sets. These sets are dynamic as when a new > application arrives requesting private cpus, we must reserve some cpus > that were previously shared. > > We use sub-cpusets to advertise to applications which of their cpus > are shared and which are private. They can then attach threads to > these containers -- which are dynamically updated as cores shift > between public and private configurations. I see but you can easily do that the other way too, right? Let the applications publish where they put their threads and let the external entity set configs on them. > >> 4) How is the cgroup property to $KEY translation defined? This feels > >> like an ioctl and no more natural than the file-system. It also does > > > > How are they even comparable? Sure ioctl inputs are variable-formed > > and its definitions aren't closely scrutinized but other than those > > it's a programmable system-call interface and how programs use and > > interact with them is completely different from how a program > > interacts with cgroupfs. > > They're exactly comparable in that every cgroup.<property> api now > needs some magic equivalent $KEY defined. I don't understand how > you're proposing these would be generated or managed. Not everything. Just the ones which make sense in-process. This is exactly the process we need to go through when introducing new syscalls. Why is this a surprise? We want to scrutinize them, hard. > > It doesn't have to parse out the path, > > compose the knob path, open and format the data into it > > There's nothing hard about this. Further, we already have to do > exactly this at the process level; which means abstractions for this I'm not following. Why would it need to do that already? > already exist; removing this property does not change their presence > of requirement, but instead means they must be duplicated for the > in-thread case. > > Even ignoring that the libraries for this can be shared between thread > and process, this is also generally easier to work with than magic > $KEY values. This is like saying syscalls are worse in terms of progammability compared to opening and writing formatted strings for setting attributes. If that's what you're saying, let's just agree to disgree on this one. > > all the while > > not being sure whether the file it's operating on is even the right > > one anymore or the sub-hierarchcy it's assuming is still there. > > One possible resolution to this has been proposed several times: > Have the sub-process hierarchy exposed in an independent and fixed location. > > >> not seem to resolve your concerns regarding races; the application > >> must still coordinate internally when concurrently calling > >> set_resource(). > > > > I have no idea where you're going with this. When did the internal > > synchronization inside a process become an issue? Sure, if a thread > > does *(int *)=0, we can't protect other threads from it. Also, why > > would it be a problem? If two perform set_resource() on the same > > thread, one will be executed after the other. What are you talking > > about? > > It was my impression that you'd had atomicity concerns regarding > file-system operations such as writes for updates previously. If you > have no concerns within a sub-processes operation then this can be > discarded. That's comparing apples and oranges. Threads being moved around and hierarchies changing beneath them present a whole different issues than someone else setting an attribute to a different value. The operations might fail, might set properties on the wrong group. > >> 5) How does an external agent coordinate when a resource must be > >> removed from a sub-hierarchy? > > > > That sort of restriction should generally be put at the higher level. > > Thread-level resource control should be cooperative with the > > application if at all necessary and in those cases just set the limit > > on the sub-hierarchy would work. > > > > Could you expand on how you envision this being cooperative? This > seems tricky to me, particularly when limits are involved. > > How do I even arbitrate which external agents are allowed control? I think we're talking past each other. If you wanna put restrictions on the process as whole, do it at the higher level. If you wanna fiddle with in-process resource distribution, you just have to assume that the application itself is cooperative or at least not malicious. No matter what external entities try to do, the application can circumvent because that's what ultimately determines the grouping. > So I was really trying to make sure we covered the interface problems > we're trying to solve here. Are there major ones not listed there? > > However, I strongly disagree with this statement. It is much easier > for applications to work with named abstract objects then having magic > threads that it must track and treat specially. How is that different? Sure, the name is created by the threads but once you set the resource, the tid would be the resource group ID and the thread can go away. It's still an object named by an ID. The only difference is that the process of creating the hierarchy is tied to the process that threads are created in. > My implementation must now look like this: > 1) I instantiate some abstraction which uses cgroups. > 2) In construction I must now coordinate with my chosen threading > implementation (an exciting new dependency) to create a new thread and > get its tid. This thread must exist for as long as the associated > data-structure. I must pay a kernel stack, at least one page of > thread stack and however much TLS I've declared in my real threads. > 3) In destruction I must now wake and free the thread created in (2). > 4) If I choose to treat it as a real thread, I must be careful, this > thread is special and cannot be arbitrarily released like other > threads. > 5) To do anything I must go grok the documentation to look up the > magic $KEY. If I get this wrong I am going to have a fun time > debugging it since things are no longer reasonably inspect-able. If I > must work with a cgroup that adds features over time things are even > more painful since $KEY may or may not exist. > > Is any of the above unfair with respect to what you've described above? Yeah, as I wrote above. > This isn't even beginning to consider the additional pain that a > language implementing its own run-time such as Go might incur. Yeap, it does require userland runtime to have a way to make the thread creation history visible to the operating system. It doesn't look like a big price. Again, I'm looking for a balance. You're citing inconveniences from userland side and yeah I get that. Making things more rigid and static requires some adjustments from userland but we gain from it too. No need to worry about structural inconsistencies and the varied failure modes which can cascade from that. If the only possible solution is C++-esque everything-goes way, sure, we'll have to do that but that's not the case. We can implement and provide the core functionality in a more controlled manner. > Option B: > We expose sub-process hierarchies via /proc/self/cgroups or similar. > They do not appear within the process only cgroup hierarchy. > Only the same user (or a privileged one) has access to this internal > hierarchy. This can be arbitrarily restricted further. > Applications continue to use almost exactly the same cgroup > interfaces that exist today, however, the problem of path computation > and non-stable paths are now eliminated. > > Really, what problems does this not solve? > > It eliminates the unstable mount point, your concerns regarding > external entity manipulation, and allows for the parent processes to > be moved. It provides a reasonable place for coordination to occur, > with standard mechanisms for access control. It allows for state to > be easily inspected, it does not require new documentation, allows the > creation of sub-hierarchies, does not require special threads. > > This was previously raised as a straw man, but I have not yet seen or > thought of good arguments against it. It allows for structural inconsistencies where applications can end up performing operations which are non-sensical. Breaking that invariant is substantial. Why would we do that if Can we at least agree that we're now venturing into an area where things aren't really critical? The core functionality here is being able to hierarchically categorize threads and assign resource limits to them. Can we agree that the minimum core functionality is met in both approaches? Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-01 18:46 ` Tejun Heo @ 2015-10-15 11:42 ` Paul Turner 2015-10-23 22:21 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Paul Turner @ 2015-10-15 11:42 UTC (permalink / raw) To: Tejun Heo Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Thu, Oct 1, 2015 at 11:46 AM, Tejun Heo <tj@kernel.org> wrote: > Hello, Paul. > > Sorry about the delay. Things were kinda hectic in the past couple > weeks. Likewise :-( > > On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote: >> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo <tj@kernel.org> wrote: >> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote: >> >> I do not think this is a layering problem. This is more like C++: >> >> there is no sane way to concurrently use all the features available, >> >> however, reasonably self-consistent subsets may be chosen. >> > >> > That's just admitting failure. >> > >> >> Alternatively: accepting there are varied use-cases to >> support. > > Analogies like this can go awry but as we're in it anyway, let's push > it a bit further. One of the reasons why C++ isn't lauded as an > example of great engineering is while it does support a vast number of > use-cases or rather usage-scenarios (it's not necessarily focused on > utility but just how things are done) it fails to distill the essence > of the actual utility out of them and condense it. It's not just an > aesthetic argument. That failure exacts heavy costs on its users and > is one of the reasons why C++ projects are more prone to horrible > disarrays unless specific precautions are taken. > > I'm not against supporting valid and useful use-cases but not all > usage-scenarios are equal. If we can achieve the same eventual goals > with reasonable trade-offs in a simpler and more straight-forward way, > that's what we should do even though that'd require some modifications > to specific usage-scenarios. ie. the usage-scenarios need to > scrutinized so that the core of the utility can be extracted and > abstracted in the, hopefully, minimal way. > > This is what worries me when you liken the situation to C++. You > probably were trying to make a different point but I'm not sure we're > on the same page and I think we need to agree at least on this in > principle; otherwise, we'll just keep talking past each other. I agree with trying to reach a minimal core functionality that satisfies all use-cases. I am only saying however, that I think that I do not think we can reduce to an api so minimal that all users will use all parts of it. We have to fit more than one usage model in. > >> > The kernel does not update all CPU affinity masks when a CPU goes down >> > or comes up. It just enforces the intersection and when the >> > intersection becomes empty, ignores it. cgroup-scoped behaviors >> > should reflect what the system does in the global case in general, and >> > the global behavior here, although missing some bits, is a lot saner >> > than what cpuset is currently doing. >> >> You are conflating two things here: >> 1) How we maintain these masks >> 2) The interactions on updates >> >> I absolutely agree with you that we want to maintain (1) in a >> non-pointwise format. I've already talked about that in other replies >> on this thread. >> >> However for (2) I feel you are: >> i) Underestimating the complexity of synchronizing updates with user-space. >> ii) Introducing more non-desirable behaviors [partial overwrite] than >> those you object to [total overwrite]. > > The thing which bothers me the most is that cpuset behavior is > different from global case for no good reason. I've tried to explain above that I believe there are reasonable reasons for it working the way it does from an interface perspective. I do not think they can be so quickly discarded out of hand. However, I think we should continue winnowing focus and first resolve the model of interaction for sub-process hierarchies, > We don't have a model > right now. It's schizophrenic. And what I was trying to say was that > maybe this is because we never had a working model in the global case > either but if that's the case we need to solve the global case too or > at least figure out where we wanna be in the long term. > >> It's the most consistent choice; you've not given any reasons above >> why a solution with only partial consistency is any better. >> >> Any choice here is difficult to coordinate, that two APIs allow >> manipulation of the same property means that we must always >> choose some compromise here. I prefer the one with the least >> surprises. > > I don't think the current situation around affinity mask handling can > be considered consistent and cpuset is pouring more inconsistencies > into it. We need to figure it out one way or the other. > > ... >> I do not yet see a good reason why the threads arbitrarily not sharing an >> address space necessitates the use of an entirely different API. The >> only problems stated so far in this discussion have been: >> 1) Actual issues involving relative paths, which are potentially solvable. > > Also the ownership of organization. If the use-cases can be > reasonably served with static grouping, I think it'd definitely be a > worthwhile trade-off to make. It's different from process level > grouping. There, we can simply state that this is to be arbitrated in > the userland and that arbitration isn't that difficult as it's among > administration stack of userspace. > > In-process attributes are different. The process itself can > manipulate its own attributes but it's also common for external tools > to peek into processes and set certain attributes. Even when the two > parties aren't coordinated, this is usually fine because there's no > reason for applications to depend on what those attribute are set to > and even when the different entities do different things, the > combination is still something coherent. > > Now, if you make the in-process grouping dynamic and accessible to > external entities (and if we aren't gonna do that, why even bother?), > this breaks down and we have some of the same problems we have with > allowing applications to directly manipulate cgroup sub-directories. > This is a fundamental problem. Setting attributes can be shared but > organization is an exclusive process. You can't share that without > close coordination. Your concern here is centered on permissions, not the interface. This seems directly remedied by exactly: Any sub-process hierarchy we exposed would be locked down in terms of write access. These would not be generally writable. You're absolutely correct that you can't share without close coordination, and granting the appropriate permissions is part of that. > > Assigning the full responsiblity of in-process organization to the > application itself and tying it to static parental relationship allows > for solid common grounds where these resource operations can be > performed by different entities without causing structural issues just > like other similar operations. But cases have already been presented above where the full responsibility cannot be delegated to the application. Because we explicitly depend on constraints being provided by the external environment. > > Another point for assigning this responsibility to the application > itself is that it can't be done without the application's cooperation > anyway because the group membership of new threads is determined by > the group the parent belongs to. > >> 2) Aesthetic distaste for using file-system abstractions > > It's not that but more about what the file-system interface implies. > It's not just different. It breaks a lot of expectations a lot of > application visible kernel interface provides as explained above. > There are reasons why we usually don't do things this way. The arguments you've made above are largely centered on permissions and the right to make modifications. I don't see what other expectations you believe are being broken here. This still feels like an aesthetic objection. > > ... >> >> 1) It depends on "anchor" threads to define groupings. >> > >> > So does cgroupfs. Any kind of thread or process grouping can't escape >> > that as soon as things start forking and if things don't fork whether >> > something is anchor or not doesn't make much difference. >> >> The difference is that this ignores how applications are actually written: > > It does require the applications to follow certain protocols to > organize itself but this is a pretty trivial thing to do and comes > with the benefit that we don't need to introduce a completely new > grouping concept to applications. I strongly disagree here: Applications today do _not_ use sub-process clone hierarchies today. As a result, this _is_ introducing a completely new grouping concept because it's one applications have never cared about outside of a shell implementation. > >> A container that is independent of its members (e.g. a cgroup >> directory) can be created and configured by an application's Init() or >> within the construction of a data-structure that will use it without >> dependency on those resources yet being used. >> >> As an example: >> The resources associated with thread pools are often dynamically >> managed. What you're proposing means that some initialization must >> now be moved into the first thread that pool creates (as opposed to >> the pool's initilization), that synchronization and identification of >> this thread is now required, and that it must be treated differently >> to other threads in the pool (it can no longer be reclaimed). > > That should be like a two hour job for most applications. This is a > trivial thing to do. It's difficult for me to consider the difficulty > of doing this a major decision point. > You are seriously underestimating the complexity and API overhead this introduces. It cannot be claimed trivial and discarded; it's not. >> >> 2) It does not allow thread-level hierarchies to be created >> > >> > Huh? That's the only thing it would do. This obviously wouldn't get >> > applied to processes. It's strictly about threads. >> >> This allows a single *partition*, not a hierarchy. As machines >> become larger, so are many of the processes we run on them. These >> larger processes manage resources between threads on scales that we >> would previously partition between processes. > > I don't get it. Why wouldn't it allow hierarchy? "- If $TID isn't already a resource group leader, it creates a sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants to it. - If $TID is already a resource group leader, set $KEY to $VAL." This only allows resource groups at the root level to be created. There is no way to make $TID2 a resource group leader, parented by $TID1. > >> >> 3) When coordination with an external agent is desired this defines no >> >> common interface that can be shared. Directories are an extremely >> >> useful container. Are you proposing applications would need to >> >> somehow publish the list of anchor-threads from (1)? >> > >> > Again, this is an invariant no matter what we do. As I wrote numerous >> > times in this thread, this information is only known to the process >> > itself. If an external agent want to manipulate these from outside, >> > it just has to know which thread is doing what. The difference is >> > that this doesn't require the process itself to coordinate with >> > external agent when operating on itself. >> >> Nothing about what was previously state would require any coordination >> with the process and an external agent when operating on itself. >> What's the basis for this claim? > > I hope this is explained now. See above regarding permissions. > >> This also ignores the cases previously discussed in which the external >> agent is providing state for threads within a process to attach to. >> An example of this is repeated below. >> >> This isn't even covering that this requires the invention of entirely >> new user-level APIs and coordination for somehow publishing these >> magic tids. > > We already have those tids. External management applications do not. This was covering that would now need a new API to handle their publishing. Whereas using the VFS handles this naturally. > >> >> What if I want to set up state that an application will attaches >> >> threads to [consider cpuset example above]? >> > >> > It feels like we're running in circles. Process-level control stays >> > the same. That part is not an issue. Thread-level control requires >> > cooperation from the process itself no matter what and should stay >> > confined to the limits imposed on the process as a whole. >> > >> > Frankly, cpuset example doesn't make much sense to me because there is >> > nothing hierarchical about it and it isn't even layered properly. Can >> > you describe what you're actually trying to achieve? But no matter >> > the specifities of the example, it's almost trivial to achieve >> > whatever end results. >> >> This has been previously detailed, repeating it here: >> >> Machines are shared resources, we partition the available cpus into >> shared and private sets. These sets are dynamic as when a new >> application arrives requesting private cpus, we must reserve some cpus >> that were previously shared. >> >> We use sub-cpusets to advertise to applications which of their cpus >> are shared and which are private. They can then attach threads to >> these containers -- which are dynamically updated as cores shift >> between public and private configurations. > > I see but you can easily do that the other way too, right? Let the > applications publish where they put their threads and let the external > entity set configs on them. And what API controls the right to do this? > >> >> 4) How is the cgroup property to $KEY translation defined? This feels >> >> like an ioctl and no more natural than the file-system. It also does >> > >> > How are they even comparable? Sure ioctl inputs are variable-formed >> > and its definitions aren't closely scrutinized but other than those >> > it's a programmable system-call interface and how programs use and >> > interact with them is completely different from how a program >> > interacts with cgroupfs. >> >> They're exactly comparable in that every cgroup.<property> api now >> needs some magic equivalent $KEY defined. I don't understand how >> you're proposing these would be generated or managed. > > Not everything. Just the ones which make sense in-process. This is > exactly the process we need to go through when introducing new > syscalls. Why is this a surprise? We want to scrutinize them, hard. I'm talking only about the control->$KEY mapping. Yes it would be a subset, but this seems a large step back in usability. > >> > It doesn't have to parse out the path, >> > compose the knob path, open and format the data into it >> >> There's nothing hard about this. Further, we already have to do >> exactly this at the process level; which means abstractions for this > > I'm not following. Why would it need to do that already? Because the process-level interface will continue to work the way it does today. That means we still need to implement these operations. This same library code could be shared for applications to use on their private, sub-process, controls. > >> already exist; removing this property does not change their presence >> of requirement, but instead means they must be duplicated for the >> in-thread case. >> >> Even ignoring that the libraries for this can be shared between thread >> and process, this is also generally easier to work with than magic >> $KEY values. > > This is like saying syscalls are worse in terms of progammability > compared to opening and writing formatted strings for setting > attributes. If that's what you're saying, let's just agree to disgree > on this one. The goal of such a system is as much administration as it is a programmable interface. There's a reason much configuration is specified by sysctls and not syscalls. > >> > all the while >> > not being sure whether the file it's operating on is even the right >> > one anymore or the sub-hierarchcy it's assuming is still there. >> >> One possible resolution to this has been proposed several times: >> Have the sub-process hierarchy exposed in an independent and fixed location. >> >> >> not seem to resolve your concerns regarding races; the application >> >> must still coordinate internally when concurrently calling >> >> set_resource(). >> > >> > I have no idea where you're going with this. When did the internal >> > synchronization inside a process become an issue? Sure, if a thread >> > does *(int *)=0, we can't protect other threads from it. Also, why >> > would it be a problem? If two perform set_resource() on the same >> > thread, one will be executed after the other. What are you talking >> > about? >> >> It was my impression that you'd had atomicity concerns regarding >> file-system operations such as writes for updates previously. If you >> have no concerns within a sub-processes operation then this can be >> discarded. > > That's comparing apples and oranges. Threads being moved around and > hierarchies changing beneath them present a whole different issues > than someone else setting an attribute to a different value. The > operations might fail, might set properties on the wrong group. > There are no differences between using VFS and your proposed API for this. >> >> 5) How does an external agent coordinate when a resource must be >> >> removed from a sub-hierarchy? >> > >> > That sort of restriction should generally be put at the higher level. >> > Thread-level resource control should be cooperative with the >> > application if at all necessary and in those cases just set the limit >> > on the sub-hierarchy would work. >> > >> >> Could you expand on how you envision this being cooperative? This >> seems tricky to me, particularly when limits are involved. >> >> How do I even arbitrate which external agents are allowed control? > > I think we're talking past each other. If you wanna put restrictions > on the process as whole, do it at the higher level. If you wanna > fiddle with in-process resource distribution, you just have to assume > that the application itself is cooperative or at least not malicious. > No matter what external entities try to do, the application can > circumvent because that's what ultimately determines the grouping. I think you misunderstood here. What I'm saying is equivalently: - How do I bless a 'good' external agent to be allowed to make modificaitons - How do I make sure a malicious external process is not able to make modifications > >> So I was really trying to make sure we covered the interface problems >> we're trying to solve here. Are there major ones not listed there? >> >> However, I strongly disagree with this statement. It is much easier >> for applications to work with named abstract objects then having magic >> threads that it must track and treat specially. > > How is that different? Sure, the name is created by the threads but > once you set the resource, the tid would be the resource group ID and > the thread can go away. It's still an object named by an ID. Huh?? If the thread goes away, then the tid can be re-used -- within the same process. Now you have non-unique IDs to operate on?? > The > only difference is that the process of creating the hierarchy is tied > to the process that threads are created in. > >> My implementation must now look like this: >> 1) I instantiate some abstraction which uses cgroups. >> 2) In construction I must now coordinate with my chosen threading >> implementation (an exciting new dependency) to create a new thread and >> get its tid. This thread must exist for as long as the associated >> data-structure. I must pay a kernel stack, at least one page of >> thread stack and however much TLS I've declared in my real threads. >> 3) In destruction I must now wake and free the thread created in (2). >> 4) If I choose to treat it as a real thread, I must be careful, this >> thread is special and cannot be arbitrarily released like other >> threads. >> 5) To do anything I must go grok the documentation to look up the >> magic $KEY. If I get this wrong I am going to have a fun time >> debugging it since things are no longer reasonably inspect-able. If I >> must work with a cgroup that adds features over time things are even >> more painful since $KEY may or may not exist. >> >> Is any of the above unfair with respect to what you've described above? > > Yeah, as I wrote above. > >> This isn't even beginning to consider the additional pain that a >> language implementing its own run-time such as Go might incur. > > Yeap, it does require userland runtime to have a way to make the > thread creation history visible to the operating system. It doesn't > look like a big price. Again, I'm looking for a balance. I know that the current API charges a minimal price here. I strongly believe that what you're proposing carries a significant price. > > You're citing inconveniences from userland side and yeah I get that. > Making things more rigid and static requires some adjustments from > userland but we gain from it too. No need to worry about structural > inconsistencies and the varied failure modes which can cascade from > that. See below. > > If the only possible solution is C++-esque everything-goes way, sure, > we'll have to do that but that's not the case. We can implement and > provide the core functionality in a more controlled manner. > >> Option B: >> We expose sub-process hierarchies via /proc/self/cgroups or similar. >> They do not appear within the process only cgroup hierarchy. >> Only the same user (or a privileged one) has access to this internal >> hierarchy. This can be arbitrarily restricted further. >> Applications continue to use almost exactly the same cgroup >> interfaces that exist today, however, the problem of path computation >> and non-stable paths are now eliminated. >> >> Really, what problems does this not solve? >> >> It eliminates the unstable mount point, your concerns regarding >> external entity manipulation, and allows for the parent processes to >> be moved. It provides a reasonable place for coordination to occur, >> with standard mechanisms for access control. It allows for state to >> be easily inspected, it does not require new documentation, allows the >> creation of sub-hierarchies, does not require special threads. >> >> This was previously raised as a straw man, but I have not yet seen or >> thought of good arguments against it. > > It allows for structural inconsistencies where applications can end up > performing operations which are non-sensical. Breaking that invariant > is substantial. Why would we do that if Can you please provide an example? I don't know what inconsistencies you mean here. In particular, I do not see anything that your proposed interface resolves versus this; while being _significantly_ simpler for applications to use and implement. > > Can we at least agree that we're now venturing into an area where > things aren't really critical? The core functionality here is being > able to hierarchically categorize threads and assign resource limits > to them. Can we agree that the minimum core functionality is met in > both approaches? I'm not sure entirely how to respond here. I am deeply concerned that the API you're proposing is not tenable for providing this core functionality. I worry that you're introducing serious new challenges and too quickly discarding them as manageable. ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-15 11:42 ` Paul Turner @ 2015-10-23 22:21 ` Tejun Heo 2015-10-24 4:36 ` Mike Galbraith 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-10-23 22:21 UTC (permalink / raw) To: Paul Turner Cc: Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Paul. On Thu, Oct 15, 2015 at 04:42:37AM -0700, Paul Turner wrote: > > The thing which bothers me the most is that cpuset behavior is > > different from global case for no good reason. > > I've tried to explain above that I believe there are reasonable > reasons for it working the way it does from an interface perspective. > I do not think they can be so quickly discarded out of hand. However, > I think we should continue winnowing focus and first resolve the model > of interaction for sub-process hierarchies, One way or the other, I think the kernel needs to sort out how task affinity masks are handled when the available CPUs change, be that from CPU hotplug or cpuset config changes. On forcing all affinity masks to the set of available CPUs, I'm still not convinced that it's a useful extra behavior to implement for cpuset especially given that the same can be achieved from userland without too much difficulty. This goes back to the argument for implmenting the minimal set of functionality which can be used as building blocks. Updating all task affinty masks is an irreversible destructive operation. It doesn't enable anything which can't be otherwise but does end up restricting how the feature can be used. But yeah, let's shelve this subject for now. > > Now, if you make the in-process grouping dynamic and accessible to > > external entities (and if we aren't gonna do that, why even bother?), > > this breaks down and we have some of the same problems we have with > > allowing applications to directly manipulate cgroup sub-directories. > > This is a fundamental problem. Setting attributes can be shared but > > organization is an exclusive process. You can't share that without > > close coordination. > > Your concern here is centered on permissions, not the interface. > > This seems directly remedied by exactly: > Any sub-process hierarchy we exposed would be locked down in terms > of write access. These would not be generally writable. You're > absolutely correct that you can't share without close coordination, > and granting the appropriate permissions is part of that. It is not about permissions. It is about designing an interface which guarantees certain set of invariants regardless of priviledges - even root can't violate such invariants short of injecting code into and modifying the behavior of the target process. This isn't anything unusual. In fact, permission based access control is something which is added if and only if allowing and controlling accesses from multiple parties is necessary and needs to be explicitly justified. If coordination in terms of thread hierarchy organization from the target process is needed for allowing external entities to twiddle with resource distribution, no capability is lost by making the organization solely the responsibility of the target process while gaining a lot stronger set of behavioral invariants. I can't see strong enough justifications for allowing external entities to manipulate in-process thread organization. > > assigning the full responsiblity of in-process organization to the > > application itself and tying it to static parental relationship allows > > for solid common grounds where these resource operations can be > > performed by different entities without causing structural issues just > > like other similar operations. > > But cases have already been presented above where the full > responsibility cannot be delegated to the application. Because we > explicitly depend on constraints being provided by the external > environment. I don't think such cases have been presented. The only thing necessary is the target processes organizing threads in a way which allows external agents to apply external constraints. > > It's not that but more about what the file-system interface implies. > > It's not just different. It breaks a lot of expectations a lot of > > application visible kernel interface provides as explained above. > > There are reasons why we usually don't do things this way. > > The arguments you've made above are largely centered on permissions > and the right to make modifications. I don't see what other > expectations you believe are being broken here. This still feels like > an aesthetic objection. I hope my points are clear by now. > > It does require the applications to follow certain protocols to > > organize itself but this is a pretty trivial thing to do and comes > > with the benefit that we don't need to introduce a completely new > > grouping concept to applications. > > I strongly disagree here: Applications today do _not_ use sub-process > clone hierarchies today. As a result, this _is_ introducing a > completely new grouping concept because it's one applications have > never cared about outside of a shell implementation. It is a logical extension of how the kernel organizes processes in the system. It's a lot more native to how programs usually interact with the system than meddling with a pseudo file system. > > That should be like a two hour job for most applications. This is a > > trivial thing to do. It's difficult for me to consider the difficulty > > of doing this a major decision point. > > You are seriously underestimating the complexity and API overhead this > introduces. It cannot be claimed trivial and discarded; it's not. You're exaggerating. Requiring applications to organize threads according to, most likely, their logical roles, is not an unreasonable burden for enabling hierarchical resource control. While I can understand the reluctance for users who are currently making use of task-granular cgroups, please realize that we're trying to introduce a whole new class of features directly visible to applications. Future usages will vastly outnumber that of the current cgroup hack. In addition, it's not like the current users are required to migrate immediately. > "- If $TID isn't already a resource group leader, it creates a > sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants > to it. > > - If $TID is already a resource group leader, set $KEY to $VAL." > > This only allows resource groups at the root level to be created. > There is no way to make $TID2 a resource group leader, parented by > $TID1. I probably should have written it better but obviously a new resource group for $TID would be nested under the resource group $TID is already in. > > We already have those tids. > > External management applications do not. This was covering that would > now need a new API to handle their publishing. Whereas using the VFS > handles this naturally. I suppose you're suggesting that naming conventions in the per-process cgroup hierarchy can be used as a mechanism to carry such information, am I right? If so, it's trivial to solve. Just let the application tag the TID based resource groups with an integer or string identifying hints. > > I see but you can easily do that the other way too, right? Let the > > applications publish where they put their threads and let the external > > entity set configs on them. > > And what API controls the right to do this? Exactly the same as prlimit(2)? In fact, while details will dictate what will happen exactly, we might even just extend prlimit(2) instead of introducing completely new syscalls. Please not that the fact that prlimit(2) can be so easily referred to is not an accident. This is because what's being proposed is a natural extension of the model the kernel already uses. > > Not everything. Just the ones which make sense in-process. This is > > exactly the process we need to go through when introducing new > > syscalls. Why is this a surprise? We want to scrutinize them, hard. > > I'm talking only about the control->$KEY mapping. Yes it would be a > subset, but this seems a large step back in usability. I don't understand. This is introducing a whole new set of syscalls to be used by applications and we *need* to scrutinize and restrict what's being exposed. Furthermore, as there are inherent differences in system management interface and application programming interface, we should filter what's to be exposed to individual applications regardless of the specific mechanism for the interface. For example, it doesn't make any sense to expose "cgroup.procs" or "release_agent" on in-process interface. It'd be a step back in usability only for users who have been using cgroups in fringing ways which can't be justified for ratification and we do want to actively filter those out. It may cause a short-term pain for some but the whole thing is an a lot larger problem. Let's please think long term. > > I'm not following. Why would it need to do that already? > > Because the process-level interface will continue to work the way it > does today. That means we still need to implement these operations. > > This same library code could be shared for applications to use on > their private, sub-process, controls. This doesn't make any sense. The reason why cgroup users need low level access libraries is because the file system interface is too unwiedly to program directly against. The fact the system management interface requires such library can't possibly be an argument against the kernel providing a programmable interface to applications. > > This is like saying syscalls are worse in terms of progammability > > compared to opening and writing formatted strings for setting > > attributes. If that's what you're saying, let's just agree to disgree > > on this one. > > The goal of such a system is as much administration as it is a > programmable interface. There's a reason much configuration is > specified by sysctls and not syscalls. And there are reasons why individual applications usually don't program directly against sysctl or other system management interfaces. It's the kernel's job to provide abstractions so that those two spheres can be separated reasonably. We don't want system management meddling with thread organization of applications. That's the application's domain. Applying attributes on top sure can be done from outside. > > That's comparing apples and oranges. Threads being moved around and > > hierarchies changing beneath them present a whole different issues > > than someone else setting an attribute to a different value. The > > operations might fail, might set properties on the wrong group. > > There are no differences between using VFS and your proposed API for this. I hope this part is clear now. > I think you misunderstood here. What I'm saying is equivalently: > - How do I bless a 'good' external agent to be allowed to make modificaitons > - How do I make sure a malicious external process is not able to make > modifications I'm lost why these are even being asked. Why would it be any different from other syscalls which manipulate similar attributes? > > How is that different? Sure, the name is created by the threads but > > once you set the resource, the tid would be the resource group ID and > > the thread can go away. It's still an object named by an ID. > > Huh?? If the thread goes away, then the tid can be re-used -- within > the same process. Now you have non-unique IDs to operate on?? The TID can be pinned on group creation or we can track thread hierarchy (while collapsing irrelevant dead parts) to allow setting attributes on siblings instead. These are details which can be fleshed out as design and implementation progresses. Let's please concentrate on the general approach for now. > > It allows for structural inconsistencies where applications can end up > > performing operations which are non-sensical. Breaking that invariant > > is substantial. Why would we do that if > > Can you please provide an example? I don't know what inconsistencies > you mean here. In particular, I do not see anything that your > proposed interface resolves versus this; while being _significantly_ > simpler for applications to use and implement. The fact that in-process hierarchy can be manipulated by external entities, regardless of permissions, means that the organization can be changed underneath the application in a way which can cause various failures and unexpected behaviors when the application later on performs operations assuming the original organization. > > Can we at least agree that we're now venturing into an area where > > things aren't really critical? The core functionality here is being > > able to hierarchically categorize threads and assign resource limits > > to them. Can we agree that the minimum core functionality is met in > > both approaches? > > I'm not sure entirely how to respond here. I am deeply concerned that > the API you're proposing is not tenable for providing this core > functionality. I worry that you're introducing serious new challenges > and too quickly discarding them as manageable. The capability to obtain here is allowing threads of a process to be organized hierarchically and controlling resource distribution along that hierarchy. I'm asking whether you agree that such core capability can be obtained in both approaches. I think you're underestimating the gravity of adding a whole new set of interfaces to be used by applications. This is something which will be with us decades later. I can understand the reluctance coming for the existing users; however, in perspective, that is not a concern that we can or should hinge major decisions on, so I beg you to take a step back from immediate concerns and take a longer-term look at the problem. Also, while holding off the v2 interface for the cpu controller is an understandable method of exerting political (I don't mean in a derogative way) pressure on resolving the in-process resource management issue, I don't think our specific disagreements affect system level interface in any way. Given the size of the problem, implementing a proper solution for this problem will likely take quite a while even after we agree on the approach. As, AFAICS, there aren't technical reasons to hold back v2 interface, can we please proceed there? I promise to keep working on in-process resource distribution to the best of my abilities. It's something I want to solve anyway. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-23 22:21 ` Tejun Heo @ 2015-10-24 4:36 ` Mike Galbraith 2015-10-25 2:18 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Mike Galbraith @ 2015-10-24 4:36 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote: > It'd be a step back in usability only for users who have been using > cgroups in fringing ways which can't be justified for ratification and > we do want to actively filter those out. Of all the cgroup signal currently in existence, seems the Google signal has to have the most volume under the curve by a mile. If you were to filter that signal out, what remained would be a flat line of noise. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-24 4:36 ` Mike Galbraith @ 2015-10-25 2:18 ` Tejun Heo 2015-10-25 3:43 ` Mike Galbraith 2015-10-25 3:54 ` Linus Torvalds 0 siblings, 2 replies; 92+ messages in thread From: Tejun Heo @ 2015-10-25 2:18 UTC (permalink / raw) To: Mike Galbraith Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Mike. On Sat, Oct 24, 2015 at 06:36:07AM +0200, Mike Galbraith wrote: > On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote: > > > It'd be a step back in usability only for users who have been using > > cgroups in fringing ways which can't be justified for ratification and > > we do want to actively filter those out. > > Of all the cgroup signal currently in existence, seems the Google signal > has to have the most volume under the curve by a mile. If you were to > filter that signal out, what remained would be a flat line of noise. This is a weird direction to take the discussion, but let me provide a counter argument. Google sure is an important user of the kernel and likely the most extensive user of cgroup. At the same time, its kernel efforts are largely in service of a few very big internal customers which are in control of large part of the entire software stack. The things that are important for general audience of the kernel in the long term don't necessarily coincide with what such efforts need or want. I'd even venture to say as much as the inputs coming out of google are interesting and important, they're also a lot more prone to lock-in effects to short term solutions and status-quo given their priorities. This is not to denigrate google's kernel efforts but just to counter-balance "it's google" as a shortcut for proper technical discussions. There are good reasons why cgroup is the design disaster as it is now and chasing each usage scenario and hack which provides the least immediate resistance without paying the effort to extract the actual requirements and common solutions is an important one. It is critical to provide back-pressure for long-term thinking and solutions; otherwise, we're bound to repeat the errors and end up with something which everyone loves to hate. We definitely need to weigh the inputs from heavy users but also need to discern the actual problems which need to be solved from the specific mechanisms chosen to solve them. Let's please keep the discussions technical. That's the best way to reach a viable long-term solution which can benefit a lot wider audience in the long term. Even though that might not be the path of least immediate resistance, I believe that google will be an eventual beneficiary too. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 2:18 ` Tejun Heo @ 2015-10-25 3:43 ` Mike Galbraith 2015-10-27 3:16 ` Tejun Heo 2015-10-25 3:54 ` Linus Torvalds 1 sibling, 1 reply; 92+ messages in thread From: Mike Galbraith @ 2015-10-25 3:43 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Sun, 2015-10-25 at 11:18 +0900, Tejun Heo wrote: > Hello, Mike. > > On Sat, Oct 24, 2015 at 06:36:07AM +0200, Mike Galbraith wrote: > > On Sat, 2015-10-24 at 07:21 +0900, Tejun Heo wrote: > > > > > It'd be a step back in usability only for users who have been using > > > cgroups in fringing ways which can't be justified for ratification and > > > we do want to actively filter those out. > > > > Of all the cgroup signal currently in existence, seems the Google signal > > has to have the most volume under the curve by a mile. If you were to > > filter that signal out, what remained would be a flat line of noise. > > This is a weird direction to take the discussion, but let me provide a > counter argument. I don't think it's weird, it's just a thought wrt where pigeon holing could lead: If you filter out current users who do so in a manner you consider to be in some way odd, when all the filtering is done, you may find that you've filtered out the vast majority of current deployment. > Google sure is an important user of the kernel and likely the most > extensive user of cgroup. At the same time, its kernel efforts are > largely in service of a few very big internal customers which are in > control of large part of the entire software stack. The things that > are important for general audience of the kernel in the long term > don't necessarily coincide with what such efforts need or want. I'm not at all sure of this, but I suspect that SUSE's gigabuck size cgroup power user will land in the same "fringe" pigeon hole. If so, that would be another sizeable dent in volume. My point is that these power users likely _are_ your general audience. > I'd even venture to say as much as the inputs coming out of google are > interesting and important, they're also a lot more prone to lock-in > effects to short term solutions and status-quo given their priorities. > This is not to denigrate google's kernel efforts but just to > counter-balance "it's google" as a shortcut for proper technical > discussions. > > There are good reasons why cgroup is the design disaster as it is now > and chasing each usage scenario and hack which provides the least > immediate resistance without paying the effort to extract the actual > requirements and common solutions is an important one. It is critical > to provide back-pressure for long-term thinking and solutions; > otherwise, we're bound to repeat the errors and end up with something > which everyone loves to hate. > > We definitely need to weigh the inputs from heavy users but also need > to discern the actual problems which need to be solved from the > specific mechanisms chosen to solve them. Let's please keep the > discussions technical. Sure, it was just a thought wrt "actively filter those out" and who all "those" may end up being. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 3:43 ` Mike Galbraith @ 2015-10-27 3:16 ` Tejun Heo 2015-10-27 5:42 ` Mike Galbraith 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-10-27 3:16 UTC (permalink / raw) To: Mike Galbraith Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, Mike. On Sun, Oct 25, 2015 at 04:43:33AM +0100, Mike Galbraith wrote: > I don't think it's weird, it's just a thought wrt where pigeon holing > could lead: If you filter out current users who do so in a manner you > consider to be in some way odd, when all the filtering is done, you may > find that you've filtered out the vast majority of current deployment. I think you misunderstood what I wrote. It's not about excluding existing odd use cases. It's about examining the usages and extracting the required capabilities and building an interface which is well defined and blends well with the rest of programming interface provided by the kernel so that those can be achieved in a saner way. If doing acrobatics with the current interface is necessary to acheive certain capabilities, we need to come up with a better interface for those. If fringe usages can be satisfied using better constructs, we should implement that and encourage transition to a better mechanism. > I'm not at all sure of this, but I suspect that SUSE's gigabuck size > cgroup power user will land in the same "fringe" pigeon hole. If so, > that would be another sizeable dent in volume. > > My point is that these power users likely _are_ your general audience. Sure, that doesn't mean we shouldn't scrutinize the interface we implement to support those users. Also, cgroup also definitely had some negative spiral effect where eccentric mechanisms and interfaces discouraged general wider usages fortifying the argument that "we're the main users" which in turn fed back to even weirder things being added. Everybody including the "heavy" users suffers from such failures in the long term. We sure want to support all the valid use cases from heavy users in a reasonable way but that doesn't mean we say yes to everything. > Sure, it was just a thought wrt "actively filter those out" and who all > "those" may end up being. I hope what I meant is clearer now. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-27 3:16 ` Tejun Heo @ 2015-10-27 5:42 ` Mike Galbraith 2015-10-27 5:46 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Mike Galbraith @ 2015-10-27 5:42 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, 2015-10-27 at 12:16 +0900, Tejun Heo wrote: > Hello, Mike. > > On Sun, Oct 25, 2015 at 04:43:33AM +0100, Mike Galbraith wrote: > > I don't think it's weird, it's just a thought wrt where pigeon holing > > could lead: If you filter out current users who do so in a manner you > > consider to be in some way odd, when all the filtering is done, you may > > find that you've filtered out the vast majority of current deployment. > > I think you misunderstood what I wrote. It's not about excluding > existing odd use cases. It's about examining the usages and > extracting the required capabilities and building an interface which > is well defined and blends well with the rest of programming interface > provided by the kernel so that those can be achieved in a saner way. Sure, sounds fine, I just fervently hope that the below is foul swamp gas having nothing what so ever to do with your definition of "saner". http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ I'm not into begging. I really don't want to have to ask anyone to pretty please do for me what I can currently do all by my little self without having to give a rats ass less whether what I want to do fits in the world view of this or that obnoxious little control freak. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-27 5:42 ` Mike Galbraith @ 2015-10-27 5:46 ` Tejun Heo 2015-10-27 5:56 ` Mike Galbraith 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-10-27 5:46 UTC (permalink / raw) To: Mike Galbraith Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton Hello, On Tue, Oct 27, 2015 at 06:42:11AM +0100, Mike Galbraith wrote: > Sure, sounds fine, I just fervently hope that the below is foul swamp > gas having nothing what so ever to do with your definition of "saner". lol, idk, you keep taking things in weird directions. Let's just stay technical, okay? > http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign > > http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ > > I'm not into begging. I really don't want to have to ask anyone to > pretty please do for me what I can currently do all by my little self > without having to give a rats ass less whether what I want to do fits in > the world view of this or that obnoxious little control freak. Well, if you think certain things are being missed, please speak up. Not in some media campaign way but with technical reasoning and justifications. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-27 5:46 ` Tejun Heo @ 2015-10-27 5:56 ` Mike Galbraith 2015-10-27 6:00 ` Tejun Heo 0 siblings, 1 reply; 92+ messages in thread From: Mike Galbraith @ 2015-10-27 5:56 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, 2015-10-27 at 14:46 +0900, Tejun Heo wrote: > Hello, > > On Tue, Oct 27, 2015 at 06:42:11AM +0100, Mike Galbraith wrote: > > Sure, sounds fine, I just fervently hope that the below is foul swamp > > gas having nothing what so ever to do with your definition of "saner". > > lol, idk, you keep taking things in weird directions. Let's just stay > technical, okay? > > > http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign > > > > http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ > > > > I'm not into begging. I really don't want to have to ask anyone to > > pretty please do for me what I can currently do all by my little self > > without having to give a rats ass less whether what I want to do fits in > > the world view of this or that obnoxious little control freak. > > Well, if you think certain things are being missed, please speak up. > Not in some media campaign way but with technical reasoning and > justifications. Inserting a middle-man is extremely unlikely to improve performance. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-27 5:56 ` Mike Galbraith @ 2015-10-27 6:00 ` Tejun Heo 2015-10-27 6:08 ` Mike Galbraith 0 siblings, 1 reply; 92+ messages in thread From: Tejun Heo @ 2015-10-27 6:00 UTC (permalink / raw) To: Mike Galbraith Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, Oct 27, 2015 at 06:56:42AM +0100, Mike Galbraith wrote: > > Well, if you think certain things are being missed, please speak up. > > Not in some media campaign way but with technical reasoning and > > justifications. > > Inserting a middle-man is extremely unlikely to improve performance. I'm not following you at all. Technical reasoning and justifications is a middle-man? I don't think anything productive is likely to come out of this conversation. Let's just end this sub-thread. Thanks. -- tejun ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-27 6:00 ` Tejun Heo @ 2015-10-27 6:08 ` Mike Galbraith 0 siblings, 0 replies; 92+ messages in thread From: Mike Galbraith @ 2015-10-27 6:08 UTC (permalink / raw) To: Tejun Heo Cc: Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, lizefan, cgroups, LKML, kernel-team, Linus Torvalds, Andrew Morton On Tue, 2015-10-27 at 15:00 +0900, Tejun Heo wrote: > On Tue, Oct 27, 2015 at 06:56:42AM +0100, Mike Galbraith wrote: > > > Well, if you think certain things are being missed, please speak up. > > > Not in some media campaign way but with technical reasoning and > > > justifications. > > > > Inserting a middle-man is extremely unlikely to improve performance. > > I'm not following you at all. Technical reasoning and justifications > is a middle-man? No, user <-> systemd or whatever <-> kernel ^^^^^^^^^^^^^^^^^^^ > I don't think anything productive is likely to come out of this > conversation. Let's just end this sub-thread. Agreed. -Mike ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 2:18 ` Tejun Heo 2015-10-25 3:43 ` Mike Galbraith @ 2015-10-25 3:54 ` Linus Torvalds 2015-10-25 9:33 ` Ingo Molnar 1 sibling, 1 reply; 92+ messages in thread From: Linus Torvalds @ 2015-10-25 3:54 UTC (permalink / raw) To: Tejun Heo Cc: Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On Sun, Oct 25, 2015 at 11:18 AM, Tejun Heo <tj@kernel.org> wrote: > > We definitely need to weigh the inputs from heavy users but also need > to discern the actual problems which need to be solved from the > specific mechanisms chosen to solve them. Let's please keep the > discussions technical. That's the best way to reach a viable > long-term solution which can benefit a lot wider audience in the long > term. Even though that might not be the path of least immediate > resistance, I believe that google will be an eventual beneficiary too. So here's a somewhat odd request I got to hear very recently (at LinuxCon EU in Ireland).. A least some game engine writers apparently would like to be able to set scheduling priorities for threads within a single process, because they may want te game as a whole to have a certain priority, but then some of the threads are critical for latency and may want certain guaranteed resources (eg audio or actual gameplay) while others are very much background things (garbage collection etc). I suspect that's a very non-google use. We apparently don't really support that kind of per-thread model right now at all. Do they want cgroups? Maybe not. You can apparently do something like this under Windows and OS X, but not under Linux (and I'm reporting second-hand here, I don't know the exact details). I'm just bringing it up as a somewhat unusual non-server thing that is certainly very relevant despite being different. Linus ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 3:54 ` Linus Torvalds @ 2015-10-25 9:33 ` Ingo Molnar 2015-10-25 10:41 ` Theodore Ts'o 0 siblings, 1 reply; 92+ messages in thread From: Ingo Molnar @ 2015-10-25 9:33 UTC (permalink / raw) To: Linus Torvalds Cc: Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Oct 25, 2015 at 11:18 AM, Tejun Heo <tj@kernel.org> wrote: > > > > We definitely need to weigh the inputs from heavy users but also need to > > discern the actual problems which need to be solved from the specific > > mechanisms chosen to solve them. Let's please keep the discussions technical. > > That's the best way to reach a viable long-term solution which can benefit a > > lot wider audience in the long term. Even though that might not be the path > > of least immediate resistance, I believe that google will be an eventual > > beneficiary too. > > So here's a somewhat odd request I got to hear very recently (at LinuxCon EU in > Ireland).. > > A least some game engine writers apparently would like to be able to set > scheduling priorities for threads within a single process, because they may want > te game as a whole to have a certain priority, but then some of the threads are > critical for latency and may want certain guaranteed resources (eg audio or > actual gameplay) while others are very much background things (garbage > collection etc). > > I suspect that's a very non-google use. We apparently don't really support that > kind of per-thread model right now at all. Hm, that's weird - all our sched_*() system call APIs that set task scheduling priorities are fundamentally per thread, not per process. Same goes for the old sys_nice() interface. The scheduler has no real notion of 'process', and certainly not at the system call level. This was always so and is expected to remain so in the future as well - and this is unrelated to cgroups. > Do they want cgroups? Maybe not. You can apparently do something like this under > Windows and OS X, but not under Linux (and I'm reporting second-hand here, I > don't know the exact details). I'm just bringing it up as a somewhat unusual > non-server thing that is certainly very relevant despite being different. So I'd realy like to hear about specifics, and they might be banging on open doors! Thanks, Ingo ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 9:33 ` Ingo Molnar @ 2015-10-25 10:41 ` Theodore Ts'o 2015-10-25 10:47 ` Florian Weimer 0 siblings, 1 reply; 92+ messages in thread From: Theodore Ts'o @ 2015-10-25 10:41 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote: > > Hm, that's weird - all our sched_*() system call APIs that set task scheduling > priorities are fundamentally per thread, not per process. Same goes for the old > sys_nice() interface. The scheduler has no real notion of 'process', and certainly > not at the system call level. > I suspect the main issue is that the games programmers were trying to access it via libc / pthreads, which hides a lot of the power available at the raw syscall level. This is probably more of a "tutorial needed for userspace programmers" issue, at a guess. - Ted ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 10:41 ` Theodore Ts'o @ 2015-10-25 10:47 ` Florian Weimer 2015-10-25 11:58 ` Theodore Ts'o 0 siblings, 1 reply; 92+ messages in thread From: Florian Weimer @ 2015-10-25 10:47 UTC (permalink / raw) To: Theodore Ts'o, Ingo Molnar, Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On 10/25/2015 11:41 AM, Theodore Ts'o wrote: > On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote: >> >> Hm, that's weird - all our sched_*() system call APIs that set task scheduling >> priorities are fundamentally per thread, not per process. Same goes for the old >> sys_nice() interface. The scheduler has no real notion of 'process', and certainly >> not at the system call level. >> > > I suspect the main issue is that the games programmers were trying to > access it via libc / pthreads, which hides a lot of the power > available at the raw syscall level. This is probably more of a > "tutorial needed for userspace programmers" issue, at a guess. If this refers to the lack of exposure of thread IDs in glibc, we are willing to change that on glibc side. The discussion has progressed to the point where it is now about the question whether it should be part of the GNU API (like sched_setaffinity), or live in glibc as a Linux-specific extension (like sched_getcpu). More input is certainly welcome. Old concerns about support for n:m threading implementations in glibc are no longer relevant because too much code using well-documented interfaces would break. Florian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 10:47 ` Florian Weimer @ 2015-10-25 11:58 ` Theodore Ts'o 2015-10-25 13:17 ` Florian Weimer 0 siblings, 1 reply; 92+ messages in thread From: Theodore Ts'o @ 2015-10-25 11:58 UTC (permalink / raw) To: Florian Weimer Cc: Ingo Molnar, Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On Sun, Oct 25, 2015 at 11:47:04AM +0100, Florian Weimer wrote: > On 10/25/2015 11:41 AM, Theodore Ts'o wrote: > > On Sun, Oct 25, 2015 at 10:33:32AM +0100, Ingo Molnar wrote: > >> > >> Hm, that's weird - all our sched_*() system call APIs that set task scheduling > >> priorities are fundamentally per thread, not per process. Same goes for the old > >> sys_nice() interface. The scheduler has no real notion of 'process', and certainly > >> not at the system call level. > >> > > > > I suspect the main issue is that the games programmers were trying to > > access it via libc / pthreads, which hides a lot of the power > > available at the raw syscall level. This is probably more of a > > "tutorial needed for userspace programmers" issue, at a guess. > > If this refers to the lack of exposure of thread IDs in glibc, we are > willing to change that on glibc side. The discussion has progressed to > the point where it is now about the question whether it should be part > of the GNU API (like sched_setaffinity), or live in glibc as a > Linux-specific extension (like sched_getcpu). More input is certainly > welcome. Well, I was thinking we could just teach them to use "syscall(SYS_gettid)". On a different subject, I'm going to start telling people to use "syscall(SYS_getrandom)", since I think that's going to be easier than having asking people to change their Makefiles to link against some Linux-specific library, but that's a different debate, and I recognize the glibc folks aren't willing to bend on that one. Cheers, - Ted ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 11:58 ` Theodore Ts'o @ 2015-10-25 13:17 ` Florian Weimer 2015-10-25 13:40 ` Getrandom wrapper Theodore Ts'o 2015-10-26 14:10 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Peter Zijlstra 0 siblings, 2 replies; 92+ messages in thread From: Florian Weimer @ 2015-10-25 13:17 UTC (permalink / raw) To: Theodore Ts'o, Ingo Molnar, Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On 10/25/2015 12:58 PM, Theodore Ts'o wrote: > Well, I was thinking we could just teach them to use > "syscall(SYS_gettid)". Right, and that's easier if TIDs are officially part of the GNU API. I think the worry is that some future system might have TIDs which do not share the PID space, or are real descriptors (that they need explicit open and close operations). > On a different subject, I'm going to start telling people to use > "syscall(SYS_getrandom)", since I think that's going to be easier than > having asking people to change their Makefiles to link against some > Linux-specific library, but that's a different debate, and I recognize > the glibc folks aren't willing to bend on that one. I think we can reach consensus for an implementation which makes this code unsigned char session_key[32]; getrandom (session_key, sizeof (session_key), 0); install_session_key (session_key); correct. That is, no error handling code for ENOMEM, ENOSYS, EINTR, ENOMEM or short reads is necessary. It seems that several getrandom wrappers currently built into applications do not get this completely right. Florian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Getrandom wrapper 2015-10-25 13:17 ` Florian Weimer @ 2015-10-25 13:40 ` Theodore Ts'o 2015-10-26 13:32 ` Florian Weimer 2015-10-26 14:10 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Peter Zijlstra 1 sibling, 1 reply; 92+ messages in thread From: Theodore Ts'o @ 2015-10-25 13:40 UTC (permalink / raw) To: Florian Weimer Cc: Ingo Molnar, Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On Sun, Oct 25, 2015 at 02:17:23PM +0100, Florian Weimer wrote: > > I think we can reach consensus for an implementation which makes this code > > unsigned char session_key[32]; > getrandom (session_key, sizeof (session_key), 0); > install_session_key (session_key); > > correct. That is, no error handling code for ENOMEM, ENOSYS, EINTR, > ENOMEM or short reads is necessary. It seems that several getrandom > wrappers currently built into applications do not get this completely right. The only error handling code that is necessary is a fallback for ENOSYS. getrandom(2) won't return ENOMEM, and if the number of bytes requested is less than or equal to 256 bytes, it won't return EINTR either. If the user requests more than 256 bytes, they're doing something insane and almost certainly not cryptographic, and so letting it be interruptible should be fine. (OpenBSD will outright *fail* a request greater than 256 bytes with an EIO error in their getentropy(2) system call. But that means the insane application won't get any randomness at all in their overly large, insane request, and if they're that insane, they're probably not checking error conditions either.) As far as ENOSYS is concerned, a fallback gets tricky; you could try to open /dev/urandom, and read from it, but that can fail due to EMFILE, ENFILE, ENOENT (if they are chrooted and /dev wasn't properly populated). So attempting a fallback for ENOSYS can actually expand the number of potential error conditions for the userspace application to (fail to) handle. I suppose you could attempt the fallback and call abort(2) if the fallback fails, which is probably the safe and secure thing to do, but applications might not appreciate getting terminated without getting a chance to do something (but if the something is just calling random(3), maybe not giving them a chance to do something insane is the appropriate thing to do....) - Ted ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: Getrandom wrapper 2015-10-25 13:40 ` Getrandom wrapper Theodore Ts'o @ 2015-10-26 13:32 ` Florian Weimer 0 siblings, 0 replies; 92+ messages in thread From: Florian Weimer @ 2015-10-26 13:32 UTC (permalink / raw) To: Theodore Ts'o, Ingo Molnar, Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Peter Zijlstra, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On 10/25/2015 02:40 PM, Theodore Ts'o wrote: > On Sun, Oct 25, 2015 at 02:17:23PM +0100, Florian Weimer wrote: >> >> I think we can reach consensus for an implementation which makes this code >> >> unsigned char session_key[32]; >> getrandom (session_key, sizeof (session_key), 0); >> install_session_key (session_key); >> >> correct. That is, no error handling code for ENOMEM, ENOSYS, EINTR, >> ENOMEM or short reads is necessary. It seems that several getrandom >> wrappers currently built into applications do not get this completely right. > > The only error handling code that is necessary is a fallback for > ENOSYS. getrandom(2) won't return ENOMEM, and if the number of bytes > requested is less than or equal to 256 bytes, it won't return EINTR > either. Not even during early boot? The code suggests that you can get EINTR if the non-blocking pool isn't initialized yet. With VMs, that initialization can happen quite some time after boot, when the userland is well under way. > As far as ENOSYS is concerned, a fallback gets tricky; you could try > to open /dev/urandom, and read from it, but that can fail due to > EMFILE, ENFILE, ENOENT (if they are chrooted and /dev wasn't properly > populated). So attempting a fallback for ENOSYS can actually expand > the number of potential error conditions for the userspace application > to (fail to) handle. I suppose you could attempt the fallback and > call abort(2) if the fallback fails, which is probably the safe and > secure thing to do, but applications might not appreciate getting > terminated without getting a chance to do something (but if the > something is just calling random(3), maybe not giving them a chance to > do something insane is the appropriate thing to do....) I'm more worried that the fallback code could be triggered unexpectedly on some obscure code path that is not tested regularly, and runs into a failure. I suspect a high-quality implementation of getrandom would have to open /dev/random and /devurandom when the getrandom symbol is resolved, and report failure at that point, to avoid late surprises. Florian ^ permalink raw reply [flat|nested] 92+ messages in thread
* Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy 2015-10-25 13:17 ` Florian Weimer 2015-10-25 13:40 ` Getrandom wrapper Theodore Ts'o @ 2015-10-26 14:10 ` Peter Zijlstra 1 sibling, 0 replies; 92+ messages in thread From: Peter Zijlstra @ 2015-10-26 14:10 UTC (permalink / raw) To: Florian Weimer Cc: Theodore Ts'o, Ingo Molnar, Linus Torvalds, Tejun Heo, Mike Galbraith, Paul Turner, Ingo Molnar, Johannes Weiner, Li Zefan, cgroups, LKML, kernel-team, Andrew Morton On Sun, Oct 25, 2015 at 02:17:23PM +0100, Florian Weimer wrote: > On 10/25/2015 12:58 PM, Theodore Ts'o wrote: > > > Well, I was thinking we could just teach them to use > > "syscall(SYS_gettid)". > > Right, and that's easier if TIDs are officially part of the GNU API. > > I think the worry is that some future system might have TIDs which do > not share the PID space, or are real descriptors (that they need > explicit open and close operations). For the scheduler the sharing of pid/tid space is not an issue. Semantically all [1] scheduler syscalls take a tid. There isn't a single syscall that iterates the thread group. Even sys_setpriority() interprets its @who argument as a tid when @which == PRIO_PROCESS (PRIO_PGRP looks to be the actual process). [1] as seen from: git grep SYSCALL kernel/sched/ ^ permalink raw reply [flat|nested] 92+ messages in thread
* [PATCH v2 3/3] sched: Implement interface for cgroup unified hierarchy 2015-08-03 22:41 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2015-08-04 9:07 ` Peter Zijlstra @ 2015-08-04 19:32 ` Tejun Heo 1 sibling, 0 replies; 92+ messages in thread From: Tejun Heo @ 2015-08-04 19:32 UTC (permalink / raw) To: mingo, peterz; +Cc: hannes, lizefan, cgroups, linux-kernel, kernel-team >From f85c07ea11a52068c45cdd5f5528ed7c842c833a Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Tue, 4 Aug 2015 15:24:08 -0400 While the cpu controller doesn't have any functional problems, there are a couple interface issues which can be addressed in the v2 interface. * cpuacct being a separate controller. This separation is artificial and rather pointless as demonstrated by most use cases co-mounting the two controllers. It also forces certain information to be accounted twice. * Use of different time units. Writable control knobs use microseconds, some stat fields use nanoseconds while other cpuacct stat fields use centiseconds. * Control knobs which can't be used in the root cgroup still show up in the root. * Control knob names and semantics aren't consistent with other controllers. This patchset implements cpu controller's interface on the unified hierarchy which adheres to the controller file conventions described in Documentation/cgroups/unified-hierarchy.txt. Overall, the following changes are made. * cpuacct is implictly enabled and disabled by cpu and its information is reported through "cpu.stat" which now uses microseconds for all time durations. All time duration fields now have "_usec" appended to them for clarity. While this doesn't solve the double accounting immediately, once majority of users switch to v2, cpu can directly account and report the relevant stats and cpuacct can be disabled on the unified hierarchy. Note that cpuacct.usage_percpu is currently not included in "cpu.stat". If this information is actually called for, it can be added later. * "cpu.shares" is replaced with "cpu.weight" and operates on the standard scale defined by CGROUP_WEIGHT_MIN/DFL/MAX (1, 100, 10000). The weight is scaled to scheduler weight so that 100 maps to 1024 and the ratio relationship is preserved - if weight is W and its scaled value is S, W / 100 == S / 1024. While the mapped range is a bit smaller than the orignal scheduler weight range, the dead zones on both sides are relatively small and covers wider range than the nice value mappings. This file doesn't make sense in the root cgroup and isn't create on root. * "cpu.cfs_quota_us" and "cpu.cfs_period_us" are replaced by "cpu.max" which contains both quota and period. * "cpu.rt_runtime_us" and "cpu.rt_period_us" are replaced by "cpu.rt.max" which contains both runtime and period. v2: cpu_stats_show() was incorrectly using CONFIG_FAIR_GROUP_SCHED for CFS bandwidth stats and also using raw division for u64. Use CONFIG_CFS_BANDWITH and do_div() instead. The semantics of "cpu.rt.max" is not fully decided yet. Dropped for now. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> --- Hello, Fixed build issues for certain configs and removed cpu.rt.max for now. The git branch has been updated accordingly. git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-sched-unified-intf Thanks. Documentation/cgroups/unified-hierarchy.txt | 53 +++++++++++ kernel/sched/core.c | 140 ++++++++++++++++++++++++++++ kernel/sched/cpuacct.c | 24 +++++ kernel/sched/cpuacct.h | 5 + 4 files changed, 222 insertions(+) diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt index 1ee9caf..09b4a4e 100644 --- a/Documentation/cgroups/unified-hierarchy.txt +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -30,6 +30,7 @@ CONTENTS 5-4-1. blkio 5-4-2. cpuset 5-4-3. memory + 5-4-4. cpu, cpuacct 6. Planned Changes 6-1. CAP for resource control @@ -537,6 +538,58 @@ may be specified in any order and not all pairs have to be specified. memory.low, memory.high, and memory.max will use the string "max" to indicate and set the highest possible value. +5-4-4. cpu, cpuacct + +- cpuacct is no longer an independent controller. It's implicitly + enabled by cpu and its information is reported in cpu.stat. + +- All time durations, including all stats, are now in microseconds. + +- The interface is updated as follows. + + cpu.stat + + Currently reports the following six stats. All time stats are + in microseconds. + + usage_usec + user_usec + system_usec + nr_periods + nr_throttled + throttled_usec + + cpu.weight + + The weight setting. The weight is between 1 and 10000 and + defaults to 100. + + This file is available only on non-root cgroups. + + cpu.max + + The maximum bandwidth setting. It's in the following format. + + $MAX $PERIOD + + which indicates that the group may consume upto $MAX in each + $PERIOD duration. "max" for $MAX indicates no limit. If only + one number is written, $MAX is updated. + + This file is available only on non-root cgroups. + + cpu.rt.max + + The maximum realtime runtime setting. It's in the following + format. + + $MAX $PERIOD + + which indicates that the group may consume upto $MAX in each + $PERIOD duration. If only one number is written, $MAX is + updated. + + 6. Planned Changes 6-1. CAP for resource control diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6137037..1e72cdd 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8438,6 +8438,138 @@ static struct cftype cpu_legacy_files[] = { { } /* terminate */ }; +static int cpu_stats_show(struct seq_file *sf, void *v) +{ + cpuacct_cpu_stats_show(sf); + +#ifdef CONFIG_CFS_BANDWIDTH + { + struct task_group *tg = css_tg(seq_css(sf)); + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + u64 throttled_usec; + + throttled_usec = cfs_b->throttled_time; + do_div(throttled_usec, NSEC_PER_USEC); + + seq_printf(sf, "nr_periods %d\n" + "nr_throttled %d\n" + "throttled_usec %llu\n", + cfs_b->nr_periods, cfs_b->nr_throttled, + throttled_usec); + } +#endif + return 0; +} + +#ifdef CONFIG_FAIR_GROUP_SCHED +static u64 cpu_weight_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct task_group *tg = css_tg(css); + u64 weight = scale_load_down(tg->shares); + + return DIV_ROUND_CLOSEST_ULL(weight * CGROUP_WEIGHT_DFL, 1024); +} + +static int cpu_weight_write_u64(struct cgroup_subsys_state *css, + struct cftype *cftype, u64 weight) +{ + /* + * cgroup weight knobs should use the common MIN, DFL and MAX + * values which are 1, 100 and 10000 respectively. While it loses + * a bit of range on both ends, it maps pretty well onto the shares + * value used by scheduler and the round-trip conversions preserve + * the original value over the entire range. + */ + if (weight < CGROUP_WEIGHT_MIN || weight > CGROUP_WEIGHT_MAX) + return -ERANGE; + + weight = DIV_ROUND_CLOSEST_ULL(weight * 1024, CGROUP_WEIGHT_DFL); + + return sched_group_set_shares(css_tg(css), scale_load(weight)); +} +#endif + +static void __maybe_unused cpu_period_quota_print(struct seq_file *sf, + long period, long quota) +{ + if (quota < 0) + seq_puts(sf, "max"); + else + seq_printf(sf, "%ld", quota); + + seq_printf(sf, " %ld\n", period); +} + +/* caller should put the current value in *@periodp before calling */ +static int __maybe_unused cpu_period_quota_parse(char *buf, + u64 *periodp, u64 *quotap) +{ + char tok[21]; /* U64_MAX */ + + if (!sscanf(buf, "%s %llu", tok, periodp)) + return -EINVAL; + + *periodp *= NSEC_PER_USEC; + + if (sscanf(tok, "%llu", quotap)) + *quotap *= NSEC_PER_USEC; + else if (!strcmp(tok, "max")) + *quotap = RUNTIME_INF; + else + return -EINVAL; + + return 0; +} + +#ifdef CONFIG_CFS_BANDWIDTH +static int cpu_max_show(struct seq_file *sf, void *v) +{ + struct task_group *tg = css_tg(seq_css(sf)); + + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg)); + return 0; +} + +static ssize_t cpu_max_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct task_group *tg = css_tg(of_css(of)); + u64 period = tg_get_cfs_period(tg); + u64 quota; + int ret; + + ret = cpu_period_quota_parse(buf, &period, "a); + if (!ret) + ret = tg_set_cfs_bandwidth(tg, period, quota); + return ret ?: nbytes; +} +#endif + +static struct cftype cpu_files[] = { + { + .name = "stat", + .seq_show = cpu_stats_show, + }, +#ifdef CONFIG_FAIR_GROUP_SCHED + { + .name = "weight", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = cpu_weight_read_u64, + .write_u64 = cpu_weight_write_u64, + }, +#endif +#ifdef CONFIG_CFS_BANDWIDTH + { + .name = "max", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cpu_max_show, + .write = cpu_max_write, + }, +#endif + { } /* terminate */ +}; + struct cgroup_subsys cpu_cgrp_subsys = { .css_alloc = cpu_cgroup_css_alloc, .css_free = cpu_cgroup_css_free, @@ -8448,7 +8580,15 @@ struct cgroup_subsys cpu_cgrp_subsys = { .attach = cpu_cgroup_attach, .exit = cpu_cgroup_exit, .legacy_cftypes = cpu_legacy_files, + .dfl_cftypes = cpu_files, .early_init = 1, +#ifdef CONFIG_CGROUP_CPUACCT + /* + * cpuacct is enabled together with cpu on the unified hierarchy + * and its stats are reported through "cpu.stat". + */ + .depends_on = 1 << cpuacct_cgrp_id, +#endif }; #endif /* CONFIG_CGROUP_SCHED */ diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 42b2dd5..b4d32a6 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -224,6 +224,30 @@ static struct cftype files[] = { { } /* terminate */ }; +/* used to print cpuacct stats in cpu.stat on the unified hierarchy */ +void cpuacct_cpu_stats_show(struct seq_file *sf) +{ + struct cgroup_subsys_state *css; + u64 usage, user, sys; + + css = cgroup_get_e_css(seq_css(sf)->cgroup, &cpuacct_cgrp_subsys); + + usage = cpuusage_read(css, seq_cft(sf)); + cpuacct_stats_read(css_ca(css), &user, &sys); + + user *= TICK_NSEC; + sys *= TICK_NSEC; + do_div(usage, NSEC_PER_USEC); + do_div(user, NSEC_PER_USEC); + do_div(sys, NSEC_PER_USEC); + + seq_printf(sf, "usage_usec %llu\n" + "user_usec %llu\n" + "system_usec %llu\n", usage, user, sys); + + css_put(css); +} + /* * charge this task's execution time to its accounting group. * diff --git a/kernel/sched/cpuacct.h b/kernel/sched/cpuacct.h index ed60562..44eace9 100644 --- a/kernel/sched/cpuacct.h +++ b/kernel/sched/cpuacct.h @@ -2,6 +2,7 @@ extern void cpuacct_charge(struct task_struct *tsk, u64 cputime); extern void cpuacct_account_field(struct task_struct *p, int index, u64 val); +extern void cpuacct_cpu_stats_show(struct seq_file *sf); #else @@ -14,4 +15,8 @@ cpuacct_account_field(struct task_struct *p, int index, u64 val) { } +static inline void cpuacct_cpu_stats_show(struct seq_file *sf) +{ +} + #endif -- 2.4.3 ^ permalink raw reply related [flat|nested] 92+ messages in thread
end of thread, other threads:[~2015-10-27 6:08 UTC | newest] Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-08-03 22:41 [PATCHSET sched,cgroup] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2015-08-03 22:41 ` [PATCH 1/3] cgroup: define controller file conventions Tejun Heo 2015-08-04 8:42 ` Peter Zijlstra 2015-08-04 14:51 ` Tejun Heo 2015-08-04 8:48 ` Peter Zijlstra 2015-08-04 14:53 ` Tejun Heo 2015-08-04 19:31 ` [PATCH v2 " Tejun Heo 2015-08-05 0:39 ` Kamezawa Hiroyuki 2015-08-05 7:47 ` Michal Hocko 2015-08-06 2:30 ` Kamezawa Hiroyuki 2015-08-07 18:17 ` Michal Hocko 2015-08-17 22:04 ` Johannes Weiner 2015-08-17 21:34 ` Johannes Weiner 2015-08-03 22:41 ` [PATCH 2/3] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo 2015-08-03 22:41 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Tejun Heo 2015-08-04 9:07 ` Peter Zijlstra 2015-08-04 15:10 ` Tejun Heo 2015-08-05 9:10 ` Peter Zijlstra 2015-08-05 14:31 ` Tejun Heo 2015-08-17 20:35 ` Tejun Heo [not found] ` <CAPM31RJTf0v=2v90kN6-HM9xUGab_k++upO0Ym=irmfO9+BbFw@mail.gmail.com> 2015-08-18 4:03 ` Paul Turner 2015-08-18 20:31 ` Tejun Heo 2015-08-18 23:39 ` Kamezawa Hiroyuki 2015-08-19 16:23 ` Tejun Heo 2015-08-19 3:23 ` Mike Galbraith 2015-08-19 16:41 ` Tejun Heo 2015-08-20 4:00 ` Mike Galbraith 2015-08-20 7:52 ` Tejun Heo 2015-08-20 8:47 ` Mike Galbraith 2015-08-21 19:26 ` Paul Turner 2015-08-22 18:29 ` Tejun Heo 2015-08-24 15:47 ` Austin S Hemmelgarn 2015-08-24 17:04 ` Tejun Heo 2015-08-24 19:18 ` Mike Galbraith 2015-08-24 20:00 ` Austin S Hemmelgarn 2015-08-24 20:25 ` Tejun Heo 2015-08-24 21:00 ` Paul Turner 2015-08-24 21:12 ` Tejun Heo 2015-08-24 21:15 ` Paul Turner 2015-08-24 20:54 ` Paul Turner 2015-08-24 21:02 ` Tejun Heo 2015-08-24 21:10 ` Paul Turner 2015-08-24 21:17 ` Tejun Heo 2015-08-24 21:19 ` Paul Turner 2015-08-24 21:40 ` Tejun Heo 2015-08-24 22:03 ` Paul Turner 2015-08-24 22:49 ` Tejun Heo 2015-08-24 23:15 ` Paul Turner 2015-08-25 2:36 ` Kamezawa Hiroyuki 2015-08-25 21:13 ` Tejun Heo 2015-08-25 9:24 ` Ingo Molnar 2015-08-25 10:00 ` Peter Zijlstra 2015-08-25 19:18 ` Tejun Heo 2015-08-24 20:52 ` Paul Turner 2015-08-24 21:36 ` Tejun Heo 2015-08-24 21:58 ` Paul Turner 2015-08-24 22:19 ` Tejun Heo 2015-08-24 23:06 ` Paul Turner 2015-08-25 21:02 ` Tejun Heo 2015-09-02 17:03 ` Tejun Heo 2015-09-09 12:49 ` Paul Turner 2015-09-12 14:40 ` Tejun Heo 2015-09-17 14:35 ` Peter Zijlstra 2015-09-17 14:53 ` Tejun Heo 2015-09-17 15:42 ` Peter Zijlstra 2015-09-17 15:10 ` Peter Zijlstra 2015-09-17 15:52 ` Tejun Heo 2015-09-17 15:53 ` Peter Zijlstra 2015-09-17 23:29 ` Tejun Heo 2015-09-18 11:27 ` Paul Turner 2015-10-01 18:46 ` Tejun Heo 2015-10-15 11:42 ` Paul Turner 2015-10-23 22:21 ` Tejun Heo 2015-10-24 4:36 ` Mike Galbraith 2015-10-25 2:18 ` Tejun Heo 2015-10-25 3:43 ` Mike Galbraith 2015-10-27 3:16 ` Tejun Heo 2015-10-27 5:42 ` Mike Galbraith 2015-10-27 5:46 ` Tejun Heo 2015-10-27 5:56 ` Mike Galbraith 2015-10-27 6:00 ` Tejun Heo 2015-10-27 6:08 ` Mike Galbraith 2015-10-25 3:54 ` Linus Torvalds 2015-10-25 9:33 ` Ingo Molnar 2015-10-25 10:41 ` Theodore Ts'o 2015-10-25 10:47 ` Florian Weimer 2015-10-25 11:58 ` Theodore Ts'o 2015-10-25 13:17 ` Florian Weimer 2015-10-25 13:40 ` Getrandom wrapper Theodore Ts'o 2015-10-26 13:32 ` Florian Weimer 2015-10-26 14:10 ` [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy Peter Zijlstra 2015-08-04 19:32 ` [PATCH v2 " Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).