All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  0:54 ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-04  0:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Tejun Heo, LKML, Cgroups

It's allowed to promote a task from normal to realtime after it has been
attached to a non-root cgroup, but it will fail if the attaching happens
after it has become realtime. I don't see how this restriction is useful.

We are moving toward unified hierarchy where all the cgroup controllers
are bound together, so it would make cgroups easier to use if we have less
restrictions on attaching tasks between cgroups.

Signed-off-by: Zefan Li <lizefan@huawei.com>
---
 kernel/sched/core.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f75..55c21f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7800,6 +7800,11 @@ static int sched_rt_global_constraints(void)
 
 	return ret;
 }
+
+static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
+{
+	return 1;
+}
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 static int sched_dl_global_validate(void)
@@ -8002,16 +8007,9 @@ static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
 {
 	struct task_struct *task;
 
-	cgroup_taskset_for_each(task, tset) {
-#ifdef CONFIG_RT_GROUP_SCHED
+	cgroup_taskset_for_each(task, tset)
 		if (!sched_rt_can_attach(css_tg(css), task))
 			return -EINVAL;
-#else
-		/* We don't support RT-tasks being in separate groups */
-		if (task->sched_class != &fair_sched_class)
-			return -EINVAL;
-#endif
-	}
 	return 0;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  0:54 ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-04  0:54 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Tejun Heo, LKML, Cgroups

It's allowed to promote a task from normal to realtime after it has been
attached to a non-root cgroup, but it will fail if the attaching happens
after it has become realtime. I don't see how this restriction is useful.

We are moving toward unified hierarchy where all the cgroup controllers
are bound together, so it would make cgroups easier to use if we have less
restrictions on attaching tasks between cgroups.

Signed-off-by: Zefan Li <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
---
 kernel/sched/core.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe22f75..55c21f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7800,6 +7800,11 @@ static int sched_rt_global_constraints(void)
 
 	return ret;
 }
+
+static int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
+{
+	return 1;
+}
 #endif /* CONFIG_RT_GROUP_SCHED */
 
 static int sched_dl_global_validate(void)
@@ -8002,16 +8007,9 @@ static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
 {
 	struct task_struct *task;
 
-	cgroup_taskset_for_each(task, tset) {
-#ifdef CONFIG_RT_GROUP_SCHED
+	cgroup_taskset_for_each(task, tset)
 		if (!sched_rt_can_attach(css_tg(css), task))
 			return -EINVAL;
-#else
-		/* We don't support RT-tasks being in separate groups */
-		if (task->sched_class != &fair_sched_class)
-			return -EINVAL;
-#endif
-	}
 	return 0;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  3:13   ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04  3:13 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 08:54 +0800, Zefan Li wrote:
> It's allowed to promote a task from normal to realtime after it has been
> attached to a non-root cgroup, but it will fail if the attaching happens
> after it has become realtime. I don't see how this restriction is useful.

In the CONFIG_RT_GROUP_SCHED case, promotion will fail is there is no
bandwidth allocated.

> We are moving toward unified hierarchy where all the cgroup controllers
> are bound together, so it would make cgroups easier to use if we have less
> restrictions on attaching tasks between cgroups.

Forcing group scheduling overhead on users if they want cpuset or memory
cgroup functionality would be far from wonderful.  Am I interpreting the
implications of this unification/binding properly?

(I hope not, surely the plan is not to utterly _destroy_ cgroup utility)

	-Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  3:13   ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04  3:13 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 08:54 +0800, Zefan Li wrote:
> It's allowed to promote a task from normal to realtime after it has been
> attached to a non-root cgroup, but it will fail if the attaching happens
> after it has become realtime. I don't see how this restriction is useful.

In the CONFIG_RT_GROUP_SCHED case, promotion will fail is there is no
bandwidth allocated.

> We are moving toward unified hierarchy where all the cgroup controllers
> are bound together, so it would make cgroups easier to use if we have less
> restrictions on attaching tasks between cgroups.

Forcing group scheduling overhead on users if they want cpuset or memory
cgroup functionality would be far from wonderful.  Am I interpreting the
implications of this unification/binding properly?

(I hope not, surely the plan is not to utterly _destroy_ cgroup utility)

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  4:39     ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-04  4:39 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On 2015/5/4 11:13, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 08:54 +0800, Zefan Li wrote:
>> It's allowed to promote a task from normal to realtime after it has been
>> attached to a non-root cgroup, but it will fail if the attaching happens
>> after it has become realtime. I don't see how this restriction is useful.
> 
> In the CONFIG_RT_GROUP_SCHED case, promotion will fail is there is no
> bandwidth allocated.
> 

Right. I forgot to mention this patch affects !CONFIG_RT_GROUP_SCHED only,
though it should be obvious by reading the change.

>> We are moving toward unified hierarchy where all the cgroup controllers
>> are bound together, so it would make cgroups easier to use if we have less
>> restrictions on attaching tasks between cgroups.
> 
> Forcing group scheduling overhead on users if they want cpuset or memory
> cgroup functionality would be far from wonderful.  Am I interpreting the
> implications of this unification/binding properly?
> 
> (I hope not, surely the plan is not to utterly _destroy_ cgroup utility)
> 

Some degree of flexibility is provided so that you may disable some controllers
in a subtree. For example:

root                  ---> child1
(cpuset,memory,cpu)        (cpuset,memory)
                      \
                       \-> child2
                           (cpu)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  4:39     ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-04  4:39 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On 2015/5/4 11:13, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 08:54 +0800, Zefan Li wrote:
>> It's allowed to promote a task from normal to realtime after it has been
>> attached to a non-root cgroup, but it will fail if the attaching happens
>> after it has become realtime. I don't see how this restriction is useful.
> 
> In the CONFIG_RT_GROUP_SCHED case, promotion will fail is there is no
> bandwidth allocated.
> 

Right. I forgot to mention this patch affects !CONFIG_RT_GROUP_SCHED only,
though it should be obvious by reading the change.

>> We are moving toward unified hierarchy where all the cgroup controllers
>> are bound together, so it would make cgroups easier to use if we have less
>> restrictions on attaching tasks between cgroups.
> 
> Forcing group scheduling overhead on users if they want cpuset or memory
> cgroup functionality would be far from wonderful.  Am I interpreting the
> implications of this unification/binding properly?
> 
> (I hope not, surely the plan is not to utterly _destroy_ cgroup utility)
> 

Some degree of flexibility is provided so that you may disable some controllers
in a subtree. For example:

root                  ---> child1
(cpuset,memory,cpu)        (cpuset,memory)
                      \
                       \-> child2
                           (cpu)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  5:10       ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04  5:10 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 12:39 +0800, Zefan Li wrote:

> >> We are moving toward unified hierarchy where all the cgroup controllers
> >> are bound together, so it would make cgroups easier to use if we have less
> >> restrictions on attaching tasks between cgroups.
> > 
> > Forcing group scheduling overhead on users if they want cpuset or memory
> > cgroup functionality would be far from wonderful.  Am I interpreting the
> > implications of this unification/binding properly?
> > 
> > (I hope not, surely the plan is not to utterly _destroy_ cgroup utility)
> > 
> 
> Some degree of flexibility is provided so that you may disable some controllers
> in a subtree. For example:
> 
> root                  ---> child1
> (cpuset,memory,cpu)        (cpuset,memory)
>                       \
>                        \-> child2
>                            (cpu)

Whew, that's a relief.  Thanks.

	-Mike



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  5:10       ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04  5:10 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 12:39 +0800, Zefan Li wrote:

> >> We are moving toward unified hierarchy where all the cgroup controllers
> >> are bound together, so it would make cgroups easier to use if we have less
> >> restrictions on attaching tasks between cgroups.
> > 
> > Forcing group scheduling overhead on users if they want cpuset or memory
> > cgroup functionality would be far from wonderful.  Am I interpreting the
> > implications of this unification/binding properly?
> > 
> > (I hope not, surely the plan is not to utterly _destroy_ cgroup utility)
> > 
> 
> Some degree of flexibility is provided so that you may disable some controllers
> in a subtree. For example:
> 
> root                  ---> child1
> (cpuset,memory,cpu)        (cpuset,memory)
>                       \
>                        \-> child2
>                            (cpu)

Whew, that's a relief.  Thanks.

	-Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  5:39         ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04  5:39 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 07:10 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 12:39 +0800, Zefan Li wrote:
> 
> > >> We are moving toward unified hierarchy where all the cgroup controllers
> > >> are bound together, so it would make cgroups easier to use if we have less
> > >> restrictions on attaching tasks between cgroups.
> > > 
> > > Forcing group scheduling overhead on users if they want cpuset or memory
> > > cgroup functionality would be far from wonderful.  Am I interpreting the
> > > implications of this unification/binding properly?
> > > 
> > > (I hope not, surely the plan is not to utterly _destroy_ cgroup utility)
> > > 
> > 
> > Some degree of flexibility is provided so that you may disable some controllers
> > in a subtree. For example:
> > 
> > root                  ---> child1
> > (cpuset,memory,cpu)        (cpuset,memory)
> >                       \
> >                        \-> child2
> >                            (cpu)
> 
> Whew, that's a relief.  Thanks.

But somehow I'm not feeling a whole lot better.

"May" means if you don't explicitly take some action to disable group
scheduling, you get it (I don't care if I have an off button), but that
would also seemingly mean that we would then have rt tasks in taskgroups
with no bandwidth allocated, ie you have to make group scheduling for rt
tasks meaningless until a bandwidth appeared, and to make bandwidth
appear, you'd have to stop the world, distribute, continue, no?

The current "just say no" seems a lot more sensible.

	-Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  5:39         ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04  5:39 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 07:10 +0200, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 12:39 +0800, Zefan Li wrote:
> 
> > >> We are moving toward unified hierarchy where all the cgroup controllers
> > >> are bound together, so it would make cgroups easier to use if we have less
> > >> restrictions on attaching tasks between cgroups.
> > > 
> > > Forcing group scheduling overhead on users if they want cpuset or memory
> > > cgroup functionality would be far from wonderful.  Am I interpreting the
> > > implications of this unification/binding properly?
> > > 
> > > (I hope not, surely the plan is not to utterly _destroy_ cgroup utility)
> > > 
> > 
> > Some degree of flexibility is provided so that you may disable some controllers
> > in a subtree. For example:
> > 
> > root                  ---> child1
> > (cpuset,memory,cpu)        (cpuset,memory)
> >                       \
> >                        \-> child2
> >                            (cpu)
> 
> Whew, that's a relief.  Thanks.

But somehow I'm not feeling a whole lot better.

"May" means if you don't explicitly take some action to disable group
scheduling, you get it (I don't care if I have an off button), but that
would also seemingly mean that we would then have rt tasks in taskgroups
with no bandwidth allocated, ie you have to make group scheduling for rt
tasks meaningless until a bandwidth appeared, and to make bandwidth
appear, you'd have to stop the world, distribute, continue, no?

The current "just say no" seems a lot more sensible.

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  9:11           ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-04  9:11 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

>>> Some degree of flexibility is provided so that you may disable some controllers
>>> in a subtree. For example:
>>>
>>> root                  ---> child1
>>> (cpuset,memory,cpu)        (cpuset,memory)
>>>                       \
>>>                        \-> child2
>>>                            (cpu)
>>
>> Whew, that's a relief.  Thanks.
> 
> But somehow I'm not feeling a whole lot better.
> 
> "May" means if you don't explicitly take some action to disable group
> scheduling, you get it (I don't care if I have an off button), but that
> would also seemingly mean that we would then have rt tasks in taskgroups
> with no bandwidth allocated, ie you have to make group scheduling for rt
> tasks meaningless until a bandwidth appeared, and to make bandwidth
> appear, you'd have to stop the world, distribute, continue, no?
> 
> The current "just say no" seems a lot more sensible.
> 

I just realized we allow removing/adding controllers from/to cgroups
while there are tasks in them, which isn't safe unless we eliminate all
can_attach callbacks. We've done so for some cgroup subsystems, but
there are still a few of them...


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04  9:11           ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-04  9:11 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

>>> Some degree of flexibility is provided so that you may disable some controllers
>>> in a subtree. For example:
>>>
>>> root                  ---> child1
>>> (cpuset,memory,cpu)        (cpuset,memory)
>>>                       \
>>>                        \-> child2
>>>                            (cpu)
>>
>> Whew, that's a relief.  Thanks.
> 
> But somehow I'm not feeling a whole lot better.
> 
> "May" means if you don't explicitly take some action to disable group
> scheduling, you get it (I don't care if I have an off button), but that
> would also seemingly mean that we would then have rt tasks in taskgroups
> with no bandwidth allocated, ie you have to make group scheduling for rt
> tasks meaningless until a bandwidth appeared, and to make bandwidth
> appear, you'd have to stop the world, distribute, continue, no?
> 
> The current "just say no" seems a lot more sensible.
> 

I just realized we allow removing/adding controllers from/to cgroups
while there are tasks in them, which isn't safe unless we eliminate all
can_attach callbacks. We've done so for some cgroup subsystems, but
there are still a few of them...

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04 12:08             ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04 12:08 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 17:11 +0800, Zefan Li wrote:
> >>> Some degree of flexibility is provided so that you may disable some controllers
> >>> in a subtree. For example:
> >>>
> >>> root                  ---> child1
> >>> (cpuset,memory,cpu)        (cpuset,memory)
> >>>                       \
> >>>                        \-> child2
> >>>                            (cpu)
> >>
> >> Whew, that's a relief.  Thanks.
> > 
> > But somehow I'm not feeling a whole lot better.
> > 
> > "May" means if you don't explicitly take some action to disable group
> > scheduling, you get it (I don't care if I have an off button), but that
> > would also seemingly mean that we would then have rt tasks in taskgroups
> > with no bandwidth allocated, ie you have to make group scheduling for rt
> > tasks meaningless until a bandwidth appeared, and to make bandwidth
> > appear, you'd have to stop the world, distribute, continue, no?
> > 
> > The current "just say no" seems a lot more sensible.
> > 
> 
> I just realized we allow removing/adding controllers from/to cgroups
> while there are tasks in them, which isn't safe unless we eliminate all
> can_attach callbacks. We've done so for some cgroup subsystems, but
> there are still a few of them...

I was pondering the future (or so I thought), but seems it turned into
the past while I wasn't looking.  Oh well, you found a bug anyway.

	-Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04 12:08             ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04 12:08 UTC (permalink / raw)
  To: Zefan Li; +Cc: Ingo Molnar, Peter Zijlstra, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 17:11 +0800, Zefan Li wrote:
> >>> Some degree of flexibility is provided so that you may disable some controllers
> >>> in a subtree. For example:
> >>>
> >>> root                  ---> child1
> >>> (cpuset,memory,cpu)        (cpuset,memory)
> >>>                       \
> >>>                        \-> child2
> >>>                            (cpu)
> >>
> >> Whew, that's a relief.  Thanks.
> > 
> > But somehow I'm not feeling a whole lot better.
> > 
> > "May" means if you don't explicitly take some action to disable group
> > scheduling, you get it (I don't care if I have an off button), but that
> > would also seemingly mean that we would then have rt tasks in taskgroups
> > with no bandwidth allocated, ie you have to make group scheduling for rt
> > tasks meaningless until a bandwidth appeared, and to make bandwidth
> > appear, you'd have to stop the world, distribute, continue, no?
> > 
> > The current "just say no" seems a lot more sensible.
> > 
> 
> I just realized we allow removing/adding controllers from/to cgroups
> while there are tasks in them, which isn't safe unless we eliminate all
> can_attach callbacks. We've done so for some cgroup subsystems, but
> there are still a few of them...

I was pondering the future (or so I thought), but seems it turned into
the past while I wasn't looking.  Oh well, you found a bug anyway.

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04 12:37             ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-04 12:37 UTC (permalink / raw)
  To: Zefan Li; +Cc: Mike Galbraith, Ingo Molnar, Tejun Heo, LKML, Cgroups

On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:

> Some degree of flexibility is provided so that you may disable some controllers
> in a subtree. For example:
>
> root                  ---> child1
> (cpuset,memory,cpu)        (cpuset,memory)
>                       \
>                        \-> child2
>                            (cpu)

Uhm, how does that work? Would a task their effective cgroup be the
first parent that has a controller enabled?

In particular, in your example, if T were part of child1, would its cpu
controller be root?

> I just realized we allow removing/adding controllers from/to cgroups
> while there are tasks in them, which isn't safe unless we eliminate all
> can_attach callbacks. We've done so for some cgroup subsystems, but
> there are still a few of them...

You can't remove can_attach(), we must be able to disallow joining a
cgroup.

If that results in you not being able to change the cgroup setup with
tasks in, so be it -- that seems like a sane restriction anyhow.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-04 12:37             ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-04 12:37 UTC (permalink / raw)
  To: Zefan Li; +Cc: Mike Galbraith, Ingo Molnar, Tejun Heo, LKML, Cgroups

On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:

> Some degree of flexibility is provided so that you may disable some controllers
> in a subtree. For example:
>
> root                  ---> child1
> (cpuset,memory,cpu)        (cpuset,memory)
>                       \
>                        \-> child2
>                            (cpu)

Uhm, how does that work? Would a task their effective cgroup be the
first parent that has a controller enabled?

In particular, in your example, if T were part of child1, would its cpu
controller be root?

> I just realized we allow removing/adding controllers from/to cgroups
> while there are tasks in them, which isn't safe unless we eliminate all
> can_attach callbacks. We've done so for some cgroup subsystems, but
> there are still a few of them...

You can't remove can_attach(), we must be able to disallow joining a
cgroup.

If that results in you not being able to change the cgroup setup with
tasks in, so be it -- that seems like a sane restriction anyhow.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-04 12:37             ` Peter Zijlstra
  (?)
@ 2015-05-04 14:09             ` Mike Galbraith
  2015-05-05  3:46                 ` Zefan Li
  -1 siblings, 1 reply; 43+ messages in thread
From: Mike Galbraith @ 2015-05-04 14:09 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Ingo Molnar, Tejun Heo, LKML, Cgroups

On Mon, 2015-05-04 at 14:37 +0200, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
> 
> > Some degree of flexibility is provided so that you may disable some controllers
> > in a subtree. For example:
> >
> > root                  ---> child1
> > (cpuset,memory,cpu)        (cpuset,memory)
> >                       \
> >                        \-> child2
> >                            (cpu)
> 
> Uhm, how does that work? Would a task their effective cgroup be the
> first parent that has a controller enabled?
> 
> In particular, in your example, if T were part of child1, would its cpu
> controller be root?

That's what I'd hope for.  I wanted to try that cgroup.subtree_control
gizmo to see for myself, but I don't have one, and probably won't get
one until I introduce systemd to my axe (again, it's a slow learner).

	-Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05  3:46                 ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-05  3:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, Tejun Heo, LKML, Cgroups

On 2015/5/4 22:09, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 14:37 +0200, Peter Zijlstra wrote:
>> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
>>
>>> Some degree of flexibility is provided so that you may disable some controllers
>>> in a subtree. For example:
>>>
>>> root                  ---> child1
>>> (cpuset,memory,cpu)        (cpuset,memory)
>>>                       \
>>>                        \-> child2
>>>                            (cpu)
>>
>> Uhm, how does that work? Would a task their effective cgroup be the
>> first parent that has a controller enabled?
>>
>> In particular, in your example, if T were part of child1, would its cpu
>> controller be root?

correct.

> 
> That's what I'd hope for.  I wanted to try that cgroup.subtree_control
> gizmo to see for myself, but I don't have one, and probably won't get
> one until I introduce systemd to my axe (again, it's a slow learner).
> 

I'm testing in an environment without systemd.

You need to mount cgroup with a special option:

  # mount -t cgroup -o __DEVEL__sane_behavior xxx /where

If a cgroup controller has already been mounted without this option,
you won't see it in the unified hierarchy, so firstly you need to
delete all cgroups in it and umount it.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05  3:46                 ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-05  3:46 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Peter Zijlstra, Ingo Molnar, Tejun Heo, LKML, Cgroups

On 2015/5/4 22:09, Mike Galbraith wrote:
> On Mon, 2015-05-04 at 14:37 +0200, Peter Zijlstra wrote:
>> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
>>
>>> Some degree of flexibility is provided so that you may disable some controllers
>>> in a subtree. For example:
>>>
>>> root                  ---> child1
>>> (cpuset,memory,cpu)        (cpuset,memory)
>>>                       \
>>>                        \-> child2
>>>                            (cpu)
>>
>> Uhm, how does that work? Would a task their effective cgroup be the
>> first parent that has a controller enabled?
>>
>> In particular, in your example, if T were part of child1, would its cpu
>> controller be root?

correct.

> 
> That's what I'd hope for.  I wanted to try that cgroup.subtree_control
> gizmo to see for myself, but I don't have one, and probably won't get
> one until I introduce systemd to my axe (again, it's a slow learner).
> 

I'm testing in an environment without systemd.

You need to mount cgroup with a special option:

  # mount -t cgroup -o __DEVEL__sane_behavior xxx /where

If a cgroup controller has already been mounted without this option,
you won't see it in the unified hierarchy, so firstly you need to
delete all cgroups in it and umount it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05  3:54               ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-05  3:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, Tejun Heo, LKML, Cgroups

On 2015/5/4 20:37, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
> 
>> Some degree of flexibility is provided so that you may disable some controllers
>> in a subtree. For example:
>>
>> root                  ---> child1
>> (cpuset,memory,cpu)        (cpuset,memory)
>>                       \
>>                        \-> child2
>>                            (cpu)
> 
> Uhm, how does that work? Would a task their effective cgroup be the
> first parent that has a controller enabled?
> 
> In particular, in your example, if T were part of child1, would its cpu
> controller be root?
> 
>> I just realized we allow removing/adding controllers from/to cgroups
>> while there are tasks in them, which isn't safe unless we eliminate all
>> can_attach callbacks. We've done so for some cgroup subsystems, but
>> there are still a few of them...
> 
> You can't remove can_attach(), we must be able to disallow joining a
> cgroup.
> 
> If that results in you not being able to change the cgroup setup with
> tasks in, so be it -- that seems like a sane restriction anyhow.
> 

I wasn't thinking about removing can_attach() before I noticed this issue.

But I was wondering if we can change the default value of cpu.rt_runtime_us
from 0 to -1? So by default the RT tasks can be attached to a newly-created
cgroup without users having to make any configuration, and those tasks are
confined by the parent cgroup, which is what we have with cfs bw control.
This require some changes to the code, but I guess it's do-able?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05  3:54               ` Zefan Li
  0 siblings, 0 replies; 43+ messages in thread
From: Zefan Li @ 2015-05-05  3:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, Ingo Molnar, Tejun Heo, LKML, Cgroups

On 2015/5/4 20:37, Peter Zijlstra wrote:
> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
> 
>> Some degree of flexibility is provided so that you may disable some controllers
>> in a subtree. For example:
>>
>> root                  ---> child1
>> (cpuset,memory,cpu)        (cpuset,memory)
>>                       \
>>                        \-> child2
>>                            (cpu)
> 
> Uhm, how does that work? Would a task their effective cgroup be the
> first parent that has a controller enabled?
> 
> In particular, in your example, if T were part of child1, would its cpu
> controller be root?
> 
>> I just realized we allow removing/adding controllers from/to cgroups
>> while there are tasks in them, which isn't safe unless we eliminate all
>> can_attach callbacks. We've done so for some cgroup subsystems, but
>> there are still a few of them...
> 
> You can't remove can_attach(), we must be able to disallow joining a
> cgroup.
> 
> If that results in you not being able to change the cgroup setup with
> tasks in, so be it -- that seems like a sane restriction anyhow.
> 

I wasn't thinking about removing can_attach() before I noticed this issue.

But I was wondering if we can change the default value of cpu.rt_runtime_us
from 0 to -1? So by default the RT tasks can be attached to a newly-created
cgroup without users having to make any configuration, and those tasks are
confined by the parent cgroup, which is what we have with cfs bw control.
This require some changes to the code, but I guess it's do-able?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05  6:02                   ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-05  6:02 UTC (permalink / raw)
  To: Zefan Li; +Cc: Peter Zijlstra, Ingo Molnar, Tejun Heo, LKML, Cgroups

On Tue, 2015-05-05 at 11:46 +0800, Zefan Li wrote:
> On 2015/5/4 22:09, Mike Galbraith wrote:
> > On Mon, 2015-05-04 at 14:37 +0200, Peter Zijlstra wrote:
> >> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
> >>
> >>> Some degree of flexibility is provided so that you may disable some controllers
> >>> in a subtree. For example:
> >>>
> >>> root                  ---> child1
> >>> (cpuset,memory,cpu)        (cpuset,memory)
> >>>                       \
> >>>                        \-> child2
> >>>                            (cpu)
> >>
> >> Uhm, how does that work? Would a task their effective cgroup be the
> >> first parent that has a controller enabled?
> >>
> >> In particular, in your example, if T were part of child1, would its cpu
> >> controller be root?
> 
> correct.
> 
> > 
> > That's what I'd hope for.  I wanted to try that cgroup.subtree_control
> > gizmo to see for myself, but I don't have one, and probably won't get
> > one until I introduce systemd to my axe (again, it's a slow learner).
> > 
> 
> I'm testing in an environment without systemd.

Lucky you.

> You need to mount cgroup with a special option:
> 
>   # mount -t cgroup -o __DEVEL__sane_behavior xxx /where
> 
> If a cgroup controller has already been mounted without this option,
> you won't see it in the unified hierarchy, so firstly you need to
> delete all cgroups in it and umount it.

Yeah, I found the flag, and systemd is indeed in the way.  You already
verified what subtree_control does, so I needn't squabble with the vile
thing over cgroups possession... immediately anyway.

	-Mike


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05  6:02                   ` Mike Galbraith
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Galbraith @ 2015-05-05  6:02 UTC (permalink / raw)
  To: Zefan Li; +Cc: Peter Zijlstra, Ingo Molnar, Tejun Heo, LKML, Cgroups

On Tue, 2015-05-05 at 11:46 +0800, Zefan Li wrote:
> On 2015/5/4 22:09, Mike Galbraith wrote:
> > On Mon, 2015-05-04 at 14:37 +0200, Peter Zijlstra wrote:
> >> On Mon, May 04, 2015 at 05:11:10PM +0800, Zefan Li wrote:
> >>
> >>> Some degree of flexibility is provided so that you may disable some controllers
> >>> in a subtree. For example:
> >>>
> >>> root                  ---> child1
> >>> (cpuset,memory,cpu)        (cpuset,memory)
> >>>                       \
> >>>                        \-> child2
> >>>                            (cpu)
> >>
> >> Uhm, how does that work? Would a task their effective cgroup be the
> >> first parent that has a controller enabled?
> >>
> >> In particular, in your example, if T were part of child1, would its cpu
> >> controller be root?
> 
> correct.
> 
> > 
> > That's what I'd hope for.  I wanted to try that cgroup.subtree_control
> > gizmo to see for myself, but I don't have one, and probably won't get
> > one until I introduce systemd to my axe (again, it's a slow learner).
> > 
> 
> I'm testing in an environment without systemd.

Lucky you.

> You need to mount cgroup with a special option:
> 
>   # mount -t cgroup -o __DEVEL__sane_behavior xxx /where
> 
> If a cgroup controller has already been mounted without this option,
> you won't see it in the unified hierarchy, so firstly you need to
> delete all cgroups in it and umount it.

Yeah, I found the flag, and systemd is indeed in the way.  You already
verified what subtree_control does, so I needn't squabble with the vile
thing over cgroups possession... immediately anyway.

	-Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05 14:09           ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 14:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, LKML, Cgroups

Hello, Mike.

On Mon, May 04, 2015 at 07:39:24AM +0200, Mike Galbraith wrote:
> > > Some degree of flexibility is provided so that you may disable some controllers
> > > in a subtree. For example:
> > > 
> > > root                  ---> child1
> > > (cpuset,memory,cpu)        (cpuset,memory)
> > >                       \
> > >                        \-> child2
> > >                            (cpu)
> > 
> > Whew, that's a relief.  Thanks.
> 
> But somehow I'm not feeling a whole lot better.
> 
> "May" means if you don't explicitly take some action to disable group
> scheduling, you get it (I don't care if I have an off button), but that

In the new interface, hierarchy setup and controller configuration are
two separate steps.  Creating subhierarchy doesn't enable controller
automatically and as long as specific controllers are concerned
nothing changes when subhierarchy is created and processes are moved
inbetween them.  If control over specific resources is necessary in a
given hierarchy, the matching controllers should be enabled
explicitly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05 14:09           ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 14:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Zefan Li, Ingo Molnar, Peter Zijlstra, LKML, Cgroups

Hello, Mike.

On Mon, May 04, 2015 at 07:39:24AM +0200, Mike Galbraith wrote:
> > > Some degree of flexibility is provided so that you may disable some controllers
> > > in a subtree. For example:
> > > 
> > > root                  ---> child1
> > > (cpuset,memory,cpu)        (cpuset,memory)
> > >                       \
> > >                        \-> child2
> > >                            (cpu)
> > 
> > Whew, that's a relief.  Thanks.
> 
> But somehow I'm not feeling a whole lot better.
> 
> "May" means if you don't explicitly take some action to disable group
> scheduling, you get it (I don't care if I have an off button), but that

In the new interface, hierarchy setup and controller configuration are
two separate steps.  Creating subhierarchy doesn't enable controller
automatically and as long as specific controllers are concerned
nothing changes when subhierarchy is created and processes are moved
inbetween them.  If control over specific resources is necessary in a
given hierarchy, the matching controllers should be enabled
explicitly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05  3:54               ` Zefan Li
  (?)
@ 2015-05-05 14:10               ` Peter Zijlstra
  2015-05-05 14:18                 ` Tejun Heo
  -1 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-05 14:10 UTC (permalink / raw)
  To: Zefan Li; +Cc: Mike Galbraith, Ingo Molnar, Tejun Heo, LKML, Cgroups

On Tue, May 05, 2015 at 11:54:31AM +0800, Zefan Li wrote:

> But I was wondering if we can change the default value of cpu.rt_runtime_us
> from 0 to -1? So by default the RT tasks can be attached to a newly-created
> cgroup without users having to make any configuration, and those tasks are
> confined by the parent cgroup, which is what we have with cfs bw control.
> This require some changes to the code, but I guess it's do-able?

Its tricky.

Imagine:

	  root
	 /    \
	A      B
       / \    / \
      a1 a2  b1 b2

Now if they all have -1, I cannot set a bw on any except the leaf nodes
([ab][12]). Because the sum of child bw must strictly be smaller or
equal to the parent bandwidth, and -1 if effective inf.

Similarly, if A has bw enabled I cannot create a new child with -1.
Because above.

Now you can kludge around some of this, for example you can make the
default depend on the parent setting etc.. But that's horribly
inconsistent.

So I really prefer not to go that way; if people use RR/FIFO they had
better bloody know what they're doing; which includes setting up the
system.

The whole RR/FIFO thing is so enormously broken (by definition; this
truly is unfixable) that you simply _cannot_ automate it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 14:10               ` Peter Zijlstra
@ 2015-05-05 14:18                 ` Tejun Heo
  2015-05-05 15:19                   ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 14:18 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Peter.

On Tue, May 05, 2015 at 04:10:49PM +0200, Peter Zijlstra wrote:
> Imagine:
> 
> 	  root
> 	 /    \
> 	A      B
>        / \    / \
>       a1 a2  b1 b2
> 
> Now if they all have -1, I cannot set a bw on any except the leaf nodes
> ([ab][12]). Because the sum of child bw must strictly be smaller or
> equal to the parent bandwidth, and -1 if effective inf.
> 
> Similarly, if A has bw enabled I cannot create a new child with -1.
> Because above.
> 
> Now you can kludge around some of this, for example you can make the
> default depend on the parent setting etc.. But that's horribly
> inconsistent.

I don't think we can kludge this.  For all other resources, we're
defining the limits that can't be crossed so nesting them w/ -1 by
default is fine.  RR slices are different it that we're really slicing
up and guaranteeing a portion of something finite, so unlimited by
default thing doesn't really work here.

> So I really prefer not to go that way; if people use RR/FIFO they had
> better bloody know what they're doing; which includes setting up the
> system.

The problem is that this is tied to the normal cpu controller.  Users
who don't have any intention of mucking with RT scheduling end up
being dragged into it.  Given the strict nature of RR slicing, I'm
don't even think it's actually useful to make the slicing
hierarchical.  From cgroup's POV, it'd be best if RR slicing can be
detached.

> The whole RR/FIFO thing is so enormously broken (by definition; this
> truly is unfixable) that you simply _cannot_ automate it.

Yeah, exactly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-04 12:37             ` Peter Zijlstra
                               ` (2 preceding siblings ...)
  (?)
@ 2015-05-05 14:41             ` Tejun Heo
  2015-05-05 15:11               ` Peter Zijlstra
  -1 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 14:41 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Peter.

On Mon, May 04, 2015 at 02:37:38PM +0200, Peter Zijlstra wrote:
> > I just realized we allow removing/adding controllers from/to cgroups
> > while there are tasks in them, which isn't safe unless we eliminate all
> > can_attach callbacks. We've done so for some cgroup subsystems, but
> > there are still a few of them...
> 
> You can't remove can_attach(), we must be able to disallow joining a
> cgroup.
> 
> If that results in you not being able to change the cgroup setup with
> tasks in, so be it -- that seems like a sane restriction anyhow.

This is really an interface policy issue.  For all other controllers,
it's almost trivial to let organizational operations (setting up
hierarchies, moving processes around) overrule controller
configurations.  The main benefit of doing this is that this decouples
organizational operations from resource control.  Users can depend on
the fact that allowed organizational operations won't fail due to
specific controller configuration issues.

This also works well with controllers accepting target configurations
regardless of the current state and enforcing rules to converge to the
configured state instead.  e.g. if you set max memory lower than the
currently used, the config will be accepted and the controller will
keep trying to make the current state converge to the target state.
This is important as rejecting configuration can lead to chasing game
between configuration attempts and run-away resource consumption.

Now, RR slices are the special case here because it's inherently
different from every other resource cgroup is concerned with.  It
simply doesn't fit into the same model that other resources follow.
There are several options we can try.

1. Decouple RR slices from cpu controller.  This would be the best
   route to follow.  RR slices need a hard allocator no matter what we
   do.  There isn't much point in imposing hierarchical structure on
   top of it.

2. Implement special case behavior so that it can follow the same
   model.  e.g. resetting RR scheduling config when the effective cpu
   cgroup changes or carrying the amount of slice being consumed with
   the process being moved.  No matter how this is done, it's gonna be
   a clear compromise as we're forcing this into the model which
   doesn't quite fit it.  That said, given how RR slices are a special
   case to begin with, I think this can be acceptable.

3. Take compromise in the other direction - add exceptions to
   organizational operations but clearly limit the failure modes.  We
   prolly want to structure code in a way to enforce this.

4. If #1 can be done in time but not right now, simply disallow any
   RR/FIFO in !root cgroups on the unified hierarchy for now.

What do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 14:41             ` Tejun Heo
@ 2015-05-05 15:11               ` Peter Zijlstra
  2015-05-05 16:13                 ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-05 15:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, May 05, 2015 at 10:41:04AM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, May 04, 2015 at 02:37:38PM +0200, Peter Zijlstra wrote:
> > > I just realized we allow removing/adding controllers from/to cgroups
> > > while there are tasks in them, which isn't safe unless we eliminate all
> > > can_attach callbacks. We've done so for some cgroup subsystems, but
> > > there are still a few of them...
> > 
> > You can't remove can_attach(), we must be able to disallow joining a
> > cgroup.
> > 
> > If that results in you not being able to change the cgroup setup with
> > tasks in, so be it -- that seems like a sane restriction anyhow.
> 
> This is really an interface policy issue.  For all other controllers,
> it's almost trivial to let organizational operations (setting up
> hierarchies, moving processes around) overrule controller
> configurations.  The main benefit of doing this is that this decouples
> organizational operations from resource control.  Users can depend on
> the fact that allowed organizational operations won't fail due to
> specific controller configuration issues.

But but but... that doesn't make any damn sense! Why would you want to
do something mad like that?

To me the organization is very much part of the control structure. It
cannot be an invariant. Treating it like that destroys the whole notion
of a hierarchy.

> This also works well with controllers accepting target configurations
> regardless of the current state and enforcing rules to converge to the
> configured state instead.

I think we had a long discussion on that which we never finished. I'm
not much for converging to a state. Either it can or it can not and you
hard fail.

With this soft lets just accept any old crap mentality you cannot
provide guarantees.

> e.g. if you set max memory lower than the
> currently used, the config will be accepted and the controller will
> keep trying to make the current state converge to the target state.
> This is important as rejecting configuration can lead to chasing game
> between configuration attempts and run-away resource consumption.

This is an entirely different issue; albeit with its own pitfalls, what
if you put the max too low and you run into a never ending reclaim loop?
Attempting to attain the unattainable.

> Now, RR slices are the special case here because it's inherently
> different from every other resource cgroup is concerned with. 

I don't think so, any controller which wants to carve up a fixed
resource in non proportional ways is going to run into this.

Its just that you don't want this, but that doesn't render it less
useful.

> It
> simply doesn't fit into the same model that other resources follow.
> There are several options we can try.
> 
> 1. Decouple RR slices from cpu controller.  This would be the best
>    route to follow.  RR slices need a hard allocator no matter what we
>    do.  There isn't much point in imposing hierarchical structure on
>    top of it.

The same is true of SCHED_DEADLINE, we hard divide a fixed amount. We've
not currently exposed it to cgroups, but we want to eventually.

As to not having a hierarchy; you're the one destroying it by saying the
organization should be decoupled from the controller.

And, no a hierarchy still makes perfect sense, think of containers, they
might not even see the parent.

> 3. Take compromise in the other direction - add exceptions to
>    organizational operations but clearly limit the failure modes.  We
>    prolly want to structure code in a way to enforce this.

I'm for failure modes as you should well now by know ;-)

I really think you're moving in the wrong direction with the whole
cgroup stuff if you just want to willy nilly allow everything.

Also, who's the one doing a PID controller which will hard fail fork?
How are you going to do away with can_attach() there? Surely you need to
dis-allow another task joining when its at its maximum number of allowed
PIDs, the same condition you're going to fail fork().

So no; hard failure is good and desired. It allows guarantees, which is
a good and desired feature of control.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 14:18                 ` Tejun Heo
@ 2015-05-05 15:19                   ` Peter Zijlstra
  2015-05-05 16:31                     ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-05 15:19 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, May 05, 2015 at 10:18:38AM -0400, Tejun Heo wrote:
> > Now you can kludge around some of this, for example you can make the
> > default depend on the parent setting etc.. But that's horribly
> > inconsistent.
> 
> I don't think we can kludge this.  For all other resources, we're
> defining the limits that can't be crossed so nesting them w/ -1 by
> default is fine.  RR slices are different it that we're really slicing
> up and guaranteeing a portion of something finite, so unlimited by
> default thing doesn't really work here.

Note that you _could_ do the same thing with IO bandwidth; esp. with
these modern no-seek-penalty devices this could make sense.


> > So I really prefer not to go that way; if people use RR/FIFO they had
> > better bloody know what they're doing; which includes setting up the
> > system.
> 
> The problem is that this is tied to the normal cpu controller.  Users
> who don't have any intention of mucking with RT scheduling end up
> being dragged into it.  Given the strict nature of RR slicing, I'm
> don't even think it's actually useful to make the slicing
> hierarchical.  From cgroup's POV, it'd be best if RR slicing can be
> detached.

Like in the other mail; hierarchy still makes perfect sense for the
container case.

> > The whole RR/FIFO thing is so enormously broken (by definition; this
> > truly is unfixable) that you simply _cannot_ automate it.
> 
> Yeah, exactly.

I don't think you're quite agreeing to the same reasons I am. My main
objection to the whole SCHED_RR/FIFO thing as defined by POSIX is that
it does not in fact allow the OS to do what an OS _should_ do, namely
resource arbitration and control.

The whole rt-cgroup controller tries to somewhat contain that, but
fundamentally once you use RR/FIFO you've given up your system to
userspace control -- which btw is why its usually limited to root.

SCHED_DEADLINE avoids all these problems, at the cost of a more complex
setup.

But the fact that both need fixed portions of a limited total does not
in fact mean they're broken.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 15:11               ` Peter Zijlstra
@ 2015-05-05 16:13                 ` Tejun Heo
  2015-05-05 16:50                   ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 16:13 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Peter.

On Tue, May 05, 2015 at 05:11:13PM +0200, Peter Zijlstra wrote:
...
> But but but... that doesn't make any damn sense! Why would you want to
> do something mad like that?
> 
> To me the organization is very much part of the control structure. It
> cannot be an invariant. Treating it like that destroys the whole notion
> of a hierarchy.

You and I don't really agree on this.  The disagreement is fine but
what I don't get is why this is such a big deal.  How would it break
the whole notion of a hierarchy?  A user isn't allowed to esacpe the
subhierarchy it's allowed in no matter what.  Whether organizational
operations supercedes configurations or not doesn't matter as long as
the user is confined under the right hierarchy.

Furthermore, in majority of use cases, organizational operations are
used to set up the hierarchy when starting up a group and then left
alone.  For stateful controller like memcg process migrations are
inherently expensive and intrusive, so the usage model isn't
arbitrary.  This is a corner case issue and doesn't really affect the
whole model.

> > e.g. if you set max memory lower than the
> > currently used, the config will be accepted and the controller will
> > keep trying to make the current state converge to the target state.
> > This is important as rejecting configuration can lead to chasing game
> > between configuration attempts and run-away resource consumption.
> 
> This is an entirely different issue; albeit with its own pitfalls, what
> if you put the max too low and you run into a never ending reclaim loop?
> Attempting to attain the unattainable.

That's an oom condition and memcg handles it accordingly.

> > Now, RR slices are the special case here because it's inherently
> > different from every other resource cgroup is concerned with. 
> 
> I don't think so, any controller which wants to carve up a fixed
> resource in non proportional ways is going to run into this.
> 
> Its just that you don't want this, but that doesn't render it less
> useful.

Well, of the resources that we handle right now, it is a special case
and a sucky one at that because it ties itself to regular cpu
controller which doesn't need that behavior.

> > It
> > simply doesn't fit into the same model that other resources follow.
> > There are several options we can try.
> > 
> > 1. Decouple RR slices from cpu controller.  This would be the best
> >    route to follow.  RR slices need a hard allocator no matter what we
> >    do.  There isn't much point in imposing hierarchical structure on
> >    top of it.
> 
> The same is true of SCHED_DEADLINE, we hard divide a fixed amount. We've
> not currently exposed it to cgroups, but we want to eventually.
> 
> As to not having a hierarchy; you're the one destroying it by saying the
> organization should be decoupled from the controller.

I don't get this part.  How does making organization supercede
configuration destroy hierarchy?

> And, no a hierarchy still makes perfect sense, think of containers, they
> might not even see the parent.

The mode of configuration is different tho.  No matter what we do, if
we want to automate this sort of distribution with resource as limited
as realtime slices, it'll need a separate allocator which can carve
out resources on demand.  This can't be ratio-distributed or
soft-capped and having to tie this together with regular cpu
controller is annoying.

> > 3. Take compromise in the other direction - add exceptions to
> >    organizational operations but clearly limit the failure modes.  We
> >    prolly want to structure code in a way to enforce this.
> 
> I'm for failure modes as you should well now by know ;-)
> 
> I really think you're moving in the wrong direction with the whole
> cgroup stuff if you just want to willy nilly allow everything.

Well, let's agree to disagree on that one.  It's not about allowing
willy nilly everything but separating out the specification of intent
from the current state and you also saw how coupling the two tightly
messed up cpuset.  It can make configuration tedious enough to the
point where it becomes impractical to use under certain circumstances.

The thing is, allowing to specify configurations doesn't prevent the
user from enforcing stricter rules.  The current state is always
visible to the user and if it fails to converge, the user can take
whatever actions that it needs to take to remedy the situation.

> Also, who's the one doing a PID controller which will hard fail fork?
> How are you going to do away with can_attach() there? Surely you need to
> dis-allow another task joining when its at its maximum number of allowed
> PIDs, the same condition you're going to fail fork().

It allows migrations into already capped cgroup.  It just won't allow
new forks.  This isn't different from allowing limit to be lowered
below the current and we *do* want that because otherwise it becomes a
race between whoever is setting the config and whoever is consuming
the resources.  You always wanna be able to say "stop giving out
resources now".

> So no; hard failure is good and desired. It allows guarantees, which is
> a good and desired feature of control.

Isn't that too sweeping a statement?  We want them in some places but
not necessarily in all places.  The hard failures aren't going away.
They're just localized to specific areas where they're easier to
handle.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 15:19                   ` Peter Zijlstra
@ 2015-05-05 16:31                     ` Tejun Heo
  2015-05-05 19:00                       ` Peter Zijlstra
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 16:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Peter.

On Tue, May 05, 2015 at 05:19:49PM +0200, Peter Zijlstra wrote:
> > I don't think we can kludge this.  For all other resources, we're
> > defining the limits that can't be crossed so nesting them w/ -1 by
> > default is fine.  RR slices are different it that we're really slicing
> > up and guaranteeing a portion of something finite, so unlimited by
> > default thing doesn't really work here.
> 
> Note that you _could_ do the same thing with IO bandwidth; esp. with
> these modern no-seek-penalty devices this could make sense.

Yeah, maybe.  It currently is too unpredictable to do that (at least
from OS side w/ all the layering) but that is a possibility.

> > The problem is that this is tied to the normal cpu controller.  Users
> > who don't have any intention of mucking with RT scheduling end up
> > being dragged into it.  Given the strict nature of RR slicing, I'm
> > don't even think it's actually useful to make the slicing
> > hierarchical.  From cgroup's POV, it'd be best if RR slicing can be
> > detached.
> 
> Like in the other mail; hierarchy still makes perfect sense for the
> container case.

We'd still need an on-demand arbitration mechanism across containers
no matter what we do which might as well take care of everything.  But
please see below.

> > > The whole RR/FIFO thing is so enormously broken (by definition; this
> > > truly is unfixable) that you simply _cannot_ automate it.
> > 
> > Yeah, exactly.
> 
> I don't think you're quite agreeing to the same reasons I am. My main
> objection to the whole SCHED_RR/FIFO thing as defined by POSIX is that
> it does not in fact allow the OS to do what an OS _should_ do, namely
> resource arbitration and control.
> 
> The whole rt-cgroup controller tries to somewhat contain that, but
> fundamentally once you use RR/FIFO you've given up your system to
> userspace control -- which btw is why its usually limited to root.
> 
> SCHED_DEADLINE avoids all these problems, at the cost of a more complex
> setup.
> 
> But the fact that both need fixed portions of a limited total does not
> in fact mean they're broken.

But that does make them pretty different from others.  What bothers me
the most about RR slices right now is that it's tightly coupled with
the rest of cpu controller while having a very different set of
characteristics.  Maybe this is something mandated by the underlying
structure and we have to live with it but it definitely isn't an ideal
situation.

What I don't want to happen is controllers failing migrations
willy-nilly for random reasons leaving users baffled, which we've
actually been doing unfortunately.  Maybe we need to deal with this
fixed resource arbitration as a separate class and allow them to fail
migration w/ -EBUSY.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 16:13                 ` Tejun Heo
@ 2015-05-05 16:50                   ` Peter Zijlstra
  2015-05-05 18:29                     ` Thomas Gleixner
  2015-05-05 18:31                     ` Tejun Heo
  0 siblings, 2 replies; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-05 16:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, May 05, 2015 at 12:13:35PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, May 05, 2015 at 05:11:13PM +0200, Peter Zijlstra wrote:
> ...
> > But but but... that doesn't make any damn sense! Why would you want to
> > do something mad like that?
> > 
> > To me the organization is very much part of the control structure. It
> > cannot be an invariant. Treating it like that destroys the whole notion
> > of a hierarchy.
> 
> You and I don't really agree on this.  The disagreement is fine but
> what I don't get is why this is such a big deal.  How would it break
> the whole notion of a hierarchy?  A user isn't allowed to esacpe the
> subhierarchy it's allowed in no matter what.  Whether organizational
> operations supercedes configurations or not doesn't matter as long as
> the user is confined under the right hierarchy.

I really don't get what you're saying there. If its not allowed to
'escape' there must be some equivalent of can_attach().

Otherwise you simply cannot reject the move.

> Furthermore, in majority of use cases, organizational operations are
> used to set up the hierarchy when starting up a group and then left
> alone.  For stateful controller like memcg process migrations are
> inherently expensive and intrusive, so the usage model isn't
> arbitrary.  This is a corner case issue and doesn't really affect the
> whole model.

Again, I don't follow, so why is can_attach() bad?

> > I don't think so, any controller which wants to carve up a fixed
> > resource in non proportional ways is going to run into this.
> > 
> > Its just that you don't want this, but that doesn't render it less
> > useful.
> 
> Well, of the resources that we handle right now, it is a special case
> and a sucky one at that because it ties itself to regular cpu
> controller which doesn't need that behavior.

It doesn't 'tie' itself to the cpu controller, its a fundamental part of
the cpu controller. The cpu controller is about all computation time,
RR/FIFO is a very much part of that.

And RR/FIFO is extra special in that if you grant a process that it can
suck your machine dry of this time. This is why you must configure it.

People should not unknowingly let programs use RR/FIFO. Also what sorts
of 'problems' are people having because of this? What kind of
applications 'require' RR/FIFO on a normal desktop?

> > As to not having a hierarchy; you're the one destroying it by saying the
> > organization should be decoupled from the controller.
> 
> I don't get this part.  How does making organization supercede
> configuration destroy hierarchy?

If you want to unconditionally allow task migration between groups, the
hierarchy doesn't actually mean anything.

You can't enforce hierarchical constraints. Which to me is the entire
point of having a hierarchy.

> > And, no a hierarchy still makes perfect sense, think of containers, they
> > might not even see the parent.
> 
> The mode of configuration is different tho.  No matter what we do, if
> we want to automate this sort of distribution with resource as limited
> as realtime slices, it'll need a separate allocator which can carve
> out resources on demand.

But you don't want to automate, full stop.

> This can't be ratio-distributed or
> soft-capped and having to tie this together with regular cpu
> controller is annoying.

Welcome to actual world issues. Stop pretending this stuff is easy and
can be hidden from the user.

IF people want to use RR/FIFO they had better damn well know what
they're doing. There is not way around that. There's just too many
things that can go wrong with it.

If they don't want to deal with this problems, then tell them to go
away. Do _NOT_ pretend its easy and fudge it for them.

This on-demand carving thing you mention, that's a _MASSIVE_ fudge. Just
don't even go there.

> > I really think you're moving in the wrong direction with the whole
> > cgroup stuff if you just want to willy nilly allow everything.
> 
> Well, let's agree to disagree on that one.  It's not about allowing
> willy nilly everything but separating out the specification of intent
> from the current state and you also saw how coupling the two tightly
> messed up cpuset.  It can make configuration tedious enough to the
> point where it becomes impractical to use under certain circumstances.

Well, no I didn't see how cpusets was messed up. You see that is where
we start to disagree.

The improvement I wanted to cpusets was to simply disallow hotplug when
there were tasks that could not go elsewhere.

> The thing is, allowing to specify configurations doesn't prevent the
> user from enforcing stricter rules.  The current state is always
> visible to the user and if it fails to converge, the user can take
> whatever actions that it needs to take to remedy the situation.

Right, so how about failing hotplug if there's (user) tasks pinned to a
cpu? That's clearly visible and the user can go fix it if he really
wants to do the unplug.

That's a very similar thing, but you've argued against it.

That said, this is not the point we're now arguing about; I want the
hierarchy to actually mean something, and the only way to do that is to
allow can_attach().

Without can_attach() one cannot provide hierarchical constraints.

> > Also, who's the one doing a PID controller which will hard fail fork?
> > How are you going to do away with can_attach() there? Surely you need to
> > dis-allow another task joining when its at its maximum number of allowed
> > PIDs, the same condition you're going to fail fork().
> 
> It allows migrations into already capped cgroup. 

OMFG, that's so broken. This basically renders the entire cap useless.

So you now have: no more than 'N' tasks, except <big-gaping-hole>.

> It just won't allow
> new forks.  This isn't different from allowing limit to be lowered
> below the current and we *do* want that because otherwise it becomes a
> race between whoever is setting the config and whoever is consuming
> the resources.  You always wanna be able to say "stop giving out
> resources now".

Ah, that is what you've been trying to say with your memcg example. Well
see this cannot work for realtime (and anybody else who wants to provide
actual guarantees).

You simply cannot lower the max below the current usage, end of story.
Because it will _NOT_ converge. Tasks were promised that time and will
continue using it.

If you want to lower it, first take some tasks out. Idem the cpu
affinity vs hotplug.

Same for your PID controller btw, it will NOT converge, tasks won't
magically go away just because you want them to.

Also, there is no problem failing any of these setting, its 'obvious'
what the problem is. When they return -EBUSY or whatnot, the resource is
taken and you need to go free some up.

> > So no; hard failure is good and desired. It allows guarantees, which is
> > a good and desired feature of control.
> 
> Isn't that too sweeping a statement?  We want them in some places but
> not necessarily in all places.  The hard failures aren't going away.
> They're just localized to specific areas where they're easier to
> handle.

Easier how? I'm really not seeing how any of this is making things
easier for anybody.

All I'm seeing is that you're making cgroups useless for people who want
to guarantee things (eg. the realtime people).

Are you really going to force us to abandon cgroups and invent yet
another grouping thing?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 16:50                   ` Peter Zijlstra
@ 2015-05-05 18:29                     ` Thomas Gleixner
  2015-05-05 19:00                         ` Tejun Heo
  2015-05-05 18:31                     ` Tejun Heo
  1 sibling, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2015-05-05 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Tejun Heo, Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, 5 May 2015, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 12:13:35PM -0400, Tejun Heo wrote:
> > On Tue, May 05, 2015 at 05:11:13PM +0200, Peter Zijlstra wrote:
> > > 
> > > So no; hard failure is good and desired. It allows guarantees, which is
> > > a good and desired feature of control.
> > 
> > Isn't that too sweeping a statement?  We want them in some places but
> > not necessarily in all places.  The hard failures aren't going away.
> > They're just localized to specific areas where they're easier to
> > handle.
> 
> Easier how? I'm really not seeing how any of this is making things
> easier for anybody.
> 
> All I'm seeing is that you're making cgroups useless for people who want
> to guarantee things (eg. the realtime people).

I fully agree and after reading through this thread I really have to
say that this whole notion of relax the admission control and then try
to magically converge to the resource limits is horrible in all
aspects.

Hierarchies must have a strictly inherited and overall consistent
resource management and therefor resource limitation. Otherwise they
are just useless.

The idea of allowing overcommitment and magically converging to back
to the limits yells heuristics all over the place and we all know how
reliable heuristics are.

Tejun, you try to make the whole configuration and placement simpler
for the user, but all you achieve is that you act like all these
politicians who promise tax cuts and whatever and forget about them
once the elections are over. How is that going to make stuff simpler
for users/admins? Not at all.

Instead of failing hard at placement/configuration time they get
surprised by hard to understand fallout of magic convergence
heuristics. That's crap and no matter how you argue it stays crap.

As Peter said several times: hard failure is good and desired. It's a
very clear information on which people can act on. If the failures
modes are nilly-willy today, as you wrote somewhere, then we need to
fix that and make them consistent and understandable and not replace
them by half baken heuristics which postpone the failure to some point
where it is even less understandable.

If there are issues with run-away problems, i.e. upping a resource
limit which gets eaten up from the existing tasks before you can admit
a new one, then your magic convergence thing is again the wrong
answer. The right approach is:

      1) Up the limit and make a reservation at the same time
      2) Admit the new task and allow it to consume the reservation
      3) Set it effective

You can apply this to ALL sorts of resource controllers and you give
the user a very simple to understand mechanism to control and
configure his system.

> Are you really going to force us to abandon cgroups and invent yet
> another grouping thing?

Sigh no. I think cgroups can be fixed, if we just adhere to the basic
principles of hierarchical resource management and remove/reject all
magic "we'll fix that for you" nonsense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 16:50                   ` Peter Zijlstra
  2015-05-05 18:29                     ` Thomas Gleixner
@ 2015-05-05 18:31                     ` Tejun Heo
  1 sibling, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 18:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, again.

On Tue, May 05, 2015 at 06:50:06PM +0200, Peter Zijlstra wrote:
> I really don't get what you're saying there. If its not allowed to
> 'escape' there must be some equivalent of can_attach().
> 
> Otherwise you simply cannot reject the move.

A given user isn't allowed to move processes into a cgroup outside its
subhierarchy and the hierarchical resource control keeps the
subhierarchy under the limits no matter what the user does inside it.
Whether can_attach can fail or not is peripheral in this sense - if a
user can move processes into a cgroup outside its allowed scope, the
user can already escape regardless of the specifics of configuration.

Any user of cgroups should be confined to its scope and when it's
confined that way, the hierarchical limits are enforced no matter what
happens in its subhierarchy.

> > Furthermore, in majority of use cases, organizational operations are
> > used to set up the hierarchy when starting up a group and then left
> > alone.  For stateful controller like memcg process migrations are
> > inherently expensive and intrusive, so the usage model isn't
> > arbitrary.  This is a corner case issue and doesn't really affect the
> > whole model.
> 
> Again, I don't follow, so why is can_attach() bad?

It's more like can_attach failures don't add much for other
controllers.  Please see below.

> People should not unknowingly let programs use RR/FIFO. Also what sorts
> of 'problems' are people having because of this? What kind of
> applications 'require' RR/FIFO on a normal desktop?

The cases I hear about are mostly audio applications which end up in
whatever default cgroups other applications are put in w/o an easy way
to configure the hierarchy for RR slices.  As I wrote way back, if
these can't be decoupled, whoever is setting up cpu cgroup hierarchies
will also have to take part in distributing realtime slices.

This might not necessarily be a bad thing.  It's just different from
everything else cgroups deal with at this point.

> > I don't get this part.  How does making organization supercede
> > configuration destroy hierarchy?
> 
> If you want to unconditionally allow task migration between groups, the
> hierarchy doesn't actually mean anything.
>
> You can't enforce hierarchical constraints. Which to me is the entire
> point of having a hierarchy.

No, hierarchy still puts restrictions on who can do what where.
Whether organization operations supercede configurations or not
doens't affect this at all.  Again, if you can stow away processes out
of your domain, you're escaping the hierarchical constrasints all the
same.  Delegations need to scoped no matter what.

> > This can't be ratio-distributed or
> > soft-capped and having to tie this together with regular cpu
> > controller is annoying.
> 
> Welcome to actual world issues. Stop pretending this stuff is easy and
> can be hidden from the user.
> 
> IF people want to use RR/FIFO they had better damn well know what
> they're doing. There is not way around that. There's just too many
> things that can go wrong with it.
> 
> If they don't want to deal with this problems, then tell them to go
> away. Do _NOT_ pretend its easy and fudge it for them.
> 
> This on-demand carving thing you mention, that's a _MASSIVE_ fudge. Just
> don't even go there.

How is on-demand allocation fudging?  You can do it manually or you
can have policies set up to allocate the specific resource.  This is
really beside the point tho.  What I was trying to say was that this
takes a different approach from other non-hard resources.

> > Well, let's agree to disagree on that one.  It's not about allowing
> > willy nilly everything but separating out the specification of intent
> > from the current state and you also saw how coupling the two tightly
> > messed up cpuset.  It can make configuration tedious enough to the
> > point where it becomes impractical to use under certain circumstances.
> 
> Well, no I didn't see how cpusets was messed up. You see that is where
> we start to disagree.

Yeah, seems that way.  Let's agree to disagree here.

> The improvement I wanted to cpusets was to simply disallow hotplug when
> there were tasks that could not go elsewhere.

Would that mean we're also gonna disallow hotunplug if some threads
are pinned to that cpu?  And the kernel would still be changing
configurations in an non-reversible way.  Again, how does that jive
with plain affinities?

> That said, this is not the point we're now arguing about; I want the
> hierarchy to actually mean something, and the only way to do that is to
> allow can_attach().
> 
> Without can_attach() one cannot provide hierarchical constraints.

I don't think this is the point either.  The point is how to deal with
hard resources that can't be permissive by default.

> > > Also, who's the one doing a PID controller which will hard fail fork?
> > > How are you going to do away with can_attach() there? Surely you need to
> > > dis-allow another task joining when its at its maximum number of allowed
> > > PIDs, the same condition you're going to fail fork().
> > 
> > It allows migrations into already capped cgroup. 
> 
> OMFG, that's so broken. This basically renders the entire cap useless.
> 
> So you now have: no more than 'N' tasks, except <big-gaping-hole>.

We need that "hole" anyway as I wrote below.  The rule is "no new fork
if there are more than N tasks in the group" and that's it.

...
> Ah, that is what you've been trying to say with your memcg example. Well
> see this cannot work for realtime (and anybody else who wants to provide
> actual guarantees).
> 
> You simply cannot lower the max below the current usage, end of story.
> Because it will _NOT_ converge. Tasks were promised that time and will
> continue using it.

So, this is the key issue.  These resources are fundamentally
different.

> If you want to lower it, first take some tasks out. Idem the cpu
> affinity vs hotplug.
> 
> Same for your PID controller btw, it will NOT converge, tasks won't
> magically go away just because you want them to.

It won't automatically converge of course.  It just won't allow new
forks.  Moving processes into the cgroup is at the same level as
lowering the limit below current.  If one is allowed, the other is
allowed too and neither can allow the user to escape its hierarchical
limit as long as the user is properly contained in its subhierarchy.

> Also, there is no problem failing any of these setting, its 'obvious'
> what the problem is. When they return -EBUSY or whatnot, the resource is
> taken and you need to go free some up.

Hmm... so, I kinda agree here.  If we clearly define and constrain how
we use -EBUSY (hard resources only), it can work out.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 18:29                     ` Thomas Gleixner
@ 2015-05-05 19:00                         ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 19:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Thomas.

On Tue, May 05, 2015 at 08:29:28PM +0200, Thomas Gleixner wrote:
> I fully agree and after reading through this thread I really have to
> say that this whole notion of relax the admission control and then try
> to magically converge to the resource limits is horrible in all
> aspects.

This comes down to controllers allowing limits to be configured
current usage.  We need to allow and define what happens in that
situation and moving a process into a full cgroup inherently follows
the same pattern albeit from the other direction.

> The idea of allowing overcommitment and magically converging to back
> to the limits yells heuristics all over the place and we all know how
> reliable heuristics are.

It's not magic heuristics.  This is a core part of normal operation.

> As Peter said several times: hard failure is good and desired. It's a
> very clear information on which people can act on. If the failures
> modes are nilly-willy today, as you wrote somewhere, then we need to
> fix that and make them consistent and understandable and not replace
> them by half baken heuristics which postpone the failure to some point
> where it is even less understandable.

There are no such magic heuristics because controllers need well
defined behaviors when current is above limit anyway and behave
exactly the same way no matter how that state is reached.  For
resources like RR slices, this doesn't work and that's why this is an
issue, so yeah this is the process of finding out what must be able to
fail.

> If there are issues with run-away problems, i.e. upping a resource
> limit which gets eaten up from the existing tasks before you can admit
> a new one, then your magic convergence thing is again the wrong
> answer. The right approach is:
> 
>       1) Up the limit and make a reservation at the same time
>       2) Admit the new task and allow it to consume the reservation
>       3) Set it effective

I don't really think this is a scenario we need to worry about.  If we
choose to fail migration, let's just fail it.  There's no point in
building a mechanism to work around malbehavior from its users.

> > Are you really going to force us to abandon cgroups and invent yet
> > another grouping thing?
> 
> Sigh no. I think cgroups can be fixed, if we just adhere to the basic
> principles of hierarchical resource management and remove/reject all
> magic "we'll fix that for you" nonsense.

So, let's do -EBUSY for hard resource failures which have to be exact.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05 19:00                         ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 19:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Thomas.

On Tue, May 05, 2015 at 08:29:28PM +0200, Thomas Gleixner wrote:
> I fully agree and after reading through this thread I really have to
> say that this whole notion of relax the admission control and then try
> to magically converge to the resource limits is horrible in all
> aspects.

This comes down to controllers allowing limits to be configured
current usage.  We need to allow and define what happens in that
situation and moving a process into a full cgroup inherently follows
the same pattern albeit from the other direction.

> The idea of allowing overcommitment and magically converging to back
> to the limits yells heuristics all over the place and we all know how
> reliable heuristics are.

It's not magic heuristics.  This is a core part of normal operation.

> As Peter said several times: hard failure is good and desired. It's a
> very clear information on which people can act on. If the failures
> modes are nilly-willy today, as you wrote somewhere, then we need to
> fix that and make them consistent and understandable and not replace
> them by half baken heuristics which postpone the failure to some point
> where it is even less understandable.

There are no such magic heuristics because controllers need well
defined behaviors when current is above limit anyway and behave
exactly the same way no matter how that state is reached.  For
resources like RR slices, this doesn't work and that's why this is an
issue, so yeah this is the process of finding out what must be able to
fail.

> If there are issues with run-away problems, i.e. upping a resource
> limit which gets eaten up from the existing tasks before you can admit
> a new one, then your magic convergence thing is again the wrong
> answer. The right approach is:
> 
>       1) Up the limit and make a reservation at the same time
>       2) Admit the new task and allow it to consume the reservation
>       3) Set it effective

I don't really think this is a scenario we need to worry about.  If we
choose to fail migration, let's just fail it.  There's no point in
building a mechanism to work around malbehavior from its users.

> > Are you really going to force us to abandon cgroups and invent yet
> > another grouping thing?
> 
> Sigh no. I think cgroups can be fixed, if we just adhere to the basic
> principles of hierarchical resource management and remove/reject all
> magic "we'll fix that for you" nonsense.

So, let's do -EBUSY for hard resource failures which have to be exact.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 16:31                     ` Tejun Heo
@ 2015-05-05 19:00                       ` Peter Zijlstra
  2015-05-05 19:06                           ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-05 19:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, May 05, 2015 at 12:31:12PM -0400, Tejun Heo wrote:
> 
> What I don't want to happen is controllers failing migrations
> willy-nilly for random reasons leaving users baffled, which we've
> actually been doing unfortunately.  Maybe we need to deal with this
> fixed resource arbitration as a separate class and allow them to fail
> migration w/ -EBUSY.

Ah, _that_ was the problem.

Which is something created by this co-mounting of controllers.

You could of course store the ss-id of the failing operation in
task_struct and have a file reporting the name of the ss-id.

That way, there is a simple way to find out which controller failed the
migrate.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05 19:06                           ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 19:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Peter.

On Tue, May 05, 2015 at 09:00:57PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 12:31:12PM -0400, Tejun Heo wrote:
> > What I don't want to happen is controllers failing migrations
> > willy-nilly for random reasons leaving users baffled, which we've
> > actually been doing unfortunately.  Maybe we need to deal with this
> > fixed resource arbitration as a separate class and allow them to fail
> > migration w/ -EBUSY.
> 
> Ah, _that_ was the problem.
> 
> Which is something created by this co-mounting of controllers.

Yeah, partly, but also that it's an extra failure mode which isn't
necessary for most controllers.

> You could of course store the ss-id of the failing operation in
> task_struct and have a file reporting the name of the ss-id.
> 
> That way, there is a simple way to find out which controller failed the
> migrate.

Given that the resources which can fail are very limited, I don't
think we need that right now as long as we limit and document the
possible failure cases clearly.  Hopefully, this won't devolve into
collection of arbitrary failures.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-05 19:06                           ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2015-05-05 19:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

Hello, Peter.

On Tue, May 05, 2015 at 09:00:57PM +0200, Peter Zijlstra wrote:
> On Tue, May 05, 2015 at 12:31:12PM -0400, Tejun Heo wrote:
> > What I don't want to happen is controllers failing migrations
> > willy-nilly for random reasons leaving users baffled, which we've
> > actually been doing unfortunately.  Maybe we need to deal with this
> > fixed resource arbitration as a separate class and allow them to fail
> > migration w/ -EBUSY.
> 
> Ah, _that_ was the problem.
> 
> Which is something created by this co-mounting of controllers.

Yeah, partly, but also that it's an extra failure mode which isn't
necessary for most controllers.

> You could of course store the ss-id of the failing operation in
> task_struct and have a file reporting the name of the ss-id.
> 
> That way, there is a simple way to find out which controller failed the
> migrate.

Given that the resources which can fail are very limited, I don't
think we need that right now as long as we limit and document the
possible failure cases clearly.  Hopefully, this won't devolve into
collection of arbitrary failures.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-06  8:49                             ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-06  8:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, May 05, 2015 at 03:06:03PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, May 05, 2015 at 09:00:57PM +0200, Peter Zijlstra wrote:
> > On Tue, May 05, 2015 at 12:31:12PM -0400, Tejun Heo wrote:
> > > What I don't want to happen is controllers failing migrations
> > > willy-nilly for random reasons leaving users baffled, which we've
> > > actually been doing unfortunately.  Maybe we need to deal with this
> > > fixed resource arbitration as a separate class and allow them to fail
> > > migration w/ -EBUSY.
> > 
> > Ah, _that_ was the problem.
> > 
> > Which is something created by this co-mounting of controllers.
> 
> Yeah, partly, but also that it's an extra failure mode which isn't
> necessary for most controllers.

I can agree with reducing failure modes, but we should not do it at the
cost of functionality.

> > You could of course store the ss-id of the failing operation in
> > task_struct and have a file reporting the name of the ss-id.
> > 
> > That way, there is a simple way to find out which controller failed the
> > migrate.
> 
> Given that the resources which can fail are very limited, I don't
> think we need that right now as long as we limit and document the
> possible failure cases clearly.  Hopefully, this won't devolve into
> collection of arbitrary failures.

Right, but something like that would be fairly trivial to implement and
would give immediate resolution.

For example:

$ echo 123 > /cgroups/monkey/business/tasks
-EBUSY
$ cat /cgroups/monkey/business/errno
cpu:-EBUSY

(in fact, for a trivial implementation it doesn't matter which
cgroup/errno you cat)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
@ 2015-05-06  8:49                             ` Peter Zijlstra
  0 siblings, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2015-05-06  8:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, May 05, 2015 at 03:06:03PM -0400, Tejun Heo wrote:
> Hello, Peter.
> 
> On Tue, May 05, 2015 at 09:00:57PM +0200, Peter Zijlstra wrote:
> > On Tue, May 05, 2015 at 12:31:12PM -0400, Tejun Heo wrote:
> > > What I don't want to happen is controllers failing migrations
> > > willy-nilly for random reasons leaving users baffled, which we've
> > > actually been doing unfortunately.  Maybe we need to deal with this
> > > fixed resource arbitration as a separate class and allow them to fail
> > > migration w/ -EBUSY.
> > 
> > Ah, _that_ was the problem.
> > 
> > Which is something created by this co-mounting of controllers.
> 
> Yeah, partly, but also that it's an extra failure mode which isn't
> necessary for most controllers.

I can agree with reducing failure modes, but we should not do it at the
cost of functionality.

> > You could of course store the ss-id of the failing operation in
> > task_struct and have a file reporting the name of the ss-id.
> > 
> > That way, there is a simple way to find out which controller failed the
> > migrate.
> 
> Given that the resources which can fail are very limited, I don't
> think we need that right now as long as we limit and document the
> possible failure cases clearly.  Hopefully, this won't devolve into
> collection of arbitrary failures.

Right, but something like that would be fairly trivial to implement and
would give immediate resolution.

For example:

$ echo 123 > /cgroups/monkey/business/tasks
-EBUSY
$ cat /cgroups/monkey/business/errno
cpu:-EBUSY

(in fact, for a trivial implementation it doesn't matter which
cgroup/errno you cat)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] sched: Relax a restriction in sched_rt_can_attach()
  2015-05-05 19:00                         ` Tejun Heo
  (?)
@ 2015-05-06  9:12                         ` Thomas Gleixner
  -1 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2015-05-06  9:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Peter Zijlstra, Zefan Li, Mike Galbraith, Ingo Molnar, LKML, Cgroups

On Tue, 5 May 2015, Tejun Heo wrote:
> On Tue, May 05, 2015 at 08:29:28PM +0200, Thomas Gleixner wrote:
> > As Peter said several times: hard failure is good and desired. It's a
> > very clear information on which people can act on. If the failures
> > modes are nilly-willy today, as you wrote somewhere, then we need to
> > fix that and make them consistent and understandable and not replace
> > them by half baken heuristics which postpone the failure to some point
> > where it is even less understandable.
> 
> There are no such magic heuristics because controllers need well
> defined behaviors when current is above limit anyway and behave
> exactly the same way no matter how that state is reached.  For

How would something go above limit in the first place if your resource
management is done proper?

  If a group has a resource limit, then it is not allowed to exceed
  that resource. So any attempt to use more resources must fail,
  period. There is no way to go above the limit.

  If you try to lower the limits of an existing group below the level
  which is already used, then this limit restriction attempt must
  fail.

That's the basic principle of resource management. And if you try to
avoid them, then you have a massive design failure. It's that simple.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-05-06  9:13 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-04  0:54 [PATCH] sched: Relax a restriction in sched_rt_can_attach() Zefan Li
2015-05-04  0:54 ` Zefan Li
2015-05-04  3:13 ` Mike Galbraith
2015-05-04  3:13   ` Mike Galbraith
2015-05-04  4:39   ` Zefan Li
2015-05-04  4:39     ` Zefan Li
2015-05-04  5:10     ` Mike Galbraith
2015-05-04  5:10       ` Mike Galbraith
2015-05-04  5:39       ` Mike Galbraith
2015-05-04  5:39         ` Mike Galbraith
2015-05-04  9:11         ` Zefan Li
2015-05-04  9:11           ` Zefan Li
2015-05-04 12:08           ` Mike Galbraith
2015-05-04 12:08             ` Mike Galbraith
2015-05-04 12:37           ` Peter Zijlstra
2015-05-04 12:37             ` Peter Zijlstra
2015-05-04 14:09             ` Mike Galbraith
2015-05-05  3:46               ` Zefan Li
2015-05-05  3:46                 ` Zefan Li
2015-05-05  6:02                 ` Mike Galbraith
2015-05-05  6:02                   ` Mike Galbraith
2015-05-05  3:54             ` Zefan Li
2015-05-05  3:54               ` Zefan Li
2015-05-05 14:10               ` Peter Zijlstra
2015-05-05 14:18                 ` Tejun Heo
2015-05-05 15:19                   ` Peter Zijlstra
2015-05-05 16:31                     ` Tejun Heo
2015-05-05 19:00                       ` Peter Zijlstra
2015-05-05 19:06                         ` Tejun Heo
2015-05-05 19:06                           ` Tejun Heo
2015-05-06  8:49                           ` Peter Zijlstra
2015-05-06  8:49                             ` Peter Zijlstra
2015-05-05 14:41             ` Tejun Heo
2015-05-05 15:11               ` Peter Zijlstra
2015-05-05 16:13                 ` Tejun Heo
2015-05-05 16:50                   ` Peter Zijlstra
2015-05-05 18:29                     ` Thomas Gleixner
2015-05-05 19:00                       ` Tejun Heo
2015-05-05 19:00                         ` Tejun Heo
2015-05-06  9:12                         ` Thomas Gleixner
2015-05-05 18:31                     ` Tejun Heo
2015-05-05 14:09         ` Tejun Heo
2015-05-05 14:09           ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.