linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* cgroup-aware OOM killer, how to move forward
@ 2018-07-11 22:40 Roman Gushchin
  2018-07-12 12:07 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Roman Gushchin @ 2018-07-11 22:40 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, rientjes, mhocko, hannes, tj, gthelen

Hello!

I was thinking on how to move forward with the cgroup-aware OOM killer.
It looks to me, that we all agree on the "cleanup" part of the patchset:
it's a nice feature to be able to kill all tasks in the cgroup
to guarantee the consistent state of the workload.
All our disagreements are related to the victim selection algorithm.

So, I wonder, if the right thing to do is to split the problem.
We can agree on the "cleanup" part, which is useful by itself,
merge it upstream, and then return to the victim selection
algorithm.

So, here is my proposal:
let's introduce the memory.group_oom knob with the following semantics:
if the knob is set, the OOM killer can kill either none, either all
tasks in the cgroup*.
It can perfectly work with the current OOM killer (as a "cleanup" option),
and allows _any_ further approach on the OOM victim selection.
It also doesn't require any mount/boot/tree-wide options.

How does it sound?

If we can agree on this, I will prepare the patchset.
It's quite small and straightforward in comparison to the current version.

Thanks!


* More precisely: if the OOM killer kills a task,
it will traverse the cgroup tree up to the OOM domain (OOMing memcg or root),
looking for the highest-level cgroup with group_oom set. Then it will
kill all tasks in such cgroup, if it does exist.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin
@ 2018-07-12 12:07 ` Michal Hocko
  2018-07-12 15:55   ` Roman Gushchin
  2018-07-13 21:34 ` David Rientjes
  2018-07-24 11:59 ` Tetsuo Handa
  2 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-12 12:07 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, akpm, rientjes, hannes, tj, gthelen

On Wed 11-07-18 15:40:03, Roman Gushchin wrote:
> Hello!
> 
> I was thinking on how to move forward with the cgroup-aware OOM killer.
> It looks to me, that we all agree on the "cleanup" part of the patchset:
> it's a nice feature to be able to kill all tasks in the cgroup
> to guarantee the consistent state of the workload.
> All our disagreements are related to the victim selection algorithm.
> 
> So, I wonder, if the right thing to do is to split the problem.
> We can agree on the "cleanup" part, which is useful by itself,
> merge it upstream, and then return to the victim selection
> algorithm.

Could you be more specific which patches are those please?

> So, here is my proposal:
> let's introduce the memory.group_oom knob with the following semantics:
> if the knob is set, the OOM killer can kill either none, either all
> tasks in the cgroup*.
> It can perfectly work with the current OOM killer (as a "cleanup" option),
> and allows _any_ further approach on the OOM victim selection.
> It also doesn't require any mount/boot/tree-wide options.
> 
> How does it sound?

Well, I guess we have already discussed that. One problem I can see with
that approach is that there is a disconnection between what is the oom
killable entity and oom candidate entity. This will matter when we start
seeing reports that a wrong container has been torn down because there
were larger ones running. All that just because the latter ones consists
of smaller tasks.

Is this a fundamental roadblock? I am not sure but I would tend to say
_no_ because the oom victim selection has always been an implementation
detail. We just need to kill _somebody_ to release _some_ memory. Kill
the whole workload is a sensible thing to do.

So I would be ok with that even though I am still not sure why we should
start with something half done when your original implementation was
much more consistent. Sure there is some disagreement but I suspect
that we will get stuck with an intermediate solution later on again for
very same reasons. I have summarized [1] current contention points and
I would really appreciate if somebody who wasn't really involved in the
previous discussions could just join there and weight arguments. OOM
selection policy is just a heuristic with some potential drawbacks and
somebody might object and block otherwise useful features for others for
ever.  So we should really find some consensus on what is reasonable and
what is just over the line.

[1] http://lkml.kernel.org/r/20180605114729.GB19202@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-12 12:07 ` Michal Hocko
@ 2018-07-12 15:55   ` Roman Gushchin
  0 siblings, 0 replies; 52+ messages in thread
From: Roman Gushchin @ 2018-07-12 15:55 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, akpm, rientjes, hannes, tj, gthelen

On Thu, Jul 12, 2018 at 02:07:03PM +0200, Michal Hocko wrote:
> On Wed 11-07-18 15:40:03, Roman Gushchin wrote:
> > Hello!
> > 
> > I was thinking on how to move forward with the cgroup-aware OOM killer.
> > It looks to me, that we all agree on the "cleanup" part of the patchset:
> > it's a nice feature to be able to kill all tasks in the cgroup
> > to guarantee the consistent state of the workload.
> > All our disagreements are related to the victim selection algorithm.
> > 
> > So, I wonder, if the right thing to do is to split the problem.
> > We can agree on the "cleanup" part, which is useful by itself,
> > merge it upstream, and then return to the victim selection
> > algorithm.
> 
> Could you be more specific which patches are those please?

It's not quite a part of existing patchset. But I had such version
during my work on the current patchset, and it was really small and cute.
I need some time to restore/rebase it.

> 
> > So, here is my proposal:
> > let's introduce the memory.group_oom knob with the following semantics:
> > if the knob is set, the OOM killer can kill either none, either all
> > tasks in the cgroup*.
> > It can perfectly work with the current OOM killer (as a "cleanup" option),
> > and allows _any_ further approach on the OOM victim selection.
> > It also doesn't require any mount/boot/tree-wide options.
> > 
> > How does it sound?
> 
> Well, I guess we have already discussed that. One problem I can see with
> that approach is that there is a disconnection between what is the oom
> killable entity and oom candidate entity. This will matter when we start
> seeing reports that a wrong container has been torn down because there
> were larger ones running. All that just because the latter ones consists
> of smaller tasks.
> 
> Is this a fundamental roadblock? I am not sure but I would tend to say
> _no_ because the oom victim selection has always been an implementation
> detail. We just need to kill _somebody_ to release _some_ memory. Kill
> the whole workload is a sensible thing to do.

Yes. We also use Johaness's memory pressure metrics for making OOM
decisions internally, which is working nice. In this case the in-kernel
OOM decision logic serves more as a backup solution, and consistency
is the only thing which does really matter.

> 
> So I would be ok with that even though I am still not sure why we should
> start with something half done when your original implementation was
> much more consistent. Sure there is some disagreement but I suspect
> that we will get stuck with an intermediate solution later on again for
> very same reasons. I have summarized [1] current contention points and
> I would really appreciate if somebody who wasn't really involved in the
> previous discussions could just join there and weight arguments. OOM
> selection policy is just a heuristic with some potential drawbacks and
> somebody might object and block otherwise useful features for others for
> ever.  So we should really find some consensus on what is reasonable and
> what is just over the line.

I would definitely prefer just to land the existing version, and I prefer
it over this proposal. But it doesn't seem to be going forward well...

Maybe making the described step first might help.

Thanks,
Roman

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin
  2018-07-12 12:07 ` Michal Hocko
@ 2018-07-13 21:34 ` David Rientjes
  2018-07-13 22:16   ` Roman Gushchin
  2018-07-24 11:59 ` Tetsuo Handa
  2 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-07-13 21:34 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Wed, 11 Jul 2018, Roman Gushchin wrote:

> I was thinking on how to move forward with the cgroup-aware OOM killer.
> It looks to me, that we all agree on the "cleanup" part of the patchset:
> it's a nice feature to be able to kill all tasks in the cgroup
> to guarantee the consistent state of the workload.
> All our disagreements are related to the victim selection algorithm.
> 
> So, I wonder, if the right thing to do is to split the problem.
> We can agree on the "cleanup" part, which is useful by itself,
> merge it upstream, and then return to the victim selection
> algorithm.
> 
> So, here is my proposal:
> let's introduce the memory.group_oom knob with the following semantics:
> if the knob is set, the OOM killer can kill either none, either all
> tasks in the cgroup*.
> It can perfectly work with the current OOM killer (as a "cleanup" option),
> and allows _any_ further approach on the OOM victim selection.
> It also doesn't require any mount/boot/tree-wide options.
> 
> How does it sound?
> 

No objection, of course, this was always the mechanism vs policy 
separation that I was referring to.  Having the ability to kill all 
processes attached to the cgroup when one of its processes is selected is 
useful, and we have our own patches that do just that, with the exception 
that it's triggerable by the user.

One of the things that I really like about cgroup v2, though, is what 
appears to be an implicit, but rather apparent, goal to minimize the 
number of files for each controller.  It's very clean.  So I'd suggest 
that we consider memory.group_oom, or however it is named, to allow for 
future development.

For example, rather than simply being binary, we'd probably want the 
ability to kill all eligible processes attached directly to the victim's 
mem cgroup *or* all processes attached to its subtree as well.

I'd suggest it be implemented to accept a string, "default"/"process", 
"local" or "tree"/"hierarchy", or better names, to define the group oom 
mechanism for the mem cgroup that is oom when one of its processes is 
selected as a victim.

> * More precisely: if the OOM killer kills a task,
> it will traverse the cgroup tree up to the OOM domain (OOMing memcg or root),
> looking for the highest-level cgroup with group_oom set. Then it will
> kill all tasks in such cgroup, if it does exist.
> 

All such processes that are not oom disabled, yes.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-13 21:34 ` David Rientjes
@ 2018-07-13 22:16   ` Roman Gushchin
  2018-07-13 22:39     ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-13 22:16 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Fri, Jul 13, 2018 at 02:34:49PM -0700, David Rientjes wrote:
> On Wed, 11 Jul 2018, Roman Gushchin wrote:
> 
> > I was thinking on how to move forward with the cgroup-aware OOM killer.
> > It looks to me, that we all agree on the "cleanup" part of the patchset:
> > it's a nice feature to be able to kill all tasks in the cgroup
> > to guarantee the consistent state of the workload.
> > All our disagreements are related to the victim selection algorithm.
> > 
> > So, I wonder, if the right thing to do is to split the problem.
> > We can agree on the "cleanup" part, which is useful by itself,
> > merge it upstream, and then return to the victim selection
> > algorithm.
> > 
> > So, here is my proposal:
> > let's introduce the memory.group_oom knob with the following semantics:
> > if the knob is set, the OOM killer can kill either none, either all
> > tasks in the cgroup*.
> > It can perfectly work with the current OOM killer (as a "cleanup" option),
> > and allows _any_ further approach on the OOM victim selection.
> > It also doesn't require any mount/boot/tree-wide options.
> > 
> > How does it sound?
> > 
> 
> No objection, of course, this was always the mechanism vs policy 
> separation that I was referring to.  Having the ability to kill all 
> processes attached to the cgroup when one of its processes is selected is 
> useful, and we have our own patches that do just that, with the exception 
> that it's triggerable by the user.

Perfect! I'll prepare the patchset.

> 
> One of the things that I really like about cgroup v2, though, is what 
> appears to be an implicit, but rather apparent, goal to minimize the 
> number of files for each controller.  It's very clean.  So I'd suggest 
> that we consider memory.group_oom, or however it is named, to allow for 
> future development.
> 
> For example, rather than simply being binary, we'd probably want the 
> ability to kill all eligible processes attached directly to the victim's 
> mem cgroup *or* all processes attached to its subtree as well.
> 
> I'd suggest it be implemented to accept a string, "default"/"process", 
> "local" or "tree"/"hierarchy", or better names, to define the group oom 
> mechanism for the mem cgroup that is oom when one of its processes is 
> selected as a victim.

I would prefer to keep it boolean to match the simplicity of cgroup v2 API.
In v2 hierarchy processes can't be attached to non-leaf cgroups,
so I don't see the place for the 3rd meaning.

> 
> > * More precisely: if the OOM killer kills a task,
> > it will traverse the cgroup tree up to the OOM domain (OOMing memcg or root),
> > looking for the highest-level cgroup with group_oom set. Then it will
> > kill all tasks in such cgroup, if it does exist.
> > 
> 
> All such processes that are not oom disabled, yes.
> 

Yep, of course.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-13 22:16   ` Roman Gushchin
@ 2018-07-13 22:39     ` David Rientjes
  2018-07-13 23:05       ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-07-13 22:39 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Fri, 13 Jul 2018, Roman Gushchin wrote:

> > No objection, of course, this was always the mechanism vs policy 
> > separation that I was referring to.  Having the ability to kill all 
> > processes attached to the cgroup when one of its processes is selected is 
> > useful, and we have our own patches that do just that, with the exception 
> > that it's triggerable by the user.
> 
> Perfect! I'll prepare the patchset.
> 

I mean, I separated it out completely in my own 
https://marc.info/?l=linux-kernel&m=152175564704473 as part of a patch 
series that completely fixes all of the issues with the cgroup aware oom 
killer, so of course there's no objection to separating it out.

> > 
> > One of the things that I really like about cgroup v2, though, is what 
> > appears to be an implicit, but rather apparent, goal to minimize the 
> > number of files for each controller.  It's very clean.  So I'd suggest 
> > that we consider memory.group_oom, or however it is named, to allow for 
> > future development.
> > 
> > For example, rather than simply being binary, we'd probably want the 
> > ability to kill all eligible processes attached directly to the victim's 
> > mem cgroup *or* all processes attached to its subtree as well.
> > 
> > I'd suggest it be implemented to accept a string, "default"/"process", 
> > "local" or "tree"/"hierarchy", or better names, to define the group oom 
> > mechanism for the mem cgroup that is oom when one of its processes is 
> > selected as a victim.
> 
> I would prefer to keep it boolean to match the simplicity of cgroup v2 API.
> In v2 hierarchy processes can't be attached to non-leaf cgroups,
> so I don't see the place for the 3rd meaning.
> 

All cgroup v2 files do not need to be boolean and the only way you can add 
a subtree oom kill is to introduce yet another file later.  Please make it 
tristate so that you can define a mechanism of default (process only), 
local cgroup, or subtree, and so we can avoid adding another option later 
that conflicts with the proposed one.  This should be easy.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-13 22:39     ` David Rientjes
@ 2018-07-13 23:05       ` Roman Gushchin
  2018-07-13 23:11         ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-13 23:05 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Fri, Jul 13, 2018 at 03:39:15PM -0700, David Rientjes wrote:
> On Fri, 13 Jul 2018, Roman Gushchin wrote:
> 
> > > 
> > > One of the things that I really like about cgroup v2, though, is what 
> > > appears to be an implicit, but rather apparent, goal to minimize the 
> > > number of files for each controller.  It's very clean.  So I'd suggest 
> > > that we consider memory.group_oom, or however it is named, to allow for 
> > > future development.
> > > 
> > > For example, rather than simply being binary, we'd probably want the 
> > > ability to kill all eligible processes attached directly to the victim's 
> > > mem cgroup *or* all processes attached to its subtree as well.
> > > 
> > > I'd suggest it be implemented to accept a string, "default"/"process", 
> > > "local" or "tree"/"hierarchy", or better names, to define the group oom 
> > > mechanism for the mem cgroup that is oom when one of its processes is 
> > > selected as a victim.
> > 
> > I would prefer to keep it boolean to match the simplicity of cgroup v2 API.
> > In v2 hierarchy processes can't be attached to non-leaf cgroups,
> > so I don't see the place for the 3rd meaning.
> > 
> 
> All cgroup v2 files do not need to be boolean and the only way you can add 
> a subtree oom kill is to introduce yet another file later.  Please make it 
> tristate so that you can define a mechanism of default (process only), 
> local cgroup, or subtree, and so we can avoid adding another option later 
> that conflicts with the proposed one.  This should be easy.

David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup
either has a sub-tree, either attached processes. So, there is no difference
between local cgroup and subtree.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-13 23:05       ` Roman Gushchin
@ 2018-07-13 23:11         ` David Rientjes
  2018-07-13 23:16           ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-07-13 23:11 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Fri, 13 Jul 2018, Roman Gushchin wrote:

> > All cgroup v2 files do not need to be boolean and the only way you can add 
> > a subtree oom kill is to introduce yet another file later.  Please make it 
> > tristate so that you can define a mechanism of default (process only), 
> > local cgroup, or subtree, and so we can avoid adding another option later 
> > that conflicts with the proposed one.  This should be easy.
> 
> David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup
> either has a sub-tree, either attached processes. So, there is no difference
> between local cgroup and subtree.
> 

Uhm, what?  We're talking about a common ancestor reaching its limit, so 
it's oom, and it has multiple immediate children with their own processes 
attached.  The difference is killing all processes attached to the 
victim's cgroup or all processes under the oom mem cgroup's subtree.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-13 23:11         ` David Rientjes
@ 2018-07-13 23:16           ` Roman Gushchin
  2018-07-17  4:19             ` David Rientjes
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-13 23:16 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Fri, Jul 13, 2018 at 04:11:51PM -0700, David Rientjes wrote:
> On Fri, 13 Jul 2018, Roman Gushchin wrote:
> 
> > > All cgroup v2 files do not need to be boolean and the only way you can add 
> > > a subtree oom kill is to introduce yet another file later.  Please make it 
> > > tristate so that you can define a mechanism of default (process only), 
> > > local cgroup, or subtree, and so we can avoid adding another option later 
> > > that conflicts with the proposed one.  This should be easy.
> > 
> > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup
> > either has a sub-tree, either attached processes. So, there is no difference
> > between local cgroup and subtree.
> > 
> 
> Uhm, what?  We're talking about a common ancestor reaching its limit, so 
> it's oom, and it has multiple immediate children with their own processes 
> attached.  The difference is killing all processes attached to the 
> victim's cgroup or all processes under the oom mem cgroup's subtree.
> 

But it's a binary decision, no?
If memory.group_oom set, the whole sub-tree will be killed. Otherwise not.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-13 23:16           ` Roman Gushchin
@ 2018-07-17  4:19             ` David Rientjes
  2018-07-17 12:41               ` Michal Hocko
  2018-07-17 17:38               ` Roman Gushchin
  0 siblings, 2 replies; 52+ messages in thread
From: David Rientjes @ 2018-07-17  4:19 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Fri, 13 Jul 2018, Roman Gushchin wrote:

> > > > All cgroup v2 files do not need to be boolean and the only way you can add 
> > > > a subtree oom kill is to introduce yet another file later.  Please make it 
> > > > tristate so that you can define a mechanism of default (process only), 
> > > > local cgroup, or subtree, and so we can avoid adding another option later 
> > > > that conflicts with the proposed one.  This should be easy.
> > > 
> > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup
> > > either has a sub-tree, either attached processes. So, there is no difference
> > > between local cgroup and subtree.
> > > 
> > 
> > Uhm, what?  We're talking about a common ancestor reaching its limit, so 
> > it's oom, and it has multiple immediate children with their own processes 
> > attached.  The difference is killing all processes attached to the 
> > victim's cgroup or all processes under the oom mem cgroup's subtree.
> > 
> 
> But it's a binary decision, no?
> If memory.group_oom set, the whole sub-tree will be killed. Otherwise not.
> 

No, if memory.max is reached and memory.group_oom is set, my understanding 
of your proposal is that a process is chosen and all eligible processes 
attached to its mem cgroup are oom killed.  My desire for a tristate is so 
that it can be specified that all processes attached to the *subtree* are 
oom killed.  With single unified hierarchy mandated by cgroup v2, we can 
separate descendant cgroups for use with other controllers and enforce 
memory.max by an ancestor.

Making this a boolean value is only preventing it from becoming 
extensible.  If memory.group_oom only is effective for the victim's mem 
cgroup, it becomes impossible to specify that all processes in the subtree 
should be oom killed as a result of the ancestor limit without adding yet 
another tunable.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17  4:19             ` David Rientjes
@ 2018-07-17 12:41               ` Michal Hocko
  2018-07-17 17:38               ` Roman Gushchin
  1 sibling, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-17 12:41 UTC (permalink / raw)
  To: David Rientjes; +Cc: Roman Gushchin, linux-mm, akpm, hannes, tj, gthelen

On Mon 16-07-18 21:19:18, David Rientjes wrote:
> On Fri, 13 Jul 2018, Roman Gushchin wrote:
> 
> > > > > All cgroup v2 files do not need to be boolean and the only way you can add 
> > > > > a subtree oom kill is to introduce yet another file later.  Please make it 
> > > > > tristate so that you can define a mechanism of default (process only), 
> > > > > local cgroup, or subtree, and so we can avoid adding another option later 
> > > > > that conflicts with the proposed one.  This should be easy.
> > > > 
> > > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup
> > > > either has a sub-tree, either attached processes. So, there is no difference
> > > > between local cgroup and subtree.
> > > > 
> > > 
> > > Uhm, what?  We're talking about a common ancestor reaching its limit, so 
> > > it's oom, and it has multiple immediate children with their own processes 
> > > attached.  The difference is killing all processes attached to the 
> > > victim's cgroup or all processes under the oom mem cgroup's subtree.
> > > 
> > 
> > But it's a binary decision, no?
> > If memory.group_oom set, the whole sub-tree will be killed. Otherwise not.
> > 
> 
> No, if memory.max is reached and memory.group_oom is set, my understanding 
> of your proposal is that a process is chosen and all eligible processes 
> attached to its mem cgroup are oom killed.  My desire for a tristate is so 
> that it can be specified that all processes attached to the *subtree* are 
> oom killed.  With single unified hierarchy mandated by cgroup v2, we can 
> separate descendant cgroups for use with other controllers and enforce 
> memory.max by an ancestor.
> 
> Making this a boolean value is only preventing it from becoming 
> extensible.  If memory.group_oom only is effective for the victim's mem 
> cgroup, it becomes impossible to specify that all processes in the subtree 
> should be oom killed as a result of the ancestor limit without adding yet 
> another tunable.

No, this is mangling the interface again. I have already objected to
this [1]. group_oom only tells to tear the whole group down. How you
select this particular group is a completely different story. Conflating
those two things is a bad interface to start with. Killing the whold
cgroup or only a single process should be invariant for the particular
memcg regardless of what is the oom selection policy above in the
hierarchy. Either you are indivisible workload or you are not, full
stop.

If we really need a better control over how to select subtrees then this
should be a separate control knob. How should it look like is a matter
of discussion but it will be hard to find any consensus if the
single-knob-for-single-purpose approach is not clear.

[1] http://lkml.kernel.org/r/20180117160004.GH2900@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17  4:19             ` David Rientjes
  2018-07-17 12:41               ` Michal Hocko
@ 2018-07-17 17:38               ` Roman Gushchin
  2018-07-17 19:49                 ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-17 17:38 UTC (permalink / raw)
  To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen

On Mon, Jul 16, 2018 at 09:19:18PM -0700, David Rientjes wrote:
> On Fri, 13 Jul 2018, Roman Gushchin wrote:
> 
> > > > > All cgroup v2 files do not need to be boolean and the only way you can add 
> > > > > a subtree oom kill is to introduce yet another file later.  Please make it 
> > > > > tristate so that you can define a mechanism of default (process only), 
> > > > > local cgroup, or subtree, and so we can avoid adding another option later 
> > > > > that conflicts with the proposed one.  This should be easy.
> > > > 
> > > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup
> > > > either has a sub-tree, either attached processes. So, there is no difference
> > > > between local cgroup and subtree.
> > > > 
> > > 
> > > Uhm, what?  We're talking about a common ancestor reaching its limit, so 
> > > it's oom, and it has multiple immediate children with their own processes 
> > > attached.  The difference is killing all processes attached to the 
> > > victim's cgroup or all processes under the oom mem cgroup's subtree.
> > > 
> > 
> > But it's a binary decision, no?
> > If memory.group_oom set, the whole sub-tree will be killed. Otherwise not.
> > 
> 
> No, if memory.max is reached and memory.group_oom is set, my understanding 
> of your proposal is that a process is chosen and all eligible processes 
> attached to its mem cgroup are oom killed.  My desire for a tristate is so 
> that it can be specified that all processes attached to the *subtree* are 
> oom killed.  With single unified hierarchy mandated by cgroup v2, we can 
> separate descendant cgroups for use with other controllers and enforce 
> memory.max by an ancestor.
> 
> Making this a boolean value is only preventing it from becoming 
> extensible.  If memory.group_oom only is effective for the victim's mem 
> cgroup, it becomes impossible to specify that all processes in the subtree 
> should be oom killed as a result of the ancestor limit without adding yet 
> another tunable.

Let me show my proposal on examples. Let's say we have the following hierarchy,
and the biggest process (or the process with highest oom_score_adj) is in D.

  /
  |
  A
  |
  B
 / \
C   D

Let's look at different examples and intended behavior:
1) system-wide OOM
  - default settings: the biggest process is killed
  - D/memory.group_oom=1: all processes in D are killed
  - A/memory.group_oom=1: all processes in A are killed
2) memcg oom in B
  - default settings: the biggest process is killed
  - A/memory.group_oom=1: the biggest process is killed
  - B/memory.group_oom=1: all processes in B are killed
  - D/memory.group_oom=1: all processes in D are killed

Please, note, that processes can't be attached directly to A and B,
so "all processes in A are killed" means all processes in the sub-tree
are killed. Immortal processes (oom_score_adj=-1000) are excluded.

I believe, that this model is full and doesn't require any further
extension.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 17:38               ` Roman Gushchin
@ 2018-07-17 19:49                 ` Michal Hocko
  2018-07-17 20:06                   ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-17 19:49 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen

On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
[...]
> Let me show my proposal on examples. Let's say we have the following hierarchy,
> and the biggest process (or the process with highest oom_score_adj) is in D.
> 
>   /
>   |
>   A
>   |
>   B
>  / \
> C   D
> 
> Let's look at different examples and intended behavior:
> 1) system-wide OOM
>   - default settings: the biggest process is killed
>   - D/memory.group_oom=1: all processes in D are killed
>   - A/memory.group_oom=1: all processes in A are killed
> 2) memcg oom in B
>   - default settings: the biggest process is killed
>   - A/memory.group_oom=1: the biggest process is killed

Huh? Why would you even consider A here when the oom is below it?
/me confused

>   - B/memory.group_oom=1: all processes in B are killed

    - B/memory.group_oom=0 &&
>   - D/memory.group_oom=1: all processes in D are killed

What about?
    - B/memory.group_oom=1 && D/memory.group_oom=0

Is this a sane configuration?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 19:49                 ` Michal Hocko
@ 2018-07-17 20:06                   ` Roman Gushchin
  2018-07-17 20:41                     ` David Rientjes
  2018-07-18  8:12                     ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Roman Gushchin @ 2018-07-17 20:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen

On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> [...]
> > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > and the biggest process (or the process with highest oom_score_adj) is in D.
> > 
> >   /
> >   |
> >   A
> >   |
> >   B
> >  / \
> > C   D
> > 
> > Let's look at different examples and intended behavior:
> > 1) system-wide OOM
> >   - default settings: the biggest process is killed
> >   - D/memory.group_oom=1: all processes in D are killed
> >   - A/memory.group_oom=1: all processes in A are killed
> > 2) memcg oom in B
> >   - default settings: the biggest process is killed
> >   - A/memory.group_oom=1: the biggest process is killed
> 
> Huh? Why would you even consider A here when the oom is below it?
> /me confused

I do not.
This is exactly a counter-example: A's memory.group_oom
is not considered at all in this case,
because A is above ooming cgroup.

> 
> >   - B/memory.group_oom=1: all processes in B are killed
> 
>     - B/memory.group_oom=0 &&
> >   - D/memory.group_oom=1: all processes in D are killed
> 
> What about?
>     - B/memory.group_oom=1 && D/memory.group_oom=0

All tasks in B are killed.

Group_oom set to 1 means that the workload can't tolerate
killing of a random process, so in this case it's better
to guarantee consistency for B.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 20:06                   ` Roman Gushchin
@ 2018-07-17 20:41                     ` David Rientjes
  2018-07-17 20:52                       ` Roman Gushchin
  2018-07-18  8:19                       ` Michal Hocko
  2018-07-18  8:12                     ` Michal Hocko
  1 sibling, 2 replies; 52+ messages in thread
From: David Rientjes @ 2018-07-17 20:41 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen

On Tue, 17 Jul 2018, Roman Gushchin wrote:

> > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > 
> > >   /
> > >   |
> > >   A
> > >   |
> > >   B
> > >  / \
> > > C   D
> > > 
> > > Let's look at different examples and intended behavior:
> > > 1) system-wide OOM
> > >   - default settings: the biggest process is killed
> > >   - D/memory.group_oom=1: all processes in D are killed
> > >   - A/memory.group_oom=1: all processes in A are killed
> > > 2) memcg oom in B
> > >   - default settings: the biggest process is killed
> > >   - A/memory.group_oom=1: the biggest process is killed
> > 
> > Huh? Why would you even consider A here when the oom is below it?
> > /me confused
> 
> I do not.
> This is exactly a counter-example: A's memory.group_oom
> is not considered at all in this case,
> because A is above ooming cgroup.
> 

I think the confusion is that this says A/memory.group_oom=1 and then the 
biggest process is killed, which doesn't seem like it matches the 
description you want to give memory.group_oom.

> > >   - B/memory.group_oom=1: all processes in B are killed
> > 
> >     - B/memory.group_oom=0 &&
> > >   - D/memory.group_oom=1: all processes in D are killed
> > 
> > What about?
> >     - B/memory.group_oom=1 && D/memory.group_oom=0
> 
> All tasks in B are killed.
> 
> Group_oom set to 1 means that the workload can't tolerate
> killing of a random process, so in this case it's better
> to guarantee consistency for B.
> 

This example is missing the usecase that I was referring to, i.e. killing 
all processes attached to a subtree because the limit on a common ancestor 
has been reached.

In your example, I would think that the memory.group_oom setting of /A and 
/A/B are meaningless because there are no processes attached to them.

IIUC, your proposal is to select the victim by whatever means, check the 
memory.group_oom setting of that victim, and then either kill the victim 
or all processes attached to that local mem cgroup depending on the 
setting.

However, if C and D here are only limited by a common ancestor, /A or 
/A/B, there is no way to specify that the subtree itself should be oom 
killed.  That was where I thought a tristate value would be helpful such 
that you can define all processes attached to the subtree should be oom 
killed when a mem cgroup has reached memory.max.

I was purposefully overloading memory.group_oom because the actual value 
of memory.group_oom given your semantics here is not relevant for /A or 
/A/B.  I think an additional memory.group_oom_tree or whatever it would be 
called would lead to unnecessary confusion because then we have a model 
where one tunable means something based on the value of the other.

Given the no internal process constraint of cgroup v2, my suggestion was a 
value, "tree", that could specify that a mem cgroup reaching its limit 
could cause all processes attached to its subtree to be killed.  This is 
required only because the single unified hierarchy of cgroup v2 such that 
we want to bind a subset of processes to be controlled by another 
controller separately but still want all processes oom killed when 
reaching the limit of a common ancestor.

Thus, the semantic would be: if oom mem cgroup is "tree", kill all 
processes in subtree; otherwise, it can be "cgroup" or "process" to 
determine what is oom killed depending on the victim selection.

Having the "tree" behavior could definitely be implemented as a separate 
tunable; but then then value of /A/memory.group_oom and 
/A/B/memory.group_oom are irrelevant and, to me, seems like it would be 
more confusing.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 20:41                     ` David Rientjes
@ 2018-07-17 20:52                       ` Roman Gushchin
  2018-07-20  8:30                         ` David Rientjes
  2018-07-18  8:19                       ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-17 20:52 UTC (permalink / raw)
  To: David Rientjes; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen

On Tue, Jul 17, 2018 at 01:41:33PM -0700, David Rientjes wrote:
> On Tue, 17 Jul 2018, Roman Gushchin wrote:
> 
> > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > 
> > > >   /
> > > >   |
> > > >   A
> > > >   |
> > > >   B
> > > >  / \
> > > > C   D
> > > > 
> > > > Let's look at different examples and intended behavior:
> > > > 1) system-wide OOM
> > > >   - default settings: the biggest process is killed
> > > >   - D/memory.group_oom=1: all processes in D are killed
> > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > 2) memcg oom in B
> > > >   - default settings: the biggest process is killed
> > > >   - A/memory.group_oom=1: the biggest process is killed
> > > 
> > > Huh? Why would you even consider A here when the oom is below it?
> > > /me confused
> > 
> > I do not.
> > This is exactly a counter-example: A's memory.group_oom
> > is not considered at all in this case,
> > because A is above ooming cgroup.
> > 
> 
> I think the confusion is that this says A/memory.group_oom=1 and then the 
> biggest process is killed, which doesn't seem like it matches the 
> description you want to give memory.group_oom.

It matches perfectly, as the description says that the kernel will
look for the most high-level cgroup with group_oom set up to the OOM domain.
Here B is oom domain, so A's settings are irrelevant.

> 
> > > >   - B/memory.group_oom=1: all processes in B are killed
> > > 
> > >     - B/memory.group_oom=0 &&
> > > >   - D/memory.group_oom=1: all processes in D are killed
> > > 
> > > What about?
> > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > 
> > All tasks in B are killed.
> > 
> > Group_oom set to 1 means that the workload can't tolerate
> > killing of a random process, so in this case it's better
> > to guarantee consistency for B.
> > 
> 
> This example is missing the usecase that I was referring to, i.e. killing 
> all processes attached to a subtree because the limit on a common ancestor 
> has been reached.
> 
> In your example, I would think that the memory.group_oom setting of /A and 
> /A/B are meaningless because there are no processes attached to them.
> 
> IIUC, your proposal is to select the victim by whatever means, check the 
> memory.group_oom setting of that victim, and then either kill the victim 
> or all processes attached to that local mem cgroup depending on the 
> setting.

Sorry, I don't get what are you saying.
In cgroup v2 processes can't be attached to A and B.
There is no such thing as "local mem cgroup" at all.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 20:06                   ` Roman Gushchin
  2018-07-17 20:41                     ` David Rientjes
@ 2018-07-18  8:12                     ` Michal Hocko
  2018-07-18 15:28                       ` Roman Gushchin
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-18  8:12 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen

On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > [...]
> > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > 
> > >   /
> > >   |
> > >   A
> > >   |
> > >   B
> > >  / \
> > > C   D
> > > 
> > > Let's look at different examples and intended behavior:
> > > 1) system-wide OOM
> > >   - default settings: the biggest process is killed
> > >   - D/memory.group_oom=1: all processes in D are killed
> > >   - A/memory.group_oom=1: all processes in A are killed
> > > 2) memcg oom in B
> > >   - default settings: the biggest process is killed
> > >   - A/memory.group_oom=1: the biggest process is killed
> > 
> > Huh? Why would you even consider A here when the oom is below it?
> > /me confused
> 
> I do not.
> This is exactly a counter-example: A's memory.group_oom
> is not considered at all in this case,
> because A is above ooming cgroup.

OK, it confused me.

> > 
> > >   - B/memory.group_oom=1: all processes in B are killed
> > 
> >     - B/memory.group_oom=0 &&
> > >   - D/memory.group_oom=1: all processes in D are killed
> > 
> > What about?
> >     - B/memory.group_oom=1 && D/memory.group_oom=0
> 
> All tasks in B are killed.

so essentially find a task, traverse the memcg hierarchy from the
victim's memcg up to the oom root as long as memcg.group_oom = 1?
If the resulting memcg.group_oom == 1 then kill the whole sub tree.
Right?

> Group_oom set to 1 means that the workload can't tolerate
> killing of a random process, so in this case it's better
> to guarantee consistency for B.

OK, but then if D itself is OOM then we do not care about consistency
all of the sudden? I have hard time to think about a sensible usecase.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 20:41                     ` David Rientjes
  2018-07-17 20:52                       ` Roman Gushchin
@ 2018-07-18  8:19                       ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-18  8:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: Roman Gushchin, linux-mm, akpm, hannes, tj, gthelen

On Tue 17-07-18 13:41:33, David Rientjes wrote:
[...]
> Thus, the semantic would be: if oom mem cgroup is "tree", kill all 
> processes in subtree; otherwise, it can be "cgroup" or "process" to 
> determine what is oom killed depending on the victim selection.

Why should be an intermediate node any different from the leaf. If you
want to tear down the whole subtree, just make it oom_cgroup = true and
be done with that. Why do we even need to call it tree?
 
> Having the "tree" behavior could definitely be implemented as a separate 
> tunable; but then then value of /A/memory.group_oom and 
> /A/B/memory.group_oom are irrelevant and, to me, seems like it would be 
> more confusing.

I am sorry, I do not follow. How are the following two different?
A (tree)	A (group)
|		|
B (tree)	B (group)
|		|
C (process)	C (group=false)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-18  8:12                     ` Michal Hocko
@ 2018-07-18 15:28                       ` Roman Gushchin
  2018-07-19  7:38                         ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-18 15:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen

On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote:
> On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > > [...]
> > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > 
> > > >   /
> > > >   |
> > > >   A
> > > >   |
> > > >   B
> > > >  / \
> > > > C   D
> > > > 
> > > > Let's look at different examples and intended behavior:
> > > > 1) system-wide OOM
> > > >   - default settings: the biggest process is killed
> > > >   - D/memory.group_oom=1: all processes in D are killed
> > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > 2) memcg oom in B
> > > >   - default settings: the biggest process is killed
> > > >   - A/memory.group_oom=1: the biggest process is killed
> > > 
> > > Huh? Why would you even consider A here when the oom is below it?
> > > /me confused
> > 
> > I do not.
> > This is exactly a counter-example: A's memory.group_oom
> > is not considered at all in this case,
> > because A is above ooming cgroup.
> 
> OK, it confused me.
> 
> > > 
> > > >   - B/memory.group_oom=1: all processes in B are killed
> > > 
> > >     - B/memory.group_oom=0 &&
> > > >   - D/memory.group_oom=1: all processes in D are killed
> > > 
> > > What about?
> > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > 
> > All tasks in B are killed.
> 
> so essentially find a task, traverse the memcg hierarchy from the
> victim's memcg up to the oom root as long as memcg.group_oom = 1?
> If the resulting memcg.group_oom == 1 then kill the whole sub tree.
> Right?

Yes.

> 
> > Group_oom set to 1 means that the workload can't tolerate
> > killing of a random process, so in this case it's better
> > to guarantee consistency for B.
> 
> OK, but then if D itself is OOM then we do not care about consistency
> all of the sudden? I have hard time to think about a sensible usecase.

I mean if traversing the hierarchy up to the oom root we meet
a memcg with group_oom set to 0, we shouldn't stop traversing.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-18 15:28                       ` Roman Gushchin
@ 2018-07-19  7:38                         ` Michal Hocko
  2018-07-19 17:05                           ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-19  7:38 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen

On Wed 18-07-18 08:28:50, Roman Gushchin wrote:
> On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote:
> > On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > > > [...]
> > > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > > 
> > > > >   /
> > > > >   |
> > > > >   A
> > > > >   |
> > > > >   B
> > > > >  / \
> > > > > C   D
> > > > > 
> > > > > Let's look at different examples and intended behavior:
> > > > > 1) system-wide OOM
> > > > >   - default settings: the biggest process is killed
> > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > > 2) memcg oom in B
> > > > >   - default settings: the biggest process is killed
> > > > >   - A/memory.group_oom=1: the biggest process is killed
> > > > 
> > > > Huh? Why would you even consider A here when the oom is below it?
> > > > /me confused
> > > 
> > > I do not.
> > > This is exactly a counter-example: A's memory.group_oom
> > > is not considered at all in this case,
> > > because A is above ooming cgroup.
> > 
> > OK, it confused me.
> > 
> > > > 
> > > > >   - B/memory.group_oom=1: all processes in B are killed
> > > > 
> > > >     - B/memory.group_oom=0 &&
> > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > 
> > > > What about?
> > > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > > 
> > > All tasks in B are killed.
> > 
> > so essentially find a task, traverse the memcg hierarchy from the
> > victim's memcg up to the oom root as long as memcg.group_oom = 1?
> > If the resulting memcg.group_oom == 1 then kill the whole sub tree.
> > Right?
> 
> Yes.
> 
> > 
> > > Group_oom set to 1 means that the workload can't tolerate
> > > killing of a random process, so in this case it's better
> > > to guarantee consistency for B.
> > 
> > OK, but then if D itself is OOM then we do not care about consistency
> > all of the sudden? I have hard time to think about a sensible usecase.
> 
> I mean if traversing the hierarchy up to the oom root we meet
> a memcg with group_oom set to 0, we shouldn't stop traversing.

Well, I am still fighting with the semantic of group, no-group, group
configuration. Why does it make any sense? In other words when can we
consider a cgroup to be a indivisible workload for one oom context while
it is fine to lose head or arm from another?

Anyway, your previous implementation would allow the same configuration
as well, so this is nothing really new. The new selection policy you are
proposing just makes it more obvious. So that doesn't mean this is a
roadblock but I think we should be really thinking hard about this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-19  7:38                         ` Michal Hocko
@ 2018-07-19 17:05                           ` Roman Gushchin
  2018-07-20  8:32                             ` David Rientjes
  2018-07-23 14:17                             ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Roman Gushchin @ 2018-07-19 17:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen

On Thu, Jul 19, 2018 at 09:38:43AM +0200, Michal Hocko wrote:
> On Wed 18-07-18 08:28:50, Roman Gushchin wrote:
> > On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote:
> > > On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> > > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > > > > [...]
> > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > > > 
> > > > > >   /
> > > > > >   |
> > > > > >   A
> > > > > >   |
> > > > > >   B
> > > > > >  / \
> > > > > > C   D
> > > > > > 
> > > > > > Let's look at different examples and intended behavior:
> > > > > > 1) system-wide OOM
> > > > > >   - default settings: the biggest process is killed
> > > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > > > 2) memcg oom in B
> > > > > >   - default settings: the biggest process is killed
> > > > > >   - A/memory.group_oom=1: the biggest process is killed
> > > > > 
> > > > > Huh? Why would you even consider A here when the oom is below it?
> > > > > /me confused
> > > > 
> > > > I do not.
> > > > This is exactly a counter-example: A's memory.group_oom
> > > > is not considered at all in this case,
> > > > because A is above ooming cgroup.
> > > 
> > > OK, it confused me.
> > > 
> > > > > 
> > > > > >   - B/memory.group_oom=1: all processes in B are killed
> > > > > 
> > > > >     - B/memory.group_oom=0 &&
> > > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > > 
> > > > > What about?
> > > > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > > > 
> > > > All tasks in B are killed.
> > > 
> > > so essentially find a task, traverse the memcg hierarchy from the
> > > victim's memcg up to the oom root as long as memcg.group_oom = 1?
> > > If the resulting memcg.group_oom == 1 then kill the whole sub tree.
> > > Right?
> > 
> > Yes.
> > 
> > > 
> > > > Group_oom set to 1 means that the workload can't tolerate
> > > > killing of a random process, so in this case it's better
> > > > to guarantee consistency for B.
> > > 
> > > OK, but then if D itself is OOM then we do not care about consistency
> > > all of the sudden? I have hard time to think about a sensible usecase.
> > 
> > I mean if traversing the hierarchy up to the oom root we meet
> > a memcg with group_oom set to 0, we shouldn't stop traversing.
> 
> Well, I am still fighting with the semantic of group, no-group, group
> configuration. Why does it make any sense? In other words when can we
> consider a cgroup to be a indivisible workload for one oom context while
> it is fine to lose head or arm from another?

Hm, so the question is should we traverse up to the OOMing cgroup,
or up to the first cgroup with memory.group_oom=0?

I looked at an example, and it *might* be the latter is better,
especially if we'll make the default value inheritable.

Let's say we have a sub-tree with a workload and some control stuff.
Workload is tolerable to OOM's (we can handle it in userspace, for
example), but the control stuff is not.
Then it probably makes no sense to kill the entire sub-tree,
if a task in C has to be killed. But makes perfect sense if we
have to kill a task in B.

  /
  |
  A, delegated sub-tree, group_oom=1
 / \
B   C, workload, group_oom=0
^
some control stuff here, group_oom=1

Does this makes sense?

> Anyway, your previous implementation would allow the same configuration
> as well, so this is nothing really new. The new selection policy you are
> proposing just makes it more obvious. So that doesn't mean this is a
> roadblock but I think we should be really thinking hard about this.

I agree.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-17 20:52                       ` Roman Gushchin
@ 2018-07-20  8:30                         ` David Rientjes
  2018-07-20 11:21                           ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: David Rientjes @ 2018-07-20  8:30 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen

On Tue, 17 Jul 2018, Roman Gushchin wrote:

> > This example is missing the usecase that I was referring to, i.e. killing 
> > all processes attached to a subtree because the limit on a common ancestor 
> > has been reached.
> > 
> > In your example, I would think that the memory.group_oom setting of /A and 
> > /A/B are meaningless because there are no processes attached to them.
> > 
> > IIUC, your proposal is to select the victim by whatever means, check the 
> > memory.group_oom setting of that victim, and then either kill the victim 
> > or all processes attached to that local mem cgroup depending on the 
> > setting.
> 
> Sorry, I don't get what are you saying.
> In cgroup v2 processes can't be attached to A and B.
> There is no such thing as "local mem cgroup" at all.
> 

Read the second paragraph, yes, there are no processes attached to either 
mem cgroup.  I'm saying "group oom" can take on two different meanings: 
one for the behavior when the mem cgroup reaches its limit (a direct 
ancestor with no processes attached) and one for the mem cgrop of the 
process chosen for oom kill.  I know that you care about the latter.  My 
*only* suggestion was for the tunable to take a string instead of a 
boolean so it is extensible for future use.  This seems like something so 
trivial.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-19 17:05                           ` Roman Gushchin
@ 2018-07-20  8:32                             ` David Rientjes
  2018-07-23 14:17                             ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-07-20  8:32 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen

On Thu, 19 Jul 2018, Roman Gushchin wrote:

> Hm, so the question is should we traverse up to the OOMing cgroup,
> or up to the first cgroup with memory.group_oom=0?
> 
> I looked at an example, and it *might* be the latter is better,
> especially if we'll make the default value inheritable.
> 
> Let's say we have a sub-tree with a workload and some control stuff.
> Workload is tolerable to OOM's (we can handle it in userspace, for
> example), but the control stuff is not.
> Then it probably makes no sense to kill the entire sub-tree,
> if a task in C has to be killed. But makes perfect sense if we
> have to kill a task in B.
> 
>   /
>   |
>   A, delegated sub-tree, group_oom=1
>  / \
> B   C, workload, group_oom=0
> ^
> some control stuff here, group_oom=1
> 
> Does this makes sense?
> 

The *only* suggestion here was that memory.group_oom take a string instead 
of a boolean value so that it can be extended later, especially if 
introducing another tunable is problematic because it clashes with 
semantics of the this one.  This is *so* trivial to do.  Anything that is 
going to care about setting up cgroup oom killing will have no problems 
writing a string instead of an integer.  I'm asking that you don't back 
the interface into a corner where extending it later is problematic.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-20  8:30                         ` David Rientjes
@ 2018-07-20 11:21                           ` Tejun Heo
  2018-07-20 16:13                             ` Roman Gushchin
  2018-07-20 20:28                             ` David Rientjes
  0 siblings, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2018-07-20 11:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Roman Gushchin, Michal Hocko, linux-mm, akpm, hannes, gthelen

Hello,

On Fri, Jul 20, 2018 at 01:30:00AM -0700, David Rientjes wrote:
...
> process chosen for oom kill.  I know that you care about the latter.  My 
> *only* suggestion was for the tunable to take a string instead of a 
> boolean so it is extensible for future use.  This seems like something so 
> trivial.

So, I'd much prefer it as boolean.  It's a fundamentally binary
property, either handle the cgroup as a unit when chosen as oom victim
or not, nothing more.  I don't see the (interface-wise) benefits of
preparing for further oom policy extensions.  If that happens, it
should be through a separate interface file.  The number of files
isn't the most important criteria interface is designed on.

Roman, can you rename it tho to memory.oom.group?  That's how other
interface files are scoped and it'd be better if we try to add further
oom related interface files later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-20 11:21                           ` Tejun Heo
@ 2018-07-20 16:13                             ` Roman Gushchin
  2018-07-20 20:28                             ` David Rientjes
  1 sibling, 0 replies; 52+ messages in thread
From: Roman Gushchin @ 2018-07-20 16:13 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Rientjes, Michal Hocko, linux-mm, akpm, hannes, gthelen

On Fri, Jul 20, 2018 at 04:21:31AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jul 20, 2018 at 01:30:00AM -0700, David Rientjes wrote:
> ...
> > process chosen for oom kill.  I know that you care about the latter.  My 
> > *only* suggestion was for the tunable to take a string instead of a 
> > boolean so it is extensible for future use.  This seems like something so 
> > trivial.
> 
> So, I'd much prefer it as boolean.  It's a fundamentally binary
> property, either handle the cgroup as a unit when chosen as oom victim
> or not, nothing more.  I don't see the (interface-wise) benefits of
> preparing for further oom policy extensions.  If that happens, it
> should be through a separate interface file.  The number of files
> isn't the most important criteria interface is designed on.
> 
> Roman, can you rename it tho to memory.oom.group?  That's how other
> interface files are scoped and it'd be better if we try to add further
> oom related interface files later.

Yes, sure, this looks good to me.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-20 11:21                           ` Tejun Heo
  2018-07-20 16:13                             ` Roman Gushchin
@ 2018-07-20 20:28                             ` David Rientjes
  2018-07-20 20:47                               ` Roman Gushchin
  2018-07-23 14:12                               ` Michal Hocko
  1 sibling, 2 replies; 52+ messages in thread
From: David Rientjes @ 2018-07-20 20:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, Michal Hocko, linux-mm, akpm, hannes, gthelen

On Fri, 20 Jul 2018, Tejun Heo wrote:

> > process chosen for oom kill.  I know that you care about the latter.  My 
> > *only* suggestion was for the tunable to take a string instead of a 
> > boolean so it is extensible for future use.  This seems like something so 
> > trivial.
> 
> So, I'd much prefer it as boolean.  It's a fundamentally binary
> property, either handle the cgroup as a unit when chosen as oom victim
> or not, nothing more.

With the single hierarchy mandate of cgroup v2, the need arises to 
separate processes from a single job into subcontainers for use with 
controllers other than mem cgroup.  In that case, we have no functionality 
to oom kill all processes in the subtree.

A boolean can kill all processes attached to the victim's mem cgroup, but 
cannot kill all processes in a subtree if the limit of a common ancestor 
is reached.  The common ancestor is needed to enforce a single memory 
limit but allow for processes to be constrained separately with other 
controllers. 

So if group oom takes on a boolean type, then we mandate that all 
processes to be killed must share the same cgroup which cannot always be 
done.  Thus, I was suggesting that group oom can also configure for 
subtree killing when the limit of a shared ancestor is reached.  This is 
unique only to non-leaf cgroups.  So non-leaf and leaf cgroups have 
mutually exclusive group oom settings; if we have two tunables, which this 
would otherwise require, the setting of one would always be irrelevant 
based on non-leaf or leaf.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-20 20:28                             ` David Rientjes
@ 2018-07-20 20:47                               ` Roman Gushchin
  2018-07-23 23:06                                 ` David Rientjes
  2018-07-23 14:12                               ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-20 20:47 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tejun Heo, Michal Hocko, linux-mm, akpm, hannes, gthelen

On Fri, Jul 20, 2018 at 01:28:56PM -0700, David Rientjes wrote:
> On Fri, 20 Jul 2018, Tejun Heo wrote:
> 
> > > process chosen for oom kill.  I know that you care about the latter.  My 
> > > *only* suggestion was for the tunable to take a string instead of a 
> > > boolean so it is extensible for future use.  This seems like something so 
> > > trivial.
> > 
> > So, I'd much prefer it as boolean.  It's a fundamentally binary
> > property, either handle the cgroup as a unit when chosen as oom victim
> > or not, nothing more.
> 
> With the single hierarchy mandate of cgroup v2, the need arises to 
> separate processes from a single job into subcontainers for use with 
> controllers other than mem cgroup.  In that case, we have no functionality 
> to oom kill all processes in the subtree.
> 
> A boolean can kill all processes attached to the victim's mem cgroup, but 
> cannot kill all processes in a subtree if the limit of a common ancestor 
> is reached.

Why so?

Once again my proposal:
as soon as the OOM killer selected a victim task,
we'll look at the victim task's memory cgroup.
If memory.oom.group is not set, we're done.
Otherwise let's traverse the memory cgroup tree up to
the OOMing cgroup (or root) as long as memory.oom.group is set.
Kill the last cgroup entirely (including all children).

Please, note:
we do not look at memory.oom.group of the OOMing cgroup,
we're looking at the memcg of the victim task.

If this model doesn't work well for you case,
please, describe it on an example. I'm not sure
I understand your problem anymore.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-20 20:28                             ` David Rientjes
  2018-07-20 20:47                               ` Roman Gushchin
@ 2018-07-23 14:12                               ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-23 14:12 UTC (permalink / raw)
  To: David Rientjes; +Cc: Tejun Heo, Roman Gushchin, linux-mm, akpm, hannes, gthelen

On Fri 20-07-18 13:28:56, David Rientjes wrote:
> On Fri, 20 Jul 2018, Tejun Heo wrote:
> 
> > > process chosen for oom kill.  I know that you care about the latter.  My 
> > > *only* suggestion was for the tunable to take a string instead of a 
> > > boolean so it is extensible for future use.  This seems like something so 
> > > trivial.
> > 
> > So, I'd much prefer it as boolean.  It's a fundamentally binary
> > property, either handle the cgroup as a unit when chosen as oom victim
> > or not, nothing more.
> 
> With the single hierarchy mandate of cgroup v2, the need arises to 
> separate processes from a single job into subcontainers for use with 
> controllers other than mem cgroup.  In that case, we have no functionality 
> to oom kill all processes in the subtree.
> 
> A boolean can kill all processes attached to the victim's mem cgroup, but 
> cannot kill all processes in a subtree if the limit of a common ancestor 
> is reached.  The common ancestor is needed to enforce a single memory 
> limit but allow for processes to be constrained separately with other 
> controllers. 

I think you misunderstood the proposed semantic. oom.group is a property
of any (including inter-node) memcg. Once set all the processes in its
domain are killed in one go because they are considered indivisible
workload. Note how this doesn't tell anything about _how_ we select
a victim. That is not important and an in fact an implementation
detail. All we care about is that a selected victim is a part of an
indivisible workload and we have to tear down all of it. Future
extensions can talk more about how we select the victim but the
fundamental property of a group to be indivisible workload or a group of
semi raleted processes is a 0/1 IMHO.

Now there still are questions to iron out for that model. E.g. should
we allow to make a subtree of oom.group == 1 to be group == 0? In other
words something would be indivisible workload for one OOM context while
it is not for more restrictive OOM scope. If yes, then what is the
usecase?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-19 17:05                           ` Roman Gushchin
  2018-07-20  8:32                             ` David Rientjes
@ 2018-07-23 14:17                             ` Michal Hocko
  2018-07-23 15:09                               ` Tejun Heo
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-23 14:17 UTC (permalink / raw)
  To: Roman Gushchin, hannes, tj; +Cc: David Rientjes, linux-mm, akpm, gthelen

On Thu 19-07-18 10:05:47, Roman Gushchin wrote:
> On Thu, Jul 19, 2018 at 09:38:43AM +0200, Michal Hocko wrote:
> > On Wed 18-07-18 08:28:50, Roman Gushchin wrote:
> > > On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote:
> > > > On Tue 17-07-18 13:06:42, Roman Gushchin wrote:
> > > > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote:
> > > > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote:
> > > > > > [...]
> > > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy,
> > > > > > > and the biggest process (or the process with highest oom_score_adj) is in D.
> > > > > > > 
> > > > > > >   /
> > > > > > >   |
> > > > > > >   A
> > > > > > >   |
> > > > > > >   B
> > > > > > >  / \
> > > > > > > C   D
> > > > > > > 
> > > > > > > Let's look at different examples and intended behavior:
> > > > > > > 1) system-wide OOM
> > > > > > >   - default settings: the biggest process is killed
> > > > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > > > >   - A/memory.group_oom=1: all processes in A are killed
> > > > > > > 2) memcg oom in B
> > > > > > >   - default settings: the biggest process is killed
> > > > > > >   - A/memory.group_oom=1: the biggest process is killed
> > > > > > 
> > > > > > Huh? Why would you even consider A here when the oom is below it?
> > > > > > /me confused
> > > > > 
> > > > > I do not.
> > > > > This is exactly a counter-example: A's memory.group_oom
> > > > > is not considered at all in this case,
> > > > > because A is above ooming cgroup.
> > > > 
> > > > OK, it confused me.
> > > > 
> > > > > > 
> > > > > > >   - B/memory.group_oom=1: all processes in B are killed
> > > > > > 
> > > > > >     - B/memory.group_oom=0 &&
> > > > > > >   - D/memory.group_oom=1: all processes in D are killed
> > > > > > 
> > > > > > What about?
> > > > > >     - B/memory.group_oom=1 && D/memory.group_oom=0
> > > > > 
> > > > > All tasks in B are killed.
> > > > 
> > > > so essentially find a task, traverse the memcg hierarchy from the
> > > > victim's memcg up to the oom root as long as memcg.group_oom = 1?
> > > > If the resulting memcg.group_oom == 1 then kill the whole sub tree.
> > > > Right?
> > > 
> > > Yes.
> > > 
> > > > 
> > > > > Group_oom set to 1 means that the workload can't tolerate
> > > > > killing of a random process, so in this case it's better
> > > > > to guarantee consistency for B.
> > > > 
> > > > OK, but then if D itself is OOM then we do not care about consistency
> > > > all of the sudden? I have hard time to think about a sensible usecase.
> > > 
> > > I mean if traversing the hierarchy up to the oom root we meet
> > > a memcg with group_oom set to 0, we shouldn't stop traversing.
> > 
> > Well, I am still fighting with the semantic of group, no-group, group
> > configuration. Why does it make any sense? In other words when can we
> > consider a cgroup to be a indivisible workload for one oom context while
> > it is fine to lose head or arm from another?
> 
> Hm, so the question is should we traverse up to the OOMing cgroup,
> or up to the first cgroup with memory.group_oom=0?
> 
> I looked at an example, and it *might* be the latter is better,
> especially if we'll make the default value inheritable.
> 
> Let's say we have a sub-tree with a workload and some control stuff.
> Workload is tolerable to OOM's (we can handle it in userspace, for
> example), but the control stuff is not.
> Then it probably makes no sense to kill the entire sub-tree,
> if a task in C has to be killed. But makes perfect sense if we
> have to kill a task in B.
> 
>   /
>   |
>   A, delegated sub-tree, group_oom=1
>  / \
> B   C, workload, group_oom=0
> ^
> some control stuff here, group_oom=1
> 
> Does this makes sense?

I am not sure. If you are going to delegate then you are basically
losing control of the group_oom at A-level. Is this good? What if I
_want_ to tear down the whole thing if it starts misbehaving because I
do not trust it?

The more I think about it the more I am concluding that we should start
with a more contrained model and require that once parent is
group_oom == 1 then children have to as well. If we ever find a usecase
to require a different scheme we can weaker it later. We cannot do that
other way around.

Tejun, Johannes what do you think about that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-23 14:17                             ` Michal Hocko
@ 2018-07-23 15:09                               ` Tejun Heo
  2018-07-24  7:32                                 ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-07-23 15:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

On Mon, Jul 23, 2018 at 04:17:48PM +0200, Michal Hocko wrote:
> I am not sure. If you are going to delegate then you are basically
> losing control of the group_oom at A-level. Is this good? What if I
> _want_ to tear down the whole thing if it starts misbehaving because I
> do not trust it?
> 
> The more I think about it the more I am concluding that we should start
> with a more contrained model and require that once parent is
> group_oom == 1 then children have to as well. If we ever find a usecase
> to require a different scheme we can weaker it later. We cannot do that
> other way around.
> 
> Tejun, Johannes what do you think about that?

I'd find the cgroup closest to the root which has the oom group set
and kill the entire subtree.  There's no reason to put any
restrictions on what each cgroup can configure.  The only thing which
matters is is that the effective behavior is what the highest in the
ancestry configures, and, at the system level, it'd conceptually map
to panic_on_oom.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-20 20:47                               ` Roman Gushchin
@ 2018-07-23 23:06                                 ` David Rientjes
  0 siblings, 0 replies; 52+ messages in thread
From: David Rientjes @ 2018-07-23 23:06 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Tejun Heo, Michal Hocko, linux-mm, akpm, hannes, gthelen

On Fri, 20 Jul 2018, Roman Gushchin wrote:

> > > > process chosen for oom kill.  I know that you care about the latter.  My 
> > > > *only* suggestion was for the tunable to take a string instead of a 
> > > > boolean so it is extensible for future use.  This seems like something so 
> > > > trivial.
> > > 
> > > So, I'd much prefer it as boolean.  It's a fundamentally binary
> > > property, either handle the cgroup as a unit when chosen as oom victim
> > > or not, nothing more.
> > 
> > With the single hierarchy mandate of cgroup v2, the need arises to 
> > separate processes from a single job into subcontainers for use with 
> > controllers other than mem cgroup.  In that case, we have no functionality 
> > to oom kill all processes in the subtree.
> > 
> > A boolean can kill all processes attached to the victim's mem cgroup, but 
> > cannot kill all processes in a subtree if the limit of a common ancestor 
> > is reached.
> 
> Why so?
> 
> Once again my proposal:
> as soon as the OOM killer selected a victim task,
> we'll look at the victim task's memory cgroup.
> If memory.oom.group is not set, we're done.
> Otherwise let's traverse the memory cgroup tree up to
> the OOMing cgroup (or root) as long as memory.oom.group is set.
> Kill the last cgroup entirely (including all children).
> 

I know this is your proposal, I'm suggesting a context-based extension 
based on which mem cgroup is oom: the common ancestor or the leaf.

Consider /A, /A/b, and /A/c, and memory.oom_group is 1 for all of them.  
When /A, /A/b, or /A/c is oom, all processes attached to /A and its 
subtree are oom killed per your semantic.  That occurs when any of the 
three mem cgroups are oom.

I'm suggesting that it may become useful to kill an entire subtree when 
the common ancestor, /A, is oom, but not when /A/b or /A/c is oom.  There 
is no way to specify this with the proposal and trees where the limits of
/A/b + /A/c > /A exist.  We want all processes killed in /A/b or /A/c if 
they reach their individual limits.  We want all processes killed in /A's 
subtree if /A reaches its limit.

I am not asking for that support to be implemented immediately if you do 
not have a need for it.  But I am asking that your interface to do so is 
extensible so that we may implement it.  Given the no internal process 
constraint of cgroup v2, defining this as two separate tunables would 
always have one be effective and the other be irrelevant, so I suggest it 
is overloaded.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-23 15:09                               ` Tejun Heo
@ 2018-07-24  7:32                                 ` Michal Hocko
  2018-07-24 13:08                                   ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-24  7:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Mon 23-07-18 08:09:29, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 23, 2018 at 04:17:48PM +0200, Michal Hocko wrote:
> > I am not sure. If you are going to delegate then you are basically
> > losing control of the group_oom at A-level. Is this good? What if I
> > _want_ to tear down the whole thing if it starts misbehaving because I
> > do not trust it?
> > 
> > The more I think about it the more I am concluding that we should start
> > with a more contrained model and require that once parent is
> > group_oom == 1 then children have to as well. If we ever find a usecase
> > to require a different scheme we can weaker it later. We cannot do that
> > other way around.
> > 
> > Tejun, Johannes what do you think about that?
> 
> I'd find the cgroup closest to the root which has the oom group set
> and kill the entire subtree.

Yes, this is what we have been discussing. In fact it would match the
behavior which is still sitting in the mmotm tree where we compare
groups.

> There's no reason to put any
> restrictions on what each cgroup can configure.  The only thing which
> matters is is that the effective behavior is what the highest in the
> ancestry configures, and, at the system level, it'd conceptually map
> to panic_on_oom.

Hmm, so do we inherit group_oom? If not, how do we prevent from
unexpected behavior?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin
  2018-07-12 12:07 ` Michal Hocko
  2018-07-13 21:34 ` David Rientjes
@ 2018-07-24 11:59 ` Tetsuo Handa
  2018-07-25  0:10   ` Roman Gushchin
  2 siblings, 1 reply; 52+ messages in thread
From: Tetsuo Handa @ 2018-07-24 11:59 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm; +Cc: akpm, rientjes, mhocko, hannes, tj, gthelen

Roman, will you check this cleanup patch? This patch applies on top of next-20180724.
I assumed that your series do not kill processes which current thread should not
wait for termination.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24  7:32                                 ` Michal Hocko
@ 2018-07-24 13:08                                   ` Tejun Heo
  2018-07-24 13:26                                     ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-07-24 13:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

On Tue, Jul 24, 2018 at 09:32:30AM +0200, Michal Hocko wrote:
> > I'd find the cgroup closest to the root which has the oom group set
> > and kill the entire subtree.
> 
> Yes, this is what we have been discussing. In fact it would match the
> behavior which is still sitting in the mmotm tree where we compare
> groups.

Yeah, I'd too.  Everyone except David seems to agree that that's a
good enough approach for now.

> > There's no reason to put any
> > restrictions on what each cgroup can configure.  The only thing which
> > matters is is that the effective behavior is what the highest in the
> > ancestry configures, and, at the system level, it'd conceptually map
> > to panic_on_oom.
> 
> Hmm, so do we inherit group_oom? If not, how do we prevent from
> unexpected behavior?

Hmm... I guess we're debating two options here.  Please consider the
following hierarchy.

      R
      |
      A (group oom == 1)
     / \
    B   C
    |
    D

1. No matter what B, C or D sets, as long as A sets group oom, any oom
   kill inside A's subtree kills the entire subtree.

2. A's group oom policy applies iff the source of the OOM is either at
   or above A - ie. iff the OOM is system-wide or caused by memory.max
   of A.

In #1, it doesn't matter what B, C or D sets, so it's kinda moot to
discuss whether they inherit A's setting or not.  A's is, if set,
always overriding.  In #2, what B, C or D sets matters if they also
set their own memory.max, so there's no reason for them to inherit
anything.

I'm actually okay with either option.  #2 is more flexible than #1 but
given that this is a cgroup owned property which is likely to be set
on per-application basis, #1 is likely good enough.

IIRC, we did #2 in the original implementation and the simplified one
is doing #1, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 13:08                                   ` Tejun Heo
@ 2018-07-24 13:26                                     ` Michal Hocko
  2018-07-24 13:31                                       ` Tejun Heo
  2018-07-30  8:03                                       ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-24 13:26 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 06:08:36, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 24, 2018 at 09:32:30AM +0200, Michal Hocko wrote:
[...]
> > > There's no reason to put any
> > > restrictions on what each cgroup can configure.  The only thing which
> > > matters is is that the effective behavior is what the highest in the
> > > ancestry configures, and, at the system level, it'd conceptually map
> > > to panic_on_oom.
> > 
> > Hmm, so do we inherit group_oom? If not, how do we prevent from
> > unexpected behavior?
> 
> Hmm... I guess we're debating two options here.  Please consider the
> following hierarchy.
> 
>       R
>       |
>       A (group oom == 1)
>      / \
>     B   C
>     |
>     D
> 
> 1. No matter what B, C or D sets, as long as A sets group oom, any oom
>    kill inside A's subtree kills the entire subtree.
> 
> 2. A's group oom policy applies iff the source of the OOM is either at
>    or above A - ie. iff the OOM is system-wide or caused by memory.max
>    of A.
> 
> In #1, it doesn't matter what B, C or D sets, so it's kinda moot to
> discuss whether they inherit A's setting or not.  A's is, if set,
> always overriding.  In #2, what B, C or D sets matters if they also
> set their own memory.max, so there's no reason for them to inherit
> anything.
> 
> I'm actually okay with either option.  #2 is more flexible than #1 but
> given that this is a cgroup owned property which is likely to be set
> on per-application basis, #1 is likely good enough.
> 
> IIRC, we did #2 in the original implementation and the simplified one
> is doing #1, right?

No, we've been discussing #2 unless I have misunderstood something.
I find it rather non-intuitive that a property outside of the oom domain
controls the behavior inside the domain. I will keep thinking about that
though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 13:26                                     ` Michal Hocko
@ 2018-07-24 13:31                                       ` Tejun Heo
  2018-07-24 13:50                                         ` Michal Hocko
  2018-07-30  8:03                                       ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-07-24 13:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

On Tue, Jul 24, 2018 at 03:26:40PM +0200, Michal Hocko wrote:
> > 1. No matter what B, C or D sets, as long as A sets group oom, any oom
> >    kill inside A's subtree kills the entire subtree.
> > 
> > 2. A's group oom policy applies iff the source of the OOM is either at
> >    or above A - ie. iff the OOM is system-wide or caused by memory.max
> >    of A.
> > 
> > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to
> > discuss whether they inherit A's setting or not.  A's is, if set,
> > always overriding.  In #2, what B, C or D sets matters if they also
> > set their own memory.max, so there's no reason for them to inherit
> > anything.
> > 
> > I'm actually okay with either option.  #2 is more flexible than #1 but
> > given that this is a cgroup owned property which is likely to be set
> > on per-application basis, #1 is likely good enough.
> > 
> > IIRC, we did #2 in the original implementation and the simplified one
> > is doing #1, right?
> 
> No, we've been discussing #2 unless I have misunderstood something.
> I find it rather non-intuitive that a property outside of the oom domain
> controls the behavior inside the domain. I will keep thinking about that
> though.

So, one good way of thinking about this, I think, could be considering
it as a scoped panic_on_oom.  However panic_on_oom interacts with
memcg ooms, scoping that to cgroup level should likely be how we
define group oom.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 13:31                                       ` Tejun Heo
@ 2018-07-24 13:50                                         ` Michal Hocko
  2018-07-24 13:55                                           ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-24 13:50 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 06:31:10, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 24, 2018 at 03:26:40PM +0200, Michal Hocko wrote:
> > > 1. No matter what B, C or D sets, as long as A sets group oom, any oom
> > >    kill inside A's subtree kills the entire subtree.
> > > 
> > > 2. A's group oom policy applies iff the source of the OOM is either at
> > >    or above A - ie. iff the OOM is system-wide or caused by memory.max
> > >    of A.
> > > 
> > > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to
> > > discuss whether they inherit A's setting or not.  A's is, if set,
> > > always overriding.  In #2, what B, C or D sets matters if they also
> > > set their own memory.max, so there's no reason for them to inherit
> > > anything.
> > > 
> > > I'm actually okay with either option.  #2 is more flexible than #1 but
> > > given that this is a cgroup owned property which is likely to be set
> > > on per-application basis, #1 is likely good enough.
> > > 
> > > IIRC, we did #2 in the original implementation and the simplified one
> > > is doing #1, right?
> > 
> > No, we've been discussing #2 unless I have misunderstood something.
> > I find it rather non-intuitive that a property outside of the oom domain
> > controls the behavior inside the domain. I will keep thinking about that
> > though.
> 
> So, one good way of thinking about this, I think, could be considering
> it as a scoped panic_on_oom.  However panic_on_oom interacts with
> memcg ooms, scoping that to cgroup level should likely be how we
> define group oom.

So what are we going to do if we have a reasonable usecase when somebody
really wants to have group kill behavior depending on the oom domain?
I have hard time to imagine such a usecase but my experience tells me
that users will find a way I have never thought about.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 13:50                                         ` Michal Hocko
@ 2018-07-24 13:55                                           ` Tejun Heo
  2018-07-24 14:25                                             ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-07-24 13:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

On Tue, Jul 24, 2018 at 03:50:22PM +0200, Michal Hocko wrote:
> > So, one good way of thinking about this, I think, could be considering
> > it as a scoped panic_on_oom.  However panic_on_oom interacts with
> > memcg ooms, scoping that to cgroup level should likely be how we
> > define group oom.
> 
> So what are we going to do if we have a reasonable usecase when somebody
> really wants to have group kill behavior depending on the oom domain?
> I have hard time to imagine such a usecase but my experience tells me
> that users will find a way I have never thought about.

So, I don't know when that happend but panic_on_oom actually has 0, 1,
2 settings - 0 no group oom, 1 system kill on oom of any origin, 2
system kill only if it was a system oom.  Maybe we should just follow
that but just start with 1?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 13:55                                           ` Tejun Heo
@ 2018-07-24 14:25                                             ` Michal Hocko
  2018-07-24 14:28                                               ` Tejun Heo
  0 siblings, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-24 14:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 06:55:28, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 24, 2018 at 03:50:22PM +0200, Michal Hocko wrote:
> > > So, one good way of thinking about this, I think, could be considering
> > > it as a scoped panic_on_oom.  However panic_on_oom interacts with
> > > memcg ooms, scoping that to cgroup level should likely be how we
> > > define group oom.
> > 
> > So what are we going to do if we have a reasonable usecase when somebody
> > really wants to have group kill behavior depending on the oom domain?
> > I have hard time to imagine such a usecase but my experience tells me
> > that users will find a way I have never thought about.
> 
> So, I don't know when that happend but panic_on_oom actually has 0, 1,
> 2 settings - 0 no group oom, 1 system kill on oom of any origin, 2
> system kill only if it was a system oom.  Maybe we should just follow
> that but just start with 1?

I am sorry but I do not follow. Besides that modeling the behavior on
panic_on_oom doesn't really sound very appealing to me. The knob is a
crude hack mostly motivated by debugging (at least its non-global
variants).

So can we get back to workloads and shape the semantic on top of that
please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 14:25                                             ` Michal Hocko
@ 2018-07-24 14:28                                               ` Tejun Heo
  2018-07-24 14:35                                                 ` Tejun Heo
  2018-07-24 14:43                                                 ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2018-07-24 14:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

On Tue, Jul 24, 2018 at 04:25:54PM +0200, Michal Hocko wrote:
> I am sorry but I do not follow. Besides that modeling the behavior on
> panic_on_oom doesn't really sound very appealing to me. The knob is a
> crude hack mostly motivated by debugging (at least its non-global
> variants).

Hmm... we actually do use that quite a bit in production (moving away
from it gradually).

> So can we get back to workloads and shape the semantic on top of that
> please?

I didn't realize we were that off track.  Don't both map to what we
were discussing almost perfectly?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 14:28                                               ` Tejun Heo
@ 2018-07-24 14:35                                                 ` Tejun Heo
  2018-07-24 14:43                                                 ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: Tejun Heo @ 2018-07-24 14:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

Lemme elaborate just a bit more.

On Tue, Jul 24, 2018 at 07:28:20AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 24, 2018 at 04:25:54PM +0200, Michal Hocko wrote:
> > I am sorry but I do not follow. Besides that modeling the behavior on
> > panic_on_oom doesn't really sound very appealing to me. The knob is a
> > crude hack mostly motivated by debugging (at least its non-global
> > variants).
> 
> Hmm... we actually do use that quite a bit in production (moving away
> from it gradually).

So, the reason panic_on_oom is used is very similar for the reason one
would want group oom kill - workload integrity after an oom kill.
panic_on_oom is an expensive way of avoiding partial kills and the
resulting possibly inconsistent state.  Group oom can scope that down
so that we can maintain integrity per-application or domain rather
than at system level making it way cheaper.

> > So can we get back to workloads and shape the semantic on top of that
> > please?
> 
> I didn't realize we were that off track.  Don't both map to what we
> were discussing almost perfectly?

I guess the reason why panic_on_oom developed the two behaviors is
likely that the initial behavior - panicking on any oom - was too
inflexible.  We're scoping it down, so whatever problems we used to
have with panic_on_oom is less pronounced with group oom.  So, I don't
think this matters all that much in terms of practical usefulness.
Both always kliling and factoring in oom origin seem fine to me.
Let's just pick one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 14:28                                               ` Tejun Heo
  2018-07-24 14:35                                                 ` Tejun Heo
@ 2018-07-24 14:43                                                 ` Michal Hocko
  2018-07-24 14:49                                                   ` Tejun Heo
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-24 14:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 07:28:20, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 24, 2018 at 04:25:54PM +0200, Michal Hocko wrote:
[...]
> > So can we get back to workloads and shape the semantic on top of that
> > please?
> 
> I didn't realize we were that off track.  Don't both map to what we
> were discussing almost perfectly?

If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have
any scope. It is all or nothing thing. You can only control whether
memcg OOMs should be considered or not because this is inherently
dangerous to be the case by default.

oom_group has a scope and that scope is exactly what we are trying to
find a proper semantic for. And especially what to do if descendants in
the hierarchy disagree with parent(s). While I do not see a sensible
configuration where the scope of the OOM should define the workload is
indivisible I would like to prevent from "carved in stone" semantic that
couldn't be changed later.

So IMHO the best option would be to simply inherit the group_oom to
children. This would allow users to do their weird stuff but the default
configuration would be consistent.

A more restrictive variant would be to disallow changing children to
mismatch the parent oom_group == 1. This would have a slight advantage
that those users would get back to us with their usecases and we can
either loose the restriction or explain that what they are doing is
questionable and help with a more appropriate configuration.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 14:43                                                 ` Michal Hocko
@ 2018-07-24 14:49                                                   ` Tejun Heo
  2018-07-24 15:52                                                     ` Roman Gushchin
  2018-07-25 11:58                                                     ` Michal Hocko
  0 siblings, 2 replies; 52+ messages in thread
From: Tejun Heo @ 2018-07-24 14:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello, Michal.

On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote:
> If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have
> any scope. It is all or nothing thing. You can only control whether
> memcg OOMs should be considered or not because this is inherently
> dangerous to be the case by default.

Oh yeah, so, panic_on_oom is like group oom on the root cgroup, right?
If 1, it treats the whole system as a single unit and kills it no
matter the oom domain.  If 2, it only does so if the oom is not caused
by restrictions in subdomains.

> oom_group has a scope and that scope is exactly what we are trying to
> find a proper semantic for. And especially what to do if descendants in
> the hierarchy disagree with parent(s). While I do not see a sensible
> configuration where the scope of the OOM should define the workload is
> indivisible I would like to prevent from "carved in stone" semantic that
> couldn't be changed later.

And we can scope it down the same way down the cgroup hierarchy.

> So IMHO the best option would be to simply inherit the group_oom to
> children. This would allow users to do their weird stuff but the default
> configuration would be consistent.

Persistent config inheritance is a big no no.  It really sucks because
it makes the inherited state sticky and there's no way of telling why
the current setting is the way it is without knowing the past
configurations of the hierarchy.  We actually had a pretty bad
incident due to memcg swappiness inheritance recently (top level
cgroups would inherit sysctl swappiness during boot before sysctl
initialized them and then post-boot it isn't clear why the settings
are the way they're).

Nothing in cgroup2 does persistent inheritance.  If something explicit
is necessary, we do .effective so that the effective config is clearly
visible.

> A more restrictive variant would be to disallow changing children to
> mismatch the parent oom_group == 1. This would have a slight advantage
> that those users would get back to us with their usecases and we can
> either loose the restriction or explain that what they are doing is
> questionable and help with a more appropriate configuration.

That's a nack too because across delegation, from a child pov, it
isn't clear why it can't change configs and it's also easy to
introduce a situation where a child can lock its ancestors out of
chanding their configs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 14:49                                                   ` Tejun Heo
@ 2018-07-24 15:52                                                     ` Roman Gushchin
  2018-07-25 12:00                                                       ` Michal Hocko
  2018-07-25 11:58                                                     ` Michal Hocko
  1 sibling, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-24 15:52 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue, Jul 24, 2018 at 07:49:40AM -0700, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote:
> > If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have
> > any scope. It is all or nothing thing. You can only control whether
> > memcg OOMs should be considered or not because this is inherently
> > dangerous to be the case by default.
> 
> Oh yeah, so, panic_on_oom is like group oom on the root cgroup, right?
> If 1, it treats the whole system as a single unit and kills it no
> matter the oom domain.  If 2, it only does so if the oom is not caused
> by restrictions in subdomains.
> 
> > oom_group has a scope and that scope is exactly what we are trying to
> > find a proper semantic for. And especially what to do if descendants in
> > the hierarchy disagree with parent(s). While I do not see a sensible
> > configuration where the scope of the OOM should define the workload is
> > indivisible I would like to prevent from "carved in stone" semantic that
> > couldn't be changed later.
> 
> And we can scope it down the same way down the cgroup hierarchy.
> 
> > So IMHO the best option would be to simply inherit the group_oom to
> > children. This would allow users to do their weird stuff but the default
> > configuration would be consistent.

I think, that the problem occurs because of the default value (0).

Let's imagine we can make default to 1.
It means, that by default we kill the whole sub-tree up to the top-level
cgroup, and it does guarantee consistency.
If on some level userspace _knows_ how to handle OOM, it opts-out
by setting oom.group to 0.

E.g. systemd _knows_ that services working in systems slice are
independent and knows how to detect that they are dead and restart.
So, it sets system.slice/memory.oom.group to 0.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 11:59 ` Tetsuo Handa
@ 2018-07-25  0:10   ` Roman Gushchin
  2018-07-25 12:23     ` Tetsuo Handa
  0 siblings, 1 reply; 52+ messages in thread
From: Roman Gushchin @ 2018-07-25  0:10 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, mhocko, hannes, tj, gthelen

On Tue, Jul 24, 2018 at 08:59:58PM +0900, Tetsuo Handa wrote:
> Roman, will you check this cleanup patch? This patch applies on top of next-20180724.
> I assumed that your series do not kill processes which current thread should not
> wait for termination.

Hi Tetsuo!

> 
> From 86ba99fbf73a9eda0df5ee4ae70c075781e83f81 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Tue, 24 Jul 2018 14:00:45 +0900
> Subject: [PATCH] mm,oom: Check pending victims earlier in out_of_memory().
> 
> The "mm, oom: cgroup-aware OOM killer" patchset introduced INFLIGHT_VICTIM
> in order to replace open-coded ((void *)-1UL). But (regarding CONFIG_MMU=y
> case) we have a list of inflight OOM victim threads which are connected to
> oom_reaper_list. Thus we can check whether there are inflight OOM victims
> before starting process/memcg list traversal. Since it is likely that only
> few threads are linked to oom_reaper_list, checking all victims' OOM domain
> will not matter.

I have a couple of top-level concerns:
1) You're doubling the size of oom_reaper memory footprint in task_struct.
   I doubt, that code cleanup really worth it. If it's absolutely necessary
   to resolve the lockup, which you mentioned, it should be explained
   explicitly.

2) There are several cosmetic changes in this patch, which makes reviewing
   harder. Can you, please, split it into several parts.


Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 14:49                                                   ` Tejun Heo
  2018-07-24 15:52                                                     ` Roman Gushchin
@ 2018-07-25 11:58                                                     ` Michal Hocko
  1 sibling, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-25 11:58 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 07:49:40, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote:
[...]
> > So IMHO the best option would be to simply inherit the group_oom to
> > children. This would allow users to do their weird stuff but the default
> > configuration would be consistent.
> 
> Persistent config inheritance is a big no no.  It really sucks because
> it makes the inherited state sticky and there's no way of telling why
> the current setting is the way it is without knowing the past
> configurations of the hierarchy.  We actually had a pretty bad
> incident due to memcg swappiness inheritance recently (top level
> cgroups would inherit sysctl swappiness during boot before sysctl
> initialized them and then post-boot it isn't clear why the settings
> are the way they're).
> 
> Nothing in cgroup2 does persistent inheritance.  If something explicit
> is necessary, we do .effective so that the effective config is clearly
> visible.
> 
> > A more restrictive variant would be to disallow changing children to
> > mismatch the parent oom_group == 1. This would have a slight advantage
> > that those users would get back to us with their usecases and we can
> > either loose the restriction or explain that what they are doing is
> > questionable and help with a more appropriate configuration.
> 
> That's a nack too because across delegation, from a child pov, it
> isn't clear why it can't change configs and it's also easy to
> introduce a situation where a child can lock its ancestors out of
> chanding their configs.

OK, fair points. I will keep thinking about this some more. I still
cannot shake a bad feeling about the semantic and how poor users are
going to scratch their heads what the heck is going on here. I will
follow up in other email where we are discussing both options once
I am able to sort this out myself.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 15:52                                                     ` Roman Gushchin
@ 2018-07-25 12:00                                                       ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-25 12:00 UTC (permalink / raw)
  To: Roman Gushchin; +Cc: Tejun Heo, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 08:52:51, Roman Gushchin wrote:
> On Tue, Jul 24, 2018 at 07:49:40AM -0700, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote:
> > > If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have
> > > any scope. It is all or nothing thing. You can only control whether
> > > memcg OOMs should be considered or not because this is inherently
> > > dangerous to be the case by default.
> > 
> > Oh yeah, so, panic_on_oom is like group oom on the root cgroup, right?
> > If 1, it treats the whole system as a single unit and kills it no
> > matter the oom domain.  If 2, it only does so if the oom is not caused
> > by restrictions in subdomains.
> > 
> > > oom_group has a scope and that scope is exactly what we are trying to
> > > find a proper semantic for. And especially what to do if descendants in
> > > the hierarchy disagree with parent(s). While I do not see a sensible
> > > configuration where the scope of the OOM should define the workload is
> > > indivisible I would like to prevent from "carved in stone" semantic that
> > > couldn't be changed later.
> > 
> > And we can scope it down the same way down the cgroup hierarchy.
> > 
> > > So IMHO the best option would be to simply inherit the group_oom to
> > > children. This would allow users to do their weird stuff but the default
> > > configuration would be consistent.
> 
> I think, that the problem occurs because of the default value (0).
> 
> Let's imagine we can make default to 1.
> It means, that by default we kill the whole sub-tree up to the top-level
> cgroup, and it does guarantee consistency.
> If on some level userspace _knows_ how to handle OOM, it opts-out
> by setting oom.group to 0.

Apart that default group_oom is absolutely unacceptable as explained earlier.
I still fail to see how this makes situation any different. So say you know
that you are not group oom so what will happen now. As soon as well
check parents we are screwed the same way. Not to mention that a global
oom would mean killing the world basically...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-25  0:10   ` Roman Gushchin
@ 2018-07-25 12:23     ` Tetsuo Handa
  2018-07-25 13:01       ` Michal Hocko
  0 siblings, 1 reply; 52+ messages in thread
From: Tetsuo Handa @ 2018-07-25 12:23 UTC (permalink / raw)
  To: Roman Gushchin, akpm, mhocko; +Cc: linux-mm, rientjes, hannes, tj, gthelen

Michal, I think that I hit WQ OOM lockup caused by lack of guaranteed schedule_timeout_*() for WQ.
I think we should immediately send https://marc.info/?l=linux-mm&m=153062798103081
(or both http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org and
https://marc.info/?l=linux-mm&m=152723708623015 ) to linux.git so that we can send
to stable kernels without waiting for "mm, oom: cgroup-aware OOM killer" patchset.



Log file is at http://I-love.SAKURA.ne.jp/tmp/serial-20180725.txt.xz .
( Output up to last two OOM killer invocations are omitted from below excerpt. )

$ grep lockup serial-20180725.txt
[ 3512.401044] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 72s!
[ 3545.156052] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 105s!
[ 3575.364044] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 135s!
[ 3693.429456] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 253s!
[ 3785.795525] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 346s!
[ 3906.627292] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 467s!

$ grep out_of_memory serial-20180725.txt
[ 3439.486723]  out_of_memory+0x3fc/0x780
[ 3439.674515]  out_of_memory+0x3fc/0x780

$ grep sysrq: serial-20180725.txt
[ 3473.136366] sysrq: SysRq : Show State
[ 3575.779715] sysrq: SysRq : Show backtrace of all active CPUs
[ 3621.473841] sysrq: SysRq : Show Memory
[ 3654.129996] sysrq: SysRq : Show State
[ 3733.447398] sysrq: SysRq : Manual OOM execution
[ 3808.956565] sysrq: SysRq : Show backtrace of all active CPUs
[ 3857.741374] sysrq: SysRq : Manual OOM execution
[ 3871.848122] sysrq: SysRq : Show backtrace of all active CPUs
[ 3930.529273] sysrq: SysRq : Show backtrace of all active CPUs
[ 3946.694858] sysrq: SysRq : Terminate All Tasks

$ grep -A 100 -F 3575.779715 serial-20180725.txt
[ 3575.779715] sysrq: SysRq : Show backtrace of all active CPUs
[ 3575.782255] NMI backtrace for cpu 0
[ 3575.784104] CPU: 0 PID: 82423 Comm: kworker/0:3 Not tainted 4.18.0-rc6-next-20180724+ #246
[ 3575.787142] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[ 3575.790847] Workqueue: events_freezable_power_ disk_events_workfn
[ 3575.793317] Call Trace:
[ 3575.794818]  <IRQ>
[ 3575.796180]  dump_stack+0x85/0xcb
[ 3575.797858]  nmi_cpu_backtrace+0xa1/0xb0
[ 3575.799710]  ? lapic_can_unplug_cpu+0x90/0x90
[ 3575.801712]  nmi_trigger_cpumask_backtrace+0xf0/0x130
[ 3575.803894]  __handle_sysrq+0xbe/0x200
[ 3575.805733]  sysrq_filter+0x82/0x400
[ 3575.807513]  input_to_handler+0x43/0xe0
[ 3575.809339]  input_pass_values.part.9+0x1ab/0x260
[ 3575.811404]  input_handle_event+0xd6/0x570
[ 3575.813330]  input_event+0x45/0x70
[ 3575.815020]  atkbd_interrupt+0x4ff/0x780
[ 3575.816810]  serio_interrupt+0x3b/0x80
[ 3575.818517]  i8042_interrupt+0x25d/0x410
[ 3575.820305]  __handle_irq_event_percpu+0x2c/0xe0
[ 3575.822256]  handle_irq_event_percpu+0x2b/0x70
[ 3575.824148]  handle_irq_event+0x2f/0x50
[ 3575.825882]  handle_edge_irq+0x6c/0x180
[ 3575.827598]  handle_irq+0xa0/0x110
[ 3575.829158]  do_IRQ+0x4e/0x100
[ 3575.830611]  common_interrupt+0xf/0xf
[ 3575.832244]  </IRQ>
[ 3575.833439] RIP: 0010:lock_release+0x36/0x1d0
[ 3575.835346] Code: 25 c0 5d 01 00 48 83 ec 18 44 8b 8b d4 0b 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 45 85 c9 0f 85 e1 00 00 00 9c <58> 66 66 90 66 90 48 89 c5 fa 66 66 90 66 66 90 44 8b 05 43 54 0f
[ 3575.841978] RSP: 0000:ffffabfb43b3b898 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcc
[ 3575.844683] RAX: 0000000000000000 RBX: ffff8bfb94e8c240 RCX: 0000000000000000
[ 3575.847296] RDX: ffffffff8919647a RSI: 0000000000000001 RDI: ffffffff8a0597c0
[ 3575.849896] RBP: ffffabfb43b3b908 R08: ffffffff891963d0 R09: 0000000000000000
[ 3575.852506] R10: ffffabfb43b3b8a8 R11: 0000000000000001 R12: ffffffffffffffff
[ 3575.855104] R13: ffff8bfbb6373028 R14: 0000000000000000 R15: ffff8bfbb61a3b20
[ 3575.857713]  ? __list_lru_init+0x280/0x280
[ 3575.859567]  ? list_lru_count_one+0xaa/0x1e0
[ 3575.861453]  ? list_lru_count_one+0xc2/0x1e0
[ 3575.863290]  ? super_cache_count+0x74/0xf0
[ 3575.865135]  ? do_shrink_slab+0x35/0x190
[ 3575.866878]  ? shrink_slab+0x1d3/0x2c0
[ 3575.868565]  ? shrink_slab+0x240/0x2c0
[ 3575.870238]  ? shrink_node+0xe3/0x460
[ 3575.871863]  ? do_try_to_free_pages+0xcb/0x380
[ 3575.873700]  ? try_to_free_pages+0xbb/0xf0
[ 3575.875450]  ? __alloc_pages_slowpath+0x3c1/0xc50
[ 3575.877311]  ? __alloc_pages_nodemask+0x2a6/0x2c0
[ 3575.879187]  ? bio_copy_kern+0xcd/0x200
[ 3575.880976]  ? blk_rq_map_kern+0xb6/0x130
[ 3575.882760]  ? scsi_execute+0x64/0x250
[ 3575.884459]  ? sr_check_events+0x9a/0x2b0 [sr_mod]
[ 3575.886424]  ? __mutex_unlock_slowpath+0x46/0x2b0
[ 3575.888257]  ? cdrom_check_events+0xf/0x30 [cdrom]
[ 3575.890119]  ? sr_block_check_events+0x7c/0xb0 [sr_mod]
[ 3575.892152]  ? disk_check_events+0x5e/0x150
[ 3575.893890]  ? process_one_work+0x290/0x4a0
[ 3575.895660]  ? process_one_work+0x227/0x4a0
[ 3575.897411]  ? worker_thread+0x28/0x3d0
[ 3575.899006]  ? process_one_work+0x4a0/0x4a0
[ 3575.900727]  ? kthread+0x107/0x120
[ 3575.902191]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3575.904132]  ? ret_from_fork+0x24/0x30
[ 3621.473841] sysrq: SysRq : Show Memory
[ 3621.475640] Mem-Info:
[ 3621.476934] active_anon:831608 inactive_anon:5544 isolated_anon:0
[ 3621.476934]  active_file:0 inactive_file:0 isolated_file:0
[ 3621.476934]  unevictable:0 dirty:0 writeback:0 unstable:0
[ 3621.476934]  slab_reclaimable:3620 slab_unreclaimable:31654
[ 3621.476934]  mapped:1463 shmem:10471 pagetables:8448 bounce:0
[ 3621.476934]  free:21327 free_pcp:66 free_cma:0
[ 3621.488816] Node 0 active_anon:3326432kB inactive_anon:22176kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5852kB dirty:0kB writeback:0kB shmem:41884kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2816000kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 3621.497230] Node 0 DMA free:14712kB min:288kB low:360kB high:432kB active_anon:1128kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3621.505476] lowmem_reserve[]: 0 2679 3607 3607
[ 3621.507341] Node 0 DMA32 free:53596kB min:49968kB low:62460kB high:74952kB active_anon:2686544kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2743316kB mlocked:0kB kernel_stack:48kB pagetables:120kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3621.516007] lowmem_reserve[]: 0 0 928 928
[ 3621.517799] Node 0 Normal free:17000kB min:17320kB low:21648kB high:25976kB active_anon:638644kB inactive_anon:22168kB active_file:56kB inactive_file:200kB unevictable:0kB writepending:0kB present:1048576kB managed:951040kB mlocked:0kB kernel_stack:17888kB pagetables:33668kB bounce:0kB free_pcp:264kB local_pcp:264kB free_cma:0kB
[ 3621.527702] lowmem_reserve[]: 0 0 0 0
[ 3621.529458] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (M) 1*32kB (U) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14712kB
[ 3621.533965] Node 0 DMA32: 13*4kB (UM) 17*8kB (UM) 20*16kB (UM) 67*32kB (UM) 50*64kB (U) 27*128kB (UM) 27*256kB (UM) 7*512kB (U) 33*1024kB (UM) 0*2048kB 0*4096kB = 53596kB
[ 3621.539631] Node 0 Normal: 29*4kB (UME) 3*8kB (UE) 13*16kB (UME) 74*32kB (UME) 67*64kB (UME) 38*128kB (UE) 17*256kB (UE) 2*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 17244kB
[ 3621.545487] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3621.548666] 10482 total pagecache pages
[ 3621.550795] 0 pages in swap cache
[ 3621.552891] Swap cache stats: add 0, delete 0, find 0/0
[ 3621.555539] Free swap  = 0kB
[ 3621.557299] Total swap = 0kB
[ 3621.559054] 1048422 pages RAM
[ 3621.560836] 0 pages HighMem/MovableOnly
[ 3621.562838] 120864 pages reserved
[ 3621.564724] 0 pages cma reserved
[ 3621.566586] 0 pages hwpoisoned
[ 3654.129996] sysrq: SysRq : Show State
[ 3654.132038]   task                        PC stack   pid father
[ 3654.134654] systemd         D12240     1      0 0x00000000
[ 3654.137087] Call Trace:
[ 3654.138710]  ? __schedule+0x245/0x7f0
[ 3654.140687]  ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[ 3654.143007]  schedule+0x23/0x80
[ 3654.144835]  schedule_preempt_disabled+0x5/0x10
[ 3654.146999]  __mutex_lock+0x3f5/0x9b0

$ grep -A 61 -F 3930.529273 serial-20180725.txt
[ 3930.529273] sysrq: SysRq : Show backtrace of all active CPUs
[ 3930.532004] NMI backtrace for cpu 0
[ 3930.533974] CPU: 0 PID: 82423 Comm: kworker/0:3 Not tainted 4.18.0-rc6-next-20180724+ #246
[ 3930.537201] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
[ 3930.541173] Workqueue: events_freezable_power_ disk_events_workfn
[ 3930.543773] Call Trace:
[ 3930.545341]  <IRQ>
[ 3930.546766]  dump_stack+0x85/0xcb
[ 3930.548558]  nmi_cpu_backtrace+0xa1/0xb0
[ 3930.550566]  ? lapic_can_unplug_cpu+0x90/0x90
[ 3930.552640]  nmi_trigger_cpumask_backtrace+0xf0/0x130
[ 3930.554930]  __handle_sysrq+0xbe/0x200
[ 3930.556834]  sysrq_filter+0x82/0x400
[ 3930.558669]  input_to_handler+0x43/0xe0
[ 3930.560558]  input_pass_values.part.9+0x1ab/0x260
[ 3930.562688]  input_handle_event+0xd6/0x570
[ 3930.564602]  input_event+0x45/0x70
[ 3930.566309]  atkbd_interrupt+0x4ff/0x780
[ 3930.568187]  serio_interrupt+0x3b/0x80
[ 3930.569981]  i8042_interrupt+0x25d/0x410
[ 3930.571819]  __handle_irq_event_percpu+0x2c/0xe0
[ 3930.573860]  handle_irq_event_percpu+0x2b/0x70
[ 3930.575796]  handle_irq_event+0x2f/0x50
[ 3930.577561]  handle_edge_irq+0x6c/0x180
[ 3930.579313]  handle_irq+0xa0/0x110
[ 3930.580994]  do_IRQ+0x4e/0x100
[ 3930.582607]  common_interrupt+0xf/0xf
[ 3930.584360]  </IRQ>
[ 3930.585642] RIP: 0010:lock_acquire+0x1/0x70
[ 3930.587495] Code: c6 08 3c e0 89 48 c7 c7 c4 d6 df 89 e8 78 8e fa ff 0f 0b e9 3a ff ff ff e8 cc 90 fa ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 <53> 65 48 8b 2c 25 c0 5d 01 00 8b 85 d4 0b 00 00 85 c0 75 54 4d 89
[ 3930.594222] RSP: 0000:ffffabfb43b3b8d0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcc
[ 3930.597083] RAX: 0000000000000000 RBX: ffff8bfbb14b9ed8 RCX: 0000000000000002
[ 3930.599825] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8a0597c0
[ 3930.602605] RBP: ffffabfb43b3b908 R08: 0000000000000000 R09: 0000000000000000
[ 3930.605348] R10: ffffabfb43b3b8a8 R11: 0000000000000001 R12: ffffabfb43b3b9b0
[ 3930.608076] R13: ffff8bfbb8b90008 R14: 0000000000000000 R15: ffff8bfbb17e1880
[ 3930.610804]  list_lru_count_one+0x3b/0x1e0
[ 3930.612662]  ? __list_lru_init+0x280/0x280
[ 3930.614559]  super_cache_count+0x74/0xf0
[ 3930.616362]  do_shrink_slab+0x35/0x190
[ 3930.618100]  ? shrink_slab+0x1d3/0x2c0
[ 3930.619844]  shrink_slab+0x240/0x2c0
[ 3930.621531]  shrink_node+0xe3/0x460
[ 3930.623184]  do_try_to_free_pages+0xcb/0x380
[ 3930.625049]  try_to_free_pages+0xbb/0xf0
[ 3930.626869]  __alloc_pages_slowpath+0x3c1/0xc50
[ 3930.628803]  __alloc_pages_nodemask+0x2a6/0x2c0
[ 3930.630904]  bio_copy_kern+0xcd/0x200
[ 3930.632639]  blk_rq_map_kern+0xb6/0x130
[ 3930.634408]  scsi_execute+0x64/0x250
[ 3930.636079]  sr_check_events+0x9a/0x2b0 [sr_mod]
[ 3930.637998]  ? __mutex_unlock_slowpath+0x46/0x2b0
[ 3930.639934]  cdrom_check_events+0xf/0x30 [cdrom]
[ 3930.641822]  sr_block_check_events+0x7c/0xb0 [sr_mod]
[ 3930.643889]  disk_check_events+0x5e/0x150
[ 3930.645611]  process_one_work+0x290/0x4a0
[ 3930.647628]  ? process_one_work+0x227/0x4a0
[ 3930.649590]  worker_thread+0x28/0x3d0
[ 3930.651394]  ? process_one_work+0x4a0/0x4a0
[ 3930.653216]  kthread+0x107/0x120
[ 3930.654709]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 3930.656745]  ret_from_fork+0x24/0x30



Roman Gushchin wrote:
> On Tue, Jul 24, 2018 at 08:59:58PM +0900, Tetsuo Handa wrote:
> > Roman, will you check this cleanup patch? This patch applies on top of next-20180724.
> > I assumed that your series do not kill processes which current thread should not
> > wait for termination.
> 
> Hi Tetsuo!
> 
> > 
> > From 86ba99fbf73a9eda0df5ee4ae70c075781e83f81 Mon Sep 17 00:00:00 2001
> > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Date: Tue, 24 Jul 2018 14:00:45 +0900
> > Subject: [PATCH] mm,oom: Check pending victims earlier in out_of_memory().
> > 
> > The "mm, oom: cgroup-aware OOM killer" patchset introduced INFLIGHT_VICTIM
> > in order to replace open-coded ((void *)-1UL). But (regarding CONFIG_MMU=y
> > case) we have a list of inflight OOM victim threads which are connected to
> > oom_reaper_list. Thus we can check whether there are inflight OOM victims
> > before starting process/memcg list traversal. Since it is likely that only
> > few threads are linked to oom_reaper_list, checking all victims' OOM domain
> > will not matter.
> 
> I have a couple of top-level concerns:
> 1) You're doubling the size of oom_reaper memory footprint in task_struct.
>    I doubt, that code cleanup really worth it. If it's absolutely necessary
>    to resolve the lockup, which you mentioned, it should be explained
>    explicitly.

Doubling the size of oom_reaper memory footprint in task_struct is irrelevant
with mitigating the lockup. The patch for mitigating the lockup is very simple
(like shown above). But since that patch conflicts with your series, we can't
apply that patch unless we make that patch on top of your series. I really
appreciate if you rebase your series after applying that patch.

Doubling the size is not unique to my changes. David is trying to avoid
unnecessary killing of additional processes using timeout
( http://lkml.kernel.org/r/alpine.DEB.2.21.1807241443390.206335@chino.kir.corp.google.com )
and I agree that using "struct list_head" can make the code easier to read,
especially when your series is about to generate many OOM victims.

> 
> 2) There are several cosmetic changes in this patch, which makes reviewing
>    harder. Can you, please, split it into several parts.

The patch is shorter ( https://marc.info/?l=linux-mm&m=153062797803080 )
if we could apply it before your series.

FYI, a splitted series is at https://marc.info/?l=linux-mm&m=153062796003073 .

> 
> 
> Thanks!
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-25 12:23     ` Tetsuo Handa
@ 2018-07-25 13:01       ` Michal Hocko
  0 siblings, 0 replies; 52+ messages in thread
From: Michal Hocko @ 2018-07-25 13:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Roman Gushchin, akpm, linux-mm, rientjes, hannes, tj, gthelen

On Wed 25-07-18 21:23:11, Tetsuo Handa wrote:
> Michal, I think that I hit WQ OOM lockup caused by lack of guaranteed schedule_timeout_*() for WQ.
> I think we should immediately send https://marc.info/?l=linux-mm&m=153062798103081
> (or both http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org and
> https://marc.info/?l=linux-mm&m=152723708623015 ) to linux.git so that we can send
> to stable kernels without waiting for "mm, oom: cgroup-aware OOM killer" patchset.

Then do not pollute unrelated threads. Really! The patch to drop the
sleep from the oom_lock is in mmotm tree already. I really do not see
why you make it more important than it really is (ohhh it is the dubious
CVE you have filed?).

If you have hit a WQ OOM lockup beucase of a missing schedule_timeout
then report it in a new email thread and we can discuss the proper way
of handling it.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-24 13:26                                     ` Michal Hocko
  2018-07-24 13:31                                       ` Tejun Heo
@ 2018-07-30  8:03                                       ` Michal Hocko
  2018-07-30 14:04                                         ` Tejun Heo
  1 sibling, 1 reply; 52+ messages in thread
From: Michal Hocko @ 2018-07-30  8:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

On Tue 24-07-18 15:26:40, Michal Hocko wrote:
> On Tue 24-07-18 06:08:36, Tejun Heo wrote:
> > Hello,
> > 
> > On Tue, Jul 24, 2018 at 09:32:30AM +0200, Michal Hocko wrote:
> [...]
> > > > There's no reason to put any
> > > > restrictions on what each cgroup can configure.  The only thing which
> > > > matters is is that the effective behavior is what the highest in the
> > > > ancestry configures, and, at the system level, it'd conceptually map
> > > > to panic_on_oom.
> > > 
> > > Hmm, so do we inherit group_oom? If not, how do we prevent from
> > > unexpected behavior?
> > 
> > Hmm... I guess we're debating two options here.  Please consider the
> > following hierarchy.
> > 
> >       R
> >       |
> >       A (group oom == 1)
> >      / \
> >     B   C
> >     |
> >     D
> > 
> > 1. No matter what B, C or D sets, as long as A sets group oom, any oom
> >    kill inside A's subtree kills the entire subtree.
> > 
> > 2. A's group oom policy applies iff the source of the OOM is either at
> >    or above A - ie. iff the OOM is system-wide or caused by memory.max
> >    of A.
> > 
> > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to
> > discuss whether they inherit A's setting or not.  A's is, if set,
> > always overriding.  In #2, what B, C or D sets matters if they also
> > set their own memory.max, so there's no reason for them to inherit
> > anything.
> > 
> > I'm actually okay with either option.  #2 is more flexible than #1 but
> > given that this is a cgroup owned property which is likely to be set
> > on per-application basis, #1 is likely good enough.
> > 
> > IIRC, we did #2 in the original implementation and the simplified one
> > is doing #1, right?
> 
> No, we've been discussing #2 unless I have misunderstood something.
> I find it rather non-intuitive that a property outside of the oom domain
> controls the behavior inside the domain. I will keep thinking about that
> though.

So the more I think about it the more I am convinced that 1 is simply
wrong for the reason I've mentioned above. Consulting a property outside
of the oom hierarchy is tricky and non-intuitive.

So the implementation should be
	oom_group = NULL;
	memcg = mem_cgroup_from_task(oom_victim);
	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
		if (memcg->group_oom)
			oom_group = memcg;
		if (memcg == root)
			break;
	}

And the documented semantic
oom.group - denotes a memory cgroup (or subhierarchy) which represents
an indivisble workload and should any process be selected as an oom
victim due to an OOM even (at this cgroup level or above) then all
processes belonging to the group/hierarchy are killed together.

Please be careful when defining differen oom.group policies within the
same hierarchy because OOM events at different hierarchy levels can 
have surprising effects. Example
	R
	|
	A (oom.group = 1)
       / \
      B   C (oom.group = 0)

oom victim living in B resp. C.

OOM event at R - (e.g. global OOM) or A will kill all tasks in A subtree.
OOM event at B resp. C will only kill a single process from those
memcgs. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-30  8:03                                       ` Michal Hocko
@ 2018-07-30 14:04                                         ` Tejun Heo
  2018-07-30 15:29                                           ` Roman Gushchin
  0 siblings, 1 reply; 52+ messages in thread
From: Tejun Heo @ 2018-07-30 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen

Hello,

On Mon, Jul 30, 2018 at 10:03:57AM +0200, Michal Hocko wrote:
> Please be careful when defining differen oom.group policies within the
> same hierarchy because OOM events at different hierarchy levels can 
> have surprising effects. Example
> 	R
> 	|
> 	A (oom.group = 1)
>        / \
>       B   C (oom.group = 0)
> 
> oom victim living in B resp. C.
> 
> OOM event at R - (e.g. global OOM) or A will kill all tasks in A subtree.
> OOM event at B resp. C will only kill a single process from those
> memcgs. 

That behavior makes perfect sense to me and it maps to panic_on_oom==2
which works.  Roman, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: cgroup-aware OOM killer, how to move forward
  2018-07-30 14:04                                         ` Tejun Heo
@ 2018-07-30 15:29                                           ` Roman Gushchin
  0 siblings, 0 replies; 52+ messages in thread
From: Roman Gushchin @ 2018-07-30 15:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Michal Hocko, hannes, David Rientjes, linux-mm, akpm, gthelen

On Mon, Jul 30, 2018 at 07:04:26AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jul 30, 2018 at 10:03:57AM +0200, Michal Hocko wrote:
> > Please be careful when defining differen oom.group policies within the
> > same hierarchy because OOM events at different hierarchy levels can 
> > have surprising effects. Example
> > 	R
> > 	|
> > 	A (oom.group = 1)
> >        / \
> >       B   C (oom.group = 0)
> > 
> > oom victim living in B resp. C.
> > 
> > OOM event at R - (e.g. global OOM) or A will kill all tasks in A subtree.
> > OOM event at B resp. C will only kill a single process from those
> > memcgs. 
> 
> That behavior makes perfect sense to me and it maps to panic_on_oom==2
> which works.  Roman, what do you think?

I'm totally fine with this behavior, this is what I've suggested initially.
I'll post the patchset soon.

Thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2018-07-30 15:29 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin
2018-07-12 12:07 ` Michal Hocko
2018-07-12 15:55   ` Roman Gushchin
2018-07-13 21:34 ` David Rientjes
2018-07-13 22:16   ` Roman Gushchin
2018-07-13 22:39     ` David Rientjes
2018-07-13 23:05       ` Roman Gushchin
2018-07-13 23:11         ` David Rientjes
2018-07-13 23:16           ` Roman Gushchin
2018-07-17  4:19             ` David Rientjes
2018-07-17 12:41               ` Michal Hocko
2018-07-17 17:38               ` Roman Gushchin
2018-07-17 19:49                 ` Michal Hocko
2018-07-17 20:06                   ` Roman Gushchin
2018-07-17 20:41                     ` David Rientjes
2018-07-17 20:52                       ` Roman Gushchin
2018-07-20  8:30                         ` David Rientjes
2018-07-20 11:21                           ` Tejun Heo
2018-07-20 16:13                             ` Roman Gushchin
2018-07-20 20:28                             ` David Rientjes
2018-07-20 20:47                               ` Roman Gushchin
2018-07-23 23:06                                 ` David Rientjes
2018-07-23 14:12                               ` Michal Hocko
2018-07-18  8:19                       ` Michal Hocko
2018-07-18  8:12                     ` Michal Hocko
2018-07-18 15:28                       ` Roman Gushchin
2018-07-19  7:38                         ` Michal Hocko
2018-07-19 17:05                           ` Roman Gushchin
2018-07-20  8:32                             ` David Rientjes
2018-07-23 14:17                             ` Michal Hocko
2018-07-23 15:09                               ` Tejun Heo
2018-07-24  7:32                                 ` Michal Hocko
2018-07-24 13:08                                   ` Tejun Heo
2018-07-24 13:26                                     ` Michal Hocko
2018-07-24 13:31                                       ` Tejun Heo
2018-07-24 13:50                                         ` Michal Hocko
2018-07-24 13:55                                           ` Tejun Heo
2018-07-24 14:25                                             ` Michal Hocko
2018-07-24 14:28                                               ` Tejun Heo
2018-07-24 14:35                                                 ` Tejun Heo
2018-07-24 14:43                                                 ` Michal Hocko
2018-07-24 14:49                                                   ` Tejun Heo
2018-07-24 15:52                                                     ` Roman Gushchin
2018-07-25 12:00                                                       ` Michal Hocko
2018-07-25 11:58                                                     ` Michal Hocko
2018-07-30  8:03                                       ` Michal Hocko
2018-07-30 14:04                                         ` Tejun Heo
2018-07-30 15:29                                           ` Roman Gushchin
2018-07-24 11:59 ` Tetsuo Handa
2018-07-25  0:10   ` Roman Gushchin
2018-07-25 12:23     ` Tetsuo Handa
2018-07-25 13:01       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).