* cgroup-aware OOM killer, how to move forward @ 2018-07-11 22:40 Roman Gushchin 2018-07-12 12:07 ` Michal Hocko ` (2 more replies) 0 siblings, 3 replies; 52+ messages in thread From: Roman Gushchin @ 2018-07-11 22:40 UTC (permalink / raw) To: linux-mm; +Cc: akpm, rientjes, mhocko, hannes, tj, gthelen Hello! I was thinking on how to move forward with the cgroup-aware OOM killer. It looks to me, that we all agree on the "cleanup" part of the patchset: it's a nice feature to be able to kill all tasks in the cgroup to guarantee the consistent state of the workload. All our disagreements are related to the victim selection algorithm. So, I wonder, if the right thing to do is to split the problem. We can agree on the "cleanup" part, which is useful by itself, merge it upstream, and then return to the victim selection algorithm. So, here is my proposal: let's introduce the memory.group_oom knob with the following semantics: if the knob is set, the OOM killer can kill either none, either all tasks in the cgroup*. It can perfectly work with the current OOM killer (as a "cleanup" option), and allows _any_ further approach on the OOM victim selection. It also doesn't require any mount/boot/tree-wide options. How does it sound? If we can agree on this, I will prepare the patchset. It's quite small and straightforward in comparison to the current version. Thanks! * More precisely: if the OOM killer kills a task, it will traverse the cgroup tree up to the OOM domain (OOMing memcg or root), looking for the highest-level cgroup with group_oom set. Then it will kill all tasks in such cgroup, if it does exist. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin @ 2018-07-12 12:07 ` Michal Hocko 2018-07-12 15:55 ` Roman Gushchin 2018-07-13 21:34 ` David Rientjes 2018-07-24 11:59 ` Tetsuo Handa 2 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-12 12:07 UTC (permalink / raw) To: Roman Gushchin; +Cc: linux-mm, akpm, rientjes, hannes, tj, gthelen On Wed 11-07-18 15:40:03, Roman Gushchin wrote: > Hello! > > I was thinking on how to move forward with the cgroup-aware OOM killer. > It looks to me, that we all agree on the "cleanup" part of the patchset: > it's a nice feature to be able to kill all tasks in the cgroup > to guarantee the consistent state of the workload. > All our disagreements are related to the victim selection algorithm. > > So, I wonder, if the right thing to do is to split the problem. > We can agree on the "cleanup" part, which is useful by itself, > merge it upstream, and then return to the victim selection > algorithm. Could you be more specific which patches are those please? > So, here is my proposal: > let's introduce the memory.group_oom knob with the following semantics: > if the knob is set, the OOM killer can kill either none, either all > tasks in the cgroup*. > It can perfectly work with the current OOM killer (as a "cleanup" option), > and allows _any_ further approach on the OOM victim selection. > It also doesn't require any mount/boot/tree-wide options. > > How does it sound? Well, I guess we have already discussed that. One problem I can see with that approach is that there is a disconnection between what is the oom killable entity and oom candidate entity. This will matter when we start seeing reports that a wrong container has been torn down because there were larger ones running. All that just because the latter ones consists of smaller tasks. Is this a fundamental roadblock? I am not sure but I would tend to say _no_ because the oom victim selection has always been an implementation detail. We just need to kill _somebody_ to release _some_ memory. Kill the whole workload is a sensible thing to do. So I would be ok with that even though I am still not sure why we should start with something half done when your original implementation was much more consistent. Sure there is some disagreement but I suspect that we will get stuck with an intermediate solution later on again for very same reasons. I have summarized [1] current contention points and I would really appreciate if somebody who wasn't really involved in the previous discussions could just join there and weight arguments. OOM selection policy is just a heuristic with some potential drawbacks and somebody might object and block otherwise useful features for others for ever. So we should really find some consensus on what is reasonable and what is just over the line. [1] http://lkml.kernel.org/r/20180605114729.GB19202@dhcp22.suse.cz -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-12 12:07 ` Michal Hocko @ 2018-07-12 15:55 ` Roman Gushchin 0 siblings, 0 replies; 52+ messages in thread From: Roman Gushchin @ 2018-07-12 15:55 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, akpm, rientjes, hannes, tj, gthelen On Thu, Jul 12, 2018 at 02:07:03PM +0200, Michal Hocko wrote: > On Wed 11-07-18 15:40:03, Roman Gushchin wrote: > > Hello! > > > > I was thinking on how to move forward with the cgroup-aware OOM killer. > > It looks to me, that we all agree on the "cleanup" part of the patchset: > > it's a nice feature to be able to kill all tasks in the cgroup > > to guarantee the consistent state of the workload. > > All our disagreements are related to the victim selection algorithm. > > > > So, I wonder, if the right thing to do is to split the problem. > > We can agree on the "cleanup" part, which is useful by itself, > > merge it upstream, and then return to the victim selection > > algorithm. > > Could you be more specific which patches are those please? It's not quite a part of existing patchset. But I had such version during my work on the current patchset, and it was really small and cute. I need some time to restore/rebase it. > > > So, here is my proposal: > > let's introduce the memory.group_oom knob with the following semantics: > > if the knob is set, the OOM killer can kill either none, either all > > tasks in the cgroup*. > > It can perfectly work with the current OOM killer (as a "cleanup" option), > > and allows _any_ further approach on the OOM victim selection. > > It also doesn't require any mount/boot/tree-wide options. > > > > How does it sound? > > Well, I guess we have already discussed that. One problem I can see with > that approach is that there is a disconnection between what is the oom > killable entity and oom candidate entity. This will matter when we start > seeing reports that a wrong container has been torn down because there > were larger ones running. All that just because the latter ones consists > of smaller tasks. > > Is this a fundamental roadblock? I am not sure but I would tend to say > _no_ because the oom victim selection has always been an implementation > detail. We just need to kill _somebody_ to release _some_ memory. Kill > the whole workload is a sensible thing to do. Yes. We also use Johaness's memory pressure metrics for making OOM decisions internally, which is working nice. In this case the in-kernel OOM decision logic serves more as a backup solution, and consistency is the only thing which does really matter. > > So I would be ok with that even though I am still not sure why we should > start with something half done when your original implementation was > much more consistent. Sure there is some disagreement but I suspect > that we will get stuck with an intermediate solution later on again for > very same reasons. I have summarized [1] current contention points and > I would really appreciate if somebody who wasn't really involved in the > previous discussions could just join there and weight arguments. OOM > selection policy is just a heuristic with some potential drawbacks and > somebody might object and block otherwise useful features for others for > ever. So we should really find some consensus on what is reasonable and > what is just over the line. I would definitely prefer just to land the existing version, and I prefer it over this proposal. But it doesn't seem to be going forward well... Maybe making the described step first might help. Thanks, Roman ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin 2018-07-12 12:07 ` Michal Hocko @ 2018-07-13 21:34 ` David Rientjes 2018-07-13 22:16 ` Roman Gushchin 2018-07-24 11:59 ` Tetsuo Handa 2 siblings, 1 reply; 52+ messages in thread From: David Rientjes @ 2018-07-13 21:34 UTC (permalink / raw) To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Wed, 11 Jul 2018, Roman Gushchin wrote: > I was thinking on how to move forward with the cgroup-aware OOM killer. > It looks to me, that we all agree on the "cleanup" part of the patchset: > it's a nice feature to be able to kill all tasks in the cgroup > to guarantee the consistent state of the workload. > All our disagreements are related to the victim selection algorithm. > > So, I wonder, if the right thing to do is to split the problem. > We can agree on the "cleanup" part, which is useful by itself, > merge it upstream, and then return to the victim selection > algorithm. > > So, here is my proposal: > let's introduce the memory.group_oom knob with the following semantics: > if the knob is set, the OOM killer can kill either none, either all > tasks in the cgroup*. > It can perfectly work with the current OOM killer (as a "cleanup" option), > and allows _any_ further approach on the OOM victim selection. > It also doesn't require any mount/boot/tree-wide options. > > How does it sound? > No objection, of course, this was always the mechanism vs policy separation that I was referring to. Having the ability to kill all processes attached to the cgroup when one of its processes is selected is useful, and we have our own patches that do just that, with the exception that it's triggerable by the user. One of the things that I really like about cgroup v2, though, is what appears to be an implicit, but rather apparent, goal to minimize the number of files for each controller. It's very clean. So I'd suggest that we consider memory.group_oom, or however it is named, to allow for future development. For example, rather than simply being binary, we'd probably want the ability to kill all eligible processes attached directly to the victim's mem cgroup *or* all processes attached to its subtree as well. I'd suggest it be implemented to accept a string, "default"/"process", "local" or "tree"/"hierarchy", or better names, to define the group oom mechanism for the mem cgroup that is oom when one of its processes is selected as a victim. > * More precisely: if the OOM killer kills a task, > it will traverse the cgroup tree up to the OOM domain (OOMing memcg or root), > looking for the highest-level cgroup with group_oom set. Then it will > kill all tasks in such cgroup, if it does exist. > All such processes that are not oom disabled, yes. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-13 21:34 ` David Rientjes @ 2018-07-13 22:16 ` Roman Gushchin 2018-07-13 22:39 ` David Rientjes 0 siblings, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-13 22:16 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Fri, Jul 13, 2018 at 02:34:49PM -0700, David Rientjes wrote: > On Wed, 11 Jul 2018, Roman Gushchin wrote: > > > I was thinking on how to move forward with the cgroup-aware OOM killer. > > It looks to me, that we all agree on the "cleanup" part of the patchset: > > it's a nice feature to be able to kill all tasks in the cgroup > > to guarantee the consistent state of the workload. > > All our disagreements are related to the victim selection algorithm. > > > > So, I wonder, if the right thing to do is to split the problem. > > We can agree on the "cleanup" part, which is useful by itself, > > merge it upstream, and then return to the victim selection > > algorithm. > > > > So, here is my proposal: > > let's introduce the memory.group_oom knob with the following semantics: > > if the knob is set, the OOM killer can kill either none, either all > > tasks in the cgroup*. > > It can perfectly work with the current OOM killer (as a "cleanup" option), > > and allows _any_ further approach on the OOM victim selection. > > It also doesn't require any mount/boot/tree-wide options. > > > > How does it sound? > > > > No objection, of course, this was always the mechanism vs policy > separation that I was referring to. Having the ability to kill all > processes attached to the cgroup when one of its processes is selected is > useful, and we have our own patches that do just that, with the exception > that it's triggerable by the user. Perfect! I'll prepare the patchset. > > One of the things that I really like about cgroup v2, though, is what > appears to be an implicit, but rather apparent, goal to minimize the > number of files for each controller. It's very clean. So I'd suggest > that we consider memory.group_oom, or however it is named, to allow for > future development. > > For example, rather than simply being binary, we'd probably want the > ability to kill all eligible processes attached directly to the victim's > mem cgroup *or* all processes attached to its subtree as well. > > I'd suggest it be implemented to accept a string, "default"/"process", > "local" or "tree"/"hierarchy", or better names, to define the group oom > mechanism for the mem cgroup that is oom when one of its processes is > selected as a victim. I would prefer to keep it boolean to match the simplicity of cgroup v2 API. In v2 hierarchy processes can't be attached to non-leaf cgroups, so I don't see the place for the 3rd meaning. > > > * More precisely: if the OOM killer kills a task, > > it will traverse the cgroup tree up to the OOM domain (OOMing memcg or root), > > looking for the highest-level cgroup with group_oom set. Then it will > > kill all tasks in such cgroup, if it does exist. > > > > All such processes that are not oom disabled, yes. > Yep, of course. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-13 22:16 ` Roman Gushchin @ 2018-07-13 22:39 ` David Rientjes 2018-07-13 23:05 ` Roman Gushchin 0 siblings, 1 reply; 52+ messages in thread From: David Rientjes @ 2018-07-13 22:39 UTC (permalink / raw) To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Fri, 13 Jul 2018, Roman Gushchin wrote: > > No objection, of course, this was always the mechanism vs policy > > separation that I was referring to. Having the ability to kill all > > processes attached to the cgroup when one of its processes is selected is > > useful, and we have our own patches that do just that, with the exception > > that it's triggerable by the user. > > Perfect! I'll prepare the patchset. > I mean, I separated it out completely in my own https://marc.info/?l=linux-kernel&m=152175564704473 as part of a patch series that completely fixes all of the issues with the cgroup aware oom killer, so of course there's no objection to separating it out. > > > > One of the things that I really like about cgroup v2, though, is what > > appears to be an implicit, but rather apparent, goal to minimize the > > number of files for each controller. It's very clean. So I'd suggest > > that we consider memory.group_oom, or however it is named, to allow for > > future development. > > > > For example, rather than simply being binary, we'd probably want the > > ability to kill all eligible processes attached directly to the victim's > > mem cgroup *or* all processes attached to its subtree as well. > > > > I'd suggest it be implemented to accept a string, "default"/"process", > > "local" or "tree"/"hierarchy", or better names, to define the group oom > > mechanism for the mem cgroup that is oom when one of its processes is > > selected as a victim. > > I would prefer to keep it boolean to match the simplicity of cgroup v2 API. > In v2 hierarchy processes can't be attached to non-leaf cgroups, > so I don't see the place for the 3rd meaning. > All cgroup v2 files do not need to be boolean and the only way you can add a subtree oom kill is to introduce yet another file later. Please make it tristate so that you can define a mechanism of default (process only), local cgroup, or subtree, and so we can avoid adding another option later that conflicts with the proposed one. This should be easy. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-13 22:39 ` David Rientjes @ 2018-07-13 23:05 ` Roman Gushchin 2018-07-13 23:11 ` David Rientjes 0 siblings, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-13 23:05 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Fri, Jul 13, 2018 at 03:39:15PM -0700, David Rientjes wrote: > On Fri, 13 Jul 2018, Roman Gushchin wrote: > > > > > > > One of the things that I really like about cgroup v2, though, is what > > > appears to be an implicit, but rather apparent, goal to minimize the > > > number of files for each controller. It's very clean. So I'd suggest > > > that we consider memory.group_oom, or however it is named, to allow for > > > future development. > > > > > > For example, rather than simply being binary, we'd probably want the > > > ability to kill all eligible processes attached directly to the victim's > > > mem cgroup *or* all processes attached to its subtree as well. > > > > > > I'd suggest it be implemented to accept a string, "default"/"process", > > > "local" or "tree"/"hierarchy", or better names, to define the group oom > > > mechanism for the mem cgroup that is oom when one of its processes is > > > selected as a victim. > > > > I would prefer to keep it boolean to match the simplicity of cgroup v2 API. > > In v2 hierarchy processes can't be attached to non-leaf cgroups, > > so I don't see the place for the 3rd meaning. > > > > All cgroup v2 files do not need to be boolean and the only way you can add > a subtree oom kill is to introduce yet another file later. Please make it > tristate so that you can define a mechanism of default (process only), > local cgroup, or subtree, and so we can avoid adding another option later > that conflicts with the proposed one. This should be easy. David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup either has a sub-tree, either attached processes. So, there is no difference between local cgroup and subtree. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-13 23:05 ` Roman Gushchin @ 2018-07-13 23:11 ` David Rientjes 2018-07-13 23:16 ` Roman Gushchin 0 siblings, 1 reply; 52+ messages in thread From: David Rientjes @ 2018-07-13 23:11 UTC (permalink / raw) To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Fri, 13 Jul 2018, Roman Gushchin wrote: > > All cgroup v2 files do not need to be boolean and the only way you can add > > a subtree oom kill is to introduce yet another file later. Please make it > > tristate so that you can define a mechanism of default (process only), > > local cgroup, or subtree, and so we can avoid adding another option later > > that conflicts with the proposed one. This should be easy. > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup > either has a sub-tree, either attached processes. So, there is no difference > between local cgroup and subtree. > Uhm, what? We're talking about a common ancestor reaching its limit, so it's oom, and it has multiple immediate children with their own processes attached. The difference is killing all processes attached to the victim's cgroup or all processes under the oom mem cgroup's subtree. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-13 23:11 ` David Rientjes @ 2018-07-13 23:16 ` Roman Gushchin 2018-07-17 4:19 ` David Rientjes 0 siblings, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-13 23:16 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Fri, Jul 13, 2018 at 04:11:51PM -0700, David Rientjes wrote: > On Fri, 13 Jul 2018, Roman Gushchin wrote: > > > > All cgroup v2 files do not need to be boolean and the only way you can add > > > a subtree oom kill is to introduce yet another file later. Please make it > > > tristate so that you can define a mechanism of default (process only), > > > local cgroup, or subtree, and so we can avoid adding another option later > > > that conflicts with the proposed one. This should be easy. > > > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup > > either has a sub-tree, either attached processes. So, there is no difference > > between local cgroup and subtree. > > > > Uhm, what? We're talking about a common ancestor reaching its limit, so > it's oom, and it has multiple immediate children with their own processes > attached. The difference is killing all processes attached to the > victim's cgroup or all processes under the oom mem cgroup's subtree. > But it's a binary decision, no? If memory.group_oom set, the whole sub-tree will be killed. Otherwise not. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-13 23:16 ` Roman Gushchin @ 2018-07-17 4:19 ` David Rientjes 2018-07-17 12:41 ` Michal Hocko 2018-07-17 17:38 ` Roman Gushchin 0 siblings, 2 replies; 52+ messages in thread From: David Rientjes @ 2018-07-17 4:19 UTC (permalink / raw) To: Roman Gushchin; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Fri, 13 Jul 2018, Roman Gushchin wrote: > > > > All cgroup v2 files do not need to be boolean and the only way you can add > > > > a subtree oom kill is to introduce yet another file later. Please make it > > > > tristate so that you can define a mechanism of default (process only), > > > > local cgroup, or subtree, and so we can avoid adding another option later > > > > that conflicts with the proposed one. This should be easy. > > > > > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup > > > either has a sub-tree, either attached processes. So, there is no difference > > > between local cgroup and subtree. > > > > > > > Uhm, what? We're talking about a common ancestor reaching its limit, so > > it's oom, and it has multiple immediate children with their own processes > > attached. The difference is killing all processes attached to the > > victim's cgroup or all processes under the oom mem cgroup's subtree. > > > > But it's a binary decision, no? > If memory.group_oom set, the whole sub-tree will be killed. Otherwise not. > No, if memory.max is reached and memory.group_oom is set, my understanding of your proposal is that a process is chosen and all eligible processes attached to its mem cgroup are oom killed. My desire for a tristate is so that it can be specified that all processes attached to the *subtree* are oom killed. With single unified hierarchy mandated by cgroup v2, we can separate descendant cgroups for use with other controllers and enforce memory.max by an ancestor. Making this a boolean value is only preventing it from becoming extensible. If memory.group_oom only is effective for the victim's mem cgroup, it becomes impossible to specify that all processes in the subtree should be oom killed as a result of the ancestor limit without adding yet another tunable. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 4:19 ` David Rientjes @ 2018-07-17 12:41 ` Michal Hocko 2018-07-17 17:38 ` Roman Gushchin 1 sibling, 0 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-17 12:41 UTC (permalink / raw) To: David Rientjes; +Cc: Roman Gushchin, linux-mm, akpm, hannes, tj, gthelen On Mon 16-07-18 21:19:18, David Rientjes wrote: > On Fri, 13 Jul 2018, Roman Gushchin wrote: > > > > > > All cgroup v2 files do not need to be boolean and the only way you can add > > > > > a subtree oom kill is to introduce yet another file later. Please make it > > > > > tristate so that you can define a mechanism of default (process only), > > > > > local cgroup, or subtree, and so we can avoid adding another option later > > > > > that conflicts with the proposed one. This should be easy. > > > > > > > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup > > > > either has a sub-tree, either attached processes. So, there is no difference > > > > between local cgroup and subtree. > > > > > > > > > > Uhm, what? We're talking about a common ancestor reaching its limit, so > > > it's oom, and it has multiple immediate children with their own processes > > > attached. The difference is killing all processes attached to the > > > victim's cgroup or all processes under the oom mem cgroup's subtree. > > > > > > > But it's a binary decision, no? > > If memory.group_oom set, the whole sub-tree will be killed. Otherwise not. > > > > No, if memory.max is reached and memory.group_oom is set, my understanding > of your proposal is that a process is chosen and all eligible processes > attached to its mem cgroup are oom killed. My desire for a tristate is so > that it can be specified that all processes attached to the *subtree* are > oom killed. With single unified hierarchy mandated by cgroup v2, we can > separate descendant cgroups for use with other controllers and enforce > memory.max by an ancestor. > > Making this a boolean value is only preventing it from becoming > extensible. If memory.group_oom only is effective for the victim's mem > cgroup, it becomes impossible to specify that all processes in the subtree > should be oom killed as a result of the ancestor limit without adding yet > another tunable. No, this is mangling the interface again. I have already objected to this [1]. group_oom only tells to tear the whole group down. How you select this particular group is a completely different story. Conflating those two things is a bad interface to start with. Killing the whold cgroup or only a single process should be invariant for the particular memcg regardless of what is the oom selection policy above in the hierarchy. Either you are indivisible workload or you are not, full stop. If we really need a better control over how to select subtrees then this should be a separate control knob. How should it look like is a matter of discussion but it will be hard to find any consensus if the single-knob-for-single-purpose approach is not clear. [1] http://lkml.kernel.org/r/20180117160004.GH2900@dhcp22.suse.cz -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 4:19 ` David Rientjes 2018-07-17 12:41 ` Michal Hocko @ 2018-07-17 17:38 ` Roman Gushchin 2018-07-17 19:49 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-17 17:38 UTC (permalink / raw) To: David Rientjes; +Cc: linux-mm, akpm, mhocko, hannes, tj, gthelen On Mon, Jul 16, 2018 at 09:19:18PM -0700, David Rientjes wrote: > On Fri, 13 Jul 2018, Roman Gushchin wrote: > > > > > > All cgroup v2 files do not need to be boolean and the only way you can add > > > > > a subtree oom kill is to introduce yet another file later. Please make it > > > > > tristate so that you can define a mechanism of default (process only), > > > > > local cgroup, or subtree, and so we can avoid adding another option later > > > > > that conflicts with the proposed one. This should be easy. > > > > > > > > David, we're adding a cgroup v2 knob, and in cgroup v2 a memory cgroup > > > > either has a sub-tree, either attached processes. So, there is no difference > > > > between local cgroup and subtree. > > > > > > > > > > Uhm, what? We're talking about a common ancestor reaching its limit, so > > > it's oom, and it has multiple immediate children with their own processes > > > attached. The difference is killing all processes attached to the > > > victim's cgroup or all processes under the oom mem cgroup's subtree. > > > > > > > But it's a binary decision, no? > > If memory.group_oom set, the whole sub-tree will be killed. Otherwise not. > > > > No, if memory.max is reached and memory.group_oom is set, my understanding > of your proposal is that a process is chosen and all eligible processes > attached to its mem cgroup are oom killed. My desire for a tristate is so > that it can be specified that all processes attached to the *subtree* are > oom killed. With single unified hierarchy mandated by cgroup v2, we can > separate descendant cgroups for use with other controllers and enforce > memory.max by an ancestor. > > Making this a boolean value is only preventing it from becoming > extensible. If memory.group_oom only is effective for the victim's mem > cgroup, it becomes impossible to specify that all processes in the subtree > should be oom killed as a result of the ancestor limit without adding yet > another tunable. Let me show my proposal on examples. Let's say we have the following hierarchy, and the biggest process (or the process with highest oom_score_adj) is in D. / | A | B / \ C D Let's look at different examples and intended behavior: 1) system-wide OOM - default settings: the biggest process is killed - D/memory.group_oom=1: all processes in D are killed - A/memory.group_oom=1: all processes in A are killed 2) memcg oom in B - default settings: the biggest process is killed - A/memory.group_oom=1: the biggest process is killed - B/memory.group_oom=1: all processes in B are killed - D/memory.group_oom=1: all processes in D are killed Please, note, that processes can't be attached directly to A and B, so "all processes in A are killed" means all processes in the sub-tree are killed. Immortal processes (oom_score_adj=-1000) are excluded. I believe, that this model is full and doesn't require any further extension. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 17:38 ` Roman Gushchin @ 2018-07-17 19:49 ` Michal Hocko 2018-07-17 20:06 ` Roman Gushchin 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-17 19:49 UTC (permalink / raw) To: Roman Gushchin; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen On Tue 17-07-18 10:38:45, Roman Gushchin wrote: [...] > Let me show my proposal on examples. Let's say we have the following hierarchy, > and the biggest process (or the process with highest oom_score_adj) is in D. > > / > | > A > | > B > / \ > C D > > Let's look at different examples and intended behavior: > 1) system-wide OOM > - default settings: the biggest process is killed > - D/memory.group_oom=1: all processes in D are killed > - A/memory.group_oom=1: all processes in A are killed > 2) memcg oom in B > - default settings: the biggest process is killed > - A/memory.group_oom=1: the biggest process is killed Huh? Why would you even consider A here when the oom is below it? /me confused > - B/memory.group_oom=1: all processes in B are killed - B/memory.group_oom=0 && > - D/memory.group_oom=1: all processes in D are killed What about? - B/memory.group_oom=1 && D/memory.group_oom=0 Is this a sane configuration? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 19:49 ` Michal Hocko @ 2018-07-17 20:06 ` Roman Gushchin 2018-07-17 20:41 ` David Rientjes 2018-07-18 8:12 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Roman Gushchin @ 2018-07-17 20:06 UTC (permalink / raw) To: Michal Hocko; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote: > On Tue 17-07-18 10:38:45, Roman Gushchin wrote: > [...] > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > / > > | > > A > > | > > B > > / \ > > C D > > > > Let's look at different examples and intended behavior: > > 1) system-wide OOM > > - default settings: the biggest process is killed > > - D/memory.group_oom=1: all processes in D are killed > > - A/memory.group_oom=1: all processes in A are killed > > 2) memcg oom in B > > - default settings: the biggest process is killed > > - A/memory.group_oom=1: the biggest process is killed > > Huh? Why would you even consider A here when the oom is below it? > /me confused I do not. This is exactly a counter-example: A's memory.group_oom is not considered at all in this case, because A is above ooming cgroup. > > > - B/memory.group_oom=1: all processes in B are killed > > - B/memory.group_oom=0 && > > - D/memory.group_oom=1: all processes in D are killed > > What about? > - B/memory.group_oom=1 && D/memory.group_oom=0 All tasks in B are killed. Group_oom set to 1 means that the workload can't tolerate killing of a random process, so in this case it's better to guarantee consistency for B. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 20:06 ` Roman Gushchin @ 2018-07-17 20:41 ` David Rientjes 2018-07-17 20:52 ` Roman Gushchin 2018-07-18 8:19 ` Michal Hocko 2018-07-18 8:12 ` Michal Hocko 1 sibling, 2 replies; 52+ messages in thread From: David Rientjes @ 2018-07-17 20:41 UTC (permalink / raw) To: Roman Gushchin; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen On Tue, 17 Jul 2018, Roman Gushchin wrote: > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > / > > > | > > > A > > > | > > > B > > > / \ > > > C D > > > > > > Let's look at different examples and intended behavior: > > > 1) system-wide OOM > > > - default settings: the biggest process is killed > > > - D/memory.group_oom=1: all processes in D are killed > > > - A/memory.group_oom=1: all processes in A are killed > > > 2) memcg oom in B > > > - default settings: the biggest process is killed > > > - A/memory.group_oom=1: the biggest process is killed > > > > Huh? Why would you even consider A here when the oom is below it? > > /me confused > > I do not. > This is exactly a counter-example: A's memory.group_oom > is not considered at all in this case, > because A is above ooming cgroup. > I think the confusion is that this says A/memory.group_oom=1 and then the biggest process is killed, which doesn't seem like it matches the description you want to give memory.group_oom. > > > - B/memory.group_oom=1: all processes in B are killed > > > > - B/memory.group_oom=0 && > > > - D/memory.group_oom=1: all processes in D are killed > > > > What about? > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > All tasks in B are killed. > > Group_oom set to 1 means that the workload can't tolerate > killing of a random process, so in this case it's better > to guarantee consistency for B. > This example is missing the usecase that I was referring to, i.e. killing all processes attached to a subtree because the limit on a common ancestor has been reached. In your example, I would think that the memory.group_oom setting of /A and /A/B are meaningless because there are no processes attached to them. IIUC, your proposal is to select the victim by whatever means, check the memory.group_oom setting of that victim, and then either kill the victim or all processes attached to that local mem cgroup depending on the setting. However, if C and D here are only limited by a common ancestor, /A or /A/B, there is no way to specify that the subtree itself should be oom killed. That was where I thought a tristate value would be helpful such that you can define all processes attached to the subtree should be oom killed when a mem cgroup has reached memory.max. I was purposefully overloading memory.group_oom because the actual value of memory.group_oom given your semantics here is not relevant for /A or /A/B. I think an additional memory.group_oom_tree or whatever it would be called would lead to unnecessary confusion because then we have a model where one tunable means something based on the value of the other. Given the no internal process constraint of cgroup v2, my suggestion was a value, "tree", that could specify that a mem cgroup reaching its limit could cause all processes attached to its subtree to be killed. This is required only because the single unified hierarchy of cgroup v2 such that we want to bind a subset of processes to be controlled by another controller separately but still want all processes oom killed when reaching the limit of a common ancestor. Thus, the semantic would be: if oom mem cgroup is "tree", kill all processes in subtree; otherwise, it can be "cgroup" or "process" to determine what is oom killed depending on the victim selection. Having the "tree" behavior could definitely be implemented as a separate tunable; but then then value of /A/memory.group_oom and /A/B/memory.group_oom are irrelevant and, to me, seems like it would be more confusing. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 20:41 ` David Rientjes @ 2018-07-17 20:52 ` Roman Gushchin 2018-07-20 8:30 ` David Rientjes 2018-07-18 8:19 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-17 20:52 UTC (permalink / raw) To: David Rientjes; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen On Tue, Jul 17, 2018 at 01:41:33PM -0700, David Rientjes wrote: > On Tue, 17 Jul 2018, Roman Gushchin wrote: > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > > > / > > > > | > > > > A > > > > | > > > > B > > > > / \ > > > > C D > > > > > > > > Let's look at different examples and intended behavior: > > > > 1) system-wide OOM > > > > - default settings: the biggest process is killed > > > > - D/memory.group_oom=1: all processes in D are killed > > > > - A/memory.group_oom=1: all processes in A are killed > > > > 2) memcg oom in B > > > > - default settings: the biggest process is killed > > > > - A/memory.group_oom=1: the biggest process is killed > > > > > > Huh? Why would you even consider A here when the oom is below it? > > > /me confused > > > > I do not. > > This is exactly a counter-example: A's memory.group_oom > > is not considered at all in this case, > > because A is above ooming cgroup. > > > > I think the confusion is that this says A/memory.group_oom=1 and then the > biggest process is killed, which doesn't seem like it matches the > description you want to give memory.group_oom. It matches perfectly, as the description says that the kernel will look for the most high-level cgroup with group_oom set up to the OOM domain. Here B is oom domain, so A's settings are irrelevant. > > > > > - B/memory.group_oom=1: all processes in B are killed > > > > > > - B/memory.group_oom=0 && > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > What about? > > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > > > All tasks in B are killed. > > > > Group_oom set to 1 means that the workload can't tolerate > > killing of a random process, so in this case it's better > > to guarantee consistency for B. > > > > This example is missing the usecase that I was referring to, i.e. killing > all processes attached to a subtree because the limit on a common ancestor > has been reached. > > In your example, I would think that the memory.group_oom setting of /A and > /A/B are meaningless because there are no processes attached to them. > > IIUC, your proposal is to select the victim by whatever means, check the > memory.group_oom setting of that victim, and then either kill the victim > or all processes attached to that local mem cgroup depending on the > setting. Sorry, I don't get what are you saying. In cgroup v2 processes can't be attached to A and B. There is no such thing as "local mem cgroup" at all. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 20:52 ` Roman Gushchin @ 2018-07-20 8:30 ` David Rientjes 2018-07-20 11:21 ` Tejun Heo 0 siblings, 1 reply; 52+ messages in thread From: David Rientjes @ 2018-07-20 8:30 UTC (permalink / raw) To: Roman Gushchin; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen On Tue, 17 Jul 2018, Roman Gushchin wrote: > > This example is missing the usecase that I was referring to, i.e. killing > > all processes attached to a subtree because the limit on a common ancestor > > has been reached. > > > > In your example, I would think that the memory.group_oom setting of /A and > > /A/B are meaningless because there are no processes attached to them. > > > > IIUC, your proposal is to select the victim by whatever means, check the > > memory.group_oom setting of that victim, and then either kill the victim > > or all processes attached to that local mem cgroup depending on the > > setting. > > Sorry, I don't get what are you saying. > In cgroup v2 processes can't be attached to A and B. > There is no such thing as "local mem cgroup" at all. > Read the second paragraph, yes, there are no processes attached to either mem cgroup. I'm saying "group oom" can take on two different meanings: one for the behavior when the mem cgroup reaches its limit (a direct ancestor with no processes attached) and one for the mem cgrop of the process chosen for oom kill. I know that you care about the latter. My *only* suggestion was for the tunable to take a string instead of a boolean so it is extensible for future use. This seems like something so trivial. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-20 8:30 ` David Rientjes @ 2018-07-20 11:21 ` Tejun Heo 2018-07-20 16:13 ` Roman Gushchin 2018-07-20 20:28 ` David Rientjes 0 siblings, 2 replies; 52+ messages in thread From: Tejun Heo @ 2018-07-20 11:21 UTC (permalink / raw) To: David Rientjes Cc: Roman Gushchin, Michal Hocko, linux-mm, akpm, hannes, gthelen Hello, On Fri, Jul 20, 2018 at 01:30:00AM -0700, David Rientjes wrote: ... > process chosen for oom kill. I know that you care about the latter. My > *only* suggestion was for the tunable to take a string instead of a > boolean so it is extensible for future use. This seems like something so > trivial. So, I'd much prefer it as boolean. It's a fundamentally binary property, either handle the cgroup as a unit when chosen as oom victim or not, nothing more. I don't see the (interface-wise) benefits of preparing for further oom policy extensions. If that happens, it should be through a separate interface file. The number of files isn't the most important criteria interface is designed on. Roman, can you rename it tho to memory.oom.group? That's how other interface files are scoped and it'd be better if we try to add further oom related interface files later. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-20 11:21 ` Tejun Heo @ 2018-07-20 16:13 ` Roman Gushchin 2018-07-20 20:28 ` David Rientjes 1 sibling, 0 replies; 52+ messages in thread From: Roman Gushchin @ 2018-07-20 16:13 UTC (permalink / raw) To: Tejun Heo; +Cc: David Rientjes, Michal Hocko, linux-mm, akpm, hannes, gthelen On Fri, Jul 20, 2018 at 04:21:31AM -0700, Tejun Heo wrote: > Hello, > > On Fri, Jul 20, 2018 at 01:30:00AM -0700, David Rientjes wrote: > ... > > process chosen for oom kill. I know that you care about the latter. My > > *only* suggestion was for the tunable to take a string instead of a > > boolean so it is extensible for future use. This seems like something so > > trivial. > > So, I'd much prefer it as boolean. It's a fundamentally binary > property, either handle the cgroup as a unit when chosen as oom victim > or not, nothing more. I don't see the (interface-wise) benefits of > preparing for further oom policy extensions. If that happens, it > should be through a separate interface file. The number of files > isn't the most important criteria interface is designed on. > > Roman, can you rename it tho to memory.oom.group? That's how other > interface files are scoped and it'd be better if we try to add further > oom related interface files later. Yes, sure, this looks good to me. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-20 11:21 ` Tejun Heo 2018-07-20 16:13 ` Roman Gushchin @ 2018-07-20 20:28 ` David Rientjes 2018-07-20 20:47 ` Roman Gushchin 2018-07-23 14:12 ` Michal Hocko 1 sibling, 2 replies; 52+ messages in thread From: David Rientjes @ 2018-07-20 20:28 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, Michal Hocko, linux-mm, akpm, hannes, gthelen On Fri, 20 Jul 2018, Tejun Heo wrote: > > process chosen for oom kill. I know that you care about the latter. My > > *only* suggestion was for the tunable to take a string instead of a > > boolean so it is extensible for future use. This seems like something so > > trivial. > > So, I'd much prefer it as boolean. It's a fundamentally binary > property, either handle the cgroup as a unit when chosen as oom victim > or not, nothing more. With the single hierarchy mandate of cgroup v2, the need arises to separate processes from a single job into subcontainers for use with controllers other than mem cgroup. In that case, we have no functionality to oom kill all processes in the subtree. A boolean can kill all processes attached to the victim's mem cgroup, but cannot kill all processes in a subtree if the limit of a common ancestor is reached. The common ancestor is needed to enforce a single memory limit but allow for processes to be constrained separately with other controllers. So if group oom takes on a boolean type, then we mandate that all processes to be killed must share the same cgroup which cannot always be done. Thus, I was suggesting that group oom can also configure for subtree killing when the limit of a shared ancestor is reached. This is unique only to non-leaf cgroups. So non-leaf and leaf cgroups have mutually exclusive group oom settings; if we have two tunables, which this would otherwise require, the setting of one would always be irrelevant based on non-leaf or leaf. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-20 20:28 ` David Rientjes @ 2018-07-20 20:47 ` Roman Gushchin 2018-07-23 23:06 ` David Rientjes 2018-07-23 14:12 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-20 20:47 UTC (permalink / raw) To: David Rientjes; +Cc: Tejun Heo, Michal Hocko, linux-mm, akpm, hannes, gthelen On Fri, Jul 20, 2018 at 01:28:56PM -0700, David Rientjes wrote: > On Fri, 20 Jul 2018, Tejun Heo wrote: > > > > process chosen for oom kill. I know that you care about the latter. My > > > *only* suggestion was for the tunable to take a string instead of a > > > boolean so it is extensible for future use. This seems like something so > > > trivial. > > > > So, I'd much prefer it as boolean. It's a fundamentally binary > > property, either handle the cgroup as a unit when chosen as oom victim > > or not, nothing more. > > With the single hierarchy mandate of cgroup v2, the need arises to > separate processes from a single job into subcontainers for use with > controllers other than mem cgroup. In that case, we have no functionality > to oom kill all processes in the subtree. > > A boolean can kill all processes attached to the victim's mem cgroup, but > cannot kill all processes in a subtree if the limit of a common ancestor > is reached. Why so? Once again my proposal: as soon as the OOM killer selected a victim task, we'll look at the victim task's memory cgroup. If memory.oom.group is not set, we're done. Otherwise let's traverse the memory cgroup tree up to the OOMing cgroup (or root) as long as memory.oom.group is set. Kill the last cgroup entirely (including all children). Please, note: we do not look at memory.oom.group of the OOMing cgroup, we're looking at the memcg of the victim task. If this model doesn't work well for you case, please, describe it on an example. I'm not sure I understand your problem anymore. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-20 20:47 ` Roman Gushchin @ 2018-07-23 23:06 ` David Rientjes 0 siblings, 0 replies; 52+ messages in thread From: David Rientjes @ 2018-07-23 23:06 UTC (permalink / raw) To: Roman Gushchin; +Cc: Tejun Heo, Michal Hocko, linux-mm, akpm, hannes, gthelen On Fri, 20 Jul 2018, Roman Gushchin wrote: > > > > process chosen for oom kill. I know that you care about the latter. My > > > > *only* suggestion was for the tunable to take a string instead of a > > > > boolean so it is extensible for future use. This seems like something so > > > > trivial. > > > > > > So, I'd much prefer it as boolean. It's a fundamentally binary > > > property, either handle the cgroup as a unit when chosen as oom victim > > > or not, nothing more. > > > > With the single hierarchy mandate of cgroup v2, the need arises to > > separate processes from a single job into subcontainers for use with > > controllers other than mem cgroup. In that case, we have no functionality > > to oom kill all processes in the subtree. > > > > A boolean can kill all processes attached to the victim's mem cgroup, but > > cannot kill all processes in a subtree if the limit of a common ancestor > > is reached. > > Why so? > > Once again my proposal: > as soon as the OOM killer selected a victim task, > we'll look at the victim task's memory cgroup. > If memory.oom.group is not set, we're done. > Otherwise let's traverse the memory cgroup tree up to > the OOMing cgroup (or root) as long as memory.oom.group is set. > Kill the last cgroup entirely (including all children). > I know this is your proposal, I'm suggesting a context-based extension based on which mem cgroup is oom: the common ancestor or the leaf. Consider /A, /A/b, and /A/c, and memory.oom_group is 1 for all of them. When /A, /A/b, or /A/c is oom, all processes attached to /A and its subtree are oom killed per your semantic. That occurs when any of the three mem cgroups are oom. I'm suggesting that it may become useful to kill an entire subtree when the common ancestor, /A, is oom, but not when /A/b or /A/c is oom. There is no way to specify this with the proposal and trees where the limits of /A/b + /A/c > /A exist. We want all processes killed in /A/b or /A/c if they reach their individual limits. We want all processes killed in /A's subtree if /A reaches its limit. I am not asking for that support to be implemented immediately if you do not have a need for it. But I am asking that your interface to do so is extensible so that we may implement it. Given the no internal process constraint of cgroup v2, defining this as two separate tunables would always have one be effective and the other be irrelevant, so I suggest it is overloaded. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-20 20:28 ` David Rientjes 2018-07-20 20:47 ` Roman Gushchin @ 2018-07-23 14:12 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-23 14:12 UTC (permalink / raw) To: David Rientjes; +Cc: Tejun Heo, Roman Gushchin, linux-mm, akpm, hannes, gthelen On Fri 20-07-18 13:28:56, David Rientjes wrote: > On Fri, 20 Jul 2018, Tejun Heo wrote: > > > > process chosen for oom kill. I know that you care about the latter. My > > > *only* suggestion was for the tunable to take a string instead of a > > > boolean so it is extensible for future use. This seems like something so > > > trivial. > > > > So, I'd much prefer it as boolean. It's a fundamentally binary > > property, either handle the cgroup as a unit when chosen as oom victim > > or not, nothing more. > > With the single hierarchy mandate of cgroup v2, the need arises to > separate processes from a single job into subcontainers for use with > controllers other than mem cgroup. In that case, we have no functionality > to oom kill all processes in the subtree. > > A boolean can kill all processes attached to the victim's mem cgroup, but > cannot kill all processes in a subtree if the limit of a common ancestor > is reached. The common ancestor is needed to enforce a single memory > limit but allow for processes to be constrained separately with other > controllers. I think you misunderstood the proposed semantic. oom.group is a property of any (including inter-node) memcg. Once set all the processes in its domain are killed in one go because they are considered indivisible workload. Note how this doesn't tell anything about _how_ we select a victim. That is not important and an in fact an implementation detail. All we care about is that a selected victim is a part of an indivisible workload and we have to tear down all of it. Future extensions can talk more about how we select the victim but the fundamental property of a group to be indivisible workload or a group of semi raleted processes is a 0/1 IMHO. Now there still are questions to iron out for that model. E.g. should we allow to make a subtree of oom.group == 1 to be group == 0? In other words something would be indivisible workload for one OOM context while it is not for more restrictive OOM scope. If yes, then what is the usecase? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 20:41 ` David Rientjes 2018-07-17 20:52 ` Roman Gushchin @ 2018-07-18 8:19 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-18 8:19 UTC (permalink / raw) To: David Rientjes; +Cc: Roman Gushchin, linux-mm, akpm, hannes, tj, gthelen On Tue 17-07-18 13:41:33, David Rientjes wrote: [...] > Thus, the semantic would be: if oom mem cgroup is "tree", kill all > processes in subtree; otherwise, it can be "cgroup" or "process" to > determine what is oom killed depending on the victim selection. Why should be an intermediate node any different from the leaf. If you want to tear down the whole subtree, just make it oom_cgroup = true and be done with that. Why do we even need to call it tree? > Having the "tree" behavior could definitely be implemented as a separate > tunable; but then then value of /A/memory.group_oom and > /A/B/memory.group_oom are irrelevant and, to me, seems like it would be > more confusing. I am sorry, I do not follow. How are the following two different? A (tree) A (group) | | B (tree) B (group) | | C (process) C (group=false) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-17 20:06 ` Roman Gushchin 2018-07-17 20:41 ` David Rientjes @ 2018-07-18 8:12 ` Michal Hocko 2018-07-18 15:28 ` Roman Gushchin 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-18 8:12 UTC (permalink / raw) To: Roman Gushchin; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen On Tue 17-07-18 13:06:42, Roman Gushchin wrote: > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote: > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote: > > [...] > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > / > > > | > > > A > > > | > > > B > > > / \ > > > C D > > > > > > Let's look at different examples and intended behavior: > > > 1) system-wide OOM > > > - default settings: the biggest process is killed > > > - D/memory.group_oom=1: all processes in D are killed > > > - A/memory.group_oom=1: all processes in A are killed > > > 2) memcg oom in B > > > - default settings: the biggest process is killed > > > - A/memory.group_oom=1: the biggest process is killed > > > > Huh? Why would you even consider A here when the oom is below it? > > /me confused > > I do not. > This is exactly a counter-example: A's memory.group_oom > is not considered at all in this case, > because A is above ooming cgroup. OK, it confused me. > > > > > - B/memory.group_oom=1: all processes in B are killed > > > > - B/memory.group_oom=0 && > > > - D/memory.group_oom=1: all processes in D are killed > > > > What about? > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > All tasks in B are killed. so essentially find a task, traverse the memcg hierarchy from the victim's memcg up to the oom root as long as memcg.group_oom = 1? If the resulting memcg.group_oom == 1 then kill the whole sub tree. Right? > Group_oom set to 1 means that the workload can't tolerate > killing of a random process, so in this case it's better > to guarantee consistency for B. OK, but then if D itself is OOM then we do not care about consistency all of the sudden? I have hard time to think about a sensible usecase. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-18 8:12 ` Michal Hocko @ 2018-07-18 15:28 ` Roman Gushchin 2018-07-19 7:38 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-18 15:28 UTC (permalink / raw) To: Michal Hocko; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote: > On Tue 17-07-18 13:06:42, Roman Gushchin wrote: > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote: > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote: > > > [...] > > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > > > / > > > > | > > > > A > > > > | > > > > B > > > > / \ > > > > C D > > > > > > > > Let's look at different examples and intended behavior: > > > > 1) system-wide OOM > > > > - default settings: the biggest process is killed > > > > - D/memory.group_oom=1: all processes in D are killed > > > > - A/memory.group_oom=1: all processes in A are killed > > > > 2) memcg oom in B > > > > - default settings: the biggest process is killed > > > > - A/memory.group_oom=1: the biggest process is killed > > > > > > Huh? Why would you even consider A here when the oom is below it? > > > /me confused > > > > I do not. > > This is exactly a counter-example: A's memory.group_oom > > is not considered at all in this case, > > because A is above ooming cgroup. > > OK, it confused me. > > > > > > > > - B/memory.group_oom=1: all processes in B are killed > > > > > > - B/memory.group_oom=0 && > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > What about? > > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > > > All tasks in B are killed. > > so essentially find a task, traverse the memcg hierarchy from the > victim's memcg up to the oom root as long as memcg.group_oom = 1? > If the resulting memcg.group_oom == 1 then kill the whole sub tree. > Right? Yes. > > > Group_oom set to 1 means that the workload can't tolerate > > killing of a random process, so in this case it's better > > to guarantee consistency for B. > > OK, but then if D itself is OOM then we do not care about consistency > all of the sudden? I have hard time to think about a sensible usecase. I mean if traversing the hierarchy up to the oom root we meet a memcg with group_oom set to 0, we shouldn't stop traversing. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-18 15:28 ` Roman Gushchin @ 2018-07-19 7:38 ` Michal Hocko 2018-07-19 17:05 ` Roman Gushchin 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-19 7:38 UTC (permalink / raw) To: Roman Gushchin; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen On Wed 18-07-18 08:28:50, Roman Gushchin wrote: > On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote: > > On Tue 17-07-18 13:06:42, Roman Gushchin wrote: > > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote: > > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote: > > > > [...] > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > > > > > / > > > > > | > > > > > A > > > > > | > > > > > B > > > > > / \ > > > > > C D > > > > > > > > > > Let's look at different examples and intended behavior: > > > > > 1) system-wide OOM > > > > > - default settings: the biggest process is killed > > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > - A/memory.group_oom=1: all processes in A are killed > > > > > 2) memcg oom in B > > > > > - default settings: the biggest process is killed > > > > > - A/memory.group_oom=1: the biggest process is killed > > > > > > > > Huh? Why would you even consider A here when the oom is below it? > > > > /me confused > > > > > > I do not. > > > This is exactly a counter-example: A's memory.group_oom > > > is not considered at all in this case, > > > because A is above ooming cgroup. > > > > OK, it confused me. > > > > > > > > > > > - B/memory.group_oom=1: all processes in B are killed > > > > > > > > - B/memory.group_oom=0 && > > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > > > What about? > > > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > > > > > All tasks in B are killed. > > > > so essentially find a task, traverse the memcg hierarchy from the > > victim's memcg up to the oom root as long as memcg.group_oom = 1? > > If the resulting memcg.group_oom == 1 then kill the whole sub tree. > > Right? > > Yes. > > > > > > Group_oom set to 1 means that the workload can't tolerate > > > killing of a random process, so in this case it's better > > > to guarantee consistency for B. > > > > OK, but then if D itself is OOM then we do not care about consistency > > all of the sudden? I have hard time to think about a sensible usecase. > > I mean if traversing the hierarchy up to the oom root we meet > a memcg with group_oom set to 0, we shouldn't stop traversing. Well, I am still fighting with the semantic of group, no-group, group configuration. Why does it make any sense? In other words when can we consider a cgroup to be a indivisible workload for one oom context while it is fine to lose head or arm from another? Anyway, your previous implementation would allow the same configuration as well, so this is nothing really new. The new selection policy you are proposing just makes it more obvious. So that doesn't mean this is a roadblock but I think we should be really thinking hard about this. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-19 7:38 ` Michal Hocko @ 2018-07-19 17:05 ` Roman Gushchin 2018-07-20 8:32 ` David Rientjes 2018-07-23 14:17 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Roman Gushchin @ 2018-07-19 17:05 UTC (permalink / raw) To: Michal Hocko; +Cc: David Rientjes, linux-mm, akpm, hannes, tj, gthelen On Thu, Jul 19, 2018 at 09:38:43AM +0200, Michal Hocko wrote: > On Wed 18-07-18 08:28:50, Roman Gushchin wrote: > > On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote: > > > On Tue 17-07-18 13:06:42, Roman Gushchin wrote: > > > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote: > > > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote: > > > > > [...] > > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > > > > > > > / > > > > > > | > > > > > > A > > > > > > | > > > > > > B > > > > > > / \ > > > > > > C D > > > > > > > > > > > > Let's look at different examples and intended behavior: > > > > > > 1) system-wide OOM > > > > > > - default settings: the biggest process is killed > > > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > - A/memory.group_oom=1: all processes in A are killed > > > > > > 2) memcg oom in B > > > > > > - default settings: the biggest process is killed > > > > > > - A/memory.group_oom=1: the biggest process is killed > > > > > > > > > > Huh? Why would you even consider A here when the oom is below it? > > > > > /me confused > > > > > > > > I do not. > > > > This is exactly a counter-example: A's memory.group_oom > > > > is not considered at all in this case, > > > > because A is above ooming cgroup. > > > > > > OK, it confused me. > > > > > > > > > > > > > > - B/memory.group_oom=1: all processes in B are killed > > > > > > > > > > - B/memory.group_oom=0 && > > > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > > > > > What about? > > > > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > > > > > > > All tasks in B are killed. > > > > > > so essentially find a task, traverse the memcg hierarchy from the > > > victim's memcg up to the oom root as long as memcg.group_oom = 1? > > > If the resulting memcg.group_oom == 1 then kill the whole sub tree. > > > Right? > > > > Yes. > > > > > > > > > Group_oom set to 1 means that the workload can't tolerate > > > > killing of a random process, so in this case it's better > > > > to guarantee consistency for B. > > > > > > OK, but then if D itself is OOM then we do not care about consistency > > > all of the sudden? I have hard time to think about a sensible usecase. > > > > I mean if traversing the hierarchy up to the oom root we meet > > a memcg with group_oom set to 0, we shouldn't stop traversing. > > Well, I am still fighting with the semantic of group, no-group, group > configuration. Why does it make any sense? In other words when can we > consider a cgroup to be a indivisible workload for one oom context while > it is fine to lose head or arm from another? Hm, so the question is should we traverse up to the OOMing cgroup, or up to the first cgroup with memory.group_oom=0? I looked at an example, and it *might* be the latter is better, especially if we'll make the default value inheritable. Let's say we have a sub-tree with a workload and some control stuff. Workload is tolerable to OOM's (we can handle it in userspace, for example), but the control stuff is not. Then it probably makes no sense to kill the entire sub-tree, if a task in C has to be killed. But makes perfect sense if we have to kill a task in B. / | A, delegated sub-tree, group_oom=1 / \ B C, workload, group_oom=0 ^ some control stuff here, group_oom=1 Does this makes sense? > Anyway, your previous implementation would allow the same configuration > as well, so this is nothing really new. The new selection policy you are > proposing just makes it more obvious. So that doesn't mean this is a > roadblock but I think we should be really thinking hard about this. I agree. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-19 17:05 ` Roman Gushchin @ 2018-07-20 8:32 ` David Rientjes 2018-07-23 14:17 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: David Rientjes @ 2018-07-20 8:32 UTC (permalink / raw) To: Roman Gushchin; +Cc: Michal Hocko, linux-mm, akpm, hannes, tj, gthelen On Thu, 19 Jul 2018, Roman Gushchin wrote: > Hm, so the question is should we traverse up to the OOMing cgroup, > or up to the first cgroup with memory.group_oom=0? > > I looked at an example, and it *might* be the latter is better, > especially if we'll make the default value inheritable. > > Let's say we have a sub-tree with a workload and some control stuff. > Workload is tolerable to OOM's (we can handle it in userspace, for > example), but the control stuff is not. > Then it probably makes no sense to kill the entire sub-tree, > if a task in C has to be killed. But makes perfect sense if we > have to kill a task in B. > > / > | > A, delegated sub-tree, group_oom=1 > / \ > B C, workload, group_oom=0 > ^ > some control stuff here, group_oom=1 > > Does this makes sense? > The *only* suggestion here was that memory.group_oom take a string instead of a boolean value so that it can be extended later, especially if introducing another tunable is problematic because it clashes with semantics of the this one. This is *so* trivial to do. Anything that is going to care about setting up cgroup oom killing will have no problems writing a string instead of an integer. I'm asking that you don't back the interface into a corner where extending it later is problematic. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-19 17:05 ` Roman Gushchin 2018-07-20 8:32 ` David Rientjes @ 2018-07-23 14:17 ` Michal Hocko 2018-07-23 15:09 ` Tejun Heo 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-23 14:17 UTC (permalink / raw) To: Roman Gushchin, hannes, tj; +Cc: David Rientjes, linux-mm, akpm, gthelen On Thu 19-07-18 10:05:47, Roman Gushchin wrote: > On Thu, Jul 19, 2018 at 09:38:43AM +0200, Michal Hocko wrote: > > On Wed 18-07-18 08:28:50, Roman Gushchin wrote: > > > On Wed, Jul 18, 2018 at 10:12:30AM +0200, Michal Hocko wrote: > > > > On Tue 17-07-18 13:06:42, Roman Gushchin wrote: > > > > > On Tue, Jul 17, 2018 at 09:49:46PM +0200, Michal Hocko wrote: > > > > > > On Tue 17-07-18 10:38:45, Roman Gushchin wrote: > > > > > > [...] > > > > > > > Let me show my proposal on examples. Let's say we have the following hierarchy, > > > > > > > and the biggest process (or the process with highest oom_score_adj) is in D. > > > > > > > > > > > > > > / > > > > > > > | > > > > > > > A > > > > > > > | > > > > > > > B > > > > > > > / \ > > > > > > > C D > > > > > > > > > > > > > > Let's look at different examples and intended behavior: > > > > > > > 1) system-wide OOM > > > > > > > - default settings: the biggest process is killed > > > > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > > - A/memory.group_oom=1: all processes in A are killed > > > > > > > 2) memcg oom in B > > > > > > > - default settings: the biggest process is killed > > > > > > > - A/memory.group_oom=1: the biggest process is killed > > > > > > > > > > > > Huh? Why would you even consider A here when the oom is below it? > > > > > > /me confused > > > > > > > > > > I do not. > > > > > This is exactly a counter-example: A's memory.group_oom > > > > > is not considered at all in this case, > > > > > because A is above ooming cgroup. > > > > > > > > OK, it confused me. > > > > > > > > > > > > > > > > > - B/memory.group_oom=1: all processes in B are killed > > > > > > > > > > > > - B/memory.group_oom=0 && > > > > > > > - D/memory.group_oom=1: all processes in D are killed > > > > > > > > > > > > What about? > > > > > > - B/memory.group_oom=1 && D/memory.group_oom=0 > > > > > > > > > > All tasks in B are killed. > > > > > > > > so essentially find a task, traverse the memcg hierarchy from the > > > > victim's memcg up to the oom root as long as memcg.group_oom = 1? > > > > If the resulting memcg.group_oom == 1 then kill the whole sub tree. > > > > Right? > > > > > > Yes. > > > > > > > > > > > > Group_oom set to 1 means that the workload can't tolerate > > > > > killing of a random process, so in this case it's better > > > > > to guarantee consistency for B. > > > > > > > > OK, but then if D itself is OOM then we do not care about consistency > > > > all of the sudden? I have hard time to think about a sensible usecase. > > > > > > I mean if traversing the hierarchy up to the oom root we meet > > > a memcg with group_oom set to 0, we shouldn't stop traversing. > > > > Well, I am still fighting with the semantic of group, no-group, group > > configuration. Why does it make any sense? In other words when can we > > consider a cgroup to be a indivisible workload for one oom context while > > it is fine to lose head or arm from another? > > Hm, so the question is should we traverse up to the OOMing cgroup, > or up to the first cgroup with memory.group_oom=0? > > I looked at an example, and it *might* be the latter is better, > especially if we'll make the default value inheritable. > > Let's say we have a sub-tree with a workload and some control stuff. > Workload is tolerable to OOM's (we can handle it in userspace, for > example), but the control stuff is not. > Then it probably makes no sense to kill the entire sub-tree, > if a task in C has to be killed. But makes perfect sense if we > have to kill a task in B. > > / > | > A, delegated sub-tree, group_oom=1 > / \ > B C, workload, group_oom=0 > ^ > some control stuff here, group_oom=1 > > Does this makes sense? I am not sure. If you are going to delegate then you are basically losing control of the group_oom at A-level. Is this good? What if I _want_ to tear down the whole thing if it starts misbehaving because I do not trust it? The more I think about it the more I am concluding that we should start with a more contrained model and require that once parent is group_oom == 1 then children have to as well. If we ever find a usecase to require a different scheme we can weaker it later. We cannot do that other way around. Tejun, Johannes what do you think about that? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-23 14:17 ` Michal Hocko @ 2018-07-23 15:09 ` Tejun Heo 2018-07-24 7:32 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Tejun Heo @ 2018-07-23 15:09 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, On Mon, Jul 23, 2018 at 04:17:48PM +0200, Michal Hocko wrote: > I am not sure. If you are going to delegate then you are basically > losing control of the group_oom at A-level. Is this good? What if I > _want_ to tear down the whole thing if it starts misbehaving because I > do not trust it? > > The more I think about it the more I am concluding that we should start > with a more contrained model and require that once parent is > group_oom == 1 then children have to as well. If we ever find a usecase > to require a different scheme we can weaker it later. We cannot do that > other way around. > > Tejun, Johannes what do you think about that? I'd find the cgroup closest to the root which has the oom group set and kill the entire subtree. There's no reason to put any restrictions on what each cgroup can configure. The only thing which matters is is that the effective behavior is what the highest in the ancestry configures, and, at the system level, it'd conceptually map to panic_on_oom. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-23 15:09 ` Tejun Heo @ 2018-07-24 7:32 ` Michal Hocko 2018-07-24 13:08 ` Tejun Heo 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-24 7:32 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Mon 23-07-18 08:09:29, Tejun Heo wrote: > Hello, > > On Mon, Jul 23, 2018 at 04:17:48PM +0200, Michal Hocko wrote: > > I am not sure. If you are going to delegate then you are basically > > losing control of the group_oom at A-level. Is this good? What if I > > _want_ to tear down the whole thing if it starts misbehaving because I > > do not trust it? > > > > The more I think about it the more I am concluding that we should start > > with a more contrained model and require that once parent is > > group_oom == 1 then children have to as well. If we ever find a usecase > > to require a different scheme we can weaker it later. We cannot do that > > other way around. > > > > Tejun, Johannes what do you think about that? > > I'd find the cgroup closest to the root which has the oom group set > and kill the entire subtree. Yes, this is what we have been discussing. In fact it would match the behavior which is still sitting in the mmotm tree where we compare groups. > There's no reason to put any > restrictions on what each cgroup can configure. The only thing which > matters is is that the effective behavior is what the highest in the > ancestry configures, and, at the system level, it'd conceptually map > to panic_on_oom. Hmm, so do we inherit group_oom? If not, how do we prevent from unexpected behavior? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 7:32 ` Michal Hocko @ 2018-07-24 13:08 ` Tejun Heo 2018-07-24 13:26 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Tejun Heo @ 2018-07-24 13:08 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, On Tue, Jul 24, 2018 at 09:32:30AM +0200, Michal Hocko wrote: > > I'd find the cgroup closest to the root which has the oom group set > > and kill the entire subtree. > > Yes, this is what we have been discussing. In fact it would match the > behavior which is still sitting in the mmotm tree where we compare > groups. Yeah, I'd too. Everyone except David seems to agree that that's a good enough approach for now. > > There's no reason to put any > > restrictions on what each cgroup can configure. The only thing which > > matters is is that the effective behavior is what the highest in the > > ancestry configures, and, at the system level, it'd conceptually map > > to panic_on_oom. > > Hmm, so do we inherit group_oom? If not, how do we prevent from > unexpected behavior? Hmm... I guess we're debating two options here. Please consider the following hierarchy. R | A (group oom == 1) / \ B C | D 1. No matter what B, C or D sets, as long as A sets group oom, any oom kill inside A's subtree kills the entire subtree. 2. A's group oom policy applies iff the source of the OOM is either at or above A - ie. iff the OOM is system-wide or caused by memory.max of A. In #1, it doesn't matter what B, C or D sets, so it's kinda moot to discuss whether they inherit A's setting or not. A's is, if set, always overriding. In #2, what B, C or D sets matters if they also set their own memory.max, so there's no reason for them to inherit anything. I'm actually okay with either option. #2 is more flexible than #1 but given that this is a cgroup owned property which is likely to be set on per-application basis, #1 is likely good enough. IIRC, we did #2 in the original implementation and the simplified one is doing #1, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 13:08 ` Tejun Heo @ 2018-07-24 13:26 ` Michal Hocko 2018-07-24 13:31 ` Tejun Heo 2018-07-30 8:03 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-24 13:26 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 06:08:36, Tejun Heo wrote: > Hello, > > On Tue, Jul 24, 2018 at 09:32:30AM +0200, Michal Hocko wrote: [...] > > > There's no reason to put any > > > restrictions on what each cgroup can configure. The only thing which > > > matters is is that the effective behavior is what the highest in the > > > ancestry configures, and, at the system level, it'd conceptually map > > > to panic_on_oom. > > > > Hmm, so do we inherit group_oom? If not, how do we prevent from > > unexpected behavior? > > Hmm... I guess we're debating two options here. Please consider the > following hierarchy. > > R > | > A (group oom == 1) > / \ > B C > | > D > > 1. No matter what B, C or D sets, as long as A sets group oom, any oom > kill inside A's subtree kills the entire subtree. > > 2. A's group oom policy applies iff the source of the OOM is either at > or above A - ie. iff the OOM is system-wide or caused by memory.max > of A. > > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to > discuss whether they inherit A's setting or not. A's is, if set, > always overriding. In #2, what B, C or D sets matters if they also > set their own memory.max, so there's no reason for them to inherit > anything. > > I'm actually okay with either option. #2 is more flexible than #1 but > given that this is a cgroup owned property which is likely to be set > on per-application basis, #1 is likely good enough. > > IIRC, we did #2 in the original implementation and the simplified one > is doing #1, right? No, we've been discussing #2 unless I have misunderstood something. I find it rather non-intuitive that a property outside of the oom domain controls the behavior inside the domain. I will keep thinking about that though. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 13:26 ` Michal Hocko @ 2018-07-24 13:31 ` Tejun Heo 2018-07-24 13:50 ` Michal Hocko 2018-07-30 8:03 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Tejun Heo @ 2018-07-24 13:31 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, On Tue, Jul 24, 2018 at 03:26:40PM +0200, Michal Hocko wrote: > > 1. No matter what B, C or D sets, as long as A sets group oom, any oom > > kill inside A's subtree kills the entire subtree. > > > > 2. A's group oom policy applies iff the source of the OOM is either at > > or above A - ie. iff the OOM is system-wide or caused by memory.max > > of A. > > > > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to > > discuss whether they inherit A's setting or not. A's is, if set, > > always overriding. In #2, what B, C or D sets matters if they also > > set their own memory.max, so there's no reason for them to inherit > > anything. > > > > I'm actually okay with either option. #2 is more flexible than #1 but > > given that this is a cgroup owned property which is likely to be set > > on per-application basis, #1 is likely good enough. > > > > IIRC, we did #2 in the original implementation and the simplified one > > is doing #1, right? > > No, we've been discussing #2 unless I have misunderstood something. > I find it rather non-intuitive that a property outside of the oom domain > controls the behavior inside the domain. I will keep thinking about that > though. So, one good way of thinking about this, I think, could be considering it as a scoped panic_on_oom. However panic_on_oom interacts with memcg ooms, scoping that to cgroup level should likely be how we define group oom. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 13:31 ` Tejun Heo @ 2018-07-24 13:50 ` Michal Hocko 2018-07-24 13:55 ` Tejun Heo 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-24 13:50 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 06:31:10, Tejun Heo wrote: > Hello, > > On Tue, Jul 24, 2018 at 03:26:40PM +0200, Michal Hocko wrote: > > > 1. No matter what B, C or D sets, as long as A sets group oom, any oom > > > kill inside A's subtree kills the entire subtree. > > > > > > 2. A's group oom policy applies iff the source of the OOM is either at > > > or above A - ie. iff the OOM is system-wide or caused by memory.max > > > of A. > > > > > > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to > > > discuss whether they inherit A's setting or not. A's is, if set, > > > always overriding. In #2, what B, C or D sets matters if they also > > > set their own memory.max, so there's no reason for them to inherit > > > anything. > > > > > > I'm actually okay with either option. #2 is more flexible than #1 but > > > given that this is a cgroup owned property which is likely to be set > > > on per-application basis, #1 is likely good enough. > > > > > > IIRC, we did #2 in the original implementation and the simplified one > > > is doing #1, right? > > > > No, we've been discussing #2 unless I have misunderstood something. > > I find it rather non-intuitive that a property outside of the oom domain > > controls the behavior inside the domain. I will keep thinking about that > > though. > > So, one good way of thinking about this, I think, could be considering > it as a scoped panic_on_oom. However panic_on_oom interacts with > memcg ooms, scoping that to cgroup level should likely be how we > define group oom. So what are we going to do if we have a reasonable usecase when somebody really wants to have group kill behavior depending on the oom domain? I have hard time to imagine such a usecase but my experience tells me that users will find a way I have never thought about. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 13:50 ` Michal Hocko @ 2018-07-24 13:55 ` Tejun Heo 2018-07-24 14:25 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Tejun Heo @ 2018-07-24 13:55 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, On Tue, Jul 24, 2018 at 03:50:22PM +0200, Michal Hocko wrote: > > So, one good way of thinking about this, I think, could be considering > > it as a scoped panic_on_oom. However panic_on_oom interacts with > > memcg ooms, scoping that to cgroup level should likely be how we > > define group oom. > > So what are we going to do if we have a reasonable usecase when somebody > really wants to have group kill behavior depending on the oom domain? > I have hard time to imagine such a usecase but my experience tells me > that users will find a way I have never thought about. So, I don't know when that happend but panic_on_oom actually has 0, 1, 2 settings - 0 no group oom, 1 system kill on oom of any origin, 2 system kill only if it was a system oom. Maybe we should just follow that but just start with 1? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 13:55 ` Tejun Heo @ 2018-07-24 14:25 ` Michal Hocko 2018-07-24 14:28 ` Tejun Heo 0 siblings, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-24 14:25 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 06:55:28, Tejun Heo wrote: > Hello, > > On Tue, Jul 24, 2018 at 03:50:22PM +0200, Michal Hocko wrote: > > > So, one good way of thinking about this, I think, could be considering > > > it as a scoped panic_on_oom. However panic_on_oom interacts with > > > memcg ooms, scoping that to cgroup level should likely be how we > > > define group oom. > > > > So what are we going to do if we have a reasonable usecase when somebody > > really wants to have group kill behavior depending on the oom domain? > > I have hard time to imagine such a usecase but my experience tells me > > that users will find a way I have never thought about. > > So, I don't know when that happend but panic_on_oom actually has 0, 1, > 2 settings - 0 no group oom, 1 system kill on oom of any origin, 2 > system kill only if it was a system oom. Maybe we should just follow > that but just start with 1? I am sorry but I do not follow. Besides that modeling the behavior on panic_on_oom doesn't really sound very appealing to me. The knob is a crude hack mostly motivated by debugging (at least its non-global variants). So can we get back to workloads and shape the semantic on top of that please? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 14:25 ` Michal Hocko @ 2018-07-24 14:28 ` Tejun Heo 2018-07-24 14:35 ` Tejun Heo 2018-07-24 14:43 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Tejun Heo @ 2018-07-24 14:28 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, On Tue, Jul 24, 2018 at 04:25:54PM +0200, Michal Hocko wrote: > I am sorry but I do not follow. Besides that modeling the behavior on > panic_on_oom doesn't really sound very appealing to me. The knob is a > crude hack mostly motivated by debugging (at least its non-global > variants). Hmm... we actually do use that quite a bit in production (moving away from it gradually). > So can we get back to workloads and shape the semantic on top of that > please? I didn't realize we were that off track. Don't both map to what we were discussing almost perfectly? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 14:28 ` Tejun Heo @ 2018-07-24 14:35 ` Tejun Heo 2018-07-24 14:43 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: Tejun Heo @ 2018-07-24 14:35 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, Lemme elaborate just a bit more. On Tue, Jul 24, 2018 at 07:28:20AM -0700, Tejun Heo wrote: > Hello, > > On Tue, Jul 24, 2018 at 04:25:54PM +0200, Michal Hocko wrote: > > I am sorry but I do not follow. Besides that modeling the behavior on > > panic_on_oom doesn't really sound very appealing to me. The knob is a > > crude hack mostly motivated by debugging (at least its non-global > > variants). > > Hmm... we actually do use that quite a bit in production (moving away > from it gradually). So, the reason panic_on_oom is used is very similar for the reason one would want group oom kill - workload integrity after an oom kill. panic_on_oom is an expensive way of avoiding partial kills and the resulting possibly inconsistent state. Group oom can scope that down so that we can maintain integrity per-application or domain rather than at system level making it way cheaper. > > So can we get back to workloads and shape the semantic on top of that > > please? > > I didn't realize we were that off track. Don't both map to what we > were discussing almost perfectly? I guess the reason why panic_on_oom developed the two behaviors is likely that the initial behavior - panicking on any oom - was too inflexible. We're scoping it down, so whatever problems we used to have with panic_on_oom is less pronounced with group oom. So, I don't think this matters all that much in terms of practical usefulness. Both always kliling and factoring in oom origin seem fine to me. Let's just pick one. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 14:28 ` Tejun Heo 2018-07-24 14:35 ` Tejun Heo @ 2018-07-24 14:43 ` Michal Hocko 2018-07-24 14:49 ` Tejun Heo 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-24 14:43 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 07:28:20, Tejun Heo wrote: > Hello, > > On Tue, Jul 24, 2018 at 04:25:54PM +0200, Michal Hocko wrote: [...] > > So can we get back to workloads and shape the semantic on top of that > > please? > > I didn't realize we were that off track. Don't both map to what we > were discussing almost perfectly? If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have any scope. It is all or nothing thing. You can only control whether memcg OOMs should be considered or not because this is inherently dangerous to be the case by default. oom_group has a scope and that scope is exactly what we are trying to find a proper semantic for. And especially what to do if descendants in the hierarchy disagree with parent(s). While I do not see a sensible configuration where the scope of the OOM should define the workload is indivisible I would like to prevent from "carved in stone" semantic that couldn't be changed later. So IMHO the best option would be to simply inherit the group_oom to children. This would allow users to do their weird stuff but the default configuration would be consistent. A more restrictive variant would be to disallow changing children to mismatch the parent oom_group == 1. This would have a slight advantage that those users would get back to us with their usecases and we can either loose the restriction or explain that what they are doing is questionable and help with a more appropriate configuration. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 14:43 ` Michal Hocko @ 2018-07-24 14:49 ` Tejun Heo 2018-07-24 15:52 ` Roman Gushchin 2018-07-25 11:58 ` Michal Hocko 0 siblings, 2 replies; 52+ messages in thread From: Tejun Heo @ 2018-07-24 14:49 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, Michal. On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote: > If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have > any scope. It is all or nothing thing. You can only control whether > memcg OOMs should be considered or not because this is inherently > dangerous to be the case by default. Oh yeah, so, panic_on_oom is like group oom on the root cgroup, right? If 1, it treats the whole system as a single unit and kills it no matter the oom domain. If 2, it only does so if the oom is not caused by restrictions in subdomains. > oom_group has a scope and that scope is exactly what we are trying to > find a proper semantic for. And especially what to do if descendants in > the hierarchy disagree with parent(s). While I do not see a sensible > configuration where the scope of the OOM should define the workload is > indivisible I would like to prevent from "carved in stone" semantic that > couldn't be changed later. And we can scope it down the same way down the cgroup hierarchy. > So IMHO the best option would be to simply inherit the group_oom to > children. This would allow users to do their weird stuff but the default > configuration would be consistent. Persistent config inheritance is a big no no. It really sucks because it makes the inherited state sticky and there's no way of telling why the current setting is the way it is without knowing the past configurations of the hierarchy. We actually had a pretty bad incident due to memcg swappiness inheritance recently (top level cgroups would inherit sysctl swappiness during boot before sysctl initialized them and then post-boot it isn't clear why the settings are the way they're). Nothing in cgroup2 does persistent inheritance. If something explicit is necessary, we do .effective so that the effective config is clearly visible. > A more restrictive variant would be to disallow changing children to > mismatch the parent oom_group == 1. This would have a slight advantage > that those users would get back to us with their usecases and we can > either loose the restriction or explain that what they are doing is > questionable and help with a more appropriate configuration. That's a nack too because across delegation, from a child pov, it isn't clear why it can't change configs and it's also easy to introduce a situation where a child can lock its ancestors out of chanding their configs. Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 14:49 ` Tejun Heo @ 2018-07-24 15:52 ` Roman Gushchin 2018-07-25 12:00 ` Michal Hocko 2018-07-25 11:58 ` Michal Hocko 1 sibling, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-24 15:52 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue, Jul 24, 2018 at 07:49:40AM -0700, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote: > > If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have > > any scope. It is all or nothing thing. You can only control whether > > memcg OOMs should be considered or not because this is inherently > > dangerous to be the case by default. > > Oh yeah, so, panic_on_oom is like group oom on the root cgroup, right? > If 1, it treats the whole system as a single unit and kills it no > matter the oom domain. If 2, it only does so if the oom is not caused > by restrictions in subdomains. > > > oom_group has a scope and that scope is exactly what we are trying to > > find a proper semantic for. And especially what to do if descendants in > > the hierarchy disagree with parent(s). While I do not see a sensible > > configuration where the scope of the OOM should define the workload is > > indivisible I would like to prevent from "carved in stone" semantic that > > couldn't be changed later. > > And we can scope it down the same way down the cgroup hierarchy. > > > So IMHO the best option would be to simply inherit the group_oom to > > children. This would allow users to do their weird stuff but the default > > configuration would be consistent. I think, that the problem occurs because of the default value (0). Let's imagine we can make default to 1. It means, that by default we kill the whole sub-tree up to the top-level cgroup, and it does guarantee consistency. If on some level userspace _knows_ how to handle OOM, it opts-out by setting oom.group to 0. E.g. systemd _knows_ that services working in systems slice are independent and knows how to detect that they are dead and restart. So, it sets system.slice/memory.oom.group to 0. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 15:52 ` Roman Gushchin @ 2018-07-25 12:00 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-25 12:00 UTC (permalink / raw) To: Roman Gushchin; +Cc: Tejun Heo, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 08:52:51, Roman Gushchin wrote: > On Tue, Jul 24, 2018 at 07:49:40AM -0700, Tejun Heo wrote: > > Hello, Michal. > > > > On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote: > > > If yes, then I do not see it ;) Mostly because panic_on_oom doesn't have > > > any scope. It is all or nothing thing. You can only control whether > > > memcg OOMs should be considered or not because this is inherently > > > dangerous to be the case by default. > > > > Oh yeah, so, panic_on_oom is like group oom on the root cgroup, right? > > If 1, it treats the whole system as a single unit and kills it no > > matter the oom domain. If 2, it only does so if the oom is not caused > > by restrictions in subdomains. > > > > > oom_group has a scope and that scope is exactly what we are trying to > > > find a proper semantic for. And especially what to do if descendants in > > > the hierarchy disagree with parent(s). While I do not see a sensible > > > configuration where the scope of the OOM should define the workload is > > > indivisible I would like to prevent from "carved in stone" semantic that > > > couldn't be changed later. > > > > And we can scope it down the same way down the cgroup hierarchy. > > > > > So IMHO the best option would be to simply inherit the group_oom to > > > children. This would allow users to do their weird stuff but the default > > > configuration would be consistent. > > I think, that the problem occurs because of the default value (0). > > Let's imagine we can make default to 1. > It means, that by default we kill the whole sub-tree up to the top-level > cgroup, and it does guarantee consistency. > If on some level userspace _knows_ how to handle OOM, it opts-out > by setting oom.group to 0. Apart that default group_oom is absolutely unacceptable as explained earlier. I still fail to see how this makes situation any different. So say you know that you are not group oom so what will happen now. As soon as well check parents we are screwed the same way. Not to mention that a global oom would mean killing the world basically... -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 14:49 ` Tejun Heo 2018-07-24 15:52 ` Roman Gushchin @ 2018-07-25 11:58 ` Michal Hocko 1 sibling, 0 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-25 11:58 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 07:49:40, Tejun Heo wrote: > Hello, Michal. > > On Tue, Jul 24, 2018 at 04:43:51PM +0200, Michal Hocko wrote: [...] > > So IMHO the best option would be to simply inherit the group_oom to > > children. This would allow users to do their weird stuff but the default > > configuration would be consistent. > > Persistent config inheritance is a big no no. It really sucks because > it makes the inherited state sticky and there's no way of telling why > the current setting is the way it is without knowing the past > configurations of the hierarchy. We actually had a pretty bad > incident due to memcg swappiness inheritance recently (top level > cgroups would inherit sysctl swappiness during boot before sysctl > initialized them and then post-boot it isn't clear why the settings > are the way they're). > > Nothing in cgroup2 does persistent inheritance. If something explicit > is necessary, we do .effective so that the effective config is clearly > visible. > > > A more restrictive variant would be to disallow changing children to > > mismatch the parent oom_group == 1. This would have a slight advantage > > that those users would get back to us with their usecases and we can > > either loose the restriction or explain that what they are doing is > > questionable and help with a more appropriate configuration. > > That's a nack too because across delegation, from a child pov, it > isn't clear why it can't change configs and it's also easy to > introduce a situation where a child can lock its ancestors out of > chanding their configs. OK, fair points. I will keep thinking about this some more. I still cannot shake a bad feeling about the semantic and how poor users are going to scratch their heads what the heck is going on here. I will follow up in other email where we are discussing both options once I am able to sort this out myself. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 13:26 ` Michal Hocko 2018-07-24 13:31 ` Tejun Heo @ 2018-07-30 8:03 ` Michal Hocko 2018-07-30 14:04 ` Tejun Heo 1 sibling, 1 reply; 52+ messages in thread From: Michal Hocko @ 2018-07-30 8:03 UTC (permalink / raw) To: Tejun Heo; +Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen On Tue 24-07-18 15:26:40, Michal Hocko wrote: > On Tue 24-07-18 06:08:36, Tejun Heo wrote: > > Hello, > > > > On Tue, Jul 24, 2018 at 09:32:30AM +0200, Michal Hocko wrote: > [...] > > > > There's no reason to put any > > > > restrictions on what each cgroup can configure. The only thing which > > > > matters is is that the effective behavior is what the highest in the > > > > ancestry configures, and, at the system level, it'd conceptually map > > > > to panic_on_oom. > > > > > > Hmm, so do we inherit group_oom? If not, how do we prevent from > > > unexpected behavior? > > > > Hmm... I guess we're debating two options here. Please consider the > > following hierarchy. > > > > R > > | > > A (group oom == 1) > > / \ > > B C > > | > > D > > > > 1. No matter what B, C or D sets, as long as A sets group oom, any oom > > kill inside A's subtree kills the entire subtree. > > > > 2. A's group oom policy applies iff the source of the OOM is either at > > or above A - ie. iff the OOM is system-wide or caused by memory.max > > of A. > > > > In #1, it doesn't matter what B, C or D sets, so it's kinda moot to > > discuss whether they inherit A's setting or not. A's is, if set, > > always overriding. In #2, what B, C or D sets matters if they also > > set their own memory.max, so there's no reason for them to inherit > > anything. > > > > I'm actually okay with either option. #2 is more flexible than #1 but > > given that this is a cgroup owned property which is likely to be set > > on per-application basis, #1 is likely good enough. > > > > IIRC, we did #2 in the original implementation and the simplified one > > is doing #1, right? > > No, we've been discussing #2 unless I have misunderstood something. > I find it rather non-intuitive that a property outside of the oom domain > controls the behavior inside the domain. I will keep thinking about that > though. So the more I think about it the more I am convinced that 1 is simply wrong for the reason I've mentioned above. Consulting a property outside of the oom hierarchy is tricky and non-intuitive. So the implementation should be oom_group = NULL; memcg = mem_cgroup_from_task(oom_victim); for (; memcg; memcg = parent_mem_cgroup(memcg)) { if (memcg->group_oom) oom_group = memcg; if (memcg == root) break; } And the documented semantic oom.group - denotes a memory cgroup (or subhierarchy) which represents an indivisble workload and should any process be selected as an oom victim due to an OOM even (at this cgroup level or above) then all processes belonging to the group/hierarchy are killed together. Please be careful when defining differen oom.group policies within the same hierarchy because OOM events at different hierarchy levels can have surprising effects. Example R | A (oom.group = 1) / \ B C (oom.group = 0) oom victim living in B resp. C. OOM event at R - (e.g. global OOM) or A will kill all tasks in A subtree. OOM event at B resp. C will only kill a single process from those memcgs. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-30 8:03 ` Michal Hocko @ 2018-07-30 14:04 ` Tejun Heo 2018-07-30 15:29 ` Roman Gushchin 0 siblings, 1 reply; 52+ messages in thread From: Tejun Heo @ 2018-07-30 14:04 UTC (permalink / raw) To: Michal Hocko Cc: Roman Gushchin, hannes, David Rientjes, linux-mm, akpm, gthelen Hello, On Mon, Jul 30, 2018 at 10:03:57AM +0200, Michal Hocko wrote: > Please be careful when defining differen oom.group policies within the > same hierarchy because OOM events at different hierarchy levels can > have surprising effects. Example > R > | > A (oom.group = 1) > / \ > B C (oom.group = 0) > > oom victim living in B resp. C. > > OOM event at R - (e.g. global OOM) or A will kill all tasks in A subtree. > OOM event at B resp. C will only kill a single process from those > memcgs. That behavior makes perfect sense to me and it maps to panic_on_oom==2 which works. Roman, what do you think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-30 14:04 ` Tejun Heo @ 2018-07-30 15:29 ` Roman Gushchin 0 siblings, 0 replies; 52+ messages in thread From: Roman Gushchin @ 2018-07-30 15:29 UTC (permalink / raw) To: Tejun Heo; +Cc: Michal Hocko, hannes, David Rientjes, linux-mm, akpm, gthelen On Mon, Jul 30, 2018 at 07:04:26AM -0700, Tejun Heo wrote: > Hello, > > On Mon, Jul 30, 2018 at 10:03:57AM +0200, Michal Hocko wrote: > > Please be careful when defining differen oom.group policies within the > > same hierarchy because OOM events at different hierarchy levels can > > have surprising effects. Example > > R > > | > > A (oom.group = 1) > > / \ > > B C (oom.group = 0) > > > > oom victim living in B resp. C. > > > > OOM event at R - (e.g. global OOM) or A will kill all tasks in A subtree. > > OOM event at B resp. C will only kill a single process from those > > memcgs. > > That behavior makes perfect sense to me and it maps to panic_on_oom==2 > which works. Roman, what do you think? I'm totally fine with this behavior, this is what I've suggested initially. I'll post the patchset soon. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin 2018-07-12 12:07 ` Michal Hocko 2018-07-13 21:34 ` David Rientjes @ 2018-07-24 11:59 ` Tetsuo Handa 2018-07-25 0:10 ` Roman Gushchin 2 siblings, 1 reply; 52+ messages in thread From: Tetsuo Handa @ 2018-07-24 11:59 UTC (permalink / raw) To: Roman Gushchin, linux-mm; +Cc: akpm, rientjes, mhocko, hannes, tj, gthelen Roman, will you check this cleanup patch? This patch applies on top of next-20180724. I assumed that your series do not kill processes which current thread should not wait for termination. ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-24 11:59 ` Tetsuo Handa @ 2018-07-25 0:10 ` Roman Gushchin 2018-07-25 12:23 ` Tetsuo Handa 0 siblings, 1 reply; 52+ messages in thread From: Roman Gushchin @ 2018-07-25 0:10 UTC (permalink / raw) To: Tetsuo Handa; +Cc: linux-mm, akpm, rientjes, mhocko, hannes, tj, gthelen On Tue, Jul 24, 2018 at 08:59:58PM +0900, Tetsuo Handa wrote: > Roman, will you check this cleanup patch? This patch applies on top of next-20180724. > I assumed that your series do not kill processes which current thread should not > wait for termination. Hi Tetsuo! > > From 86ba99fbf73a9eda0df5ee4ae70c075781e83f81 Mon Sep 17 00:00:00 2001 > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Date: Tue, 24 Jul 2018 14:00:45 +0900 > Subject: [PATCH] mm,oom: Check pending victims earlier in out_of_memory(). > > The "mm, oom: cgroup-aware OOM killer" patchset introduced INFLIGHT_VICTIM > in order to replace open-coded ((void *)-1UL). But (regarding CONFIG_MMU=y > case) we have a list of inflight OOM victim threads which are connected to > oom_reaper_list. Thus we can check whether there are inflight OOM victims > before starting process/memcg list traversal. Since it is likely that only > few threads are linked to oom_reaper_list, checking all victims' OOM domain > will not matter. I have a couple of top-level concerns: 1) You're doubling the size of oom_reaper memory footprint in task_struct. I doubt, that code cleanup really worth it. If it's absolutely necessary to resolve the lockup, which you mentioned, it should be explained explicitly. 2) There are several cosmetic changes in this patch, which makes reviewing harder. Can you, please, split it into several parts. Thanks! ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-25 0:10 ` Roman Gushchin @ 2018-07-25 12:23 ` Tetsuo Handa 2018-07-25 13:01 ` Michal Hocko 0 siblings, 1 reply; 52+ messages in thread From: Tetsuo Handa @ 2018-07-25 12:23 UTC (permalink / raw) To: Roman Gushchin, akpm, mhocko; +Cc: linux-mm, rientjes, hannes, tj, gthelen Michal, I think that I hit WQ OOM lockup caused by lack of guaranteed schedule_timeout_*() for WQ. I think we should immediately send https://marc.info/?l=linux-mm&m=153062798103081 (or both http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org and https://marc.info/?l=linux-mm&m=152723708623015 ) to linux.git so that we can send to stable kernels without waiting for "mm, oom: cgroup-aware OOM killer" patchset. Log file is at http://I-love.SAKURA.ne.jp/tmp/serial-20180725.txt.xz . ( Output up to last two OOM killer invocations are omitted from below excerpt. ) $ grep lockup serial-20180725.txt [ 3512.401044] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 72s! [ 3545.156052] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 105s! [ 3575.364044] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 135s! [ 3693.429456] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 253s! [ 3785.795525] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 346s! [ 3906.627292] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 467s! $ grep out_of_memory serial-20180725.txt [ 3439.486723] out_of_memory+0x3fc/0x780 [ 3439.674515] out_of_memory+0x3fc/0x780 $ grep sysrq: serial-20180725.txt [ 3473.136366] sysrq: SysRq : Show State [ 3575.779715] sysrq: SysRq : Show backtrace of all active CPUs [ 3621.473841] sysrq: SysRq : Show Memory [ 3654.129996] sysrq: SysRq : Show State [ 3733.447398] sysrq: SysRq : Manual OOM execution [ 3808.956565] sysrq: SysRq : Show backtrace of all active CPUs [ 3857.741374] sysrq: SysRq : Manual OOM execution [ 3871.848122] sysrq: SysRq : Show backtrace of all active CPUs [ 3930.529273] sysrq: SysRq : Show backtrace of all active CPUs [ 3946.694858] sysrq: SysRq : Terminate All Tasks $ grep -A 100 -F 3575.779715 serial-20180725.txt [ 3575.779715] sysrq: SysRq : Show backtrace of all active CPUs [ 3575.782255] NMI backtrace for cpu 0 [ 3575.784104] CPU: 0 PID: 82423 Comm: kworker/0:3 Not tainted 4.18.0-rc6-next-20180724+ #246 [ 3575.787142] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 3575.790847] Workqueue: events_freezable_power_ disk_events_workfn [ 3575.793317] Call Trace: [ 3575.794818] <IRQ> [ 3575.796180] dump_stack+0x85/0xcb [ 3575.797858] nmi_cpu_backtrace+0xa1/0xb0 [ 3575.799710] ? lapic_can_unplug_cpu+0x90/0x90 [ 3575.801712] nmi_trigger_cpumask_backtrace+0xf0/0x130 [ 3575.803894] __handle_sysrq+0xbe/0x200 [ 3575.805733] sysrq_filter+0x82/0x400 [ 3575.807513] input_to_handler+0x43/0xe0 [ 3575.809339] input_pass_values.part.9+0x1ab/0x260 [ 3575.811404] input_handle_event+0xd6/0x570 [ 3575.813330] input_event+0x45/0x70 [ 3575.815020] atkbd_interrupt+0x4ff/0x780 [ 3575.816810] serio_interrupt+0x3b/0x80 [ 3575.818517] i8042_interrupt+0x25d/0x410 [ 3575.820305] __handle_irq_event_percpu+0x2c/0xe0 [ 3575.822256] handle_irq_event_percpu+0x2b/0x70 [ 3575.824148] handle_irq_event+0x2f/0x50 [ 3575.825882] handle_edge_irq+0x6c/0x180 [ 3575.827598] handle_irq+0xa0/0x110 [ 3575.829158] do_IRQ+0x4e/0x100 [ 3575.830611] common_interrupt+0xf/0xf [ 3575.832244] </IRQ> [ 3575.833439] RIP: 0010:lock_release+0x36/0x1d0 [ 3575.835346] Code: 25 c0 5d 01 00 48 83 ec 18 44 8b 8b d4 0b 00 00 65 48 8b 04 25 28 00 00 00 48 89 44 24 10 31 c0 45 85 c9 0f 85 e1 00 00 00 9c <58> 66 66 90 66 90 48 89 c5 fa 66 66 90 66 66 90 44 8b 05 43 54 0f [ 3575.841978] RSP: 0000:ffffabfb43b3b898 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcc [ 3575.844683] RAX: 0000000000000000 RBX: ffff8bfb94e8c240 RCX: 0000000000000000 [ 3575.847296] RDX: ffffffff8919647a RSI: 0000000000000001 RDI: ffffffff8a0597c0 [ 3575.849896] RBP: ffffabfb43b3b908 R08: ffffffff891963d0 R09: 0000000000000000 [ 3575.852506] R10: ffffabfb43b3b8a8 R11: 0000000000000001 R12: ffffffffffffffff [ 3575.855104] R13: ffff8bfbb6373028 R14: 0000000000000000 R15: ffff8bfbb61a3b20 [ 3575.857713] ? __list_lru_init+0x280/0x280 [ 3575.859567] ? list_lru_count_one+0xaa/0x1e0 [ 3575.861453] ? list_lru_count_one+0xc2/0x1e0 [ 3575.863290] ? super_cache_count+0x74/0xf0 [ 3575.865135] ? do_shrink_slab+0x35/0x190 [ 3575.866878] ? shrink_slab+0x1d3/0x2c0 [ 3575.868565] ? shrink_slab+0x240/0x2c0 [ 3575.870238] ? shrink_node+0xe3/0x460 [ 3575.871863] ? do_try_to_free_pages+0xcb/0x380 [ 3575.873700] ? try_to_free_pages+0xbb/0xf0 [ 3575.875450] ? __alloc_pages_slowpath+0x3c1/0xc50 [ 3575.877311] ? __alloc_pages_nodemask+0x2a6/0x2c0 [ 3575.879187] ? bio_copy_kern+0xcd/0x200 [ 3575.880976] ? blk_rq_map_kern+0xb6/0x130 [ 3575.882760] ? scsi_execute+0x64/0x250 [ 3575.884459] ? sr_check_events+0x9a/0x2b0 [sr_mod] [ 3575.886424] ? __mutex_unlock_slowpath+0x46/0x2b0 [ 3575.888257] ? cdrom_check_events+0xf/0x30 [cdrom] [ 3575.890119] ? sr_block_check_events+0x7c/0xb0 [sr_mod] [ 3575.892152] ? disk_check_events+0x5e/0x150 [ 3575.893890] ? process_one_work+0x290/0x4a0 [ 3575.895660] ? process_one_work+0x227/0x4a0 [ 3575.897411] ? worker_thread+0x28/0x3d0 [ 3575.899006] ? process_one_work+0x4a0/0x4a0 [ 3575.900727] ? kthread+0x107/0x120 [ 3575.902191] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3575.904132] ? ret_from_fork+0x24/0x30 [ 3621.473841] sysrq: SysRq : Show Memory [ 3621.475640] Mem-Info: [ 3621.476934] active_anon:831608 inactive_anon:5544 isolated_anon:0 [ 3621.476934] active_file:0 inactive_file:0 isolated_file:0 [ 3621.476934] unevictable:0 dirty:0 writeback:0 unstable:0 [ 3621.476934] slab_reclaimable:3620 slab_unreclaimable:31654 [ 3621.476934] mapped:1463 shmem:10471 pagetables:8448 bounce:0 [ 3621.476934] free:21327 free_pcp:66 free_cma:0 [ 3621.488816] Node 0 active_anon:3326432kB inactive_anon:22176kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:5852kB dirty:0kB writeback:0kB shmem:41884kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 2816000kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no [ 3621.497230] Node 0 DMA free:14712kB min:288kB low:360kB high:432kB active_anon:1128kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15960kB managed:15876kB mlocked:0kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 3621.505476] lowmem_reserve[]: 0 2679 3607 3607 [ 3621.507341] Node 0 DMA32 free:53596kB min:49968kB low:62460kB high:74952kB active_anon:2686544kB inactive_anon:8kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:3129152kB managed:2743316kB mlocked:0kB kernel_stack:48kB pagetables:120kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 3621.516007] lowmem_reserve[]: 0 0 928 928 [ 3621.517799] Node 0 Normal free:17000kB min:17320kB low:21648kB high:25976kB active_anon:638644kB inactive_anon:22168kB active_file:56kB inactive_file:200kB unevictable:0kB writepending:0kB present:1048576kB managed:951040kB mlocked:0kB kernel_stack:17888kB pagetables:33668kB bounce:0kB free_pcp:264kB local_pcp:264kB free_cma:0kB [ 3621.527702] lowmem_reserve[]: 0 0 0 0 [ 3621.529458] Node 0 DMA: 0*4kB 1*8kB (M) 1*16kB (M) 1*32kB (U) 1*64kB (U) 2*128kB (UM) 2*256kB (UM) 1*512kB (M) 1*1024kB (U) 0*2048kB 3*4096kB (M) = 14712kB [ 3621.533965] Node 0 DMA32: 13*4kB (UM) 17*8kB (UM) 20*16kB (UM) 67*32kB (UM) 50*64kB (U) 27*128kB (UM) 27*256kB (UM) 7*512kB (U) 33*1024kB (UM) 0*2048kB 0*4096kB = 53596kB [ 3621.539631] Node 0 Normal: 29*4kB (UME) 3*8kB (UE) 13*16kB (UME) 74*32kB (UME) 67*64kB (UME) 38*128kB (UE) 17*256kB (UE) 2*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 17244kB [ 3621.545487] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 3621.548666] 10482 total pagecache pages [ 3621.550795] 0 pages in swap cache [ 3621.552891] Swap cache stats: add 0, delete 0, find 0/0 [ 3621.555539] Free swap = 0kB [ 3621.557299] Total swap = 0kB [ 3621.559054] 1048422 pages RAM [ 3621.560836] 0 pages HighMem/MovableOnly [ 3621.562838] 120864 pages reserved [ 3621.564724] 0 pages cma reserved [ 3621.566586] 0 pages hwpoisoned [ 3654.129996] sysrq: SysRq : Show State [ 3654.132038] task PC stack pid father [ 3654.134654] systemd D12240 1 0 0x00000000 [ 3654.137087] Call Trace: [ 3654.138710] ? __schedule+0x245/0x7f0 [ 3654.140687] ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs] [ 3654.143007] schedule+0x23/0x80 [ 3654.144835] schedule_preempt_disabled+0x5/0x10 [ 3654.146999] __mutex_lock+0x3f5/0x9b0 $ grep -A 61 -F 3930.529273 serial-20180725.txt [ 3930.529273] sysrq: SysRq : Show backtrace of all active CPUs [ 3930.532004] NMI backtrace for cpu 0 [ 3930.533974] CPU: 0 PID: 82423 Comm: kworker/0:3 Not tainted 4.18.0-rc6-next-20180724+ #246 [ 3930.537201] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017 [ 3930.541173] Workqueue: events_freezable_power_ disk_events_workfn [ 3930.543773] Call Trace: [ 3930.545341] <IRQ> [ 3930.546766] dump_stack+0x85/0xcb [ 3930.548558] nmi_cpu_backtrace+0xa1/0xb0 [ 3930.550566] ? lapic_can_unplug_cpu+0x90/0x90 [ 3930.552640] nmi_trigger_cpumask_backtrace+0xf0/0x130 [ 3930.554930] __handle_sysrq+0xbe/0x200 [ 3930.556834] sysrq_filter+0x82/0x400 [ 3930.558669] input_to_handler+0x43/0xe0 [ 3930.560558] input_pass_values.part.9+0x1ab/0x260 [ 3930.562688] input_handle_event+0xd6/0x570 [ 3930.564602] input_event+0x45/0x70 [ 3930.566309] atkbd_interrupt+0x4ff/0x780 [ 3930.568187] serio_interrupt+0x3b/0x80 [ 3930.569981] i8042_interrupt+0x25d/0x410 [ 3930.571819] __handle_irq_event_percpu+0x2c/0xe0 [ 3930.573860] handle_irq_event_percpu+0x2b/0x70 [ 3930.575796] handle_irq_event+0x2f/0x50 [ 3930.577561] handle_edge_irq+0x6c/0x180 [ 3930.579313] handle_irq+0xa0/0x110 [ 3930.580994] do_IRQ+0x4e/0x100 [ 3930.582607] common_interrupt+0xf/0xf [ 3930.584360] </IRQ> [ 3930.585642] RIP: 0010:lock_acquire+0x1/0x70 [ 3930.587495] Code: c6 08 3c e0 89 48 c7 c7 c4 d6 df 89 e8 78 8e fa ff 0f 0b e9 3a ff ff ff e8 cc 90 fa ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 55 <53> 65 48 8b 2c 25 c0 5d 01 00 8b 85 d4 0b 00 00 85 c0 75 54 4d 89 [ 3930.594222] RSP: 0000:ffffabfb43b3b8d0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffcc [ 3930.597083] RAX: 0000000000000000 RBX: ffff8bfbb14b9ed8 RCX: 0000000000000002 [ 3930.599825] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8a0597c0 [ 3930.602605] RBP: ffffabfb43b3b908 R08: 0000000000000000 R09: 0000000000000000 [ 3930.605348] R10: ffffabfb43b3b8a8 R11: 0000000000000001 R12: ffffabfb43b3b9b0 [ 3930.608076] R13: ffff8bfbb8b90008 R14: 0000000000000000 R15: ffff8bfbb17e1880 [ 3930.610804] list_lru_count_one+0x3b/0x1e0 [ 3930.612662] ? __list_lru_init+0x280/0x280 [ 3930.614559] super_cache_count+0x74/0xf0 [ 3930.616362] do_shrink_slab+0x35/0x190 [ 3930.618100] ? shrink_slab+0x1d3/0x2c0 [ 3930.619844] shrink_slab+0x240/0x2c0 [ 3930.621531] shrink_node+0xe3/0x460 [ 3930.623184] do_try_to_free_pages+0xcb/0x380 [ 3930.625049] try_to_free_pages+0xbb/0xf0 [ 3930.626869] __alloc_pages_slowpath+0x3c1/0xc50 [ 3930.628803] __alloc_pages_nodemask+0x2a6/0x2c0 [ 3930.630904] bio_copy_kern+0xcd/0x200 [ 3930.632639] blk_rq_map_kern+0xb6/0x130 [ 3930.634408] scsi_execute+0x64/0x250 [ 3930.636079] sr_check_events+0x9a/0x2b0 [sr_mod] [ 3930.637998] ? __mutex_unlock_slowpath+0x46/0x2b0 [ 3930.639934] cdrom_check_events+0xf/0x30 [cdrom] [ 3930.641822] sr_block_check_events+0x7c/0xb0 [sr_mod] [ 3930.643889] disk_check_events+0x5e/0x150 [ 3930.645611] process_one_work+0x290/0x4a0 [ 3930.647628] ? process_one_work+0x227/0x4a0 [ 3930.649590] worker_thread+0x28/0x3d0 [ 3930.651394] ? process_one_work+0x4a0/0x4a0 [ 3930.653216] kthread+0x107/0x120 [ 3930.654709] ? kthread_create_worker_on_cpu+0x70/0x70 [ 3930.656745] ret_from_fork+0x24/0x30 Roman Gushchin wrote: > On Tue, Jul 24, 2018 at 08:59:58PM +0900, Tetsuo Handa wrote: > > Roman, will you check this cleanup patch? This patch applies on top of next-20180724. > > I assumed that your series do not kill processes which current thread should not > > wait for termination. > > Hi Tetsuo! > > > > > From 86ba99fbf73a9eda0df5ee4ae70c075781e83f81 Mon Sep 17 00:00:00 2001 > > From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Date: Tue, 24 Jul 2018 14:00:45 +0900 > > Subject: [PATCH] mm,oom: Check pending victims earlier in out_of_memory(). > > > > The "mm, oom: cgroup-aware OOM killer" patchset introduced INFLIGHT_VICTIM > > in order to replace open-coded ((void *)-1UL). But (regarding CONFIG_MMU=y > > case) we have a list of inflight OOM victim threads which are connected to > > oom_reaper_list. Thus we can check whether there are inflight OOM victims > > before starting process/memcg list traversal. Since it is likely that only > > few threads are linked to oom_reaper_list, checking all victims' OOM domain > > will not matter. > > I have a couple of top-level concerns: > 1) You're doubling the size of oom_reaper memory footprint in task_struct. > I doubt, that code cleanup really worth it. If it's absolutely necessary > to resolve the lockup, which you mentioned, it should be explained > explicitly. Doubling the size of oom_reaper memory footprint in task_struct is irrelevant with mitigating the lockup. The patch for mitigating the lockup is very simple (like shown above). But since that patch conflicts with your series, we can't apply that patch unless we make that patch on top of your series. I really appreciate if you rebase your series after applying that patch. Doubling the size is not unique to my changes. David is trying to avoid unnecessary killing of additional processes using timeout ( http://lkml.kernel.org/r/alpine.DEB.2.21.1807241443390.206335@chino.kir.corp.google.com ) and I agree that using "struct list_head" can make the code easier to read, especially when your series is about to generate many OOM victims. > > 2) There are several cosmetic changes in this patch, which makes reviewing > harder. Can you, please, split it into several parts. The patch is shorter ( https://marc.info/?l=linux-mm&m=153062797803080 ) if we could apply it before your series. FYI, a splitted series is at https://marc.info/?l=linux-mm&m=153062796003073 . > > > Thanks! > ^ permalink raw reply [flat|nested] 52+ messages in thread
* Re: cgroup-aware OOM killer, how to move forward 2018-07-25 12:23 ` Tetsuo Handa @ 2018-07-25 13:01 ` Michal Hocko 0 siblings, 0 replies; 52+ messages in thread From: Michal Hocko @ 2018-07-25 13:01 UTC (permalink / raw) To: Tetsuo Handa Cc: Roman Gushchin, akpm, linux-mm, rientjes, hannes, tj, gthelen On Wed 25-07-18 21:23:11, Tetsuo Handa wrote: > Michal, I think that I hit WQ OOM lockup caused by lack of guaranteed schedule_timeout_*() for WQ. > I think we should immediately send https://marc.info/?l=linux-mm&m=153062798103081 > (or both http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org and > https://marc.info/?l=linux-mm&m=152723708623015 ) to linux.git so that we can send > to stable kernels without waiting for "mm, oom: cgroup-aware OOM killer" patchset. Then do not pollute unrelated threads. Really! The patch to drop the sleep from the oom_lock is in mmotm tree already. I really do not see why you make it more important than it really is (ohhh it is the dubious CVE you have filed?). If you have hit a WQ OOM lockup beucase of a missing schedule_timeout then report it in a new email thread and we can discuss the proper way of handling it. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 52+ messages in thread
end of thread, other threads:[~2018-07-30 15:29 UTC | newest] Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-11 22:40 cgroup-aware OOM killer, how to move forward Roman Gushchin 2018-07-12 12:07 ` Michal Hocko 2018-07-12 15:55 ` Roman Gushchin 2018-07-13 21:34 ` David Rientjes 2018-07-13 22:16 ` Roman Gushchin 2018-07-13 22:39 ` David Rientjes 2018-07-13 23:05 ` Roman Gushchin 2018-07-13 23:11 ` David Rientjes 2018-07-13 23:16 ` Roman Gushchin 2018-07-17 4:19 ` David Rientjes 2018-07-17 12:41 ` Michal Hocko 2018-07-17 17:38 ` Roman Gushchin 2018-07-17 19:49 ` Michal Hocko 2018-07-17 20:06 ` Roman Gushchin 2018-07-17 20:41 ` David Rientjes 2018-07-17 20:52 ` Roman Gushchin 2018-07-20 8:30 ` David Rientjes 2018-07-20 11:21 ` Tejun Heo 2018-07-20 16:13 ` Roman Gushchin 2018-07-20 20:28 ` David Rientjes 2018-07-20 20:47 ` Roman Gushchin 2018-07-23 23:06 ` David Rientjes 2018-07-23 14:12 ` Michal Hocko 2018-07-18 8:19 ` Michal Hocko 2018-07-18 8:12 ` Michal Hocko 2018-07-18 15:28 ` Roman Gushchin 2018-07-19 7:38 ` Michal Hocko 2018-07-19 17:05 ` Roman Gushchin 2018-07-20 8:32 ` David Rientjes 2018-07-23 14:17 ` Michal Hocko 2018-07-23 15:09 ` Tejun Heo 2018-07-24 7:32 ` Michal Hocko 2018-07-24 13:08 ` Tejun Heo 2018-07-24 13:26 ` Michal Hocko 2018-07-24 13:31 ` Tejun Heo 2018-07-24 13:50 ` Michal Hocko 2018-07-24 13:55 ` Tejun Heo 2018-07-24 14:25 ` Michal Hocko 2018-07-24 14:28 ` Tejun Heo 2018-07-24 14:35 ` Tejun Heo 2018-07-24 14:43 ` Michal Hocko 2018-07-24 14:49 ` Tejun Heo 2018-07-24 15:52 ` Roman Gushchin 2018-07-25 12:00 ` Michal Hocko 2018-07-25 11:58 ` Michal Hocko 2018-07-30 8:03 ` Michal Hocko 2018-07-30 14:04 ` Tejun Heo 2018-07-30 15:29 ` Roman Gushchin 2018-07-24 11:59 ` Tetsuo Handa 2018-07-25 0:10 ` Roman Gushchin 2018-07-25 12:23 ` Tetsuo Handa 2018-07-25 13:01 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).