From: Michal Hocko <mhocko@kernel.org> To: Roman Gushchin <guro@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>, kernel-team@fb.com Cc: David Rientjes <rientjes@google.com>, linux-mm@kvack.org, Vladimir Davydov <vdavydov.dev@gmail.com>, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v8 0/4] cgroup-aware OOM killer Date: Mon, 25 Sep 2017 14:24:00 +0200 [thread overview] Message-ID: <20170925122400.4e7jh5zmuzvbggpe@dhcp22.suse.cz> (raw) In-Reply-To: <20170920215341.GA5382@castle> I would really appreciate some feedback from Tejun, Johannes here. On Wed 20-09-17 14:53:41, Roman Gushchin wrote: > On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote: > > On Fri 15-09-17 08:23:01, Roman Gushchin wrote: > > > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote: [...] > > > > But then you just enforce a structural restriction on your configuration > > > > because > > > > root > > > > / \ > > > > A D > > > > /\ > > > > B C > > > > > > > > is a different thing than > > > > root > > > > / | \ > > > > B C D > > > > > > > > > > I actually don't have a strong argument against an approach to select > > > largest leaf or kill-all-set memcg. I think, in practice there will be > > > no much difference. > > I've tried to implement this approach, and it's really arguable. > Although your example looks reasonable, the opposite example is also valid: > you might want to compare whole hierarchies, and it's a quite typical usecase. > > Assume, you have several containerized workloads on a machine (probably, > each will be contained in a memcg with memory.max set), with some hierarchy > of cgroups inside. Then in case of global memory shortage we want to reclaim > some memory from the biggest workload, and the selection should not depend > on group_oom settings. It would be really strange, if setting group_oom will > higher the chances to be killed. > > In other words, let's imagine processes as leaf nodes in memcg tree. We decided > to select the biggest memcg and kill one or more processes inside (depending > on group_oom setting), but the memcg selection doesn't depend on it. > We do not compare processes from different cgroups, as well as cgroups with > processes. The same should apply to cgroups: why do we want to compare cgroups > from different sub-trees? > > While size-based comparison can be implemented with this approach, > the priority-based is really weird (as David mentioned). > If priorities have no hierarchical meaning at all, we lack the very important > ability to enforce hierarchy oom_priority. Otherwise we have to invent some > complex rules of oom_priority propagation (e.g. is someone is raising > the oom_priority in parent, should it be applied to children immediately, etc). I would really forget about the priority at this stage. This needs really much more thinking and I consider the David's usecase very specialized to use it as a template for a general purpose oom prioritization. I might be wrong here of course... > The oom_group knob meaning also becoms more complex. It affects both > the victim selection and OOM action. _ANY_ mechanism which allows to affect > OOM victim selection (either priorities, either bpf-based approach) should > not have global system-wide meaning, it breaks everything. > > I do understand your point, but the same is true for other stuff, right? > E.g. cpu time distribution (and io, etc) depends on hierarchy configuration. > It's a limitation, but it's ok, as user should create a hierarchy which > reflects some logical relations between processes and groups of processes. > Otherwise we're going to the configuration hell. And that is _exactly_ my concern. We surely do not want tell people that they have to consider their cgroup tree structure to control the global oom behavior. You simply do not have that constrain with leaf-only semantic and if kill-all intermediate nodes are used then there is an explicit opt-in for the hierarchy considerations. > In any case, OOM is a last resort mechanism. The goal is to reclaim some memory > and do not crash the system or do not leave it in totally broken state. > Any really complex mm in userspace should be applied _before_ OOM happens. > So, I don't think we have to support all possible configurations here, > if we're able to achieve the main goal (kill some processes and do not leave > broken systems/containers). True but we want to have the semantic reasonably understandable. And it is quite hard to explain that the oom killer hasn't selected the largest memcg just because it happened to be in a deeper hierarchy which has been configured to cover a different resource. I am sorry to repeat my self and I will not argue if there is a prevalent agreement that level-by-level comparison is considered desirable and documented behavior but, by all means, do not define this semantic based on a priority requirements and/or implementation details. -- Michal Hocko SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Michal Hocko <mhocko@kernel.org> To: Roman Gushchin <guro@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>, kernel-team@fb.com Cc: David Rientjes <rientjes@google.com>, linux-mm@kvack.org, Vladimir Davydov <vdavydov.dev@gmail.com>, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, Andrew Morton <akpm@linux-foundation.org>, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v8 0/4] cgroup-aware OOM killer Date: Mon, 25 Sep 2017 14:24:00 +0200 [thread overview] Message-ID: <20170925122400.4e7jh5zmuzvbggpe@dhcp22.suse.cz> (raw) In-Reply-To: <20170920215341.GA5382@castle> I would really appreciate some feedback from Tejun, Johannes here. On Wed 20-09-17 14:53:41, Roman Gushchin wrote: > On Mon, Sep 18, 2017 at 08:14:05AM +0200, Michal Hocko wrote: > > On Fri 15-09-17 08:23:01, Roman Gushchin wrote: > > > On Fri, Sep 15, 2017 at 12:58:26PM +0200, Michal Hocko wrote: [...] > > > > But then you just enforce a structural restriction on your configuration > > > > because > > > > root > > > > / \ > > > > A D > > > > /\ > > > > B C > > > > > > > > is a different thing than > > > > root > > > > / | \ > > > > B C D > > > > > > > > > > I actually don't have a strong argument against an approach to select > > > largest leaf or kill-all-set memcg. I think, in practice there will be > > > no much difference. > > I've tried to implement this approach, and it's really arguable. > Although your example looks reasonable, the opposite example is also valid: > you might want to compare whole hierarchies, and it's a quite typical usecase. > > Assume, you have several containerized workloads on a machine (probably, > each will be contained in a memcg with memory.max set), with some hierarchy > of cgroups inside. Then in case of global memory shortage we want to reclaim > some memory from the biggest workload, and the selection should not depend > on group_oom settings. It would be really strange, if setting group_oom will > higher the chances to be killed. > > In other words, let's imagine processes as leaf nodes in memcg tree. We decided > to select the biggest memcg and kill one or more processes inside (depending > on group_oom setting), but the memcg selection doesn't depend on it. > We do not compare processes from different cgroups, as well as cgroups with > processes. The same should apply to cgroups: why do we want to compare cgroups > from different sub-trees? > > While size-based comparison can be implemented with this approach, > the priority-based is really weird (as David mentioned). > If priorities have no hierarchical meaning at all, we lack the very important > ability to enforce hierarchy oom_priority. Otherwise we have to invent some > complex rules of oom_priority propagation (e.g. is someone is raising > the oom_priority in parent, should it be applied to children immediately, etc). I would really forget about the priority at this stage. This needs really much more thinking and I consider the David's usecase very specialized to use it as a template for a general purpose oom prioritization. I might be wrong here of course... > The oom_group knob meaning also becoms more complex. It affects both > the victim selection and OOM action. _ANY_ mechanism which allows to affect > OOM victim selection (either priorities, either bpf-based approach) should > not have global system-wide meaning, it breaks everything. > > I do understand your point, but the same is true for other stuff, right? > E.g. cpu time distribution (and io, etc) depends on hierarchy configuration. > It's a limitation, but it's ok, as user should create a hierarchy which > reflects some logical relations between processes and groups of processes. > Otherwise we're going to the configuration hell. And that is _exactly_ my concern. We surely do not want tell people that they have to consider their cgroup tree structure to control the global oom behavior. You simply do not have that constrain with leaf-only semantic and if kill-all intermediate nodes are used then there is an explicit opt-in for the hierarchy considerations. > In any case, OOM is a last resort mechanism. The goal is to reclaim some memory > and do not crash the system or do not leave it in totally broken state. > Any really complex mm in userspace should be applied _before_ OOM happens. > So, I don't think we have to support all possible configurations here, > if we're able to achieve the main goal (kill some processes and do not leave > broken systems/containers). True but we want to have the semantic reasonably understandable. And it is quite hard to explain that the oom killer hasn't selected the largest memcg just because it happened to be in a deeper hierarchy which has been configured to cover a different resource. I am sorry to repeat my self and I will not argue if there is a prevalent agreement that level-by-level comparison is considered desirable and documented behavior but, by all means, do not define this semantic based on a priority requirements and/or implementation details. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-09-25 12:24 UTC|newest] Thread overview: 168+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin 2017-09-11 13:17 ` Roman Gushchin 2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin 2017-09-11 13:17 ` Roman Gushchin 2017-09-11 20:51 ` David Rientjes 2017-09-11 20:51 ` David Rientjes 2017-09-14 13:42 ` Michal Hocko 2017-09-14 13:42 ` Michal Hocko 2017-09-11 13:17 ` [v8 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin 2017-09-11 13:17 ` Roman Gushchin 2017-09-13 20:46 ` David Rientjes 2017-09-13 20:46 ` David Rientjes 2017-09-13 21:59 ` Roman Gushchin 2017-09-13 21:59 ` Roman Gushchin 2017-09-13 21:59 ` Roman Gushchin 2017-09-11 13:17 ` [v8 3/4] mm, oom: add cgroup v2 mount option for " Roman Gushchin 2017-09-11 13:17 ` Roman Gushchin 2017-09-11 13:17 ` Roman Gushchin 2017-09-11 20:48 ` David Rientjes 2017-09-11 20:48 ` David Rientjes 2017-09-12 20:01 ` Roman Gushchin 2017-09-12 20:01 ` Roman Gushchin 2017-09-12 20:23 ` David Rientjes 2017-09-12 20:23 ` David Rientjes 2017-09-13 12:23 ` Michal Hocko 2017-09-13 12:23 ` Michal Hocko 2017-09-11 13:17 ` [v8 4/4] mm, oom, docs: describe the " Roman Gushchin 2017-09-11 13:17 ` Roman Gushchin 2017-09-11 20:44 ` [v8 0/4] " David Rientjes 2017-09-11 20:44 ` David Rientjes 2017-09-13 12:29 ` Michal Hocko 2017-09-13 12:29 ` Michal Hocko 2017-09-13 20:46 ` David Rientjes 2017-09-13 20:46 ` David Rientjes 2017-09-14 13:34 ` Michal Hocko 2017-09-14 13:34 ` Michal Hocko 2017-09-14 20:07 ` David Rientjes 2017-09-14 20:07 ` David Rientjes 2017-09-13 21:56 ` Roman Gushchin 2017-09-13 21:56 ` Roman Gushchin 2017-09-14 13:40 ` Michal Hocko 2017-09-14 13:40 ` Michal Hocko 2017-09-14 16:05 ` Roman Gushchin 2017-09-14 16:05 ` Roman Gushchin 2017-09-15 10:58 ` Michal Hocko 2017-09-15 10:58 ` Michal Hocko 2017-09-15 15:23 ` Roman Gushchin 2017-09-15 15:23 ` Roman Gushchin 2017-09-15 19:55 ` David Rientjes 2017-09-15 19:55 ` David Rientjes 2017-09-15 21:08 ` Roman Gushchin 2017-09-15 21:08 ` Roman Gushchin 2017-09-18 6:20 ` Michal Hocko 2017-09-18 6:20 ` Michal Hocko 2017-09-18 15:02 ` Roman Gushchin 2017-09-18 15:02 ` Roman Gushchin 2017-09-18 15:02 ` Roman Gushchin 2017-09-21 8:30 ` David Rientjes 2017-09-21 8:30 ` David Rientjes 2017-09-19 20:54 ` David Rientjes 2017-09-19 20:54 ` David Rientjes 2017-09-20 22:24 ` Roman Gushchin 2017-09-20 22:24 ` Roman Gushchin 2017-09-21 8:27 ` David Rientjes 2017-09-21 8:27 ` David Rientjes 2017-09-18 6:16 ` Michal Hocko 2017-09-18 6:16 ` Michal Hocko 2017-09-19 20:51 ` David Rientjes 2017-09-19 20:51 ` David Rientjes 2017-09-18 6:14 ` Michal Hocko 2017-09-18 6:14 ` Michal Hocko 2017-09-20 21:53 ` Roman Gushchin 2017-09-20 21:53 ` Roman Gushchin 2017-09-20 21:53 ` Roman Gushchin 2017-09-25 12:24 ` Michal Hocko [this message] 2017-09-25 12:24 ` Michal Hocko 2017-09-25 17:00 ` Johannes Weiner 2017-09-25 17:00 ` Johannes Weiner 2017-09-25 18:15 ` Roman Gushchin 2017-09-25 18:15 ` Roman Gushchin 2017-09-25 20:25 ` Michal Hocko 2017-09-25 20:25 ` Michal Hocko 2017-09-25 20:25 ` Michal Hocko 2017-09-26 10:59 ` Roman Gushchin 2017-09-26 10:59 ` Roman Gushchin 2017-09-26 11:21 ` Michal Hocko 2017-09-26 11:21 ` Michal Hocko 2017-09-26 12:13 ` Roman Gushchin 2017-09-26 12:13 ` Roman Gushchin 2017-09-26 12:13 ` Roman Gushchin 2017-09-26 13:30 ` Michal Hocko 2017-09-26 13:30 ` Michal Hocko 2017-09-26 17:26 ` Johannes Weiner 2017-09-26 17:26 ` Johannes Weiner 2017-09-27 3:37 ` Tim Hockin 2017-09-27 3:37 ` Tim Hockin 2017-09-27 7:43 ` Michal Hocko 2017-09-27 7:43 ` Michal Hocko 2017-09-27 10:19 ` Roman Gushchin 2017-09-27 10:19 ` Roman Gushchin 2017-09-27 10:19 ` Roman Gushchin 2017-09-27 15:35 ` Tim Hockin 2017-09-27 15:35 ` Tim Hockin 2017-09-27 16:23 ` Roman Gushchin 2017-09-27 16:23 ` Roman Gushchin 2017-09-27 18:11 ` Tim Hockin 2017-09-27 18:11 ` Tim Hockin 2017-10-01 23:29 ` Shakeel Butt 2017-10-01 23:29 ` Shakeel Butt 2017-10-02 11:56 ` Tetsuo Handa 2017-10-02 11:56 ` Tetsuo Handa 2017-10-02 12:24 ` Michal Hocko 2017-10-02 12:24 ` Michal Hocko 2017-10-02 12:47 ` Roman Gushchin 2017-10-02 12:47 ` Roman Gushchin 2017-10-02 14:29 ` Michal Hocko 2017-10-02 14:29 ` Michal Hocko 2017-10-02 14:29 ` Michal Hocko 2017-10-02 19:00 ` Shakeel Butt 2017-10-02 19:00 ` Shakeel Butt 2017-10-02 19:28 ` Michal Hocko 2017-10-02 19:28 ` Michal Hocko 2017-10-02 19:45 ` Shakeel Butt 2017-10-02 19:45 ` Shakeel Butt 2017-10-02 19:56 ` Michal Hocko 2017-10-02 19:56 ` Michal Hocko 2017-10-02 20:00 ` Tim Hockin 2017-10-02 20:00 ` Tim Hockin 2017-10-02 20:08 ` Michal Hocko 2017-10-02 20:08 ` Michal Hocko 2017-10-02 20:09 ` Shakeel Butt 2017-10-02 20:20 ` Shakeel Butt 2017-10-02 20:20 ` Shakeel Butt 2017-10-02 20:24 ` Shakeel Butt 2017-10-02 20:24 ` Shakeel Butt 2017-10-02 20:34 ` Johannes Weiner 2017-10-02 20:34 ` Johannes Weiner 2017-10-02 20:55 ` Michal Hocko 2017-10-02 20:55 ` Michal Hocko 2017-09-25 22:21 ` David Rientjes 2017-09-25 22:21 ` David Rientjes 2017-09-26 8:46 ` Michal Hocko 2017-09-26 8:46 ` Michal Hocko 2017-09-26 21:04 ` David Rientjes 2017-09-26 21:04 ` David Rientjes 2017-09-27 7:37 ` Michal Hocko 2017-09-27 7:37 ` Michal Hocko 2017-09-27 9:57 ` Roman Gushchin 2017-09-27 9:57 ` Roman Gushchin 2017-09-21 14:21 ` Johannes Weiner 2017-09-21 14:21 ` Johannes Weiner 2017-09-21 21:17 ` David Rientjes 2017-09-21 21:17 ` David Rientjes 2017-09-21 21:17 ` David Rientjes 2017-09-21 21:51 ` Johannes Weiner 2017-09-21 21:51 ` Johannes Weiner 2017-09-22 20:53 ` David Rientjes 2017-09-22 20:53 ` David Rientjes 2017-09-22 15:44 ` Tejun Heo 2017-09-22 15:44 ` Tejun Heo 2017-09-22 15:44 ` Tejun Heo 2017-09-22 20:39 ` David Rientjes 2017-09-22 20:39 ` David Rientjes 2017-09-22 20:39 ` David Rientjes 2017-09-22 21:05 ` Tejun Heo 2017-09-22 21:05 ` Tejun Heo 2017-09-23 8:16 ` David Rientjes 2017-09-23 8:16 ` David Rientjes
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20170925122400.4e7jh5zmuzvbggpe@dhcp22.suse.cz \ --to=mhocko@kernel.org \ --cc=akpm@linux-foundation.org \ --cc=cgroups@vger.kernel.org \ --cc=guro@fb.com \ --cc=hannes@cmpxchg.org \ --cc=kernel-team@fb.com \ --cc=linux-doc@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=penguin-kernel@i-love.sakura.ne.jp \ --cc=rientjes@google.com \ --cc=tj@kernel.org \ --cc=vdavydov.dev@gmail.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.