Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

From: Michal Hocko <mhocko@kernel.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, Vladimir Davydov <vdavydov.dev@gmail.com>,
	Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>,
	David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer
Date: Wed, 6 Sep 2017 10:28:59 +0200	[thread overview]
Message-ID: <20170906082859.qlqenftxuib64j35@dhcp22.suse.cz> (raw)
In-Reply-To: <20170905215344.GA27427@cmpxchg.org>

On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with 
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> > 
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
> 
> oom_kill_allocating_task is actually a really good example of why
> cgroup-awareness *should* be the new default.
> 
> Before we had the oom killer victim selection, we simply killed the
> faulting/allocating task. While a valid answer to the problem, it's
> not very fair or representative of what the user wants or intends.
> 
> Then we added code to kill the biggest offender instead, which should
> have been the case from the start and was hence made the new default.
> The oom_kill_allocating_task was added on the off-chance that there
> might be setups who, for historical reasons, rely on the old behavior.
> But our default was chosen based on what behavior is fair, expected,
> and most reflective of the user's intentions.

I am not sure this is how things evolved actually. This is way before
my time so my git log interpretation might be imprecise. We do have
oom_badness heuristic since out_of_memory has been introduced and
oom_kill_allocating_task has been introduced much later because of large
boxes with zillions of tasks (SGI I suspect) which took too long to
select a victim so David has added this heuristic.

> The cgroup-awareness in the OOM killer is exactly the same thing. It
> should have been the default from the beginning, because the user
> configures a group of tasks to be an interdependent, terminal unit of
> memory consumption, and it's undesirable for the OOM killer to ignore
> this intention and compare members across these boundaries.

I would agree if that was true in general. I can completely see how the
cgroup awareness is useful in e.g. containerized environments (especially
with kill-all enabled) but memcgs are used in a large variety of
usecases and I cannot really say all of them really demand the new
semantic. Say I have a workload which doesn't want to see reclaim
interference from others on the same machine. Why should I kill a
process from that particular memcg just because it is the largest one
when there is a memory hog/leak outside of this memcg?

>From my point of view the safest (in a sense of the least surprise)
way to go with opt-in for the new heuristic. I am pretty sure all who
would benefit from the new behavior will enable it while others will not
regress in unexpected way.

We can talk about the way _how_ to control these oom strategies, of
course. But I would be really reluctant to change the default which is
used for years and people got used to it.
-- 
Michal Hocko
SUSE Labs