linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Rientjes <rientjes@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <guro@fb.com>,
	linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [v8 0/4] cgroup-aware OOM killer
Date: Fri, 22 Sep 2017 13:39:55 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.10.1709221316290.68140@chino.kir.corp.google.com> (raw)
In-Reply-To: <20170922154426.GF828415@devbig577.frc2.facebook.com>

On Fri, 22 Sep 2017, Tejun Heo wrote:

> > It doesn't have anything to do with my particular usecase, but rather the 
> > ability of userspace to influence the decisions of the kernel.  Previous 
> > to this patchset, when selection is done based on process size, userspace 
> > has full control over selection.  After this patchset, userspace has no 
> > control other than setting all processes to be oom disabled if the largest 
> > memory consumer is to be protected.  Roman's memory.oom_priority provides 
> > a perfect solution for userspace to be able to influence this decision 
> > making and causes no change in behavior for users who choose not to tune 
> > memory.oom_priority.  The nack originates from the general need for 
> > userspace influence over oom victim selection and to avoid userspace 
> > needing to take the rather drastic measure of setting all processes to be 
> > oom disabled to prevent oom kill in kernels before oom priorities are 
> > introduced.
> 
> Overall, I think that OOM killing is the wrong place to implement
> sophisticated intelligence in.  It's too late to be smart - the
> workload already has suffered significantly and there's only very
> limited amount of computing which can be performed.  That said, if
> there's a useful and general enough mechanism to configure OOM killer
> behavior from userland, that can definitely be useful.
> 

What is under discussion is a new way to compare sibling cgroups when 
selecting a victim for oom kill.  It's a new heuristic based on a 
characteristic of the memory cgroup rather than the individual process.  
We want this behavior that the patchset implements.  The only desire is a 
way for userspace to influence that decision making in the same way that 
/proc/pid/oom_score_adj allows userspace to influence the current 
heuristic.

Current heuristic based on processes is coupled with per-process
/proc/pid/oom_score_adj.  The proposed 
heuristic has no ability to be influenced by userspace, and it needs one.  
The proposed heuristic based on memory cgroups coupled with Roman's 
per-memcg memory.oom_priority is appropriate and needed.  It is not 
"sophisticated intelligence," it merely allows userspace to protect vital 
memory cgroups when opting into the new features (cgroups compared based 
on size and memory.oom_group) that we very much want.

> We even change the whole scheduling behaviors and try really hard to
> not get locked into specific implementation details which exclude
> future improvements.  Guaranteeing OOM killing selection would be
> crazy.  Why would we prevent ourselves from doing things better in the
> future?  We aren't talking about the semantics of read(2) here.  This
> is a kernel emergency mechanism to avoid deadlock at the last moment.
> 

We merely want to prefer other memory cgroups are oom killed on system oom 
conditions before important ones, regardless if the important one is using 
more memory than the others because of the new heuristic this patchset 
introduces.  This is exactly the same as /proc/pid/oom_score_adj for the 
current heuristic.

> Here's a really simple use case.  Imagine a system which hosts two
> containers of services and one is somewhat favored over the other and
> wants to set up cgroup hierarchy so that resources are split at the
> top level between the two containers.  oom_priority is set accordingly
> too.  Let's say a low priority maintenance job in higher priority
> container goes berserk, as they oftne do, and pushing the system into
> OOM.
> 
> With the proposed static oom_priority mechanism, the only
> configuration which can be expressed is "kill all of the lower top
> level subtree before any of the higher one", which is a silly
> restriction leading to silly behavior and a direct result of
> conflating resource distribution network with level-by-level OOM
> killing decsion.
> 

The problem you're describing is an issue with the top-level limits after 
this patchset is merged, not memory.oom_priority at all.

If they are truly split evenly, this patchset kills the largest process 
from the hierarchy with the most charged memory.  That's unchanged if the 
two priorities are equal.  By changing the priority to be more preferred 
for a hierarchy, you indeed prefer oom kills from the lower priority 
hierarchy.  You've opted in.  One hierarchy is more important than the 
other, regardless of any hypothetical low priority maintenance job going 
berserk.

If you have this low priority maintenance job charging memory to the high 
priority hierarchy, you're already misconfigured unless you adjust 
/proc/pid/oom_score_adj because it will oom kill any larger process than 
itself in today's kernels anyway.

A better configuration would be attach this hypothetical low priority 
maintenance job to its own sibling cgroup with its own memory limit to 
avoid exactly that problem: it going berserk and charging too much memory 
to the high priority container that results in one of its processes 
getting oom killed.

  reply	other threads:[~2017-09-22 20:40 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-11 13:17 [v8 0/4] cgroup-aware OOM killer Roman Gushchin
2017-09-11 13:17 ` [v8 1/4] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-09-11 20:51   ` David Rientjes
2017-09-14 13:42   ` Michal Hocko
2017-09-11 13:17 ` [v8 2/4] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-09-13 20:46   ` David Rientjes
2017-09-13 21:59     ` Roman Gushchin
2017-09-11 13:17 ` [v8 3/4] mm, oom: add cgroup v2 mount option for " Roman Gushchin
2017-09-11 20:48   ` David Rientjes
2017-09-12 20:01     ` Roman Gushchin
2017-09-12 20:23       ` David Rientjes
2017-09-13 12:23       ` Michal Hocko
2017-09-11 13:17 ` [v8 4/4] mm, oom, docs: describe the " Roman Gushchin
2017-09-11 20:44 ` [v8 0/4] " David Rientjes
2017-09-13 12:29   ` Michal Hocko
2017-09-13 20:46     ` David Rientjes
2017-09-14 13:34       ` Michal Hocko
2017-09-14 20:07         ` David Rientjes
2017-09-13 21:56     ` Roman Gushchin
2017-09-14 13:40       ` Michal Hocko
2017-09-14 16:05         ` Roman Gushchin
2017-09-15 10:58           ` Michal Hocko
2017-09-15 15:23             ` Roman Gushchin
2017-09-15 19:55               ` David Rientjes
2017-09-15 21:08                 ` Roman Gushchin
2017-09-18  6:20                   ` Michal Hocko
2017-09-18 15:02                     ` Roman Gushchin
2017-09-21  8:30                       ` David Rientjes
2017-09-19 20:54                   ` David Rientjes
2017-09-20 22:24                     ` Roman Gushchin
2017-09-21  8:27                       ` David Rientjes
2017-09-18  6:16                 ` Michal Hocko
2017-09-19 20:51                   ` David Rientjes
2017-09-18  6:14               ` Michal Hocko
2017-09-20 21:53                 ` Roman Gushchin
2017-09-25 12:24                   ` Michal Hocko
2017-09-25 17:00                     ` Johannes Weiner
2017-09-25 18:15                       ` Roman Gushchin
2017-09-25 20:25                         ` Michal Hocko
2017-09-26 10:59                           ` Roman Gushchin
2017-09-26 11:21                             ` Michal Hocko
2017-09-26 12:13                               ` Roman Gushchin
2017-09-26 13:30                                 ` Michal Hocko
2017-09-26 17:26                                   ` Johannes Weiner
2017-09-27  3:37                                     ` Tim Hockin
2017-09-27  7:43                                       ` Michal Hocko
2017-09-27 10:19                                         ` Roman Gushchin
2017-09-27 15:35                                         ` Tim Hockin
2017-09-27 16:23                                           ` Roman Gushchin
2017-09-27 18:11                                             ` Tim Hockin
2017-10-01 23:29                                               ` Shakeel Butt
2017-10-02 11:56                                                 ` Tetsuo Handa
2017-10-02 12:24                                                 ` Michal Hocko
2017-10-02 12:47                                                   ` Roman Gushchin
2017-10-02 14:29                                                     ` Michal Hocko
2017-10-02 19:00                                                   ` Shakeel Butt
2017-10-02 19:28                                                     ` Michal Hocko
2017-10-02 19:45                                                       ` Shakeel Butt
2017-10-02 19:56                                                         ` Michal Hocko
2017-10-02 20:00                                                           ` Tim Hockin
2017-10-02 20:08                                                             ` Michal Hocko
2017-10-02 20:20                                                             ` Shakeel Butt
2017-10-02 20:24                                                           ` Shakeel Butt
2017-10-02 20:34                                                             ` Johannes Weiner
2017-10-02 20:55                                                             ` Michal Hocko
2017-09-25 22:21                       ` David Rientjes
2017-09-26  8:46                         ` Michal Hocko
2017-09-26 21:04                           ` David Rientjes
2017-09-27  7:37                             ` Michal Hocko
2017-09-27  9:57                               ` Roman Gushchin
2017-09-21 14:21   ` Johannes Weiner
2017-09-21 21:17     ` David Rientjes
2017-09-21 21:51       ` Johannes Weiner
2017-09-22 20:53         ` David Rientjes
2017-09-22 15:44       ` Tejun Heo
2017-09-22 20:39         ` David Rientjes [this message]
2017-09-22 21:05           ` Tejun Heo
2017-09-23  8:16             ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.DEB.2.10.1709221316290.68140@chino.kir.corp.google.com \
    --to=rientjes@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).