linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg
Date: Wed, 25 Oct 2017 12:44:02 -0400	[thread overview]
Message-ID: <20171025164402.GA11582@cmpxchg.org> (raw)
In-Reply-To: <20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz>

On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 09:11:51, Johannes Weiner wrote:
> > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
> [...]
> > > ... we shouldn't make it more loose though.
> > 
> > Then we can end this discussion right now. I pointed out right from
> > the start that the only way to replace -ENOMEM with OOM killing in the
> > syscall is to force charges. If we don't, we either deadlock or still
> > return -ENOMEM occasionally. Nobody has refuted that this is the case.
> 
> Yes this is true. I guess we are back to the non-failing allocations
> discussion...  Currently we are too ENOMEM happy for memcg !PF paths which
> can lead to weird issues Greg has pointed out earlier. Going to opposite
> direction to basically never ENOMEM and rather pretend a success (which
> allows runaways for extreme setups with no oom eligible tasks) sounds
> like going from one extreme to another. This basically means that those
> charges will effectively GFP_NOFAIL. Too much to guarantee IMHO.

We're talking in circles.

Overrunning the hard limit in the syscall path isn't worse than an
infinite loop in the page fault path. Once you make tasks ineligible
for OOM, that's what you have to expect. It's the OOM disabling that
results in extreme consequences across the board.

It's silly to keep bringing this up as an example.

> > They're a challenge as it is.  The only sane options are to stick with
> > the status quo,
> 
> One thing that really worries me about the current status quo is that
> the behavior depends on whether you run under memcg or not. The global
> policy is "almost never fail unless something horrible is going on".
> But we _do not_ guarantee that ENOMEM stays inside the kernel.
> 
> So if we need to do something about that I would think we need an
> universal solution rather than something memcg specific. Sure global
> ENOMEMs are so rare that nobody will probably trigger those but that is
> just a wishful thinking...

The difference in current behavior comes from the fact that real-life
workloads could trigger a deadlock with memcg looping indefinitely in
the charge path, but not with the allocator looping indefinitely in
the allocation path.

That also means that if we change it like you proposed, it'll be much
more likely to trigger -ENOMEM from syscalls inside memcg than it is
outside.

The margins are simply tighter inside cgroups. You often have only a
few closely interacting tasks in there, and the probability for
something to go wrong is much higher than on the system scope.

But also, the system is constrained by physical limits. It's a hard
wall; once you hit it, the kernel is dead, so we don't have a lot of
options. Memcg on the other hand is a completely artificial limit. It
doesn't make sense to me to pretend it's a hard wall to the point
where we actively *create* for ourselves the downsides of a physical
limitation.

> So how about we start with a BIG FAT WARNING for the failure case?
> Something resembling warn_alloc for the failure case.
>
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5d9323028870..3ba62c73eee5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
>  	 * we would fall back to the global oom killer in pagefault_out_of_memory
>  	 */
> -	if (!memcg->oom_kill_disable &&
> -			mem_cgroup_out_of_memory(memcg, mask, order))
> -		return true;
> +	if (!memcg->oom_kill_disable) {
> +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> +			return true;
> +
> +		WARN(!current->memcg_may_oom,
> +				"Memory cgroup charge failed because of no reclaimable memory! "
> +				"This looks like a misconfiguration or a kernel bug.");
> +	}

That's crazy!

We shouldn't create interfaces that make it possible to accidentally
livelock the kernel. Then warn about it and let it crash. That is a
DOS-level lack of OS abstraction.

In such a situation, we should ignore oom_score_adj or ignore the hard
limit. Even panic() would be better from a machine management point of
view than leaving random tasks inside infinite loops.

Why is OOM-disabling a thing? Why isn't this simply a "kill everything
else before you kill me"? It's crashing the kernel in trying to
protect a userspace application. How is that not insane?

  reply	other threads:[~2017-10-25 16:44 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-10-05 22:21 [PATCH] fs, mm: account filp and names caches to kmemcg Shakeel Butt
2017-10-06  7:59 ` Michal Hocko
2017-10-06 19:33   ` Shakeel Butt
2017-10-09  6:24     ` Michal Hocko
2017-10-09 17:52       ` Greg Thelen
2017-10-09 18:04         ` Michal Hocko
2017-10-09 18:17           ` Michal Hocko
2017-10-10  9:10             ` Michal Hocko
2017-10-10 22:21               ` Shakeel Butt
2017-10-11  9:09                 ` Michal Hocko
2017-10-09 20:26         ` Johannes Weiner
2017-10-10  9:14           ` Michal Hocko
2017-10-10 14:17             ` Johannes Weiner
2017-10-10 14:24               ` Michal Hocko
2017-10-12 19:03                 ` Johannes Weiner
2017-10-12 23:57                   ` Greg Thelen
2017-10-13  6:51                     ` Michal Hocko
2017-10-13  6:35                   ` Michal Hocko
2017-10-13  7:00                     ` Michal Hocko
2017-10-13 15:24                       ` Michal Hocko
2017-10-24 12:18                         ` Michal Hocko
2017-10-24 17:54                           ` Johannes Weiner
2017-10-24 16:06                         ` Johannes Weiner
2017-10-24 16:22                           ` Michal Hocko
2017-10-24 17:23                             ` Johannes Weiner
2017-10-24 17:55                               ` Michal Hocko
2017-10-24 18:58                                 ` Johannes Weiner
2017-10-24 20:15                                   ` Michal Hocko
2017-10-25  6:51                                     ` Greg Thelen
2017-10-25  7:15                                       ` Michal Hocko
2017-10-25 13:11                                         ` Johannes Weiner
2017-10-25 14:12                                           ` Michal Hocko
2017-10-25 16:44                                             ` Johannes Weiner [this message]
2017-10-25 17:29                                               ` Michal Hocko
2017-10-25 18:11                                                 ` Johannes Weiner
2017-10-25 19:00                                                   ` Michal Hocko
2017-10-25 21:13                                                     ` Johannes Weiner
2017-10-25 22:49                                                       ` Greg Thelen
2017-10-26  7:49                                                         ` Michal Hocko
2017-10-26 12:45                                                           ` Tetsuo Handa
2017-10-26 14:31                                                         ` Johannes Weiner
2017-10-26 19:56                                                           ` Greg Thelen
2017-10-27  8:20                                                             ` Michal Hocko
2017-10-27 20:50                                               ` Shakeel Butt
2017-10-30  8:29                                                 ` Michal Hocko
2017-10-30 19:28                                                   ` Shakeel Butt
2017-10-31  8:00                                                     ` Michal Hocko
2017-10-31 16:49                                                       ` Johannes Weiner
2017-10-31 18:50                                                         ` Michal Hocko
2017-10-24 15:45                     ` Johannes Weiner
2017-10-24 16:30                       ` Michal Hocko
2017-10-10 23:32 ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171025164402.GA11582@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=gthelen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=shakeelb@google.com \
    --cc=vdavydov.dev@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).