From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S932211AbdJYVOL (ORCPT <rfc822;w@1wt.eu>);
        Wed, 25 Oct 2017 17:14:11 -0400
Received: from gum.cmpxchg.org ([85.214.110.215]:45200 "EHLO gum.cmpxchg.org"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751668AbdJYVOJ (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 25 Oct 2017 17:14:09 -0400
Date: Wed, 25 Oct 2017 17:13:59 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>, Shakeel Butt <shakeelb@google.com>,
        Alexander Viro <viro@zeniv.linux.org.uk>,
        Vladimir Davydov <vdavydov.dev@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Linux MM <linux-mm@kvack.org>, linux-fsdevel@vger.kernel.org,
        LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg
Message-ID: <20171025211359.GA17899@cmpxchg.org>
References: <20171024185854.GA6154@cmpxchg.org>
 <20171024201522.3z2fjnfywgx2egqx@dhcp22.suse.cz>
 <xr93r2tr67pp.fsf@gthelen.svl.corp.google.com>
 <20171025071522.xyw4lsvdv4xsbhbo@dhcp22.suse.cz>
 <20171025131151.GA8210@cmpxchg.org>
 <20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz>
 <20171025164402.GA11582@cmpxchg.org>
 <20171025172924.i7du5wnkeihx2fgl@dhcp22.suse.cz>
 <20171025181106.GA14967@cmpxchg.org>
 <20171025190057.mqmnprhce7kvsfz7@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171025190057.mqmnprhce7kvsfz7@dhcp22.suse.cz>
User-Agent: Mutt/1.9.1 (2017-09-22)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> > "Safe" is a vague term, and it doesn't make much sense to me in this
> > situation. The OOM behavior should be predictable and consistent.
> > 
> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> > don't have to do that in memcg because we're not physically limited.
> 
> OK, so here seems to be the biggest disconnect. Being physically or
> artificially constrained shouldn't make much difference IMHO. In both
> cases the resource is simply limited for the consumer. And once all the
> attempts to fit within the limit fail then the request for the resource
> has to fail.

It's a huge difference. In the global case, we have to make trade-offs
to not deadlock the kernel. In the memcg case, we have to make a trade
off between desirable OOM behavior and desirable meaning of memory.max.

If we can borrow a resource temporarily from the ether to resolve the
OOM situation, I don't see why we shouldn't. We're only briefly
ignoring the limit to make sure the allocating task isn't preventing
the OOM victim from exiting or the OOM reaper from reaping. It's more
of an implementation detail than interface.

The only scenario you brought up where this might be the permanent
overrun is the single, oom-disabled task. And I explained why that is
a silly argument, why that's the least problematic consequence of
oom-disabling, and why it probably shouldn't even be configurable.

The idea that memory.max must never be breached is an extreme and
narrow view. As Greg points out, there are allocations we do not even
track. There are other scenarios that force allocations. They may
violate the limit on paper, but they're not notably weakening the goal
of memory.max - isolating workloads from each other.

Let's look at it this way.

There are two deadlock relationships the OOM killer needs to solve
between the triggerer and the potential OOM victim:

	#1 Memory. The triggerer needs memory that the victim has,
	    but the victim needs some additional memory to release it.

	#2 Locks. The triggerer needs memory that the victim has, but
	    the victim needs a lock the triggerer holds to release it.

We have no qualms letting the victim temporarily (until the victim's
exit) ignore memory.max to resolve the memory deadlock #1.

I don't understand why it's such a stretch to let the triggerer
temporarily (until the victim's exit) ignore memory.max to resolve the
locks deadlock #2. [1]

We need both for the OOM killer to function correctly.

We've solved #1 both for memcg and globally. But we haven't solved #2.
Global can still deadlock, and memcg copped out and returns -ENOMEM.

Adding speculative OOM killing before the -ENOMEM makes things more
muddy and unpredictable. It doesn't actually solve deadlock #2.

[1] And arguably that's what we should be doing in the global case
    too: give the triggerer access to reserves. If you recall this
    thread here: https://patchwork.kernel.org/patch/6088511/

> > > So the only change I am really proposing is to keep retrying as long
> > > as the oom killer makes a forward progress and ENOMEM otherwise.
> > 
> > That's the behavior change I'm against.
> 
> So just to make it clear you would be OK with the retry on successful
> OOM killer invocation and force charge on oom failure, right?

Yeah, that sounds reasonable to me.