Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom

From: Johannes Weiner <hannes@cmpxchg.org>
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Yafang Shao <laoar.shao@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom
Date: Tue, 28 Jul 2020 14:04:13 -0400	[thread overview]
Message-ID: <20200728180237.GA387006@cmpxchg.org> (raw)
In-Reply-To: <alpine.DEB.2.23.453.2007152349510.2921049@chino.kir.corp.google.com>

On Wed, Jul 15, 2020 at 11:56:11PM -0700, David Rientjes wrote:
> On Thu, 16 Jul 2020, Michal Hocko wrote:
> 
> > > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > > right before printing the oom info and killing the process is a very 
> > > invasive patch.  Any strong preference against doing it that way?  I think 
> > > moving the check as late as possible to save a process from being killed 
> > > when racing with an exiter or killed process (including perhaps current) 
> > > has a pretty clear motivation.
> > 
> > We have been through this discussion several times in the past IIRC
> > The conclusion has been that the allocator (charging path for
> > the memcg) is the one to define OOM situation. This is an inherently
> > racy situation as long as we are not synchronizing oom with the world,
> > which I believe we agree, we do not want to do. There are few exceptions
> > to bail out early from the oom under certain situations and the trend
> > was to remove some of the existing ones rather than adding new because
> > they had subtle side effects and were prone to lockups.
> > 
> > As much as it might sound attractive to move mem_cgroup_margin resp.
> > last allocation attempt closer to the actual oom killing I haven't seen
> > any convincing data that would support that such a change would make a
> > big difference. select_bad_process is not a free operation as it scales
> > with the number of tasks in the oom domain but it shouldn't be a super
> > expensive. The oom reporting is by far the most expensive part of the
> > operation.
> > 
> > That being said, really convincing data should be presented in order
> > to do such a change. I do not think we want to do that just in case.
> 
> It's not possible to present data because we've had such a check for years 
> in our fleet so I can't say that it has prevented X unnecessary oom kills 
> compared to doing the check prior to calling out_of_memory().  I'm hoping 
> that can be understood.

That makes sense. I would just be curious whether you can remember
what data points you looked at at the time you made the change.

Were you inspecting individual OOM kills and saw in the memory dump
after the fact that the kill would have been unnecessary?

Or were you looking at aggregate OOM kill rates in your fleet and saw
a measurable reduction?

> Since Yafang is facing the same issue, and there is no significant 
> downside to doing the mem_cgroup_margin() check prior to 
> oom_kill_process() (or checking task_will_free_mem(current)), and it's 
> acknowledged that it *can* prevent unnecessary oom killing, which is a 
> very good thing, I'd like to understand why such resistance to it.

As someone who has been fairly vocal against this in the past, I have
to admit I don't feel too strongly about it anymore.

My argument in the past has been that if you're going for a
probabilistic reduction of OOM kills, injecting sleeps in between
reclaim and the last allocation attempt could also increase the chance
of coincidental parallel memory frees. And that just always seemed
somewhat arbitrary to me. I was also worried we'd be opening a can of
worms by allowing this type of tweaking of the OOM killer, when OOM
kills are supposed to be a once-in-a-blue-moon deadlock breaker,
rather than a common resource enforcement strategy.

However, I also have to admit that while artificial sleeps would
indeed be pretty arbitrary and likely controversial, the time it takes
to select a victim is unavoidable. It's something we need to do in any
case. If checking the margin one last time after that helps you bring
down the kill rate somewhat, I'd say go for it. It's a natural line,
and I would agree that the change isn't very invasive.

> Killing a user process is a serious matter.  I would fully agree if the 
> margin is only one page: it's still better to kill something off.  But 
> when a process has uncharged memory by means induced by a process waiting 
> on oom notication, such as a userspace kill or dropping of caches from 
> your malloc implementation, that uncharge can be quite substantial and oom 
> killing is then unnecessary.
> 
> I can refresh the patch and send it formally.

No objection from me.