From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=9Vx7=A3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.0 required=3.0 tests=BAYES_00,MAILING_LIST_MULTI,
	SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DC97AC433E3
	for <linux-mm@archiver.kernel.org>; Thu, 16 Jul 2020 07:12:45 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A60842070E
	for <linux-mm@archiver.kernel.org>; Thu, 16 Jul 2020 07:12:45 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A60842070E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 3279A6B0007; Thu, 16 Jul 2020 03:12:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D95B8D0001; Thu, 16 Jul 2020 03:12:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1C5DD6B0024; Thu, 16 Jul 2020 03:12:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0194.hostedemail.com [216.40.44.194])
	by kanga.kvack.org (Postfix) with ESMTP id 059696B0007
	for <linux-mm@kvack.org>; Thu, 16 Jul 2020 03:12:45 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 7C8EB180AD811
	for <linux-mm@kvack.org>; Thu, 16 Jul 2020 07:12:44 +0000 (UTC)
X-FDA: 77043071448.08.sack36_240ed0726f00
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin08.hostedemail.com (Postfix) with ESMTP id 50F541819E772
	for <linux-mm@kvack.org>; Thu, 16 Jul 2020 07:12:44 +0000 (UTC)
X-HE-Tag: sack36_240ed0726f00
X-Filterd-Recvd-Size: 7225
Received: from mail-wr1-f68.google.com (mail-wr1-f68.google.com [209.85.221.68])
	by imf01.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 16 Jul 2020 07:12:43 +0000 (UTC)
Received: by mail-wr1-f68.google.com with SMTP id z13so5877238wrw.5
        for <linux-mm@kvack.org>; Thu, 16 Jul 2020 00:12:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=QK3lwNDs1nC7FGFsf9rYhyIAMPjGFDApn+mLR/MjNZ8=;
        b=BklL7hEVA1vwzBK3iXblB3jxsOe+/hUaAAVtwYMjSwtsaTvrgUJoo2SKf6OOiPGRYw
         vg4nSA7imaKlMGjcx1gXi0oA4eQobtgXETKV8cBKqelOOnDDBJBBzE8rSO8nkl4l/ojw
         XaH7OfyTvunUI2dsAAHu/JI1KVyTIpS5BzKEyA8uCmeS+r9htIGHyTNGD0cyygLQOSe0
         NZ2O4X/m/b2A4cGmtRkOibFTzFBZhMwDy0ldWdjCxaAHtgupq9+Kb2OnVkMuFVCHJGIF
         tTMGfIx2ZKIMOe56pwU8OCO4R0bB4EXk4m8DM5VdnvTNI9vSEOetl/jofAgEAbLzdrCS
         UCag==
X-Gm-Message-State: AOAM530vgmDVCPyWu8/aB9SklodCiz652nPf/fU1pKmw4GYukRS4vLtH
	ZwVs8V4zUTcPj0QuaFkaD40=
X-Google-Smtp-Source: ABdhPJyz7XGgei7UU560iEdYhrJld5xyUfdpsZHn3/G4r1QNx73f/RGxYZM+lA3VuaTeV/0v36tulg==
X-Received: by 2002:a5d:6907:: with SMTP id t7mr3584248wru.329.1594883562712;
        Thu, 16 Jul 2020 00:12:42 -0700 (PDT)
Received: from localhost (ip-37-188-169-187.eurotel.cz. [37.188.169.187])
        by smtp.gmail.com with ESMTPSA id g3sm8226134wrb.59.2020.07.16.00.12.41
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 16 Jul 2020 00:12:41 -0700 (PDT)
Date: Thu, 16 Jul 2020 09:12:40 +0200
From: Michal Hocko <mhocko@kernel.org>
To: David Rientjes <rientjes@google.com>
Cc: Yafang Shao <laoar.shao@gmail.com>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom
Message-ID: <20200716071240.GD31089@dhcp22.suse.cz>
References: <1594735034-19190-1-git-send-email-laoar.shao@gmail.com>
 <alpine.DEB.2.23.453.2007141137110.2381754@chino.kir.corp.google.com>
 <CALOAHbDU1DCzEcat3VWovtf26Ka9XOaj_Zt92meKeb-mXP-SFQ@mail.gmail.com>
 <alpine.DEB.2.23.453.2007141934320.2615810@chino.kir.corp.google.com>
 <CALOAHbB3wHgUPVJvg6trwWpNzeM=atgvoJ4wzih0g0DFdmStYw@mail.gmail.com>
 <alpine.DEB.2.23.453.2007142016240.2667860@chino.kir.corp.google.com>
 <CALOAHbDpoFzR-jeDbTLUzQSE-nU+F3BXNLXJgX-07EUJq6+woA@mail.gmail.com>
 <alpine.DEB.2.23.453.2007151024350.2788464@chino.kir.corp.google.com>
 <20200716060814.GA31089@dhcp22.suse.cz>
 <alpine.DEB.2.23.453.2007152349510.2921049@chino.kir.corp.google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.23.453.2007152349510.2921049@chino.kir.corp.google.com>
X-Rspamd-Queue-Id: 50F541819E772
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam02
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed 15-07-20 23:56:11, David Rientjes wrote:
> On Thu, 16 Jul 2020, Michal Hocko wrote:
> 
> > > I don't think moving the mem_cgroup_margin() check to out_of_memory() 
> > > right before printing the oom info and killing the process is a very 
> > > invasive patch.  Any strong preference against doing it that way?  I think 
> > > moving the check as late as possible to save a process from being killed 
> > > when racing with an exiter or killed process (including perhaps current) 
> > > has a pretty clear motivation.
> > 
> > We have been through this discussion several times in the past IIRC
> > The conclusion has been that the allocator (charging path for
> > the memcg) is the one to define OOM situation. This is an inherently
> > racy situation as long as we are not synchronizing oom with the world,
> > which I believe we agree, we do not want to do. There are few exceptions
> > to bail out early from the oom under certain situations and the trend
> > was to remove some of the existing ones rather than adding new because
> > they had subtle side effects and were prone to lockups.
> > 
> > As much as it might sound attractive to move mem_cgroup_margin resp.
> > last allocation attempt closer to the actual oom killing I haven't seen
> > any convincing data that would support that such a change would make a
> > big difference. select_bad_process is not a free operation as it scales
> > with the number of tasks in the oom domain but it shouldn't be a super
> > expensive. The oom reporting is by far the most expensive part of the
> > operation.
> > 
> > That being said, really convincing data should be presented in order
> > to do such a change. I do not think we want to do that just in case.
> 
> It's not possible to present data because we've had such a check for years 
> in our fleet so I can't say that it has prevented X unnecessary oom kills 
> compared to doing the check prior to calling out_of_memory().  I'm hoping 
> that can be understood.
> 
> Since Yafang is facing the same issue, and there is no significant 
> downside to doing the mem_cgroup_margin() check prior to 
> oom_kill_process() (or checking task_will_free_mem(current)), and it's 
> acknowledged that it *can* prevent unnecessary oom killing, which is a 
> very good thing, I'd like to understand why such resistance to it.

Because exactly this kind of arguments has led to quite some "should be
fine" heuristics which kicked back: do not kill exiting task, sacrifice
child instead of a victim just to name few. All of them make some sense
from a glance but they can serious kick back as the experience has
thought us.

Really, I do not see what is so hard to understand that each heuristic,
especially those to subtle areas like oom definitely is, needs data to
justify them. We are running this for years is really not an argument.
Sure arguing that your workload leads to x amount of false positives
and just shifting the check to later saves y amount of them sounds like
a relevant argument to me.

> Killing a user process is a serious matter.

It definitely is and I believe that nobody is questioning that. The oom
situation is a serious matter on its own. It says that the system has
failed to balance the memory consumption and the oom killer is the
_last_ resort action to be taken.

> I would fully agree if the 
> margin is only one page: it's still better to kill something off.

And exactly these kinds of details make any heuristic subtle and hard
maintain.

> But 
> when a process has uncharged memory by means induced by a process waiting 
> on oom notication, such as a userspace kill or dropping of caches from 
> your malloc implementation, that uncharge can be quite substantial and oom 
> killing is then unnecessary.

That should happen quite some time before the hard limit is reached and
we have means to achieve that. High limit is there to help with
pro-active reclaim before the OOM happens on the hard limit. If you are
stuck with the v1 then oom disable and shifting the whole logic into the
userspace is another variant.
 
> I can refresh the patch and send it formally.

-- 
Michal Hocko
SUSE Labs