From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Bqph=A2=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-16.1 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_SANE_1,USER_IN_DEF_DKIM_WL autolearn=unavailable
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 99A17C433E2
	for <linux-mm@archiver.kernel.org>; Wed, 15 Jul 2020 02:44:07 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5054920663
	for <linux-mm@archiver.kernel.org>; Wed, 15 Jul 2020 02:44:07 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="ZYj60EAF"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5054920663
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id C0F3D6B0002; Tue, 14 Jul 2020 22:44:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BBF686B0003; Tue, 14 Jul 2020 22:44:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AAE086B0005; Tue, 14 Jul 2020 22:44:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0043.hostedemail.com [216.40.44.43])
	by kanga.kvack.org (Postfix) with ESMTP id 91C316B0002
	for <linux-mm@kvack.org>; Tue, 14 Jul 2020 22:44:06 -0400 (EDT)
Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 2700C8248D7C
	for <linux-mm@kvack.org>; Wed, 15 Jul 2020 02:44:06 +0000 (UTC)
X-FDA: 77038765692.12.shoe83_2016c0026ef6
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin12.hostedemail.com (Postfix) with ESMTP id EBCBC1800BC3F
	for <linux-mm@kvack.org>; Wed, 15 Jul 2020 02:44:05 +0000 (UTC)
X-HE-Tag: shoe83_2016c0026ef6
X-Filterd-Recvd-Size: 9806
Received: from mail-pg1-f193.google.com (mail-pg1-f193.google.com [209.85.215.193])
	by imf23.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Jul 2020 02:44:05 +0000 (UTC)
Received: by mail-pg1-f193.google.com with SMTP id e8so1441898pgc.5
        for <linux-mm@kvack.org>; Tue, 14 Jul 2020 19:44:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :user-agent:mime-version;
        bh=JfGyKFjsmKVEAuSw0Pesz6sw0KYD1HraRxoceAR16tE=;
        b=ZYj60EAF9nfwyqyPTmr1AI4vflBXOgfdHY/B2P2OB5hgYjYI613zdzlwcYeRMB4AHV
         CBQLjYSHYPmkEcfEsdWYDNKMbZHGyQ8h9/RdGAFKEeru85rEuutsKN8ktqtINRsvGsiu
         eDrw0jErjSN3Tnl6pRQxPghzaQTN26nHbTAdDtHDfJLqb/tUnucN5StoBj+D/4V+PKGL
         FIhYKMU5CVCQL3PNmckWHXztZbZ4YtDoRR1Yk0AB+iJJXsCe1jpfCevLId4cBj4kbZjr
         YJigmu22CwyriMVQD68vuiHNqo5VlbA/rADNzH5m2qpzBgi0jDgRHumndbVUK8d12t2q
         rDKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:user-agent:mime-version;
        bh=JfGyKFjsmKVEAuSw0Pesz6sw0KYD1HraRxoceAR16tE=;
        b=HJRGGqk01JQDmoVatbl4rW78YHGaD2lNVPVChqrPQCHP8S4tRWqRUQ6/bCxnCZW8I7
         adaERerJyBBlTksAIxKNhrQlT4buZkZU5MWn1mS3DCzzyoxohZjhffwLPTMjl8nAqrPH
         nRI9xcezBMsS074By6HxmqLABttOd0Wv13WS9bvYRnAyXrjslqbeCIo6dqE/xc99dBSE
         QpZjTWLjPjfmQ4Z3HWBh8VMiJVK4DCNaqiKRfBR0C+qj2N4g2iNRNi6aSEGcXOaeSv5U
         QM+P0XOb+TQElTnETsVXB6pBHysxELj49UnM54mqb9m8W5jjyvwc9KOZ4o83KHXrZi8a
         Vziw==
X-Gm-Message-State: AOAM5325guDJ7Ra3Ktne9sH8/+mHSug588U/l16x2RQfbb0PMY4ULlhN
	kORmgZMC2EmF2SUJsnnelW9eLA==
X-Google-Smtp-Source: ABdhPJy6ExcCZPSS0D0pme0MQaQ0AXgI+oVZOknqCxzfh/IB6M0dglbTSRYC7J4RaDtu5amq8ENDEA==
X-Received: by 2002:a63:8ec2:: with SMTP id k185mr5750947pge.331.1594781044106;
        Tue, 14 Jul 2020 19:44:04 -0700 (PDT)
Received: from [2620:15c:17:3:4a0f:cfff:fe51:6667] ([2620:15c:17:3:4a0f:cfff:fe51:6667])
        by smtp.gmail.com with ESMTPSA id z8sm406414pgz.7.2020.07.14.19.44.03
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 14 Jul 2020 19:44:03 -0700 (PDT)
Date: Tue, 14 Jul 2020 19:44:03 -0700 (PDT)
From: David Rientjes <rientjes@google.com>
X-X-Sender: rientjes@chino.kir.corp.google.com
To: Yafang Shao <laoar.shao@gmail.com>
cc: Michal Hocko <mhocko@kernel.org>, 
    Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, 
    Andrew Morton <akpm@linux-foundation.org>, 
    Johannes Weiner <hannes@cmpxchg.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom
In-Reply-To: <CALOAHbDU1DCzEcat3VWovtf26Ka9XOaj_Zt92meKeb-mXP-SFQ@mail.gmail.com>
Message-ID: <alpine.DEB.2.23.453.2007141934320.2615810@chino.kir.corp.google.com>
References: <1594735034-19190-1-git-send-email-laoar.shao@gmail.com> <alpine.DEB.2.23.453.2007141137110.2381754@chino.kir.corp.google.com> <CALOAHbDU1DCzEcat3VWovtf26Ka9XOaj_Zt92meKeb-mXP-SFQ@mail.gmail.com>
User-Agent: Alpine 2.23 (DEB 453 2020-06-18)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Queue-Id: EBCBC1800BC3F
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, 15 Jul 2020, Yafang Shao wrote:

> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 1962232..15e0e18 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >               .gfp_mask = gfp_mask,
> > >               .order = order,
> > >       };
> > > -     bool ret;
> > > +     bool ret = true;
> > >
> > >       if (mutex_lock_killable(&oom_lock))
> > >               return true;
> > > +
> > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > +             goto unlock;
> > > +
> > >       /*
> > >        * A few threads which were not waiting at mutex_lock_killable() can
> > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > >        */
> > >       ret = should_force_charge() || out_of_memory(&oc);
> > > +
> > > +unlock:
> > >       mutex_unlock(&oom_lock);
> > >       return ret;
> > >  }
> >
> > Hi Yafang,
> >
> > We've run with a patch very much like this for several years and it works
> > quite successfully to prevent the unnecessary oom killing of processes.
> >
> > We do this in out_of_memory() directly, however, because we found that we
> > could prevent even *more* unnecessary killing if we checked this at the
> > "point of no return" because the selection of processes takes some
> > additional time when we might resolve the oom condition.
> >
> 
> Hi David,
> 
> Your proposal could also resolve the issue,

It has successfully resolved it for several years in our kernel, we tried 
an approach similiar to yours but saw many instances where memcg oom kills 
continued to proceed even though the memcg information dumped to the 
kernel log showed memory available.

If this was a page or two that became available due to memory freeing, 
it's not a significant difference.  Instead, if this races with an oom 
notification and a process exiting or being SIGKILL'd, it becomes much 
harder to explain to a user why their process was oom killed when there 
are tens of megabytes of memory available as shown by the kernel log (the 
freeing/exiting happened during a particularly long iteration of processes 
attached to the memcg, for example).

That's what motivated a change to moving this to out_of_memory() directly, 
we found that it prevented even more unnecessary oom kills, which is a 
very good thing.  It may only be easily observable and make a significant 
difference at very large scale, however.

> but I'm wondering why do
> it specifically for memcg oom?
> Doesn't it apply to global oom?
> For example, in the global oom, when selecting the processes, the
> others might free some pages and then it might allocate pages
> successfully.
> 

It's more complex because memory being allocated from the page allocator 
must be physically contiguous, it's not a simple matter of comparing the 
margin of available memory to the memory being charged.  It could 
theoretically be done but I haven't seen a use case where it has actually 
mattered as opposed to memcg oom where it can happen quite readily at 
scale.  When memory is uncharged to a memcg because of large freeing or a 
process exiting, that's immediately chargable by another process in the 
same hierarchy because of its isolation as opposed to the page allocator 
where that memory is up for grabs and anything on the system could 
allocate it.

> > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> > generic mm code, but in the interest of preventing any unnecessary oom
> > kill we've found it to be helpful.
> >
> > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> >  void mem_cgroup_split_huge_fixup(struct page *head);
> >  #endif
> >
> > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > +
> >  #else /* CONFIG_MEMCG */
> >
> >  #define MEM_CGROUP_ID_SHIFT    0
> > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> >  {
> >  }
> >
> > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > +{
> > +}
> > +
> >  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> >                                                   bool in_low_reclaim)
> >  {
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> >   * Returns the maximum amount of memory @mem can be charged with, in
> >   * pages.
> >   */
> > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> >  {
> >         unsigned long margin = 0;
> >         unsigned long count;
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
> >                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
> >                         panic("System is deadlocked on memory\n");
> >         }
> > -       if (oc->chosen && oc->chosen != (void *)-1UL)
> > +       if (oc->chosen && oc->chosen != (void *)-1UL) {
> > +               if (is_memcg_oom(oc)) {
> > +                       /*
> > +                        * If a memcg is now under its limit or current will be
> > +                        * exiting and freeing memory, avoid needlessly killing
> > +                        * chosen.
> > +                        */
> > +                       if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> > +                           task_will_free_mem(current)) {
> > +                               put_task_struct(oc->chosen);
> > +                               return true;
> > +                       }
> > +               }
> > +
> >                 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> >                                  "Memory cgroup out of memory");
> > +       }
> >         return !!oc->chosen;
> >  }
> >
> 
> 
> -- 
> Thanks
> Yafang
>