From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Bqph=A2=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.1 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5AB34C433E0
	for <linux-mm@archiver.kernel.org>; Wed, 15 Jul 2020 03:10:46 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 0DB0E20663
	for <linux-mm@archiver.kernel.org>; Wed, 15 Jul 2020 03:10:45 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QMg29bY8"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0DB0E20663
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 80ED56B0002; Tue, 14 Jul 2020 23:10:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 797EF6B0003; Tue, 14 Jul 2020 23:10:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 65F646B0005; Tue, 14 Jul 2020 23:10:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192])
	by kanga.kvack.org (Postfix) with ESMTP id 4D4706B0002
	for <linux-mm@kvack.org>; Tue, 14 Jul 2020 23:10:45 -0400 (EDT)
Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id BA055180AD804
	for <linux-mm@kvack.org>; Wed, 15 Jul 2020 03:10:44 +0000 (UTC)
X-FDA: 77038832808.16.hope36_031371426ef6
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin16.hostedemail.com (Postfix) with ESMTP id 892D3100E6903
	for <linux-mm@kvack.org>; Wed, 15 Jul 2020 03:10:44 +0000 (UTC)
X-HE-Tag: hope36_031371426ef6
X-Filterd-Recvd-Size: 10433
Received: from mail-il1-f195.google.com (mail-il1-f195.google.com [209.85.166.195])
	by imf35.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Jul 2020 03:10:44 +0000 (UTC)
Received: by mail-il1-f195.google.com with SMTP id t4so739567iln.1
        for <linux-mm@kvack.org>; Tue, 14 Jul 2020 20:10:44 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=jCE7Sc13re51fzlcTELppqxmIr8LZB5P4BB1eDOL5Fg=;
        b=QMg29bY8b23rVaJQUGEkWk6AA7pBAHPddemH2ESZyHxW8cc82TbuHo+0MynLzvhZna
         kPQp98fAhCEr7XnOMYpNfMfQs/poqSx5ikrdo0AriujzZoKvPllCLVen/OwL5MWCGq8P
         V5hSHCDqKWLpLnkpy9TQx5ZlnIuut8MGCCu+d1w7fo5ViehaUZVCq93gZgJllwQMD7oS
         1p+cw7ug0uek8NVUeiVpdsXbwTtceoMazY0aMJ0o2nB4TSHO4Jcwb5VeUTXA7e5PiNkW
         4Y/h82XE5n9h5i+7DnRUf2vJf6Psh2ZYeStIAUBVlpcsqDyUdk75T17vzs1xTKg9doLy
         oLJQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=jCE7Sc13re51fzlcTELppqxmIr8LZB5P4BB1eDOL5Fg=;
        b=oLpWAAiZoZ6maQSpJoqZ/Lz1AF6MlCNzpJU2ZDVczvilkIrmKklUO9VKdA1dUbywV9
         vAwageEPj/utX29DdI4YaaeCj6fEfR+kp4MuGh2ffUUyQjc8oIwxkooI7+3h5+qp6+O3
         Fkz7ZcU8zbQu0gP3l1Vq6ImLqKp5L3XfF++6lBu+VHVcECkvB0S57CWK7llEHYyh8wj+
         C7PgxiFuf2k2zEbIfHbChoMetecI0YBIcT12juVzEYxVjAe2cqT2cHoqBBZWlLOonNAW
         QHhpkn43U/OwqffxLISHDJxQytSktLo5Buci38qp0uPnMhm102a4Pxk2LKNq9VlWMehT
         8zfQ==
X-Gm-Message-State: AOAM5316eHbKFRPGEKuDXC0eyIKEuSMRjeZLcQjgmuEgXsWNlxcX74LL
	DapYlfbvQg11U8cW7/1dTXoYimV+pJhfuzxyYJI=
X-Google-Smtp-Source: ABdhPJy07VggwMU+Sc58JJHLAS/jc31UiIJOlF6nLkn8Kyb4xoa7rl0wsy9ovZi9Y3rUDK7EtkbZgGb4JopYLI2mQng=
X-Received: by 2002:a92:404e:: with SMTP id n75mr7799189ila.203.1594782643481;
 Tue, 14 Jul 2020 20:10:43 -0700 (PDT)
MIME-Version: 1.0
References: <1594735034-19190-1-git-send-email-laoar.shao@gmail.com>
 <alpine.DEB.2.23.453.2007141137110.2381754@chino.kir.corp.google.com>
 <CALOAHbDU1DCzEcat3VWovtf26Ka9XOaj_Zt92meKeb-mXP-SFQ@mail.gmail.com> <alpine.DEB.2.23.453.2007141934320.2615810@chino.kir.corp.google.com>
In-Reply-To: <alpine.DEB.2.23.453.2007141934320.2615810@chino.kir.corp.google.com>
From: Yafang Shao <laoar.shao@gmail.com>
Date: Wed, 15 Jul 2020 11:10:07 +0800
Message-ID: <CALOAHbB3wHgUPVJvg6trwWpNzeM=atgvoJ4wzih0g0DFdmStYw@mail.gmail.com>
Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom
To: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>, Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>, 
	Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Linux MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 892D3100E6903
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam01
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Jul 15, 2020 at 10:44 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 15 Jul 2020, Yafang Shao wrote:
>
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 1962232..15e0e18 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >               .gfp_mask = gfp_mask,
> > > >               .order = order,
> > > >       };
> > > > -     bool ret;
> > > > +     bool ret = true;
> > > >
> > > >       if (mutex_lock_killable(&oom_lock))
> > > >               return true;
> > > > +
> > > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > > +             goto unlock;
> > > > +
> > > >       /*
> > > >        * A few threads which were not waiting at mutex_lock_killable() can
> > > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > > >        */
> > > >       ret = should_force_charge() || out_of_memory(&oc);
> > > > +
> > > > +unlock:
> > > >       mutex_unlock(&oom_lock);
> > > >       return ret;
> > > >  }
> > >
> > > Hi Yafang,
> > >
> > > We've run with a patch very much like this for several years and it works
> > > quite successfully to prevent the unnecessary oom killing of processes.
> > >
> > > We do this in out_of_memory() directly, however, because we found that we
> > > could prevent even *more* unnecessary killing if we checked this at the
> > > "point of no return" because the selection of processes takes some
> > > additional time when we might resolve the oom condition.
> > >
> >
> > Hi David,
> >
> > Your proposal could also resolve the issue,
>
> It has successfully resolved it for several years in our kernel, we tried
> an approach similiar to yours but saw many instances where memcg oom kills
> continued to proceed even though the memcg information dumped to the
> kernel log showed memory available.
>
> If this was a page or two that became available due to memory freeing,
> it's not a significant difference.  Instead, if this races with an oom
> notification and a process exiting or being SIGKILL'd, it becomes much
> harder to explain to a user why their process was oom killed when there
> are tens of megabytes of memory available as shown by the kernel log (the
> freeing/exiting happened during a particularly long iteration of processes
> attached to the memcg, for example).
>
> That's what motivated a change to moving this to out_of_memory() directly,
> we found that it prevented even more unnecessary oom kills, which is a
> very good thing.  It may only be easily observable and make a significant
> difference at very large scale, however.
>

Thanks for the clarification.

If it is the race which causes this issue and we want to reduce the
race window, I don't know whether it is proper to check the memcg
margin in out_of_memory() or  do it before calling do_send_sig_info().
Because per my understanding, dump_header() always takes much more
time than select_bad_process() especially if there're slow consoles.
So the race might easily happen when doing dump_header()  or dumping
other information, but if we check the memcg margin after dumping this
oom info, it would be strange to dump so much oom logs without killing
a process.

> > but I'm wondering why do
> > it specifically for memcg oom?
> > Doesn't it apply to global oom?
> > For example, in the global oom, when selecting the processes, the
> > others might free some pages and then it might allocate pages
> > successfully.
> >
>
> It's more complex because memory being allocated from the page allocator
> must be physically contiguous, it's not a simple matter of comparing the
> margin of available memory to the memory being charged.  It could
> theoretically be done but I haven't seen a use case where it has actually
> mattered as opposed to memcg oom where it can happen quite readily at
> scale.  When memory is uncharged to a memcg because of large freeing or a
> process exiting, that's immediately chargable by another process in the
> same hierarchy because of its isolation as opposed to the page allocator
> where that memory is up for grabs and anything on the system could
> allocate it.
>
> > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to
> > > generic mm code, but in the interest of preventing any unnecessary oom
> > > kill we've found it to be helpful.
> > >
> > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089.
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > >  void mem_cgroup_split_huge_fixup(struct page *head);
> > >  #endif
> > >
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
> > > +
> > >  #else /* CONFIG_MEMCG */
> > >
> > >  #define MEM_CGROUP_ID_SHIFT    0
> > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > >  {
> > >  }
> > >
> > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +{
> > > +}
> > > +
> > >  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > >                                                   bool in_low_reclaim)
> > >  {
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> > >   * Returns the maximum amount of memory @mem can be charged with, in
> > >   * pages.
> > >   */
> > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
> > >  {
> > >         unsigned long margin = 0;
> > >         unsigned long count;
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc)
> > >                 if (!is_sysrq_oom(oc) && !is_memcg_oom(oc))
> > >                         panic("System is deadlocked on memory\n");
> > >         }
> > > -       if (oc->chosen && oc->chosen != (void *)-1UL)
> > > +       if (oc->chosen && oc->chosen != (void *)-1UL) {
> > > +               if (is_memcg_oom(oc)) {
> > > +                       /*
> > > +                        * If a memcg is now under its limit or current will be
> > > +                        * exiting and freeing memory, avoid needlessly killing
> > > +                        * chosen.
> > > +                        */
> > > +                       if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) ||
> > > +                           task_will_free_mem(current)) {
> > > +                               put_task_struct(oc->chosen);
> > > +                               return true;
> > > +                       }
> > > +               }
> > > +
> > >                 oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
> > >                                  "Memory cgroup out of memory");
> > > +       }
> > >         return !!oc->chosen;
> > >  }
> > >
> >
> >
> > --
> > Thanks
> > Yafang
> >


-- 
Thanks
Yafang