From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5AB34C433E0 for ; Wed, 15 Jul 2020 03:10:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0DB0E20663 for ; Wed, 15 Jul 2020 03:10:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="QMg29bY8" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0DB0E20663 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 80ED56B0002; Tue, 14 Jul 2020 23:10:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 797EF6B0003; Tue, 14 Jul 2020 23:10:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 65F646B0005; Tue, 14 Jul 2020 23:10:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0192.hostedemail.com [216.40.44.192]) by kanga.kvack.org (Postfix) with ESMTP id 4D4706B0002 for ; Tue, 14 Jul 2020 23:10:45 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id BA055180AD804 for ; Wed, 15 Jul 2020 03:10:44 +0000 (UTC) X-FDA: 77038832808.16.hope36_031371426ef6 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id 892D3100E6903 for ; Wed, 15 Jul 2020 03:10:44 +0000 (UTC) X-HE-Tag: hope36_031371426ef6 X-Filterd-Recvd-Size: 10433 Received: from mail-il1-f195.google.com (mail-il1-f195.google.com [209.85.166.195]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Wed, 15 Jul 2020 03:10:44 +0000 (UTC) Received: by mail-il1-f195.google.com with SMTP id t4so739567iln.1 for ; Tue, 14 Jul 2020 20:10:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=jCE7Sc13re51fzlcTELppqxmIr8LZB5P4BB1eDOL5Fg=; b=QMg29bY8b23rVaJQUGEkWk6AA7pBAHPddemH2ESZyHxW8cc82TbuHo+0MynLzvhZna kPQp98fAhCEr7XnOMYpNfMfQs/poqSx5ikrdo0AriujzZoKvPllCLVen/OwL5MWCGq8P V5hSHCDqKWLpLnkpy9TQx5ZlnIuut8MGCCu+d1w7fo5ViehaUZVCq93gZgJllwQMD7oS 1p+cw7ug0uek8NVUeiVpdsXbwTtceoMazY0aMJ0o2nB4TSHO4Jcwb5VeUTXA7e5PiNkW 4Y/h82XE5n9h5i+7DnRUf2vJf6Psh2ZYeStIAUBVlpcsqDyUdk75T17vzs1xTKg9doLy oLJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=jCE7Sc13re51fzlcTELppqxmIr8LZB5P4BB1eDOL5Fg=; b=oLpWAAiZoZ6maQSpJoqZ/Lz1AF6MlCNzpJU2ZDVczvilkIrmKklUO9VKdA1dUbywV9 vAwageEPj/utX29DdI4YaaeCj6fEfR+kp4MuGh2ffUUyQjc8oIwxkooI7+3h5+qp6+O3 Fkz7ZcU8zbQu0gP3l1Vq6ImLqKp5L3XfF++6lBu+VHVcECkvB0S57CWK7llEHYyh8wj+ C7PgxiFuf2k2zEbIfHbChoMetecI0YBIcT12juVzEYxVjAe2cqT2cHoqBBZWlLOonNAW QHhpkn43U/OwqffxLISHDJxQytSktLo5Buci38qp0uPnMhm102a4Pxk2LKNq9VlWMehT 8zfQ== X-Gm-Message-State: AOAM5316eHbKFRPGEKuDXC0eyIKEuSMRjeZLcQjgmuEgXsWNlxcX74LL DapYlfbvQg11U8cW7/1dTXoYimV+pJhfuzxyYJI= X-Google-Smtp-Source: ABdhPJy07VggwMU+Sc58JJHLAS/jc31UiIJOlF6nLkn8Kyb4xoa7rl0wsy9ovZi9Y3rUDK7EtkbZgGb4JopYLI2mQng= X-Received: by 2002:a92:404e:: with SMTP id n75mr7799189ila.203.1594782643481; Tue, 14 Jul 2020 20:10:43 -0700 (PDT) MIME-Version: 1.0 References: <1594735034-19190-1-git-send-email-laoar.shao@gmail.com> In-Reply-To: From: Yafang Shao Date: Wed, 15 Jul 2020 11:10:07 +0800 Message-ID: Subject: Re: [PATCH v2] memcg, oom: check memcg margin for parallel oom To: David Rientjes Cc: Michal Hocko , Tetsuo Handa , Andrew Morton , Johannes Weiner , Linux MM Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 892D3100E6903 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 15, 2020 at 10:44 AM David Rientjes wrote: > > On Wed, 15 Jul 2020, Yafang Shao wrote: > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 1962232..15e0e18 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -1560,15 +1560,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > > > .gfp_mask = gfp_mask, > > > > .order = order, > > > > }; > > > > - bool ret; > > > > + bool ret = true; > > > > > > > > if (mutex_lock_killable(&oom_lock)) > > > > return true; > > > > + > > > > + if (mem_cgroup_margin(memcg) >= (1 << order)) > > > > + goto unlock; > > > > + > > > > /* > > > > * A few threads which were not waiting at mutex_lock_killable() can > > > > * fail to bail out. Therefore, check again after holding oom_lock. > > > > */ > > > > ret = should_force_charge() || out_of_memory(&oc); > > > > + > > > > +unlock: > > > > mutex_unlock(&oom_lock); > > > > return ret; > > > > } > > > > > > Hi Yafang, > > > > > > We've run with a patch very much like this for several years and it works > > > quite successfully to prevent the unnecessary oom killing of processes. > > > > > > We do this in out_of_memory() directly, however, because we found that we > > > could prevent even *more* unnecessary killing if we checked this at the > > > "point of no return" because the selection of processes takes some > > > additional time when we might resolve the oom condition. > > > > > > > Hi David, > > > > Your proposal could also resolve the issue, > > It has successfully resolved it for several years in our kernel, we tried > an approach similiar to yours but saw many instances where memcg oom kills > continued to proceed even though the memcg information dumped to the > kernel log showed memory available. > > If this was a page or two that became available due to memory freeing, > it's not a significant difference. Instead, if this races with an oom > notification and a process exiting or being SIGKILL'd, it becomes much > harder to explain to a user why their process was oom killed when there > are tens of megabytes of memory available as shown by the kernel log (the > freeing/exiting happened during a particularly long iteration of processes > attached to the memcg, for example). > > That's what motivated a change to moving this to out_of_memory() directly, > we found that it prevented even more unnecessary oom kills, which is a > very good thing. It may only be easily observable and make a significant > difference at very large scale, however. > Thanks for the clarification. If it is the race which causes this issue and we want to reduce the race window, I don't know whether it is proper to check the memcg margin in out_of_memory() or do it before calling do_send_sig_info(). Because per my understanding, dump_header() always takes much more time than select_bad_process() especially if there're slow consoles. So the race might easily happen when doing dump_header() or dumping other information, but if we check the memcg margin after dumping this oom info, it would be strange to dump so much oom logs without killing a process. > > but I'm wondering why do > > it specifically for memcg oom? > > Doesn't it apply to global oom? > > For example, in the global oom, when selecting the processes, the > > others might free some pages and then it might allocate pages > > successfully. > > > > It's more complex because memory being allocated from the page allocator > must be physically contiguous, it's not a simple matter of comparing the > margin of available memory to the memory being charged. It could > theoretically be done but I haven't seen a use case where it has actually > mattered as opposed to memcg oom where it can happen quite readily at > scale. When memory is uncharged to a memcg because of large freeing or a > process exiting, that's immediately chargable by another process in the > same hierarchy because of its isolation as opposed to the page allocator > where that memory is up for grabs and anything on the system could > allocate it. > > > > Some may argue that this is unnecessarily exposing mem_cgroup_margin() to > > > generic mm code, but in the interest of preventing any unnecessary oom > > > kill we've found it to be helpful. > > > > > > I proposed a variant of this in https://lkml.org/lkml/2020/3/11/1089. > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -798,6 +798,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > > void mem_cgroup_split_huge_fixup(struct page *head); > > > #endif > > > > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg); > > > + > > > #else /* CONFIG_MEMCG */ > > > > > > #define MEM_CGROUP_ID_SHIFT 0 > > > @@ -825,6 +827,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm, > > > { > > > } > > > > > > +static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > +{ > > > +} > > > + > > > static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg, > > > bool in_low_reclaim) > > > { > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1282,7 +1282,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, > > > * Returns the maximum amount of memory @mem can be charged with, in > > > * pages. > > > */ > > > -static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > +unsigned long mem_cgroup_margin(struct mem_cgroup *memcg) > > > { > > > unsigned long margin = 0; > > > unsigned long count; > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > > --- a/mm/oom_kill.c > > > +++ b/mm/oom_kill.c > > > @@ -1109,9 +1109,23 @@ bool out_of_memory(struct oom_control *oc) > > > if (!is_sysrq_oom(oc) && !is_memcg_oom(oc)) > > > panic("System is deadlocked on memory\n"); > > > } > > > - if (oc->chosen && oc->chosen != (void *)-1UL) > > > + if (oc->chosen && oc->chosen != (void *)-1UL) { > > > + if (is_memcg_oom(oc)) { > > > + /* > > > + * If a memcg is now under its limit or current will be > > > + * exiting and freeing memory, avoid needlessly killing > > > + * chosen. > > > + */ > > > + if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order) || > > > + task_will_free_mem(current)) { > > > + put_task_struct(oc->chosen); > > > + return true; > > > + } > > > + } > > > + > > > oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" : > > > "Memory cgroup out of memory"); > > > + } > > > return !!oc->chosen; > > > } > > > > > > > > > -- > > Thanks > > Yafang > > -- Thanks Yafang