From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A996C433E0 for ; Tue, 16 Feb 2021 21:53:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B0E1E64E2B for ; Tue, 16 Feb 2021 21:53:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B0E1E64E2B Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2B7A58D0006; Tue, 16 Feb 2021 16:53:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 267C98D0002; Tue, 16 Feb 2021 16:53:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 157308D0006; Tue, 16 Feb 2021 16:53:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0070.hostedemail.com [216.40.44.70]) by kanga.kvack.org (Postfix) with ESMTP id F1B7E8D0002 for ; Tue, 16 Feb 2021 16:53:15 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id BAD6B8768 for ; Tue, 16 Feb 2021 21:53:15 +0000 (UTC) X-FDA: 77825482350.30.birth98_4916ce227647 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 92A57180B3C98 for ; Tue, 16 Feb 2021 21:53:15 +0000 (UTC) X-HE-Tag: birth98_4916ce227647 X-Filterd-Recvd-Size: 6113 Received: from mail-pg1-f169.google.com (mail-pg1-f169.google.com [209.85.215.169]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Tue, 16 Feb 2021 21:53:15 +0000 (UTC) Received: by mail-pg1-f169.google.com with SMTP id o7so7150936pgl.1 for ; Tue, 16 Feb 2021 13:53:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version; bh=0zls66CbYUnShNIJgPu/UgmV9jANuDUd31/IHtthW64=; b=rslZrRF79usj1wpjXVj2FLVT6ukeUqIXfYp5xeWcRAHSl9cQ0Fb2JHM2/VQtacubdv 0jDyyHeGdw6FoQDPKyx172wGjZ6ol9M0R32Bmyv+r005q0S1f94nj8WnQpqd5p19nBjz zmrlTX/nbqP1cM/ECqSaNioNZ/IYivwB+q2uiXTB0QixjmFBaug0ley9QJW7pjYXUNHT MHCYLkoUTgbovleuh8cFnsmQhpmdcGCbuyLYfpnC5jJPNNvi7xpCq4ibT2w2WOxV7yNo lk93w1kHnbjWL8c1TF8WOgtA+bfPVUpqkgQ7pXgieOh+yyIFhkDM4JWUJvtzrYpn8zob S1xA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:mime-version; bh=0zls66CbYUnShNIJgPu/UgmV9jANuDUd31/IHtthW64=; b=I3bb1rziJkbR087pDV8+nOc2BYpHI0XNg2XGXm0FSUTCaU/UVPbEXbcwFkCQ9hLKez M7sh8lwGZzLSJ5Vh832HEZ/FR4DFY67UcF68FB534+HB0q8WaMEMjHQ9rl1t+jnylPRl aUCpu897TV/C8p68L0Gdl5Brpq5tWR6fUKgQMAoLPn9e2SdOLDf324GTTzjtj4SpEFqR HxwPv8BMUp0vPd9cxgTd3OcfBpTrcxkTOoe2K14sY/8iVjlKw416pMTCzznlUvPbl+T9 EMDo5aI5MXdKFT8WkADG2v7MCwLZf3eftt8KTImAS8RNgap68iq3Ca/5MZ5/3Yv3tyVH cZEw== X-Gm-Message-State: AOAM531/buLSxfgsRve8jDTdRV1QrFwujpGM/Ow+gxf2x746IUwxc45L I/LuT65lcTU9HgB1pnN0tnLZ+A== X-Google-Smtp-Source: ABdhPJxC689jqS6tAFrfu2ZGe+vHaTBYjgdgmtIWwLu49NzPMD60eQHIUluGYkitUp9ffvlSS1L3wA== X-Received: by 2002:a05:6a00:1a08:b029:1cd:404e:a70c with SMTP id g8-20020a056a001a08b02901cd404ea70cmr21540920pfv.33.1613512393866; Tue, 16 Feb 2021 13:53:13 -0800 (PST) Received: from [2620:15c:17:3:984e:d574:ca36:ce3c] ([2620:15c:17:3:984e:d574:ca36:ce3c]) by smtp.gmail.com with ESMTPSA id w11sm106603pge.28.2021.02.16.13.53.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 16 Feb 2021 13:53:13 -0800 (PST) Date: Tue, 16 Feb 2021 13:53:12 -0800 (PST) From: David Rientjes To: Michal Hocko cc: Eiichi Tsukata , corbet@lwn.net, mike.kravetz@oracle.com, mcgrof@kernel.org, keescook@chromium.org, yzaikin@google.com, akpm@linux-foundation.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, felipe.franciosi@nutanix.com Subject: Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom In-Reply-To: Message-ID: References: <20210216030713.79101-1-eiichi.tsukata@nutanix.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, 16 Feb 2021, Michal Hocko wrote: > > Hugepages can be preallocated to avoid unpredictable allocation latency. > > If we run into 4k page shortage, the kernel can trigger OOM even though > > there were free hugepages. When OOM is triggered by user address page > > fault handler, we can use oom notifier to free hugepages in user space > > but if it's triggered by memory allocation for kernel, there is no way > > to synchronously handle it in user space. > > Can you expand some more on what kind of problem do you see? > Hugetlb pages are, by definition, a preallocated, unreclaimable and > admin controlled pool of pages. Small nit: true of non-surplus hugetlb pages. > Under those conditions it is expected > and required that the sizing would be done very carefully. Why is that a > problem in your particular setup/scenario? > > If the sizing is really done properly and then a random process can > trigger OOM then this can lead to malfunctioning of those workloads > which do depend on hugetlb pool, right? So isn't this a kinda DoS > scenario? > > > This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If > > enabled, it first tries to free a hugepage if available before invoking > > the oom-killer. The default value is disabled not to change the current > > behavior. > > Why is this interface not hugepage size aware? It is quite different to > release a GB huge page or 2MB one. Or is it expected to release the > smallest one? To the implementation... > > [...] > > +static int sacrifice_hugepage(void) > > +{ > > + int ret; > > + > > + spin_lock(&hugetlb_lock); > > + ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0); > > ... no it is going to release the default huge page. This will be 2MB in > most cases but this is not given. > > Unless I am mistaken this will free up also reserved hugetlb pages. This > would mean that a page fault would SIGBUS which is very likely not > something we want to do right? You also want to use oom nodemask rather > than a full one. > > Overall, I am not really happy about this feature even when above is > fixed, but let's hear more the actual problem first. Shouldn't this behavior be possible as an oomd plugin instead, perhaps triggered by psi? I'm not sure if oomd is intended only to kill something (oomkilld? lol) or if it can be made to do sysadmin level behavior, such as shrinking the hugetlb pool, to solve the oom condition. If so, it seems like we want to do this at the absolute last minute. In other words, reclaim has failed to free memory by other means so we would like to shrink the hugetlb pool. (It's the reason why it's implemented as a predecessor to oom as opposed to part of reclaim in general.) Do we have the ability to suppress the oom killer until oomd has a chance to react in this scenario?