From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C390C4332F for ; Wed, 22 Sep 2021 09:14:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 053D961211 for ; Wed, 22 Sep 2021 09:14:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234253AbhIVJPp (ORCPT ); Wed, 22 Sep 2021 05:15:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38966 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233741AbhIVJPo (ORCPT ); Wed, 22 Sep 2021 05:15:44 -0400 Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C44B9C061574; Wed, 22 Sep 2021 02:14:14 -0700 (PDT) Received: by mail-qk1-x72e.google.com with SMTP id i132so7584901qke.1; Wed, 22 Sep 2021 02:14:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=r8/U5lhFhptIEj0YScg0hSUpvj2Pgosre/gr4ODgB5s=; b=dDuVpMj2Ae6qRegobVXGfopn8pjKX0AMRL82MzRNRDu+GttaejhnWni8uoZrQjRrP4 Yui66eF5A3wtcHwksE0seeMUwPjbzBB5EThhWJdOEoCcrJyiF5sE08i5TD9NdL870kYh EdUeX94tHmSvtVl1OzvJuyM9r6oS7IaDL5AuNfueb2DKYcFpSNv0oKFjimzXaWa5+n6a haDUDoVb03wMseGCw01qPiTwottZx62jRqbavs7d+k4kYH5B1pXAmP2xgb57G75gFRvm cdrOxGq4dtoASjeUUOKH+Rc6XMI7oIvGn/4FdmTdkzU+PI+Ai50wye/nfOaSRY4UGB7J ymvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=r8/U5lhFhptIEj0YScg0hSUpvj2Pgosre/gr4ODgB5s=; b=FKmokXaeD3UrDJxkVJJMl02Z5nEdFwHY+5gxt57rC0tb1N50oGKXm8Zc8ijqWpATyd FID9aDfJntSl7vFHpFj0KXb4YmCZI23+09/YCf9w4wre6iZv/b6CNNQIU4SpkWMzJ1LJ 6myCjC52RzRhEOsCCQdO6NSNSzAVuSdlBc63syHKlVesvCo0xD2Dxc1Wh54iNpHcAaow shPMMZ5JkyeID0Ug4GWz1OxsyiCGDI4XVA1f0odqxkyu4a0nlIU4DJw0wH0oNQCMirEF Ynv0yizA1yb/dQ9O/+2an5WSXDRjOOgmFm5/PSUb6NDUOPG4PH/xQ39qo73vhg8d0SlH yvgA== X-Gm-Message-State: AOAM530eIn8mmwS4BcFMxoGDCAjBbdfMXkoNOUdNJK+vKvT+A6gagYQp 7AeBIVt5PDWvPaONnJX34aY1caZZIS5l0gjl4P8= X-Google-Smtp-Source: ABdhPJw0isA2c57xhycThegrIms8reGZh0oxHfr3J2t/iZmmcplr+bD/0vczyCZa/RuTZdPPQ2dPbpQZeondKZiAoVc= X-Received: by 2002:a25:d9c8:: with SMTP id q191mr46022669ybg.61.1632302053817; Wed, 22 Sep 2021 02:14:13 -0700 (PDT) MIME-Version: 1.0 References: <20210918113201.4133-1-yaozhenguo1@gmail.com> In-Reply-To: From: zhenguo yao Date: Wed, 22 Sep 2021 17:14:02 +0800 Message-ID: Subject: Re: [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation To: Mike Rapoport Cc: Mike Kravetz , mpe@ellerman.id.au, benh@kernel.crashing.org, paulus@samba.org, corbet@lwn.net, Andrew Morton , yaozhenguo@jd.com, Matthew Wilcox , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Linux Memory Management List Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thanks for your review. Mike Rapoport =E4=BA=8E2021=E5=B9=B49=E6=9C=8819=E6=97=A5= =E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=883:28=E5=86=99=E9=81=93=EF=BC=9A > > On Sat, Sep 18, 2021 at 07:32:01PM +0800, yaozhenguo wrote: > > We can specify the number of hugepages to allocate at boot. But the > > hugepages is balanced in all nodes at present. In some scenarios, > > we only need hugepages in one node. For example: DPDK needs hugepages > > which are in the same node as NIC. if DPDK needs four hugepages of 1G > > size in node1 and system has 16 numa nodes. We must reserve 64 hugepage= s > > in kernel cmdline. But, only four hugepages are used. The others should > > be free after boot. If the system memory is low(for example: 64G), it w= ill > > be an impossible task. So, Extending hugepages parameter to support > > specifying hugepages at a specific node. > > For example add following parameter: > > > > hugepagesz=3D1G hugepages=3D0:1,1:3 > > > > It will allocate 1 hugepage in node0 and 3 hugepages in node1. > > > > Signed-off-by: yaozhenguo > > --- > > v4->v5 changes: > > - remove BUG_ON in __alloc_bootmem_huge_page. > > - add nid parameter in __alloc_bootmem_huge_page to support > > called in both specific node alloc and normal alloc. > > - do normal alloc if architecture can't support node specific all= oc. > > - return -1 in powerpc version of alloc_bootmem_huge_page when > > nid is not NUMA_NO_NODE. > > I think, node format should not be enabled on powerpc. Rather than add a > new return value to alloc_bootmem_huge_page() and add complex checks for > -1, 0 and 1, it would be simpler to add a new weak function. The default > implementation will return true and powerpc will have an override that > returns false. > This function can be called during command line parsing and if it returne= d > false the entire per-node allocation path would be disabled. > Seems more clearly and simpler. I will change it. > > > > --- > > .../admin-guide/kernel-parameters.txt | 8 +- > > Documentation/admin-guide/mm/hugetlbpage.rst | 12 +- > > arch/powerpc/mm/hugetlbpage.c | 8 +- > > include/linux/hugetlb.h | 5 +- > > mm/hugetlb.c | 155 ++++++++++++++++-- > > 5 files changed, 166 insertions(+), 22 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Document= ation/admin-guide/kernel-parameters.txt > > index bdb22006f..a2046b2c5 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1588,9 +1588,11 @@ > > the number of pages of hugepagesz to be allocated= . > > If this is the first HugeTLB parameter on the com= mand > > line, it specifies the number of pages to allocat= e for > > - the default huge page size. See also > > - Documentation/admin-guide/mm/hugetlbpage.rst. > > - Format: > > + the default huge page size. If using node format,= the > > + number of pages to allocate per-node can be speci= fied. > > + See also Documentation/admin-guide/mm/hugetlbpage= .rst. > > + Format: or (node format) > > + :[,:] > > > > hugepagesz=3D > > [HW] The size of the HugeTLB pages. This is used= in > > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentati= on/admin-guide/mm/hugetlbpage.rst > > index 8abaeb144..d70828c07 100644 > > --- a/Documentation/admin-guide/mm/hugetlbpage.rst > > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst > > @@ -128,7 +128,9 @@ hugepages > > implicitly specifies the number of huge pages of default size to > > allocate. If the number of huge pages of default size is implici= tly > > specified, it can not be overwritten by a hugepagesz,hugepages > > - parameter pair for the default size. > > + parameter pair for the default size. This parameter also has a > > + node format. The node format specifies the number of huge pages > > + to allocate on specific nodes. > > > > For example, on an architecture with 2M default huge page size:: > > > > @@ -138,6 +140,14 @@ hugepages > > indicating that the hugepages=3D512 parameter is ignored. If a h= ugepages > > parameter is preceded by an invalid hugepagesz parameter, it will > > be ignored. > > + > > + Node format example:: > > + > > + hugepagesz=3D2M hugepages=3D0:1,1:2 > > + > > + It will allocate 1 2M hugepage on node0 and 2 2M hugepages on nod= e1. > > + If the node number is invalid, the parameter will be ignored. > > + > > default_hugepagesz > > Specify the default huge page size. This parameter can > > only be specified once on the command line. default_hugepagesz c= an > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpag= e.c > > index 9a75ba078..4a9cea714 100644 > > --- a/arch/powerpc/mm/hugetlbpage.c > > +++ b/arch/powerpc/mm/hugetlbpage.c > > @@ -19,6 +19,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -232,14 +233,17 @@ static int __init pseries_alloc_bootmem_huge_page= (struct hstate *hstate) > > #endif > > > > > > -int __init alloc_bootmem_huge_page(struct hstate *h) > > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid) > > { > > > > #ifdef CONFIG_PPC_BOOK3S_64 > > + /* Don't support node specific alloc */ > > + if (nid !=3D NUMA_NO_NODE) > > + return -1; > > if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled()) > > return pseries_alloc_bootmem_huge_page(h); > > #endif > > - return __alloc_bootmem_huge_page(h); > > + return __alloc_bootmem_huge_page(h, nid); > > } > > > > #ifndef CONFIG_PPC_BOOK3S_64 > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index f7ca1a387..a9bdbc870 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -605,6 +605,7 @@ struct hstate { > > unsigned long nr_overcommit_huge_pages; > > struct list_head hugepage_activelist; > > struct list_head hugepage_freelists[MAX_NUMNODES]; > > + unsigned int max_huge_pages_node[MAX_NUMNODES]; > > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > > unsigned int free_huge_pages_node[MAX_NUMNODES]; > > unsigned int surplus_huge_pages_node[MAX_NUMNODES]; > > @@ -637,8 +638,8 @@ void restore_reserve_on_error(struct hstate *h, str= uct vm_area_struct *vma, > > unsigned long address, struct page *page)= ; > > > > /* arch callback */ > > -int __init __alloc_bootmem_huge_page(struct hstate *h); > > -int __init alloc_bootmem_huge_page(struct hstate *h); > > +int __init __alloc_bootmem_huge_page(struct hstate *h, int nid); > > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid); > > > > void __init hugetlb_add_hstate(unsigned order); > > bool __init arch_hugetlb_valid_size(unsigned long size); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index dfc940d52..f7aaca6ff 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate; > > static unsigned long __initdata default_hstate_max_huge_pages; > > static bool __initdata parsed_valid_hugepagesz =3D true; > > static bool __initdata parsed_default_hugepagesz; > > +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata= ; > > > > /* > > * Protects updates to hugepage_freelists, hugepage_activelist, nr_hug= e_pages, > > @@ -2774,17 +2775,30 @@ struct page *alloc_huge_page(struct vm_area_str= uct *vma, > > vma_end_reservation(h, vma, addr); > > return ERR_PTR(-ENOSPC); > > } > > - > > -int alloc_bootmem_huge_page(struct hstate *h) > > +/* > > + * Architecture version must provide there own node specific hugepage > > ^ its > > > + * allocation code. If not, please return -1 when nid is not NUMA_NO_N= ODE. > > + */ > > +int alloc_bootmem_huge_page(struct hstate *h, int nid) > > __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); > > -int __alloc_bootmem_huge_page(struct hstate *h) > > +int __alloc_bootmem_huge_page(struct hstate *h, int nid) > > { > > struct huge_bootmem_page *m; > > int nr_nodes, node; > > + void *addr =3D NULL; > > > > + /* do node specific alloc */ > > + if (nid !=3D NUMA_NO_NODE && nid < nodes_weight(node_states[N_MEM= ORY])) { > > + addr =3D memblock_alloc_try_nid_raw(huge_page_size(h), hu= ge_page_size(h), > > + 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid); > > + if (addr) { > > + m =3D addr; > > + goto found; > > + } else { > > + return 0; > > + } > > if (!addr) > return 0; > m =3D addr; > goto found; > > seems simpler to me. > > Besides, I think addr is not needed at all, m can be assigned directly. I= 'd > also rename it to e.g. page. > Yes, addr seems useless. I will remove it. > > + } > > for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEM= ORY]) { > > - void *addr; > > - > > addr =3D memblock_alloc_try_nid_raw( > > huge_page_size(h), huge_page_size(h), > > 0, MEMBLOCK_ALLOC_ACCESSIBLE, node); > > @@ -2801,7 +2815,6 @@ int __alloc_bootmem_huge_page(struct hstate *h) > > return 0; > > > > found: > > - BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h))); > > /* Put them into a private list first because mem_map is not up y= et */ > > INIT_LIST_HEAD(&m->list); > > list_add(&m->list, &huge_boot_pages); > > @@ -2841,11 +2854,83 @@ static void __init gather_bootmem_prealloc(void= ) > > cond_resched(); > > } > > } > > +/* > > + * do node specific hugepage allocation > > + * return 0: allocation is failed > > + * 1: allocation is successful > > + * -1: alloc_bootmem_huge_page can't support node specific alloca= tion > > + */ > > +static int __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h,= int nid) > > +{ > > + unsigned long i; > > + char buf[32]; > > + int ret =3D 1; > > + > > + for (i =3D 0; i < h->max_huge_pages_node[nid]; ++i) { > > + if (hstate_is_gigantic(h)) { > > + ret =3D alloc_bootmem_huge_page(h, nid); > > + if (ret =3D=3D 0 || ret =3D=3D -1) > > + break; > > + } else { > > + struct page *page; > > + gfp_t gfp_mask =3D htlb_alloc_mask(h) | __GFP_THI= SNODE; > > + > > + page =3D alloc_fresh_huge_page(h, gfp_mask, nid, > > + &node_states[N_MEMORY], NULL); > > + if (!page) { > > + ret =3D 0; > > + break; > > + } > > + put_page(page); /* free it into the hugepage allo= cator */ > > + } > > + cond_resched(); > > + } > > + if (i =3D=3D h->max_huge_pages_node[nid]) > > + return ret; > > + > > + string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32); > > + pr_warn("HugeTLB: allocating %u of page size %s failed node%d. O= nly allocated %lu hugepages.\n", > > + h->max_huge_pages_node[nid], buf, nid, i); > > + /* > > + * we need h->max_huge_pages to do normal alloc, > > + * when alloc_bootm_huge_page can't support node specific allocat= ion. > > + */ > > + if (ret !=3D -1) > > + h->max_huge_pages -=3D (h->max_huge_pages_node[nid] - i); > > + h->max_huge_pages_node[nid] =3D i; > > + return ret; > > +} > > > > static void __init hugetlb_hstate_alloc_pages(struct hstate *h) > > { > > unsigned long i; > > nodemask_t *node_alloc_noretry; > > + bool hugetlb_node_set =3D false; > > + > > + /* skip gigantic hugepages allocation if hugetlb_cma enabled */ > > + if (hstate_is_gigantic(h) && hugetlb_cma_size) { > > + pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot = time allocation\n"); > > + return; > > + } > > + > > + /* do node alloc */ > > + for (i =3D 0; i < nodes_weight(node_states[N_MEMORY]); i++) { > > + if (h->max_huge_pages_node[i] > 0) { > > + /* > > + * if hugetlb_hstate_alloc_pages_onenode return -= 1 > > + * architecture version alloc_bootmem_huge_page > > + * can't support node specific alloc for gigantic > > + * hugepage, so goto normal alloc. > > + */ > > + if (hugetlb_hstate_alloc_pages_onenode(h, i) =3D= =3D -1) > > + pr_warn_once("HugeTLB: architecture can't= support node specific alloc, skip!\n"); > > + else > > + hugetlb_node_set =3D true; > > + } > > + } > > + > > + if (hugetlb_node_set) > > + return; > > If I understand correctly, this means that even if the allocation fails f= or > some of the nodes explicitly set in hugepages=3Dnid:pages,..., we won't t= ry > allocate more huge pages and silently ignore the failure. Is this the > intended behaviour? > Yes, If the allocation is failed in some specific nodes. We just print warning messages in hugetlb_hstate_alloc_pages_onenode to remind the user to use appropriate hugepages parameters or check the system and do not try to allocate more huge pages. hugetlb_node_set is used here to judge whether it is node specific alloc in this =E2=80=9Chstate=E2=80=9D. If it is a node specific allocation for this =E2=80=9Chstate=E2=80=9D, we s= hould not do old path(bellow code) allocation whether the node specific allocation is successful or not. > > > > if (!hstate_is_gigantic(h)) { > > /* > > @@ -2867,11 +2952,7 @@ static void __init hugetlb_hstate_alloc_pages(st= ruct hstate *h) > > > > for (i =3D 0; i < h->max_huge_pages; ++i) { > > if (hstate_is_gigantic(h)) { > > - if (hugetlb_cma_size) { > > - pr_warn_once("HugeTLB: hugetlb_cma is ena= bled, skip boot time allocation\n"); > > - goto free; > > - } > > - if (!alloc_bootmem_huge_page(h)) > > + if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE)) > > break; > > } else if (!alloc_pool_huge_page(h, > > &node_states[N_MEMORY], > > @@ -2887,7 +2968,6 @@ static void __init hugetlb_hstate_alloc_pages(str= uct hstate *h) > > h->max_huge_pages, buf, i); > > h->max_huge_pages =3D i; > > } > > -free: > > kfree(node_alloc_noretry); > > } > > > > -- > Sincerely yours, > Mike. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7532CC433F5 for ; Wed, 22 Sep 2021 09:14:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E3D89610A0 for ; Wed, 22 Sep 2021 09:14:15 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E3D89610A0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 783296B006C; Wed, 22 Sep 2021 05:14:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 70BA06B0072; Wed, 22 Sep 2021 05:14:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 585F16B0073; Wed, 22 Sep 2021 05:14:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1]) by kanga.kvack.org (Postfix) with ESMTP id 40AFF6B006C for ; Wed, 22 Sep 2021 05:14:15 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id E4ECC82499A8 for ; Wed, 22 Sep 2021 09:14:14 +0000 (UTC) X-FDA: 78614648028.26.8E743A3 Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) by imf18.hostedemail.com (Postfix) with ESMTP id A1E844002093 for ; Wed, 22 Sep 2021 09:14:14 +0000 (UTC) Received: by mail-qk1-f173.google.com with SMTP id c7so7539923qka.2 for ; Wed, 22 Sep 2021 02:14:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=r8/U5lhFhptIEj0YScg0hSUpvj2Pgosre/gr4ODgB5s=; b=dDuVpMj2Ae6qRegobVXGfopn8pjKX0AMRL82MzRNRDu+GttaejhnWni8uoZrQjRrP4 Yui66eF5A3wtcHwksE0seeMUwPjbzBB5EThhWJdOEoCcrJyiF5sE08i5TD9NdL870kYh EdUeX94tHmSvtVl1OzvJuyM9r6oS7IaDL5AuNfueb2DKYcFpSNv0oKFjimzXaWa5+n6a haDUDoVb03wMseGCw01qPiTwottZx62jRqbavs7d+k4kYH5B1pXAmP2xgb57G75gFRvm cdrOxGq4dtoASjeUUOKH+Rc6XMI7oIvGn/4FdmTdkzU+PI+Ai50wye/nfOaSRY4UGB7J ymvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=r8/U5lhFhptIEj0YScg0hSUpvj2Pgosre/gr4ODgB5s=; b=PQKtFKBR8bb2wKbdU31dEDeDEwpOYesDbMp47WgYfRwoFLYiDI2+1jF+8tdc0Rk6OV 6hGN3nPzQAGBo8Tn3kt86wek5SJ21+L1b3FGCB9d5R0yB2WMigcBh40mWv+3XnvqPxUV O4px3mPgr8Aw622kNpIB+Z0rIa9AmCkoFvPvflNolyK1t9hjQ+k+S3se8pmcBX1/tDNd INhUNFjdn8W7hwsNNuXIebg/+ZyHruE1XC7Noj2Ts8XXXkjLIWrajPN5Xtpxu81fypJO to0APe6ntJt3uAdmSixWwBdldcNbT7dU5Xz6SAzRvC+7dlXDBZo0uN0K9C+WdUNGRCzc ngZQ== X-Gm-Message-State: AOAM530DNLFqr98IdFXniYFK/rJgqC8yIucfHBwuevpXLcKnClGJg5el +4TR7YofqDyrJaVuqA6mjacrU9raFywo+KWTDUw= X-Google-Smtp-Source: ABdhPJw0isA2c57xhycThegrIms8reGZh0oxHfr3J2t/iZmmcplr+bD/0vczyCZa/RuTZdPPQ2dPbpQZeondKZiAoVc= X-Received: by 2002:a25:d9c8:: with SMTP id q191mr46022669ybg.61.1632302053817; Wed, 22 Sep 2021 02:14:13 -0700 (PDT) MIME-Version: 1.0 References: <20210918113201.4133-1-yaozhenguo1@gmail.com> In-Reply-To: From: zhenguo yao Date: Wed, 22 Sep 2021 17:14:02 +0800 Message-ID: Subject: Re: [PATCH v5] hugetlbfs: Extend the definition of hugepages parameter to support node allocation To: Mike Rapoport Cc: Mike Kravetz , mpe@ellerman.id.au, benh@kernel.crashing.org, paulus@samba.org, corbet@lwn.net, Andrew Morton , yaozhenguo@jd.com, Matthew Wilcox , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Linux Memory Management List Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=dDuVpMj2; spf=pass (imf18.hostedemail.com: domain of yaozhenguo1@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=yaozhenguo1@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: jocm6mtau9nidhowdxihie8qjnegmu71 X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A1E844002093 X-HE-Tag: 1632302054-422370 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Thanks for your review. Mike Rapoport =E4=BA=8E2021=E5=B9=B49=E6=9C=8819=E6=97=A5= =E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=883:28=E5=86=99=E9=81=93=EF=BC=9A > > On Sat, Sep 18, 2021 at 07:32:01PM +0800, yaozhenguo wrote: > > We can specify the number of hugepages to allocate at boot. But the > > hugepages is balanced in all nodes at present. In some scenarios, > > we only need hugepages in one node. For example: DPDK needs hugepages > > which are in the same node as NIC. if DPDK needs four hugepages of 1G > > size in node1 and system has 16 numa nodes. We must reserve 64 hugepage= s > > in kernel cmdline. But, only four hugepages are used. The others should > > be free after boot. If the system memory is low(for example: 64G), it w= ill > > be an impossible task. So, Extending hugepages parameter to support > > specifying hugepages at a specific node. > > For example add following parameter: > > > > hugepagesz=3D1G hugepages=3D0:1,1:3 > > > > It will allocate 1 hugepage in node0 and 3 hugepages in node1. > > > > Signed-off-by: yaozhenguo > > --- > > v4->v5 changes: > > - remove BUG_ON in __alloc_bootmem_huge_page. > > - add nid parameter in __alloc_bootmem_huge_page to support > > called in both specific node alloc and normal alloc. > > - do normal alloc if architecture can't support node specific all= oc. > > - return -1 in powerpc version of alloc_bootmem_huge_page when > > nid is not NUMA_NO_NODE. > > I think, node format should not be enabled on powerpc. Rather than add a > new return value to alloc_bootmem_huge_page() and add complex checks for > -1, 0 and 1, it would be simpler to add a new weak function. The default > implementation will return true and powerpc will have an override that > returns false. > This function can be called during command line parsing and if it returne= d > false the entire per-node allocation path would be disabled. > Seems more clearly and simpler. I will change it. > > > > --- > > .../admin-guide/kernel-parameters.txt | 8 +- > > Documentation/admin-guide/mm/hugetlbpage.rst | 12 +- > > arch/powerpc/mm/hugetlbpage.c | 8 +- > > include/linux/hugetlb.h | 5 +- > > mm/hugetlb.c | 155 ++++++++++++++++-- > > 5 files changed, 166 insertions(+), 22 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Document= ation/admin-guide/kernel-parameters.txt > > index bdb22006f..a2046b2c5 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1588,9 +1588,11 @@ > > the number of pages of hugepagesz to be allocated= . > > If this is the first HugeTLB parameter on the com= mand > > line, it specifies the number of pages to allocat= e for > > - the default huge page size. See also > > - Documentation/admin-guide/mm/hugetlbpage.rst. > > - Format: > > + the default huge page size. If using node format,= the > > + number of pages to allocate per-node can be speci= fied. > > + See also Documentation/admin-guide/mm/hugetlbpage= .rst. > > + Format: or (node format) > > + :[,:] > > > > hugepagesz=3D > > [HW] The size of the HugeTLB pages. This is used= in > > diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentati= on/admin-guide/mm/hugetlbpage.rst > > index 8abaeb144..d70828c07 100644 > > --- a/Documentation/admin-guide/mm/hugetlbpage.rst > > +++ b/Documentation/admin-guide/mm/hugetlbpage.rst > > @@ -128,7 +128,9 @@ hugepages > > implicitly specifies the number of huge pages of default size to > > allocate. If the number of huge pages of default size is implici= tly > > specified, it can not be overwritten by a hugepagesz,hugepages > > - parameter pair for the default size. > > + parameter pair for the default size. This parameter also has a > > + node format. The node format specifies the number of huge pages > > + to allocate on specific nodes. > > > > For example, on an architecture with 2M default huge page size:: > > > > @@ -138,6 +140,14 @@ hugepages > > indicating that the hugepages=3D512 parameter is ignored. If a h= ugepages > > parameter is preceded by an invalid hugepagesz parameter, it will > > be ignored. > > + > > + Node format example:: > > + > > + hugepagesz=3D2M hugepages=3D0:1,1:2 > > + > > + It will allocate 1 2M hugepage on node0 and 2 2M hugepages on nod= e1. > > + If the node number is invalid, the parameter will be ignored. > > + > > default_hugepagesz > > Specify the default huge page size. This parameter can > > only be specified once on the command line. default_hugepagesz c= an > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpag= e.c > > index 9a75ba078..4a9cea714 100644 > > --- a/arch/powerpc/mm/hugetlbpage.c > > +++ b/arch/powerpc/mm/hugetlbpage.c > > @@ -19,6 +19,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -232,14 +233,17 @@ static int __init pseries_alloc_bootmem_huge_page= (struct hstate *hstate) > > #endif > > > > > > -int __init alloc_bootmem_huge_page(struct hstate *h) > > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid) > > { > > > > #ifdef CONFIG_PPC_BOOK3S_64 > > + /* Don't support node specific alloc */ > > + if (nid !=3D NUMA_NO_NODE) > > + return -1; > > if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled()) > > return pseries_alloc_bootmem_huge_page(h); > > #endif > > - return __alloc_bootmem_huge_page(h); > > + return __alloc_bootmem_huge_page(h, nid); > > } > > > > #ifndef CONFIG_PPC_BOOK3S_64 > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > > index f7ca1a387..a9bdbc870 100644 > > --- a/include/linux/hugetlb.h > > +++ b/include/linux/hugetlb.h > > @@ -605,6 +605,7 @@ struct hstate { > > unsigned long nr_overcommit_huge_pages; > > struct list_head hugepage_activelist; > > struct list_head hugepage_freelists[MAX_NUMNODES]; > > + unsigned int max_huge_pages_node[MAX_NUMNODES]; > > unsigned int nr_huge_pages_node[MAX_NUMNODES]; > > unsigned int free_huge_pages_node[MAX_NUMNODES]; > > unsigned int surplus_huge_pages_node[MAX_NUMNODES]; > > @@ -637,8 +638,8 @@ void restore_reserve_on_error(struct hstate *h, str= uct vm_area_struct *vma, > > unsigned long address, struct page *page)= ; > > > > /* arch callback */ > > -int __init __alloc_bootmem_huge_page(struct hstate *h); > > -int __init alloc_bootmem_huge_page(struct hstate *h); > > +int __init __alloc_bootmem_huge_page(struct hstate *h, int nid); > > +int __init alloc_bootmem_huge_page(struct hstate *h, int nid); > > > > void __init hugetlb_add_hstate(unsigned order); > > bool __init arch_hugetlb_valid_size(unsigned long size); > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index dfc940d52..f7aaca6ff 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -66,6 +66,7 @@ static struct hstate * __initdata parsed_hstate; > > static unsigned long __initdata default_hstate_max_huge_pages; > > static bool __initdata parsed_valid_hugepagesz =3D true; > > static bool __initdata parsed_default_hugepagesz; > > +static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata= ; > > > > /* > > * Protects updates to hugepage_freelists, hugepage_activelist, nr_hug= e_pages, > > @@ -2774,17 +2775,30 @@ struct page *alloc_huge_page(struct vm_area_str= uct *vma, > > vma_end_reservation(h, vma, addr); > > return ERR_PTR(-ENOSPC); > > } > > - > > -int alloc_bootmem_huge_page(struct hstate *h) > > +/* > > + * Architecture version must provide there own node specific hugepage > > ^ its > > > + * allocation code. If not, please return -1 when nid is not NUMA_NO_N= ODE. > > + */ > > +int alloc_bootmem_huge_page(struct hstate *h, int nid) > > __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); > > -int __alloc_bootmem_huge_page(struct hstate *h) > > +int __alloc_bootmem_huge_page(struct hstate *h, int nid) > > { > > struct huge_bootmem_page *m; > > int nr_nodes, node; > > + void *addr =3D NULL; > > > > + /* do node specific alloc */ > > + if (nid !=3D NUMA_NO_NODE && nid < nodes_weight(node_states[N_MEM= ORY])) { > > + addr =3D memblock_alloc_try_nid_raw(huge_page_size(h), hu= ge_page_size(h), > > + 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid); > > + if (addr) { > > + m =3D addr; > > + goto found; > > + } else { > > + return 0; > > + } > > if (!addr) > return 0; > m =3D addr; > goto found; > > seems simpler to me. > > Besides, I think addr is not needed at all, m can be assigned directly. I= 'd > also rename it to e.g. page. > Yes, addr seems useless. I will remove it. > > + } > > for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEM= ORY]) { > > - void *addr; > > - > > addr =3D memblock_alloc_try_nid_raw( > > huge_page_size(h), huge_page_size(h), > > 0, MEMBLOCK_ALLOC_ACCESSIBLE, node); > > @@ -2801,7 +2815,6 @@ int __alloc_bootmem_huge_page(struct hstate *h) > > return 0; > > > > found: > > - BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h))); > > /* Put them into a private list first because mem_map is not up y= et */ > > INIT_LIST_HEAD(&m->list); > > list_add(&m->list, &huge_boot_pages); > > @@ -2841,11 +2854,83 @@ static void __init gather_bootmem_prealloc(void= ) > > cond_resched(); > > } > > } > > +/* > > + * do node specific hugepage allocation > > + * return 0: allocation is failed > > + * 1: allocation is successful > > + * -1: alloc_bootmem_huge_page can't support node specific alloca= tion > > + */ > > +static int __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h,= int nid) > > +{ > > + unsigned long i; > > + char buf[32]; > > + int ret =3D 1; > > + > > + for (i =3D 0; i < h->max_huge_pages_node[nid]; ++i) { > > + if (hstate_is_gigantic(h)) { > > + ret =3D alloc_bootmem_huge_page(h, nid); > > + if (ret =3D=3D 0 || ret =3D=3D -1) > > + break; > > + } else { > > + struct page *page; > > + gfp_t gfp_mask =3D htlb_alloc_mask(h) | __GFP_THI= SNODE; > > + > > + page =3D alloc_fresh_huge_page(h, gfp_mask, nid, > > + &node_states[N_MEMORY], NULL); > > + if (!page) { > > + ret =3D 0; > > + break; > > + } > > + put_page(page); /* free it into the hugepage allo= cator */ > > + } > > + cond_resched(); > > + } > > + if (i =3D=3D h->max_huge_pages_node[nid]) > > + return ret; > > + > > + string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32); > > + pr_warn("HugeTLB: allocating %u of page size %s failed node%d. O= nly allocated %lu hugepages.\n", > > + h->max_huge_pages_node[nid], buf, nid, i); > > + /* > > + * we need h->max_huge_pages to do normal alloc, > > + * when alloc_bootm_huge_page can't support node specific allocat= ion. > > + */ > > + if (ret !=3D -1) > > + h->max_huge_pages -=3D (h->max_huge_pages_node[nid] - i); > > + h->max_huge_pages_node[nid] =3D i; > > + return ret; > > +} > > > > static void __init hugetlb_hstate_alloc_pages(struct hstate *h) > > { > > unsigned long i; > > nodemask_t *node_alloc_noretry; > > + bool hugetlb_node_set =3D false; > > + > > + /* skip gigantic hugepages allocation if hugetlb_cma enabled */ > > + if (hstate_is_gigantic(h) && hugetlb_cma_size) { > > + pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot = time allocation\n"); > > + return; > > + } > > + > > + /* do node alloc */ > > + for (i =3D 0; i < nodes_weight(node_states[N_MEMORY]); i++) { > > + if (h->max_huge_pages_node[i] > 0) { > > + /* > > + * if hugetlb_hstate_alloc_pages_onenode return -= 1 > > + * architecture version alloc_bootmem_huge_page > > + * can't support node specific alloc for gigantic > > + * hugepage, so goto normal alloc. > > + */ > > + if (hugetlb_hstate_alloc_pages_onenode(h, i) =3D= =3D -1) > > + pr_warn_once("HugeTLB: architecture can't= support node specific alloc, skip!\n"); > > + else > > + hugetlb_node_set =3D true; > > + } > > + } > > + > > + if (hugetlb_node_set) > > + return; > > If I understand correctly, this means that even if the allocation fails f= or > some of the nodes explicitly set in hugepages=3Dnid:pages,..., we won't t= ry > allocate more huge pages and silently ignore the failure. Is this the > intended behaviour? > Yes, If the allocation is failed in some specific nodes. We just print warning messages in hugetlb_hstate_alloc_pages_onenode to remind the user to use appropriate hugepages parameters or check the system and do not try to allocate more huge pages. hugetlb_node_set is used here to judge whether it is node specific alloc in this =E2=80=9Chstate=E2=80=9D. If it is a node specific allocation for this =E2=80=9Chstate=E2=80=9D, we s= hould not do old path(bellow code) allocation whether the node specific allocation is successful or not. > > > > if (!hstate_is_gigantic(h)) { > > /* > > @@ -2867,11 +2952,7 @@ static void __init hugetlb_hstate_alloc_pages(st= ruct hstate *h) > > > > for (i =3D 0; i < h->max_huge_pages; ++i) { > > if (hstate_is_gigantic(h)) { > > - if (hugetlb_cma_size) { > > - pr_warn_once("HugeTLB: hugetlb_cma is ena= bled, skip boot time allocation\n"); > > - goto free; > > - } > > - if (!alloc_bootmem_huge_page(h)) > > + if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE)) > > break; > > } else if (!alloc_pool_huge_page(h, > > &node_states[N_MEMORY], > > @@ -2887,7 +2968,6 @@ static void __init hugetlb_hstate_alloc_pages(str= uct hstate *h) > > h->max_huge_pages, buf, i); > > h->max_huge_pages =3D i; > > } > > -free: > > kfree(node_alloc_noretry); > > } > > > > -- > Sincerely yours, > Mike.