From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C4C7C433E0 for ; Mon, 25 Jan 2021 07:42:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C1CBC20715 for ; Mon, 25 Jan 2021 07:42:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C1CBC20715 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C35C38D0002; Mon, 25 Jan 2021 02:42:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BBEC98D0001; Mon, 25 Jan 2021 02:42:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AADBB8D0002; Mon, 25 Jan 2021 02:42:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0080.hostedemail.com [216.40.44.80]) by kanga.kvack.org (Postfix) with ESMTP id 9350A8D0001 for ; Mon, 25 Jan 2021 02:42:08 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 54F5F1EE6 for ; Mon, 25 Jan 2021 07:42:08 +0000 (UTC) X-FDA: 77743503936.09.noise18_3016dc227584 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin09.hostedemail.com (Postfix) with ESMTP id 37E82180AD804 for ; Mon, 25 Jan 2021 07:42:08 +0000 (UTC) X-HE-Tag: noise18_3016dc227584 X-Filterd-Recvd-Size: 9772 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf10.hostedemail.com (Postfix) with ESMTP for ; Mon, 25 Jan 2021 07:42:07 +0000 (UTC) Received: by mail-pl1-f172.google.com with SMTP id d13so1313956plg.0 for ; Sun, 24 Jan 2021 23:42:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=7mbaGXtQOBneRXBPI9rhq6Q1yml5C2C1LZtqV3oYKEc=; b=l5wF8kbv5vJ18cKSWjzRkOyzwMT+tS8Hdt8KVjYSmqhOtdOlB+fDpmkYyXwSG2hMaf NPA/Bndgdm39IgUqwcjuhvbp6MpnaQTUCs7y6r+WfQCnCVVFKEsunoRtYnW6PI8Q9GFa sU95uUACSaBOhJoy67QJzhuefsxy9SRflbvSFxmIcuE7iabXUScy2fY59beZ/w95aA2t v55znZ9hojrG4Gmh+7ihSO1tdg+kL1QAleYXMZInOiXlJsgzOCV/HYtN86e4mSxqZcl3 aF9beKuexlCWqt0CSgh3JJbdktQf5AyQLraZOQVjkNxRYegFmkTeXTVzFYUb4crULuK0 UZVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7mbaGXtQOBneRXBPI9rhq6Q1yml5C2C1LZtqV3oYKEc=; b=PNTpNvB5pw9MIkSKXf+E2HtSb1p7kwGV1PoDIW7dgsc8DrU0GscW36IGF/rOTpbGFj HouhAVOW5vSg0aGAda1N0ArlQy1pD9zBGQ6S9wokCMfqAMkseq9Dx1HqQ/ZzBD1Bfkml 2gU8KdOutqgJJ+719RKs79plq1UGCd7u9Jo81iqb+hk/eAmgn69d9I+epZzVfTrdiEZP QM3Br7ZSwRva3iNsLJGx2ldrsMSQDPI6tpcwe3+xJJSYmkMrNJRkbyh2g4jA7RGTXYSk IXNtHtzKnaDVJSgLUkhY4m+d9FSMqtnYvQY/vU6jFBTjgcQkWjYapwhkP65gYKgj5nbn MPMg== X-Gm-Message-State: AOAM531k2eItdjdOVN/U3aKKQhP6sAfVdrM3wUFouMhPr5qiLlLamMuV TyojuhdoSsaU3ty2pLrEc3qezgIOzzEhd76Hnu8wIQ== X-Google-Smtp-Source: ABdhPJzlXZ5eYn850p24thSCtGdODRqp+tkVXVdpKbfjz1moHPWSAuarr+wM4ciCo6NHHrW/gSS3dP98XRSx5N75i9U= X-Received: by 2002:a17:90a:3e81:: with SMTP id k1mr3773086pjc.13.1611560526278; Sun, 24 Jan 2021 23:42:06 -0800 (PST) MIME-Version: 1.0 References: <20210117151053.24600-1-songmuchun@bytedance.com> <20210117151053.24600-6-songmuchun@bytedance.com> <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> In-Reply-To: From: Muchun Song Date: Mon, 25 Jan 2021 15:41:29 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page To: David Rientjes Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jan 25, 2021 at 2:40 PM Muchun Song wrote: > > On Mon, Jan 25, 2021 at 8:05 AM David Rientjes wrote: > > > > > > On Sun, 17 Jan 2021, Muchun Song wrote: > > > > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > > index ce4be1fa93c2..3b146d5949f3 100644 > > > --- a/mm/sparse-vmemmap.c > > > +++ b/mm/sparse-vmemmap.c > > > @@ -29,6 +29,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #include > > > #include > > > @@ -40,7 +41,8 @@ > > > * @remap_pte: called for each non-empty PTE (lowest-level) entry. > > > * @reuse_page: the page which is reused for the tail vmemmap pages. > > > * @reuse_addr: the virtual address of the @reuse_page page. > > > - * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > > > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed > > > + * or is mapped from. > > > */ > > > struct vmemmap_remap_walk { > > > void (*remap_pte)(pte_t *pte, unsigned long addr, > > > @@ -50,6 +52,10 @@ struct vmemmap_remap_walk { > > > struct list_head *vmemmap_pages; > > > }; > > > > > > +/* The gfp mask of allocating vmemmap page */ > > > +#define GFP_VMEMMAP_PAGE \ > > > + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE) > > > + > > > > This is unnecessary, just use the gfp mask directly in allocator. > > Will do. Thanks. > > > > > > static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > > > unsigned long end, > > > struct vmemmap_remap_walk *walk) > > > @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, > > > free_vmemmap_page_list(&vmemmap_pages); > > > } > > > > > > +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > > > + struct vmemmap_remap_walk *walk) > > > +{ > > > + pgprot_t pgprot = PAGE_KERNEL; > > > + struct page *page; > > > + void *to; > > > + > > > + BUG_ON(pte_page(*pte) != walk->reuse_page); > > > + > > > + page = list_first_entry(walk->vmemmap_pages, struct page, lru); > > > + list_del(&page->lru); > > > + to = page_to_virt(page); > > > + copy_page(to, (void *)walk->reuse_addr); > > > + > > > + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > > > +} > > > + > > > +static void alloc_vmemmap_page_list(struct list_head *list, > > > + unsigned long start, unsigned long end) > > > +{ > > > + unsigned long addr; > > > + > > > + for (addr = start; addr < end; addr += PAGE_SIZE) { > > > + struct page *page; > > > + int nid = page_to_nid((const void *)addr); > > > + > > > +retry: > > > + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0); > > > + if (unlikely(!page)) { > > > + msleep(100); > > > + /* > > > + * We should retry infinitely, because we cannot > > > + * handle allocation failures. Once we allocate > > > + * vmemmap pages successfully, then we can free > > > + * a HugeTLB page. > > > + */ > > > + goto retry; > > > > Ugh, I don't think this will work, there's no guarantee that we'll ever > > succeed and now we can't free a 2MB hugepage because we cannot allocate a > > 4KB page. We absolutely have to ensure we make forward progress here. > > This can trigger a OOM when there is no memory and kill someone to release > some memory. Right? > > > > > We're going to be freeing the hugetlb page after this succeeeds, can we > > not use part of the hugetlb page that we're freeing for this memory > > instead? > > It seems a good idea. We can try to allocate memory firstly, if successful, > just use the new page to remap (it can reduce memory fragmentation). > If not, we can use part of the hugetlb page to remap. What's your opinion > about this? If the HugeTLB page is a gigantic page which is allocated from CMA. In this case, we cannot use part of the hugetlb page to remap. Right? > > > > > > + } > > > + list_add_tail(&page->lru, list); > > > + } > > > +} > > > + > > > +/** > > > + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) > > > + * to the page which is from the @vmemmap_pages > > > + * respectively. > > > + * @start: start address of the vmemmap virtual address range. > > > + * @end: end address of the vmemmap virtual address range. > > > + * @reuse: reuse address. > > > + */ > > > +void vmemmap_remap_alloc(unsigned long start, unsigned long end, > > > + unsigned long reuse) > > > +{ > > > + LIST_HEAD(vmemmap_pages); > > > + struct vmemmap_remap_walk walk = { > > > + .remap_pte = vmemmap_restore_pte, > > > + .reuse_addr = reuse, > > > + .vmemmap_pages = &vmemmap_pages, > > > + }; > > > + > > > + might_sleep(); > > > + > > > + /* See the comment in the vmemmap_remap_free(). */ > > > + BUG_ON(start - reuse != PAGE_SIZE); > > > + > > > + alloc_vmemmap_page_list(&vmemmap_pages, start, end); > > > + vmemmap_remap_range(reuse, end, &walk); > > > +} > > > + > > > /* > > > * Allocate a block of memory to be used to back the virtual memory map > > > * or to back the page tables that are used to create the mapping. > > > -- > > > 2.11.0 > > > > > >