From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEB89C433ED for ; Wed, 21 Apr 2021 03:43:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9401461414 for ; Wed, 21 Apr 2021 03:43:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234919AbhDUDnm (ORCPT ); Tue, 20 Apr 2021 23:43:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56270 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234905AbhDUDnm (ORCPT ); Tue, 20 Apr 2021 23:43:42 -0400 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5F802C06174A for ; Tue, 20 Apr 2021 20:43:09 -0700 (PDT) Received: by mail-pf1-x430.google.com with SMTP id h11so9736525pfn.0 for ; Tue, 20 Apr 2021 20:43:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KmKEWuh5kX+xwliBF/7UPbgNwt5vdw/WaIbOHk7+c2c=; b=D7SzKgyrUNQwXWF1oBPoS5L/TdTX7IV6DZcj4o3bXAG/nI1PLwARVLHG05VNer8PrQ LqQMbArJmBX7eFaejaAcIQ+iDUsJRbdJudzkznKNK9i5Pj3hwzk3PZ3XrBB612K+qNVW 5eOZ9jXYCAGMnI7SMy2msbQj7sxWn7Kbwb6IQHRN+1DnNysKm6li7qWZ8HbocKa5HKgy 9cwTk9BczWKoSi9OqrNCmmRzVETsKrujTohOO8EO5gyyhP8jrBrHibLeN8qoNfx3wYK+ 5PuE0RuIe7cSKlOmBrSiRb8uRzi3YAmBRGLNr/JV5aGCt7oYM/N+jrwUD0Sa/mCRj6Xa DZgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KmKEWuh5kX+xwliBF/7UPbgNwt5vdw/WaIbOHk7+c2c=; b=B66dVGUQsSI18dorPWOk+ZYTyKwZl5UK8oPF1DtcbNuQewBHFbP9be19ehXvXYvAL5 osiWpaHMFrIcOYim2HXjJxSnXWSlncsF4nabZFeZBFqxquCdrLRQlDl/Pipaxb5LnavD RgEwYVE8Ij1iwGqDQ99dtu08G3tfl/aPtC2ARYV98lOUSTAxPH17q0zludLZalbdUpjN rpzhrdS+7jvdmckCPVyqPAkhuicSHzQo+BKA9Kl0Lrq5F4+NHQK3QPWBmmvRrTzvoYkO l0HwtB/PqLoQXdikl79tZyEjMiYLf+tQZSSelYtHUOnE633mwEUiVcrGmzDeMer0Ll/X 7PJA== X-Gm-Message-State: AOAM531/QZeaS/lKtcC/mfWobIPF4bNHERZo9JGm44rGMy8D3R5DCL3h /t3ayxuldLTWllTR/L7HrinQS0nylzv6dl33KXQ4dw== X-Google-Smtp-Source: ABdhPJxJBozIBcM52f25VJKMAu9Q4Z49o+wvjiK79PClh8TFh0Gj5u1gwWUVmU2G1T3AbtNVtkTL5Fx4chmzwPXwWD4= X-Received: by 2002:a05:6a00:8c7:b029:20f:1cf4:d02 with SMTP id s7-20020a056a0008c7b029020f1cf40d02mr28230925pfu.49.1618976588944; Tue, 20 Apr 2021 20:43:08 -0700 (PDT) MIME-Version: 1.0 References: <20210415084005.25049-1-songmuchun@bytedance.com> <20210415084005.25049-7-songmuchun@bytedance.com> <5f914142-009b-3bfc-9cb2-46154f610e29@oracle.com> <8de3d7a0-f100-5d50-fe54-b83af07570f4@oracle.com> In-Reply-To: <8de3d7a0-f100-5d50-fe54-b83af07570f4@oracle.com> From: Muchun Song Date: Wed, 21 Apr 2021 11:42:31 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v20 6/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page To: Mike Kravetz Cc: Jonathan Corbet , Thomas Gleixner , Ingo Molnar , bp@alien8.de, X86 ML , hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , Alexander Viro , Andrew Morton , paulmck@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Joao Martins , Xiongchun duan , fam.zheng@bytedance.com, linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Apr 21, 2021 at 1:48 AM Mike Kravetz wrote: > > On 4/20/21 1:46 AM, Muchun Song wrote: > > On Tue, Apr 20, 2021 at 7:20 AM Mike Kravetz wrote: > >> > >> On 4/15/21 1:40 AM, Muchun Song wrote: > >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > >>> index 0abed7e766b8..6e970a7d3480 100644 > >>> --- a/include/linux/hugetlb.h > >>> +++ b/include/linux/hugetlb.h > >>> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > >>> * code knows it has only reference. All other examinations and > >>> * modifications require hugetlb_lock. > >>> * HPG_freed - Set when page is on the free lists. > >>> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. > >>> * Synchronization: hugetlb_lock held for examination and modification. > >> > >> I like the per-page flag. In previous versions of the series, you just > >> checked the free_vmemmap_pages_per_hpage() to determine if vmemmmap > >> should be allocated. Is there any change in functionality that makes is > >> necessary to set the flag in each page, or is it mostly for flexibility > >> going forward? > > > > Actually, only the routine of dissolving the page cares whether > > the page is on the buddy free list when update_and_free_page > > returns. But we cannot change the return type of the > > update_and_free_page (e.g. change return type from 'void' to 'int'). > > Why? If the hugepage is freed through a kworker, we cannot > > know the return value when update_and_free_page returns. > > So adding a return value seems odd. > > > > In the dissolving routine, We can allocate vmemmap pages first, > > if it is successful, then we can make sure that > > update_and_free_page can successfully free page. So I need > > some stuff to mark the page which does not need to allocate > > vmemmap pages. > > > > On the surface, we seem to have a straightforward method > > to do this. > > > > Add a new parameter 'alloc_vmemmap' to update_and_free_page() to > > indicate that the caller is already allocated the vmemmap pages. > > update_and_free_page() do not need to allocate. Just like below. > > > > void update_and_free_page(struct hstate *h, struct page *page, bool atomic, > > bool alloc_vmemmap) > > { > > if (alloc_vmemmap) > > // allocate vmemmap pages > > } > > > > But if the page is freed through a kworker. How to pass > > 'alloc_vmemmap' to the kworker? We can embed this > > information into the per-page flag. So if we introduce > > HPG_vmemmap_optimized, the parameter of > > alloc_vmemmap is also necessary. > > > > So it seems that introducing HPG_vmemmap_optimized is > > a good choice. > > Thanks for the explanation! > > Agree that the flag is a good choice. How about adding a comment like > this above the alloc_huge_page_vmemmap call in dissolve_free_huge_page? > > /* > * Normally update_and_free_page will allocate required vmemmmap before > * freeing the page. update_and_free_page will fail to free the page > * if it can not allocate required vmemmap. We need to adjust > * max_huge_pages if the page is not freed. Attempt to allocate > * vmemmmap here so that we can take appropriate action on failure. > */ Thanks. I will add this comment. > > ... > >>> +static void add_hugetlb_page(struct hstate *h, struct page *page, > >>> + bool adjust_surplus) > >>> +{ > >> > >> We need to be a bit careful with hugepage specific flags that may be > >> set. The routine remove_hugetlb_page which is called for 'page' before > >> this routine will not clear any of the hugepage specific flags. If the > >> calling path goes through free_huge_page, most but not all flags are > >> cleared. > >> > >> We had a discussion about clearing the page->private field in Oscar's > >> series. In the case of 'new' pages we can assume page->private is > >> cleared, but perhaps we should not make that assumption here. Since we > >> hope to rarely call this routine, it might be safer to do something > >> like: > >> > >> set_page_private(page, 0); > >> SetHPageVmemmapOptimized(page); > > > > Agree. Thanks for your reminder. I will fix this. > > > >> > >>> + int nid = page_to_nid(page); > >>> + > >>> + lockdep_assert_held(&hugetlb_lock); > >>> + > >>> + INIT_LIST_HEAD(&page->lru); > >>> + h->nr_huge_pages++; > >>> + h->nr_huge_pages_node[nid]++; > >>> + > >>> + if (adjust_surplus) { > >>> + h->surplus_huge_pages++; > >>> + h->surplus_huge_pages_node[nid]++; > >>> + } > >>> + > >>> + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); > >>> + > >>> + /* > >>> + * The refcount can possibly be increased by memory-failure or > >>> + * soft_offline handlers. > >>> + */ > >>> + if (likely(put_page_testzero(page))) { > >> > >> In the existing code there is no such test. Is the need for the test > >> because of something introduced in the new code? > > > > No. > > > >> Or, should this test be in the existing code? > > > > Yes. gather_surplus_pages should be fixed. I can fix it > > in a separate patch. > > > > The possible bad scenario: > > > > CPU0: CPU1: > > set_compound_page_dtor(HUGETLB_PAGE_DTOR); > > memory_failure_hugetlb > > get_hwpoison_page > > __get_hwpoison_page > > get_page_unless_zero > > put_page_testzero() > > > > put_page(page) > > > > > > More details and discussion can refer to: > > > > https://lore.kernel.org/linux-doc/CAMZfGtVRSBkKe=tKAKLY8dp_hywotq3xL+EJZNjXuSKt3HK3bQ@mail.gmail.com/ > > > > Thanks you! I did not remember that discussion. > > It would be helpful to add a separate patch for gather_surplus_pages. > Otherwise, we have the VM_BUG_ON there and not in add_hugetlb_page. > Agree. Will do. > -- > Mike Kravetz