All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Muchun Song <muchun.song@linux.dev>
Cc: Usama Arif <usama.arif@bytedance.com>,
	Linux-MM <linux-mm@kvack.org>,
	"Mike Rapoport (IBM)" <rppt@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Muchun Song <songmuchun@bytedance.com>,
	fam.zheng@bytedance.com, liangma@liangbit.com,
	punit.agrawal@bytedance.com
Subject: Re: [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO
Date: Fri, 8 Sep 2023 11:29:50 -0700	[thread overview]
Message-ID: <20230908182950.GA6564@monkey> (raw)
In-Reply-To: <C51EF081-ABC9-4770-B329-2CAC2CE979FA@linux.dev>

On 09/08/23 10:39, Muchun Song wrote:
> 
> 
> > On Sep 8, 2023, at 02:37, Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> > On 09/06/23 12:26, Usama Arif wrote:
> >> The new boot flow when it comes to initialization of gigantic pages
> >> is as follows:
> >> - At boot time, for a gigantic page during __alloc_bootmem_hugepage,
> >> the region after the first struct page is marked as noinit.
> >> - This results in only the first struct page to be
> >> initialized in reserve_bootmem_region. As the tail struct pages are
> >> not initialized at this point, there can be a significant saving
> >> in boot time if HVO succeeds later on.
> >> - Later on in the boot, the head page is prepped and the first
> >> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
> >> are initialized.
> >> - HVO is attempted. If it is not successful, then the rest of the
> >> tail struct pages are initialized. If it is successful, no more
> >> tail struct pages need to be initialized saving significant boot time.
> >> 
> >> Signed-off-by: Usama Arif <usama.arif@bytedance.com>
> >> ---
> >> mm/hugetlb.c         | 61 +++++++++++++++++++++++++++++++++++++-------
> >> mm/hugetlb_vmemmap.c |  2 +-
> >> mm/hugetlb_vmemmap.h |  9 ++++---
> >> mm/internal.h        |  3 +++
> >> mm/mm_init.c         |  2 +-
> >> 5 files changed, 62 insertions(+), 15 deletions(-)
> > 
> > As mentioned, in general this looks good.  One small point below.
> > 
> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >> index c32ca241df4b..540e0386514e 100644
> >> --- a/mm/hugetlb.c
> >> +++ b/mm/hugetlb.c
> >> @@ -3169,6 +3169,15 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >> }
> >> 
> >> found:
> >> +
> >> + 	/*
> >> + 	 * Only initialize the head struct page in memmap_init_reserved_pages,
> >> + 	 * rest of the struct pages will be initialized by the HugeTLB subsystem itself.
> >> + 	 * The head struct page is used to get folio information by the HugeTLB
> >> + 	 * subsystem like zone id and node id.
> >> + 	 */
> >> + 	memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
> >> + 	huge_page_size(h) - PAGE_SIZE);
> >> 	/* Put them into a private list first because mem_map is not up yet */
> >> 	INIT_LIST_HEAD(&m->list);
> >> 	list_add(&m->list, &huge_boot_pages);
> >> @@ -3176,6 +3185,40 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >> 	return 1;
> >> }
> >> 
> >> +/* Initialize [start_page:end_page_number] tail struct pages of a hugepage */
> >> +static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio,
> >> +     		unsigned long start_page_number,
> >> +     		unsigned long end_page_number)
> >> +{
> >> + 	enum zone_type zone = zone_idx(folio_zone(folio));
> >> + 	int nid = folio_nid(folio);
> >> + 	unsigned long head_pfn = folio_pfn(folio);
> >> + 	unsigned long pfn, end_pfn = head_pfn + end_page_number;
> >> +
> >> + 	for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) {
> >> + 	struct page *page = pfn_to_page(pfn);
> >> +
> >> + 		__init_single_page(page, pfn, zone, nid);
> >> + 		prep_compound_tail((struct page *)folio, pfn - head_pfn);
> >> + 		set_page_count(page, 0);
> >> + 	}
> >> +}
> >> +
> >> +static void __init hugetlb_folio_init_vmemmap(struct folio *folio, struct hstate *h,
> >> +        unsigned long nr_pages)
> >> +{
> >> + 	int ret;
> >> +
> >> + 	/* Prepare folio head */
> >> +	 __folio_clear_reserved(folio);
> >> + 	__folio_set_head(folio);
> >> + 	ret = page_ref_freeze(&folio->page, 1);
> >> + 	VM_BUG_ON(!ret);
> > 
> > In the current code, we print a warning and free the associated pages to
> > buddy if we ever experience an increased ref count.  The routine
> > hugetlb_folio_init_tail_vmemmap does not check for this.
> > 
> > I do not believe speculative/temporary ref counts this early in the boot
> > process are possible.  It would be great to get input from someone else.
> 
> Yes, it is a very early stage and other tail struct pages haven't been
> initialized yet, anyone should not reference them. It it the same case
> as CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
> 
> > 
> > When I wrote the existing code, it was fairly easy to WARN and continue
> > if we encountered an increased ref count.  Things would be bit more
> 
> In your case, I think it is not in the boot process, right?

They were calls in the same routine: gather_bootmem_prealloc().

> > complicated here.  So, it may not be worth the effort.
> 
> Agree. Note that tail struct pages are not initialized here, if we want to
> handle head page, how to handle tail pages? It really cannot resolved.
> We should make the same assumption as CONFIG_DEFERRED_STRUCT_PAGE_INIT
> that anyone should not reference those pages.

Agree that speculative refs should not happen this early.  How about making
the following changes?
- Instead of set_page_count() in hugetlb_folio_init_tail_vmemmap, do a
  page_ref_freeze and VM_BUG_ON if not ref_count != 1.
- In the commit message, mention 'The WARN_ON for increased ref count in
  gather_bootmem_prealloc was changed to a VM_BUG_ON.  This is OK as
  there should be no speculative references this early in boot process.
  The VM_BUG_ON's are there just in case such code is introduced.'
-- 
Mike Kravetz

  reply	other threads:[~2023-09-08 18:55 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-06 11:26 [v4 0/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
2023-09-06 11:26 ` [v4 1/4] mm: hugetlb_vmemmap: Use nid of the head page to reallocate it Usama Arif
2023-09-06 11:26 ` [v4 2/4] memblock: pass memblock_type to memblock_setclr_flag Usama Arif
2023-09-06 11:26 ` [v4 3/4] memblock: introduce MEMBLOCK_RSRV_NOINIT flag Usama Arif
2023-09-06 11:35   ` Muchun Song
2023-09-06 12:01   ` Mike Rapoport
2023-09-06 11:26 ` [v4 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Usama Arif
2023-09-06 18:10   ` Mike Kravetz
2023-09-06 21:27     ` [External] " Usama Arif
2023-09-06 21:59       ` Mike Kravetz
2023-09-07 10:14         ` Usama Arif
2023-09-07 18:24           ` Mike Kravetz
2023-09-07 18:37   ` Mike Kravetz
2023-09-08  2:39     ` Muchun Song
2023-09-08 18:29       ` Mike Kravetz [this message]
2023-09-08 20:48         ` [External] " Usama Arif
2023-09-22 14:42 ` [v4 0/4] " Pasha Tatashin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230908182950.GA6564@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=fam.zheng@bytedance.com \
    --cc=liangma@liangbit.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=muchun.song@linux.dev \
    --cc=punit.agrawal@bytedance.com \
    --cc=rppt@kernel.org \
    --cc=songmuchun@bytedance.com \
    --cc=usama.arif@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.