Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize

From: Jane Chu <jane.chu@oracle.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: Will Deacon <will@kernel.org>,
	Nanyong Sun <sunnanyong@huawei.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	muchun.song@linux.dev, akpm@linux-foundation.org,
	anshuman.khandual@arm.com, wangkefeng.wang@huawei.com,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize
Date: Thu, 8 Feb 2024 11:21:19 -0800	[thread overview]
Message-ID: <78eee4ef-99ed-46a3-a776-a74bcd83ba44@oracle.com> (raw)
In-Reply-To: <ZcT4DH7VE1XLBvVc@casper.infradead.org>

On 2/8/2024 7:49 AM, Matthew Wilcox wrote:

> On Wed, Feb 07, 2024 at 06:24:52PM -0800, Jane Chu wrote:
>> On 2/7/2024 6:17 AM, Matthew Wilcox wrote:
>>> While this array of ~512 pages have been allocated to hugetlbfs, and one
>>> would think that there would be no way that there could still be
>>> references to them, another CPU can have a pointer to this struct page
>>> (eg attempting a speculative page cache reference or
>>> get_user_pages_fast()).  That means it will try to call
>>> atomic_add_unless(&page->_refcount, 1, 0);
>>>
>>> Actually, I wonder if this isn't a problem on x86 too?  Do we need to
>>> explicitly go through an RCU grace period before freeing the pages
>>> for use by somebody else?
>>>
>> Sorry, not sure what I'm missing, please help.
> Having written out the analysis, I now think it can't happen on x86,
> but let's walk through it because it's non-obvious (and I think it
> illustrates what people are afraid of on Arm).
>
> CPU A calls either get_user_pages_fast() or __filemap_get_folio().
> Let's do the latter this time.
>
>          folio = filemap_get_entry(mapping, index);
> filemap_get_entry:
> 	rcu_read_lock();
>          folio = xas_load(&xas);
>          if (!folio_try_get_rcu(folio))
>                  goto repeat;
>          if (unlikely(folio != xas_reload(&xas))) {
>                  folio_put(folio);
>                  goto repeat;
>          }
> folio_try_get_rcu:
> 	folio_ref_try_add_rcu(folio, 1);
> folio_ref_try_add_rcu:
>          if (unlikely(!folio_ref_add_unless(folio, count, 0))) {
>                  /* Either the folio has been freed, or will be freed. */
>                  return false;
> folio_ref_add_unless:
>          return page_ref_add_unless(&folio->page, nr, u);
> page_ref_add_unless:
> 	atomic_add_unless(&page->_refcount, nr, u);
>
> A rather deep callchain there, but for our purposes the important part
> is: we take the RCU read lock, we look up a folio, we increment its
> refcount if it's not zero, then check that looking up this index gets
> the same folio; if it doesn't, we decrement the refcount again and retry
> the lookup.
>
> For this analysis, we can be preempted at any point after we've got the
> folio pointer from xa_load().
>
>>  From hugetlb allocation perspective,  one of the scenarios is run time
>> hugetlb page allocation (say 2M pages), starting from the buddy allocator
>> returns compound pages, then the head page is set to frozen, then the
>> folio(compound pages) is put thru the HVO process, one of which is
>> vmemmap_split_pmd() in case a vmemmap page is a PMD page.
>>
>> Until the HVO process completes, none of the vmemmap represented pages are
>> available to any threads, so what are the causes for IRQ threads to access
>> their vmemmap pages?
> Yup, this sounds like enough, but it's not.  The problem is the person
> who's looking up the folio in the pagecache under RCU.  They've got
> the folio pointer and have been preempted.  So now what happens to our
> victim folio?
>
> Something happens to remove it from the page cache.  Maybe the file is
> truncated, perhaps vmscan comes along and kicks it out.  Either way, it's
> removed from the xarray and gets its refcount set to 0.  If the lookup
> were to continue at this time, everything would be fine because it would
> see a refcount of 0 and not increment it (in page_ref_add_unless()).
> And this is where my analysis of RCU tends to go wrong, because I only
> think of interleaving event A and B.  I don't think about B and then C
> happening before A resumes.  But it can!  Let's follow the journey of
> this struct page.
>
> Now that it's been removed from the page cache, it's allocated by hugetlb,
> as you describe.  And it's one of the tail pages towards the end of
> the 512 contiguous struct pages.  That means that we alter vmemmap so
> that the pointer to struct page now points to a different struct page
> (one of the earlier ones).  Then the original page of vmemmap containing
> our lucky struct page is returned to the page allocator.  At this point,
> it no longer contains struct pages; it can contain literally anything.
>
> Where my analysis went wrong was that CPU A _no longer has a pointer
> to it_.  CPU A has a pointer into vmemmap.  So it will access the
> replacement struct page (which definitely has a refcount 0) instead of
> the one which has been freed.  I had thought that CPU A would access the
> original memory which has now been allocated to someone else.  But no,
> it can't because its pointer is virtual, not physical.
>
>
> ---
>
> Now I'm thinking more about this and there's another scenario which I
> thought might go wrong, and doesn't.  For 7 of the 512 pages which are
> freed, the struct page pointer gathered by CPU A will not point to a
> page with a refcount of 0.  Instead it will point to an alias of the
> head page with a positive refcount.  For those pages, CPU A will see
> folio_try_get_rcu() succeed.  Then it will call xas_reload() and see
> the folio isn't there any more, so it will call folio_put() on something
> which used to be a folio, and isn't any more.
>
> But folio_put() calls folio_put_testzero() which calls put_page_testzero()
> without asserting that the pointer is actually to a folio.
> So everything's fine, but really only by coincidence; I don't think
> anybody's thought about this scenario before (maybe Muchun has, but I
> don't remember it being discussed).

Wow!  Marvelous analysis, thank you!

So is the solution simple as making folio_put_testzero() to check 
whether the folio pointer actually points to a folio?

or there is more to consider?

Thanks a lot!

-jane