Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	James Houghton <jthoughton@google.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	David Hildenbrand <david@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>,
	Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
Date: Fri, 4 Nov 2022 08:44:08 -0700	[thread overview]
Message-ID: <Y2UzSOqJr9glD39i@monkey> (raw)
In-Reply-To: <Y2UplvB0dgQYIZWm@x1n>

On 11/04/22 11:02, Peter Xu wrote:
> Hi, Mike,
> 
> On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> > On 10/30/22 17:29, Peter Xu wrote:
> > > Resolution
> > > ==========
> > > 
> > > What this patch proposed is, besides using the vma lock, we can also use
> > > RCU to protect the pgtable page from being freed from under us when
> > > huge_pte_offset() is used.  The idea is kind of similar to RCU fast-gup.
> > > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > > including another mm's.
> > > 
> > > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > > always safe to access and de-reference, along with the pgtable lock that
> > > was bound to the pgtable page.
> > > 
> > > Patch Layout
> > > ============
> > > 
> > > Patch 1 is a trivial cleanup that I noticed when working on this.  Please
> > > shoot if anyone think I should just post it separately, or hopefully I can
> > > still just carry it over.
> > > 
> > > Patch 2 is the gut of the patchset, describing how we should use the helper
> > > huge_pte_offset() correctly. Only a comment patch but should be the most
> > > important one, as the follow up patches are just trying to follow the rule
> > > it setup here.
> > > 
> > > The rest patches resolve all the call sites of huge_pte_offset() to make
> > > sure either it's with the vma lock (which is perfectly good enough for
> > > safety in this case; the last patch commented on all those callers to make
> > > sure we won't miss a single case, and why they're safe).  Besides, each of
> > > the patch will add rcu protection to one caller of huge_pte_offset().
> > > 
> > > Tests
> > > =====
> > > 
> > > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > > here, with error "Unexpected number of free huge pages line 207"; haven't
> > > really got time to look into it).
> > 
> > Do not worry about the madvise test failure, that is caused by a recent
> > change.
> > 
> > Unless I am missing something, the basic strategy in this series is to
> > wrap calls to huge_pte_offset and subsequent ptep access with
> > rcu_read_lock/unlock calls.  I must embarrassingly admit that it has
> > been a loooong time since I had to look at rcu usage and may not know
> > what I am talking about.  However, I seem to recall that one needs to
> > somehow flag the data items being protected from update/freeing.  I
> > do not see anything like that in the huge_pmd_unshare routine where
> > pmd page pointer is updated.  Or, is it where the pmd page pointer is
> > referenced in huge_pte_offset?
> 
> Right.  The RCU proposed here is trying to protect the pmd pgtable page
> that will normally be freed in rcu pattern.  Please refer to
> tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
> it's released with RCU API:
> 
> 	call_rcu(&batch->rcu, tlb_remove_table_rcu);
> 

Thanks!  That is the piece of the puzzle I was missing.

> I mentioned fast-gup just to refererence on the same usage as fast-gup has
> the same risk if without RCU or similar protections that is IPI-based, but
> I definitely can be even clearer, and I will enrich the cover letter in the
> next post.
> 
> In short, my understanding is pgtable pages (including the shared PUD page
> for hugetlb) needs to be freed with caution because there can be softwares
> that are walking the pages with no locks.  In our case, even though
> huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
> always having the same mmap lock as when the pgtable needs to be freed, so
> it's similar to having no lock here, imo.  Then huge_pte_offset() needs to
> be protected just like what we do with fast-gup.
> 
> Please also feel free to refer to the comment chunk at the start of
> asm-generic/tlb.h for more information on the mmu gather API.
> 
> > 
> > Please ignore if you are certain of this rcu usage, otherwise I will
> > spend some time reeducating myself.

Sorry for any misunderstanding.  I am very happy with the RFC and the
work you have done.  I was just missing the piece about rcu
synchronization when the page table was removed.

> I'm not certain, and I'd like to get any form of comment. :)
> 
> Sorry if this RFC version is confusing, but if it can try to at least
> explain what the problem we have and if we can agree on the problem first
> then that'll already be a step forward to me.  So far that's more important
> than how we resolve it, using RCU or vma lock or anything else.
> 
> For a non-rfc series, I think I need to be more careful on some details,
> e.g., the RCU protection for pgtable page is only used when the arch
> supports MMU_GATHER_RCU_TABLE_FREE.  I thought that's always supported at
> least for pmd sharing enabled archs, but I'm actually wrong:
> 
> arch/arm64/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> arch/riscv/Kconfig:     select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
> arch/x86/Kconfig:       select ARCH_WANT_HUGE_PMD_SHARE
> 
> arch/arm/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
> arch/arm64/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE
> arch/powerpc/Kconfig:   select MMU_GATHER_RCU_TABLE_FREE
> arch/s390/Kconfig:      select MMU_GATHER_RCU_TABLE_FREE
> arch/sparc/Kconfig:     select MMU_GATHER_RCU_TABLE_FREE if SMP
> arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> arch/x86/Kconfig:       select MMU_GATHER_RCU_TABLE_FREE        if PARAVIRT
> 
> I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
> need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
> with local_irq_disable() on RISCV only for what this patchset wanted to
> do).  In the next version, I plan to add a helper, let's name it
> huge_pte_walker_lock() for now, and it should be one of the three options:
> 
>   - if !ARCH_WANT_HUGE_PMD_SHARE:      it's no-op
>   - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
>   - else:                              it should be local_irq_disable()
> 
> With that, I think we'll strictly follow what we have with fast-gup, at the
> meantime it should add zero overhead on archs that does not have pmd sharing.
> 
> Hope above helps a bit on extending the missing pieces of the cover
> letter.  Or again if anything missing I'd be more than glad to know..
-- 
Mike Kravetz