From: Mike Kravetz <mike.kravetz@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Andrew Morton <akpm@linux-foundation.org>,
James Houghton <jthoughton@google.com>,
Miaohe Lin <linmiaohe@huawei.com>,
David Hildenbrand <david@redhat.com>,
Muchun Song <songmuchun@bytedance.com>,
Andrea Arcangeli <aarcange@redhat.com>,
Nadav Amit <nadav.amit@gmail.com>,
Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare
Date: Fri, 4 Nov 2022 08:44:08 -0700 [thread overview]
Message-ID: <Y2UzSOqJr9glD39i@monkey> (raw)
In-Reply-To: <Y2UplvB0dgQYIZWm@x1n>
On 11/04/22 11:02, Peter Xu wrote:
> Hi, Mike,
>
> On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> > On 10/30/22 17:29, Peter Xu wrote:
> > > Resolution
> > > ==========
> > >
> > > What this patch proposed is, besides using the vma lock, we can also use
> > > RCU to protect the pgtable page from being freed from under us when
> > > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup.
> > > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > > including another mm's.
> > >
> > > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > > always safe to access and de-reference, along with the pgtable lock that
> > > was bound to the pgtable page.
> > >
> > > Patch Layout
> > > ============
> > >
> > > Patch 1 is a trivial cleanup that I noticed when working on this. Please
> > > shoot if anyone think I should just post it separately, or hopefully I can
> > > still just carry it over.
> > >
> > > Patch 2 is the gut of the patchset, describing how we should use the helper
> > > huge_pte_offset() correctly. Only a comment patch but should be the most
> > > important one, as the follow up patches are just trying to follow the rule
> > > it setup here.
> > >
> > > The rest patches resolve all the call sites of huge_pte_offset() to make
> > > sure either it's with the vma lock (which is perfectly good enough for
> > > safety in this case; the last patch commented on all those callers to make
> > > sure we won't miss a single case, and why they're safe). Besides, each of
> > > the patch will add rcu protection to one caller of huge_pte_offset().
> > >
> > > Tests
> > > =====
> > >
> > > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > > here, with error "Unexpected number of free huge pages line 207"; haven't
> > > really got time to look into it).
> >
> > Do not worry about the madvise test failure, that is caused by a recent
> > change.
> >
> > Unless I am missing something, the basic strategy in this series is to
> > wrap calls to huge_pte_offset and subsequent ptep access with
> > rcu_read_lock/unlock calls. I must embarrassingly admit that it has
> > been a loooong time since I had to look at rcu usage and may not know
> > what I am talking about. However, I seem to recall that one needs to
> > somehow flag the data items being protected from update/freeing. I
> > do not see anything like that in the huge_pmd_unshare routine where
> > pmd page pointer is updated. Or, is it where the pmd page pointer is
> > referenced in huge_pte_offset?
>
> Right. The RCU proposed here is trying to protect the pmd pgtable page
> that will normally be freed in rcu pattern. Please refer to
> tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
> it's released with RCU API:
>
> call_rcu(&batch->rcu, tlb_remove_table_rcu);
>
Thanks! That is the piece of the puzzle I was missing.
> I mentioned fast-gup just to refererence on the same usage as fast-gup has
> the same risk if without RCU or similar protections that is IPI-based, but
> I definitely can be even clearer, and I will enrich the cover letter in the
> next post.
>
> In short, my understanding is pgtable pages (including the shared PUD page
> for hugetlb) needs to be freed with caution because there can be softwares
> that are walking the pages with no locks. In our case, even though
> huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
> always having the same mmap lock as when the pgtable needs to be freed, so
> it's similar to having no lock here, imo. Then huge_pte_offset() needs to
> be protected just like what we do with fast-gup.
>
> Please also feel free to refer to the comment chunk at the start of
> asm-generic/tlb.h for more information on the mmu gather API.
>
> >
> > Please ignore if you are certain of this rcu usage, otherwise I will
> > spend some time reeducating myself.
Sorry for any misunderstanding. I am very happy with the RFC and the
work you have done. I was just missing the piece about rcu
synchronization when the page table was removed.
> I'm not certain, and I'd like to get any form of comment. :)
>
> Sorry if this RFC version is confusing, but if it can try to at least
> explain what the problem we have and if we can agree on the problem first
> then that'll already be a step forward to me. So far that's more important
> than how we resolve it, using RCU or vma lock or anything else.
>
> For a non-rfc series, I think I need to be more careful on some details,
> e.g., the RCU protection for pgtable page is only used when the arch
> supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at
> least for pmd sharing enabled archs, but I'm actually wrong:
>
> arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
> arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE
>
> arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
> arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP
> arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
>
> I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
> need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
> with local_irq_disable() on RISCV only for what this patchset wanted to
> do). In the next version, I plan to add a helper, let's name it
> huge_pte_walker_lock() for now, and it should be one of the three options:
>
> - if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op
> - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
> - else: it should be local_irq_disable()
>
> With that, I think we'll strictly follow what we have with fast-gup, at the
> meantime it should add zero overhead on archs that does not have pmd sharing.
>
> Hope above helps a bit on extending the missing pieces of the cover
> letter. Or again if anything missing I'd be more than glad to know..
--
Mike Kravetz
prev parent reply other threads:[~2022-11-04 15:44 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-30 21:29 [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Peter Xu
2022-10-30 21:29 ` [PATCH RFC 01/10] mm/hugetlb: Let vma_offset_start() to return start Peter Xu
2022-11-03 15:25 ` Mike Kravetz
2022-10-30 21:29 ` [PATCH RFC 02/10] mm/hugetlb: Comment huge_pte_offset() for its locking requirements Peter Xu
2022-11-01 5:46 ` Nadav Amit
2022-11-02 20:51 ` Peter Xu
2022-11-03 15:42 ` Mike Kravetz
2022-11-03 18:11 ` Peter Xu
2022-11-03 18:38 ` Mike Kravetz
2022-10-30 21:29 ` [PATCH RFC 03/10] mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe Peter Xu
2022-10-30 21:29 ` [PATCH RFC 04/10] mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe Peter Xu
2022-11-02 18:06 ` James Houghton
2022-11-02 21:17 ` Peter Xu
2022-10-30 21:29 ` [PATCH RFC 05/10] mm/hugetlb: Make walk_hugetlb_range() RCU-safe Peter Xu
2022-11-06 8:14 ` kernel test robot
2022-11-06 16:41 ` Peter Xu
2022-10-30 21:29 ` [PATCH RFC 06/10] mm/hugetlb: Make page_vma_mapped_walk() RCU-safe Peter Xu
2022-10-30 21:29 ` [PATCH RFC 07/10] mm/hugetlb: Make hugetlb_follow_page_mask() RCU-safe Peter Xu
2022-11-02 18:24 ` James Houghton
2022-11-03 15:50 ` Peter Xu
2022-10-30 21:30 ` [PATCH RFC 08/10] mm/hugetlb: Make follow_hugetlb_page RCU-safe Peter Xu
2022-10-30 21:30 ` [PATCH RFC 09/10] mm/hugetlb: Make hugetlb_fault() RCU-safe Peter Xu
2022-11-02 18:04 ` James Houghton
2022-11-03 15:39 ` Peter Xu
2022-10-30 21:30 ` [PATCH RFC 10/10] mm/hugetlb: Comment at rest huge_pte_offset() places Peter Xu
2022-11-01 5:39 ` Nadav Amit
2022-11-02 21:21 ` Peter Xu
2022-11-04 0:21 ` [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare Mike Kravetz
2022-11-04 15:02 ` Peter Xu
2022-11-04 15:44 ` Mike Kravetz [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y2UzSOqJr9glD39i@monkey \
--to=mike.kravetz@oracle.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=jthoughton@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nadav.amit@gmail.com \
--cc=peterx@redhat.com \
--cc=riel@surriel.com \
--cc=songmuchun@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).