[BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
@ 2022-10-25 20:37 Rik van Riel
  2022-10-26  3:07 ` Mike Kravetz
  2022-10-26 19:38 ` Mike Kravetz
  0 siblings, 2 replies; 5+ messages in thread
From: Rik van Riel @ 2022-10-25 20:37 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Chris Mason, David Hildenbrand, linux-mm, linux-kernel,
	kernel-team, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1758 bytes --]

Hi Mike,

After getting promising results initially, we discovered there
is yet another bug left with hugetlbfs MADV_DONTNEED.

This one involves a page fault on a hugetlbfs address, while
another thread in the same process is in the middle of MADV_DONTNEED
on that same memory address.

The code in __unmap_hugepage_range() will clear the page table
entry, and then at some point later the lazy TLB code will 
actually free the huge page back into the hugetlbfs free page
pool.

Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
will fail because the code calling __unmap_hugepage_range() has
not actually returned the page to the free list yet.

The result is that the process gets killed with SIGBUS.

I have thought of a few different solutions to this problem, but
none of them look good:
- Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
  page faults. This could make MADV_DONTNEED on VMAs with 4kB
  pages unacceptably slow.
- Some sort of atomic counter kept by __unmap_hugepage_range()
  that huge pages may be getting placed in the tlb gather, and
  freed later by tlb_finish_mmu().  This would involve changes
  to the MMU gather code, outside of hugetlbfs.
- Some sort of generation counter that tracks tlb_gather_mmu
  cycles in progress, with the alloc_huge_page failure path
  waiting until all mmu gather operations that started before
  it to finish, before retrying the allocation. This requires
  changes to the generic code, outside of hugetlbfs.

What are the reasonable alternatives here?

Should we see if anybody can come up with a simple solution
to the problem, or would it be better to just disable
MADV_DONTNEED on hugetlbfs for now?

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
  2022-10-25 20:37 [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS Rik van Riel
@ 2022-10-26  3:07 ` Mike Kravetz
  2022-10-26  4:13   ` Mike Kravetz
  2022-10-26 19:38 ` Mike Kravetz
  1 sibling, 1 reply; 5+ messages in thread
From: Mike Kravetz @ 2022-10-26  3:07 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Mason, David Hildenbrand, linux-mm, linux-kernel,
	kernel-team, Andrew Morton

On 10/25/22 16:37, Rik van Riel wrote:
> Hi Mike,
> 
> After getting promising results initially, we discovered there
> is yet another bug left with hugetlbfs MADV_DONTNEED.
> 
> This one involves a page fault on a hugetlbfs address, while
> another thread in the same process is in the middle of MADV_DONTNEED
> on that same memory address.
> 
> The code in __unmap_hugepage_range() will clear the page table
> entry, and then at some point later the lazy TLB code will 
> actually free the huge page back into the hugetlbfs free page
> pool.
> 
> Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
> will fail because the code calling __unmap_hugepage_range() has
> not actually returned the page to the free list yet.
> 
> The result is that the process gets killed with SIGBUS.

Thanks for the details Rik.  That makes perfect sense.

> I have thought of a few different solutions to this problem, but
> none of them look good:
> - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
>   page faults. This could make MADV_DONTNEED on VMAs with 4kB
>   pages unacceptably slow.
> - Some sort of atomic counter kept by __unmap_hugepage_range()
>   that huge pages may be getting placed in the tlb gather, and
>   freed later by tlb_finish_mmu().  This would involve changes
>   to the MMU gather code, outside of hugetlbfs.
> - Some sort of generation counter that tracks tlb_gather_mmu
>   cycles in progress, with the alloc_huge_page failure path
>   waiting until all mmu gather operations that started before
>   it to finish, before retrying the allocation. This requires
>   changes to the generic code, outside of hugetlbfs.
> 
> What are the reasonable alternatives here?
> 
> Should we see if anybody can come up with a simple solution
> to the problem, or would it be better to just disable
> MADV_DONTNEED on hugetlbfs for now?

I am thinking that we could use the vma_lock to prevent faults on the
vma until the MADV_DONTNEED processing is totally complete.  IIUC, it is
the tlb_finish_mmu call that actually drops the ref count on the pages
and returns them to the free list.  Currently, we hold the vma_lock in
write mode during unmap, and acquire it in read mode during faults.
However, we drop it in the MADV_DONTNEED path (just a bit) before calling
tlb_finish_mmu.  So, holding a bit longer may address this issue rather
simply.

Below is another version of the "hugetlb: don't delete vma_lock in hugetlb
MADV_DONTNEED processing" patch.  It is very much like the version in
Andrew's tree but the code has been shuffled a bit to hold the vma_lock
longer in the MADV_DONTNEED path.  Can you take a look and see if that
addresses the issue for you?

From 9a1f047ce42f76b3befd673da5ee2bd49bea6c75 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 25 Oct 2022 20:02:58 -0700
Subject: [PATCH v4] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED
 processing

madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
page tables associated with the address range.  For hugetlb vmas,
zap_page_range will call __unmap_hugepage_range_final.  However,
__unmap_hugepage_range_final assumes the passed vma is about to be
removed and deletes the vma_lock to prevent pmd sharing as the vma is
on the way out.  In the case of madvise(MADV_DONTNEED) the vma remains,
but the missing vma_lock prevents pmd sharing and could potentially
lead to issues with truncation/fault races.

This issue was originally reported here [1] as a BUG triggered in
page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
prevent pmd sharing.  Subsequent faults on this vma were confused as
VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
was not set in new pages added to the page table.  This resulted in
pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.

Create a new routine clear_hugetlb_page_range() that can be called from
madvise(MADV_DONTNEED) for hugetlb vmas.  It has the same setup as
zap_page_range, but does not delete the vma_lock.

[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/

Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings")
Reported-by: Wei Chen <harperchen1110@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/hugetlb.h |  7 +++++++
 mm/hugetlb.c            | 30 ++++++++++++++++++++++++++++++
 mm/madvise.c            |  5 ++++-
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index a899bc76d677..0246e77be3a3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -158,6 +158,8 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 void unmap_hugepage_range(struct vm_area_struct *,
 			  unsigned long, unsigned long, struct page *,
 			  zap_flags_t);
+void clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end);
 void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
@@ -426,6 +428,11 @@ static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	BUG();
 }
 
+static void __maybe_unused clear_hugetlb_page_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end)
+{
+}
+
 static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned int flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 931789a8f734..0a52c3938ce9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5221,9 +5221,39 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
 	 * will be issues when accessed by someone else.
 	 */
 	__hugetlb_vma_unlock_write_free(vma);
+	i_mmap_unlock_write(vma->vm_file->f_mapping);
+}
+
+#ifdef CONFIG_ADVISE_SYSCALLS
+/*
+ * Similar setup as in zap_page_range().  madvise(MADV_DONTNEED) can not call
+ * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
+ * the associated vma_lock.
+ */
+void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
+				unsigned long end)
+{
+	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				start, end);
+	tlb_gather_mmu(&tlb, vma->vm_mm);
+	update_hiwater_rss(vma->vm_mm);
+
+	hugetlb_vma_lock_write(vma);
+	i_mmap_lock_write(vma->vm_file->f_mapping);
+
+	__unmap_hugepage_range(&tlb, vma, start, end, NULL, 0);
 
 	i_mmap_unlock_write(vma->vm_file->f_mapping);
+
+	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb);
+
+	hugetlb_vma_unlock_write(vma);
 }
+#endif
 
 void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start,
 			  unsigned long end, struct page *ref_page,
diff --git a/mm/madvise.c b/mm/madvise.c
index 2baa93ca2310..90577a669635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
-	zap_page_range(vma, start, end - start);
+	if (!is_vm_hugetlb_page(vma))
+		zap_page_range(vma, start, end - start);
+	else
+		clear_hugetlb_page_range(vma, start, end);
 	return 0;
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
  2022-10-26  3:07 ` Mike Kravetz
@ 2022-10-26  4:13   ` Mike Kravetz
  0 siblings, 0 replies; 5+ messages in thread
From: Mike Kravetz @ 2022-10-26  4:13 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Mason, David Hildenbrand, linux-mm, linux-kernel,
	kernel-team, Andrew Morton

On 10/25/22 20:07, Mike Kravetz wrote:
> On 10/25/22 16:37, Rik van Riel wrote:
> > Hi Mike,
> > 
> > After getting promising results initially, we discovered there
> > is yet another bug left with hugetlbfs MADV_DONTNEED.
> > 
> > This one involves a page fault on a hugetlbfs address, while
> > another thread in the same process is in the middle of MADV_DONTNEED
> > on that same memory address.
> > 
> > The code in __unmap_hugepage_range() will clear the page table
> > entry, and then at some point later the lazy TLB code will 
> > actually free the huge page back into the hugetlbfs free page
> > pool.
> > 
> > Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
> > will fail because the code calling __unmap_hugepage_range() has
> > not actually returned the page to the free list yet.
> > 
> > The result is that the process gets killed with SIGBUS.
> 
> Thanks for the details Rik.  That makes perfect sense.
> 
> > I have thought of a few different solutions to this problem, but
> > none of them look good:
> > - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
> >   page faults. This could make MADV_DONTNEED on VMAs with 4kB
> >   pages unacceptably slow.
> > - Some sort of atomic counter kept by __unmap_hugepage_range()
> >   that huge pages may be getting placed in the tlb gather, and
> >   freed later by tlb_finish_mmu().  This would involve changes
> >   to the MMU gather code, outside of hugetlbfs.
> > - Some sort of generation counter that tracks tlb_gather_mmu
> >   cycles in progress, with the alloc_huge_page failure path
> >   waiting until all mmu gather operations that started before
> >   it to finish, before retrying the allocation. This requires
> >   changes to the generic code, outside of hugetlbfs.
> > 
> > What are the reasonable alternatives here?
> > 
> > Should we see if anybody can come up with a simple solution
> > to the problem, or would it be better to just disable
> > MADV_DONTNEED on hugetlbfs for now?
> 
> I am thinking that we could use the vma_lock to prevent faults on the
> vma until the MADV_DONTNEED processing is totally complete.  IIUC, it is
> the tlb_finish_mmu call that actually drops the ref count on the pages
> and returns them to the free list.  Currently, we hold the vma_lock in
> write mode during unmap, and acquire it in read mode during faults.
> However, we drop it in the MADV_DONTNEED path (just a bit) before calling
> tlb_finish_mmu.  So, holding a bit longer may address this issue rather
> simply.

Sorry, that is wrong!  The vma_lock only applies to sharable mappings
and this is certainly about non-sharable mappings.  Please disregard.

I will continue to think about this and try to come up with a solution.
If not, we may need to disable hugetlb MADV_DONTNEED.

I really don't want to modify too much (if any) non-hugtlb code to
accommodate this.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
  2022-10-25 20:37 [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS Rik van Riel
  2022-10-26  3:07 ` Mike Kravetz
@ 2022-10-26 19:38 ` Mike Kravetz
  2022-10-27 15:07   ` Johannes Weiner
  1 sibling, 1 reply; 5+ messages in thread
From: Mike Kravetz @ 2022-10-26 19:38 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Chris Mason, David Hildenbrand, Axel Rasmussen, Peter Xu,
	James Houghton, linux-mm, linux-kernel, kernel-team,
	Andrew Morton

On 10/25/22 16:37, Rik van Riel wrote:
> Hi Mike,
> 
> After getting promising results initially, we discovered there
> is yet another bug left with hugetlbfs MADV_DONTNEED.
> 
> This one involves a page fault on a hugetlbfs address, while
> another thread in the same process is in the middle of MADV_DONTNEED
> on that same memory address.
> 
> The code in __unmap_hugepage_range() will clear the page table
> entry, and then at some point later the lazy TLB code will 
> actually free the huge page back into the hugetlbfs free page
> pool.
> 
> Meanwhile, hugetlb_no_page will call alloc_huge_page, and that
> will fail because the code calling __unmap_hugepage_range() has
> not actually returned the page to the free list yet.
> 
> The result is that the process gets killed with SIGBUS.
> 
> I have thought of a few different solutions to this problem, but
> none of them look good:
> - Make MADV_DONTNEED take a write lock on mmap_sem, to exclude
>   page faults. This could make MADV_DONTNEED on VMAs with 4kB
>   pages unacceptably slow.
> - Some sort of atomic counter kept by __unmap_hugepage_range()
>   that huge pages may be getting placed in the tlb gather, and
>   freed later by tlb_finish_mmu().  This would involve changes
>   to the MMU gather code, outside of hugetlbfs.
> - Some sort of generation counter that tracks tlb_gather_mmu
>   cycles in progress, with the alloc_huge_page failure path
>   waiting until all mmu gather operations that started before
>   it to finish, before retrying the allocation. This requires
>   changes to the generic code, outside of hugetlbfs.

I've had a little more time to think about this.  Sorry about the noise
yesterday evening!

My assumption is that you are still working with library code that gets
passed an address range and is not aware it is a hugetlb mapping.  Correct?

Part of the issue here is that hugetlb pages are a limited resource.  As
such we have this somewhat ugly 'kill with SIGBUS' behavior if we are
unable to allocate a hugetlb page for a page fault.  I would expect
applications that know they are dealing with hugetlb mappings and call
MADV_DONTNEED to at least be prepared for this situation.  I know that may
sound crazy, but let me explain why.

The first sentence in the description of MADV_DONTNEED says:
"Do  not  expect access in the near future.  (For the time being, the
 application is finished with the given range, so the  kernel can free
 resources associated with it.)"
As mentioned, hugetlb pages are a limited resource.  So, I would expect an
application to MADV_DONTNEED part of a hugetlb range so that the pages can
be used somewhere else.  If the application has need for pages in that range
later, it should not expect that those pages are available.  They would only
be available if anyone that may be using those pages have made them available.
Some type of coordination by application code is required to manage the
use the limited hugetlb pages.

In hugetlbfs there is the concept of 'reservations' to somewhat manage
the expectation that hugetlb pages are available for page faults.  I
suspect most people know that reservations are created at mmap time for
the size of the mapping.  The corresponding number of hugetlb pages are
reserved for the mapping.  As the mapping is populated, the number of
reservations is decremented.

Now consider a populated hugetlb mapping, and someone does a MADV_DONTNEED
on a page of that mapping.  When the page was originally faulted in, the
reservation associated with that address was consumed.  So, before the
MADV_DONTNEED there is no reservation for that specific address within
the mapping.  In addition, when the MADV_DONTNEED is performed, there is no
effort made to restore or create a reservation for the address.  I 'believe'
this is the desired behavior.  If a reservation would be created as the result
of MADV_DONTNEED, the freed page is not really 'free' as it is still reserved
for this address/mapping.

If we believe that MADV_DONTNEED truly frees the underlying page and does not
create a reservation, then getting a SIGBUS as the result of a subsequent
fault is certainly a possibility.  The situation you describe above results
in SIGBUS because the fault happens before the page hits the free list.
However, even if we hold off faults until the page hits the free list,
there is nothing to prevent someone else from taking the page off the free
list before the fault happens.  This would also result in SIGBUS.  Do
note that all this assumes there are no free hugetlb pages before the
MADV_DONTNEED.  This is why I think an application that knows it is
performing an MADV_DONTNEED on a hugetlb mapping should can not expect
with 100%certainty that subsequent faults on that range will succeed.

Of course, that does not help in this specific situation.  Since the
application/library code does not know that it is dealing with a hugetlb
mapping, the best thing would be for the MADV_DONTNEED not to succeed.
However, since MADV_DONTNEED for hugetlbfs was introduced there has been
some interest and desire for use.  Mostly this is around userfaultfd and live
migration.  So, I would really hate to just remove hugetlb MADV_DONTNEED
support.  I have cc'ed some of those people here.

> What are the reasonable alternatives here?
> 
> Should we see if anybody can come up with a simple solution
> to the problem, or would it be better to just disable
> MADV_DONTNEED on hugetlbfs for now?

Here is one thought (perhaps crazy).  Do not allow MADV_DONTNEED on
hugetlb mappings.  This would help with the library code passed hugetlb
mapping.  For the use cases where MADV_DONTNEED on hugetlb is desirable,
create ?MADV_DONTNEED_HUGETLB? that only operates on hugetlb mappings.
In this way the caller must be aware they are operating on a hugetlb
mapping.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS
  2022-10-26 19:38 ` Mike Kravetz
@ 2022-10-27 15:07   ` Johannes Weiner
  0 siblings, 0 replies; 5+ messages in thread
From: Johannes Weiner @ 2022-10-27 15:07 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Rik van Riel, Chris Mason, David Hildenbrand, Axel Rasmussen,
	Peter Xu, James Houghton, linux-mm, linux-kernel, kernel-team,
	Andrew Morton

On Wed, Oct 26, 2022 at 12:38:43PM -0700, Mike Kravetz wrote:
> On 10/25/22 16:37, Rik van Riel wrote:
> > What are the reasonable alternatives here?
> > 
> > Should we see if anybody can come up with a simple solution
> > to the problem, or would it be better to just disable
> > MADV_DONTNEED on hugetlbfs for now?
> 
> Here is one thought (perhaps crazy).  Do not allow MADV_DONTNEED on
> hugetlb mappings.  This would help with the library code passed hugetlb
> mapping.  For the use cases where MADV_DONTNEED on hugetlb is desirable,
> create ?MADV_DONTNEED_HUGETLB? that only operates on hugetlb mappings.
> In this way the caller must be aware they are operating on a hugetlb
> mapping.

What could also work is implementing mlock() for hugetlb, and having
MADV_DONTNEED respect it.

That would allow libraries/general-purpose allocators to continue
using MADV_DONTNEED without being aware of the underlying memory
situation. Whoever is responsible for setting up the memory pool could
just mlock() - or not, if hugetlb overcommit is enabled - and the GP
allocator would do the right thing in both scenarios.

[ Our setup code is actually already calling mlock() on the hugetlb
  ranges. We never wanted DONTNEED to free hugetlb pages - it just
  happened to work so far because DONTNEED wasn't implemented for
  them. If both DONTNEED and mlock() were implemented, we'd be good. ]

Johannes

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-27 15:08 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-25 20:37 [BUG] hugetlbfs_no_page vs MADV_DONTNEED race leading to SIGBUS Rik van Riel
2022-10-26  3:07 ` Mike Kravetz
2022-10-26  4:13   ` Mike Kravetz
2022-10-26 19:38 ` Mike Kravetz
2022-10-27 15:07   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).