linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range()
@ 2020-11-10 23:44 Jason Gunthorpe
  2020-11-10 23:44 ` [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2020-11-10 23:44 UTC (permalink / raw)
  To: linux-kernel, Peter Xu, Linus Torvalds
  Cc: Ahmed S. Darwish, Andrea Arcangeli, Andrew Morton,
	Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara,
	Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

As discussed and suggested by Linus use a seqcount to close the small race
between gup_fast and copy_page_range().

Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
in this case and it doesn't trigger any lockdeps.

I was able to test it using two threads, one forking and the other using
ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
made the window large enough to reliably hit to test the logic.

v4:
 - Use read_seqcount_retry() not read_seqcount_t_retry
v3: https://lore.kernel.org/r/0-v3-7358966cab09+14e9-gup_fork_jgg@nvidia.com
 - Revise comment for write_protect_seq
 - Revise comment in copy_page_range
 - Use raw_write_seqcount_begin() not raw_write_seqcount_t_begin()
v2: https://lore.kernel.org/r/0-v2-dfe9ecdb6c74+2066-gup_fork_jgg@nvidia.com
 - Use start not addr in lockless_pages_from_mm
 - Replace unsigned long casts with using the proper variable type
 - Update comments
 - Use raw_write_seqcount_t_begin() instead of open coding
 - Update commit messages
v1: https://lore.kernel.org/r/0-v1-281e425c752f+2df-gup_fork_jgg@nvidia.com

To: linux-kernel@vger.kernel.org
To: Peter Xu <peterx@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Linux-MM <linux-mm@kvack.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: "Ahmed S. Darwish" <a.darwish@linutronix.de>

Jason Gunthorpe (2):
  mm: reorganize internal_get_user_pages_fast()
  mm: prevent gup_fast from racing with COW during fork

 arch/x86/kernel/tboot.c    |   1 +
 drivers/firmware/efi/efi.c |   1 +
 include/linux/mm_types.h   |   8 +++
 kernel/fork.c              |   1 +
 mm/gup.c                   | 117 +++++++++++++++++++++++--------------
 mm/init-mm.c               |   1 +
 mm/memory.c                |  13 ++++-
 7 files changed, 96 insertions(+), 46 deletions(-)

-- 
2.29.2



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast()
  2020-11-10 23:44 [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Jason Gunthorpe
@ 2020-11-10 23:44 ` Jason Gunthorpe
  2020-11-10 23:44 ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Jason Gunthorpe
  2020-11-11 18:27 ` [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Linus Torvalds
  2 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2020-11-10 23:44 UTC (permalink / raw)
  To: linux-kernel, Peter Xu, Linus Torvalds
  Cc: Ahmed S. Darwish, Andrea Arcangeli, Andrew Morton,
	Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn,
	John Hubbard, Kirill Shutemov, Kirill Tkhai, Linux-MM,
	Michal Hocko, Oleg Nesterov

The next patch in this series makes the lockless flow a little more
complex, so move the entire block into a new function and remove a level
of indention. Tidy a bit of cruft:

 - addr is always the same as start, so use start

 - Use the modern check_add_overflow() for computing end = start + len

 - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
   avoid shift overflow, make the variables unsigned long to avoid coding
   casts in both places. nr_pinned was missing its cast

 - The handling of ret and nr_pinned can be streamlined a bit

No functional change.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 mm/gup.c | 99 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 54 insertions(+), 45 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6d2609c3..c7e24301860abb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2677,13 +2677,43 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
-static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+static unsigned long lockless_pages_from_mm(unsigned long start,
+					    unsigned long end,
+					    unsigned int gup_flags,
+					    struct page **pages)
+{
+	unsigned long flags;
+	int nr_pinned = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
+	    !gup_fast_permitted(start, end))
+		return 0;
+
+	/*
+	 * Disable interrupts. The nested form is used, in order to allow full,
+	 * general purpose use of this routine.
+	 *
+	 * With interrupts disabled, we block page table pages from being freed
+	 * from under us. See struct mmu_table_batch comments in
+	 * include/asm-generic/tlb.h for more details.
+	 *
+	 * We do not adopt an rcu_read_lock() here as we also want to block IPIs
+	 * that come from THPs splitting.
+	 */
+	local_irq_save(flags);
+	gup_pgd_range(start, end, gup_flags, pages, &nr_pinned);
+	local_irq_restore(flags);
+	return nr_pinned;
+}
+
+static int internal_get_user_pages_fast(unsigned long start,
+					unsigned long nr_pages,
 					unsigned int gup_flags,
 					struct page **pages)
 {
-	unsigned long addr, len, end;
-	unsigned long flags;
-	int nr_pinned = 0, ret = 0;
+	unsigned long len, end;
+	unsigned long nr_pinned;
+	int ret;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
 				       FOLL_FORCE | FOLL_PIN | FOLL_GET |
@@ -2697,54 +2727,33 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 		might_lock_read(&current->mm->mmap_lock);
 
 	start = untagged_addr(start) & PAGE_MASK;
-	addr = start;
-	len = (unsigned long) nr_pages << PAGE_SHIFT;
-	end = start + len;
-
-	if (end <= start)
+	len = nr_pages << PAGE_SHIFT;
+	if (check_add_overflow(start, len, &end))
 		return 0;
 	if (unlikely(!access_ok((void __user *)start, len)))
 		return -EFAULT;
 
-	/*
-	 * Disable interrupts. The nested form is used, in order to allow
-	 * full, general purpose use of this routine.
-	 *
-	 * With interrupts disabled, we block page table pages from being
-	 * freed from under us. See struct mmu_table_batch comments in
-	 * include/asm-generic/tlb.h for more details.
-	 *
-	 * We do not adopt an rcu_read_lock(.) here as we also want to
-	 * block IPIs that come from THPs splitting.
-	 */
-	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) && gup_fast_permitted(start, end)) {
-		unsigned long fast_flags = gup_flags;
-
-		local_irq_save(flags);
-		gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned);
-		local_irq_restore(flags);
-		ret = nr_pinned;
-	}
-
-	if (nr_pinned < nr_pages && !(gup_flags & FOLL_FAST_ONLY)) {
-		/* Try to get the remaining pages with get_user_pages */
-		start += nr_pinned << PAGE_SHIFT;
-		pages += nr_pinned;
+	nr_pinned = lockless_pages_from_mm(start, end, gup_flags, pages);
+	if (nr_pinned == nr_pages || gup_flags & FOLL_FAST_ONLY)
+		return nr_pinned;
 
-		ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned,
-					      gup_flags, pages);
-
-		/* Have to be a bit careful with return values */
-		if (nr_pinned > 0) {
-			if (ret < 0)
-				ret = nr_pinned;
-			else
-				ret += nr_pinned;
-		}
+	/* Slow path: try to get the remaining pages with get_user_pages */
+	start += nr_pinned << PAGE_SHIFT;
+	pages += nr_pinned;
+	ret = __gup_longterm_unlocked(start, nr_pages - nr_pinned, gup_flags,
+				      pages);
+	if (ret < 0) {
+		/*
+		 * The caller has to unpin the pages we already pinned so
+		 * returning -errno is not an option
+		 */
+		if (nr_pinned)
+			return nr_pinned;
+		return ret;
 	}
-
-	return ret;
+	return ret + nr_pinned;
 }
+
 /**
  * get_user_pages_fast_only() - pin user pages in memory
  * @start:      starting user address
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork
  2020-11-10 23:44 [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Jason Gunthorpe
  2020-11-10 23:44 ` [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
@ 2020-11-10 23:44 ` Jason Gunthorpe
  2020-11-11 19:59   ` Peter Xu
  2020-11-12  7:41   ` Ahmed S. Darwish
  2020-11-11 18:27 ` [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Linus Torvalds
  2 siblings, 2 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2020-11-10 23:44 UTC (permalink / raw)
  To: linux-kernel, Peter Xu, Linus Torvalds
  Cc: Ahmed S. Darwish, Andrea Arcangeli, Andrew Morton,
	Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jan Kara,
	Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.

However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:

        CPU 0                             CPU 1
   get_user_pages_fast()
    internal_get_user_pages_fast()
                                       copy_page_range()
                                         pte_alloc_map_lock()
                                           copy_present_page()
                                             atomic_read(has_pinned) == 0
					     page_maybe_dma_pinned() == false
     atomic_set(has_pinned, 1);
     gup_pgd_range()
      gup_pte_range()
       pte_t pte = gup_get_pte(ptep)
       pte_access_permitted(pte)
       try_grab_compound_head()
                                             pte = pte_wrprotect(pte)
	                                     set_pte_at();
                                         pte_unmap_unlock()
      // GUP now returns with a write protected page

The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
early COW write protect games during fork()")

Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.

Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.

Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 arch/x86/kernel/tboot.c    |  1 +
 drivers/firmware/efi/efi.c |  1 +
 include/linux/mm_types.h   |  8 ++++++++
 kernel/fork.c              |  1 +
 mm/gup.c                   | 18 ++++++++++++++++++
 mm/init-mm.c               |  1 +
 mm/memory.c                | 13 ++++++++++++-
 7 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 992fb1415c0f1f..6a2f542d9588a4 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -93,6 +93,7 @@ static struct mm_struct tboot_mm = {
 	.pgd            = swapper_pg_dir,
 	.mm_users       = ATOMIC_INIT(2),
 	.mm_count       = ATOMIC_INIT(1),
+	.write_protect_seq = SEQCNT_ZERO(tboot_mm.write_protect_seq),
 	MMAP_LOCK_INITIALIZER(init_mm)
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.mmlist         = LIST_HEAD_INIT(init_mm.mmlist),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 5e5480a0a32d7d..2520f6e05f4d44 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -57,6 +57,7 @@ struct mm_struct efi_mm = {
 	.mm_rb			= RB_ROOT,
 	.mm_users		= ATOMIC_INIT(2),
 	.mm_count		= ATOMIC_INIT(1),
+	.write_protect_seq      = SEQCNT_ZERO(efi_mm.write_protect_seq),
 	MMAP_LOCK_INITIALIZER(efi_mm)
 	.page_table_lock	= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
 	.mmlist			= LIST_HEAD_INIT(efi_mm.mmlist),
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5a9238f6caad97..915f4f100383b5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -14,6 +14,7 @@
 #include <linux/uprobes.h>
 #include <linux/page-flags-layout.h>
 #include <linux/workqueue.h>
+#include <linux/seqlock.h>
 
 #include <asm/mmu.h>
 
@@ -446,6 +447,13 @@ struct mm_struct {
 		 */
 		atomic_t has_pinned;
 
+		/**
+		 * @write_protect_seq: Locked when any thread is write
+		 * protecting pages mapped by this mm to enforce a later COW,
+		 * for instance during page table copying for fork().
+		 */
+		seqcount_t write_protect_seq;
+
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d3804c..dc55f68a6ee36d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1007,6 +1007,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->vmacache_seqnum = 0;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
+	seqcount_init(&mm->write_protect_seq);
 	mmap_init_lock(mm);
 	INIT_LIST_HEAD(&mm->mmlist);
 	mm->core_state = NULL;
diff --git a/mm/gup.c b/mm/gup.c
index c7e24301860abb..9c6a2f5001c5c2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2684,11 +2684,18 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
 {
 	unsigned long flags;
 	int nr_pinned = 0;
+	unsigned seq;
 
 	if (!IS_ENABLED(CONFIG_HAVE_FAST_GUP) ||
 	    !gup_fast_permitted(start, end))
 		return 0;
 
+	if (gup_flags & FOLL_PIN) {
+		seq = raw_read_seqcount(&current->mm->write_protect_seq);
+		if (seq & 1)
+			return 0;
+	}
+
 	/*
 	 * Disable interrupts. The nested form is used, in order to allow full,
 	 * general purpose use of this routine.
@@ -2703,6 +2710,17 @@ static unsigned long lockless_pages_from_mm(unsigned long start,
 	local_irq_save(flags);
 	gup_pgd_range(start, end, gup_flags, pages, &nr_pinned);
 	local_irq_restore(flags);
+
+	/*
+	 * When pinning pages for DMA there could be a concurrent write protect
+	 * from fork() via copy_page_range(), in this case always fail fast GUP.
+	 */
+	if (gup_flags & FOLL_PIN) {
+		if (read_seqcount_retry(&current->mm->write_protect_seq, seq)) {
+			unpin_user_pages(pages, nr_pinned);
+			return 0;
+		}
+	}
 	return nr_pinned;
 }
 
diff --git a/mm/init-mm.c b/mm/init-mm.c
index 3a613c85f9ede2..153162669f8062 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -31,6 +31,7 @@ struct mm_struct init_mm = {
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
+	.write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),
 	MMAP_LOCK_INITIALIZER(init_mm)
 	.page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
 	.arg_lock	=  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e50268..783aabfa66d29f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1171,6 +1171,15 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
 					0, src_vma, src_mm, addr, end);
 		mmu_notifier_invalidate_range_start(&range);
+		/*
+		 * Disabling preemption is not needed for the write side, as
+		 * the read side doesn't spin, but goes to the mmap_lock.
+		 *
+		 * Use the raw variant of the seqcount_t write API to avoid
+		 * lockdep complaining about preemptibility.
+		*/
+		mmap_assert_write_locked(src_mm);
+		raw_write_seqcount_begin(&src_mm->write_protect_seq);
 	}
 
 	ret = 0;
@@ -1187,8 +1196,10 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		}
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
-	if (is_cow)
+	if (is_cow) {
+		raw_write_seqcount_end(&src_mm->write_protect_seq);
 		mmu_notifier_invalidate_range_end(&range);
+	}
 	return ret;
 }
 
-- 
2.29.2



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range()
  2020-11-10 23:44 [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Jason Gunthorpe
  2020-11-10 23:44 ` [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
  2020-11-10 23:44 ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Jason Gunthorpe
@ 2020-11-11 18:27 ` Linus Torvalds
  2 siblings, 0 replies; 7+ messages in thread
From: Linus Torvalds @ 2020-11-11 18:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linux Kernel Mailing List, Peter Xu, Ahmed S. Darwish,
	Andrea Arcangeli, Andrew Morton, Aneesh Kumar K.V,
	Christoph Hellwig, Hugh Dickins, Jan Kara, Jann Horn,
	John Hubbard, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky,
	Linux-MM, Michal Hocko, Oleg Nesterov

On Tue, Nov 10, 2020 at 3:44 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As discussed and suggested by Linus use a seqcount to close the small race
> between gup_fast and copy_page_range().
>
> Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
> in this case and it doesn't trigger any lockdeps.
>
> I was able to test it using two threads, one forking and the other using
> ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
> made the window large enough to reliably hit to test the logic.

Looks all good to me.

             Linus


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork
  2020-11-10 23:44 ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Jason Gunthorpe
@ 2020-11-11 19:59   ` Peter Xu
  2020-11-12  7:41   ` Ahmed S. Darwish
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Xu @ 2020-11-11 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Linus Torvalds, Ahmed S. Darwish, Andrea Arcangeli,
	Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins,
	Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

On Tue, Nov 10, 2020 at 07:44:09PM -0400, Jason Gunthorpe wrote:
> Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
> fork() for ptes") pages under a FOLL_PIN will not be write protected
> during COW for fork. This means that pages returned from
> pin_user_pages(FOLL_WRITE) should not become write protected while the pin
> is active.
> 
> However, there is a small race where get_user_pages_fast(FOLL_PIN) can
> establish a FOLL_PIN at the same time copy_present_page() is write
> protecting it:
> 
>         CPU 0                             CPU 1
>    get_user_pages_fast()
>     internal_get_user_pages_fast()
>                                        copy_page_range()
>                                          pte_alloc_map_lock()
>                                            copy_present_page()
>                                              atomic_read(has_pinned) == 0
> 					     page_maybe_dma_pinned() == false
>      atomic_set(has_pinned, 1);
>      gup_pgd_range()
>       gup_pte_range()
>        pte_t pte = gup_get_pte(ptep)
>        pte_access_permitted(pte)
>        try_grab_compound_head()
>                                              pte = pte_wrprotect(pte)
> 	                                     set_pte_at();
>                                          pte_unmap_unlock()
>       // GUP now returns with a write protected page
> 
> The first attempt to resolve this by using the write protect caused
> problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
> early COW write protect games during fork()")
> 
> Instead wrap copy_p4d_range() with the write side of a seqcount and check
> the read side around gup_pgd_range(). If there is a collision then
> get_user_pages_fast() fails and falls back to slow GUP.
> 
> Slow GUP is safe against this race because copy_page_range() is only
> called while holding the exclusive side of the mmap_lock on the src
> mm_struct.
> 
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork
  2020-11-10 23:44 ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Jason Gunthorpe
  2020-11-11 19:59   ` Peter Xu
@ 2020-11-12  7:41   ` Ahmed S. Darwish
  2020-11-12 14:31     ` Jason Gunthorpe
  1 sibling, 1 reply; 7+ messages in thread
From: Ahmed S. Darwish @ 2020-11-12  7:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli,
	Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins,
	Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

On Tue, Nov 10, 2020 at 07:44:09PM -0400, Jason Gunthorpe wrote:
...
>
> Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

Thanks for the v3 and v4 updates.

For the seqcount_t parts:

  Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork
  2020-11-12  7:41   ` Ahmed S. Darwish
@ 2020-11-12 14:31     ` Jason Gunthorpe
  0 siblings, 0 replies; 7+ messages in thread
From: Jason Gunthorpe @ 2020-11-12 14:31 UTC (permalink / raw)
  To: Ahmed S. Darwish
  Cc: linux-kernel, Peter Xu, Linus Torvalds, Andrea Arcangeli,
	Andrew Morton, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins,
	Jan Kara, Jann Horn, John Hubbard, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Linux-MM, Michal Hocko, Oleg Nesterov

On Thu, Nov 12, 2020 at 08:41:58AM +0100, Ahmed S. Darwish wrote:
> On Tue, Nov 10, 2020 at 07:44:09PM -0400, Jason Gunthorpe wrote:
> ...
> >
> > Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
> > Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Thanks for the v3 and v4 updates.
> 
> For the seqcount_t parts:
> 
>   Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de>

Thank you for your help!

Jason


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-11-12 14:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-10 23:44 [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Jason Gunthorpe
2020-11-10 23:44 ` [PATCH v4 1/2] mm: reorganize internal_get_user_pages_fast() Jason Gunthorpe
2020-11-10 23:44 ` [PATCH v4 2/2] mm: prevent gup_fast from racing with COW during fork Jason Gunthorpe
2020-11-11 19:59   ` Peter Xu
2020-11-12  7:41   ` Ahmed S. Darwish
2020-11-12 14:31     ` Jason Gunthorpe
2020-11-11 18:27 ` [PATCH v4 0/2] Add a seqcount between gup_fast and copy_page_range() Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).