linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: Fetch the dirty bit before we reset the pte
@ 2020-10-08  9:26 Aneesh Kumar K.V
  2020-10-08 17:02 ` Linus Torvalds
  0 siblings, 1 reply; 5+ messages in thread
From: Aneesh Kumar K.V @ 2020-10-08  9:26 UTC (permalink / raw)
  To: linuxppc-dev, mpe
  Cc: npiggin, Aneesh Kumar K.V, Peter Xu, Jason Gunthorpe,
	John Hubbard, linux-mm, linux-kernel, Andrew Morton, Jan Kara,
	Michal Hocko, Kirill Shutemov, Hugh Dickins, Linus Torvalds

In copy_present_page, after we mark the pte non-writable, we should
check for previous dirty bit updates and make sure we don't lose the dirty
bit on reset.

Also, avoid marking the pte write-protected again if copy_present_page
already marked it write-protected.

Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/memory.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index bfe202ef6244..f57b1f04d50a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -848,6 +848,9 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (likely(!page_maybe_dma_pinned(page)))
 		return 1;
 
+	if (pte_dirty(*src_pte))
+		pte = pte_mkdirty(pte);
+
 	/*
 	 * Uhhuh. It looks like the page might be a pinned page,
 	 * and we actually need to copy it. Now we can set the
@@ -904,6 +907,11 @@ copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (retval <= 0)
 			return retval;
 
+		/*
+		 * Fetch the src pte value again, copy_present_page
+		 * could modify it.
+		 */
+		pte = *src_pte;
 		get_page(page);
 		page_dup_rmap(page, false);
 		rss[mm_counter(page)]++;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: Fetch the dirty bit before we reset the pte
  2020-10-08  9:26 [RFC PATCH] mm: Fetch the dirty bit before we reset the pte Aneesh Kumar K.V
@ 2020-10-08 17:02 ` Linus Torvalds
  2020-10-08 17:06   ` Linus Torvalds
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Linus Torvalds @ 2020-10-08 17:02 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, Michael Ellerman, Nick Piggin, Peter Xu,
	Jason Gunthorpe, John Hubbard, Linux-MM,
	Linux Kernel Mailing List, Andrew Morton, Jan Kara, Michal Hocko,
	Kirill Shutemov, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 746 bytes --]

On Thu, Oct 8, 2020 at 2:27 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> In copy_present_page, after we mark the pte non-writable, we should
> check for previous dirty bit updates and make sure we don't lose the dirty
> bit on reset.

No, we'll just remove that entirely.

Do you have a test-case that shows a problem? I have a patch that I
was going to delay until 5.10 because I didn't think it mattered in
practice..

The second part of this patch would be to add a sequence count
protection to fast-GUP pinning, so that GUP and fork() couldn't race,
but I haven't written that part.

Here's the first patch anyway. If you actually have a test-case where
this matters, I guess I need to apply it now..

                   Linus

[-- Attachment #2: fork-cleanup --]
[-- Type: application/octet-stream, Size: 2530 bytes --]

 mm/memory.c | 46 ++++++++++------------------------------------
 1 file changed, 10 insertions(+), 36 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fcfc4ca36eba..4a7e89d35ecf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -806,8 +806,6 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		return 1;
 
 	/*
-	 * The trick starts.
-	 *
 	 * What we want to do is to check whether this page may
 	 * have been pinned by the parent process.  If so,
 	 * instead of wrprotect the pte on both sides, we copy
@@ -815,46 +813,22 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * the pinned page won't be randomly replaced in the
 	 * future.
 	 *
-	 * To achieve this, we do the following:
-	 *
-	 * 1. Write-protect the pte if it's writable.  This is
-	 *    to protect concurrent write fast-gup with
-	 *    FOLL_PIN, so that we'll fail the fast-gup with
-	 *    the write bit removed.
-	 *
-	 * 2. Check page_maybe_dma_pinned() to see whether this
-	 *    page may have been pinned.
+	 * The page pinning checks are just "has this mm ever
+	 * seen pinning", along with the (inexact) check of
+	 * the page count. That might give false positives for
+	 * for pinning, but it will work correctly.
 	 *
-	 * The order of these steps is important to serialize
-	 * against the fast-gup code (gup_pte_range()) on the
-	 * pte check and try_grab_compound_head(), so that
-	 * we'll make sure either we'll capture that fast-gup
-	 * so we'll copy the pinned page here, or we'll fail
-	 * that fast-gup.
-	 *
-	 * NOTE! Even if we don't end up copying the page,
-	 * we won't undo this wrprotect(), because the normal
-	 * reference copy will need it anyway.
-	 */
-	if (pte_write(pte))
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-
-	/*
-	 * These are the "normally we can just copy by reference"
-	 * checks.
+	 * Another heuristic is to just check the mapcount for
+	 * this page. If it is mapped elsewhere, it already is
+	 * not an exclusively pinned page, and doing another
+	 * "copy by reference" isn't going to matter.
 	 */
 	if (likely(!atomic_read(&src_mm->has_pinned)))
 		return 1;
 	if (likely(!page_maybe_dma_pinned(page)))
 		return 1;
-
-	/*
-	 * Uhhuh. It looks like the page might be a pinned page,
-	 * and we actually need to copy it. Now we can set the
-	 * source pte back to being writable.
-	 */
-	if (pte_write(pte))
-		set_pte_at(src_mm, addr, src_pte, pte);
+	if (__page_mapcount(page) > 1)
+		return 1;
 
 	new_page = *prealloc;
 	if (!new_page)

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: Fetch the dirty bit before we reset the pte
  2020-10-08 17:02 ` Linus Torvalds
@ 2020-10-08 17:06   ` Linus Torvalds
  2020-10-08 17:30   ` Linus Torvalds
  2020-10-08 17:32   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2020-10-08 17:06 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Leon Romanovsky
  Cc: linuxppc-dev, Michael Ellerman, Nick Piggin, Peter Xu,
	Jason Gunthorpe, John Hubbard, Linux-MM,
	Linux Kernel Mailing List, Andrew Morton, Jan Kara, Michal Hocko,
	Kirill Shutemov, Hugh Dickins

[ Just adding Leon to the participants ]

This patch (not attached again, Leon has seen it before) has been
tested for the last couple of weeks for the rdma case, so I have no
problems applying it now, just to keep everybody in the loop.

             Linus

On Thu, Oct 8, 2020 at 10:02 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Oct 8, 2020 at 2:27 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> >
> > In copy_present_page, after we mark the pte non-writable, we should
> > check for previous dirty bit updates and make sure we don't lose the dirty
> > bit on reset.
>
> No, we'll just remove that entirely.
>
> Do you have a test-case that shows a problem? I have a patch that I
> was going to delay until 5.10 because I didn't think it mattered in
> practice..
>
> The second part of this patch would be to add a sequence count
> protection to fast-GUP pinning, so that GUP and fork() couldn't race,
> but I haven't written that part.
>
> Here's the first patch anyway. If you actually have a test-case where
> this matters, I guess I need to apply it now..
>
>                    Linus


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: Fetch the dirty bit before we reset the pte
  2020-10-08 17:02 ` Linus Torvalds
  2020-10-08 17:06   ` Linus Torvalds
@ 2020-10-08 17:30   ` Linus Torvalds
  2020-10-08 17:32   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 5+ messages in thread
From: Linus Torvalds @ 2020-10-08 17:30 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Leon Romanovsky
  Cc: linuxppc-dev, Michael Ellerman, Nick Piggin, Peter Xu,
	Jason Gunthorpe, John Hubbard, Linux-MM,
	Linux Kernel Mailing List, Andrew Morton, Jan Kara, Michal Hocko,
	Kirill Shutemov, Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 632 bytes --]

On Thu, Oct 8, 2020 at 10:02 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Here's the first patch anyway. If you actually have a test-case where
> this matters, I guess I need to apply it now..

Actually, I removed the "__page_mapcount()" part of that patch, to
keep it minimal and _only_ do remove the wrprotect trick.

We can do the __page_mapcount() optimization and the mm sequence count
for 5.10 (although so far nobody has actually written the seqcount
patch - I think it would be a trivial few-liner, but I guess it won't
make 5.10 at this point).

So here's what I ended up with.

                      Linus

[-- Attachment #2: 0001-mm-avoid-early-COW-write-protect-games-during-fork.patch --]
[-- Type: text/x-patch, Size: 5138 bytes --]

From f3c64eda3e5097ec3198cb271f5f504d65d67131 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Mon, 28 Sep 2020 12:50:03 -0700
Subject: [PATCH] mm: avoid early COW write protect games during fork()

In commit 70e806e4e645 ("mm: Do early cow for pinned pages during fork()
for ptes") we write-protected the PTE before doing the page pinning
check, in order to avoid a race with concurrent fast-GUP pinning (which
doesn't take the mm semaphore or the page table lock).

That trick doesn't actually work - it doesn't handle memory ordering
properly, and doing so would be prohibitively expensive.

It also isn't really needed.  While we're moving in the direction of
allowing and supporting page pinning without marking the pinned area
with MADV_DONTFORK, the fact is that we've never really supported this
kind of odd "concurrent fork() and page pinning", and doing the
serialization on a pte level is just wrong.

We can add serialization with a per-mm sequence counter, so we know how
to solve that race properly, but we'll do that at a more appropriate
time.  Right now this just removes the write protect games.

It also turns out that the write protect games actually break on Power,
as reported by Aneesh Kumar:

 "Architecture like ppc64 expects set_pte_at to be not used for updating
  a valid pte. This is further explained in commit 56eecdb912b5 ("mm:
  Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"

and the code triggered a warning there:

  WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
  Call Trace:
    copy_present_page mm/memory.c:857 [inline]
    copy_present_pte mm/memory.c:899 [inline]
    copy_pte_range mm/memory.c:1014 [inline]
    copy_pmd_range mm/memory.c:1092 [inline]
    copy_pud_range mm/memory.c:1127 [inline]
    copy_p4d_range mm/memory.c:1150 [inline]
    copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
    dup_mmap kernel/fork.c:592 [inline]
    dup_mm+0x77c/0xab0 kernel/fork.c:1355
    copy_mm kernel/fork.c:1411 [inline]
    copy_process+0x1f00/0x2740 kernel/fork.c:2070
    _do_fork+0xc4/0x10b0 kernel/fork.c:2429

Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/
Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Tested-by: Leon Romanovsky <leonro@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 mm/memory.c | 41 ++++-------------------------------------
 1 file changed, 4 insertions(+), 37 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fcfc4ca36eba..eeae590e526a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -806,8 +806,6 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		return 1;
 
 	/*
-	 * The trick starts.
-	 *
 	 * What we want to do is to check whether this page may
 	 * have been pinned by the parent process.  If so,
 	 * instead of wrprotect the pte on both sides, we copy
@@ -815,47 +813,16 @@ copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * the pinned page won't be randomly replaced in the
 	 * future.
 	 *
-	 * To achieve this, we do the following:
-	 *
-	 * 1. Write-protect the pte if it's writable.  This is
-	 *    to protect concurrent write fast-gup with
-	 *    FOLL_PIN, so that we'll fail the fast-gup with
-	 *    the write bit removed.
-	 *
-	 * 2. Check page_maybe_dma_pinned() to see whether this
-	 *    page may have been pinned.
-	 *
-	 * The order of these steps is important to serialize
-	 * against the fast-gup code (gup_pte_range()) on the
-	 * pte check and try_grab_compound_head(), so that
-	 * we'll make sure either we'll capture that fast-gup
-	 * so we'll copy the pinned page here, or we'll fail
-	 * that fast-gup.
-	 *
-	 * NOTE! Even if we don't end up copying the page,
-	 * we won't undo this wrprotect(), because the normal
-	 * reference copy will need it anyway.
-	 */
-	if (pte_write(pte))
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-
-	/*
-	 * These are the "normally we can just copy by reference"
-	 * checks.
+	 * The page pinning checks are just "has this mm ever
+	 * seen pinning", along with the (inexact) check of
+	 * the page count. That might give false positives for
+	 * for pinning, but it will work correctly.
 	 */
 	if (likely(!atomic_read(&src_mm->has_pinned)))
 		return 1;
 	if (likely(!page_maybe_dma_pinned(page)))
 		return 1;
 
-	/*
-	 * Uhhuh. It looks like the page might be a pinned page,
-	 * and we actually need to copy it. Now we can set the
-	 * source pte back to being writable.
-	 */
-	if (pte_write(pte))
-		set_pte_at(src_mm, addr, src_pte, pte);
-
 	new_page = *prealloc;
 	if (!new_page)
 		return -EAGAIN;
-- 
2.28.0.218.gc12ef3d349


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] mm: Fetch the dirty bit before we reset the pte
  2020-10-08 17:02 ` Linus Torvalds
  2020-10-08 17:06   ` Linus Torvalds
  2020-10-08 17:30   ` Linus Torvalds
@ 2020-10-08 17:32   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 5+ messages in thread
From: Aneesh Kumar K.V @ 2020-10-08 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linuxppc-dev, Michael Ellerman, Nick Piggin, Peter Xu,
	Jason Gunthorpe, John Hubbard, Linux-MM,
	Linux Kernel Mailing List, Andrew Morton, Jan Kara, Michal Hocko,
	Kirill Shutemov, Hugh Dickins

On 10/8/20 10:32 PM, Linus Torvalds wrote:
> On Thu, Oct 8, 2020 at 2:27 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> In copy_present_page, after we mark the pte non-writable, we should
>> check for previous dirty bit updates and make sure we don't lose the dirty
>> bit on reset.
> 
> No, we'll just remove that entirely.
> 
> Do you have a test-case that shows a problem? I have a patch that I
> was going to delay until 5.10 because I didn't think it mattered in
> practice..
> 

Unfortunately, I don't have a test case. That was observed by code 
inspection while I was fixing syzkaller report.

> The second part of this patch would be to add a sequence count
> protection to fast-GUP pinning, so that GUP and fork() couldn't race,
> but I haven't written that part.
> 
> Here's the first patch anyway. If you actually have a test-case where
> this matters, I guess I need to apply it now..
> 
>                     Linus
> 


-aneesh


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-10-08 17:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08  9:26 [RFC PATCH] mm: Fetch the dirty bit before we reset the pte Aneesh Kumar K.V
2020-10-08 17:02 ` Linus Torvalds
2020-10-08 17:06   ` Linus Torvalds
2020-10-08 17:30   ` Linus Torvalds
2020-10-08 17:32   ` Aneesh Kumar K.V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).