linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/gup: Allow real explicit breaking of COW
@ 2020-08-10 14:57 Peter Xu
  2020-08-10 16:47 ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2020-08-10 14:57 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Andrew Morton, peterx, Marty Mcfadden, Andrea Arcangeli,
	Linus Torvalds, Jann Horn, Christoph Hellwig, Oleg Nesterov,
	Kirill Shutemov, Jan Kara

Starting from commit 17839856fd58 ("gup: document and work around "COW can
break either way" issue", 2020-06-02), explicit copy-on-write behavior is
enforced for private gup pages even if it's a read-only.  It is achieved by
always passing FOLL_WRITE to emulate a write.

That should fix the COW issue that we were facing, however above commit could
also break userfaultfd-wp and applications like umapsort [1,2].

One general routine of umap-like program is: userspace library will manage page
allocations, and it will evict the least recently used pages from memory to
external storages (e.g., file systems).  Below are the general steps to evict
an in-memory page in the uffd service thread when the page pool is full:

  (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
      further writes to page P will block (keep page P clean)
  (2) Copy page P to external storage (e.g. file system)
  (3) MADV_DONTNEED to evict page P

Here step (1) makes sure that the page to dump will always be up-to-date, so
that the page snapshot in the file system is consistent with the one that was
in the memory.  However with commit 17839856fd58, step (2) can potentially hang
itself because e.g. if we use write() to a file system fd to dump the page
data, that will be a translated read gup request in the file system driver to
read the page content, then the read gup will be translated to a write gup due
to the new enforced COW behavior.  This write gup will further trigger
handle_userfault() and hang the uffd service thread itself.

I think the problem will go away too if we replace the write() to the file
system into a memory write to a mmaped region in the userspace library, because
normal page faults will not enforce COW, only gup is affected.  However we
cannot forbid users to use write() or any form of kernel level read gup.

One solution is actually already mentioned in commit 17839856fd58, which is to
provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
"enfornced COW (read) request".

[1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp
[2] https://github.com/LLNL/umap

CC: Marty Mcfadden <mcfadden8@llnl.gov>
CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Kirill Shutemov <kirill@shutemov.name>
CC: Jan Kara <jack@suse.cz>
Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v2:
- apply FAULT_FLAG_BREAK_COW correctly when FOLL_BREAK_COW [Christoph]
- removed comments above do_wp_page which seems redundant
---
 include/linux/mm.h | 3 +++
 mm/gup.c           | 6 ++++--
 mm/memory.c        | 7 ++++---
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6a82f9bccd7..dacba5c7942f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -409,6 +409,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read)
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -439,6 +440,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_BREAK_COW			0x400
 
 /*
  * The default fault flags that should be used by most of the
@@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_BREAK_COW  0x100000 /* request for explicit COW (even for read) */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index d8a33dd1430d..c33e84ab9c36 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
+	if (*flags & FOLL_BREAK_COW)
+		fault_flags |= FAULT_FLAG_BREAK_COW;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
 	if (locked)
@@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			}
 			if (is_vm_hugetlb_page(vma)) {
 				if (should_force_cow_break(vma, foll_flags))
-					foll_flags |= FOLL_WRITE;
+					foll_flags |= FOLL_BREAK_COW;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
 						foll_flags, locked);
@@ -1095,7 +1097,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		}
 
 		if (should_force_cow_break(vma, foll_flags))
-			foll_flags |= FOLL_WRITE;
+			foll_flags |= FOLL_BREAK_COW;
 
 retry:
 		/*
diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..7659b0e27a98 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,7 +2900,8 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+	if ((vmf->flags & FAULT_FLAG_WRITE) &&
+	    userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3290,7 +3291,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		put_page(swapcache);
 	}
 
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
@@ -4241,7 +4242,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
 		entry = pte_mkdirty(entry);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 14:57 [PATCH v2] mm/gup: Allow real explicit breaking of COW Peter Xu
@ 2020-08-10 16:47 ` Linus Torvalds
  2020-08-10 19:15   ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2020-08-10 16:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Jann Horn, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 7:57 AM Peter Xu <peterx@redhat.com> wrote:
>
> One solution is actually already mentioned in commit 17839856fd58, which is to
> provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> "enfornced COW (read) request".

I think the patch is fine, but during discussions we also discussed
having the flag the other way around, in order to have the default be
"break COW", and the use cases that explicitly know they can handle
the ambiguity would have to say so explicitly with a "don't break COW"
bit.

I don't think this matters in theory, but in practice I think it would
be a good thing as documentation.

Because FAULT_FLAG_BREAK_COW doesn't really tell you anything:
breaking COW is "always safe".

In contrast, a "FAULT_FLAG_DONT_COW" bit could be accompanied with a
note like "I don't care which side I get - I'm not going to keep track
of the page, I just want random data, and it's ok if I get it from
another forked process".

In fact, maybe it shouldn't be called "DONT_COW", but more along the
lines of those semantics of "READ_WRONG_SIDE_COW_OK", so that people
who use the bit have to _think_ about it.

I think comments in general should be there.

Looking at your patch, for example, I go "Hmm" when I see this part:

-       if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+       if ((vmf->flags & FAULT_FLAG_WRITE) &&
+           userfaultfd_pte_wp(vma, *vmf->pte)) {
                pte_unmap_unlock(vmf->pte, vmf->ptl);
                return handle_userfault(vmf, VM_UFFD_WP);
        }

and I go "ok, for reads with COW breaking, we'll just silently break
the COW and not do VM_UFFD_WP?"

An explanation of why that is the right thing to do would be good. And
no, I don't mean "UFFD doesn't want a WP fault in this case". I mean
literally why "we do want the silent COW, but UFFD doesn't care about
it".

End result: I think the patch is fine, but the reason we had
discussion about it and the reason it wasn't done initially was that
you get all these kinds of subtle behavior differences, and I think
they need clarifying.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 16:47 ` Linus Torvalds
@ 2020-08-10 19:15   ` Peter Xu
  2020-08-10 20:51     ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2020-08-10 19:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Jann Horn, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 09:47:22AM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 7:57 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > One solution is actually already mentioned in commit 17839856fd58, which is to
> > provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> > use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> > "enfornced COW (read) request".
> 
> I think the patch is fine, but during discussions we also discussed
> having the flag the other way around, in order to have the default be
> "break COW", and the use cases that explicitly know they can handle
> the ambiguity would have to say so explicitly with a "don't break COW"
> bit.
> 
> I don't think this matters in theory, but in practice I think it would
> be a good thing as documentation.
> 
> Because FAULT_FLAG_BREAK_COW doesn't really tell you anything:
> breaking COW is "always safe".
> 
> In contrast, a "FAULT_FLAG_DONT_COW" bit could be accompanied with a
> note like "I don't care which side I get - I'm not going to keep track
> of the page, I just want random data, and it's ok if I get it from
> another forked process".

My previous understanding was that although COW is always safe, we should still
avoid it when unnecessary because it's still expensive.  Currently we will do
enforced COW only if should_force_cow_break() returns true, which seems to be a
good justification of when to ask for the enforcement.

> 
> In fact, maybe it shouldn't be called "DONT_COW", but more along the
> lines of those semantics of "READ_WRONG_SIDE_COW_OK", so that people
> who use the bit have to _think_ about it.
> 
> I think comments in general should be there.

How about squashing below comments into the patch?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index eab4b833680e..09a2dde93b4b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -409,7 +409,8 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
- * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read)
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
+ *                        Please read FOLL_BREAK_COW for more information.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -2807,7 +2808,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
 #define FOLL_PIN       0x40000 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
-#define FOLL_BREAK_COW  0x100000 /* request for explicit COW (even for read) */
+/*
+ * request for explicit COW (even for read).  Normally COW is always safe for
+ * gup, but we don't always pass it to avoid unnecessary COW.
+ */
+#define FOLL_BREAK_COW  0x100000

> 
> Looking at your patch, for example, I go "Hmm" when I see this part:
> 
> -       if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> +       if ((vmf->flags & FAULT_FLAG_WRITE) &&
> +           userfaultfd_pte_wp(vma, *vmf->pte)) {
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>                 return handle_userfault(vmf, VM_UFFD_WP);
>         }
> 
> and I go "ok, for reads with COW breaking, we'll just silently break
> the COW and not do VM_UFFD_WP?"
> 
> An explanation of why that is the right thing to do would be good. And
> no, I don't mean "UFFD doesn't want a WP fault in this case". I mean
> literally why "we do want the silent COW, but UFFD doesn't care about
> it".

Userfaultfd-wp should not care about this because it's not a write operation,
so we don't need to capture it.  We should only trigger uffd-wp when the COWed
page got written later again either by a real write gup or a write #pf.

This is a good question anyway...  Because I totally forgot to check the other
part of the COW logic to keep the UFFD_WP bit properly during COW.  Without
that extra change in wp_page_copy() we could potentially lose uffd-wp bit when
there's enforced COW on read-only gups.

Moreover, I definitely missed some special handling for huge pages too here and
there.  As a summary of above...

> 
> End result: I think the patch is fine, but the reason we had
> discussion about it and the reason it wasn't done initially was that
> you get all these kinds of subtle behavior differences, and I think
> they need clarifying.

... I think I'll squash these into v2:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index eab4b833680e..09a2dde93b4b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -409,7 +409,8 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
- * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read)
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
+ *                        Please read FOLL_BREAK_COW for more information.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -2807,7 +2808,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
 #define FOLL_PIN       0x40000 /* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
-#define FOLL_BREAK_COW  0x100000 /* request for explicit COW (even for read) */
+/*
+ * request for explicit COW (even for read).  Normally COW is always safe for
+ * gup, but we don't always pass it to avoid unnecessary COW.
+ */
+#define FOLL_BREAK_COW  0x100000

 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 78c84bee7e29..9d4d308811a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,7 +1296,17 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
        if (reuse_swap_page(page, NULL)) {
                pmd_t entry;
                entry = pmd_mkyoung(orig_pmd);
-               entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+               entry = pmd_mkdirty(entry);
+               if (pmd_uffd_wp(orig_pmd))
+                       /*
+                        * This can happen when an uffd-wp protected page is
+                        * copied due to enfornced COW.  When it happens, we
+                        * need to keep the uffd-wp bit even after COW, and
+                        * make sure write bit is kept cleared.
+                        */
+                       entry = pmd_mkuffd_wp(pmd_wrprotect(entry));
+               else
+                       entry = maybe_pmd_mkwrite(entry, vma);
                if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
                        update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
                unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index f11c87bfc60e..2ed1b68b2323 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2703,7 +2703,17 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
                flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
                entry = mk_pte(new_page, vma->vm_page_prot);
                entry = pte_sw_mkyoung(entry);
-               entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+               entry = pte_mkdirty(entry);
+               if (pte_uffd_wp(vmf->orig_pte))
+                       /*
+                        * This can happen when an uffd-wp protected page is
+                        * copied due to enfornced COW.  When it happens, we
+                        * need to keep the uffd-wp bit even after COW, and
+                        * make sure write bit is kept cleared.
+                        */
+                       entry = pte_mkuffd_wp(pte_wrprotect(entry));
+               else
+                       entry = maybe_mkwrite(entry, vma);
                /*
                 * Clear the pte entry and flush it first, before updating the
                 * pte with the new entry. This will avoid a race condition
@@ -4115,7 +4125,8 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
        if (vma_is_anonymous(vmf->vma)) {
-               if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+               if ((vmf->flags & FAULT_FLAG_WRITE) &&
+                   userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
                        return handle_userfault(vmf, VM_UFFD_WP);
                return do_huge_pmd_wp_page(vmf, orig_pmd);
        }
@@ -4279,7 +4290,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
                .pgoff = linear_page_index(vma, address),
                .gfp_mask = __get_fault_gfp_mask(vma),
        };
-       unsigned int dirty = flags & FAULT_FLAG_WRITE;
+       bool cow = flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW);
        struct mm_struct *mm = vma->vm_mm;
        pgd_t *pgd;
        p4d_t *p4d;
@@ -4306,7 +4317,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,

                        /* NUMA case for anonymous PUDs would go here */

-                       if (dirty && !pud_write(orig_pud)) {
+                       if (cow && !pud_write(orig_pud)) {
                                ret = wp_huge_pud(&vmf, orig_pud);
                                if (!(ret & VM_FAULT_FALLBACK))
                                        return ret;
@@ -4344,7 +4355,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
                        if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
                                return do_huge_pmd_numa_page(&vmf, orig_pmd);

-                       if (dirty && !pmd_write(orig_pmd)) {
+                       if (cow && !pmd_write(orig_pmd)) {
                                ret = wp_huge_pmd(&vmf, orig_pmd);
                                if (!(ret & VM_FAULT_FALLBACK))
                                        return ret;

Thanks,

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 19:15   ` Peter Xu
@ 2020-08-10 20:51     ` Linus Torvalds
  2020-08-10 20:59       ` Linus Torvalds
  2020-08-10 21:57       ` Peter Xu
  0 siblings, 2 replies; 10+ messages in thread
From: Linus Torvalds @ 2020-08-10 20:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Jann Horn, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 12:15 PM Peter Xu <peterx@redhat.com> wrote:
>
> My previous understanding was that although COW is always safe, we should still
> avoid it when unnecessary because it's still expensive.  Currently we will do
> enforced COW only if should_force_cow_break() returns true, which seems to be a
> good justification of when to ask for the enforcement.

It's not the _enforcement_ I worry about.

It's the _users_.

So COW is always the safe thing to do, but as you say, it can be
expensive (although we actually seemed to get a robot that claimed it
sped up some test benchmark, it wasn't clear why).

But because not breaking COW can be very subtly unsafe - considering
that we had this behavior for over two decades - I think the users
that decide to not break COW need to explain why it is safe for them.

See what I'm saying?

That's why an explicit  FOLL_READ_WRONG_SIDE_COW_OK flag would be
good, because it would force people to *think* about what they are
doing, and explain why it's ok to do that unsafe thing, and get a page
that may actually end up not being your page at all!


> + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
> + *                        Please read FOLL_BREAK_COW for more information.

This comment is useless - because it's aimed at the wrong people.

It's aimed at the people who already know they want to break COW. They
understand, and they are doing the safe thing.

The case I worry about is the people who do NOT say "break COW". Not
because they are smart and know what they are doing, but because they
don't think about the issue.

> Userfaultfd-wp should not care about this because it's not a write operation,

Ok, I will *definitely* not be applying tyhis patch, beause you don't
even understand what the problem is.

The fact is, A READ operation that doesn't break COW can GET THE WRONG ANSWER.

Why?

If you have two (threaded) processes A and B, who have that shared COW
page, what can happen is:

 - A does a READ operation using get_follow_page, and gets the shared page

 - another thread in A ends up doing an unmap operation

 - B does write to the page, and causes a COW, but now doesn't see the
A mapping at all any more, so takes over the page and writes to it

 - the original get_follow_page() thread in A is now holding a page
that contains data that wasn't written by A

See? That's the whole point. It doesn't _matter_ if you're only
reading the data, without the COW you may be reading the _wrong_ data.

So no. I will not be applying the patch, because you apparently didn't
understand why reading needs to do a COW.

                     Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 20:51     ` Linus Torvalds
@ 2020-08-10 20:59       ` Linus Torvalds
  2020-08-10 21:57       ` Peter Xu
  1 sibling, 0 replies; 10+ messages in thread
From: Linus Torvalds @ 2020-08-10 20:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Jann Horn, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 1:51 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> See? That's the whole point. It doesn't _matter_ if you're only
> reading the data, without the COW you may be reading the _wrong_ data.

Example safe users:

 - if you hold the mmap_sem the whole time, you're fine

 - if you only look up the page mapping for some tracing reasons, you're fine

 - if you can explain some other reason why you don't care that some
other process might be changing the data from under you, you might be
fine.

But the point is, you need to _explain_ it. Not just say "I don't want COW".

                  Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 20:51     ` Linus Torvalds
  2020-08-10 20:59       ` Linus Torvalds
@ 2020-08-10 21:57       ` Peter Xu
  2020-08-10 23:19         ` Linus Torvalds
  1 sibling, 1 reply; 10+ messages in thread
From: Peter Xu @ 2020-08-10 21:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Jann Horn, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 01:51:49PM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 12:15 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > My previous understanding was that although COW is always safe, we should still
> > avoid it when unnecessary because it's still expensive.  Currently we will do
> > enforced COW only if should_force_cow_break() returns true, which seems to be a
> > good justification of when to ask for the enforcement.
> 
> It's not the _enforcement_ I worry about.
> 
> It's the _users_.
> 
> So COW is always the safe thing to do, but as you say, it can be
> expensive (although we actually seemed to get a robot that claimed it
> sped up some test benchmark, it wasn't clear why).
> 
> But because not breaking COW can be very subtly unsafe - considering
> that we had this behavior for over two decades - I think the users
> that decide to not break COW need to explain why it is safe for them.
> 
> See what I'm saying?
> 
> That's why an explicit  FOLL_READ_WRONG_SIDE_COW_OK flag would be
> good, because it would force people to *think* about what they are
> doing, and explain why it's ok to do that unsafe thing, and get a page
> that may actually end up not being your page at all!

Disregarding the name that we'd prefer, I thought the new flag should only be
used internally in GUP just like FOLL_COW, or am I wrong?

Now FOLL_BREAK_COW (or whatever we'd like to call it) is applied as long as for
FOLL_PIN and FOLL_GET irrelevant to who calls __get_user_pages().  Shouldn't
that be enough that the new flag will be applied correctly always?...

I'd be fine if you want me to rename the flag into FOLL_READ_WRONG_SIDE_COW_OK.
It's just a bit awkward to read and check, because otherwise for the COW path
we'll need to use "(FAULT_FLAG_WRITE || !FAULT_FLAG_READ_WRONG_SIDE_COW_OK)".

> 
> 
> > + * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
> > + *                        Please read FOLL_BREAK_COW for more information.
> 
> This comment is useless - because it's aimed at the wrong people.
> 
> It's aimed at the people who already know they want to break COW. They
> understand, and they are doing the safe thing.
> 
> The case I worry about is the people who do NOT say "break COW". Not
> because they are smart and know what they are doing, but because they
> don't think about the issue.
> 
> > Userfaultfd-wp should not care about this because it's not a write operation,
> 
> Ok, I will *definitely* not be applying tyhis patch, beause you don't
> even understand what the problem is.
> 
> The fact is, A READ operation that doesn't break COW can GET THE WRONG ANSWER.
> 
> Why?
> 
> If you have two (threaded) processes A and B, who have that shared COW
> page, what can happen is:
> 
>  - A does a READ operation using get_follow_page, and gets the shared page
> 
>  - another thread in A ends up doing an unmap operation
> 
>  - B does write to the page, and causes a COW, but now doesn't see the
> A mapping at all any more, so takes over the page and writes to it
> 
>  - the original get_follow_page() thread in A is now holding a page
> that contains data that wasn't written by A
> 
> See? That's the whole point. It doesn't _matter_ if you're only
> reading the data, without the COW you may be reading the _wrong_ data.

Yeah, that's why I totally agree we need to do enforced COW even for a read gup
as long as the page can be further referenced (GET|PIN).  However frankly
speaking I didn't follow the rest on what's wrong with "Userfaultfd-wp should
not care about this because it's not a write operation" that I mentiond.  Is
that the major part of the objection?

I'm afraid I still think that's the right thing to do...

Could you help elaborate more?  It would be better if you can help to point out
the code changes that are wrong or suspecious, maybe that'll help me to
understand what I might have missed too.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 21:57       ` Peter Xu
@ 2020-08-10 23:19         ` Linus Torvalds
  2020-08-10 23:38           ` Jann Horn
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2020-08-10 23:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Jann Horn, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 2:57 PM Peter Xu <peterx@redhat.com> wrote:
>
> Yeah, that's why I totally agree we need to do enforced COW even for a read gup
> as long as the page can be further referenced (GET|PIN).  However frankly
> speaking I didn't follow the rest on what's wrong with "Userfaultfd-wp should
> not care about this because it's not a write operation" that I mentiond.  Is
> that the major part of the objection?

You didn't _explain_ why it's ok.

You said "it's only reading".

And I told you why "only reading" is not an argument against COW.

So my objection is that you haven't actually explained why it's ok for
userfaultfd-wp.

I can easily believe that it is, but my whole point is that this is
subtle, and it needs documentation and explanation.

That's also why I think the bit should be the other way: the people
who need to explain are the ones that disable COW, and the reversed
bit would mark that case.

The people who are ok with COW don't need to explain anything, and for
the same reason I think they shouldn't need to add that
FAULT_FLAG_BREAK_COW.

See what my argument is? It's entirely about documenting the special case.

               Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 23:19         ` Linus Torvalds
@ 2020-08-10 23:38           ` Jann Horn
  2020-08-11 15:05             ` Linus Torvalds
  0 siblings, 1 reply; 10+ messages in thread
From: Jann Horn @ 2020-08-10 23:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Xu, Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Tue, Aug 11, 2020 at 1:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Aug 10, 2020 at 2:57 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Yeah, that's why I totally agree we need to do enforced COW even for a read gup
> > as long as the page can be further referenced (GET|PIN).  However frankly
> > speaking I didn't follow the rest on what's wrong with "Userfaultfd-wp should
> > not care about this because it's not a write operation" that I mentiond.  Is
> > that the major part of the objection?
>
> You didn't _explain_ why it's ok.
>
> You said "it's only reading".
>
> And I told you why "only reading" is not an argument against COW.

The way I understand Peter, he doesn't want to avoid doing COW; he
wants to decouple userfaultfd-WP's fault handling from COW, so that
userfaultfd-wp notifies only when a previously-write-protected page is
actually written to. In other words, he wants the COW to basically
happen as it happens now, but it should only create a readonly PTE;
and if someone later triggers a real write fault, the fault handling
path would run again, and this time userfaultfd-wp would be notified
before that readonly PTE is turned into a writable one.

The FOLL flag would only be generated by the GUP path, not passed in
by any callers. It would cause the PTEs generated by breaking COW to
be read-only (I think this part is missing from the patch?), and it
would cause userfaultfd-WP to not synchronously block the fault.

I hope I summarized Peter's idea correctly?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-10 23:38           ` Jann Horn
@ 2020-08-11 15:05             ` Linus Torvalds
  2020-08-11 15:27               ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Linus Torvalds @ 2020-08-11 15:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: Peter Xu, Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Mon, Aug 10, 2020 at 4:39 PM Jann Horn <jannh@google.com> wrote:
>
> The way I understand Peter, he doesn't want to avoid doing COW; he
> wants to decouple userfaultfd-WP's fault handling from COW, so that
> userfaultfd-wp notifies only when a previously-write-protected page is
> actually written to. In other words, he wants the COW to basically
> happen as it happens now, but it should only create a readonly PTE;
> and if someone later triggers a real write fault, the fault handling
> path would run again, and this time userfaultfd-wp would be notified
> before that readonly PTE is turned into a writable one.

Ahh.

A light goes on.

Thank you.

And apologies to Peter - I misread that patch entirely.

That said, now that I (finally) understand what Peter wants to do, I
don't think the patch does what you say.

Because the GUP will now indeed avoid userfaultfd-wp unless it's
_actually_ a write, but then any reads will cause a COW that turns
things writable. There is no second fault.

So now later writes will never cause any userfaultfd-wp notifications at all.

Which for all I know might be acceptable and ok, but it seems to be
against userfaultfd rules, and against the whole synchronization idea.

So I think the patch is broken, but I'm less fundamentally concerned about it.

Because at that point, it's "only" userfaultfd that might break.

                  Linus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] mm/gup: Allow real explicit breaking of COW
  2020-08-11 15:05             ` Linus Torvalds
@ 2020-08-11 15:27               ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2020-08-11 15:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Linux-MM, Linux Kernel Mailing List, Andrew Morton,
	Marty Mcfadden, Andrea Arcangeli, Christoph Hellwig,
	Oleg Nesterov, Kirill Shutemov, Jan Kara

On Tue, Aug 11, 2020 at 08:05:22AM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 4:39 PM Jann Horn <jannh@google.com> wrote:
> >
> > The way I understand Peter, he doesn't want to avoid doing COW; he
> > wants to decouple userfaultfd-WP's fault handling from COW, so that
> > userfaultfd-wp notifies only when a previously-write-protected page is
> > actually written to. In other words, he wants the COW to basically
> > happen as it happens now, but it should only create a readonly PTE;
> > and if someone later triggers a real write fault, the fault handling
> > path would run again, and this time userfaultfd-wp would be notified
> > before that readonly PTE is turned into a writable one.

Yes, that's what I meant.  Appreciate your help, Jann.

> 
> Ahh.
> 
> A light goes on.
> 
> Thank you.
> 
> And apologies to Peter - I misread that patch entirely.
> 
> That said, now that I (finally) understand what Peter wants to do, I
> don't think the patch does what you say.
> 
> Because the GUP will now indeed avoid userfaultfd-wp unless it's
> _actually_ a write, but then any reads will cause a COW that turns
> things writable. There is no second fault.
> 
> So now later writes will never cause any userfaultfd-wp notifications at all.
> 
> Which for all I know might be acceptable and ok, but it seems to be
> against userfaultfd rules, and against the whole synchronization idea.
> 
> So I think the patch is broken, but I'm less fundamentally concerned about it.
> 
> Because at that point, it's "only" userfaultfd that might break.

Right, v2 is broken on that.  That's why I pasted another chunk in my previous
reply to still inherit the UFFD_WP bit even for COW [1].  Previously it was not
needed because UFFD_WP must have been turned off for the pte/pmd/.. before COW
happens.  However with enforced COW that's not guaranteed any more.

I'll post v3 soon.

[1] https://lore.kernel.org/lkml/20200810191520.GA132381@xz-x1/

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-08-11 15:28 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10 14:57 [PATCH v2] mm/gup: Allow real explicit breaking of COW Peter Xu
2020-08-10 16:47 ` Linus Torvalds
2020-08-10 19:15   ` Peter Xu
2020-08-10 20:51     ` Linus Torvalds
2020-08-10 20:59       ` Linus Torvalds
2020-08-10 21:57       ` Peter Xu
2020-08-10 23:19         ` Linus Torvalds
2020-08-10 23:38           ` Jann Horn
2020-08-11 15:05             ` Linus Torvalds
2020-08-11 15:27               ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).