All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/1] gup: document and work around "COW can break either way" issue
@ 2021-10-12  1:53 Suren Baghdasaryan
  0 siblings, 0 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-10-12  1:53 UTC (permalink / raw)
  To: stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, hch, oleg, kirill, jack, willy, linux-mm,
	linux-kernel, kernel-team, surenb

From: Linus Torvalds <torvalds@linux-foundation.org>

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes
        Since gup_pgd_range does not exist, made appropriate changes on
        the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
        gup_huge_pd and gup_pud_range.
	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
	the earlier kernels.]

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.9 kernel]
Cc: stable@vger.kernel.org # 4.9.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/gup.c         | 48 ++++++++++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c |  7 +++----
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6bb7a8eb7f82..301dd96ef176 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -577,12 +586,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags);
+						foll_flags);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1503,6 +1518,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 /*
  * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
  * the regular GUP. It will only return non-negative values.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1532,6 +1551,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	local_irq_save(flags);
@@ -1542,15 +1567,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
 			break;
+		/*
+		 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+		 * because get_user_pages() may need to cause an early COW in
+		 * order to avoid confusing the normal COW routines. So only
+		 * targets that are already writable are safe to do by just
+		 * looking at the page tables.
+		 */
 		if (unlikely(pgd_huge(pgd))) {
-			if (!gup_huge_pgd(pgd, pgdp, addr, next, write,
+			if (!gup_huge_pgd(pgd, pgdp, addr, next, 1,
 					  pages, &nr))
 				break;
 		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
-					 PGDIR_SHIFT, next, write, pages, &nr))
+					 PGDIR_SHIFT, next, 1, pages, &nr))
 				break;
-		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		} else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr))
 			break;
 	} while (pgdp++, addr = next, addr != end);
 	local_irq_restore(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 14cd0ef33b62..c5424483c066 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1135,13 +1135,12 @@ int do_huge_pmd_wp_page(struct fault_env *fe, pmd_t orig_pmd)
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-- 
2.33.0.882.g93a45727a2-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  1:52 Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2021-10-12 10:06 ` Greg KH
@ 2021-10-13 21:58 ` John Hubbard
  3 siblings, 0 replies; 17+ messages in thread
From: John Hubbard @ 2021-10-13 21:58 UTC (permalink / raw)
  To: Suren Baghdasaryan, stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, hch, oleg, kirill, jack, willy, linux-mm,
	linux-kernel, kernel-team

On 10/11/21 18:52, Suren Baghdasaryan wrote:
...
> [surenb: backport notes
>          Since gup_pgd_range does not exist, made appropriate changes on
>          the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
> 	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
>          gup_huge_pd and gup_pud_range.
> 	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
> 	the earlier kernels.]
> 

This backport looks accurate. At first I thought you missed the comment-only
change to i915_gem_userptr.c, and then I noticed that the older branch still
uses non-fast gup, so not applicable there after all. :)

Agree with others that this whole area is still shaky, but it does sound as if
this will help.


Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA

> Reported-by: Jann Horn <jannh@google.com>
> Tested-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: Kirill Shutemov <kirill@shutemov.name>
> Acked-by: Jan Kara <jack@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> [surenb: backport to 4.4 kernel]
> Cc: stable@vger.kernel.org # 4.4.x
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>   mm/gup.c         | 48 ++++++++++++++++++++++++++++++++++++++++--------
>   mm/huge_memory.c |  7 +++----
>   2 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 4c5857889e9d..c80cdc408228 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>   }
>   
>   /*
> - * FOLL_FORCE can write to even unwritable pte's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> + * but only after we've gone through a COW cycle and they are dirty.
>    */
>   static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>   {
> -	return pte_write(pte) ||
> -		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> +	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> +}
> +
> +/*
> + * A (separate) COW fault might break the page the other way and
> + * get_user_pages() would return the page from what is now the wrong
> + * VM. So we need to force a COW break at GUP time even for reads.
> + */
> +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> +{
> +	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
>   }
>   
>   static struct page *follow_page_pte(struct vm_area_struct *vma,
> @@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>   			if (!vma || check_vma_flags(vma, gup_flags))
>   				return i ? : -EFAULT;
>   			if (is_vm_hugetlb_page(vma)) {
> +				if (should_force_cow_break(vma, foll_flags))
> +					foll_flags |= FOLL_WRITE;
>   				i = follow_hugetlb_page(mm, vma, pages, vmas,
>   						&start, &nr_pages, i,
> -						gup_flags);
> +						foll_flags);
>   				continue;
>   			}
>   		}
> +
> +		if (should_force_cow_break(vma, foll_flags))
> +			foll_flags |= FOLL_WRITE;
> +
>   retry:
>   		/*
>   		 * If we have a pending SIGKILL, don't keep faulting pages and
> @@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>   /*
>    * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
>    * the regular GUP. It will only return non-negative values.
> + *
> + * Careful, careful! COW breaking can go either way, so a non-write
> + * access can get ambiguous page results. If you call this function without
> + * 'write' set, you'd better be sure that you're ok with that ambiguity.
>    */
>   int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   			  struct page **pages)
> @@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   	 *
>   	 * We do not adopt an rcu_read_lock(.) here as we also want to
>   	 * block IPIs that come from THPs splitting.
> +	 *
> +	 * NOTE! We allow read-only gup_fast() here, but you'd better be
> +	 * careful about possible COW pages. You'll get _a_ COW page, but
> +	 * not necessarily the one you intended to get depending on what
> +	 * COW event happens after this. COW may break the page copy in a
> +	 * random direction.
>   	 */
>   
>   	local_irq_save(flags);
> @@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>   		next = pgd_addr_end(addr, end);
>   		if (pgd_none(pgd))
>   			break;
> +		/*
> +		 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> +		 * because get_user_pages() may need to cause an early COW in
> +		 * order to avoid confusing the normal COW routines. So only
> +		 * targets that are already writable are safe to do by just
> +		 * looking at the page tables.
> +		 */
>   		if (unlikely(pgd_huge(pgd))) {
> -			if (!gup_huge_pgd(pgd, pgdp, addr, next, write,
> +			if (!gup_huge_pgd(pgd, pgdp, addr, next, 1,
>   					  pages, &nr))
>   				break;
>   		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
>   			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
> -					 PGDIR_SHIFT, next, write, pages, &nr))
> +					 PGDIR_SHIFT, next, 1, pages, &nr))
>   				break;
> -		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> +		} else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr))
>   			break;
>   	} while (pgdp++, addr = next, addr != end);
>   	local_irq_restore(flags);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6404e4fcb4ed..fae45c56e2ee 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1268,13 +1268,12 @@ out_unlock:
>   }
>   
>   /*
> - * FOLL_FORCE can write to even unwritable pmd's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
> + * but only after we've gone through a COW cycle and they are dirty.
>    */
>   static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
>   {
> -	return pmd_write(pmd) ||
> -	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
> +	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
>   }
>   
>   struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12 18:57       ` Thadeu Lima de Souza Cascardo
@ 2021-10-13  8:56         ` Greg KH
  0 siblings, 0 replies; 17+ messages in thread
From: Greg KH @ 2021-10-13  8:56 UTC (permalink / raw)
  To: Thadeu Lima de Souza Cascardo
  Cc: Vlastimil Babka, Jan Kara, Suren Baghdasaryan, stable, jannh,
	torvalds, peterx, aarcange, david, jgg, ktkhai, shli, namit, hch,
	oleg, kirill, willy, linux-mm, linux-kernel, kernel-team

On Tue, Oct 12, 2021 at 03:57:51PM -0300, Thadeu Lima de Souza Cascardo wrote:
> On Tue, Oct 12, 2021 at 10:42:40AM +0200, Greg KH wrote:
> > On Tue, Oct 12, 2021 at 10:14:27AM +0200, Vlastimil Babka wrote:
> > > On 10/12/21 10:06, Jan Kara wrote:
> > > > On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote:
> > > >> From: Linus Torvalds <torvalds@linux-foundation.org>
> > > >> 
> > > >> From: Linus Torvalds <torvalds@linux-foundation.org>
> > > >> 
> > > >> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
> > > >> 
> > > >> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> > > >> ambiguous: the page can be COW'ed at any time afterwards, and the
> > > >> direction of a COW event isn't defined.
> > > >> 
> > > >> Yes, whoever writes to it will generally do the COW, but if the thread
> > > >> that did the get_user_pages() unmapped the page before the write (and
> > > >> that could happen due to memory pressure in addition to any outright
> > > >> action), the writer could also just take over the old page instead.
> > > >> 
> > > >> End result: the get_user_pages() call might result in a page pointer
> > > >> that is no longer associated with the original VM, and is associated
> > > >> with - and controlled by - another VM having taken it over instead.
> > > >> 
> > > >> So when doing a get_user_pages() on a COW mapping, the only really safe
> > > >> thing to do would be to break the COW when getting the page, even when
> > > >> only getting it for reading.
> > > >> 
> > > >> At the same time, some users simply don't even care.
> > > >> 
> > > >> For example, the perf code wants to look up the page not because it
> > > >> cares about the page, but because the code simply wants to look up the
> > > >> physical address of the access for informational purposes, and doesn't
> > > >> really care about races when a page might be unmapped and remapped
> > > >> elsewhere.
> > > >> 
> > > >> This adds logic to force a COW event by setting FOLL_WRITE on any
> > > >> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> > > >> pointer as a result.
> > > >> 
> > > >> The current semantics end up being:
> > > >> 
> > > >>  - __get_user_pages_fast(): no change. If you don't ask for a write,
> > > >>    you won't break COW. You'd better know what you're doing.
> > > >> 
> > > >>  - get_user_pages_fast(): the fast-case "look it up in the page tables
> > > >>    without anything getting mmap_sem" now refuses to follow a read-only
> > > >>    page, since it might need COW breaking.  Which happens in the slow
> > > >>    path - the fast path doesn't know if the memory might be COW or not.
> > > >> 
> > > >>  - get_user_pages() (including the slow-path fallback for gup_fast()):
> > > >>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
> > > >>    very similar semantics to FOLL_FORCE.
> > > >> 
> > > >> If it turns out that we want finer granularity (ie "only break COW when
> > > >> it might actually matter" - things like the zero page are special and
> > > >> don't need to be broken) we might need to push these semantics deeper
> > > >> into the lookup fault path.  So if people care enough, it's possible
> > > >> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> > > >> with the internal FOLL_COW flag we already have for tracking "I had a
> > > >> COW".
> > > >> 
> > > >> Alternatively, if it turns out that different callers might want to
> > > >> explicitly control the forced COW break behavior, we might even want to
> > > >> make such a flag visible to the users of get_user_pages() instead of
> > > >> using the above default semantics.
> > > >> 
> > > >> But for now, this is mostly commentary on the issue (this commit message
> > > >> being a lot bigger than the patch, and that patch in turn is almost all
> > > >> comments), with that minimal "enable COW breaking early" logic using the
> > > >> existing FOLL_WRITE behavior.
> > > >> 
> > > >> [ It might be worth noting that we've always had this ambiguity, and it
> > > >>   could arguably be seen as a user-space issue.
> > > >> 
> > > >>   You only get private COW mappings that could break either way in
> > > >>   situations where user space is doing cooperative things (ie fork()
> > > >>   before an execve() etc), but it _is_ surprising and very subtle, and
> > > >>   fork() is supposed to give you independent address spaces.
> > > >> 
> > > >>   So let's treat this as a kernel issue and make the semantics of
> > > >>   get_user_pages() easier to understand. Note that obviously a true
> > > >>   shared mapping will still get a page that can change under us, so this
> > > >>   does _not_ mean that get_user_pages() somehow returns any "stable"
> > > >>   page ]
> > > >> 
> > > >> [surenb: backport notes
> > > >>         Since gup_pgd_range does not exist, made appropriate changes on
> > > >>         the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
> > > >> 	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
> > > >>         gup_huge_pd and gup_pud_range.
> > > >> 	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
> > > >> 	the earlier kernels.]
> > > > 
> > > > I'd be really careful with backporting this to stable. There was a lot of
> > > > userspace breakage caused by this change if I remember right which needed
> > > > to be fixed up later. There is a nice summary at
> > > > https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and
> > > > some problems are still being found...
> > > 
> > > Yeah that was my initial reaction. But looks like back in April we agreed
> > > that backporting only this commit could be feasible - the relevant subthread
> > > starts around here [1]. The known breakage for just this commit was uffd
> > > functionality introduced only in 5.7, and strace on dax on pmem (that was
> > > never properly root caused). 5.4 stable already has the backport since year
> > > ago, Suren posted 4.14 and 4.19 in April after [1]. Looks like nobody
> > > reported issues? Continuing with 4.4 and 4.9 makes this consistent at least,
> > > although the risk of breaking something is always there and the CVE probably
> > > not worth it, but whatever...
> > 
> > I have had people "complain" that the issue was not fixed on these older
> > kernels, now if that is just because those groups have a "it has a CVE
> > so it must be fixed!" policy or not, it is hard to tell.
> > 
> > But this seems to be exploitable, and we have a reproducer somewhere
> > around here, so it would be nice to get resolved for the reason of it
> > being a bug that we should fix if possible.
> > 
> > So I would err on the side of "lets merge this" as fixing a known issue
> > is ALWAYS better than the fear of "maybe something might break".  We can
> > always revert if the latter happens in testing.
> > 
> > thanks,
> > 
> > greg k-h
> 
> When we backported this to the Ubuntu kernel based on 4.4, we found a
> regression that required commit 38e088546522e1e86d2b8f401a1354ad3a9b3303
> ("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing") as a fix.
> 
> I tested that this was also the case with the 4.4.y stable-rc tree and I am
> providing our backport below, which I also tested. The reproducer that
> regresses reads from /proc/self/mem. Writing to /proc/self/mem seems to
> have been a bug on 4.4 for a while and is also fixed by this backport, so
> should be considered in any case.

Thank you for the backport, and letting us know, now queued up!

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  8:42     ` Greg KH
@ 2021-10-12 18:57       ` Thadeu Lima de Souza Cascardo
  2021-10-13  8:56         ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Thadeu Lima de Souza Cascardo @ 2021-10-12 18:57 UTC (permalink / raw)
  To: Greg KH
  Cc: Vlastimil Babka, Jan Kara, Suren Baghdasaryan, stable, jannh,
	torvalds, peterx, aarcange, david, jgg, ktkhai, shli, namit, hch,
	oleg, kirill, willy, linux-mm, linux-kernel, kernel-team

[-- Attachment #1: Type: text/plain, Size: 7152 bytes --]

On Tue, Oct 12, 2021 at 10:42:40AM +0200, Greg KH wrote:
> On Tue, Oct 12, 2021 at 10:14:27AM +0200, Vlastimil Babka wrote:
> > On 10/12/21 10:06, Jan Kara wrote:
> > > On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote:
> > >> From: Linus Torvalds <torvalds@linux-foundation.org>
> > >> 
> > >> From: Linus Torvalds <torvalds@linux-foundation.org>
> > >> 
> > >> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
> > >> 
> > >> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> > >> ambiguous: the page can be COW'ed at any time afterwards, and the
> > >> direction of a COW event isn't defined.
> > >> 
> > >> Yes, whoever writes to it will generally do the COW, but if the thread
> > >> that did the get_user_pages() unmapped the page before the write (and
> > >> that could happen due to memory pressure in addition to any outright
> > >> action), the writer could also just take over the old page instead.
> > >> 
> > >> End result: the get_user_pages() call might result in a page pointer
> > >> that is no longer associated with the original VM, and is associated
> > >> with - and controlled by - another VM having taken it over instead.
> > >> 
> > >> So when doing a get_user_pages() on a COW mapping, the only really safe
> > >> thing to do would be to break the COW when getting the page, even when
> > >> only getting it for reading.
> > >> 
> > >> At the same time, some users simply don't even care.
> > >> 
> > >> For example, the perf code wants to look up the page not because it
> > >> cares about the page, but because the code simply wants to look up the
> > >> physical address of the access for informational purposes, and doesn't
> > >> really care about races when a page might be unmapped and remapped
> > >> elsewhere.
> > >> 
> > >> This adds logic to force a COW event by setting FOLL_WRITE on any
> > >> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> > >> pointer as a result.
> > >> 
> > >> The current semantics end up being:
> > >> 
> > >>  - __get_user_pages_fast(): no change. If you don't ask for a write,
> > >>    you won't break COW. You'd better know what you're doing.
> > >> 
> > >>  - get_user_pages_fast(): the fast-case "look it up in the page tables
> > >>    without anything getting mmap_sem" now refuses to follow a read-only
> > >>    page, since it might need COW breaking.  Which happens in the slow
> > >>    path - the fast path doesn't know if the memory might be COW or not.
> > >> 
> > >>  - get_user_pages() (including the slow-path fallback for gup_fast()):
> > >>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
> > >>    very similar semantics to FOLL_FORCE.
> > >> 
> > >> If it turns out that we want finer granularity (ie "only break COW when
> > >> it might actually matter" - things like the zero page are special and
> > >> don't need to be broken) we might need to push these semantics deeper
> > >> into the lookup fault path.  So if people care enough, it's possible
> > >> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> > >> with the internal FOLL_COW flag we already have for tracking "I had a
> > >> COW".
> > >> 
> > >> Alternatively, if it turns out that different callers might want to
> > >> explicitly control the forced COW break behavior, we might even want to
> > >> make such a flag visible to the users of get_user_pages() instead of
> > >> using the above default semantics.
> > >> 
> > >> But for now, this is mostly commentary on the issue (this commit message
> > >> being a lot bigger than the patch, and that patch in turn is almost all
> > >> comments), with that minimal "enable COW breaking early" logic using the
> > >> existing FOLL_WRITE behavior.
> > >> 
> > >> [ It might be worth noting that we've always had this ambiguity, and it
> > >>   could arguably be seen as a user-space issue.
> > >> 
> > >>   You only get private COW mappings that could break either way in
> > >>   situations where user space is doing cooperative things (ie fork()
> > >>   before an execve() etc), but it _is_ surprising and very subtle, and
> > >>   fork() is supposed to give you independent address spaces.
> > >> 
> > >>   So let's treat this as a kernel issue and make the semantics of
> > >>   get_user_pages() easier to understand. Note that obviously a true
> > >>   shared mapping will still get a page that can change under us, so this
> > >>   does _not_ mean that get_user_pages() somehow returns any "stable"
> > >>   page ]
> > >> 
> > >> [surenb: backport notes
> > >>         Since gup_pgd_range does not exist, made appropriate changes on
> > >>         the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
> > >> 	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
> > >>         gup_huge_pd and gup_pud_range.
> > >> 	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
> > >> 	the earlier kernels.]
> > > 
> > > I'd be really careful with backporting this to stable. There was a lot of
> > > userspace breakage caused by this change if I remember right which needed
> > > to be fixed up later. There is a nice summary at
> > > https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and
> > > some problems are still being found...
> > 
> > Yeah that was my initial reaction. But looks like back in April we agreed
> > that backporting only this commit could be feasible - the relevant subthread
> > starts around here [1]. The known breakage for just this commit was uffd
> > functionality introduced only in 5.7, and strace on dax on pmem (that was
> > never properly root caused). 5.4 stable already has the backport since year
> > ago, Suren posted 4.14 and 4.19 in April after [1]. Looks like nobody
> > reported issues? Continuing with 4.4 and 4.9 makes this consistent at least,
> > although the risk of breaking something is always there and the CVE probably
> > not worth it, but whatever...
> 
> I have had people "complain" that the issue was not fixed on these older
> kernels, now if that is just because those groups have a "it has a CVE
> so it must be fixed!" policy or not, it is hard to tell.
> 
> But this seems to be exploitable, and we have a reproducer somewhere
> around here, so it would be nice to get resolved for the reason of it
> being a bug that we should fix if possible.
> 
> So I would err on the side of "lets merge this" as fixing a known issue
> is ALWAYS better than the fear of "maybe something might break".  We can
> always revert if the latter happens in testing.
> 
> thanks,
> 
> greg k-h

When we backported this to the Ubuntu kernel based on 4.4, we found a
regression that required commit 38e088546522e1e86d2b8f401a1354ad3a9b3303
("mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing") as a fix.

I tested that this was also the case with the 4.4.y stable-rc tree and I am
providing our backport below, which I also tested. The reproducer that
regresses reads from /proc/self/mem. Writing to /proc/self/mem seems to
have been a bug on 4.4 for a while and is also fixed by this backport, so
should be considered in any case.

Cascardo.

[-- Attachment #2: 0001-mm-check-VMA-flags-to-avoid-invalid-PROT_NONE-NUMA-b.patch --]
[-- Type: text/x-diff, Size: 4547 bytes --]

From 757597f803f5770ebd7c70d4bf0068aa525c033a Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lstoakes@gmail.com>
Date: Mon, 5 Apr 2021 22:36:00 +0200
Subject: [PATCH 4.4] mm: check VMA flags to avoid invalid PROT_NONE NUMA
 balancing

commit 38e088546522e1e86d2b8f401a1354ad3a9b3303 upstream.

The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
PMDs respectively as requiring balancing upon a subsequent page fault.
User-defined PROT_NONE memory regions which also have this flag set will
not normally invoke the NUMA balancing code as do_page_fault() will send
a segfault to the process before handle_mm_fault() is even called.

However if access_remote_vm() is invoked to access a PROT_NONE region of
memory, handle_mm_fault() is called via faultin_page() and
__get_user_pages() without any access checks being performed, meaning
the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
region.

A simple means of triggering this problem is to access PROT_NONE mmap'd
memory using /proc/self/mem which reliably results in the NUMA handling
functions being invoked when CONFIG_NUMA_BALANCING is set.

This issue was reported in bugzilla (issue 99101) which includes some
simple repro code.

There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
added at commit c0e7cad to avoid accidentally provoking strange
behaviour by attempting to apply NUMA balancing to pages that are in
fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.

This patch moves the PROT_NONE check into mm/memory.c rather than
invoking BUG_ON() as faulting in these pages via faultin_page() is a
valid reason for reaching the NUMA check with the PROT_NONE page table
flag set and is therefore not always a bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
Reported-by: Trevor Saunders <tbsaunde@tbsaunde.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
[cascardo: context adjustments were necessary]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
---
 mm/huge_memory.c |  3 ---
 mm/memory.c      | 12 +++++++-----
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fae45c56e2ee..2f53786098c5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1340,9 +1340,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	bool was_writable;
 	int flags = 0;
 
-	/* A PROT_NONE fault should not end up here */
-	BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
-
 	ptl = pmd_lock(mm, pmdp);
 	if (unlikely(!pmd_same(pmd, *pmdp)))
 		goto out_unlock;
diff --git a/mm/memory.c b/mm/memory.c
index 360d28224a8e..6bfc6a021c4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3209,9 +3209,6 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	bool was_writable = pte_write(pte);
 	int flags = 0;
 
-	/* A PROT_NONE fault should not end up here */
-	BUG_ON(!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)));
-
 	/*
 	* The "pte" at this point cannot be used safely without
 	* validation through pte_unmap_same(). It's of NUMA type but
@@ -3304,6 +3301,11 @@ static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	return VM_FAULT_FALLBACK;
 }
 
+static inline bool vma_is_accessible(struct vm_area_struct *vma)
+{
+	return vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE);
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -3350,7 +3352,7 @@ static int handle_pte_fault(struct mm_struct *mm,
 					pte, pmd, flags, entry);
 	}
 
-	if (pte_protnone(entry))
+	if (pte_protnone(entry) && vma_is_accessible(vma))
 		return do_numa_page(mm, vma, address, entry, pte, pmd);
 
 	ptl = pte_lockptr(mm, pmd);
@@ -3425,7 +3427,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (pmd_trans_splitting(orig_pmd))
 				return 0;
 
-			if (pmd_protnone(orig_pmd))
+			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(mm, vma, address,
 							     orig_pmd, pmd);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12 10:06 ` Greg KH
@ 2021-10-12 16:16   ` Suren Baghdasaryan
  0 siblings, 0 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-10-12 16:16 UTC (permalink / raw)
  To: Greg KH
  Cc: stable, Jann Horn, Linus Torvalds, Vlastimil Babka, Peter Xu,
	Andrea Arcangeli, David Hildenbrand, Jason Gunthorpe,
	Kirill Tkhai, Shaohua Li, Nadav Amit, Christoph Hellwig,
	Oleg Nesterov, Kirill A. Shutemov, jack, Matthew Wilcox,
	linux-mm, LKML, kernel-team

On Tue, Oct 12, 2021 at 3:06 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Mon, Oct 11, 2021 at 06:52:44PM -0700, Suren Baghdasaryan wrote:
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> >
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> >
> > commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
>
> Both backports now queued up, thanks.

Thanks!

>
> greg k-h
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  1:52 Suren Baghdasaryan
  2021-10-12  1:55 ` Suren Baghdasaryan
  2021-10-12  8:06 ` Jan Kara
@ 2021-10-12 10:06 ` Greg KH
  2021-10-12 16:16   ` Suren Baghdasaryan
  2021-10-13 21:58 ` John Hubbard
  3 siblings, 1 reply; 17+ messages in thread
From: Greg KH @ 2021-10-12 10:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: stable, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, hch, oleg, kirill, jack, willy, linux-mm,
	linux-kernel, kernel-team

On Mon, Oct 11, 2021 at 06:52:44PM -0700, Suren Baghdasaryan wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Both backports now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  8:14   ` Vlastimil Babka
@ 2021-10-12  8:42     ` Greg KH
  2021-10-12 18:57       ` Thadeu Lima de Souza Cascardo
  0 siblings, 1 reply; 17+ messages in thread
From: Greg KH @ 2021-10-12  8:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jan Kara, Suren Baghdasaryan, stable, jannh, torvalds, peterx,
	aarcange, david, jgg, ktkhai, shli, namit, hch, oleg, kirill,
	willy, linux-mm, linux-kernel, kernel-team

On Tue, Oct 12, 2021 at 10:14:27AM +0200, Vlastimil Babka wrote:
> On 10/12/21 10:06, Jan Kara wrote:
> > On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote:
> >> From: Linus Torvalds <torvalds@linux-foundation.org>
> >> 
> >> From: Linus Torvalds <torvalds@linux-foundation.org>
> >> 
> >> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
> >> 
> >> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> >> ambiguous: the page can be COW'ed at any time afterwards, and the
> >> direction of a COW event isn't defined.
> >> 
> >> Yes, whoever writes to it will generally do the COW, but if the thread
> >> that did the get_user_pages() unmapped the page before the write (and
> >> that could happen due to memory pressure in addition to any outright
> >> action), the writer could also just take over the old page instead.
> >> 
> >> End result: the get_user_pages() call might result in a page pointer
> >> that is no longer associated with the original VM, and is associated
> >> with - and controlled by - another VM having taken it over instead.
> >> 
> >> So when doing a get_user_pages() on a COW mapping, the only really safe
> >> thing to do would be to break the COW when getting the page, even when
> >> only getting it for reading.
> >> 
> >> At the same time, some users simply don't even care.
> >> 
> >> For example, the perf code wants to look up the page not because it
> >> cares about the page, but because the code simply wants to look up the
> >> physical address of the access for informational purposes, and doesn't
> >> really care about races when a page might be unmapped and remapped
> >> elsewhere.
> >> 
> >> This adds logic to force a COW event by setting FOLL_WRITE on any
> >> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> >> pointer as a result.
> >> 
> >> The current semantics end up being:
> >> 
> >>  - __get_user_pages_fast(): no change. If you don't ask for a write,
> >>    you won't break COW. You'd better know what you're doing.
> >> 
> >>  - get_user_pages_fast(): the fast-case "look it up in the page tables
> >>    without anything getting mmap_sem" now refuses to follow a read-only
> >>    page, since it might need COW breaking.  Which happens in the slow
> >>    path - the fast path doesn't know if the memory might be COW or not.
> >> 
> >>  - get_user_pages() (including the slow-path fallback for gup_fast()):
> >>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
> >>    very similar semantics to FOLL_FORCE.
> >> 
> >> If it turns out that we want finer granularity (ie "only break COW when
> >> it might actually matter" - things like the zero page are special and
> >> don't need to be broken) we might need to push these semantics deeper
> >> into the lookup fault path.  So if people care enough, it's possible
> >> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> >> with the internal FOLL_COW flag we already have for tracking "I had a
> >> COW".
> >> 
> >> Alternatively, if it turns out that different callers might want to
> >> explicitly control the forced COW break behavior, we might even want to
> >> make such a flag visible to the users of get_user_pages() instead of
> >> using the above default semantics.
> >> 
> >> But for now, this is mostly commentary on the issue (this commit message
> >> being a lot bigger than the patch, and that patch in turn is almost all
> >> comments), with that minimal "enable COW breaking early" logic using the
> >> existing FOLL_WRITE behavior.
> >> 
> >> [ It might be worth noting that we've always had this ambiguity, and it
> >>   could arguably be seen as a user-space issue.
> >> 
> >>   You only get private COW mappings that could break either way in
> >>   situations where user space is doing cooperative things (ie fork()
> >>   before an execve() etc), but it _is_ surprising and very subtle, and
> >>   fork() is supposed to give you independent address spaces.
> >> 
> >>   So let's treat this as a kernel issue and make the semantics of
> >>   get_user_pages() easier to understand. Note that obviously a true
> >>   shared mapping will still get a page that can change under us, so this
> >>   does _not_ mean that get_user_pages() somehow returns any "stable"
> >>   page ]
> >> 
> >> [surenb: backport notes
> >>         Since gup_pgd_range does not exist, made appropriate changes on
> >>         the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
> >> 	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
> >>         gup_huge_pd and gup_pud_range.
> >> 	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
> >> 	the earlier kernels.]
> > 
> > I'd be really careful with backporting this to stable. There was a lot of
> > userspace breakage caused by this change if I remember right which needed
> > to be fixed up later. There is a nice summary at
> > https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and
> > some problems are still being found...
> 
> Yeah that was my initial reaction. But looks like back in April we agreed
> that backporting only this commit could be feasible - the relevant subthread
> starts around here [1]. The known breakage for just this commit was uffd
> functionality introduced only in 5.7, and strace on dax on pmem (that was
> never properly root caused). 5.4 stable already has the backport since year
> ago, Suren posted 4.14 and 4.19 in April after [1]. Looks like nobody
> reported issues? Continuing with 4.4 and 4.9 makes this consistent at least,
> although the risk of breaking something is always there and the CVE probably
> not worth it, but whatever...

I have had people "complain" that the issue was not fixed on these older
kernels, now if that is just because those groups have a "it has a CVE
so it must be fixed!" policy or not, it is hard to tell.

But this seems to be exploitable, and we have a reproducer somewhere
around here, so it would be nice to get resolved for the reason of it
being a bug that we should fix if possible.

So I would err on the side of "lets merge this" as fixing a known issue
is ALWAYS better than the fear of "maybe something might break".  We can
always revert if the latter happens in testing.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  8:06 ` Jan Kara
@ 2021-10-12  8:14   ` Vlastimil Babka
  2021-10-12  8:42     ` Greg KH
  0 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2021-10-12  8:14 UTC (permalink / raw)
  To: Jan Kara, Suren Baghdasaryan
  Cc: stable, gregkh, jannh, torvalds, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, hch, oleg, kirill, willy, linux-mm,
	linux-kernel, kernel-team

On 10/12/21 10:06, Jan Kara wrote:
> On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote:
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>> 
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>> 
>> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
>> 
>> Doing a "get_user_pages()" on a copy-on-write page for reading can be
>> ambiguous: the page can be COW'ed at any time afterwards, and the
>> direction of a COW event isn't defined.
>> 
>> Yes, whoever writes to it will generally do the COW, but if the thread
>> that did the get_user_pages() unmapped the page before the write (and
>> that could happen due to memory pressure in addition to any outright
>> action), the writer could also just take over the old page instead.
>> 
>> End result: the get_user_pages() call might result in a page pointer
>> that is no longer associated with the original VM, and is associated
>> with - and controlled by - another VM having taken it over instead.
>> 
>> So when doing a get_user_pages() on a COW mapping, the only really safe
>> thing to do would be to break the COW when getting the page, even when
>> only getting it for reading.
>> 
>> At the same time, some users simply don't even care.
>> 
>> For example, the perf code wants to look up the page not because it
>> cares about the page, but because the code simply wants to look up the
>> physical address of the access for informational purposes, and doesn't
>> really care about races when a page might be unmapped and remapped
>> elsewhere.
>> 
>> This adds logic to force a COW event by setting FOLL_WRITE on any
>> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
>> pointer as a result.
>> 
>> The current semantics end up being:
>> 
>>  - __get_user_pages_fast(): no change. If you don't ask for a write,
>>    you won't break COW. You'd better know what you're doing.
>> 
>>  - get_user_pages_fast(): the fast-case "look it up in the page tables
>>    without anything getting mmap_sem" now refuses to follow a read-only
>>    page, since it might need COW breaking.  Which happens in the slow
>>    path - the fast path doesn't know if the memory might be COW or not.
>> 
>>  - get_user_pages() (including the slow-path fallback for gup_fast()):
>>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
>>    very similar semantics to FOLL_FORCE.
>> 
>> If it turns out that we want finer granularity (ie "only break COW when
>> it might actually matter" - things like the zero page are special and
>> don't need to be broken) we might need to push these semantics deeper
>> into the lookup fault path.  So if people care enough, it's possible
>> that we might end up adding a new internal FOLL_BREAK_COW flag to go
>> with the internal FOLL_COW flag we already have for tracking "I had a
>> COW".
>> 
>> Alternatively, if it turns out that different callers might want to
>> explicitly control the forced COW break behavior, we might even want to
>> make such a flag visible to the users of get_user_pages() instead of
>> using the above default semantics.
>> 
>> But for now, this is mostly commentary on the issue (this commit message
>> being a lot bigger than the patch, and that patch in turn is almost all
>> comments), with that minimal "enable COW breaking early" logic using the
>> existing FOLL_WRITE behavior.
>> 
>> [ It might be worth noting that we've always had this ambiguity, and it
>>   could arguably be seen as a user-space issue.
>> 
>>   You only get private COW mappings that could break either way in
>>   situations where user space is doing cooperative things (ie fork()
>>   before an execve() etc), but it _is_ surprising and very subtle, and
>>   fork() is supposed to give you independent address spaces.
>> 
>>   So let's treat this as a kernel issue and make the semantics of
>>   get_user_pages() easier to understand. Note that obviously a true
>>   shared mapping will still get a page that can change under us, so this
>>   does _not_ mean that get_user_pages() somehow returns any "stable"
>>   page ]
>> 
>> [surenb: backport notes
>>         Since gup_pgd_range does not exist, made appropriate changes on
>>         the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
>> 	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
>>         gup_huge_pd and gup_pud_range.
>> 	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
>> 	the earlier kernels.]
> 
> I'd be really careful with backporting this to stable. There was a lot of
> userspace breakage caused by this change if I remember right which needed
> to be fixed up later. There is a nice summary at
> https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and
> some problems are still being found...

Yeah that was my initial reaction. But looks like back in April we agreed
that backporting only this commit could be feasible - the relevant subthread
starts around here [1]. The known breakage for just this commit was uffd
functionality introduced only in 5.7, and strace on dax on pmem (that was
never properly root caused). 5.4 stable already has the backport since year
ago, Suren posted 4.14 and 4.19 in April after [1]. Looks like nobody
reported issues? Continuing with 4.4 and 4.9 makes this consistent at least,
although the risk of breaking something is always there and the CVE probably
not worth it, but whatever...

[1]
https://lore.kernel.org/all/CAHk-=wj0JH6PnG7dW51Sr5ZqhomqSaSLTQV7z4Si2dLeSVcO_g@mail.gmail.com/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  1:52 Suren Baghdasaryan
  2021-10-12  1:55 ` Suren Baghdasaryan
@ 2021-10-12  8:06 ` Jan Kara
  2021-10-12  8:14   ` Vlastimil Babka
  2021-10-12 10:06 ` Greg KH
  2021-10-13 21:58 ` John Hubbard
  3 siblings, 1 reply; 17+ messages in thread
From: Jan Kara @ 2021-10-12  8:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: stable, gregkh, jannh, torvalds, vbabka, peterx, aarcange, david,
	jgg, ktkhai, shli, namit, hch, oleg, kirill, jack, willy,
	linux-mm, linux-kernel, kernel-team

On Mon 11-10-21 18:52:44, Suren Baghdasaryan wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
> 
> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> ambiguous: the page can be COW'ed at any time afterwards, and the
> direction of a COW event isn't defined.
> 
> Yes, whoever writes to it will generally do the COW, but if the thread
> that did the get_user_pages() unmapped the page before the write (and
> that could happen due to memory pressure in addition to any outright
> action), the writer could also just take over the old page instead.
> 
> End result: the get_user_pages() call might result in a page pointer
> that is no longer associated with the original VM, and is associated
> with - and controlled by - another VM having taken it over instead.
> 
> So when doing a get_user_pages() on a COW mapping, the only really safe
> thing to do would be to break the COW when getting the page, even when
> only getting it for reading.
> 
> At the same time, some users simply don't even care.
> 
> For example, the perf code wants to look up the page not because it
> cares about the page, but because the code simply wants to look up the
> physical address of the access for informational purposes, and doesn't
> really care about races when a page might be unmapped and remapped
> elsewhere.
> 
> This adds logic to force a COW event by setting FOLL_WRITE on any
> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> pointer as a result.
> 
> The current semantics end up being:
> 
>  - __get_user_pages_fast(): no change. If you don't ask for a write,
>    you won't break COW. You'd better know what you're doing.
> 
>  - get_user_pages_fast(): the fast-case "look it up in the page tables
>    without anything getting mmap_sem" now refuses to follow a read-only
>    page, since it might need COW breaking.  Which happens in the slow
>    path - the fast path doesn't know if the memory might be COW or not.
> 
>  - get_user_pages() (including the slow-path fallback for gup_fast()):
>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
>    very similar semantics to FOLL_FORCE.
> 
> If it turns out that we want finer granularity (ie "only break COW when
> it might actually matter" - things like the zero page are special and
> don't need to be broken) we might need to push these semantics deeper
> into the lookup fault path.  So if people care enough, it's possible
> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> with the internal FOLL_COW flag we already have for tracking "I had a
> COW".
> 
> Alternatively, if it turns out that different callers might want to
> explicitly control the forced COW break behavior, we might even want to
> make such a flag visible to the users of get_user_pages() instead of
> using the above default semantics.
> 
> But for now, this is mostly commentary on the issue (this commit message
> being a lot bigger than the patch, and that patch in turn is almost all
> comments), with that minimal "enable COW breaking early" logic using the
> existing FOLL_WRITE behavior.
> 
> [ It might be worth noting that we've always had this ambiguity, and it
>   could arguably be seen as a user-space issue.
> 
>   You only get private COW mappings that could break either way in
>   situations where user space is doing cooperative things (ie fork()
>   before an execve() etc), but it _is_ surprising and very subtle, and
>   fork() is supposed to give you independent address spaces.
> 
>   So let's treat this as a kernel issue and make the semantics of
>   get_user_pages() easier to understand. Note that obviously a true
>   shared mapping will still get a page that can change under us, so this
>   does _not_ mean that get_user_pages() somehow returns any "stable"
>   page ]
> 
> [surenb: backport notes
>         Since gup_pgd_range does not exist, made appropriate changes on
>         the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
> 	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
>         gup_huge_pd and gup_pud_range.
> 	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
> 	the earlier kernels.]

I'd be really careful with backporting this to stable. There was a lot of
userspace breakage caused by this change if I remember right which needed
to be fixed up later. There is a nice summary at
https://lwn.net/Articles/849638/ and https://lwn.net/Articles/849876/ and
some problems are still being found...

								Honza

> 
> Reported-by: Jann Horn <jannh@google.com>
> Tested-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: Kirill Shutemov <kirill@shutemov.name>
> Acked-by: Jan Kara <jack@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> [surenb: backport to 4.4 kernel]
> Cc: stable@vger.kernel.org # 4.4.x
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/gup.c         | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  mm/huge_memory.c |  7 +++----
>  2 files changed, 43 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 4c5857889e9d..c80cdc408228 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  }
>  
>  /*
> - * FOLL_FORCE can write to even unwritable pte's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> + * but only after we've gone through a COW cycle and they are dirty.
>   */
>  static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>  {
> -	return pte_write(pte) ||
> -		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> +	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> +}
> +
> +/*
> + * A (separate) COW fault might break the page the other way and
> + * get_user_pages() would return the page from what is now the wrong
> + * VM. So we need to force a COW break at GUP time even for reads.
> + */
> +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> +{
> +	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
>  }
>  
>  static struct page *follow_page_pte(struct vm_area_struct *vma,
> @@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			if (!vma || check_vma_flags(vma, gup_flags))
>  				return i ? : -EFAULT;
>  			if (is_vm_hugetlb_page(vma)) {
> +				if (should_force_cow_break(vma, foll_flags))
> +					foll_flags |= FOLL_WRITE;
>  				i = follow_hugetlb_page(mm, vma, pages, vmas,
>  						&start, &nr_pages, i,
> -						gup_flags);
> +						foll_flags);
>  				continue;
>  			}
>  		}
> +
> +		if (should_force_cow_break(vma, foll_flags))
> +			foll_flags |= FOLL_WRITE;
> +
>  retry:
>  		/*
>  		 * If we have a pending SIGKILL, don't keep faulting pages and
> @@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  /*
>   * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
>   * the regular GUP. It will only return non-negative values.
> + *
> + * Careful, careful! COW breaking can go either way, so a non-write
> + * access can get ambiguous page results. If you call this function without
> + * 'write' set, you'd better be sure that you're ok with that ambiguity.
>   */
>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			  struct page **pages)
> @@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  	 *
>  	 * We do not adopt an rcu_read_lock(.) here as we also want to
>  	 * block IPIs that come from THPs splitting.
> +	 *
> +	 * NOTE! We allow read-only gup_fast() here, but you'd better be
> +	 * careful about possible COW pages. You'll get _a_ COW page, but
> +	 * not necessarily the one you intended to get depending on what
> +	 * COW event happens after this. COW may break the page copy in a
> +	 * random direction.
>  	 */
>  
>  	local_irq_save(flags);
> @@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  		next = pgd_addr_end(addr, end);
>  		if (pgd_none(pgd))
>  			break;
> +		/*
> +		 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> +		 * because get_user_pages() may need to cause an early COW in
> +		 * order to avoid confusing the normal COW routines. So only
> +		 * targets that are already writable are safe to do by just
> +		 * looking at the page tables.
> +		 */
>  		if (unlikely(pgd_huge(pgd))) {
> -			if (!gup_huge_pgd(pgd, pgdp, addr, next, write,
> +			if (!gup_huge_pgd(pgd, pgdp, addr, next, 1,
>  					  pages, &nr))
>  				break;
>  		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
>  			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
> -					 PGDIR_SHIFT, next, write, pages, &nr))
> +					 PGDIR_SHIFT, next, 1, pages, &nr))
>  				break;
> -		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> +		} else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr))
>  			break;
>  	} while (pgdp++, addr = next, addr != end);
>  	local_irq_restore(flags);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6404e4fcb4ed..fae45c56e2ee 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1268,13 +1268,12 @@ out_unlock:
>  }
>  
>  /*
> - * FOLL_FORCE can write to even unwritable pmd's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
> + * but only after we've gone through a COW cycle and they are dirty.
>   */
>  static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
>  {
> -	return pmd_write(pmd) ||
> -	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
> +	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
>  }
>  
>  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> -- 
> 2.33.0.882.g93a45727a2-goog
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  1:55 ` Suren Baghdasaryan
@ 2021-10-12  5:45   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 17+ messages in thread
From: Greg Kroah-Hartman @ 2021-10-12  5:45 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: stable, Jann Horn, Linus Torvalds, Vlastimil Babka, Peter Xu,
	Andrea Arcangeli, David Hildenbrand, Jason Gunthorpe,
	Kirill Tkhai, Shaohua Li, Nadav Amit, Christoph Hellwig,
	Oleg Nesterov, Kirill A. Shutemov, jack, Matthew Wilcox,
	linux-mm, LKML, kernel-team

On Mon, Oct 11, 2021 at 06:55:39PM -0700, Suren Baghdasaryan wrote:
> On Mon, Oct 11, 2021 at 6:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> >
> > From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> Sorry Greg, I must have screwed up something and this "From:" line
> appeared twice. Do you want me to resent these backports or you can
> simply drop the duplicate line?

I will fix it up, no worries.

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-10-12  1:52 Suren Baghdasaryan
@ 2021-10-12  1:55 ` Suren Baghdasaryan
  2021-10-12  5:45   ` Greg Kroah-Hartman
  2021-10-12  8:06 ` Jan Kara
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-10-12  1:55 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, Jann Horn, Linus Torvalds, Vlastimil Babka,
	Peter Xu, Andrea Arcangeli, David Hildenbrand, Jason Gunthorpe,
	Kirill Tkhai, Shaohua Li, Nadav Amit, Christoph Hellwig,
	Oleg Nesterov, Kirill A. Shutemov, jack, Matthew Wilcox,
	linux-mm, LKML, kernel-team

On Mon, Oct 11, 2021 at 6:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
>
> From: Linus Torvalds <torvalds@linux-foundation.org>

Sorry Greg, I must have screwed up something and this "From:" line
appeared twice. Do you want me to resent these backports or you can
simply drop the duplicate line?

>
> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
>
> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> ambiguous: the page can be COW'ed at any time afterwards, and the
> direction of a COW event isn't defined.
>
> Yes, whoever writes to it will generally do the COW, but if the thread
> that did the get_user_pages() unmapped the page before the write (and
> that could happen due to memory pressure in addition to any outright
> action), the writer could also just take over the old page instead.
>
> End result: the get_user_pages() call might result in a page pointer
> that is no longer associated with the original VM, and is associated
> with - and controlled by - another VM having taken it over instead.
>
> So when doing a get_user_pages() on a COW mapping, the only really safe
> thing to do would be to break the COW when getting the page, even when
> only getting it for reading.
>
> At the same time, some users simply don't even care.
>
> For example, the perf code wants to look up the page not because it
> cares about the page, but because the code simply wants to look up the
> physical address of the access for informational purposes, and doesn't
> really care about races when a page might be unmapped and remapped
> elsewhere.
>
> This adds logic to force a COW event by setting FOLL_WRITE on any
> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> pointer as a result.
>
> The current semantics end up being:
>
>  - __get_user_pages_fast(): no change. If you don't ask for a write,
>    you won't break COW. You'd better know what you're doing.
>
>  - get_user_pages_fast(): the fast-case "look it up in the page tables
>    without anything getting mmap_sem" now refuses to follow a read-only
>    page, since it might need COW breaking.  Which happens in the slow
>    path - the fast path doesn't know if the memory might be COW or not.
>
>  - get_user_pages() (including the slow-path fallback for gup_fast()):
>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
>    very similar semantics to FOLL_FORCE.
>
> If it turns out that we want finer granularity (ie "only break COW when
> it might actually matter" - things like the zero page are special and
> don't need to be broken) we might need to push these semantics deeper
> into the lookup fault path.  So if people care enough, it's possible
> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> with the internal FOLL_COW flag we already have for tracking "I had a
> COW".
>
> Alternatively, if it turns out that different callers might want to
> explicitly control the forced COW break behavior, we might even want to
> make such a flag visible to the users of get_user_pages() instead of
> using the above default semantics.
>
> But for now, this is mostly commentary on the issue (this commit message
> being a lot bigger than the patch, and that patch in turn is almost all
> comments), with that minimal "enable COW breaking early" logic using the
> existing FOLL_WRITE behavior.
>
> [ It might be worth noting that we've always had this ambiguity, and it
>   could arguably be seen as a user-space issue.
>
>   You only get private COW mappings that could break either way in
>   situations where user space is doing cooperative things (ie fork()
>   before an execve() etc), but it _is_ surprising and very subtle, and
>   fork() is supposed to give you independent address spaces.
>
>   So let's treat this as a kernel issue and make the semantics of
>   get_user_pages() easier to understand. Note that obviously a true
>   shared mapping will still get a page that can change under us, so this
>   does _not_ mean that get_user_pages() somehow returns any "stable"
>   page ]
>
> [surenb: backport notes
>         Since gup_pgd_range does not exist, made appropriate changes on
>         the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
>         Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
>         gup_huge_pd and gup_pud_range.
>         Removed FOLL_PIN usage in should_force_cow_break since it's missing in
>         the earlier kernels.]
>
> Reported-by: Jann Horn <jannh@google.com>
> Tested-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: Kirill Shutemov <kirill@shutemov.name>
> Acked-by: Jan Kara <jack@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> [surenb: backport to 4.4 kernel]
> Cc: stable@vger.kernel.org # 4.4.x
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/gup.c         | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  mm/huge_memory.c |  7 +++----
>  2 files changed, 43 insertions(+), 12 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 4c5857889e9d..c80cdc408228 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  }
>
>  /*
> - * FOLL_FORCE can write to even unwritable pte's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
> + * but only after we've gone through a COW cycle and they are dirty.
>   */
>  static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
>  {
> -       return pte_write(pte) ||
> -               ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
> +       return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
> +}
> +
> +/*
> + * A (separate) COW fault might break the page the other way and
> + * get_user_pages() would return the page from what is now the wrong
> + * VM. So we need to force a COW break at GUP time even for reads.
> + */
> +static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
> +{
> +       return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
>  }
>
>  static struct page *follow_page_pte(struct vm_area_struct *vma,
> @@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                         if (!vma || check_vma_flags(vma, gup_flags))
>                                 return i ? : -EFAULT;
>                         if (is_vm_hugetlb_page(vma)) {
> +                               if (should_force_cow_break(vma, foll_flags))
> +                                       foll_flags |= FOLL_WRITE;
>                                 i = follow_hugetlb_page(mm, vma, pages, vmas,
>                                                 &start, &nr_pages, i,
> -                                               gup_flags);
> +                                               foll_flags);
>                                 continue;
>                         }
>                 }
> +
> +               if (should_force_cow_break(vma, foll_flags))
> +                       foll_flags |= FOLL_WRITE;
> +
>  retry:
>                 /*
>                  * If we have a pending SIGKILL, don't keep faulting pages and
> @@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
>  /*
>   * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
>   * the regular GUP. It will only return non-negative values.
> + *
> + * Careful, careful! COW breaking can go either way, so a non-write
> + * access can get ambiguous page results. If you call this function without
> + * 'write' set, you'd better be sure that you're ok with that ambiguity.
>   */
>  int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                           struct page **pages)
> @@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>          *
>          * We do not adopt an rcu_read_lock(.) here as we also want to
>          * block IPIs that come from THPs splitting.
> +        *
> +        * NOTE! We allow read-only gup_fast() here, but you'd better be
> +        * careful about possible COW pages. You'll get _a_ COW page, but
> +        * not necessarily the one you intended to get depending on what
> +        * COW event happens after this. COW may break the page copy in a
> +        * random direction.
>          */
>
>         local_irq_save(flags);
> @@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>                 next = pgd_addr_end(addr, end);
>                 if (pgd_none(pgd))
>                         break;
> +               /*
> +                * The FAST_GUP case requires FOLL_WRITE even for pure reads,
> +                * because get_user_pages() may need to cause an early COW in
> +                * order to avoid confusing the normal COW routines. So only
> +                * targets that are already writable are safe to do by just
> +                * looking at the page tables.
> +                */
>                 if (unlikely(pgd_huge(pgd))) {
> -                       if (!gup_huge_pgd(pgd, pgdp, addr, next, write,
> +                       if (!gup_huge_pgd(pgd, pgdp, addr, next, 1,
>                                           pages, &nr))
>                                 break;
>                 } else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
>                         if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
> -                                        PGDIR_SHIFT, next, write, pages, &nr))
> +                                        PGDIR_SHIFT, next, 1, pages, &nr))
>                                 break;
> -               } else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> +               } else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr))
>                         break;
>         } while (pgdp++, addr = next, addr != end);
>         local_irq_restore(flags);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 6404e4fcb4ed..fae45c56e2ee 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1268,13 +1268,12 @@ out_unlock:
>  }
>
>  /*
> - * FOLL_FORCE can write to even unwritable pmd's, but only
> - * after we've gone through a COW cycle and they are dirty.
> + * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
> + * but only after we've gone through a COW cycle and they are dirty.
>   */
>  static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
>  {
> -       return pmd_write(pmd) ||
> -              ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
> +       return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
>  }
>
>  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> --
> 2.33.0.882.g93a45727a2-goog
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] gup: document and work around "COW can break either way" issue
@ 2021-10-12  1:52 Suren Baghdasaryan
  2021-10-12  1:55 ` Suren Baghdasaryan
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-10-12  1:52 UTC (permalink / raw)
  To: stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, hch, oleg, kirill, jack, willy, linux-mm,
	linux-kernel, kernel-team, surenb

From: Linus Torvalds <torvalds@linux-foundation.org>

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes
        Since gup_pgd_range does not exist, made appropriate changes on
        the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
        gup_huge_pd and gup_pud_range.
	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
	the earlier kernels.]

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.4 kernel]
Cc: stable@vger.kernel.org # 4.4.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/gup.c         | 48 ++++++++++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c |  7 +++----
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 4c5857889e9d..c80cdc408228 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags);
+						foll_flags);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 /*
  * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
  * the regular GUP. It will only return non-negative values.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	local_irq_save(flags);
@@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
 			break;
+		/*
+		 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+		 * because get_user_pages() may need to cause an early COW in
+		 * order to avoid confusing the normal COW routines. So only
+		 * targets that are already writable are safe to do by just
+		 * looking at the page tables.
+		 */
 		if (unlikely(pgd_huge(pgd))) {
-			if (!gup_huge_pgd(pgd, pgdp, addr, next, write,
+			if (!gup_huge_pgd(pgd, pgdp, addr, next, 1,
 					  pages, &nr))
 				break;
 		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
-					 PGDIR_SHIFT, next, write, pages, &nr))
+					 PGDIR_SHIFT, next, 1, pages, &nr))
 				break;
-		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		} else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr))
 			break;
 	} while (pgdp++, addr = next, addr != end);
 	local_irq_restore(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6404e4fcb4ed..fae45c56e2ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1268,13 +1268,12 @@ out_unlock:
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-- 
2.33.0.882.g93a45727a2-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/1] gup: document and work around "COW can break either way" issue
  2021-04-21 22:56 ` Suren Baghdasaryan
  (?)
@ 2021-04-23 15:05 ` Greg KH
  -1 siblings, 0 replies; 17+ messages in thread
From: Greg KH @ 2021-04-23 15:05 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: stable, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, linux-mm, linux-kernel, kernel-team,
	Christoph Hellwig, Oleg Nesterov, Kirill Shutemov, Jan Kara,
	Matthew Wilcox

On Wed, Apr 21, 2021 at 03:56:13PM -0700, Suren Baghdasaryan wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.
> 
> Doing a "get_user_pages()" on a copy-on-write page for reading can be
> ambiguous: the page can be COW'ed at any time afterwards, and the
> direction of a COW event isn't defined.
> 
> Yes, whoever writes to it will generally do the COW, but if the thread
> that did the get_user_pages() unmapped the page before the write (and
> that could happen due to memory pressure in addition to any outright
> action), the writer could also just take over the old page instead.
> 
> End result: the get_user_pages() call might result in a page pointer
> that is no longer associated with the original VM, and is associated
> with - and controlled by - another VM having taken it over instead.
> 
> So when doing a get_user_pages() on a COW mapping, the only really safe
> thing to do would be to break the COW when getting the page, even when
> only getting it for reading.
> 
> At the same time, some users simply don't even care.
> 
> For example, the perf code wants to look up the page not because it
> cares about the page, but because the code simply wants to look up the
> physical address of the access for informational purposes, and doesn't
> really care about races when a page might be unmapped and remapped
> elsewhere.
> 
> This adds logic to force a COW event by setting FOLL_WRITE on any
> copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
> pointer as a result.
> 
> The current semantics end up being:
> 
>  - __get_user_pages_fast(): no change. If you don't ask for a write,
>    you won't break COW. You'd better know what you're doing.
> 
>  - get_user_pages_fast(): the fast-case "look it up in the page tables
>    without anything getting mmap_sem" now refuses to follow a read-only
>    page, since it might need COW breaking.  Which happens in the slow
>    path - the fast path doesn't know if the memory might be COW or not.
> 
>  - get_user_pages() (including the slow-path fallback for gup_fast()):
>    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
>    very similar semantics to FOLL_FORCE.
> 
> If it turns out that we want finer granularity (ie "only break COW when
> it might actually matter" - things like the zero page are special and
> don't need to be broken) we might need to push these semantics deeper
> into the lookup fault path.  So if people care enough, it's possible
> that we might end up adding a new internal FOLL_BREAK_COW flag to go
> with the internal FOLL_COW flag we already have for tracking "I had a
> COW".
> 
> Alternatively, if it turns out that different callers might want to
> explicitly control the forced COW break behavior, we might even want to
> make such a flag visible to the users of get_user_pages() instead of
> using the above default semantics.
> 
> But for now, this is mostly commentary on the issue (this commit message
> being a lot bigger than the patch, and that patch in turn is almost all
> comments), with that minimal "enable COW breaking early" logic using the
> existing FOLL_WRITE behavior.
> 
> [ It might be worth noting that we've always had this ambiguity, and it
>   could arguably be seen as a user-space issue.
> 
>   You only get private COW mappings that could break either way in
>   situations where user space is doing cooperative things (ie fork()
>   before an execve() etc), but it _is_ surprising and very subtle, and
>   fork() is supposed to give you independent address spaces.
> 
>   So let's treat this as a kernel issue and make the semantics of
>   get_user_pages() easier to understand. Note that obviously a true
>   shared mapping will still get a page that can change under us, so this
>   does _not_ mean that get_user_pages() somehow returns any "stable"
>   page ]
> 
> [surenb: backport notes]
> Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range.
> Removed FOLL_PIN usage in should_force_cow_break since it's missing in
> the earlier kernels.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Tested-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Acked-by: Kirill Shutemov <kirill@shutemov.name>
> Acked-by: Jan Kara <jack@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> [surenb: backport to 4.19 kernel]
> Cc: stable@vger.kernel.org # 4.19.x
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/gup.c         | 44 ++++++++++++++++++++++++++++++++++++++------
>  mm/huge_memory.c |  7 +++----
>  2 files changed, 41 insertions(+), 10 deletions(-)

Thanks for these backports, I've now queued them up.

greg k-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] gup: document and work around "COW can break either way" issue
@ 2021-04-21 22:57 ` Suren Baghdasaryan
  0 siblings, 0 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-04-21 22:57 UTC (permalink / raw)
  To: stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, linux-mm, linux-kernel, kernel-team, surenb,
	Christoph Hellwig, Oleg Nesterov, Kirill Shutemov, Jan Kara,
	Matthew Wilcox

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes]
Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range.
Removed FOLL_PIN usage in should_force_cow_break since it's missing in
the earlier kernels.

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.14 kernel]
Cc: stable@vger.kernel.org # 4.14.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/gup.c         | 44 ++++++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c |  7 +++----
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 12b9626b1a9e..cfe0a56f8e27 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						foll_flags, nonblocking);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write)
 /*
  * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
  * the regular GUP. It will only return non-negative values.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
@@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 					(void __user *)start, len)))
 		return -EFAULT;
 
+	/*
+	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+	 * because get_user_pages() may need to cause an early COW in
+	 * order to avoid confusing the normal COW routines. So only
+	 * targets that are already writable are safe to do by just
+	 * looking at the page tables.
+	 */
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
-		gup_pgd_range(addr, end, write, pages, &nr);
+		gup_pgd_range(addr, end, 1, pages, &nr);
 		local_irq_enable();
 		ret = nr;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9dbfa7286c61..513f0cf173ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1367,13 +1367,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-- 
2.31.1.498.g6c1eba8ee3d-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] gup: document and work around "COW can break either way" issue
@ 2021-04-21 22:57 ` Suren Baghdasaryan
  0 siblings, 0 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-04-21 22:57 UTC (permalink / raw)
  To: stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, linux-mm, linux-kernel, kernel-team, surenb,
	Christoph Hellwig, Oleg Nesterov, Kirill Shutemov, Jan Kara,
	Matthew Wilcox

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes]
Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range.
Removed FOLL_PIN usage in should_force_cow_break since it's missing in
the earlier kernels.

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.14 kernel]
Cc: stable@vger.kernel.org # 4.14.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/gup.c         | 44 ++++++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c |  7 +++----
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 12b9626b1a9e..cfe0a56f8e27 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -694,12 +703,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						foll_flags, nonblocking);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1796,6 +1811,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write)
 /*
  * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
  * the regular GUP. It will only return non-negative values.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1823,6 +1842,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
@@ -1868,9 +1893,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 					(void __user *)start, len)))
 		return -EFAULT;
 
+	/*
+	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+	 * because get_user_pages() may need to cause an early COW in
+	 * order to avoid confusing the normal COW routines. So only
+	 * targets that are already writable are safe to do by just
+	 * looking at the page tables.
+	 */
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
-		gup_pgd_range(addr, end, write, pages, &nr);
+		gup_pgd_range(addr, end, 1, pages, &nr);
 		local_irq_enable();
 		ret = nr;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9dbfa7286c61..513f0cf173ad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1367,13 +1367,12 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-- 
2.31.1.498.g6c1eba8ee3d-goog



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] gup: document and work around "COW can break either way" issue
@ 2021-04-21 22:56 ` Suren Baghdasaryan
  0 siblings, 0 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-04-21 22:56 UTC (permalink / raw)
  To: stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, linux-mm, linux-kernel, kernel-team, surenb,
	Christoph Hellwig, Oleg Nesterov, Kirill Shutemov, Jan Kara,
	Matthew Wilcox

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes]
Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range.
Removed FOLL_PIN usage in should_force_cow_break since it's missing in
the earlier kernels.

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.19 kernel]
Cc: stable@vger.kernel.org # 4.19.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/gup.c         | 44 ++++++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c |  7 +++----
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f3088d25bd92..44569927f0ea 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -710,12 +719,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						foll_flags, nonblocking);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1804,6 +1819,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write)
  * the regular GUP.
  * Note a difference with get_user_pages_fast: this always returns the
  * number of pages pinned, 0 if no pages were pinned.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1831,6 +1850,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
@@ -1876,9 +1901,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 					(void __user *)start, len)))
 		return -EFAULT;
 
+	/*
+	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+	 * because get_user_pages() may need to cause an early COW in
+	 * order to avoid confusing the normal COW routines. So only
+	 * targets that are already writable are safe to do by just
+	 * looking at the page tables.
+	 */
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
-		gup_pgd_range(addr, end, write, pages, &nr);
+		gup_pgd_range(addr, end, 1, pages, &nr);
 		local_irq_enable();
 		ret = nr;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf9e2bbffdc1..7c374c0fcf0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1432,13 +1432,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-- 
2.31.1.368.gbe11c130af-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] gup: document and work around "COW can break either way" issue
@ 2021-04-21 22:56 ` Suren Baghdasaryan
  0 siblings, 0 replies; 17+ messages in thread
From: Suren Baghdasaryan @ 2021-04-21 22:56 UTC (permalink / raw)
  To: stable
  Cc: gregkh, jannh, torvalds, vbabka, peterx, aarcange, david, jgg,
	ktkhai, shli, namit, linux-mm, linux-kernel, kernel-team, surenb,
	Christoph Hellwig, Oleg Nesterov, Kirill Shutemov, Jan Kara,
	Matthew Wilcox

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes]
Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range.
Removed FOLL_PIN usage in should_force_cow_break since it's missing in
the earlier kernels.

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.19 kernel]
Cc: stable@vger.kernel.org # 4.19.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/gup.c         | 44 ++++++++++++++++++++++++++++++++++++++------
 mm/huge_memory.c |  7 +++----
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f3088d25bd92..44569927f0ea 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -61,13 +61,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -710,12 +719,18 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags, nonblocking);
+						foll_flags, nonblocking);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1804,6 +1819,10 @@ bool gup_fast_permitted(unsigned long start, int nr_pages, int write)
  * the regular GUP.
  * Note a difference with get_user_pages_fast: this always returns the
  * number of pages pinned, 0 if no pages were pinned.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1831,6 +1850,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	if (gup_fast_permitted(start, nr_pages, write)) {
@@ -1876,9 +1901,16 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 					(void __user *)start, len)))
 		return -EFAULT;
 
+	/*
+	 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+	 * because get_user_pages() may need to cause an early COW in
+	 * order to avoid confusing the normal COW routines. So only
+	 * targets that are already writable are safe to do by just
+	 * looking at the page tables.
+	 */
 	if (gup_fast_permitted(start, nr_pages, write)) {
 		local_irq_disable();
-		gup_pgd_range(addr, end, write, pages, &nr);
+		gup_pgd_range(addr, end, 1, pages, &nr);
 		local_irq_enable();
 		ret = nr;
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf9e2bbffdc1..7c374c0fcf0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1432,13 +1432,12 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
-- 
2.31.1.368.gbe11c130af-goog



^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-10-13 21:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12  1:53 [PATCH 1/1] gup: document and work around "COW can break either way" issue Suren Baghdasaryan
  -- strict thread matches above, loose matches on Subject: below --
2021-10-12  1:52 Suren Baghdasaryan
2021-10-12  1:55 ` Suren Baghdasaryan
2021-10-12  5:45   ` Greg Kroah-Hartman
2021-10-12  8:06 ` Jan Kara
2021-10-12  8:14   ` Vlastimil Babka
2021-10-12  8:42     ` Greg KH
2021-10-12 18:57       ` Thadeu Lima de Souza Cascardo
2021-10-13  8:56         ` Greg KH
2021-10-12 10:06 ` Greg KH
2021-10-12 16:16   ` Suren Baghdasaryan
2021-10-13 21:58 ` John Hubbard
2021-04-21 22:57 Suren Baghdasaryan
2021-04-21 22:57 ` Suren Baghdasaryan
2021-04-21 22:56 Suren Baghdasaryan
2021-04-21 22:56 ` Suren Baghdasaryan
2021-04-23 15:05 ` Greg KH

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.