linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] mm/gup: some cleanups
@ 2022-02-03  9:32 John Hubbard
  2022-02-03  9:32 ` [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups John Hubbard
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03  9:32 UTC (permalink / raw)
  To: Andrew Morton, Peter Xu, Jason Gunthorpe
  Cc: David Hildenbrand, Lukas Bulwahn, Jan Kara, Claudio Imbrenda,
	Kirill A . Shutemov, Alex Williamson, Andrea Arcangeli, LKML,
	linux-mm, John Hubbard

Hi Peter, Jason and all,

Changes since v2:

    * Patch 2: Removed an unnecessary line that was causing a
      clang-analyzer complaint, as reported by Lukas Bulwahn
      [1] (thanks!), and retested locally.

      Jason: I've boldly left your Reviewed-by tag on this patch,
      because I am predicting you'll agree with it...

    * Added Reviewed-by tags from Jan Kara, Christoph Hellwig, and
      Jason Gunthorpe that have collected since v2.


Changes since v1:
    * Patch 4: changed from get_user_pages(), to get_user_pages_fast().

    * Patch 4: Rewrote the commit description--thanks to Jan Kara for
               that feedback.

    * Patch 1: Removed Jerome's Cc from patch 1, due to a stale email
               address.

    * Added Reviewed-by's from David Hildenbrand and Jason Gunthorpe.

Original cover letter, updated as necessary:

I'm including Peter's patch as the first one in this tiny series. (The
commit description has my r-b tag in place of my Cc, and removes
Jerome's Cc because he is no longer at redhat.com) The second patch is
what I had in mind for a follow-up to that, when we were discussing that
fix [2].

Plus, a couple more small removals that I had queued up:

The third patch removes a completely unused routine:
pin_user_pages_locked().

The forth patch removes a similar routine, get_user_pages_locked(), that
only has one caller. It now calls get_user_pages_fast(), instead.

v1 of this patchset is here:
https://lore.kernel.org/all/20220131051752.447699-1-jhubbard@nvidia.com/

v2 of this patchset is here:
https://lore.kernel.org/r/20220201101108.306062-1-jhubbard@nvidia.com

[1] https://lore.kernel.org/r/CAKXUXMxFK9bo8jDoRZbQ0r2j-JwAGg3Xc5cpAcLaHfwHddJ7ew@mail.gmail.com

[2] https://lore.kernel.org/all/20220125033700.69705-1-peterx@redhat.com/


thanks,
John Hubbard

John Hubbard (3):
  mm/gup: clean up follow_pfn_pte() slightly
  mm/gup: remove unused pin_user_pages_locked()
  mm/gup: remove get_user_pages_locked()

Peter Xu (1):
  mm: Fix invalid page pointer returned with FOLL_PIN gups

 include/linux/mm.h |  4 --
 mm/gup.c           | 99 +++-------------------------------------------
 mm/mempolicy.c     | 21 +++++-----
 3 files changed, 15 insertions(+), 109 deletions(-)


base-commit: 88808fbbead481aedb46640a5ace69c58287f56a
--
2.35.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups
  2022-02-03  9:32 [PATCH v3 0/4] mm/gup: some cleanups John Hubbard
@ 2022-02-03  9:32 ` John Hubbard
  2022-02-03 12:10   ` Claudio Imbrenda
  2022-02-03 14:00   ` Christoph Hellwig
  2022-02-03  9:32 ` [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly John Hubbard
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03  9:32 UTC (permalink / raw)
  To: Andrew Morton, Peter Xu, Jason Gunthorpe
  Cc: David Hildenbrand, Lukas Bulwahn, Jan Kara, Claudio Imbrenda,
	Kirill A . Shutemov, Alex Williamson, Andrea Arcangeli, LKML,
	linux-mm, John Hubbard

From: Peter Xu <peterx@redhat.com>

Alex reported invalid page pointer returned with pin_user_pages_remote() from
vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
pinning with struct vfio_batch").  This problem breaks NVIDIA vfio mdev.

It turns out that it's not the fault of the vfio commit; however after vfio
switches to a full page buffer to store the page pointers it starts to expose
the problem easier.

The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then
vfio will carry on to handle the MMIO regions.  However when the bug triggered,
follow_page_mask() returned -EEXIST for such a page, which will jump over the
current page, leaving that entry in **pages untouched.  However the caller is
not aware of it, hence the caller will reference the page as usual even if the
pointer data can be anything.

We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
mapping unless FOLL_GET is requested") which seems very reasonable.  It could
be that when we reworked GUP with FOLL_PIN we could have overlooked that
special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
it needs to return an -EEXIST.

Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than
1027e4436b6a.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Debugged-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..65575ae3602f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, unsigned int flags)
 {
 	/* No page to get reference */
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return -EFAULT;
 
 	if (flags & FOLL_TOUCH) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03  9:32 [PATCH v3 0/4] mm/gup: some cleanups John Hubbard
  2022-02-03  9:32 ` [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups John Hubbard
@ 2022-02-03  9:32 ` John Hubbard
  2022-02-03 13:31   ` Claudio Imbrenda
  2022-02-03 13:53   ` Jan Kara
  2022-02-03  9:32 ` [PATCH v3 3/4] mm/gup: remove unused pin_user_pages_locked() John Hubbard
  2022-02-03  9:32 ` [PATCH v3 4/4] mm/gup: remove get_user_pages_locked() John Hubbard
  3 siblings, 2 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03  9:32 UTC (permalink / raw)
  To: Andrew Morton, Peter Xu, Jason Gunthorpe
  Cc: David Hildenbrand, Lukas Bulwahn, Jan Kara, Claudio Imbrenda,
	Kirill A . Shutemov, Alex Williamson, Andrea Arcangeli, LKML,
	linux-mm, John Hubbard, Jason Gunthorpe

Regardless of any FOLL_* flags, get_user_pages() and its variants should
handle PFN-only entries by stopping early, if the caller expected
**pages to be filled in.

This makes for a more reliable API, as compared to the previous approach
of skipping over such entries (and thus leaving them silently
unwritten).

Cc: Peter Xu <peterx@redhat.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 65575ae3602f..cad3f28492e3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 		pte_t *pte, unsigned int flags)
 {
-	/* No page to get reference */
-	if (flags & (FOLL_GET | FOLL_PIN))
-		return -EFAULT;
-
 	if (flags & FOLL_TOUCH) {
 		pte_t entry = *pte;
 
@@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
 		} else if (PTR_ERR(page) == -EEXIST) {
 			/*
 			 * Proper page table entry exists, but no corresponding
-			 * struct page.
+			 * struct page. If the caller expects **pages to be
+			 * filled in, bail out now, because that can't be done
+			 * for this page.
 			 */
+			if (pages)
+				goto out;
+
 			goto next_page;
 		} else if (IS_ERR(page)) {
 			ret = PTR_ERR(page);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/4] mm/gup: remove unused pin_user_pages_locked()
  2022-02-03  9:32 [PATCH v3 0/4] mm/gup: some cleanups John Hubbard
  2022-02-03  9:32 ` [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups John Hubbard
  2022-02-03  9:32 ` [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly John Hubbard
@ 2022-02-03  9:32 ` John Hubbard
  2022-02-03 11:52   ` Claudio Imbrenda
  2022-02-03  9:32 ` [PATCH v3 4/4] mm/gup: remove get_user_pages_locked() John Hubbard
  3 siblings, 1 reply; 19+ messages in thread
From: John Hubbard @ 2022-02-03  9:32 UTC (permalink / raw)
  To: Andrew Morton, Peter Xu, Jason Gunthorpe
  Cc: David Hildenbrand, Lukas Bulwahn, Jan Kara, Claudio Imbrenda,
	Kirill A . Shutemov, Alex Williamson, Andrea Arcangeli, LKML,
	linux-mm, John Hubbard, Jason Gunthorpe, Christoph Hellwig

This routine was used for a short while, but then the calling code was
refactored and the only caller was removed.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  2 --
 mm/gup.c           | 29 -----------------------------
 2 files changed, 31 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..80c540c17d83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1918,8 +1918,6 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
-long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
-		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
diff --git a/mm/gup.c b/mm/gup.c
index cad3f28492e3..b0ecbfe03928 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3119,32 +3119,3 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages_unlocked);
-
-/*
- * pin_user_pages_locked() is the FOLL_PIN variant of get_user_pages_locked().
- * Behavior is the same, except that this one sets FOLL_PIN and rejects
- * FOLL_GET.
- */
-long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
-			   unsigned int gup_flags, struct page **pages,
-			   int *locked)
-{
-	/*
-	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas.  As there are no users of this flag in this call we simply
-	 * disallow this option for now.
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-		return -EINVAL;
-
-	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
-	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
-		return -EINVAL;
-
-	gup_flags |= FOLL_PIN;
-	return __get_user_pages_locked(current->mm, start, nr_pages,
-				       pages, NULL, locked,
-				       gup_flags | FOLL_TOUCH);
-}
-EXPORT_SYMBOL(pin_user_pages_locked);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 4/4] mm/gup: remove get_user_pages_locked()
  2022-02-03  9:32 [PATCH v3 0/4] mm/gup: some cleanups John Hubbard
                   ` (2 preceding siblings ...)
  2022-02-03  9:32 ` [PATCH v3 3/4] mm/gup: remove unused pin_user_pages_locked() John Hubbard
@ 2022-02-03  9:32 ` John Hubbard
  2022-02-03 12:04   ` Claudio Imbrenda
  2022-02-03 14:01   ` Christoph Hellwig
  3 siblings, 2 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03  9:32 UTC (permalink / raw)
  To: Andrew Morton, Peter Xu, Jason Gunthorpe
  Cc: David Hildenbrand, Lukas Bulwahn, Jan Kara, Claudio Imbrenda,
	Kirill A . Shutemov, Alex Williamson, Andrea Arcangeli, LKML,
	linux-mm, John Hubbard, Jason Gunthorpe

There is only one caller of get_user_pages_locked(). The purpose of
get_user_pages_locked() is to allow for unlocking the mmap_lock when
reading a page from the disk during a page fault (hidden behind
VM_FAULT_RETRY). The idea is to reduce contention on the heavily-used
mmap_lock. (Thanks to Jan Kara for clearly pointing that out, and in
fact I've used some of his wording here.)

However, it is unlikely for lookup_node() to take a page fault. With
that in mind, change over to calling get_user_pages_fast(). This
simplifies the code, runs a little faster in the expected case, and
allows removing get_user_pages_locked() entirely.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  2 --
 mm/gup.c           | 59 ----------------------------------------------
 mm/mempolicy.c     | 21 +++++++----------
 3 files changed, 9 insertions(+), 73 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80c540c17d83..528ef1cb4f3a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1916,8 +1916,6 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
-long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
-		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
diff --git a/mm/gup.c b/mm/gup.c
index b0ecbfe03928..7da49df59110 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2118,65 +2118,6 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(get_user_pages);
 
-/**
- * get_user_pages_locked() - variant of get_user_pages()
- *
- * @start:      starting user address
- * @nr_pages:   number of pages from start to pin
- * @gup_flags:  flags modifying lookup behaviour
- * @pages:      array that receives pointers to the pages pinned.
- *              Should be at least nr_pages long. Or NULL, if caller
- *              only intends to ensure the pages are faulted in.
- * @locked:     pointer to lock flag indicating whether lock is held and
- *              subsequently whether VM_FAULT_RETRY functionality can be
- *              utilised. Lock must initially be held.
- *
- * It is suitable to replace the form:
- *
- *      mmap_read_lock(mm);
- *      do_something()
- *      get_user_pages(mm, ..., pages, NULL);
- *      mmap_read_unlock(mm);
- *
- *  to:
- *
- *      int locked = 1;
- *      mmap_read_lock(mm);
- *      do_something()
- *      get_user_pages_locked(mm, ..., pages, &locked);
- *      if (locked)
- *          mmap_read_unlock(mm);
- *
- * We can leverage the VM_FAULT_RETRY functionality in the page fault
- * paths better by using either get_user_pages_locked() or
- * get_user_pages_unlocked().
- *
- */
-long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
-			   unsigned int gup_flags, struct page **pages,
-			   int *locked)
-{
-	/*
-	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas.  As there are no users of this flag in this call we simply
-	 * disallow this option for now.
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-		return -EINVAL;
-	/*
-	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
-	 * never directly by the caller, so enforce that:
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
-		return -EINVAL;
-
-	return __get_user_pages_locked(current->mm, start, nr_pages,
-				       pages, NULL, locked,
-				       gup_flags | FOLL_TOUCH);
-}
-EXPORT_SYMBOL(get_user_pages_locked);
-
 /*
  * get_user_pages_unlocked() is suitable to replace the form:
  *
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 028e8dd82b44..3f8dc58da3e8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -907,17 +907,14 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 static int lookup_node(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *p = NULL;
-	int err;
+	int ret;
 
-	int locked = 1;
-	err = get_user_pages_locked(addr & PAGE_MASK, 1, 0, &p, &locked);
-	if (err > 0) {
-		err = page_to_nid(p);
+	ret = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
+	if (ret > 0) {
+		ret = page_to_nid(p);
 		put_page(p);
 	}
-	if (locked)
-		mmap_read_unlock(mm);
-	return err;
+	return ret;
 }
 
 /* Retrieve NUMA policy */
@@ -968,14 +965,14 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
 			/*
-			 * Take a refcount on the mpol, lookup_node()
-			 * will drop the mmap_lock, so after calling
-			 * lookup_node() only "pol" remains valid, "vma"
-			 * is stale.
+			 * Take a refcount on the mpol, because we are about to
+			 * drop the mmap_lock, after which only "pol" remains
+			 * valid, "vma" is stale.
 			 */
 			pol_refcount = pol;
 			vma = NULL;
 			mpol_get(pol);
+			mmap_read_unlock(mm);
 			err = lookup_node(mm, addr);
 			if (err < 0)
 				goto out;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/4] mm/gup: remove unused pin_user_pages_locked()
  2022-02-03  9:32 ` [PATCH v3 3/4] mm/gup: remove unused pin_user_pages_locked() John Hubbard
@ 2022-02-03 11:52   ` Claudio Imbrenda
  0 siblings, 0 replies; 19+ messages in thread
From: Claudio Imbrenda @ 2022-02-03 11:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Kirill A . Shutemov, Alex Williamson,
	Andrea Arcangeli, LKML, linux-mm, Jason Gunthorpe,
	Christoph Hellwig

On Thu, 3 Feb 2022 01:32:31 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> This routine was used for a short while, but then the calling code was
> refactored and the only caller was removed.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

if it's not used anymore, good riddance

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

> ---
>  include/linux/mm.h |  2 --
>  mm/gup.c           | 29 -----------------------------
>  2 files changed, 31 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 213cc569b192..80c540c17d83 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1918,8 +1918,6 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    struct vm_area_struct **vmas);
>  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages, int *locked);
> -long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
> -		    unsigned int gup_flags, struct page **pages, int *locked);
>  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  		    struct page **pages, unsigned int gup_flags);
>  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> diff --git a/mm/gup.c b/mm/gup.c
> index cad3f28492e3..b0ecbfe03928 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3119,32 +3119,3 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  	return get_user_pages_unlocked(start, nr_pages, pages, gup_flags);
>  }
>  EXPORT_SYMBOL(pin_user_pages_unlocked);
> -
> -/*
> - * pin_user_pages_locked() is the FOLL_PIN variant of get_user_pages_locked().
> - * Behavior is the same, except that this one sets FOLL_PIN and rejects
> - * FOLL_GET.
> - */
> -long pin_user_pages_locked(unsigned long start, unsigned long nr_pages,
> -			   unsigned int gup_flags, struct page **pages,
> -			   int *locked)
> -{
> -	/*
> -	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> -	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -	 * vmas.  As there are no users of this flag in this call we simply
> -	 * disallow this option for now.
> -	 */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> -		return -EINVAL;
> -
> -	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> -		return -EINVAL;
> -
> -	gup_flags |= FOLL_PIN;
> -	return __get_user_pages_locked(current->mm, start, nr_pages,
> -				       pages, NULL, locked,
> -				       gup_flags | FOLL_TOUCH);
> -}
> -EXPORT_SYMBOL(pin_user_pages_locked);


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/4] mm/gup: remove get_user_pages_locked()
  2022-02-03  9:32 ` [PATCH v3 4/4] mm/gup: remove get_user_pages_locked() John Hubbard
@ 2022-02-03 12:04   ` Claudio Imbrenda
  2022-02-03 14:01   ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Claudio Imbrenda @ 2022-02-03 12:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Kirill A . Shutemov, Alex Williamson,
	Andrea Arcangeli, LKML, linux-mm, Jason Gunthorpe

On Thu, 3 Feb 2022 01:32:32 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> There is only one caller of get_user_pages_locked(). The purpose of
> get_user_pages_locked() is to allow for unlocking the mmap_lock when
> reading a page from the disk during a page fault (hidden behind
> VM_FAULT_RETRY). The idea is to reduce contention on the heavily-used
> mmap_lock. (Thanks to Jan Kara for clearly pointing that out, and in
> fact I've used some of his wording here.)
> 
> However, it is unlikely for lookup_node() to take a page fault. With
> that in mind, change over to calling get_user_pages_fast(). This
> simplifies the code, runs a little faster in the expected case, and
> allows removing get_user_pages_locked() entirely.
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

I have always disliked these functions that might or might not unlock
the lock under the hood. I'm happy to see one more go.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

> ---
>  include/linux/mm.h |  2 --
>  mm/gup.c           | 59 ----------------------------------------------
>  mm/mempolicy.c     | 21 +++++++----------
>  3 files changed, 9 insertions(+), 73 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80c540c17d83..528ef1cb4f3a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1916,8 +1916,6 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
>  		    struct vm_area_struct **vmas);
> -long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> -		    unsigned int gup_flags, struct page **pages, int *locked);
>  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  		    struct page **pages, unsigned int gup_flags);
>  long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> diff --git a/mm/gup.c b/mm/gup.c
> index b0ecbfe03928..7da49df59110 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2118,65 +2118,6 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
>  }
>  EXPORT_SYMBOL(get_user_pages);
>  
> -/**
> - * get_user_pages_locked() - variant of get_user_pages()
> - *
> - * @start:      starting user address
> - * @nr_pages:   number of pages from start to pin
> - * @gup_flags:  flags modifying lookup behaviour
> - * @pages:      array that receives pointers to the pages pinned.
> - *              Should be at least nr_pages long. Or NULL, if caller
> - *              only intends to ensure the pages are faulted in.
> - * @locked:     pointer to lock flag indicating whether lock is held and
> - *              subsequently whether VM_FAULT_RETRY functionality can be
> - *              utilised. Lock must initially be held.
> - *
> - * It is suitable to replace the form:
> - *
> - *      mmap_read_lock(mm);
> - *      do_something()
> - *      get_user_pages(mm, ..., pages, NULL);
> - *      mmap_read_unlock(mm);
> - *
> - *  to:
> - *
> - *      int locked = 1;
> - *      mmap_read_lock(mm);
> - *      do_something()
> - *      get_user_pages_locked(mm, ..., pages, &locked);
> - *      if (locked)
> - *          mmap_read_unlock(mm);
> - *
> - * We can leverage the VM_FAULT_RETRY functionality in the page fault
> - * paths better by using either get_user_pages_locked() or
> - * get_user_pages_unlocked().
> - *
> - */
> -long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> -			   unsigned int gup_flags, struct page **pages,
> -			   int *locked)
> -{
> -	/*
> -	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> -	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -	 * vmas.  As there are no users of this flag in this call we simply
> -	 * disallow this option for now.
> -	 */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> -		return -EINVAL;
> -	/*
> -	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
> -	 * never directly by the caller, so enforce that:
> -	 */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> -		return -EINVAL;
> -
> -	return __get_user_pages_locked(current->mm, start, nr_pages,
> -				       pages, NULL, locked,
> -				       gup_flags | FOLL_TOUCH);
> -}
> -EXPORT_SYMBOL(get_user_pages_locked);
> -
>  /*
>   * get_user_pages_unlocked() is suitable to replace the form:
>   *
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 028e8dd82b44..3f8dc58da3e8 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -907,17 +907,14 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
>  static int lookup_node(struct mm_struct *mm, unsigned long addr)
>  {
>  	struct page *p = NULL;
> -	int err;
> +	int ret;
>  
> -	int locked = 1;
> -	err = get_user_pages_locked(addr & PAGE_MASK, 1, 0, &p, &locked);
> -	if (err > 0) {
> -		err = page_to_nid(p);
> +	ret = get_user_pages_fast(addr & PAGE_MASK, 1, 0, &p);
> +	if (ret > 0) {
> +		ret = page_to_nid(p);
>  		put_page(p);
>  	}
> -	if (locked)
> -		mmap_read_unlock(mm);
> -	return err;
> +	return ret;
>  }
>  
>  /* Retrieve NUMA policy */
> @@ -968,14 +965,14 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  	if (flags & MPOL_F_NODE) {
>  		if (flags & MPOL_F_ADDR) {
>  			/*
> -			 * Take a refcount on the mpol, lookup_node()
> -			 * will drop the mmap_lock, so after calling
> -			 * lookup_node() only "pol" remains valid, "vma"
> -			 * is stale.
> +			 * Take a refcount on the mpol, because we are about to
> +			 * drop the mmap_lock, after which only "pol" remains
> +			 * valid, "vma" is stale.
>  			 */
>  			pol_refcount = pol;
>  			vma = NULL;
>  			mpol_get(pol);
> +			mmap_read_unlock(mm);
>  			err = lookup_node(mm, addr);
>  			if (err < 0)
>  				goto out;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups
  2022-02-03  9:32 ` [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups John Hubbard
@ 2022-02-03 12:10   ` Claudio Imbrenda
  2022-02-03 21:25     ` John Hubbard
  2022-02-03 14:00   ` Christoph Hellwig
  1 sibling, 1 reply; 19+ messages in thread
From: Claudio Imbrenda @ 2022-02-03 12:10 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Kirill A . Shutemov, Alex Williamson,
	Andrea Arcangeli, LKML, linux-mm

On Thu, 3 Feb 2022 01:32:29 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> From: Peter Xu <peterx@redhat.com>
> 
> Alex reported invalid page pointer returned with pin_user_pages_remote() from
> vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
> pinning with struct vfio_batch").  This problem breaks NVIDIA vfio mdev.
> 
> It turns out that it's not the fault of the vfio commit; however after vfio
> switches to a full page buffer to store the page pointers it starts to expose
> the problem easier.
> 
> The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then
> vfio will carry on to handle the MMIO regions.  However when the bug triggered,
> follow_page_mask() returned -EEXIST for such a page, which will jump over the
> current page, leaving that entry in **pages untouched.  However the caller is
> not aware of it, hence the caller will reference the page as usual even if the
> pointer data can be anything.
> 
> We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
> mapping unless FOLL_GET is requested") which seems very reasonable.  It could
> be that when we reworked GUP with FOLL_PIN we could have overlooked that
> special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
> that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
> it needs to return an -EEXIST.
> 
> Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than
> 1027e4436b6a.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Reported-by: Alex Williamson <alex.williamson@redhat.com>
> Debugged-by: Alex Williamson <alex.williamson@redhat.com>
> Tested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

you can add 

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>

although maybe this would look better if it were squashed into the next
patch, as others have also suggested

> ---
>  mm/gup.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index f0af462ac1e2..65575ae3602f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  		pte_t *pte, unsigned int flags)
>  {
>  	/* No page to get reference */
> -	if (flags & FOLL_GET)
> +	if (flags & (FOLL_GET | FOLL_PIN))
>  		return -EFAULT;
>  
>  	if (flags & FOLL_TOUCH) {


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03  9:32 ` [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly John Hubbard
@ 2022-02-03 13:31   ` Claudio Imbrenda
  2022-02-03 20:53     ` John Hubbard
  2022-02-03 13:53   ` Jan Kara
  1 sibling, 1 reply; 19+ messages in thread
From: Claudio Imbrenda @ 2022-02-03 13:31 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Kirill A . Shutemov, Alex Williamson,
	Andrea Arcangeli, LKML, linux-mm, Jason Gunthorpe

On Thu, 3 Feb 2022 01:32:30 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> Regardless of any FOLL_* flags, get_user_pages() and its variants should
> handle PFN-only entries by stopping early, if the caller expected
> **pages to be filled in.
> 
> This makes for a more reliable API, as compared to the previous approach
> of skipping over such entries (and thus leaving them silently
> unwritten).
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 65575ae3602f..cad3f28492e3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
>  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  		pte_t *pte, unsigned int flags)
>  {
> -	/* No page to get reference */
> -	if (flags & (FOLL_GET | FOLL_PIN))
> -		return -EFAULT;
> -
>  	if (flags & FOLL_TOUCH) {
>  		pte_t entry = *pte;
>  
> @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
>  		} else if (PTR_ERR(page) == -EEXIST) {
>  			/*
>  			 * Proper page table entry exists, but no corresponding
> -			 * struct page.
> +			 * struct page. If the caller expects **pages to be
> +			 * filled in, bail out now, because that can't be done
> +			 * for this page.
>  			 */
> +			if (pages)
> +				goto out;
> +
>  			goto next_page;
>  		} else if (IS_ERR(page)) {
>  			ret = PTR_ERR(page);

I'm not an expert, can you explain why this is better, and why it does
not cause new issues?

If I understand correctly, the problem you are trying to solve is that
in some cases you might try to get n pages, but you only get m < n
pages instead, because some don't have an associated struct page, and
the missing pages might even be in the middle.

The `pages` array would contain the list of pages actually pinned
(getted?), but this won't tell which of the requested pages have been
pinned (e.g. if some pages in the middle of the run were skipped)

With your patch you will stop at the first page without a struct page,
meaning that if the caller tries again, it will get 0 pages. Why won't
this cause issues?

Why will this not cause problems when the `pages` parameter is NULL?


sorry for the dumb questions, but this seems a rather important change,
and I think in these circumstances you can't have too much
documentation.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03  9:32 ` [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly John Hubbard
  2022-02-03 13:31   ` Claudio Imbrenda
@ 2022-02-03 13:53   ` Jan Kara
  2022-02-03 15:01     ` Jason Gunthorpe
  1 sibling, 1 reply; 19+ messages in thread
From: Jan Kara @ 2022-02-03 13:53 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm,
	Jason Gunthorpe

On Thu 03-02-22 01:32:30, John Hubbard wrote:
> Regardless of any FOLL_* flags, get_user_pages() and its variants should
> handle PFN-only entries by stopping early, if the caller expected
> **pages to be filled in.
> 
> This makes for a more reliable API, as compared to the previous approach
> of skipping over such entries (and thus leaving them silently
> unwritten).
> 
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 11 ++++++-----
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 65575ae3602f..cad3f28492e3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
>  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>  		pte_t *pte, unsigned int flags)
>  {
> -	/* No page to get reference */
> -	if (flags & (FOLL_GET | FOLL_PIN))
> -		return -EFAULT;
> -
>  	if (flags & FOLL_TOUCH) {
>  		pte_t entry = *pte;
>  

This will also modify the error code returned from follow_page(). A quick
audit shows that at least the user in mm/migrate.c will propagate this
error code to userspace and I'm not sure the change in error code will not
break something... EEXIST is a bit strange error code to get from
move_pages(2).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups
  2022-02-03  9:32 ` [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups John Hubbard
  2022-02-03 12:10   ` Claudio Imbrenda
@ 2022-02-03 14:00   ` Christoph Hellwig
  2022-02-03 21:13     ` John Hubbard
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2022-02-03 14:00 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm

On Thu, Feb 03, 2022 at 01:32:29AM -0800, John Hubbard wrote:
> From: Peter Xu <peterx@redhat.com>
> 
> Alex reported invalid page pointer returned with pin_user_pages_remote() from
> vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
> pinning with struct vfio_batch").  This problem breaks NVIDIA vfio mdev.

There still isn't any nvidia vfio mdev driver in the tree, so this
changelog stilldoesn't make sense.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/4] mm/gup: remove get_user_pages_locked()
  2022-02-03  9:32 ` [PATCH v3 4/4] mm/gup: remove get_user_pages_locked() John Hubbard
  2022-02-03 12:04   ` Claudio Imbrenda
@ 2022-02-03 14:01   ` Christoph Hellwig
  2022-02-03 21:27     ` John Hubbard
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2022-02-03 14:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm,
	Jason Gunthorpe

On Thu, Feb 03, 2022 at 01:32:32AM -0800, John Hubbard wrote:
> There is only one caller of get_user_pages_locked(). The purpose of
> get_user_pages_locked() is to allow for unlocking the mmap_lock when
> reading a page from the disk during a page fault (hidden behind
> VM_FAULT_RETRY). The idea is to reduce contention on the heavily-used
> mmap_lock. (Thanks to Jan Kara for clearly pointing that out, and in
> fact I've used some of his wording here.)
> 
> However, it is unlikely for lookup_node() to take a page fault. With
> that in mind, change over to calling get_user_pages_fast(). This
> simplifies the code, runs a little faster in the expected case, and
> allows removing get_user_pages_locked() entirely.

Maybe split the lookup_node changes into a separate patch, as that
allows to document that change even better.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03 13:53   ` Jan Kara
@ 2022-02-03 15:01     ` Jason Gunthorpe
  2022-02-03 15:18       ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2022-02-03 15:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Andrew Morton, Peter Xu, David Hildenbrand,
	Lukas Bulwahn, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm

On Thu, Feb 03, 2022 at 02:53:52PM +0100, Jan Kara wrote:
> On Thu 03-02-22 01:32:30, John Hubbard wrote:
> > Regardless of any FOLL_* flags, get_user_pages() and its variants should
> > handle PFN-only entries by stopping early, if the caller expected
> > **pages to be filled in.
> > 
> > This makes for a more reliable API, as compared to the previous approach
> > of skipping over such entries (and thus leaving them silently
> > unwritten).
> > 
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> >  mm/gup.c | 11 ++++++-----
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 65575ae3602f..cad3f28492e3 100644
> > +++ b/mm/gup.c
> > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
> >  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> >  		pte_t *pte, unsigned int flags)
> >  {
> > -	/* No page to get reference */
> > -	if (flags & (FOLL_GET | FOLL_PIN))
> > -		return -EFAULT;
> > -
> >  	if (flags & FOLL_TOUCH) {
> >  		pte_t entry = *pte;
> >  
> 
> This will also modify the error code returned from follow_page(). 

Er, but isn't that the whole point of this entire design? It is what
the commit that added it says:

commit 1027e4436b6a5c413c95d95e50d0f26348a602ac
Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Date:   Fri Sep 4 15:47:55 2015 -0700

    mm: make GUP handle pfn mapping unless FOLL_GET is requested
    
    With DAX, pfn mapping becoming more common.  The patch adjusts GUP code to
    cover pfn mapping for cases when we don't need struct page to proceed.
    
    To make it possible, let's change follow_page() code to return -EEXIST
    error code if proper page table entry exists, but no corresponding struct
    page.  __get_user_page() would ignore the error code and move to the next
    page frame.
    
    The immediate effect of the change is working MAP_POPULATE and mlock() on
    DAX mappings.

> A quick audit shows that at least the user in mm/migrate.c will
> propagate this error code to userspace and I'm not sure the change
> in error code will not break something... EEXIST is a bit strange
> error code to get from move_pages(2).

That makes sense, maybe move_pages should squash the return codes to
EEXIST?

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03 15:01     ` Jason Gunthorpe
@ 2022-02-03 15:18       ` Matthew Wilcox
  2022-02-03 21:19         ` John Hubbard
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2022-02-03 15:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, John Hubbard, Andrew Morton, Peter Xu,
	David Hildenbrand, Lukas Bulwahn, Claudio Imbrenda,
	Kirill A . Shutemov, Alex Williamson, Andrea Arcangeli, LKML,
	linux-mm

On Thu, Feb 03, 2022 at 11:01:23AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 03, 2022 at 02:53:52PM +0100, Jan Kara wrote:
> > On Thu 03-02-22 01:32:30, John Hubbard wrote:
> > > Regardless of any FOLL_* flags, get_user_pages() and its variants should
> > > handle PFN-only entries by stopping early, if the caller expected
> > > **pages to be filled in.
> > > 
> > > This makes for a more reliable API, as compared to the previous approach
> > > of skipping over such entries (and thus leaving them silently
> > > unwritten).
> > > 
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
> > > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> > >  mm/gup.c | 11 ++++++-----
> > >  1 file changed, 6 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 65575ae3602f..cad3f28492e3 100644
> > > +++ b/mm/gup.c
> > > @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
> > >  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > >  		pte_t *pte, unsigned int flags)
> > >  {
> > > -	/* No page to get reference */
> > > -	if (flags & (FOLL_GET | FOLL_PIN))
> > > -		return -EFAULT;
> > > -
> > >  	if (flags & FOLL_TOUCH) {
> > >  		pte_t entry = *pte;
> > >  
> > 
> > This will also modify the error code returned from follow_page(). 
> 
> Er, but isn't that the whole point of this entire design? It is what
> the commit that added it says:
> 
> commit 1027e4436b6a5c413c95d95e50d0f26348a602ac
> Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Date:   Fri Sep 4 15:47:55 2015 -0700
> 
>     mm: make GUP handle pfn mapping unless FOLL_GET is requested
>     
>     With DAX, pfn mapping becoming more common.  The patch adjusts GUP code to
>     cover pfn mapping for cases when we don't need struct page to proceed.
>     
>     To make it possible, let's change follow_page() code to return -EEXIST
>     error code if proper page table entry exists, but no corresponding struct
>     page.  __get_user_page() would ignore the error code and move to the next
>     page frame.
>     
>     The immediate effect of the change is working MAP_POPULATE and mlock() on
>     DAX mappings.
> 
> > A quick audit shows that at least the user in mm/migrate.c will
> > propagate this error code to userspace and I'm not sure the change
> > in error code will not break something... EEXIST is a bit strange
> > error code to get from move_pages(2).
> 
> That makes sense, maybe move_pages should squash the return codes to
> EEXIST?

I think EFAULT is the closest:
              This  is  a  zero  page  or the memory area is not mapped by the
              process.

EBUSY implies it can be tried again later.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03 13:31   ` Claudio Imbrenda
@ 2022-02-03 20:53     ` John Hubbard
  0 siblings, 0 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03 20:53 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Kirill A . Shutemov, Alex Williamson,
	Andrea Arcangeli, LKML, linux-mm, Jason Gunthorpe

On 2/3/22 05:31, Claudio Imbrenda wrote:
...
>> @@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
>>   		} else if (PTR_ERR(page) == -EEXIST) {
>>   			/*
>>   			 * Proper page table entry exists, but no corresponding
>> -			 * struct page.
>> +			 * struct page. If the caller expects **pages to be
>> +			 * filled in, bail out now, because that can't be done
>> +			 * for this page.
>>   			 */
>> +			if (pages)
>> +				goto out;
>> +
>>   			goto next_page;
>>   		} else if (IS_ERR(page)) {
>>   			ret = PTR_ERR(page);
> 
> I'm not an expert, can you explain why this is better, and why it does
> not cause new issues?
> 
> If I understand correctly, the problem you are trying to solve is that
> in some cases you might try to get n pages, but you only get m < n
> pages instead, because some don't have an associated struct page, and
> the missing pages might even be in the middle.
> 
> The `pages` array would contain the list of pages actually pinned
> (getted?), but this won't tell which of the requested pages have been
> pinned (e.g. if some pages in the middle of the run were skipped)
> 

The get_user_pages() API doesn't leave pages in the middle, ever.
Instead, it stops at the first error, and reports the number of page
that were successfully pinned. And the caller is responsible for
unpinning.

 From __get_user_pages()'s kerneldoc documentation:

  * Returns either number of pages pinned (which may be less than the
  * number requested), or an error. Details about the return value:
  *
  * -- If nr_pages is 0, returns 0.
  * -- If nr_pages is >0, but no pages were pinned, returns -errno.
  * -- If nr_pages is >0, and some pages were pinned, returns the number of
  *    pages pinned. Again, this may be less than nr_pages.
  * -- 0 return value is possible when the fault would need to be retried.
  *
  * The caller is responsible for releasing returned @pages, via put_page().

So the **pages array doesn't have holes, and the caller just counts up
from the beginning of **pages and stops at nr_pages.


> With your patch you will stop at the first page without a struct page,
> meaning that if the caller tries again, it will get 0 pages. Why won't
> this cause issues?

Callers are already written to deal with this case.

> 
> Why will this not cause problems when the `pages` parameter is NULL?

The behavior is unchanged here if pages == NULL. But maybe you meant,
if pages != NULL. And in that case, the new behavior is to stop early
and return n < m, which is (I am claiming) better than just leaving
garbage values in **pages.

Another approach would be to file in PTR_ERR(page) values, but GUP is
a well-established and widely used API, and that would be a large
change that would require changing a lot of caller code.

> 
> 
> sorry for the dumb questions, but this seems a rather important change,
> and I think in these circumstances you can't have too much
> documentation.
> 

Thanks for reviewing this!


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups
  2022-02-03 14:00   ` Christoph Hellwig
@ 2022-02-03 21:13     ` John Hubbard
  0 siblings, 0 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03 21:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm

On 2/3/22 06:00, Christoph Hellwig wrote:
> On Thu, Feb 03, 2022 at 01:32:29AM -0800, John Hubbard wrote:
>> From: Peter Xu <peterx@redhat.com>
>>
>> Alex reported invalid page pointer returned with pin_user_pages_remote() from
>> vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
>> pinning with struct vfio_batch").  This problem breaks NVIDIA vfio mdev.
> 
> There still isn't any nvidia vfio mdev driver in the tree, so this
> changelog stilldoesn't make sense.

I'll remove that last sentence (and put in a tiny note that I'm scribbling
on Peter's commit description a little bit more), in order to avoid any
references to out of tree things.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly
  2022-02-03 15:18       ` Matthew Wilcox
@ 2022-02-03 21:19         ` John Hubbard
  0 siblings, 0 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03 21:19 UTC (permalink / raw)
  To: Matthew Wilcox, Jason Gunthorpe
  Cc: Jan Kara, Andrew Morton, Peter Xu, David Hildenbrand,
	Lukas Bulwahn, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm

On 2/3/22 07:18, Matthew Wilcox wrote:
...
>>> This will also modify the error code returned from follow_page().
>>
>> Er, but isn't that the whole point of this entire design? It is what
>> the commit that added it says:
>>
>> commit 1027e4436b6a5c413c95d95e50d0f26348a602ac
>> Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Date:   Fri Sep 4 15:47:55 2015 -0700
>>
>>      mm: make GUP handle pfn mapping unless FOLL_GET is requested
>>      
>>      With DAX, pfn mapping becoming more common.  The patch adjusts GUP code to
>>      cover pfn mapping for cases when we don't need struct page to proceed.
>>      
>>      To make it possible, let's change follow_page() code to return -EEXIST
>>      error code if proper page table entry exists, but no corresponding struct
>>      page.  __get_user_page() would ignore the error code and move to the next
>>      page frame.
>>      
>>      The immediate effect of the change is working MAP_POPULATE and mlock() on
>>      DAX mappings.
>>
>>> A quick audit shows that at least the user in mm/migrate.c will
>>> propagate this error code to userspace and I'm not sure the change
>>> in error code will not break something... EEXIST is a bit strange
>>> error code to get from move_pages(2).
>>
>> That makes sense, maybe move_pages should squash the return codes to
>> EEXIST?
> 
> I think EFAULT is the closest:
>                This  is  a  zero  page  or the memory area is not mapped by the
>                process.
> 
> EBUSY implies it can be tried again later.
> 

OK. I definitely need to rework the commit description now, but the diffs are
looking like this:

diff --git a/mm/gup.c b/mm/gup.c
index 65575ae3602f..cad3f28492e3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
  static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
  		pte_t *pte, unsigned int flags)
  {
-	/* No page to get reference */
-	if (flags & (FOLL_GET | FOLL_PIN))
-		return -EFAULT;
-
  	if (flags & FOLL_TOUCH) {
  		pte_t entry = *pte;

@@ -1180,8 +1176,13 @@ static long __get_user_pages(struct mm_struct *mm,
  		} else if (PTR_ERR(page) == -EEXIST) {
  			/*
  			 * Proper page table entry exists, but no corresponding
-			 * struct page.
+			 * struct page. If the caller expects **pages to be
+			 * filled in, bail out now, because that can't be done
+			 * for this page.
  			 */
+			if (pages)
+				goto out;
+
  			goto next_page;
  		} else if (IS_ERR(page)) {
  			ret = PTR_ERR(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..be0d5ae36dc1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1761,6 +1761,13 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
  			continue;
  		}

+		/*
+		 * The move_pages() man page does not have an -EEXIST choice, so
+		 * use -EFAULT instead.
+		 */
+		if (err == -EEXIST)
+			err = -EFAULT;
+
  		/*
  		 * If the page is already on the target node (!err), store the
  		 * node, otherwise, store the err.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups
  2022-02-03 12:10   ` Claudio Imbrenda
@ 2022-02-03 21:25     ` John Hubbard
  0 siblings, 0 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03 21:25 UTC (permalink / raw)
  To: Claudio Imbrenda
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Kirill A . Shutemov, Alex Williamson,
	Andrea Arcangeli, LKML, linux-mm

On 2/3/22 04:10, Claudio Imbrenda wrote:
> On Thu, 3 Feb 2022 01:32:29 -0800
> John Hubbard <jhubbard@nvidia.com> wrote:
> 
>> From: Peter Xu <peterx@redhat.com>
>>
>> Alex reported invalid page pointer returned with pin_user_pages_remote() from
>> vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
>> pinning with struct vfio_batch").  This problem breaks NVIDIA vfio mdev.
>>
>> It turns out that it's not the fault of the vfio commit; however after vfio
>> switches to a full page buffer to store the page pointers it starts to expose
>> the problem easier.
>>
>> The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then
>> vfio will carry on to handle the MMIO regions.  However when the bug triggered,
>> follow_page_mask() returned -EEXIST for such a page, which will jump over the
>> current page, leaving that entry in **pages untouched.  However the caller is
>> not aware of it, hence the caller will reference the page as usual even if the
>> pointer data can be anything.
>>
>> We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
>> mapping unless FOLL_GET is requested") which seems very reasonable.  It could
>> be that when we reworked GUP with FOLL_PIN we could have overlooked that
>> special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
>> that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
>> it needs to return an -EEXIST.
>>
>> Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than
>> 1027e4436b6a.
>>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
>> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
>> Reported-by: Alex Williamson <alex.williamson@redhat.com>
>> Debugged-by: Alex Williamson <alex.williamson@redhat.com>
>> Tested-by: Alex Williamson <alex.williamson@redhat.com>
>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> 
> you can add
> 
> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
> 

Thanks!

> although maybe this would look better if it were squashed into the next
> patch, as others have also suggested
> 

I was thinking about that. It seems like this patch here cleanly addresses
an oversight, and it is tiny and reasonably suitable for backporting.

Patch 2, on the other hand is less of a fix, and more of a "let's improve
things". And now it is expanding now to cover move_pages() too. So maybe it
is better to leave them separate, after all.


thanks,
-- 
John Hubbard
NVIDIA

>> ---
>>   mm/gup.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index f0af462ac1e2..65575ae3602f 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>>   		pte_t *pte, unsigned int flags)
>>   {
>>   	/* No page to get reference */
>> -	if (flags & FOLL_GET)
>> +	if (flags & (FOLL_GET | FOLL_PIN))
>>   		return -EFAULT;
>>   
>>   	if (flags & FOLL_TOUCH) {
> 
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 4/4] mm/gup: remove get_user_pages_locked()
  2022-02-03 14:01   ` Christoph Hellwig
@ 2022-02-03 21:27     ` John Hubbard
  0 siblings, 0 replies; 19+ messages in thread
From: John Hubbard @ 2022-02-03 21:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Peter Xu, Jason Gunthorpe, David Hildenbrand,
	Lukas Bulwahn, Jan Kara, Claudio Imbrenda, Kirill A . Shutemov,
	Alex Williamson, Andrea Arcangeli, LKML, linux-mm,
	Jason Gunthorpe

On 2/3/22 06:01, Christoph Hellwig wrote:
> On Thu, Feb 03, 2022 at 01:32:32AM -0800, John Hubbard wrote:
>> There is only one caller of get_user_pages_locked(). The purpose of
>> get_user_pages_locked() is to allow for unlocking the mmap_lock when
>> reading a page from the disk during a page fault (hidden behind
>> VM_FAULT_RETRY). The idea is to reduce contention on the heavily-used
>> mmap_lock. (Thanks to Jan Kara for clearly pointing that out, and in
>> fact I've used some of his wording here.)
>>
>> However, it is unlikely for lookup_node() to take a page fault. With
>> that in mind, change over to calling get_user_pages_fast(). This
>> simplifies the code, runs a little faster in the expected case, and
>> allows removing get_user_pages_locked() entirely.
> 
> Maybe split the lookup_node changes into a separate patch, as that
> allows to document that change even better.

OK, I'll do that.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-02-03 21:27 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03  9:32 [PATCH v3 0/4] mm/gup: some cleanups John Hubbard
2022-02-03  9:32 ` [PATCH v3 1/4] mm: Fix invalid page pointer returned with FOLL_PIN gups John Hubbard
2022-02-03 12:10   ` Claudio Imbrenda
2022-02-03 21:25     ` John Hubbard
2022-02-03 14:00   ` Christoph Hellwig
2022-02-03 21:13     ` John Hubbard
2022-02-03  9:32 ` [PATCH v3 2/4] mm/gup: clean up follow_pfn_pte() slightly John Hubbard
2022-02-03 13:31   ` Claudio Imbrenda
2022-02-03 20:53     ` John Hubbard
2022-02-03 13:53   ` Jan Kara
2022-02-03 15:01     ` Jason Gunthorpe
2022-02-03 15:18       ` Matthew Wilcox
2022-02-03 21:19         ` John Hubbard
2022-02-03  9:32 ` [PATCH v3 3/4] mm/gup: remove unused pin_user_pages_locked() John Hubbard
2022-02-03 11:52   ` Claudio Imbrenda
2022-02-03  9:32 ` [PATCH v3 4/4] mm/gup: remove get_user_pages_locked() John Hubbard
2022-02-03 12:04   ` Claudio Imbrenda
2022-02-03 14:01   ` Christoph Hellwig
2022-02-03 21:27     ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).