linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/12] mm/gup: track FOLL_PIN pages
@ 2020-02-11  0:15 John Hubbard
  2020-02-11  0:15 ` [PATCH v6 01/12] mm/gup: split get_user_pages_remote() into two routines John Hubbard
                   ` (11 more replies)
  0 siblings, 12 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard

Hi,

Jan and Kirill: I've tentatively removed your review and ACK,
respectively, for patch 12 (the last dump_page patch), because even
though they are logically the same as what you reviewed in v5, the
base is Matthew's new patch instead of my earlier patch. (Trying to err
on the side of caution with these tags.)

There is a git repo and branch, for convenience in reviewing:

    git@github.com:johnhubbard/linux.git  track_user_pages_v6

============================================================
Changes since v5:

* Rebased onto Linux 5.6.0-rc1.

* Swapped in Matthew Wilcox's more comprehensive dump_page() patch, and
  moved it later in this series so that it immediately precedes my
  subsequent dump_page() patch, for slightly easier reviews and commit
  log history.

* Fixed "the last bug!" in the /proc/vmstat patch, by moving the
  mod_node_page_state() call in put_compound_page() so that it only
  happens in the FOLL_PIN case.

* Added a couple more ACKs from Kirill.

* Tweaked the "Future steps" in this cover letter to add a little
  detail about what comes next.

============================================================
Changes since v4:

* Added documentation about the huge page behavior of the new
  /proc/vmstat items.

* Added a missing mode_node_page_state() call to put_compound_head().

* Fixed a tracepoint call in page_ref_sub_return().

* Added a trailing underscore to a URL in pin_user_pages.rst, to fix
  a broken generated link.

* Added ACKs and reviewed-by's from Jan Kara and Kirill Shutemov.

* Rebased onto today's linux.git, and

* I am experimenting here with "git format-patch --base=<commit>".
  This generated the "base-commit:" tag you'll see at the end of this
  cover letter.  I was inspired to do so after trying out a new
  get-lore-mbox.py tool (it's very nice), mentioned in a recent LWN
  article (https://lwn.net/Articles/811528/ ). That tool relies on the
  base-commit tag for some things.

============================================================
Changes since v3:

* Rebased onto latest linux.git

* Added ACKs and reviewed-by's from Kirill Shutemov and Jan Kara.

* /proc/vmstat:
    * Renamed items, after realizing that I hate the previous names:
         nr_foll_pin_requested --> nr_foll_pin_acquired
         nr_foll_pin_returned  --> nr_foll_pin_released

    * Removed the CONFIG_DEBUG_VM guard, and collapsed away a wrapper
      routine: now just calls mod_node_page_state() directly.

* Tweaked the WARN_ON_ONCE() statements in mm/hugetlb.c to be more
  informative, and added comments above them as well.

* Fixed gup_benchmark: signed int --> unsigned long.

* One or two minor formatting changes.

============================================================
Changes since v2:

* Rebased onto linux.git, because the akpm tree for 5.6 has been merged.

* Split the tracking patch into even more patches, as requested.

* Merged Matthew Wilcox's dump_page() changes into mine, as part of the
  first patch.

* Renamed: page_dma_pinned() --> page_maybe_dma_pinned(), in response to
  Kirill Shutemov's review.

* Moved a WARN to the top of a routine, and fixed a typo in the commit
  description of patch #7, also as suggested by Kirill.

============================================================
Changes since v1:

* Split the tracking patch into 6 smaller patches

* Rebased onto today's linux-next/akpm (there weren't any conflicts).

* Fixed an "unsigned int" vs. "int" problem in gup_benchmark, reported
  by Nathan Chancellor. (I don't see it in my local builds, probably
  because they use gcc, but an LLVM test found the mismatch.)

* Fixed a huge page pincount problem (add/subtract vs.
  increment/decrement), spotted by Jan Kara.
============================================================

There is a reasonable case to be made for merging two of the patches
(patches 7 and 8), given that patch 7 provides tracking that has upper
limits on the number of pins that can be done with huge pages. Let me
know if anyone wants those merged, but unless there is some weird chance
of someone grabbing patch 7 and not patch 8, I don't really see the
need. Meanwhile, it's easier to review in this form.

Also, patch 3 has been revived. Earlier reviewers asked for it to be
merged into the tracking patch (one cannot please everyone, heh), but
now it's back out on it's own.

This activates tracking of FOLL_PIN pages. This is in support of fixing
the get_user_pages()+DMA problem described in [1]-[4].

FOLL_PIN support is now in the main linux tree. However, the
patch to use FOLL_PIN to track pages was *not* submitted, because Leon
saw an RDMA test suite failure that involved (I think) page refcount
overflows when huge pages were used.

This patch definitively solves that kind of overflow problem, by adding
an exact pincount, for compound pages (of order > 1), in the 3rd struct
page of a compound page. If available, that form of pincounting is used,
instead of the GUP_PIN_COUNTING_BIAS approach. Thanks again to Jan Kara
for that idea.

Other interesting changes:

* dump_page(): added one, or two new things to report for compound
  pages: head refcount (for all compound pages), and map_pincount (for
  compound pages of order > 1).

* Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
  huge page refcount upper limit problems, and added notes about how it
  works now. Also added a note about the dump_page() enhancements.

* Added some comments in gup.c and mm.h, to explain that there are two
  ways to count pinned pages: exact (for compound pages of order > 1)
  and fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).

============================================================
General notes about the tracking patch:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], [4] and in a remarkable number of email threads since about
2017. :)

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

============================================================
Future steps:

* Convert more subsystems from get_user_pages() to pin_user_pages().
  The first probably needs to be bio/biovecs, because any filesystem
  testing is too difficult without those in place.

* Change VFS and filesystems to respond appropriately when encountering
  dma-pinned pages.

* Work with Ira and others to connect this all up with file system
  leases.

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/

[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/

[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/

[4] LWN kernel index: get_user_pages()
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages


John Hubbard (11):
  mm/gup: split get_user_pages_remote() into two routines
  mm/gup: pass a flags arg to __gup_device_* functions
  mm: introduce page_ref_sub_return()
  mm/gup: pass gup flags to two more routines
  mm/gup: require FOLL_GET for get_user_pages_fast()
  mm/gup: track FOLL_PIN pages
  mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
  mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage
  mm: dump_page(): additional diagnostics for huge pinned pages

Matthew Wilcox (Oracle) (1):
  mm: Improve dump_page() for compound pages

 Documentation/core-api/pin_user_pages.rst  |  86 ++--
 include/linux/mm.h                         | 108 ++++-
 include/linux/mm_types.h                   |   7 +-
 include/linux/mmzone.h                     |   2 +
 include/linux/page_ref.h                   |   9 +
 mm/debug.c                                 |  44 +-
 mm/gup.c                                   | 451 ++++++++++++++++-----
 mm/gup_benchmark.c                         |  71 +++-
 mm/huge_memory.c                           |  29 +-
 mm/hugetlb.c                               |  60 ++-
 mm/page_alloc.c                            |   2 +
 mm/rmap.c                                  |   6 +
 mm/vmstat.c                                |   2 +
 tools/testing/selftests/vm/gup_benchmark.c |  15 +-
 tools/testing/selftests/vm/run_vmtests     |  22 +
 15 files changed, 734 insertions(+), 180 deletions(-)


base-commit: bb6d3fb354c5ee8d6bde2d576eb7220ea09862b9
-- 
2.25.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v6 01/12] mm/gup: split get_user_pages_remote() into two routines
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 02/12] mm/gup: pass a flags arg to __gup_device_* functions John Hubbard
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

An upcoming patch requires reusing the implementation of
get_user_pages_remote(). Split up get_user_pages_remote() into an outer
routine that checks flags, and an implementation routine that will be
reused. This makes subsequent changes much easier to understand.

There should be no change in behavior due to this patch.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 56 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 33 insertions(+), 23 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 1b521e0ac1de..b699500da077 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1557,6 +1557,37 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
 }
 #endif /* CONFIG_FS_DAX || CONFIG_CMA */
 
+#ifdef CONFIG_MMU
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * Parts of FOLL_LONGTERM behavior are incompatible with
+	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
+	 * vmas. However, this only comes up if locked is set, and there are
+	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
+	 * allow what we can.
+	 */
+	if (gup_flags & FOLL_LONGTERM) {
+		if (WARN_ON_ONCE(locked))
+			return -EINVAL;
+		/*
+		 * This will check the vmas (even if our vmas arg is NULL)
+		 * and return -ENOTSUPP if DAX isn't allowed in this case:
+		 */
+		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+					     vmas, gup_flags | FOLL_TOUCH |
+					     FOLL_REMOTE);
+	}
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked,
+				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
+
 /*
  * get_user_pages_remote() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -1619,7 +1650,6 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
-#ifdef CONFIG_MMU
 long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
@@ -1632,28 +1662,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
-	/*
-	 * Parts of FOLL_LONGTERM behavior are incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas. However, this only comes up if locked is set, and there are
-	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
-	 * allow what we can.
-	 */
-	if (gup_flags & FOLL_LONGTERM) {
-		if (WARN_ON_ONCE(locked))
-			return -EINVAL;
-		/*
-		 * This will check the vmas (even if our vmas arg is NULL)
-		 * and return -ENOTSUPP if DAX isn't allowed in this case:
-		 */
-		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
-					     vmas, gup_flags | FOLL_TOUCH |
-					     FOLL_REMOTE);
-	}
-
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
-				       locked,
-				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 02/12] mm/gup: pass a flags arg to __gup_device_* functions
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
  2020-02-11  0:15 ` [PATCH v6 01/12] mm/gup: split get_user_pages_remote() into two routines John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 03/12] mm: introduce page_ref_sub_return() John Hubbard
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

A subsequent patch requires access to gup flags, so pass the flags
argument through to the __gup_device_* functions.

Also placate checkpatch.pl by shortening a nearby line.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index b699500da077..9e117998274c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1963,7 +1963,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+			     unsigned long end, unsigned int flags,
+			     struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1989,13 +1990,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -2006,13 +2008,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -2023,14 +2026,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
@@ -2146,7 +2151,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
@@ -2167,7 +2173,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+			unsigned long end, unsigned int flags,
+			struct page **pages, int *nr)
 {
 	struct page *head, *page;
 	int refs;
@@ -2178,7 +2185,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 03/12] mm: introduce page_ref_sub_return()
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
  2020-02-11  0:15 ` [PATCH v6 01/12] mm/gup: split get_user_pages_remote() into two routines John Hubbard
  2020-02-11  0:15 ` [PATCH v6 02/12] mm/gup: pass a flags arg to __gup_device_* functions John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 04/12] mm/gup: pass gup flags to two more routines John Hubbard
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

An upcoming patch requires subtracting a large chunk of refcounts from
a page, and checking what the resulting refcount is. This is a little
different than the usual "check for zero refcount" that many of the
page ref functions already do. However, it is similar to a few other
routines that (like this one) are generally useful for things such as
1-based refcounting.

Add page_ref_sub_return(), that subtracts a chunk of refcounts
atomically, and returns an atomic snapshot of the result.

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/page_ref.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..d27701199a4d 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -102,6 +102,15 @@ static inline void page_ref_sub(struct page *page, int nr)
 		__page_ref_mod(page, -nr);
 }
 
+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+	int ret = atomic_sub_return(nr, &page->_refcount);
+
+	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return))
+		__page_ref_mod_and_return(page, -nr, ret);
+	return ret;
+}
+
 static inline void page_ref_inc(struct page *page)
 {
 	atomic_inc(&page->_refcount);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 04/12] mm/gup: pass gup flags to two more routines
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (2 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 03/12] mm: introduce page_ref_sub_return() John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 05/12] mm/gup: require FOLL_GET for get_user_pages_fast() John Hubbard
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

In preparation for an upcoming patch, send gup flags args to two more
routines: put_compound_head(), and undo_dev_pagemap().

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 9e117998274c..e5f75e886663 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1870,6 +1870,7 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+					    unsigned int flags,
 					    struct page **pages)
 {
 	while ((*nr) - nr_start) {
@@ -1909,7 +1910,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
+				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
 		} else if (pte_special(pte))
@@ -1974,7 +1975,7 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
-			undo_dev_pagemap(nr, nr_start, pages);
+			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
 		SetPageReferenced(page);
@@ -2001,7 +2002,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
@@ -2019,7 +2020,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
@@ -2053,7 +2054,7 @@ static int record_subpages(struct page *page, unsigned long addr,
 	return nr;
 }
 
-static void put_compound_head(struct page *page, int refs)
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
 	/*
@@ -2103,7 +2104,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 		return 0;
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2163,7 +2164,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2197,7 +2198,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2226,7 +2227,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 05/12] mm/gup: require FOLL_GET for get_user_pages_fast()
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (3 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 04/12] mm/gup: pass gup flags to two more routines John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 06/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

Internal to mm/gup.c, require that get_user_pages_fast()
and __get_user_pages_fast() identify themselves, by setting
FOLL_GET. This is required in order to be able to make decisions
based on "FOLL_PIN, or FOLL_GET, or both or neither are set", in
upcoming patches.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index e5f75e886663..c8affbea2019 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2390,6 +2390,14 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	unsigned long len, end;
 	unsigned long flags;
 	int nr = 0;
+	/*
+	 * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+	 * because gup fast is always a "pin with a +1 page refcount" request.
+	 */
+	unsigned int gup_flags = FOLL_GET;
+
+	if (write)
+		gup_flags |= FOLL_WRITE;
 
 	start = untagged_addr(start) & PAGE_MASK;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
@@ -2415,7 +2423,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
 		local_irq_save(flags);
-		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
+		gup_pgd_range(start, end, gup_flags, pages, &nr);
 		local_irq_restore(flags);
 	}
 
@@ -2454,7 +2462,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE | FOLL_PIN)))
+				       FOLL_FORCE | FOLL_PIN | FOLL_GET)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2521,6 +2529,13 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
+	/*
+	 * The caller may or may not have explicitly set FOLL_GET; either way is
+	 * OK. However, internally (within mm/gup.c), gup fast variants must set
+	 * FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
+	 * request.
+	 */
+	gup_flags |= FOLL_GET;
 	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (4 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 05/12] mm/gup: require FOLL_GET for get_user_pages_fast() John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-04-24 18:18   ` [regression] " Alex Williamson
  2020-02-11  0:15 ` [PATCH v6 07/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages John Hubbard
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

Add tracking of pages that were pinned via FOLL_PIN. This tracking is
implemented via overloading of page->_refcount: pins are added by
adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a
fuzzy indication of pinning, and it can have false positives (and that's
OK). Please see the pre-existing
Documentation/core-api/pin_user_pages.rst for details.

As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
(typically via pin_user_pages*()) are required to ultimately free such
pages via unpin_user_page().

Please also note the limitation, discussed in pin_user_pages.rst under
the "TODO: for 1GB and larger huge pages" section. (That limitation will
be removed in a following patch.)

The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
thought of as "FOLL_GET for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_maybe_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], [3], and [4].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/
[4] LWN kernel index: get_user_pages():
    https://lwn.net/Kernel/Index/#Memory_management-get_user_pages

Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/pin_user_pages.rst |   6 +-
 include/linux/mm.h                        |  82 +++++--
 mm/gup.c                                  | 254 +++++++++++++++++-----
 mm/huge_memory.c                          |  29 ++-
 mm/hugetlb.c                              |  54 +++--
 5 files changed, 334 insertions(+), 91 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..9829345428f8 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -173,8 +173,8 @@ CASE 4: Pinning for struct page manipulation only
 -------------------------------------------------
 Here, normal GUP calls are sufficient, so neither flag needs to be set.
 
-page_dma_pinned(): the whole point of pinning
-=============================================
+page_maybe_dma_pinned(): the whole point of pinning
+===================================================
 
 The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
 to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
@@ -186,7 +186,7 @@ and debates (see the References at the end of this document). It's a TODO item
 here: fill in the details once that's worked out. Meanwhile, it's safe to say
 that having this available: ::
 
-        static inline bool page_dma_pinned(struct page *page)
+        static inline bool page_maybe_dma_pinned(struct page *page)
 
 ...is a prerequisite to solving the long-running gup+DMA problem.
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52269e56c514..8d4f9f4094f4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1001,6 +1001,8 @@ static inline void get_page(struct page *page)
 	page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -1029,29 +1031,79 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many pin_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void unpin_user_page(struct page *page)
-{
-	put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty);
-
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_maybe_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*().
+ *
+ * For non-huge pages, the return value is partially fuzzy: false is not fuzzy,
+ * because it means "definitely not pinned for DMA", but true means "probably
+ * pinned for DMA, but possibly a false positive due to having at least
+ * GUP_PIN_COUNTING_BIAS worth of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please see Documentation/vm/pin_user_pages.rst.
+ *
+ * @page:	pointer to page to be queried.
+ * @Return:	True, if it is likely that the page has been "dma-pinned".
+ *		False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_maybe_dma_pinned(struct page *page)
+{
+	/*
+	 * page_ref_count() is signed. If that refcount overflows, then
+	 * page_ref_count() returns a negative value, and callers will avoid
+	 * further incrementing the refcount.
+	 *
+	 * Here, for that overflow case, use the signed bit to count a little
+	 * bit higher via unsigned math, and thus still get an accurate result.
+	 */
+	return ((unsigned int)page_ref_count(compound_head(page))) >=
+		GUP_PIN_COUNTING_BIAS;
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/mm/gup.c b/mm/gup.c
index c8affbea2019..a2356482e1ea 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -44,6 +44,135 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
 	return head;
 }
 
+/*
+ * try_grab_compound_head() - attempt to elevate a page's refcount, by a
+ * flags-dependent amount.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
+ * same time. (That's true throughout the get_user_pages*() and
+ * pin_user_pages*() APIs.) Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: head page (with refcount appropriately incremented) for success, or
+ * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
+ * considered failure, and furthermore, a likely bug in the caller, so a warning
+ * is also emitted.
+ */
+static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+							  int refs,
+							  unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_compound_head(page, refs);
+	else if (flags & FOLL_PIN) {
+		refs *= GUP_PIN_COUNTING_BIAS;
+		return try_get_compound_head(page, refs);
+	}
+
+	WARN_ON_ONCE(1);
+	return NULL;
+}
+
+/**
+ * try_grab_page() - elevate a page's refcount by a flag-dependent amount
+ *
+ * This might not do anything at all, depending on the flags argument.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * @page:    pointer to page to be grabbed
+ * @flags:   gup flags: these are the FOLL_* flag values.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
+ * time. Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: true for success, or if no action was required (if neither FOLL_PIN
+ * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
+ * FOLL_PIN was set, but the page could not be grabbed.
+ */
+bool __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+	WARN_ON_ONCE((flags & (FOLL_GET | FOLL_PIN)) == (FOLL_GET | FOLL_PIN));
+
+	if (flags & FOLL_GET)
+		return try_get_page(page);
+	else if (flags & FOLL_PIN) {
+		page = compound_head(page);
+
+		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+			return false;
+
+		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+	}
+
+	return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	int count;
+
+	if (!page_is_devmap_managed(page))
+		return false;
+
+	count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+	/*
+	 * devmap page refcounts are 1-based, rather than 0-based: if
+	 * refcount is 1, then the page is free and the refcount is
+	 * stable because nobody holds a reference on the page.
+	 */
+	if (count == 1)
+		free_devmap_managed_page(page);
+	else if (!count)
+		__put_page(page);
+
+	return true;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (__unpin_devmap_managed_user_page(page))
+		return;
+
+	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+		__put_page(page);
+}
+EXPORT_SYMBOL(unpin_user_page);
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -230,10 +359,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET case since
-		 * they are only valid while holding the pgmap reference.
+		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+		 * case since they are only valid while holding the pgmap
+		 * reference.
 		 */
 		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 		if (*pgmap)
@@ -271,11 +401,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto retry;
 	}
 
-	if (flags & FOLL_GET) {
-		if (unlikely(!try_get_page(page))) {
-			page = ERR_PTR(-ENOMEM);
-			goto out;
-		}
+	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
+	if (unlikely(!try_grab_page(page, flags))) {
+		page = ERR_PTR(-ENOMEM);
+		goto out;
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -537,7 +666,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
+		WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
 		return page;
 	}
 
@@ -1675,6 +1804,15 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 {
 	return 0;
 }
+
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	return 0;
+}
 #endif /* !CONFIG_MMU */
 
 /*
@@ -1877,7 +2015,10 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 		struct page *page = pages[--(*nr)];
 
 		ClearPageReferenced(page);
-		put_page(page);
+		if (flags & FOLL_PIN)
+			unpin_user_page(page);
+		else
+			put_page(page);
 	}
 }
 
@@ -1919,7 +2060,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		head = try_get_compound_head(page, 1);
+		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
 
@@ -1980,7 +2121,10 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			undo_dev_pagemap(nr, nr_start, flags, pages);
+			return 0;
+		}
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
@@ -2056,6 +2200,9 @@ static int record_subpages(struct page *page, unsigned long addr,
 
 static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
+	if (flags & FOLL_PIN)
+		refs *= GUP_PIN_COUNTING_BIAS;
+
 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
 	/*
 	 * Calling put_page() for each ref is unnecessarily slow. Only the last
@@ -2099,7 +2246,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(head, refs);
+	head = try_grab_compound_head(head, refs, flags);
 	if (!head)
 		return 0;
 
@@ -2159,7 +2306,7 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pmd_page(orig), refs);
+	head = try_grab_compound_head(pmd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2193,7 +2340,7 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pud_page(orig), refs);
+	head = try_grab_compound_head(pud_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2222,7 +2369,7 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pgd_page(orig), refs);
+	head = try_grab_compound_head(pgd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
@@ -2505,11 +2652,11 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 
 /**
  * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
  *
  * Attempt to pin user pages in memory without taking mm->mmap_sem.
  * If not successful, it will fall back to taking the lock and
@@ -2543,9 +2690,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 /**
  * pin_user_pages_fast() - pin user pages in memory without taking locks
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_fast().
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2553,21 +2703,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_remote().
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2577,22 +2730,24 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			   unsigned int gup_flags, struct page **pages,
 			   struct vm_area_struct **vmas, int *locked)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
-				     vmas, locked);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(pin_user_pages_remote);
 
 /**
  * pin_user_pages() - pin user pages in memory for use by other devices
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages().
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2601,11 +2756,12 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..580098e115bd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -958,6 +958,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		return NULL;
 
@@ -973,7 +978,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -981,7 +986,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1101,6 +1107,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (flags & FOLL_WRITE && !pud_write(*pud))
 		return NULL;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1112,8 +1123,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	/*
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
+	 *
+	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1121,7 +1134,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1497,8 +1511,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if (!try_grab_page(page, flags))
+		return ERR_PTR(-ENOMEM);
+
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
+
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/*
 		 * We don't mlock() pte-mapped THPs. This way we can avoid
@@ -1535,8 +1554,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-	if (flags & FOLL_GET)
-		get_page(page);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd8737a94bec..ba1de6bc1402 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4375,19 +4375,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 
-		/*
-		 * Instead of doing 'try_get_page()' below in the same_page
-		 * loop, just check the count once here.
-		 */
-		if (unlikely(page_count(page) <= 0)) {
-			if (pages) {
-				spin_unlock(ptl);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
 		/*
 		 * If subpage information not requested, update counters
 		 * and skip the same_page loop below.
@@ -4405,7 +4392,22 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+			/*
+			 * try_grab_page() should always succeed here, because:
+			 * a) we hold the ptl lock, and b) we've just checked
+			 * that the huge page is present in the page tables. If
+			 * the huge page is present, then the tail pages must
+			 * also be present. The ptl prevents the head page and
+			 * tail pages from being rearranged in any way. So this
+			 * page must be available at this point, unless the page
+			 * refcount overflowed:
+			 */
+			if (WARN_ON_ONCE(!try_grab_page(pages[i], flags))) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				break;
+			}
 		}
 
 		if (vmas)
@@ -4965,6 +4967,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t pte;
+
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 retry:
 	ptl = pmd_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -4977,8 +4985,18 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
-		if (flags & FOLL_GET)
-			get_page(page);
+		/*
+		 * try_grab_page() should always succeed here, because: a) we
+		 * hold the pmd (ptl) lock, and b) we've just checked that the
+		 * huge pmd (head) page is present in the page tables. The ptl
+		 * prevents the head page and tail pages from being rearranged
+		 * in any way. So this page must be available at this point,
+		 * unless the page refcount overflowed:
+		 */
+		if (WARN_ON_ONCE(!try_grab_page(page, flags))) {
+			page = NULL;
+			goto out;
+		}
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
@@ -4999,7 +5017,7 @@ struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -5008,7 +5026,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 07/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (5 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 06/12] mm/gup: track FOLL_PIN pages John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 08/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

For huge pages (and in fact, any compound page), the
GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail
page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS
(1024). That limits the number of huge pages that can be pinned.

This patch removes that limitation, by using an exact form of pin
counting for compound pages of order > 1. The "order > 1" is required
because this approach uses the 3rd struct page in the compound page, and
order 1 compound pages only have two pages, so that won't work there.

A new struct page field, hpage_pinned_refcount, has been added,
replacing a padding field in the union (so no new space is used).

This enhancement also has a useful side effect: huge pages and compound
pages (of order > 1) do not suffer from the "potential false positives"
problem that is discussed in the page_dma_pinned() comment block. That
is because these compound pages have extra space for tracking things, so
they get exact pin counts instead of overloading page->_refcount.

Documentation/core-api/pin_user_pages.rst is updated accordingly.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/pin_user_pages.rst | 40 +++++-------
 include/linux/mm.h                        | 26 ++++++++
 include/linux/mm_types.h                  |  7 +-
 mm/gup.c                                  | 78 ++++++++++++++++++++---
 mm/hugetlb.c                              |  6 ++
 mm/page_alloc.c                           |  2 +
 mm/rmap.c                                 |  6 ++
 7 files changed, 133 insertions(+), 32 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 9829345428f8..7e5dd8b1b3f2 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -52,8 +52,22 @@ Which flags are set by each wrapper
 
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
-pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+pages* array, and the function then pins pages by incrementing each by a special
+value: GUP_PIN_COUNTING_BIAS.
+
+For huge pages (and in fact, any compound page of more than 2 pages), the
+GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
+is achieved, by using the 3rd struct page in the compound page. A new struct
+page field, hpage_pinned_refcount, has been added in order to support this.
+
+This approach for compound pages avoids the counting upper limit problems that
+are discussed below. Those limitations would have been aggravated severely by
+huge pages, because each tail page adds a refcount to the head page. And in
+fact, testing revealed that, without a separate hpage_pinned_refcount field,
+page overflows were seen in some huge page stress tests.
+
+This also means that huge pages and compound pages (of order > 1) do not suffer
+from the false positives problem that is mentioned below.::
 
  Function
  --------
@@ -99,27 +113,6 @@ pages:
 This also leads to limitations: there are only 31-10==21 bits available for a
 counter that increments 10 bits at a time.
 
-TODO: for 1GB and larger huge pages, this is cutting it close. That's because
-when pin_user_pages() follows such pages, it increments the head page by "1"
-(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
-pin_user_pages()) for each tail page. So if you have a 1GB huge page:
-
-* There are 256K (18 bits) worth of 4 KB tail pages.
-* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
-  10 bits at a time)
-* There are 21 - 18 == 3 bits available to count. Except that there aren't,
-  because you need to allow for a few normal get_page() calls on the head page,
-  as well. Fortunately, the approach of using addition, rather than "hard"
-  bitfields, within page->_refcount, allows for sharing these bits gracefully.
-  But we're still looking at about 8 references.
-
-This, however, is a missing feature more than anything else, because it's easily
-solved by addressing an obvious inefficiency in the original get_user_pages()
-approach of retrieving pages: stop treating all the pages as if they were
-PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
-this, so some work is required. Once that's in place, this limitation mostly
-disappears from view, because there will be ample refcounting range available.
-
 * Callers must specifically request "dma-pinned tracking of pages". In other
   words, just calling get_user_pages() will not suffice; a new set of functions,
   pin_user_page() and related, must be used.
@@ -228,5 +221,6 @@ References
 * `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
 * `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
 * `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
 
 John Hubbard, October, 2019
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8d4f9f4094f4..2f9ca976402b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page)
 	return page[1].compound_order;
 }
 
+static inline bool hpage_pincount_available(struct page *page)
+{
+	/*
+	 * Can the page->hpage_pinned_refcount field be used? That field is in
+	 * the 3rd page of the compound page, so the smallest (2-page) compound
+	 * pages cannot support it.
+	 */
+	page = compound_head(page);
+	return PageCompound(page) && compound_order(page) > 1;
+}
+
+static inline int compound_pincount(struct page *page)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	page = compound_head(page);
+	return atomic_read(compound_pincount_ptr(page));
+}
+
 static inline void set_compound_order(struct page *page, unsigned int order)
 {
 	page[1].compound_order = order;
@@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
  * refcounts, and b) all the callers of this routine are expected to be able to
  * deal gracefully with a false positive.
  *
+ * For huge pages, the result will be exactly correct. That's because we have
+ * more tracking data available: the 3rd struct page in the compound page is
+ * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS
+ * scheme).
+ *
  * For more information, please see Documentation/vm/pin_user_pages.rst.
  *
  * @page:	pointer to page to be queried.
@@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages);
  */
 static inline bool page_maybe_dma_pinned(struct page *page)
 {
+	if (hpage_pincount_available(page))
+		return compound_pincount(page) > 0;
+
 	/*
 	 * page_ref_count() is signed. If that refcount overflows, then
 	 * page_ref_count() returns a negative value, and callers will avoid
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c28911c3afa8..dd555e6d23f3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -137,7 +137,7 @@ struct page {
 		};
 		struct {	/* Second tail page of compound page */
 			unsigned long _compound_pad_1;	/* compound_head */
-			unsigned long _compound_pad_2;
+			atomic_t hpage_pinned_refcount;
 			/* For both global and memcg */
 			struct list_head deferred_list;
 		};
@@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page)
 	return &page[1].compound_mapcount;
 }
 
+static inline atomic_t *compound_pincount_ptr(struct page *page)
+{
+	return &page[2].hpage_pinned_refcount;
+}
+
 /*
  * Used for sizing the vmemmap region on some architectures
  */
diff --git a/mm/gup.c b/mm/gup.c
index a2356482e1ea..4d0d94405639 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,22 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+static void hpage_pincount_add(struct page *page, int refs)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	VM_BUG_ON_PAGE(page != compound_head(page), page);
+
+	atomic_add(refs, compound_pincount_ptr(page));
+}
+
+static void hpage_pincount_sub(struct page *page, int refs)
+{
+	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
+	VM_BUG_ON_PAGE(page != compound_head(page), page);
+
+	atomic_sub(refs, compound_pincount_ptr(page));
+}
+
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 	if (flags & FOLL_GET)
 		return try_get_compound_head(page, refs);
 	else if (flags & FOLL_PIN) {
-		refs *= GUP_PIN_COUNTING_BIAS;
-		return try_get_compound_head(page, refs);
+		/*
+		 * When pinning a compound page of order > 1 (which is what
+		 * hpage_pincount_available() checks for), use an exact count to
+		 * track it, via hpage_pincount_add/_sub().
+		 *
+		 * However, be sure to *also* increment the normal page refcount
+		 * field at least once, so that the page really is pinned.
+		 */
+		if (!hpage_pincount_available(page))
+			refs *= GUP_PIN_COUNTING_BIAS;
+
+		page = try_get_compound_head(page, refs);
+		if (!page)
+			return NULL;
+
+		if (hpage_pincount_available(page))
+			hpage_pincount_add(page, refs);
+
+		return page;
 	}
 
 	WARN_ON_ONCE(1);
@@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
 	if (flags & FOLL_GET)
 		return try_get_page(page);
 	else if (flags & FOLL_PIN) {
+		int refs = 1;
+
 		page = compound_head(page);
 
 		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
 			return false;
 
-		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+		if (hpage_pincount_available(page))
+			hpage_pincount_add(page, 1);
+		else
+			refs = GUP_PIN_COUNTING_BIAS;
+
+		/*
+		 * Similar to try_grab_compound_head(): even if using the
+		 * hpage_pincount_add/_sub() routines, be sure to
+		 * *also* increment the normal page refcount field at least
+		 * once, so that the page really is pinned.
+		 */
+		page_ref_add(page, refs);
 	}
 
 	return true;
@@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 static bool __unpin_devmap_managed_user_page(struct page *page)
 {
-	int count;
+	int count, refs = 1;
 
 	if (!page_is_devmap_managed(page))
 		return false;
 
-	count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+	if (hpage_pincount_available(page))
+		hpage_pincount_sub(page, 1);
+	else
+		refs = GUP_PIN_COUNTING_BIAS;
+
+	count = page_ref_sub_return(page, refs);
 
 	/*
 	 * devmap page refcounts are 1-based, rather than 0-based: if
@@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
  */
 void unpin_user_page(struct page *page)
 {
+	int refs = 1;
+
 	page = compound_head(page);
 
 	/*
@@ -168,7 +221,12 @@ void unpin_user_page(struct page *page)
 	if (__unpin_devmap_managed_user_page(page))
 		return;
 
-	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+	if (hpage_pincount_available(page))
+		hpage_pincount_sub(page, 1);
+	else
+		refs = GUP_PIN_COUNTING_BIAS;
+
+	if (page_ref_sub_and_test(page, refs))
 		__put_page(page);
 }
 EXPORT_SYMBOL(unpin_user_page);
@@ -2200,8 +2258,12 @@ static int record_subpages(struct page *page, unsigned long addr,
 
 static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
-	if (flags & FOLL_PIN)
-		refs *= GUP_PIN_COUNTING_BIAS;
+	if (flags & FOLL_PIN) {
+		if (hpage_pincount_available(page))
+			hpage_pincount_sub(page, refs);
+		else
+			refs *= GUP_PIN_COUNTING_BIAS;
+	}
 
 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
 	/*
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ba1de6bc1402..3d31a235b53d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page,
 	struct page *p = page + 1;
 
 	atomic_set(compound_mapcount_ptr(page), 0);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		clear_compound_head(p);
 		set_page_refcounted(p);
@@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order)
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
+
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c4eb750a199..b2fe61035b7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -689,6 +689,8 @@ void prep_compound_page(struct page *page, unsigned int order)
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/mm/rmap.c b/mm/rmap.c
index b3e381919835..e45b9b991e2f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page,
 		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 		/* increment count (starts at -1) */
 		atomic_set(compound_mapcount_ptr(page), 0);
+		if (hpage_pincount_available(page))
+			atomic_set(compound_pincount_ptr(page), 0);
+
 		__inc_node_page_state(page, NR_ANON_THPS);
 	} else {
 		/* Anon THP always mapped first with PMD */
@@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page,
 {
 	BUG_ON(address < vma->vm_start || address >= vma->vm_end);
 	atomic_set(compound_mapcount_ptr(page), 0);
+	if (hpage_pincount_available(page))
+		atomic_set(compound_pincount_ptr(page), 0);
+
 	__page_set_anon_rmap(page, vma, address, 1);
 }
 #endif /* CONFIG_HUGETLB_PAGE */
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 08/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (6 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 07/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-12  9:17   ` Jan Kara
  2020-02-11  0:15 ` [PATCH v6 09/12] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
unpin_user_pages*(), we need some visibility into whether all of this is
working correctly.

Add two new fields to /proc/vmstat:

    nr_foll_pin_acquired
    nr_foll_pin_released

These are documented in Documentation/core-api/pin_user_pages.rst.
They represent the number of pages (since boot time) that have been
pinned ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"),
via pin_user_pages*() and unpin_user_pages*().

In the absence of long-running DMA or RDMA operations that hold pages
pinned, the above two fields will normally be equal to each other.

Also: update Documentation/core-api/pin_user_pages.rst, to remove an
earlier (now confirmed untrue) claim about a performance problem with
/proc/vmstat.

Also: updated Documentation/core-api/pin_user_pages.rst to rename the
new /proc/vmstat entries, to the names listed here.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/pin_user_pages.rst | 33 +++++++++++++++++++----
 include/linux/mmzone.h                    |  2 ++
 mm/gup.c                                  | 13 +++++++++
 mm/vmstat.c                               |  2 ++
 4 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 7e5dd8b1b3f2..5c8a5f89756b 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -208,12 +208,35 @@ has the following new calls to exercise the new pin*() wrapper functions:
 You can monitor how many total dma-pinned pages have been acquired and released
 since the system was booted, via two new /proc/vmstat entries: ::
 
-    /proc/vmstat/nr_foll_pin_requested
-    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_acquired
+    /proc/vmstat/nr_foll_pin_released
 
-Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in unpin_user_page(), when they
-are activated.
+Under normal conditions, these two values will be equal unless there are any
+long-term [R]DMA pins in place, or during pin/unpin transitions.
+
+* nr_foll_pin_acquired: This is the number of logical pins that have been
+  acquired since the system was powered on. For huge pages, the head page is
+  pinned once for each page (head page and each tail page) within the huge page.
+  This follows the same sort of behavior that get_user_pages() uses for huge
+  pages: the head page is refcounted once for each tail or head page in the huge
+  page, when get_user_pages() is applied to a huge page.
+
+* nr_foll_pin_released: The number of logical pins that have been released since
+  the system was powered on. Note that pages are released (unpinned) on a
+  PAGE_SIZE granularity, even if the original pin was applied to a huge page.
+  Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
+  the accounting balances out, so that after doing this::
+
+    pin_user_pages(huge_page);
+    for (each page in huge_page)
+        unpin_user_page(page);
+
+...the following is expected::
+
+    nr_foll_pin_released == nr_foll_pin_acquired
+
+(...unless it was already out of balance due to a long-term RDMA pin being in
+place.)
 
 References
 ==========
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..4bca42eeb439 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -243,6 +243,8 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
+	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/mm/gup.c b/mm/gup.c
index 4d0d94405639..441f7a48f370 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -86,6 +86,8 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 	if (flags & FOLL_GET)
 		return try_get_compound_head(page, refs);
 	else if (flags & FOLL_PIN) {
+		int orig_refs = refs;
+
 		/*
 		 * When pinning a compound page of order > 1 (which is what
 		 * hpage_pincount_available() checks for), use an exact count to
@@ -104,6 +106,9 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 		if (hpage_pincount_available(page))
 			hpage_pincount_add(page, refs);
 
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
+				    orig_refs);
+
 		return page;
 	}
 
@@ -158,6 +163,8 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
 		 * once, so that the page really is pinned.
 		 */
 		page_ref_add(page, refs);
+
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
 	}
 
 	return true;
@@ -178,6 +185,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
 
 	count = page_ref_sub_return(page, refs);
 
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
 	/*
 	 * devmap page refcounts are 1-based, rather than 0-based: if
 	 * refcount is 1, then the page is free and the refcount is
@@ -228,6 +236,8 @@ void unpin_user_page(struct page *page)
 
 	if (page_ref_sub_and_test(page, refs))
 		__put_page(page);
+
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
 }
 EXPORT_SYMBOL(unpin_user_page);
 
@@ -2259,6 +2269,9 @@ static int record_subpages(struct page *page, unsigned long addr,
 static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
 	if (flags & FOLL_PIN) {
+		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
+				    refs);
+
 		if (hpage_pincount_available(page))
 			hpage_pincount_sub(page, refs);
 		else
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..c9c0d71f917f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+	"nr_foll_pin_acquired",
+	"nr_foll_pin_released",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 09/12] mm/gup_benchmark: support pin_user_pages() and related calls
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (7 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 08/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 10/12] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():      via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

    PIN_FAST_BENCHMARK     : calls pin_user_pages_fast()
    PIN_BENCHMARK          : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_maybe_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup_benchmark.c                         | 71 ++++++++++++++++++++--
 tools/testing/selftests/vm/gup_benchmark.c | 15 ++++-
 2 files changed, 80 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 8dba38e79a9f..be690fa66a46 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -19,6 +21,48 @@ struct gup_benchmark {
 	__u64 expansion[10];	/* For future use */
 };
 
+static void put_back_pages(unsigned int cmd, struct page **pages,
+			   unsigned long nr_pages)
+{
+	unsigned long i;
+
+	switch (cmd) {
+	case GUP_FAST_BENCHMARK:
+	case GUP_LONGTERM_BENCHMARK:
+	case GUP_BENCHMARK:
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+		break;
+
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
+		unpin_user_pages(pages, nr_pages);
+		break;
+	}
+}
+
+static void verify_dma_pinned(unsigned int cmd, struct page **pages,
+			      unsigned long nr_pages)
+{
+	unsigned long i;
+	struct page *page;
+
+	switch (cmd) {
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
+		for (i = 0; i < nr_pages; i++) {
+			page = pages[i];
+			if (WARN(!page_maybe_dma_pinned(page),
+				 "pages[%lu] is NOT dma-pinned\n", i)) {
+
+				dump_page(page, "gup_benchmark failure");
+				break;
+			}
+		}
+		break;
+	}
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
 		struct gup_benchmark *gup)
 {
@@ -66,6 +110,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = get_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
 			break;
+		case PIN_FAST_BENCHMARK:
+			nr = pin_user_pages_fast(addr, nr, gup->flags,
+						 pages + i);
+			break;
+		case PIN_BENCHMARK:
+			nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+					    NULL);
+			break;
 		default:
 			kvfree(pages);
 			ret = -EINVAL;
@@ -78,15 +130,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 	}
 	end_time = ktime_get();
 
+	/* Shifting the meaning of nr_pages: now it is actual number pinned: */
+	nr_pages = i;
+
 	gup->get_delta_usec = ktime_us_delta(end_time, start_time);
 	gup->size = addr - gup->addr;
 
+	/*
+	 * Take an un-benchmark-timed moment to verify DMA pinned
+	 * state: print a warning if any non-dma-pinned pages are found:
+	 */
+	verify_dma_pinned(cmd, pages, nr_pages);
+
 	start_time = ktime_get();
-	for (i = 0; i < nr_pages; i++) {
-		if (!pages[i])
-			break;
-		put_page(pages[i]);
-	}
+
+	put_back_pages(cmd, pages, nr_pages);
+
 	end_time = ktime_get();
 	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -105,6 +164,8 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
 	case GUP_FAST_BENCHMARK:
 	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
 		break;
 	default:
 		return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index 389327e9b30a..43b4dfe161a2 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,10 @@
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
+/* Similar to above, but use FOLL_PIN instead of FOLL_GET. */
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
+
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE	0x01	/* check pte is writable */
 
@@ -40,8 +44,14 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:abtTLUuwSH")) != -1) {
 		switch (opt) {
+		case 'a':
+			cmd = PIN_FAST_BENCHMARK;
+			break;
+		case 'b':
+			cmd = PIN_BENCHMARK;
+			break;
 		case 'm':
 			size = atoi(optarg) * MB;
 			break;
@@ -63,6 +73,9 @@ int main(int argc, char **argv)
 		case 'U':
 			cmd = GUP_BENCHMARK;
 			break;
+		case 'u':
+			cmd = GUP_FAST_BENCHMARK;
+			break;
 		case 'w':
 			write = 1;
 			break;
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 10/12] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (8 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 09/12] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 11/12] mm: Improve dump_page() for compound pages John Hubbard
  2020-02-11  0:15 ` [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages John Hubbard
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard

It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 tools/testing/selftests/vm/run_vmtests | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..df6a6bf3f238 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage.  Use"
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
 
+echo "--------------------------------------------"
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo "--------------------------------------------"
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
+echo "------------------------------------------"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "------------------------------------------"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "-------------------"
 echo "running userfaultfd"
 echo "-------------------"
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 11/12] mm: Improve dump_page() for compound pages
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (9 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 10/12] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11  0:15 ` [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages John Hubbard
  11 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

There was no protection against a corrupted struct page having an
implausible compound_head().  Sanity check that a compound page has
a head within reach of the maximum allocatable page (this will need
to be adjusted if one of the plans to allocate 1GB pages comes to
fruition).  In addition,

 - Print the mapping pointer using %p insted of %px.  The actual value of
   the pointer can be read out of the raw page dump and using %p gives a
   chance to correlate it with an earlier printk of the mapping pointer
 - Print the mapping pointer from the head page, not the tail page
   (the tail ->mapping pointer may be in use for other purposes, eg part
   of a list_head)
 - Print the order of the page for compound pages
 - Dump the raw head page as well as the raw page
 - Print the refcount from the head page, not the tail page

Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/debug.c | 33 +++++++++++++++++++++++----------
 1 file changed, 23 insertions(+), 10 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index ecccd9f17801..f5ffb0784559 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -44,8 +44,10 @@ const struct trace_print_flags vmaflag_names[] = {
 
 void __dump_page(struct page *page, const char *reason)
 {
+	struct page *head = compound_head(page);
 	struct address_space *mapping;
 	bool page_poisoned = PagePoisoned(page);
+	bool compound = PageCompound(page);
 	/*
 	 * Accessing the pageblock without the zone lock. It could change to
 	 * "isolate" again in the meantime, but since we are just dumping the
@@ -66,25 +68,32 @@ void __dump_page(struct page *page, const char *reason)
 		goto hex_only;
 	}
 
-	mapping = page_mapping(page);
+	if (page < head || (page >= head + MAX_ORDER_NR_PAGES)) {
+		/* Corrupt page, cannot call page_mapping */
+		mapping = page->mapping;
+		head = page;
+		compound = false;
+	} else {
+		mapping = page_mapping(page);
+	}
 
 	/*
 	 * Avoid VM_BUG_ON() in page_mapcount().
 	 * page->_mapcount space in struct page is used by sl[aou]b pages to
 	 * encode own info.
 	 */
-	mapcount = PageSlab(page) ? 0 : page_mapcount(page);
+	mapcount = PageSlab(head) ? 0 : page_mapcount(page);
 
-	if (PageCompound(page))
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%px "
-			"index:%#lx compound_mapcount: %d\n",
-			page, page_ref_count(page), mapcount,
-			page->mapping, page_to_pgoff(page),
-			compound_mapcount(page));
+	if (compound)
+		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
+			"index:%#lx head:%px order:%u compound_mapcount:%d\n",
+			page, page_ref_count(head), mapcount,
+			mapping, page_to_pgoff(page), head,
+			compound_order(head), compound_mapcount(page));
 	else
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%px index:%#lx\n",
+		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
 			page, page_ref_count(page), mapcount,
-			page->mapping, page_to_pgoff(page));
+			mapping, page_to_pgoff(page));
 	if (PageKsm(page))
 		type = "ksm ";
 	else if (PageAnon(page))
@@ -106,6 +115,10 @@ void __dump_page(struct page *page, const char *reason)
 	print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
 			sizeof(unsigned long), page,
 			sizeof(struct page), false);
+	if (head != page)
+		print_hex_dump(KERN_WARNING, "head: ", DUMP_PREFIX_NONE, 32,
+			sizeof(unsigned long), head,
+			sizeof(struct page), false);
 
 	if (reason)
 		pr_warn("page dumped because: %s\n", reason);
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages
  2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (10 preceding siblings ...)
  2020-02-11  0:15 ` [PATCH v6 11/12] mm: Improve dump_page() for compound pages John Hubbard
@ 2020-02-11  0:15 ` John Hubbard
  2020-02-11 13:21   ` Kirill A. Shutemov
  11 siblings, 1 reply; 28+ messages in thread
From: John Hubbard @ 2020-02-11  0:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Christoph Hellwig, Dan Williams, Dave Chinner,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, John Hubbard, Kirill A . Shutemov

As part of pin_user_pages() and related API calls, pages are
"dma-pinned". For the case of compound pages of order > 1, the per-page
accounting of dma pins is accomplished via the 3rd struct page in the
compound page. In order to support debugging of any pin_user_pages()-
related problems, enhance dump_page() so as to report the pin count
in that case.

Documentation/core-api/pin_user_pages.rst is also updated accordingly.

Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/pin_user_pages.rst |  7 +++++++
 mm/debug.c                                | 21 ++++++++++++++++-----
 2 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 5c8a5f89756b..2e939ff10b86 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -238,6 +238,13 @@ long-term [R]DMA pins in place, or during pin/unpin transitions.
 (...unless it was already out of balance due to a long-term RDMA pin being in
 place.)
 
+Other diagnostics
+=================
+
+dump_page() has been enhanced slightly, to handle these new counting fields, and
+to better report on compound pages in general. Specifically, for compound pages
+with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
+
 References
 ==========
 
diff --git a/mm/debug.c b/mm/debug.c
index f5ffb0784559..2189357f0987 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -85,11 +85,22 @@ void __dump_page(struct page *page, const char *reason)
 	mapcount = PageSlab(head) ? 0 : page_mapcount(page);
 
 	if (compound)
-		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
-			"index:%#lx head:%px order:%u compound_mapcount:%d\n",
-			page, page_ref_count(head), mapcount,
-			mapping, page_to_pgoff(page), head,
-			compound_order(head), compound_mapcount(page));
+		if (hpage_pincount_available(page)) {
+			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
+				"index:%#lx head:%px order:%u "
+				"compound_mapcount:%d compound_pincount:%d\n",
+				page, page_ref_count(head), mapcount,
+				mapping, page_to_pgoff(page), head,
+				compound_order(head), compound_mapcount(page),
+				compound_pincount(page));
+		} else {
+			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
+				"index:%#lx head:%px order:%u "
+				"compound_mapcount:%d\n",
+				page, page_ref_count(head), mapcount,
+				mapping, page_to_pgoff(page), head,
+				compound_order(head), compound_mapcount(page));
+		}
 	else
 		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
 			page, page_ref_count(page), mapcount,
-- 
2.25.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages
  2020-02-11  0:15 ` [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages John Hubbard
@ 2020-02-11 13:21   ` Kirill A. Shutemov
  2020-02-12  2:10     ` John Hubbard
  0 siblings, 1 reply; 28+ messages in thread
From: Kirill A. Shutemov @ 2020-02-11 13:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, Kirill A . Shutemov

On Mon, Feb 10, 2020 at 04:15:36PM -0800, John Hubbard wrote:
> As part of pin_user_pages() and related API calls, pages are
> "dma-pinned". For the case of compound pages of order > 1, the per-page
> accounting of dma pins is accomplished via the 3rd struct page in the
> compound page. In order to support debugging of any pin_user_pages()-
> related problems, enhance dump_page() so as to report the pin count
> in that case.
> 
> Documentation/core-api/pin_user_pages.rst is also updated accordingly.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  Documentation/core-api/pin_user_pages.rst |  7 +++++++
>  mm/debug.c                                | 21 ++++++++++++++++-----
>  2 files changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 5c8a5f89756b..2e939ff10b86 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -238,6 +238,13 @@ long-term [R]DMA pins in place, or during pin/unpin transitions.
>  (...unless it was already out of balance due to a long-term RDMA pin being in
>  place.)
>  
> +Other diagnostics
> +=================
> +
> +dump_page() has been enhanced slightly, to handle these new counting fields, and
> +to better report on compound pages in general. Specifically, for compound pages
> +with order > 1, the exact (hpage_pinned_refcount) pincount is reported.
> +
>  References
>  ==========
>  
> diff --git a/mm/debug.c b/mm/debug.c
> index f5ffb0784559..2189357f0987 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -85,11 +85,22 @@ void __dump_page(struct page *page, const char *reason)
>  	mapcount = PageSlab(head) ? 0 : page_mapcount(page);
>  
>  	if (compound)
> -		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
> -			"index:%#lx head:%px order:%u compound_mapcount:%d\n",
> -			page, page_ref_count(head), mapcount,
> -			mapping, page_to_pgoff(page), head,
> -			compound_order(head), compound_mapcount(page));
> +		if (hpage_pincount_available(page)) {
> +			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
> +				"index:%#lx head:%px order:%u "
> +				"compound_mapcount:%d compound_pincount:%d\n",
> +				page, page_ref_count(head), mapcount,
> +				mapping, page_to_pgoff(page), head,
> +				compound_order(head), compound_mapcount(page),
> +				compound_pincount(page));
> +		} else {
> +			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
> +				"index:%#lx head:%px order:%u "
> +				"compound_mapcount:%d\n",
> +				page, page_ref_count(head), mapcount,
> +				mapping, page_to_pgoff(page), head,
> +				compound_order(head), compound_mapcount(page));
> +		}

Have you considered using pr_cont() here. I guess it would be easier to
read.

You can use my Ack anyway.


>  	else
>  		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
>  			page, page_ref_count(page), mapcount,
> -- 
> 2.25.0
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages
  2020-02-11 13:21   ` Kirill A. Shutemov
@ 2020-02-12  2:10     ` John Hubbard
  0 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2020-02-12  2:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	LKML, Kirill A . Shutemov

On 2/11/20 5:21 AM, Kirill A. Shutemov wrote:
...
>> diff --git a/mm/debug.c b/mm/debug.c
>> index f5ffb0784559..2189357f0987 100644
>> --- a/mm/debug.c
>> +++ b/mm/debug.c
>> @@ -85,11 +85,22 @@ void __dump_page(struct page *page, const char *reason)
>>  	mapcount = PageSlab(head) ? 0 : page_mapcount(page);
>>  
>>  	if (compound)
>> -		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
>> -			"index:%#lx head:%px order:%u compound_mapcount:%d\n",
>> -			page, page_ref_count(head), mapcount,
>> -			mapping, page_to_pgoff(page), head,
>> -			compound_order(head), compound_mapcount(page));
>> +		if (hpage_pincount_available(page)) {
>> +			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
>> +				"index:%#lx head:%px order:%u "
>> +				"compound_mapcount:%d compound_pincount:%d\n",
>> +				page, page_ref_count(head), mapcount,
>> +				mapping, page_to_pgoff(page), head,
>> +				compound_order(head), compound_mapcount(page),
>> +				compound_pincount(page));
>> +		} else {
>> +			pr_warn("page:%px refcount:%d mapcount:%d mapping:%p "
>> +				"index:%#lx head:%px order:%u "
>> +				"compound_mapcount:%d\n",
>> +				page, page_ref_count(head), mapcount,
>> +				mapping, page_to_pgoff(page), head,
>> +				compound_order(head), compound_mapcount(page));
>> +		}
> 
> Have you considered using pr_cont() here. I guess it would be easier to
> read.

Yes, and it does have the advantage of removing some of the code duplication above. 
On the other hand, though, it leaves the end result (the long lines being printed) 
the same, and introduces a window in which the output can get garbled by another 
thread that is printk'-ing. And actually, what I'd really like is to shorten the
printed output lines, as I mentioned in [1].

So overall, given that this series has been fairly difficult to get finalized, 
and it's now in Andrew's tree at last, I'd *really* like to leave it as-is right 
now, and build on top of it. So I will submit a follow-on patch to formally propose
shortening the printed lines, and that can live or die independently of this series,
which is hopefully over now.

> 
> You can use my Ack anyway.


Thanks, and I appreciate all of your reviews and bug spotting and ideas for improvements 
on this series, it's been really helpful.


> 
> 
>>  	else
>>  		pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n",
>>  			page, page_ref_count(page), mapcount,
>> -- 
>> 2.25.0
>>
> 


[1] https://lore.kernel.org/r/96e1f693-0e7b-2817-f13d-1946ff7654a1@nvidia.com

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v6 08/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting
  2020-02-11  0:15 ` [PATCH v6 08/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
@ 2020-02-12  9:17   ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2020-02-12  9:17 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Kirill A . Shutemov,
	Michal Hocko, Mike Kravetz, Shuah Khan, Vlastimil Babka,
	Matthew Wilcox, linux-doc, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, LKML, Kirill A . Shutemov

On Mon 10-02-20 16:15:32, John Hubbard wrote:
> Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
> unpin_user_pages*(), we need some visibility into whether all of this is
> working correctly.
> 
> Add two new fields to /proc/vmstat:
> 
>     nr_foll_pin_acquired
>     nr_foll_pin_released
> 
> These are documented in Documentation/core-api/pin_user_pages.rst.
> They represent the number of pages (since boot time) that have been
> pinned ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"),
> via pin_user_pages*() and unpin_user_pages*().
> 
> In the absence of long-running DMA or RDMA operations that hold pages
> pinned, the above two fields will normally be equal to each other.
> 
> Also: update Documentation/core-api/pin_user_pages.rst, to remove an
> earlier (now confirmed untrue) claim about a performance problem with
> /proc/vmstat.
> 
> Also: updated Documentation/core-api/pin_user_pages.rst to rename the
> new /proc/vmstat entries, to the names listed here.
> 
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

The patch looks good to me now. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  Documentation/core-api/pin_user_pages.rst | 33 +++++++++++++++++++----
>  include/linux/mmzone.h                    |  2 ++
>  mm/gup.c                                  | 13 +++++++++
>  mm/vmstat.c                               |  2 ++
>  4 files changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
> index 7e5dd8b1b3f2..5c8a5f89756b 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -208,12 +208,35 @@ has the following new calls to exercise the new pin*() wrapper functions:
>  You can monitor how many total dma-pinned pages have been acquired and released
>  since the system was booted, via two new /proc/vmstat entries: ::
>  
> -    /proc/vmstat/nr_foll_pin_requested
> -    /proc/vmstat/nr_foll_pin_requested
> +    /proc/vmstat/nr_foll_pin_acquired
> +    /proc/vmstat/nr_foll_pin_released
>  
> -Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
> -because there is a noticeable performance drop in unpin_user_page(), when they
> -are activated.
> +Under normal conditions, these two values will be equal unless there are any
> +long-term [R]DMA pins in place, or during pin/unpin transitions.
> +
> +* nr_foll_pin_acquired: This is the number of logical pins that have been
> +  acquired since the system was powered on. For huge pages, the head page is
> +  pinned once for each page (head page and each tail page) within the huge page.
> +  This follows the same sort of behavior that get_user_pages() uses for huge
> +  pages: the head page is refcounted once for each tail or head page in the huge
> +  page, when get_user_pages() is applied to a huge page.
> +
> +* nr_foll_pin_released: The number of logical pins that have been released since
> +  the system was powered on. Note that pages are released (unpinned) on a
> +  PAGE_SIZE granularity, even if the original pin was applied to a huge page.
> +  Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
> +  the accounting balances out, so that after doing this::
> +
> +    pin_user_pages(huge_page);
> +    for (each page in huge_page)
> +        unpin_user_page(page);
> +
> +...the following is expected::
> +
> +    nr_foll_pin_released == nr_foll_pin_acquired
> +
> +(...unless it was already out of balance due to a long-term RDMA pin being in
> +place.)
>  
>  References
>  ==========
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..4bca42eeb439 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -243,6 +243,8 @@ enum node_stat_item {
>  	NR_DIRTIED,		/* page dirtyings since bootup */
>  	NR_WRITTEN,		/* page writings since bootup */
>  	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
> +	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
> +	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
>  	NR_VM_NODE_STAT_ITEMS
>  };
>  
> diff --git a/mm/gup.c b/mm/gup.c
> index 4d0d94405639..441f7a48f370 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -86,6 +86,8 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
>  	if (flags & FOLL_GET)
>  		return try_get_compound_head(page, refs);
>  	else if (flags & FOLL_PIN) {
> +		int orig_refs = refs;
> +
>  		/*
>  		 * When pinning a compound page of order > 1 (which is what
>  		 * hpage_pincount_available() checks for), use an exact count to
> @@ -104,6 +106,9 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page,
>  		if (hpage_pincount_available(page))
>  			hpage_pincount_add(page, refs);
>  
> +		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED,
> +				    orig_refs);
> +
>  		return page;
>  	}
>  
> @@ -158,6 +163,8 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags)
>  		 * once, so that the page really is pinned.
>  		 */
>  		page_ref_add(page, refs);
> +
> +		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1);
>  	}
>  
>  	return true;
> @@ -178,6 +185,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>  
>  	count = page_ref_sub_return(page, refs);
>  
> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>  	/*
>  	 * devmap page refcounts are 1-based, rather than 0-based: if
>  	 * refcount is 1, then the page is free and the refcount is
> @@ -228,6 +236,8 @@ void unpin_user_page(struct page *page)
>  
>  	if (page_ref_sub_and_test(page, refs))
>  		__put_page(page);
> +
> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>  }
>  EXPORT_SYMBOL(unpin_user_page);
>  
> @@ -2259,6 +2269,9 @@ static int record_subpages(struct page *page, unsigned long addr,
>  static void put_compound_head(struct page *page, int refs, unsigned int flags)
>  {
>  	if (flags & FOLL_PIN) {
> +		mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED,
> +				    refs);
> +
>  		if (hpage_pincount_available(page))
>  			hpage_pincount_sub(page, refs);
>  		else
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 78d53378db99..c9c0d71f917f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = {
>  	"nr_dirtied",
>  	"nr_written",
>  	"nr_kernel_misc_reclaimable",
> +	"nr_foll_pin_acquired",
> +	"nr_foll_pin_released",
>  
>  	/* enum writeback_stat_item counters */
>  	"nr_dirty_threshold",
> -- 
> 2.25.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [regression] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-02-11  0:15 ` [PATCH v6 06/12] mm/gup: track FOLL_PIN pages John Hubbard
@ 2020-04-24 18:18   ` Alex Williamson
  2020-04-24 19:20     ` John Hubbard
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2020-04-24 18:18 UTC (permalink / raw)
  To: John Hubbard, LKML
  Cc: Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Kirill A . Shutemov,
	Michal Hocko, Mike Kravetz, Shuah Khan, Vlastimil Babka,
	Matthew Wilcox, linux-doc, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

[-- Attachment #1: Type: text/plain, Size: 6259 bytes --]

On Mon, 10 Feb 2020 16:15:30 -0800
John Hubbard <jhubbard@nvidia.com> wrote:

> Add tracking of pages that were pinned via FOLL_PIN. This tracking is
> implemented via overloading of page->_refcount: pins are added by
> adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a
> fuzzy indication of pinning, and it can have false positives (and that's
> OK). Please see the pre-existing
> Documentation/core-api/pin_user_pages.rst for details.
> 
> As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
> (typically via pin_user_pages*()) are required to ultimately free such
> pages via unpin_user_page().
> 
> Please also note the limitation, discussed in pin_user_pages.rst under
> the "TODO: for 1GB and larger huge pages" section. (That limitation will
> be removed in a following patch.)
> 
> The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
> thought of as "FOLL_GET for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_maybe_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], [3], and [4].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
>     https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
>     https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
>     https://lwn.net/Articles/753027/
> [4] LWN kernel index: get_user_pages():
>     https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  Documentation/core-api/pin_user_pages.rst |   6 +-
>  include/linux/mm.h                        |  82 +++++--
>  mm/gup.c                                  | 254 +++++++++++++++++-----
>  mm/huge_memory.c                          |  29 ++-
>  mm/hugetlb.c                              |  54 +++--
>  5 files changed, 334 insertions(+), 91 deletions(-)

Hi John,

I'm seeing a regression bisected back to this commit (3faa52c03f44
mm/gup: track FOLL_PIN pages).  I've attached some vfio-pci test code
that reproduces this by mmap'ing a page of MMIO space of a device and
then tries to map that through the IOMMU, so this should be attempting
a gup/pin of a PFNMAP page.  Previously this failed gracefully (-EFAULT),
but now results in:

BUG: unable to handle page fault for address: ffffae5cbfe5e938
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP NOPTI
CPU: 18 PID: 3365 Comm: vfio-pci-dma-ma Tainted: G           OE     5.6.0+ #6
Hardware name: AMD Corporation Diesel/Diesel, BIOS TDL100CB 03/17/2020
RIP: 0010:get_pfnblock_flags_mask+0x22/0x70
Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1
RSP: 0018:ffffb55289b3fcc8 EFLAGS: 00010216
RAX: ffff9e5cbff50000 RBX: 0000000000000001 RCX: 000001fffffe1d27
RDX: 0000000000000002 RSI: ffffff0e93acd633 RDI: 0001fffffe1d2759
RBP: ffffb55289b3fd88 R08: 0000000000000007 R09: ffff9e48a52476a8
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000001 R15: ffff9e48ab358cc0
FS:  00007f4ef7269740(0000) GS:ffff9e48afa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffae5cbfe5e938 CR3: 0000000c61eda000 CR4: 00000000003406e0
Call Trace:
 __gup_longterm_locked+0x274/0x620
 vaddr_get_pfn+0x74/0x110 [vfio_iommu_type1]
 vfio_pin_pages_remote+0x6e/0x370 [vfio_iommu_type1]
 vfio_iommu_type1_ioctl+0x8e5/0xaac [vfio_iommu_type1]
 ksys_ioctl+0x86/0xc0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x5b/0x1f0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4ef6d7d307
Code: 44 00 00 48 8b 05 69 1b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 1b 2d 00 f7 d8 64 89 01 48
RSP: 002b:00007fff76ada738 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4ef6d7d307
RDX: 00007fff76ada760 RSI: 0000000000003b71 RDI: 0000000000000003
RBP: 00007fff76ada930 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000213 R12: 0000000000400950
R13: 00007fff76adaa10 R14: 0000000000000000 R15: 0000000000000000
Modules linked in: vfio_pci(OE) vfio_virqfd(OE) vfio_iommu_type1(OE) vfio(OE) amd64_edac_mod edac_mce_amd kvm_amd kvm rfkill sunrpc ipmi_ssif vfat irqbypass fat ipmi_si crct10dif_pclmul crc32_pclmul sp5100_tco ghash_clmulni_intel ipmi_devintf pcspkr joydev ccp i2c_piix4 k10temp ipmi_msghandler pinctrl_amd acpi_cpufreq ip_tables nouveau ast video mxm_wmi drm_vram_helper wmi drm_ttm_helper i2c_algo_bit drm_kms_helper cec ttm drm i40e e1000e crc32c_intel
CR2: ffffae5cbfe5e938
---[ end trace a384ab7cc8e37d46 ]---
RIP: 0010:get_pfnblock_flags_mask+0x22/0x70
Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1
RSP: 0018:ffffb55289b3fcc8 EFLAGS: 00010216
RAX: ffff9e5cbff50000 RBX: 0000000000000001 RCX: 000001fffffe1d27
RDX: 0000000000000002 RSI: ffffff0e93acd633 RDI: 0001fffffe1d2759
RBP: ffffb55289b3fd88 R08: 0000000000000007 R09: ffff9e48a52476a8
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: 0000000000000001 R15: ffff9e48ab358cc0
FS:  00007f4ef7269740(0000) GS:ffff9e48afa80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffae5cbfe5e938 CR3: 0000000c61eda000 CR4: 00000000003406e0

Thanks,
Alex

[-- Attachment #2: vfio-pci-dma-map-mmio.c --]
[-- Type: text/x-c++src, Size: 4876 bytes --]

#include <errno.h>
#include <libgen.h>
#include <fcntl.h>
#include <signal.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/eventfd.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>

#include <linux/ioctl.h>
#include <linux/vfio.h>
#include <linux/pci_regs.h>

void usage(char *name)
{
	fprintf(stderr, "usage: %s <ssss:bb:dd.f>\n", name);
	fprintf(stderr, "\tssss: PCI segment, ex. 0000\n");
	fprintf(stderr, "\tbb:   PCI bus, ex. 01\n");
	fprintf(stderr, "\tdd:   PCI device, ex. 06\n");
	fprintf(stderr, "\tf:    PCI function, ex. 0\n");
}

int main(int argc, char **argv)
{
	int seg, bus, slot, func;
	int ret, container, group, device, groupid;
	char path[50], iommu_group_path[50], *group_name;
	struct stat st;
	ssize_t len;
	void *map = MAP_FAILED;
	int i;
	unsigned int bar;
	struct vfio_group_status group_status = {
		.argsz = sizeof(group_status)
	};
	struct vfio_region_info region_info = {
		.argsz = sizeof(region_info)
	};
	struct vfio_region_info config_info = {
		.argsz = sizeof(config_info)
	};
	struct vfio_iommu_type1_dma_map dma_map = {
		.argsz = sizeof(dma_map),
		.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
	};

	if (argc != 2) {
		usage(argv[0]);
		return -1;
	}

	ret = sscanf(argv[1], "%04x:%02x:%02x.%d", &seg, &bus, &slot, &func);
	if (ret != 4) {
		fprintf(stderr, "Invalid device\n");
		usage(argv[0]);
		return -1;
	}

	/* Boilerplate vfio setup */
	container = open("/dev/vfio/vfio", O_RDWR);
	if (container < 0) {
		fprintf(stderr, "Failed to open /dev/vfio/vfio, %d (%s)\n",
		       container, strerror(errno));
		return container;
	}

	snprintf(path, sizeof(path),
		 "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/",
		 seg, bus, slot, func);

	ret = stat(path, &st);
	if (ret < 0) {
		fprintf(stderr, "No such device\n");
		return ret;
	}

	strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1);

	len = readlink(path, iommu_group_path, sizeof(iommu_group_path));
	if (len <= 0) {
		fprintf(stderr, "No iommu_group for device\n");
		return -1;
	}

	iommu_group_path[len] = 0;
	group_name = basename(iommu_group_path);

	if (sscanf(group_name, "%d", &groupid) != 1) {
		fprintf(stderr, "Unknown group\n");
		return -1;
	}

	snprintf(path, sizeof(path), "/dev/vfio/%d", groupid);
	group = open(path, O_RDWR);
	if (group < 0) {
		fprintf(stderr, "Failed to open %s, %d (%s)\n",
		       path, group, strerror(errno));
		return group;
	}

	ret = ioctl(group, VFIO_GROUP_GET_STATUS, &group_status);
	if (ret) {
		fprintf(stderr, "ioctl(VFIO_GROUP_GET_STATUS) failed\n");
		return ret;
	}

	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
		fprintf(stderr,
			"Group not viable, all devices attached to vfio?\n");
		return -1;
	}

	ret = ioctl(group, VFIO_GROUP_SET_CONTAINER, &container);
	if (ret) {
		fprintf(stderr, "Failed to set group container\n");
		return ret;
	}

	ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU);
	if (ret) {
		fprintf(stderr, "Failed to set IOMMU\n");
		return ret;
	}

	snprintf(path, sizeof(path), "%04x:%02x:%02x.%d", seg, bus, slot, func);

	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, path);
	if (device < 0) {
		fprintf(stderr, "Failed to get device\n");
		return -ENODEV;
	}

	config_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
	ret = ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &config_info);
	if (ret) {
		fprintf(stderr, "Failed to get config space region info\n");
		return ret;
	}

	for (i = 0; i < 6; i++) {
		if (pread(device, &bar, sizeof(bar),
		          config_info.offset + PCI_BASE_ADDRESS_0 + (4 * i)) !=
		    sizeof(bar)) {
			fprintf(stderr, "Error reading BAR%d\n", i);
			return -errno;
		}

		if (!(bar & PCI_BASE_ADDRESS_SPACE)) {
			break;

tryagain:
			if (bar & PCI_BASE_ADDRESS_MEM_TYPE_64)
				i++;
		}
	}

	if (i >= 6) {
		fprintf(stderr, "No memory BARs found\n");
		return -ENODEV;
	}

	region_info.index = VFIO_PCI_BAR0_REGION_INDEX + i;
	ret = ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &region_info);
	if (ret) {
		fprintf(stderr, "Failed to get BAR%d region info\n", i);
		return ret;
	}
  
	if (!(region_info.flags & VFIO_REGION_INFO_FLAG_MMAP)) {
		printf("No mmap support, try next\n");
		goto tryagain;
	}

	if (region_info.size < getpagesize()) {
		printf("Too small for mmap, try next\n");
		goto tryagain;
	}

	map = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
		   MAP_SHARED, device, region_info.offset);
	if (map == MAP_FAILED) {
		fprintf(stderr, "Error mmap'ing BAR: %m\n");
		goto tryagain;
	}

	dma_map.size = getpagesize();
	dma_map.vaddr = (__u64)map;
	dma_map.iova = 1024 * 1024 * 1024; /* 1GB IOVA, arbitrary */

	ret = ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);
	if (ret) {
		fprintf(stderr, "Failed to DMA map: %m\n");
		return ret;
	}
		
	printf("Passed\n");
	return 0;
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-24 18:18   ` [regression] " Alex Williamson
@ 2020-04-24 19:20     ` John Hubbard
  2020-04-24 20:15       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: John Hubbard @ 2020-04-24 19:20 UTC (permalink / raw)
  To: Alex Williamson, LKML
  Cc: Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Kirill A . Shutemov,
	Michal Hocko, Mike Kravetz, Shuah Khan, Vlastimil Babka,
	Matthew Wilcox, linux-doc, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

On 2020-04-24 11:18, Alex Williamson wrote:
...
> Hi John,
> 
> I'm seeing a regression bisected back to this commit (3faa52c03f44
> mm/gup: track FOLL_PIN pages).  I've attached some vfio-pci test code
> that reproduces this by mmap'ing a page of MMIO space of a device and
> then tries to map that through the IOMMU, so this should be attempting
> a gup/pin of a PFNMAP page.  Previously this failed gracefully (-EFAULT),
> but now results in:


Hi Alex,

Thanks for this report, and especially for source code to test it, 
seeing as how I can't immediately spot the problem just from the crash
data so far.  I'll get set up and attempt a repro.

Actually this looks like it should be relatively easier than the usual 
sort of "oops, we leaked a pin_user_pages() or unpin_user_pages() call,
good luck finding which one" report that I fear the most. :) This one 
looks more like a crash that happens directly, when calling into the 
pin_user_pages_remote() code. Which should be a lot easier to solve...

btw, if you are set up for it, it would be nice to know what source file 
and line number corresponds to the RIP (get_pfnblock_flags_mask+0x22) 
below. But if not, no problem, because I've likely got to do the repro 
in any case.

thanks,
-- 
John Hubbard
NVIDIA
> 
> BUG: unable to handle page fault for address: ffffae5cbfe5e938
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP NOPTI
> CPU: 18 PID: 3365 Comm: vfio-pci-dma-ma Tainted: G           OE     5.6.0+ #6
> Hardware name: AMD Corporation Diesel/Diesel, BIOS TDL100CB 03/17/2020
> RIP: 0010:get_pfnblock_flags_mask+0x22/0x70
> Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1
> RSP: 0018:ffffb55289b3fcc8 EFLAGS: 00010216
> RAX: ffff9e5cbff50000 RBX: 0000000000000001 RCX: 000001fffffe1d27
> RDX: 0000000000000002 RSI: ffffff0e93acd633 RDI: 0001fffffe1d2759
> RBP: ffffb55289b3fd88 R08: 0000000000000007 R09: ffff9e48a52476a8
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> R13: 0000000000000000 R14: 0000000000000001 R15: ffff9e48ab358cc0
> FS:  00007f4ef7269740(0000) GS:ffff9e48afa80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffae5cbfe5e938 CR3: 0000000c61eda000 CR4: 00000000003406e0
> Call Trace:
>   __gup_longterm_locked+0x274/0x620
>   vaddr_get_pfn+0x74/0x110 [vfio_iommu_type1]
>   vfio_pin_pages_remote+0x6e/0x370 [vfio_iommu_type1]
>   vfio_iommu_type1_ioctl+0x8e5/0xaac [vfio_iommu_type1]
>   ksys_ioctl+0x86/0xc0
>   __x64_sys_ioctl+0x16/0x20
>   do_syscall_64+0x5b/0x1f0
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f4ef6d7d307
> Code: 44 00 00 48 8b 05 69 1b 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 1b 2d 00 f7 d8 64 89 01 48
> RSP: 002b:00007fff76ada738 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4ef6d7d307
> RDX: 00007fff76ada760 RSI: 0000000000003b71 RDI: 0000000000000003
> RBP: 00007fff76ada930 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000001 R11: 0000000000000213 R12: 0000000000400950
> R13: 00007fff76adaa10 R14: 0000000000000000 R15: 0000000000000000
> Modules linked in: vfio_pci(OE) vfio_virqfd(OE) vfio_iommu_type1(OE) vfio(OE) amd64_edac_mod edac_mce_amd kvm_amd kvm rfkill sunrpc ipmi_ssif vfat irqbypass fat ipmi_si crct10dif_pclmul crc32_pclmul sp5100_tco ghash_clmulni_intel ipmi_devintf pcspkr joydev ccp i2c_piix4 k10temp ipmi_msghandler pinctrl_amd acpi_cpufreq ip_tables nouveau ast video mxm_wmi drm_vram_helper wmi drm_ttm_helper i2c_algo_bit drm_kms_helper cec ttm drm i40e e1000e crc32c_intel
> CR2: ffffae5cbfe5e938
> ---[ end trace a384ab7cc8e37d46 ]---
> RIP: 0010:get_pfnblock_flags_mask+0x22/0x70
> Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1
> RSP: 0018:ffffb55289b3fcc8 EFLAGS: 00010216
> RAX: ffff9e5cbff50000 RBX: 0000000000000001 RCX: 000001fffffe1d27
> RDX: 0000000000000002 RSI: ffffff0e93acd633 RDI: 0001fffffe1d2759
> RBP: ffffb55289b3fd88 R08: 0000000000000007 R09: ffff9e48a52476a8
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> R13: 0000000000000000 R14: 0000000000000001 R15: ffff9e48ab358cc0
> FS:  00007f4ef7269740(0000) GS:ffff9e48afa80000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffae5cbfe5e938 CR3: 0000000c61eda000 CR4: 00000000003406e0
> 
> Thanks,
> Alex
> 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-24 19:20     ` John Hubbard
@ 2020-04-24 20:15       ` Alex Williamson
  2020-04-24 22:58         ` John Hubbard
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2020-04-24 20:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: LKML, Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Kirill A . Shutemov,
	Michal Hocko, Mike Kravetz, Shuah Khan, Vlastimil Babka,
	Matthew Wilcox, linux-doc, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

On Fri, 24 Apr 2020 12:20:03 -0700
John Hubbard <jhubbard@nvidia.com> wrote:

> On 2020-04-24 11:18, Alex Williamson wrote:
> ...
> > Hi John,
> > 
> > I'm seeing a regression bisected back to this commit (3faa52c03f44
> > mm/gup: track FOLL_PIN pages).  I've attached some vfio-pci test code
> > that reproduces this by mmap'ing a page of MMIO space of a device and
> > then tries to map that through the IOMMU, so this should be attempting
> > a gup/pin of a PFNMAP page.  Previously this failed gracefully (-EFAULT),
> > but now results in:  
> 
> 
> Hi Alex,
> 
> Thanks for this report, and especially for source code to test it, 
> seeing as how I can't immediately spot the problem just from the crash
> data so far.  I'll get set up and attempt a repro.
> 
> Actually this looks like it should be relatively easier than the usual 
> sort of "oops, we leaked a pin_user_pages() or unpin_user_pages() call,
> good luck finding which one" report that I fear the most. :) This one 
> looks more like a crash that happens directly, when calling into the 
> pin_user_pages_remote() code. Which should be a lot easier to solve...
> 
> btw, if you are set up for it, it would be nice to know what source file 
> and line number corresponds to the RIP (get_pfnblock_flags_mask+0x22) 
> below. But if not, no problem, because I've likely got to do the repro 
> in any case.

Hey John,

TBH I'm feeling a lot less confident about this bisect.  This was
readily reproducible to me on a clean tree a bit ago, but now it
eludes me.  Let me go back and figure out what's going on before you
spend any more time on it.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-24 20:15       ` Alex Williamson
@ 2020-04-24 22:58         ` John Hubbard
  2020-04-28 16:54           ` [regression?] " Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: John Hubbard @ 2020-04-24 22:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: LKML, Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Kirill A . Shutemov,
	Michal Hocko, Mike Kravetz, Shuah Khan, Vlastimil Babka,
	Matthew Wilcox, linux-doc, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

On 2020-04-24 13:15, Alex Williamson wrote:
> On Fri, 24 Apr 2020 12:20:03 -0700
> John Hubbard <jhubbard@nvidia.com> wrote:
> 
>> On 2020-04-24 11:18, Alex Williamson wrote:
>> ...
>>> Hi John,
>>>
>>> I'm seeing a regression bisected back to this commit (3faa52c03f44
>>> mm/gup: track FOLL_PIN pages).  I've attached some vfio-pci test code
>>> that reproduces this by mmap'ing a page of MMIO space of a device and
>>> then tries to map that through the IOMMU, so this should be attempting
>>> a gup/pin of a PFNMAP page.  Previously this failed gracefully (-EFAULT),
>>> but now results in:
>>
>>
>> Hi Alex,
>>
>> Thanks for this report, and especially for source code to test it,
>> seeing as how I can't immediately spot the problem just from the crash
>> data so far.  I'll get set up and attempt a repro.
>>
>> Actually this looks like it should be relatively easier than the usual
>> sort of "oops, we leaked a pin_user_pages() or unpin_user_pages() call,
>> good luck finding which one" report that I fear the most. :) This one
>> looks more like a crash that happens directly, when calling into the
>> pin_user_pages_remote() code. Which should be a lot easier to solve...
>>
>> btw, if you are set up for it, it would be nice to know what source file
>> and line number corresponds to the RIP (get_pfnblock_flags_mask+0x22)
>> below. But if not, no problem, because I've likely got to do the repro
>> in any case.
> 
> Hey John,
> 
> TBH I'm feeling a lot less confident about this bisect.  This was
> readily reproducible to me on a clean tree a bit ago, but now it
> eludes me.  Let me go back and figure out what's going on before you
> spend any more time on it.  Thanks,
> 

OK. But I'm keeping the repro program! :)  It made it quick and easy to
set up a vfio test, so it was worth doing in any case.

Anyway, I wanted to double check this just out of paranoia, and so
now I have a data point for you: your test program runs and passes for
me using today's linux.git kernel, with an NVIDIA GPU as the vfio
device, and the kernel log is clean. No hint of any problems.

I traced it a little bit:

# sudo bpftrace -e kprobe:__get_user_pages { @[kstack()] = count(); }
Attaching 1 probe...
^C
...
@[
     __get_user_pages+1
     __gup_longterm_locked+176
     vaddr_get_pfn+104
     vfio_pin_pages_remote+113
     vfio_dma_do_map+760
     vfio_iommu_type1_ioctl+761
     ksys_ioctl+135
     __x64_sys_ioctl+22
     do_syscall_64+90
     entry_SYSCALL_64_after_hwframe+73
]: 1

...and also verified that it's not actually pinning any pages with that
path:

$ grep foll_pin /proc/vmstat
nr_foll_pin_acquired 0
nr_foll_pin_released 0


Good luck and let me know if it starts pointing to FOLL_PIN or gup, etc.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-24 22:58         ` John Hubbard
@ 2020-04-28 16:54           ` Alex Williamson
  2020-04-28 17:49             ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2020-04-28 16:54 UTC (permalink / raw)
  To: John Hubbard
  Cc: LKML, Andrew Morton, Al Viro, Christoph Hellwig, Dan Williams,
	Dave Chinner, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jonathan Corbet, Jérôme Glisse, Kirill A . Shutemov,
	Michal Hocko, Mike Kravetz, Shuah Khan, Vlastimil Babka,
	Matthew Wilcox, linux-doc, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

[-- Attachment #1: Type: text/plain, Size: 5876 bytes --]

On Fri, 24 Apr 2020 15:58:29 -0700
John Hubbard <jhubbard@nvidia.com> wrote:

> On 2020-04-24 13:15, Alex Williamson wrote:
> > On Fri, 24 Apr 2020 12:20:03 -0700
> > John Hubbard <jhubbard@nvidia.com> wrote:
> >   
> >> On 2020-04-24 11:18, Alex Williamson wrote:
> >> ...  
> >>> Hi John,
> >>>
> >>> I'm seeing a regression bisected back to this commit (3faa52c03f44
> >>> mm/gup: track FOLL_PIN pages).  I've attached some vfio-pci test code
> >>> that reproduces this by mmap'ing a page of MMIO space of a device and
> >>> then tries to map that through the IOMMU, so this should be attempting
> >>> a gup/pin of a PFNMAP page.  Previously this failed gracefully (-EFAULT),
> >>> but now results in:  
> >>
> >>
> >> Hi Alex,
> >>
> >> Thanks for this report, and especially for source code to test it,
> >> seeing as how I can't immediately spot the problem just from the crash
> >> data so far.  I'll get set up and attempt a repro.
> >>
> >> Actually this looks like it should be relatively easier than the usual
> >> sort of "oops, we leaked a pin_user_pages() or unpin_user_pages() call,
> >> good luck finding which one" report that I fear the most. :) This one
> >> looks more like a crash that happens directly, when calling into the
> >> pin_user_pages_remote() code. Which should be a lot easier to solve...
> >>
> >> btw, if you are set up for it, it would be nice to know what source file
> >> and line number corresponds to the RIP (get_pfnblock_flags_mask+0x22)
> >> below. But if not, no problem, because I've likely got to do the repro
> >> in any case.  
> > 
> > Hey John,
> > 
> > TBH I'm feeling a lot less confident about this bisect.  This was
> > readily reproducible to me on a clean tree a bit ago, but now it
> > eludes me.  Let me go back and figure out what's going on before you
> > spend any more time on it.  Thanks,
> >   
> 
> OK. But I'm keeping the repro program! :)  It made it quick and easy to
> set up a vfio test, so it was worth doing in any case.

Great, because I've traced my steps, re-bisected and came back to the
same upstream commit with the same test program.  The major difference
is that I thought I was seeing this on pure upstream, but some vfio
code that I'm trying to prepare for upstream snuck in, so this isn't a
pure upstream regression, but the changes I was making worked on v5.6
and does not work with this commit.  Maybe still a latent regression,
maybe a bug in my changes.

> Anyway, I wanted to double check this just out of paranoia, and so
> now I have a data point for you: your test program runs and passes for
> me using today's linux.git kernel, with an NVIDIA GPU as the vfio
> device, and the kernel log is clean. No hint of any problems.

Yep, I agree.  The vfio change I'm experimenting with is to move the
remap_pfn_range() from vfio_pci_mmap() to a vm_ops.fault handler.  This
is why I have the test program creating an mmap of the device mmio
space and then immediately mapping it through the iommu without
touching it.  If the vma gets faulted in the dma mapping path via
pin_user_pages_remote(), I see the crash I reported initially.  If the
test program is modified to access the mmap before doing the dma
mapping, everything works normally.  In either case, the fault handler
is called and satisfies the fault with remap_pfn_range() and returns
VM_FAULT_NOPAGE (vfio patch attached).

Here's the crash I'm seeing with some further debugging:

BUG: unable to handle page fault for address: ffffa5b8bfe14f38
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP NOPTI
CPU: 70 PID: 3343 Comm: vfio-pci-dma-ma Not tainted 5.6.0-3faa52c03f44+ #20
Hardware name: AMD Corporation Diesel/Diesel, BIOS TDL100CB 03/17/2020
RIP: 0010:get_pfnblock_flags_mask+0x22/0x70
Code: c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 48 8b 05 bc e1 d9 01 48 89 f7 49 89 c8 48 c1 ef 0f 48 85 c0 74 48 48 89 f1 48 c1 e9 17 <48> 8b 04 c8 48 85 c0 74 0b 40 0f b6 ff 48 c1 e7 04 48 01 f8 48 c1
RSP: 0018:ffffb2e3c910fcc8 EFLAGS: 00010216
RAX: ffff95b8bff50000 RBX: 0000000000000001 RCX: 000001fffffd89e7
RDX: 0000000000000002 RSI: fffffec4f3a899ba RDI: 0001fffffd89e751
RBP: ffffb2e3c910fd88 R08: 0000000000000007 R09: ffff95a4aa79fce8
R10: 0000000000000000 R11: ffffb2e3c910f840 R12: 0000000100000000
R13: 0000000000000000 R14: 0000000000000001 R15: ffff959caa266e80
FS:  00007f1a95023740(0000) GS:ffff959caf180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffa5b8bfe14f38 CR3: 0000000462a1e000 CR4: 00000000003406e0
Call Trace:
 __gup_longterm_locked+0x274/0x620
 vaddr_get_pfn+0x74/0x110 [vfio_iommu_type1]
 vfio_pin_pages_remote+0x6e/0x370 [vfio_iommu_type1]
 vfio_iommu_type1_ioctl+0x8e5/0xaac [vfio_iommu_type1]

get_pfnblock_flags_mask+0x22/0x70 is here:

include/linux/mmzone.h:1254
static inline struct mem_section *__nr_to_section(unsigned long nr)
{
#ifdef CONFIG_SPARSEMEM_EXTREME
        if (!mem_section)
                return NULL;
#endif
====>   if (!mem_section[SECTION_NR_TO_ROOT(nr)])
                return NULL;
        return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}

The call trace is:

__gup_longterm_locked
  check_and_migrate_cma_pages
    is_migrate_cma_page
      get_pageblock_migratetype
        get_pfnblock_flags_mask
          __get_pfnblock_flags_mask
            get_pageblock_bitmap
              __pfn_to_section
                __nr_to_section

Any ideas why we see this difference between the vma being faulted in
outside of the page pinning path versus faulted by it?

FWIW, here's the stack dump for getting to the fault handler in the
latter case:

 vfio_pci_mmap_fault+0x22/0x130 [vfio_pci]
 __do_fault+0x38/0xd0
 __handle_mm_fault+0xd4b/0x1380
 handle_mm_fault+0xe2/0x1f0
 __get_user_pages+0x188/0x820
 __gup_longterm_locked+0xc8/0x620

Thanks!
Alex

[-- Attachment #2: vfio-fault.diff --]
[-- Type: text/x-patch, Size: 3854 bytes --]

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 379a02c36e37..3b8db2ef6247 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1195,6 +1195,76 @@ static ssize_t vfio_pci_write(void *device_data, const char __user *buf,
 	return vfio_pci_rw(device_data, (char __user *)buf, count, ppos, true);
 }
 
+static int vfio_pci_add_vma(struct vfio_pci_device *vdev,
+			    struct vm_area_struct *vma)
+{
+	struct vfio_pci_mmap_vma *mmap_vma;
+
+	mmap_vma = kzalloc(sizeof(*mmap_vma), GFP_KERNEL);
+	if (!mmap_vma)
+		return -ENOMEM;
+
+	mmap_vma->vma = vma;
+
+	mutex_lock(&vdev->vma_lock);
+	list_add(&mmap_vma->vma_next, &vdev->vma_list);
+	mutex_unlock(&vdev->vma_lock);
+
+	return 0;
+}
+
+/*
+ * Zap mmaps on open so that we can fault them in on access and therefore
+ * our vma_list only tracks mappings accessed since last zap.
+ */
+static void vfio_pci_mmap_open(struct vm_area_struct *vma)
+{
+	zap_vma_ptes(vma, vma->vm_start, vma->vm_end - vma->vm_start);
+}
+
+static void vfio_pci_mmap_close(struct vm_area_struct *vma)
+{
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_mmap_vma *mmap_vma;
+
+	mutex_lock(&vdev->vma_lock);
+	list_for_each_entry(mmap_vma, &vdev->vma_list, vma_next) {
+		if (mmap_vma->vma == vma) {
+			list_del(&mmap_vma->vma_next);
+			kfree(mmap_vma);
+			break;
+		}
+	}
+	mutex_unlock(&vdev->vma_lock);
+}
+
+static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+
+	printk("Fault: %p\n", vma);
+	if (vfio_pci_add_vma(vdev, vma)) {
+		printk("SIGOOM\n");
+		return VM_FAULT_OOM;
+	}
+
+	if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
+			    vma->vm_end - vma->vm_start, vma->vm_page_prot)) {
+		printk("SIGBUS\n");
+		return VM_FAULT_SIGBUS;
+	}
+
+	printk("OK\n");
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vfio_pci_mmap_ops = {
+	.open = vfio_pci_mmap_open,
+	.close = vfio_pci_mmap_close,
+	.fault = vfio_pci_mmap_fault,
+};
+
 static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 {
 	struct vfio_pci_device *vdev = device_data;
@@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
 
+	vma->vm_ops = &vfio_pci_mmap_ops;
+
+#if 1
+	return 0;
+#else
 	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
-			       req_len, vma->vm_page_prot);
+			       vma->vm_end - vma->vm_start, vma->vm_page_prot);
+#endif
 }
 
 static void vfio_pci_request(void *device_data, unsigned int count)
@@ -1330,6 +1406,8 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	spin_lock_init(&vdev->irqlock);
 	mutex_init(&vdev->ioeventfds_lock);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
+	mutex_init(&vdev->vma_lock);
+	INIT_LIST_HEAD(&vdev->vma_list);
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret) {
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a2c7607d513..7faed79fe033 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -84,6 +84,11 @@ struct vfio_pci_reflck {
 	struct mutex		lock;
 };
 
+struct vfio_pci_mmap_vma {
+	struct vm_area_struct	*vma;
+	struct list_head	vma_next;
+};
+
 struct vfio_pci_device {
 	struct pci_dev		*pdev;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
@@ -122,6 +127,8 @@ struct vfio_pci_device {
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
+	struct mutex		vma_lock;
+	struct list_head	vma_list;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-28 16:54           ` [regression?] " Alex Williamson
@ 2020-04-28 17:49             ` Jason Gunthorpe
  2020-04-28 19:07               ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-04-28 17:49 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John Hubbard, LKML, Andrew Morton, Al Viro, Christoph Hellwig,
	Dan Williams, Dave Chinner, Ira Weiny, Jan Kara, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	Kirill A . Shutemov

On Tue, Apr 28, 2020 at 10:54:55AM -0600, Alex Williamson wrote:
>  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  {
>  	struct vfio_pci_device *vdev = device_data;
> @@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
>  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
>  
> +	vma->vm_ops = &vfio_pci_mmap_ops;
> +
> +#if 1
> +	return 0;
> +#else
>  	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> -			       req_len, vma->vm_page_prot);
> +			       vma->vm_end - vma->vm_start, vma->vm_page_prot);

The remap_pfn_range here is what tells get_user_pages this is a
non-struct page mapping:

	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

Which has to be set when the VMA is created, they shouldn't be
modified during fault.

Also the vma code above looked a little strange to me, if you do send
something like this cc me and I can look at it. I did some work like
this for rdma a while ago..

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-28 17:49             ` Jason Gunthorpe
@ 2020-04-28 19:07               ` Alex Williamson
  2020-04-28 19:22                 ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2020-04-28 19:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, LKML, Andrew Morton, Al Viro, Christoph Hellwig,
	Dan Williams, Dave Chinner, Ira Weiny, Jan Kara, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	Kirill A . Shutemov

On Tue, 28 Apr 2020 14:49:57 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Apr 28, 2020 at 10:54:55AM -0600, Alex Williamson wrote:
> >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >  {
> >  	struct vfio_pci_device *vdev = device_data;
> > @@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> >  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> >  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> >  
> > +	vma->vm_ops = &vfio_pci_mmap_ops;
> > +
> > +#if 1
> > +	return 0;
> > +#else
> >  	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > -			       req_len, vma->vm_page_prot);
> > +			       vma->vm_end - vma->vm_start, vma->vm_page_prot);  
> 
> The remap_pfn_range here is what tells get_user_pages this is a
> non-struct page mapping:
> 
> 	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> 
> Which has to be set when the VMA is created, they shouldn't be
> modified during fault.

Aha, thanks Jason!  So fundamentally, pin_user_pages_remote() should
never have been faulting in this vma since the pages are non-struct
page backed.  Maybe I was just getting lucky before this commit.  For a
VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
error and the vma information that we setup in vfio_pci_mmap().  We
only need the fault handler to trigger for user access, which is what I
see with this change.  That should work for me.

> Also the vma code above looked a little strange to me, if you do send
> something like this cc me and I can look at it. I did some work like
> this for rdma a while ago..

Cool, I'll do that.  I'd like to be able to zap the vmas from user
access at a later point and I have doubts that I'm holding the
refs/locks that I need to for that.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-28 19:07               ` Alex Williamson
@ 2020-04-28 19:22                 ` Jason Gunthorpe
  2020-04-28 20:12                   ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-04-28 19:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: John Hubbard, LKML, Andrew Morton, Al Viro, Christoph Hellwig,
	Dan Williams, Dave Chinner, Ira Weiny, Jan Kara, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-doc, linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	Kirill A . Shutemov

On Tue, Apr 28, 2020 at 01:07:52PM -0600, Alex Williamson wrote:
> On Tue, 28 Apr 2020 14:49:57 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Tue, Apr 28, 2020 at 10:54:55AM -0600, Alex Williamson wrote:
> > >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > >  {
> > >  	struct vfio_pci_device *vdev = device_data;
> > > @@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > >  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > >  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> > >  
> > > +	vma->vm_ops = &vfio_pci_mmap_ops;
> > > +
> > > +#if 1
> > > +	return 0;
> > > +#else
> > >  	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > -			       req_len, vma->vm_page_prot);
> > > +			       vma->vm_end - vma->vm_start, vma->vm_page_prot);  
> > 
> > The remap_pfn_range here is what tells get_user_pages this is a
> > non-struct page mapping:
> > 
> > 	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > 
> > Which has to be set when the VMA is created, they shouldn't be
> > modified during fault.
> 
> Aha, thanks Jason!  So fundamentally, pin_user_pages_remote() should
> never have been faulting in this vma since the pages are non-struct
> page backed. 

gup should not try to pin them.. I think the VM will still call fault
though, not sure from memory?

> Maybe I was just getting lucky before this commit.  For a
> VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> error and the vma information that we setup in vfio_pci_mmap().

I've written on this before, vfio should not be passing pages to the
iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
first place.

It is a use-after-free security issue the way it is..

> only need the fault handler to trigger for user access, which is what I
> see with this change.  That should work for me.
> 
> > Also the vma code above looked a little strange to me, if you do send
> > something like this cc me and I can look at it. I did some work like
> > this for rdma a while ago..
> 
> Cool, I'll do that.  I'd like to be able to zap the vmas from user
> access at a later point and I have doubts that I'm holding the
> refs/locks that I need to for that.  Thanks,

Check rdma_umap_ops, it does what you described (actually it replaces
them with 0 page, but along the way it zaps too).

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-28 19:22                 ` Jason Gunthorpe
@ 2020-04-28 20:12                   ` Alex Williamson
  2020-04-29  0:29                     ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2020-04-28 20:12 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-doc
  Cc: John Hubbard, LKML, Andrew Morton, Al Viro, Christoph Hellwig,
	Dan Williams, Dave Chinner, Ira Weiny, Jan Kara, Jonathan Corbet,
	Jérôme Glisse, Kirill A . Shutemov, Michal Hocko,
	Mike Kravetz, Shuah Khan, Vlastimil Babka, Matthew Wilcox,
	linux-fsdevel, linux-kselftest, linux-rdma, linux-mm,
	Kirill A . Shutemov

On Tue, 28 Apr 2020 16:22:51 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Apr 28, 2020 at 01:07:52PM -0600, Alex Williamson wrote:
> > On Tue, 28 Apr 2020 14:49:57 -0300
> > Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >   
> > > On Tue, Apr 28, 2020 at 10:54:55AM -0600, Alex Williamson wrote:  
> > > >  static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > >  {
> > > >  	struct vfio_pci_device *vdev = device_data;
> > > > @@ -1253,8 +1323,14 @@ static int vfio_pci_mmap(void *device_data, struct vm_area_struct *vma)
> > > >  	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > > >  	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> > > >  
> > > > +	vma->vm_ops = &vfio_pci_mmap_ops;
> > > > +
> > > > +#if 1
> > > > +	return 0;
> > > > +#else
> > > >  	return remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
> > > > -			       req_len, vma->vm_page_prot);
> > > > +			       vma->vm_end - vma->vm_start, vma->vm_page_prot);    
> > > 
> > > The remap_pfn_range here is what tells get_user_pages this is a
> > > non-struct page mapping:
> > > 
> > > 	vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;
> > > 
> > > Which has to be set when the VMA is created, they shouldn't be
> > > modified during fault.  
> > 
> > Aha, thanks Jason!  So fundamentally, pin_user_pages_remote() should
> > never have been faulting in this vma since the pages are non-struct
> > page backed.   
> 
> gup should not try to pin them.. I think the VM will still call fault
> though, not sure from memory?

Hmm, at commit 3faa52c03f44 the behavior is that I don't see a fault on
pin, maybe that's a bug.  But trying to rebase to current top of tree,
now my DMA mapping gets an -EFAULT, so something is still funky :-\

> > Maybe I was just getting lucky before this commit.  For a
> > VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> > error and the vma information that we setup in vfio_pci_mmap().  
> 
> I've written on this before, vfio should not be passing pages to the
> iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
> first place.
> 
> It is a use-after-free security issue the way it is..

Where is the user after free?  Here I'm trying to map device mmio space
through the iommu, which we need to enable p2p when the user owns
multiple devices.  The device is owned by the user, bound to vfio-pci,
and can't be unbound while the user has it open.  The iommu mappings
are torn down on release.  I guess I don't understand the problem.

> > only need the fault handler to trigger for user access, which is what I
> > see with this change.  That should work for me.
> >   
> > > Also the vma code above looked a little strange to me, if you do send
> > > something like this cc me and I can look at it. I did some work like
> > > this for rdma a while ago..  
> > 
> > Cool, I'll do that.  I'd like to be able to zap the vmas from user
> > access at a later point and I have doubts that I'm holding the
> > refs/locks that I need to for that.  Thanks,  
> 
> Check rdma_umap_ops, it does what you described (actually it replaces
> them with 0 page, but along the way it zaps too).

Ok, thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-28 20:12                   ` Alex Williamson
@ 2020-04-29  0:29                     ` Jason Gunthorpe
  2020-04-29 19:56                       ` Alex Williamson
  0 siblings, 1 reply; 28+ messages in thread
From: Jason Gunthorpe @ 2020-04-29  0:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-doc, John Hubbard, LKML, Andrew Morton, Al Viro,
	Christoph Hellwig, Dan Williams, Dave Chinner, Ira Weiny,
	Jan Kara, Jonathan Corbet, Jérôme Glisse,
	Kirill A . Shutemov, Michal Hocko, Mike Kravetz, Shuah Khan,
	Vlastimil Babka, Matthew Wilcox, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

On Tue, Apr 28, 2020 at 02:12:23PM -0600, Alex Williamson wrote:

> > > Maybe I was just getting lucky before this commit.  For a
> > > VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> > > error and the vma information that we setup in vfio_pci_mmap().  
> > 
> > I've written on this before, vfio should not be passing pages to the
> > iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
> > first place.
> > 
> > It is a use-after-free security issue the way it is..
> 
> Where is the user after free?  Here I'm trying to map device mmio space
> through the iommu, which we need to enable p2p when the user owns
> multiple devices.

Yes, I gathered what the intent was..

> The device is owned by the user, bound to vfio-pci, and can't be
> unbound while the user has it open.  The iommu mappings are torn
> down on release.  I guess I don't understand the problem.

For PFNMAP VMAs the lifecycle rule is basically that the PFN inside
the VMA can only be used inside the mmap_sem that read it. Ie you
cannot take a PFN outside the mmap_sem and continue to use it.

This is because the owner of the VMA owns the lifetime of that PFN,
and under the write side of the mmap_sem it can zap the PFN, or close
the VMA. Afterwards the VMA owner knows that there are no active
reference to the PFN in the system and can reclaim the PFN

ie the PFNMAP has no per-page pin counter. All lifetime revolves around
the mmap_sem and the vma.

What vfio does is take the PFN out of the mmap_sem and program it into
the iommu.

So when the VMA owner decides the PFN has no references, it actually
doesn't: vfio continues to access it beyond its permitted lifetime.

HW like mlx5 and GPUs have BAR pages which have security
properties. Once the PFN is returned to the driver the security
context of the PFN can be reset and re-assigned to another
process. Using VFIO a hostile user space can retain access to the BAR
page and upon its reassignment access a security context they were not
permitted to access.

This is why GUP does not return PFNMAP pages and vfio should not carry
a reference outside the mmap_sem. It breaks all the lifetime rules.

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-29  0:29                     ` Jason Gunthorpe
@ 2020-04-29 19:56                       ` Alex Williamson
  2020-04-29 23:03                         ` Jason Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Alex Williamson @ 2020-04-29 19:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-doc, John Hubbard, LKML, Andrew Morton, Al Viro,
	Christoph Hellwig, Dan Williams, Dave Chinner, Ira Weiny,
	Jan Kara, Jonathan Corbet, Jérôme Glisse,
	Kirill A . Shutemov, Michal Hocko, Mike Kravetz, Shuah Khan,
	Vlastimil Babka, Matthew Wilcox, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

On Tue, 28 Apr 2020 21:29:03 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, Apr 28, 2020 at 02:12:23PM -0600, Alex Williamson wrote:
> 
> > > > Maybe I was just getting lucky before this commit.  For a
> > > > VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> > > > error and the vma information that we setup in vfio_pci_mmap().    
> > > 
> > > I've written on this before, vfio should not be passing pages to the
> > > iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
> > > first place.
> > > 
> > > It is a use-after-free security issue the way it is..  
> > 
> > Where is the user after free?  Here I'm trying to map device mmio space
> > through the iommu, which we need to enable p2p when the user owns
> > multiple devices.  
> 
> Yes, I gathered what the intent was..
> 
> > The device is owned by the user, bound to vfio-pci, and can't be
> > unbound while the user has it open.  The iommu mappings are torn
> > down on release.  I guess I don't understand the problem.  
> 
> For PFNMAP VMAs the lifecycle rule is basically that the PFN inside
> the VMA can only be used inside the mmap_sem that read it. Ie you
> cannot take a PFN outside the mmap_sem and continue to use it.
> 
> This is because the owner of the VMA owns the lifetime of that PFN,
> and under the write side of the mmap_sem it can zap the PFN, or close
> the VMA. Afterwards the VMA owner knows that there are no active
> reference to the PFN in the system and can reclaim the PFN
> 
> ie the PFNMAP has no per-page pin counter. All lifetime revolves around
> the mmap_sem and the vma.
> 
> What vfio does is take the PFN out of the mmap_sem and program it into
> the iommu.
> 
> So when the VMA owner decides the PFN has no references, it actually
> doesn't: vfio continues to access it beyond its permitted lifetime.
> 
> HW like mlx5 and GPUs have BAR pages which have security
> properties. Once the PFN is returned to the driver the security
> context of the PFN can be reset and re-assigned to another
> process. Using VFIO a hostile user space can retain access to the BAR
> page and upon its reassignment access a security context they were not
> permitted to access.
> 
> This is why GUP does not return PFNMAP pages and vfio should not carry
> a reference outside the mmap_sem. It breaks all the lifetime rules.

Thanks for the explanation.  I'm inferring that there is no solution to
this, but why can't we use mmu notifiers to invalidate the iommu on zap
or close?  I know that at least QEMU won't consider these sorts of
mapping fatal, so we could possibly change the default and make support
for such mappings opt-in, but I don't know if I'd break DPDK, or
potentially users within QEMU that make use of p2p between devices.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [regression?] Re: [PATCH v6 06/12] mm/gup: track FOLL_PIN pages
  2020-04-29 19:56                       ` Alex Williamson
@ 2020-04-29 23:03                         ` Jason Gunthorpe
  0 siblings, 0 replies; 28+ messages in thread
From: Jason Gunthorpe @ 2020-04-29 23:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-doc, John Hubbard, LKML, Andrew Morton, Al Viro,
	Christoph Hellwig, Dan Williams, Dave Chinner, Ira Weiny,
	Jan Kara, Jonathan Corbet, Jérôme Glisse,
	Kirill A . Shutemov, Michal Hocko, Mike Kravetz, Shuah Khan,
	Vlastimil Babka, Matthew Wilcox, linux-fsdevel, linux-kselftest,
	linux-rdma, linux-mm, Kirill A . Shutemov

On Wed, Apr 29, 2020 at 01:56:33PM -0600, Alex Williamson wrote:
> On Tue, 28 Apr 2020 21:29:03 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Tue, Apr 28, 2020 at 02:12:23PM -0600, Alex Williamson wrote:
> > 
> > > > > Maybe I was just getting lucky before this commit.  For a
> > > > > VM_PFNMAP, vaddr_get_pfn() only needs pin_user_pages_remote() to return
> > > > > error and the vma information that we setup in vfio_pci_mmap().    
> > > > 
> > > > I've written on this before, vfio should not be passing pages to the
> > > > iommu that it cannot pin eg it should not touch VM_PFNMAP vma's in the
> > > > first place.
> > > > 
> > > > It is a use-after-free security issue the way it is..  
> > > 
> > > Where is the user after free?  Here I'm trying to map device mmio space
> > > through the iommu, which we need to enable p2p when the user owns
> > > multiple devices.  
> > 
> > Yes, I gathered what the intent was..
> > 
> > > The device is owned by the user, bound to vfio-pci, and can't be
> > > unbound while the user has it open.  The iommu mappings are torn
> > > down on release.  I guess I don't understand the problem.  
> > 
> > For PFNMAP VMAs the lifecycle rule is basically that the PFN inside
> > the VMA can only be used inside the mmap_sem that read it. Ie you
> > cannot take a PFN outside the mmap_sem and continue to use it.
> > 
> > This is because the owner of the VMA owns the lifetime of that PFN,
> > and under the write side of the mmap_sem it can zap the PFN, or close
> > the VMA. Afterwards the VMA owner knows that there are no active
> > reference to the PFN in the system and can reclaim the PFN
> > 
> > ie the PFNMAP has no per-page pin counter. All lifetime revolves around
> > the mmap_sem and the vma.
> > 
> > What vfio does is take the PFN out of the mmap_sem and program it into
> > the iommu.
> > 
> > So when the VMA owner decides the PFN has no references, it actually
> > doesn't: vfio continues to access it beyond its permitted lifetime.
> > 
> > HW like mlx5 and GPUs have BAR pages which have security
> > properties. Once the PFN is returned to the driver the security
> > context of the PFN can be reset and re-assigned to another
> > process. Using VFIO a hostile user space can retain access to the BAR
> > page and upon its reassignment access a security context they were not
> > permitted to access.
> > 
> > This is why GUP does not return PFNMAP pages and vfio should not carry
> > a reference outside the mmap_sem. It breaks all the lifetime rules.
> 
> Thanks for the explanation.  I'm inferring that there is no solution to
> this, 

Not a particularly good one unfortunately. I've been wanting to use
P2P_DMA pages to solve these kinds of things but they are kind of
expensive.

I have a copy of some draft patches trying to do this

> but why can't we use mmu notifiers to invalidate the iommu on zap or
> close?

Hum.. I think with the new mmu interval notifiers vfio might be able
to manage that without a huge amount of trouble. But the iommu
invalidation needs to be synchronous from a mmu notifier callback - is
that feasible?

But even so, we have all this stuff now for authorizing PCI P2P which
this design completely ignores as well. :(

> I know that at least QEMU won't consider these sorts of mapping
> fatal, so we could possibly change the default and make support for
> such mappings opt-in, but I don't know if I'd break DPDK, or
> potentially users within QEMU that make use of p2p between devices.

I'd heard this was mostly for GPU device assignment? I'd be surprised
if DPDK used this..

Jason

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-04-29 23:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-11  0:15 [PATCH v6 00/12] mm/gup: track FOLL_PIN pages John Hubbard
2020-02-11  0:15 ` [PATCH v6 01/12] mm/gup: split get_user_pages_remote() into two routines John Hubbard
2020-02-11  0:15 ` [PATCH v6 02/12] mm/gup: pass a flags arg to __gup_device_* functions John Hubbard
2020-02-11  0:15 ` [PATCH v6 03/12] mm: introduce page_ref_sub_return() John Hubbard
2020-02-11  0:15 ` [PATCH v6 04/12] mm/gup: pass gup flags to two more routines John Hubbard
2020-02-11  0:15 ` [PATCH v6 05/12] mm/gup: require FOLL_GET for get_user_pages_fast() John Hubbard
2020-02-11  0:15 ` [PATCH v6 06/12] mm/gup: track FOLL_PIN pages John Hubbard
2020-04-24 18:18   ` [regression] " Alex Williamson
2020-04-24 19:20     ` John Hubbard
2020-04-24 20:15       ` Alex Williamson
2020-04-24 22:58         ` John Hubbard
2020-04-28 16:54           ` [regression?] " Alex Williamson
2020-04-28 17:49             ` Jason Gunthorpe
2020-04-28 19:07               ` Alex Williamson
2020-04-28 19:22                 ` Jason Gunthorpe
2020-04-28 20:12                   ` Alex Williamson
2020-04-29  0:29                     ` Jason Gunthorpe
2020-04-29 19:56                       ` Alex Williamson
2020-04-29 23:03                         ` Jason Gunthorpe
2020-02-11  0:15 ` [PATCH v6 07/12] mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages John Hubbard
2020-02-11  0:15 ` [PATCH v6 08/12] mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting John Hubbard
2020-02-12  9:17   ` Jan Kara
2020-02-11  0:15 ` [PATCH v6 09/12] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2020-02-11  0:15 ` [PATCH v6 10/12] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2020-02-11  0:15 ` [PATCH v6 11/12] mm: Improve dump_page() for compound pages John Hubbard
2020-02-11  0:15 ` [PATCH v6 12/12] mm: dump_page(): additional diagnostics for huge pinned pages John Hubbard
2020-02-11 13:21   ` Kirill A. Shutemov
2020-02-12  2:10     ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).