Linux-kselftest Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
@ 2019-12-16 22:25 John Hubbard
  2019-12-16 22:25 ` [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines John Hubbard
                   ` (26 more replies)
  0 siblings, 27 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Hi,

This implements an API naming change (put_user_page*() -->
unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
extends that tracking to a few select subsystems. More subsystems will
be added in follow up work.

Christoph Hellwig, a point of interest:

a) I've moved the bulk of the code out of the inline functions, as
   requested, for the devmap changes (patch 4: "mm: devmap: refactor
   1-based refcounting for ZONE_DEVICE pages").

Changes since v10: Remaining fixes resulting from Jan Kara's reviews:

* Shifted to using the sign bit in page_dma_pinned() to allow accurate
  results even in the overflow case. See the comments in that routine
  for details. This allowed getting rid of the new
  page_ref_zero_or_close_to_bias_overflow(), in favor of a simple
  sign check via "page_ref_count() <= 0").

* Simplified some of the huge_memory.c changes, and simplified a gup.c
  WARN invocation.

* Now using a standard -ENOMEM for most try_grab_page() failures.

* Got rid of tabs in the comment headers (I had thought they were
  required there, but it's actually the reverse: they are not
  allowed there).

* Rebased against 5.5-rc2 and retested.

* Added Jan Kara's reviewed-by tag for patch 23 (the main patch of the
  series).

Changes since v9: Fixes resulting from Jan Kara's and Jonathan Corbet's
reviews:

* Removed reviewed-by tags from the "mm/gup: track FOLL_PIN pages" (those
  were improperly inherited from the much smaller refactoring patch that
  was merged into it).

* Made try_grab_compound_head() and try_grab_page() behavior similar in
  their behavior with flags, in order to avoid "gotchas" later.

* follow_trans_huge_pmd(): moved the try_grab_page() to earlier in the
  routine, in order to avoid having to undo mlock_vma_page().

* follow_hugetlb_page(): removed a refcount overflow check that is now
  extraneous (and weaker than what try_grab_page() provides a few lines
  further down).

* Fixed up two Documentation flaws, pointed out by Jonathan Corbet's
  review.

Changes since v8:

* Merged the "mm/gup: pass flags arg to __gup_device_* functions" patch
  into the "mm/gup: track FOLL_PIN pages" patch, as requested by
  Christoph and Jan.

* Changed void grab_page() to bool try_grab_page(), and handled errors
  at the call sites. (From Jan's review comments.) try_grab_page()
  attempts to avoid page refcount overflows, even when counting up with
  GUP_PIN_COUNTING_BIAS increments.

* Fixed a bug that I'd introduced, when changing a BUG() to a WARN().

* Added Jan's reviewed-by tag to the " mm/gup: allow FOLL_FORCE for
  get_user_pages_fast()" patch.

* Documentation: pin_user_pages.rst: fixed an incorrect gup_benchmark
  invocation, left over from the pin_longterm days, spotted while preparing
  this version.

* Rebased onto today's linux.git (-rc1), and re-tested.

Changes since v7:

* Rebased onto Linux 5.5-rc1

* Reworked the grab_page() and try_grab_compound_head(), for API
  consistency and less diffs (thanks to Jan Kara's reviews).

* Added Leon Romanovsky's reviewed-by tags for two of the IB-related
  patches.

* patch 4 refactoring changes, as mentioned above.

There is a git repo and branch, for convenience:

    git@github.com:johnhubbard/linux.git pin_user_pages_tracking_v8

For the remaining list of "changes since version N", those are all in
v7, which is here:

  https://lore.kernel.org/r/20191121071354.456618-1-jhubbard@nvidia.com

============================================================
Overview:

This is a prerequisite to solving the problem of proper interactions
between file-backed pages, and [R]DMA activities, as discussed in [1],
[2], [3], and in a remarkable number of email threads since about
2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    unpin_user_page()

============================================================
Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've
  been able to runtime test the core get_user_pages() and
  pin_user_pages() and related routines, but not so much on several of
  the call sites--but those are generally just a couple of lines
  changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Runtime testing for the call sites so far is pretty light:

    * io_uring: Some directed tests from liburing exercise this, and
                they pass.
    * process_vm_access.c: A small directed test passes.
    * gup_benchmark: the enhanced version hits the new gup.c code, and
                     passes.
    * infiniband: ran "ib_write_bw", which exercises the umem.c changes,
                  but not the other changes.
    * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                      not ready just yet)
    * powerpc: it compiles...
    * drm/via: compiles...
    * goldfish: compiles...
    * net/xdp: compiles...
    * media/v4l2: compiles...

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

Dan Williams (1):
  mm: Cleanup __put_devmap_managed_page() vs ->page_free()

John Hubbard (24):
  mm/gup: factor out duplicate code from four routines
  mm/gup: move try_get_compound_head() to top, fix minor issues
  mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  goldish_pipe: rename local pin_user_pages() routine
  mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
  mm/gup: allow FOLL_FORCE for get_user_pages_fast()
  IB/umem: use get_user_pages_fast() to pin DMA pages
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page()
    conversion
  vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  powerpc: book3s64: convert to pin_user_pages() and put_user_page()
  mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding
    "1"
  mm, tree-wide: rename put_user_page*() to unpin_user_page*()
  mm/gup: track FOLL_PIN pages
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage

 Documentation/core-api/index.rst            |   1 +
 Documentation/core-api/pin_user_pages.rst   | 232 ++++++++
 arch/powerpc/mm/book3s64/iommu_api.c        |  10 +-
 drivers/gpu/drm/via/via_dmablit.c           |   6 +-
 drivers/infiniband/core/umem.c              |  19 +-
 drivers/infiniband/core/umem_odp.c          |  13 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |   4 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   4 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   8 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   4 +-
 drivers/infiniband/sw/siw/siw_mem.c         |   4 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |   8 +-
 drivers/nvdimm/pmem.c                       |   6 -
 drivers/platform/goldfish/goldfish_pipe.c   |  35 +-
 drivers/vfio/vfio_iommu_type1.c             |  35 +-
 fs/io_uring.c                               |   6 +-
 include/linux/mm.h                          | 155 ++++-
 include/linux/mmzone.h                      |   2 +
 include/linux/page_ref.h                    |  10 +
 mm/gup.c                                    | 626 +++++++++++++++-----
 mm/gup_benchmark.c                          |  74 ++-
 mm/huge_memory.c                            |  29 +-
 mm/hugetlb.c                                |  38 +-
 mm/memremap.c                               |  76 ++-
 mm/process_vm_access.c                      |  28 +-
 mm/swap.c                                   |  24 +
 mm/vmstat.c                                 |   2 +
 net/xdp/xdp_umem.c                          |   4 +-
 tools/testing/selftests/vm/gup_benchmark.c  |  21 +-
 tools/testing/selftests/vm/run_vmtests      |  22 +
 31 files changed, 1145 insertions(+), 369 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

--
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-18 15:52   ` Kirill A. Shutemov
  2019-12-16 22:25 ` [PATCH v11 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
                   ` (25 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig,
	Aneesh Kumar K . V

There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 91 ++++++++++++++++++++++----------------------------------
 1 file changed, 36 insertions(+), 55 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 7646bf993b25..f764432914c4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1978,6 +1978,25 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
+static int record_subpages(struct page *page, unsigned long addr,
+			   unsigned long end, struct page **pages)
+{
+	int nr;
+
+	for (nr = 0; addr != end; addr += PAGE_SIZE)
+		pages[nr++] = page++;
+
+	return nr;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+	/* Do a get_page() first, in case refs == page->_refcount */
+	get_page(page);
+	page_ref_sub(page, refs);
+	put_page(page);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
@@ -2007,32 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	refs = 0;
 	head = pte_page(pte);
-
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(head, refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
@@ -2079,28 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
 	}
 
-	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(pmd_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
@@ -2120,28 +2118,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
 	}
 
-	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(pud_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
@@ -2157,28 +2146,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 
 	BUILD_BUG_ON(pgd_devmap(orig));
-	refs = 0;
+
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = record_subpages(page, addr, end, pages + *nr);
 
 	head = try_get_compound_head(pgd_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
+	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
 }
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
  2019-12-16 22:25 ` [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
                   ` (24 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig

An upcoming patch uses try_get_compound_head() more widely,
so move it to the top of gup.c.

Also fix a tiny spelling error and a checkpatch.pl warning.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 29 +++++++++++++++--------------
 1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f764432914c4..3ecce297a47f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,21 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+	struct page *head = compound_head(page);
+
+	if (WARN_ON_ONCE(page_ref_count(head) < 0))
+		return NULL;
+	if (unlikely(!page_cache_add_speculative(head, refs)))
+		return NULL;
+	return head;
+}
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -1807,20 +1822,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
 	}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-	struct page *head = compound_head(page);
-	if (WARN_ON_ONCE(page_ref_count(head) < 0))
-		return NULL;
-	if (unlikely(!page_cache_add_speculative(head, refs)))
-		return NULL;
-	return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
  2019-12-16 22:25 ` [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines John Hubbard
  2019-12-16 22:25 ` [PATCH v11 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
                   ` (23 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig

From: Dan Williams <dan.j.williams@intel.com>

After the removal of the device-public infrastructure there are only 2
->page_free() call backs in the kernel. One of those is a device-private
callback in the nouveau driver, the other is a generic wakeup needed in
the DAX case. In the hopes that all ->page_free() callbacks can be
migrated to common core kernel functionality, move the device-private
specific actions in __put_devmap_managed_page() under the
is_device_private_page() conditional, including the ->page_free()
callback. For the other page types just open-code the generic wakeup.

Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
case.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/nvdimm/pmem.c |  6 ----
 mm/memremap.c         | 80 ++++++++++++++++++++++++-------------------
 2 files changed, 44 insertions(+), 42 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index ad8e4df1282b..4eae441f86c9 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -337,13 +337,7 @@ static void pmem_release_disk(void *__pmem)
 	put_disk(pmem->disk);
 }
 
-static void pmem_pagemap_page_free(struct page *page)
-{
-	wake_up_var(&page->_refcount);
-}
-
 static const struct dev_pagemap_ops fsdax_pagemap_ops = {
-	.page_free		= pmem_pagemap_page_free,
 	.kill			= pmem_pagemap_kill,
 	.cleanup		= pmem_pagemap_cleanup,
 };
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..e899fa876a62 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -27,7 +27,8 @@ static void devmap_managed_enable_put(void)
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
 {
-	if (!pgmap->ops || !pgmap->ops->page_free) {
+	if (pgmap->type == MEMORY_DEVICE_PRIVATE &&
+	    (!pgmap->ops || !pgmap->ops->page_free)) {
 		WARN(1, "Missing page_free method\n");
 		return -EINVAL;
 	}
@@ -414,44 +415,51 @@ void __put_devmap_managed_page(struct page *page)
 {
 	int count = page_ref_dec_return(page);
 
-	/*
-	 * If refcount is 1 then page is freed and refcount is stable as nobody
-	 * holds a reference on the page.
-	 */
-	if (count == 1) {
-		/* Clear Active bit in case of parallel mark_page_accessed */
-		__ClearPageActive(page);
-		__ClearPageWaiters(page);
+	/* still busy */
+	if (count > 1)
+		return;
 
-		mem_cgroup_uncharge(page);
+	/* only triggered by the dev_pagemap shutdown path */
+	if (count == 0) {
+		__put_page(page);
+		return;
+	}
 
-		/*
-		 * When a device_private page is freed, the page->mapping field
-		 * may still contain a (stale) mapping value. For example, the
-		 * lower bits of page->mapping may still identify the page as
-		 * an anonymous page. Ultimately, this entire field is just
-		 * stale and wrong, and it will cause errors if not cleared.
-		 * One example is:
-		 *
-		 *  migrate_vma_pages()
-		 *    migrate_vma_insert_page()
-		 *      page_add_new_anon_rmap()
-		 *        __page_set_anon_rmap()
-		 *          ...checks page->mapping, via PageAnon(page) call,
-		 *            and incorrectly concludes that the page is an
-		 *            anonymous page. Therefore, it incorrectly,
-		 *            silently fails to set up the new anon rmap.
-		 *
-		 * For other types of ZONE_DEVICE pages, migration is either
-		 * handled differently or not done at all, so there is no need
-		 * to clear page->mapping.
-		 */
-		if (is_device_private_page(page))
-			page->mapping = NULL;
+	/* notify page idle for dax */
+	if (!is_device_private_page(page)) {
+		wake_up_var(&page->_refcount);
+		return;
+	}
 
-		page->pgmap->ops->page_free(page);
-	} else if (!count)
-		__put_page(page);
+	/* Clear Active bit in case of parallel mark_page_accessed */
+	__ClearPageActive(page);
+	__ClearPageWaiters(page);
+
+	mem_cgroup_uncharge(page);
+
+	/*
+	 * When a device_private page is freed, the page->mapping field
+	 * may still contain a (stale) mapping value. For example, the
+	 * lower bits of page->mapping may still identify the page as an
+	 * anonymous page. Ultimately, this entire field is just stale
+	 * and wrong, and it will cause errors if not cleared.  One
+	 * example is:
+	 *
+	 *  migrate_vma_pages()
+	 *    migrate_vma_insert_page()
+	 *      page_add_new_anon_rmap()
+	 *        __page_set_anon_rmap()
+	 *          ...checks page->mapping, via PageAnon(page) call,
+	 *            and incorrectly concludes that the page is an
+	 *            anonymous page. Therefore, it incorrectly,
+	 *            silently fails to set up the new anon rmap.
+	 *
+	 * For other types of ZONE_DEVICE pages, migration is either
+	 * handled differently or not done at all, so there is no need
+	 * to clear page->mapping.
+	 */
+	page->mapping = NULL;
+	page->pgmap->ops->page_free(page);
 }
 EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (2 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-18 16:04   ` Kirill A. Shutemov
  2019-12-19  5:27   ` [PATCH v11 04/25] " Dan Williams
  2019-12-16 22:25 ` [PATCH v11 05/25] goldish_pipe: rename local pin_user_pages() routine John Hubbard
                   ` (22 subsequent siblings)
  26 siblings, 2 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig

An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle checking
  what kind of page it is, and what kind of refcount handling it
  requires.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 17 +++++++++++++----
 mm/memremap.c      | 16 ++--------------
 mm/swap.c          | 24 ++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..77a4df06c8a7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page *page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	if (!static_branch_unlikely(&devmap_managed_key))
 		return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page *page)
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_FS_DAX:
-		__put_devmap_managed_page(page);
 		return true;
 	default:
 		break;
@@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page *page)
 	return false;
 }
 
+bool put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
+static inline bool page_is_devmap_managed(struct page *page)
+{
+	return false;
+}
+
 static inline bool put_devmap_managed_page(struct page *page)
 {
 	return false;
@@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
 	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (put_devmap_managed_page(page))
+	if (page_is_devmap_managed(page)) {
+		put_devmap_managed_page(page);
 		return;
+	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-	int count = page_ref_dec_return(page);
-
-	/* still busy */
-	if (count > 1)
-		return;
-
-	/* only triggered by the dev_pagemap shutdown path */
-	if (count == 0) {
-		__put_page(page);
-		return;
-	}
-
 	/* notify page idle for dax */
 	if (!is_device_private_page(page)) {
 		wake_up_var(&page->_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..49f7c2eea0ba 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1102,3 +1102,27 @@ void __init swap_setup(void)
 	 * _really_ don't want to cluster much more
 	 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+bool put_devmap_managed_page(struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+
+	if (is_devmap) {
+		int count = page_ref_dec_return(page);
+
+		/*
+		 * devmap page refcounts are 1-based, rather than 0-based: if
+		 * refcount is 1, then the page is free and the refcount is
+		 * stable because nobody holds a reference on the page.
+		 */
+		if (count == 1)
+			free_devmap_managed_page(page);
+		else if (!count)
+			__put_page(page);
+	}
+
+	return is_devmap;
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 05/25] goldish_pipe: rename local pin_user_pages() routine
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (3 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM John Hubbard
                   ` (21 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "goldfish_pin_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..ef50c264db71 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
 	}
 }
 
-static int pin_user_pages(unsigned long first_page,
-			  unsigned long last_page,
-			  unsigned int last_page_size,
-			  int is_write,
-			  struct page *pages[MAX_BUFFERS_PER_COMMAND],
-			  unsigned int *iter_last_page_size)
+static int goldfish_pin_pages(unsigned long first_page,
+			      unsigned long last_page,
+			      unsigned int last_page_size,
+			      int is_write,
+			      struct page *pages[MAX_BUFFERS_PER_COMMAND],
+			      unsigned int *iter_last_page_size)
 {
 	int ret;
 	int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 	if (mutex_lock_interruptible(&pipe->lock))
 		return -ERESTARTSYS;
 
-	pages_count = pin_user_pages(first_page, last_page,
-				     last_page_size, is_write,
-				     pipe->pages, &iter_last_page_size);
+	pages_count = goldfish_pin_pages(first_page, last_page,
+					 last_page_size, is_write,
+					 pipe->pages, &iter_last_page_size);
 	if (pages_count < 0) {
 		mutex_unlock(&pipe->lock);
 		return pages_count;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (4 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 05/25] goldish_pipe: rename local pin_user_pages() routine John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-18 16:19   ` Kirill A. Shutemov
  2019-12-16 22:25 ` [PATCH v11 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call John Hubbard
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Jason Gunthorpe

As it says in the updated comment in gup.c: current FOLL_LONGTERM
behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
FS DAX check requirement on vmas.

However, the corresponding restriction in get_user_pages_remote() was
slightly stricter than is actually required: it forbade all
FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
that do not set the "locked" arg.

Update the code and comments to loosen the restriction, allowing
FOLL_LONGTERM in some cases.

Also, copy the DAX check ("if a VMA is DAX, don't allow long term
pinning") from the VFIO call site, all the way into the internals
of get_user_pages_remote() and __gup_longterm_locked(). That is:
get_user_pages_remote() calls __gup_longterm_locked(), which in turn
calls check_dax_vmas(). This check will then be removed from the VFIO
call site in a subsequent patch.

Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
and to Dan Williams for helping clarify the DAX refactoring.

Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3ecce297a47f..c0c56888e7cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,13 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
+						  struct mm_struct *mm,
+						  unsigned long start,
+						  unsigned long nr_pages,
+						  struct page **pages,
+						  struct vm_area_struct **vmas,
+						  unsigned int flags);
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		struct vm_area_struct **vmas, int *locked)
 {
 	/*
-	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
+	 * Parts of FOLL_LONGTERM behavior are incompatible with
 	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas.  As there are no users of this flag in this call we simply
-	 * disallow this option for now.
+	 * vmas. However, this only comes up if locked is set, and there are
+	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
+	 * allow what we can.
 	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
-		return -EINVAL;
+	if (gup_flags & FOLL_LONGTERM) {
+		if (WARN_ON_ONCE(locked))
+			return -EINVAL;
+		/*
+		 * This will check the vmas (even if our vmas arg is NULL)
+		 * and return -ENOTSUPP if DAX isn't allowed in this case:
+		 */
+		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+					     vmas, gup_flags | FOLL_TOUCH |
+					     FOLL_REMOTE);
+	}
 
 	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
 				       locked,
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (5 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast() John Hubbard
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Jason Gunthorpe

Update VFIO to take advantage of the recently loosened restriction on
FOLL_LONGTERM with get_user_pages_remote(). Also, now it is possible to
fix a bug: the VFIO caller is logically a FOLL_LONGTERM user, but it
wasn't setting FOLL_LONGTERM.

Also, remove an unnessary pair of calls that were releasing and
reacquiring the mmap_sem. There is no need to avoid holding mmap_sem
just in order to call page_to_pfn().

Also, now that the the DAX check ("if a VMA is DAX, don't allow long
term pinning") is in the internals of get_user_pages_remote() and
__gup_longterm_locked(), there's no need for it at the VFIO call site.
So remove it.

Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 30 +++++-------------------------
 1 file changed, 5 insertions(+), 25 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ada8e6cdb88..b800fc9a0251 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -322,7 +322,6 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
-	struct vm_area_struct *vmas[1];
 	unsigned int flags = 0;
 	int ret;
 
@@ -330,33 +329,14 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		flags |= FOLL_WRITE;
 
 	down_read(&mm->mmap_sem);
-	if (mm == current->mm) {
-		ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-				     vmas);
-	} else {
-		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-					    vmas, NULL);
-		/*
-		 * The lifetime of a vaddr_get_pfn() page pin is
-		 * userspace-controlled. In the fs-dax case this could
-		 * lead to indefinite stalls in filesystem operations.
-		 * Disallow attempts to pin fs-dax pages via this
-		 * interface.
-		 */
-		if (ret > 0 && vma_is_fsdax(vmas[0])) {
-			ret = -EOPNOTSUPP;
-			put_page(page[0]);
-		}
-	}
-	up_read(&mm->mmap_sem);
-
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+				    page, NULL, NULL);
 	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
-		return 0;
+		ret = 0;
+		goto done;
 	}
 
-	down_read(&mm->mmap_sem);
-
 	vaddr = untagged_addr(vaddr);
 
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
@@ -366,7 +346,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		if (is_invalid_reserved_pfn(*pfn))
 			ret = 0;
 	}
-
+done:
 	up_read(&mm->mmap_sem);
 	return ret;
 }
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (6 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
                   ` (18 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig, Leon Romanovsky

Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
This, combined with the fact that get_user_pages_fast() falls back to
"slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
if you need FOLL_FORCE, you cannot call get_user_pages_fast().

There does not appear to be any reason for filtering out FOLL_FORCE.
There is nothing in the _fast() implementation that requires that we
avoid writing to the pages. So it appears to have been an oversight.

Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().

Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index c0c56888e7cc..958ab0757389 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
 
-	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
+				       FOLL_FORCE)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (7 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Leon Romanovsky, Christoph Hellwig,
	Jason Gunthorpe

And get rid of the mmap_sem calls, as part of that. Note
that get_user_pages_fast() will, if necessary, fall back to
__gup_longterm_unlocked(), which takes the mmap_sem as needed.

Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 7a3b99597ead..214e87aa609d 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,16 +266,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	sg = umem->sg_head.sgl;
 
 	while (npages) {
-		down_read(&mm->mmap_sem);
-		ret = get_user_pages(cur_base,
-				     min_t(unsigned long, npages,
-					   PAGE_SIZE / sizeof (struct page *)),
-				     gup_flags | FOLL_LONGTERM,
-				     page_list, NULL);
-		if (ret < 0) {
-			up_read(&mm->mmap_sem);
+		ret = get_user_pages_fast(cur_base,
+					  min_t(unsigned long, npages,
+						PAGE_SIZE /
+						sizeof(struct page *)),
+					  gup_flags | FOLL_LONGTERM, page_list);
+		if (ret < 0)
 			goto umem_release;
-		}
 
 		cur_base += ret * PAGE_SIZE;
 		npages   -= ret;
@@ -283,8 +280,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 		sg = ib_umem_add_sg_table(sg, page_list, ret,
 			dma_get_max_seg_size(context->device->dma_device),
 			&umem->sg_nents);
-
-		up_read(&mm->mmap_sem);
 	}
 
 	sg_mark_end(sg);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (8 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Mike Rapoport

Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites
are converted to use the correct get_user_pages*() or
pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also
introduced, and thoroughly documented.

    pin_user_pages()
    pin_user_pages_remote()
    pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
set it. That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
        * Start with "pin_user_pages" instead of "get_user_pages". That
          makes it easy to find and audit the call sites.
        * Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded upon it.)

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>  # Documentation
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/index.rst          |   1 +
 Documentation/core-api/pin_user_pages.rst | 232 ++++++++++++++++++++++
 include/linux/mm.h                        |  63 ++++--
 mm/gup.c                                  | 161 +++++++++++++--
 4 files changed, 423 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/core-api/pin_user_pages.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
    generic-radix-tree
    memory-allocation
    mm-api
+   pin_user_pages
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm
diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
new file mode 100644
index 000000000000..71849830cd48
--- /dev/null
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+pin_user_pages() and related calls
+====================================================
+
+.. contents:: :local:
+
+Overview
+========
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=============================
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===================================
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ --------
+ pin_user_pages          FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast     FOLL_PIN is always set internally by this function.
+ pin_user_pages_remote   FOLL_PIN is always set internally by this function.
+
+For these get_user_pages*() functions, FOLL_GET might not even be specified.
+Behavior is a little more complex than above. If FOLL_GET was *not* specified,
+but the caller passed in a non-null struct pages* array, then the function
+sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
+of each page by +1.::
+
+ Function
+ --------
+ get_user_pages           FOLL_GET is sometimes set internally by this function.
+ get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
+ get_user_pages_remote    FOLL_GET is sometimes set internally by this function.
+
+Tracking dma-pinned pages
+=========================
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 31-10==21 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 21 - 18 == 3 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of using addition, rather than "hard"
+  bitfields, within page->_refcount, allows for sharing these bits gracefully.
+  But we're still looking at about 8 references.
+
+This, however, is a missing feature more than anything else, because it's easily
+solved by addressing an obvious inefficiency in the original get_user_pages()
+approach of retrieving pages: stop treating all the pages as if they were
+PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
+this, so some work is required. Once that's in place, this limitation mostly
+disappears from view, because there will be ample refcounting range available.
+
+* Callers must specifically request "dma-pinned tracking of pages". In other
+  words, just calling get_user_pages() will not suffice; a new set of functions,
+  pin_user_page() and related, must be used.
+
+FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
+==========================================================
+
+Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
+these categories:
+
+CASE 1: Direct IO (DIO)
+-----------------------
+There are GUP references to pages that are serving
+as DIO buffers. These buffers are needed for a relatively short time (so they
+are not "long term"). No special synchronization with page_mkclean() or
+munmap() is provided. Therefore, flags to set at the call site are: ::
+
+    FOLL_PIN
+
+...but rather than setting FOLL_PIN directly, call sites should use one of
+the pin_user_pages*() routines that set FOLL_PIN.
+
+CASE 2: RDMA
+------------
+There are GUP references to pages that are serving as DMA
+buffers. These buffers are needed for a long time ("long term"). No special
+synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+to set at the call site are: ::
+
+    FOLL_PIN | FOLL_LONGTERM
+
+NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
+because DAX pages do not have a separate page cache, and so "pinning" implies
+locking down file system blocks, which is not (yet) supported in that way.
+
+CASE 3: Hardware with page faulting support
+-------------------------------------------
+Here, a well-written driver doesn't normally need to pin pages at all. However,
+if the driver does choose to do so, it can register MMU notifiers for the range,
+and will be called back upon invalidation. Either way (avoiding page pinning, or
+using MMU notifiers to unpin upon request), there is proper synchronization with
+both filesystem and mm (page_mkclean(), munmap(), etc).
+
+Therefore, neither flag needs to be set.
+
+In this case, ideally, neither get_user_pages() nor pin_user_pages() should be
+called. Instead, the software should be written so that it does not pin pages.
+This allows mm and filesystems to operate more efficiently and reliably.
+
+CASE 4: Pinning for struct page manipulation only
+-------------------------------------------------
+Here, normal GUP calls are sufficient, so neither flag needs to be set.
+
+page_dma_pinned(): the whole point of pinning
+=============================================
+
+The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
+to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
+(and file system writeback code in general) to make informed decisions about
+what to do when a page cannot be unmapped due to such pins.
+
+What to do in those cases is the subject of a years-long series of discussions
+and debates (see the References at the end of this document). It's a TODO item
+here: fill in the details once that's worked out. Meanwhile, it's safe to say
+that having this available: ::
+
+        static inline bool page_dma_pinned(struct page *page)
+
+...is a prerequisite to solving the long-running gup+DMA problem.
+
+Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
+===================================================================
+
+Another way of thinking about these flags is as a progression of restrictions:
+FOLL_GET is for struct page manipulation, without affecting the data that the
+struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
+short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
+a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
+restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
+will be pinned longterm, and whose data will be accessed.
+
+Unit testing
+============
+This file::
+
+ tools/testing/selftests/vm/gup_benchmark.c
+
+has the following new calls to exercise the new pin*() wrapper functions:
+
+* PIN_FAST_BENCHMARK (./gup_benchmark -a)
+* PIN_BENCHMARK (./gup_benchmark -b)
+
+You can monitor how many total dma-pinned pages have been acquired and released
+since the system was booted, via two new /proc/vmstat entries: ::
+
+    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_requested
+
+Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
+because there is a noticeable performance drop in put_user_page(), when they
+are activated.
+
+References
+==========
+
+* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
+* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
+* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+
+John Hubbard, October, 2019
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77a4df06c8a7..0fb9929e00af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1048,16 +1048,14 @@ static inline void put_page(struct page *page)
  * put_user_page() - release a gup-pinned page
  * @page:            pointer to page to be released
  *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * put_user_page(), or one of the put_user_pages*() routines. This is so that
+ * eventually such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
  *
  * put_user_page() and put_page() are not interchangeable, despite this early
  * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
+ * be perfectly matched up with pin*() calls.
  */
 static inline void put_user_page(struct page *page)
 {
@@ -1515,9 +1513,16 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1525,6 +1530,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2588,13 +2595,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), whose usages are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2609,11 +2618,39 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * Currently only get_user_pages() and get_user_pages_fast() support this flag
  * and calls to get_user_pages_[un]locked are specifically not allowed.  This
  * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
  *
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region.  And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region.  And so, CMA attempts to migrate the page before pinning, when
  * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
+ * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
+ * a call to put_user_page().
+ *
+ * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
+ * and separate refcounting mechanisms, however, and that means that each has
+ * its own acquire and release mechanisms:
+ *
+ *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
+ *
+ *     FOLL_PIN: pin_user_pages*() to acquire, and put_user_pages to release.
+ *
+ * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
+ * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
+ * calls applied to them, and that's perfectly OK. This is a constraint on the
+ * callers, not on the pages.)
+ *
+ * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
+ * directly by the caller. That's in order to help avoid mismatches when
+ * releasing pages: get_user_pages*() pages must be released via put_page(),
+ * while pin_user_pages*() pages must be released via put_user_page().
+ *
+ * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
 
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
diff --git a/mm/gup.c b/mm/gup.c
index 958ab0757389..4862ff982bc3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -201,6 +201,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t *ptep, pte;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return ERR_PTR(-EINVAL);
 retry:
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
@@ -818,7 +822,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	start = untagged_addr(start);
 
-	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
+	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
 
 	/*
 	 * If FOLL_FORCE is set then do not force a full fault as the hinting
@@ -1042,7 +1046,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		BUG_ON(*locked != 1);
 	}
 
-	if (pages)
+	/*
+	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
+	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
+	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
+	 * for FOLL_GET, not for the newer FOLL_PIN.
+	 *
+	 * FOLL_PIN always expects pages to be non-null, but no need to assert
+	 * that here, as any failures will be obvious enough.
+	 */
+	if (pages && !(flags & FOLL_PIN))
 		flags |= FOLL_GET;
 
 	pages_done = 0;
@@ -1185,6 +1198,13 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *locked)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	/*
 	 * Parts of FOLL_LONGTERM behavior are incompatible with
 	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
@@ -1400,6 +1420,14 @@ static long __get_user_pages_locked(struct task_struct *tsk,
 finish_or_fault:
 	return i ? : -EFAULT;
 }
+
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	return 0;
+}
 #endif /* !CONFIG_MMU */
 
 #if defined(CONFIG_FS_DAX) || defined (CONFIG_CMA)
@@ -1654,6 +1682,13 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
 				     pages, vmas, gup_flags | FOLL_TOUCH);
 }
@@ -2392,30 +2427,15 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
- *
- * Attempt to pin user pages in memory without taking mm->mmap_sem.
- * If not successful, it will fall back to taking the lock and
- * calling get_user_pages().
- *
- * Returns number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int get_user_pages_fast(unsigned long start, int nr_pages,
-			unsigned int gup_flags, struct page **pages)
+static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+					unsigned int gup_flags,
+					struct page **pages)
 {
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE)))
+				       FOLL_FORCE | FOLL_PIN)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2455,4 +2475,103 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 
 	return ret;
 }
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying pin behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number requested.
+ * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
+ * -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/**
+ * pin_user_pages_fast() - pin user pages in memory without taking locks
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages_fast().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast);
+
+/**
+ * pin_user_pages_remote() - pin pages of a remote process (task != current)
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages_remote().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
+				     vmas, locked);
+}
+EXPORT_SYMBOL(pin_user_pages_remote);
+
+/**
+ * pin_user_pages() - pin user pages in memory for use by other devices
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+EXPORT_SYMBOL(pin_user_pages);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (9 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 12/25] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index ef50c264db71..2a5901efecde 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int goldfish_pin_pages(unsigned long first_page,
 		*iter_last_page_size = last_page_size;
 	}
 
-	ret = get_user_pages_fast(first_page, requested_pages,
+	ret = pin_user_pages_fast(first_page, requested_pages,
 				  !is_write ? FOLL_WRITE : 0,
 				  pages);
 	if (ret <= 0)
@@ -285,18 +285,6 @@ static int goldfish_pin_pages(unsigned long first_page,
 	return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-			       int is_write, s32 consumed_size)
-{
-	int i;
-
-	for (i = 0; i < pages_count; i++) {
-		if (!is_write && consumed_size > 0)
-			set_page_dirty(pages[i]);
-		put_page(pages[i]);
-	}
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
 			       int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
 	*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-	release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+	put_user_pages_dirty_lock(pipe->pages, pages_count,
+				  !is_write && *consumed_size > 0);
 
 	mutex_unlock(&pipe->lock);
 	return 0;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 12/25] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (10 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
                   ` (14 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Jason Gunthorpe

Convert infiniband to use the new pin_user_pages*() calls.

Also, revert earlier changes to Infiniband ODP that had it using
put_user_page(). ODP is "Case 3" in
Documentation/core-api/pin_user_pages.rst, which is to say, normal
get_user_pages() and put_page() is the API to use there.

The new pin_user_pages*() calls replace corresponding get_user_pages*()
calls, and set the FOLL_PIN flag. The FOLL_PIN flag requires that the
caller must return the pages via put_user_page*() calls, but infiniband
was already doing that as part of an earlier commit.

Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              |  2 +-
 drivers/infiniband/core/umem_odp.c          | 13 ++++++-------
 drivers/infiniband/hw/hfi1/user_pages.c     |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  2 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  2 +-
 drivers/infiniband/sw/siw/siw_mem.c         |  2 +-
 8 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 214e87aa609d..55daefaa9b88 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -266,7 +266,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 	sg = umem->sg_head.sgl;
 
 	while (npages) {
-		ret = get_user_pages_fast(cur_base,
+		ret = pin_user_pages_fast(cur_base,
 					  min_t(unsigned long, npages,
 						PAGE_SIZE /
 						sizeof(struct page *)),
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e42d44e501fd..abc3bb6578cc 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -308,9 +308,8 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  * The function returns -EFAULT if the DMA mapping operation fails. It returns
  * -EAGAIN if a concurrent invalidation prevents us from updating the page.
  *
- * The page is released via put_user_page even if the operation failed. For
- * on-demand pinning, the page is released whenever it isn't stored in the
- * umem.
+ * The page is released via put_page even if the operation failed. For on-demand
+ * pinning, the page is released whenever it isn't stored in the umem.
  */
 static int ib_umem_odp_map_dma_single_page(
 		struct ib_umem_odp *umem_odp,
@@ -363,7 +362,7 @@ static int ib_umem_odp_map_dma_single_page(
 	}
 
 out:
-	put_user_page(page);
+	put_page(page);
 	return ret;
 }
 
@@ -473,7 +472,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 					ret = -EFAULT;
 					break;
 				}
-				put_user_page(local_page_list[j]);
+				put_page(local_page_list[j]);
 				continue;
 			}
 
@@ -500,8 +499,8 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 			 * ib_umem_odp_map_dma_single_page().
 			 */
 			if (npages - (j + 1) > 0)
-				put_user_pages(&local_page_list[j+1],
-					       npages - (j + 1));
+				release_pages(&local_page_list[j+1],
+					      npages - (j + 1));
 			break;
 		}
 	}
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9a94761765c0 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -106,7 +106,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 	int ret;
 	unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
 
-	ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+	ret = pin_user_pages_fast(vaddr, npages, gup_flags, pages);
 	if (ret < 0)
 		return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..8269ab040c21 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,7 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 		goto out;
 	}
 
-	ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
+	ret = pin_user_pages_fast(uaddr & PAGE_MASK, 1,
 				  FOLL_WRITE | FOLL_LONGTERM, pages);
 	if (ret < 0)
 		goto out;
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 6bf764e41891..7fc4b5f81fcd 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -108,7 +108,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 
 	down_read(&current->mm->mmap_sem);
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(start_page + got * PAGE_SIZE,
+		ret = pin_user_pages(start_page + got * PAGE_SIZE,
 				     num_pages - got,
 				     FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
 				     p + got, NULL);
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 05190edc2611..1a3cc2957e3a 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -670,7 +670,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 		else
 			j = npages;
 
-		ret = get_user_pages_fast(addr, j, FOLL_LONGTERM, pages);
+		ret = pin_user_pages_fast(addr, j, FOLL_LONGTERM, pages);
 		if (ret != j) {
 			i = 0;
 			j = ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 62e6ffa9ad78..600896727d34 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -141,7 +141,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(cur_base,
+		ret = pin_user_pages(cur_base,
 				     min_t(unsigned long, npages,
 				     PAGE_SIZE / sizeof(struct page *)),
 				     gup_flags | FOLL_LONGTERM,
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index e99983f07663..e53b07dcfed5 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -426,7 +426,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 		while (nents) {
 			struct page **plist = &umem->page_chunk[i].plist[got];
 
-			rv = get_user_pages(first_page_va, nents,
+			rv = pin_user_pages(first_page_va, nents,
 					    foll_flags | FOLL_LONGTERM,
 					    plist, NULL);
 			if (rv < 0)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (11 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 12/25] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/process_vm_access.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
 		if (copy > len)
 			copy = len;
 
-		if (vm_write) {
+		if (vm_write)
 			copied = copy_page_from_iter(page, offset, copy, iter);
-			set_page_dirty_lock(page);
-		} else {
+		else
 			copied = copy_page_to_iter(page, offset, copy, iter);
-		}
+
 		len -= copied;
 		if (copied < copy && iov_iter_count(iter))
 			return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		flags |= FOLL_WRITE;
 
 	while (!rc && nr_pages && iov_iter_count(iter)) {
-		int pages = min(nr_pages, max_pages_per_loop);
+		int pinned_pages = min(nr_pages, max_pages_per_loop);
 		int locked = 1;
 		size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		 * current/current->mm
 		 */
 		down_read(&mm->mmap_sem);
-		pages = get_user_pages_remote(task, mm, pa, pages, flags,
-					      process_pages, NULL, &locked);
+		pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+						     flags, process_pages,
+						     NULL, &locked);
 		if (locked)
 			up_read(&mm->mmap_sem);
-		if (pages <= 0)
+		if (pinned_pages <= 0)
 			return -EFAULT;
 
-		bytes = pages * PAGE_SIZE - start_offset;
+		bytes = pinned_pages * PAGE_SIZE - start_offset;
 		if (bytes > len)
 			bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 					 vm_write);
 		len -= bytes;
 		start_offset = 0;
-		nr_pages -= pages;
-		pa += pages * PAGE_SIZE;
-		while (pages)
-			put_page(process_pages[--pages]);
+		nr_pages -= pinned_pages;
+		pa += pinned_pages * PAGE_SIZE;
+
+		/* If vm_write is set, the pages need to be made dirty: */
+		put_user_pages_dirty_lock(process_pages, pinned_pages,
+					  vm_write);
 	}
 
 	return rc;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (12 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Daniel Vetter

Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the drm/via driver was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  drm_via_dmablit_t *xfer)
 	vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
 	if (NULL == vsg->pages)
 		return -ENOMEM;
-	ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+	ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
 			vsg->num_pages,
 			vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
 			vsg->pages);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (13 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 16/25] net/xdp: " John Hubbard
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

In partial anticipation of this work, the io_uring code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/io_uring.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9b1833fedc5c..c6ff9cc7fe71 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4539,7 +4539,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ret = 0;
 		down_read(&current->mm->mmap_sem);
-		pret = get_user_pages(ubuf, nr_pages,
+		pret = pin_user_pages(ubuf, nr_pages,
 				      FOLL_WRITE | FOLL_LONGTERM,
 				      pages, vmas);
 		if (pret == nr_pages) {
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 16/25] net/xdp: set FOLL_PIN via pin_user_pages()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (14 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
@ 2019-12-16 22:25 ` " John Hubbard
  2019-12-16 22:25 ` [PATCH v11 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
here is to change get_user_pages() to pin_user_pages().

Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 net/xdp/xdp_umem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..d071003b5e76 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,7 +291,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
 		return -ENOMEM;
 
 	down_read(&current->mm->mmap_sem);
-	npgs = get_user_pages(umem->address, umem->npgs,
+	npgs = pin_user_pages(umem->address, umem->npgs,
 			      gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
 	up_read(&current->mm->mmap_sem);
 
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (15 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 16/25] net/xdp: " John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig, Hans Verkuil,
	stable

After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		for (i = 0; i < dma->nr_pages; i++)
+		for (i = 0; i < dma->nr_pages; i++) {
+			if (dma->direction == DMA_FROM_DEVICE)
+				set_page_dirty_lock(dma->pages[i]);
 			put_page(dma->pages[i]);
+		}
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (16 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 19/25] vfio, mm: " John Hubbard
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Hans Verkuil

1. Change v4l2 from get_user_pages() to pin_user_pages().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..162a2633b1e3 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
+	err = pin_user_pages(data & PAGE_MASK, dma->nr_pages,
 			     flags | FOLL_LONGTERM, dma->pages, NULL);
 
 	if (err != dma->nr_pages) {
 		dma->nr_pages = (err >= 0) ? err : 0;
-		dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+		dprintk(1, "pin_user_pages: err=%d [%d]\n", err,
 			dma->nr_pages);
 		return err < 0 ? err : -EINVAL;
 	}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		for (i = 0; i < dma->nr_pages; i++) {
-			if (dma->direction == DMA_FROM_DEVICE)
-				set_page_dirty_lock(dma->pages[i]);
-			put_page(dma->pages[i]);
-		}
+		put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+					  dma->direction == DMA_FROM_DEVICE);
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 19/25] vfio, mm: pin_user_pages (FOLL_PIN) and put_user_page() conversion
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (17 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
@ 2019-12-16 22:25 ` " John Hubbard
  2019-12-16 22:25 ` [PATCH v11 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Change vfio from get_user_pages_remote(), to
pin_user_pages_remote().

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Tested-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b800fc9a0251..18bfc2fc8e6d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -309,9 +309,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
 	if (!is_invalid_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
-		if (prot & IOMMU_WRITE)
-			SetPageDirty(page);
-		put_page(page);
+
+		put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
 		return 1;
 	}
 	return 0;
@@ -329,7 +328,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		flags |= FOLL_WRITE;
 
 	down_read(&mm->mmap_sem);
-	ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
+	ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM,
 				    page, NULL, NULL);
 	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (18 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 19/25] vfio, mm: " John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Convert from get_user_pages() to pin_user_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page().

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..a86547822034 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	for (entry = 0; entry < entries; entry += chunk) {
 		unsigned long n = min(entries - entry, chunk);
 
-		ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
+		ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
 				FOLL_WRITE | FOLL_LONGTERM,
 				mem->hpages + entry, NULL);
 		if (ret == n) {
@@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	return 0;
 
 free_exit:
-	/* free the reference taken */
-	for (i = 0; i < pinned; i++)
-		put_page(mem->hpages[i]);
+	/* free the references taken */
+	put_user_pages(mem->hpages, pinned);
 
 	vfree(mem->hpas);
 	kfree(mem);
@@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
 			SetPageDirty(page);
 
-		put_page(page);
+		put_user_page(page);
+
 		mem->hpas[i] = 0;
 	}
 }
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1"
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (19 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard
                   ` (5 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Fix the gup benchmark flags to use the symbolic FOLL_WRITE,
instead of a hard-coded "1" value.

Also, clean up the filtering of gup flags a little, by just doing
it once before issuing any of the get_user_pages*() calls. This
makes it harder to overlook, instead of having little "gup_flags & 1"
phrases in the function calls.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup_benchmark.c                         | 9 ++++++---
 tools/testing/selftests/vm/gup_benchmark.c | 6 +++++-
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..7fc44d25eca7 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -48,18 +48,21 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = (next - addr) / PAGE_SIZE;
 		}
 
+		/* Filter out most gup flags: only allow a tiny subset here: */
+		gup->flags &= FOLL_WRITE;
+
 		switch (cmd) {
 		case GUP_FAST_BENCHMARK:
-			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
+			nr = get_user_pages_fast(addr, nr, gup->flags,
 						 pages + i);
 			break;
 		case GUP_LONGTERM_BENCHMARK:
 			nr = get_user_pages(addr, nr,
-					    (gup->flags & 1) | FOLL_LONGTERM,
+					    gup->flags | FOLL_LONGTERM,
 					    pages + i, NULL);
 			break;
 		case GUP_BENCHMARK:
-			nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
+			nr = get_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
 			break;
 		default:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..389327e9b30a 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,9 @@
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE	0x01	/* check pte is writable */
+
 struct gup_benchmark {
 	__u64 get_delta_usec;
 	__u64 put_delta_usec;
@@ -85,7 +88,8 @@ int main(int argc, char **argv)
 	}
 
 	gup.nr_pages_per_call = nr_pages;
-	gup.flags = write;
+	if (write)
+		gup.flags |= FOLL_WRITE;
 
 	fd = open("/sys/kernel/debug/gup_benchmark", O_RDWR);
 	if (fd == -1)
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*()
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (20 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 23/25] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

In order to provide a clearer, more symmetric API for pinning
and unpinning DMA pages. This way, pin_user_pages*() calls
match up with unpin_user_pages*() calls, and the API is a lot
closer to being self-explanatory.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/pin_user_pages.rst   |  2 +-
 arch/powerpc/mm/book3s64/iommu_api.c        |  4 +--
 drivers/gpu/drm/via/via_dmablit.c           |  4 +--
 drivers/infiniband/core/umem.c              |  2 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |  2 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |  6 ++--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 ++--
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  2 +-
 drivers/infiniband/sw/siw/siw_mem.c         |  2 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +--
 drivers/platform/goldfish/goldfish_pipe.c   |  4 +--
 drivers/vfio/vfio_iommu_type1.c             |  2 +-
 fs/io_uring.c                               |  4 +--
 include/linux/mm.h                          | 26 ++++++++---------
 mm/gup.c                                    | 32 ++++++++++-----------
 mm/process_vm_access.c                      |  4 +--
 net/xdp/xdp_umem.c                          |  2 +-
 18 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 71849830cd48..1d490155ecd7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -219,7 +219,7 @@ since the system was booted, via two new /proc/vmstat entries: ::
     /proc/vmstat/nr_foll_pin_requested
 
 Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
-because there is a noticeable performance drop in put_user_page(), when they
+because there is a noticeable performance drop in unpin_user_page(), when they
 are activated.
 
 References
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index a86547822034..eba73ebd8ae5 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 
 free_exit:
 	/* free the references taken */
-	put_user_pages(mem->hpages, pinned);
+	unpin_user_pages(mem->hpages, pinned);
 
 	vfree(mem->hpas);
 	kfree(mem);
@@ -214,7 +214,7 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
 			SetPageDirty(page);
 
-		put_user_page(page);
+		unpin_user_page(page);
 
 		mem->hpas[i] = 0;
 	}
diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
index 37c5e572993a..719d036c9384 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t *vsg)
 		kfree(vsg->desc_pages);
 		/* fall through */
 	case dr_via_pages_locked:
-		put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
-					  (vsg->direction == DMA_FROM_DEVICE));
+		unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
+					   (vsg->direction == DMA_FROM_DEVICE));
 		/* fall through */
 	case dr_via_pages_alloc:
 		vfree(vsg->pages);
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 55daefaa9b88..a6094766b6f5 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
 
 	for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
 		page = sg_page_iter_page(&sg_iter);
-		put_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
+		unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
 	}
 
 	sg_free_table(&umem->sg_head);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index 9a94761765c0..3b505006c0a6 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -118,7 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 void hfi1_release_user_pages(struct mm_struct *mm, struct page **p,
 			     size_t npages, bool dirty)
 {
-	put_user_pages_dirty_lock(p, npages, dirty);
+	unpin_user_pages_dirty_lock(p, npages, dirty);
 
 	if (mm) { /* during close after signal, mm can be NULL */
 		atomic64_sub(npages, &mm->pinned_vm);
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index 8269ab040c21..78a48aea3faf 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -482,7 +482,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 
 	ret = pci_map_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
 	if (ret < 0) {
-		put_user_page(pages[0]);
+		unpin_user_page(pages[0]);
 		goto out;
 	}
 
@@ -490,7 +490,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 				 mthca_uarc_virt(dev, uar, i));
 	if (ret) {
 		pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-		put_user_page(sg_page(&db_tab->page[i].mem));
+		unpin_user_page(sg_page(&db_tab->page[i].mem));
 		goto out;
 	}
 
@@ -556,7 +556,7 @@ void mthca_cleanup_user_db_tab(struct mthca_dev *dev, struct mthca_uar *uar,
 		if (db_tab->page[i].uvirt) {
 			mthca_UNMAP_ICM(dev, mthca_uarc_virt(dev, uar, i), 1);
 			pci_unmap_sg(dev->pdev, &db_tab->page[i].mem, 1, PCI_DMA_TODEVICE);
-			put_user_page(sg_page(&db_tab->page[i].mem));
+			unpin_user_page(sg_page(&db_tab->page[i].mem));
 		}
 	}
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 7fc4b5f81fcd..342e3172ca40 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -40,7 +40,7 @@
 static void __qib_release_user_pages(struct page **p, size_t num_pages,
 				     int dirty)
 {
-	put_user_pages_dirty_lock(p, num_pages, dirty);
+	unpin_user_pages_dirty_lock(p, num_pages, dirty);
 }
 
 /**
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 1a3cc2957e3a..a67599b5a550 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -317,7 +317,7 @@ static int qib_user_sdma_page_to_frags(const struct qib_devdata *dd,
 		 * the caller can ignore this page.
 		 */
 		if (put) {
-			put_user_page(page);
+			unpin_user_page(page);
 		} else {
 			/* coalesce case */
 			kunmap(page);
@@ -631,7 +631,7 @@ static void qib_user_sdma_free_pkt_frag(struct device *dev,
 			kunmap(pkt->addr[i].page);
 
 		if (pkt->addr[i].put_page)
-			put_user_page(pkt->addr[i].page);
+			unpin_user_page(pkt->addr[i].page);
 		else
 			__free_page(pkt->addr[i].page);
 	} else if (pkt->addr[i].kvaddr) {
@@ -706,7 +706,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 	/* if error, return all pages not managed by pkt */
 free_pages:
 	while (i < j)
-		put_user_page(pages[i++]);
+		unpin_user_page(pages[i++]);
 
 done:
 	return ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 600896727d34..bd9f944b68fc 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -75,7 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty)
 		for_each_sg(chunk->page_list, sg, chunk->nents, i) {
 			page = sg_page(sg);
 			pa = sg_phys(sg);
-			put_user_pages_dirty_lock(&page, 1, dirty);
+			unpin_user_pages_dirty_lock(&page, 1, dirty);
 			usnic_dbg("pa: %pa\n", &pa);
 		}
 		kfree(chunk);
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index e53b07dcfed5..e2061dc0b043 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -63,7 +63,7 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index)
 static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages,
 			   bool dirty)
 {
-	put_user_pages_dirty_lock(chunk->plist, num_pages, dirty);
+	unpin_user_pages_dirty_lock(chunk->plist, num_pages, dirty);
 }
 
 void siw_umem_release(struct siw_umem *umem, bool dirty)
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 162a2633b1e3..13b65ed9e74c 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
-					  dma->direction == DMA_FROM_DEVICE);
+		unpin_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+					    dma->direction == DMA_FROM_DEVICE);
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index 2a5901efecde..1ab207ec9c94 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -360,8 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
 	*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-	put_user_pages_dirty_lock(pipe->pages, pages_count,
-				  !is_write && *consumed_size > 0);
+	unpin_user_pages_dirty_lock(pipe->pages, pages_count,
+				    !is_write && *consumed_size > 0);
 
 	mutex_unlock(&pipe->lock);
 	return 0;
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 18bfc2fc8e6d..a177bf2c6683 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -310,7 +310,7 @@ static int put_pfn(unsigned long pfn, int prot)
 	if (!is_invalid_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
 
-		put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
+		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
 		return 1;
 	}
 	return 0;
diff --git a/fs/io_uring.c b/fs/io_uring.c
index c6ff9cc7fe71..d4acb1eec456 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4418,7 +4418,7 @@ static int io_sqe_buffer_unregister(struct io_ring_ctx *ctx)
 		struct io_mapped_ubuf *imu = &ctx->user_bufs[i];
 
 		for (j = 0; j < imu->nr_bvecs; j++)
-			put_user_page(imu->bvec[j].bv_page);
+			unpin_user_page(imu->bvec[j].bv_page);
 
 		if (ctx->account_mem)
 			io_unaccount_mem(ctx->user, imu->nr_bvecs);
@@ -4563,7 +4563,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 			 * release any pages we did get
 			 */
 			if (pret > 0)
-				put_user_pages(pages, pret);
+				unpin_user_pages(pages, pret);
 			if (ctx->account_mem)
 				io_unaccount_mem(ctx->user, nr_pages);
 			kvfree(imu->bvec);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0fb9929e00af..6a1a357e7d86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1045,27 +1045,27 @@ static inline void put_page(struct page *page)
 }
 
 /**
- * put_user_page() - release a gup-pinned page
+ * unpin_user_page() - release a gup-pinned page
  * @page:            pointer to page to be released
  *
  * Pages that were pinned via pin_user_pages*() must be released via either
- * put_user_page(), or one of the put_user_pages*() routines. This is so that
- * eventually such pages can be separately tracked and uniquely handled. In
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that eventually such pages can be separately tracked and uniquely handled. In
  * particular, interactions with RDMA and filesystems need special handling.
  *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
+ * unpin_user_page() and put_page() are not interchangeable, despite this early
+ * implementation that makes them look the same. unpin_user_page() calls must
  * be perfectly matched up with pin*() calls.
  */
-static inline void put_user_page(struct page *page)
+static inline void unpin_user_page(struct page *page)
 {
 	put_page(page);
 }
 
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
-			       bool make_dirty);
+void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+				 bool make_dirty);
 
-void put_user_pages(struct page **pages, unsigned long npages);
+void unpin_user_pages(struct page **pages, unsigned long npages);
 
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
@@ -2595,7 +2595,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
-#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
+#define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -2630,7 +2630,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * Direct IO). This lets the filesystem know that some non-file-system entity is
  * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
  * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
- * a call to put_user_page().
+ * a call to unpin_user_page().
  *
  * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
  * and separate refcounting mechanisms, however, and that means that each has
@@ -2638,7 +2638,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  *
  *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
  *
- *     FOLL_PIN: pin_user_pages*() to acquire, and put_user_pages to release.
+ *     FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
  *
  * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
  * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
@@ -2648,7 +2648,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
  * directly by the caller. That's in order to help avoid mismatches when
  * releasing pages: get_user_pages*() pages must be released via put_page(),
- * while pin_user_pages*() pages must be released via put_user_page().
+ * while pin_user_pages*() pages must be released via unpin_user_page().
  *
  * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
diff --git a/mm/gup.c b/mm/gup.c
index 4862ff982bc3..73aedcefa4bd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -52,7 +52,7 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
 }
 
 /**
- * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
+ * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
  * @npages: number of pages in the @pages array.
  * @make_dirty: whether to mark the pages dirty
@@ -62,19 +62,19 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
  *
  * For each page in the @pages array, make that page (or its head page, if a
  * compound page) dirty, if @make_dirty is true, and if the page was previously
- * listed as clean. In any case, releases all pages using put_user_page(),
- * possibly via put_user_pages(), for the non-dirty case.
+ * listed as clean. In any case, releases all pages using unpin_user_page(),
+ * possibly via unpin_user_pages(), for the non-dirty case.
  *
- * Please see the put_user_page() documentation for details.
+ * Please see the unpin_user_page() documentation for details.
  *
  * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is
  * required, then the caller should a) verify that this is really correct,
  * because _lock() is usually required, and b) hand code it:
- * set_page_dirty_lock(), put_user_page().
+ * set_page_dirty_lock(), unpin_user_page().
  *
  */
-void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
-			       bool make_dirty)
+void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
+				 bool make_dirty)
 {
 	unsigned long index;
 
@@ -85,7 +85,7 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 	 */
 
 	if (!make_dirty) {
-		put_user_pages(pages, npages);
+		unpin_user_pages(pages, npages);
 		return;
 	}
 
@@ -113,21 +113,21 @@ void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		 */
 		if (!PageDirty(page))
 			set_page_dirty_lock(page);
-		put_user_page(page);
+		unpin_user_page(page);
 	}
 }
-EXPORT_SYMBOL(put_user_pages_dirty_lock);
+EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 
 /**
- * put_user_pages() - release an array of gup-pinned pages.
+ * unpin_user_pages() - release an array of gup-pinned pages.
  * @pages:  array of pages to be marked dirty and released.
  * @npages: number of pages in the @pages array.
  *
- * For each page in the @pages array, release the page using put_user_page().
+ * For each page in the @pages array, release the page using unpin_user_page().
  *
- * Please see the put_user_page() documentation for details.
+ * Please see the unpin_user_page() documentation for details.
  */
-void put_user_pages(struct page **pages, unsigned long npages)
+void unpin_user_pages(struct page **pages, unsigned long npages)
 {
 	unsigned long index;
 
@@ -137,9 +137,9 @@ void put_user_pages(struct page **pages, unsigned long npages)
 	 * single operation to the head page should suffice.
 	 */
 	for (index = 0; index < npages; index++)
-		put_user_page(pages[index]);
+		unpin_user_page(pages[index]);
 }
-EXPORT_SYMBOL(put_user_pages);
+EXPORT_SYMBOL(unpin_user_pages);
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index fd20ab675b85..de41e830cdac 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -126,8 +126,8 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		pa += pinned_pages * PAGE_SIZE;
 
 		/* If vm_write is set, the pages need to be made dirty: */
-		put_user_pages_dirty_lock(process_pages, pinned_pages,
-					  vm_write);
+		unpin_user_pages_dirty_lock(process_pages, pinned_pages,
+					    vm_write);
 	}
 
 	return rc;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index d071003b5e76..ac182c38f7b0 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -212,7 +212,7 @@ static int xdp_umem_map_pages(struct xdp_umem *umem)
 
 static void xdp_umem_unpin_pages(struct xdp_umem *umem)
 {
-	put_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
+	unpin_user_pages_dirty_lock(umem->pgs, umem->npgs, true);
 
 	kfree(umem->pgs);
 	umem->pgs = NULL;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 23/25] mm/gup: track FOLL_PIN pages
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (21 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-17 14:19   ` [PATCH v12 " John Hubbard
  2019-12-16 22:25 ` [PATCH v11 24/25] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Kirill A . Shutemov

Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/

Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h                        |  83 ++++-
 include/linux/mmzone.h                    |   2 +
 include/linux/page_ref.h                  |  10 +
 mm/gup.c                                  | 409 +++++++++++++++++-----
 mm/huge_memory.c                          |  29 +-
 mm/hugetlb.c                              |  38 +-
 mm/vmstat.c                               |   2 +
 8 files changed, 439 insertions(+), 136 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..2db14df1f2d7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  --------
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..bb44c4d2ada7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1016,6 +1016,8 @@ static inline void get_page(struct page *page)
 	page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -1044,29 +1046,80 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many pin_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void unpin_user_page(struct page *page)
-{
-	put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty);
-
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*().
+ *
+ * The return value is partially fuzzy: false is not fuzzy, because it means
+ * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
+ * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
+ * of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please see Documentation/vm/pin_user_pages.rst.
+ *
+ * @page:	pointer to page to be queried.
+ * @Return:	True, if it is likely that the page has been "dma-pinned".
+ *		False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_dma_pinned(struct page *page)
+{
+	/*
+	 * page_ref_count() is signed. If that refcount overflows, then
+	 * page_ref_count() returns a negative value, and callers will avoid
+	 * further incrementing the refcount.
+	 *
+	 * Here, for that overflow case, use the signed bit to count a little
+	 * bit higher via unsigned math, and thus still get an accurate result
+	 * from page_dma_pinned().
+	 */
+	return ((unsigned int)page_ref_count(compound_head(page))) >=
+		GUP_PIN_COUNTING_BIAS;
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 89d8ff06c9ce..a7418f7a44da 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -244,6 +244,8 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+	NR_FOLL_PIN_REQUESTED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
+	NR_FOLL_PIN_RETURNED,	/* pages returned via unpin_user_page() */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..b9cbe553d1e7 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
 		__page_ref_mod(page, -nr);
 }
 
+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+	int ret = atomic_sub_return(nr, &page->_refcount);
+
+	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
+		__page_ref_mod(page, -nr);
+
+	return ret;
+}
+
 static inline void page_ref_inc(struct page *page)
 {
 	atomic_inc(&page->_refcount);
diff --git a/mm/gup.c b/mm/gup.c
index 73aedcefa4bd..c2793a86450e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -36,6 +36,20 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
 						  struct page **pages,
 						  struct vm_area_struct **vmas,
 						  unsigned int flags);
+
+#ifdef CONFIG_DEBUG_VM
+static inline void __update_proc_vmstat(struct page *page,
+					enum node_stat_item item, int count)
+{
+	mod_node_page_state(page_pgdat(page), item, count);
+}
+#else
+static inline void __update_proc_vmstat(struct page *page,
+					enum node_stat_item item, int count)
+{
+}
+#endif
+
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -51,6 +65,156 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
 	return head;
 }
 
+/**
+ * try_pin_compound_head() - mark a compound page as being used by
+ * pin_user_pages*().
+ *
+ * This is the FOLL_PIN counterpart to try_get_compound_head().
+ *
+ * @page:	pointer to page to be marked
+ * @Return:	the compound head page, with ref appropriately incremented,
+ * or NULL upon failure.
+ */
+__must_check struct page *try_pin_compound_head(struct page *page, int refs)
+{
+	struct page *head = try_get_compound_head(page,
+						  GUP_PIN_COUNTING_BIAS * refs);
+	if (!head)
+		return NULL;
+
+	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
+	return head;
+}
+
+/*
+ * try_grab_compound_head() - attempt to elevate a page's refcount, by a
+ * flags-dependent amount.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
+ * same time. (That's true throughout the get_user_pages*() and
+ * pin_user_pages*() APIs.) Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: head page (with refcount appropriately incremented) for success, or
+ * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
+ * considered failure, and furthermore, a likely bug in the caller, so a warning
+ * is also emitted.
+ */
+static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+							  int refs,
+							  unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_compound_head(page, refs);
+	else if (flags & FOLL_PIN)
+		return try_pin_compound_head(page, refs);
+
+	WARN_ON_ONCE(1);
+	return NULL;
+}
+
+/**
+ * try_grab_page() - elevate a page's refcount by a flag-dependent amount
+ *
+ * This might not do anything at all, depending on the flags argument.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * @page:    pointer to page to be grabbed
+ * @flags:   gup flags: these are the FOLL_* flag values.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
+ * time. Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: true for success, or if no action was required (if neither FOLL_PIN
+ * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
+ * FOLL_PIN was set, but the page could not be grabbed.
+ */
+bool __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_page(page);
+	else if (flags & FOLL_PIN) {
+		page = compound_head(page);
+		WARN_ON_ONCE(flags & FOLL_GET);
+
+		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+			return false;
+
+		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+		__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
+	}
+
+	return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+
+	if (is_devmap) {
+		int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+		__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+		/*
+		 * devmap page refcounts are 1-based, rather than 0-based: if
+		 * refcount is 1, then the page is free and the refcount is
+		 * stable because nobody holds a reference on the page.
+		 */
+		if (count == 1)
+			free_devmap_managed_page(page);
+		else if (!count)
+			__put_page(page);
+	}
+
+	return is_devmap;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (__unpin_devmap_managed_user_page(page))
+		return;
+
+	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+		__put_page(page);
+
+	__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+}
+EXPORT_SYMBOL(unpin_user_page);
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -237,10 +401,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET case since
-		 * they are only valid while holding the pgmap reference.
+		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+		 * case since they are only valid while holding the pgmap
+		 * reference.
 		 */
 		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 		if (*pgmap)
@@ -278,11 +443,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto retry;
 	}
 
-	if (flags & FOLL_GET) {
-		if (unlikely(!try_get_page(page))) {
-			page = ERR_PTR(-ENOMEM);
-			goto out;
-		}
+	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
+	if (unlikely(!try_grab_page(page, flags))) {
+		page = ERR_PTR(-ENOMEM);
+		goto out;
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -544,7 +708,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
+		WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
 		return page;
 	}
 
@@ -1131,6 +1295,36 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 	return pages_done;
 }
 
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * Parts of FOLL_LONGTERM behavior are incompatible with
+	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
+	 * vmas. However, this only comes up if locked is set, and there are
+	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
+	 * allow what we can.
+	 */
+	if (gup_flags & FOLL_LONGTERM) {
+		if (WARN_ON_ONCE(locked))
+			return -EINVAL;
+		/*
+		 * This will check the vmas (even if our vmas arg is NULL)
+		 * and return -ENOTSUPP if DAX isn't allowed in this case:
+		 */
+		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+					     vmas, gup_flags | FOLL_TOUCH |
+					     FOLL_REMOTE);
+	}
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked,
+				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
+
 /*
  * get_user_pages_remote() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -1205,28 +1399,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
-	/*
-	 * Parts of FOLL_LONGTERM behavior are incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas. However, this only comes up if locked is set, and there are
-	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
-	 * allow what we can.
-	 */
-	if (gup_flags & FOLL_LONGTERM) {
-		if (WARN_ON_ONCE(locked))
-			return -EINVAL;
-		/*
-		 * This will check the vmas (even if our vmas arg is NULL)
-		 * and return -ENOTSUPP if DAX isn't allowed in this case:
-		 */
-		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
-					     vmas, gup_flags | FOLL_TOUCH |
-					     FOLL_REMOTE);
-	}
-
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
-				       locked,
-				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
@@ -1421,10 +1595,11 @@ static long __get_user_pages_locked(struct task_struct *tsk,
 	return i ? : -EFAULT;
 }
 
-long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
-			   unsigned int gup_flags, struct page **pages,
-			   struct vm_area_struct **vmas, int *locked)
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
 {
 	return 0;
 }
@@ -1864,13 +2039,17 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+					    unsigned int flags,
 					    struct page **pages)
 {
 	while ((*nr) - nr_start) {
 		struct page *page = pages[--(*nr)];
 
 		ClearPageReferenced(page);
-		put_page(page);
+		if (flags & FOLL_PIN)
+			unpin_user_page(page);
+		else
+			put_page(page);
 	}
 }
 
@@ -1903,7 +2082,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
+				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
 		} else if (pte_special(pte))
@@ -1912,7 +2091,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		head = try_get_compound_head(page, 1);
+		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
 
@@ -1957,7 +2136,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+			     unsigned long end, unsigned int flags,
+			     struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1967,12 +2147,15 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
-			undo_dev_pagemap(nr, nr_start, pages);
+			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			undo_dev_pagemap(nr, nr_start, flags, pages);
+			return 0;
+		}
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
@@ -1983,48 +2166,52 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
@@ -2042,8 +2229,11 @@ static int record_subpages(struct page *page, unsigned long addr,
 	return nr;
 }
 
-static void put_compound_head(struct page *page, int refs)
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
+	if (flags & FOLL_PIN)
+		refs *= GUP_PIN_COUNTING_BIAS;
+
 	/* Do a get_page() first, in case refs == page->_refcount */
 	get_page(page);
 	page_ref_sub(page, refs);
@@ -2083,12 +2273,12 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(head, refs);
+	head = try_grab_compound_head(head, refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2136,18 +2326,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pmd_page(orig), refs);
+	head = try_grab_compound_head(pmd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2157,7 +2348,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+			unsigned long end, unsigned int flags,
+			struct page **pages, int *nr)
 {
 	struct page *head, *page;
 	int refs;
@@ -2168,18 +2360,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pud_page(orig), refs);
+	head = try_grab_compound_head(pud_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2203,12 +2396,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pgd_page(orig), refs);
+	head = try_grab_compound_head(pgd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2371,6 +2564,14 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	unsigned long len, end;
 	unsigned long flags;
 	int nr = 0;
+	/*
+	 * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+	 * because gup fast is always a "pin with a +1 page refcount" request.
+	 */
+	unsigned int gup_flags = FOLL_GET;
+
+	if (write)
+		gup_flags |= FOLL_WRITE;
 
 	start = untagged_addr(start) & PAGE_MASK;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
@@ -2396,7 +2597,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
 		local_irq_save(flags);
-		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
+		gup_pgd_range(start, end, gup_flags, pages, &nr);
 		local_irq_restore(flags);
 	}
 
@@ -2435,7 +2636,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE | FOLL_PIN)))
+				       FOLL_FORCE | FOLL_PIN | FOLL_GET)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2478,11 +2679,11 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 
 /**
  * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
  *
  * Attempt to pin user pages in memory without taking mm->mmap_sem.
  * If not successful, it will fall back to taking the lock and
@@ -2502,6 +2703,13 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
+	/*
+	 * The caller may or may not have explicitly set FOLL_GET; either way is
+	 * OK. However, internally (within mm/gup.c), gup fast variants must set
+	 * FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
+	 * request.
+	 */
+	gup_flags |= FOLL_GET;
 	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
@@ -2509,9 +2717,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 /**
  * pin_user_pages_fast() - pin user pages in memory without taking locks
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_fast().
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2519,21 +2730,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_remote().
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2543,22 +2757,24 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			   unsigned int gup_flags, struct page **pages,
 			   struct vm_area_struct **vmas, int *locked)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
-				     vmas, locked);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(pin_user_pages_remote);
 
 /**
  * pin_user_pages() - pin user pages in memory for use by other devices
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages().
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2567,11 +2783,12 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..a71646a4c4d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -945,6 +945,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		return NULL;
 
@@ -960,7 +965,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -968,7 +973,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1088,6 +1094,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (flags & FOLL_WRITE && !pud_write(*pud))
 		return NULL;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1099,8 +1110,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	/*
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
+	 *
+	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1108,7 +1121,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1484,8 +1498,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if (!try_grab_page(page, flags))
+		return ERR_PTR(-ENOMEM);
+
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
+
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/*
 		 * We don't mlock() pte-mapped THPs. This way we can avoid
@@ -1522,8 +1541,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-	if (flags & FOLL_GET)
-		get_page(page);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..0e21bbe9f017 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4326,19 +4326,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 
-		/*
-		 * Instead of doing 'try_get_page()' below in the same_page
-		 * loop, just check the count once here.
-		 */
-		if (unlikely(page_count(page) <= 0)) {
-			if (pages) {
-				spin_unlock(ptl);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
 		/*
 		 * If subpage information not requested, update counters
 		 * and skip the same_page loop below.
@@ -4356,7 +4343,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+			if (!try_grab_page(pages[i], flags)) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				WARN_ON_ONCE(1);
+				break;
+			}
 		}
 
 		if (vmas)
@@ -4916,6 +4909,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t pte;
+
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 retry:
 	ptl = pmd_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -4928,8 +4927,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
-		if (flags & FOLL_GET)
-			get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			WARN_ON_ONCE(1);
+			page = NULL;
+			goto out;
+		}
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
@@ -4950,7 +4952,7 @@ struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -4959,7 +4961,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..b56808bae1b4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+	"nr_foll_pin_requested",
+	"nr_foll_pin_returned",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 24/25] mm/gup_benchmark: support pin_user_pages() and related calls
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (22 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 23/25] mm/gup: track FOLL_PIN pages John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-16 22:25 ` [PATCH v11 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
                   ` (2 subsequent siblings)
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages_fast(): via the '-a' command line option
* pin_user_pages():      via the '-b' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

    PIN_FAST_BENCHMARK     : calls pin_user_pages_fast()
    PIN_BENCHMARK          : calls pin_user_pages()

In between the calls to pin_*() and unpin_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup_benchmark.c                         | 65 ++++++++++++++++++++--
 tools/testing/selftests/vm/gup_benchmark.c | 15 ++++-
 2 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7fc44d25eca7..76d32db48af8 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,8 @@
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -19,6 +21,42 @@ struct gup_benchmark {
 	__u64 expansion[10];	/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages)
+{
+	int i;
+
+	switch (cmd) {
+	case GUP_FAST_BENCHMARK:
+	case GUP_LONGTERM_BENCHMARK:
+	case GUP_BENCHMARK:
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+		break;
+
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
+		unpin_user_pages(pages, nr_pages);
+		break;
+	}
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+			      unsigned long nr_pages)
+{
+	int i;
+
+	switch (cmd) {
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
+		for (i = 0; i < nr_pages; i++) {
+			if (WARN(!page_dma_pinned(pages[i]),
+				 "pages[%d] is NOT dma-pinned\n", i))
+				break;
+		}
+		break;
+	}
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
 		struct gup_benchmark *gup)
 {
@@ -65,6 +103,14 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = get_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
 			break;
+		case PIN_FAST_BENCHMARK:
+			nr = pin_user_pages_fast(addr, nr, gup->flags,
+						 pages + i);
+			break;
+		case PIN_BENCHMARK:
+			nr = pin_user_pages(addr, nr, gup->flags, pages + i,
+					    NULL);
+			break;
 		default:
 			return -1;
 		}
@@ -75,15 +121,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 	}
 	end_time = ktime_get();
 
+	/* Shifting the meaning of nr_pages: now it is actual number pinned: */
+	nr_pages = i;
+
 	gup->get_delta_usec = ktime_us_delta(end_time, start_time);
 	gup->size = addr - gup->addr;
 
+	/*
+	 * Take an un-benchmark-timed moment to verify DMA pinned
+	 * state: print a warning if any non-dma-pinned pages are found:
+	 */
+	verify_dma_pinned(cmd, pages, nr_pages);
+
 	start_time = ktime_get();
-	for (i = 0; i < nr_pages; i++) {
-		if (!pages[i])
-			break;
-		put_page(pages[i]);
-	}
+
+	put_back_pages(cmd, pages, nr_pages);
+
 	end_time = ktime_get();
 	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -101,6 +154,8 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
 	case GUP_FAST_BENCHMARK:
 	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
+	case PIN_FAST_BENCHMARK:
+	case PIN_BENCHMARK:
 		break;
 	default:
 		return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index 389327e9b30a..43b4dfe161a2 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,10 @@
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
+/* Similar to above, but use FOLL_PIN instead of FOLL_GET. */
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
+
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE	0x01	/* check pte is writable */
 
@@ -40,8 +44,14 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:abtTLUuwSH")) != -1) {
 		switch (opt) {
+		case 'a':
+			cmd = PIN_FAST_BENCHMARK;
+			break;
+		case 'b':
+			cmd = PIN_BENCHMARK;
+			break;
 		case 'm':
 			size = atoi(optarg) * MB;
 			break;
@@ -63,6 +73,9 @@ int main(int argc, char **argv)
 		case 'U':
 			cmd = GUP_BENCHMARK;
 			break;
+		case 'u':
+			cmd = GUP_FAST_BENCHMARK;
+			break;
 		case 'w':
 			write = 1;
 			break;
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v11 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (23 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 24/25] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
@ 2019-12-16 22:25 ` John Hubbard
  2019-12-17  7:39 ` [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN Jan Kara
  2019-12-19 13:26 ` Leon Romanovsky
  26 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-16 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 tools/testing/selftests/vm/run_vmtests | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..df6a6bf3f238 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -112,6 +112,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage.  Use"
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
 
+echo "--------------------------------------------"
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo "--------------------------------------------"
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
+echo "------------------------------------------"
+echo "running gup_benchmark -b (pin_user_pages)"
+echo "------------------------------------------"
+./gup_benchmark -b
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "-------------------"
 echo "running userfaultfd"
 echo "-------------------"
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (24 preceding siblings ...)
  2019-12-16 22:25 ` [PATCH v11 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
@ 2019-12-17  7:39 ` Jan Kara
  2019-12-19 13:26 ` Leon Romanovsky
  26 siblings, 0 replies; 67+ messages in thread
From: Jan Kara @ 2019-12-17  7:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

Hi!

On Mon 16-12-19 14:25:12, John Hubbard wrote:
> Hi,
> 
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Just a note for Andrew and others watching this series: At this point I'm fine
with the series so if someone still has some review feedback or wants to
check the series, now is the right time. Otherwise I think Andrew can push
the series to MM tree so that it will get wider testing exposure and is
prepared for the next merge window.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v12 23/25] mm/gup: track FOLL_PIN pages
  2019-12-16 22:25 ` [PATCH v11 23/25] mm/gup: track FOLL_PIN pages John Hubbard
@ 2019-12-17 14:19   ` " John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-17 14:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Kirill A . Shutemov

Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via unpin_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1], [2], and [3].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

[1] Some slow progress on get_user_pages() (Apr 2, 2019):
    https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018):
    https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018):
    https://lwn.net/Articles/753027/

Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---

Hi,

The kbuild test robot noticed that try_pin_compound_head() can be
declared static, in mm/gup.c. This updated patch does that.

thanks,
John Hubbard
NVIDIA

 Documentation/core-api/pin_user_pages.rst |   2 +-
 include/linux/mm.h                        |  83 ++++-
 include/linux/mmzone.h                    |   2 +
 include/linux/page_ref.h                  |  10 +
 mm/gup.c                                  | 410 +++++++++++++++++-----
 mm/huge_memory.c                          |  29 +-
 mm/hugetlb.c                              |  38 +-
 mm/vmstat.c                               |   2 +
 8 files changed, 440 insertions(+), 136 deletions(-)

diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst
index 1d490155ecd7..2db14df1f2d7 100644
--- a/Documentation/core-api/pin_user_pages.rst
+++ b/Documentation/core-api/pin_user_pages.rst
@@ -53,7 +53,7 @@ Which flags are set by each wrapper
 For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
 flags the caller provides. The caller is required to pass in a non-null struct
 pages* array, and the function then pin pages by incrementing each by a special
-value. For now, that value is +1, just like get_user_pages*().::
+value: GUP_PIN_COUNTING_BIAS.::
 
  Function
  --------
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6a1a357e7d86..bb44c4d2ada7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1016,6 +1016,8 @@ static inline void get_page(struct page *page)
 	page_ref_inc(page);
 }
 
+bool __must_check try_grab_page(struct page *page, unsigned int flags);
+
 static inline __must_check bool try_get_page(struct page *page)
 {
 	page = compound_head(page);
@@ -1044,29 +1046,80 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
-/**
- * unpin_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many pin_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, pin_user_pages() becomes special: such pages are marked as
+ * distinct from normal pages. As such, the unpin_user_page() call (and its
+ * variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
+ *
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to pin_user_pages() and unpin_user_page() becomes
+ * simpler, due to the fact that adding an even power of two to the page
+ * refcount has the effect of using only the upper N bits, for the code that
+ * counts up using the bias value. This means that the lower bits are left for
+ * the exclusive use of the original code that increments and decrements by one
+ * (or at least, by much smaller values than the bias value).
  *
- * Pages that were pinned via pin_user_pages*() must be released via either
- * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
- * that eventually such pages can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special handling.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
  *
- * unpin_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. unpin_user_page() calls must
- * be perfectly matched up with pin*() calls.
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void unpin_user_page(struct page *page)
-{
-	put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1U << 10)
 
+void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty);
-
 void unpin_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_dma_pinned() - report if a page is pinned for DMA.
+ *
+ * This function checks if a page has been pinned via a call to
+ * pin_user_pages*().
+ *
+ * The return value is partially fuzzy: false is not fuzzy, because it means
+ * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
+ * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
+ * of normal page references".
+ *
+ * False positives are OK, because: a) it's unlikely for a page to get that many
+ * refcounts, and b) all the callers of this routine are expected to be able to
+ * deal gracefully with a false positive.
+ *
+ * For more information, please see Documentation/vm/pin_user_pages.rst.
+ *
+ * @page:	pointer to page to be queried.
+ * @Return:	True, if it is likely that the page has been "dma-pinned".
+ *		False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_dma_pinned(struct page *page)
+{
+	/*
+	 * page_ref_count() is signed. If that refcount overflows, then
+	 * page_ref_count() returns a negative value, and callers will avoid
+	 * further incrementing the refcount.
+	 *
+	 * Here, for that overflow case, use the signed bit to count a little
+	 * bit higher via unsigned math, and thus still get an accurate result
+	 * from page_dma_pinned().
+	 */
+	return ((unsigned int)page_ref_count(compound_head(page))) >=
+		GUP_PIN_COUNTING_BIAS;
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 89d8ff06c9ce..a7418f7a44da 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -244,6 +244,8 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+	NR_FOLL_PIN_REQUESTED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
+	NR_FOLL_PIN_RETURNED,	/* pages returned via unpin_user_page() */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..b9cbe553d1e7 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
 		__page_ref_mod(page, -nr);
 }
 
+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+	int ret = atomic_sub_return(nr, &page->_refcount);
+
+	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
+		__page_ref_mod(page, -nr);
+
+	return ret;
+}
+
 static inline void page_ref_inc(struct page *page)
 {
 	atomic_inc(&page->_refcount);
diff --git a/mm/gup.c b/mm/gup.c
index 73aedcefa4bd..39b2f683bd2e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -36,6 +36,20 @@ static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
 						  struct page **pages,
 						  struct vm_area_struct **vmas,
 						  unsigned int flags);
+
+#ifdef CONFIG_DEBUG_VM
+static inline void __update_proc_vmstat(struct page *page,
+					enum node_stat_item item, int count)
+{
+	mod_node_page_state(page_pgdat(page), item, count);
+}
+#else
+static inline void __update_proc_vmstat(struct page *page,
+					enum node_stat_item item, int count)
+{
+}
+#endif
+
 /*
  * Return the compound head page with ref appropriately incremented,
  * or NULL if that failed.
@@ -51,6 +65,157 @@ static inline struct page *try_get_compound_head(struct page *page, int refs)
 	return head;
 }
 
+/**
+ * try_pin_compound_head() - mark a compound page as being used by
+ * pin_user_pages*().
+ *
+ * This is the FOLL_PIN counterpart to try_get_compound_head().
+ *
+ * @page:	pointer to page to be marked
+ * @Return:	the compound head page, with ref appropriately incremented,
+ * or NULL upon failure.
+ */
+static __must_check struct page *try_pin_compound_head(struct page *page,
+						       int refs)
+{
+	struct page *head = try_get_compound_head(page,
+						  GUP_PIN_COUNTING_BIAS * refs);
+	if (!head)
+		return NULL;
+
+	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
+	return head;
+}
+
+/*
+ * try_grab_compound_head() - attempt to elevate a page's refcount, by a
+ * flags-dependent amount.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
+ * same time. (That's true throughout the get_user_pages*() and
+ * pin_user_pages*() APIs.) Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: head page (with refcount appropriately incremented) for success, or
+ * NULL upon failure. If neither FOLL_GET nor FOLL_PIN was set, that's
+ * considered failure, and furthermore, a likely bug in the caller, so a warning
+ * is also emitted.
+ */
+static __maybe_unused struct page *try_grab_compound_head(struct page *page,
+							  int refs,
+							  unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_compound_head(page, refs);
+	else if (flags & FOLL_PIN)
+		return try_pin_compound_head(page, refs);
+
+	WARN_ON_ONCE(1);
+	return NULL;
+}
+
+/**
+ * try_grab_page() - elevate a page's refcount by a flag-dependent amount
+ *
+ * This might not do anything at all, depending on the flags argument.
+ *
+ * "grab" names in this file mean, "look at flags to decide whether to use
+ * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
+ *
+ * @page:    pointer to page to be grabbed
+ * @flags:   gup flags: these are the FOLL_* flag values.
+ *
+ * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the same
+ * time. Cases:
+ *
+ *    FOLL_GET: page's refcount will be incremented by 1.
+ *    FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
+ *
+ * Return: true for success, or if no action was required (if neither FOLL_PIN
+ * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
+ * FOLL_PIN was set, but the page could not be grabbed.
+ */
+bool __must_check try_grab_page(struct page *page, unsigned int flags)
+{
+	if (flags & FOLL_GET)
+		return try_get_page(page);
+	else if (flags & FOLL_PIN) {
+		page = compound_head(page);
+		WARN_ON_ONCE(flags & FOLL_GET);
+
+		if (WARN_ON_ONCE(page_ref_count(page) <= 0))
+			return false;
+
+		page_ref_add(page, GUP_PIN_COUNTING_BIAS);
+		__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
+	}
+
+	return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+
+	if (is_devmap) {
+		int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+		__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+		/*
+		 * devmap page refcounts are 1-based, rather than 0-based: if
+		 * refcount is 1, then the page is free and the refcount is
+		 * stable because nobody holds a reference on the page.
+		 */
+		if (count == 1)
+			free_devmap_managed_page(page);
+		else if (!count)
+			__put_page(page);
+	}
+
+	return is_devmap;
+}
+#else
+static bool __unpin_devmap_managed_user_page(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * unpin_user_page() - release a dma-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * unpin_user_page(), or one of the unpin_user_pages*() routines. This is so
+ * that such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
+ */
+void unpin_user_page(struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (__unpin_devmap_managed_user_page(page))
+		return;
+
+	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+		__put_page(page);
+
+	__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+}
+EXPORT_SYMBOL(unpin_user_page);
+
 /**
  * unpin_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -237,10 +402,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET case since
-		 * they are only valid while holding the pgmap reference.
+		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+		 * case since they are only valid while holding the pgmap
+		 * reference.
 		 */
 		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 		if (*pgmap)
@@ -278,11 +444,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		goto retry;
 	}
 
-	if (flags & FOLL_GET) {
-		if (unlikely(!try_get_page(page))) {
-			page = ERR_PTR(-ENOMEM);
-			goto out;
-		}
+	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
+	if (unlikely(!try_grab_page(page, flags))) {
+		page = ERR_PTR(-ENOMEM);
+		goto out;
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -544,7 +709,7 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
+		WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
 		return page;
 	}
 
@@ -1131,6 +1296,36 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 	return pages_done;
 }
 
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * Parts of FOLL_LONGTERM behavior are incompatible with
+	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
+	 * vmas. However, this only comes up if locked is set, and there are
+	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
+	 * allow what we can.
+	 */
+	if (gup_flags & FOLL_LONGTERM) {
+		if (WARN_ON_ONCE(locked))
+			return -EINVAL;
+		/*
+		 * This will check the vmas (even if our vmas arg is NULL)
+		 * and return -ENOTSUPP if DAX isn't allowed in this case:
+		 */
+		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
+					     vmas, gup_flags | FOLL_TOUCH |
+					     FOLL_REMOTE);
+	}
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked,
+				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+}
+
 /*
  * get_user_pages_remote() - pin user pages in memory
  * @tsk:	the task_struct to use for page fault accounting, or
@@ -1205,28 +1400,8 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
-	/*
-	 * Parts of FOLL_LONGTERM behavior are incompatible with
-	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
-	 * vmas. However, this only comes up if locked is set, and there are
-	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
-	 * allow what we can.
-	 */
-	if (gup_flags & FOLL_LONGTERM) {
-		if (WARN_ON_ONCE(locked))
-			return -EINVAL;
-		/*
-		 * This will check the vmas (even if our vmas arg is NULL)
-		 * and return -ENOTSUPP if DAX isn't allowed in this case:
-		 */
-		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
-					     vmas, gup_flags | FOLL_TOUCH |
-					     FOLL_REMOTE);
-	}
-
-	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
-				       locked,
-				       gup_flags | FOLL_TOUCH | FOLL_REMOTE);
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
@@ -1421,10 +1596,11 @@ static long __get_user_pages_locked(struct task_struct *tsk,
 	return i ? : -EFAULT;
 }
 
-long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
-			   unsigned long start, unsigned long nr_pages,
-			   unsigned int gup_flags, struct page **pages,
-			   struct vm_area_struct **vmas, int *locked)
+static long __get_user_pages_remote(struct task_struct *tsk,
+				    struct mm_struct *mm,
+				    unsigned long start, unsigned long nr_pages,
+				    unsigned int gup_flags, struct page **pages,
+				    struct vm_area_struct **vmas, int *locked)
 {
 	return 0;
 }
@@ -1864,13 +2040,17 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+					    unsigned int flags,
 					    struct page **pages)
 {
 	while ((*nr) - nr_start) {
 		struct page *page = pages[--(*nr)];
 
 		ClearPageReferenced(page);
-		put_page(page);
+		if (flags & FOLL_PIN)
+			unpin_user_page(page);
+		else
+			put_page(page);
 	}
 }
 
@@ -1903,7 +2083,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
+				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
 		} else if (pte_special(pte))
@@ -1912,7 +2092,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		head = try_get_compound_head(page, 1);
+		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
 
@@ -1957,7 +2137,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+			     unsigned long end, unsigned int flags,
+			     struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1967,12 +2148,15 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
-			undo_dev_pagemap(nr, nr_start, pages);
+			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			undo_dev_pagemap(nr, nr_start, flags, pages);
+			return 0;
+		}
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
@@ -1983,48 +2167,52 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
@@ -2042,8 +2230,11 @@ static int record_subpages(struct page *page, unsigned long addr,
 	return nr;
 }
 
-static void put_compound_head(struct page *page, int refs)
+static void put_compound_head(struct page *page, int refs, unsigned int flags)
 {
+	if (flags & FOLL_PIN)
+		refs *= GUP_PIN_COUNTING_BIAS;
+
 	/* Do a get_page() first, in case refs == page->_refcount */
 	get_page(page);
 	page_ref_sub(page, refs);
@@ -2083,12 +2274,12 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(head, refs);
+	head = try_grab_compound_head(head, refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2136,18 +2327,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pmd_page(orig), refs);
+	head = try_grab_compound_head(pmd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2157,7 +2349,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+			unsigned long end, unsigned int flags,
+			struct page **pages, int *nr)
 {
 	struct page *head, *page;
 	int refs;
@@ -2168,18 +2361,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+					     pages, nr);
 	}
 
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pud_page(orig), refs);
+	head = try_grab_compound_head(pud_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2203,12 +2397,12 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	refs = record_subpages(page, addr, end, pages + *nr);
 
-	head = try_get_compound_head(pgd_page(orig), refs);
+	head = try_grab_compound_head(pgd_page(orig), refs, flags);
 	if (!head)
 		return 0;
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		put_compound_head(head, refs);
+		put_compound_head(head, refs, flags);
 		return 0;
 	}
 
@@ -2371,6 +2565,14 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	unsigned long len, end;
 	unsigned long flags;
 	int nr = 0;
+	/*
+	 * Internally (within mm/gup.c), gup fast variants must set FOLL_GET,
+	 * because gup fast is always a "pin with a +1 page refcount" request.
+	 */
+	unsigned int gup_flags = FOLL_GET;
+
+	if (write)
+		gup_flags |= FOLL_WRITE;
 
 	start = untagged_addr(start) & PAGE_MASK;
 	len = (unsigned long) nr_pages << PAGE_SHIFT;
@@ -2396,7 +2598,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	if (IS_ENABLED(CONFIG_HAVE_FAST_GUP) &&
 	    gup_fast_permitted(start, end)) {
 		local_irq_save(flags);
-		gup_pgd_range(start, end, write ? FOLL_WRITE : 0, pages, &nr);
+		gup_pgd_range(start, end, gup_flags, pages, &nr);
 		local_irq_restore(flags);
 	}
 
@@ -2435,7 +2637,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE | FOLL_PIN)))
+				       FOLL_FORCE | FOLL_PIN | FOLL_GET)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2478,11 +2680,11 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 
 /**
  * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
+ * @start:      starting user address
+ * @nr_pages:   number of pages from start to pin
+ * @gup_flags:  flags modifying pin behaviour
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at least nr_pages long.
  *
  * Attempt to pin user pages in memory without taking mm->mmap_sem.
  * If not successful, it will fall back to taking the lock and
@@ -2502,6 +2704,13 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
 		return -EINVAL;
 
+	/*
+	 * The caller may or may not have explicitly set FOLL_GET; either way is
+	 * OK. However, internally (within mm/gup.c), gup fast variants must set
+	 * FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
+	 * request.
+	 */
+	gup_flags |= FOLL_GET;
 	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
@@ -2509,9 +2718,12 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 /**
  * pin_user_pages_fast() - pin user pages in memory without taking locks
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_fast().
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2519,21 +2731,24 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
 int pin_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
 }
 EXPORT_SYMBOL_GPL(pin_user_pages_fast);
 
 /**
  * pin_user_pages_remote() - pin pages of a remote process (task != current)
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages_remote().
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2543,22 +2758,24 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			   unsigned int gup_flags, struct page **pages,
 			   struct vm_area_struct **vmas, int *locked)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
-				     vmas, locked);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags,
+				       pages, vmas, locked);
 }
 EXPORT_SYMBOL(pin_user_pages_remote);
 
 /**
  * pin_user_pages() - pin user pages in memory for use by other devices
  *
- * For now, this is a placeholder function, until various call sites are
- * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
- * this is identical to get_user_pages().
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via unpin_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
  *
  * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
  * is NOT intended for Case 2 (RDMA: long-term pins).
@@ -2567,11 +2784,12 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas)
 {
-	/*
-	 * This is a placeholder, until the pin functionality is activated.
-	 * Until then, just behave like the corresponding get_user_pages*()
-	 * routine.
-	 */
-	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41a0fbddc96b..a71646a4c4d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -945,6 +945,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		return NULL;
 
@@ -960,7 +965,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -968,7 +973,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1088,6 +1094,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (flags & FOLL_WRITE && !pud_write(*pud))
 		return NULL;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1099,8 +1110,10 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	/*
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
+	 *
+	 * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1108,7 +1121,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+	if (!try_grab_page(page, flags))
+		page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1484,8 +1498,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+	if (!try_grab_page(page, flags))
+		return ERR_PTR(-ENOMEM);
+
 	if (flags & FOLL_TOUCH)
 		touch_pmd(vma, addr, pmd, flags);
+
 	if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
 		/*
 		 * We don't mlock() pte-mapped THPs. This way we can avoid
@@ -1522,8 +1541,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
-	if (flags & FOLL_GET)
-		get_page(page);
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..0e21bbe9f017 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4326,19 +4326,6 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		pfn_offset = (vaddr & ~huge_page_mask(h)) >> PAGE_SHIFT;
 		page = pte_page(huge_ptep_get(pte));
 
-		/*
-		 * Instead of doing 'try_get_page()' below in the same_page
-		 * loop, just check the count once here.
-		 */
-		if (unlikely(page_count(page) <= 0)) {
-			if (pages) {
-				spin_unlock(ptl);
-				remainder = 0;
-				err = -ENOMEM;
-				break;
-			}
-		}
-
 		/*
 		 * If subpage information not requested, update counters
 		 * and skip the same_page loop below.
@@ -4356,7 +4343,13 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+			if (!try_grab_page(pages[i], flags)) {
+				spin_unlock(ptl);
+				remainder = 0;
+				err = -ENOMEM;
+				WARN_ON_ONCE(1);
+				break;
+			}
 		}
 
 		if (vmas)
@@ -4916,6 +4909,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t pte;
+
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 retry:
 	ptl = pmd_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -4928,8 +4927,11 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
-		if (flags & FOLL_GET)
-			get_page(page);
+		if (unlikely(!try_grab_page(page, flags))) {
+			WARN_ON_ONCE(1);
+			page = NULL;
+			goto out;
+		}
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
@@ -4950,7 +4952,7 @@ struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -4959,7 +4961,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..b56808bae1b4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+	"nr_foll_pin_requested",
+	"nr_foll_pin_returned",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines
  2019-12-16 22:25 ` [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines John Hubbard
@ 2019-12-18 15:52   ` Kirill A. Shutemov
  2019-12-18 22:15     ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2019-12-18 15:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Christoph Hellwig, Aneesh Kumar K . V

On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:
> +static void put_compound_head(struct page *page, int refs)
> +{
> +	/* Do a get_page() first, in case refs == page->_refcount */
> +	get_page(page);
> +	page_ref_sub(page, refs);
> +	put_page(page);
> +}

It's not terribly efficient. Maybe something like:

	VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
	if (refs > 2)
		page_ref_sub(page, refs - 1);
	put_page(page);

?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-16 22:25 ` [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
@ 2019-12-18 16:04   ` Kirill A. Shutemov
  2019-12-19  0:32     ` John Hubbard
  2019-12-19  0:40     ` [PATCH v12] " John Hubbard
  2019-12-19  5:27   ` [PATCH v11 04/25] " Dan Williams
  1 sibling, 2 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2019-12-18 16:04 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Christoph Hellwig

On Mon, Dec 16, 2019 at 02:25:16PM -0800, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.

What the reason to separate put_devmap_managed_page() from
free_devmap_managed_page() if free_devmap_managed_page() has exacly one
caller? Is it preparation for the next patches?

> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 17 +++++++++++++----
>  mm/memremap.c      | 16 ++--------------
>  mm/swap.c          | 24 ++++++++++++++++++++++++
>  3 files changed, 39 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c97ea3b694e6..77a4df06c8a7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>  	if (!static_branch_unlikely(&devmap_managed_key))
>  		return false;
> @@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page *page)
>  	switch (page->pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
>  	case MEMORY_DEVICE_FS_DAX:
> -		__put_devmap_managed_page(page);
>  		return true;
>  	default:
>  		break;
> @@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page *page)
>  	return false;
>  }
>  
> +bool put_devmap_managed_page(struct page *page);
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
> +static inline bool page_is_devmap_managed(struct page *page)
> +{
> +	return false;
> +}
> +
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
>  	return false;
> @@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
>  	 * need to inform the device driver through callback. See
>  	 * include/linux/memremap.h and HMM for details.
>  	 */
> -	if (put_devmap_managed_page(page))
> +	if (page_is_devmap_managed(page)) {
> +		put_devmap_managed_page(page);

put_devmap_managed_page() has yet another page_is_devmap_managed() check
inside. It looks strange.

>  		return;
> +	}
>  
>  	if (put_page_testzero(page))
>  		__put_page(page);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index e899fa876a62..2ba773859031 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> -	int count = page_ref_dec_return(page);
> -
> -	/* still busy */
> -	if (count > 1)
> -		return;
> -
> -	/* only triggered by the dev_pagemap shutdown path */
> -	if (count == 0) {
> -		__put_page(page);
> -		return;
> -	}
> -
>  	/* notify page idle for dax */
>  	if (!is_device_private_page(page)) {
>  		wake_up_var(&page->_refcount);
> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
>  	page->mapping = NULL;
>  	page->pgmap->ops->page_free(page);
>  }
> -EXPORT_SYMBOL(__put_devmap_managed_page);
> +EXPORT_SYMBOL(free_devmap_managed_page);
>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
> diff --git a/mm/swap.c b/mm/swap.c
> index 5341ae93861f..49f7c2eea0ba 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1102,3 +1102,27 @@ void __init swap_setup(void)
>  	 * _really_ don't want to cluster much more
>  	 */
>  }
> +
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +bool put_devmap_managed_page(struct page *page)
> +{
> +	bool is_devmap = page_is_devmap_managed(page);
> +
> +	if (is_devmap) {

Reversing the condition would save you an indentation level.

> +		int count = page_ref_dec_return(page);
> +
> +		/*
> +		 * devmap page refcounts are 1-based, rather than 0-based: if
> +		 * refcount is 1, then the page is free and the refcount is
> +		 * stable because nobody holds a reference on the page.
> +		 */
> +		if (count == 1)
> +			free_devmap_managed_page(page);
> +		else if (!count)
> +			__put_page(page);
> +	}
> +
> +	return is_devmap;
> +}
> +EXPORT_SYMBOL(put_devmap_managed_page);
> +#endif
> -- 
> 2.24.1
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  2019-12-16 22:25 ` [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM John Hubbard
@ 2019-12-18 16:19   ` Kirill A. Shutemov
  2019-12-18 22:15     ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Kirill A. Shutemov @ 2019-12-18 16:19 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Jason Gunthorpe

On Mon, Dec 16, 2019 at 02:25:18PM -0800, John Hubbard wrote:
> As it says in the updated comment in gup.c: current FOLL_LONGTERM
> behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the
> FS DAX check requirement on vmas.
> 
> However, the corresponding restriction in get_user_pages_remote() was
> slightly stricter than is actually required: it forbade all
> FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
> that do not set the "locked" arg.
> 
> Update the code and comments to loosen the restriction, allowing
> FOLL_LONGTERM in some cases.
> 
> Also, copy the DAX check ("if a VMA is DAX, don't allow long term
> pinning") from the VFIO call site, all the way into the internals
> of get_user_pages_remote() and __gup_longterm_locked(). That is:
> get_user_pages_remote() calls __gup_longterm_locked(), which in turn
> calls check_dax_vmas(). This check will then be removed from the VFIO
> call site in a subsequent patch.
> 
> Thanks to Jason Gunthorpe for pointing out a clean way to fix this,
> and to Dan Williams for helping clarify the DAX refactoring.
> 
> Tested-by: Alex Williamson <alex.williamson@redhat.com>
> Acked-by: Alex Williamson <alex.williamson@redhat.com>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  mm/gup.c | 27 ++++++++++++++++++++++-----
>  1 file changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 3ecce297a47f..c0c56888e7cc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,13 @@ struct follow_page_context {
>  	unsigned int page_mask;
>  };
>  
> +static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
> +						  struct mm_struct *mm,
> +						  unsigned long start,
> +						  unsigned long nr_pages,
> +						  struct page **pages,
> +						  struct vm_area_struct **vmas,
> +						  unsigned int flags);

Any particular reason for the forward declaration? Maybe move
get_user_pages_remote() down?

>  /*
>   * Return the compound head page with ref appropriately incremented,
>   * or NULL if that failed.
> @@ -1179,13 +1186,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  		struct vm_area_struct **vmas, int *locked)
>  {
>  	/*
> -	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
> +	 * Parts of FOLL_LONGTERM behavior are incompatible with
>  	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> -	 * vmas.  As there are no users of this flag in this call we simply
> -	 * disallow this option for now.
> +	 * vmas. However, this only comes up if locked is set, and there are
> +	 * callers that do request FOLL_LONGTERM, but do not set locked. So,
> +	 * allow what we can.
>  	 */
> -	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
> -		return -EINVAL;
> +	if (gup_flags & FOLL_LONGTERM) {
> +		if (WARN_ON_ONCE(locked))
> +			return -EINVAL;
> +		/*
> +		 * This will check the vmas (even if our vmas arg is NULL)
> +		 * and return -ENOTSUPP if DAX isn't allowed in this case:
> +		 */
> +		return __gup_longterm_locked(tsk, mm, start, nr_pages, pages,
> +					     vmas, gup_flags | FOLL_TOUCH |
> +					     FOLL_REMOTE);
> +	}
>  
>  	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
>  				       locked,
> -- 
> 2.24.1
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM
  2019-12-18 16:19   ` Kirill A. Shutemov
@ 2019-12-18 22:15     ` John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-18 22:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Jason Gunthorpe

On 12/18/19 8:19 AM, Kirill A. Shutemov wrote:
...
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 3ecce297a47f..c0c56888e7cc 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -29,6 +29,13 @@ struct follow_page_context {
>>   	unsigned int page_mask;
>>   };
>>   
>> +static __always_inline long __gup_longterm_locked(struct task_struct *tsk,
>> +						  struct mm_struct *mm,
>> +						  unsigned long start,
>> +						  unsigned long nr_pages,
>> +						  struct page **pages,
>> +						  struct vm_area_struct **vmas,
>> +						  unsigned int flags);
> 
> Any particular reason for the forward declaration? Maybe move
> get_user_pages_remote() down?
> 

Yes, that's exactly why: I was thinking it would be cleaner to put in the
forward declaration, rather than moving code blocks, but either way seems
reasonable. I'll go ahead and move the code blocks and delete the forward
declaration, now that someone has weighed in in favor of that.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines
  2019-12-18 15:52   ` Kirill A. Shutemov
@ 2019-12-18 22:15     ` John Hubbard
  2019-12-18 22:45       ` Kirill A. Shutemov
  0 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-18 22:15 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Christoph Hellwig, Aneesh Kumar K . V

On 12/18/19 7:52 AM, Kirill A. Shutemov wrote:
> On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:
>> +static void put_compound_head(struct page *page, int refs)
>> +{
>> +	/* Do a get_page() first, in case refs == page->_refcount */
>> +	get_page(page);
>> +	page_ref_sub(page, refs);
>> +	put_page(page);
>> +}
> 
> It's not terribly efficient. Maybe something like:
> 
> 	VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
> 	if (refs > 2)
> 		page_ref_sub(page, refs - 1);
> 	put_page(page);
> 
> ?

OK, but how about this instead? I don't see the need for a "2", as that
is a magic number that requires explanation. Whereas "1" is not a magic
number--here it means: either there are "many" (>1) refs, or not.

And the routine won't be called with refs less than about 32 (2MB huge
page, 64KB base page == 32 subpages) anyway.

	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
	/*
	 * Calling put_page() for each ref is unnecessarily slow. Only the last
	 * ref needs a put_page().
	 */
	if (refs > 1)
		page_ref_sub(page, refs - 1);
	put_page(page);


thanks,
-- 
John Hubbard
NVIDIA
  

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines
  2019-12-18 22:15     ` John Hubbard
@ 2019-12-18 22:45       ` Kirill A. Shutemov
  0 siblings, 0 replies; 67+ messages in thread
From: Kirill A. Shutemov @ 2019-12-18 22:45 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Christoph Hellwig, Aneesh Kumar K . V

On Wed, Dec 18, 2019 at 02:15:53PM -0800, John Hubbard wrote:
> On 12/18/19 7:52 AM, Kirill A. Shutemov wrote:
> > On Mon, Dec 16, 2019 at 02:25:13PM -0800, John Hubbard wrote:
> > > +static void put_compound_head(struct page *page, int refs)
> > > +{
> > > +	/* Do a get_page() first, in case refs == page->_refcount */
> > > +	get_page(page);
> > > +	page_ref_sub(page, refs);
> > > +	put_page(page);
> > > +}
> > 
> > It's not terribly efficient. Maybe something like:
> > 
> > 	VM_BUG_ON_PAGE(page_ref_count(page) < ref, page);
> > 	if (refs > 2)
> > 		page_ref_sub(page, refs - 1);
> > 	put_page(page);
> > 
> > ?
> 
> OK, but how about this instead? I don't see the need for a "2", as that
> is a magic number that requires explanation. Whereas "1" is not a magic
> number--here it means: either there are "many" (>1) refs, or not.

Yeah, it's my thinko. Sure, it has to be '1' (or >= 2, which is less readable).

> And the routine won't be called with refs less than about 32 (2MB huge
> page, 64KB base page == 32 subpages) anyway.

It's hard to make predictions about future :P

> 	VM_BUG_ON_PAGE(page_ref_count(page) < refs, page);
> 	/*
> 	 * Calling put_page() for each ref is unnecessarily slow. Only the last
> 	 * ref needs a put_page().
> 	 */
> 	if (refs > 1)
> 		page_ref_sub(page, refs - 1);
> 	put_page(page);

Looks good to me.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-18 16:04   ` Kirill A. Shutemov
@ 2019-12-19  0:32     ` John Hubbard
  2019-12-19  0:40     ` [PATCH v12] " John Hubbard
  1 sibling, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-19  0:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Christoph Hellwig

On 12/18/19 8:04 AM, Kirill A. Shutemov wrote:
> On Mon, Dec 16, 2019 at 02:25:16PM -0800, John Hubbard wrote:
>> An upcoming patch changes and complicates the refcounting and
>> especially the "put page" aspects of it. In order to keep
>> everything clean, refactor the devmap page release routines:
>>
>> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>>    and limit the functionality to "read only": return a bool,
>>    with no side effects.
>>
>> * Add a new routine, put_devmap_managed_page(), to handle checking
>>    what kind of page it is, and what kind of refcount handling it
>>    requires.
>>
>> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>>    and limit the functionality to unconditionally freeing a devmap
>>    page.
> 
> What the reason to separate put_devmap_managed_page() from
> free_devmap_managed_page() if free_devmap_managed_page() has exacly one
> caller? Is it preparation for the next patches?


Yes. A later patch, #23, adds another caller: __unpin_devmap_managed_user_page().

...
>> @@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page *page)
>>   	return false;
>>   }
>>   
>> +bool put_devmap_managed_page(struct page *page);
>> +
>>   #else /* CONFIG_DEV_PAGEMAP_OPS */
>> +static inline bool page_is_devmap_managed(struct page *page)
>> +{
>> +	return false;
>> +}
>> +
>>   static inline bool put_devmap_managed_page(struct page *page)
>>   {
>>   	return false;
>> @@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
>>   	 * need to inform the device driver through callback. See
>>   	 * include/linux/memremap.h and HMM for details.
>>   	 */
>> -	if (put_devmap_managed_page(page))
>> +	if (page_is_devmap_managed(page)) {
>> +		put_devmap_managed_page(page);
> 
> put_devmap_managed_page() has yet another page_is_devmap_managed() check
> inside. It looks strange.
> 

Good point, it's an extra unnecessary check. So to clean it up, I'll note
that the "if" check is required here in put_page(), in order to stay out of
non-inlined function calls in the hot path (put_page()). So I'll do the
following:

* Leave the above code as it is here

* Simplify put_devmap_managed_page(), it was trying to do two separate things,
   and those two things have different requirements. So change it to a void
   function, with a WARN_ON_ONCE to assert that page_is_devmap_managed() is true,

* And change the other caller (release_pages()) to do that check.

...
>> @@ -1102,3 +1102,27 @@ void __init swap_setup(void)
>>   	 * _really_ don't want to cluster much more
>>   	 */
>>   }
>> +
>> +#ifdef CONFIG_DEV_PAGEMAP_OPS
>> +bool put_devmap_managed_page(struct page *page)
>> +{
>> +	bool is_devmap = page_is_devmap_managed(page);
>> +
>> +	if (is_devmap) {
> 
> Reversing the condition would save you an indentation level.

Yes. Done.

I'll also git-reply with an updated patch so you can see what it looks like.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v12] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-18 16:04   ` Kirill A. Shutemov
  2019-12-19  0:32     ` John Hubbard
@ 2019-12-19  0:40     ` " John Hubbard
  1 sibling, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-19  0:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A . Shutemov, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jens Axboe, Jonathan Corbet, Jérôme Glisse,
	Magnus Karlsson, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Hocko, Mike Kravetz, Paul Mackerras, Shuah Khan,
	Vlastimil Babka, bpf, dri-devel, kvm, linux-block, linux-doc,
	linux-fsdevel, linux-kselftest, linux-media, linux-rdma,
	linuxppc-dev, netdev, linux-mm, LKML, John Hubbard,
	Christoph Hellwig

An upcoming patch changes and complicates the refcounting and
especially the "put page" aspects of it. In order to keep
everything clean, refactor the devmap page release routines:

* Rename put_devmap_managed_page() to page_is_devmap_managed(),
  and limit the functionality to "read only": return a bool,
  with no side effects.

* Add a new routine, put_devmap_managed_page(), to handle
  decrementing the refcount for ZONE_DEVICE pages.

* Change callers (just release_pages() and put_page()) to check
  page_is_devmap_managed() before calling the new
  put_devmap_managed_page() routine. This is a performance
  point: put_page() is a hot path, so we need to avoid non-
  inline function calls where possible.

* Rename __put_devmap_managed_page() to free_devmap_managed_page(),
  and limit the functionality to unconditionally freeing a devmap
  page.

This is originally based on a separate patch by Ira Weiny, which
applied to an early version of the put_user_page() experiments.
Since then, Jérôme Glisse suggested the refactoring described above.

Cc: Christoph Hellwig <hch@lst.de>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 18 +++++++++++++-----
 mm/memremap.c      | 16 ++--------------
 mm/swap.c          | 27 ++++++++++++++++++++++++++-
 3 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c97ea3b694e6..87b54126e46d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page *page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	if (!static_branch_unlikely(&devmap_managed_key))
 		return false;
@@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page *page)
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_FS_DAX:
-		__put_devmap_managed_page(page);
 		return true;
 	default:
 		break;
@@ -971,11 +971,17 @@ static inline bool put_devmap_managed_page(struct page *page)
 	return false;
 }
 
+void put_devmap_managed_page(struct page *page);
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool put_devmap_managed_page(struct page *page)
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	return false;
 }
+
+static inline void put_devmap_managed_page(struct page *page)
+{
+}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1028,8 +1034,10 @@ static inline void put_page(struct page *page)
 	 * need to inform the device driver through callback. See
 	 * include/linux/memremap.h and HMM for details.
 	 */
-	if (put_devmap_managed_page(page))
+	if (page_is_devmap_managed(page)) {
+		put_devmap_managed_page(page);
 		return;
+	}
 
 	if (put_page_testzero(page))
 		__put_page(page);
diff --git a/mm/memremap.c b/mm/memremap.c
index e899fa876a62..2ba773859031 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void free_devmap_managed_page(struct page *page)
 {
-	int count = page_ref_dec_return(page);
-
-	/* still busy */
-	if (count > 1)
-		return;
-
-	/* only triggered by the dev_pagemap shutdown path */
-	if (count == 0) {
-		__put_page(page);
-		return;
-	}
-
 	/* notify page idle for dax */
 	if (!is_device_private_page(page)) {
 		wake_up_var(&page->_refcount);
@@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
 	page->mapping = NULL;
 	page->pgmap->ops->page_free(page);
 }
-EXPORT_SYMBOL(__put_devmap_managed_page);
+EXPORT_SYMBOL(free_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 5341ae93861f..cf39d24ada2a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -813,8 +813,10 @@ void release_pages(struct page **pages, int nr)
 			 * processing, and instead, expect a call to
 			 * put_page_testzero().
 			 */
-			if (put_devmap_managed_page(page))
+			if (page_is_devmap_managed(page)) {
+				put_devmap_managed_page(page);
 				continue;
+			}
 		}
 
 		page = compound_head(page);
@@ -1102,3 +1104,26 @@ void __init swap_setup(void)
 	 * _really_ don't want to cluster much more
 	 */
 }
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void put_devmap_managed_page(struct page *page)
+{
+	int count;
+
+	if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
+		return;
+
+	count = page_ref_dec_return(page);
+
+	/*
+	 * devmap page refcounts are 1-based, rather than 0-based: if
+	 * refcount is 1, then the page is free and the refcount is
+	 * stable because nobody holds a reference on the page.
+	 */
+	if (count == 1)
+		free_devmap_managed_page(page);
+	else if (!count)
+		__put_page(page);
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
+#endif
-- 
2.24.1


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-16 22:25 ` [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
  2019-12-18 16:04   ` Kirill A. Shutemov
@ 2019-12-19  5:27   ` " Dan Williams
  2019-12-19  5:48     ` John Hubbard
  1 sibling, 1 reply; 67+ messages in thread
From: Dan Williams @ 2019-12-19  5:27 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Daniel Vetter,
	Dave Chinner, David Airlie, David S . Miller, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, Maling list - DRI developers,
	KVM list, linux-block, Linux Doc Mailing List, linux-fsdevel,
	linux-kselftest, Linux-media, linux-rdma, linuxppc-dev, Netdev,
	Linux MM, LKML, Christoph Hellwig

On Mon, Dec 16, 2019 at 2:26 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
>
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
>
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
>
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
>
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
>
> Cc: Christoph Hellwig <hch@lst.de>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h | 17 +++++++++++++----
>  mm/memremap.c      | 16 ++--------------
>  mm/swap.c          | 24 ++++++++++++++++++++++++
>  3 files changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c97ea3b694e6..77a4df06c8a7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -952,9 +952,10 @@ static inline bool is_zone_device_page(const struct page *page)
>  #endif
>
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>         if (!static_branch_unlikely(&devmap_managed_key))
>                 return false;
> @@ -963,7 +964,6 @@ static inline bool put_devmap_managed_page(struct page *page)
>         switch (page->pgmap->type) {
>         case MEMORY_DEVICE_PRIVATE:
>         case MEMORY_DEVICE_FS_DAX:
> -               __put_devmap_managed_page(page);
>                 return true;
>         default:
>                 break;
> @@ -971,7 +971,14 @@ static inline bool put_devmap_managed_page(struct page *page)
>         return false;
>  }
>
> +bool put_devmap_managed_page(struct page *page);
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
> +static inline bool page_is_devmap_managed(struct page *page)
> +{
> +       return false;
> +}
> +
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
>         return false;
> @@ -1028,8 +1035,10 @@ static inline void put_page(struct page *page)
>          * need to inform the device driver through callback. See
>          * include/linux/memremap.h and HMM for details.
>          */
> -       if (put_devmap_managed_page(page))
> +       if (page_is_devmap_managed(page)) {
> +               put_devmap_managed_page(page);
>                 return;
> +       }
>
>         if (put_page_testzero(page))
>                 __put_page(page);
> diff --git a/mm/memremap.c b/mm/memremap.c
> index e899fa876a62..2ba773859031 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -411,20 +411,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> -       int count = page_ref_dec_return(page);
> -
> -       /* still busy */
> -       if (count > 1)
> -               return;
> -
> -       /* only triggered by the dev_pagemap shutdown path */
> -       if (count == 0) {
> -               __put_page(page);
> -               return;
> -       }
> -
>         /* notify page idle for dax */
>         if (!is_device_private_page(page)) {
>                 wake_up_var(&page->_refcount);
> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
>         page->mapping = NULL;
>         page->pgmap->ops->page_free(page);
>  }
> -EXPORT_SYMBOL(__put_devmap_managed_page);
> +EXPORT_SYMBOL(free_devmap_managed_page);

This patch does not have a module consumer for
free_devmap_managed_page(), so the export should move to the patch
that needs the new export.

Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL
instead of EXPORT_SYMBOL_GPL is that there was no practical way to
hide the devmap details from evey module in the kernel that did
put_page(). I would expect free_devmap_managed_page() to
EXPORT_SYMBOL_GPL if it is not inlined into an existing exported
static inline api.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-19  5:27   ` [PATCH v11 04/25] " Dan Williams
@ 2019-12-19  5:48     ` John Hubbard
  2019-12-19  6:52       ` Dan Williams
  0 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-19  5:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Daniel Vetter,
	Dave Chinner, David Airlie, David S . Miller, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, Maling list - DRI developers,
	KVM list, linux-block, Linux Doc Mailing List, linux-fsdevel,
	linux-kselftest, Linux-media, linux-rdma, linuxppc-dev, Netdev,
	Linux MM, LKML, Christoph Hellwig

On 12/18/19 9:27 PM, Dan Williams wrote:
...
>> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
>>          page->mapping = NULL;
>>          page->pgmap->ops->page_free(page);
>>   }
>> -EXPORT_SYMBOL(__put_devmap_managed_page);
>> +EXPORT_SYMBOL(free_devmap_managed_page);
> 
> This patch does not have a module consumer for
> free_devmap_managed_page(), so the export should move to the patch
> that needs the new export.

Hi Dan,

OK, I know that's a policy--although it seems quite pointless here given
that this is definitely going to need an EXPORT.

At the moment, the series doesn't use it in any module at all, so I'll just
delete the EXPORT for now.

> 
> Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL
> instead of EXPORT_SYMBOL_GPL is that there was no practical way to
> hide the devmap details from evey module in the kernel that did
> put_page(). I would expect free_devmap_managed_page() to
> EXPORT_SYMBOL_GPL if it is not inlined into an existing exported
> static inline api.
> 

Sure, I'll change it to EXPORT_SYMBOL_GPL when the time comes. We do have
to be careful that we don't shut out normal put_page() types of callers,
but...glancing through the current callers, that doesn't look to be a problem.
Good. So it should be OK to do EXPORT_SYMBOL_GPL here.

Are you *sure* you don't want to just pre-emptively EXPORT now, and save
looking at it again?

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-19  5:48     ` John Hubbard
@ 2019-12-19  6:52       ` Dan Williams
  2019-12-19  7:33         ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2019-12-19  6:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Daniel Vetter,
	Dave Chinner, David Airlie, David S . Miller, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, Maling list - DRI developers,
	KVM list, linux-block, Linux Doc Mailing List, linux-fsdevel,
	linux-kselftest, Linux-media, linux-rdma, linuxppc-dev, Netdev,
	Linux MM, LKML, Christoph Hellwig

On Wed, Dec 18, 2019 at 9:51 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/18/19 9:27 PM, Dan Williams wrote:
> ...
> >> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
> >>          page->mapping = NULL;
> >>          page->pgmap->ops->page_free(page);
> >>   }
> >> -EXPORT_SYMBOL(__put_devmap_managed_page);
> >> +EXPORT_SYMBOL(free_devmap_managed_page);
> >
> > This patch does not have a module consumer for
> > free_devmap_managed_page(), so the export should move to the patch
> > that needs the new export.
>
> Hi Dan,
>
> OK, I know that's a policy--although it seems quite pointless here given
> that this is definitely going to need an EXPORT.
>
> At the moment, the series doesn't use it in any module at all, so I'll just
> delete the EXPORT for now.
>
> >
> > Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL
> > instead of EXPORT_SYMBOL_GPL is that there was no practical way to
> > hide the devmap details from evey module in the kernel that did
> > put_page(). I would expect free_devmap_managed_page() to
> > EXPORT_SYMBOL_GPL if it is not inlined into an existing exported
> > static inline api.
> >
>
> Sure, I'll change it to EXPORT_SYMBOL_GPL when the time comes. We do have
> to be careful that we don't shut out normal put_page() types of callers,
> but...glancing through the current callers, that doesn't look to be a problem.
> Good. So it should be OK to do EXPORT_SYMBOL_GPL here.
>
> Are you *sure* you don't want to just pre-emptively EXPORT now, and save
> looking at it again?

I'm positive. There is enough history for "trust me the consumer is
coming" turning out not to be true to justify the hassle in my mind. I
do trust you, but things happen.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages
  2019-12-19  6:52       ` Dan Williams
@ 2019-12-19  7:33         ` John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-19  7:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Daniel Vetter,
	Dave Chinner, David Airlie, David S . Miller, Ira Weiny,
	Jan Kara, Jason Gunthorpe, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, Maling list - DRI developers,
	KVM list, linux-block, Linux Doc Mailing List, linux-fsdevel,
	linux-kselftest, Linux-media, linux-rdma, linuxppc-dev, Netdev,
	Linux MM, LKML, Christoph Hellwig

On 12/18/19 10:52 PM, Dan Williams wrote:
> On Wed, Dec 18, 2019 at 9:51 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/18/19 9:27 PM, Dan Williams wrote:
>> ...
>>>> @@ -461,5 +449,5 @@ void __put_devmap_managed_page(struct page *page)
>>>>           page->mapping = NULL;
>>>>           page->pgmap->ops->page_free(page);
>>>>    }
>>>> -EXPORT_SYMBOL(__put_devmap_managed_page);
>>>> +EXPORT_SYMBOL(free_devmap_managed_page);
>>>
>>> This patch does not have a module consumer for
>>> free_devmap_managed_page(), so the export should move to the patch
>>> that needs the new export.
>>
>> Hi Dan,
>>
>> OK, I know that's a policy--although it seems quite pointless here given
>> that this is definitely going to need an EXPORT.
>>
>> At the moment, the series doesn't use it in any module at all, so I'll just
>> delete the EXPORT for now.
>>
>>>
>>> Also the only reason that put_devmap_managed_page() is EXPORT_SYMBOL
>>> instead of EXPORT_SYMBOL_GPL is that there was no practical way to
>>> hide the devmap details from evey module in the kernel that did
>>> put_page(). I would expect free_devmap_managed_page() to
>>> EXPORT_SYMBOL_GPL if it is not inlined into an existing exported
>>> static inline api.
>>>
>>
>> Sure, I'll change it to EXPORT_SYMBOL_GPL when the time comes. We do have
>> to be careful that we don't shut out normal put_page() types of callers,
>> but...glancing through the current callers, that doesn't look to be a problem.
>> Good. So it should be OK to do EXPORT_SYMBOL_GPL here.
>>
>> Are you *sure* you don't want to just pre-emptively EXPORT now, and save
>> looking at it again?
> 
> I'm positive. There is enough history for "trust me the consumer is
> coming" turning out not to be true to justify the hassle in my mind. I
> do trust you, but things happen.
> 

OK, it's deleted locally. Thanks for looking at the patch. I'll post a v12 series
that includes the change, once it looks like reviews are slowing down.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
                   ` (25 preceding siblings ...)
  2019-12-17  7:39 ` [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN Jan Kara
@ 2019-12-19 13:26 ` Leon Romanovsky
  2019-12-19 20:30   ` John Hubbard
  26 siblings, 1 reply; 67+ messages in thread
From: Leon Romanovsky @ 2019-12-19 13:26 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> Hi,
>
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Hi John,

The patchset generates kernel panics in our IB testing. In our tests, we
allocated single memory block and registered multiple MRs using the single
block.

The possible bad flow is:
 ib_umem_geti() ->
  pin_user_pages_fast(FOLL_WRITE) ->
   internal_get_user_pages_fast(FOLL_WRITE) ->
    gup_pgd_range() ->
     gup_huge_pd() ->
      gup_hugepte() ->
       try_grab_compound_head() ->

 108 static __maybe_unused struct page *try_grab_compound_head(struct page *page,
 109                                                           int refs,
 110                                                           unsigned int flags)
 111 {
 112         if (flags & FOLL_GET)
 113                 return try_get_compound_head(page, refs);
 114         else if (flags & FOLL_PIN)
 115                 return try_pin_compound_head(page, refs);
 116
 117         WARN_ON_ONCE(1);
 118         return NULL;
 119 }

# (master) $ dmesg
[10924.722220] mlx5_core 0000:00:08.0 eth2: Link up
[10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
[10960.902254] ------------[ cut here ]------------
[10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 try_grab_compound_head+0x92/0xd0
[10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl]
[10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G           OE     5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
[10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 cb 85 c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 d2 75 b9 66 90 eb ea <0f> 0b 31 ff eb b7 85 c0 66 0f 1f 44 00 00 74 ab 8d 14 06 f0 0f b1
[10960.928512] RSP: 0018:ffffc9000129f880 EFLAGS: 00010082
[10960.929831] RAX: 0000000080000001 RBX: 00007f6397446000 RCX: 000fffffffe00000
[10960.931422] RDX: 0000000000040000 RSI: 0000000000011800 RDI: ffffea000f5d8000
[10960.933005] RBP: ffffc9000129f93c R08: ffffc9000129f93c R09: 0000000000200000
[10960.934584] R10: ffff88840774b200 R11: ffff888000000230 R12: 00007f6397446000
[10960.936212] R13: 0000000000000046 R14: 80000003d76000e7 R15: 0000000000000080
[10960.937793] FS:  00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000
[10960.939962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10960.941367] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0
[10960.942975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10960.944654] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10960.946394] PKRU: 55555554
[10960.947310] Call Trace:
[10960.948193]  gup_pgd_range+0x61e/0x950
[10960.949585]  internal_get_user_pages_fast+0x98/0x1c0
[10960.951313]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
[10960.952929]  mr_umem_get+0xd8/0x280 [mlx5_ib]
[10960.954150]  ? xas_store+0x49/0x550
[10960.955187]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
[10960.956478]  ? xas_load+0x9/0x80
[10960.957474]  ? xa_load+0x54/0x90
[10960.958465]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
[10960.959926]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
[10960.961192]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
[10960.963208]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
[10960.964603]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[10960.965949]  ? mem_cgroup_commit_charge+0x6a/0x140
[10960.967177]  ? page_add_new_anon_rmap+0x58/0xc0
[10960.968360]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
[10960.969595]  do_vfs_ioctl+0xa6/0x640
[10960.970631]  ? syscall_trace_enter+0x1f8/0x2e0
[10960.971829]  ksys_ioctl+0x60/0x90
[10960.972825]  __x64_sys_ioctl+0x16/0x20
[10960.973888]  do_syscall_64+0x48/0x130
[10960.974949]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10960.976219] RIP: 0033:0x7f639fe9b267
[10960.977260] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[10960.981413] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[10960.983472] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267
[10960.985037] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003
[10960.986603] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010
[10960.988194] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150
[10960.989903] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000
[10960.991544] ---[ end trace 1f0ee07a75a16a93 ]---
[10960.992773] ------------[ cut here ]------------
[10960.993995] WARNING: CPU: 3 PID: 8838 at mm/gup.c:150 try_grab_page+0x55/0x70
[10960.995758] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl]
[10961.008579] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G        W  OE     5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
[10961.011416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[10961.013766] RIP: 0010:try_grab_page+0x55/0x70
[10961.014921] Code: 00 04 00 00 b8 01 00 00 00 f3 c3 48 8b 47 08 a8 01 75 1c 8b 47 34 85 c0 7e 1d f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb cb <0f> 0b 31 c0 c3 48 8d 78 ff 66 90 eb dc 0f 0b 31 c0 c3 66 0f 1f 84
[10961.019058] RSP: 0018:ffffc9000129f7e8 EFLAGS: 00010282
[10961.020351] RAX: 0000000080000001 RBX: 0000000000050201 RCX: 000000000f5d8000
[10961.021921] RDX: 000ffffffffff000 RSI: 0000000000040000 RDI: ffffea000f5d8000
[10961.023494] RBP: 00007f6397400000 R08: ffffea000f986cc0 R09: ffff8883c758bdd0
[10961.025067] R10: 0000000000000001 R11: ffff888000000230 R12: ffff888407701c00
[10961.026637] R13: ffff8883e61b35d0 R14: ffffea000f5d8000 R15: 0000000000050201
[10961.028217] FS:  00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000
[10961.030353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10961.031721] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0
[10961.033305] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10961.034884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[10961.036456] PKRU: 55555554
[10961.037369] Call Trace:
[10961.038285]  follow_trans_huge_pmd+0x10c/0x300
[10961.039555]  follow_page_mask+0x64a/0x760
[10961.040762]  __get_user_pages+0xf1/0x720
[10961.041851]  ? apic_timer_interrupt+0xa/0x20
[10961.042996]  internal_get_user_pages_fast+0x14b/0x1c0
[10961.044266]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
[10961.045474]  mr_umem_get+0xd8/0x280 [mlx5_ib]
[10961.046652]  ? xas_store+0x49/0x550
[10961.047696]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
[10961.048967]  ? xas_load+0x9/0x80
[10961.049949]  ? xa_load+0x54/0x90
[10961.050935]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
[10961.052378]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
[10961.053635]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
[10961.055646]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
[10961.057033]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[10961.058381]  ? mem_cgroup_commit_charge+0x6a/0x140
[10961.059611]  ? page_add_new_anon_rmap+0x58/0xc0
[10961.060796]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
[10961.062034]  do_vfs_ioctl+0xa6/0x640
[10961.063081]  ? syscall_trace_enter+0x1f8/0x2e0
[10961.064253]  ksys_ioctl+0x60/0x90
[10961.065252]  __x64_sys_ioctl+0x16/0x20
[10961.066315]  do_syscall_64+0x48/0x130
[10961.067382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10961.068647] RIP: 0033:0x7f639fe9b267
[10961.069691] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[10961.073882] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[10961.075949] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267
[10961.077545] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003
[10961.079128] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010
[10961.080709] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150
[10961.082278] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000
[10961.083873] ---[ end trace 1f0ee07a75a16a94 ]---

Thanks

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 13:26 ` Leon Romanovsky
@ 2019-12-19 20:30   ` John Hubbard
  2019-12-19 21:07     ` Jason Gunthorpe
  2019-12-20  9:21     ` Jan Kara
  0 siblings, 2 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-19 20:30 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
>> Hi,
>>
>> This implements an API naming change (put_user_page*() -->
>> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
>> extends that tracking to a few select subsystems. More subsystems will
>> be added in follow up work.
> 
> Hi John,
> 
> The patchset generates kernel panics in our IB testing. In our tests, we
> allocated single memory block and registered multiple MRs using the single
> block.
> 
> The possible bad flow is:
>   ib_umem_geti() ->
>    pin_user_pages_fast(FOLL_WRITE) ->
>     internal_get_user_pages_fast(FOLL_WRITE) ->
>      gup_pgd_range() ->
>       gup_huge_pd() ->
>        gup_hugepte() ->
>         try_grab_compound_head() ->

Hi Leon,

Thanks very much for the detailed report! So we're overflowing...

At first look, this seems likely to be hitting a weak point in the
GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
(there's a writeup in Documentation/core-api/pin_user_page.rst, lines
99-121). Basically it's pretty easy to overflow the page->_refcount
with huge pages if the pages have a *lot* of subpages.

We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
Do you have any idea how many pins (repeated pins on the same page, which
it sounds like you have) might be involved in your test case,
and the huge page and system page sizes? That would allow calculating
if we're likely overflowing for that reason.

So, ideas and next steps:

1. Assuming that you *are* hitting this, I think I may have to fall back to
implementing the "deferred" part of this design, as part of this series, after
all. That means:

   For the pin/unpin calls at least, stop treating all pages as if they are
   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
   That's not how it works now, and the need to hand back a huge array of
   subpages is part of the problem. This affects the callers too, so it's not
   a super quick change to make. (I was really hoping not to have to do this
   yet.)

2. OR, maybe if you're *close* the the overflow, I could buy some development
time by moving the boundary between pinned vs get_page() refcounts, by
reducing GUP_PIN_COUNTING_BIAS. That's less robust, but I don't want
to rule it out just yet. After all, 1024 is a big chunk to count up with,
and even if get_page() calls add up to, say, 512 refs on a page, it's still
just a false positive on page_dma_pinned(). And false positives, if transient,
are OK.

3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
test setup, but I have done only the tiniest bit of user space IB coding, so
if you have any test programs that aren't too hard to deal with that could
possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
that I'm not an advanced IB programmer. At all. :)

4. (minor note to self) This also uncovers a minor weakness in diagnostics:
there's no page dump in these reports, because I chickened out and didn't
include my WARN_ONCE_PAGE() macro that would have provided it. Although,
even without it, it's obvious that this is a page overflow.


thanks,
-- 
John Hubbard
NVIDIA


> 
>   108 static __maybe_unused struct page *try_grab_compound_head(struct page *page,
>   109                                                           int refs,
>   110                                                           unsigned int flags)
>   111 {
>   112         if (flags & FOLL_GET)
>   113                 return try_get_compound_head(page, refs);
>   114         else if (flags & FOLL_PIN)
>   115                 return try_pin_compound_head(page, refs);
>   116
>   117         WARN_ON_ONCE(1);
>   118         return NULL;
>   119 }
> 
> # (master) $ dmesg
> [10924.722220] mlx5_core 0000:00:08.0 eth2: Link up
> [10924.725383] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
> [10960.902254] ------------[ cut here ]------------
> [10960.905614] WARNING: CPU: 3 PID: 8838 at mm/gup.c:61 try_grab_compound_head+0x92/0xd0
> [10960.907313] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl]
> [10960.917806] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G           OE     5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
> [10960.920530] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [10960.923024] RIP: 0010:try_grab_compound_head+0x92/0xd0
> [10960.924329] Code: e4 8d 14 06 48 8d 4f 34 f0 0f b1 57 34 0f 94 c2 84 d2 75 cb 85 c0 74 cd 8d 14 06 f0 0f b1 11 0f 94 c2 84 d2 75 b9 66 90 eb ea <0f> 0b 31 ff eb b7 85 c0 66 0f 1f 44 00 00 74 ab 8d 14 06 f0 0f b1
> [10960.928512] RSP: 0018:ffffc9000129f880 EFLAGS: 00010082
> [10960.929831] RAX: 0000000080000001 RBX: 00007f6397446000 RCX: 000fffffffe00000
> [10960.931422] RDX: 0000000000040000 RSI: 0000000000011800 RDI: ffffea000f5d8000
> [10960.933005] RBP: ffffc9000129f93c R08: ffffc9000129f93c R09: 0000000000200000
> [10960.934584] R10: ffff88840774b200 R11: ffff888000000230 R12: 00007f6397446000
> [10960.936212] R13: 0000000000000046 R14: 80000003d76000e7 R15: 0000000000000080
> [10960.937793] FS:  00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000
> [10960.939962] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [10960.941367] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0
> [10960.942975] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [10960.944654] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [10960.946394] PKRU: 55555554
> [10960.947310] Call Trace:
> [10960.948193]  gup_pgd_range+0x61e/0x950
> [10960.949585]  internal_get_user_pages_fast+0x98/0x1c0
> [10960.951313]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
> [10960.952929]  mr_umem_get+0xd8/0x280 [mlx5_ib]
> [10960.954150]  ? xas_store+0x49/0x550
> [10960.955187]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
> [10960.956478]  ? xas_load+0x9/0x80
> [10960.957474]  ? xa_load+0x54/0x90
> [10960.958465]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
> [10960.959926]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
> [10960.961192]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
> [10960.963208]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
> [10960.964603]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> [10960.965949]  ? mem_cgroup_commit_charge+0x6a/0x140
> [10960.967177]  ? page_add_new_anon_rmap+0x58/0xc0
> [10960.968360]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
> [10960.969595]  do_vfs_ioctl+0xa6/0x640
> [10960.970631]  ? syscall_trace_enter+0x1f8/0x2e0
> [10960.971829]  ksys_ioctl+0x60/0x90
> [10960.972825]  __x64_sys_ioctl+0x16/0x20
> [10960.973888]  do_syscall_64+0x48/0x130
> [10960.974949]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [10960.976219] RIP: 0033:0x7f639fe9b267
> [10960.977260] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> [10960.981413] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [10960.983472] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267
> [10960.985037] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003
> [10960.986603] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010
> [10960.988194] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150
> [10960.989903] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000
> [10960.991544] ---[ end trace 1f0ee07a75a16a93 ]---
> [10960.992773] ------------[ cut here ]------------
> [10960.993995] WARNING: CPU: 3 PID: 8838 at mm/gup.c:150 try_grab_page+0x55/0x70
> [10960.995758] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp rpcrdma rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core kvm_intel mlx5_core rfkill mlxfw sunrpc virtio_net pci_hyperv_intf kvm irqbypass net_failover crc32_pclmul i2c_piix4 ptp crc32c_intel failover pcspkr ghash_clmulni_intel i2c_core pps_core sch_fq_codel ip_tables ata_generic pata_acpi serio_raw ata_piix floppy [last unloaded: mlxkvl]
> [10961.008579] CPU: 3 PID: 8838 Comm: consume_mtts Tainted: G        W  OE     5.5.0-rc2-for-upstream-perf-2019-12-18_10-06-50-78 #1
> [10961.011416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [10961.013766] RIP: 0010:try_grab_page+0x55/0x70
> [10961.014921] Code: 00 04 00 00 b8 01 00 00 00 f3 c3 48 8b 47 08 a8 01 75 1c 8b 47 34 85 c0 7e 1d f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb cb <0f> 0b 31 c0 c3 48 8d 78 ff 66 90 eb dc 0f 0b 31 c0 c3 66 0f 1f 84
> [10961.019058] RSP: 0018:ffffc9000129f7e8 EFLAGS: 00010282
> [10961.020351] RAX: 0000000080000001 RBX: 0000000000050201 RCX: 000000000f5d8000
> [10961.021921] RDX: 000ffffffffff000 RSI: 0000000000040000 RDI: ffffea000f5d8000
> [10961.023494] RBP: 00007f6397400000 R08: ffffea000f986cc0 R09: ffff8883c758bdd0
> [10961.025067] R10: 0000000000000001 R11: ffff888000000230 R12: ffff888407701c00
> [10961.026637] R13: ffff8883e61b35d0 R14: ffffea000f5d8000 R15: 0000000000050201
> [10961.028217] FS:  00007f63a0590740(0000) GS:ffff88842f980000(0000) knlGS:0000000000000000
> [10961.030353] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [10961.031721] CR2: 00000000023e9008 CR3: 0000000406d0a002 CR4: 00000000007606e0
> [10961.033305] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [10961.034884] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [10961.036456] PKRU: 55555554
> [10961.037369] Call Trace:
> [10961.038285]  follow_trans_huge_pmd+0x10c/0x300
> [10961.039555]  follow_page_mask+0x64a/0x760
> [10961.040762]  __get_user_pages+0xf1/0x720
> [10961.041851]  ? apic_timer_interrupt+0xa/0x20
> [10961.042996]  internal_get_user_pages_fast+0x14b/0x1c0
> [10961.044266]  ib_umem_get+0x2b3/0x5a0 [ib_uverbs]
> [10961.045474]  mr_umem_get+0xd8/0x280 [mlx5_ib]
> [10961.046652]  ? xas_store+0x49/0x550
> [10961.047696]  mlx5_ib_reg_user_mr+0x149/0x7a0 [mlx5_ib]
> [10961.048967]  ? xas_load+0x9/0x80
> [10961.049949]  ? xa_load+0x54/0x90
> [10961.050935]  ? lookup_get_idr_uobject.part.10+0x12/0x80 [ib_uverbs]
> [10961.052378]  ib_uverbs_reg_mr+0x138/0x2a0 [ib_uverbs]
> [10961.053635]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xb1/0xf0 [ib_uverbs]
> [10961.055646]  ib_uverbs_cmd_verbs.isra.8+0x997/0xb30 [ib_uverbs]
> [10961.057033]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> [10961.058381]  ? mem_cgroup_commit_charge+0x6a/0x140
> [10961.059611]  ? page_add_new_anon_rmap+0x58/0xc0
> [10961.060796]  ib_uverbs_ioctl+0xbc/0x130 [ib_uverbs]
> [10961.062034]  do_vfs_ioctl+0xa6/0x640
> [10961.063081]  ? syscall_trace_enter+0x1f8/0x2e0
> [10961.064253]  ksys_ioctl+0x60/0x90
> [10961.065252]  __x64_sys_ioctl+0x16/0x20
> [10961.066315]  do_syscall_64+0x48/0x130
> [10961.067382]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [10961.068647] RIP: 0033:0x7f639fe9b267
> [10961.069691] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> [10961.073882] RSP: 002b:00007fff5335ca08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [10961.075949] RAX: ffffffffffffffda RBX: 00007fff5335ca98 RCX: 00007f639fe9b267
> [10961.077545] RDX: 00007fff5335ca80 RSI: 00000000c0181b01 RDI: 0000000000000003
> [10961.079128] RBP: 00007fff5335ca60 R08: 0000000000000003 R09: 00007f63a055e010
> [10961.080709] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f63a055e150
> [10961.082278] R13: 00007fff5335ca60 R14: 00007fff5335cc38 R15: 00007f6397246000
> [10961.083873] ---[ end trace 1f0ee07a75a16a94 ]---
> 
> Thanks
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 20:30   ` John Hubbard
@ 2019-12-19 21:07     ` Jason Gunthorpe
  2019-12-19 21:13       ` John Hubbard
                         ` (2 more replies)
  2019-12-20  9:21     ` Jan Kara
  1 sibling, 3 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2019-12-19 21:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >    pin_user_pages_fast(FOLL_WRITE) ->
> >     internal_get_user_pages_fast(FOLL_WRITE) ->
> >      gup_pgd_range() ->
> >       gup_huge_pd() ->
> >        gup_hugepte() ->
> >         try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.

Considering that establishing these pins is entirely under user
control, we can't have a limit here.

If the number of allowed pins are exhausted then the
pin_user_pages_fast() must fail back to the user.

> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
> test setup, but I have done only the tiniest bit of user space IB coding, so
> if you have any test programs that aren't too hard to deal with that could
> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> that I'm not an advanced IB programmer. At all. :)

Clone this:

https://github.com/linux-rdma/rdma-core.git

Install all the required deps to build it (notably cython), see the README.md

$ ./build.sh
$ build/bin/run_tests.py 

If you get things that far I think Leon can get a reproduction for you

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 21:07     ` Jason Gunthorpe
@ 2019-12-19 21:13       ` John Hubbard
  2019-12-20 13:34         ` Jason Gunthorpe
  2019-12-19 22:58       ` John Hubbard
  2019-12-20 18:29       ` Leon Romanovsky
  2 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-19 21:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
>> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
>>> On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
>>>> Hi,
>>>>
>>>> This implements an API naming change (put_user_page*() -->
>>>> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
>>>> extends that tracking to a few select subsystems. More subsystems will
>>>> be added in follow up work.
>>>
>>> Hi John,
>>>
>>> The patchset generates kernel panics in our IB testing. In our tests, we
>>> allocated single memory block and registered multiple MRs using the single
>>> block.
>>>
>>> The possible bad flow is:
>>>    ib_umem_geti() ->
>>>     pin_user_pages_fast(FOLL_WRITE) ->
>>>      internal_get_user_pages_fast(FOLL_WRITE) ->
>>>       gup_pgd_range() ->
>>>        gup_huge_pd() ->
>>>         gup_hugepte() ->
>>>          try_grab_compound_head() ->
>>
>> Hi Leon,
>>
>> Thanks very much for the detailed report! So we're overflowing...
>>
>> At first look, this seems likely to be hitting a weak point in the
>> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
>> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
>> 99-121). Basically it's pretty easy to overflow the page->_refcount
>> with huge pages if the pages have a *lot* of subpages.
>>
>> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> 
> Considering that establishing these pins is entirely under user
> control, we can't have a limit here.

There's already a limit, it's just a much larger one. :) What does "no limit"
really mean, numerically, to you in this case?

> 
> If the number of allowed pins are exhausted then the
> pin_user_pages_fast() must fail back to the user.


I'll poke around the IB call stack and see how much of that return path
is in place, if any. Because it's the same situation for get_user_pages_fast().
This code just added a warning on overflow so we could spot it early.

> 
>> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
>> test setup, but I have done only the tiniest bit of user space IB coding, so
>> if you have any test programs that aren't too hard to deal with that could
>> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
>> that I'm not an advanced IB programmer. At all. :)
> 
> Clone this:
> 
> https://github.com/linux-rdma/rdma-core.git
> 
> Install all the required deps to build it (notably cython), see the README.md
> 
> $ ./build.sh
> $ build/bin/run_tests.py
> 
> If you get things that far I think Leon can get a reproduction for you
> 

OK, here goes.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 21:07     ` Jason Gunthorpe
  2019-12-19 21:13       ` John Hubbard
@ 2019-12-19 22:58       ` John Hubbard
  2019-12-20 18:48         ` Leon Romanovsky
  2019-12-20 18:29       ` Leon Romanovsky
  2 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-19 22:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
...
>> 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
>> test setup, but I have done only the tiniest bit of user space IB coding, so
>> if you have any test programs that aren't too hard to deal with that could
>> possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
>> that I'm not an advanced IB programmer. At all. :)
> 
> Clone this:
> 
> https://github.com/linux-rdma/rdma-core.git
> 
> Install all the required deps to build it (notably cython), see the README.md
> 
> $ ./build.sh
> $ build/bin/run_tests.py
> 
> If you get things that far I think Leon can get a reproduction for you
> 

Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).

This is a great test suite to have running, I'll add it to my scripts. Here's the
full output in case the failure or skip cases are a problem:

$ sudo ./build/bin/run_tests.py --verbose

test_create_ah (tests.test_addr.AHTest) ... ok
test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB link layer"
test_destroy_ah (tests.test_addr.AHTest) ... ok
test_create_comp_channel (tests.test_cq.CCTest) ... ok
test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
test_create_cq (tests.test_cq.CQTest) ... ok
test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
test_destroy_cq (tests.test_cq.CQTest) ... ok
test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
test_create_dm (tests.test_device.DMTest) ... ok
test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
test_destroy_dm (tests.test_device.DMTest) ... ok
test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
test_dm_read (tests.test_device.DMTest) ... ok
test_dm_write (tests.test_device.DMTest) ... ok
test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
test_buffer (tests.test_mr.MRTest) ... ok
test_dereg_mr (tests.test_mr.MRTest) ... ok
test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
test_lkey (tests.test_mr.MRTest) ... ok
test_read (tests.test_mr.MRTest) ... ok
test_reg_mr (tests.test_mr.MRTest) ... ok
test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
test_rkey (tests.test_mr.MRTest) ... ok
test_write (tests.test_mr.MRTest) ... ok
test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not supported - ODP recv not supported'
test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
test_alloc_pd (tests.test_pd.PDTest) ... ok
test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
test_dealloc_pd (tests.test_pd.PDTest) ... ok
test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
test_modify_qp (tests.test_qp.QPTest) ... ok
test_query_qp (tests.test_qp.QPTest) ... ok
test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface'

======================================================================
FAIL: test_query_port (tests.test_device.DeviceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port
     self.verify_port_attr(port_attr)
   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr
     assert 'Invalid' not in d.speed_to_str(attr.active_speed)
AssertionError

----------------------------------------------------------------------
Ran 67 tests in 10.058s

FAILED (failures=1, skipped=3)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 20:30   ` John Hubbard
  2019-12-19 21:07     ` Jason Gunthorpe
@ 2019-12-20  9:21     ` Jan Kara
  2019-12-21  0:02       ` John Hubbard
  2019-12-21  0:33       ` Dan Williams
  1 sibling, 2 replies; 67+ messages in thread
From: Jan Kara @ 2019-12-20  9:21 UTC (permalink / raw)
  To: John Hubbard
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jens Axboe, Jonathan Corbet, Jérôme Glisse,
	Magnus Karlsson, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Hocko, Mike Kravetz, Paul Mackerras, Shuah Khan,
	Vlastimil Babka, bpf, dri-devel, kvm, linux-block, linux-doc,
	linux-fsdevel, linux-kselftest, linux-media, linux-rdma,
	linuxppc-dev, netdev, linux-mm, LKML, Maor Gottlieb

On Thu 19-12-19 12:30:31, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >    pin_user_pages_fast(FOLL_WRITE) ->
> >     internal_get_user_pages_fast(FOLL_WRITE) ->
> >      gup_pgd_range() ->
> >       gup_huge_pd() ->
> >        gup_hugepte() ->
> >         try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> Do you have any idea how many pins (repeated pins on the same page, which
> it sounds like you have) might be involved in your test case,
> and the huge page and system page sizes? That would allow calculating
> if we're likely overflowing for that reason.
> 
> So, ideas and next steps:
> 
> 1. Assuming that you *are* hitting this, I think I may have to fall back to
> implementing the "deferred" part of this design, as part of this series, after
> all. That means:
> 
>   For the pin/unpin calls at least, stop treating all pages as if they are
>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>   That's not how it works now, and the need to hand back a huge array of
>   subpages is part of the problem. This affects the callers too, so it's not
>   a super quick change to make. (I was really hoping not to have to do this
>   yet.)

Does that mean that you would need to make all GUP users huge page aware?
Otherwise I don't see how what you suggest would work... And I don't think
making all GUP users huge page aware is realistic (effort-wise) or even
wanted (maintenance overhead in all those places).

I believe there might be also a different solution for this: For
transparent huge pages, we could find a space in 'struct page' of the
second page in the huge page for proper pin counter and just account pins
there so we'd have full width of 32-bits for it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 21:13       ` John Hubbard
@ 2019-12-20 13:34         ` Jason Gunthorpe
  2019-12-21  0:32           ` Dan Williams
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2019-12-20 13:34 UTC (permalink / raw)
  To: John Hubbard
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > Hi,
> > > > > 
> > > > > This implements an API naming change (put_user_page*() -->
> > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > > extends that tracking to a few select subsystems. More subsystems will
> > > > > be added in follow up work.
> > > > 
> > > > Hi John,
> > > > 
> > > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > > allocated single memory block and registered multiple MRs using the single
> > > > block.
> > > > 
> > > > The possible bad flow is:
> > > >    ib_umem_geti() ->
> > > >     pin_user_pages_fast(FOLL_WRITE) ->
> > > >      internal_get_user_pages_fast(FOLL_WRITE) ->
> > > >       gup_pgd_range() ->
> > > >        gup_huge_pd() ->
> > > >         gup_hugepte() ->
> > > >          try_grab_compound_head() ->
> > > 
> > > Hi Leon,
> > > 
> > > Thanks very much for the detailed report! So we're overflowing...
> > > 
> > > At first look, this seems likely to be hitting a weak point in the
> > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > with huge pages if the pages have a *lot* of subpages.
> > > 
> > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > 
> > Considering that establishing these pins is entirely under user
> > control, we can't have a limit here.
> 
> There's already a limit, it's just a much larger one. :) What does "no limit"
> really mean, numerically, to you in this case?

I guess I mean 'hidden limit' - hitting the limit and failing would
be managable.

I think 7 is probably too low though, but we are not using 1GB huge
pages, only 2M..

> > If the number of allowed pins are exhausted then the
> > pin_user_pages_fast() must fail back to the user.
> 
> I'll poke around the IB call stack and see how much of that return
> path is in place, if any. Because it's the same situation for
> get_user_pages_fast().  This code just added a warning on overflow
> so we could spot it early.

All GUP callers must be prepared for failure, IB should be fine...

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 21:07     ` Jason Gunthorpe
  2019-12-19 21:13       ` John Hubbard
  2019-12-19 22:58       ` John Hubbard
@ 2019-12-20 18:29       ` Leon Romanovsky
  2019-12-20 23:54         ` John Hubbard
  2 siblings, 1 reply; 67+ messages in thread
From: Leon Romanovsky @ 2019-12-20 18:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On Thu, Dec 19, 2019 at 05:07:43PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >    pin_user_pages_fast(FOLL_WRITE) ->
> > >     internal_get_user_pages_fast(FOLL_WRITE) ->
> > >      gup_pgd_range() ->
> > >       gup_huge_pd() ->
> > >        gup_hugepte() ->
> > >         try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
>
> Considering that establishing these pins is entirely under user
> control, we can't have a limit here.
>
> If the number of allowed pins are exhausted then the
> pin_user_pages_fast() must fail back to the user.
>
> > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
> > test setup, but I have done only the tiniest bit of user space IB coding, so
> > if you have any test programs that aren't too hard to deal with that could
> > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> > that I'm not an advanced IB programmer. At all. :)
>
> Clone this:
>
> https://github.com/linux-rdma/rdma-core.git
>
> Install all the required deps to build it (notably cython), see the README.md
>
> $ ./build.sh
> $ build/bin/run_tests.py
>
> If you get things that far I think Leon can get a reproduction for you

I'm not so optimistic about that.

Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-19 22:58       ` John Hubbard
@ 2019-12-20 18:48         ` Leon Romanovsky
  2019-12-20 23:13           ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Leon Romanovsky @ 2019-12-20 18:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On Thu, Dec 19, 2019 at 02:58:43PM -0800, John Hubbard wrote:
> On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> ...
> > > 3. It would be nice if I could reproduce this. I have a two-node mlx5 Infiniband
> > > test setup, but I have done only the tiniest bit of user space IB coding, so
> > > if you have any test programs that aren't too hard to deal with that could
> > > possibly hit this, or be tweaked to hit it, I'd be grateful. Keeping in mind
> > > that I'm not an advanced IB programmer. At all. :)
> >
> > Clone this:
> >
> > https://github.com/linux-rdma/rdma-core.git
> >
> > Install all the required deps to build it (notably cython), see the README.md
> >
> > $ ./build.sh
> > $ build/bin/run_tests.py
> >
> > If you get things that far I think Leon can get a reproduction for you
> >
>
> Cool, it's up and running (1 failure, 3 skipped, out of 67 tests).
>
> This is a great test suite to have running, I'll add it to my scripts. Here's the
> full output in case the failure or skip cases are a problem:
>
> $ sudo ./build/bin/run_tests.py --verbose
>
> test_create_ah (tests.test_addr.AHTest) ... ok
> test_create_ah_roce (tests.test_addr.AHTest) ... skipped "Can't run RoCE tests on IB link layer"
> test_destroy_ah (tests.test_addr.AHTest) ... ok
> test_create_comp_channel (tests.test_cq.CCTest) ... ok
> test_destroy_comp_channel (tests.test_cq.CCTest) ... ok
> test_create_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq_ex_bad_flow (tests.test_cq.CQEXTest) ... ok
> test_destroy_cq_ex (tests.test_cq.CQEXTest) ... ok
> test_create_cq (tests.test_cq.CQTest) ... ok
> test_create_cq_bad_flow (tests.test_cq.CQTest) ... ok
> test_destroy_cq (tests.test_cq.CQTest) ... ok
> test_rc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_ud_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_xrc_traffic_cq_ex (tests.test_cqex.CqExTestCase) ... ok
> test_create_dm (tests.test_device.DMTest) ... ok
> test_create_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_destroy_dm (tests.test_device.DMTest) ... ok
> test_destroy_dm_bad_flow (tests.test_device.DMTest) ... ok
> test_dm_read (tests.test_device.DMTest) ... ok
> test_dm_write (tests.test_device.DMTest) ... ok
> test_dm_write_bad_flow (tests.test_device.DMTest) ... ok
> test_dev_list (tests.test_device.DeviceTest) ... ok
> test_open_dev (tests.test_device.DeviceTest) ... ok
> test_query_device (tests.test_device.DeviceTest) ... ok
> test_query_device_ex (tests.test_device.DeviceTest) ... ok
> test_query_gid (tests.test_device.DeviceTest) ... ok
> test_query_port (tests.test_device.DeviceTest) ... FAIL
> test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok
> test_create_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_destroy_dm_mr (tests.test_mr.DMMRTest) ... ok
> test_buffer (tests.test_mr.MRTest) ... ok
> test_dereg_mr (tests.test_mr.MRTest) ... ok
> test_dereg_mr_twice (tests.test_mr.MRTest) ... ok
> test_lkey (tests.test_mr.MRTest) ... ok
> test_read (tests.test_mr.MRTest) ... ok
> test_reg_mr (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flags (tests.test_mr.MRTest) ... ok
> test_reg_mr_bad_flow (tests.test_mr.MRTest) ... ok
> test_rkey (tests.test_mr.MRTest) ... ok
> test_write (tests.test_mr.MRTest) ... ok
> test_dereg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_dereg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type1 (tests.test_mr.MWTest) ... ok
> test_reg_mw_type2 (tests.test_mr.MWTest) ... ok
> test_reg_mw_wrong_type (tests.test_mr.MWTest) ... ok
> test_odp_rc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_odp_ud_traffic (tests.test_odp.OdpTestCase) ... skipped 'ODP is not supported - ODP recv not supported'
> test_odp_xrc_traffic (tests.test_odp.OdpTestCase) ... ok
> test_default_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_mem_align_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_without_allocators (tests.test_parent_domain.ParentDomainTestCase) ... ok
> test_alloc_pd (tests.test_pd.PDTest) ... ok
> test_create_pd_none_ctx (tests.test_pd.PDTest) ... ok
> test_dealloc_pd (tests.test_pd.PDTest) ... ok
> test_destroy_pd_twice (tests.test_pd.PDTest) ... ok
> test_multiple_pd_creation (tests.test_pd.PDTest) ... ok
> test_create_qp_ex_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_ex_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_no_attr_connected (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr (tests.test_qp.QPTest) ... ok
> test_create_qp_with_attr_connected (tests.test_qp.QPTest) ... ok
> test_modify_qp (tests.test_qp.QPTest) ... ok
> test_query_qp (tests.test_qp.QPTest) ... ok
> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface'
>
> ======================================================================
> FAIL: test_query_port (tests.test_device.DeviceTest)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port
>     self.verify_port_attr(port_attr)
>   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr
>     assert 'Invalid' not in d.speed_to_str(attr.active_speed)
> AssertionError

I'm very curious how did you get this assert "d.speed_to_str" covers all
known speeds according to the IBTA.

Thanks

>
> ----------------------------------------------------------------------
> Ran 67 tests in 10.058s
>
> FAILED (failures=1, skipped=3)
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20 18:48         ` Leon Romanovsky
@ 2019-12-20 23:13           ` John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-20 23:13 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On 12/20/19 10:48 AM, Leon Romanovsky wrote:
...
>> test_query_qp (tests.test_qp.QPTest) ... ok
>> test_rdmacm_sync_traffic (tests.test_rdmacm.CMTestCase) ... skipped 'No devices with net interface'
>>
>> ======================================================================
>> FAIL: test_query_port (tests.test_device.DeviceTest)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 129, in test_query_port
>>     self.verify_port_attr(port_attr)
>>   File "/kernel_work/rdma-core/tests/test_device.py", line 113, in verify_port_attr
>>     assert 'Invalid' not in d.speed_to_str(attr.active_speed)
>> AssertionError
> 
> I'm very curious how did you get this assert "d.speed_to_str" covers all
> known speeds according to the IBTA.
> 

Hi Leon,

Short answer: I can make that one pass, with a small fix the the rdma-core test
suite:

commit a1b9fb0846e1b2356d7a16f4fbdd1960cf8dcbe5 (HEAD -> fix_speed_to_str)
Author: John Hubbard <jhubbard@nvidia.com>
Date:   Fri Dec 20 15:07:47 2019 -0800

    device: fix speed_to_str(), to handle disabled links
    
    For disabled links, the raw speed token is 0. However,
    speed_to_str() doesn't have that in the list. This leads
    to an assertion when running tests (test_query_port) when
    one link is down and other link(s) are up.
    
    Fix this by returning '(Disabled/down)' for the zero speed
    case.

diff --git a/pyverbs/device.pyx b/pyverbs/device.pyx
index 33d133fd..f8b7826b 100755
--- a/pyverbs/device.pyx
+++ b/pyverbs/device.pyx
@@ -923,8 +923,8 @@ def width_to_str(width):
 
 
 def speed_to_str(speed):
-    l = {1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps', 8: '10.0 Gbps',
-         16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
+    l = {0: '(Disabled/down)', 1: '2.5 Gbps', 2: '5.0 Gbps', 4: '5.0 Gbps',
+         8: '10.0 Gbps', 16: '14.0 Gbps', 32: '25.0 Gbps', 64: '50.0 Gbps'}
     try:
         return '{s} ({n})'.format(s=l[speed], n=speed)
     except KeyError:


Longer answer:
==============

It looks like this test suite assumes that every link is connected! (Probably
in most test systems, they are.) But in my setup, the ConnectX cards each have
two slots, and I only have (and only need) one cable. So one link is up, and
the other is disabled. 

This leads to the other problem, which is that if a link is disabled, the
test suite finds a "0" token for attr.active_speed. That token is not in the
approved list, and so d.speed_to_str() asserts.

With some diagnostics added, I can see it checking each link: one passes, and
the other asserts:

diff --git a/tests/test_device.py b/tests/test_device.py
index 524e0e89..7b33d7db 100644
--- a/tests/test_device.py
+++ b/tests/test_device.py
@@ -110,6 +110,12 @@ class DeviceTest(unittest.TestCase):
         assert 'Invalid' not in d.translate_mtu(attr.max_mtu)
         assert 'Invalid' not in d.translate_mtu(attr.active_mtu)
         assert 'Invalid' not in d.width_to_str(attr.active_width)
+        print("")
+        print('Diagnostics ===========================================')
+        print('phys_state:    ', d.phys_state_to_str(attr.phys_state))
+        print('active_width): ', d.width_to_str(attr.active_width))
+        print('active_speed:  ',   d.speed_to_str(attr.active_speed))
+        print('END of Diagnostics ====================================')
         assert 'Invalid' not in d.speed_to_str(attr.active_speed)
         assert 'Invalid' not in d.translate_link_layer(attr.link_layer)
         assert attr.max_msg_sz > 0x1000

         assert attr.max_msg_sz > 0x1000

...and the test run from that is:

# ./build/bin/run_tests.py --verbose tests.test_device.DeviceTest
test_dev_list (tests.test_device.DeviceTest) ... ok
test_open_dev (tests.test_device.DeviceTest) ... ok
test_query_device (tests.test_device.DeviceTest) ... ok
test_query_device_ex (tests.test_device.DeviceTest) ... ok
test_query_gid (tests.test_device.DeviceTest) ... ok
test_query_port (tests.test_device.DeviceTest) ... 
Diagnostics ===========================================
phys_state:     Link up (5)
active_width):  4X (2)
active_speed:   25.0 Gbps (32)
END of Diagnostics ====================================

Diagnostics ===========================================
phys_state:     Disabled (3)
active_width):  4X (2)
active_speed:   Invalid speed
END of Diagnostics ====================================
FAIL
test_query_port_bad_flow (tests.test_device.DeviceTest) ... ok

======================================================================
FAIL: test_query_port (tests.test_device.DeviceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/kernel_work/rdma-core/tests/test_device.py", line 135, in test_query_port
    self.verify_port_attr(port_attr)
  File "/kernel_work/rdma-core/tests/test_device.py", line 119, in verify_port_attr
    assert 'Invalid' not in d.speed_to_str(attr.active_speed)
AssertionError

----------------------------------------------------------------------
Ran 7 tests in 0.055s

FAILED (failures=1)



thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20 18:29       ` Leon Romanovsky
@ 2019-12-20 23:54         ` John Hubbard
  2019-12-21 10:08           ` Leon Romanovsky
  2019-12-22 13:23           ` Leon Romanovsky
  0 siblings, 2 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-20 23:54 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML, Maor Gottlieb

On 12/20/19 10:29 AM, Leon Romanovsky wrote:
...
>> $ ./build.sh
>> $ build/bin/run_tests.py
>>
>> If you get things that far I think Leon can get a reproduction for you
> 
> I'm not so optimistic about that.
> 

OK, I'm going to proceed for now on the assumption that I've got an overflow
problem that happens when huge pages are pinned. If I can get more information,
great, otherwise it's probably enough.

One thing: for your repro, if you know the huge page size, and the system
page size for that case, that would really help. Also the number of pins per
page, more or less, that you'd expect. Because Jason says that only 2M huge 
pages are used...

Because the other possibility is that the refcount really is going negative, 
likely due to a mismatched pin/unpin somehow.

If there's not an obvious repro case available, but you do have one (is it easy
to repro, though?), then *if* you have the time, I could point you to a github
branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb44c4d2ada7..8526fd03b978 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
  * get_user_pages and page_mkclean and other calls that race to set up page
  * table entries.
  */
-#define GUP_PIN_COUNTING_BIAS (1U << 10)
+#define GUP_PIN_COUNTING_BIAS (1U << 8)
 
 void unpin_user_page(struct page *page);
 void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,

If that fails to repro, then we would be zeroing in on the root cause. 

The branch is here (I just tested it and it seems healthy):

git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20  9:21     ` Jan Kara
@ 2019-12-21  0:02       ` John Hubbard
  2019-12-21  0:33       ` Dan Williams
  1 sibling, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-21  0:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On 12/20/19 1:21 AM, Jan Kara wrote:
...
>> So, ideas and next steps:
>>
>> 1. Assuming that you *are* hitting this, I think I may have to fall back to
>> implementing the "deferred" part of this design, as part of this series, after
>> all. That means:
>>
>>   For the pin/unpin calls at least, stop treating all pages as if they are
>>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>>   That's not how it works now, and the need to hand back a huge array of
>>   subpages is part of the problem. This affects the callers too, so it's not
>>   a super quick change to make. (I was really hoping not to have to do this
>>   yet.)
> 
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
> 

Well, pretty much yes. It's really just the pin_user_pages*() callers, but
the internals, follow_page() and such, are so interconnected right now that
it would probably blow up into a huge effort, as you point out.

> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.
> 
> 								Honza
> 

OK, let me pursue that. Given that I shouldn't need to handle pages
splitting, it should be not *too* bad.

I am starting to think that I should just post the first 9 or so 
prerequisite patches (first 9 patches, plus the v4l2 fix that arguably should 
have been earlier in the sequence I guess), as 5.6 candidates, while I go
back to the drawing board here. 

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20 13:34         ` Jason Gunthorpe
@ 2019-12-21  0:32           ` Dan Williams
  2019-12-23 18:24             ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2019-12-21  0:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Leon Romanovsky, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	Maling list - DRI developers, KVM list, linux-block,
	Linux Doc Mailing List, linux-fsdevel, linux-kselftest,
	Linux-media, linux-rdma, linuxppc-dev, Netdev, Linux MM, LKML,
	Maor Gottlieb

On Fri, Dec 20, 2019 at 5:34 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Dec 19, 2019 at 01:13:54PM -0800, John Hubbard wrote:
> > On 12/19/19 1:07 PM, Jason Gunthorpe wrote:
> > > On Thu, Dec 19, 2019 at 12:30:31PM -0800, John Hubbard wrote:
> > > > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > > > Hi,
> > > > > >
> > > > > > This implements an API naming change (put_user_page*() -->
> > > > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > > > extends that tracking to a few select subsystems. More subsystems will
> > > > > > be added in follow up work.
> > > > >
> > > > > Hi John,
> > > > >
> > > > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > > > allocated single memory block and registered multiple MRs using the single
> > > > > block.
> > > > >
> > > > > The possible bad flow is:
> > > > >    ib_umem_geti() ->
> > > > >     pin_user_pages_fast(FOLL_WRITE) ->
> > > > >      internal_get_user_pages_fast(FOLL_WRITE) ->
> > > > >       gup_pgd_range() ->
> > > > >        gup_huge_pd() ->
> > > > >         gup_hugepte() ->
> > > > >          try_grab_compound_head() ->
> > > >
> > > > Hi Leon,
> > > >
> > > > Thanks very much for the detailed report! So we're overflowing...
> > > >
> > > > At first look, this seems likely to be hitting a weak point in the
> > > > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > > > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > > > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > > > with huge pages if the pages have a *lot* of subpages.
> > > >
> > > > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > >
> > > Considering that establishing these pins is entirely under user
> > > control, we can't have a limit here.
> >
> > There's already a limit, it's just a much larger one. :) What does "no limit"
> > really mean, numerically, to you in this case?
>
> I guess I mean 'hidden limit' - hitting the limit and failing would
> be managable.
>
> I think 7 is probably too low though, but we are not using 1GB huge
> pages, only 2M..

What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20  9:21     ` Jan Kara
  2019-12-21  0:02       ` John Hubbard
@ 2019-12-21  0:33       ` Dan Williams
  2019-12-21  0:41         ` John Hubbard
  1 sibling, 1 reply; 67+ messages in thread
From: Dan Williams @ 2019-12-21  0:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Leon Romanovsky, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	Maling list - DRI developers, KVM list, linux-block,
	Linux Doc Mailing List, linux-fsdevel, linux-kselftest,
	Linux-media, linux-rdma, linuxppc-dev, Netdev, Linux MM, LKML,
	Maor Gottlieb

On Fri, Dec 20, 2019 at 1:22 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 19-12-19 12:30:31, John Hubbard wrote:
> > On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > > Hi,
> > > >
> > > > This implements an API naming change (put_user_page*() -->
> > > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > > extends that tracking to a few select subsystems. More subsystems will
> > > > be added in follow up work.
> > >
> > > Hi John,
> > >
> > > The patchset generates kernel panics in our IB testing. In our tests, we
> > > allocated single memory block and registered multiple MRs using the single
> > > block.
> > >
> > > The possible bad flow is:
> > >   ib_umem_geti() ->
> > >    pin_user_pages_fast(FOLL_WRITE) ->
> > >     internal_get_user_pages_fast(FOLL_WRITE) ->
> > >      gup_pgd_range() ->
> > >       gup_huge_pd() ->
> > >        gup_hugepte() ->
> > >         try_grab_compound_head() ->
> >
> > Hi Leon,
> >
> > Thanks very much for the detailed report! So we're overflowing...
> >
> > At first look, this seems likely to be hitting a weak point in the
> > GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> > (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> > 99-121). Basically it's pretty easy to overflow the page->_refcount
> > with huge pages if the pages have a *lot* of subpages.
> >
> > We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> > Do you have any idea how many pins (repeated pins on the same page, which
> > it sounds like you have) might be involved in your test case,
> > and the huge page and system page sizes? That would allow calculating
> > if we're likely overflowing for that reason.
> >
> > So, ideas and next steps:
> >
> > 1. Assuming that you *are* hitting this, I think I may have to fall back to
> > implementing the "deferred" part of this design, as part of this series, after
> > all. That means:
> >
> >   For the pin/unpin calls at least, stop treating all pages as if they are
> >   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
> >   That's not how it works now, and the need to hand back a huge array of
> >   subpages is part of the problem. This affects the callers too, so it's not
> >   a super quick change to make. (I was really hoping not to have to do this
> >   yet.)
>
> Does that mean that you would need to make all GUP users huge page aware?
> Otherwise I don't see how what you suggest would work... And I don't think
> making all GUP users huge page aware is realistic (effort-wise) or even
> wanted (maintenance overhead in all those places).
>
> I believe there might be also a different solution for this: For
> transparent huge pages, we could find a space in 'struct page' of the
> second page in the huge page for proper pin counter and just account pins
> there so we'd have full width of 32-bits for it.

That would require THP accounting for dax pages. It is something that
was probably going to be needed, but this would seem to force the
issue.

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-21  0:33       ` Dan Williams
@ 2019-12-21  0:41         ` John Hubbard
  2019-12-21  0:51           ` Dan Williams
  0 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-21  0:41 UTC (permalink / raw)
  To: Dan Williams, Jan Kara
  Cc: Leon Romanovsky, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jason Gunthorpe, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, Maling list - DRI developers,
	KVM list, linux-block, Linux Doc Mailing List, linux-fsdevel,
	linux-kselftest, Linux-media, linux-rdma, linuxppc-dev, Netdev,
	Linux MM, LKML, Maor Gottlieb

On 12/20/19 4:33 PM, Dan Williams wrote:
...
>> I believe there might be also a different solution for this: For
>> transparent huge pages, we could find a space in 'struct page' of the
>> second page in the huge page for proper pin counter and just account pins
>> there so we'd have full width of 32-bits for it.
> 
> That would require THP accounting for dax pages. It is something that
> was probably going to be needed, but this would seem to force the
> issue.
> 

Thanks for mentioning that, it wasn't obvious to me yet. 

How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
test setup? I'd hate to go into this without having that coverage up
and running. It's been sketchy enough as it is. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-21  0:41         ` John Hubbard
@ 2019-12-21  0:51           ` Dan Williams
  2019-12-21  0:53             ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Dan Williams @ 2019-12-21  0:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Leon Romanovsky, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	Maling list - DRI developers, KVM list, linux-block,
	Linux Doc Mailing List, linux-fsdevel, linux-kselftest,
	Linux-media, linux-rdma, linuxppc-dev, Netdev, Linux MM, LKML,
	Maor Gottlieb

On Fri, Dec 20, 2019 at 4:41 PM John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 12/20/19 4:33 PM, Dan Williams wrote:
> ...
> >> I believe there might be also a different solution for this: For
> >> transparent huge pages, we could find a space in 'struct page' of the
> >> second page in the huge page for proper pin counter and just account pins
> >> there so we'd have full width of 32-bits for it.
> >
> > That would require THP accounting for dax pages. It is something that
> > was probably going to be needed, but this would seem to force the
> > issue.
> >
>
> Thanks for mentioning that, it wasn't obvious to me yet.
>
> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
> test setup? I'd hate to go into this without having that coverage up
> and running. It's been sketchy enough as it is. :)

You too can have the power of the gods for the low low price of a
kernel command line parameter, or a qemu setup.

Details here:

https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
https://nvdimm.wiki.kernel.org/pmem_in_qemu

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-21  0:51           ` Dan Williams
@ 2019-12-21  0:53             ` John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-21  0:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Leon Romanovsky, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	Maling list - DRI developers, KVM list, linux-block,
	Linux Doc Mailing List, linux-fsdevel, linux-kselftest,
	Linux-media, linux-rdma, linuxppc-dev, Netdev, Linux MM, LKML,
	Maor Gottlieb

On 12/20/19 4:51 PM, Dan Williams wrote:
> On Fri, Dec 20, 2019 at 4:41 PM John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> On 12/20/19 4:33 PM, Dan Williams wrote:
>> ...
>>>> I believe there might be also a different solution for this: For
>>>> transparent huge pages, we could find a space in 'struct page' of the
>>>> second page in the huge page for proper pin counter and just account pins
>>>> there so we'd have full width of 32-bits for it.
>>>
>>> That would require THP accounting for dax pages. It is something that
>>> was probably going to be needed, but this would seem to force the
>>> issue.
>>>
>>
>> Thanks for mentioning that, it wasn't obvious to me yet.
>>
>> How easy is it for mere mortals outside of Intel, to set up a DAX (nvdimm?)
>> test setup? I'd hate to go into this without having that coverage up
>> and running. It's been sketchy enough as it is. :)
> 
> You too can have the power of the gods for the low low price of a
> kernel command line parameter, or a qemu setup.
> 
> Details here:
> 
> https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
> https://nvdimm.wiki.kernel.org/pmem_in_qemu
> 

Sweeeet! Now I can really cause some damage. :)

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20 23:54         ` John Hubbard
@ 2019-12-21 10:08           ` Leon Romanovsky
  2019-12-21 23:59             ` John Hubbard
  2019-12-22 13:23           ` Leon Romanovsky
  1 sibling, 1 reply; 67+ messages in thread
From: Leon Romanovsky @ 2019-12-21 10:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> ...
> >> $ ./build.sh
> >> $ build/bin/run_tests.py
> >>
> >> If you get things that far I think Leon can get a reproduction for you
> >
> > I'm not so optimistic about that.
> >
>
> OK, I'm going to proceed for now on the assumption that I've got an overflow
> problem that happens when huge pages are pinned. If I can get more information,
> great, otherwise it's probably enough.
>
> One thing: for your repro, if you know the huge page size, and the system
> page size for that case, that would really help. Also the number of pins per
> page, more or less, that you'd expect. Because Jason says that only 2M huge
> pages are used...
>
> Because the other possibility is that the refcount really is going negative,
> likely due to a mismatched pin/unpin somehow.
>
> If there's not an obvious repro case available, but you do have one (is it easy
> to repro, though?), then *if* you have the time, I could point you to a github
> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:

I'll see what I can do this Sunday.

>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb44c4d2ada7..8526fd03b978 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>   * get_user_pages and page_mkclean and other calls that race to set up page
>   * table entries.
>   */
> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>
>  void unpin_user_page(struct page *page);
>  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>
> If that fails to repro, then we would be zeroing in on the root cause.
>
> The branch is here (I just tested it and it seems healthy):
>
> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
>
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-21 10:08           ` Leon Romanovsky
@ 2019-12-21 23:59             ` John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2019-12-21 23:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb

On 12/21/19 2:08 AM, Leon Romanovsky wrote:
> On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
>> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
>> ...
>>>> $ ./build.sh
>>>> $ build/bin/run_tests.py
>>>>
>>>> If you get things that far I think Leon can get a reproduction for you
>>>
>>> I'm not so optimistic about that.
>>>
>>
>> OK, I'm going to proceed for now on the assumption that I've got an overflow
>> problem that happens when huge pages are pinned. If I can get more information,
>> great, otherwise it's probably enough.
>>
>> One thing: for your repro, if you know the huge page size, and the system
>> page size for that case, that would really help. Also the number of pins per
>> page, more or less, that you'd expect. Because Jason says that only 2M huge
>> pages are used...
>>
>> Because the other possibility is that the refcount really is going negative,
>> likely due to a mismatched pin/unpin somehow.
>>
>> If there's not an obvious repro case available, but you do have one (is it easy
>> to repro, though?), then *if* you have the time, I could point you to a github
>> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
> 
> I'll see what I can do this Sunday.
> 

The other data point that might shed light on whether it's a mismatch (this only
works if the system is not actually crashing, though), is checking the new
vmstat items, like this:

$ grep foll_pin /proc/vmstat
nr_foll_pin_requested 16288188
nr_foll_pin_returned 16288188

...but OTOH, if you've got long-term pins, then those are *supposed* to be
mismatched, so it only really helps in between tests.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-20 23:54         ` John Hubbard
  2019-12-21 10:08           ` Leon Romanovsky
@ 2019-12-22 13:23           ` Leon Romanovsky
  2019-12-25  2:03             ` John Hubbard
  1 sibling, 1 reply; 67+ messages in thread
From: Leon Romanovsky @ 2019-12-22 13:23 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> ...
> >> $ ./build.sh
> >> $ build/bin/run_tests.py
> >>
> >> If you get things that far I think Leon can get a reproduction for you
> >
> > I'm not so optimistic about that.
> >
>
> OK, I'm going to proceed for now on the assumption that I've got an overflow
> problem that happens when huge pages are pinned. If I can get more information,
> great, otherwise it's probably enough.
>
> One thing: for your repro, if you know the huge page size, and the system
> page size for that case, that would really help. Also the number of pins per
> page, more or less, that you'd expect. Because Jason says that only 2M huge
> pages are used...
>
> Because the other possibility is that the refcount really is going negative,
> likely due to a mismatched pin/unpin somehow.
>
> If there's not an obvious repro case available, but you do have one (is it easy
> to repro, though?), then *if* you have the time, I could point you to a github
> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb44c4d2ada7..8526fd03b978 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>   * get_user_pages and page_mkclean and other calls that race to set up page
>   * table entries.
>   */
> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>
>  void unpin_user_page(struct page *page);
>  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>
> If that fails to repro, then we would be zeroing in on the root cause.
>
> The branch is here (I just tested it and it seems healthy):
>
> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags

Hi,

We tested the following branch and here comes results:
[root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
nr_foll_pin_requested 0
nr_foll_pin_returned 0

[root@serer consume_mtts]# (master) $ dmesg
[  425.221459] ------------[ cut here ]------------
[  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
[  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
[  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G           O      5.5.0-rc2+ #1
[  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
[  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
[  425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
[  425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
[  425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
[  425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
[  425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
[  425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
[  425.256309] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
[  425.258401] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
[  425.261884] Call Trace:
[  425.262735]  gup_pgd_range+0x517/0x5a0
[  425.263819]  internal_get_user_pages_fast+0x210/0x250
[  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
[  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
[  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
[  425.269134]  ? xas_load+0x8/0x80
[  425.270074]  ? xa_load+0x48/0x90
[  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
[  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
[  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
[  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
[  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
[  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
[  425.281755]  do_vfs_ioctl+0x9d/0x650
[  425.282766]  ksys_ioctl+0x70/0x80
[  425.283745]  __x64_sys_ioctl+0x16/0x20
[  425.284912]  do_syscall_64+0x42/0x130
[  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  425.287377] RIP: 0033:0x7f780d2df267
[  425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[  425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
[  425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
[  425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
[  425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
[  425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
[  425.304113] ---[ end trace 1ecbefdb403190dd ]---
[  425.305434] ------------[ cut here ]------------
[  425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
[  425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
[  425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G        W  O      5.5.0-rc2+ #1
[  425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  425.320958] RIP: 0010:try_grab_page+0x56/0x60
[  425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
[  425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
[  425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
[  425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
[  425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
[  425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
[  425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
[  425.337638] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
[  425.339734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
[  425.343160] Call Trace:
[  425.343967]  follow_trans_huge_pmd+0x16f/0x2e0
[  425.345263]  follow_p4d_mask+0x51c/0x630
[  425.346344]  __get_user_pages+0x1a1/0x6c0
[  425.347463]  internal_get_user_pages_fast+0x17b/0x250
[  425.348918]  ib_umem_get+0x298/0x550 [ib_uverbs]
[  425.350174]  mr_umem_get+0xc9/0x260 [mlx5_ib]
[  425.351383]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
[  425.352849]  ? xas_load+0x8/0x80
[  425.353776]  ? xa_load+0x48/0x90
[  425.354730]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
[  425.356410]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
[  425.357843]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
[  425.359749]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
[  425.361405]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
[  425.362898]  ? __alloc_pages_nodemask+0x148/0x2b0
[  425.364206]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
[  425.365564]  do_vfs_ioctl+0x9d/0x650
[  425.366567]  ksys_ioctl+0x70/0x80
[  425.367537]  __x64_sys_ioctl+0x16/0x20
[  425.368698]  do_syscall_64+0x42/0x130
[  425.369782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  425.371117] RIP: 0033:0x7f780d2df267
[  425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
[  425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
[  425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
[  425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
[  425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
[  425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
[  425.387964] ---[ end trace 1ecbefdb403190de ]---

Thanks

>
>
>
> thanks,
> --
> John Hubbard
> NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-21  0:32           ` Dan Williams
@ 2019-12-23 18:24             ` Jason Gunthorpe
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2019-12-23 18:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: John Hubbard, Leon Romanovsky, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	Maling list - DRI developers, KVM list, linux-block,
	Linux Doc Mailing List, linux-fsdevel, linux-kselftest,
	Linux-media, linux-rdma, linuxppc-dev, Netdev, Linux MM, LKML,
	Maor Gottlieb

On Fri, Dec 20, 2019 at 04:32:13PM -0800, Dan Williams wrote:

> > > There's already a limit, it's just a much larger one. :) What does "no limit"
> > > really mean, numerically, to you in this case?
> >
> > I guess I mean 'hidden limit' - hitting the limit and failing would
> > be managable.
> >
> > I think 7 is probably too low though, but we are not using 1GB huge
> > pages, only 2M..
> 
> What about RDMA to 1GB-hugetlbfs and 1GB-device-dax mappings?

I don't think the failing testing is doing that.

It is also less likely that 1GB regions will need multi-mapping, IMHO.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-22 13:23           ` Leon Romanovsky
@ 2019-12-25  2:03             ` John Hubbard
  2019-12-25  5:26               ` Leon Romanovsky
  0 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-25  2:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On 12/22/19 5:23 AM, Leon Romanovsky wrote:
> On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
>> On 12/20/19 10:29 AM, Leon Romanovsky wrote:
>> ...
>>>> $ ./build.sh
>>>> $ build/bin/run_tests.py
>>>>
>>>> If you get things that far I think Leon can get a reproduction for you
>>>
>>> I'm not so optimistic about that.
>>>
>>
>> OK, I'm going to proceed for now on the assumption that I've got an overflow
>> problem that happens when huge pages are pinned. If I can get more information,
>> great, otherwise it's probably enough.
>>
>> One thing: for your repro, if you know the huge page size, and the system
>> page size for that case, that would really help. Also the number of pins per
>> page, more or less, that you'd expect. Because Jason says that only 2M huge
>> pages are used...
>>
>> Because the other possibility is that the refcount really is going negative,
>> likely due to a mismatched pin/unpin somehow.
>>
>> If there's not an obvious repro case available, but you do have one (is it easy
>> to repro, though?), then *if* you have the time, I could point you to a github
>> branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index bb44c4d2ada7..8526fd03b978 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
>>    * get_user_pages and page_mkclean and other calls that race to set up page
>>    * table entries.
>>    */
>> -#define GUP_PIN_COUNTING_BIAS (1U << 10)
>> +#define GUP_PIN_COUNTING_BIAS (1U << 8)
>>
>>   void unpin_user_page(struct page *page);
>>   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>
>> If that fails to repro, then we would be zeroing in on the root cause.
>>
>> The branch is here (I just tested it and it seems healthy):
>>
>> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> 
> Hi,
> 
> We tested the following branch and here comes results:

Thanks for this testing run!

> [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
> nr_foll_pin_requested 0
> nr_foll_pin_returned 0
> 

Zero pinned pages!

...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
not happening. And although the backtraces below show some of my new
routines (like try_grab_page), they also confirm the above: there is no
pin_user_page*() call in the stack.

In particular, it looks like ib_umem_get() is calling through to
get_user_pages*(), rather than pin_user_pages*(). I don't see how this
is possible, because the code on my screen shows ib_umem_get() calling
pin_user_pages_fast().

Any thoughts or ideas are welcome here.

However, glossing over all of that and assuming that the new
GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
see any overflow. I'm less confident now that this is a true refcount
overflow.

Also, any information that would get me closer to being able to attempt
my own reproduction of the problem are *very* welcome. :)

thanks,
-- 
John Hubbard
NVIDIA

> [root@serer consume_mtts]# (master) $ dmesg
> [  425.221459] ------------[ cut here ]------------
> [  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
> [  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
> [  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G           O      5.5.0-rc2+ #1
> [  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
> [  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
> [  425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
> [  425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
> [  425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
> [  425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
> [  425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
> [  425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
> [  425.256309] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
> [  425.258401] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
> [  425.261884] Call Trace:
> [  425.262735]  gup_pgd_range+0x517/0x5a0
> [  425.263819]  internal_get_user_pages_fast+0x210/0x250
> [  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
> [  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
> [  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
> [  425.269134]  ? xas_load+0x8/0x80
> [  425.270074]  ? xa_load+0x48/0x90
> [  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
> [  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
> [  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
> [  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
> [  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> [  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
> [  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
> [  425.281755]  do_vfs_ioctl+0x9d/0x650
> [  425.282766]  ksys_ioctl+0x70/0x80
> [  425.283745]  __x64_sys_ioctl+0x16/0x20
> [  425.284912]  do_syscall_64+0x42/0x130
> [  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  425.287377] RIP: 0033:0x7f780d2df267
> [  425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> [  425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
> [  425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
> [  425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
> [  425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
> [  425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
> [  425.304113] ---[ end trace 1ecbefdb403190dd ]---
> [  425.305434] ------------[ cut here ]------------
> [  425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
> [  425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
> [  425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G        W  O      5.5.0-rc2+ #1
> [  425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [  425.320958] RIP: 0010:try_grab_page+0x56/0x60
> [  425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
> [  425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
> [  425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
> [  425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
> [  425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
> [  425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
> [  425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
> [  425.337638] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
> [  425.339734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
> [  425.343160] Call Trace:
> [  425.343967]  follow_trans_huge_pmd+0x16f/0x2e0
> [  425.345263]  follow_p4d_mask+0x51c/0x630
> [  425.346344]  __get_user_pages+0x1a1/0x6c0
> [  425.347463]  internal_get_user_pages_fast+0x17b/0x250
> [  425.348918]  ib_umem_get+0x298/0x550 [ib_uverbs]
> [  425.350174]  mr_umem_get+0xc9/0x260 [mlx5_ib]
> [  425.351383]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
> [  425.352849]  ? xas_load+0x8/0x80
> [  425.353776]  ? xa_load+0x48/0x90
> [  425.354730]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
> [  425.356410]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
> [  425.357843]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
> [  425.359749]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
> [  425.361405]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> [  425.362898]  ? __alloc_pages_nodemask+0x148/0x2b0
> [  425.364206]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
> [  425.365564]  do_vfs_ioctl+0x9d/0x650
> [  425.366567]  ksys_ioctl+0x70/0x80
> [  425.367537]  __x64_sys_ioctl+0x16/0x20
> [  425.368698]  do_syscall_64+0x42/0x130
> [  425.369782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  425.371117] RIP: 0033:0x7f780d2df267
> [  425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> [  425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
> [  425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
> [  425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
> [  425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
> [  425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
> [  425.387964] ---[ end trace 1ecbefdb403190de ]---
> 
> Thanks
> 
>>
>>
>>
>> thanks,
>> --
>> John Hubbard
>> NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-25  2:03             ` John Hubbard
@ 2019-12-25  5:26               ` Leon Romanovsky
  2019-12-27 21:56                 ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Leon Romanovsky @ 2019-12-25  5:26 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On Tue, Dec 24, 2019 at 06:03:50PM -0800, John Hubbard wrote:
> On 12/22/19 5:23 AM, Leon Romanovsky wrote:
> > On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote:
> > > On 12/20/19 10:29 AM, Leon Romanovsky wrote:
> > > ...
> > > > > $ ./build.sh
> > > > > $ build/bin/run_tests.py
> > > > >
> > > > > If you get things that far I think Leon can get a reproduction for you
> > > >
> > > > I'm not so optimistic about that.
> > > >
> > >
> > > OK, I'm going to proceed for now on the assumption that I've got an overflow
> > > problem that happens when huge pages are pinned. If I can get more information,
> > > great, otherwise it's probably enough.
> > >
> > > One thing: for your repro, if you know the huge page size, and the system
> > > page size for that case, that would really help. Also the number of pins per
> > > page, more or less, that you'd expect. Because Jason says that only 2M huge
> > > pages are used...
> > >
> > > Because the other possibility is that the refcount really is going negative,
> > > likely due to a mismatched pin/unpin somehow.
> > >
> > > If there's not an obvious repro case available, but you do have one (is it easy
> > > to repro, though?), then *if* you have the time, I could point you to a github
> > > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this:
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index bb44c4d2ada7..8526fd03b978 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page)
> > >    * get_user_pages and page_mkclean and other calls that race to set up page
> > >    * table entries.
> > >    */
> > > -#define GUP_PIN_COUNTING_BIAS (1U << 10)
> > > +#define GUP_PIN_COUNTING_BIAS (1U << 8)
> > >
> > >   void unpin_user_page(struct page *page);
> > >   void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
> > >
> > > If that fails to repro, then we would be zeroing in on the root cause.
> > >
> > > The branch is here (I just tested it and it seems healthy):
> > >
> > > git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> >
> > Hi,
> >
> > We tested the following branch and here comes results:
>
> Thanks for this testing run!
>
> > [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
> > nr_foll_pin_requested 0
> > nr_foll_pin_returned 0
> >
>
> Zero pinned pages!

Maybe we are missing some CONFIG_* option?
https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f

>
> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
> not happening. And although the backtraces below show some of my new
> routines (like try_grab_page), they also confirm the above: there is no
> pin_user_page*() call in the stack.
>
> In particular, it looks like ib_umem_get() is calling through to
> get_user_pages*(), rather than pin_user_pages*(). I don't see how this
> is possible, because the code on my screen shows ib_umem_get() calling
> pin_user_pages_fast().
>
> Any thoughts or ideas are welcome here.
>
> However, glossing over all of that and assuming that the new
> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
> see any overflow. I'm less confident now that this is a true refcount
> overflow.

Earlier in this email thread, I posted possible function call chain which
doesn't involve refcount overflow, but for some reason the refcount
overflow was chosen as a way to explore.

>
> Also, any information that would get me closer to being able to attempt
> my own reproduction of the problem are *very* welcome. :)

It is ancient verification test (~10y) which is not an easy task to
make it understandable and standalone :).

>
> thanks,
> --
> John Hubbard
> NVIDIA
>
> > [root@serer consume_mtts]# (master) $ dmesg
> > [  425.221459] ------------[ cut here ]------------
> > [  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
> > [  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
> > [  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G           O      5.5.0-rc2+ #1
> > [  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
> > [  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
> > [  425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
> > [  425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
> > [  425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
> > [  425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
> > [  425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
> > [  425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
> > [  425.256309] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
> > [  425.258401] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
> > [  425.261884] Call Trace:
> > [  425.262735]  gup_pgd_range+0x517/0x5a0
> > [  425.263819]  internal_get_user_pages_fast+0x210/0x250
> > [  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
> > [  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
> > [  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
> > [  425.269134]  ? xas_load+0x8/0x80
> > [  425.270074]  ? xa_load+0x48/0x90
> > [  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
> > [  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
> > [  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
> > [  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
> > [  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> > [  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
> > [  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
> > [  425.281755]  do_vfs_ioctl+0x9d/0x650
> > [  425.282766]  ksys_ioctl+0x70/0x80
> > [  425.283745]  __x64_sys_ioctl+0x16/0x20
> > [  425.284912]  do_syscall_64+0x42/0x130
> > [  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  425.287377] RIP: 0033:0x7f780d2df267
> > [  425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> > [  425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > [  425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
> > [  425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
> > [  425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
> > [  425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
> > [  425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
> > [  425.304113] ---[ end trace 1ecbefdb403190dd ]---
> > [  425.305434] ------------[ cut here ]------------
> > [  425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
> > [  425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
> > [  425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G        W  O      5.5.0-rc2+ #1
> > [  425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> > [  425.320958] RIP: 0010:try_grab_page+0x56/0x60
> > [  425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
> > [  425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
> > [  425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
> > [  425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
> > [  425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
> > [  425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
> > [  425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
> > [  425.337638] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
> > [  425.339734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
> > [  425.343160] Call Trace:
> > [  425.343967]  follow_trans_huge_pmd+0x16f/0x2e0
> > [  425.345263]  follow_p4d_mask+0x51c/0x630
> > [  425.346344]  __get_user_pages+0x1a1/0x6c0
> > [  425.347463]  internal_get_user_pages_fast+0x17b/0x250
> > [  425.348918]  ib_umem_get+0x298/0x550 [ib_uverbs]
> > [  425.350174]  mr_umem_get+0xc9/0x260 [mlx5_ib]
> > [  425.351383]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
> > [  425.352849]  ? xas_load+0x8/0x80
> > [  425.353776]  ? xa_load+0x48/0x90
> > [  425.354730]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
> > [  425.356410]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
> > [  425.357843]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
> > [  425.359749]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
> > [  425.361405]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
> > [  425.362898]  ? __alloc_pages_nodemask+0x148/0x2b0
> > [  425.364206]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
> > [  425.365564]  do_vfs_ioctl+0x9d/0x650
> > [  425.366567]  ksys_ioctl+0x70/0x80
> > [  425.367537]  __x64_sys_ioctl+0x16/0x20
> > [  425.368698]  do_syscall_64+0x42/0x130
> > [  425.369782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  425.371117] RIP: 0033:0x7f780d2df267
> > [  425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
> > [  425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > [  425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
> > [  425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
> > [  425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
> > [  425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
> > [  425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
> > [  425.387964] ---[ end trace 1ecbefdb403190de ]---
> >
> > Thanks
> >
> > >
> > >
> > >
> > > thanks,
> > > --
> > > John Hubbard
> > > NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-25  5:26               ` Leon Romanovsky
@ 2019-12-27 21:56                 ` John Hubbard
  2019-12-29  4:33                   ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-27 21:56 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On 12/24/19 9:26 PM, Leon Romanovsky wrote:
...
>>>> The branch is here (I just tested it and it seems healthy):
>>>>
>>>> git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
>>>
>>> Hi,
>>>
>>> We tested the following branch and here comes results:
>>
>> Thanks for this testing run!
>>
>>> [root@server consume_mtts]# (master) $ grep foll_pin /proc/vmstat
>>> nr_foll_pin_requested 0
>>> nr_foll_pin_returned 0
>>>
>>
>> Zero pinned pages!
> 
> Maybe we are missing some CONFIG_* option?
> https://lore.kernel.org/linux-rdma/12a28917-f8c9-5092-2f01-92bb74714cae@nvidia.com/T/#mf900896f5dfc86cdee9246219990c632ed77115f


Ah OK, it must be that CONFIG_DEBUG_VM is not set, thanks!


>> ...now I'm confused. Somehow FOLL_PIN and pin_user_pages*() calls are
>> not happening. And although the backtraces below show some of my new
>> routines (like try_grab_page), they also confirm the above: there is no
>> pin_user_page*() call in the stack.
>>
>> In particular, it looks like ib_umem_get() is calling through to
>> get_user_pages*(), rather than pin_user_pages*(). I don't see how this
>> is possible, because the code on my screen shows ib_umem_get() calling
>> pin_user_pages_fast().
>>

It must be that pin_user_pages() is in the call stack, but just not getting
printed. There's no other way to explain this.

>> Any thoughts or ideas are welcome here.
>>
>> However, glossing over all of that and assuming that the new
>> GUP_PIN_COUNTING_BIAS of 256 is applied, it's interesting that we still
>> see any overflow. I'm less confident now that this is a true refcount
>> overflow.
> 
> Earlier in this email thread, I posted possible function call chain which
> doesn't involve refcount overflow, but for some reason the refcount
> overflow was chosen as a way to explore.
> 

Well, both of the WARN() calls are asserting that the refcount went negative
(well, one asserts negative, and the other asserts "<=0"). So it's pretty
hard to talk our way out of a refcount overflow here.

>>
>> Also, any information that would get me closer to being able to attempt
>> my own reproduction of the problem are *very* welcome. :)
> 
> It is ancient verification test (~10y) which is not an easy task to
> make it understandable and standalone :).
> 

Is this the only test that fails, btw? No other test failures or hints of
problems?

(Also, maybe hopeless, but can *anyone* on the RDMA list provide some
characterization of the test, such as how many pins per page, what page
sizes are used? I'm still hoping to write a test to trigger something
close to this...)

I do have a couple more ideas for test runs:

1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
page->_refcount into a no-op, and so if all is well (it may not be!) with the
rest of the patch, then we'd expect this problem to not reappear.

2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
tests, of course), so we can see if there is a get/put mismatch. However, that
will change the timing, and so it must be attempted independently of (1), in
order to see if it ends up hiding the repro.

I've updated this branch to implement (1), but not (2), hoping you can give
this one a spin?

     git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags


thanks,
-- 
John Hubbard
NVIDIA


>>
>>> [root@serer consume_mtts]# (master) $ dmesg
>>> [  425.221459] ------------[ cut here ]------------
>>> [  425.225894] WARNING: CPU: 1 PID: 6738 at mm/gup.c:61 try_grab_compound_head+0x90/0xa0
>>> [  425.228021] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
>>> [  425.235266] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G           O      5.5.0-rc2+ #1
>>> [  425.237480] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [  425.239738] RIP: 0010:try_grab_compound_head+0x90/0xa0
>>> [  425.241170] Code: 06 48 8d 4f 34 f0 0f b1 57 34 74 cd 85 c0 74 cf 8d 14 06 f0 0f b1 11 74 c0 eb f1 8d 14 06 f0 0f b1 11 74 b5 85 c0 75 f3 eb b5 <0f> 0b 31 c0 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41
>>> [  425.245739] RSP: 0018:ffffc900006878a8 EFLAGS: 00010082
>>> [  425.247124] RAX: 0000000080000001 RBX: 00007f780488a000 RCX: 0000000000000bb0
>>> [  425.248956] RDX: ffffea000e031087 RSI: 0000000000008a00 RDI: ffffea000dc58000
>>> [  425.250761] RBP: ffffea000e031080 R08: ffffc90000687974 R09: 000fffffffe00000
>>> [  425.252661] R10: 0000000000000000 R11: ffff888362560000 R12: 000000000000008a
>>> [  425.254487] R13: 80000003716000e7 R14: 00007f780488a000 R15: ffffc90000687974
>>> [  425.256309] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
>>> [  425.258401] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  425.259949] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
>>> [  425.261884] Call Trace:
>>> [  425.262735]  gup_pgd_range+0x517/0x5a0
>>> [  425.263819]  internal_get_user_pages_fast+0x210/0x250
>>> [  425.265193]  ib_umem_get+0x298/0x550 [ib_uverbs]
>>> [  425.266476]  mr_umem_get+0xc9/0x260 [mlx5_ib]
>>> [  425.267699]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
>>> [  425.269134]  ? xas_load+0x8/0x80
>>> [  425.270074]  ? xa_load+0x48/0x90
>>> [  425.271038]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
>>> [  425.272757]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
>>> [  425.274120]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
>>> [  425.276058]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
>>> [  425.277657]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
>>> [  425.279155]  ? __alloc_pages_nodemask+0x148/0x2b0
>>> [  425.280445]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
>>> [  425.281755]  do_vfs_ioctl+0x9d/0x650
>>> [  425.282766]  ksys_ioctl+0x70/0x80
>>> [  425.283745]  __x64_sys_ioctl+0x16/0x20
>>> [  425.284912]  do_syscall_64+0x42/0x130
>>> [  425.285973]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  425.287377] RIP: 0033:0x7f780d2df267
>>> [  425.288449] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
>>> [  425.293073] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [  425.295034] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
>>> [  425.296895] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
>>> [  425.298689] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
>>> [  425.300480] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
>>> [  425.302290] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
>>> [  425.304113] ---[ end trace 1ecbefdb403190dd ]---
>>> [  425.305434] ------------[ cut here ]------------
>>> [  425.307147] WARNING: CPU: 1 PID: 6738 at mm/gup.c:150 try_grab_page+0x56/0x60
>>> [  425.309111] Modules linked in: mlx5_ib mlx5_core mlxfw mlx4_ib mlx4_en ptp pps_core mlx4_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre ip_tunnel rdma_rxe ip6_udp_tunnel udp_tunnel rdma_ucm ib_uverbs ib_ipoib ib_umad ib_srp scsi_transport_srp rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core [last unloaded: mlxfw]
>>> [  425.316461] CPU: 1 PID: 6738 Comm: consume_mtts Tainted: G        W  O      5.5.0-rc2+ #1
>>> [  425.318582] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> [  425.320958] RIP: 0010:try_grab_page+0x56/0x60
>>> [  425.322167] Code: 7e 28 f0 81 47 34 00 01 00 00 c3 48 8b 47 08 48 8d 50 ff a8 01 48 0f 45 fa 8b 47 34 85 c0 7e 0f f0 ff 47 34 b8 01 00 00 00 c3 <0f> 0b 31 c0 c3 0f 0b 31 c0 c3 0f 1f 44 00 00 41 57 41 56 41 55 41
>>> [  425.326814] RSP: 0018:ffffc90000687830 EFLAGS: 00010282
>>> [  425.328226] RAX: 0000000000000001 RBX: ffffea000dc58000 RCX: ffffea000e031087
>>> [  425.330104] RDX: 0000000080000001 RSI: 0000000000040000 RDI: ffffea000dc58000
>>> [  425.331980] RBP: 00007f7804800000 R08: 000ffffffffff000 R09: 80000003716000e7
>>> [  425.333898] R10: ffff88834af80120 R11: ffff8883ac16f000 R12: ffff88834af80120
>>> [  425.335704] R13: ffff88837c0915c0 R14: 0000000000050201 R15: 00007f7804800000
>>> [  425.337638] FS:  00007f780d9d3740(0000) GS:ffff8883b1c80000(0000) knlGS:0000000000000000
>>> [  425.339734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  425.341369] CR2: 0000000002334048 CR3: 000000039c68c001 CR4: 00000000001606a0
>>> [  425.343160] Call Trace:
>>> [  425.343967]  follow_trans_huge_pmd+0x16f/0x2e0
>>> [  425.345263]  follow_p4d_mask+0x51c/0x630
>>> [  425.346344]  __get_user_pages+0x1a1/0x6c0
>>> [  425.347463]  internal_get_user_pages_fast+0x17b/0x250
>>> [  425.348918]  ib_umem_get+0x298/0x550 [ib_uverbs]
>>> [  425.350174]  mr_umem_get+0xc9/0x260 [mlx5_ib]
>>> [  425.351383]  mlx5_ib_reg_user_mr+0xcc/0x7e0 [mlx5_ib]
>>> [  425.352849]  ? xas_load+0x8/0x80
>>> [  425.353776]  ? xa_load+0x48/0x90
>>> [  425.354730]  ? lookup_get_idr_uobject.part.10+0x12/0x70 [ib_uverbs]
>>> [  425.356410]  ib_uverbs_reg_mr+0x127/0x280 [ib_uverbs]
>>> [  425.357843]  ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xc2/0xf0 [ib_uverbs]
>>> [  425.359749]  ib_uverbs_cmd_verbs.isra.6+0x5be/0xbe0 [ib_uverbs]
>>> [  425.361405]  ? uverbs_disassociate_api+0xd0/0xd0 [ib_uverbs]
>>> [  425.362898]  ? __alloc_pages_nodemask+0x148/0x2b0
>>> [  425.364206]  ib_uverbs_ioctl+0xc0/0x120 [ib_uverbs]
>>> [  425.365564]  do_vfs_ioctl+0x9d/0x650
>>> [  425.366567]  ksys_ioctl+0x70/0x80
>>> [  425.367537]  __x64_sys_ioctl+0x16/0x20
>>> [  425.368698]  do_syscall_64+0x42/0x130
>>> [  425.369782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [  425.371117] RIP: 0033:0x7f780d2df267
>>> [  425.372159] Code: b3 66 90 48 8b 05 19 3c 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 3b 2c 00 f7 d8 64 89 01 48
>>> [  425.376774] RSP: 002b:00007ffce49a88a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>> [  425.378740] RAX: ffffffffffffffda RBX: 00007ffce49a8938 RCX: 00007f780d2df267
>>> [  425.380598] RDX: 00007ffce49a8920 RSI: 00000000c0181b01 RDI: 0000000000000003
>>> [  425.382411] RBP: 00007ffce49a8900 R08: 0000000000000003 R09: 00007f780d9a1010
>>> [  425.384312] R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f780d9a1150
>>> [  425.386132] R13: 00007ffce49a8900 R14: 00007ffce49a8ad8 R15: 00007f780468a000
>>> [  425.387964] ---[ end trace 1ecbefdb403190de ]---
>>>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-27 21:56                 ` John Hubbard
@ 2019-12-29  4:33                   ` John Hubbard
  2020-01-06  9:01                     ` Jan Kara
  0 siblings, 1 reply; 67+ messages in thread
From: John Hubbard @ 2019-12-29  4:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On 12/27/19 1:56 PM, John Hubbard wrote:
...
>> It is ancient verification test (~10y) which is not an easy task to
>> make it understandable and standalone :).
>>
> 
> Is this the only test that fails, btw? No other test failures or hints of
> problems?
> 
> (Also, maybe hopeless, but can *anyone* on the RDMA list provide some
> characterization of the test, such as how many pins per page, what page
> sizes are used? I'm still hoping to write a test to trigger something
> close to this...)
> 
> I do have a couple more ideas for test runs:
> 
> 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
> page->_refcount into a no-op, and so if all is well (it may not be!) with the
> rest of the patch, then we'd expect this problem to not reappear.
> 
> 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
> tests, of course), so we can see if there is a get/put mismatch. However, that
> will change the timing, and so it must be attempted independently of (1), in
> order to see if it ends up hiding the repro.
> 
> I've updated this branch to implement (1), but not (2), hoping you can give
> this one a spin?
> 
>     git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> 
> 

Also, looking ahead:

a) if the problem disappears with the latest above test, then we likely have
   a huge page refcount overflow, and there are a couple of different ways to
   fix it. 

b) if it still reproduces with the above, then it's some other random mistake,
   and in that case I'd be inclined to do a sort of guided (or classic, unguided)
   git bisect of the series. Because it could be any of several patches.

   If that's too much trouble, then I'd have to fall back to submitting a few
   patches at a time and working my way up to the tracking patch...


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2019-12-29  4:33                   ` John Hubbard
@ 2020-01-06  9:01                     ` Jan Kara
  2020-01-07  1:26                       ` John Hubbard
  0 siblings, 1 reply; 67+ messages in thread
From: Jan Kara @ 2020-01-06  9:01 UTC (permalink / raw)
  To: John Hubbard
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Dan Williams, Daniel Vetter, Dave Chinner,
	David Airlie, David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On Sat 28-12-19 20:33:32, John Hubbard wrote:
> On 12/27/19 1:56 PM, John Hubbard wrote:
> ...
> >> It is ancient verification test (~10y) which is not an easy task to
> >> make it understandable and standalone :).
> >>
> > 
> > Is this the only test that fails, btw? No other test failures or hints of
> > problems?
> > 
> > (Also, maybe hopeless, but can *anyone* on the RDMA list provide some
> > characterization of the test, such as how many pins per page, what page
> > sizes are used? I'm still hoping to write a test to trigger something
> > close to this...)
> > 
> > I do have a couple more ideas for test runs:
> > 
> > 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
> > page->_refcount into a no-op, and so if all is well (it may not be!) with the
> > rest of the patch, then we'd expect this problem to not reappear.
> > 
> > 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
> > tests, of course), so we can see if there is a get/put mismatch. However, that
> > will change the timing, and so it must be attempted independently of (1), in
> > order to see if it ends up hiding the repro.
> > 
> > I've updated this branch to implement (1), but not (2), hoping you can give
> > this one a spin?
> > 
> >     git@github.com:johnhubbard/linux.git  pin_user_pages_tracking_v11_with_diags
> > 
> > 
> 
> Also, looking ahead:
> 
> a) if the problem disappears with the latest above test, then we likely have
>    a huge page refcount overflow, and there are a couple of different ways to
>    fix it. 
> 
> b) if it still reproduces with the above, then it's some other random mistake,
>    and in that case I'd be inclined to do a sort of guided (or classic, unguided)
>    git bisect of the series. Because it could be any of several patches.
> 
>    If that's too much trouble, then I'd have to fall back to submitting a few
>    patches at a time and working my way up to the tracking patch...

It could also be that an ordinary page reference is dropped with 'unpin'
thus underflowing the page refcount...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN
  2020-01-06  9:01                     ` Jan Kara
@ 2020-01-07  1:26                       ` John Hubbard
  0 siblings, 0 replies; 67+ messages in thread
From: John Hubbard @ 2020-01-07  1:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Al Viro,
	Alex Williamson, Benjamin Herrenschmidt, Björn Töpel,
	Christoph Hellwig, Dan Williams, Daniel Vetter, Dave Chinner,
	David Airlie, David S . Miller, Ira Weiny, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, Maor Gottlieb, Ran Rozenstein

On 1/6/20 1:01 AM, Jan Kara wrote:
...
>> Also, looking ahead:
>>
>> a) if the problem disappears with the latest above test, then we likely have
>>     a huge page refcount overflow, and there are a couple of different ways to
>>     fix it.
>>
>> b) if it still reproduces with the above, then it's some other random mistake,
>>     and in that case I'd be inclined to do a sort of guided (or classic, unguided)
>>     git bisect of the series. Because it could be any of several patches.
>>
>>     If that's too much trouble, then I'd have to fall back to submitting a few
>>     patches at a time and working my way up to the tracking patch...
> 
> It could also be that an ordinary page reference is dropped with 'unpin'
> thus underflowing the page refcount...
> 
> 								Honza
> 

Yes.

And, I think I'm about out of time for this release cycle, so I'm probably going to
submit the prerequisite patches (patches 1-10, or more boldly, 1-22), for candidates
for 5.6.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, back to index

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 22:25 [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN John Hubbard
2019-12-16 22:25 ` [PATCH v11 01/25] mm/gup: factor out duplicate code from four routines John Hubbard
2019-12-18 15:52   ` Kirill A. Shutemov
2019-12-18 22:15     ` John Hubbard
2019-12-18 22:45       ` Kirill A. Shutemov
2019-12-16 22:25 ` [PATCH v11 02/25] mm/gup: move try_get_compound_head() to top, fix minor issues John Hubbard
2019-12-16 22:25 ` [PATCH v11 03/25] mm: Cleanup __put_devmap_managed_page() vs ->page_free() John Hubbard
2019-12-16 22:25 ` [PATCH v11 04/25] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages John Hubbard
2019-12-18 16:04   ` Kirill A. Shutemov
2019-12-19  0:32     ` John Hubbard
2019-12-19  0:40     ` [PATCH v12] " John Hubbard
2019-12-19  5:27   ` [PATCH v11 04/25] " Dan Williams
2019-12-19  5:48     ` John Hubbard
2019-12-19  6:52       ` Dan Williams
2019-12-19  7:33         ` John Hubbard
2019-12-16 22:25 ` [PATCH v11 05/25] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-12-16 22:25 ` [PATCH v11 06/25] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM John Hubbard
2019-12-18 16:19   ` Kirill A. Shutemov
2019-12-18 22:15     ` John Hubbard
2019-12-16 22:25 ` [PATCH v11 07/25] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call John Hubbard
2019-12-16 22:25 ` [PATCH v11 08/25] mm/gup: allow FOLL_FORCE for get_user_pages_fast() John Hubbard
2019-12-16 22:25 ` [PATCH v11 09/25] IB/umem: use get_user_pages_fast() to pin DMA pages John Hubbard
2019-12-16 22:25 ` [PATCH v11 10/25] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-12-16 22:25 ` [PATCH v11 11/25] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-12-16 22:25 ` [PATCH v11 12/25] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP John Hubbard
2019-12-16 22:25 ` [PATCH v11 13/25] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-12-16 22:25 ` [PATCH v11 14/25] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-12-16 22:25 ` [PATCH v11 15/25] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-12-16 22:25 ` [PATCH v11 16/25] net/xdp: " John Hubbard
2019-12-16 22:25 ` [PATCH v11 17/25] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-12-16 22:25 ` [PATCH v11 18/25] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-12-16 22:25 ` [PATCH v11 19/25] vfio, mm: " John Hubbard
2019-12-16 22:25 ` [PATCH v11 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page() John Hubbard
2019-12-16 22:25 ` [PATCH v11 21/25] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" John Hubbard
2019-12-16 22:25 ` [PATCH v11 22/25] mm, tree-wide: rename put_user_page*() to unpin_user_page*() John Hubbard
2019-12-16 22:25 ` [PATCH v11 23/25] mm/gup: track FOLL_PIN pages John Hubbard
2019-12-17 14:19   ` [PATCH v12 " John Hubbard
2019-12-16 22:25 ` [PATCH v11 24/25] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-12-16 22:25 ` [PATCH v11 25/25] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-12-17  7:39 ` [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN Jan Kara
2019-12-19 13:26 ` Leon Romanovsky
2019-12-19 20:30   ` John Hubbard
2019-12-19 21:07     ` Jason Gunthorpe
2019-12-19 21:13       ` John Hubbard
2019-12-20 13:34         ` Jason Gunthorpe
2019-12-21  0:32           ` Dan Williams
2019-12-23 18:24             ` Jason Gunthorpe
2019-12-19 22:58       ` John Hubbard
2019-12-20 18:48         ` Leon Romanovsky
2019-12-20 23:13           ` John Hubbard
2019-12-20 18:29       ` Leon Romanovsky
2019-12-20 23:54         ` John Hubbard
2019-12-21 10:08           ` Leon Romanovsky
2019-12-21 23:59             ` John Hubbard
2019-12-22 13:23           ` Leon Romanovsky
2019-12-25  2:03             ` John Hubbard
2019-12-25  5:26               ` Leon Romanovsky
2019-12-27 21:56                 ` John Hubbard
2019-12-29  4:33                   ` John Hubbard
2020-01-06  9:01                     ` Jan Kara
2020-01-07  1:26                       ` John Hubbard
2019-12-20  9:21     ` Jan Kara
2019-12-21  0:02       ` John Hubbard
2019-12-21  0:33       ` Dan Williams
2019-12-21  0:41         ` John Hubbard
2019-12-21  0:51           ` Dan Williams
2019-12-21  0:53             ` John Hubbard

Linux-kselftest Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-kselftest/0 linux-kselftest/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-kselftest linux-kselftest/ https://lore.kernel.org/linux-kselftest \
		linux-kselftest@vger.kernel.org
	public-inbox-index linux-kselftest

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kselftest


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git