Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM
@ 2019-11-03 21:17 John Hubbard
  2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
                   ` (17 more replies)
  0 siblings, 18 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Hi,

Changes since v1:

* Changed the function signature of __huge_pt_done() from int to void.
* Renamed __remove_refs_from_head() to put_compound_head().
* Improved the comment documentation in mm.h and gup.c
* Merged Documentation/vm/pin_user_pages.rst into the "introduce
  FOLL_PIN" patch.
* Fixed Documentation/vm/pin_user_pages.rst:
     * Fixed up a TODO about DAX.
     * 31, not 32 bits total are available for counting
* Deleted some stale comments from the commit description of the
  VFIO patch.
* Added Reviewed-by tags from Ira Weiny and Jens Axboe, and Acked-by
  from Björn Töpel.

======================================================================
Original cover letter (edited to fix up the patch description numbers)

This applies cleanly to linux-next and mmotm, and also to linux.git if
linux-next's commit 20cac10710c9 ("mm/gup_benchmark: fix MAP_HUGETLB
case") is first applied there.

This provides tracking of dma-pinned pages. This is a prerequisite to
solving the larger problem of proper interactions between file-backed
pages, and [R]DMA activities, as discussed in [1], [2], [3], and in
a remarkable number of email threads since about 2017. :)

A new internal gup flag, FOLL_PIN is introduced, and thoroughly
documented in the last patch's Documentation/vm/pin_user_pages.rst.

I believe that this will provide a good starting point for doing the
layout lease work that Ira Weiny has been working on. That's because
these new wrapper functions provide a clean, constrained, systematically
named set of functionality that, again, is required in order to even
know if a page is "dma-pinned".

In contrast to earlier approaches, the page tracking can be
incrementally applied to the kernel call sites that, until now, have
been simply calling get_user_pages() ("gup"). In other words, opt-in by
changing from this:

    get_user_pages() (sets FOLL_GET)
    put_page()

to this:
    pin_user_pages() (sets FOLL_PIN)
    put_user_page()

Because there are interdependencies with FOLL_LONGTERM, a similar
conversion as for FOLL_PIN, was applied. The change was from this:

    get_user_pages(FOLL_LONGTERM) (also sets FOLL_GET)
    put_page()

to this:
    pin_longterm_pages() (sets FOLL_PIN | FOLL_LONGTERM)
    put_user_page()

============================================================
Patch summary:

* Patches 1-4: refactoring and preparatory cleanup, independent fixes
    (Patch 4: V4L2-core bug fix (can be separately applied))

* Patch 5: introduce pin_user_pages(), FOLL_PIN, but no functional
           changes yet
* Patches 6-11: Convert existing put_user_page() callers, to use the
                new pin*()
* Patch 12: Activate tracking of FOLL_PIN pages.
* Patches 13-15: convert FOLL_LONGTERM callers
* Patches: 16-17: gup_benchmark and run_vmtests support
* Patch 18: enforce FOLL_LONGTERM as a gup-internal (only) flag

============================================================
Testing:

* I've done some overall kernel testing (LTP, and a few other goodies),
  and some directed testing to exercise some of the changes. And as you
  can see, gup_benchmark is enhanced to exercise this. Basically, I've been
  able to runtime test the core get_user_pages() and pin_user_pages() and
  related routines, but not so much on several of the call sites--but those
  are generally just a couple of lines changed, each.

  Not much of the kernel is actually using this, which on one hand
  reduces risk quite a lot. But on the other hand, testing coverage
  is low. So I'd love it if, in particular, the Infiniband and PowerPC
  folks could do a smoke test of this series for me.

  Also, my runtime testing for the call sites so far is very weak:

    * io_uring: Some directed tests from liburing exercise this, and they pass.
    * process_vm_access.c: A small directed test passes.
    * gup_benchmark: the enhanced version hits the new gup.c code, and passes.
    * infiniband (still only have crude "IB pingpong" working, on a
                  good day: it's not exercising my conversions at runtime...)
    * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                      not ready just yet)
    * powerpc: it compiles...
    * drm/via: compiles...
    * goldfish: compiles...
    * net/xdp: compiles...
    * media/v4l2: compiles...

============================================================
Next:

* Get the block/bio_vec sites converted to use pin_user_pages().

* Work with Ira and Dave Chinner to weave this together with the
  layout lease stuff.

============================================================

[1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
[2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
[3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/

John Hubbard (18):
  mm/gup: pass flags arg to __gup_device_* functions
  mm/gup: factor out duplicate code from four routines
  goldish_pipe: rename local pin_user_pages() routine
  media/v4l2-core: set pages dirty upon releasing DMA buffers
  mm/gup: introduce pin_user_pages*() and FOLL_PIN
  goldish_pipe: convert to pin_user_pages() and put_user_page()
  infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  drm/via: set FOLL_PIN via pin_user_pages_fast()
  fs/io_uring: set FOLL_PIN via pin_user_pages()
  net/xdp: set FOLL_PIN via pin_user_pages()
  mm/gup: track FOLL_PIN pages
  media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page()
    conversion
  vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
  powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
  mm/gup_benchmark: support pin_user_pages() and related calls
  selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN
    coverage
  mm/gup: remove support for gup(FOLL_LONGTERM)

 Documentation/vm/index.rst                  |   1 +
 Documentation/vm/pin_user_pages.rst         | 212 +++++++
 arch/powerpc/mm/book3s64/iommu_api.c        |  15 +-
 drivers/gpu/drm/via/via_dmablit.c           |   2 +-
 drivers/infiniband/core/umem.c              |   5 +-
 drivers/infiniband/core/umem_odp.c          |  10 +-
 drivers/infiniband/hw/hfi1/user_pages.c     |   4 +-
 drivers/infiniband/hw/mthca/mthca_memfree.c |   3 +-
 drivers/infiniband/hw/qib/qib_user_pages.c  |   8 +-
 drivers/infiniband/hw/qib/qib_user_sdma.c   |   2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |   9 +-
 drivers/infiniband/sw/siw/siw_mem.c         |   5 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c   |  10 +-
 drivers/platform/goldfish/goldfish_pipe.c   |  35 +-
 drivers/vfio/vfio_iommu_type1.c             |  15 +-
 fs/io_uring.c                               |   5 +-
 include/linux/mm.h                          | 142 ++++-
 include/linux/mmzone.h                      |   2 +
 include/linux/page_ref.h                    |  10 +
 mm/gup.c                                    | 594 ++++++++++++++++----
 mm/gup_benchmark.c                          |  81 ++-
 mm/huge_memory.c                            |  32 +-
 mm/hugetlb.c                                |  28 +-
 mm/memremap.c                               |   4 +-
 mm/process_vm_access.c                      |  28 +-
 mm/vmstat.c                                 |   2 +
 net/xdp/xdp_umem.c                          |   4 +-
 tools/testing/selftests/vm/gup_benchmark.c  |  28 +-
 tools/testing/selftests/vm/run_vmtests      |  22 +
 29 files changed, 1054 insertions(+), 264 deletions(-)
 create mode 100644 Documentation/vm/pin_user_pages.rst

-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
@ 2019-11-03 21:17 ` John Hubbard
  2019-11-04 16:39   ` Jerome Glisse
  2019-11-03 21:17 ` [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines John Hubbard
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Kirill A . Shutemov

A subsequent patch requires access to gup flags, so
pass the flags argument through to the __gup_device_*
functions.

Also placate checkpatch.pl by shortening a nearby line.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..85caf76b3012 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+			     unsigned long end, unsigned int flags,
+			     struct page **pages, int *nr)
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
@@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
@@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	unsigned long fault_pfn;
 	int nr_start = *nr;
 
 	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
+	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
@@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 }
 #else
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
 }
 
 static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
-		unsigned long end, struct page **pages, int *nr)
+				 unsigned long end, unsigned int flags,
+				 struct page **pages, int *nr)
 {
 	BUILD_BUG();
 	return 0;
@@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
+		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
+					     pages, nr);
 	}
 
 	refs = 0;
@@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 }
 
 static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
-		unsigned long end, unsigned int flags, struct page **pages, int *nr)
+			unsigned long end, unsigned int flags,
+			struct page **pages, int *nr)
 {
 	struct page *head, *page;
 	int refs;
@@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
-		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
+		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
+					     pages, nr);
 	}
 
 	refs = 0;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
  2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
@ 2019-11-03 21:17 ` John Hubbard
  2019-11-04 16:51   ` Jerome Glisse
  2019-11-03 21:17 ` [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine John Hubbard
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard, Christoph Hellwig,
	Aneesh Kumar K . V

There are four locations in gup.c that have a fair amount of code
duplication. This means that changing one requires making the same
changes in four places, not to mention reading the same code four
times, and wondering if there are subtle differences.

Factor out the common code into static functions, thus reducing the
overall line count and the code's complexity.

Also, take the opportunity to slightly improve the efficiency of the
error cases, by doing a mass subtraction of the refcount, surrounded
by get_page()/put_page().

Also, further simplify (slightly), by waiting until the the successful
end of each routine, to increment *nr.

Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c | 104 ++++++++++++++++++++++++-------------------------------
 1 file changed, 45 insertions(+), 59 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 85caf76b3012..199da99e8ffc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
 }
 #endif
 
+static int __record_subpages(struct page *page, unsigned long addr,
+			     unsigned long end, struct page **pages, int nr)
+{
+	int nr_recorded_pages = 0;
+
+	do {
+		pages[nr] = page;
+		nr++;
+		page++;
+		nr_recorded_pages++;
+	} while (addr += PAGE_SIZE, addr != end);
+	return nr_recorded_pages;
+}
+
+static void put_compound_head(struct page *page, int refs)
+{
+	/* Do a get_page() first, in case refs == page->_refcount */
+	get_page(page);
+	page_ref_sub(page, refs);
+	put_page(page);
+}
+
+static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
+{
+	*nr += nr_recorded_pages;
+	SetPageReferenced(head);
+}
+
 #ifdef CONFIG_ARCH_HAS_HUGEPD
 static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
 				      unsigned long sz)
@@ -1998,33 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	/* hugepages are never "special" */
 	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-	refs = 0;
 	head = pte_page(pte);
-
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = __record_subpages(page, addr, end, pages, *nr);
 
 	head = try_get_compound_head(head, refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
-	SetPageReferenced(head);
+	__huge_pt_done(head, refs, nr);
 	return 1;
 }
 
@@ -2071,29 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 					     pages, nr);
 	}
 
-	refs = 0;
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = __record_subpages(page, addr, end, pages, *nr);
 
 	head = try_get_compound_head(pmd_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
-	SetPageReferenced(head);
+	__huge_pt_done(head, refs, nr);
 	return 1;
 }
 
@@ -2114,29 +2119,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 					     pages, nr);
 	}
 
-	refs = 0;
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = __record_subpages(page, addr, end, pages, *nr);
 
 	head = try_get_compound_head(pud_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
-	SetPageReferenced(head);
+	__huge_pt_done(head, refs, nr);
 	return 1;
 }
 
@@ -2151,29 +2146,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 		return 0;
 
 	BUILD_BUG_ON(pgd_devmap(orig));
-	refs = 0;
+
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-	do {
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (addr += PAGE_SIZE, addr != end);
+	refs = __record_subpages(page, addr, end, pages, *nr);
 
 	head = try_get_compound_head(pgd_page(orig), refs);
-	if (!head) {
-		*nr -= refs;
+	if (!head)
 		return 0;
-	}
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
-		*nr -= refs;
-		while (refs--)
-			put_page(head);
+		put_compound_head(head, refs);
 		return 0;
 	}
 
-	SetPageReferenced(head);
+	__huge_pt_done(head, refs, nr);
 	return 1;
 }
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
  2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
  2019-11-03 21:17 ` [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines John Hubbard
@ 2019-11-03 21:17 ` John Hubbard
  2019-11-04 16:52   ` Jerome Glisse
  2019-11-03 21:17 ` [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Avoid naming conflicts: rename local static function from
"pin_user_pages()" to "pin_goldfish_pages()".

An upcoming patch will introduce a global pin_user_pages()
function.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/platform/goldfish/goldfish_pipe.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index cef0133aa47a..7ed2a21a0bac 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
 	}
 }
 
-static int pin_user_pages(unsigned long first_page,
-			  unsigned long last_page,
-			  unsigned int last_page_size,
-			  int is_write,
-			  struct page *pages[MAX_BUFFERS_PER_COMMAND],
-			  unsigned int *iter_last_page_size)
+static int pin_goldfish_pages(unsigned long first_page,
+			      unsigned long last_page,
+			      unsigned int last_page_size,
+			      int is_write,
+			      struct page *pages[MAX_BUFFERS_PER_COMMAND],
+			      unsigned int *iter_last_page_size)
 {
 	int ret;
 	int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
@@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 	if (mutex_lock_interruptible(&pipe->lock))
 		return -ERESTARTSYS;
 
-	pages_count = pin_user_pages(first_page, last_page,
-				     last_page_size, is_write,
-				     pipe->pages, &iter_last_page_size);
+	pages_count = pin_goldfish_pages(first_page, last_page,
+					 last_page_size, is_write,
+					 pipe->pages, &iter_last_page_size);
 	if (pages_count < 0) {
 		mutex_unlock(&pipe->lock);
 		return pages_count;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (2 preceding siblings ...)
  2019-11-03 21:17 ` [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine John Hubbard
@ 2019-11-03 21:17 ` John Hubbard
  2019-11-10 10:10   ` Hans Verkuil
  2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

After DMA is complete, and the device and CPU caches are synchronized,
it's still required to mark the CPU pages as dirty, if the data was
coming from the device. However, this driver was just issuing a
bare put_page() call, without any set_page_dirty*() call.

Fix the problem, by calling set_page_dirty_lock() if the CPU pages
were potentially receiving data from the device.

Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c236a7..28262190c3ab 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		for (i = 0; i < dma->nr_pages; i++)
+		for (i = 0; i < dma->nr_pages; i++) {
+			if (dma->direction == DMA_FROM_DEVICE)
+				set_page_dirty_lock(dma->pages[i]);
 			put_page(dma->pages[i]);
+		}
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (3 preceding siblings ...)
  2019-11-03 21:17 ` [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-04 17:33   ` Jerome Glisse
                     ` (2 more replies)
  2019-11-03 21:18 ` [PATCH v2 06/18] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
                   ` (12 subsequent siblings)
  17 siblings, 3 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Introduce pin_user_pages*() variations of get_user_pages*() calls,
and also pin_longterm_pages*() variations.

These variants all set FOLL_PIN, which is also introduced, and
thoroughly documented.

The pin_longterm*() variants also set FOLL_LONGTERM, in addition
to FOLL_PIN:

    pin_user_pages()
    pin_user_pages_remote()
    pin_user_pages_fast()

    pin_longterm_pages()
    pin_longterm_pages_remote()
    pin_longterm_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* These are gup-internal flags, so the call sites should not directly
set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
assertions, for the new FOLL_PIN flag. However, for the pre-existing
FOLL_LONGTERM flag, which has some call sites that still directly
set FOLL_LONGTERM, there is no assertion yet.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
  will:
        * Start with "pin_user_pages" instead of "get_user_pages". That
          makes it easy to find and audit the call sites.
        * Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded on it slightly.)

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 Documentation/vm/index.rst          |   1 +
 Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
 include/linux/mm.h                  |  62 ++++++-
 mm/gup.c                            | 265 +++++++++++++++++++++++++---
 4 files changed, 514 insertions(+), 26 deletions(-)
 create mode 100644 Documentation/vm/pin_user_pages.rst

diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index e8d943b21cf9..7194efa3554a 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -44,6 +44,7 @@ descriptions of data structures and algorithms.
    page_migration
    page_frags
    page_owner
+   pin_user_pages
    remap_file_pages
    slub
    split_page_table_lock
diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
new file mode 100644
index 000000000000..3910f49ca98c
--- /dev/null
+++ b/Documentation/vm/pin_user_pages.rst
@@ -0,0 +1,212 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+pin_user_pages() and related calls
+====================================================
+
+.. contents:: :local:
+
+Overview
+========
+
+This document describes the following functions: ::
+
+ pin_user_pages
+ pin_user_pages_fast
+ pin_user_pages_remote
+
+ pin_longterm_pages
+ pin_longterm_pages_fast
+ pin_longterm_pages_remote
+
+Basic description of FOLL_PIN
+=============================
+
+A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
+significant interactions and interdependencies with FOLL_LONGTERM, so both are
+covered here.
+
+Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
+FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
+the associated wrapper functions  (pin_user_pages and others) to set the correct
+combination of these flags, and to check for problems as well.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===================================
+
+Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
+whatever flags the caller provides::
+
+ Function                    gup flags (FOLL_PIN or FOLL_LONGTERM only)
+ --------                    ------------------------------------------
+ pin_user_pages              FOLL_PIN
+ pin_user_pages_fast         FOLL_PIN
+ pin_user_pages_remote       FOLL_PIN
+
+ pin_longterm_pages          FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_fast     FOLL_PIN | FOLL_LONGTERM
+ pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
+
+Tracking dma-pinned pages
+=========================
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 31-10==21 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 21 - 18 == 3 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of using addition, rather than "hard"
+  bitfields, within page->_refcount, allows for sharing these bits gracefully.
+  But we're still looking at about 8 references.
+
+This, however, is a missing feature more than anything else, because it's easily
+solved by addressing an obvious inefficiency in the original get_user_pages()
+approach of retrieving pages: stop treating all the pages as if they were
+PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
+this, so some work is required. Once that's in place, this limitation mostly
+disappears from view, because there will be ample refcounting range available.
+
+* Callers must specifically request "dma-pinned tracking of pages". In other
+  words, just calling get_user_pages() will not suffice; a new set of functions,
+  pin_user_page() and related, must be used.
+
+FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
+==========================================================
+
+Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
+these categories:
+
+CASE 1: Direct IO (DIO)
+-----------------------
+There are GUP references to pages that are serving
+as DIO buffers. These buffers are needed for a relatively short time (so they
+are not "long term"). No special synchronization with page_mkclean() or
+munmap() is provided. Therefore, flags to set at the call site are: ::
+
+    FOLL_PIN
+
+...but rather than setting FOLL_PIN directly, call sites should use one of
+the pin_user_pages*() routines that set FOLL_PIN.
+
+CASE 2: RDMA
+------------
+There are GUP references to pages that are serving as DMA
+buffers. These buffers are needed for a long time ("long term"). No special
+synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+to set at the call site are: ::
+
+    FOLL_PIN | FOLL_LONGTERM
+
+NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
+because DAX pages do not have a separate page cache, and so "pinning" implies
+locking down file system blocks, which is not (yet) supported in that way.
+
+CASE 3: ODP
+-----------
+(Mellanox/Infiniband On Demand Paging: the hardware supports
+replayable page faulting). There are GUP references to pages serving as DMA
+buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
+and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
+needs to be set.
+
+CASE 4: Pinning for struct page manipulation only
+-------------------------------------------------
+Here, normal GUP calls are sufficient, so neither flag needs to be set.
+
+page_dma_pinned(): the whole point of pinning
+=============================================
+
+The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
+to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
+(and file system writeback code in general) to make informed decisions about
+what to do when a page cannot be unmapped due to such pins.
+
+What to do in those cases is the subject of a years-long series of discussions
+and debates (see the References at the end of this document). It's a TODO item
+here: fill in the details once that's worked out. Meanwhile, it's safe to say
+that having this available: ::
+
+        static inline bool page_dma_pinned(struct page *page)
+
+...is a prerequisite to solving the long-running gup+DMA problem.
+
+Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
+===================================================================
+
+Another way of thinking about these flags is as a progression of restrictions:
+FOLL_GET is for struct page manipulation, without affecting the data that the
+struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
+short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
+a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
+restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
+will be pinned longterm, and whose data will be accessed.
+
+Unit testing
+============
+This file::
+
+ tools/testing/selftests/vm/gup_benchmark.c
+
+has the following new calls to exercise the new pin*() wrapper functions:
+
+* PIN_FAST_BENCHMARK (./gup_benchmark -a)
+* PIN_LONGTERM_BENCHMARK (./gup_benchmark -a)
+* PIN_BENCHMARK (./gup_benchmark -a)
+
+You can monitor how many total dma-pinned pages have been acquired and released
+since the system was booted, via two new /proc/vmstat entries: ::
+
+    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_requested
+
+Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
+because there is a noticeable performance drop in put_user_page(), when they
+are activated.
+
+References
+==========
+
+* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
+* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
+* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+
+John Hubbard, October, 2019
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cc292273e6ba..cdfb6fedb271 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 			    unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked);
+long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			       unsigned long start, unsigned long nr_pages,
+			       unsigned int gup_flags, struct page **pages,
+			       struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas);
+long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
+			unsigned int gup_flags, struct page **pages,
+			struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages);
+int pin_longterm_pages_fast(unsigned long start, int nr_pages,
+			    unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), where usages which are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2615,11 +2635,41 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * Currently only get_user_pages() and get_user_pages_fast() support this flag
  * and calls to get_user_pages_[un]locked are specifically not allowed.  This
  * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
  *
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region.  And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region.  And so, CMA attempts to migrate the page before pinning, when
  * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
+ * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
+ * a call to put_user_page().
+ *
+ * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
+ * and separate refcounting mechanisms, however, and that means that each has
+ * its own acquire and release mechanisms:
+ *
+ *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
+ *
+ *     FOLL_PIN: pin_user_pages*() or pin_longterm_pages*() to acquire, and
+ *               put_user_pages to release.
+ *
+ * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
+ * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
+ * calls applied to them, and that's perfectly OK. This is a constraint on the
+ * callers, not on the pages.)
+ *
+ * FOLL_PIN and FOLL_LONGTERM should be set internally by the pin_user_page*()
+ * and pin_longterm_*() APIs, never directly by the caller. That's in order to
+ * help avoid mismatches when releasing pages: get_user_pages*() pages must be
+ * released via put_page(), while pin_user_pages*() pages must be released via
+ * put_user_page().
+ *
+ * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
 
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
diff --git a/mm/gup.c b/mm/gup.c
index 199da99e8ffc..1aea48427879 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -179,6 +179,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	pte_t *ptep, pte;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return ERR_PTR(-EINVAL);
 retry:
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
@@ -790,7 +794,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 
 	start = untagged_addr(start);
 
-	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
+	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
 
 	/*
 	 * If FOLL_FORCE is set then do not force a full fault as the hinting
@@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
 		BUG_ON(*locked != 1);
 	}
 
-	if (pages)
+	/*
+	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
+	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
+	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
+	 * for FOLL_GET, not for the newer FOLL_PIN.
+	 *
+	 * FOLL_PIN always expects pages to be non-null, but no need to assert
+	 * that here, as any failures will be obvious enough.
+	 */
+	if (pages && !(flags & FOLL_PIN))
 		flags |= FOLL_GET;
 
 	pages_done = 0;
@@ -1151,6 +1164,14 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *locked)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_page*() and
+	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
+	 * with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	/*
 	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
 	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
@@ -1608,6 +1629,14 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_page*() and
+	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
+	 * with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
 				     pages, vmas, gup_flags | FOLL_TOUCH);
 }
@@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
 	return ret;
 }
 
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
- *
- * Attempt to pin user pages in memory without taking mm->mmap_sem.
- * If not successful, it will fall back to taking the lock and
- * calling get_user_pages().
- *
- * Returns number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int get_user_pages_fast(unsigned long start, int nr_pages,
-			unsigned int gup_flags, struct page **pages)
+static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+					unsigned int gup_flags,
+					struct page **pages)
 {
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
@@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 
 	return ret;
 }
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying pin behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number requested.
+ * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
+ * -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_page*() and
+	 * pin_longterm_*() APIs, never directly by the caller, so enforce that:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/**
+ * pin_user_pages_fast() - pin user pages in memory without taking locks
+ *
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
+ * get_user_pages_fast() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast);
+
+/**
+ * pin_longterm_pages_fast() - pin user pages in memory without taking locks
+ *
+ * Nearly the same as get_user_pages_fast(), except that FOLL_PIN and
+ * FOLL_LONGTERM are set. See get_user_pages_fast() for documentation on the
+ * function arguments, because the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
+ * typically by a non-CPU device, and we cannot be sure that waiting for a
+ * pinned page to become unpin will be effective.
+ *
+ * This is intended for Case 2 (RDMA: long-term pins) of the FOLL_PIN
+ * documentation.
+ */
+int pin_longterm_pages_fast(unsigned long start, int nr_pages,
+			    unsigned int gup_flags, struct page **pages)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= (FOLL_PIN | FOLL_LONGTERM);
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_longterm_pages_fast);
+
+/**
+ * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
+ * return the pages to the user.
+ *
+ * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
+ * get_user_pages_remote() for documentation on the function arguments, because
+ * the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked, gup_flags);
+}
+EXPORT_SYMBOL(pin_user_pages_remote);
+
+/**
+ * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
+ * return the pages to the user.
+ *
+ * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
+ * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
+ * documentation on the function arguments, because the arguments here are
+ * identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
+ * typically by a non-CPU device, and we cannot be sure that waiting for a
+ * pinned page to become unpin will be effective.
+ *
+ * This is intended for Case 2 (RDMA: long-term pins) in
+ * Documentation/vm/pin_user_pages.rst.
+ */
+long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			       unsigned long start, unsigned long nr_pages,
+			       unsigned int gup_flags, struct page **pages,
+			       struct vm_area_struct **vmas, int *locked)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	/*
+	 * FIXME: as noted in the get_user_pages_remote() implementation, it
+	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
+	 * needs to be set, but for now the best we can do is a "TODO" item.
+	 */
+	gup_flags |= FOLL_REMOTE | FOLL_PIN;
+
+	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
+				       locked, gup_flags);
+}
+EXPORT_SYMBOL(pin_longterm_pages_remote);
+
+/**
+ * pin_user_pages() - pin user pages in memory for use by other devices
+ *
+ * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
+ * FOLL_PIN is set.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for details.
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
+}
+EXPORT_SYMBOL(pin_user_pages);
+
+/**
+ * pin_longterm_pages() - pin user pages in memory for long-term use (RDMA,
+ * typically)
+ *
+ * Nearly the same as get_user_pages(), except that FOLL_PIN and FOLL_LONGTERM
+ * are set. See get_user_pages_fast() for documentation on the function
+ * arguments, because the arguments here are identical.
+ *
+ * FOLL_PIN means that the pages must be released via put_user_page(). Please
+ * see Documentation/vm/pin_user_pages.rst for further details.
+ *
+ * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
+ * typically by a non-CPU device, and we cannot be sure that waiting for a
+ * pinned page to become unpin will be effective.
+ *
+ * This is intended for Case 2 (RDMA: long-term pins) in
+ * Documentation/vm/pin_user_pages.rst.
+ */
+long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
+			unsigned int gup_flags, struct page **pages,
+			struct vm_area_struct **vmas)
+{
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
+		return -EINVAL;
+
+	gup_flags |= FOLL_PIN | FOLL_LONGTERM;
+	return __gup_longterm_locked(current, current->mm, start, nr_pages,
+				     pages, vmas, gup_flags);
+}
+EXPORT_SYMBOL(pin_longterm_pages);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 06/18] goldish_pipe: convert to pin_user_pages() and put_user_page()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (4 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-03 21:18 ` [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() John Hubbard
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

Another side effect is that the release code is simplified because
the page[] loop is now in gup.c instead of here, so just delete the
local release_user_pages() entirely, and call
put_user_pages_dirty_lock() directly, instead.

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/platform/goldfish/goldfish_pipe.c | 17 +++--------------
 1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
index 7ed2a21a0bac..635a8bc1b480 100644
--- a/drivers/platform/goldfish/goldfish_pipe.c
+++ b/drivers/platform/goldfish/goldfish_pipe.c
@@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page,
 		*iter_last_page_size = last_page_size;
 	}
 
-	ret = get_user_pages_fast(first_page, requested_pages,
+	ret = pin_user_pages_fast(first_page, requested_pages,
 				  !is_write ? FOLL_WRITE : 0,
 				  pages);
 	if (ret <= 0)
@@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page,
 	return ret;
 }
 
-static void release_user_pages(struct page **pages, int pages_count,
-			       int is_write, s32 consumed_size)
-{
-	int i;
-
-	for (i = 0; i < pages_count; i++) {
-		if (!is_write && consumed_size > 0)
-			set_page_dirty(pages[i]);
-		put_page(pages[i]);
-	}
-}
-
 /* Populate the call parameters, merging adjacent pages together */
 static void populate_rw_params(struct page **pages,
 			       int pages_count,
@@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
 
 	*consumed_size = pipe->command_buffer->rw_params.consumed_size;
 
-	release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
+	put_user_pages_dirty_lock(pipe->pages, pages_count,
+				  !is_write && *consumed_size > 0);
 
 	mutex_unlock(&pipe->lock);
 	return 0;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (5 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 06/18] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-04 20:33   ` Jason Gunthorpe
  2019-11-03 21:18 ` [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert infiniband to use the new wrapper calls, and stop
explicitly setting FOLL_LONGTERM at the call sites.

The new pin_longterm_*() calls replace get_user_pages*()
calls, and set both FOLL_LONGTERM and a new FOLL_PIN
flag. The FOLL_PIN flag requires that the caller must
return the pages via put_user_page*() calls, but
infiniband was already doing that as part of an earlier
commit.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/infiniband/core/umem.c              |  5 ++---
 drivers/infiniband/core/umem_odp.c          | 10 +++++-----
 drivers/infiniband/hw/hfi1/user_pages.c     |  4 ++--
 drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
 drivers/infiniband/hw/qib/qib_user_pages.c  |  8 ++++----
 drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c    |  9 ++++-----
 drivers/infiniband/sw/siw/siw_mem.c         |  5 ++---
 8 files changed, 21 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 24244a2f68cc..c5a78d3e674b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
 
 	while (npages) {
 		down_read(&mm->mmap_sem);
-		ret = get_user_pages(cur_base,
+		ret = pin_longterm_pages(cur_base,
 				     min_t(unsigned long, npages,
 					   PAGE_SIZE / sizeof (struct page *)),
-				     gup_flags | FOLL_LONGTERM,
-				     page_list, NULL);
+				     gup_flags, page_list, NULL);
 		if (ret < 0) {
 			up_read(&mm->mmap_sem);
 			goto umem_release;
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 163ff7ba92b7..a38b67b83db5 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -534,7 +534,7 @@ static int ib_umem_odp_map_dma_single_page(
 	} else if (umem_odp->page_list[page_index] == page) {
 		umem_odp->dma_list[page_index] |= access_mask;
 	} else {
-		pr_err("error: got different pages in IB device and from get_user_pages. IB device page: %p, gup page: %p\n",
+		pr_err("error: got different pages in IB device and from pin_longterm_pages. IB device page: %p, gup page: %p\n",
 		       umem_odp->page_list[page_index], page);
 		/* Better remove the mapping now, to prevent any further
 		 * damage. */
@@ -639,11 +639,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 		/*
 		 * Note: this might result in redundent page getting. We can
 		 * avoid this by checking dma_list to be 0 before calling
-		 * get_user_pages. However, this make the code much more
-		 * complex (and doesn't gain us much performance in most use
-		 * cases).
+		 * pin_longterm_pages. However, this makes the code much
+		 * more complex (and doesn't gain us much performance in most
+		 * use cases).
 		 */
-		npages = get_user_pages_remote(owning_process, owning_mm,
+		npages = pin_longterm_pages_remote(owning_process, owning_mm,
 				user_virt, gup_num_pages,
 				flags, local_page_list, NULL, NULL);
 		up_read(&owning_mm->mmap_sem);
diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c
index 469acb961fbd..9b55b0a73e29 100644
--- a/drivers/infiniband/hw/hfi1/user_pages.c
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -104,9 +104,9 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np
 			    bool writable, struct page **pages)
 {
 	int ret;
-	unsigned int gup_flags = FOLL_LONGTERM | (writable ? FOLL_WRITE : 0);
+	unsigned int gup_flags = (writable ? FOLL_WRITE : 0);
 
-	ret = get_user_pages_fast(vaddr, npages, gup_flags, pages);
+	ret = pin_longterm_pages_fast(vaddr, npages, gup_flags, pages);
 	if (ret < 0)
 		return ret;
 
diff --git a/drivers/infiniband/hw/mthca/mthca_memfree.c b/drivers/infiniband/hw/mthca/mthca_memfree.c
index edccfd6e178f..beec7e4b8a96 100644
--- a/drivers/infiniband/hw/mthca/mthca_memfree.c
+++ b/drivers/infiniband/hw/mthca/mthca_memfree.c
@@ -472,8 +472,7 @@ int mthca_map_user_db(struct mthca_dev *dev, struct mthca_uar *uar,
 		goto out;
 	}
 
-	ret = get_user_pages_fast(uaddr & PAGE_MASK, 1,
-				  FOLL_WRITE | FOLL_LONGTERM, pages);
+	ret = pin_longterm_pages_fast(uaddr & PAGE_MASK, 1, FOLL_WRITE, pages);
 	if (ret < 0)
 		goto out;
 
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index 6bf764e41891..684a14e14d9b 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -108,10 +108,10 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
 
 	down_read(&current->mm->mmap_sem);
 	for (got = 0; got < num_pages; got += ret) {
-		ret = get_user_pages(start_page + got * PAGE_SIZE,
-				     num_pages - got,
-				     FOLL_LONGTERM | FOLL_WRITE | FOLL_FORCE,
-				     p + got, NULL);
+		ret = pin_longterm_pages(start_page + got * PAGE_SIZE,
+					 num_pages - got,
+					 FOLL_WRITE | FOLL_FORCE,
+					 p + got, NULL);
 		if (ret < 0) {
 			up_read(&current->mm->mmap_sem);
 			goto bail_release;
diff --git a/drivers/infiniband/hw/qib/qib_user_sdma.c b/drivers/infiniband/hw/qib/qib_user_sdma.c
index 05190edc2611..fd86a9d19370 100644
--- a/drivers/infiniband/hw/qib/qib_user_sdma.c
+++ b/drivers/infiniband/hw/qib/qib_user_sdma.c
@@ -670,7 +670,7 @@ static int qib_user_sdma_pin_pages(const struct qib_devdata *dd,
 		else
 			j = npages;
 
-		ret = get_user_pages_fast(addr, j, FOLL_LONGTERM, pages);
+		ret = pin_longterm_pages_fast(addr, j, 0, pages);
 		if (ret != j) {
 			i = 0;
 			j = ret;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 62e6ffa9ad78..6b90ca1c3771 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -141,11 +141,10 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
 	ret = 0;
 
 	while (npages) {
-		ret = get_user_pages(cur_base,
-				     min_t(unsigned long, npages,
-				     PAGE_SIZE / sizeof(struct page *)),
-				     gup_flags | FOLL_LONGTERM,
-				     page_list, NULL);
+		ret = pin_longterm_pages(cur_base,
+					 min_t(unsigned long, npages,
+					     PAGE_SIZE / sizeof(struct page *)),
+					 gup_flags, page_list, NULL);
 
 		if (ret < 0)
 			goto out;
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index e99983f07663..20e663d7ada8 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -426,9 +426,8 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
 		while (nents) {
 			struct page **plist = &umem->page_chunk[i].plist[got];
 
-			rv = get_user_pages(first_page_va, nents,
-					    foll_flags | FOLL_LONGTERM,
-					    plist, NULL);
+			rv = pin_longterm_pages(first_page_va, nents,
+						foll_flags, plist, NULL);
 			if (rv < 0)
 				goto out_sem_up;
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (6 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-04 17:41   ` Jerome Glisse
  2019-11-03 21:18 ` [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert process_vm_access to use the new pin_user_pages_remote()
call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Also, release the pages via put_user_page*().

Also, rename "pages" to "pinned_pages", as this makes for
easier reading of process_vm_rw_single_vec().

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/process_vm_access.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 357aa7bef6c0..fd20ab675b85 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
 		if (copy > len)
 			copy = len;
 
-		if (vm_write) {
+		if (vm_write)
 			copied = copy_page_from_iter(page, offset, copy, iter);
-			set_page_dirty_lock(page);
-		} else {
+		else
 			copied = copy_page_to_iter(page, offset, copy, iter);
-		}
+
 		len -= copied;
 		if (copied < copy && iov_iter_count(iter))
 			return -EFAULT;
@@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		flags |= FOLL_WRITE;
 
 	while (!rc && nr_pages && iov_iter_count(iter)) {
-		int pages = min(nr_pages, max_pages_per_loop);
+		int pinned_pages = min(nr_pages, max_pages_per_loop);
 		int locked = 1;
 		size_t bytes;
 
@@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
 		 * current/current->mm
 		 */
 		down_read(&mm->mmap_sem);
-		pages = get_user_pages_remote(task, mm, pa, pages, flags,
-					      process_pages, NULL, &locked);
+		pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
+						     flags, process_pages,
+						     NULL, &locked);
 		if (locked)
 			up_read(&mm->mmap_sem);
-		if (pages <= 0)
+		if (pinned_pages <= 0)
 			return -EFAULT;
 
-		bytes = pages * PAGE_SIZE - start_offset;
+		bytes = pinned_pages * PAGE_SIZE - start_offset;
 		if (bytes > len)
 			bytes = len;
 
@@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
 					 vm_write);
 		len -= bytes;
 		start_offset = 0;
-		nr_pages -= pages;
-		pa += pages * PAGE_SIZE;
-		while (pages)
-			put_page(process_pages[--pages]);
+		nr_pages -= pinned_pages;
+		pa += pinned_pages * PAGE_SIZE;
+
+		/* If vm_write is set, the pages need to be made dirty: */
+		put_user_pages_dirty_lock(process_pages, pinned_pages,
+					  vm_write);
 	}
 
 	return rc;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (7 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-04 17:44   ` Jerome Glisse
  2019-11-03 21:18 ` [PATCH v2 10/18] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert drm/via to use the new pin_user_pages_fast() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/gpu/drm/via/via_dmablit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
index 3db000aacd26..37c5e572993a 100644
--- a/drivers/gpu/drm/via/via_dmablit.c
+++ b/drivers/gpu/drm/via/via_dmablit.c
@@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  drm_via_dmablit_t *xfer)
 	vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
 	if (NULL == vsg->pages)
 		return -ENOMEM;
-	ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
+	ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
 			vsg->num_pages,
 			vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
 			vsg->pages);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 10/18] fs/io_uring: set FOLL_PIN via pin_user_pages()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (8 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-03 21:18 ` [PATCH v2 11/18] net/xdp: " John Hubbard
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert fs/io_uring to use the new pin_user_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages, and therefore for any code that calls
put_user_page().

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/io_uring.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f9a38998f2fc..0f307f2c7cac 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3433,9 +3433,8 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 
 		ret = 0;
 		down_read(&current->mm->mmap_sem);
-		pret = get_user_pages(ubuf, nr_pages,
-				      FOLL_WRITE | FOLL_LONGTERM,
-				      pages, vmas);
+		pret = pin_longterm_pages(ubuf, nr_pages, FOLL_WRITE, pages,
+					  vmas);
 		if (pret == nr_pages) {
 			/* don't support file backed memory */
 			for (j = 0; j < nr_pages; j++) {
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 11/18] net/xdp: set FOLL_PIN via pin_user_pages()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (9 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 10/18] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
@ 2019-11-03 21:18 ` " John Hubbard
  2019-11-03 21:18 ` [PATCH v2 12/18] mm/gup: track FOLL_PIN pages John Hubbard
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN. Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 net/xdp/xdp_umem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 3049af269fbf..66c814863cfd 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -291,8 +291,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
 		return -ENOMEM;
 
 	down_read(&current->mm->mmap_sem);
-	npgs = get_user_pages(umem->address, umem->npgs,
-			      gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
+	npgs = pin_longterm_pages(umem->address, umem->npgs, gup_flags,
+				  &umem->pgs[0], NULL);
 	up_read(&current->mm->mmap_sem);
 
 	if (npgs != umem->npgs) {
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (10 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 11/18] net/xdp: " John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-04 18:52   ` Jerome Glisse
  2019-11-03 21:18 ` [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Add tracking of pages that were pinned via FOLL_PIN.

As mentioned in the FOLL_PIN documentation, callers who effectively set
FOLL_PIN are required to ultimately free such pages via put_user_page().
The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
for DIO and/or RDMA use".

Pages that have been pinned via FOLL_PIN are identifiable via a
new function call:

   bool page_dma_pinned(struct page *page);

What to do in response to encountering such a page, is left to later
patchsets. There is discussion about this in [1].

This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().

This also has a couple of trivial, non-functional change fixes to
try_get_compound_head(). That function got moved to the top of the
file.

This includes the following fix from Ira Weiny:

DAX requires detection of a page crossing to a ref count of 1.  Fix this
for GUP pages by introducing put_devmap_managed_user_page() which
accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

[1] https://lwn.net/Articles/784574/ "Some slow progress on
get_user_pages()"

Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h       |  80 +++++++++++----
 include/linux/mmzone.h   |   2 +
 include/linux/page_ref.h |  10 ++
 mm/gup.c                 | 213 +++++++++++++++++++++++++++++++--------
 mm/huge_memory.c         |  32 +++++-
 mm/hugetlb.c             |  28 ++++-
 mm/memremap.c            |   4 +-
 mm/vmstat.c              |   2 +
 8 files changed, 300 insertions(+), 71 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cdfb6fedb271..03b3600843b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page)
 #endif
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page);
+void __put_devmap_managed_page(struct page *page, int count);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-static inline bool put_devmap_managed_page(struct page *page)
+
+static inline bool page_is_devmap_managed(struct page *page)
 {
 	if (!static_branch_unlikely(&devmap_managed_key))
 		return false;
@@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page)
 	switch (page->pgmap->type) {
 	case MEMORY_DEVICE_PRIVATE:
 	case MEMORY_DEVICE_FS_DAX:
-		__put_devmap_managed_page(page);
 		return true;
 	default:
 		break;
@@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page)
 	return false;
 }
 
+static inline bool put_devmap_managed_page(struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+
+	if (is_devmap) {
+		int count = page_ref_dec_return(page);
+
+		__put_devmap_managed_page(page, count);
+	}
+
+	return is_devmap;
+}
+
 #else /* CONFIG_DEV_PAGEMAP_OPS */
 static inline bool put_devmap_managed_page(struct page *page)
 {
@@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
 	return true;
 }
 
+__must_check bool user_page_ref_inc(struct page *page);
+
 static inline void put_page(struct page *page)
 {
 	page = compound_head(page);
@@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
 		__put_page(page);
 }
 
-/**
- * put_user_page() - release a gup-pinned page
- * @page:            pointer to page to be released
+/*
+ * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
+ * the page's refcount so that two separate items are tracked: the original page
+ * reference count, and also a new count of how many get_user_pages() calls were
+ * made against the page. ("gup-pinned" is another term for the latter).
+ *
+ * With this scheme, get_user_pages() becomes special: such pages are marked
+ * as distinct from normal pages. As such, the new put_user_page() call (and
+ * its variants) must be used in order to release gup-pinned pages.
+ *
+ * Choice of value:
  *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
+ * counts with respect to get_user_pages() and put_user_page() becomes simpler,
+ * due to the fact that adding an even power of two to the page refcount has
+ * the effect of using only the upper N bits, for the code that counts up using
+ * the bias value. This means that the lower bits are left for the exclusive
+ * use of the original code that increments and decrements by one (or at least,
+ * by much smaller values than the bias value).
  *
- * put_user_page() and put_page() are not interchangeable, despite this early
- * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
+ * Of course, once the lower bits overflow into the upper bits (and this is
+ * OK, because subtraction recovers the original values), then visual inspection
+ * no longer suffices to directly view the separate counts. However, for normal
+ * applications that don't have huge page reference counts, this won't be an
+ * issue.
+ *
+ * Locking: the lockless algorithm described in page_cache_get_speculative()
+ * and page_cache_gup_pin_speculative() provides safe operation for
+ * get_user_pages and page_mkclean and other calls that race to set up page
+ * table entries.
  */
-static inline void put_user_page(struct page *page)
-{
-	put_page(page);
-}
+#define GUP_PIN_COUNTING_BIAS (1UL << 10)
 
+void put_user_page(struct page *page);
 void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 			       bool make_dirty);
-
 void put_user_pages(struct page **pages, unsigned long npages);
 
+/**
+ * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
+ * or pin_longterm_pages*()
+ * @page:	pointer to page to be queried.
+ * @Return:	True, if it is likely that the page has been "dma-pinned".
+ *		False, if the page is definitely not dma-pinned.
+ */
+static inline bool page_dma_pinned(struct page *page)
+{
+	return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
+}
+
 #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
 #define SECTION_IN_PAGE_FLAGS
 #endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bda20282746b..0485cba38d23 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -244,6 +244,8 @@ enum node_stat_item {
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
+	NR_FOLL_PIN_REQUESTED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
+	NR_FOLL_PIN_RETURNED,	/* pages returned via put_user_page() */
 	NR_VM_NODE_STAT_ITEMS
 };
 
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 14d14beb1f7f..b9cbe553d1e7 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -102,6 +102,16 @@ static inline void page_ref_sub(struct page *page, int nr)
 		__page_ref_mod(page, -nr);
 }
 
+static inline int page_ref_sub_return(struct page *page, int nr)
+{
+	int ret = atomic_sub_return(nr, &page->_refcount);
+
+	if (page_ref_tracepoint_active(__tracepoint_page_ref_mod))
+		__page_ref_mod(page, -nr);
+
+	return ret;
+}
+
 static inline void page_ref_inc(struct page *page)
 {
 	atomic_inc(&page->_refcount);
diff --git a/mm/gup.c b/mm/gup.c
index 1aea48427879..c9727e65fad3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,102 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+/*
+ * Return the compound head page with ref appropriately incremented,
+ * or NULL if that failed.
+ */
+static inline struct page *try_get_compound_head(struct page *page, int refs)
+{
+	struct page *head = compound_head(page);
+
+	if (WARN_ON_ONCE(page_ref_count(head) < 0))
+		return NULL;
+	if (unlikely(!page_cache_add_speculative(head, refs)))
+		return NULL;
+	return head;
+}
+
+#ifdef CONFIG_DEBUG_VM
+static inline void __update_proc_vmstat(struct page *page,
+					enum node_stat_item item, int count)
+{
+	mod_node_page_state(page_pgdat(page), item, count);
+}
+#else
+static inline void __update_proc_vmstat(struct page *page,
+					enum node_stat_item item, int count)
+{
+}
+#endif
+
+/**
+ * user_page_ref_inc() - mark a page as being used by get_user_pages(FOLL_PIN).
+ *
+ * @page:	pointer to page to be marked
+ * @Return:	true for success, false for failure
+ */
+__must_check bool user_page_ref_inc(struct page *page)
+{
+	page = try_get_compound_head(page, GUP_PIN_COUNTING_BIAS);
+	if (!page)
+		return false;
+
+	__update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
+	return true;
+}
+
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+static bool __put_devmap_managed_user_page(struct page *page)
+{
+	bool is_devmap = page_is_devmap_managed(page);
+
+	if (is_devmap) {
+		int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
+
+		__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+		__put_devmap_managed_page(page, count);
+	}
+
+	return is_devmap;
+}
+#else
+static bool __put_devmap_managed_user_page(struct page *page)
+{
+	return false;
+}
+#endif /* CONFIG_DEV_PAGEMAP_OPS */
+
+/**
+ * put_user_page() - release a gup-pinned page
+ * @page:            pointer to page to be released
+ *
+ * Pages that were pinned via get_user_pages*() must be released via
+ * either put_user_page(), or one of the put_user_pages*() routines
+ * below. This is so that eventually, pages that are pinned via
+ * get_user_pages*() can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special
+ * handling.
+ */
+void put_user_page(struct page *page)
+{
+	page = compound_head(page);
+
+	/*
+	 * For devmap managed pages we need to catch refcount transition from
+	 * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
+	 * page is free and we need to inform the device driver through
+	 * callback. See include/linux/memremap.h and HMM for details.
+	 */
+	if (__put_devmap_managed_user_page(page))
+		return;
+
+	if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
+		__put_page(page);
+
+	__update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
+}
+EXPORT_SYMBOL(put_user_page);
+
 /**
  * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages
  * @pages:  array of pages to be maybe marked dirty, and definitely released.
@@ -215,10 +311,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 	}
 
 	page = vm_normal_page(vma, address, pte);
-	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
+	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
-		 * Only return device mapping pages in the FOLL_GET case since
-		 * they are only valid while holding the pgmap reference.
+		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
+		 * case since they are only valid while holding the pgmap
+		 * reference.
 		 */
 		*pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
 		if (*pgmap)
@@ -261,6 +358,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 			page = ERR_PTR(-ENOMEM);
 			goto out;
 		}
+	} else if (flags & FOLL_PIN) {
+		if (unlikely(!user_page_ref_inc(page))) {
+			page = ERR_PTR(-ENOMEM);
+			goto out;
+		}
 	}
 	if (flags & FOLL_TOUCH) {
 		if ((flags & FOLL_WRITE) &&
@@ -522,8 +624,8 @@ static struct page *follow_page_mask(struct vm_area_struct *vma,
 	/* make this handle hugepd */
 	page = follow_huge_addr(mm, address, flags & FOLL_WRITE);
 	if (!IS_ERR(page)) {
-		BUG_ON(flags & FOLL_GET);
-		return page;
+		WARN_ON_ONCE(flags & (FOLL_GET | FOLL_PIN));
+		return NULL;
 	}
 
 	pgd = pgd_offset(mm, address);
@@ -1812,30 +1914,20 @@ static inline pte_t gup_get_pte(pte_t *ptep)
 #endif /* CONFIG_GUP_GET_PTE_LOW_HIGH */
 
 static void __maybe_unused undo_dev_pagemap(int *nr, int nr_start,
+					    unsigned int flags,
 					    struct page **pages)
 {
 	while ((*nr) - nr_start) {
 		struct page *page = pages[--(*nr)];
 
 		ClearPageReferenced(page);
-		put_page(page);
+		if (flags & FOLL_PIN)
+			put_user_page(page);
+		else
+			put_page(page);
 	}
 }
 
-/*
- * Return the compund head page with ref appropriately incremented,
- * or NULL if that failed.
- */
-static inline struct page *try_get_compound_head(struct page *page, int refs)
-{
-	struct page *head = compound_head(page);
-	if (WARN_ON_ONCE(page_ref_count(head) < 0))
-		return NULL;
-	if (unlikely(!page_cache_add_speculative(head, refs)))
-		return NULL;
-	return head;
-}
-
 #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
 static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			 unsigned int flags, struct page **pages, int *nr)
@@ -1865,7 +1957,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 
 			pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
 			if (unlikely(!pgmap)) {
-				undo_dev_pagemap(nr, nr_start, pages);
+				undo_dev_pagemap(nr, nr_start, flags, pages);
 				goto pte_unmap;
 			}
 		} else if (pte_special(pte))
@@ -1874,9 +1966,15 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
-		head = try_get_compound_head(page, 1);
-		if (!head)
-			goto pte_unmap;
+		if (flags & FOLL_PIN) {
+			head = page;
+			if (unlikely(!user_page_ref_inc(head)))
+				goto pte_unmap;
+		} else {
+			head = try_get_compound_head(page, 1);
+			if (!head)
+				goto pte_unmap;
+		}
 
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 			put_page(head);
@@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
 
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
-			undo_dev_pagemap(nr, nr_start, pages);
+			undo_dev_pagemap(nr, nr_start, flags, pages);
 			return 0;
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
-		get_page(page);
+
+		if (flags & FOLL_PIN) {
+			if (unlikely(!user_page_ref_inc(page))) {
+				undo_dev_pagemap(nr, nr_start, flags, pages);
+				return 0;
+			}
+		} else
+			get_page(page);
+
 		(*nr)++;
 		pfn++;
 	} while (addr += PAGE_SIZE, addr != end);
@@ -1957,7 +2063,7 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
@@ -1975,7 +2081,7 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 		return 0;
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
-		undo_dev_pagemap(nr, nr_start, pages);
+		undo_dev_pagemap(nr, nr_start, flags, pages);
 		return 0;
 	}
 	return 1;
@@ -2059,9 +2165,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
 	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
 	refs = __record_subpages(page, addr, end, pages, *nr);
 
-	head = try_get_compound_head(head, refs);
-	if (!head)
-		return 0;
+	if (flags & FOLL_PIN) {
+		head = page;
+		if (unlikely(!user_page_ref_inc(head)))
+			return 0;
+		head = page;
+	} else {
+		head = try_get_compound_head(head, refs);
+		if (!head)
+			return 0;
+	}
 
 	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 		put_compound_head(head, refs);
@@ -2118,9 +2231,15 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
 	refs = __record_subpages(page, addr, end, pages, *nr);
 
-	head = try_get_compound_head(pmd_page(orig), refs);
-	if (!head)
-		return 0;
+	if (flags & FOLL_PIN) {
+		head = page;
+		if (unlikely(!user_page_ref_inc(head)))
+			return 0;
+	} else {
+		head = try_get_compound_head(pmd_page(orig), refs);
+		if (!head)
+			return 0;
+	}
 
 	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
 		put_compound_head(head, refs);
@@ -2151,9 +2270,15 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
 	refs = __record_subpages(page, addr, end, pages, *nr);
 
-	head = try_get_compound_head(pud_page(orig), refs);
-	if (!head)
-		return 0;
+	if (flags & FOLL_PIN) {
+		head = page;
+		if (unlikely(!user_page_ref_inc(head)))
+			return 0;
+	} else {
+		head = try_get_compound_head(pud_page(orig), refs);
+		if (!head)
+			return 0;
+	}
 
 	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
 		put_compound_head(head, refs);
@@ -2179,9 +2304,15 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
 	refs = __record_subpages(page, addr, end, pages, *nr);
 
-	head = try_get_compound_head(pgd_page(orig), refs);
-	if (!head)
-		return 0;
+	if (flags & FOLL_PIN) {
+		head = page;
+		if (unlikely(!user_page_ref_inc(head)))
+			return 0;
+	} else {
+		head = try_get_compound_head(pgd_page(orig), refs);
+		if (!head)
+			return 0;
+	}
 
 	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
 		put_compound_head(head, refs);
@@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
 
-	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
+	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13cc93785006..66bf4c8b88f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -945,6 +945,11 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 */
 	WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set");
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (flags & FOLL_WRITE && !pmd_write(*pmd))
 		return NULL;
 
@@ -960,7 +965,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
@@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+
+	if (flags & FOLL_GET)
+		get_page(page);
+	else if (flags & FOLL_PIN)
+		if (unlikely(!user_page_ref_inc(page)))
+			page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1088,6 +1098,11 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (flags & FOLL_WRITE && !pud_write(*pud))
 		return NULL;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 	if (pud_present(*pud) && pud_devmap(*pud))
 		/* pass */;
 	else
@@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	 * device mapped pages can only be returned if the
 	 * caller will manage the page reference count.
 	 */
-	if (!(flags & FOLL_GET))
+	if (!(flags & (FOLL_GET | FOLL_PIN)))
 		return ERR_PTR(-EEXIST);
 
 	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
@@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
 	if (!*pgmap)
 		return ERR_PTR(-EFAULT);
 	page = pfn_to_page(pfn);
-	get_page(page);
+
+	if (flags & FOLL_GET)
+		get_page(page);
+	else if (flags & FOLL_PIN)
+		if (unlikely(!user_page_ref_inc(page)))
+			page = ERR_PTR(-ENOMEM);
 
 	return page;
 }
@@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 skip_mlock:
 	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
 	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
 	if (flags & FOLL_GET)
 		get_page(page);
+	else if (flags & FOLL_PIN)
+		if (unlikely(!user_page_ref_inc(page)))
+			page = NULL;
 
 out:
 	return page;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b45a95363a84..da335b1cd798 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 same_page:
 		if (pages) {
 			pages[i] = mem_map_offset(page, pfn_offset);
-			get_page(pages[i]);
+
+			if (flags & FOLL_GET)
+				get_page(pages[i]);
+			else if (flags & FOLL_PIN)
+				if (unlikely(!user_page_ref_inc(pages[i]))) {
+					spin_unlock(ptl);
+					remainder = 0;
+					err = -ENOMEM;
+					WARN_ON_ONCE(1);
+					break;
+				}
 		}
 
 		if (vmas)
@@ -5022,6 +5032,12 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	struct page *page = NULL;
 	spinlock_t *ptl;
 	pte_t pte;
+
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return NULL;
+
 retry:
 	ptl = pmd_lockptr(mm, pmd);
 	spin_lock(ptl);
@@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
 		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+
 		if (flags & FOLL_GET)
 			get_page(page);
+		else if (flags & FOLL_PIN)
+			if (unlikely(!user_page_ref_inc(page))) {
+				page = NULL;
+				goto out;
+			}
 	} else {
 		if (is_hugetlb_entry_migration(pte)) {
 			spin_unlock(ptl);
@@ -5056,7 +5078,7 @@ struct page * __weak
 follow_huge_pud(struct mm_struct *mm, unsigned long address,
 		pud_t *pud, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pud) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
@@ -5065,7 +5087,7 @@ follow_huge_pud(struct mm_struct *mm, unsigned long address,
 struct page * __weak
 follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int flags)
 {
-	if (flags & FOLL_GET)
+	if (flags & (FOLL_GET | FOLL_PIN))
 		return NULL;
 
 	return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT);
diff --git a/mm/memremap.c b/mm/memremap.c
index 03ccbdfeb697..3b1c69df1d2a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -410,10 +410,8 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void __put_devmap_managed_page(struct page *page)
+void __put_devmap_managed_page(struct page *page, int count)
 {
-	int count = page_ref_dec_return(page);
-
 	/*
 	 * If refcount is 1 then page is freed and refcount is stable as nobody
 	 * holds a reference on the page.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..65c027d9b637 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1167,6 +1167,8 @@ const char * const vmstat_text[] = {
 	"nr_dirtied",
 	"nr_written",
 	"nr_kernel_misc_reclaimable",
+	"nr_foll_pin_requested",
+	"nr_foll_pin_returned",
 
 	/* enum writeback_stat_item counters */
 	"nr_dirty_threshold",
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (11 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 12/18] mm/gup: track FOLL_PIN pages John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-10 10:11   ` Hans Verkuil
  2019-11-03 21:18 ` [PATCH v2 14/18] vfio, mm: " John Hubbard
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages_dirty_lock().

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/media/v4l2-core/videobuf-dma-sg.c | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 28262190c3ab..9b9c5b37bf59 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
 	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
 		data, size, dma->nr_pages);
 
-	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
-			     flags | FOLL_LONGTERM, dma->pages, NULL);
+	err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages,
+				 flags, dma->pages, NULL);
 
 	if (err != dma->nr_pages) {
 		dma->nr_pages = (err >= 0) ? err : 0;
-		dprintk(1, "get_user_pages: err=%d [%d]\n", err,
+		dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err,
 			dma->nr_pages);
 		return err < 0 ? err : -EINVAL;
 	}
@@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
 	BUG_ON(dma->sglen);
 
 	if (dma->pages) {
-		for (i = 0; i < dma->nr_pages; i++) {
-			if (dma->direction == DMA_FROM_DEVICE)
-				set_page_dirty_lock(dma->pages[i]);
-			put_page(dma->pages[i]);
-		}
+		put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
+					  dma->direction == DMA_FROM_DEVICE);
 		kfree(dma->pages);
 		dma->pages = NULL;
 	}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 14/18] vfio, mm: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (12 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
@ 2019-11-03 21:18 ` " John Hubbard
  2019-11-03 21:18 ` [PATCH v2 15/18] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page() John Hubbard
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Change vfio from get_user_pages(FOLL_LONGTERM), to
pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.

2. Because all FOLL_PIN-acquired pages must be released via
put_user_page(), also convert the put_page() call over to
put_user_pages().

Note that this effectively changes the code's behavior in
vfio_iommu_type1.c: put_pfn(): it now ultimately calls
set_page_dirty_lock(), instead of set_page_dirty(). This is
probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 drivers/vfio/vfio_iommu_type1.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d864277ea16f..795e13f3ef08 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -327,9 +327,8 @@ static int put_pfn(unsigned long pfn, int prot)
 {
 	if (!is_invalid_reserved_pfn(pfn)) {
 		struct page *page = pfn_to_page(pfn);
-		if (prot & IOMMU_WRITE)
-			SetPageDirty(page);
-		put_page(page);
+
+		put_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
 		return 1;
 	}
 	return 0;
@@ -349,11 +348,11 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 
 	down_read(&mm->mmap_sem);
 	if (mm == current->mm) {
-		ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
-				     vmas);
+		ret = pin_longterm_pages(vaddr, 1, flags, page, vmas);
 	} else {
-		ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
-					    vmas, NULL);
+		ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
+						flags, page, vmas,
+						NULL);
 		/*
 		 * The lifetime of a vaddr_get_pfn() page pin is
 		 * userspace-controlled. In the fs-dax case this could
@@ -363,7 +362,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		 */
 		if (ret > 0 && vma_is_fsdax(vmas[0])) {
 			ret = -EOPNOTSUPP;
-			put_page(page[0]);
+			put_user_page(page[0]);
 		}
 	}
 	up_read(&mm->mmap_sem);
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 15/18] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page()
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (13 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 14/18] vfio, mm: " John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-03 21:18 ` [PATCH v2 16/18] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

1. Convert from get_user_pages(FOLL_LONGTERM) to pin_longterm_pages().

2. As required by pin_user_pages(), release these pages via
put_user_page(). In this case, do so via put_user_pages_dirty_lock().

That has the side effect of calling set_page_dirty_lock(), instead
of set_page_dirty(). This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

3. Release each page in mem->hpages[] (instead of mem->hpas[]), because
that is the array that pin_longterm_pages() filled in. This is more
accurate and should be a little safer from a maintenance point of
view.

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 arch/powerpc/mm/book3s64/iommu_api.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..69d79cb50d47 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -103,9 +103,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	for (entry = 0; entry < entries; entry += chunk) {
 		unsigned long n = min(entries - entry, chunk);
 
-		ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
-				FOLL_WRITE | FOLL_LONGTERM,
-				mem->hpages + entry, NULL);
+		ret = pin_longterm_pages(ua + (entry << PAGE_SHIFT), n,
+					 FOLL_WRITE, mem->hpages + entry, NULL);
 		if (ret == n) {
 			pinned += n;
 			continue;
@@ -167,9 +166,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
 	return 0;
 
 free_exit:
-	/* free the reference taken */
-	for (i = 0; i < pinned; i++)
-		put_page(mem->hpages[i]);
+	/* free the references taken */
+	put_user_pages(mem->hpages, pinned);
 
 	vfree(mem->hpas);
 	kfree(mem);
@@ -212,10 +210,9 @@ static void mm_iommu_unpin(struct mm_iommu_table_group_mem_t *mem)
 		if (!page)
 			continue;
 
-		if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
-			SetPageDirty(page);
+		put_user_pages_dirty_lock(&mem->hpages[i], 1,
+					  MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
 
-		put_page(page);
 		mem->hpas[i] = 0;
 	}
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 16/18] mm/gup_benchmark: support pin_user_pages() and related calls
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (14 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 15/18] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page() John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-03 21:18 ` [PATCH v2 17/18] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
  2019-11-03 21:18 ` [PATCH v2 18/18] mm/gup: remove support for gup(FOLL_LONGTERM) John Hubbard
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Up until now, gup_benchmark supported testing of the
following kernel functions:

* get_user_pages(): via the '-U' command line option
* get_user_pages_longterm(): via the '-L' command line option
* get_user_pages_fast(): as the default (no options required)

Add test coverage for the new corresponding pin_*() functions:

* pin_user_pages(): via the '-c' command line option
* pin_longterm_pages(): via the '-b' command line option
* pin_user_pages_fast(): via the '-a' command line option

Also, add an option for clarity: '-u' for what is now (still) the
default choice: get_user_pages_fast().

Also, for the three commands that set FOLL_PIN, verify that the pages
really are dma-pinned, via the new is_dma_pinned() routine.
Those commands are:

    PIN_FAST_BENCHMARK     : calls pin_user_pages_fast()
    PIN_LONGTERM_BENCHMARK : calls pin_longterm_pages()
    PIN_BENCHMARK          : calls pin_user_pages()

In between the calls to pin_*() and put_user_pages(),
check each page: if page_dma_pinned() returns false, then
WARN and return.

Do this outside of the benchmark timestamps, so that it doesn't
affect reported times.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup_benchmark.c                         | 74 ++++++++++++++++++++--
 tools/testing/selftests/vm/gup_benchmark.c | 23 ++++++-
 2 files changed, 91 insertions(+), 6 deletions(-)

diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d7f8db..2bb0f5df4803 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -8,6 +8,9 @@
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK	_IOWR('g', 5, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 6, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -19,6 +22,44 @@ struct gup_benchmark {
 	__u64 expansion[10];	/* For future use */
 };
 
+static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages)
+{
+	int i;
+
+	switch (cmd) {
+	case GUP_FAST_BENCHMARK:
+	case GUP_LONGTERM_BENCHMARK:
+	case GUP_BENCHMARK:
+		for (i = 0; i < nr_pages; i++)
+			put_page(pages[i]);
+		break;
+
+	case PIN_FAST_BENCHMARK:
+	case PIN_LONGTERM_BENCHMARK:
+	case PIN_BENCHMARK:
+		put_user_pages(pages, nr_pages);
+		break;
+	}
+}
+
+static void verify_dma_pinned(int cmd, struct page **pages,
+			      unsigned long nr_pages)
+{
+	int i;
+
+	switch (cmd) {
+	case PIN_FAST_BENCHMARK:
+	case PIN_LONGTERM_BENCHMARK:
+	case PIN_BENCHMARK:
+		for (i = 0; i < nr_pages; i++) {
+			if (WARN(!page_dma_pinned(pages[i]),
+				 "pages[%d] is NOT dma-pinned\n", i))
+				break;
+		}
+		break;
+	}
+}
+
 static int __gup_benchmark_ioctl(unsigned int cmd,
 		struct gup_benchmark *gup)
 {
@@ -62,6 +103,19 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
 					    NULL);
 			break;
+		case PIN_FAST_BENCHMARK:
+			nr = pin_user_pages_fast(addr, nr, gup->flags & 1,
+						 pages + i);
+			break;
+		case PIN_LONGTERM_BENCHMARK:
+			nr = pin_longterm_pages(addr, nr,
+						(gup->flags & 1),
+						pages + i, NULL);
+			break;
+		case PIN_BENCHMARK:
+			nr = pin_user_pages(addr, nr, gup->flags & 1, pages + i,
+					    NULL);
+			break;
 		default:
 			return -1;
 		}
@@ -72,15 +126,22 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 	}
 	end_time = ktime_get();
 
+	/* Shifting the meaning of nr_pages: now it is actual number pinned: */
+	nr_pages = i;
+
 	gup->get_delta_usec = ktime_us_delta(end_time, start_time);
 	gup->size = addr - gup->addr;
 
+	/*
+	 * Take an un-benchmark-timed moment to verify DMA pinned
+	 * state: print a warning if any non-dma-pinned pages are found:
+	 */
+	verify_dma_pinned(cmd, pages, nr_pages);
+
 	start_time = ktime_get();
-	for (i = 0; i < nr_pages; i++) {
-		if (!pages[i])
-			break;
-		put_page(pages[i]);
-	}
+
+	put_back_pages(cmd, pages, nr_pages);
+
 	end_time = ktime_get();
 	gup->put_delta_usec = ktime_us_delta(end_time, start_time);
 
@@ -98,6 +159,9 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
 	case GUP_FAST_BENCHMARK:
 	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
+	case PIN_FAST_BENCHMARK:
+	case PIN_LONGTERM_BENCHMARK:
+	case PIN_BENCHMARK:
 		break;
 	default:
 		return -EINVAL;
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index 485cf06ef013..c5c934c0f402 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -18,6 +18,15 @@
 #define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
+/*
+ * Similar to above, but use FOLL_PIN instead of FOLL_GET. This is done
+ * by calling pin_user_pages_fast(), pin_longterm_pages(), and pin_user_pages(),
+ * respectively.
+ */
+#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK	_IOWR('g', 5, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 6, struct gup_benchmark)
+
 struct gup_benchmark {
 	__u64 get_delta_usec;
 	__u64 put_delta_usec;
@@ -37,8 +46,17 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:tTLUwSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) {
 		switch (opt) {
+		case 'a':
+			cmd = PIN_FAST_BENCHMARK;
+			break;
+		case 'b':
+			cmd = PIN_LONGTERM_BENCHMARK;
+			break;
+		case 'c':
+			cmd = PIN_BENCHMARK;
+			break;
 		case 'm':
 			size = atoi(optarg) * MB;
 			break;
@@ -60,6 +78,9 @@ int main(int argc, char **argv)
 		case 'U':
 			cmd = GUP_BENCHMARK;
 			break;
+		case 'u':
+			cmd = GUP_FAST_BENCHMARK;
+			break;
 		case 'w':
 			write = 1;
 			break;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 17/18] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (15 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 16/18] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  2019-11-03 21:18 ` [PATCH v2 18/18] mm/gup: remove support for gup(FOLL_LONGTERM) John Hubbard
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

It's good to have basic unit test coverage of the new FOLL_PIN
behavior. Fortunately, the gup_benchmark unit test is extremely
fast (a few milliseconds), so adding it the the run_vmtests suite
is going to cause no noticeable change in running time.

So, add two new invocations to run_vmtests:

1) Run gup_benchmark with normal get_user_pages().

2) Run gup_benchmark with pin_user_pages(). This is much like
the first call, except that it sets FOLL_PIN.

Running these two in quick succession also provide a visual
comparison of the running times, which is convenient.

The new invocations are fairly early in the run_vmtests script,
because with test suites, it's usually preferable to put the
shorter, faster tests first, all other things being equal.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 tools/testing/selftests/vm/run_vmtests | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 951c507a27f7..93e8dc9a7cad 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -104,6 +104,28 @@ echo "NOTE: The above hugetlb tests provide minimal coverage.  Use"
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
 
+echo "--------------------------------------------"
+echo "running 'gup_benchmark -U' (normal/slow gup)"
+echo "--------------------------------------------"
+./gup_benchmark -U
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
+echo "------------------------------------------"
+echo "running gup_benchmark -c (pin_user_pages)"
+echo "------------------------------------------"
+./gup_benchmark -c
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "-------------------"
 echo "running userfaultfd"
 echo "-------------------"
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v2 18/18] mm/gup: remove support for gup(FOLL_LONGTERM)
  2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
                   ` (16 preceding siblings ...)
  2019-11-03 21:18 ` [PATCH v2 17/18] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
@ 2019-11-03 21:18 ` John Hubbard
  17 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-03 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML, John Hubbard

Now that all other kernel callers of get_user_pages(FOLL_LONGTERM)
have been converted to pin_longterm_pages(), lock it down:

1) Add an assertion to get_user_pages(), preventing callers from
   passing FOLL_LONGTERM (in addition to the existing assertion that
   prevents FOLL_PIN).

2) Remove the associated GUP_LONGTERM_BENCHMARK test.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c                                   | 8 ++++----
 mm/gup_benchmark.c                         | 9 +--------
 tools/testing/selftests/vm/gup_benchmark.c | 7 ++-----
 3 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index c9727e65fad3..317f7602495d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1732,11 +1732,11 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
 		struct vm_area_struct **vmas)
 {
 	/*
-	 * FOLL_PIN must only be set internally by the pin_user_page*() and
-	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
-	 * with an assertion:
+	 * FOLL_PIN and FOLL_LONGTERM must only be set internally by the
+	 * pin_user_page*() and pin_longterm_*() APIs, never directly by the
+	 * caller, so enforce that with an assertion:
 	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+	if (WARN_ON_ONCE(gup_flags & (FOLL_PIN | FOLL_LONGTERM)))
 		return -EINVAL;
 
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 2bb0f5df4803..de6941855b7e 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -6,7 +6,7 @@
 #include <linux/debugfs.h>
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 #define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
 #define PIN_LONGTERM_BENCHMARK	_IOWR('g', 5, struct gup_benchmark)
@@ -28,7 +28,6 @@ static void put_back_pages(int cmd, struct page **pages, unsigned long nr_pages)
 
 	switch (cmd) {
 	case GUP_FAST_BENCHMARK:
-	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
 		for (i = 0; i < nr_pages; i++)
 			put_page(pages[i]);
@@ -94,11 +93,6 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
 			nr = get_user_pages_fast(addr, nr, gup->flags & 1,
 						 pages + i);
 			break;
-		case GUP_LONGTERM_BENCHMARK:
-			nr = get_user_pages(addr, nr,
-					    (gup->flags & 1) | FOLL_LONGTERM,
-					    pages + i, NULL);
-			break;
 		case GUP_BENCHMARK:
 			nr = get_user_pages(addr, nr, gup->flags & 1, pages + i,
 					    NULL);
@@ -157,7 +151,6 @@ static long gup_benchmark_ioctl(struct file *filep, unsigned int cmd,
 
 	switch (cmd) {
 	case GUP_FAST_BENCHMARK:
-	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
 	case PIN_FAST_BENCHMARK:
 	case PIN_LONGTERM_BENCHMARK:
diff --git a/tools/testing/selftests/vm/gup_benchmark.c b/tools/testing/selftests/vm/gup_benchmark.c
index c5c934c0f402..5ef3cf8f3da5 100644
--- a/tools/testing/selftests/vm/gup_benchmark.c
+++ b/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,7 +15,7 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
+/* Command 2 has been deleted. */
 #define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
 
 /*
@@ -46,7 +46,7 @@ int main(int argc, char **argv)
 	char *file = "/dev/zero";
 	char *p;
 
-	while ((opt = getopt(argc, argv, "m:r:n:f:abctTLUuwSH")) != -1) {
+	while ((opt = getopt(argc, argv, "m:r:n:f:abctTUuwSH")) != -1) {
 		switch (opt) {
 		case 'a':
 			cmd = PIN_FAST_BENCHMARK;
@@ -72,9 +72,6 @@ int main(int argc, char **argv)
 		case 'T':
 			thp = 0;
 			break;
-		case 'L':
-			cmd = GUP_LONGTERM_BENCHMARK;
-			break;
 		case 'U':
 			cmd = GUP_BENCHMARK;
 			break;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions
  2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
@ 2019-11-04 16:39   ` Jerome Glisse
  0 siblings, 0 replies; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 16:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML,
	Kirill A . Shutemov

On Sun, Nov 03, 2019 at 01:17:56PM -0800, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so
> pass the flags argument through to the __gup_device_*
> functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/gup.c | 28 ++++++++++++++++++----------
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..85caf76b3012 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> -		unsigned long end, struct page **pages, int *nr)
> +			     unsigned long end, unsigned int flags,
> +			     struct page **pages, int *nr)
>  {
>  	int nr_start = *nr;
>  	struct dev_pagemap *pgmap = NULL;
> @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> -		unsigned long end, struct page **pages, int *nr)
> +				 unsigned long end, unsigned int flags,
> +				 struct page **pages, int *nr)
>  {
>  	unsigned long fault_pfn;
>  	int nr_start = *nr;
>  
>  	fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> +	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>  		return 0;
>  
>  	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> -		unsigned long end, struct page **pages, int *nr)
> +				 unsigned long end, unsigned int flags,
> +				 struct page **pages, int *nr)
>  {
>  	unsigned long fault_pfn;
>  	int nr_start = *nr;
>  
>  	fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> +	if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>  		return 0;
>  
>  	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> -		unsigned long end, struct page **pages, int *nr)
> +				 unsigned long end, unsigned int flags,
> +				 struct page **pages, int *nr)
>  {
>  	BUILD_BUG();
>  	return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> -		unsigned long end, struct page **pages, int *nr)
> +				 unsigned long end, unsigned int flags,
> +				 struct page **pages, int *nr)
>  {
>  	BUILD_BUG();
>  	return 0;
> @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  	if (pmd_devmap(orig)) {
>  		if (unlikely(flags & FOLL_LONGTERM))
>  			return 0;
> -		return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> +		return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +					     pages, nr);
>  	}
>  
>  	refs = 0;
> @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> -		unsigned long end, unsigned int flags, struct page **pages, int *nr)
> +			unsigned long end, unsigned int flags,
> +			struct page **pages, int *nr)
>  {
>  	struct page *head, *page;
>  	int refs;
> @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  	if (pud_devmap(orig)) {
>  		if (unlikely(flags & FOLL_LONGTERM))
>  			return 0;
> -		return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
> +		return __gup_device_huge_pud(orig, pudp, addr, end, flags,
> +					     pages, nr);
>  	}
>  
>  	refs = 0;
> -- 
> 2.23.0
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines
  2019-11-03 21:17 ` [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines John Hubbard
@ 2019-11-04 16:51   ` Jerome Glisse
  0 siblings, 0 replies; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 16:51 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML,
	Christoph Hellwig, Aneesh Kumar K . V

On Sun, Nov 03, 2019 at 01:17:57PM -0800, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Good cleanup.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/gup.c | 104 ++++++++++++++++++++++++-------------------------------
>  1 file changed, 45 insertions(+), 59 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..199da99e8ffc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +			     unsigned long end, struct page **pages, int nr)
> +{
> +	int nr_recorded_pages = 0;
> +
> +	do {
> +		pages[nr] = page;
> +		nr++;
> +		page++;
> +		nr_recorded_pages++;
> +	} while (addr += PAGE_SIZE, addr != end);
> +	return nr_recorded_pages;
> +}
> +
> +static void put_compound_head(struct page *page, int refs)
> +{
> +	/* Do a get_page() first, in case refs == page->_refcount */
> +	get_page(page);
> +	page_ref_sub(page, refs);
> +	put_page(page);
> +}
> +
> +static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
> +{
> +	*nr += nr_recorded_pages;
> +	SetPageReferenced(head);
> +}
> +
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
>  				      unsigned long sz)
> @@ -1998,33 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
>  	/* hugepages are never "special" */
>  	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  
> -	refs = 0;
>  	head = pte_page(pte);
> -
>  	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> -	do {
> -		VM_BUG_ON(compound_head(page) != head);
> -		pages[*nr] = page;
> -		(*nr)++;
> -		page++;
> -		refs++;
> -	} while (addr += PAGE_SIZE, addr != end);
> +	refs = __record_subpages(page, addr, end, pages, *nr);
>  
>  	head = try_get_compound_head(head, refs);
> -	if (!head) {
> -		*nr -= refs;
> +	if (!head)
>  		return 0;
> -	}
>  
>  	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> -		/* Could be optimized better */
> -		*nr -= refs;
> -		while (refs--)
> -			put_page(head);
> +		put_compound_head(head, refs);
>  		return 0;
>  	}
>  
> -	SetPageReferenced(head);
> +	__huge_pt_done(head, refs, nr);
>  	return 1;
>  }
>  
> @@ -2071,29 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
>  					     pages, nr);
>  	}
>  
> -	refs = 0;
>  	page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> -	do {
> -		pages[*nr] = page;
> -		(*nr)++;
> -		page++;
> -		refs++;
> -	} while (addr += PAGE_SIZE, addr != end);
> +	refs = __record_subpages(page, addr, end, pages, *nr);
>  
>  	head = try_get_compound_head(pmd_page(orig), refs);
> -	if (!head) {
> -		*nr -= refs;
> +	if (!head)
>  		return 0;
> -	}
>  
>  	if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> -		*nr -= refs;
> -		while (refs--)
> -			put_page(head);
> +		put_compound_head(head, refs);
>  		return 0;
>  	}
>  
> -	SetPageReferenced(head);
> +	__huge_pt_done(head, refs, nr);
>  	return 1;
>  }
>  
> @@ -2114,29 +2119,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
>  					     pages, nr);
>  	}
>  
> -	refs = 0;
>  	page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> -	do {
> -		pages[*nr] = page;
> -		(*nr)++;
> -		page++;
> -		refs++;
> -	} while (addr += PAGE_SIZE, addr != end);
> +	refs = __record_subpages(page, addr, end, pages, *nr);
>  
>  	head = try_get_compound_head(pud_page(orig), refs);
> -	if (!head) {
> -		*nr -= refs;
> +	if (!head)
>  		return 0;
> -	}
>  
>  	if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> -		*nr -= refs;
> -		while (refs--)
> -			put_page(head);
> +		put_compound_head(head, refs);
>  		return 0;
>  	}
>  
> -	SetPageReferenced(head);
> +	__huge_pt_done(head, refs, nr);
>  	return 1;
>  }
>  
> @@ -2151,29 +2146,20 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned long addr,
>  		return 0;
>  
>  	BUILD_BUG_ON(pgd_devmap(orig));
> -	refs = 0;
> +
>  	page = pgd_page(orig) + ((addr & ~PGDIR_MASK) >> PAGE_SHIFT);
> -	do {
> -		pages[*nr] = page;
> -		(*nr)++;
> -		page++;
> -		refs++;
> -	} while (addr += PAGE_SIZE, addr != end);
> +	refs = __record_subpages(page, addr, end, pages, *nr);
>  
>  	head = try_get_compound_head(pgd_page(orig), refs);
> -	if (!head) {
> -		*nr -= refs;
> +	if (!head)
>  		return 0;
> -	}
>  
>  	if (unlikely(pgd_val(orig) != pgd_val(*pgdp))) {
> -		*nr -= refs;
> -		while (refs--)
> -			put_page(head);
> +		put_compound_head(head, refs);
>  		return 0;
>  	}
>  
> -	SetPageReferenced(head);
> +	__huge_pt_done(head, refs, nr);
>  	return 1;
>  }
>  
> -- 
> 2.23.0
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine
  2019-11-03 21:17 ` [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine John Hubbard
@ 2019-11-04 16:52   ` Jerome Glisse
  0 siblings, 0 replies; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 16:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Sun, Nov 03, 2019 at 01:17:58PM -0800, John Hubbard wrote:
> 1. Avoid naming conflicts: rename local static function from
> "pin_user_pages()" to "pin_goldfish_pages()".
> 
> An upcoming patch will introduce a global pin_user_pages()
> function.
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  drivers/platform/goldfish/goldfish_pipe.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/platform/goldfish/goldfish_pipe.c b/drivers/platform/goldfish/goldfish_pipe.c
> index cef0133aa47a..7ed2a21a0bac 100644
> --- a/drivers/platform/goldfish/goldfish_pipe.c
> +++ b/drivers/platform/goldfish/goldfish_pipe.c
> @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
>  	}
>  }
>  
> -static int pin_user_pages(unsigned long first_page,
> -			  unsigned long last_page,
> -			  unsigned int last_page_size,
> -			  int is_write,
> -			  struct page *pages[MAX_BUFFERS_PER_COMMAND],
> -			  unsigned int *iter_last_page_size)
> +static int pin_goldfish_pages(unsigned long first_page,
> +			      unsigned long last_page,
> +			      unsigned int last_page_size,
> +			      int is_write,
> +			      struct page *pages[MAX_BUFFERS_PER_COMMAND],
> +			      unsigned int *iter_last_page_size)
>  {
>  	int ret;
>  	int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
> @@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe *pipe,
>  	if (mutex_lock_interruptible(&pipe->lock))
>  		return -ERESTARTSYS;
>  
> -	pages_count = pin_user_pages(first_page, last_page,
> -				     last_page_size, is_write,
> -				     pipe->pages, &iter_last_page_size);
> +	pages_count = pin_goldfish_pages(first_page, last_page,
> +					 last_page_size, is_write,
> +					 pipe->pages, &iter_last_page_size);
>  	if (pages_count < 0) {
>  		mutex_unlock(&pipe->lock);
>  		return pages_count;
> -- 
> 2.23.0
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
@ 2019-11-04 17:33   ` Jerome Glisse
  2019-11-04 19:04     ` John Hubbard
  2019-11-04 20:33   ` David Rientjes
  2019-11-05 13:10   ` Mike Rapoport
  2 siblings, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 17:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
>     pin_user_pages()
>     pin_user_pages_remote()
>     pin_user_pages_fast()
> 
>     pin_longterm_pages()
>     pin_longterm_pages_remote()
>     pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
>         * Start with "pin_user_pages" instead of "get_user_pages". That
>           makes it easy to find and audit the call sites.
>         * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Few nitpick belows, nonetheless:

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  Documentation/vm/index.rst          |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
>  include/linux/mm.h                  |  62 ++++++-
>  mm/gup.c                            | 265 +++++++++++++++++++++++++---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 

[...]

> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index 000000000000..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst

[...]

> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==========================================================
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +-----------------------
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +    FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +------------
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +    FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +-----------
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.

I would not include ODP or anything like it here, they do not use
GUP anymore and i believe it is more confusing here. I would how-
ever include some text in this documentation explaining that hard-
ware that support page fault is superior as it does not incur any
of the issues described here.

> +
> +CASE 4: Pinning for struct page manipulation only
> +-------------------------------------------------
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		BUG_ON(*locked != 1);
>  	}
>  
> -	if (pages)
> +	/*
> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> +	 *
> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> +	 * that here, as any failures will be obvious enough.
> +	 */
> +	if (pages && !(flags & FOLL_PIN))
>  		flags |= FOLL_GET;

Did you look at user that have pages and not FOLL_GET set ?
I believe it would be better to first fix them to end up
with FOLL_GET set and then error out if pages is != NULL but
nor FOLL_GET or FOLL_PIN is set.

>  
>  	pages_done = 0;

> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  
> -/**
> - * get_user_pages_fast() - pin user pages in memory
> - * @start:	starting user address
> - * @nr_pages:	number of pages from start to pin
> - * @gup_flags:	flags modifying pin behaviour
> - * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long.
> - *
> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> - * If not successful, it will fall back to taking the lock and
> - * calling get_user_pages().
> - *
> - * Returns number of pages pinned. This may be fewer than the number
> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> - * were pinned, returns -errno.
> - */
> -int get_user_pages_fast(unsigned long start, int nr_pages,
> -			unsigned int gup_flags, struct page **pages)
> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> +					unsigned int gup_flags,
> +					struct page **pages)

Usualy function are rename to _old_func_name ie add _ in front. So
here it would become _get_user_pages_fast but i know some people
don't like that as sometimes we endup with ___function_overloaded :)

>  {
>  	unsigned long addr, len, end;
>  	int nr = 0, ret = 0;


> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,

[...]

> +/**
> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.

Not a fan of (typically) maybe:
pin_user_pages_remote() - pin pages of a remote process (task != current)

I think here the remote part if more important that DIO. Remote is use by
other thing that DIO.

> + *
> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
> + * get_user_pages_remote() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			   unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
> +
> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> +				       locked, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_user_pages_remote);
> +
> +/**
> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.

I think you copy pasted this from pin_user_pages_remote() :)

> + *
> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
> + * documentation on the function arguments, because the arguments here are
> + * identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) in
> + * Documentation/vm/pin_user_pages.rst.
> + */
> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			       unsigned long start, unsigned long nr_pages,
> +			       unsigned int gup_flags, struct page **pages,
> +			       struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	/*
> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
> +	 * needs to be set, but for now the best we can do is a "TODO" item.
> +	 */
> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;

Wouldn't it be better to not add pin_longterm_pages_remote() until
it can be properly implemented ?


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()
  2019-11-03 21:18 ` [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
@ 2019-11-04 17:41   ` Jerome Glisse
  0 siblings, 0 replies; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 17:41 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Sun, Nov 03, 2019 at 01:18:03PM -0800, John Hubbard wrote:
> Convert process_vm_access to use the new pin_user_pages_remote()
> call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
> code that requires tracking of pinned pages.
> 
> Also, release the pages via put_user_page*().
> 
> Also, rename "pages" to "pinned_pages", as this makes for
> easier reading of process_vm_rw_single_vec().
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

> ---
>  mm/process_vm_access.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7bef6c0..fd20ab675b85 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
>  		if (copy > len)
>  			copy = len;
>  
> -		if (vm_write) {
> +		if (vm_write)
>  			copied = copy_page_from_iter(page, offset, copy, iter);
> -			set_page_dirty_lock(page);
> -		} else {
> +		else
>  			copied = copy_page_to_iter(page, offset, copy, iter);
> -		}
> +
>  		len -= copied;
>  		if (copied < copy && iov_iter_count(iter))
>  			return -EFAULT;
> @@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>  		flags |= FOLL_WRITE;
>  
>  	while (!rc && nr_pages && iov_iter_count(iter)) {
> -		int pages = min(nr_pages, max_pages_per_loop);
> +		int pinned_pages = min(nr_pages, max_pages_per_loop);
>  		int locked = 1;
>  		size_t bytes;
>  
> @@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
>  		 * current/current->mm
>  		 */
>  		down_read(&mm->mmap_sem);
> -		pages = get_user_pages_remote(task, mm, pa, pages, flags,
> -					      process_pages, NULL, &locked);
> +		pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
> +						     flags, process_pages,
> +						     NULL, &locked);
>  		if (locked)
>  			up_read(&mm->mmap_sem);
> -		if (pages <= 0)
> +		if (pinned_pages <= 0)
>  			return -EFAULT;
>  
> -		bytes = pages * PAGE_SIZE - start_offset;
> +		bytes = pinned_pages * PAGE_SIZE - start_offset;
>  		if (bytes > len)
>  			bytes = len;
>  
> @@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
>  					 vm_write);
>  		len -= bytes;
>  		start_offset = 0;
> -		nr_pages -= pages;
> -		pa += pages * PAGE_SIZE;
> -		while (pages)
> -			put_page(process_pages[--pages]);
> +		nr_pages -= pinned_pages;
> +		pa += pinned_pages * PAGE_SIZE;
> +
> +		/* If vm_write is set, the pages need to be made dirty: */
> +		put_user_pages_dirty_lock(process_pages, pinned_pages,
> +					  vm_write);
>  	}
>  
>  	return rc;
> -- 
> 2.23.0
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast()
  2019-11-03 21:18 ` [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
@ 2019-11-04 17:44   ` Jerome Glisse
  2019-11-04 18:22     ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 17:44 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Sun, Nov 03, 2019 at 01:18:04PM -0800, John Hubbard wrote:
> Convert drm/via to use the new pin_user_pages_fast() call, which sets
> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
> tracking of pinned pages, and therefore for any code that calls
> put_user_page().
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Please be more explicit that via_dmablit.c is already using put_user_page()
as i am expecting that any conversion to pin_user_pages*() must be pair with
a put_user_page(). I find above commit message bit unclear from that POV.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>


> ---
>  drivers/gpu/drm/via/via_dmablit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
> index 3db000aacd26..37c5e572993a 100644
> --- a/drivers/gpu/drm/via/via_dmablit.c
> +++ b/drivers/gpu/drm/via/via_dmablit.c
> @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  drm_via_dmablit_t *xfer)
>  	vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
>  	if (NULL == vsg->pages)
>  		return -ENOMEM;
> -	ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
> +	ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
>  			vsg->num_pages,
>  			vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
>  			vsg->pages);
> -- 
> 2.23.0
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast()
  2019-11-04 17:44   ` Jerome Glisse
@ 2019-11-04 18:22     ` John Hubbard
  0 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-04 18:22 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 9:44 AM, Jerome Glisse wrote:
> On Sun, Nov 03, 2019 at 01:18:04PM -0800, John Hubbard wrote:
>> Convert drm/via to use the new pin_user_pages_fast() call, which sets
>> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
>> tracking of pinned pages, and therefore for any code that calls
>> put_user_page().
>>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> 
> Please be more explicit that via_dmablit.c is already using put_user_page()
> as i am expecting that any conversion to pin_user_pages*() must be pair with
> a put_user_page(). I find above commit message bit unclear from that POV.
> 

OK. This one, and the fs/io_uring (patch 9) and net/xdp (patch 10) were all
cases that had put_user_page() pre-existing. I will add something like the 
following to each commit description, for v3:

In partial anticipation of this work, the drm/via driver was already 
calling put_user_page() instead of put_page(). Therefore, in order to
convert from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required
is to change get_user_pages() to pin_user_pages().

thanks,

John Hubbard
NVIDIA

> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> 
>> ---
>>  drivers/gpu/drm/via/via_dmablit.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/via/via_dmablit.c b/drivers/gpu/drm/via/via_dmablit.c
>> index 3db000aacd26..37c5e572993a 100644
>> --- a/drivers/gpu/drm/via/via_dmablit.c
>> +++ b/drivers/gpu/drm/via/via_dmablit.c
>> @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  drm_via_dmablit_t *xfer)
>>  	vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
>>  	if (NULL == vsg->pages)
>>  		return -ENOMEM;
>> -	ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
>> +	ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
>>  			vsg->num_pages,
>>  			vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
>>  			vsg->pages);
>> -- 
>> 2.23.0
>>
> 
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
  2019-11-03 21:18 ` [PATCH v2 12/18] mm/gup: track FOLL_PIN pages John Hubbard
@ 2019-11-04 18:52   ` Jerome Glisse
  2019-11-04 22:49     ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 18:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>    bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> This also has a couple of trivial, non-functional change fixes to
> try_get_compound_head(). That function got moved to the top of the
> file.

Maybe split that as a separate trivial patch.

> 
> This includes the following fix from Ira Weiny:
> 
> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> for GUP pages by introducing put_devmap_managed_user_page() which
> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

Please do the put_devmap_managed_page() changes in a separate
patch, it would be a lot easier to follow, also on that front
see comments below.

> 
> [1] https://lwn.net/Articles/784574/ "Some slow progress on
> get_user_pages()"
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Suggested-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h       |  80 +++++++++++----
>  include/linux/mmzone.h   |   2 +
>  include/linux/page_ref.h |  10 ++
>  mm/gup.c                 | 213 +++++++++++++++++++++++++++++++--------
>  mm/huge_memory.c         |  32 +++++-
>  mm/hugetlb.c             |  28 ++++-
>  mm/memremap.c            |   4 +-
>  mm/vmstat.c              |   2 +
>  8 files changed, 300 insertions(+), 71 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cdfb6fedb271..03b3600843b7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void __put_devmap_managed_page(struct page *page, int count);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>  	if (!static_branch_unlikely(&devmap_managed_key))
>  		return false;
> @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page *page)
>  	switch (page->pgmap->type) {
>  	case MEMORY_DEVICE_PRIVATE:
>  	case MEMORY_DEVICE_FS_DAX:
> -		__put_devmap_managed_page(page);
>  		return true;
>  	default:
>  		break;
> @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page *page)
>  	return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> +	bool is_devmap = page_is_devmap_managed(page);
> +
> +	if (is_devmap) {
> +		int count = page_ref_dec_return(page);
> +
> +		__put_devmap_managed_page(page, count);
> +	}
> +
> +	return is_devmap;
> +}

I think the __put_devmap_managed_page() should be rename
to free_devmap_managed_page() and that the count != 1
case move to this inline function ie:

static inline bool put_devmap_managed_page(struct page *page)
{
	bool is_devmap = page_is_devmap_managed(page);

	if (is_devmap) {
		int count = page_ref_dec_return(page);

		/*
		 * If refcount is 1 then page is freed and refcount is stable as nobody
		 * holds a reference on the page.
		 */
		if (count == 1)
			free_devmap_managed_page(page, count);
		else if (!count)
			__put_page(page);
	}

	return is_devmap;
}


> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
>  	return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +

What about having it as an inline here as it is pretty small.


>  static inline void put_page(struct page *page)
>  {
>  	page = compound_head(page);
> @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
>  		__put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:            pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original page
> + * reference count, and also a new count of how many get_user_pages() calls were
> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
> + * as distinct from normal pages. As such, the new put_user_page() call (and
> + * its variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via get_user_pages*() must be released via
> - * either put_user_page(), or one of the put_user_pages*() routines
> - * below. This is so that eventually, pages that are pinned via
> - * get_user_pages*() can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special
> - * handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page reference
> + * counts with respect to get_user_pages() and put_user_page() becomes simpler,
> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with get_user_page() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual inspection
> + * no longer suffices to directly view the separate counts. However, for normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
> -static inline void put_user_page(struct page *page)
> -{
> -	put_page(page);
> -}
> +#define GUP_PIN_COUNTING_BIAS (1UL << 10)
>  
> +void put_user_page(struct page *page);
>  void put_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  			       bool make_dirty);
> -
>  void put_user_pages(struct page **pages, unsigned long npages);
>  
> +/**
> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
> + * or pin_longterm_pages*()
> + * @page:	pointer to page to be queried.
> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> + *		False, if the page is definitely not dma-pinned.
> + */

Maybe add a small comment about wrap around :)

> +static inline bool page_dma_pinned(struct page *page)
> +{
> +	return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +
>  #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
>  #define SECTION_IN_PAGE_FLAGS
>  #endif

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 1aea48427879..c9727e65fad3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>  
>  		pgmap = get_dev_pagemap(pfn, pgmap);
>  		if (unlikely(!pgmap)) {
> -			undo_dev_pagemap(nr, nr_start, pages);
> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>  			return 0;
>  		}
>  		SetPageReferenced(page);
>  		pages[*nr] = page;
> -		get_page(page);
> +
> +		if (flags & FOLL_PIN) {
> +			if (unlikely(!user_page_ref_inc(page))) {
> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> +				return 0;
> +			}

Maybe add a comment about a case that should never happens ie
user_page_ref_inc() fails after the second iteration of the
loop as it would be broken and a bug to call undo_dev_pagemap()
after the first iteration of that loop.

Also i believe that this should never happens as if first
iteration succeed than __page_cache_add_speculative() will
succeed for all the iterations.

Note that the pgmap case above follows that too ie the call to
get_dev_pagemap() can only fail on first iteration of the loop,
well i assume you can never have a huge device page that span
different pgmap ie different devices (which is a reasonable
assumption). So maybe this code needs fixing ie :

		pgmap = get_dev_pagemap(pfn, pgmap);
		if (unlikely(!pgmap))
			return 0;


> +		} else
> +			get_page(page);
> +
>  		(*nr)++;
>  		pfn++;
>  	} while (addr += PAGE_SIZE, addr != end);

[...]

> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>  	unsigned long addr, len, end;
>  	int nr = 0, ret = 0;
>  
> -	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> +	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))

Maybe add a comments to explain, something like:

/*
 * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
 *
 * Note that get_user_pages_fast() imply FOLL_GET flag by default but
 * callers can over-ride this default to pin case by setting FOLL_PIN.
 */

>  		return -EINVAL;
>  
>  	start = untagged_addr(start) & PAGE_MASK;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 13cc93785006..66bf4c8b88f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c

[...]

> @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (!*pgmap)
>  		return ERR_PTR(-EFAULT);
>  	page = pfn_to_page(pfn);
> -	get_page(page);
> +
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +	else if (flags & FOLL_PIN)
> +		if (unlikely(!user_page_ref_inc(page)))
> +			page = ERR_PTR(-ENOMEM);

While i agree that user_page_ref_inc() (ie page_cache_add_speculative())
should never fails here as we are holding the pmd lock and thus no one
can unmap the pmd and free the page it points to. I believe you should
return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should
not fail either for the same reason. Thus it would be better to have
consistent error. Maybe also add a comments explaining that it should
not fail here.

>  
>  	return page;
>  }

[...]

> @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  	 * device mapped pages can only be returned if the
>  	 * caller will manage the page reference count.
>  	 */
> -	if (!(flags & FOLL_GET))
> +	if (!(flags & (FOLL_GET | FOLL_PIN)))
>  		return ERR_PTR(-EEXIST);

Maybe add a comment that FOLL_GET or FOLL_PIN must be set.

>  	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
> @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>  	if (!*pgmap)
>  		return ERR_PTR(-EFAULT);
>  	page = pfn_to_page(pfn);
> -	get_page(page);
> +
> +	if (flags & FOLL_GET)
> +		get_page(page);
> +	else if (flags & FOLL_PIN)
> +		if (unlikely(!user_page_ref_inc(page)))
> +			page = ERR_PTR(-ENOMEM);

Same as for follow_devmap_pmd() see above.

>  
>  	return page;
>  }
> @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>  skip_mlock:
>  	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>  	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> +
>  	if (flags & FOLL_GET)
>  		get_page(page);
> +	else if (flags & FOLL_PIN)
> +		if (unlikely(!user_page_ref_inc(page)))
> +			page = NULL;

This should not fail either as we are holding the pmd lock maybe add
a comment. Dunno if we want a WARN() or something to catch this
degenerate case, or dump the page.

>  
>  out:
>  	return page;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b45a95363a84..da335b1cd798 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  same_page:
>  		if (pages) {
>  			pages[i] = mem_map_offset(page, pfn_offset);
> -			get_page(pages[i]);
> +
> +			if (flags & FOLL_GET)
> +				get_page(pages[i]);
> +			else if (flags & FOLL_PIN)
> +				if (unlikely(!user_page_ref_inc(pages[i]))) {
> +					spin_unlock(ptl);
> +					remainder = 0;
> +					err = -ENOMEM;
> +					WARN_ON_ONCE(1);
> +					break;
> +				}
>  		}

user_page_ref_inc() should not fail here either because we hold the
ptl, so the WAR_ON_ONCE() is right but maybe add a comment.

>  
>  		if (vmas)

[...]

> @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>  	pte = huge_ptep_get((pte_t *)pmd);
>  	if (pte_present(pte)) {
>  		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
> +
>  		if (flags & FOLL_GET)
>  			get_page(page);
> +		else if (flags & FOLL_PIN)
> +			if (unlikely(!user_page_ref_inc(page))) {
> +				page = NULL;
> +				goto out;
> +			}

This should not fail either (again holding pmd lock), dunno if we want
a warn or something to catch this degenerate case.

>  	} else {
>  		if (is_hugetlb_entry_migration(pte)) {
>  			spin_unlock(ptl);

[...]


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 17:33   ` Jerome Glisse
@ 2019-11-04 19:04     ` John Hubbard
  2019-11-04 19:18       ` Jerome Glisse
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-04 19:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 9:33 AM, Jerome Glisse wrote:
...
> 
> Few nitpick belows, nonetheless:
> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> [...]
>> +
>> +CASE 3: ODP
>> +-----------
>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
>> +replayable page faulting). There are GUP references to pages serving as DMA
>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
>> +needs to be set.
> 
> I would not include ODP or anything like it here, they do not use
> GUP anymore and i believe it is more confusing here. I would how-
> ever include some text in this documentation explaining that hard-
> ware that support page fault is superior as it does not incur any
> of the issues described here.

OK, agreed, here's a new write up that I'll put in v3:


CASE 3: ODP
-----------
Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
Here, a well-written driver doesn't normally need to pin pages at all. However,
if the driver does choose to do so, it can register MMU notifiers for the range,
and will be called back upon invalidation. Either way (avoiding page pinning, or
using MMU notifiers to unpin upon request), there is proper synchronization with 
both filesystem and mm (page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

It's worth mentioning here that pinning pages should not be the first design
choice. If page fault capable hardware is available, then the software should
be written so that it does not pin pages. This allows mm and filesystems to
operate more efficiently and reliably.

> [...]
> 
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 199da99e8ffc..1aea48427879 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
> 
> [...]
> 
>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>>  		BUG_ON(*locked != 1);
>>  	}
>>  
>> -	if (pages)
>> +	/*
>> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
>> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
>> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
>> +	 * for FOLL_GET, not for the newer FOLL_PIN.
>> +	 *
>> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
>> +	 * that here, as any failures will be obvious enough.
>> +	 */
>> +	if (pages && !(flags & FOLL_PIN))
>>  		flags |= FOLL_GET;
> 
> Did you look at user that have pages and not FOLL_GET set ?
> I believe it would be better to first fix them to end up
> with FOLL_GET set and then error out if pages is != NULL but
> nor FOLL_GET or FOLL_PIN is set.
> 

I was perhaps overly cautious, and didn't go there. However, it's probably
doable, given that there was already the following in __get_user_pages():

    VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));

...which will have conditioned people and code to set FOLL_GET together with
pages. So I agree that the time is right.

In order to make bisecting future failures simpler, I can insert a patch right 
before this one, that changes the FOLL_GET setting into an assert, like this:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..be338961e80d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
                BUG_ON(*locked != 1);
        }
 
-       if (pages)
-               flags |= FOLL_GET;
+       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
+               return -EINVAL;
 
        pages_done = 0;
        lock_dropped = false;


...and then add in FOLL_PIN, with this patch.

>>  
>>  	pages_done = 0;
> 
>> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>>  	return ret;
>>  }
>>  
>> -/**
>> - * get_user_pages_fast() - pin user pages in memory
>> - * @start:	starting user address
>> - * @nr_pages:	number of pages from start to pin
>> - * @gup_flags:	flags modifying pin behaviour
>> - * @pages:	array that receives pointers to the pages pinned.
>> - *		Should be at least nr_pages long.
>> - *
>> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
>> - * If not successful, it will fall back to taking the lock and
>> - * calling get_user_pages().
>> - *
>> - * Returns number of pages pinned. This may be fewer than the number
>> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
>> - * were pinned, returns -errno.
>> - */
>> -int get_user_pages_fast(unsigned long start, int nr_pages,
>> -			unsigned int gup_flags, struct page **pages)
>> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>> +					unsigned int gup_flags,
>> +					struct page **pages)
> 
> Usualy function are rename to _old_func_name ie add _ in front. So
> here it would become _get_user_pages_fast but i know some people
> don't like that as sometimes we endup with ___function_overloaded :)

Exactly: the __get_user_pages* names were already used for *non*-internal
routines, so I attempted to pick the next best naming prefix.

> 
>>  {
>>  	unsigned long addr, len, end;
>>  	int nr = 0, ret = 0;
> 
> 
>> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> 
> [...]
> 
>> +/**
>> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
>> + * return the pages to the user.
> 
> Not a fan of (typically) maybe:
> pin_user_pages_remote() - pin pages of a remote process (task != current)
> 
> I think here the remote part if more important that DIO. Remote is use by
> other thing that DIO.

Yes, good point. I'll use your wording:

 * pin_user_pages_remote() - pin pages of a remote process (task != current)



> 
>> + *
>> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
>> + * get_user_pages_remote() for documentation on the function arguments, because
>> + * the arguments here are identical.
>> + *
>> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
>> + * see Documentation/vm/pin_user_pages.rst for details.
>> + *
>> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
>> + * is NOT intended for Case 2 (RDMA: long-term pins).
>> + */
>> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>> +			   unsigned long start, unsigned long nr_pages,
>> +			   unsigned int gup_flags, struct page **pages,
>> +			   struct vm_area_struct **vmas, int *locked)
>> +{
>> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>> +		return -EINVAL;
>> +
>> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
>> +
>> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
>> +				       locked, gup_flags);
>> +}
>> +EXPORT_SYMBOL(pin_user_pages_remote);
>> +
>> +/**
>> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
>> + * return the pages to the user.
> 
> I think you copy pasted this from pin_user_pages_remote() :)

I admit to nothing, with respect to copy-paste! :)

This one can simply be:

 * pin_longterm_pages_remote() - pin pages of a remote process (task != current)


> 
>> + *
>> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
>> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
>> + * documentation on the function arguments, because the arguments here are
>> + * identical.
>> + *
>> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
>> + * see Documentation/vm/pin_user_pages.rst for further details.
>> + *
>> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
>> + * typically by a non-CPU device, and we cannot be sure that waiting for a
>> + * pinned page to become unpin will be effective.
>> + *
>> + * This is intended for Case 2 (RDMA: long-term pins) in
>> + * Documentation/vm/pin_user_pages.rst.
>> + */
>> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>> +			       unsigned long start, unsigned long nr_pages,
>> +			       unsigned int gup_flags, struct page **pages,
>> +			       struct vm_area_struct **vmas, int *locked)
>> +{
>> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
>> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
>> +		return -EINVAL;
>> +
>> +	/*
>> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
>> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
>> +	 * needs to be set, but for now the best we can do is a "TODO" item.
>> +	 */
>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> 
> Wouldn't it be better to not add pin_longterm_pages_remote() until
> it can be properly implemented ?
> 

Well, the problem is that I need each call site that requires FOLL_PIN
to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
there is a hard, bright rule, which is: if and only if a caller sets
FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
be called.

So this leaves me with only two reasonable choices:

a) Convert the call site as above: pin_longterm_pages_remote(), which sets
FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
site *might* not need any changes to adopt the working gup.c code.

b) Convert the call site to pin_user_pages_remote(), which also sets
FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
There would also be a comment at the call site, to the effect of, "this
is the wrong call to make: it really requires FOLL_LONGTERM behavior".

When the FOLL_LONGTERM situation is fixed, the call site will need to be
changed to pin_longterm_pages_remote().

So you can probably see why I picked (a).


thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 19:04     ` John Hubbard
@ 2019-11-04 19:18       ` Jerome Glisse
  2019-11-04 19:30         ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 19:18 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> ...
> > 
> > Few nitpick belows, nonetheless:
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > [...]
> >> +
> >> +CASE 3: ODP
> >> +-----------
> >> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >> +replayable page faulting). There are GUP references to pages serving as DMA
> >> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> >> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >> +needs to be set.
> > 
> > I would not include ODP or anything like it here, they do not use
> > GUP anymore and i believe it is more confusing here. I would how-
> > ever include some text in this documentation explaining that hard-
> > ware that support page fault is superior as it does not incur any
> > of the issues described here.
> 
> OK, agreed, here's a new write up that I'll put in v3:
> 
> 
> CASE 3: ODP
> -----------

ODP is RDMA, maybe Hardware with page fault support instead

> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> Here, a well-written driver doesn't normally need to pin pages at all. However,
> if the driver does choose to do so, it can register MMU notifiers for the range,
> and will be called back upon invalidation. Either way (avoiding page pinning, or
> using MMU notifiers to unpin upon request), there is proper synchronization with 
> both filesystem and mm (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.

In fact GUP should never be use with those.

> 
> It's worth mentioning here that pinning pages should not be the first design
> choice. If page fault capable hardware is available, then the software should
> be written so that it does not pin pages. This allows mm and filesystems to
> operate more efficiently and reliably.
> 
> > [...]
> > 
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 199da99e8ffc..1aea48427879 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> > 
> > [...]
> > 
> >> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >>  		BUG_ON(*locked != 1);
> >>  	}
> >>  
> >> -	if (pages)
> >> +	/*
> >> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> >> +	 *
> >> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> >> +	 * that here, as any failures will be obvious enough.
> >> +	 */
> >> +	if (pages && !(flags & FOLL_PIN))
> >>  		flags |= FOLL_GET;
> > 
> > Did you look at user that have pages and not FOLL_GET set ?
> > I believe it would be better to first fix them to end up
> > with FOLL_GET set and then error out if pages is != NULL but
> > nor FOLL_GET or FOLL_PIN is set.
> > 
> 
> I was perhaps overly cautious, and didn't go there. However, it's probably
> doable, given that there was already the following in __get_user_pages():
> 
>     VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> 
> ...which will have conditioned people and code to set FOLL_GET together with
> pages. So I agree that the time is right.
> 
> In order to make bisecting future failures simpler, I can insert a patch right 
> before this one, that changes the FOLL_GET setting into an assert, like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..be338961e80d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>                 BUG_ON(*locked != 1);
>         }
>  
> -       if (pages)
> -               flags |= FOLL_GET;
> +       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> +               return -EINVAL;
>  
>         pages_done = 0;
>         lock_dropped = false;
> 
> 
> ...and then add in FOLL_PIN, with this patch.

looks good but double check that it should not happens, i will try
to check on my side too.

> 
> >>  
> >>  	pages_done = 0;
> > 
> >> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
> >>  	return ret;
> >>  }
> >>  
> >> -/**
> >> - * get_user_pages_fast() - pin user pages in memory
> >> - * @start:	starting user address
> >> - * @nr_pages:	number of pages from start to pin
> >> - * @gup_flags:	flags modifying pin behaviour
> >> - * @pages:	array that receives pointers to the pages pinned.
> >> - *		Should be at least nr_pages long.
> >> - *
> >> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> >> - * If not successful, it will fall back to taking the lock and
> >> - * calling get_user_pages().
> >> - *
> >> - * Returns number of pages pinned. This may be fewer than the number
> >> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> >> - * were pinned, returns -errno.
> >> - */
> >> -int get_user_pages_fast(unsigned long start, int nr_pages,
> >> -			unsigned int gup_flags, struct page **pages)
> >> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> >> +					unsigned int gup_flags,
> >> +					struct page **pages)
> > 
> > Usualy function are rename to _old_func_name ie add _ in front. So
> > here it would become _get_user_pages_fast but i know some people
> > don't like that as sometimes we endup with ___function_overloaded :)
> 
> Exactly: the __get_user_pages* names were already used for *non*-internal
> routines, so I attempted to pick the next best naming prefix.

Didn't know we were that far in the ___ :)

> > 
> >>  {
> >>  	unsigned long addr, len, end;
> >>  	int nr = 0, ret = 0;
> > 
> > 
> >> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
> > 
> > [...]
> > 
> >> +/**
> >> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
> >> + * return the pages to the user.
> > 
> > Not a fan of (typically) maybe:
> > pin_user_pages_remote() - pin pages of a remote process (task != current)
> > 
> > I think here the remote part if more important that DIO. Remote is use by
> > other thing that DIO.
> 
> Yes, good point. I'll use your wording:
> 
>  * pin_user_pages_remote() - pin pages of a remote process (task != current)
> 
> 
> 
> > 
> >> + *
> >> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
> >> + * get_user_pages_remote() for documentation on the function arguments, because
> >> + * the arguments here are identical.
> >> + *
> >> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> >> + * see Documentation/vm/pin_user_pages.rst for details.
> >> + *
> >> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> >> + * is NOT intended for Case 2 (RDMA: long-term pins).
> >> + */
> >> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> >> +			   unsigned long start, unsigned long nr_pages,
> >> +			   unsigned int gup_flags, struct page **pages,
> >> +			   struct vm_area_struct **vmas, int *locked)
> >> +{
> >> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> >> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >> +		return -EINVAL;
> >> +
> >> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
> >> +
> >> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> >> +				       locked, gup_flags);
> >> +}
> >> +EXPORT_SYMBOL(pin_user_pages_remote);
> >> +
> >> +/**
> >> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
> >> + * return the pages to the user.
> > 
> > I think you copy pasted this from pin_user_pages_remote() :)
> 
> I admit to nothing, with respect to copy-paste! :)
> 
> This one can simply be:
> 
>  * pin_longterm_pages_remote() - pin pages of a remote process (task != current)
> 
> 
> > 
> >> + *
> >> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
> >> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
> >> + * documentation on the function arguments, because the arguments here are
> >> + * identical.
> >> + *
> >> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> >> + * see Documentation/vm/pin_user_pages.rst for further details.
> >> + *
> >> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> >> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> >> + * pinned page to become unpin will be effective.
> >> + *
> >> + * This is intended for Case 2 (RDMA: long-term pins) in
> >> + * Documentation/vm/pin_user_pages.rst.
> >> + */
> >> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> >> +			       unsigned long start, unsigned long nr_pages,
> >> +			       unsigned int gup_flags, struct page **pages,
> >> +			       struct vm_area_struct **vmas, int *locked)
> >> +{
> >> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> >> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> >> +		return -EINVAL;
> >> +
> >> +	/*
> >> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
> >> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
> >> +	 * needs to be set, but for now the best we can do is a "TODO" item.
> >> +	 */
> >> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> > 
> > Wouldn't it be better to not add pin_longterm_pages_remote() until
> > it can be properly implemented ?
> > 
> 
> Well, the problem is that I need each call site that requires FOLL_PIN
> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
> there is a hard, bright rule, which is: if and only if a caller sets
> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> be called.
> 
> So this leaves me with only two reasonable choices:
> 
> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
> site *might* not need any changes to adopt the working gup.c code.
> 
> b) Convert the call site to pin_user_pages_remote(), which also sets
> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> There would also be a comment at the call site, to the effect of, "this
> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> 
> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> changed to pin_longterm_pages_remote().
> 
> So you can probably see why I picked (a).

But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
never have the need for pin_longterm_pages_remote(). My fear is that
longterm has implication and it would be better to not drop this implication
by adding a wrapper that does not do what the name says.

So do not introduce pin_longterm_pages_remote() until its first user
happens. This is option c)

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 19:18       ` Jerome Glisse
@ 2019-11-04 19:30         ` John Hubbard
  2019-11-04 19:52           ` Jerome Glisse
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-04 19:30 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 11:18 AM, Jerome Glisse wrote:
> On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
>> On 11/4/19 9:33 AM, Jerome Glisse wrote:
>> ...
>>>
>>> Few nitpick belows, nonetheless:
>>>
>>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
>>> [...]
>>>> +
>>>> +CASE 3: ODP
>>>> +-----------
>>>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
>>>> +replayable page faulting). There are GUP references to pages serving as DMA
>>>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
>>>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
>>>> +needs to be set.
>>>
>>> I would not include ODP or anything like it here, they do not use
>>> GUP anymore and i believe it is more confusing here. I would how-
>>> ever include some text in this documentation explaining that hard-
>>> ware that support page fault is superior as it does not incur any
>>> of the issues described here.
>>
>> OK, agreed, here's a new write up that I'll put in v3:
>>
>>
>> CASE 3: ODP
>> -----------
> 
> ODP is RDMA, maybe Hardware with page fault support instead
> 
>> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.

OK, so:

    "RDMA hardware with page faulting support."

for the first sentence.


>> Here, a well-written driver doesn't normally need to pin pages at all. However,
>> if the driver does choose to do so, it can register MMU notifiers for the range,
>> and will be called back upon invalidation. Either way (avoiding page pinning, or
>> using MMU notifiers to unpin upon request), there is proper synchronization with 
>> both filesystem and mm (page_mkclean(), munmap(), etc).
>>
>> Therefore, neither flag needs to be set.
> 
> In fact GUP should never be use with those.


Yes. The next paragraph says that, but maybe not strong enough.


>>
>> It's worth mentioning here that pinning pages should not be the first design
>> choice. If page fault capable hardware is available, then the software should
>> be written so that it does not pin pages. This allows mm and filesystems to
>> operate more efficiently and reliably.

Here's what we have after the above changes:

CASE 3: ODP
-----------
RDMA hardware with page faulting support. Here, a well-written driver doesn't
normally need to pin pages at all. However, if the driver does choose to do so,
it can register MMU notifiers for the range, and will be called back upon
invalidation. Either way (avoiding page pinning, or using MMU notifiers to unpin
upon request), there is proper synchronization with both filesystem and mm
(page_mkclean(), munmap(), etc).

Therefore, neither flag needs to be set.

In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 
called. Instead, the software should be written so that it does not pin pages. 
This allows mm and filesystems to operate more efficiently and reliably.

>>> [...]
>>>
>>>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>>>>  		BUG_ON(*locked != 1);
>>>>  	}
>>>>  
>>>> -	if (pages)
>>>> +	/*
>>>> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
>>>> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
>>>> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
>>>> +	 * for FOLL_GET, not for the newer FOLL_PIN.
>>>> +	 *
>>>> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
>>>> +	 * that here, as any failures will be obvious enough.
>>>> +	 */
>>>> +	if (pages && !(flags & FOLL_PIN))
>>>>  		flags |= FOLL_GET;
>>>
>>> Did you look at user that have pages and not FOLL_GET set ?
>>> I believe it would be better to first fix them to end up
>>> with FOLL_GET set and then error out if pages is != NULL but
>>> nor FOLL_GET or FOLL_PIN is set.
>>>
>>
>> I was perhaps overly cautious, and didn't go there. However, it's probably
>> doable, given that there was already the following in __get_user_pages():
>>
>>     VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
>>
>> ...which will have conditioned people and code to set FOLL_GET together with
>> pages. So I agree that the time is right.
>>
>> In order to make bisecting future failures simpler, I can insert a patch right 
>> before this one, that changes the FOLL_GET setting into an assert, like this:
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 8f236a335ae9..be338961e80d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>>                 BUG_ON(*locked != 1);
>>         }
>>  
>> -       if (pages)
>> -               flags |= FOLL_GET;
>> +       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
>> +               return -EINVAL;
>>  
>>         pages_done = 0;
>>         lock_dropped = false;
>>
>>
>> ...and then add in FOLL_PIN, with this patch.
> 
> looks good but double check that it should not happens, i will try
> to check on my side too.

Yes, I'll look.

...
>>>> +	 */
>>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
>>>
>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
>>> it can be properly implemented ?
>>>
>>
>> Well, the problem is that I need each call site that requires FOLL_PIN
>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
>> there is a hard, bright rule, which is: if and only if a caller sets
>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
>> be called.
>>
>> So this leaves me with only two reasonable choices:
>>
>> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
>> site *might* not need any changes to adopt the working gup.c code.
>>
>> b) Convert the call site to pin_user_pages_remote(), which also sets
>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
>> There would also be a comment at the call site, to the effect of, "this
>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
>>
>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
>> changed to pin_longterm_pages_remote().
>>
>> So you can probably see why I picked (a).
> 
> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> never have the need for pin_longterm_pages_remote(). My fear is that
> longterm has implication and it would be better to not drop this implication
> by adding a wrapper that does not do what the name says.
> 
> So do not introduce pin_longterm_pages_remote() until its first user
> happens. This is option c)
> 

Almost forgot, though: there is already another user: Infiniband:

drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,



thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 19:30         ` John Hubbard
@ 2019-11-04 19:52           ` Jerome Glisse
  2019-11-04 20:09             ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 19:52 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 11:30:32AM -0800, John Hubbard wrote:
> On 11/4/19 11:18 AM, Jerome Glisse wrote:
> > On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> >> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> >> ...
> >>>
> >>> Few nitpick belows, nonetheless:
> >>>
> >>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >>> [...]
> >>>> +
> >>>> +CASE 3: ODP
> >>>> +-----------
> >>>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >>>> +replayable page faulting). There are GUP references to pages serving as DMA
> >>>> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> >>>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >>>> +needs to be set.
> >>>
> >>> I would not include ODP or anything like it here, they do not use
> >>> GUP anymore and i believe it is more confusing here. I would how-
> >>> ever include some text in this documentation explaining that hard-
> >>> ware that support page fault is superior as it does not incur any
> >>> of the issues described here.
> >>
> >> OK, agreed, here's a new write up that I'll put in v3:
> >>
> >>
> >> CASE 3: ODP
> >> -----------
> > 
> > ODP is RDMA, maybe Hardware with page fault support instead
> > 
> >> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> 
> OK, so:
> 
>     "RDMA hardware with page faulting support."
> 
> for the first sentence.

I would drop RDMA completely, RDMA is just one example, they are GPU, FPGA and
others that are in that category. See below

> 
> 
> >> Here, a well-written driver doesn't normally need to pin pages at all. However,
> >> if the driver does choose to do so, it can register MMU notifiers for the range,
> >> and will be called back upon invalidation. Either way (avoiding page pinning, or
> >> using MMU notifiers to unpin upon request), there is proper synchronization with 
> >> both filesystem and mm (page_mkclean(), munmap(), etc).
> >>
> >> Therefore, neither flag needs to be set.
> > 
> > In fact GUP should never be use with those.
> 
> 
> Yes. The next paragraph says that, but maybe not strong enough.
> 
> 
> >>
> >> It's worth mentioning here that pinning pages should not be the first design
> >> choice. If page fault capable hardware is available, then the software should
> >> be written so that it does not pin pages. This allows mm and filesystems to
> >> operate more efficiently and reliably.
> 
> Here's what we have after the above changes:
> 
> CASE 3: ODP
> -----------
> RDMA hardware with page faulting support. Here, a well-written driver doesn't

CASE3: Hardware with page fault support
---------------------------------------

Here, a well-written ....


> normally need to pin pages at all. However, if the driver does choose to do so,
> it can register MMU notifiers for the range, and will be called back upon
> invalidation. Either way (avoiding page pinning, or using MMU notifiers to unpin
> upon request), there is proper synchronization with both filesystem and mm
> (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.
> 
> In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 
> called. Instead, the software should be written so that it does not pin pages. 
> This allows mm and filesystems to operate more efficiently and reliably.
> 
> >>> [...]
> >>>
> >>>> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >>>>  		BUG_ON(*locked != 1);
> >>>>  	}
> >>>>  
> >>>> -	if (pages)
> >>>> +	/*
> >>>> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >>>> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >>>> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >>>> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> >>>> +	 *
> >>>> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> >>>> +	 * that here, as any failures will be obvious enough.
> >>>> +	 */
> >>>> +	if (pages && !(flags & FOLL_PIN))
> >>>>  		flags |= FOLL_GET;
> >>>
> >>> Did you look at user that have pages and not FOLL_GET set ?
> >>> I believe it would be better to first fix them to end up
> >>> with FOLL_GET set and then error out if pages is != NULL but
> >>> nor FOLL_GET or FOLL_PIN is set.
> >>>
> >>
> >> I was perhaps overly cautious, and didn't go there. However, it's probably
> >> doable, given that there was already the following in __get_user_pages():
> >>
> >>     VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> >>
> >> ...which will have conditioned people and code to set FOLL_GET together with
> >> pages. So I agree that the time is right.
> >>
> >> In order to make bisecting future failures simpler, I can insert a patch right 
> >> before this one, that changes the FOLL_GET setting into an assert, like this:
> >>
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 8f236a335ae9..be338961e80d 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> >> @@ -1014,8 +1014,8 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
> >>                 BUG_ON(*locked != 1);
> >>         }
> >>  
> >> -       if (pages)
> >> -               flags |= FOLL_GET;
> >> +       if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> >> +               return -EINVAL;
> >>  
> >>         pages_done = 0;
> >>         lock_dropped = false;
> >>
> >>
> >> ...and then add in FOLL_PIN, with this patch.
> > 
> > looks good but double check that it should not happens, i will try
> > to check on my side too.
> 
> Yes, I'll look.
> 
> ...
> >>>> +	 */
> >>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >>>
> >>> Wouldn't it be better to not add pin_longterm_pages_remote() until
> >>> it can be properly implemented ?
> >>>
> >>
> >> Well, the problem is that I need each call site that requires FOLL_PIN
> >> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
> >> there is a hard, bright rule, which is: if and only if a caller sets
> >> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> >> be called.
> >>
> >> So this leaves me with only two reasonable choices:
> >>
> >> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
> >> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> >> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
> >> site *might* not need any changes to adopt the working gup.c code.
> >>
> >> b) Convert the call site to pin_user_pages_remote(), which also sets
> >> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> >> There would also be a comment at the call site, to the effect of, "this
> >> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> >>
> >> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> >> changed to pin_longterm_pages_remote().
> >>
> >> So you can probably see why I picked (a).
> > 
> > But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> > never have the need for pin_longterm_pages_remote(). My fear is that
> > longterm has implication and it would be better to not drop this implication
> > by adding a wrapper that does not do what the name says.
> > 
> > So do not introduce pin_longterm_pages_remote() until its first user
> > happens. This is option c)
> > 
> 
> Almost forgot, though: there is already another user: Infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,

odp do not need that, i thought the HMM convertion was already upstream
but seems not, in any case odp do not need the longterm case it only
so best is to revert that user to gup_fast or something until it get
converted to HMM.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 19:52           ` Jerome Glisse
@ 2019-11-04 20:09             ` John Hubbard
  2019-11-04 20:31               ` Jason Gunthorpe
  2019-11-04 20:31               ` Jerome Glisse
  0 siblings, 2 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-04 20:09 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

Jason, a question for you at the bottom.

On 11/4/19 11:52 AM, Jerome Glisse wrote:
...
>> CASE 3: ODP
>> -----------
>> RDMA hardware with page faulting support. Here, a well-written driver doesn't
> 
> CASE3: Hardware with page fault support
> ---------------------------------------
> 
> Here, a well-written ....
> 

Ah, OK. So just drop the first sentence, yes.

...
>>>>>> +	 */
>>>>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
>>>>>
>>>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
>>>>> it can be properly implemented ?
>>>>>
>>>>
>>>> Well, the problem is that I need each call site that requires FOLL_PIN
>>>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
>>>> there is a hard, bright rule, which is: if and only if a caller sets
>>>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
>>>> be called.
>>>>
>>>> So this leaves me with only two reasonable choices:
>>>>
>>>> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
>>>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
>>>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
>>>> site *might* not need any changes to adopt the working gup.c code.
>>>>
>>>> b) Convert the call site to pin_user_pages_remote(), which also sets
>>>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
>>>> There would also be a comment at the call site, to the effect of, "this
>>>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
>>>>
>>>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
>>>> changed to pin_longterm_pages_remote().
>>>>
>>>> So you can probably see why I picked (a).
>>>
>>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
>>> never have the need for pin_longterm_pages_remote(). My fear is that
>>> longterm has implication and it would be better to not drop this implication
>>> by adding a wrapper that does not do what the name says.
>>>
>>> So do not introduce pin_longterm_pages_remote() until its first user
>>> happens. This is option c)
>>>
>>
>> Almost forgot, though: there is already another user: Infiniband:
>>
>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> 
> odp do not need that, i thought the HMM convertion was already upstream
> but seems not, in any case odp do not need the longterm case it only
> so best is to revert that user to gup_fast or something until it get
> converted to HMM.
> 

Note for Jason: the (a) or (b) items are talking about the vfio case, which is
one of the two call sites that now use pin_longterm_pages_remote(), and the
other one is infiniband:

drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,


Jerome, Jason: I really don't want to revert the put_page() to put_user_page() 
conversions that are already throughout the IB driver--pointless churn, right?
I'd rather either delete them in Jason's tree, or go with what I have here
while waiting for the deletion.

Maybe we should just settle on (a) or (b), so that the IB driver ends up with
the wrapper functions? In fact, if it's getting deleted, then I'd prefer leaving
it at (a), since that's simple...

Jason should weigh in on how he wants this to go, with respect to branching
and merging, since it sounds like that will conflict with the hmm branch 
(ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
being late).

thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:09             ` John Hubbard
@ 2019-11-04 20:31               ` Jason Gunthorpe
  2019-11-04 20:40                 ` John Hubbard
  2019-11-04 20:31               ` Jerome Glisse
  1 sibling, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2019-11-04 20:31 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:

> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,

This is a mistake, it is not a longterm pin and does not need FOLL_PIN
semantics

> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 

I think since you don't need to change this site things should be
fine?

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:09             ` John Hubbard
  2019-11-04 20:31               ` Jason Gunthorpe
@ 2019-11-04 20:31               ` Jerome Glisse
  2019-11-04 20:37                 ` Jason Gunthorpe
  1 sibling, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 20:31 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> Jason, a question for you at the bottom.
> 
> On 11/4/19 11:52 AM, Jerome Glisse wrote:
> ...
> >> CASE 3: ODP
> >> -----------
> >> RDMA hardware with page faulting support. Here, a well-written driver doesn't
> > 
> > CASE3: Hardware with page fault support
> > ---------------------------------------
> > 
> > Here, a well-written ....
> > 
> 
> Ah, OK. So just drop the first sentence, yes.
> 
> ...
> >>>>>> +	 */
> >>>>>> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >>>>>
> >>>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
> >>>>> it can be properly implemented ?
> >>>>>
> >>>>
> >>>> Well, the problem is that I need each call site that requires FOLL_PIN
> >>>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, because
> >>>> there is a hard, bright rule, which is: if and only if a caller sets
> >>>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> >>>> be called.
> >>>>
> >>>> So this leaves me with only two reasonable choices:
> >>>>
> >>>> a) Convert the call site as above: pin_longterm_pages_remote(), which sets
> >>>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> >>>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the call
> >>>> site *might* not need any changes to adopt the working gup.c code.
> >>>>
> >>>> b) Convert the call site to pin_user_pages_remote(), which also sets
> >>>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> >>>> There would also be a comment at the call site, to the effect of, "this
> >>>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> >>>>
> >>>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> >>>> changed to pin_longterm_pages_remote().
> >>>>
> >>>> So you can probably see why I picked (a).
> >>>
> >>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> >>> never have the need for pin_longterm_pages_remote(). My fear is that
> >>> longterm has implication and it would be better to not drop this implication
> >>> by adding a wrapper that does not do what the name says.
> >>>
> >>> So do not introduce pin_longterm_pages_remote() until its first user
> >>> happens. This is option c)
> >>>
> >>
> >> Almost forgot, though: there is already another user: Infiniband:
> >>
> >> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> > 
> > odp do not need that, i thought the HMM convertion was already upstream
> > but seems not, in any case odp do not need the longterm case it only
> > so best is to revert that user to gup_fast or something until it get
> > converted to HMM.
> > 
> 
> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,

vfio should be reverted until it can be properly implemented.
The issue is that when you fix the implementation you might
break vfio existing user and thus regress the kernel from user
point of view. So i rather have the change to vfio reverted,
i believe it was not well understood when it got upstream,
between in my 5.4 tree it is still gup_remote not longterm.


> Jerome, Jason: I really don't want to revert the put_page() to put_user_page() 
> conversions that are already throughout the IB driver--pointless churn, right?
> I'd rather either delete them in Jason's tree, or go with what I have here
> while waiting for the deletion.
> 
> Maybe we should just settle on (a) or (b), so that the IB driver ends up with
> the wrapper functions? In fact, if it's getting deleted, then I'd prefer leaving
> it at (a), since that's simple...
> 
> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 
> (ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
> being late).
> 
> thanks,
> 
> John Hubbard
> NVIDIA


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
  2019-11-04 17:33   ` Jerome Glisse
@ 2019-11-04 20:33   ` David Rientjes
  2019-11-04 20:48     ` Jerome Glisse
  2019-11-05 13:10   ` Mike Rapoport
  2 siblings, 1 reply; 57+ messages in thread
From: David Rientjes @ 2019-11-04 20:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML



On Sun, 3 Nov 2019, John Hubbard wrote:

> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
>     pin_user_pages()
>     pin_user_pages_remote()
>     pin_user_pages_fast()
> 
>     pin_longterm_pages()
>     pin_longterm_pages_remote()
>     pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 

Hi John,

I'm curious what consideration is given to what pageblock migrate types 
that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
purposes of anti-fragmentation?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-03 21:18 ` [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() John Hubbard
@ 2019-11-04 20:33   ` Jason Gunthorpe
  2019-11-04 20:48     ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2019-11-04 20:33 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Sun, Nov 03, 2019 at 01:18:02PM -0800, John Hubbard wrote:
> Convert infiniband to use the new wrapper calls, and stop
> explicitly setting FOLL_LONGTERM at the call sites.
> 
> The new pin_longterm_*() calls replace get_user_pages*()
> calls, and set both FOLL_LONGTERM and a new FOLL_PIN
> flag. The FOLL_PIN flag requires that the caller must
> return the pages via put_user_page*() calls, but
> infiniband was already doing that as part of an earlier
> commit.
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>  drivers/infiniband/core/umem.c              |  5 ++---
>  drivers/infiniband/core/umem_odp.c          | 10 +++++-----
>  drivers/infiniband/hw/hfi1/user_pages.c     |  4 ++--
>  drivers/infiniband/hw/mthca/mthca_memfree.c |  3 +--
>  drivers/infiniband/hw/qib/qib_user_pages.c  |  8 ++++----
>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  2 +-
>  drivers/infiniband/hw/usnic/usnic_uiom.c    |  9 ++++-----
>  drivers/infiniband/sw/siw/siw_mem.c         |  5 ++---
>  8 files changed, 21 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 24244a2f68cc..c5a78d3e674b 100644
> +++ b/drivers/infiniband/core/umem.c
> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
>  
>  	while (npages) {
>  		down_read(&mm->mmap_sem);
> -		ret = get_user_pages(cur_base,
> +		ret = pin_longterm_pages(cur_base,
>  				     min_t(unsigned long, npages,
>  					   PAGE_SIZE / sizeof (struct page *)),
> -				     gup_flags | FOLL_LONGTERM,
> -				     page_list, NULL);
> +				     gup_flags, page_list, NULL);

FWIW, this one should be converted to fast as well, I think we finally
got rid of all the blockers for that?

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:31               ` Jerome Glisse
@ 2019-11-04 20:37                 ` Jason Gunthorpe
  2019-11-04 20:57                   ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2019-11-04 20:37 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> > Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> > one of the two call sites that now use pin_longterm_pages_remote(), and the
> > other one is infiniband:
> > 
> > drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> > drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> 
> vfio should be reverted until it can be properly implemented.
> The issue is that when you fix the implementation you might
> break vfio existing user and thus regress the kernel from user
> point of view. So i rather have the change to vfio reverted,
> i believe it was not well understood when it got upstream,
> between in my 5.4 tree it is still gup_remote not longterm.

It is clearly a bug, vfio must use LONGTERM, and does right above this
remote call:

        if (mm == current->mm) {
                ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
                                     vmas);
        } else {
                ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
                                            vmas, NULL);


I'm not even sure that it really makes any sense to build a 'if' like
that, surely just always call remote??

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:31               ` Jason Gunthorpe
@ 2019-11-04 20:40                 ` John Hubbard
  0 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-04 20:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 12:31 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> 
>> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
>> one of the two call sites that now use pin_longterm_pages_remote(), and the
>> other one is infiniband:
>>
>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> 
> This is a mistake, it is not a longterm pin and does not need FOLL_PIN
> semantics

OK! So it really just wants to be get_user_pages_remote() / put_page()? I'll
change it back to that.

> 
>> Jason should weigh in on how he wants this to go, with respect to branching
>> and merging, since it sounds like that will conflict with the hmm branch 
> 
> I think since you don't need to change this site things should be
> fine?
> 

Right. 


thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-04 20:33   ` Jason Gunthorpe
@ 2019-11-04 20:48     ` John Hubbard
  2019-11-04 20:57       ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-04 20:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
...
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index 24244a2f68cc..c5a78d3e674b 100644
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
>>  
>>  	while (npages) {
>>  		down_read(&mm->mmap_sem);
>> -		ret = get_user_pages(cur_base,
>> +		ret = pin_longterm_pages(cur_base,
>>  				     min_t(unsigned long, npages,
>>  					   PAGE_SIZE / sizeof (struct page *)),
>> -				     gup_flags | FOLL_LONGTERM,
>> -				     page_list, NULL);
>> +				     gup_flags, page_list, NULL);
> 
> FWIW, this one should be converted to fast as well, I think we finally
> got rid of all the blockers for that?
> 

I'm not aware of any blockers on the gup.c end, anyway. The only broken thing we
have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + LONGTERM". 

Unless I'm really missing something, in which case several other call sites
would need changes.

I'll change it to pin_longterm_pages_fast().

thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:33   ` David Rientjes
@ 2019-11-04 20:48     ` Jerome Glisse
  0 siblings, 0 replies; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 20:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: John Hubbard, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jason Gunthorpe,
	Jens Axboe, Jonathan Corbet, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On Mon, Nov 04, 2019 at 12:33:09PM -0800, David Rientjes wrote:
> 
> 
> On Sun, 3 Nov 2019, John Hubbard wrote:
> 
> > Introduce pin_user_pages*() variations of get_user_pages*() calls,
> > and also pin_longterm_pages*() variations.
> > 
> > These variants all set FOLL_PIN, which is also introduced, and
> > thoroughly documented.
> > 
> > The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> > to FOLL_PIN:
> > 
> >     pin_user_pages()
> >     pin_user_pages_remote()
> >     pin_user_pages_fast()
> > 
> >     pin_longterm_pages()
> >     pin_longterm_pages_remote()
> >     pin_longterm_pages_fast()
> > 
> > All pages that are pinned via the above calls, must be unpinned via
> > put_user_page().
> > 
> 
> Hi John,
> 
> I'm curious what consideration is given to what pageblock migrate types 
> that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
> longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
> purposes of anti-fragmentation?

We do not control page block, GUP can happens on _any_ page that is
map inside a process (anonymous private vma or regular file back one).

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-04 20:48     ` John Hubbard
@ 2019-11-04 20:57       ` Jason Gunthorpe
  2019-11-04 22:03         ` John Hubbard
  2019-11-07  2:26         ` Ira Weiny
  0 siblings, 2 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2019-11-04 20:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 12:48:13PM -0800, John Hubbard wrote:
> On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
> ...
> >> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >> index 24244a2f68cc..c5a78d3e674b 100644
> >> +++ b/drivers/infiniband/core/umem.c
> >> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
> >>  
> >>  	while (npages) {
> >>  		down_read(&mm->mmap_sem);
> >> -		ret = get_user_pages(cur_base,
> >> +		ret = pin_longterm_pages(cur_base,
> >>  				     min_t(unsigned long, npages,
> >>  					   PAGE_SIZE / sizeof (struct page *)),
> >> -				     gup_flags | FOLL_LONGTERM,
> >> -				     page_list, NULL);
> >> +				     gup_flags, page_list, NULL);
> > 
> > FWIW, this one should be converted to fast as well, I think we finally
> > got rid of all the blockers for that?
> > 
> 
> I'm not aware of any blockers on the gup.c end, anyway. The only broken thing we
> have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + LONGTERM". 

I mean the use of the mmap_sem here is finally in a way where we can
just delete the mmap_sem and use _fast
 
ie, AFAIK there is no need for the mmap_sem to be held during
ib_umem_add_sg_table()

This should probably be a standalone patch however

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:37                 ` Jason Gunthorpe
@ 2019-11-04 20:57                   ` John Hubbard
  2019-11-04 21:15                     ` Jason Gunthorpe
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-04 20:57 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Magnus Karlsson, Mauro Carvalho Chehab, Michael Ellerman,
	Michal Hocko, Mike Kravetz, Paul Mackerras, Shuah Khan,
	Vlastimil Babka, bpf, dri-devel, kvm, linux-block, linux-doc,
	linux-fsdevel, linux-kselftest, linux-media, linux-rdma,
	linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
>>> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
>>> one of the two call sites that now use pin_longterm_pages_remote(), and the
>>> other one is infiniband:
>>>
>>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
>>> drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
>>
>> vfio should be reverted until it can be properly implemented.
>> The issue is that when you fix the implementation you might
>> break vfio existing user and thus regress the kernel from user
>> point of view. So i rather have the change to vfio reverted,
>> i believe it was not well understood when it got upstream,
>> between in my 5.4 tree it is still gup_remote not longterm.
> 
> It is clearly a bug, vfio must use LONGTERM, and does right above this
> remote call:
> 
>         if (mm == current->mm) {
>                 ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
>                                      vmas);
>         } else {
>                 ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
>                                             vmas, NULL);
> 
> 
> I'm not even sure that it really makes any sense to build a 'if' like
> that, surely just always call remote??
> 


Right, and I thought about this when converting, and realized that the above 
code is working around the current gup.c limitations, which are "cannot support
gup remote with FOLL_LONGTERM".

Given that observation, the code is getting itself some FOLL_LONGTERM support
for the non-remote case, and only hitting the limitation if the mm really is
non-current.

And if you look at my patch, it keeps the same behavior, while adding in the
new wrapper calls.

So...thoughts, preferences?


thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 20:57                   ` John Hubbard
@ 2019-11-04 21:15                     ` Jason Gunthorpe
  2019-11-04 21:34                       ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jason Gunthorpe @ 2019-11-04 21:15 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 12:57:59PM -0800, John Hubbard wrote:
> On 11/4/19 12:37 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2019 at 03:31:53PM -0500, Jerome Glisse wrote:
> >>> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> >>> one of the two call sites that now use pin_longterm_pages_remote(), and the
> >>> other one is infiniband:
> >>>
> >>> drivers/infiniband/core/umem_odp.c:646:         npages = pin_longterm_pages_remote(owning_process, owning_mm,
> >>> drivers/vfio/vfio_iommu_type1.c:353:            ret = pin_longterm_pages_remote(NULL, mm, vaddr, 1,
> >>
> >> vfio should be reverted until it can be properly implemented.
> >> The issue is that when you fix the implementation you might
> >> break vfio existing user and thus regress the kernel from user
> >> point of view. So i rather have the change to vfio reverted,
> >> i believe it was not well understood when it got upstream,
> >> between in my 5.4 tree it is still gup_remote not longterm.
> > 
> > It is clearly a bug, vfio must use LONGTERM, and does right above this
> > remote call:
> > 
> >         if (mm == current->mm) {
> >                 ret = get_user_pages(vaddr, 1, flags | FOLL_LONGTERM, page,
> >                                      vmas);
> >         } else {
> >                 ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
> >                                             vmas, NULL);
> > 
> > 
> > I'm not even sure that it really makes any sense to build a 'if' like
> > that, surely just always call remote??
> > 
> 
> 
> Right, and I thought about this when converting, and realized that the above 
> code is working around the current gup.c limitations, which are "cannot support
> gup remote with FOLL_LONGTERM".

But AFAICT it doesn't have a problem, the protection test is just too
strict, and I guess the control flow needs a bit of fixing..

The issue is this:

static __always_inline long __get_user_pages_locked():
{
        if (locked) {
                /* if VM_FAULT_RETRY can be returned, vmas become invalid */
                BUG_ON(vmas);
                /* check caller initialized locked */
                BUG_ON(*locked != 1);
        }


so remote could be written as:

if (gup_flags & FOLL_LONGTERM) {
   if (WARN_ON_ONCE(locked))
        return -EINVAL;
   return __gup_longterm_locked(...)
}

return __get_user_pages_locked(...)

??

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-04 21:15                     ` Jason Gunthorpe
@ 2019-11-04 21:34                       ` John Hubbard
  0 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-04 21:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Ira Weiny, Jan Kara, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 1:15 PM, Jason Gunthorpe wrote:
...
>> Right, and I thought about this when converting, and realized that the above 
>> code is working around the current gup.c limitations, which are "cannot support
>> gup remote with FOLL_LONGTERM".
> 
> But AFAICT it doesn't have a problem, the protection test is just too
> strict, and I guess the control flow needs a bit of fixing..
> 
> The issue is this:
> 
> static __always_inline long __get_user_pages_locked():
> {
>         if (locked) {
>                 /* if VM_FAULT_RETRY can be returned, vmas become invalid */
>                 BUG_ON(vmas);
>                 /* check caller initialized locked */
>                 BUG_ON(*locked != 1);
>         }
> 
> 
> so remote could be written as:
> 
> if (gup_flags & FOLL_LONGTERM) {
>    if (WARN_ON_ONCE(locked))
>         return -EINVAL;
>    return __gup_longterm_locked(...)
> }
> 
> return __get_user_pages_locked(...)
> 
> ??

Yes, that loosens it up just enough for the vfio case (which doesn't set 
"locked") to get through, great! OK, I'll put that (the above plus 
corresponding vfio fix) in a separate patch first. 

This should clear things up nicely.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-04 20:57       ` Jason Gunthorpe
@ 2019-11-04 22:03         ` John Hubbard
  2019-11-05  2:32           ` Jason Gunthorpe
  2019-11-07  2:26         ` Ira Weiny
  1 sibling, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-04 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 12:57 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 12:48:13PM -0800, John Hubbard wrote:
>> On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
>> ...
>>>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>>>> index 24244a2f68cc..c5a78d3e674b 100644
>>>> +++ b/drivers/infiniband/core/umem.c
>>>> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
>>>>  
>>>>  	while (npages) {
>>>>  		down_read(&mm->mmap_sem);
>>>> -		ret = get_user_pages(cur_base,
>>>> +		ret = pin_longterm_pages(cur_base,
>>>>  				     min_t(unsigned long, npages,
>>>>  					   PAGE_SIZE / sizeof (struct page *)),
>>>> -				     gup_flags | FOLL_LONGTERM,
>>>> -				     page_list, NULL);
>>>> +				     gup_flags, page_list, NULL);
>>>
>>> FWIW, this one should be converted to fast as well, I think we finally
>>> got rid of all the blockers for that?
>>>
>>
>> I'm not aware of any blockers on the gup.c end, anyway. The only broken thing we
>> have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + LONGTERM". 
> 
> I mean the use of the mmap_sem here is finally in a way where we can
> just delete the mmap_sem and use _fast
>  
> ie, AFAIK there is no need for the mmap_sem to be held during
> ib_umem_add_sg_table()
> 
> This should probably be a standalone patch however
> 

Yes. Oh, actually I guess the patch flow should be: change to 
get_user_pages_fast() and remove the mmap_sem calls, as one patch. And then change 
to pin_longterm_pages_fast() as the next patch. Otherwise, the internal fallback
from _fast to slow gup would attempt to take the mmap_sem (again) in the same
thread, which is not good. :)

Or just defer the change until after this series. Either way is fine, let me
know if you prefer one over the other.

The patch itself is trivial, but runtime testing to gain confidence that
it's solid is much harder. Is there a stress test you would recommend for that?
(I'm not promising I can quickly run it yet--my local IB setup is still nascent 
at best.)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
  2019-11-04 18:52   ` Jerome Glisse
@ 2019-11-04 22:49     ` John Hubbard
  2019-11-04 23:49       ` Jerome Glisse
  0 siblings, 1 reply; 57+ messages in thread
From: John Hubbard @ 2019-11-04 22:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On 11/4/19 10:52 AM, Jerome Glisse wrote:
> On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
>> Add tracking of pages that were pinned via FOLL_PIN.
>>
>> As mentioned in the FOLL_PIN documentation, callers who effectively set
>> FOLL_PIN are required to ultimately free such pages via put_user_page().
>> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
>> for DIO and/or RDMA use".
>>
>> Pages that have been pinned via FOLL_PIN are identifiable via a
>> new function call:
>>
>>    bool page_dma_pinned(struct page *page);
>>
>> What to do in response to encountering such a page, is left to later
>> patchsets. There is discussion about this in [1].
>>
>> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
>>
>> This also has a couple of trivial, non-functional change fixes to
>> try_get_compound_head(). That function got moved to the top of the
>> file.
> 
> Maybe split that as a separate trivial patch.


Will do.


> 
>>
>> This includes the following fix from Ira Weiny:
>>
>> DAX requires detection of a page crossing to a ref count of 1.  Fix this
>> for GUP pages by introducing put_devmap_managed_user_page() which
>> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
> 
> Please do the put_devmap_managed_page() changes in a separate
> patch, it would be a lot easier to follow, also on that front
> see comments below.


Oh! OK. It makes sense when you say it out loud. :)


...
>> +static inline bool put_devmap_managed_page(struct page *page)
>> +{
>> +	bool is_devmap = page_is_devmap_managed(page);
>> +
>> +	if (is_devmap) {
>> +		int count = page_ref_dec_return(page);
>> +
>> +		__put_devmap_managed_page(page, count);
>> +	}
>> +
>> +	return is_devmap;
>> +}
> 
> I think the __put_devmap_managed_page() should be rename
> to free_devmap_managed_page() and that the count != 1
> case move to this inline function ie:
> 
> static inline bool put_devmap_managed_page(struct page *page)
> {
> 	bool is_devmap = page_is_devmap_managed(page);
> 
> 	if (is_devmap) {
> 		int count = page_ref_dec_return(page);
> 
> 		/*
> 		 * If refcount is 1 then page is freed and refcount is stable as nobody
> 		 * holds a reference on the page.
> 		 */
> 		if (count == 1)
> 			free_devmap_managed_page(page, count);
> 		else if (!count)
> 			__put_page(page);
> 	}
> 
> 	return is_devmap;
> }
> 

Thanks, that does look cleaner and easier to read.

> 
>> +
>>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>>  static inline bool put_devmap_managed_page(struct page *page)
>>  {
>> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
>>  	return true;
>>  }
>>  
>> +__must_check bool user_page_ref_inc(struct page *page);
>> +
> 
> What about having it as an inline here as it is pretty small.


You mean move it to a static inline function in mm.h? It's worse than it 
looks, though: *everything* that it calls is also a static function, local
to gup.c. So I'd have to expose both try_get_compound_head() and
__update_proc_vmstat(). And that also means calling mod_node_page_state() from
mm.h, and it goes south right about there. :)


...  
>> +/**
>> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
>> + * or pin_longterm_pages*()
>> + * @page:	pointer to page to be queried.
>> + * @Return:	True, if it is likely that the page has been "dma-pinned".
>> + *		False, if the page is definitely not dma-pinned.
>> + */
> 
> Maybe add a small comment about wrap around :)


I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().

But it's true that the documentation is a little light here...What did you have 
in mind?


> [...]
> 
>> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>  
>>  		pgmap = get_dev_pagemap(pfn, pgmap);
>>  		if (unlikely(!pgmap)) {
>> -			undo_dev_pagemap(nr, nr_start, pages);
>> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>>  			return 0;
>>  		}
>>  		SetPageReferenced(page);
>>  		pages[*nr] = page;
>> -		get_page(page);
>> +
>> +		if (flags & FOLL_PIN) {
>> +			if (unlikely(!user_page_ref_inc(page))) {
>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>> +				return 0;
>> +			}
> 
> Maybe add a comment about a case that should never happens ie
> user_page_ref_inc() fails after the second iteration of the
> loop as it would be broken and a bug to call undo_dev_pagemap()
> after the first iteration of that loop.
> 
> Also i believe that this should never happens as if first
> iteration succeed than __page_cache_add_speculative() will
> succeed for all the iterations.
> 
> Note that the pgmap case above follows that too ie the call to
> get_dev_pagemap() can only fail on first iteration of the loop,
> well i assume you can never have a huge device page that span
> different pgmap ie different devices (which is a reasonable
> assumption). So maybe this code needs fixing ie :
> 
> 		pgmap = get_dev_pagemap(pfn, pgmap);
> 		if (unlikely(!pgmap))
> 			return 0;
> 
> 

OK, yes that does make sense. And I think a comment is adequate,
no need to check for bugs during every tail page iteration. So how 
about this, as a preliminary patch:

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..a4a81e125832 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1892,17 +1892,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 static int __gup_device_huge(unsigned long pfn, unsigned long addr,
                unsigned long end, struct page **pages, int *nr)
 {
-       int nr_start = *nr;
-       struct dev_pagemap *pgmap = NULL;
+       /*
+        * Huge pages should never cross dev_pagemap boundaries. Therefore, use
+        * this same pgmap for the entire huge page.
+        */
+       struct dev_pagemap *pgmap = get_dev_pagemap(pfn, NULL);
+
+       if (unlikely(!pgmap))
+               return 0;
 
        do {
                struct page *page = pfn_to_page(pfn);
 
-               pgmap = get_dev_pagemap(pfn, pgmap);
-               if (unlikely(!pgmap)) {
-                       undo_dev_pagemap(nr, nr_start, pages);
-                       return 0;
-               }
                SetPageReferenced(page);
                pages[*nr] = page;
                get_page(page);




>> +		} else
>> +			get_page(page);
>> +
>>  		(*nr)++;
>>  		pfn++;
>>  	} while (addr += PAGE_SIZE, addr != end);
> 
> [...]
> 
>> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
>>  	unsigned long addr, len, end;
>>  	int nr = 0, ret = 0;
>>  
>> -	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
>> +	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
> 
> Maybe add a comments to explain, something like:
> 
> /*
>  * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
>  *
>  * Note that get_user_pages_fast() imply FOLL_GET flag by default but
>  * callers can over-ride this default to pin case by setting FOLL_PIN.
>  */

Good idea. Here's the draft now:

/*
 * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN.
 *
 * Note that get_user_pages_fast() implies FOLL_GET flag by default, but
 * callers can override this default by setting FOLL_PIN instead of
 * FOLL_GET.
 */
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
        return -EINVAL;

> 
>>  		return -EINVAL;
>>  
>>  	start = untagged_addr(start) & PAGE_MASK;
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 13cc93785006..66bf4c8b88f1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
> 
> [...]
> 
>> @@ -968,7 +973,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
>>  	if (!*pgmap)
>>  		return ERR_PTR(-EFAULT);
>>  	page = pfn_to_page(pfn);
>> -	get_page(page);
>> +
>> +	if (flags & FOLL_GET)
>> +		get_page(page);
>> +	else if (flags & FOLL_PIN)
>> +		if (unlikely(!user_page_ref_inc(page)))
>> +			page = ERR_PTR(-ENOMEM);
> 
> While i agree that user_page_ref_inc() (ie page_cache_add_speculative())
> should never fails here as we are holding the pmd lock and thus no one
> can unmap the pmd and free the page it points to. I believe you should
> return -EFAULT like for the pgmap and not -ENOMEM as the pgmap should
> not fail either for the same reason. Thus it would be better to have
> consistent error. Maybe also add a comments explaining that it should
> not fail here.
> 

OK. I'll take a pass through and fix up the remaining points about these
sorts of cases below, as well, in v3. Those all make sense.

>>  
>>  	return page;
>>  }
> 
> [...]
> 
>> @@ -1100,7 +1115,7 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	 * device mapped pages can only be returned if the
>>  	 * caller will manage the page reference count.
>>  	 */
>> -	if (!(flags & FOLL_GET))
>> +	if (!(flags & (FOLL_GET | FOLL_PIN)))
>>  		return ERR_PTR(-EEXIST);
> 
> Maybe add a comment that FOLL_GET or FOLL_PIN must be set.
> 
>>  	pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
>> @@ -1108,7 +1123,12 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
>>  	if (!*pgmap)
>>  		return ERR_PTR(-EFAULT);
>>  	page = pfn_to_page(pfn);
>> -	get_page(page);
>> +
>> +	if (flags & FOLL_GET)
>> +		get_page(page);
>> +	else if (flags & FOLL_PIN)
>> +		if (unlikely(!user_page_ref_inc(page)))
>> +			page = ERR_PTR(-ENOMEM);
> 
> Same as for follow_devmap_pmd() see above.
> 
>>  
>>  	return page;
>>  }
>> @@ -1522,8 +1542,12 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>>  skip_mlock:
>>  	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>>  	VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
>> +
>>  	if (flags & FOLL_GET)
>>  		get_page(page);
>> +	else if (flags & FOLL_PIN)
>> +		if (unlikely(!user_page_ref_inc(page)))
>> +			page = NULL;
> 
> This should not fail either as we are holding the pmd lock maybe add
> a comment. Dunno if we want a WARN() or something to catch this
> degenerate case, or dump the page.
> 
>>  
>>  out:
>>  	return page;
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index b45a95363a84..da335b1cd798 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4462,7 +4462,17 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
>>  same_page:
>>  		if (pages) {
>>  			pages[i] = mem_map_offset(page, pfn_offset);
>> -			get_page(pages[i]);
>> +
>> +			if (flags & FOLL_GET)
>> +				get_page(pages[i]);
>> +			else if (flags & FOLL_PIN)
>> +				if (unlikely(!user_page_ref_inc(pages[i]))) {
>> +					spin_unlock(ptl);
>> +					remainder = 0;
>> +					err = -ENOMEM;
>> +					WARN_ON_ONCE(1);
>> +					break;
>> +				}
>>  		}
> 
> user_page_ref_inc() should not fail here either because we hold the
> ptl, so the WAR_ON_ONCE() is right but maybe add a comment.
> 
>>  
>>  		if (vmas)
> 
> [...]
> 
>> @@ -5034,8 +5050,14 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
>>  	pte = huge_ptep_get((pte_t *)pmd);
>>  	if (pte_present(pte)) {
>>  		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
>> +
>>  		if (flags & FOLL_GET)
>>  			get_page(page);
>> +		else if (flags & FOLL_PIN)
>> +			if (unlikely(!user_page_ref_inc(page))) {
>> +				page = NULL;
>> +				goto out;
>> +			}
> 
> This should not fail either (again holding pmd lock), dunno if we want
> a warn or something to catch this degenerate case.
> 
>>  	} else {
>>  		if (is_hugetlb_entry_migration(pte)) {
>>  			spin_unlock(ptl);
> 
> [...]
> 
> 

Those are all good points, working on them now.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
  2019-11-04 22:49     ` John Hubbard
@ 2019-11-04 23:49       ` Jerome Glisse
  2019-11-05  0:18         ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Jerome Glisse @ 2019-11-04 23:49 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 02:49:18PM -0800, John Hubbard wrote:
> On 11/4/19 10:52 AM, Jerome Glisse wrote:
> > On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> >> Add tracking of pages that were pinned via FOLL_PIN.
> >>
> >> As mentioned in the FOLL_PIN documentation, callers who effectively set
> >> FOLL_PIN are required to ultimately free such pages via put_user_page().
> >> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> >> for DIO and/or RDMA use".
> >>
> >> Pages that have been pinned via FOLL_PIN are identifiable via a
> >> new function call:
> >>
> >>    bool page_dma_pinned(struct page *page);
> >>
> >> What to do in response to encountering such a page, is left to later
> >> patchsets. There is discussion about this in [1].
> >>
> >> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> >>
> >> This also has a couple of trivial, non-functional change fixes to
> >> try_get_compound_head(). That function got moved to the top of the
> >> file.
> > 
> > Maybe split that as a separate trivial patch.
> 
> 
> Will do.
> 
> 
> > 
> >>
> >> This includes the following fix from Ira Weiny:
> >>
> >> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> >> for GUP pages by introducing put_devmap_managed_user_page() which
> >> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
> > 
> > Please do the put_devmap_managed_page() changes in a separate
> > patch, it would be a lot easier to follow, also on that front
> > see comments below.
> 
> 
> Oh! OK. It makes sense when you say it out loud. :)
> 
> 
> ...
> >> +static inline bool put_devmap_managed_page(struct page *page)
> >> +{
> >> +	bool is_devmap = page_is_devmap_managed(page);
> >> +
> >> +	if (is_devmap) {
> >> +		int count = page_ref_dec_return(page);
> >> +
> >> +		__put_devmap_managed_page(page, count);
> >> +	}
> >> +
> >> +	return is_devmap;
> >> +}
> > 
> > I think the __put_devmap_managed_page() should be rename
> > to free_devmap_managed_page() and that the count != 1
> > case move to this inline function ie:
> > 
> > static inline bool put_devmap_managed_page(struct page *page)
> > {
> > 	bool is_devmap = page_is_devmap_managed(page);
> > 
> > 	if (is_devmap) {
> > 		int count = page_ref_dec_return(page);
> > 
> > 		/*
> > 		 * If refcount is 1 then page is freed and refcount is stable as nobody
> > 		 * holds a reference on the page.
> > 		 */
> > 		if (count == 1)
> > 			free_devmap_managed_page(page, count);
> > 		else if (!count)
> > 			__put_page(page);
> > 	}
> > 
> > 	return is_devmap;
> > }
> > 
> 
> Thanks, that does look cleaner and easier to read.
> 
> > 
> >> +
> >>  #else /* CONFIG_DEV_PAGEMAP_OPS */
> >>  static inline bool put_devmap_managed_page(struct page *page)
> >>  {
> >> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct page *page)
> >>  	return true;
> >>  }
> >>  
> >> +__must_check bool user_page_ref_inc(struct page *page);
> >> +
> > 
> > What about having it as an inline here as it is pretty small.
> 
> 
> You mean move it to a static inline function in mm.h? It's worse than it 
> looks, though: *everything* that it calls is also a static function, local
> to gup.c. So I'd have to expose both try_get_compound_head() and
> __update_proc_vmstat(). And that also means calling mod_node_page_state() from
> mm.h, and it goes south right about there. :)

Ok fair enough

> ...  
> >> +/**
> >> + * page_dma_pinned() - report if a page is pinned by a call to pin_user_pages*()
> >> + * or pin_longterm_pages*()
> >> + * @page:	pointer to page to be queried.
> >> + * @Return:	True, if it is likely that the page has been "dma-pinned".
> >> + *		False, if the page is definitely not dma-pinned.
> >> + */
> > 
> > Maybe add a small comment about wrap around :)
> 
> 
> I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().
> 
> But it's true that the documentation is a little light here...What did you have 
> in mind?

About false positive case (and how unlikely they are) and that wrap
around is properly handle. Maybe just a pointer to the documentation
so that people know they can go look there for details. I know my
brain tend to forget where to look for things so i like to be constantly
reminded hey the doc is Documentations/foobar :)

> > [...]
> > 
> >> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> >>  
> >>  		pgmap = get_dev_pagemap(pfn, pgmap);
> >>  		if (unlikely(!pgmap)) {
> >> -			undo_dev_pagemap(nr, nr_start, pages);
> >> +			undo_dev_pagemap(nr, nr_start, flags, pages);
> >>  			return 0;
> >>  		}
> >>  		SetPageReferenced(page);
> >>  		pages[*nr] = page;
> >> -		get_page(page);
> >> +
> >> +		if (flags & FOLL_PIN) {
> >> +			if (unlikely(!user_page_ref_inc(page))) {
> >> +				undo_dev_pagemap(nr, nr_start, flags, pages);
> >> +				return 0;
> >> +			}
> > 
> > Maybe add a comment about a case that should never happens ie
> > user_page_ref_inc() fails after the second iteration of the
> > loop as it would be broken and a bug to call undo_dev_pagemap()
> > after the first iteration of that loop.
> > 
> > Also i believe that this should never happens as if first
> > iteration succeed than __page_cache_add_speculative() will
> > succeed for all the iterations.
> > 
> > Note that the pgmap case above follows that too ie the call to
> > get_dev_pagemap() can only fail on first iteration of the loop,
> > well i assume you can never have a huge device page that span
> > different pgmap ie different devices (which is a reasonable
> > assumption). So maybe this code needs fixing ie :
> > 
> > 		pgmap = get_dev_pagemap(pfn, pgmap);
> > 		if (unlikely(!pgmap))
> > 			return 0;
> > 
> > 
> 
> OK, yes that does make sense. And I think a comment is adequate,
> no need to check for bugs during every tail page iteration. So how 
> about this, as a preliminary patch:

Actualy i thought about it and i think that there is pgmap
per section and thus maybe one device can have multiple pgmap
and that would be an issue for page bigger than section size
(ie bigger than 128MB iirc). I will go double check that, but
maybe Dan can chime in.

In any case my comment above is correct for the page ref
increment, if the first one succeed than others will too
or otherwise it means someone is doing too many put_page()/
put_user_page() which is _bad_ :)

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..a4a81e125832 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1892,17 +1892,18 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>                 unsigned long end, struct page **pages, int *nr)
>  {
> -       int nr_start = *nr;
> -       struct dev_pagemap *pgmap = NULL;
> +       /*
> +        * Huge pages should never cross dev_pagemap boundaries. Therefore, use
> +        * this same pgmap for the entire huge page.
> +        */
> +       struct dev_pagemap *pgmap = get_dev_pagemap(pfn, NULL);
> +
> +       if (unlikely(!pgmap))
> +               return 0;
>  
>         do {
>                 struct page *page = pfn_to_page(pfn);
>  
> -               pgmap = get_dev_pagemap(pfn, pgmap);
> -               if (unlikely(!pgmap)) {
> -                       undo_dev_pagemap(nr, nr_start, pages);
> -                       return 0;
> -               }
>                 SetPageReferenced(page);
>                 pages[*nr] = page;
>                 get_page(page);
> 
> 
> 
> 
> >> +		} else
> >> +			get_page(page);
> >> +
> >>  		(*nr)++;
> >>  		pfn++;
> >>  	} while (addr += PAGE_SIZE, addr != end);
> > 
> > [...]
> > 
> >> @@ -2409,7 +2540,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> >>  	unsigned long addr, len, end;
> >>  	int nr = 0, ret = 0;
> >>  
> >> -	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> >> +	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
> > 
> > Maybe add a comments to explain, something like:
> > 
> > /*
> >  * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN
> >  *
> >  * Note that get_user_pages_fast() imply FOLL_GET flag by default but
> >  * callers can over-ride this default to pin case by setting FOLL_PIN.
> >  */
> 
> Good idea. Here's the draft now:
> 
> /*
>  * The only flags allowed here are: FOLL_WRITE, FOLL_LONGTERM, FOLL_PIN.
>  *
>  * Note that get_user_pages_fast() implies FOLL_GET flag by default, but
>  * callers can override this default by setting FOLL_PIN instead of
>  * FOLL_GET.
>  */
> if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM | FOLL_PIN)))
>         return -EINVAL;

Looks good to me.

...

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages
  2019-11-04 23:49       ` Jerome Glisse
@ 2019-11-05  0:18         ` John Hubbard
  0 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-05  0:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

Hi Dan, there is a question for you further down:


On 11/4/19 3:49 PM, Jerome Glisse wrote:
> On Mon, Nov 04, 2019 at 02:49:18PM -0800, John Hubbard wrote:
...
>>> Maybe add a small comment about wrap around :)
>>
>>
>> I don't *think* the count can wrap around, due to the checks in user_page_ref_inc().
>>
>> But it's true that the documentation is a little light here...What did you have 
>> in mind?
> 
> About false positive case (and how unlikely they are) and that wrap
> around is properly handle. Maybe just a pointer to the documentation
> so that people know they can go look there for details. I know my
> brain tend to forget where to look for things so i like to be constantly
> reminded hey the doc is Documentations/foobar :)
> 

I see. OK, here's a version with a thoroughly overhauled comment header:

/**
 * page_dma_pinned() - report if a page is pinned for DMA.
 *
 * This function checks if a page has been pinned via a call to
 * pin_user_pages*() or pin_longterm_pages*().
 *
 * The return value is partially fuzzy: false is not fuzzy, because it means
 * "definitely not pinned for DMA", but true means "probably pinned for DMA, but
 * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS worth
 * of normal page references".
 *
 * False positives are OK, because: a) it's unlikely for a page to get that many
 * refcounts, and b) all the callers of this routine are expected to be able to
 * deal gracefully with a false positive.
 *
 * For more information, please see Documentation/vm/pin_user_pages.rst.
 *
 * @page:	pointer to page to be queried.
 * @Return:	True, if it is likely that the page has been "dma-pinned".
 *		False, if the page is definitely not dma-pinned.
 */
static inline bool page_dma_pinned(struct page *page)


>>> [...]
>>>
>>>> @@ -1930,12 +2028,20 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
>>>>  
>>>>  		pgmap = get_dev_pagemap(pfn, pgmap);
>>>>  		if (unlikely(!pgmap)) {
>>>> -			undo_dev_pagemap(nr, nr_start, pages);
>>>> +			undo_dev_pagemap(nr, nr_start, flags, pages);
>>>>  			return 0;
>>>>  		}
>>>>  		SetPageReferenced(page);
>>>>  		pages[*nr] = page;
>>>> -		get_page(page);
>>>> +
>>>> +		if (flags & FOLL_PIN) {
>>>> +			if (unlikely(!user_page_ref_inc(page))) {
>>>> +				undo_dev_pagemap(nr, nr_start, flags, pages);
>>>> +				return 0;
>>>> +			}
>>>
>>> Maybe add a comment about a case that should never happens ie
>>> user_page_ref_inc() fails after the second iteration of the
>>> loop as it would be broken and a bug to call undo_dev_pagemap()
>>> after the first iteration of that loop.
>>>
>>> Also i believe that this should never happens as if first
>>> iteration succeed than __page_cache_add_speculative() will
>>> succeed for all the iterations.
>>>
>>> Note that the pgmap case above follows that too ie the call to
>>> get_dev_pagemap() can only fail on first iteration of the loop,
>>> well i assume you can never have a huge device page that span
>>> different pgmap ie different devices (which is a reasonable
>>> assumption). So maybe this code needs fixing ie :
>>>
>>> 		pgmap = get_dev_pagemap(pfn, pgmap);
>>> 		if (unlikely(!pgmap))
>>> 			return 0;
>>>
>>>
>>
>> OK, yes that does make sense. And I think a comment is adequate,
>> no need to check for bugs during every tail page iteration. So how 
>> about this, as a preliminary patch:
> 
> Actualy i thought about it and i think that there is pgmap
> per section and thus maybe one device can have multiple pgmap
> and that would be an issue for page bigger than section size
> (ie bigger than 128MB iirc). I will go double check that, but
> maybe Dan can chime in.
> 
> In any case my comment above is correct for the page ref
> increment, if the first one succeed than others will too
> or otherwise it means someone is doing too many put_page()/
> put_user_page() which is _bad_ :)
> 

I'll wait to hear from Dan before doing anything rash. :)


thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-04 22:03         ` John Hubbard
@ 2019-11-05  2:32           ` Jason Gunthorpe
  0 siblings, 0 replies; 57+ messages in thread
From: Jason Gunthorpe @ 2019-11-05  2:32 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 02:03:43PM -0800, John Hubbard wrote:
> On 11/4/19 12:57 PM, Jason Gunthorpe wrote:
> > On Mon, Nov 04, 2019 at 12:48:13PM -0800, John Hubbard wrote:
> >> On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
> >> ...
> >>>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> >>>> index 24244a2f68cc..c5a78d3e674b 100644
> >>>> +++ b/drivers/infiniband/core/umem.c
> >>>> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
> >>>>  
> >>>>  	while (npages) {
> >>>>  		down_read(&mm->mmap_sem);
> >>>> -		ret = get_user_pages(cur_base,
> >>>> +		ret = pin_longterm_pages(cur_base,
> >>>>  				     min_t(unsigned long, npages,
> >>>>  					   PAGE_SIZE / sizeof (struct page *)),
> >>>> -				     gup_flags | FOLL_LONGTERM,
> >>>> -				     page_list, NULL);
> >>>> +				     gup_flags, page_list, NULL);
> >>>
> >>> FWIW, this one should be converted to fast as well, I think we finally
> >>> got rid of all the blockers for that?
> >>>
> >>
> >> I'm not aware of any blockers on the gup.c end, anyway. The only broken thing we
> >> have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + LONGTERM". 
> > 
> > I mean the use of the mmap_sem here is finally in a way where we can
> > just delete the mmap_sem and use _fast
> >  
> > ie, AFAIK there is no need for the mmap_sem to be held during
> > ib_umem_add_sg_table()
> > 
> > This should probably be a standalone patch however
> > 
> 
> Yes. Oh, actually I guess the patch flow should be: change to 
> get_user_pages_fast() and remove the mmap_sem calls, as one patch. And then change 
> to pin_longterm_pages_fast() as the next patch. Otherwise, the internal fallback
> from _fast to slow gup would attempt to take the mmap_sem (again) in the same
> thread, which is not good. :)
> 
> Or just defer the change until after this series. Either way is fine, let me
> know if you prefer one over the other.
> 
> The patch itself is trivial, but runtime testing to gain confidence that
> it's solid is much harder. Is there a stress test you would recommend for that?
> (I'm not promising I can quickly run it yet--my local IB setup is still nascent 
> at best.)

If you make a patch we can probably get it tested, it is something
we should do I keep forgetting about.

Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
  2019-11-04 17:33   ` Jerome Glisse
  2019-11-04 20:33   ` David Rientjes
@ 2019-11-05 13:10   ` Mike Rapoport
  2019-11-05 19:00     ` John Hubbard
  2 siblings, 1 reply; 57+ messages in thread
From: Mike Rapoport @ 2019-11-05 13:10 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
>     pin_user_pages()
>     pin_user_pages_remote()
>     pin_user_pages_fast()
> 
>     pin_longterm_pages()
>     pin_longterm_pages_remote()
>     pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
>         * Start with "pin_user_pages" instead of "get_user_pages". That
>           makes it easy to find and audit the call sites.
>         * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  Documentation/vm/index.rst          |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++

I think it belongs to Documentation/core-api.

>  include/linux/mm.h                  |  62 ++++++-
>  mm/gup.c                            | 265 +++++++++++++++++++++++++---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index e8d943b21cf9..7194efa3554a 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -44,6 +44,7 @@ descriptions of data structures and algorithms.
>     page_migration
>     page_frags
>     page_owner
> +   pin_user_pages
>     remap_file_pages
>     slub
>     split_page_table_lock
> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index 000000000000..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst
> @@ -0,0 +1,212 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +====================================================
> +pin_user_pages() and related calls
> +====================================================

I know this is too much to ask, but having pin_user_pages() a part of more
general GUP description would be really great :)

> +
> +.. contents:: :local:
> +
> +Overview
> +========
> +
> +This document describes the following functions: ::
> +
> + pin_user_pages
> + pin_user_pages_fast
> + pin_user_pages_remote
> +
> + pin_longterm_pages
> + pin_longterm_pages_fast
> + pin_longterm_pages_remote
> +
> +Basic description of FOLL_PIN
> +=============================
> +
> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has

Consider reading this after, say, half a year ;-)

> +significant interactions and interdependencies with FOLL_LONGTERM, so both are
> +covered here.
> +
> +Both FOLL_PIN and FOLL_LONGTERM are "internal" to gup, meaning that neither
> +FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
> +the associated wrapper functions  (pin_user_pages and others) to set the correct
> +combination of these flags, and to check for problems as well.
> +
> +FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
> +multiple threads and call sites are free to pin the same struct pages, via both
> +FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
> +other, not the struct page(s).
> +
> +The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
> +uses a different reference counting technique.
> +
> +FOLL_PIN is a prerequisite to FOLL_LONGTGERM. Another way of saying that is,
> +FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
> +
> +Which flags are set by each wrapper
> +===================================
> +
> +Only FOLL_PIN and FOLL_LONGTERM are covered here. These flags are added to
> +whatever flags the caller provides::
> +
> + Function                    gup flags (FOLL_PIN or FOLL_LONGTERM only)
> + --------                    ------------------------------------------
> + pin_user_pages              FOLL_PIN
> + pin_user_pages_fast         FOLL_PIN
> + pin_user_pages_remote       FOLL_PIN
> +
> + pin_longterm_pages          FOLL_PIN | FOLL_LONGTERM
> + pin_longterm_pages_fast     FOLL_PIN | FOLL_LONGTERM
> + pin_longterm_pages_remote   FOLL_PIN | FOLL_LONGTERM
> +
> +Tracking dma-pinned pages
> +=========================
> +
> +Some of the key design constraints, and solutions, for tracking dma-pinned
> +pages:
> +
> +* An actual reference count, per struct page, is required. This is because
> +  multiple processes may pin and unpin a page.
> +
> +* False positives (reporting that a page is dma-pinned, when in fact it is not)
> +  are acceptable, but false negatives are not.
> +
> +* struct page may not be increased in size for this, and all fields are already
> +  used.
> +
> +* Given the above, we can overload the page->_refcount field by using, sort of,
> +  the upper bits in that field for a dma-pinned count. "Sort of", means that,
> +  rather than dividing page->_refcount into bit fields, we simple add a medium-
> +  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
> +  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
> +  on it 1024 times, then it will appear to have a single dma-pinned count.
> +  And again, that's acceptable.
> +
> +This also leads to limitations: there are only 31-10==21 bits available for a
> +counter that increments 10 bits at a time.
> +
> +TODO: for 1GB and larger huge pages, this is cutting it close. That's because
> +when pin_user_pages() follows such pages, it increments the head page by "1"
> +(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
> +pin_user_pages()) for each tail page. So if you have a 1GB huge page:
> +
> +* There are 256K (18 bits) worth of 4 KB tail pages.
> +* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
> +  10 bits at a time)
> +* There are 21 - 18 == 3 bits available to count. Except that there aren't,
> +  because you need to allow for a few normal get_page() calls on the head page,
> +  as well. Fortunately, the approach of using addition, rather than "hard"
> +  bitfields, within page->_refcount, allows for sharing these bits gracefully.
> +  But we're still looking at about 8 references.
> +
> +This, however, is a missing feature more than anything else, because it's easily
> +solved by addressing an obvious inefficiency in the original get_user_pages()
> +approach of retrieving pages: stop treating all the pages as if they were
> +PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
> +this, so some work is required. Once that's in place, this limitation mostly
> +disappears from view, because there will be ample refcounting range available.
> +
> +* Callers must specifically request "dma-pinned tracking of pages". In other
> +  words, just calling get_user_pages() will not suffice; a new set of functions,
> +  pin_user_page() and related, must be used.
> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==========================================================
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +-----------------------
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +    FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +------------
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +    FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +-----------
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.
> +
> +CASE 4: Pinning for struct page manipulation only
> +-------------------------------------------------
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +
> +page_dma_pinned(): the whole point of pinning
> +=============================================
> +
> +The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
> +to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
> +(and file system writeback code in general) to make informed decisions about
> +what to do when a page cannot be unmapped due to such pins.
> +
> +What to do in those cases is the subject of a years-long series of discussions
> +and debates (see the References at the end of this document). It's a TODO item
> +here: fill in the details once that's worked out. Meanwhile, it's safe to say
> +that having this available: ::
> +
> +        static inline bool page_dma_pinned(struct page *page)
> +
> +...is a prerequisite to solving the long-running gup+DMA problem.
> +
> +Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
> +===================================================================
> +
> +Another way of thinking about these flags is as a progression of restrictions:
> +FOLL_GET is for struct page manipulation, without affecting the data that the
> +struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
> +short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
> +a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
> +restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
> +will be pinned longterm, and whose data will be accessed.
> +
> +Unit testing
> +============
> +This file::
> +
> + tools/testing/selftests/vm/gup_benchmark.c
> +
> +has the following new calls to exercise the new pin*() wrapper functions:
> +
> +* PIN_FAST_BENCHMARK (./gup_benchmark -a)
> +* PIN_LONGTERM_BENCHMARK (./gup_benchmark -a)
> +* PIN_BENCHMARK (./gup_benchmark -a)
> +
> +You can monitor how many total dma-pinned pages have been acquired and released
> +since the system was booted, via two new /proc/vmstat entries: ::
> +
> +    /proc/vmstat/nr_foll_pin_requested
> +    /proc/vmstat/nr_foll_pin_requested
> +
> +Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
> +because there is a noticeable performance drop in put_user_page(), when they
> +are activated.
> +
> +References
> +==========
> +
> +* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
> +* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
> +* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
> +
> +John Hubbard, October, 2019
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc292273e6ba..cdfb6fedb271 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1526,9 +1526,23 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  			    unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas, int *locked);
> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			   unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   struct vm_area_struct **vmas, int *locked);
> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			       unsigned long start, unsigned long nr_pages,
> +			       unsigned int gup_flags, struct page **pages,
> +			       struct vm_area_struct **vmas, int *locked);
>  long get_user_pages(unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas);
> +long pin_user_pages(unsigned long start, unsigned long nr_pages,
> +		    unsigned int gup_flags, struct page **pages,
> +		    struct vm_area_struct **vmas);
> +long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
> +			unsigned int gup_flags, struct page **pages,
> +			struct vm_area_struct **vmas);
>  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages, int *locked);
>  long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
> @@ -1536,6 +1550,10 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>  
>  int get_user_pages_fast(unsigned long start, int nr_pages,
>  			unsigned int gup_flags, struct page **pages);
> +int pin_user_pages_fast(unsigned long start, int nr_pages,
> +			unsigned int gup_flags, struct page **pages);
> +int pin_longterm_pages_fast(unsigned long start, int nr_pages,
> +			    unsigned int gup_flags, struct page **pages);
>  
>  int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
>  int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
> @@ -2594,13 +2612,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  #define FOLL_ANON	0x8000	/* don't do file mappings */
>  #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
>  #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
> +#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
>  
>  /*
> - * NOTE on FOLL_LONGTERM:
> + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> + * other. Here is what they mean, and how to use them:
>   *
>   * FOLL_LONGTERM indicates that the page will be held for an indefinite time
> - * period _often_ under userspace control.  This is contrasted with
> - * iov_iter_get_pages() where usages which are transient.
> + * period _often_ under userspace control.  This is in contrast to
> + * iov_iter_get_pages(), where usages which are transient.
>   *
>   * FIXME: For pages which are part of a filesystem, mappings are subject to the
>   * lifetime enforced by the filesystem and we need guarantees that longterm
> @@ -2615,11 +2635,41 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>   * Currently only get_user_pages() and get_user_pages_fast() support this flag
>   * and calls to get_user_pages_[un]locked are specifically not allowed.  This
>   * is due to an incompatibility with the FS DAX check and
> - * FAULT_FLAG_ALLOW_RETRY
> + * FAULT_FLAG_ALLOW_RETRY.
>   *
> - * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
> - * that region.  And so CMA attempts to migrate the page before pinning when
> + * In the CMA case: long term pins in a CMA region would unnecessarily fragment
> + * that region.  And so, CMA attempts to migrate the page before pinning, when
>   * FOLL_LONGTERM is specified.
> + *
> + * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
> + * but an additional pin counting system) will be invoked. This is intended for
> + * anything that gets a page reference and then touches page data (for example,
> + * Direct IO). This lets the filesystem know that some non-file-system entity is
> + * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
> + * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
> + * a call to put_user_page().
> + *
> + * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
> + * and separate refcounting mechanisms, however, and that means that each has
> + * its own acquire and release mechanisms:
> + *
> + *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
> + *
> + *     FOLL_PIN: pin_user_pages*() or pin_longterm_pages*() to acquire, and
> + *               put_user_pages to release.
> + *
> + * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
> + * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
> + * calls applied to them, and that's perfectly OK. This is a constraint on the
> + * callers, not on the pages.)
> + *
> + * FOLL_PIN and FOLL_LONGTERM should be set internally by the pin_user_page*()
> + * and pin_longterm_*() APIs, never directly by the caller. That's in order to
> + * help avoid mismatches when releasing pages: get_user_pages*() pages must be
> + * released via put_page(), while pin_user_pages*() pages must be released via
> + * put_user_page().
> + *
> + * Please see Documentation/vm/pin_user_pages.rst for more information.
>   */
>  
>  static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -179,6 +179,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	pte_t *ptep, pte;
>  
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> +			 (FOLL_PIN | FOLL_GET)))
> +		return ERR_PTR(-EINVAL);
>  retry:
>  	if (unlikely(pmd_bad(*pmd)))
>  		return no_page_table(vma, flags);
> @@ -790,7 +794,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  
>  	start = untagged_addr(start);
>  
> -	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> +	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
>  
>  	/*
>  	 * If FOLL_FORCE is set then do not force a full fault as the hinting
> @@ -1014,7 +1018,16 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk,
>  		BUG_ON(*locked != 1);
>  	}
>  
> -	if (pages)
> +	/*
> +	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
> +	 * for FOLL_GET, not for the newer FOLL_PIN.
> +	 *
> +	 * FOLL_PIN always expects pages to be non-null, but no need to assert
> +	 * that here, as any failures will be obvious enough.
> +	 */
> +	if (pages && !(flags & FOLL_PIN))
>  		flags |= FOLL_GET;
>  
>  	pages_done = 0;
> @@ -1151,6 +1164,14 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas, int *locked)
>  {
> +	/*
> +	 * FOLL_PIN must only be set internally by the pin_user_page*() and
> +	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
> +	 * with an assertion:
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> +		return -EINVAL;
> +
>  	/*
>  	 * FIXME: Current FOLL_LONGTERM behavior is incompatible with
>  	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
> @@ -1608,6 +1629,14 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
>  		unsigned int gup_flags, struct page **pages,
>  		struct vm_area_struct **vmas)
>  {
> +	/*
> +	 * FOLL_PIN must only be set internally by the pin_user_page*() and
> +	 * pin_longterm_*() APIs, never directly by the caller, so enforce that
> +	 * with an assertion:
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> +		return -EINVAL;
> +
>  	return __gup_longterm_locked(current, current->mm, start, nr_pages,
>  				     pages, vmas, gup_flags | FOLL_TOUCH);
>  }
> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages,
>  	return ret;
>  }
>  
> -/**
> - * get_user_pages_fast() - pin user pages in memory
> - * @start:	starting user address
> - * @nr_pages:	number of pages from start to pin
> - * @gup_flags:	flags modifying pin behaviour
> - * @pages:	array that receives pointers to the pages pinned.
> - *		Should be at least nr_pages long.
> - *
> - * Attempt to pin user pages in memory without taking mm->mmap_sem.
> - * If not successful, it will fall back to taking the lock and
> - * calling get_user_pages().
> - *
> - * Returns number of pages pinned. This may be fewer than the number
> - * requested. If nr_pages is 0 or negative, returns 0. If no pages
> - * were pinned, returns -errno.
> - */
> -int get_user_pages_fast(unsigned long start, int nr_pages,
> -			unsigned int gup_flags, struct page **pages)
> +static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
> +					unsigned int gup_flags,
> +					struct page **pages)
>  {
>  	unsigned long addr, len, end;
>  	int nr = 0, ret = 0;
> @@ -2435,4 +2449,215 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
>  
>  	return ret;
>  }
> +
> +/**
> + * get_user_pages_fast() - pin user pages in memory
> + * @start:	starting user address
> + * @nr_pages:	number of pages from start to pin
> + * @gup_flags:	flags modifying pin behaviour
> + * @pages:	array that receives pointers to the pages pinned.
> + *		Should be at least nr_pages long.
> + *
> + * Attempt to pin user pages in memory without taking mm->mmap_sem.
> + * If not successful, it will fall back to taking the lock and
> + * calling get_user_pages().
> + *
> + * Returns number of pages pinned. This may be fewer than the number requested.
> + * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
> + * -errno.
> + */
> +int get_user_pages_fast(unsigned long start, int nr_pages,
> +			unsigned int gup_flags, struct page **pages)
> +{
> +	/*
> +	 * FOLL_PIN must only be set internally by the pin_user_page*() and
> +	 * pin_longterm_*() APIs, never directly by the caller, so enforce that:
> +	 */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
> +		return -EINVAL;
> +
> +	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> +
> +/**
> + * pin_user_pages_fast() - pin user pages in memory without taking locks
> + *
> + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
> + * get_user_pages_fast() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +int pin_user_pages_fast(unsigned long start, int nr_pages,
> +			unsigned int gup_flags, struct page **pages)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_PIN;
> +	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
> +EXPORT_SYMBOL_GPL(pin_user_pages_fast);
> +
> +/**
> + * pin_longterm_pages_fast() - pin user pages in memory without taking locks
> + *
> + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN and
> + * FOLL_LONGTERM are set. See get_user_pages_fast() for documentation on the
> + * function arguments, because the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) of the FOLL_PIN
> + * documentation.
> + */
> +int pin_longterm_pages_fast(unsigned long start, int nr_pages,
> +			    unsigned int gup_flags, struct page **pages)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= (FOLL_PIN | FOLL_LONGTERM);
> +	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
> +EXPORT_SYMBOL_GPL(pin_longterm_pages_fast);
> +
> +/**
> + * pin_user_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.
> + *
> + * Nearly the same as get_user_pages_remote(), except that FOLL_PIN is set. See
> + * get_user_pages_remote() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			   unsigned long start, unsigned long nr_pages,
> +			   unsigned int gup_flags, struct page **pages,
> +			   struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_TOUCH | FOLL_REMOTE | FOLL_PIN;
> +
> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> +				       locked, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_user_pages_remote);
> +
> +/**
> + * pin_longterm_pages_remote() - pin pages for (typically) use by Direct IO, and
> + * return the pages to the user.
> + *
> + * Nearly the same as get_user_pages_remote(), but note that FOLL_TOUCH is not
> + * set, and FOLL_PIN and FOLL_LONGTERM are set. See get_user_pages_remote() for
> + * documentation on the function arguments, because the arguments here are
> + * identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) in
> + * Documentation/vm/pin_user_pages.rst.
> + */
> +long pin_longterm_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
> +			       unsigned long start, unsigned long nr_pages,
> +			       unsigned int gup_flags, struct page **pages,
> +			       struct vm_area_struct **vmas, int *locked)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	/*
> +	 * FIXME: as noted in the get_user_pages_remote() implementation, it
> +	 * is not yet possible to safely set FOLL_LONGTERM here. FOLL_LONGTERM
> +	 * needs to be set, but for now the best we can do is a "TODO" item.
> +	 */
> +	gup_flags |= FOLL_REMOTE | FOLL_PIN;
> +
> +	return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas,
> +				       locked, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_longterm_pages_remote);
> +
> +/**
> + * pin_user_pages() - pin user pages in memory for use by other devices
> + *
> + * Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
> + * FOLL_PIN is set.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +long pin_user_pages(unsigned long start, unsigned long nr_pages,
> +		    unsigned int gup_flags, struct page **pages,
> +		    struct vm_area_struct **vmas)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_PIN;
> +	return __gup_longterm_locked(current, current->mm, start, nr_pages,
> +				     pages, vmas, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_user_pages);
> +
> +/**
> + * pin_longterm_pages() - pin user pages in memory for long-term use (RDMA,
> + * typically)
> + *
> + * Nearly the same as get_user_pages(), except that FOLL_PIN and FOLL_LONGTERM
> + * are set. See get_user_pages_fast() for documentation on the function
> + * arguments, because the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * FOLL_LONGTERM means that the pages are being pinned for "long term" use,
> + * typically by a non-CPU device, and we cannot be sure that waiting for a
> + * pinned page to become unpin will be effective.
> + *
> + * This is intended for Case 2 (RDMA: long-term pins) in
> + * Documentation/vm/pin_user_pages.rst.
> + */
> +long pin_longterm_pages(unsigned long start, unsigned long nr_pages,
> +			unsigned int gup_flags, struct page **pages,
> +			struct vm_area_struct **vmas)
> +{
> +	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
> +	if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> +		return -EINVAL;
> +
> +	gup_flags |= FOLL_PIN | FOLL_LONGTERM;
> +	return __gup_longterm_locked(current, current->mm, start, nr_pages,
> +				     pages, vmas, gup_flags);
> +}
> +EXPORT_SYMBOL(pin_longterm_pages);
> -- 
> 2.23.0
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-05 13:10   ` Mike Rapoport
@ 2019-11-05 19:00     ` John Hubbard
  2019-11-07  2:25       ` Ira Weiny
  2019-11-07  8:07       ` Mike Rapoport
  0 siblings, 2 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-05 19:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On 11/5/19 5:10 AM, Mike Rapoport wrote:
...
>> ---
>>  Documentation/vm/index.rst          |   1 +
>>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
> 
> I think it belongs to Documentation/core-api.

Done:

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index ab0eae1c153a..413f7d7c8642 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
    generic-radix-tree
    memory-allocation
    mm-api
+   pin_user_pages
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm


...
>> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
>> new file mode 100644
>> index 000000000000..3910f49ca98c
>> --- /dev/null
>> +++ b/Documentation/vm/pin_user_pages.rst
>> @@ -0,0 +1,212 @@
>> +.. SPDX-License-Identifier: GPL-2.0
>> +
>> +====================================================
>> +pin_user_pages() and related calls
>> +====================================================
> 
> I know this is too much to ask, but having pin_user_pages() a part of more
> general GUP description would be really great :)
> 

Yes, definitely. But until I saw the reaction to the pin_user_pages() API
family, I didn't want to write too much--it could have all been tossed out
in favor of a whole different API. But now that we've had some initial
reviews, I'm much more confident in being able to write about the larger 
API set.

So yes, I'll put that on my pending list.


...
>> +This document describes the following functions: ::
>> +
>> + pin_user_pages
>> + pin_user_pages_fast
>> + pin_user_pages_remote
>> +
>> + pin_longterm_pages
>> + pin_longterm_pages_fast
>> + pin_longterm_pages_remote
>> +
>> +Basic description of FOLL_PIN
>> +=============================
>> +
>> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> 
> Consider reading this after, say, half a year ;-)
> 

OK, OK. I knew when I wrote that that it was not going to stay new forever, but
somehow failed to write the right thing anyway. :) 

Here's a revised set of paragraphs:

Basic description of FOLL_PIN
=============================

FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
("gup") family of functions. FOLL_PIN has significant interactions and
interdependencies with FOLL_LONGTERM, so both are covered here.

Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
the associated wrapper functions  (pin_user_pages() and others) to set the
correct combination of these flags, and to check for problems as well.


thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-05 19:00     ` John Hubbard
@ 2019-11-07  2:25       ` Ira Weiny
  2019-11-07  8:07       ` Mike Rapoport
  1 sibling, 0 replies; 57+ messages in thread
From: Ira Weiny @ 2019-11-07  2:25 UTC (permalink / raw)
  To: John Hubbard
  Cc: Mike Rapoport, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

> 
> 
> ...
> >> +This document describes the following functions: ::
> >> +
> >> + pin_user_pages
> >> + pin_user_pages_fast
> >> + pin_user_pages_remote
> >> +
> >> + pin_longterm_pages
> >> + pin_longterm_pages_fast
> >> + pin_longterm_pages_remote
> >> +
> >> +Basic description of FOLL_PIN
> >> +=============================
> >> +
> >> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> > 
> > Consider reading this after, say, half a year ;-)
> > 
> 
> OK, OK. I knew when I wrote that that it was not going to stay new forever, but
> somehow failed to write the right thing anyway. :) 
> 
> Here's a revised set of paragraphs:
> 
> Basic description of FOLL_PIN
> =============================
> 
> FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
> ("gup") family of functions. FOLL_PIN has significant interactions and
> interdependencies with FOLL_LONGTERM, so both are covered here.
> 
> Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
> FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
> the associated wrapper functions  (pin_user_pages() and others) to set the
> correct combination of these flags, and to check for problems as well.

I like this revision as well.

Ira


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*()
  2019-11-04 20:57       ` Jason Gunthorpe
  2019-11-04 22:03         ` John Hubbard
@ 2019-11-07  2:26         ` Ira Weiny
  1 sibling, 0 replies; 57+ messages in thread
From: Ira Weiny @ 2019-11-07  2:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Andrew Morton, Al Viro, Alex Williamson,
	Benjamin Herrenschmidt, Björn Töpel, Christoph Hellwig,
	Dan Williams, Daniel Vetter, Dave Chinner, David Airlie,
	David S . Miller, Jan Kara, Jens Axboe, Jonathan Corbet,
	Jérôme Glisse, Magnus Karlsson, Mauro Carvalho Chehab,
	Michael Ellerman, Michal Hocko, Mike Kravetz, Paul Mackerras,
	Shuah Khan, Vlastimil Babka, bpf, dri-devel, kvm, linux-block,
	linux-doc, linux-fsdevel, linux-kselftest, linux-media,
	linux-rdma, linuxppc-dev, netdev, linux-mm, LKML

On Mon, Nov 04, 2019 at 04:57:38PM -0400, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 12:48:13PM -0800, John Hubbard wrote:
> > On 11/4/19 12:33 PM, Jason Gunthorpe wrote:
> > ...
> > >> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > >> index 24244a2f68cc..c5a78d3e674b 100644
> > >> +++ b/drivers/infiniband/core/umem.c
> > >> @@ -272,11 +272,10 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
> > >>  
> > >>  	while (npages) {
> > >>  		down_read(&mm->mmap_sem);
> > >> -		ret = get_user_pages(cur_base,
> > >> +		ret = pin_longterm_pages(cur_base,
> > >>  				     min_t(unsigned long, npages,
> > >>  					   PAGE_SIZE / sizeof (struct page *)),
> > >> -				     gup_flags | FOLL_LONGTERM,
> > >> -				     page_list, NULL);
> > >> +				     gup_flags, page_list, NULL);
> > > 
> > > FWIW, this one should be converted to fast as well, I think we finally
> > > got rid of all the blockers for that?
> > > 
> > 
> > I'm not aware of any blockers on the gup.c end, anyway. The only broken thing we
> > have there is "gup remote + FOLL_LONGTERM". But we can do "gup fast + LONGTERM". 
> 
> I mean the use of the mmap_sem here is finally in a way where we can
> just delete the mmap_sem and use _fast

Yay!  I agree if we can do this we should.

Thanks,
Ira

>  
> ie, AFAIK there is no need for the mmap_sem to be held during
> ib_umem_add_sg_table()
> 
> This should probably be a standalone patch however
> 
> Jason

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN
  2019-11-05 19:00     ` John Hubbard
  2019-11-07  2:25       ` Ira Weiny
@ 2019-11-07  8:07       ` Mike Rapoport
  1 sibling, 0 replies; 57+ messages in thread
From: Mike Rapoport @ 2019-11-07  8:07 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On Tue, Nov 05, 2019 at 11:00:06AM -0800, John Hubbard wrote:
> On 11/5/19 5:10 AM, Mike Rapoport wrote:
> ...
> >> ---
> >>  Documentation/vm/index.rst          |   1 +
> >>  Documentation/vm/pin_user_pages.rst | 212 ++++++++++++++++++++++
> > 
> > I think it belongs to Documentation/core-api.
> 
> Done:
> 
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index ab0eae1c153a..413f7d7c8642 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -31,6 +31,7 @@ Core utilities
>     generic-radix-tree
>     memory-allocation
>     mm-api
> +   pin_user_pages
>     gfp_mask-from-fs-io
>     timekeeping
>     boot-time-mm

Thanks!
 
> ...
> >> diff --git a/Documentation/vm/pin_user_pages.rst b/Documentation/vm/pin_user_pages.rst
> >> new file mode 100644
> >> index 000000000000..3910f49ca98c
> >> --- /dev/null
> >> +++ b/Documentation/vm/pin_user_pages.rst
> >> @@ -0,0 +1,212 @@
> >> +.. SPDX-License-Identifier: GPL-2.0
> >> +
> >> +====================================================
> >> +pin_user_pages() and related calls
> >> +====================================================
> > 
> > I know this is too much to ask, but having pin_user_pages() a part of more
> > general GUP description would be really great :)
> > 
> 
> Yes, definitely. But until I saw the reaction to the pin_user_pages() API
> family, I didn't want to write too much--it could have all been tossed out
> in favor of a whole different API. But now that we've had some initial
> reviews, I'm much more confident in being able to write about the larger 
> API set.
> 
> So yes, I'll put that on my pending list.
> 
> 
> ...
> >> +This document describes the following functions: ::
> >> +
> >> + pin_user_pages
> >> + pin_user_pages_fast
> >> + pin_user_pages_remote
> >> +
> >> + pin_longterm_pages
> >> + pin_longterm_pages_fast
> >> + pin_longterm_pages_remote
> >> +
> >> +Basic description of FOLL_PIN
> >> +=============================
> >> +
> >> +A new flag for get_user_pages ("gup") has been added: FOLL_PIN. FOLL_PIN has
> > 
> > Consider reading this after, say, half a year ;-)
> > 
> 
> OK, OK. I knew when I wrote that that it was not going to stay new forever, but
> somehow failed to write the right thing anyway. :) 
> 
> Here's a revised set of paragraphs:
> 
> Basic description of FOLL_PIN
> =============================
> 
> FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
> ("gup") family of functions. FOLL_PIN has significant interactions and
> interdependencies with FOLL_LONGTERM, so both are covered here.
> 
> Both FOLL_PIN and FOLL_LONGTERM are internal to gup, meaning that neither
> FOLL_PIN nor FOLL_LONGTERM should not appear at the gup call sites. This allows
> the associated wrapper functions  (pin_user_pages() and others) to set the
> correct combination of these flags, and to check for problems as well.

Great, thanks! 
 
> thanks,
> 
> John Hubbard
> NVIDIA

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers
  2019-11-03 21:17 ` [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
@ 2019-11-10 10:10   ` Hans Verkuil
  2019-11-11 21:46     ` John Hubbard
  0 siblings, 1 reply; 57+ messages in thread
From: Hans Verkuil @ 2019-11-10 10:10 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On 11/3/19 10:17 PM, John Hubbard wrote:
> After DMA is complete, and the device and CPU caches are synchronized,
> it's still required to mark the CPU pages as dirty, if the data was
> coming from the device. However, this driver was just issuing a
> bare put_page() call, without any set_page_dirty*() call.
> 
> Fix the problem, by calling set_page_dirty_lock() if the CPU pages
> were potentially receiving data from the device.
> 
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>

Looks good, thanks!

	Hans

> ---
>  drivers/media/v4l2-core/videobuf-dma-sg.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> index 66a6c6c236a7..28262190c3ab 100644
> --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> @@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
>  	BUG_ON(dma->sglen);
>  
>  	if (dma->pages) {
> -		for (i = 0; i < dma->nr_pages; i++)
> +		for (i = 0; i < dma->nr_pages; i++) {
> +			if (dma->direction == DMA_FROM_DEVICE)
> +				set_page_dirty_lock(dma->pages[i]);
>  			put_page(dma->pages[i]);
> +		}
>  		kfree(dma->pages);
>  		dma->pages = NULL;
>  	}
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion
  2019-11-03 21:18 ` [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
@ 2019-11-10 10:11   ` Hans Verkuil
  0 siblings, 0 replies; 57+ messages in thread
From: Hans Verkuil @ 2019-11-10 10:11 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On 11/3/19 10:18 PM, John Hubbard wrote:
> 1. Change v4l2 from get_user_pages(FOLL_LONGTERM), to
> pin_longterm_pages(), which sets both FOLL_LONGTERM and FOLL_PIN.
> 
> 2. Because all FOLL_PIN-acquired pages must be released via
> put_user_page(), also convert the put_page() call over to
> put_user_pages_dirty_lock().
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>

Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>

Looks good, thanks!

	Hans


> ---
>  drivers/media/v4l2-core/videobuf-dma-sg.c | 13 +++++--------
>  1 file changed, 5 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> index 28262190c3ab..9b9c5b37bf59 100644
> --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> @@ -183,12 +183,12 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
>  	dprintk(1, "init user [0x%lx+0x%lx => %d pages]\n",
>  		data, size, dma->nr_pages);
>  
> -	err = get_user_pages(data & PAGE_MASK, dma->nr_pages,
> -			     flags | FOLL_LONGTERM, dma->pages, NULL);
> +	err = pin_longterm_pages(data & PAGE_MASK, dma->nr_pages,
> +				 flags, dma->pages, NULL);
>  
>  	if (err != dma->nr_pages) {
>  		dma->nr_pages = (err >= 0) ? err : 0;
> -		dprintk(1, "get_user_pages: err=%d [%d]\n", err,
> +		dprintk(1, "pin_longterm_pages: err=%d [%d]\n", err,
>  			dma->nr_pages);
>  		return err < 0 ? err : -EINVAL;
>  	}
> @@ -349,11 +349,8 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
>  	BUG_ON(dma->sglen);
>  
>  	if (dma->pages) {
> -		for (i = 0; i < dma->nr_pages; i++) {
> -			if (dma->direction == DMA_FROM_DEVICE)
> -				set_page_dirty_lock(dma->pages[i]);
> -			put_page(dma->pages[i]);
> -		}
> +		put_user_pages_dirty_lock(dma->pages, dma->nr_pages,
> +					  dma->direction == DMA_FROM_DEVICE);
>  		kfree(dma->pages);
>  		dma->pages = NULL;
>  	}
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers
  2019-11-10 10:10   ` Hans Verkuil
@ 2019-11-11 21:46     ` John Hubbard
  0 siblings, 0 replies; 57+ messages in thread
From: John Hubbard @ 2019-11-11 21:46 UTC (permalink / raw)
  To: Hans Verkuil, Andrew Morton
  Cc: Al Viro, Alex Williamson, Benjamin Herrenschmidt,
	Björn Töpel, Christoph Hellwig, Dan Williams,
	Daniel Vetter, Dave Chinner, David Airlie, David S . Miller,
	Ira Weiny, Jan Kara, Jason Gunthorpe, Jens Axboe,
	Jonathan Corbet, Jérôme Glisse, Magnus Karlsson,
	Mauro Carvalho Chehab, Michael Ellerman, Michal Hocko,
	Mike Kravetz, Paul Mackerras, Shuah Khan, Vlastimil Babka, bpf,
	dri-devel, kvm, linux-block, linux-doc, linux-fsdevel,
	linux-kselftest, linux-media, linux-rdma, linuxppc-dev, netdev,
	linux-mm, LKML

On 11/10/19 2:10 AM, Hans Verkuil wrote:
> On 11/3/19 10:17 PM, John Hubbard wrote:
>> After DMA is complete, and the device and CPU caches are synchronized,
>> it's still required to mark the CPU pages as dirty, if the data was
>> coming from the device. However, this driver was just issuing a
>> bare put_page() call, without any set_page_dirty*() call.
>>
>> Fix the problem, by calling set_page_dirty_lock() if the CPU pages
>> were potentially receiving data from the device.
>>
>> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> 
> Acked-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
> 
> Looks good, thanks!
> 

Hi Hans, it's great that you could take a look at this and the other v4l2 
patch, much appreciated.


thanks,
-- 
John Hubbard
NVIDIA
>> ---
>>  drivers/media/v4l2-core/videobuf-dma-sg.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
>> index 66a6c6c236a7..28262190c3ab 100644
>> --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
>> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
>> @@ -349,8 +349,11 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
>>  	BUG_ON(dma->sglen);
>>  
>>  	if (dma->pages) {
>> -		for (i = 0; i < dma->nr_pages; i++)
>> +		for (i = 0; i < dma->nr_pages; i++) {
>> +			if (dma->direction == DMA_FROM_DEVICE)
>> +				set_page_dirty_lock(dma->pages[i]);
>>  			put_page(dma->pages[i]);
>> +		}
>>  		kfree(dma->pages);
>>  		dma->pages = NULL;
>>  	}
>>
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, back to index

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-03 21:17 [PATCH v2 00/18] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM John Hubbard
2019-11-03 21:17 ` [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions John Hubbard
2019-11-04 16:39   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines John Hubbard
2019-11-04 16:51   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine John Hubbard
2019-11-04 16:52   ` Jerome Glisse
2019-11-03 21:17 ` [PATCH v2 04/18] media/v4l2-core: set pages dirty upon releasing DMA buffers John Hubbard
2019-11-10 10:10   ` Hans Verkuil
2019-11-11 21:46     ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN John Hubbard
2019-11-04 17:33   ` Jerome Glisse
2019-11-04 19:04     ` John Hubbard
2019-11-04 19:18       ` Jerome Glisse
2019-11-04 19:30         ` John Hubbard
2019-11-04 19:52           ` Jerome Glisse
2019-11-04 20:09             ` John Hubbard
2019-11-04 20:31               ` Jason Gunthorpe
2019-11-04 20:40                 ` John Hubbard
2019-11-04 20:31               ` Jerome Glisse
2019-11-04 20:37                 ` Jason Gunthorpe
2019-11-04 20:57                   ` John Hubbard
2019-11-04 21:15                     ` Jason Gunthorpe
2019-11-04 21:34                       ` John Hubbard
2019-11-04 20:33   ` David Rientjes
2019-11-04 20:48     ` Jerome Glisse
2019-11-05 13:10   ` Mike Rapoport
2019-11-05 19:00     ` John Hubbard
2019-11-07  2:25       ` Ira Weiny
2019-11-07  8:07       ` Mike Rapoport
2019-11-03 21:18 ` [PATCH v2 06/18] goldish_pipe: convert to pin_user_pages() and put_user_page() John Hubbard
2019-11-03 21:18 ` [PATCH v2 07/18] infiniband: set FOLL_PIN, FOLL_LONGTERM via pin_longterm_pages*() John Hubbard
2019-11-04 20:33   ` Jason Gunthorpe
2019-11-04 20:48     ` John Hubbard
2019-11-04 20:57       ` Jason Gunthorpe
2019-11-04 22:03         ` John Hubbard
2019-11-05  2:32           ` Jason Gunthorpe
2019-11-07  2:26         ` Ira Weiny
2019-11-03 21:18 ` [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() John Hubbard
2019-11-04 17:41   ` Jerome Glisse
2019-11-03 21:18 ` [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast() John Hubbard
2019-11-04 17:44   ` Jerome Glisse
2019-11-04 18:22     ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 10/18] fs/io_uring: set FOLL_PIN via pin_user_pages() John Hubbard
2019-11-03 21:18 ` [PATCH v2 11/18] net/xdp: " John Hubbard
2019-11-03 21:18 ` [PATCH v2 12/18] mm/gup: track FOLL_PIN pages John Hubbard
2019-11-04 18:52   ` Jerome Glisse
2019-11-04 22:49     ` John Hubbard
2019-11-04 23:49       ` Jerome Glisse
2019-11-05  0:18         ` John Hubbard
2019-11-03 21:18 ` [PATCH v2 13/18] media/v4l2-core: pin_longterm_pages (FOLL_PIN) and put_user_page() conversion John Hubbard
2019-11-10 10:11   ` Hans Verkuil
2019-11-03 21:18 ` [PATCH v2 14/18] vfio, mm: " John Hubbard
2019-11-03 21:18 ` [PATCH v2 15/18] powerpc: book3s64: convert to pin_longterm_pages() and put_user_page() John Hubbard
2019-11-03 21:18 ` [PATCH v2 16/18] mm/gup_benchmark: support pin_user_pages() and related calls John Hubbard
2019-11-03 21:18 ` [PATCH v2 17/18] selftests/vm: run_vmtests: invoke gup_benchmark with basic FOLL_PIN coverage John Hubbard
2019-11-03 21:18 ` [PATCH v2 18/18] mm/gup: remove support for gup(FOLL_LONGTERM) John Hubbard

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git