All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, alex.williamson@redhat.com,
	aneesh.kumar@linux.ibm.com, axboe@kernel.dk,
	bjorn.topel@intel.com, corbet@lwn.net, dan.j.williams@intel.com,
	daniel.vetter@ffwll.ch, hch@lst.de, hverkuil-cisco@xs4all.nl,
	ira.weiny@intel.com, jack@suse.cz, jgg@mellanox.com,
	jgg@ziepe.ca, jglisse@redhat.com, jhubbard@nvidia.com,
	kirill@shutemov.name, leonro@mellanox.com, linux-mm@kvack.org,
	mchehab@kernel.org, mm-commits@vger.kernel.org,
	rppt@linux.ibm.com, torvalds@linux-foundation.org
Subject: [patch 034/118] mm/gup: introduce pin_user_pages*() and FOLL_PIN
Date: Thu, 30 Jan 2020 22:12:54 -0800	[thread overview]
Message-ID: <20200131061254.ou_9FoBzx%akpm@linux-foundation.org> (raw)
In-Reply-To: <20200130221021.5f0211c56346d5485af07923@linux-foundation.org>

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: introduce pin_user_pages*() and FOLL_PIN

Introduce pin_user_pages*() variations of get_user_pages*() calls, and
also pin_longterm_pages*() variations.

For now, these are placeholder calls, until the various call sites are
converted to use the correct get_user_pages*() or pin_user_pages*() API.

These variants will eventually all set FOLL_PIN, which is also introduced,
and thoroughly documented.

    pin_user_pages()
    pin_user_pages_remote()
    pin_user_pages_fast()

All pages that are pinned via the above calls, must be unpinned via
put_user_page().

The underlying rules are:

* FOLL_PIN is a gup-internal flag, so the call sites should not directly
  set it.  That behavior is enforced with assertions.

* Call sites that want to indicate that they are going to do DirectIO
  ("DIO") or something with similar characteristics, should call a
  get_user_pages()-like wrapper call that sets FOLL_PIN.  These wrappers
  will:

    * Start with "pin_user_pages" instead of "get_user_pages".  That
      makes it easy to find and audit the call sites.

    * Set FOLL_PIN

* For pages that are received via FOLL_PIN, those pages must be returned
  via put_user_page().

Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in this
documentation.  (I've reworded it and expanded upon it.)

Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[Documentation]
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/index.rst          |    1 
 Documentation/core-api/pin_user_pages.rst |  232 ++++++++++++++++++++
 include/linux/mm.h                        |   63 ++++-
 mm/gup.c                                  |  164 ++++++++++++--
 4 files changed, 426 insertions(+), 34 deletions(-)

--- a/Documentation/core-api/index.rst~mm-gup-introduce-pin_user_pages-and-foll_pin
+++ a/Documentation/core-api/index.rst
@@ -31,6 +31,7 @@ Core utilities
    generic-radix-tree
    memory-allocation
    mm-api
+   pin_user_pages
    gfp_mask-from-fs-io
    timekeeping
    boot-time-mm
--- /dev/null
+++ a/Documentation/core-api/pin_user_pages.rst
@@ -0,0 +1,232 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+pin_user_pages() and related calls
+====================================================
+
+.. contents:: :local:
+
+Overview
+========
+
+This document describes the following functions::
+
+ pin_user_pages()
+ pin_user_pages_fast()
+ pin_user_pages_remote()
+
+Basic description of FOLL_PIN
+=============================
+
+FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
+("gup") family of functions. FOLL_PIN has significant interactions and
+interdependencies with FOLL_LONGTERM, so both are covered here.
+
+FOLL_PIN is internal to gup, meaning that it should not appear at the gup call
+sites. This allows the associated wrapper functions  (pin_user_pages*() and
+others) to set the correct combination of these flags, and to check for problems
+as well.
+
+FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
+This is in order to avoid creating a large number of wrapper functions to cover
+all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
+pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
+that's a natural dividing line, and a good point to make separate wrapper calls.
+In other words, use pin_user_pages*() for DMA-pinned pages, and
+get_user_pages*() for other cases. There are four cases described later on in
+this document, to further clarify that concept.
+
+FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
+multiple threads and call sites are free to pin the same struct pages, via both
+FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
+other, not the struct page(s).
+
+The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
+uses a different reference counting technique.
+
+FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
+FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
+
+Which flags are set by each wrapper
+===================================
+
+For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
+flags the caller provides. The caller is required to pass in a non-null struct
+pages* array, and the function then pin pages by incrementing each by a special
+value. For now, that value is +1, just like get_user_pages*().::
+
+ Function
+ --------
+ pin_user_pages          FOLL_PIN is always set internally by this function.
+ pin_user_pages_fast     FOLL_PIN is always set internally by this function.
+ pin_user_pages_remote   FOLL_PIN is always set internally by this function.
+
+For these get_user_pages*() functions, FOLL_GET might not even be specified.
+Behavior is a little more complex than above. If FOLL_GET was *not* specified,
+but the caller passed in a non-null struct pages* array, then the function
+sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
+of each page by +1.::
+
+ Function
+ --------
+ get_user_pages           FOLL_GET is sometimes set internally by this function.
+ get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
+ get_user_pages_remote    FOLL_GET is sometimes set internally by this function.
+
+Tracking dma-pinned pages
+=========================
+
+Some of the key design constraints, and solutions, for tracking dma-pinned
+pages:
+
+* An actual reference count, per struct page, is required. This is because
+  multiple processes may pin and unpin a page.
+
+* False positives (reporting that a page is dma-pinned, when in fact it is not)
+  are acceptable, but false negatives are not.
+
+* struct page may not be increased in size for this, and all fields are already
+  used.
+
+* Given the above, we can overload the page->_refcount field by using, sort of,
+  the upper bits in that field for a dma-pinned count. "Sort of", means that,
+  rather than dividing page->_refcount into bit fields, we simple add a medium-
+  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
+  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
+  on it 1024 times, then it will appear to have a single dma-pinned count.
+  And again, that's acceptable.
+
+This also leads to limitations: there are only 31-10==21 bits available for a
+counter that increments 10 bits at a time.
+
+TODO: for 1GB and larger huge pages, this is cutting it close. That's because
+when pin_user_pages() follows such pages, it increments the head page by "1"
+(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
+pin_user_pages()) for each tail page. So if you have a 1GB huge page:
+
+* There are 256K (18 bits) worth of 4 KB tail pages.
+* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
+  10 bits at a time)
+* There are 21 - 18 == 3 bits available to count. Except that there aren't,
+  because you need to allow for a few normal get_page() calls on the head page,
+  as well. Fortunately, the approach of using addition, rather than "hard"
+  bitfields, within page->_refcount, allows for sharing these bits gracefully.
+  But we're still looking at about 8 references.
+
+This, however, is a missing feature more than anything else, because it's easily
+solved by addressing an obvious inefficiency in the original get_user_pages()
+approach of retrieving pages: stop treating all the pages as if they were
+PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
+this, so some work is required. Once that's in place, this limitation mostly
+disappears from view, because there will be ample refcounting range available.
+
+* Callers must specifically request "dma-pinned tracking of pages". In other
+  words, just calling get_user_pages() will not suffice; a new set of functions,
+  pin_user_page() and related, must be used.
+
+FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
+==========================================================
+
+Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
+these categories:
+
+CASE 1: Direct IO (DIO)
+-----------------------
+There are GUP references to pages that are serving
+as DIO buffers. These buffers are needed for a relatively short time (so they
+are not "long term"). No special synchronization with page_mkclean() or
+munmap() is provided. Therefore, flags to set at the call site are: ::
+
+    FOLL_PIN
+
+...but rather than setting FOLL_PIN directly, call sites should use one of
+the pin_user_pages*() routines that set FOLL_PIN.
+
+CASE 2: RDMA
+------------
+There are GUP references to pages that are serving as DMA
+buffers. These buffers are needed for a long time ("long term"). No special
+synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+to set at the call site are: ::
+
+    FOLL_PIN | FOLL_LONGTERM
+
+NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
+because DAX pages do not have a separate page cache, and so "pinning" implies
+locking down file system blocks, which is not (yet) supported in that way.
+
+CASE 3: Hardware with page faulting support
+-------------------------------------------
+Here, a well-written driver doesn't normally need to pin pages at all. However,
+if the driver does choose to do so, it can register MMU notifiers for the range,
+and will be called back upon invalidation. Either way (avoiding page pinning, or
+using MMU notifiers to unpin upon request), there is proper synchronization with
+both filesystem and mm (page_mkclean(), munmap(), etc).
+
+Therefore, neither flag needs to be set.
+
+In this case, ideally, neither get_user_pages() nor pin_user_pages() should be
+called. Instead, the software should be written so that it does not pin pages.
+This allows mm and filesystems to operate more efficiently and reliably.
+
+CASE 4: Pinning for struct page manipulation only
+-------------------------------------------------
+Here, normal GUP calls are sufficient, so neither flag needs to be set.
+
+page_dma_pinned(): the whole point of pinning
+=============================================
+
+The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
+to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
+(and file system writeback code in general) to make informed decisions about
+what to do when a page cannot be unmapped due to such pins.
+
+What to do in those cases is the subject of a years-long series of discussions
+and debates (see the References at the end of this document). It's a TODO item
+here: fill in the details once that's worked out. Meanwhile, it's safe to say
+that having this available: ::
+
+        static inline bool page_dma_pinned(struct page *page)
+
+...is a prerequisite to solving the long-running gup+DMA problem.
+
+Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
+===================================================================
+
+Another way of thinking about these flags is as a progression of restrictions:
+FOLL_GET is for struct page manipulation, without affecting the data that the
+struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
+short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
+a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
+restrictive case that has FOLL_PIN as a prerequisite: this is for pages that
+will be pinned longterm, and whose data will be accessed.
+
+Unit testing
+============
+This file::
+
+ tools/testing/selftests/vm/gup_benchmark.c
+
+has the following new calls to exercise the new pin*() wrapper functions:
+
+* PIN_FAST_BENCHMARK (./gup_benchmark -a)
+* PIN_BENCHMARK (./gup_benchmark -b)
+
+You can monitor how many total dma-pinned pages have been acquired and released
+since the system was booted, via two new /proc/vmstat entries: ::
+
+    /proc/vmstat/nr_foll_pin_requested
+    /proc/vmstat/nr_foll_pin_requested
+
+Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
+because there is a noticeable performance drop in put_user_page(), when they
+are activated.
+
+References
+==========
+
+* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
+* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
+* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
+
+John Hubbard, October, 2019
--- a/include/linux/mm.h~mm-gup-introduce-pin_user_pages-and-foll_pin
+++ a/include/linux/mm.h
@@ -1042,16 +1042,14 @@ static inline void put_page(struct page
  * put_user_page() - release a gup-pinned page
  * @page:            pointer to page to be released
  *
- * Pages that were pinned via get_user_pages*() must be released via
- * either put_user_page(), or one of the put_user_pages*() routines
- * below. This is so that eventually, pages that are pinned via
- * get_user_pages*() can be separately tracked and uniquely handled. In
- * particular, interactions with RDMA and filesystems need special
- * handling.
+ * Pages that were pinned via pin_user_pages*() must be released via either
+ * put_user_page(), or one of the put_user_pages*() routines. This is so that
+ * eventually such pages can be separately tracked and uniquely handled. In
+ * particular, interactions with RDMA and filesystems need special handling.
  *
  * put_user_page() and put_page() are not interchangeable, despite this early
  * implementation that makes them look the same. put_user_page() calls must
- * be perfectly matched up with get_user_page() calls.
+ * be perfectly matched up with pin*() calls.
  */
 static inline void put_user_page(struct page *page)
 {
@@ -1509,9 +1507,16 @@ long get_user_pages_remote(struct task_s
 			    unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas, int *locked);
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked);
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas);
 long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages, int *locked);
 long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
@@ -1519,6 +1524,8 @@ long get_user_pages_unlocked(unsigned lo
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages);
 
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
@@ -2583,13 +2590,15 @@ struct page *follow_page(struct vm_area_
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
+#define FOLL_PIN	0x40000	/* pages must be released via put_user_page() */
 
 /*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
  *
  * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control.  This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control.  This is in contrast to
+ * iov_iter_get_pages(), whose usages are transient.
  *
  * FIXME: For pages which are part of a filesystem, mappings are subject to the
  * lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2604,11 +2613,39 @@ struct page *follow_page(struct vm_area_
  * Currently only get_user_pages() and get_user_pages_fast() support this flag
  * and calls to get_user_pages_[un]locked are specifically not allowed.  This
  * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
  *
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region.  And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region.  And so, CMA attempts to migrate the page before pinning, when
  * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
+ * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
+ * a call to put_user_page().
+ *
+ * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
+ * and separate refcounting mechanisms, however, and that means that each has
+ * its own acquire and release mechanisms:
+ *
+ *     FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
+ *
+ *     FOLL_PIN: pin_user_pages*() to acquire, and put_user_pages to release.
+ *
+ * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
+ * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
+ * calls applied to them, and that's perfectly OK. This is a constraint on the
+ * callers, not on the pages.)
+ *
+ * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
+ * directly by the caller. That's in order to help avoid mismatches when
+ * releasing pages: get_user_pages*() pages must be released via put_page(),
+ * while pin_user_pages*() pages must be released via put_user_page().
+ *
+ * Please see Documentation/vm/pin_user_pages.rst for more information.
  */
 
 static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
--- a/mm/gup.c~mm-gup-introduce-pin_user_pages-and-foll_pin
+++ a/mm/gup.c
@@ -194,6 +194,10 @@ static struct page *follow_page_pte(stru
 	spinlock_t *ptl;
 	pte_t *ptep, pte;
 
+	/* FOLL_GET and FOLL_PIN are mutually exclusive. */
+	if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
+			 (FOLL_PIN | FOLL_GET)))
+		return ERR_PTR(-EINVAL);
 retry:
 	if (unlikely(pmd_bad(*pmd)))
 		return no_page_table(vma, flags);
@@ -811,7 +815,7 @@ static long __get_user_pages(struct task
 
 	start = untagged_addr(start);
 
-	VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
+	VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
 
 	/*
 	 * If FOLL_FORCE is set then do not force a full fault as the hinting
@@ -1035,7 +1039,16 @@ static __always_inline long __get_user_p
 		BUG_ON(*locked != 1);
 	}
 
-	if (pages)
+	/*
+	 * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
+	 * is to set FOLL_GET if the caller wants pages[] filled in (but has
+	 * carelessly failed to specify FOLL_GET), so keep doing that, but only
+	 * for FOLL_GET, not for the newer FOLL_PIN.
+	 *
+	 * FOLL_PIN always expects pages to be non-null, but no need to assert
+	 * that here, as any failures will be obvious enough.
+	 */
+	if (pages && !(flags & FOLL_PIN))
 		flags |= FOLL_GET;
 
 	pages_done = 0;
@@ -1606,12 +1619,20 @@ static __always_inline long __gup_longte
  * should use get_user_pages because it cannot pass
  * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault.
  */
+#ifdef CONFIG_MMU
 long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
 		unsigned long start, unsigned long nr_pages,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *locked)
 {
 	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	/*
 	 * Parts of FOLL_LONGTERM behavior are incompatible with
 	 * FAULT_FLAG_ALLOW_RETRY because of the FS DAX check requirement on
 	 * vmas. However, this only comes up if locked is set, and there are
@@ -1636,6 +1657,16 @@ long get_user_pages_remote(struct task_s
 }
 EXPORT_SYMBOL(get_user_pages_remote);
 
+#else /* CONFIG_MMU */
+long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	return 0;
+}
+#endif /* !CONFIG_MMU */
+
 /*
  * This is the same as get_user_pages_remote(), just with a
  * less-flexible calling convention where we assume that the task
@@ -1647,6 +1678,13 @@ long get_user_pages(unsigned long start,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas)
 {
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
 	return __gup_longterm_locked(current, current->mm, start, nr_pages,
 				     pages, vmas, gup_flags | FOLL_TOUCH);
 }
@@ -2389,30 +2427,15 @@ static int __gup_longterm_unlocked(unsig
 	return ret;
 }
 
-/**
- * get_user_pages_fast() - pin user pages in memory
- * @start:	starting user address
- * @nr_pages:	number of pages from start to pin
- * @gup_flags:	flags modifying pin behaviour
- * @pages:	array that receives pointers to the pages pinned.
- *		Should be at least nr_pages long.
- *
- * Attempt to pin user pages in memory without taking mm->mmap_sem.
- * If not successful, it will fall back to taking the lock and
- * calling get_user_pages().
- *
- * Returns number of pages pinned. This may be fewer than the number
- * requested. If nr_pages is 0 or negative, returns 0. If no pages
- * were pinned, returns -errno.
- */
-int get_user_pages_fast(unsigned long start, int nr_pages,
-			unsigned int gup_flags, struct page **pages)
+static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
+					unsigned int gup_flags,
+					struct page **pages)
 {
 	unsigned long addr, len, end;
 	int nr = 0, ret = 0;
 
 	if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
-				       FOLL_FORCE)))
+				       FOLL_FORCE | FOLL_PIN)))
 		return -EINVAL;
 
 	start = untagged_addr(start) & PAGE_MASK;
@@ -2452,4 +2475,103 @@ int get_user_pages_fast(unsigned long st
 
 	return ret;
 }
+
+/**
+ * get_user_pages_fast() - pin user pages in memory
+ * @start:	starting user address
+ * @nr_pages:	number of pages from start to pin
+ * @gup_flags:	flags modifying pin behaviour
+ * @pages:	array that receives pointers to the pages pinned.
+ *		Should be at least nr_pages long.
+ *
+ * Attempt to pin user pages in memory without taking mm->mmap_sem.
+ * If not successful, it will fall back to taking the lock and
+ * calling get_user_pages().
+ *
+ * Returns number of pages pinned. This may be fewer than the number requested.
+ * If nr_pages is 0 or negative, returns 0. If no pages were pinned, returns
+ * -errno.
+ */
+int get_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return -EINVAL;
+
+	return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/**
+ * pin_user_pages_fast() - pin user pages in memory without taking locks
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages_fast().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+int pin_user_pages_fast(unsigned long start, int nr_pages,
+			unsigned int gup_flags, struct page **pages)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast);
+
+/**
+ * pin_user_pages_remote() - pin pages of a remote process (task != current)
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages_remote().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm,
+			   unsigned long start, unsigned long nr_pages,
+			   unsigned int gup_flags, struct page **pages,
+			   struct vm_area_struct **vmas, int *locked)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, pages,
+				     vmas, locked);
+}
+EXPORT_SYMBOL(pin_user_pages_remote);
+
+/**
+ * pin_user_pages() - pin user pages in memory for use by other devices
+ *
+ * For now, this is a placeholder function, until various call sites are
+ * converted to use the correct get_user_pages*() or pin_user_pages*() API. So,
+ * this is identical to get_user_pages().
+ *
+ * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. It
+ * is NOT intended for Case 2 (RDMA: long-term pins).
+ */
+long pin_user_pages(unsigned long start, unsigned long nr_pages,
+		    unsigned int gup_flags, struct page **pages,
+		    struct vm_area_struct **vmas)
+{
+	/*
+	 * This is a placeholder, until the pin functionality is activated.
+	 * Until then, just behave like the corresponding get_user_pages*()
+	 * routine.
+	 */
+	return get_user_pages(start, nr_pages, gup_flags, pages, vmas);
+}
+EXPORT_SYMBOL(pin_user_pages);
_

  parent reply	other threads:[~2020-01-31  6:12 UTC|newest]

Thread overview: 147+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-31  6:10 incoming Andrew Morton
2020-01-31  6:11 ` [patch 001/118] lib/test_bitmap: correct test data offsets for 32-bit Andrew Morton
2020-01-31  6:11 ` [patch 002/118] memcg: fix a crash in wb_workfn when a device disappears Andrew Morton
2020-01-31  6:11 ` [patch 003/118] mm/mempolicy.c: fix out of bounds write in mpol_parse_str() Andrew Morton
2020-01-31  6:11   ` Andrew Morton
2020-01-31  6:11 ` [patch 004/118] mm/sparse.c: reset section's mem_map when fully deactivated Andrew Morton
2020-01-31  6:11 ` Andrew Morton
2020-01-31  6:11 ` [patch 005/118] mm/migrate.c: also overwrite error when it is bigger than zero Andrew Morton
2020-01-31  6:11 ` [patch 006/118] mm/memory_hotplug: fix remove_memory() lockdep splat Andrew Morton
2020-01-31  6:11   ` Andrew Morton
2020-01-31  6:11 ` [patch 007/118] mm: thp: don't need care deferred split queue in memcg charge move path Andrew Morton
2020-01-31  6:11   ` Andrew Morton
2020-01-31  6:11 ` [patch 008/118] mm: move_pages: report the number of non-attempted pages Andrew Morton
2020-01-31  6:11 ` [patch 009/118] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
2020-01-31  6:11 ` [patch 010/118] scripts/spelling.txt: add "issus" typo Andrew Morton
2020-01-31  6:11 ` [patch 011/118] fs: ocfs: remove unnecessary assertion in dlm_migrate_lockres Andrew Morton
2020-01-31  6:11   ` Andrew Morton
2020-01-31  6:11 ` [patch 012/118] ocfs2: remove unneeded semicolons Andrew Morton
2020-01-31  6:11 ` [patch 013/118] ocfs2: make local header paths relative to C files Andrew Morton
2020-01-31  6:11   ` Andrew Morton
2020-01-31  6:11 ` [patch 014/118] ocfs2/dlm: remove redundant assignment to ret Andrew Morton
2020-01-31  6:11 ` [patch 015/118] ocfs2/dlm: move BITS_TO_BYTES() to bitops.h for wider use Andrew Morton
2020-01-31  6:11   ` Andrew Morton
2020-01-31  6:11 ` [patch 016/118] ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans() Andrew Morton
2020-01-31  6:11 ` [patch 017/118] ocfs2: use ocfs2_update_inode_fsync_trans() to access t_tid in handle->h_transaction Andrew Morton
2020-01-31  6:11 ` [patch 018/118] mm/slub.c: avoid slub allocation while holding list_lock Andrew Morton
2020-02-03 17:25   ` Christopher Lameter
2020-01-31  6:12 ` [patch 019/118] mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t Andrew Morton
2020-01-31  6:12 ` [patch 020/118] mm/debug.c: always print flags in dump_page() Andrew Morton
2020-01-31  6:12 ` [patch 021/118] mm/filemap.c: clean up filemap_write_and_wait() Andrew Morton
2020-01-31  6:12   ` Andrew Morton
2020-01-31  6:12 ` [patch 022/118] mm: fix gup_pud_range Andrew Morton
2020-01-31  6:12 ` [patch 023/118] mm/gup.c: use is_vm_hugetlb_page() to check whether to follow huge Andrew Morton
2020-01-31  6:12 ` [patch 024/118] mm/gup: factor out duplicate code from four routines Andrew Morton
2020-01-31  6:12 ` [patch 025/118] mm/gup: move try_get_compound_head() to top, fix minor issues Andrew Morton
2020-01-31  6:12   ` Andrew Morton
2020-01-31  6:12 ` [patch 026/118] mm: Cleanup __put_devmap_managed_page() vs ->page_free() Andrew Morton
2020-01-31  6:12 ` [patch 027/118] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages Andrew Morton
2020-01-31  6:12 ` [patch 028/118] goldish_pipe: rename local pin_user_pages() routine Andrew Morton
2020-01-31  6:12 ` [patch 029/118] mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM Andrew Morton
2020-01-31  6:12 ` [patch 030/118] vfio: fix FOLL_LONGTERM use, simplify get_user_pages_remote() call Andrew Morton
2020-01-31  6:12   ` Andrew Morton
2020-01-31  6:12 ` [patch 031/118] mm/gup: allow FOLL_FORCE for get_user_pages_fast() Andrew Morton
2020-01-31  6:12 ` [patch 032/118] IB/umem: use get_user_pages_fast() to pin DMA pages Andrew Morton
2020-01-31  6:12   ` Andrew Morton
2020-01-31  6:12 ` [patch 033/118] media/v4l2-core: set pages dirty upon releasing DMA buffers Andrew Morton
2020-01-31  6:12 ` Andrew Morton [this message]
2020-01-31  6:12 ` [patch 035/118] goldish_pipe: convert to pin_user_pages() and put_user_page() Andrew Morton
2020-01-31  6:12   ` Andrew Morton
2020-01-31  6:13 ` [patch 036/118] IB/{core,hw,umem}: set FOLL_PIN via pin_user_pages*(), fix up ODP Andrew Morton
2020-01-31  6:13 ` [patch 037/118] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote() Andrew Morton
2020-01-31  6:13 ` [patch 038/118] drm/via: set FOLL_PIN via pin_user_pages_fast() Andrew Morton
2020-01-31  6:13 ` [patch 039/118] fs/io_uring: set FOLL_PIN via pin_user_pages() Andrew Morton
2020-01-31  6:13 ` [patch 040/118] net/xdp: " Andrew Morton
2020-01-31  6:13 ` [patch 041/118] media/v4l2-core: pin_user_pages (FOLL_PIN) and put_user_page() conversion Andrew Morton
2020-01-31  6:13 ` [patch 042/118] vfio, mm: " Andrew Morton
2020-01-31  6:13 ` [patch 043/118] powerpc: book3s64: convert to pin_user_pages() and put_user_page() Andrew Morton
2020-01-31  6:13 ` [patch 044/118] mm/gup_benchmark: use proper FOLL_WRITE flags instead of hard-coding "1" Andrew Morton
2020-01-31  6:13 ` [patch 045/118] mm, tree-wide: rename put_user_page*() to unpin_user_page*() Andrew Morton
2020-01-31  6:13 ` [patch 046/118] mm/swapfile.c: swap_next should increase position index Andrew Morton
2020-01-31  6:13 ` [patch 047/118] mm/memcontrol.c: cleanup some useless code Andrew Morton
2020-01-31  6:13 ` [patch 048/118] mm/page_vma_mapped.c: explicitly compare pfn for normal, hugetlbfs and THP page Andrew Morton
2020-01-31  6:13 ` [patch 049/118] mm, tracing: print symbol name for kmem_alloc_node call_site events Andrew Morton
2020-01-31  6:13 ` [patch 050/118] lib/test_kasan.c: fix memory leak in kmalloc_oob_krealloc_more() Andrew Morton
2020-01-31  6:13 ` [patch 051/118] mm/early_ioremap.c: use %pa to print resource_size_t variables Andrew Morton
2020-01-31  6:13 ` [patch 052/118] mm/page_alloc: skip non present sections on zone initialization Andrew Morton
2020-01-31  6:14 ` [patch 053/118] mm: remove the memory isolate notifier Andrew Morton
2020-01-31  6:14   ` Andrew Morton
2020-01-31  6:14 ` [patch 054/118] mm: remove "count" parameter from has_unmovable_pages() Andrew Morton
2020-01-31  6:14 ` [patch 055/118] mm/vmscan.c: remove unused return value of shrink_node Andrew Morton
2020-01-31  6:14   ` Andrew Morton
2020-01-31  6:14 ` [patch 056/118] mm/vmscan: remove prefetch_prev_lru_page Andrew Morton
2020-01-31  6:14   ` Andrew Morton
2020-01-31  6:14 ` [patch 057/118] mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE Andrew Morton
2020-01-31  6:14 ` [patch 058/118] tools/vm/slabinfo: fix sanity checks enabling Andrew Morton
2020-01-31  6:14 ` [patch 059/118] mm/memblock: define memblock_physmem_add() Andrew Morton
2020-01-31  6:14 ` [patch 060/118] memblock: Use __func__ in remaining memblock_dbg() call sites Andrew Morton
2020-01-31  6:14 ` [patch 061/118] mm, oom: dump stack of victim when reaping failed Andrew Morton
2020-01-31  6:14 ` [patch 062/118] mm/huge_memory.c: use head to check huge zero page Andrew Morton
2020-01-31  6:14 ` [patch 063/118] mm/huge_memory.c: use head to emphasize the purpose of page Andrew Morton
2020-01-31  6:14 ` [patch 064/118] mm/huge_memory.c: reduce critical section protected by split_queue_lock Andrew Morton
2020-01-31  6:14 ` [patch 065/118] mm/migrate: remove useless mask of start address Andrew Morton
2020-01-31  6:14 ` [patch 066/118] mm/migrate: clean up some minor coding style Andrew Morton
2020-01-31  6:14 ` [patch 067/118] mm/migrate: add stable check in migrate_vma_insert_page() Andrew Morton
2020-01-31  6:14 ` [patch 068/118] mm, thp: fix defrag setting if newline is not used Andrew Morton
2020-01-31  6:14 ` [patch 069/118] mm/mmap.c: get rid of odd jump labels in find_mergeable_anon_vma() Andrew Morton
2020-01-31  6:14 ` [patch 070/118] mm/memory_hotplug: pass in nid to online_pages() Andrew Morton
2020-01-31  6:14   ` Andrew Morton
2020-01-31  6:14 ` [patch 071/118] mm/hotplug: silence a lockdep splat with printk() Andrew Morton
2020-01-31  6:15 ` [patch 072/118] mm/page_isolation: fix potential warning from user Andrew Morton
2020-01-31  6:15 ` [patch 073/118] mm/zswap.c: add allocation hysteresis if pool limit is hit Andrew Morton
2020-01-31  6:15 ` [patch 074/118] zswap: potential NULL dereference on error in init_zswap() Andrew Morton
2020-01-31  6:15 ` [patch 075/118] include/linux/mm.h: clean up obsolete check on space in page->flags Andrew Morton
2020-01-31  6:15   ` Andrew Morton
2020-01-31  6:15 ` [patch 076/118] include/linux/mm.h: remove dead code totalram_pages_set() Andrew Morton
2020-01-31  6:15   ` Andrew Morton
2020-01-31  6:15 ` [patch 077/118] include/linux/memory.h: drop fields 'hw' and 'phys_callback' from struct memory_block Andrew Morton
2020-01-31  6:15 ` [patch 078/118] mm: fix comments related to node reclaim Andrew Morton
2020-01-31  6:15 ` [patch 079/118] zram: try to avoid worst-case scenario on same element pages Andrew Morton
2020-01-31  6:15 ` [patch 080/118] drivers/block/zram/zram_drv.c: fix error return codes not being returned in writeback_store Andrew Morton
2020-01-31  6:15 ` [patch 081/118] include/linux/units.h: add helpers for kelvin to/from Celsius conversion Andrew Morton
2020-01-31  6:15 ` [patch 082/118] ACPI: thermal: switch to use <linux/units.h> helpers Andrew Morton
2020-01-31  6:15 ` [patch 083/118] platform/x86: asus-wmi: " Andrew Morton
2020-01-31  6:15   ` Andrew Morton
2020-01-31  6:15 ` [patch 084/118] platform/x86: intel_menlow: " Andrew Morton
2020-01-31  6:15 ` [patch 085/118] thermal: int340x: " Andrew Morton
2020-01-31  6:15 ` [patch 086/118] thermal: intel_pch: " Andrew Morton
2020-01-31  6:15 ` [patch 087/118] nvme: hwmon: " Andrew Morton
2020-01-31  6:15 ` [patch 088/118] thermal: remove kelvin to/from Celsius conversion helpers from <linux/thermal.h> Andrew Morton
2020-01-31  6:15   ` Andrew Morton
2020-01-31  6:16 ` [patch 089/118] iwlegacy: use <linux/units.h> helpers Andrew Morton
2020-01-31  6:16   ` Andrew Morton
2020-01-31  6:16 ` [patch 090/118] iwlwifi: " Andrew Morton
2020-01-31  6:28   ` Luciano Coelho
2020-01-31  6:16 ` [patch 091/118] thermal: armada: remove unused TO_MCELSIUS macro Andrew Morton
2020-01-31  6:16   ` Andrew Morton
2020-01-31  6:16 ` [patch 092/118] iio: adc: qcom-vadc-common: use <linux/units.h> helpers Andrew Morton
2020-01-31  6:16 ` [patch 093/118] lib/zlib: add s390 hardware support for kernel zlib_deflate Andrew Morton
2020-01-31  6:16 ` [patch 094/118] s390/boot: rename HEAP_SIZE due to name collision Andrew Morton
2020-01-31  6:16 ` [patch 095/118] lib/zlib: add s390 hardware support for kernel zlib_inflate Andrew Morton
2020-01-31  6:16 ` [patch 096/118] s390/boot: add dfltcc= kernel command line parameter Andrew Morton
2020-01-31  6:16 ` [patch 097/118] lib/zlib: add zlib_deflate_dfltcc_enabled() function Andrew Morton
2020-01-31  6:16   ` Andrew Morton
2020-01-31  6:16 ` [patch 098/118] btrfs: use larger zlib buffer for s390 hardware compression Andrew Morton
2020-01-31  6:16 ` [patch 099/118] lib/scatterlist.c: adjust indentation in __sg_alloc_table Andrew Morton
2020-01-31  6:16 ` [patch 100/118] uapi: rename ext2_swab() to swab() and share globally in swab.h Andrew Morton
2020-01-31  6:16   ` Andrew Morton
2020-01-31  6:16 ` [patch 101/118] lib/find_bit.c: join _find_next_bit{_le} Andrew Morton
2020-01-31  6:16   ` Andrew Morton
2020-01-31  6:16 ` [patch 102/118] lib/find_bit.c: uninline helper _find_next_bit() Andrew Morton
2020-01-31  6:16 ` [patch 103/118] fs/binfmt_elf.c: smaller code generation around auxv vector fill Andrew Morton
2020-01-31  6:16 ` [patch 104/118] fs/binfmt_elf.c: fix ->start_code calculation Andrew Morton
2020-01-31  6:16 ` [patch 105/118] fs/binfmt_elf.c: don't copy ELF header around Andrew Morton
2023-11-22  7:15   ` Jinjie Ruan
2020-01-31  6:16 ` [patch 106/118] fs/binfmt_elf.c: better codegen around current->mm Andrew Morton
2020-01-31  6:17 ` [patch 107/118] fs/binfmt_elf.c: make BAD_ADDR() unlikely Andrew Morton
2020-01-31  6:17 ` [patch 108/118] fs/binfmt_elf.c: coredump: allocate core ELF header on stack Andrew Morton
2020-01-31  6:17 ` [patch 109/118] fs/binfmt_elf.c: coredump: delete duplicated overflow check Andrew Morton
2020-01-31  6:17 ` [patch 110/118] fs/binfmt_elf.c: coredump: allow process with empty address space to coredump Andrew Morton
2020-01-31  6:17 ` [patch 111/118] init/main.c: log arguments and environment passed to init Andrew Morton
2020-01-31  6:17 ` [patch 112/118] init/main.c: remove unnecessary repair_env_string in do_initcall_level Andrew Morton
2020-01-31  6:17 ` [patch 113/118] init/main.c: fix quoted value handling in unknown_bootoption Andrew Morton
2020-01-31  6:17 ` [patch 114/118] init/main.c: fix misleading "This architecture does not have kernel memory protection" message Andrew Morton
2020-01-31  6:17 ` [patch 115/118] reiserfs: prevent NULL pointer dereference in reiserfs_insert_item() Andrew Morton
2020-01-31  6:17 ` [patch 116/118] execve: warn if process starts with executable stack Andrew Morton
2020-01-31  6:17 ` [patch 117/118] include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc() Andrew Morton
2020-01-31  6:17 ` [patch 118/118] kcov: ignore fault-inject and stacktrace Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200131061254.ou_9FoBzx%akpm@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=axboe@kernel.dk \
    --cc=bjorn.topel@intel.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=daniel.vetter@ffwll.ch \
    --cc=hch@lst.de \
    --cc=hverkuil-cisco@xs4all.nl \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=jgg@mellanox.com \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=kirill@shutemov.name \
    --cc=leonro@mellanox.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mchehab@kernel.org \
    --cc=mm-commits@vger.kernel.org \
    --cc=rppt@linux.ibm.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.