From: John Hubbard <jhubbard@nvidia.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Christoph Hellwig" <hch@infradead.org>,
"Dan Williams" <dan.j.williams@intel.com>,
"Dave Chinner" <david@fromorbit.com>,
"Ira Weiny" <ira.weiny@intel.com>, "Jan Kara" <jack@suse.cz>,
"Jason Gunthorpe" <jgg@ziepe.ca>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
LKML <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-rdma@vger.kernel.org, "John Hubbard" <jhubbard@nvidia.com>,
"Michal Hocko" <mhocko@kernel.org>
Subject: [PATCH v2 2/3] mm/gup: introduce FOLL_PIN flag for get_user_pages()
Date: Tue, 20 Aug 2019 21:07:26 -0700 [thread overview]
Message-ID: <20190821040727.19650-3-jhubbard@nvidia.com> (raw)
In-Reply-To: <20190821040727.19650-1-jhubbard@nvidia.com>
As explained in the newly added documentation for FOLL_PIN and
FOLL_LONGTERM, in every case where vaddr_pin_pages() is required,
FOLL_PIN must be set. That reason, plus a desire to keep FOLL_PIN
an internal (to get_user_pages() and follow_page()) detail, is why
vaddr_pin_pages() sets FOLL_PIN.
FOLL_LONGTERM, on the other hand, in only set in *some* cases, but
not all. For that reason, this patch moves the setting of FOLL_LONGTERM
out to the caller.
Also add fairly extensive documentation of the meaning and use
of both FOLL_PIN and FOLL_LONGTERM.
Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
in this documentation. (I've reworded it and expanded on it slightly.)
The motivation behind moving away from "bare" get_user_pages() calls
is described in more detail in commit fc1d8e7cca2d
("mm: introduce put_user_page*(), placeholder versions").
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
drivers/infiniband/core/umem.c | 1 +
include/linux/mm.h | 56 ++++++++++++++++++++++++++++++----
mm/gup.c | 2 +-
3 files changed, 52 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e69eecb0023f..d84f1bfb8d21 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -300,6 +300,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, unsigned long addr,
while (npages) {
down_read(&mm->mmap_sem);
+ gup_flags |= FOLL_LONGTERM;
ret = vaddr_pin_pages(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof (struct page *)),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc675e94ddf8..6e7de424bf5e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2644,6 +2644,8 @@ static inline vm_fault_t vmf_error(int err)
struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
unsigned int foll_flags);
+/* Flags for follow_page(), get_user_pages ("GUP"), and vaddr_pin_pages(): */
+
#define FOLL_WRITE 0x01 /* check pte is writable */
#define FOLL_TOUCH 0x02 /* mark page accessed */
#define FOLL_GET 0x04 /* do get_page on page */
@@ -2663,13 +2665,15 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_ANON 0x8000 /* don't do file mappings */
#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
+#define FOLL_PIN 0x40000 /* pages must be released via put_user_page() */
/*
- * NOTE on FOLL_LONGTERM:
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
*
* FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control. This is contrasted with
- * iov_iter_get_pages() where usages which are transient.
+ * period _often_ under userspace control. This is in contrast to
+ * iov_iter_get_pages(), where usages which are transient.
*
* FIXME: For pages which are part of a filesystem, mappings are subject to the
* lifetime enforced by the filesystem and we need guarantees that longterm
@@ -2684,11 +2688,51 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
* Currently only get_user_pages() and get_user_pages_fast() support this flag
* and calls to get_user_pages_[un]locked are specifically not allowed. This
* is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY
+ * FAULT_FLAG_ALLOW_RETRY.
*
- * In the CMA case: longterm pins in a CMA region would unnecessarily fragment
- * that region. And so CMA attempts to migrate the page before pinning when
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region. And so, CMA attempts to migrate the page before pinning, when
* FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. FOLL_PIN pages must be released,
+ * ultimately, by a call to put_user_page(). Typically that will be via one of
+ * the vaddr_unpin_pages() variants.
+ *
+ * FIXME: note that this special tracking is not in place yet. However, the
+ * pages should still be released by put_user_page().
+ *
+ * When and where to use each flag:
+ *
+ * CASE 1: Direct IO (DIO). There are GUP references to pages that are serving
+ * as DIO buffers. These buffers are needed for a relatively short time (so they
+ * are not "long term"). No special synchronization with page_mkclean() or
+ * munmap() is provided. Therefore, flags to set at the call site are:
+ *
+ * FOLL_PIN
+ *
+ * CASE 2: RDMA. There are GUP references to pages that are serving as DMA
+ * buffers. These buffers are needed for a long time ("long term"). No special
+ * synchronization with page_mkclean() or munmap() is provided. Therefore, flags
+ * to set at the call site are:
+ *
+ * FOLL_PIN | FOLL_LONGTERM
+ *
+ * There is also a special case when the pages are DAX pages: in addition to the
+ * above flags, the caller needs a file lease. This is provided via the struct
+ * vaddr_pin argument to vaddr_pin_pages().
+ *
+ * CASE 3: ODP (Mellanox/Infiniband On Demand Paging: the hardware supports
+ * replayable page faulting). There are GUP references to pages serving as DMA
+ * buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
+ * and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
+ * needs to be set.
+ *
+ * CASE 4: pinning for struct page manipulation only. Here, normal GUP calls are
+ * sufficient, so neither flag needs to be set.
*/
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
diff --git a/mm/gup.c b/mm/gup.c
index e49096d012ea..ba316d960d7a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2490,7 +2490,7 @@ long vaddr_pin_pages(unsigned long addr, unsigned long nr_pages,
{
long ret;
- gup_flags |= FOLL_LONGTERM;
+ gup_flags |= FOLL_PIN;
if (!vaddr_pin || (!vaddr_pin->mm && !vaddr_pin->f_owner))
return -EINVAL;
--
2.22.1
next prev parent reply other threads:[~2019-08-21 4:07 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-21 4:07 [PATCH v2 0/3] mm/gup: introduce vaddr_pin_pages_remote(), FOLL_PIN John Hubbard
2019-08-21 4:07 ` [PATCH v2 1/3] For Ira: tiny formatting tweak to kerneldoc John Hubbard
2019-08-21 4:07 ` John Hubbard [this message]
2019-08-21 4:07 ` [PATCH v2 3/3] mm/gup: introduce vaddr_pin_pages_remote(), and invoke it John Hubbard
2019-08-23 0:24 ` [PATCH v2 0/3] mm/gup: introduce vaddr_pin_pages_remote(), FOLL_PIN Ira Weiny
2019-08-23 0:36 ` John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190821040727.19650-3-jhubbard@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=jglisse@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-rdma@vger.kernel.org \
--cc=mhocko@kernel.org \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.