All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN
@ 2022-02-27  9:34 jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
                   ` (5 more replies)
  0 siblings, 6 replies; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Hi,

The feedback on the RFC [1] prompted me to convert the core Direct IO
subsystem all at once. The key differences here, as compared to the RFC,
are:

    * no dio_w_*() wrapper routines,

    * no CONFIG parameter; and

    * new iov_iter_pin_pages*() routines that pin pages without
      affecting other callers of iov_iter_get_pages*(). Those other
      callers (ceph, rds, net, ...) can be converted separately.

Also, many pre-existing callers of unpin_user_pages_dirty_lock() are
wrong, and this series adds a few more callers. So readers may naturally
wonder about that. I recently had a very productive discussion with Ted
Ts'o, who suggested a way to fix the problem, and I'm going to implement
it, next. However, I think it's best to do that fix separately from
this, probably layered on top, although it could go either before or
after.

As part of fixing the "get_user_pages() + file-backed memory" problem
[2], and to support various COW-related fixes as well [3], we need to
convert the Direct IO code from get_user_pages_fast(), to
pin_user_pages_fast(). Because pin_user_pages*() calls require a
corresponding call to unpin_user_page(), the conversion is more
elaborate than just substitution.

In the main patch (patch 4) I'm a little concerned about the
bio_map_user_iov() changes, because the sole caller,
blk_rq_map_user_iov(), has either a direct mapped case or a copy from
user case, and I'm still not sure that these are properly kept separate,
from an unpin pages point of view. So a close look there by reviewers
would be welcome.

Testing: this needs lots of filesystem testing.

In this patchset:

Patches 1, 2: provide a few new routines that will be used by
conversion: pin_user_page(), iov_iter_pin_pages(),
iov_iter_pin_pages_alloc().

Patch 3: provide a few asserts that only user space pages are being
passed in for Direct IO. (This patch could be folded into another
patch.)

Patch 4: Convert all Direct IO callers that use iomap, or
blockdev_direct_IO(), or bio_iov_iter_get_pages().

Patch 5, 6: convert a few other callers to the new system: NFS-Direct,
and fuse.

This is based on linux-next (next-20220225). I've also stashed it here:

    https://github.com/johnhubbard/linux bio_pup_next_20220225


[1] https://lore.kernel.org/r/20220225085025.3052894-1-jhubbard@nvidia.com

[2] https://lwn.net/Articles/753027/ "The trouble with get_user_pages()"

[3] https://lore.kernel.org/all/20211217113049.23850-1-david@redhat.com/T/#u
    (David Hildenbrand's mm/COW fixes)

John Hubbard (6):
  mm/gup: introduce pin_user_page()
  iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  block, fs: assert that key paths use iovecs, and nothing else
  block, bio, fs: convert most filesystems to pin_user_pages_fast()
  NFS: direct-io: convert to FOLL_PIN pages
  fuse: convert direct IO paths to use FOLL_PIN

 block/bio.c          | 29 ++++++++--------
 block/blk-map.c      |  6 ++--
 fs/direct-io.c       | 28 ++++++++--------
 fs/fuse/dev.c        |  7 ++--
 fs/fuse/file.c       | 38 +++++----------------
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      | 15 +++------
 include/linux/mm.h   |  1 +
 include/linux/uio.h  |  4 +++
 lib/iov_iter.c       | 78 ++++++++++++++++++++++++++++++++++++++++++++
 mm/gup.c             | 34 +++++++++++++++++++
 11 files changed, 170 insertions(+), 72 deletions(-)


base-commit: 06aeb1495c39c86ccfaf1adadc1d2200179f16eb
-- 
2.35.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount is using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9bada4096ac..367d7fd28fd0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1946,6 +1946,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 428c587acfa2..13c0dced2aee 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3035,6 +3035,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page ()
+ *
+ * @page: the page to be pinned.
+ *
+ * Similar to get_user_pages(), in that the page's refcount is elevated using
+ * FOLL_PIN rules.
+ *
+ * IMPORTANT: That means that the caller must release the page via
+ * unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27 21:57   ` Jens Axboe
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Provide two new specialized routines that only handle user space pages,
and invoke pin_user_pages_fast() on them: iov_iter_pin_pages() and
iov_iter_pin_pages_alloc().

This allows subsequent patches to convert various callers of
iov_iter_get_pages*(), to the new calls, without having to attempt a
mass conversion all at once.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/uio.h |  4 +++
 lib/iov_iter.c      | 78 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..208020c2b75a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -236,6 +236,10 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
+ssize_t iov_iter_pin_pages(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, struct page ***pages,
+			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..e64e8e4edd0c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1560,6 +1560,41 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
+ssize_t iov_iter_pin_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned int maxpages,
+		   size_t *start)
+{
+	size_t len;
+	int n, res;
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+	if (!maxsize)
+		return 0;
+
+	WARN_ON_ONCE(!iter_is_iovec(i));
+
+	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
+		unsigned long addr;
+
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
+		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		res = pin_user_pages_fast(addr, n, gup_flags, pages);
+		if (unlikely(res <= 0))
+			return res;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+	}
+
+	return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages);
+
 static struct page **get_pages_array(size_t n)
 {
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
@@ -1696,6 +1731,49 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t len;
+	int n, res;
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+	if (!maxsize)
+		return 0;
+
+	WARN_ON_ONCE(!iter_is_iovec(i));
+
+	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
+		unsigned long addr;
+
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
+		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		p = get_pages_array(n);
+		if (!p)
+			return -ENOMEM;
+		res = pin_user_pages_fast(addr, n, gup_flags, p);
+		if (unlikely(res <= 0)) {
+			kvfree(p);
+			*pages = NULL;
+			return res;
+		}
+		*pages = p;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+	}
+
+	return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages_alloc);
+
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27 21:58   ` Jens Axboe
  2022-02-27 22:15   ` Al Viro
  2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Upcoming changes to Direct IO will change it from acquiring pages via
get_user_pages_fast(), to calling pin_user_pages_fast() instead.

Place a few assertions at key points, that the pages are IOVEC (user
pages), to enforce the assumptions that there are no kernel or pipe or
other odd variations being passed.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c    | 4 ++++
 fs/direct-io.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index b15f5466ce08..4679d6539e2d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	WARN_ON_ONCE(!iter_is_iovec(iter));
+
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
@@ -1217,6 +1219,8 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	WARN_ON_ONCE(!iter_is_iovec(iter));
+
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 38bca4980a1c..7dbbbfef300d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,6 +169,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	ssize_t ret;
 
+	WARN_ON_ONCE(!iter_is_iovec(sdio->iter));
+
 	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
                   ` (2 preceding siblings ...)
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27 21:59   ` Jens Axboe
  2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
  5 siblings, 1 reply; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
in place of get_user_pages_fast(), get_page() and put_page().

This converts the Direct IO parts of most filesystems over to using
FOLL_PIN (pin_user_page*()) page pinning.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c          | 25 ++++++++++++-------------
 block/blk-map.c      |  6 +++---
 fs/direct-io.c       | 26 +++++++++++++-------------
 fs/iomap/direct-io.c |  2 +-
 4 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4679d6539e2d..d078f992a9c9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1102,7 +1102,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		unpin_user_page(bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1130,10 +1130,9 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 
 static void bio_put_pages(struct page **pages, size_t size, size_t off)
 {
-	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
+	size_t nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
 
-	for (i = 0; i < nr; i++)
-		put_page(pages[i]);
+	unpin_user_pages(pages, nr);
 }
 
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
@@ -1144,9 +1143,9 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
  * @iter: iov iterator describing the region to be mapped
  *
  * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using unpin_user_page() when done. For
+ * multi-segment *iter, this function only adds pages from the next non-empty
+ * segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1169,7 +1168,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	WARN_ON_ONCE(!iter_is_iovec(iter));
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1180,7 +1179,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 			if (same_page)
-				put_page(page);
+				unpin_user_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len))) {
 				bio_put_pages(pages + i, left, offset);
@@ -1221,7 +1220,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	WARN_ON_ONCE(!iter_is_iovec(iter));
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1237,7 +1236,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 			break;
 		}
 		if (same_page)
-			put_page(page);
+			unpin_user_page(page);
 		offset = 0;
 	}
 
@@ -1434,8 +1433,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one unpin_user_page() against each page and will run
+ * one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index c7f71d83eff1..ce450683994c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -252,7 +252,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_pin_pages_alloc(iter, &pages, LONG_MAX, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -275,7 +275,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						unpin_user_page(page);
 					break;
 				}
 
@@ -289,7 +289,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			unpin_user_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7dbbbfef300d..815407c26f57 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -171,7 +171,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 
 	WARN_ON_ONCE(!iter_is_iovec(sdio->iter));
 
-	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+	ret = iov_iter_pin_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -183,7 +183,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		pin_user_page(page);
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -452,7 +452,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		unpin_user_page(dio->pages[sdio->head++]);
 }
 
 /*
@@ -717,7 +717,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		pin_user_page(sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -832,13 +832,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	pin_user_page(page);		/* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +853,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -953,7 +953,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					unpin_user_page(page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -998,7 +998,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio->op == REQ_OP_WRITE) {
-					put_page(page);
+					unpin_user_page(page);
 					return -ENOTBLK;
 				}
 
@@ -1011,7 +1011,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					unpin_user_page(page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1051,7 +1051,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				unpin_user_page(page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1067,7 +1067,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 		}
 
 		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		unpin_user_page(page);
 	}
 out:
 	return ret;
@@ -1289,7 +1289,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		unpin_user_page(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 67cf9c16f80c..d0986a31a9d1 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -192,7 +192,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	pin_user_page(page);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
                   ` (3 preceding siblings ...)
  2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
  5 siblings, 0 replies; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Convert nfs-direct to use pin_user_pages_fast(), and unpin_user_pages(),
in place of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/nfs/direct.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..42111a75c0f7 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -177,13 +177,6 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	return nfs_file_direct_write(iocb, iter);
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
-}
-
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq)
 {
@@ -367,7 +360,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
@@ -398,7 +391,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -811,7 +804,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  wsize, &pgbase);
 		if (result < 0)
 			break;
@@ -850,7 +843,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
                   ` (4 preceding siblings ...)
  2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-28 15:59   ` Miklos Szeredi
  5 siblings, 1 reply; 27+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Convert the fuse filesystem to support the new iov_iter_get_pages()
behavior. That routine now invokes pin_user_pages_fast(), which means
that such pages must be released via unpin_user_page(), rather than via
put_page().

This commit also removes any possibility of kernel pages being handled,
in the fuse_get_user_pages() call. Although this may seem like a steep
price to pay, Christoph Hellwig actually recommended it a few years ago
for nearly the same situation [1].

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/fuse/dev.c  |  7 +++++--
 fs/fuse/file.c | 38 +++++++++-----------------------------
 2 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e1b4a846c90d..9db85c4d549a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,10 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
 			flush_dcache_page(cs->pg);
 			set_page_dirty_lock(cs->pg);
 		}
-		put_page(cs->pg);
+		if (cs->pipebufs)
+			put_page(cs->pg);
+		else
+			unpin_user_page(cs->pg);
 	}
 	cs->pg = NULL;
 }
@@ -730,7 +733,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 		}
 	} else {
 		size_t off;
-		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
+		err = iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 94747bac3489..ecfa5bdde919 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -611,18 +611,6 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
 	args->out_args[0].size = count;
 }
 
-static void fuse_release_user_pages(struct fuse_args_pages *ap,
-				    bool should_dirty)
-{
-	unsigned int i;
-
-	for (i = 0; i < ap->num_pages; i++) {
-		if (should_dirty)
-			set_page_dirty_lock(ap->pages[i]);
-		put_page(ap->pages[i]);
-	}
-}
-
 static void fuse_io_release(struct kref *kref)
 {
 	kfree(container_of(kref, struct fuse_io_priv, refcnt));
@@ -720,7 +708,8 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
 	struct fuse_io_priv *io = ia->io;
 	ssize_t pos = -1;
 
-	fuse_release_user_pages(&ia->ap, io->should_dirty);
+	unpin_user_pages_dirty_lock(ia->ap.pages, ia->ap.num_pages,
+				    io->should_dirty);
 
 	if (err) {
 		/* Nothing */
@@ -1382,25 +1371,14 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	size_t nbytes = 0;  /* # bytes already packed in req */
 	ssize_t ret = 0;
 
-	/* Special case for kernel I/O: can copy directly into the buffer */
-	if (iov_iter_is_kvec(ii)) {
-		unsigned long user_addr = fuse_get_user_addr(ii);
-		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
-
-		if (write)
-			ap->args.in_args[1].value = (void *) user_addr;
-		else
-			ap->args.out_args[0].value = (void *) user_addr;
-
-		iov_iter_advance(ii, frag_size);
-		*nbytesp = frag_size;
-		return 0;
-	}
+	/* Only user space buffers are allowed with fuse Direct IO. */
+	if (WARN_ON_ONCE(!iter_is_iovec(ii)))
+		return -EOPNOTSUPP;
 
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
+		ret = iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
 					*nbytesp - nbytes,
 					max_pages - ap->num_pages,
 					&start);
@@ -1484,7 +1462,9 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		}
 
 		if (!io->async || nres < 0) {
-			fuse_release_user_pages(&ia->ap, io->should_dirty);
+			unpin_user_pages_dirty_lock(ia->ap.pages,
+						    ia->ap.num_pages,
+						    io->should_dirty);
 			fuse_io_free(ia);
 		}
 		ia = NULL;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
@ 2022-02-27 21:57   ` Jens Axboe
  2022-02-27 22:09     ` John Hubbard
  2022-02-28 22:49     ` John Hubbard
  0 siblings, 2 replies; 27+ messages in thread
From: Jens Axboe @ 2022-02-27 21:57 UTC (permalink / raw)
  To: jhubbard.send.patches, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

> +ssize_t iov_iter_pin_pages(struct iov_iter *i,
> +		   struct page **pages, size_t maxsize, unsigned int maxpages,
> +		   size_t *start)
> +{
> +	size_t len;
> +	int n, res;
> +
> +	if (maxsize > i->count)
> +		maxsize = i->count;
> +	if (!maxsize)
> +		return 0;
> +
> +	WARN_ON_ONCE(!iter_is_iovec(i));
> +
> +	if (likely(iter_is_iovec(i))) {
> +		unsigned int gup_flags = 0;
> +		unsigned long addr;
> +
> +		if (iov_iter_rw(i) != WRITE)
> +			gup_flags |= FOLL_WRITE;
> +		if (i->nofault)
> +			gup_flags |= FOLL_NOFAULT;
> +
> +		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
> +		n = DIV_ROUND_UP(len, PAGE_SIZE);
> +		res = pin_user_pages_fast(addr, n, gup_flags, pages);
> +		if (unlikely(res <= 0))
> +			return res;
> +		return (res == n ? len : res * PAGE_SIZE) - *start;

Trying to be clever like that just makes the code a lot less readable. I
should not have to reason about a return value. Same in the other
function.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
@ 2022-02-27 21:58   ` Jens Axboe
  2022-02-27 22:12     ` John Hubbard
  2022-02-27 22:15   ` Al Viro
  1 sibling, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2022-02-27 21:58 UTC (permalink / raw)
  To: jhubbard.send.patches, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

> diff --git a/block/bio.c b/block/bio.c
> index b15f5466ce08..4679d6539e2d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> +	WARN_ON_ONCE(!iter_is_iovec(iter));
> +

If these make sense, why aren't they also returning an error?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
@ 2022-02-27 21:59   ` Jens Axboe
  2022-02-27 22:13     ` John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: Jens Axboe @ 2022-02-27 21:59 UTC (permalink / raw)
  To: jhubbard.send.patches, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

On 2/27/22 2:34 AM, jhubbard.send.patches@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
> in place of get_user_pages_fast(), get_page() and put_page().
> 
> This converts the Direct IO parts of most filesystems over to using
> FOLL_PIN (pin_user_page*()) page pinning.

The commit message needs to explain why a change is being made, not what
is being done. The latter I can just look at the code for.

Didn't even find it in in your cover letter, had to go to the original
posting for that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27 21:57   ` Jens Axboe
@ 2022-02-27 22:09     ` John Hubbard
  2022-02-28 22:49     ` John Hubbard
  1 sibling, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-02-27 22:09 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:57, Jens Axboe wrote:
...
>> +			return res;
>> +		return (res == n ? len : res * PAGE_SIZE) - *start;
> 
> Trying to be clever like that just makes the code a lot less readable. I
> should not have to reason about a return value. Same in the other
> function.
> 

No argument there. This was shamelessly lifted from
iov_iter_get_pages(), and I initially calculated that keeping it
identical to that known-good code, where possible, was better than
fixing it up.

However, I'll go ahead and simplify it--with pleasure--based on this
feedback.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27 21:58   ` Jens Axboe
@ 2022-02-27 22:12     ` John Hubbard
  0 siblings, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-02-27 22:12 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:58, Jens Axboe wrote:
>> diff --git a/block/bio.c b/block/bio.c
>> index b15f5466ce08..4679d6539e2d 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>   	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>>   	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>>   
>> +	WARN_ON_ONCE(!iter_is_iovec(iter));
>> +
> 
> If these make sense, why aren't they also returning an error?
> 

Oops. Got caught up in the luxury of local testing and watching the
logs, but also returning errors is of course the right answer. Will fix.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-02-27 21:59   ` Jens Axboe
@ 2022-02-27 22:13     ` John Hubbard
  0 siblings, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-02-27 22:13 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:59, Jens Axboe wrote:
> On 2/27/22 2:34 AM, jhubbard.send.patches@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
>> in place of get_user_pages_fast(), get_page() and put_page().
>>
>> This converts the Direct IO parts of most filesystems over to using
>> FOLL_PIN (pin_user_page*()) page pinning.
> 
> The commit message needs to explain why a change is being made, not what
> is being done. The latter I can just look at the code for.
> 
> Didn't even find it in in your cover letter, had to go to the original
> posting for that.
> 

Sorry about that, I'll sit down and write up something thorough for this
patch. It is after the central thing, and it got neglected here.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
  2022-02-27 21:58   ` Jens Axboe
@ 2022-02-27 22:15   ` Al Viro
  2022-02-27 22:27     ` John Hubbard
  2022-02-28  3:29     ` John Hubbard
  1 sibling, 2 replies; 27+ messages in thread
From: Al Viro @ 2022-02-27 22:15 UTC (permalink / raw)
  To: jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Miklos Szeredi,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML, John Hubbard

On Sun, Feb 27, 2022 at 01:34:31AM -0800, jhubbard.send.patches@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Upcoming changes to Direct IO will change it from acquiring pages via
> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
> 
> Place a few assertions at key points, that the pages are IOVEC (user
> pages), to enforce the assumptions that there are no kernel or pipe or
> other odd variations being passed.

Umm...  And what should happen when O_DIRECT file gets passed to splice()?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27 22:15   ` Al Viro
@ 2022-02-27 22:27     ` John Hubbard
  2022-02-28  3:29     ` John Hubbard
  1 sibling, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-02-27 22:27 UTC (permalink / raw)
  To: Al Viro, jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Miklos Szeredi,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/27/22 14:15, Al Viro wrote:
> On Sun, Feb 27, 2022 at 01:34:31AM -0800, jhubbard.send.patches@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Upcoming changes to Direct IO will change it from acquiring pages via
>> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>>
>> Place a few assertions at key points, that the pages are IOVEC (user
>> pages), to enforce the assumptions that there are no kernel or pipe or
>> other odd variations being passed.
> 
> Umm...  And what should happen when O_DIRECT file gets passed to splice()?

Hi Al,

First of all, full disclosure: I still haven't worked through how
splice() handles pages in all cases. I was hoping to defer it, by
limiting this series to not *all* of the original callers of
iov_iter_get_pages*().

This series leaves the splice() code pointing to iov_iter_get_pages(),
but maybe that's not possible after all.

Any advice or ideas about how to solve the O_DIRECT-to-splice() is very
welcome.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27 22:15   ` Al Viro
  2022-02-27 22:27     ` John Hubbard
@ 2022-02-28  3:29     ` John Hubbard
  1 sibling, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-02-28  3:29 UTC (permalink / raw)
  To: Al Viro, jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Miklos Szeredi,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/27/22 14:15, Al Viro wrote:
> On Sun, Feb 27, 2022 at 01:34:31AM -0800, jhubbard.send.patches@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Upcoming changes to Direct IO will change it from acquiring pages via
>> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>>
>> Place a few assertions at key points, that the pages are IOVEC (user
>> pages), to enforce the assumptions that there are no kernel or pipe or
>> other odd variations being passed.
> 
> Umm...  And what should happen when O_DIRECT file gets passed to splice()?

OK, right, this is not going to work as-is.

Any of the remaining iov_iter_get_pages*() callers that do direct IO
(vmsplice, ceph, ...) will end up calling bio_release_pages(), which is
now unconditionally calling unpin_user_pages(). So this is just
completely broken.

And so, once again, I'm back to, "must convert the whole pile of
iov_iter_get_page*() callers at once".

Back to the drawing board.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
@ 2022-02-28 15:59   ` Miklos Szeredi
  2022-02-28 21:16     ` John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: Miklos Szeredi @ 2022-02-28 15:59 UTC (permalink / raw)
  To: jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML, John Hubbard

On Sun, 27 Feb 2022 at 10:34, <jhubbard.send.patches@gmail.com> wrote:
>
> From: John Hubbard <jhubbard@nvidia.com>
>
> Convert the fuse filesystem to support the new iov_iter_get_pages()
> behavior. That routine now invokes pin_user_pages_fast(), which means
> that such pages must be released via unpin_user_page(), rather than via
> put_page().
>
> This commit also removes any possibility of kernel pages being handled,
> in the fuse_get_user_pages() call. Although this may seem like a steep
> price to pay, Christoph Hellwig actually recommended it a few years ago
> for nearly the same situation [1].

This might work for O_DIRECT, but fuse has this mode of operation
which turns normal "buffered" I/O into direct I/O.  And that in turn
will break execve of such files.

So AFAICS we need to keep kvec handing in some way.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-28 15:59   ` Miklos Szeredi
@ 2022-02-28 21:16     ` John Hubbard
  2022-03-01  9:41       ` Miklos Szeredi
  0 siblings, 1 reply; 27+ messages in thread
From: John Hubbard @ 2022-02-28 21:16 UTC (permalink / raw)
  To: Miklos Szeredi, jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/28/22 07:59, Miklos Szeredi wrote:
> On Sun, 27 Feb 2022 at 10:34, <jhubbard.send.patches@gmail.com> wrote:
>>
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Convert the fuse filesystem to support the new iov_iter_get_pages()
>> behavior. That routine now invokes pin_user_pages_fast(), which means
>> that such pages must be released via unpin_user_page(), rather than via
>> put_page().
>>
>> This commit also removes any possibility of kernel pages being handled,
>> in the fuse_get_user_pages() call. Although this may seem like a steep
>> price to pay, Christoph Hellwig actually recommended it a few years ago
>> for nearly the same situation [1].
> 
> This might work for O_DIRECT, but fuse has this mode of operation
> which turns normal "buffered" I/O into direct I/O.  And that in turn
> will break execve of such files.
> 
> So AFAICS we need to keep kvec handing in some way.
> 

Thanks for bringing that up! Do you have any hints for me, to jump start
a deeper look? And especially, sample programs that exercise this?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27 21:57   ` Jens Axboe
  2022-02-27 22:09     ` John Hubbard
@ 2022-02-28 22:49     ` John Hubbard
  1 sibling, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-02-28 22:49 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:57, Jens Axboe wrote:
>> +ssize_t iov_iter_pin_pages(struct iov_iter *i,
>> +		   struct page **pages, size_t maxsize, unsigned int maxpages,
>> +		   size_t *start)
>> +{
>> +	size_t len;
>> +	int n, res;
>> +
>> +	if (maxsize > i->count)
>> +		maxsize = i->count;
>> +	if (!maxsize)
>> +		return 0;
>> +
>> +	WARN_ON_ONCE(!iter_is_iovec(i));
>> +
>> +	if (likely(iter_is_iovec(i))) {
>> +		unsigned int gup_flags = 0;
>> +		unsigned long addr;
>> +
>> +		if (iov_iter_rw(i) != WRITE)
>> +			gup_flags |= FOLL_WRITE;
>> +		if (i->nofault)
>> +			gup_flags |= FOLL_NOFAULT;
>> +
>> +		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
>> +		n = DIV_ROUND_UP(len, PAGE_SIZE);
>> +		res = pin_user_pages_fast(addr, n, gup_flags, pages);
>> +		if (unlikely(res <= 0))
>> +			return res;
>> +		return (res == n ? len : res * PAGE_SIZE) - *start;
> 
> Trying to be clever like that just makes the code a lot less readable. I
> should not have to reason about a return value. Same in the other
> function.
> 

Here is a differential patch on top of this one, and only showing one of
the two routines. How does this direction look to you?


diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e64e8e4edd0c..8e96f1e9ebc6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1588,7 +1588,17 @@ ssize_t iov_iter_pin_pages(struct iov_iter *i,
  		res = pin_user_pages_fast(addr, n, gup_flags, pages);
  		if (unlikely(res <= 0))
  			return res;
-		return (res == n ? len : res * PAGE_SIZE) - *start;
+
+		/* Cap len at the number of pages that were actually pinned: */
+		if (res < n)
+			len = res * PAGE_SIZE;
+
+		/*
+		 * The return value is the amount pinned in bytes that the
+		 * caller will actually use. So, reduce it by the offset into
+		 * the first page:
+		 */
+		return len - *start;
  	}

  	return -EFAULT;

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-28 21:16     ` John Hubbard
@ 2022-03-01  9:41       ` Miklos Szeredi
  2022-03-02  8:07         ` John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: Miklos Szeredi @ 2022-03-01  9:41 UTC (permalink / raw)
  To: John Hubbard
  Cc: jhubbard.send.patches, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Andrew Morton, Chaitanya Kulkarni, linux-block,
	linux-fsdevel, linux-xfs, linux-mm, LKML

On Mon, 28 Feb 2022 at 22:16, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 2/28/22 07:59, Miklos Szeredi wrote:
> > On Sun, 27 Feb 2022 at 10:34, <jhubbard.send.patches@gmail.com> wrote:
> >>
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Convert the fuse filesystem to support the new iov_iter_get_pages()
> >> behavior. That routine now invokes pin_user_pages_fast(), which means
> >> that such pages must be released via unpin_user_page(), rather than via
> >> put_page().
> >>
> >> This commit also removes any possibility of kernel pages being handled,
> >> in the fuse_get_user_pages() call. Although this may seem like a steep
> >> price to pay, Christoph Hellwig actually recommended it a few years ago
> >> for nearly the same situation [1].
> >
> > This might work for O_DIRECT, but fuse has this mode of operation
> > which turns normal "buffered" I/O into direct I/O.  And that in turn
> > will break execve of such files.
> >
> > So AFAICS we need to keep kvec handing in some way.
> >
>
> Thanks for bringing that up! Do you have any hints for me, to jump start

How about just leaving that special code in place?   It bypasses page
refs and directly copies to the kernel buffer, so it should not have
any affect on the user page code.

> a deeper look? And especially, sample programs that exercise this?

Here's one:
# uncomment as appropriate:
#sudo dnf install fuse3-devel
#sudo apt install libfuse3-dev

cat <<EOF > fuse-dio-exec.c
#define FUSE_USE_VERSION 31
#include <fuse.h>
#include <errno.h>
#include <unistd.h>

static const char *filename = "/bin/true";

static int test_getattr(const char *path, struct stat *stbuf,
             struct fuse_file_info *fi)
{
    return lstat(filename, stbuf) == -1 ? -errno : 0;
}

static int test_open(const char *path, struct fuse_file_info *fi)
{
    int res;

    res = open(filename, fi->flags);
    if (res == -1)
        return -errno;

    fi->fh = res;
    fi->direct_io = 1;
    return 0;
}

static int test_read(const char *path, char *buf, size_t size, off_t offset,
              struct fuse_file_info *fi)
{
    int res = pread(fi->fh, buf, size, offset);
    return res == -1 ? -errno : res;
}

static int test_release(const char *path, struct fuse_file_info *fi)
{
    close(fi->fh);
    return 0;
}

static const struct fuse_operations test_oper = {
    .getattr    = test_getattr,
    .open        = test_open,
    .release    = test_release,
    .read        = test_read,
};

int main(int argc, char *argv[])
{
    return fuse_main(argc, argv, &test_oper, NULL);
}
EOF

gcc -W fuse-dio-exec.c `pkg-config fuse3 --cflags --libs` -o fuse-dio-exec
touch /tmp/true

#run test:
./fuse-dio-exec /tmp/true
/tmp/true
umount /tmp/true

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-03-01  9:41       ` Miklos Szeredi
@ 2022-03-02  8:07         ` John Hubbard
  0 siblings, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-03-02  8:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: jhubbard.send.patches, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Andrew Morton, Chaitanya Kulkarni, linux-block,
	linux-fsdevel, linux-xfs, linux-mm, LKML

On 3/1/22 01:41, Miklos Szeredi wrote:
...
>>> This might work for O_DIRECT, but fuse has this mode of operation
>>> which turns normal "buffered" I/O into direct I/O.  And that in turn
>>> will break execve of such files.
>>>
>>> So AFAICS we need to keep kvec handing in some way.
>>>
>>
>> Thanks for bringing that up! Do you have any hints for me, to jump start
> 
> How about just leaving that special code in place?   It bypasses page
> refs and directly copies to the kernel buffer, so it should not have
> any affect on the user page code.
> 

Good idea, I'll go that direction.

>> a deeper look? And especially, sample programs that exercise this?
> 
> Here's one:

This is really helpful, exactly what I was looking for.


thanks!
-- 
John Hubbard
NVIDIA

> # uncomment as appropriate:
> #sudo dnf install fuse3-devel
> #sudo apt install libfuse3-dev
> 
> cat <<EOF > fuse-dio-exec.c
> #define FUSE_USE_VERSION 31
> #include <fuse.h>
> #include <errno.h>
> #include <unistd.h>
> 
> static const char *filename = "/bin/true";
> 
> static int test_getattr(const char *path, struct stat *stbuf,
>               struct fuse_file_info *fi)
> {
>      return lstat(filename, stbuf) == -1 ? -errno : 0;
> }
> 
> static int test_open(const char *path, struct fuse_file_info *fi)
> {
>      int res;
> 
>      res = open(filename, fi->flags);
>      if (res == -1)
>          return -errno;
> 
>      fi->fh = res;
>      fi->direct_io = 1;
>      return 0;
> }
> 
> static int test_read(const char *path, char *buf, size_t size, off_t offset,
>                struct fuse_file_info *fi)
> {
>      int res = pread(fi->fh, buf, size, offset);
>      return res == -1 ? -errno : res;
> }
> 
> static int test_release(const char *path, struct fuse_file_info *fi)
> {
>      close(fi->fh);
>      return 0;
> }
> 
> static const struct fuse_operations test_oper = {
>      .getattr    = test_getattr,
>      .open        = test_open,
>      .release    = test_release,
>      .read        = test_read,
> };
> 
> int main(int argc, char *argv[])
> {
>      return fuse_main(argc, argv, &test_oper, NULL);
> }
> EOF
> 
> gcc -W fuse-dio-exec.c `pkg-config fuse3 --cflags --libs` -o fuse-dio-exec
> touch /tmp/true
> 
> #run test:
> ./fuse-dio-exec /tmp/true
> /tmp/true
> umount /tmp/true

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-30 12:17       ` David Hildenbrand
  2022-08-30 21:42         ` John Hubbard
@ 2022-08-31  0:06         ` John Hubbard
  1 sibling, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-08-31  0:06 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/30/22 05:17, David Hildenbrand wrote:
> (side note: after recent VM_BUG_ON discussions we might want to convert
> the VM_BUG_ON_PAGE in sanity_check_pinned_pages())

Just realized I skipped over this point in my other response.

So, there are 28 "BUG_ON" calls in mm/gup.c, only 3 of which are 
BUILD_BUG_ON().

Maybe a single patch that covers all 25 at once is the way to go, I'm
thinking. And of course, any patch that touches any of those lines
should convert to a WARN* variant--whichever lands first.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-30 12:17       ` David Hildenbrand
@ 2022-08-30 21:42         ` John Hubbard
  2022-08-31  0:06         ` John Hubbard
  1 sibling, 0 replies; 27+ messages in thread
From: John Hubbard @ 2022-08-30 21:42 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/30/22 05:17, David Hildenbrand wrote:
> On 29.08.22 21:33, John Hubbard wrote:
>> On 8/29/22 05:07, David Hildenbrand wrote:
>>>> +/**
>>>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>>>> + *
>>>> + * @page: the page to be pinned.
>>>> + *
>>>> + * This is similar to get_user_pages(), except that the page's refcount is
>>>> + * elevated using FOLL_PIN, instead of FOLL_GET.
>>
>> Actually, my commit log has a more useful documentation of this routine,
>> and given the questions below, I think I'll change to that:
>>
>>  * pin_user_page() is an externally-usable version of try_grab_page(), but with
>>  * semantics that match get_page(), so that it can act as a drop-in replacement
>>  * for get_page().
>>  *
>>  * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
>>  * that the caller must release the page via unpin_user_page().
> 
> Some thoughts:
> 
> a) Can we generalize such that pages with a dedicated pincount
> (multi-page folios) are also covered? Maybe avoiding the refcount
> terminology would be best.
> 
> b) Should we directly work on folios?
> 
> c) Would it be valid to pass in a tail page right now?

I would fervently prefer to defer those kinds of questions and ideas,
because the call sites are dealing simply in pages. And this is 
really for file system call sites. The folio conversion is a larger
thing. Below...

> 
>>
>>>> + *
>>>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>>>> + *
>>>> + */
>>>> +void pin_user_page(struct page *page)
>>>> +{
>>>> +	struct folio *folio = page_folio(page);
>>>> +
>>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>>> +
>>>
>>> We should warn if the page is anon and !exclusive.
>>
>> That would be sort of OK, because pin_user_page() is being created
>> specifically for file system (O_DIRECT cases) use, and so the pages
>> should mostly be file-backed, rather than anon. Although I'm a little
>> vague about whether all of these iov_iter cases are really always
>> file-backed pages, especially for cases such as splice(2) to an
>> O_DIRECT-opened file, that Al Viro mentioned [1].
> 
> If we can, we should document that this interface is not for anonymous
> pages and WARN if pinning an anonymous page via this interface.

Yes. OK, I've rewritten the documentation again, and changed the 
warning to just 

    WARN_ON_ONCE(PageAnon(page));

So it looks like this now, what do you think?

/**
 * pin_user_page() - apply a FOLL_PIN reference to a file-backed page that the
 * caller already owns.
 *
 * @page: the page to be pinned.
 *
 * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
 * that the caller must release the page via unpin_user_page().
 *
 * pin_user_page() is intended as a drop-in replacement for get_page(). This
 * provides a way for callers to do a subsequent unpin_user_page() on the
 * affected page. However, it is only intended for use by callers (file systems,
 * block/bio) that have a file-backed page. Anonymous pages are not expected nor
 * supported, and will generate a warning.
 *
 * pin_user_page() may also be thought of as an externally-usable version of
 * try_grab_page(), but with semantics that match get_page(), so that it can act
 * as a drop-in replacement for get_page().
 *
 * IMPORTANT: The caller must release the page via unpin_user_page().
 *
 */

> 
> The only reasonable way to obtain a pin on an anonymous page is via the
> page table. Here, FOLL_PIN should be used to do the right thing -- for
> example, unshare first (break COW) instead of pinning a shared anonymous
> page.
> 
> Nothing would speak against duplicating such a pin using this interface
> (we'd have to sanity check that the page we're pinning may already be
> pinned), but I assume the pages we pin here are *not* necessarily
> obtained via GUP FOLL_PIN.
> 
> I would be curious under which scenarios we could end up here with an
> anonymous page and how we obtained that reference (if not via GUP).

Let's see if the warning ever fires. I expect not, but if it does,
then I can add the !PageAnonExclusive qualifier to the warning.

> 
>>
>> Can you walk me through the reasoning for why we need to keep out
>> anon shared pages? 
> 
> We make sure to only pin anonymous pages that are exclusive and check
> when unpinning -- see sanity_check_pinned_pages(), there is also a
> comment in there -- that pinned anonymous pages are in fact still
> exclusive, otherwise we might have a BUG lurking somewhere that can
> result in memory corruptions or leaking information between processes.
> 
> For example, once we'd pinned an anonymous pages that are not marked
> exclusive (!PageAnonExclusive), or we'd be sharing a page that is
> pinned, the next write fault would replace the page in the user page
> table due to breaking COW, and the GUP pin would point at a different
> page than the page table.

Right, OK it all clicks together, thanks for that.

> 
> Disallowing pinning of anon pages that may be shared in any case
> (FOLL_LONGTERM or not) simplifies GUP handling and allows for such
> sanity checks.
> 
> (side note: after recent VM_BUG_ON discussions we might want to convert
> the VM_BUG_ON_PAGE in sanity_check_pinned_pages())
> 
>>
>>>
>>> I assume the intend is to use pin_user_page() only to duplicate pins, right?
>>>
>>
>> Well, yes or no, depending on your use of the term "pin":
>>
>> pin_user_page() is used on a page that already has a refcount >= 1 (so
>> no worries about speculative pinning should apply here), but the page
>> does not necessarily have any FOLL_PIN's applied to it yet (so it's not
>> "pinned" in the FOLL_PIN sense).
> 
> Okay, then we should really figure out if/how anonymous pages could end
> up here. I assume they can't, really. But let's see :)
> 
> 

Yes, again, the warning will reveal that. Which reminds me that I also
have it on my list to also write some test cases that do things like Al
Viro suggested: ITER_BVEC + O_DIRECT.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-29 19:33     ` John Hubbard
@ 2022-08-30 12:17       ` David Hildenbrand
  2022-08-30 21:42         ` John Hubbard
  2022-08-31  0:06         ` John Hubbard
  0 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2022-08-30 12:17 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 29.08.22 21:33, John Hubbard wrote:
> On 8/29/22 05:07, David Hildenbrand wrote:
>>> +/**
>>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>>> + *
>>> + * @page: the page to be pinned.
>>> + *
>>> + * This is similar to get_user_pages(), except that the page's refcount is
>>> + * elevated using FOLL_PIN, instead of FOLL_GET.
> 
> Actually, my commit log has a more useful documentation of this routine,
> and given the questions below, I think I'll change to that:
> 
>  * pin_user_page() is an externally-usable version of try_grab_page(), but with
>  * semantics that match get_page(), so that it can act as a drop-in replacement
>  * for get_page().
>  *
>  * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
>  * that the caller must release the page via unpin_user_page().

Some thoughts:

a) Can we generalize such that pages with a dedicated pincount
(multi-page folios) are also covered? Maybe avoiding the refcount
terminology would be best.

b) Should we directly work on folios?

c) Would it be valid to pass in a tail page right now?

> 
>>> + *
>>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>>> + *
>>> + */
>>> +void pin_user_page(struct page *page)
>>> +{
>>> +	struct folio *folio = page_folio(page);
>>> +
>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>> +
>>
>> We should warn if the page is anon and !exclusive.
> 
> That would be sort of OK, because pin_user_page() is being created
> specifically for file system (O_DIRECT cases) use, and so the pages
> should mostly be file-backed, rather than anon. Although I'm a little
> vague about whether all of these iov_iter cases are really always
> file-backed pages, especially for cases such as splice(2) to an
> O_DIRECT-opened file, that Al Viro mentioned [1].

If we can, we should document that this interface is not for anonymous
pages and WARN if pinning an anonymous page via this interface.

The only reasonable way to obtain a pin on an anonymous page is via the
page table. Here, FOLL_PIN should be used to do the right thing -- for
example, unshare first (break COW) instead of pinning a shared anonymous
page.

Nothing would speak against duplicating such a pin using this interface
(we'd have to sanity check that the page we're pinning may already be
pinned), but I assume the pages we pin here are *not* necessarily
obtained via GUP FOLL_PIN.

I would be curious under which scenarios we could end up here with an
anonymous page and how we obtained that reference (if not via GUP).

> 
> Can you walk me through the reasoning for why we need to keep out
> anon shared pages? 

We make sure to only pin anonymous pages that are exclusive and check
when unpinning -- see sanity_check_pinned_pages(), there is also a
comment in there -- that pinned anonymous pages are in fact still
exclusive, otherwise we might have a BUG lurking somewhere that can
result in memory corruptions or leaking information between processes.

For example, once we'd pinned an anonymous pages that are not marked
exclusive (!PageAnonExclusive), or we'd be sharing a page that is
pinned, the next write fault would replace the page in the user page
table due to breaking COW, and the GUP pin would point at a different
page than the page table.

Disallowing pinning of anon pages that may be shared in any case
(FOLL_LONGTERM or not) simplifies GUP handling and allows for such
sanity checks.

(side note: after recent VM_BUG_ON discussions we might want to convert
the VM_BUG_ON_PAGE in sanity_check_pinned_pages())

> 
>>
>> I assume the intend is to use pin_user_page() only to duplicate pins, right?
>>
> 
> Well, yes or no, depending on your use of the term "pin":
> 
> pin_user_page() is used on a page that already has a refcount >= 1 (so
> no worries about speculative pinning should apply here), but the page
> does not necessarily have any FOLL_PIN's applied to it yet (so it's not
> "pinned" in the FOLL_PIN sense).

Okay, then we should really figure out if/how anonymous pages could end
up here. I assume they can't, really. But let's see :)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-29 12:07   ` David Hildenbrand
@ 2022-08-29 19:33     ` John Hubbard
  2022-08-30 12:17       ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: John Hubbard @ 2022-08-29 19:33 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/29/22 05:07, David Hildenbrand wrote:
>> +/**
>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>> + *
>> + * @page: the page to be pinned.
>> + *
>> + * This is similar to get_user_pages(), except that the page's refcount is
>> + * elevated using FOLL_PIN, instead of FOLL_GET.

Actually, my commit log has a more useful documentation of this routine,
and given the questions below, I think I'll change to that:

 * pin_user_page() is an externally-usable version of try_grab_page(), but with
 * semantics that match get_page(), so that it can act as a drop-in replacement
 * for get_page().
 *
 * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
 * that the caller must release the page via unpin_user_page().


>> + *
>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>> + *
>> + */
>> +void pin_user_page(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>> +
> 
> We should warn if the page is anon and !exclusive.

That would be sort of OK, because pin_user_page() is being created
specifically for file system (O_DIRECT cases) use, and so the pages
should mostly be file-backed, rather than anon. Although I'm a little
vague about whether all of these iov_iter cases are really always
file-backed pages, especially for cases such as splice(2) to an
O_DIRECT-opened file, that Al Viro mentioned [1].

Can you walk me through the reasoning for why we need to keep out
anon shared pages? 

> 
> I assume the intend is to use pin_user_page() only to duplicate pins, right?
> 

Well, yes or no, depending on your use of the term "pin":

pin_user_page() is used on a page that already has a refcount >= 1 (so
no worries about speculative pinning should apply here), but the page
does not necessarily have any FOLL_PIN's applied to it yet (so it's not
"pinned" in the FOLL_PIN sense).


[1] https://lore.kernel.org/all/Ywq5VrSrY341UVpL@ZenIV/


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-08-29 12:07   ` David Hildenbrand
  2022-08-29 19:33     ` John Hubbard
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2022-08-29 12:07 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 27.08.22 10:36, John Hubbard wrote:
> pin_user_page() is an externally-usable version of try_grab_page(), but
> with semantics that match get_page(), so that it can act as a drop-in
> replacement for get_page(). Specifically, pin_user_page() has a void
> return type.
> 
> pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
> means that the caller must release the page via unpin_user_page().
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 982f2607180b..85a105157334 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1869,6 +1869,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
>  long get_user_pages(unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas);
> +void pin_user_page(struct page *page);
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
>  		    struct vm_area_struct **vmas);
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..245ccb41ed8c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3213,6 +3213,39 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  }
>  EXPORT_SYMBOL(pin_user_pages);
>  
> +/**
> + * pin_user_page() - apply a FOLL_PIN reference to a page
> + *
> + * @page: the page to be pinned.
> + *
> + * This is similar to get_user_pages(), except that the page's refcount is
> + * elevated using FOLL_PIN, instead of FOLL_GET.
> + *
> + * IMPORTANT: The caller must release the page via unpin_user_page().
> + *
> + */
> +void pin_user_page(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
> +

We should warn if the page is anon and !exclusive.

I assume the intend is to use pin_user_page() only to duplicate pins, right?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  2022-08-29 12:07   ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 982f2607180b..85a105157334 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1869,6 +1869,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..245ccb41ed8c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3213,6 +3213,39 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page
+ *
+ * @page: the page to be pinned.
+ *
+ * This is similar to get_user_pages(), except that the page's refcount is
+ * elevated using FOLL_PIN, instead of FOLL_GET.
+ *
+ * IMPORTANT: The caller must release the page via unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-08-31  0:06 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
2022-02-27 21:57   ` Jens Axboe
2022-02-27 22:09     ` John Hubbard
2022-02-28 22:49     ` John Hubbard
2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
2022-02-27 21:58   ` Jens Axboe
2022-02-27 22:12     ` John Hubbard
2022-02-27 22:15   ` Al Viro
2022-02-27 22:27     ` John Hubbard
2022-02-28  3:29     ` John Hubbard
2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
2022-02-27 21:59   ` Jens Axboe
2022-02-27 22:13     ` John Hubbard
2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
2022-02-28 15:59   ` Miklos Szeredi
2022-02-28 21:16     ` John Hubbard
2022-03-01  9:41       ` Miklos Szeredi
2022-03-02  8:07         ` John Hubbard
2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
2022-08-29 12:07   ` David Hildenbrand
2022-08-29 19:33     ` John Hubbard
2022-08-30 12:17       ` David Hildenbrand
2022-08-30 21:42         ` John Hubbard
2022-08-31  0:06         ` John Hubbard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.