All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN
@ 2022-02-27  9:34 jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
                   ` (5 more replies)
  0 siblings, 6 replies; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Hi,

The feedback on the RFC [1] prompted me to convert the core Direct IO
subsystem all at once. The key differences here, as compared to the RFC,
are:

    * no dio_w_*() wrapper routines,

    * no CONFIG parameter; and

    * new iov_iter_pin_pages*() routines that pin pages without
      affecting other callers of iov_iter_get_pages*(). Those other
      callers (ceph, rds, net, ...) can be converted separately.

Also, many pre-existing callers of unpin_user_pages_dirty_lock() are
wrong, and this series adds a few more callers. So readers may naturally
wonder about that. I recently had a very productive discussion with Ted
Ts'o, who suggested a way to fix the problem, and I'm going to implement
it, next. However, I think it's best to do that fix separately from
this, probably layered on top, although it could go either before or
after.

As part of fixing the "get_user_pages() + file-backed memory" problem
[2], and to support various COW-related fixes as well [3], we need to
convert the Direct IO code from get_user_pages_fast(), to
pin_user_pages_fast(). Because pin_user_pages*() calls require a
corresponding call to unpin_user_page(), the conversion is more
elaborate than just substitution.

In the main patch (patch 4) I'm a little concerned about the
bio_map_user_iov() changes, because the sole caller,
blk_rq_map_user_iov(), has either a direct mapped case or a copy from
user case, and I'm still not sure that these are properly kept separate,
from an unpin pages point of view. So a close look there by reviewers
would be welcome.

Testing: this needs lots of filesystem testing.

In this patchset:

Patches 1, 2: provide a few new routines that will be used by
conversion: pin_user_page(), iov_iter_pin_pages(),
iov_iter_pin_pages_alloc().

Patch 3: provide a few asserts that only user space pages are being
passed in for Direct IO. (This patch could be folded into another
patch.)

Patch 4: Convert all Direct IO callers that use iomap, or
blockdev_direct_IO(), or bio_iov_iter_get_pages().

Patch 5, 6: convert a few other callers to the new system: NFS-Direct,
and fuse.

This is based on linux-next (next-20220225). I've also stashed it here:

    https://github.com/johnhubbard/linux bio_pup_next_20220225


[1] https://lore.kernel.org/r/20220225085025.3052894-1-jhubbard@nvidia.com

[2] https://lwn.net/Articles/753027/ "The trouble with get_user_pages()"

[3] https://lore.kernel.org/all/20211217113049.23850-1-david@redhat.com/T/#u
    (David Hildenbrand's mm/COW fixes)

John Hubbard (6):
  mm/gup: introduce pin_user_page()
  iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  block, fs: assert that key paths use iovecs, and nothing else
  block, bio, fs: convert most filesystems to pin_user_pages_fast()
  NFS: direct-io: convert to FOLL_PIN pages
  fuse: convert direct IO paths to use FOLL_PIN

 block/bio.c          | 29 ++++++++--------
 block/blk-map.c      |  6 ++--
 fs/direct-io.c       | 28 ++++++++--------
 fs/fuse/dev.c        |  7 ++--
 fs/fuse/file.c       | 38 +++++----------------
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      | 15 +++------
 include/linux/mm.h   |  1 +
 include/linux/uio.h  |  4 +++
 lib/iov_iter.c       | 78 ++++++++++++++++++++++++++++++++++++++++++++
 mm/gup.c             | 34 +++++++++++++++++++
 11 files changed, 170 insertions(+), 72 deletions(-)


base-commit: 06aeb1495c39c86ccfaf1adadc1d2200179f16eb
-- 
2.35.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount is using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c9bada4096ac..367d7fd28fd0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1946,6 +1946,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 428c587acfa2..13c0dced2aee 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3035,6 +3035,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page ()
+ *
+ * @page: the page to be pinned.
+ *
+ * Similar to get_user_pages(), in that the page's refcount is elevated using
+ * FOLL_PIN rules.
+ *
+ * IMPORTANT: That means that the caller must release the page via
+ * unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27 21:57   ` Jens Axboe
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Provide two new specialized routines that only handle user space pages,
and invoke pin_user_pages_fast() on them: iov_iter_pin_pages() and
iov_iter_pin_pages_alloc().

This allows subsequent patches to convert various callers of
iov_iter_get_pages*(), to the new calls, without having to attempt a
mass conversion all at once.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/uio.h |  4 +++
 lib/iov_iter.c      | 78 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 739285fe5a2f..208020c2b75a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -236,6 +236,10 @@ ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
+ssize_t iov_iter_pin_pages(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, struct page ***pages,
+			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..e64e8e4edd0c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1560,6 +1560,41 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages);
 
+ssize_t iov_iter_pin_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned int maxpages,
+		   size_t *start)
+{
+	size_t len;
+	int n, res;
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+	if (!maxsize)
+		return 0;
+
+	WARN_ON_ONCE(!iter_is_iovec(i));
+
+	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
+		unsigned long addr;
+
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
+		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		res = pin_user_pages_fast(addr, n, gup_flags, pages);
+		if (unlikely(res <= 0))
+			return res;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+	}
+
+	return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages);
+
 static struct page **get_pages_array(size_t n)
 {
 	return kvmalloc_array(n, sizeof(struct page *), GFP_KERNEL);
@@ -1696,6 +1731,49 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc);
 
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	struct page **p;
+	size_t len;
+	int n, res;
+
+	if (maxsize > i->count)
+		maxsize = i->count;
+	if (!maxsize)
+		return 0;
+
+	WARN_ON_ONCE(!iter_is_iovec(i));
+
+	if (likely(iter_is_iovec(i))) {
+		unsigned int gup_flags = 0;
+		unsigned long addr;
+
+		if (iov_iter_rw(i) != WRITE)
+			gup_flags |= FOLL_WRITE;
+		if (i->nofault)
+			gup_flags |= FOLL_NOFAULT;
+
+		addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
+		n = DIV_ROUND_UP(len, PAGE_SIZE);
+		p = get_pages_array(n);
+		if (!p)
+			return -ENOMEM;
+		res = pin_user_pages_fast(addr, n, gup_flags, p);
+		if (unlikely(res <= 0)) {
+			kvfree(p);
+			*pages = NULL;
+			return res;
+		}
+		*pages = p;
+		return (res == n ? len : res * PAGE_SIZE) - *start;
+	}
+
+	return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages_alloc);
+
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27 21:58   ` Jens Axboe
  2022-02-27 22:15   ` Al Viro
  2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Upcoming changes to Direct IO will change it from acquiring pages via
get_user_pages_fast(), to calling pin_user_pages_fast() instead.

Place a few assertions at key points, that the pages are IOVEC (user
pages), to enforce the assumptions that there are no kernel or pipe or
other odd variations being passed.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c    | 4 ++++
 fs/direct-io.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index b15f5466ce08..4679d6539e2d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	WARN_ON_ONCE(!iter_is_iovec(iter));
+
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
@@ -1217,6 +1219,8 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	WARN_ON_ONCE(!iter_is_iovec(iter));
+
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 38bca4980a1c..7dbbbfef300d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,6 +169,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	ssize_t ret;
 
+	WARN_ON_ONCE(!iter_is_iovec(sdio->iter));
+
 	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
                   ` (2 preceding siblings ...)
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27 21:59   ` Jens Axboe
  2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
  5 siblings, 1 reply; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
in place of get_user_pages_fast(), get_page() and put_page().

This converts the Direct IO parts of most filesystems over to using
FOLL_PIN (pin_user_page*()) page pinning.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c          | 25 ++++++++++++-------------
 block/blk-map.c      |  6 +++---
 fs/direct-io.c       | 26 +++++++++++++-------------
 fs/iomap/direct-io.c |  2 +-
 4 files changed, 29 insertions(+), 30 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4679d6539e2d..d078f992a9c9 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1102,7 +1102,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		unpin_user_page(bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1130,10 +1130,9 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 
 static void bio_put_pages(struct page **pages, size_t size, size_t off)
 {
-	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
+	size_t nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
 
-	for (i = 0; i < nr; i++)
-		put_page(pages[i]);
+	unpin_user_pages(pages, nr);
 }
 
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
@@ -1144,9 +1143,9 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
  * @iter: iov iterator describing the region to be mapped
  *
  * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using unpin_user_page() when done. For
+ * multi-segment *iter, this function only adds pages from the next non-empty
+ * segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1169,7 +1168,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	WARN_ON_ONCE(!iter_is_iovec(iter));
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1180,7 +1179,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 			if (same_page)
-				put_page(page);
+				unpin_user_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len))) {
 				bio_put_pages(pages + i, left, offset);
@@ -1221,7 +1220,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	WARN_ON_ONCE(!iter_is_iovec(iter));
 
-	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	size = iov_iter_pin_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1237,7 +1236,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 			break;
 		}
 		if (same_page)
-			put_page(page);
+			unpin_user_page(page);
 		offset = 0;
 	}
 
@@ -1434,8 +1433,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one unpin_user_page() against each page and will run
+ * one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index c7f71d83eff1..ce450683994c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -252,7 +252,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+		bytes = iov_iter_pin_pages_alloc(iter, &pages, LONG_MAX, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -275,7 +275,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						unpin_user_page(page);
 					break;
 				}
 
@@ -289,7 +289,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			unpin_user_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7dbbbfef300d..815407c26f57 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -171,7 +171,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 
 	WARN_ON_ONCE(!iter_is_iovec(sdio->iter));
 
-	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+	ret = iov_iter_pin_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && (dio->op == REQ_OP_WRITE)) {
@@ -183,7 +183,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		pin_user_page(page);
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -452,7 +452,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		unpin_user_page(dio->pages[sdio->head++]);
 }
 
 /*
@@ -717,7 +717,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		pin_user_page(sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -832,13 +832,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	pin_user_page(page);		/* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +853,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -953,7 +953,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					unpin_user_page(page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -998,7 +998,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio->op == REQ_OP_WRITE) {
-					put_page(page);
+					unpin_user_page(page);
 					return -ENOTBLK;
 				}
 
@@ -1011,7 +1011,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					unpin_user_page(page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1051,7 +1051,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				unpin_user_page(page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1067,7 +1067,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 		}
 
 		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		unpin_user_page(page);
 	}
 out:
 	return ret;
@@ -1289,7 +1289,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		unpin_user_page(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 67cf9c16f80c..d0986a31a9d1 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -192,7 +192,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	pin_user_page(page);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
                   ` (3 preceding siblings ...)
  2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
  5 siblings, 0 replies; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Convert nfs-direct to use pin_user_pages_fast(), and unpin_user_pages(),
in place of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/nfs/direct.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..42111a75c0f7 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -177,13 +177,6 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	return nfs_file_direct_write(iocb, iter);
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
-}
-
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq)
 {
@@ -367,7 +360,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
@@ -398,7 +391,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -811,7 +804,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  wsize, &pgbase);
 		if (result < 0)
 			break;
@@ -850,7 +843,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
                   ` (4 preceding siblings ...)
  2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  2022-02-28 15:59   ` Miklos Szeredi
  5 siblings, 1 reply; 21+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Convert the fuse filesystem to support the new iov_iter_get_pages()
behavior. That routine now invokes pin_user_pages_fast(), which means
that such pages must be released via unpin_user_page(), rather than via
put_page().

This commit also removes any possibility of kernel pages being handled,
in the fuse_get_user_pages() call. Although this may seem like a steep
price to pay, Christoph Hellwig actually recommended it a few years ago
for nearly the same situation [1].

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/fuse/dev.c  |  7 +++++--
 fs/fuse/file.c | 38 +++++++++-----------------------------
 2 files changed, 14 insertions(+), 31 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e1b4a846c90d..9db85c4d549a 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,10 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
 			flush_dcache_page(cs->pg);
 			set_page_dirty_lock(cs->pg);
 		}
-		put_page(cs->pg);
+		if (cs->pipebufs)
+			put_page(cs->pg);
+		else
+			unpin_user_page(cs->pg);
 	}
 	cs->pg = NULL;
 }
@@ -730,7 +733,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 		}
 	} else {
 		size_t off;
-		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
+		err = iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1, &off);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 94747bac3489..ecfa5bdde919 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -611,18 +611,6 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
 	args->out_args[0].size = count;
 }
 
-static void fuse_release_user_pages(struct fuse_args_pages *ap,
-				    bool should_dirty)
-{
-	unsigned int i;
-
-	for (i = 0; i < ap->num_pages; i++) {
-		if (should_dirty)
-			set_page_dirty_lock(ap->pages[i]);
-		put_page(ap->pages[i]);
-	}
-}
-
 static void fuse_io_release(struct kref *kref)
 {
 	kfree(container_of(kref, struct fuse_io_priv, refcnt));
@@ -720,7 +708,8 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
 	struct fuse_io_priv *io = ia->io;
 	ssize_t pos = -1;
 
-	fuse_release_user_pages(&ia->ap, io->should_dirty);
+	unpin_user_pages_dirty_lock(ia->ap.pages, ia->ap.num_pages,
+				    io->should_dirty);
 
 	if (err) {
 		/* Nothing */
@@ -1382,25 +1371,14 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	size_t nbytes = 0;  /* # bytes already packed in req */
 	ssize_t ret = 0;
 
-	/* Special case for kernel I/O: can copy directly into the buffer */
-	if (iov_iter_is_kvec(ii)) {
-		unsigned long user_addr = fuse_get_user_addr(ii);
-		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
-
-		if (write)
-			ap->args.in_args[1].value = (void *) user_addr;
-		else
-			ap->args.out_args[0].value = (void *) user_addr;
-
-		iov_iter_advance(ii, frag_size);
-		*nbytesp = frag_size;
-		return 0;
-	}
+	/* Only user space buffers are allowed with fuse Direct IO. */
+	if (WARN_ON_ONCE(!iter_is_iovec(ii)))
+		return -EOPNOTSUPP;
 
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
+		ret = iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
 					*nbytesp - nbytes,
 					max_pages - ap->num_pages,
 					&start);
@@ -1484,7 +1462,9 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		}
 
 		if (!io->async || nres < 0) {
-			fuse_release_user_pages(&ia->ap, io->should_dirty);
+			unpin_user_pages_dirty_lock(ia->ap.pages,
+						    ia->ap.num_pages,
+						    io->should_dirty);
 			fuse_io_free(ia);
 		}
 		ia = NULL;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
@ 2022-02-27 21:57   ` Jens Axboe
  2022-02-27 22:09     ` John Hubbard
  2022-02-28 22:49     ` John Hubbard
  0 siblings, 2 replies; 21+ messages in thread
From: Jens Axboe @ 2022-02-27 21:57 UTC (permalink / raw)
  To: jhubbard.send.patches, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

> +ssize_t iov_iter_pin_pages(struct iov_iter *i,
> +		   struct page **pages, size_t maxsize, unsigned int maxpages,
> +		   size_t *start)
> +{
> +	size_t len;
> +	int n, res;
> +
> +	if (maxsize > i->count)
> +		maxsize = i->count;
> +	if (!maxsize)
> +		return 0;
> +
> +	WARN_ON_ONCE(!iter_is_iovec(i));
> +
> +	if (likely(iter_is_iovec(i))) {
> +		unsigned int gup_flags = 0;
> +		unsigned long addr;
> +
> +		if (iov_iter_rw(i) != WRITE)
> +			gup_flags |= FOLL_WRITE;
> +		if (i->nofault)
> +			gup_flags |= FOLL_NOFAULT;
> +
> +		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
> +		n = DIV_ROUND_UP(len, PAGE_SIZE);
> +		res = pin_user_pages_fast(addr, n, gup_flags, pages);
> +		if (unlikely(res <= 0))
> +			return res;
> +		return (res == n ? len : res * PAGE_SIZE) - *start;

Trying to be clever like that just makes the code a lot less readable. I
should not have to reason about a return value. Same in the other
function.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
@ 2022-02-27 21:58   ` Jens Axboe
  2022-02-27 22:12     ` John Hubbard
  2022-02-27 22:15   ` Al Viro
  1 sibling, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2022-02-27 21:58 UTC (permalink / raw)
  To: jhubbard.send.patches, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

> diff --git a/block/bio.c b/block/bio.c
> index b15f5466ce08..4679d6539e2d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> +	WARN_ON_ONCE(!iter_is_iovec(iter));
> +

If these make sense, why aren't they also returning an error?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
@ 2022-02-27 21:59   ` Jens Axboe
  2022-02-27 22:13     ` John Hubbard
  0 siblings, 1 reply; 21+ messages in thread
From: Jens Axboe @ 2022-02-27 21:59 UTC (permalink / raw)
  To: jhubbard.send.patches, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

On 2/27/22 2:34 AM, jhubbard.send.patches@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
> in place of get_user_pages_fast(), get_page() and put_page().
> 
> This converts the Direct IO parts of most filesystems over to using
> FOLL_PIN (pin_user_page*()) page pinning.

The commit message needs to explain why a change is being made, not what
is being done. The latter I can just look at the code for.

Didn't even find it in in your cover letter, had to go to the original
posting for that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27 21:57   ` Jens Axboe
@ 2022-02-27 22:09     ` John Hubbard
  2022-02-28 22:49     ` John Hubbard
  1 sibling, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-02-27 22:09 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:57, Jens Axboe wrote:
...
>> +			return res;
>> +		return (res == n ? len : res * PAGE_SIZE) - *start;
> 
> Trying to be clever like that just makes the code a lot less readable. I
> should not have to reason about a return value. Same in the other
> function.
> 

No argument there. This was shamelessly lifted from
iov_iter_get_pages(), and I initially calculated that keeping it
identical to that known-good code, where possible, was better than
fixing it up.

However, I'll go ahead and simplify it--with pleasure--based on this
feedback.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27 21:58   ` Jens Axboe
@ 2022-02-27 22:12     ` John Hubbard
  0 siblings, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-02-27 22:12 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:58, Jens Axboe wrote:
>> diff --git a/block/bio.c b/block/bio.c
>> index b15f5466ce08..4679d6539e2d 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>   	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>>   	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>>   
>> +	WARN_ON_ONCE(!iter_is_iovec(iter));
>> +
> 
> If these make sense, why aren't they also returning an error?
> 

Oops. Got caught up in the luxury of local testing and watching the
logs, but also returning errors is of course the right answer. Will fix.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-02-27 21:59   ` Jens Axboe
@ 2022-02-27 22:13     ` John Hubbard
  0 siblings, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-02-27 22:13 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:59, Jens Axboe wrote:
> On 2/27/22 2:34 AM, jhubbard.send.patches@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Use pin_user_pages_fast(), pin_user_page(), and unpin_user_page() calls,
>> in place of get_user_pages_fast(), get_page() and put_page().
>>
>> This converts the Direct IO parts of most filesystems over to using
>> FOLL_PIN (pin_user_page*()) page pinning.
> 
> The commit message needs to explain why a change is being made, not what
> is being done. The latter I can just look at the code for.
> 
> Didn't even find it in in your cover letter, had to go to the original
> posting for that.
> 

Sorry about that, I'll sit down and write up something thorough for this
patch. It is after the central thing, and it got neglected here.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
  2022-02-27 21:58   ` Jens Axboe
@ 2022-02-27 22:15   ` Al Viro
  2022-02-27 22:27     ` John Hubbard
  2022-02-28  3:29     ` John Hubbard
  1 sibling, 2 replies; 21+ messages in thread
From: Al Viro @ 2022-02-27 22:15 UTC (permalink / raw)
  To: jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Miklos Szeredi,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML, John Hubbard

On Sun, Feb 27, 2022 at 01:34:31AM -0800, jhubbard.send.patches@gmail.com wrote:
> From: John Hubbard <jhubbard@nvidia.com>
> 
> Upcoming changes to Direct IO will change it from acquiring pages via
> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
> 
> Place a few assertions at key points, that the pages are IOVEC (user
> pages), to enforce the assumptions that there are no kernel or pipe or
> other odd variations being passed.

Umm...  And what should happen when O_DIRECT file gets passed to splice()?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27 22:15   ` Al Viro
@ 2022-02-27 22:27     ` John Hubbard
  2022-02-28  3:29     ` John Hubbard
  1 sibling, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-02-27 22:27 UTC (permalink / raw)
  To: Al Viro, jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Miklos Szeredi,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/27/22 14:15, Al Viro wrote:
> On Sun, Feb 27, 2022 at 01:34:31AM -0800, jhubbard.send.patches@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Upcoming changes to Direct IO will change it from acquiring pages via
>> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>>
>> Place a few assertions at key points, that the pages are IOVEC (user
>> pages), to enforce the assumptions that there are no kernel or pipe or
>> other odd variations being passed.
> 
> Umm...  And what should happen when O_DIRECT file gets passed to splice()?

Hi Al,

First of all, full disclosure: I still haven't worked through how
splice() handles pages in all cases. I was hoping to defer it, by
limiting this series to not *all* of the original callers of
iov_iter_get_pages*().

This series leaves the splice() code pointing to iov_iter_get_pages(),
but maybe that's not possible after all.

Any advice or ideas about how to solve the O_DIRECT-to-splice() is very
welcome.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-27 22:15   ` Al Viro
  2022-02-27 22:27     ` John Hubbard
@ 2022-02-28  3:29     ` John Hubbard
  1 sibling, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-02-28  3:29 UTC (permalink / raw)
  To: Al Viro, jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Miklos Szeredi,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/27/22 14:15, Al Viro wrote:
> On Sun, Feb 27, 2022 at 01:34:31AM -0800, jhubbard.send.patches@gmail.com wrote:
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Upcoming changes to Direct IO will change it from acquiring pages via
>> get_user_pages_fast(), to calling pin_user_pages_fast() instead.
>>
>> Place a few assertions at key points, that the pages are IOVEC (user
>> pages), to enforce the assumptions that there are no kernel or pipe or
>> other odd variations being passed.
> 
> Umm...  And what should happen when O_DIRECT file gets passed to splice()?

OK, right, this is not going to work as-is.

Any of the remaining iov_iter_get_pages*() callers that do direct IO
(vmsplice, ceph, ...) will end up calling bio_release_pages(), which is
now unconditionally calling unpin_user_pages(). So this is just
completely broken.

And so, once again, I'm back to, "must convert the whole pile of
iov_iter_get_page*() callers at once".

Back to the drawing board.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
@ 2022-02-28 15:59   ` Miklos Szeredi
  2022-02-28 21:16     ` John Hubbard
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2022-02-28 15:59 UTC (permalink / raw)
  To: jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML, John Hubbard

On Sun, 27 Feb 2022 at 10:34, <jhubbard.send.patches@gmail.com> wrote:
>
> From: John Hubbard <jhubbard@nvidia.com>
>
> Convert the fuse filesystem to support the new iov_iter_get_pages()
> behavior. That routine now invokes pin_user_pages_fast(), which means
> that such pages must be released via unpin_user_page(), rather than via
> put_page().
>
> This commit also removes any possibility of kernel pages being handled,
> in the fuse_get_user_pages() call. Although this may seem like a steep
> price to pay, Christoph Hellwig actually recommended it a few years ago
> for nearly the same situation [1].

This might work for O_DIRECT, but fuse has this mode of operation
which turns normal "buffered" I/O into direct I/O.  And that in turn
will break execve of such files.

So AFAICS we need to keep kvec handing in some way.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-28 15:59   ` Miklos Szeredi
@ 2022-02-28 21:16     ` John Hubbard
  2022-03-01  9:41       ` Miklos Szeredi
  0 siblings, 1 reply; 21+ messages in thread
From: John Hubbard @ 2022-02-28 21:16 UTC (permalink / raw)
  To: Miklos Szeredi, jhubbard.send.patches
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/28/22 07:59, Miklos Szeredi wrote:
> On Sun, 27 Feb 2022 at 10:34, <jhubbard.send.patches@gmail.com> wrote:
>>
>> From: John Hubbard <jhubbard@nvidia.com>
>>
>> Convert the fuse filesystem to support the new iov_iter_get_pages()
>> behavior. That routine now invokes pin_user_pages_fast(), which means
>> that such pages must be released via unpin_user_page(), rather than via
>> put_page().
>>
>> This commit also removes any possibility of kernel pages being handled,
>> in the fuse_get_user_pages() call. Although this may seem like a steep
>> price to pay, Christoph Hellwig actually recommended it a few years ago
>> for nearly the same situation [1].
> 
> This might work for O_DIRECT, but fuse has this mode of operation
> which turns normal "buffered" I/O into direct I/O.  And that in turn
> will break execve of such files.
> 
> So AFAICS we need to keep kvec handing in some way.
> 

Thanks for bringing that up! Do you have any hints for me, to jump start
a deeper look? And especially, sample programs that exercise this?


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages
  2022-02-27 21:57   ` Jens Axboe
  2022-02-27 22:09     ` John Hubbard
@ 2022-02-28 22:49     ` John Hubbard
  1 sibling, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-02-28 22:49 UTC (permalink / raw)
  To: Jens Axboe, jhubbard.send.patches, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/27/22 13:57, Jens Axboe wrote:
>> +ssize_t iov_iter_pin_pages(struct iov_iter *i,
>> +		   struct page **pages, size_t maxsize, unsigned int maxpages,
>> +		   size_t *start)
>> +{
>> +	size_t len;
>> +	int n, res;
>> +
>> +	if (maxsize > i->count)
>> +		maxsize = i->count;
>> +	if (!maxsize)
>> +		return 0;
>> +
>> +	WARN_ON_ONCE(!iter_is_iovec(i));
>> +
>> +	if (likely(iter_is_iovec(i))) {
>> +		unsigned int gup_flags = 0;
>> +		unsigned long addr;
>> +
>> +		if (iov_iter_rw(i) != WRITE)
>> +			gup_flags |= FOLL_WRITE;
>> +		if (i->nofault)
>> +			gup_flags |= FOLL_NOFAULT;
>> +
>> +		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
>> +		n = DIV_ROUND_UP(len, PAGE_SIZE);
>> +		res = pin_user_pages_fast(addr, n, gup_flags, pages);
>> +		if (unlikely(res <= 0))
>> +			return res;
>> +		return (res == n ? len : res * PAGE_SIZE) - *start;
> 
> Trying to be clever like that just makes the code a lot less readable. I
> should not have to reason about a return value. Same in the other
> function.
> 

Here is a differential patch on top of this one, and only showing one of
the two routines. How does this direction look to you?


diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index e64e8e4edd0c..8e96f1e9ebc6 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1588,7 +1588,17 @@ ssize_t iov_iter_pin_pages(struct iov_iter *i,
  		res = pin_user_pages_fast(addr, n, gup_flags, pages);
  		if (unlikely(res <= 0))
  			return res;
-		return (res == n ? len : res * PAGE_SIZE) - *start;
+
+		/* Cap len at the number of pages that were actually pinned: */
+		if (res < n)
+			len = res * PAGE_SIZE;
+
+		/*
+		 * The return value is the amount pinned in bytes that the
+		 * caller will actually use. So, reduce it by the offset into
+		 * the first page:
+		 */
+		return len - *start;
  	}

  	return -EFAULT;

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-28 21:16     ` John Hubbard
@ 2022-03-01  9:41       ` Miklos Szeredi
  2022-03-02  8:07         ` John Hubbard
  0 siblings, 1 reply; 21+ messages in thread
From: Miklos Szeredi @ 2022-03-01  9:41 UTC (permalink / raw)
  To: John Hubbard
  Cc: jhubbard.send.patches, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Andrew Morton, Chaitanya Kulkarni, linux-block,
	linux-fsdevel, linux-xfs, linux-mm, LKML

On Mon, 28 Feb 2022 at 22:16, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 2/28/22 07:59, Miklos Szeredi wrote:
> > On Sun, 27 Feb 2022 at 10:34, <jhubbard.send.patches@gmail.com> wrote:
> >>
> >> From: John Hubbard <jhubbard@nvidia.com>
> >>
> >> Convert the fuse filesystem to support the new iov_iter_get_pages()
> >> behavior. That routine now invokes pin_user_pages_fast(), which means
> >> that such pages must be released via unpin_user_page(), rather than via
> >> put_page().
> >>
> >> This commit also removes any possibility of kernel pages being handled,
> >> in the fuse_get_user_pages() call. Although this may seem like a steep
> >> price to pay, Christoph Hellwig actually recommended it a few years ago
> >> for nearly the same situation [1].
> >
> > This might work for O_DIRECT, but fuse has this mode of operation
> > which turns normal "buffered" I/O into direct I/O.  And that in turn
> > will break execve of such files.
> >
> > So AFAICS we need to keep kvec handing in some way.
> >
>
> Thanks for bringing that up! Do you have any hints for me, to jump start

How about just leaving that special code in place?   It bypasses page
refs and directly copies to the kernel buffer, so it should not have
any affect on the user page code.

> a deeper look? And especially, sample programs that exercise this?

Here's one:
# uncomment as appropriate:
#sudo dnf install fuse3-devel
#sudo apt install libfuse3-dev

cat <<EOF > fuse-dio-exec.c
#define FUSE_USE_VERSION 31
#include <fuse.h>
#include <errno.h>
#include <unistd.h>

static const char *filename = "/bin/true";

static int test_getattr(const char *path, struct stat *stbuf,
             struct fuse_file_info *fi)
{
    return lstat(filename, stbuf) == -1 ? -errno : 0;
}

static int test_open(const char *path, struct fuse_file_info *fi)
{
    int res;

    res = open(filename, fi->flags);
    if (res == -1)
        return -errno;

    fi->fh = res;
    fi->direct_io = 1;
    return 0;
}

static int test_read(const char *path, char *buf, size_t size, off_t offset,
              struct fuse_file_info *fi)
{
    int res = pread(fi->fh, buf, size, offset);
    return res == -1 ? -errno : res;
}

static int test_release(const char *path, struct fuse_file_info *fi)
{
    close(fi->fh);
    return 0;
}

static const struct fuse_operations test_oper = {
    .getattr    = test_getattr,
    .open        = test_open,
    .release    = test_release,
    .read        = test_read,
};

int main(int argc, char *argv[])
{
    return fuse_main(argc, argv, &test_oper, NULL);
}
EOF

gcc -W fuse-dio-exec.c `pkg-config fuse3 --cflags --libs` -o fuse-dio-exec
touch /tmp/true

#run test:
./fuse-dio-exec /tmp/true
/tmp/true
umount /tmp/true

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-03-01  9:41       ` Miklos Szeredi
@ 2022-03-02  8:07         ` John Hubbard
  0 siblings, 0 replies; 21+ messages in thread
From: John Hubbard @ 2022-03-02  8:07 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: jhubbard.send.patches, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Andrew Morton, Chaitanya Kulkarni, linux-block,
	linux-fsdevel, linux-xfs, linux-mm, LKML

On 3/1/22 01:41, Miklos Szeredi wrote:
...
>>> This might work for O_DIRECT, but fuse has this mode of operation
>>> which turns normal "buffered" I/O into direct I/O.  And that in turn
>>> will break execve of such files.
>>>
>>> So AFAICS we need to keep kvec handing in some way.
>>>
>>
>> Thanks for bringing that up! Do you have any hints for me, to jump start
> 
> How about just leaving that special code in place?   It bypasses page
> refs and directly copies to the kernel buffer, so it should not have
> any affect on the user page code.
> 

Good idea, I'll go that direction.

>> a deeper look? And especially, sample programs that exercise this?
> 
> Here's one:

This is really helpful, exactly what I was looking for.


thanks!
-- 
John Hubbard
NVIDIA

> # uncomment as appropriate:
> #sudo dnf install fuse3-devel
> #sudo apt install libfuse3-dev
> 
> cat <<EOF > fuse-dio-exec.c
> #define FUSE_USE_VERSION 31
> #include <fuse.h>
> #include <errno.h>
> #include <unistd.h>
> 
> static const char *filename = "/bin/true";
> 
> static int test_getattr(const char *path, struct stat *stbuf,
>               struct fuse_file_info *fi)
> {
>      return lstat(filename, stbuf) == -1 ? -errno : 0;
> }
> 
> static int test_open(const char *path, struct fuse_file_info *fi)
> {
>      int res;
> 
>      res = open(filename, fi->flags);
>      if (res == -1)
>          return -errno;
> 
>      fi->fh = res;
>      fi->direct_io = 1;
>      return 0;
> }
> 
> static int test_read(const char *path, char *buf, size_t size, off_t offset,
>                struct fuse_file_info *fi)
> {
>      int res = pread(fi->fh, buf, size, offset);
>      return res == -1 ? -errno : res;
> }
> 
> static int test_release(const char *path, struct fuse_file_info *fi)
> {
>      close(fi->fh);
>      return 0;
> }
> 
> static const struct fuse_operations test_oper = {
>      .getattr    = test_getattr,
>      .open        = test_open,
>      .release    = test_release,
>      .read        = test_read,
> };
> 
> int main(int argc, char *argv[])
> {
>      return fuse_main(argc, argv, &test_oper, NULL);
> }
> EOF
> 
> gcc -W fuse-dio-exec.c `pkg-config fuse3 --cflags --libs` -o fuse-dio-exec
> touch /tmp/true
> 
> #run test:
> ./fuse-dio-exec /tmp/true
> /tmp/true
> umount /tmp/true

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-03-02  8:07 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 1/6] mm/gup: introduce pin_user_page() jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 2/6] iov_iter: new iov_iter_pin_pages*(), for FOLL_PIN pages jhubbard.send.patches
2022-02-27 21:57   ` Jens Axboe
2022-02-27 22:09     ` John Hubbard
2022-02-28 22:49     ` John Hubbard
2022-02-27  9:34 ` [PATCH 3/6] block, fs: assert that key paths use iovecs, and nothing else jhubbard.send.patches
2022-02-27 21:58   ` Jens Axboe
2022-02-27 22:12     ` John Hubbard
2022-02-27 22:15   ` Al Viro
2022-02-27 22:27     ` John Hubbard
2022-02-28  3:29     ` John Hubbard
2022-02-27  9:34 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() jhubbard.send.patches
2022-02-27 21:59   ` Jens Axboe
2022-02-27 22:13     ` John Hubbard
2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN jhubbard.send.patches
2022-02-28 15:59   ` Miklos Szeredi
2022-02-28 21:16     ` John Hubbard
2022-03-01  9:41       ` Miklos Szeredi
2022-03-02  8:07         ` John Hubbard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.