linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast()
@ 2022-08-31  4:18 John Hubbard
  2022-08-31  4:18 ` [PATCH v2 1/7] mm: change release_pages() to use unsigned long for npages John Hubbard
                   ` (7 more replies)
  0 siblings, 8 replies; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

This is v2. Changes since v1 are:

* Incorporated feedback from Al Viro and Jan Kara: this approach now
pins both bvecs (ITER_BVEC) and user pages (user_backed_iter()) with
FOLL_PIN.

* Incorporated David Hildenbrand's feedback: Rewrote pin_user_pages()
documentation and added a WARN_ON_ONCE() to somewhat enforce the rule
that this new function is only intended for use on file-backed pages.

* Added a tiny new patch to fix up the release_pages() number of pages
argument, so as to avoid a lot of impedance-matching checks in
subsequent patches.

v1 is here:

https://lore.kernel.org/all/20220827083607.2345453-1-jhubbard@nvidia.com/

Original cover letter still applies, here it is for convenience:

This converts the iomap core and bio_release_pages() to
pin_user_pages_fast(), also referred to as FOLL_PIN here.

The conversion is temporarily guarded by
CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO. In the future (not part of this
series), when we are certain that all filesystems have converted their
Direct IO paths to FOLL_PIN, then we can do the final step, which is to
get rid of CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO and search-and-replace
the dio_w_*() functions with their final names (see bvec.h changes).

I'd like to get this part committed at some point, because it seems to
work well already. And this will help get the remaining items, below,
converted.

Status: although many filesystems have been converted, some remain to be
investigated. These include (you can recreate this list by grepping for
iov_iter_get_pages):

	cephfs
	cifs
	9P
	RDS
	net/core: datagram.c, skmsg.c
	net/tls
	fs/splice.c

Testing: this passes some light LTP and xfstest runs and fio and a few
other things like that, on my local x86_64 test machine, both with and
without CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO being set.

Conflicts: Logan, the iov_iter parts of this will conflict with your
[PATCH v9 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags(),
but I think it's easy to resolve.


John Hubbard (7):
  mm: change release_pages() to use unsigned long for npages
  mm/gup: introduce pin_user_page()
  block: add dio_w_*() wrappers for pin, unpin user pages
  iov_iter: new iov_iter_pin_pages*() routines
  block, bio, fs: convert most filesystems to pin_user_pages_fast()
  NFS: direct-io: convert to FOLL_PIN pages
  fuse: convert direct IO paths to use FOLL_PIN

 block/Kconfig        | 24 +++++++++++++
 block/bio.c          | 27 +++++++-------
 block/blk-map.c      |  7 ++--
 fs/direct-io.c       | 40 ++++++++++-----------
 fs/fuse/dev.c        | 11 ++++--
 fs/fuse/file.c       | 32 +++++++++++------
 fs/fuse/fuse_i.h     |  1 +
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      | 22 ++++++------
 include/linux/bvec.h | 37 +++++++++++++++++++
 include/linux/mm.h   |  3 +-
 include/linux/uio.h  |  4 +++
 lib/iov_iter.c       | 86 ++++++++++++++++++++++++++++++++++++++++----
 mm/gup.c             | 50 ++++++++++++++++++++++++++
 mm/swap.c            |  6 ++--
 15 files changed, 282 insertions(+), 70 deletions(-)


base-commit: dcf8e5633e2e69ad60b730ab5905608b756a032f
-- 
2.37.2


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 1/7] mm: change release_pages() to use unsigned long for npages
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-08-31  4:18 ` [PATCH v2 2/7] mm/gup: introduce pin_user_page() John Hubbard
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

The various callers of release_pages() are passing in either various
types (signed or unsigned) and lengths (int or long) of integers, for
the second argument (number of pages). To make this conversion accurate
and to avoid having to check for overflow (or deal with type conversion
warnings), let's just change release_pages() to accept an unsigned long
for the number of pages.

Also change the name of the argument, from "nr" to "npages", for
clarity, as long as that line is being changed anyway.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h | 2 +-
 mm/swap.c          | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..61c5dc37370e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1145,7 +1145,7 @@ static inline void folio_put_refs(struct folio *folio, int refs)
 		__folio_put(folio);
 }
 
-void release_pages(struct page **pages, int nr);
+void release_pages(struct page **pages, unsigned long npages);
 
 /**
  * folios_put - Decrement the reference count on an array of folios.
diff --git a/mm/swap.c b/mm/swap.c
index 9cee7f6a3809..ac6482d86187 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -931,15 +931,15 @@ void lru_cache_disable(void)
  * Decrement the reference count on all the pages in @pages.  If it
  * fell to zero, remove the page from the LRU and free it.
  */
-void release_pages(struct page **pages, int nr)
+void release_pages(struct page **pages, unsigned long npages)
 {
-	int i;
+	unsigned long i;
 	LIST_HEAD(pages_to_free);
 	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	unsigned int lock_batch;
 
-	for (i = 0; i < nr; i++) {
+	for (i = 0; i < npages; i++) {
 		struct folio *folio = page_folio(pages[i]);
 
 		/*
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 2/7] mm/gup: introduce pin_user_page()
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
  2022-08-31  4:18 ` [PATCH v2 1/7] mm: change release_pages() to use unsigned long for npages John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-09-06  6:37   ` Christoph Hellwig
  2022-08-31  4:18 ` [PATCH v2 3/7] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 50 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61c5dc37370e..c6c98d9c38ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1876,6 +1876,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..2c231dca39dd 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3213,6 +3213,56 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a file-backed page that the
+ * caller already owns.
+ *
+ * @page: the page to be pinned.
+ *
+ * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
+ * that the caller must release the page via unpin_user_page().
+ *
+ * pin_user_page() is intended as a drop-in replacement for get_page(). This
+ * provides a way for callers to do a subsequent unpin_user_page() on the
+ * affected page. However, it is only intended for use by callers (file systems,
+ * block/bio) that have a file-backed page. Anonymous pages are not expected nor
+ * supported, and will generate a warning.
+ *
+ * pin_user_page() may also be thought of as an externally-usable version of
+ * try_grab_page(), but with semantics that match get_page(), so that it can act
+ * as a drop-in replacement for get_page().
+ *
+ * IMPORTANT: The caller must release the page via unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * This function is only intended for file-backed callers, who already
+	 * have a page reference.
+	 */
+	WARN_ON_ONCE(PageAnon(page));
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 3/7] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
  2022-08-31  4:18 ` [PATCH v2 1/7] mm: change release_pages() to use unsigned long for npages John Hubbard
  2022-08-31  4:18 ` [PATCH v2 2/7] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-08-31  4:18 ` [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

Background: The Direct IO part of the block infrastructure is being
changed to use pin_user_page*() and unpin_user_page*() calls, in place
of a mix of get_user_pages_fast(), get_page(), and put_page(). These
have to be changed over all at the same time, for block, bio, and all
filesystems. However, most filesystems can be changed via iomap and core
filesystem routines, so let's get that in place, and then continue on
with converting the remaining filesystems (9P, CIFS) and anything else
that feeds pages into bio that ultimately get released via
bio_release_pages().

Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
dio_w_*() wrapper functions. The dio_w_ prefix was chosen for
uniqueness, so as to ease a subsequent kernel-wide rename via
search-and-replace. Together, these allow the developer to choose
between these sets of routines, for Direct IO code paths:

a) pin_user_pages_fast()
    pin_user_page()
    unpin_user_page()

b) get_user_pages_fast()
    get_page()
    put_page()

CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and may
be deleted once the conversion is complete. In the meantime, developers
can enable this in order to try out each filesystem.

Please remember that these /proc/vmstat items (below) should normally
contain the same values as each other, except during the middle of
pin/unpin operations. As such, they can be helpful when monitoring test
runs:

    nr_foll_pin_acquired
    nr_foll_pin_released

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/Kconfig        | 24 ++++++++++++++++++++++++
 include/linux/bvec.h | 37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 444c5ab3b67e..d4fdd606d138 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -48,6 +48,30 @@ config BLK_DEV_BSG_COMMON
 config BLK_ICQ
 	bool
 
+config BLK_USE_PIN_USER_PAGES_FOR_DIO
+	bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
+	default n
+
+	help
+	  For Direct IO code, retain the pages via calls to
+	  pin_user_pages_fast(), instead of via get_user_pages_fast().
+	  Likewise, use pin_user_page() instead of get_page(). And then
+	  release such pages via unpin_user_page(), instead of
+	  put_page().
+
+	  This is a temporary setting, which will be deleted once the
+	  conversion is completed, reviewed, and tested. In the meantime,
+	  developers can enable this in order to try out each filesystem.
+	  For that, it's best to monitor these /proc/vmstat items:
+
+		nr_foll_pin_acquired
+		nr_foll_pin_released
+
+	  ...to ensure that they remain equal, when "at rest".
+
+	  Say yes here ONLY if are actively developing or testing the
+	  block layer or filesystems with pin_user_pages_fast().
+
 config BLK_DEV_BSGLIB
 	bool "Block layer SG support v4 helper lib"
 	select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..952d869702d2 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,4 +241,41 @@ static inline void *bvec_virt(struct bio_vec *bvec)
 	return page_address(bvec->bv_page) + bvec->bv_offset;
 }
 
+#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
+#define dio_w_pin_user_pages_fast(s, n, p, f)	pin_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			pin_user_page(p)
+#define dio_w_iov_iter_pin_pages(i, p, m, n, s) iov_iter_pin_pages(i, p, m, n, s)
+#define dio_w_iov_iter_pin_pages_alloc(i, p, m, s) iov_iter_pin_pages_alloc(i, p, m, s)
+#define dio_w_unpin_user_page(p)		unpin_user_page(p)
+#define dio_w_unpin_user_pages(p, n)		unpin_user_pages(p, n)
+#define dio_w_unpin_user_pages_dirty_lock(p, n, d) unpin_user_pages_dirty_lock(p, n, d)
+
+#else
+#define dio_w_pin_user_pages_fast(s, n, p, f)	get_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			get_page(p)
+#define dio_w_iov_iter_pin_pages(i, p, m, n, s) iov_iter_get_pages2(i, p, m, n, s)
+#define dio_w_iov_iter_pin_pages_alloc(i, p, m, s) iov_iter_get_pages_alloc2(i, p, m, s)
+#define dio_w_unpin_user_page(p)		put_page(p)
+
+static inline void dio_w_unpin_user_pages(struct page **pages,
+					  unsigned long npages)
+{
+	release_pages(pages, npages);
+}
+
+static inline void dio_w_unpin_user_pages_dirty_lock(struct page **pages,
+						     unsigned long npages,
+						     bool make_dirty)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		if (make_dirty)
+			set_page_dirty_lock(pages[i]);
+		put_page(pages[i]);
+	}
+}
+
+#endif
+
 #endif /* __LINUX_BVEC_H */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (2 preceding siblings ...)
  2022-08-31  4:18 ` [PATCH v2 3/7] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-09-01  0:42   ` Al Viro
  2022-09-06  6:47   ` Christoph Hellwig
  2022-08-31  4:18 ` [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

Provide two new wrapper routines that are intended for user space pages
only:

    iov_iter_pin_pages()
    iov_iter_pin_pages_alloc()

Internally, these routines call pin_user_pages_fast(), instead of
get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i)
cases.

As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
variant, to release the pages, if they actually were acquired via
pin_user_pages_fast().

This is a prerequisite to converting bio/block layers over to use
pin_user_pages_fast().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/uio.h |  4 +++
 lib/iov_iter.c      | 86 +++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5896af36199c..e26908e443d1 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -251,6 +251,10 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
+ssize_t iov_iter_pin_pages(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, struct page ***pages,
+			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4b7fce72e3e5..c63ce0eadfcb 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1425,9 +1425,31 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	return page;
 }
 
+enum pages_alloc_internal_flags {
+	USE_FOLL_GET,
+	MAYBE_USE_FOLL_PIN
+};
+
+/*
+ * Pins pages, either via get_page(), or via pin_user_page*(). The caller is
+ * responsible for tracking which pinning mechanism was used here, and releasing
+ * pages via the appropriate call: put_page() or unpin_user_page().
+ *
+ * The way to figure that out is:
+ *
+ *     a) If how_to_pin == FOLL_GET, then this routine will always pin via
+ *        get_page().
+ *
+ *     b) If how_to_pin == MAYBE_USE_FOLL_PIN, then this routine will pin via
+ *          pin_user_page*() for either user_backed_iter(i) cases, or
+ *          iov_iter_is_bvec(i) cases. However, for the other cases (pipe,
+ *          xarray), pages will be pinned via get_page().
+ */
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   unsigned int maxpages, size_t *start)
+		   unsigned int maxpages, size_t *start,
+		   enum pages_alloc_internal_flags how_to_pin)
+
 {
 	unsigned int n;
 
@@ -1454,7 +1476,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		n = want_pages_array(pages, maxsize, *start, maxpages);
 		if (!n)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n, gup_flags, *pages);
+
+		if (how_to_pin == MAYBE_USE_FOLL_PIN)
+			res = pin_user_pages_fast(addr, n, gup_flags, *pages);
+		else
+			res = get_user_pages_fast(addr, n, gup_flags, *pages);
+
 		if (unlikely(res <= 0))
 			return res;
 		maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - *start);
@@ -1470,8 +1497,13 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		if (!n)
 			return -ENOMEM;
 		p = *pages;
-		for (int k = 0; k < n; k++)
-			get_page(p[k] = page + k);
+		for (int k = 0; k < n; k++) {
+			p[k] = page + k;
+			if (how_to_pin == MAYBE_USE_FOLL_PIN)
+				pin_user_page(p[k]);
+			else
+				get_page(p[k]);
+		}
 		maxsize = min_t(size_t, maxsize, n * PAGE_SIZE - *start);
 		i->count -= maxsize;
 		i->iov_offset += maxsize;
@@ -1497,10 +1529,29 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i,
 		return 0;
 	BUG_ON(!pages);
 
-	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
+	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start,
+					  USE_FOLL_GET);
 }
 EXPORT_SYMBOL(iov_iter_get_pages2);
 
+/*
+ * A FOLL_PIN variant that calls pin_user_pages_fast() instead of
+ * get_user_pages_fast().
+ */
+ssize_t iov_iter_pin_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned int maxpages,
+		   size_t *start)
+{
+	if (!maxpages)
+		return 0;
+	if (WARN_ON_ONCE(!pages))
+		return -EINVAL;
+
+	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start,
+					  MAYBE_USE_FOLL_PIN);
+}
+EXPORT_SYMBOL(iov_iter_pin_pages);
+
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -1509,7 +1560,8 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 
 	*pages = NULL;
 
-	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+					 USE_FOLL_GET);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;
@@ -1518,6 +1570,28 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
+/*
+ * A FOLL_PIN variant that calls pin_user_pages_fast() instead of
+ * get_user_pages_fast().
+ */
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	ssize_t len;
+
+	*pages = NULL;
+
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+					 MAYBE_USE_FOLL_PIN);
+	if (len <= 0) {
+		kvfree(*pages);
+		*pages = NULL;
+	}
+	return len;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages_alloc);
+
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (3 preceding siblings ...)
  2022-08-31  4:18 ` [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-09-06  6:48   ` Christoph Hellwig
  2022-08-31  4:18 ` [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

Use dio_w_*() wrapper calls, in place of get_user_pages_fast(),
get_page() and put_page().

This converts the Direct IO parts of most filesystems over to using
FOLL_PIN (pin_user_page*()) page pinning.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c          | 27 ++++++++++++++-------------
 block/blk-map.c      |  7 ++++---
 fs/direct-io.c       | 40 ++++++++++++++++++++--------------------
 fs/iomap/direct-io.c |  2 +-
 4 files changed, 39 insertions(+), 37 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3d3a2678fea2..6c6110f7054e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1125,7 +1125,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		dio_w_unpin_user_page(bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1162,7 +1162,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		dio_w_unpin_user_page(page);
 	return 0;
 }
 
@@ -1176,7 +1176,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		dio_w_unpin_user_page(page);
 	return 0;
 }
 
@@ -1187,10 +1187,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Pins pages from *iter and appends them to @bio's bvec array. The pages will
+ * have to be released using dio_w_unpin_user_page when done. For multi-segment
+ * *iter, this function only adds pages from the next non-empty segment of the
+ * iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1218,8 +1218,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset);
+	size = dio_w_iov_iter_pin_pages(iter, pages,
+					UINT_MAX - bio->bi_iter.bi_size,
+					nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1252,7 +1253,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		dio_w_unpin_user_page(pages[i++]);
 
 	return ret;
 }
@@ -1444,9 +1445,9 @@ void bio_set_pages_dirty(struct bio *bio)
  * have been written out during the direct-IO read.  So we take another ref on
  * the BIO and re-dirty the pages in process context.
  *
- * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * It is expected that bio_check_pages_dirty() will wholly own the BIO from here
+ * on.  It will run one dio_w_unpin_user_page() against each page and will run
+ * one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index 7196a6b64c80..4e333ad9776d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -254,7 +254,8 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
+		bytes = dio_w_iov_iter_pin_pages_alloc(iter, &pages, LONG_MAX,
+						       &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -276,7 +277,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						dio_w_unpin_user_page(page);
 					break;
 				}
 
@@ -289,7 +290,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			dio_w_unpin_user_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index f669163d5860..05c044c55374 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,8 +169,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				&sdio->from);
+	ret = dio_w_iov_iter_pin_pages(sdio->iter, dio->pages, LONG_MAX,
+				       DIO_PAGES, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
 		struct page *page = ZERO_PAGE(0);
@@ -181,7 +181,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		dio_w_pin_user_page(page);
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -197,7 +197,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		sdio->to = ((ret - 1) & (PAGE_SIZE - 1)) + 1;
 		return 0;
 	}
-	return ret;	
+	return ret;
 }
 
 /*
@@ -324,7 +324,7 @@ static void dio_aio_complete_work(struct work_struct *work)
 static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio);
 
 /*
- * Asynchronous IO callback. 
+ * Asynchronous IO callback.
  */
 static void dio_bio_end_aio(struct bio *bio)
 {
@@ -449,7 +449,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		dio_w_unpin_user_page(dio->pages[sdio->head++]);
 }
 
 /*
@@ -716,7 +716,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		dio_w_pin_user_page(sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -725,7 +725,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 	}
 	return ret;
 }
-		
+
 /*
  * Put cur_page under IO.  The section of cur_page which is described by
  * cur_page_offset,cur_page_len is put into a BIO.  The section of cur_page
@@ -787,7 +787,7 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
  * An autonomous function to put a chunk of a page under deferred IO.
  *
  * The caller doesn't actually know (or care) whether this piece of page is in
- * a BIO, or is under IO or whatever.  We just take care of all possible 
+ * a BIO, or is under IO or whatever.  We just take care of all possible
  * situations here.  The separation between the logic of do_direct_IO() and
  * that of submit_page_section() is important for clarity.  Please don't break.
  *
@@ -832,13 +832,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_w_unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	dio_w_pin_user_page(page); /* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +853,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_w_unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -890,7 +890,7 @@ static inline void dio_zero_block(struct dio *dio, struct dio_submit *sdio,
 	 * We need to zero out part of an fs block.  It is either at the
 	 * beginning or the end of the fs block.
 	 */
-	if (end) 
+	if (end)
 		this_chunk_blocks = dio_blocks_per_fs_block - this_chunk_blocks;
 
 	this_chunk_bytes = this_chunk_blocks << sdio->blkbits;
@@ -954,7 +954,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -999,7 +999,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio_op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					return -ENOTBLK;
 				}
 
@@ -1012,7 +1012,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1052,7 +1052,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_w_unpin_user_page(page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1067,8 +1067,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				break;
 		}
 
-		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		/* Drop the ref which was taken in [get|pin]_user_pages() */
+		dio_w_unpin_user_page(page);
 	}
 out:
 	return ret;
@@ -1288,7 +1288,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_w_unpin_user_page(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 4eb559a16c9e..fc7763c418d1 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	dio_w_pin_user_page(page);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (4 preceding siblings ...)
  2022-08-31  4:18 ` [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-09-06  6:49   ` Christoph Hellwig
  2022-08-31  4:18 ` [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
  2022-09-06  6:36 ` [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() Christoph Hellwig
  7 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

Convert the NFS Direct IO layer to use pin_user_pages_fast() and
unpin_user_page(), instead of get_user_pages_fast() and put_page().

The user of pin_user_pages_fast() depends upon:

1) CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and

2) User-space-backed pages: user_backed_iter(i) == true

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/nfs/direct.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..71b794f39ee2 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,11 +142,13 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	return 0;
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
+static void nfs_direct_release_pages(struct iov_iter *iter, struct page **pages,
+				     unsigned int npages)
 {
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
+	if (user_backed_iter(iter) || iov_iter_is_bvec(iter))
+		dio_w_unpin_user_pages(pages, npages);
+	else
+		release_pages(pages, npages);
 }
 
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
@@ -332,11 +334,11 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
+		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
-	
+
 		bytes = result;
 		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
 		for (i = 0; i < npages; i++) {
@@ -362,7 +364,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		nfs_direct_release_pages(iter, pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -791,8 +793,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
-						  wsize, &pgbase);
+		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
+							wsize, &pgbase);
 		if (result < 0)
 			break;
 
@@ -829,7 +831,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		nfs_direct_release_pages(iter, pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (5 preceding siblings ...)
  2022-08-31  4:18 ` [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
@ 2022-08-31  4:18 ` John Hubbard
  2022-08-31 10:37   ` Miklos Szeredi
  2022-09-06  6:36 ` [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() Christoph Hellwig
  7 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-08-31  4:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML, John Hubbard

Convert the fuse filesystem to use pin_user_pages_fast() and
unpin_user_page(), instead of get_user_pages_fast() and put_page().

The user of pin_user_pages_fast() depends upon:

1) CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and

2) User-space-backed pages or ITER_BVEC pages.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/fuse/dev.c    | 11 +++++++++--
 fs/fuse/file.c   | 32 +++++++++++++++++++++-----------
 fs/fuse/fuse_i.h |  1 +
 3 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51897427a534..5de98a7a45b1 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,12 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
 			flush_dcache_page(cs->pg);
 			set_page_dirty_lock(cs->pg);
 		}
-		put_page(cs->pg);
+		if (!cs->pipebufs &&
+		    (user_backed_iter(cs->iter) || iov_iter_is_bvec(cs->iter)))
+			dio_w_unpin_user_page(cs->pg);
+
+		else
+			put_page(cs->pg);
 	}
 	cs->pg = NULL;
 }
@@ -730,7 +735,9 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 		}
 	} else {
 		size_t off;
-		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
+
+		err = dio_w_iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1,
+					       &off);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1a3afd469e3a..01da38928d0b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -625,14 +625,19 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
 }
 
 static void fuse_release_user_pages(struct fuse_args_pages *ap,
-				    bool should_dirty)
+				    bool should_dirty, bool is_user_or_bvec)
 {
 	unsigned int i;
 
-	for (i = 0; i < ap->num_pages; i++) {
-		if (should_dirty)
-			set_page_dirty_lock(ap->pages[i]);
-		put_page(ap->pages[i]);
+	if (is_user_or_bvec) {
+		dio_w_unpin_user_pages_dirty_lock(ap->pages, ap->num_pages,
+						  should_dirty);
+	} else {
+		for (i = 0; i < ap->num_pages; i++) {
+			if (should_dirty)
+				set_page_dirty_lock(ap->pages[i]);
+			put_page(ap->pages[i]);
+		}
 	}
 }
 
@@ -733,7 +738,7 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
 	struct fuse_io_priv *io = ia->io;
 	ssize_t pos = -1;
 
-	fuse_release_user_pages(&ia->ap, io->should_dirty);
+	fuse_release_user_pages(&ia->ap, io->should_dirty, io->is_user_or_bvec);
 
 	if (err) {
 		/* Nothing */
@@ -1414,10 +1419,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
-					*nbytesp - nbytes,
-					max_pages - ap->num_pages,
-					&start);
+		ret = dio_w_iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
+					       *nbytesp - nbytes,
+					       max_pages - ap->num_pages,
+					       &start);
 		if (ret < 0)
 			break;
 
@@ -1483,6 +1488,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		fl_owner_t owner = current->files;
 		size_t nbytes = min(count, nmax);
 
+		/* For use in fuse_release_user_pages(): */
+		io->is_user_or_bvec = user_backed_iter(iter) ||
+				      iov_iter_is_bvec(iter);
+
 		err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
 					  max_pages);
 		if (err && !nbytes)
@@ -1498,7 +1507,8 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		}
 
 		if (!io->async || nres < 0) {
-			fuse_release_user_pages(&ia->ap, io->should_dirty);
+			fuse_release_user_pages(&ia->ap, io->should_dirty,
+						io->is_user_or_bvec);
 			fuse_io_free(ia);
 		}
 		ia = NULL;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 488b460e046f..6ee7f72e29eb 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -290,6 +290,7 @@ struct fuse_io_priv {
 	struct kiocb *iocb;
 	struct completion *done;
 	bool blocking;
+	bool is_user_or_bvec;
 };
 
 #define FUSE_IO_PRIV_SYNC(i) \
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN
  2022-08-31  4:18 ` [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
@ 2022-08-31 10:37   ` Miklos Szeredi
  2022-09-01  1:33     ` John Hubbard
  0 siblings, 1 reply; 55+ messages in thread
From: Miklos Szeredi @ 2022-08-31 10:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On Wed, 31 Aug 2022 at 06:19, John Hubbard <jhubbard@nvidia.com> wrote:
>
> Convert the fuse filesystem to use pin_user_pages_fast() and
> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>
> The user of pin_user_pages_fast() depends upon:
>
> 1) CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
>
> 2) User-space-backed pages or ITER_BVEC pages.
>
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  fs/fuse/dev.c    | 11 +++++++++--
>  fs/fuse/file.c   | 32 +++++++++++++++++++++-----------
>  fs/fuse/fuse_i.h |  1 +
>  3 files changed, 31 insertions(+), 13 deletions(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 51897427a534..5de98a7a45b1 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -675,7 +675,12 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
>                         flush_dcache_page(cs->pg);
>                         set_page_dirty_lock(cs->pg);
>                 }
> -               put_page(cs->pg);
> +               if (!cs->pipebufs &&
> +                   (user_backed_iter(cs->iter) || iov_iter_is_bvec(cs->iter)))
> +                       dio_w_unpin_user_page(cs->pg);
> +
> +               else
> +                       put_page(cs->pg);

Why not move the logic into a helper and pass a "bool pinned" argument?

>         }
>         cs->pg = NULL;
>  }
> @@ -730,7 +735,9 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>                 }
>         } else {
>                 size_t off;
> -               err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
> +
> +               err = dio_w_iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1,
> +                                              &off);
>                 if (err < 0)
>                         return err;
>                 BUG_ON(!err);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 1a3afd469e3a..01da38928d0b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -625,14 +625,19 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
>  }
>
>  static void fuse_release_user_pages(struct fuse_args_pages *ap,
> -                                   bool should_dirty)
> +                                   bool should_dirty, bool is_user_or_bvec)
>  {
>         unsigned int i;
>
> -       for (i = 0; i < ap->num_pages; i++) {
> -               if (should_dirty)
> -                       set_page_dirty_lock(ap->pages[i]);
> -               put_page(ap->pages[i]);
> +       if (is_user_or_bvec) {
> +               dio_w_unpin_user_pages_dirty_lock(ap->pages, ap->num_pages,
> +                                                 should_dirty);
> +       } else {
> +               for (i = 0; i < ap->num_pages; i++) {
> +                       if (should_dirty)
> +                               set_page_dirty_lock(ap->pages[i]);
> +                       put_page(ap->pages[i]);
> +               }

Same here.

>         }
>  }
>
> @@ -733,7 +738,7 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
>         struct fuse_io_priv *io = ia->io;
>         ssize_t pos = -1;
>
> -       fuse_release_user_pages(&ia->ap, io->should_dirty);
> +       fuse_release_user_pages(&ia->ap, io->should_dirty, io->is_user_or_bvec);
>
>         if (err) {
>                 /* Nothing */
> @@ -1414,10 +1419,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>         while (nbytes < *nbytesp && ap->num_pages < max_pages) {
>                 unsigned npages;
>                 size_t start;
> -               ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
> -                                       *nbytesp - nbytes,
> -                                       max_pages - ap->num_pages,
> -                                       &start);
> +               ret = dio_w_iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
> +                                              *nbytesp - nbytes,
> +                                              max_pages - ap->num_pages,
> +                                              &start);
>                 if (ret < 0)
>                         break;
>
> @@ -1483,6 +1488,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>                 fl_owner_t owner = current->files;
>                 size_t nbytes = min(count, nmax);
>
> +               /* For use in fuse_release_user_pages(): */
> +               io->is_user_or_bvec = user_backed_iter(iter) ||
> +                                     iov_iter_is_bvec(iter);
> +

How about io->is_pinned?  And a iov_iter_is_pinned() helper?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-08-31  4:18 ` [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
@ 2022-09-01  0:42   ` Al Viro
  2022-09-01  1:48     ` John Hubbard
  2022-09-06  6:47   ` Christoph Hellwig
  1 sibling, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-01  0:42 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Aug 30, 2022 at 09:18:40PM -0700, John Hubbard wrote:
> Provide two new wrapper routines that are intended for user space pages
> only:
> 
>     iov_iter_pin_pages()
>     iov_iter_pin_pages_alloc()
> 
> Internally, these routines call pin_user_pages_fast(), instead of
> get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i)
> cases.
> 
> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
> variant, to release the pages, if they actually were acquired via
> pin_user_pages_fast().
> 
> This is a prerequisite to converting bio/block layers over to use
> pin_user_pages_fast().

What of ITER_PIPE (splice from O_DIRECT fd to a to pipe, for filesystem
that uses generic_file_splice_read())?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN
  2022-08-31 10:37   ` Miklos Szeredi
@ 2022-09-01  1:33     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-01  1:33 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 8/31/22 03:37, Miklos Szeredi wrote:

Hi Miklos,

Thanks for looking at this, I'll accept all of these suggestions.

> On Wed, 31 Aug 2022 at 06:19, John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> Convert the fuse filesystem to use pin_user_pages_fast() and
>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>>
>> The user of pin_user_pages_fast() depends upon:
>>
>> 1) CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
>>
>> 2) User-space-backed pages or ITER_BVEC pages.
>>
>> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
>> ---
>>  fs/fuse/dev.c    | 11 +++++++++--
>>  fs/fuse/file.c   | 32 +++++++++++++++++++++-----------
>>  fs/fuse/fuse_i.h |  1 +
>>  3 files changed, 31 insertions(+), 13 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 51897427a534..5de98a7a45b1 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -675,7 +675,12 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
>>                         flush_dcache_page(cs->pg);
>>                         set_page_dirty_lock(cs->pg);
>>                 }
>> -               put_page(cs->pg);
>> +               if (!cs->pipebufs &&
>> +                   (user_backed_iter(cs->iter) || iov_iter_is_bvec(cs->iter)))
>> +                       dio_w_unpin_user_page(cs->pg);
>> +
>> +               else
>> +                       put_page(cs->pg);
> 
> Why not move the logic into a helper and pass a "bool pinned" argument?

OK, will do. 

It's not yet clear from the discussion in the other thread with Jan and Al [1],
if I'll end up keeping this check:

    user_backed_iter(cs->iter) || iov_iter_is_bvec(cs->iter)

...but if it stays, then the helper is a good idea.

> 
>>         }
>>         cs->pg = NULL;
>>  }
>> @@ -730,7 +735,9 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>>                 }
>>         } else {
>>                 size_t off;
>> -               err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
>> +
>> +               err = dio_w_iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1,
>> +                                              &off);
>>                 if (err < 0)
>>                         return err;
>>                 BUG_ON(!err);
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 1a3afd469e3a..01da38928d0b 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -625,14 +625,19 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
>>  }
>>
>>  static void fuse_release_user_pages(struct fuse_args_pages *ap,
>> -                                   bool should_dirty)
>> +                                   bool should_dirty, bool is_user_or_bvec)
>>  {
>>         unsigned int i;
>>
>> -       for (i = 0; i < ap->num_pages; i++) {
>> -               if (should_dirty)
>> -                       set_page_dirty_lock(ap->pages[i]);
>> -               put_page(ap->pages[i]);
>> +       if (is_user_or_bvec) {
>> +               dio_w_unpin_user_pages_dirty_lock(ap->pages, ap->num_pages,
>> +                                                 should_dirty);
>> +       } else {
>> +               for (i = 0; i < ap->num_pages; i++) {
>> +                       if (should_dirty)
>> +                               set_page_dirty_lock(ap->pages[i]);
>> +                       put_page(ap->pages[i]);
>> +               }
> 
> Same here.

Yes. Definitely belongs in a helper function. I was thinking, "don't
go that far, because the code will eventually get deleted anyway", but
you are right. :)

> 
>>         }
>>  }
>>
>> @@ -733,7 +738,7 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
>>         struct fuse_io_priv *io = ia->io;
>>         ssize_t pos = -1;
>>
>> -       fuse_release_user_pages(&ia->ap, io->should_dirty);
>> +       fuse_release_user_pages(&ia->ap, io->should_dirty, io->is_user_or_bvec);
>>
>>         if (err) {
>>                 /* Nothing */
>> @@ -1414,10 +1419,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>>         while (nbytes < *nbytesp && ap->num_pages < max_pages) {
>>                 unsigned npages;
>>                 size_t start;
>> -               ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
>> -                                       *nbytesp - nbytes,
>> -                                       max_pages - ap->num_pages,
>> -                                       &start);
>> +               ret = dio_w_iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
>> +                                              *nbytesp - nbytes,
>> +                                              max_pages - ap->num_pages,
>> +                                              &start);
>>                 if (ret < 0)
>>                         break;
>>
>> @@ -1483,6 +1488,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
>>                 fl_owner_t owner = current->files;
>>                 size_t nbytes = min(count, nmax);
>>
>> +               /* For use in fuse_release_user_pages(): */
>> +               io->is_user_or_bvec = user_backed_iter(iter) ||
>> +                                     iov_iter_is_bvec(iter);
>> +
> 
> How about io->is_pinned?  And a iov_iter_is_pinned() helper?

Agreed, is_pinned is a better name, and the helper (if we end up needing
that logic) also sounds good.


[1] https://lore.kernel.org/r/20220831094349.boln4jjajkdtykx3@quack3

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-01  0:42   ` Al Viro
@ 2022-09-01  1:48     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-01  1:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 8/31/22 17:42, Al Viro wrote:
> On Tue, Aug 30, 2022 at 09:18:40PM -0700, John Hubbard wrote:
>> Provide two new wrapper routines that are intended for user space pages
>> only:
>>
>>     iov_iter_pin_pages()
>>     iov_iter_pin_pages_alloc()
>>
>> Internally, these routines call pin_user_pages_fast(), instead of
>> get_user_pages_fast(), for user_backed_iter(i) and iov_iter_bvec(i)
>> cases.
>>
>> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
>> variant, to release the pages, if they actually were acquired via
>> pin_user_pages_fast().
>>
>> This is a prerequisite to converting bio/block layers over to use
>> pin_user_pages_fast().
> 
> What of ITER_PIPE (splice from O_DIRECT fd to a to pipe, for filesystem
> that uses generic_file_splice_read())?

Yes. And it turns out that I sent this v2 just a little too early: it
does not include Jan Kara's latest idea [1] of including ITER_PIPE and
ITER_XARRAY. That should fix this up.


[1] https://lore.kernel.org/r/20220831094349.boln4jjajkdtykx3@quack3

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast()
  2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (6 preceding siblings ...)
  2022-08-31  4:18 ` [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
@ 2022-09-06  6:36 ` Christoph Hellwig
  2022-09-06  7:10   ` John Hubbard
  7 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  6:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Aug 30, 2022 at 09:18:36PM -0700, John Hubbard wrote:
> The conversion is temporarily guarded by
> CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO. In the future (not part of this
> series), when we are certain that all filesystems have converted their
> Direct IO paths to FOLL_PIN, then we can do the final step, which is to
> get rid of CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO and search-and-replace
> the dio_w_*() functions with their final names (see bvec.h changes).

What is the the point of these wrappers?  We should be able to
convert one caller at a time in an entirely safe way.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/7] mm/gup: introduce pin_user_page()
  2022-08-31  4:18 ` [PATCH v2 2/7] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-09-06  6:37   ` Christoph Hellwig
  2022-09-06  7:12     ` John Hubbard
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  6:37 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

No need to export this, it really is an internal helper for
the iov_iter code.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-08-31  4:18 ` [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
  2022-09-01  0:42   ` Al Viro
@ 2022-09-06  6:47   ` Christoph Hellwig
  2022-09-06  7:44     ` John Hubbard
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  6:47 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

I'd it one step back.  For BVECS we never need a get or pin.  The
block layer already does this, an the other callers should as well.
For KVEC the same is true.  For PIPE and xarray as you pointed out
we can probably just do the pin, it is not like these are performance
paths.

So, I'd suggest to:

 - factor out the user backed and bvec cases from
   __iov_iter_get_pages_alloc into helper just to keep
   __iov_iter_get_pages_alloc readable.
 - for the pin case don't use the existing bvec helper at all, but
   copy the logic for the block layer for not pinning.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-08-31  4:18 ` [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
@ 2022-09-06  6:48   ` Christoph Hellwig
  2022-09-06  7:15     ` John Hubbard
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  6:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

Please split this into separate patches for block, iomap and legacy
direct I/O.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-31  4:18 ` [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
@ 2022-09-06  6:49   ` Christoph Hellwig
  2022-09-06  7:16     ` John Hubbard
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  6:49 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Aug 30, 2022 at 09:18:42PM -0700, John Hubbard wrote:
static void nfs_direct_release_pages(struct iov_iter *iter, struct page **pages,
				     unsigned int npages)
>  {
> -	unsigned int i;
> -	for (i = 0; i < npages; i++)
> -		put_page(pages[i]);
> +	if (user_backed_iter(iter) || iov_iter_is_bvec(iter))
> +		dio_w_unpin_user_pages(pages, npages);
> +	else
> +		release_pages(pages, npages);

Instead of having this magic in all kinds of places, we need a helper
that takes the page array, npages and iter->type and does the right
thing.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast()
  2022-09-06  6:36 ` [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() Christoph Hellwig
@ 2022-09-06  7:10   ` John Hubbard
  2022-09-06  7:22     ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/5/22 23:36, Christoph Hellwig wrote:
> On Tue, Aug 30, 2022 at 09:18:36PM -0700, John Hubbard wrote:
>> The conversion is temporarily guarded by
>> CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO. In the future (not part of this
>> series), when we are certain that all filesystems have converted their
>> Direct IO paths to FOLL_PIN, then we can do the final step, which is to
>> get rid of CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO and search-and-replace
>> the dio_w_*() functions with their final names (see bvec.h changes).
> 
> What is the the point of these wrappers?  We should be able to
> convert one caller at a time in an entirely safe way.

I would be delighted if that were somehow possible. Every time I think
it's possible, it has fallen apart. The fact that bio_release_pages()
will need to switch over from put_page() to unpin_user_page(), combined
with the fact that there are a lot of callers that submit bios, has
led me to the current approach.

What did you have in mind?

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 2/7] mm/gup: introduce pin_user_page()
  2022-09-06  6:37   ` Christoph Hellwig
@ 2022-09-06  7:12     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/5/22 23:37, Christoph Hellwig wrote:
> No need to export this, it really is an internal helper for
> the iov_iter code.

Agreed. I'd fallen into a bit of a habit, for gup APIs. But this
one has a much more limited and special role, yes.


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-09-06  6:48   ` Christoph Hellwig
@ 2022-09-06  7:15     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/5/22 23:48, Christoph Hellwig wrote:
> Please split this into separate patches for block, iomap and legacy
> direct I/O.

Sure, will do.


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages
  2022-09-06  6:49   ` Christoph Hellwig
@ 2022-09-06  7:16     ` John Hubbard
  0 siblings, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/5/22 23:49, Christoph Hellwig wrote:
> On Tue, Aug 30, 2022 at 09:18:42PM -0700, John Hubbard wrote:
> static void nfs_direct_release_pages(struct iov_iter *iter, struct page **pages,
> 				     unsigned int npages)
>>  {
>> -	unsigned int i;
>> -	for (i = 0; i < npages; i++)
>> -		put_page(pages[i]);
>> +	if (user_backed_iter(iter) || iov_iter_is_bvec(iter))
>> +		dio_w_unpin_user_pages(pages, npages);
>> +	else
>> +		release_pages(pages, npages);
> 
> Instead of having this magic in all kinds of places, we need a helper
> that takes the page array, npages and iter->type and does the right
> thing.

Yes, Miklos asked for much the same thing, too. :)

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast()
  2022-09-06  7:10   ` John Hubbard
@ 2022-09-06  7:22     ` Christoph Hellwig
  2022-09-06  7:37       ` John Hubbard
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  7:22 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, Alexander Viro,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Sep 06, 2022 at 12:10:54AM -0700, John Hubbard wrote:
> I would be delighted if that were somehow possible. Every time I think
> it's possible, it has fallen apart. The fact that bio_release_pages()
> will need to switch over from put_page() to unpin_user_page(), combined
> with the fact that there are a lot of callers that submit bios, has
> led me to the current approach.

We can (temporarily) pass the gup flag to bio_release_pages or even
better add a new bio_unpin_pages helper that undoes the pin side.
That is: don't try to reuse the old APIs, but ad new ones, just like
we do on the lower layers.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast()
  2022-09-06  7:22     ` Christoph Hellwig
@ 2022-09-06  7:37       ` John Hubbard
  2022-09-06  7:46         ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/6/22 00:22, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:10:54AM -0700, John Hubbard wrote:
>> I would be delighted if that were somehow possible. Every time I think
>> it's possible, it has fallen apart. The fact that bio_release_pages()
>> will need to switch over from put_page() to unpin_user_page(), combined
>> with the fact that there are a lot of callers that submit bios, has
>> led me to the current approach.
> 
> We can (temporarily) pass the gup flag to bio_release_pages or even
> better add a new bio_unpin_pages helper that undoes the pin side.
> That is: don't try to reuse the old APIs, but ad new ones, just like
> we do on the lower layers.

OK...so, to confirm: the idea is to convert these callsites (below) to
call a new bio_unpin_pages() routine that does unpin_user_page().

$ git grep -nw bio_release_pages
block/bio.c:1474:               bio_release_pages(bio, true);
block/bio.c:1490:       bio_release_pages(bio, false);
block/blk-map.c:308:    bio_release_pages(bio, false);
block/blk-map.c:610:                    bio_release_pages(bio, bio_data_dir(bio) == READ);
block/fops.c:99:        bio_release_pages(&bio, should_dirty);
block/fops.c:165:               bio_release_pages(bio, false);
block/fops.c:289:               bio_release_pages(bio, false);
fs/direct-io.c:510:             bio_release_pages(bio, should_dirty);
fs/iomap/direct-io.c:185:               bio_release_pages(bio, false);
fs/zonefs/super.c:793:  bio_release_pages(bio, false);


And these can probably be done in groups, not as one big patch--your
other email also asked to break them up into block, iomap, and legacy.

That would be nice, I really hate the ugly dio_w*() wrappers, thanks
for this help. :)

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-06  6:47   ` Christoph Hellwig
@ 2022-09-06  7:44     ` John Hubbard
  2022-09-06  7:48       ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/5/22 23:47, Christoph Hellwig wrote:
> I'd it one step back.  For BVECS we never need a get or pin.  The
> block layer already does this, an the other callers should as well.
> For KVEC the same is true.  For PIPE and xarray as you pointed out
> we can probably just do the pin, it is not like these are performance
> paths.
> 
> So, I'd suggest to:
> 
>  - factor out the user backed and bvec cases from
>    __iov_iter_get_pages_alloc into helper just to keep
>    __iov_iter_get_pages_alloc readable.

OK, that part is clear.

>  - for the pin case don't use the existing bvec helper at all, but
>    copy the logic for the block layer for not pinning.

I'm almost, but not quite sure I get the idea above. Overall, what
happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
(or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast()
  2022-09-06  7:37       ` John Hubbard
@ 2022-09-06  7:46         ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  7:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, Alexander Viro,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Sep 06, 2022 at 12:37:00AM -0700, John Hubbard wrote:
> On 9/6/22 00:22, Christoph Hellwig wrote:
> > On Tue, Sep 06, 2022 at 12:10:54AM -0700, John Hubbard wrote:
> >> I would be delighted if that were somehow possible. Every time I think
> >> it's possible, it has fallen apart. The fact that bio_release_pages()
> >> will need to switch over from put_page() to unpin_user_page(), combined
> >> with the fact that there are a lot of callers that submit bios, has
> >> led me to the current approach.
> > 
> > We can (temporarily) pass the gup flag to bio_release_pages or even
> > better add a new bio_unpin_pages helper that undoes the pin side.
> > That is: don't try to reuse the old APIs, but ad new ones, just like
> > we do on the lower layers.
> 
> OK...so, to confirm: the idea is to convert these callsites (below) to
> call a new bio_unpin_pages() routine that does unpin_user_page().

Yeah.  And to stay symmetric also a new bio_iov_iter_pin_pages for
the pin side.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-06  7:44     ` John Hubbard
@ 2022-09-06  7:48       ` Christoph Hellwig
  2022-09-06  7:58         ` John Hubbard
  2022-09-06 10:21         ` Jan Kara
  0 siblings, 2 replies; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-06  7:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, Alexander Viro,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote:
> OK, that part is clear.
> 
> >  - for the pin case don't use the existing bvec helper at all, but
> >    copy the logic for the block layer for not pinning.
> 
> I'm almost, but not quite sure I get the idea above. Overall, what
> happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
> (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?

Do not change anyhing for FOLL_GET callers, as they are on the way out
anyway.

For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
not acquiring a reference is obviously safe, and the other callers will
need an audit, but I can't think of why it woul  ever be unsafe.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-06  7:48       ` Christoph Hellwig
@ 2022-09-06  7:58         ` John Hubbard
  2022-09-07  8:50           ` Christoph Hellwig
  2022-09-06 10:21         ` Jan Kara
  1 sibling, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-09-06  7:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Alexander Viro, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/6/22 00:48, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote:
>> OK, that part is clear.
>>
>>>  - for the pin case don't use the existing bvec helper at all, but
>>>    copy the logic for the block layer for not pinning.
>>
>> I'm almost, but not quite sure I get the idea above. Overall, what
>> happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
>> (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?
> 
> Do not change anyhing for FOLL_GET callers, as they are on the way out
> anyway.
> 

OK, got it.

> For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> not acquiring a reference is obviously safe, and the other callers will
> need an audit, but I can't think of why it woul  ever be unsafe.

In order to do that, one would need to be confident that such bvec and kvec
pages do not get passed down to bio_release_pages() (or the new
bio_unpin_pages()). Also, I'm missing a key point, because today bvec pages get
a get_page() reference from __iov_iter_get_pages_alloc(). If I just skip
that, then the get/put calls are unbalanced...

I can just hear Al Viro repeating his points about splice() and vmsplice(),
heh. :)


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-06  7:48       ` Christoph Hellwig
  2022-09-06  7:58         ` John Hubbard
@ 2022-09-06 10:21         ` Jan Kara
  2022-09-07  8:45           ` Christoph Hellwig
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Kara @ 2022-09-06 10:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Hubbard, Andrew Morton, Jens Axboe, Alexander Viro,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue 06-09-22 00:48:49, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:44:28AM -0700, John Hubbard wrote:
> > OK, that part is clear.
> > 
> > >  - for the pin case don't use the existing bvec helper at all, but
> > >    copy the logic for the block layer for not pinning.
> > 
> > I'm almost, but not quite sure I get the idea above. Overall, what
> > happens to bvec pages? Leave the get_page() pin in place for FOLL_GET
> > (or USE_FOLL_GET), I suppose, but do...what, for FOLL_PIN callers?
> 
> Do not change anyhing for FOLL_GET callers, as they are on the way out
> anyway.
> 
> For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> not acquiring a reference is obviously safe, and the other callers will
> need an audit, but I can't think of why it woul  ever be unsafe.

Are you sure about "For file systems not acquiring a reference is obviously
safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
from pagecache pages. And then we have iter_file_splice_write() which
creates bvec from pipe pages (which can also be pagecache pages if
vmsplice() is used). So perhaps there are no lifetime issues even without
acquiring a reference (but looking at the code I would not say it is
obvious) but I definitely don't see how it would be safe to not get a pin
to signal to filesystem backing the pagecache page that there is DMA
happening to/from the page.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-06 10:21         ` Jan Kara
@ 2022-09-07  8:45           ` Christoph Hellwig
  2022-09-14  3:51             ` Al Viro
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-07  8:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, Jens Axboe,
	Alexander Viro, Miklos Szeredi, Darrick J . Wong,
	Trond Myklebust, Anna Schumaker, David Hildenbrand,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote:
> > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > not acquiring a reference is obviously safe, and the other callers will
> > need an audit, but I can't think of why it woul  ever be unsafe.
> 
> Are you sure about "For file systems not acquiring a reference is obviously
> safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
> from pagecache pages. And then we have iter_file_splice_write() which
> creates bvec from pipe pages (which can also be pagecache pages if
> vmsplice() is used). So perhaps there are no lifetime issues even without
> acquiring a reference (but looking at the code I would not say it is
> obvious) but I definitely don't see how it would be safe to not get a pin
> to signal to filesystem backing the pagecache page that there is DMA
> happening to/from the page.

I mean in the context of iov_iter_get_pages callers, that is direct
I/O.  Direct callers of iov_iter_bvec which then pass that iov to
->read_iter / ->write_iter will need to hold references (those are
the references that the callers of iov_iter_get_pages rely on!).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-06  7:58         ` John Hubbard
@ 2022-09-07  8:50           ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-07  8:50 UTC (permalink / raw)
  To: John Hubbard
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, Alexander Viro,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, David Hildenbrand, Logan Gunthorpe,
	linux-block, linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Sep 06, 2022 at 12:58:59AM -0700, John Hubbard wrote:
> > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > not acquiring a reference is obviously safe, and the other callers will
> > need an audit, but I can't think of why it woul  ever be unsafe.
> 
> In order to do that, one would need to be confident that such bvec and kvec
> pages do not get passed down to bio_release_pages() (or the new
> bio_unpin_pages()). Also, I'm missing a key point, because today bvec pages get
> a get_page() reference from __iov_iter_get_pages_alloc(). If I just skip
> that, then the get/put calls are unbalanced...

Except that for the most relevant callers (bdev and iomap) we never
end up in __iov_iter_get_pages_alloc. What I suggest is that, with
the move to the pin helper, we codify that logic.

For callers of the bio_iov_iter_pin_pages helper, the corresponding
bio_unpin_pages helper will encapsulate that logic.  For direct calls
to iov_iter_pin_pages, we should add a iov_iter_unpin_pages helper that
also encapsulates the logic.  The beauty is that this means the caller
itself does not have to care about any of thise, and the logic is in
one (well two due to the block special case that reuses the bio_vec
array for the pages space) place.

> I can just hear Al Viro repeating his points about splice() and vmsplice(),
> heh. :)

splice does take these references before calling into the file system
(see my reply to Jan).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-07  8:45           ` Christoph Hellwig
@ 2022-09-14  3:51             ` Al Viro
  2022-09-14 14:52               ` Jan Kara
  2022-09-22 14:31               ` Christoph Hellwig
  0 siblings, 2 replies; 55+ messages in thread
From: Al Viro @ 2022-09-14  3:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Wed, Sep 07, 2022 at 01:45:26AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote:
> > > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > > not acquiring a reference is obviously safe, and the other callers will
> > > need an audit, but I can't think of why it woul  ever be unsafe.
> > 
> > Are you sure about "For file systems not acquiring a reference is obviously
> > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
> > from pagecache pages. And then we have iter_file_splice_write() which
> > creates bvec from pipe pages (which can also be pagecache pages if
> > vmsplice() is used). So perhaps there are no lifetime issues even without
> > acquiring a reference (but looking at the code I would not say it is
> > obvious) but I definitely don't see how it would be safe to not get a pin
> > to signal to filesystem backing the pagecache page that there is DMA
> > happening to/from the page.
> 
> I mean in the context of iov_iter_get_pages callers, that is direct
> I/O.  Direct callers of iov_iter_bvec which then pass that iov to
> ->read_iter / ->write_iter will need to hold references (those are
> the references that the callers of iov_iter_get_pages rely on!).

Unless I'm misreading Jan, the question is whether they should get or
pin.  AFAICS, anyone who passes the sucker to ->read_iter() (or ->recvmsg(),
or does direct copy_to_iter()/zero_iter(), etc.) is falling under
=================================================================================
CASE 5: Pinning in order to write to the data within the page
-------------------------------------------------------------
Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
write to a page's data, unpin" can cause a problem. Case 5 may be considered a
superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
other words, if the code is neither Case 1 nor Case 2, it may still require
FOLL_PIN, for patterns like this:

Correct (uses FOLL_PIN calls):
    pin_user_pages()
    write to the data within the pages
    unpin_user_pages()

INCORRECT (uses FOLL_GET calls):
    get_user_pages()
    write to the data within the pages
    put_page()
=================================================================================

Regarding iter_file_splice_write() case, do we need to pin pages
when we are not going to modify the data in those?

The same goes for afs, AFAICS; I started to type "... and everything that passes
WRITE to iov_iter_bvec()", but...
drivers/vhost/vringh.c:1165:            iov_iter_bvec(&iter, READ, iov, ret, translated);
drivers/vhost/vringh.c:1198:            iov_iter_bvec(&iter, WRITE, iov, ret, translated);
is backwards - READ is for data destinations, comes with copy_to_iter(); WRITE is
for data sources and it comes with copy_from_iter()...
I'm really tempted to slap
	if (WARN_ON(i->data_source))
		return 0;
into copy_to_iter() et.al., along with its opposite for copy_from_iter().
And see who comes screaming...  Things like
        if (unlikely(iov_iter_is_pipe(i) || iov_iter_is_discard(i))) {
                WARN_ON(1);
                return 0;
        }
in e.g. csum_and_copy_from_iter() would be replaced by that, and become
easier to understand...
These two are also getting it wrong, BTW:
drivers/target/target_core_file.c:340:  iov_iter_bvec(&iter, READ, bvec, sgl_nents, len);
drivers/target/target_core_file.c:476:  iov_iter_bvec(&iter, READ, bvec, nolb, len);


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-14  3:51             ` Al Viro
@ 2022-09-14 14:52               ` Jan Kara
  2022-09-14 16:42                 ` Al Viro
  2022-09-22 14:31               ` Christoph Hellwig
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Kara @ 2022-09-14 14:52 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Jan Kara, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Wed 14-09-22 04:51:17, Al Viro wrote:
> On Wed, Sep 07, 2022 at 01:45:26AM -0700, Christoph Hellwig wrote:
> > On Tue, Sep 06, 2022 at 12:21:06PM +0200, Jan Kara wrote:
> > > > For FOLL_PIN callers, never pin bvec and kvec pages:  For file systems
> > > > not acquiring a reference is obviously safe, and the other callers will
> > > > need an audit, but I can't think of why it woul  ever be unsafe.
> > > 
> > > Are you sure about "For file systems not acquiring a reference is obviously
> > > safe"? I can see places e.g. in orangefs, afs, etc. which create bvec iters
> > > from pagecache pages. And then we have iter_file_splice_write() which
> > > creates bvec from pipe pages (which can also be pagecache pages if
> > > vmsplice() is used). So perhaps there are no lifetime issues even without
> > > acquiring a reference (but looking at the code I would not say it is
> > > obvious) but I definitely don't see how it would be safe to not get a pin
> > > to signal to filesystem backing the pagecache page that there is DMA
> > > happening to/from the page.
> > 
> > I mean in the context of iov_iter_get_pages callers, that is direct
> > I/O.  Direct callers of iov_iter_bvec which then pass that iov to
> > ->read_iter / ->write_iter will need to hold references (those are
> > the references that the callers of iov_iter_get_pages rely on!).
> 
> Unless I'm misreading Jan, the question is whether they should get or
> pin.  AFAICS, anyone who passes the sucker to ->read_iter() (or ->recvmsg(),
> or does direct copy_to_iter()/zero_iter(), etc.) is falling under
> =================================================================================
> CASE 5: Pinning in order to write to the data within the page
> -------------------------------------------------------------
> Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> other words, if the code is neither Case 1 nor Case 2, it may still require
> FOLL_PIN, for patterns like this:
> 
> Correct (uses FOLL_PIN calls):
>     pin_user_pages()
>     write to the data within the pages
>     unpin_user_pages()
> 
> INCORRECT (uses FOLL_GET calls):
>     get_user_pages()
>     write to the data within the pages
>     put_page()
> =================================================================================

Yes, that was my point.

> Regarding iter_file_splice_write() case, do we need to pin pages
> when we are not going to modify the data in those?

Strictly speaking not. So far we are pinning pages even if they serve as
data source because it is simpler not to bother about data access direction
but I'm not really aware of anything that would mandate that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-14 14:52               ` Jan Kara
@ 2022-09-14 16:42                 ` Al Viro
  2022-09-15  8:16                   ` Jan Kara
  0 siblings, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-14 16:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Wed, Sep 14, 2022 at 04:52:33PM +0200, Jan Kara wrote:
> > =================================================================================
> > CASE 5: Pinning in order to write to the data within the page
> > -------------------------------------------------------------
> > Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> > write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> > other words, if the code is neither Case 1 nor Case 2, it may still require
> > FOLL_PIN, for patterns like this:
> > 
> > Correct (uses FOLL_PIN calls):
> >     pin_user_pages()
> >     write to the data within the pages
> >     unpin_user_pages()
> > 
> > INCORRECT (uses FOLL_GET calls):
> >     get_user_pages()
> >     write to the data within the pages
> >     put_page()
> > =================================================================================
> 
> Yes, that was my point.

The thing is, at which point do we pin those pages?  pin_user_pages() works by
userland address; by the time we get to any of those we have struct page
references and no idea whether they are still mapped anywhere.

How would that work?  What protects the area where you want to avoid running
into pinned pages from previously acceptable page getting pinned?  If "they
must have been successfully unmapped" is a part of what you are planning, we
really do have a problem...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-14 16:42                 ` Al Viro
@ 2022-09-15  8:16                   ` Jan Kara
  2022-09-16  1:55                     ` Al Viro
  2022-09-22  2:22                     ` Al Viro
  0 siblings, 2 replies; 55+ messages in thread
From: Jan Kara @ 2022-09-15  8:16 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Christoph Hellwig, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Wed 14-09-22 17:42:40, Al Viro wrote:
> On Wed, Sep 14, 2022 at 04:52:33PM +0200, Jan Kara wrote:
> > > =================================================================================
> > > CASE 5: Pinning in order to write to the data within the page
> > > -------------------------------------------------------------
> > > Even though neither DMA nor Direct IO is involved, just a simple case of "pin,
> > > write to a page's data, unpin" can cause a problem. Case 5 may be considered a
> > > superset of Case 1, plus Case 2, plus anything that invokes that pattern. In
> > > other words, if the code is neither Case 1 nor Case 2, it may still require
> > > FOLL_PIN, for patterns like this:
> > > 
> > > Correct (uses FOLL_PIN calls):
> > >     pin_user_pages()
> > >     write to the data within the pages
> > >     unpin_user_pages()
> > > 
> > > INCORRECT (uses FOLL_GET calls):
> > >     get_user_pages()
> > >     write to the data within the pages
> > >     put_page()
> > > =================================================================================
> > 
> > Yes, that was my point.
> 
> The thing is, at which point do we pin those pages?  pin_user_pages() works by
> userland address; by the time we get to any of those we have struct page
> references and no idea whether they are still mapped anywhere.

Yes, pin_user_pages() currently works by page address but there's nothing
fundamental about that. Technically, pin is currently just another type of
page reference so we can as well just pin the page when given struct page.
In fact John Hubbart has added such helper in this series.

> How would that work?  What protects the area where you want to avoid running
> into pinned pages from previously acceptable page getting pinned?  If "they
> must have been successfully unmapped" is a part of what you are planning, we
> really do have a problem...

But this is a very good question. So far the idea was that we lock the
page, unmap (or writeprotect) the page, and then check pincount == 0 and
that is a reliable method for making sure page data is stable (until we
unlock the page & release other locks blocking page faults and writes). But
once suddently ordinary page references can be used to create pins this
does not work anymore. Hrm.

Just brainstorming ideas now: So we'd either need to obtain the pins early
when we still have the virtual address (but I guess that is often not
practical but should work e.g. for normal direct IO path) or we need some
way to "simulate" the page fault when pinning the page, just don't map it
into page tables in the end. This simulated page fault could be perhaps
avoided if rmap walk shows that the page is already mapped somewhere with
suitable permissions.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-15  8:16                   ` Jan Kara
@ 2022-09-16  1:55                     ` Al Viro
  2022-09-20  5:02                       ` Al Viro
  2022-09-22  2:22                     ` Al Viro
  1 sibling, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-16  1:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:

> > How would that work?  What protects the area where you want to avoid running
> > into pinned pages from previously acceptable page getting pinned?  If "they
> > must have been successfully unmapped" is a part of what you are planning, we
> > really do have a problem...
> 
> But this is a very good question. So far the idea was that we lock the
> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> that is a reliable method for making sure page data is stable (until we
> unlock the page & release other locks blocking page faults and writes). But
> once suddently ordinary page references can be used to create pins this
> does not work anymore. Hrm.
> 
> Just brainstorming ideas now: So we'd either need to obtain the pins early
> when we still have the virtual address (but I guess that is often not
> practical but should work e.g. for normal direct IO path) or we need some
> way to "simulate" the page fault when pinning the page, just don't map it
> into page tables in the end. This simulated page fault could be perhaps
> avoided if rmap walk shows that the page is already mapped somewhere with
> suitable permissions.

OK...  I'd done some digging; results so far

	* READ vs. WRITE turned out to be an awful way to specify iov_iter
data direction.  Local iov_iter branch so far:
	get rid of unlikely() on page_copy_sane() calls
	csum_and_copy_to_iter(): handle ITER_DISCARD
	[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
	[fsi] WRITE is "data source", not destination...
	[infiniband] READ is "data destination", not source...
	[s390] zcore: WRITE is "data source", not destination...
	[target] fix iov_iter_bvec() "direction" argument
	[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
	[xen] fix "direction" argument of iov_iter_kvec()
	[trace] READ means "data destination", not source...
	iov_iter: saner checks for attempt to copy to/from iterator
	use less confusing names for iov_iter direction initializers
those 8 commits in the middle consist of fixes, some of them with more than
one call site affected.  Folks keep going "oh, we are going to copy data
into that iterator, must be WRITE".  Wrong - WRITE means "as for write(2)",
i.e. the data _source_, not data destination.  And the same kind of bugs
goes in the opposite direction, of course.
	I think something like ITER_DEST vs. ITER_SOURCE would be less
confusing.

	* anything that goes with ITER_SOURCE doesn't need pin.
	* ITER_IOVEC/ITER_UBUF need pin for get_pages and for nothing else.
Need to grab reference on get_pages, obviously.
	* even more obviously, ITER_DISCARD is irrelevant here.
	* ITER_PIPE only modifies anonymous pages that had been allocated
by iov_iter primitives and hadn't been observed by anything outside until
we are done with said ITER_PIPE.
	* quite a few instances are similar to e.g. REQ_OP_READ handling in
/dev/loop - we work with ITER_BVEC there and we do modify the page contents,
but the damn thing would better be given to us locked and stay locked until
all involved modifications (be it real IO/decoding/whatever) is complete.
That ought to be safe, unless I'm missing something.

That doesn't cover everything; still going through the list...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-16  1:55                     ` Al Viro
@ 2022-09-20  5:02                       ` Al Viro
  2022-09-22 14:36                         ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-20  5:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Fri, Sep 16, 2022 at 02:55:53AM +0100, Al Viro wrote:
> 	* READ vs. WRITE turned out to be an awful way to specify iov_iter
> data direction.  Local iov_iter branch so far:
> 	get rid of unlikely() on page_copy_sane() calls
> 	csum_and_copy_to_iter(): handle ITER_DISCARD
> 	[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
> 	[fsi] WRITE is "data source", not destination...
> 	[infiniband] READ is "data destination", not source...
> 	[s390] zcore: WRITE is "data source", not destination...
> 	[target] fix iov_iter_bvec() "direction" argument
> 	[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
> 	[xen] fix "direction" argument of iov_iter_kvec()
> 	[trace] READ means "data destination", not source...
> 	iov_iter: saner checks for attempt to copy to/from iterator
> 	use less confusing names for iov_iter direction initializers
> those 8 commits in the middle consist of fixes, some of them with more than
> one call site affected.  Folks keep going "oh, we are going to copy data
> into that iterator, must be WRITE".  Wrong - WRITE means "as for write(2)",
> i.e. the data _source_, not data destination.  And the same kind of bugs
> goes in the opposite direction, of course.
> 	I think something like ITER_DEST vs. ITER_SOURCE would be less
> confusing.
> 
> 	* anything that goes with ITER_SOURCE doesn't need pin.
> 	* ITER_IOVEC/ITER_UBUF need pin for get_pages and for nothing else.
> Need to grab reference on get_pages, obviously.
> 	* even more obviously, ITER_DISCARD is irrelevant here.
> 	* ITER_PIPE only modifies anonymous pages that had been allocated
> by iov_iter primitives and hadn't been observed by anything outside until
> we are done with said ITER_PIPE.
> 	* quite a few instances are similar to e.g. REQ_OP_READ handling in
> /dev/loop - we work with ITER_BVEC there and we do modify the page contents,
> but the damn thing would better be given to us locked and stay locked until
> all involved modifications (be it real IO/decoding/whatever) is complete.
> That ought to be safe, unless I'm missing something.
> 
> That doesn't cover everything; still going through the list...

More:

nvme target: nvme read requests end up with somebody allocating and filling
sglist, followed by reading from file into it (using ITER_BVEC).  Then the
pages are sent out, presumably.  I would be very surprised if it turned out
to be anything other than anon pages allocated by the driver, but I'd like
to see that confirmed by nvme folks.  Probably doesn't need pinning.

->read_folio() instances - page locked by caller, not unlocked until we are done.

->readahead() instances - pages are in the segment of page cache that had been
populated and locked by the caller; some are ITER_BVEC (with page(s) extracted
by readahead_page()), some - ITER_XARRAY.
other similar places (some of ->write_begin() instances, after having grabbed
a locked page, etc.)

->issue_read() instances - the call graph is scary (in particular, recursion
prevention there is non-obvious), but unless netfs folks say otherwise, I'd
assume that all pages involved are supposed to be locked by the caller.
swap reads (ending up at __swap_read_unplug()) - pages locked by callers.

in some cases (cifs) pages are privately allocated and not visible to anyone
else.

io_import_fixed() sets ITER_BVEC over pinned pages; see io_pin_pages() for
the place where that's done.

In cifs_send_async_read() we take the pages that will eventually go into
ITER_BVEC iterator from iov_iter_get_pages() - that one wants pinning if
the type of ctx->iter would demand so.  The same goes for setup_aio_ctx_iter() -
iov_iter_get_pages() is used to make an ITER_BVEC counterpart of the
iov_iter passed to ->read_iter(), with the same considerations re pinning.
The same goes for ceph __iter_get_bvecs().

Haven't done yet:

drivers/target/target_core_file.c:292:  iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);
drivers/vhost/vringh.c:1198:            iov_iter_bvec(&iter, ITER_DEST, iov, ret, translated);
fs/afs/dir.c:308:       iov_iter_xarray(&req->def_iter, ITER_DEST, &dvnode->netfs.inode.i_mapping->i_pages,
net/ceph/messenger_v1.c:52:     iov_iter_bvec(&msg.msg_iter, ITER_DEST, &bvec, 1, length);
net/ceph/messenger_v2.c:236:    iov_iter_bvec(&con->v2.in_iter, ITER_DEST, &con->v2.in_bvec, 1, bv->bv_len);
net/sunrpc/svcsock.c:263:       iov_iter_bvec(&msg.msg_iter, ITER_DEST, bvec, i, buflen);
net/sunrpc/xprtsock.c:376:      iov_iter_bvec(&msg->msg_iter, ITER_DEST, bvec, nr, count);

The picture so far looks like we mostly need to take care of pinning when
we obtain the references from iov_iter_get_pages().  What's more, it looks
like ITER_BVEC/ITER_XARRAY/ITER_PIPE we really don't need to pin anything on
get_pages/pin_pages - they are already protected (or, in case of ITER_PIPE,
allocated by iov_iter itself and not reachable by anybody outside).
Might or might not be true for the remaining 7 call sites...

NOTE: all of the above assumes that callers with pre-locked pages are
either synchronous or do not unlock until the completion callbacks.
It does appear to be true; if it is true, I really wonder if we need
to even grab references in iov_iter_pin_pages() for anything other
than ITER_IOVEC/ITER_UBUF.  The right primitive might be
	if user-backed
		pin pages
	else
		just copy the pointers; any lifetime-related issues are
		up to the caller.
	advance iterator in either case


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-15  8:16                   ` Jan Kara
  2022-09-16  1:55                     ` Al Viro
@ 2022-09-22  2:22                     ` Al Viro
  2022-09-22  6:09                       ` John Hubbard
  2022-09-22 14:38                       ` Christoph Hellwig
  1 sibling, 2 replies; 55+ messages in thread
From: Al Viro @ 2022-09-22  2:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:

> > How would that work?  What protects the area where you want to avoid running
> > into pinned pages from previously acceptable page getting pinned?  If "they
> > must have been successfully unmapped" is a part of what you are planning, we
> > really do have a problem...
> 
> But this is a very good question. So far the idea was that we lock the
> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> that is a reliable method for making sure page data is stable (until we
> unlock the page & release other locks blocking page faults and writes). But
> once suddently ordinary page references can be used to create pins this
> does not work anymore. Hrm.
> 
> Just brainstorming ideas now: So we'd either need to obtain the pins early
> when we still have the virtual address (but I guess that is often not
> practical but should work e.g. for normal direct IO path) or we need some
> way to "simulate" the page fault when pinning the page, just don't map it
> into page tables in the end. This simulated page fault could be perhaps
> avoided if rmap walk shows that the page is already mapped somewhere with
> suitable permissions.

OK.  As far as I can see, the rules are along the lines of
	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
	  That includes
		* page known to be locked by caller
		* page being privately allocated and not visible to anyone else
		* iterator being data source
		* page coming from pin_user_pages(), possibly as the result of
		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
	* ITER_PIPE pages are always safe
	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
	  had been created with such.
My preference would be to have iov_iter_get_pages() and friends pin if and
only if we have data-destination iov_iter that is user-backed.  For
data-source user-backed we only need FOLL_GET, and for all other flavours
(ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
at all.

What I'd like to have is the understanding of the places where we drop
the references acquired by iov_iter_get_pages().  How do we decide
whether to unpin?  E.g. pipe_buffer carries a reference to page and no
way to tell whether it's a pinned one; results of iov_iter_get_pages()
on ITER_IOVEC *can* end up there, but thankfully only from data-source
(== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
Then there's nfs_request; AFAICS, we do need to pin the references in
those if they are coming from nfs_direct_read_schedule_iovec(), but
not if they come from readpage_async_filler().  How do we deal with
coalescence, etc.?  It's been a long time since I really looked at
that code...  Christoph, could you give any comments on that one?

Note, BTW, that nfs_request coming from readpage_async_filler() have
pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
do not, and that's where we want them pinned.  Resulting page references
end up (after quite a trip through data structures) stuffed into struct
rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
case it's safe since the pages are locked; in another - since they would
come from pin_user_pages().  The call chain at the time they are used
has nothing to do with the originator - sunrpc is looking at the arrived
response to READ that matches an rpc_rqst that had been created by sender
of that request and safety is the sender's responsibility.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22  2:22                     ` Al Viro
@ 2022-09-22  6:09                       ` John Hubbard
  2022-09-22 11:29                         ` Jan Kara
  2022-09-22 14:38                       ` Christoph Hellwig
  1 sibling, 1 reply; 55+ messages in thread
From: John Hubbard @ 2022-09-22  6:09 UTC (permalink / raw)
  To: Al Viro, Jan Kara
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/21/22 19:22, Al Viro wrote:
> On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:
> 
>>> How would that work?  What protects the area where you want to avoid running
>>> into pinned pages from previously acceptable page getting pinned?  If "they
>>> must have been successfully unmapped" is a part of what you are planning, we
>>> really do have a problem...
>>
>> But this is a very good question. So far the idea was that we lock the
>> page, unmap (or writeprotect) the page, and then check pincount == 0 and
>> that is a reliable method for making sure page data is stable (until we
>> unlock the page & release other locks blocking page faults and writes). But
>> once suddently ordinary page references can be used to create pins this
>> does not work anymore. Hrm.
>>
>> Just brainstorming ideas now: So we'd either need to obtain the pins early
>> when we still have the virtual address (but I guess that is often not
>> practical but should work e.g. for normal direct IO path) or we need some
>> way to "simulate" the page fault when pinning the page, just don't map it
>> into page tables in the end. This simulated page fault could be perhaps
>> avoided if rmap walk shows that the page is already mapped somewhere with
>> suitable permissions.
> 
> OK.  As far as I can see, the rules are along the lines of
> 	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
> 	  That includes
> 		* page known to be locked by caller
> 		* page being privately allocated and not visible to anyone else
> 		* iterator being data source
> 		* page coming from pin_user_pages(), possibly as the result of
> 		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
> 	* ITER_PIPE pages are always safe
> 	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
> 	  had been created with such.
> My preference would be to have iov_iter_get_pages() and friends pin if and
> only if we have data-destination iov_iter that is user-backed.  For
> data-source user-backed we only need FOLL_GET, and for all other flavours
> (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
> at all.

This rule would mostly work, as long as we can relax it in some cases, to
allow pinning of both source and dest pages, instead of just destination
pages, in some cases. In particular, bio_release_pages() has lost all
context about whether it was a read or a write request, as far as I can
tell. And bio_release_pages() is the primary place to unpin pages for
direct IO.

> 
> What I'd like to have is the understanding of the places where we drop
> the references acquired by iov_iter_get_pages().  How do we decide
> whether to unpin?  E.g. pipe_buffer carries a reference to page and no
> way to tell whether it's a pinned one; results of iov_iter_get_pages()
> on ITER_IOVEC *can* end up there, but thankfully only from data-source
> (== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
> Then there's nfs_request; AFAICS, we do need to pin the references in
> those if they are coming from nfs_direct_read_schedule_iovec(), but
> not if they come from readpage_async_filler().  How do we deal with
> coalescence, etc.?  It's been a long time since I really looked at
> that code...  Christoph, could you give any comments on that one?
> 
> Note, BTW, that nfs_request coming from readpage_async_filler() have
> pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
> do not, and that's where we want them pinned.  Resulting page references
> end up (after quite a trip through data structures) stuffed into struct
> rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
> they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
> case it's safe since the pages are locked; in another - since they would
> come from pin_user_pages().  The call chain at the time they are used
> has nothing to do with the originator - sunrpc is looking at the arrived
> response to READ that matches an rpc_rqst that had been created by sender
> of that request and safety is the sender's responsibility.

For NFS Direct, is there any reason it can't be as simple as this
(conceptually, that is--the implementation of iov_iter_pin_pages_alloc()
is not shown here)? Here:


diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..7dbc705bab83 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	return 0;
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
-}
-
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq)
 {
@@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
@@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+
+		/*
+		 * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for
+		 * the user_backed_iter() case (only).
+		 */
+		if (user_backed_iter(iter))
+			unpin_user_pages(pagevec, npages);
+		else
+			release_pages(pagevec, npages);
+
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -829,7 +831,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		release_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22  6:09                       ` John Hubbard
@ 2022-09-22 11:29                         ` Jan Kara
  2022-09-23  3:19                           ` Al Viro
  2022-09-23  4:34                           ` John Hubbard
  0 siblings, 2 replies; 55+ messages in thread
From: Jan Kara @ 2022-09-22 11:29 UTC (permalink / raw)
  To: John Hubbard
  Cc: Al Viro, Jan Kara, Christoph Hellwig, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Wed 21-09-22 23:09:06, John Hubbard wrote:
> On 9/21/22 19:22, Al Viro wrote:
> > On Thu, Sep 15, 2022 at 10:16:25AM +0200, Jan Kara wrote:
> > 
> >>> How would that work?  What protects the area where you want to avoid running
> >>> into pinned pages from previously acceptable page getting pinned?  If "they
> >>> must have been successfully unmapped" is a part of what you are planning, we
> >>> really do have a problem...
> >>
> >> But this is a very good question. So far the idea was that we lock the
> >> page, unmap (or writeprotect) the page, and then check pincount == 0 and
> >> that is a reliable method for making sure page data is stable (until we
> >> unlock the page & release other locks blocking page faults and writes). But
> >> once suddently ordinary page references can be used to create pins this
> >> does not work anymore. Hrm.
> >>
> >> Just brainstorming ideas now: So we'd either need to obtain the pins early
> >> when we still have the virtual address (but I guess that is often not
> >> practical but should work e.g. for normal direct IO path) or we need some
> >> way to "simulate" the page fault when pinning the page, just don't map it
> >> into page tables in the end. This simulated page fault could be perhaps
> >> avoided if rmap walk shows that the page is already mapped somewhere with
> >> suitable permissions.
> > 
> > OK.  As far as I can see, the rules are along the lines of
> > 	* creator of ITER_BVEC/ITER_XARRAY is responsible for pages being safe.
> > 	  That includes
> > 		* page known to be locked by caller
> > 		* page being privately allocated and not visible to anyone else
> > 		* iterator being data source
> > 		* page coming from pin_user_pages(), possibly as the result of
> > 		  iov_iter_pin_pages() on ITER_IOVEC/ITER_UBUF.
> > 	* ITER_PIPE pages are always safe
> > 	* pages found in ITER_BVEC/ITER_XARRAY are safe, since the iterator
> > 	  had been created with such.
> > My preference would be to have iov_iter_get_pages() and friends pin if and
> > only if we have data-destination iov_iter that is user-backed.  For
> > data-source user-backed we only need FOLL_GET, and for all other flavours
> > (ITER_BVEC, etc.) we only do get_page(), if we need to grab any references
> > at all.
> 
> This rule would mostly work, as long as we can relax it in some cases, to
> allow pinning of both source and dest pages, instead of just destination
> pages, in some cases. In particular, bio_release_pages() has lost all
> context about whether it was a read or a write request, as far as I can
> tell. And bio_release_pages() is the primary place to unpin pages for
> direct IO.

Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
bio_release_pages(). I think we can easily spare another bio flag to tell
whether we need to unpin or not. So as long as all the pages in the created
bio need the same treatment, the situation should be simple.

> > What I'd like to have is the understanding of the places where we drop
> > the references acquired by iov_iter_get_pages().  How do we decide
> > whether to unpin?  E.g. pipe_buffer carries a reference to page and no
> > way to tell whether it's a pinned one; results of iov_iter_get_pages()
> > on ITER_IOVEC *can* end up there, but thankfully only from data-source
> > (== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
> > Then there's nfs_request; AFAICS, we do need to pin the references in
> > those if they are coming from nfs_direct_read_schedule_iovec(), but
> > not if they come from readpage_async_filler().  How do we deal with
> > coalescence, etc.?  It's been a long time since I really looked at
> > that code...  Christoph, could you give any comments on that one?
> > 
> > Note, BTW, that nfs_request coming from readpage_async_filler() have
> > pages locked by caller; the ones from nfs_direct_read_schedule_iovec()
> > do not, and that's where we want them pinned.  Resulting page references
> > end up (after quite a trip through data structures) stuffed into struct
> > rpc_rqst ->rc_recv_buf.pages[] and when a response arrives from server,
> > they get picked by xs_read_bvec() and fed to iov_iter_bvec().  In one
> > case it's safe since the pages are locked; in another - since they would
> > come from pin_user_pages().  The call chain at the time they are used
> > has nothing to do with the originator - sunrpc is looking at the arrived
> > response to READ that matches an rpc_rqst that had been created by sender
> > of that request and safety is the sender's responsibility.
> 
> For NFS Direct, is there any reason it can't be as simple as this
> (conceptually, that is--the implementation of iov_iter_pin_pages_alloc()
> is not shown here)? Here:
> 
> 
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index 1707f46b1335..7dbc705bab83 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
>  	return 0;
>  }
>  
> -static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
> -{
> -	unsigned int i;
> -	for (i = 0; i < npages; i++)
> -		put_page(pages[i]);
> -}
> -
>  void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
>  			      struct nfs_direct_req *dreq)
>  {
> @@ -332,7 +325,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>  		size_t pgbase;
>  		unsigned npages, i;
>  
> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> +		result = iov_iter_pin_pages_alloc(iter, &pagevec,
>  						  rsize, &pgbase);
>  		if (result < 0)
>  			break;
> @@ -362,7 +355,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>  			pos += req_len;
>  			dreq->bytes_left -= req_len;
>  		}
> -		nfs_direct_release_pages(pagevec, npages);
> +
> +		/*
> +		 * iov_iter_pin_pages_alloc() calls pin_user_pages_fast() for
> +		 * the user_backed_iter() case (only).
> +		 */
> +		if (user_backed_iter(iter))
> +			unpin_user_pages(pagevec, npages);
> +		else
> +			release_pages(pagevec, npages);
> +

I don't think this will work. The pin nfs_direct_read_schedule_iovec()
obtains needs to be released once the IO is completed. Not once the IO is
submitted. Notice how nfs_create_request()->__nfs_create_request() gets
another page reference which is released on completion
(nfs_direct_read_completion->nfs_release_request->nfs_page_group_destroy->
nfs_free_request->nfs_clear_request). And we need to stop releasing the
obtained pin in nfs_direct_read_schedule_iovec() (and acquiring another
reference in __nfs_create_request()) and instead propagate it to
nfs_clear_request() where it can get released.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-14  3:51             ` Al Viro
  2022-09-14 14:52               ` Jan Kara
@ 2022-09-22 14:31               ` Christoph Hellwig
  2022-09-22 14:36                 ` Al Viro
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-22 14:31 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Jan Kara, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Wed, Sep 14, 2022 at 04:51:17AM +0100, Al Viro wrote:
> Unless I'm misreading Jan, the question is whether they should get or
> pin.

And I think the answer is:  inside ->read_iter or ->write_iter they
should neither get or pin.  The callers of it need to pin the pages
if they are pagecache pages that can potentially be written to through
shared mappings, else a get would be enough.  But the method instance
should not have to care and just be able to rely on the caller making
sure they do not go away.

> I'm really tempted to slap
> 	if (WARN_ON(i->data_source))
> 		return 0;
> into copy_to_iter() et.al., along with its opposite for copy_from_iter().

Ys, I think that would be useful.  And we could use something more
descriptive than READ/WRITE to start with.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-20  5:02                       ` Al Viro
@ 2022-09-22 14:36                         ` Christoph Hellwig
  2022-09-22 14:43                           ` David Hildenbrand
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-22 14:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Christoph Hellwig, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Tue, Sep 20, 2022 at 06:02:11AM +0100, Al Viro wrote:
> nvme target: nvme read requests end up with somebody allocating and filling
> sglist, followed by reading from file into it (using ITER_BVEC).  Then the
> pages are sent out, presumably

Yes.

> .  I would be very surprised if it turned out
> to be anything other than anon pages allocated by the driver, but I'd like
> to see that confirmed by nvme folks.  Probably doesn't need pinning.

They are anon pages allocated by the driver using sgl_alloc().

> drivers/target/target_core_file.c:292:  iov_iter_bvec(&iter, is_write, aio_cmd->bvecs, sgl_nents, len);

Same as nvme target.

> The picture so far looks like we mostly need to take care of pinning when
> we obtain the references from iov_iter_get_pages().  What's more, it looks
> like ITER_BVEC/ITER_XARRAY/ITER_PIPE we really don't need to pin anything on
> get_pages/pin_pages - they are already protected (or, in case of ITER_PIPE,
> allocated by iov_iter itself and not reachable by anybody outside).

That's what I've been trying to say for a while..


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22 14:31               ` Christoph Hellwig
@ 2022-09-22 14:36                 ` Al Viro
  0 siblings, 0 replies; 55+ messages in thread
From: Al Viro @ 2022-09-22 14:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 22, 2022 at 07:31:36AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 14, 2022 at 04:51:17AM +0100, Al Viro wrote:
> > Unless I'm misreading Jan, the question is whether they should get or
> > pin.
> 
> And I think the answer is:  inside ->read_iter or ->write_iter they
> should neither get or pin.  The callers of it need to pin the pages
> if they are pagecache pages that can potentially be written to through
> shared mappings, else a get would be enough.  But the method instance
> should not have to care and just be able to rely on the caller making
> sure they do not go away.

The interesting part, AFAICS, is where do we _unpin_ them and how do
we keep track which pages (obtained from iov_iter_get_pages et.al.)
need to be unpinned.

> > I'm really tempted to slap
> > 	if (WARN_ON(i->data_source))
> > 		return 0;
> > into copy_to_iter() et.al., along with its opposite for copy_from_iter().
> 
> Ys, I think that would be useful.  And we could use something more
> descriptive than READ/WRITE to start with.

See #work.iov_iter; done, but it took a bit of fixing the places that
create iov_iter instances.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22  2:22                     ` Al Viro
  2022-09-22  6:09                       ` John Hubbard
@ 2022-09-22 14:38                       ` Christoph Hellwig
  2022-09-23  4:22                         ` Al Viro
  1 sibling, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-22 14:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, Christoph Hellwig, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 22, 2022 at 03:22:48AM +0100, Al Viro wrote:
> What I'd like to have is the understanding of the places where we drop
> the references acquired by iov_iter_get_pages().  How do we decide
> whether to unpin?

Add a iov_iter_unpin_pages that does the right thing based on the
type.  (block will need a modified copy of it as it doesn't keep
the pages array around, but logic will be the same).

> E.g. pipe_buffer carries a reference to page and no
> way to tell whether it's a pinned one; results of iov_iter_get_pages()
> on ITER_IOVEC *can* end up there, but thankfully only from data-source
> (== WRITE, aka.  ITER_SOURCE) iov_iter.  So for those we don't care.
> Then there's nfs_request; AFAICS, we do need to pin the references in
> those if they are coming from nfs_direct_read_schedule_iovec(), but
> not if they come from readpage_async_filler().  How do we deal with
> coalescence, etc.?  It's been a long time since I really looked at
> that code...  Christoph, could you give any comments on that one?

I think the above should work, but I'll need to look at the NFS code
in more detail to be sure.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22 14:36                         ` Christoph Hellwig
@ 2022-09-22 14:43                           ` David Hildenbrand
  2022-09-22 14:45                             ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: David Hildenbrand @ 2022-09-22 14:43 UTC (permalink / raw)
  To: Christoph Hellwig, Al Viro
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 22.09.22 16:36, Christoph Hellwig wrote:
> On Tue, Sep 20, 2022 at 06:02:11AM +0100, Al Viro wrote:
>> nvme target: nvme read requests end up with somebody allocating and filling
>> sglist, followed by reading from file into it (using ITER_BVEC).  Then the
>> pages are sent out, presumably
> 
> Yes.
> 
>> .  I would be very surprised if it turned out
>> to be anything other than anon pages allocated by the driver, but I'd like
>> to see that confirmed by nvme folks.  Probably doesn't need pinning.
> 
> They are anon pages allocated by the driver using sgl_alloc().

I assume they are not anon pages as in "PageAnon()", but simply not 
pagecache pages, correct?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22 14:43                           ` David Hildenbrand
@ 2022-09-22 14:45                             ` Christoph Hellwig
  0 siblings, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-22 14:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Al Viro, Jan Kara, John Hubbard,
	Andrew Morton, Jens Axboe, Miklos Szeredi, Darrick J . Wong,
	Trond Myklebust, Anna Schumaker, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 22, 2022 at 04:43:57PM +0200, David Hildenbrand wrote:
> I assume they are not anon pages as in "PageAnon()", but simply not
> pagecache pages, correct?

Yes, sorry.  From the page allocator and not added to the page cache.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22 11:29                         ` Jan Kara
@ 2022-09-23  3:19                           ` Al Viro
  2022-09-23  4:05                             ` John Hubbard
  2022-09-23  4:34                           ` John Hubbard
  1 sibling, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-23  3:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Christoph Hellwig, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote:

> > This rule would mostly work, as long as we can relax it in some cases, to
> > allow pinning of both source and dest pages, instead of just destination
> > pages, in some cases. In particular, bio_release_pages() has lost all
> > context about whether it was a read or a write request, as far as I can
> > tell. And bio_release_pages() is the primary place to unpin pages for
> > direct IO.
> 
> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
> bio_release_pages(). I think we can easily spare another bio flag to tell
> whether we need to unpin or not. So as long as all the pages in the created
> bio need the same treatment, the situation should be simple.

Yes.  Incidentally, the same condition is already checked by the creators
of those bio - see the assorted should_dirty logics.

While we are at it - how much of the rationale around bio_check_pages_dirty()
doing dirtying is still applicable with pinning pages before we stick them
into bio?  We do dirty them before submitting bio, then on completion
bio_check_pages_dirty() checks if something has marked them clean while
we'd been doing IO; if all of them are still dirty we just drop the pages
(well, unpin and drop), otherwise we arrange for dirty + unpin + drop
done in process context (via schedule_work()).  Can they be marked clean by
anyone while they are pinned?  After all, pinning is done to prevent
writeback getting done on them while we are modifying the suckers...

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-23  3:19                           ` Al Viro
@ 2022-09-23  4:05                             ` John Hubbard
  2022-09-23  8:39                               ` Christoph Hellwig
  2022-09-23 12:22                               ` Jan Kara
  0 siblings, 2 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-23  4:05 UTC (permalink / raw)
  To: Al Viro, Jan Kara
  Cc: Christoph Hellwig, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker,
	David Hildenbrand, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 9/22/22 20:19, Al Viro wrote:
> On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote:
> 
>>> This rule would mostly work, as long as we can relax it in some cases, to
>>> allow pinning of both source and dest pages, instead of just destination
>>> pages, in some cases. In particular, bio_release_pages() has lost all
>>> context about whether it was a read or a write request, as far as I can
>>> tell. And bio_release_pages() is the primary place to unpin pages for
>>> direct IO.
>>
>> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
>> bio_release_pages(). I think we can easily spare another bio flag to tell
>> whether we need to unpin or not. So as long as all the pages in the created
>> bio need the same treatment, the situation should be simple.
> 
> Yes.  Incidentally, the same condition is already checked by the creators
> of those bio - see the assorted should_dirty logics.

Beautiful!

> 
> While we are at it - how much of the rationale around bio_check_pages_dirty()
> doing dirtying is still applicable with pinning pages before we stick them
> into bio?  We do dirty them before submitting bio, then on completion
> bio_check_pages_dirty() checks if something has marked them clean while
> we'd been doing IO; if all of them are still dirty we just drop the pages
> (well, unpin and drop), otherwise we arrange for dirty + unpin + drop
> done in process context (via schedule_work()).  Can they be marked clean by
> anyone while they are pinned?  After all, pinning is done to prevent
> writeback getting done on them while we are modifying the suckers...

I certainly hope not. And in fact, we should really just say that that's
a rule: the whole time the page is pinned, it simply must remain dirty
and writable, at least with the way things are right now.

This reminds me that I'm not exactly sure what the rules for
FOLL_LONGTERM callers should be, with respect to dirtying. At the
moment, most, if not all of the code that does "set_page_dirty_lock();
unpin_user_page()" is wrong.

To fix those cases, IIUC, the answer is: you must make the page dirty
properly, with page_mkwrite(), not just with set_page_dirty_lock(). And
that has to be done probably a lot earlier, for reasons that I'm still
vague on. But perhaps right after pinning the page. (Assuming that we
hold off writeback while the page is pinned.)

Just wanted to see if that sounds right, while we're on the topic.

thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22 14:38                       ` Christoph Hellwig
@ 2022-09-23  4:22                         ` Al Viro
  2022-09-23  8:44                           ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-23  4:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 22, 2022 at 07:38:25AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 22, 2022 at 03:22:48AM +0100, Al Viro wrote:
> > What I'd like to have is the understanding of the places where we drop
> > the references acquired by iov_iter_get_pages().  How do we decide
> > whether to unpin?
> 
> Add a iov_iter_unpin_pages that does the right thing based on the
> type.  (block will need a modified copy of it as it doesn't keep
> the pages array around, but logic will be the same).

Huh?  You want to keep the type (+ direction) of iov_iter in any structure
a page reference coming from iov_iter_get_pages might end up in?  IDGI...

BTW, speaking of lifetime rules - am I right assuming that fd_execute_rw()
does IO on pages of the scatterlist passed to it?  Where are they getting
dropped and what guarantees that IO is complete by that point?

The reason I'm asking is that here you have an ITER_BVEC possibly fed to
__blkdev_direct_IO_async(), with its
        if (iov_iter_is_bvec(iter)) {
                /*
                 * Users don't rely on the iterator being in any particular
                 * state for async I/O returning -EIOCBQUEUED, hence we can
                 * avoid expensive iov_iter_advance(). Bypass
                 * bio_iov_iter_get_pages() and set the bvec directly.
                 */
                bio_iov_bvec_set(bio, iter);
which does *not* grab the page referneces.  Sure, bio_release_pages() knows
to leave those alone and doesn't drop anything.  However, what is the
mechanism preventing the pages getting freed before the IO completion
in this case?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-22 11:29                         ` Jan Kara
  2022-09-23  3:19                           ` Al Viro
@ 2022-09-23  4:34                           ` John Hubbard
  1 sibling, 0 replies; 55+ messages in thread
From: John Hubbard @ 2022-09-23  4:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Christoph Hellwig, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On 9/22/22 04:29, Jan Kara wrote:
> I don't think this will work. The pin nfs_direct_read_schedule_iovec()
> obtains needs to be released once the IO is completed. Not once the IO is
> submitted. Notice how nfs_create_request()->__nfs_create_request() gets
> another page reference which is released on completion
> (nfs_direct_read_completion->nfs_release_request->nfs_page_group_destroy->
> nfs_free_request->nfs_clear_request). And we need to stop releasing the
> obtained pin in nfs_direct_read_schedule_iovec() (and acquiring another
> reference in __nfs_create_request()) and instead propagate it to
> nfs_clear_request() where it can get released.
> 
> 								Honza

OK, now I see how that is supposed to work, thanks for explaining!


thanks,

-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-23  4:05                             ` John Hubbard
@ 2022-09-23  8:39                               ` Christoph Hellwig
  2022-09-23 12:22                               ` Jan Kara
  1 sibling, 0 replies; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-23  8:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Al Viro, Jan Kara, Christoph Hellwig, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu, Sep 22, 2022 at 09:05:16PM -0700, John Hubbard wrote:
> I certainly hope not. And in fact, we should really just say that that's
> a rule: the whole time the page is pinned, it simply must remain dirty
> and writable, at least with the way things are right now.

Yes, if we can stick to that rule and make sure shared pagecache is
never dirtied through get_user_pags anywhere that will allow us to
fix a lot of mess

> To fix those cases, IIUC, the answer is: you must make the page dirty
> properly, with page_mkwrite(), not just with set_page_dirty_lock(). And
> that has to be done probably a lot earlier, for reasons that I'm still
> vague on. But perhaps right after pinning the page. (Assuming that we
> hold off writeback while the page is pinned.)

I think we need to hold off the writeback for it to work properly.
The big question is, is if there are callers that do expect data
to be written back on mappings that are longterm pinned.  RDMA or
vfio would come to mind.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-23  4:22                         ` Al Viro
@ 2022-09-23  8:44                           ` Christoph Hellwig
  2022-09-23 16:13                             ` Al Viro
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-23  8:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Jan Kara, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Fri, Sep 23, 2022 at 05:22:17AM +0100, Al Viro wrote:
> > Add a iov_iter_unpin_pages that does the right thing based on the
> > type.  (block will need a modified copy of it as it doesn't keep
> > the pages array around, but logic will be the same).
> 
> Huh?  You want to keep the type (+ direction) of iov_iter in any structure
> a page reference coming from iov_iter_get_pages might end up in?  IDGI...

Why would I?  We generall do have or should have the iov_iter around.
And for the common case where we don't (bios) we can carry that
information in the bio as it needs a special unmap helper anyway.

But if you don't want to use the iov_iter for some reason, we'll just
need to condense the information to a flags variable and then pass that.

> 
> BTW, speaking of lifetime rules - am I right assuming that fd_execute_rw()
> does IO on pages of the scatterlist passed to it?

Yes.

> Where are they getting
> dropped and what guarantees that IO is complete by that point?

The exact place depens on the exact taaraget frontend of which we have
a few.  But it happens from the end_io callback that is triggered
through a call to target_complete_cmd.

> The reason I'm asking is that here you have an ITER_BVEC possibly fed to
> __blkdev_direct_IO_async(), with its
>         if (iov_iter_is_bvec(iter)) {
>                 /*
>                  * Users don't rely on the iterator being in any particular
>                  * state for async I/O returning -EIOCBQUEUED, hence we can
>                  * avoid expensive iov_iter_advance(). Bypass
>                  * bio_iov_iter_get_pages() and set the bvec directly.
>                  */
>                 bio_iov_bvec_set(bio, iter);
> which does *not* grab the page referneces.  Sure, bio_release_pages() knows
> to leave those alone and doesn't drop anything.  However, what is the
> mechanism preventing the pages getting freed before the IO completion
> in this case?

The contract that callers of bvec iters need to hold their own
references as without that doing I/O do them would be unsafe.  It they
did not hold references the pages could go away before even calling
bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set).

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-23  4:05                             ` John Hubbard
  2022-09-23  8:39                               ` Christoph Hellwig
@ 2022-09-23 12:22                               ` Jan Kara
  1 sibling, 0 replies; 55+ messages in thread
From: Jan Kara @ 2022-09-23 12:22 UTC (permalink / raw)
  To: John Hubbard
  Cc: Al Viro, Jan Kara, Christoph Hellwig, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu 22-09-22 21:05:16, John Hubbard wrote:
> On 9/22/22 20:19, Al Viro wrote:
> > On Thu, Sep 22, 2022 at 01:29:35PM +0200, Jan Kara wrote:
> > 
> >>> This rule would mostly work, as long as we can relax it in some cases, to
> >>> allow pinning of both source and dest pages, instead of just destination
> >>> pages, in some cases. In particular, bio_release_pages() has lost all
> >>> context about whether it was a read or a write request, as far as I can
> >>> tell. And bio_release_pages() is the primary place to unpin pages for
> >>> direct IO.
> >>
> >> Well, we already do have BIO_NO_PAGE_REF bio flag that gets checked in
> >> bio_release_pages(). I think we can easily spare another bio flag to tell
> >> whether we need to unpin or not. So as long as all the pages in the created
> >> bio need the same treatment, the situation should be simple.
> > 
> > Yes.  Incidentally, the same condition is already checked by the creators
> > of those bio - see the assorted should_dirty logics.
> 
> Beautiful!
> 
> > 
> > While we are at it - how much of the rationale around bio_check_pages_dirty()
> > doing dirtying is still applicable with pinning pages before we stick them
> > into bio?  We do dirty them before submitting bio, then on completion
> > bio_check_pages_dirty() checks if something has marked them clean while
> > we'd been doing IO; if all of them are still dirty we just drop the pages
> > (well, unpin and drop), otherwise we arrange for dirty + unpin + drop
> > done in process context (via schedule_work()).  Can they be marked clean by
> > anyone while they are pinned?  After all, pinning is done to prevent
> > writeback getting done on them while we are modifying the suckers...
> 
> I certainly hope not. And in fact, we should really just say that that's
> a rule: the whole time the page is pinned, it simply must remain dirty
> and writable, at least with the way things are right now.

I agree the page should be staying dirty the whole time it is pinned. I
don't think it is feasible to keep it writeable in the page tables because
that would mean you would need to block e.g. munmap() until the pages gets
unpinned and that will almost certainly upset some current userspace.

But keeping page dirty should be enough so that we can get rid of all these
nasty calls to set_page_dirty() from IO completion.

> This reminds me that I'm not exactly sure what the rules for
> FOLL_LONGTERM callers should be, with respect to dirtying. At the
> moment, most, if not all of the code that does "set_page_dirty_lock();
> unpin_user_page()" is wrong.

Right.

> To fix those cases, IIUC, the answer is: you must make the page dirty
> properly, with page_mkwrite(), not just with set_page_dirty_lock(). And

Correct, and GUP (or PUP) actually does that under the hood so I don't
think we need to change anything there.

> that has to be done probably a lot earlier, for reasons that I'm still
> vague on. But perhaps right after pinning the page. (Assuming that we
> hold off writeback while the page is pinned.)

Holding off writeback is not always doable - as Christoph mentions, for
data integrity writeback we'll have to get the data to disk before the page
is unpinned (as for longterm users it can take days for the page to be
unpinned). But we can just writeback the page without clearing the dirty
bit in these cases. We may need to use bounce pages to be able to safely
writeback pinned pages but that's another part of the story...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-23  8:44                           ` Christoph Hellwig
@ 2022-09-23 16:13                             ` Al Viro
  2022-09-26 15:53                               ` Christoph Hellwig
  0 siblings, 1 reply; 55+ messages in thread
From: Al Viro @ 2022-09-23 16:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Fri, Sep 23, 2022 at 01:44:33AM -0700, Christoph Hellwig wrote:

> Why would I?  We generall do have or should have the iov_iter around.

Not for async IO.

> And for the common case where we don't (bios) we can carry that
> information in the bio as it needs a special unmap helper anyway.

*Any* async read_iter is like that.

> > Where are they getting
> > dropped and what guarantees that IO is complete by that point?
> 
> The exact place depens on the exact taaraget frontend of which we have
> a few.  But it happens from the end_io callback that is triggered
> through a call to target_complete_cmd.

OK...

> > The reason I'm asking is that here you have an ITER_BVEC possibly fed to
> > __blkdev_direct_IO_async(), with its
> >         if (iov_iter_is_bvec(iter)) {
> >                 /*
> >                  * Users don't rely on the iterator being in any particular
> >                  * state for async I/O returning -EIOCBQUEUED, hence we can
> >                  * avoid expensive iov_iter_advance(). Bypass
> >                  * bio_iov_iter_get_pages() and set the bvec directly.
> >                  */
> >                 bio_iov_bvec_set(bio, iter);
> > which does *not* grab the page referneces.  Sure, bio_release_pages() knows
> > to leave those alone and doesn't drop anything.  However, what is the
> > mechanism preventing the pages getting freed before the IO completion
> > in this case?
> 
> The contract that callers of bvec iters need to hold their own
> references as without that doing I/O do them would be unsafe.  It they
> did not hold references the pages could go away before even calling
> bio_iov_iter_get_pages (or this open coded bio_iov_bvec_set).

You are mixing two issues here - holding references to pages while using
iov_iter instance is obvious; holding them until async IO is complete, even
though struct iov_iter might be long gone by that point is a different
story.

And originating iov_iter instance really can be long-gone by the time
of IO completion - requirement to keep it around would be very hard to
satisfy.  I've no objections to requiring the pages in ITER_BVEC to be
preserved at least until the IO completion by means independent of
whatever ->read_iter/->write_iter does to them, but
	* that needs to be spelled out very clearly and
	* we need to verify that it is, indeed, the case for all existing
iov_iter_bvec callers, preferably with comments next to non-obvious ones
(something that is followed only by the sync IO is obvious)

That goes not just for bio - if we make get_pages *NOT* grab references
on ITER_BVEC (and I'm all for it), we need to make sure that those
pages won't be retained after the original protection runs out.  Which
includes the reference put into struct nfs_request, for example, as well
as whatever ceph transport is doing, etc.  Another thing that needs to
be sorted out is __zerocopy_sg_from_iter() and its callers - AFAICS,
all of those are in ->sendmsg() with MSG_ZEROCOPY in flags.

It's a non-trivial amount of code audit - we have about 40 iov_iter_bvec()
callers in the tree, and while many are clearly sync-only... the ones
that are not tend to balloon into interesting analysis of call chains, etc.

Don't get me wrong - that analysis needs to be done, just don't expect
it to be trivial.  And it will require quite a bit of cooperation from the
folks familiar with e.g. drivers/target, or with ceph transport layer,
etc.

FWIW, my preference would be along the lines of

	Backing memory for any non user-backed iov_iter should be protected
	from reuse by creator of iov_iter and that protection should continue
	through not just all synchronous operations with iov_iter in question
	- it should last until all async operations involving that memory are
	finished.  That continued protection must not depend upon taking
	extra page references, etc. while we are working with iov_iter.

But getting there will take quite a bit of code audit and probably some massage
as well.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-23 16:13                             ` Al Viro
@ 2022-09-26 15:53                               ` Christoph Hellwig
  2022-09-26 19:55                                 ` Al Viro
  0 siblings, 1 reply; 55+ messages in thread
From: Christoph Hellwig @ 2022-09-26 15:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Jan Kara, John Hubbard, Andrew Morton,
	Jens Axboe, Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Fri, Sep 23, 2022 at 05:13:42PM +0100, Al Viro wrote:
> You are mixing two issues here - holding references to pages while using
> iov_iter instance is obvious; holding them until async IO is complete, even
> though struct iov_iter might be long gone by that point is a different
> story.

But someone needs to hold a refernce until the I/O is completed, because
the I/O obviously needs the pages.  Yes, we could say the callers holds
them and can drop the references right after I/O submission, while
the method needs to grab another reference.  But that is more
complicated and is more costly than just holding the damn reference.

> And originating iov_iter instance really can be long-gone by the time
> of IO completion - requirement to keep it around would be very hard to
> satisfy.  I've no objections to requiring the pages in ITER_BVEC to be
> preserved at least until the IO completion by means independent of
> whatever ->read_iter/->write_iter does to them, but
> 	* that needs to be spelled out very clearly and
> 	* we need to verify that it is, indeed, the case for all existing
> iov_iter_bvec callers, preferably with comments next to non-obvious ones
> (something that is followed only by the sync IO is obvious)

Agreed.

> That goes not just for bio - if we make get_pages *NOT* grab references
> on ITER_BVEC (and I'm all for it), we need to make sure that those
> pages won't be retained after the original protection runs out.

Yes.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines
  2022-09-26 15:53                               ` Christoph Hellwig
@ 2022-09-26 19:55                                 ` Al Viro
  0 siblings, 0 replies; 55+ messages in thread
From: Al Viro @ 2022-09-26 19:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, David Hildenbrand, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Mon, Sep 26, 2022 at 08:53:43AM -0700, Christoph Hellwig wrote:
> On Fri, Sep 23, 2022 at 05:13:42PM +0100, Al Viro wrote:
> > You are mixing two issues here - holding references to pages while using
> > iov_iter instance is obvious; holding them until async IO is complete, even
> > though struct iov_iter might be long gone by that point is a different
> > story.
> 
> But someone needs to hold a refernce until the I/O is completed, because
> the I/O obviously needs the pages.  Yes, we could say the callers holds
> them and can drop the references right after I/O submission, while
> the method needs to grab another reference.  But that is more
> complicated and is more costly than just holding the damn reference.

Take a look at __nfs_create_request().  And trace the call chains leading
to nfs_clear_request() where the corresponding put_page() happens.

What I'm afraid of is something similar in the bowels of some RDMA driver.
With upper layers shoving page references into sglist using iov_iter_get_pages(),
then passing sglist to some intermediate layer, then *that* getting passed down
into a driver which grabs references for its own use and releases them from
destructor of some private structure.  Done via kref_put().  Have that
delayed by, hell - anything, up to and including debugfs shite somewhere
in the same driver, iterating through those private structures, grabbing
a reference to do some pretty-print into kmalloc'ed buffer, then drooping it.
Voila - we have page refs duplicated from ITER_BVEC and occasionally staying
around after the ->ki_complete() of async ->write_iter() that got that
ITER_BVEC.

It's really not a trivial rule change.

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2022-09-26 19:55 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-31  4:18 [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() John Hubbard
2022-08-31  4:18 ` [PATCH v2 1/7] mm: change release_pages() to use unsigned long for npages John Hubbard
2022-08-31  4:18 ` [PATCH v2 2/7] mm/gup: introduce pin_user_page() John Hubbard
2022-09-06  6:37   ` Christoph Hellwig
2022-09-06  7:12     ` John Hubbard
2022-08-31  4:18 ` [PATCH v2 3/7] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
2022-08-31  4:18 ` [PATCH v2 4/7] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
2022-09-01  0:42   ` Al Viro
2022-09-01  1:48     ` John Hubbard
2022-09-06  6:47   ` Christoph Hellwig
2022-09-06  7:44     ` John Hubbard
2022-09-06  7:48       ` Christoph Hellwig
2022-09-06  7:58         ` John Hubbard
2022-09-07  8:50           ` Christoph Hellwig
2022-09-06 10:21         ` Jan Kara
2022-09-07  8:45           ` Christoph Hellwig
2022-09-14  3:51             ` Al Viro
2022-09-14 14:52               ` Jan Kara
2022-09-14 16:42                 ` Al Viro
2022-09-15  8:16                   ` Jan Kara
2022-09-16  1:55                     ` Al Viro
2022-09-20  5:02                       ` Al Viro
2022-09-22 14:36                         ` Christoph Hellwig
2022-09-22 14:43                           ` David Hildenbrand
2022-09-22 14:45                             ` Christoph Hellwig
2022-09-22  2:22                     ` Al Viro
2022-09-22  6:09                       ` John Hubbard
2022-09-22 11:29                         ` Jan Kara
2022-09-23  3:19                           ` Al Viro
2022-09-23  4:05                             ` John Hubbard
2022-09-23  8:39                               ` Christoph Hellwig
2022-09-23 12:22                               ` Jan Kara
2022-09-23  4:34                           ` John Hubbard
2022-09-22 14:38                       ` Christoph Hellwig
2022-09-23  4:22                         ` Al Viro
2022-09-23  8:44                           ` Christoph Hellwig
2022-09-23 16:13                             ` Al Viro
2022-09-26 15:53                               ` Christoph Hellwig
2022-09-26 19:55                                 ` Al Viro
2022-09-22 14:31               ` Christoph Hellwig
2022-09-22 14:36                 ` Al Viro
2022-08-31  4:18 ` [PATCH v2 5/7] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
2022-09-06  6:48   ` Christoph Hellwig
2022-09-06  7:15     ` John Hubbard
2022-08-31  4:18 ` [PATCH v2 6/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
2022-09-06  6:49   ` Christoph Hellwig
2022-09-06  7:16     ` John Hubbard
2022-08-31  4:18 ` [PATCH v2 7/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
2022-08-31 10:37   ` Miklos Szeredi
2022-09-01  1:33     ` John Hubbard
2022-09-06  6:36 ` [PATCH v2 0/7] convert most filesystems to pin_user_pages_fast() Christoph Hellwig
2022-09-06  7:10   ` John Hubbard
2022-09-06  7:22     ` Christoph Hellwig
2022-09-06  7:37       ` John Hubbard
2022-09-06  7:46         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).