All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] convert most filesystems to pin_user_pages_fast()
@ 2022-08-27  8:36 John Hubbard
  2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

Hi,

This converts the iomap core and bio_release_pages() to
pin_user_pages_fast(), also referred to as FOLL_PIN here.

The conversion is temporarily guarded by
CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO. In the future (not part of this
series), when we are certain that all filesystems have converted their
Direct IO paths to FOLL_PIN, then we can do the final step, which is to
get rid of CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO and search-and-replace
the dio_w_*() functions with their final names (see bvec.h changes).

I'd like to get this part committed at some point, because it seems to
work well already. And this will help get the remaining items, below,
converted.

Status: although many filesystems have been converted, some remain to be
investigated. These include (you can recreate this list by grepping for
iov_iter_get_pages):

	cephfs
	cifs
	9P
	RDS
	net/core: datagram.c, skmsg.c
	net/tls
	fs/splice.c

Testing: this passes some light LTP and xfstest runs and fio and a few
other things like that, on my local x86_64 test machine, both with and
without CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO being set.

Conflicts: Logan, the iov_iter parts of this will conflict with your
[PATCH v9 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags(),
but I think it's easy to resolve.

John Hubbard (6):
  mm/gup: introduce pin_user_page()
  block: add dio_w_*() wrappers for pin, unpin user pages
  iov_iter: new iov_iter_pin_pages*() routines
  block, bio, fs: convert most filesystems to pin_user_pages_fast()
  NFS: direct-io: convert to FOLL_PIN pages
  fuse: convert direct IO paths to use FOLL_PIN

 block/Kconfig        | 24 ++++++++++++++
 block/bio.c          | 27 ++++++++--------
 block/blk-map.c      |  7 +++--
 fs/direct-io.c       | 40 ++++++++++++------------
 fs/fuse/dev.c        |  8 +++--
 fs/fuse/file.c       | 31 ++++++++++++-------
 fs/fuse/fuse_i.h     |  1 +
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      | 19 ++++--------
 include/linux/bvec.h | 40 ++++++++++++++++++++++++
 include/linux/mm.h   |  1 +
 include/linux/uio.h  |  4 +++
 lib/iov_iter.c       | 74 +++++++++++++++++++++++++++++++++++++++++---
 mm/gup.c             | 33 ++++++++++++++++++++
 14 files changed, 244 insertions(+), 67 deletions(-)


base-commit: e022620b5d056e822e42eb9bc0f24fcb97389d86
--
2.37.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  2022-08-29 12:07   ` David Hildenbrand
  2022-08-27  8:36 ` [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 982f2607180b..85a105157334 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1869,6 +1869,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..245ccb41ed8c 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3213,6 +3213,39 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page
+ *
+ * @page: the page to be pinned.
+ *
+ * This is similar to get_user_pages(), except that the page's refcount is
+ * elevated using FOLL_PIN, instead of FOLL_GET.
+ *
+ * IMPORTANT: The caller must release the page via unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
  2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  2022-08-27 22:27   ` Andrew Morton
  2022-08-27  8:36 ` [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

Background: The Direct IO part of the block infrastructure is being
changed to use pin_user_page*() and unpin_user_page*() calls, in place
of a mix of get_user_pages_fast(), get_page(), and put_page(). These
have to be changed over all at the same time, for block, bio, and all
filesystems. However, most filesystems can be changed via iomap and core
filesystem routines, so let's get that in place, and then continue on
with converting the remaining filesystems (9P, CIFS) and anything else
that feeds pages into bio that ultimately get released via
bio_release_pages().

Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
dio_w_*() wrapper functions. The dio_w_ prefix was chosen for
uniqueness, so as to ease a subsequent kernel-wide rename via
search-and-replace. Together, these allow the developer to choose
between these sets of routines, for Direct IO code paths:

a) pin_user_pages_fast()
    pin_user_page()
    unpin_user_page()

b) get_user_pages_fast()
    get_page()
    put_page()

CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and may
be deleted once the conversion is complete. In the meantime, developers
can enable this in order to try out each filesystem.

Please remember that these /proc/vmstat items (below) should normally
contain the same values as each other, except during the middle of
pin/unpin operations. As such, they can be helpful when monitoring test
runs:

    nr_foll_pin_acquired
    nr_foll_pin_released

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/Kconfig        | 24 ++++++++++++++++++++++++
 include/linux/bvec.h | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 444c5ab3b67e..d4fdd606d138 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -48,6 +48,30 @@ config BLK_DEV_BSG_COMMON
 config BLK_ICQ
 	bool
 
+config BLK_USE_PIN_USER_PAGES_FOR_DIO
+	bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
+	default n
+
+	help
+	  For Direct IO code, retain the pages via calls to
+	  pin_user_pages_fast(), instead of via get_user_pages_fast().
+	  Likewise, use pin_user_page() instead of get_page(). And then
+	  release such pages via unpin_user_page(), instead of
+	  put_page().
+
+	  This is a temporary setting, which will be deleted once the
+	  conversion is completed, reviewed, and tested. In the meantime,
+	  developers can enable this in order to try out each filesystem.
+	  For that, it's best to monitor these /proc/vmstat items:
+
+		nr_foll_pin_acquired
+		nr_foll_pin_released
+
+	  ...to ensure that they remain equal, when "at rest".
+
+	  Say yes here ONLY if are actively developing or testing the
+	  block layer or filesystems with pin_user_pages_fast().
+
 config BLK_DEV_BSGLIB
 	bool "Block layer SG support v4 helper lib"
 	select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..33fc119da9fc 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,4 +241,44 @@ static inline void *bvec_virt(struct bio_vec *bvec)
 	return page_address(bvec->bv_page) + bvec->bv_offset;
 }
 
+#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
+#define dio_w_pin_user_pages_fast(s, n, p, f)	pin_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			pin_user_page(p)
+#define dio_w_iov_iter_pin_pages(i, p, m, n, s) iov_iter_pin_pages(i, p, m, n, s)
+#define dio_w_iov_iter_pin_pages_alloc(i, p, m, s) iov_iter_pin_pages_alloc(i, p, m, s)
+#define dio_w_unpin_user_page(p)		unpin_user_page(p)
+#define dio_w_unpin_user_pages(p, n)		unpin_user_pages(p, n)
+#define dio_w_unpin_user_pages_dirty_lock(p, n, d) unpin_user_pages_dirty_lock(p, n, d)
+
+#else
+#define dio_w_pin_user_pages_fast(s, n, p, f)	get_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			get_page(p)
+#define dio_w_iov_iter_pin_pages(i, p, m, n, s) iov_iter_get_pages2(i, p, m, n, s)
+#define dio_w_iov_iter_pin_pages_alloc(i, p, m, s) iov_iter_get_pages_alloc2(i, p, m, s)
+#define dio_w_unpin_user_page(p)		put_page(p)
+
+static inline void dio_w_unpin_user_pages(struct page **pages,
+					  unsigned long npages)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; i++)
+		put_page(pages[i]);
+}
+
+static inline void dio_w_unpin_user_pages_dirty_lock(struct page **pages,
+						     unsigned long npages,
+						     bool make_dirty)
+{
+	unsigned long i;
+
+	for (i = 0; i < npages; i++) {
+		if (make_dirty)
+			set_page_dirty_lock(pages[i]);
+		put_page(pages[i]);
+	}
+}
+
+#endif
+
 #endif /* __LINUX_BVEC_H */
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
  2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
  2022-08-27  8:36 ` [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  2022-08-27 22:46   ` Al Viro
  2022-08-27  8:36 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

Provide two new wrapper routines that are intended for user space pages
only:

    iov_iter_pin_pages()
    iov_iter_pin_pages_alloc()

Internally, these routines call pin_user_pages_fast(), instead of
get_user_pages_fast().

As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
variant, to release the pages, if they actually were acquired via
pin_user_pages_fast().

This is a prerequisite to converting bio/block layers over to use
pin_user_pages_fast().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/uio.h |  4 +++
 lib/iov_iter.c      | 74 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 74 insertions(+), 4 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5896af36199c..e26908e443d1 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -251,6 +251,10 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
+ssize_t iov_iter_pin_pages(struct iov_iter *i, struct page **pages,
+			size_t maxsize, unsigned int maxpages, size_t *start);
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i, struct page ***pages,
+			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4b7fce72e3e5..1c08014c8498 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1425,9 +1425,23 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	return page;
 }
 
+enum pages_alloc_internal_flags {
+	USE_FOLL_GET,
+	USE_FOLL_PIN
+};
+
+/*
+ * TODO: get rid of the how_to_pin arg, and just call pin_user_pages_fast()
+ * unconditionally for the user_back_iter(i) case in this function. That can be
+ * done once all callers are ready to deal with FOLL_PIN pages for their
+ * user-space pages. (FOLL_PIN pages must be released via unpin_user_page(),
+ * rather than put_page().)
+ */
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   unsigned int maxpages, size_t *start)
+		   unsigned int maxpages, size_t *start,
+		   enum pages_alloc_internal_flags how_to_pin)
+
 {
 	unsigned int n;
 
@@ -1454,7 +1468,12 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		n = want_pages_array(pages, maxsize, *start, maxpages);
 		if (!n)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n, gup_flags, *pages);
+
+		if (how_to_pin == USE_FOLL_PIN)
+			res = pin_user_pages_fast(addr, n, gup_flags, *pages);
+		else
+			res = get_user_pages_fast(addr, n, gup_flags, *pages);
+
 		if (unlikely(res <= 0))
 			return res;
 		maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - *start);
@@ -1497,10 +1516,31 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i,
 		return 0;
 	BUG_ON(!pages);
 
-	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
+	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start,
+					  USE_FOLL_GET);
 }
 EXPORT_SYMBOL(iov_iter_get_pages2);
 
+/*
+ * A FOLL_PIN variant that calls pin_user_pages_fast() instead of
+ * get_user_pages_fast().
+ */
+ssize_t iov_iter_pin_pages(struct iov_iter *i,
+		   struct page **pages, size_t maxsize, unsigned int maxpages,
+		   size_t *start)
+{
+	if (!maxpages)
+		return 0;
+	if (WARN_ON_ONCE(!pages))
+		return -EINVAL;
+	if (WARN_ON_ONCE(!user_backed_iter(i)))
+		return -EINVAL;
+
+	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start,
+					  USE_FOLL_PIN);
+}
+EXPORT_SYMBOL(iov_iter_pin_pages);
+
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start)
@@ -1509,7 +1549,8 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 
 	*pages = NULL;
 
-	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+					 USE_FOLL_GET);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;
@@ -1518,6 +1559,31 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
+/*
+ * A FOLL_PIN variant that calls pin_user_pages_fast() instead of
+ * get_user_pages_fast().
+ */
+ssize_t iov_iter_pin_pages_alloc(struct iov_iter *i,
+		   struct page ***pages, size_t maxsize,
+		   size_t *start)
+{
+	ssize_t len;
+
+	*pages = NULL;
+
+	if (WARN_ON_ONCE(!user_backed_iter(i)))
+		return -EINVAL;
+
+	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+					 USE_FOLL_PIN);
+	if (len <= 0) {
+		kvfree(*pages);
+		*pages = NULL;
+	}
+	return len;
+}
+EXPORT_SYMBOL(iov_iter_pin_pages_alloc);
+
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast()
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (2 preceding siblings ...)
  2022-08-27  8:36 ` [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  2022-08-27  8:36 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
  2022-08-27  8:36 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
  5 siblings, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

Use dio_w_*() wrapper calls, in place of get_user_pages_fast(),
get_page() and put_page().

This converts the Direct IO parts of most filesystems over to using
FOLL_PIN (pin_user_page*()) page pinning.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c          | 27 ++++++++++++++-------------
 block/blk-map.c      |  7 ++++---
 fs/direct-io.c       | 40 ++++++++++++++++++++--------------------
 fs/iomap/direct-io.c |  2 +-
 4 files changed, 39 insertions(+), 37 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3d3a2678fea2..6c6110f7054e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1125,7 +1125,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		dio_w_unpin_user_page(bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1162,7 +1162,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		dio_w_unpin_user_page(page);
 	return 0;
 }
 
@@ -1176,7 +1176,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		dio_w_unpin_user_page(page);
 	return 0;
 }
 
@@ -1187,10 +1187,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Pins pages from *iter and appends them to @bio's bvec array. The pages will
+ * have to be released using dio_w_unpin_user_page when done. For multi-segment
+ * *iter, this function only adds pages from the next non-empty segment of the
+ * iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1218,8 +1218,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset);
+	size = dio_w_iov_iter_pin_pages(iter, pages,
+					UINT_MAX - bio->bi_iter.bi_size,
+					nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1252,7 +1253,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		dio_w_unpin_user_page(pages[i++]);
 
 	return ret;
 }
@@ -1444,9 +1445,9 @@ void bio_set_pages_dirty(struct bio *bio)
  * have been written out during the direct-IO read.  So we take another ref on
  * the BIO and re-dirty the pages in process context.
  *
- * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * It is expected that bio_check_pages_dirty() will wholly own the BIO from here
+ * on.  It will run one dio_w_unpin_user_page() against each page and will run
+ * one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index 7196a6b64c80..4e333ad9776d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -254,7 +254,8 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		size_t offs, added = 0;
 		int npages;
 
-		bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
+		bytes = dio_w_iov_iter_pin_pages_alloc(iter, &pages, LONG_MAX,
+						       &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -276,7 +277,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						dio_w_unpin_user_page(page);
 					break;
 				}
 
@@ -289,7 +290,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			dio_w_unpin_user_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes) {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index f669163d5860..05c044c55374 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,8 +169,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				&sdio->from);
+	ret = dio_w_iov_iter_pin_pages(sdio->iter, dio->pages, LONG_MAX,
+				       DIO_PAGES, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
 		struct page *page = ZERO_PAGE(0);
@@ -181,7 +181,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		dio_w_pin_user_page(page);
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -197,7 +197,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		sdio->to = ((ret - 1) & (PAGE_SIZE - 1)) + 1;
 		return 0;
 	}
-	return ret;	
+	return ret;
 }
 
 /*
@@ -324,7 +324,7 @@ static void dio_aio_complete_work(struct work_struct *work)
 static blk_status_t dio_bio_complete(struct dio *dio, struct bio *bio);
 
 /*
- * Asynchronous IO callback. 
+ * Asynchronous IO callback.
  */
 static void dio_bio_end_aio(struct bio *bio)
 {
@@ -449,7 +449,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		dio_w_unpin_user_page(dio->pages[sdio->head++]);
 }
 
 /*
@@ -716,7 +716,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		dio_w_pin_user_page(sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -725,7 +725,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 	}
 	return ret;
 }
-		
+
 /*
  * Put cur_page under IO.  The section of cur_page which is described by
  * cur_page_offset,cur_page_len is put into a BIO.  The section of cur_page
@@ -787,7 +787,7 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
  * An autonomous function to put a chunk of a page under deferred IO.
  *
  * The caller doesn't actually know (or care) whether this piece of page is in
- * a BIO, or is under IO or whatever.  We just take care of all possible 
+ * a BIO, or is under IO or whatever.  We just take care of all possible
  * situations here.  The separation between the logic of do_direct_IO() and
  * that of submit_page_section() is important for clarity.  Please don't break.
  *
@@ -832,13 +832,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_w_unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	dio_w_pin_user_page(page); /* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +853,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_w_unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -890,7 +890,7 @@ static inline void dio_zero_block(struct dio *dio, struct dio_submit *sdio,
 	 * We need to zero out part of an fs block.  It is either at the
 	 * beginning or the end of the fs block.
 	 */
-	if (end) 
+	if (end)
 		this_chunk_blocks = dio_blocks_per_fs_block - this_chunk_blocks;
 
 	this_chunk_bytes = this_chunk_blocks << sdio->blkbits;
@@ -954,7 +954,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -999,7 +999,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio_op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					return -ENOTBLK;
 				}
 
@@ -1012,7 +1012,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1052,7 +1052,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_w_unpin_user_page(page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1067,8 +1067,8 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				break;
 		}
 
-		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		/* Drop the ref which was taken in [get|pin]_user_pages() */
+		dio_w_unpin_user_page(page);
 	}
 out:
 	return ret;
@@ -1288,7 +1288,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_w_unpin_user_page(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 4eb559a16c9e..fc7763c418d1 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	dio_w_pin_user_page(page);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (3 preceding siblings ...)
  2022-08-27  8:36 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  2022-08-27 22:48   ` Al Viro
  2022-08-27  8:36 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
  5 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

Convert the NFS Direct IO layer to use pin_user_pages_fast() and
unpin_user_page(), instead of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/nfs/direct.c | 19 ++++++-------------
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..f6e47329e092 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,13 +142,6 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	return 0;
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
-}
-
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq)
 {
@@ -332,11 +325,11 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
+		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
-	
+
 		bytes = result;
 		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
 		for (i = 0; i < npages; i++) {
@@ -362,7 +355,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		dio_w_unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -791,8 +784,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
-						  wsize, &pgbase);
+		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
+							wsize, &pgbase);
 		if (result < 0)
 			break;
 
@@ -829,7 +822,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		dio_w_unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN
  2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
                   ` (4 preceding siblings ...)
  2022-08-27  8:36 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
@ 2022-08-27  8:36 ` John Hubbard
  5 siblings, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-27  8:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML, John Hubbard

Convert the fuse filesystem to use pin_user_pages_fast() and
unpin_user_page(), instead of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/fuse/dev.c    |  8 ++++++--
 fs/fuse/file.c   | 31 ++++++++++++++++++++-----------
 fs/fuse/fuse_i.h |  1 +
 3 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51897427a534..eb841fc82bb9 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,10 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
 			flush_dcache_page(cs->pg);
 			set_page_dirty_lock(cs->pg);
 		}
-		put_page(cs->pg);
+		if (cs->pipebufs)
+			put_page(cs->pg);
+		else
+			dio_w_unpin_user_page(cs->pg);
 	}
 	cs->pg = NULL;
 }
@@ -730,7 +733,8 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 		}
 	} else {
 		size_t off;
-		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
+		err = dio_w_iov_iter_pin_pages(cs->iter, &page, PAGE_SIZE, 1,
+					       &off);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1a3afd469e3a..a79aa4fea937 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -625,14 +625,19 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
 }
 
 static void fuse_release_user_pages(struct fuse_args_pages *ap,
-				    bool should_dirty)
+				    bool should_dirty, bool is_kvec)
 {
 	unsigned int i;
 
-	for (i = 0; i < ap->num_pages; i++) {
-		if (should_dirty)
-			set_page_dirty_lock(ap->pages[i]);
-		put_page(ap->pages[i]);
+	if (is_kvec) {
+		for (i = 0; i < ap->num_pages; i++) {
+			if (should_dirty)
+				set_page_dirty_lock(ap->pages[i]);
+			put_page(ap->pages[i]);
+		}
+	} else {
+		dio_w_unpin_user_pages_dirty_lock(ap->pages, ap->num_pages,
+						  should_dirty);
 	}
 }
 
@@ -733,7 +738,7 @@ static void fuse_aio_complete_req(struct fuse_mount *fm, struct fuse_args *args,
 	struct fuse_io_priv *io = ia->io;
 	ssize_t pos = -1;
 
-	fuse_release_user_pages(&ia->ap, io->should_dirty);
+	fuse_release_user_pages(&ia->ap, io->should_dirty, io->is_kvec);
 
 	if (err) {
 		/* Nothing */
@@ -1414,10 +1419,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
-					*nbytesp - nbytes,
-					max_pages - ap->num_pages,
-					&start);
+		ret = dio_w_iov_iter_pin_pages(ii, &ap->pages[ap->num_pages],
+					       *nbytesp - nbytes,
+					       max_pages - ap->num_pages,
+					       &start);
 		if (ret < 0)
 			break;
 
@@ -1483,6 +1488,9 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		fl_owner_t owner = current->files;
 		size_t nbytes = min(count, nmax);
 
+		/* For use in fuse_release_user_pages(): */
+		io->is_kvec = iov_iter_is_kvec(iter);
+
 		err = fuse_get_user_pages(&ia->ap, iter, &nbytes, write,
 					  max_pages);
 		if (err && !nbytes)
@@ -1498,7 +1506,8 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, struct iov_iter *iter,
 		}
 
 		if (!io->async || nres < 0) {
-			fuse_release_user_pages(&ia->ap, io->should_dirty);
+			fuse_release_user_pages(&ia->ap, io->should_dirty,
+						io->is_kvec);
 			fuse_io_free(ia);
 		}
 		ia = NULL;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 488b460e046f..1d927e499395 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -290,6 +290,7 @@ struct fuse_io_priv {
 	struct kiocb *iocb;
 	struct completion *done;
 	bool blocking;
+	bool is_kvec;
 };
 
 #define FUSE_IO_PRIV_SYNC(i) \
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-27  8:36 ` [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
@ 2022-08-27 22:27   ` Andrew Morton
  2022-08-27 23:59     ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2022-08-27 22:27 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Sat, 27 Aug 2022 01:36:03 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> Background: The Direct IO part of the block infrastructure is being
> changed to use pin_user_page*() and unpin_user_page*() calls, in place
> of a mix of get_user_pages_fast(), get_page(), and put_page(). These
> have to be changed over all at the same time, for block, bio, and all
> filesystems. However, most filesystems can be changed via iomap and core
> filesystem routines, so let's get that in place, and then continue on
> with converting the remaining filesystems (9P, CIFS) and anything else
> that feeds pages into bio that ultimately get released via
> bio_release_pages().
> 
> Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
> dio_w_*() wrapper functions. The dio_w_ prefix was chosen for
> uniqueness, so as to ease a subsequent kernel-wide rename via
> search-and-replace. Together, these allow the developer to choose
> between these sets of routines, for Direct IO code paths:
> 
> a) pin_user_pages_fast()
>     pin_user_page()
>     unpin_user_page()
> 
> b) get_user_pages_fast()
>     get_page()
>     put_page()
> 
> CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and may
> be deleted once the conversion is complete. In the meantime, developers
> can enable this in order to try out each filesystem.
> 
> Please remember that these /proc/vmstat items (below) should normally
> contain the same values as each other, except during the middle of
> pin/unpin operations. As such, they can be helpful when monitoring test
> runs:
> 
>     nr_foll_pin_acquired
>     nr_foll_pin_released
> 
> ...
>
> +static inline void dio_w_unpin_user_pages(struct page **pages,
> +					  unsigned long npages)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < npages; i++)
> +		put_page(pages[i]);
> +}

release_pages()?  Might be faster if many of the pages are page_count()==1.

(release_pages() was almost as simple as the above when I added it a
million years ago.  But then progress happened).


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines
  2022-08-27  8:36 ` [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
@ 2022-08-27 22:46   ` Al Viro
  2022-08-27 22:48     ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2022-08-27 22:46 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Sat, Aug 27, 2022 at 01:36:04AM -0700, John Hubbard wrote:
> Provide two new wrapper routines that are intended for user space pages
> only:
> 
>     iov_iter_pin_pages()
>     iov_iter_pin_pages_alloc()
> 
> Internally, these routines call pin_user_pages_fast(), instead of
> get_user_pages_fast().
> 
> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
> variant, to release the pages, if they actually were acquired via
> pin_user_pages_fast().
> 
> This is a prerequisite to converting bio/block layers over to use
> pin_user_pages_fast().

You do realize that O_DIRECT on ITER_BVEC is possible, right?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines
  2022-08-27 22:46   ` Al Viro
@ 2022-08-27 22:48     ` John Hubbard
  0 siblings, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-27 22:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 15:46, Al Viro wrote:
> On Sat, Aug 27, 2022 at 01:36:04AM -0700, John Hubbard wrote:
>> Provide two new wrapper routines that are intended for user space pages
>> only:
>>
>>     iov_iter_pin_pages()
>>     iov_iter_pin_pages_alloc()
>>
>> Internally, these routines call pin_user_pages_fast(), instead of
>> get_user_pages_fast().
>>
>> As always, callers must use unpin_user_pages() or a suitable FOLL_PIN
>> variant, to release the pages, if they actually were acquired via
>> pin_user_pages_fast().
>>
>> This is a prerequisite to converting bio/block layers over to use
>> pin_user_pages_fast().
> 
> You do realize that O_DIRECT on ITER_BVEC is possible, right?

I do now. But I didn't until you wrote this!

Any advice or ideas would be welcome here.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-27  8:36 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
@ 2022-08-27 22:48   ` Al Viro
  2022-08-27 23:55     ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2022-08-27 22:48 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> unpin_user_page(), instead of get_user_pages_fast() and put_page().

Again, this stuff can be hit with ITER_BVEC iterators

> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>  						  rsize, &pgbase);

and this will break on those.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-27 22:48   ` Al Viro
@ 2022-08-27 23:55     ` John Hubbard
  2022-08-28  0:38       ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-27 23:55 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 15:48, Al Viro wrote:
> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> 
> Again, this stuff can be hit with ITER_BVEC iterators
> 
>> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
>> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>>  						  rsize, &pgbase);
> 
> and this will break on those.

If anyone has an example handy, of a user space program that leads
to this situation (O_DIRECT with ITER_BVEC), it would really help
me reach enlightenment a lot quicker in this area. :)

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-27 22:27   ` Andrew Morton
@ 2022-08-27 23:59     ` John Hubbard
  2022-08-28  0:12       ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-27 23:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 15:27, Andrew Morton wrote:
>> +static inline void dio_w_unpin_user_pages(struct page **pages,
>> +					  unsigned long npages)
>> +{
>> +	unsigned long i;
>> +
>> +	for (i = 0; i < npages; i++)
>> +		put_page(pages[i]);
>> +}
> 
> release_pages()?  Might be faster if many of the pages are page_count()==1.

Sure. I was being perhaps too cautious about changing too many things
at once, earlier when I wrote this. 

> 
> (release_pages() was almost as simple as the above when I added it a
> million years ago.  But then progress happened).
> 

Actually, I'm tempted update the release_pages() API as well, because it
uses an int for npages, while other things (in gup.c, anyway) are moving
over to unsigned long.

Anyway, I'll change my patch locally for now, to this:

static inline void dio_w_unpin_user_pages(struct page **pages,
					  unsigned long npages)
{
	/* Careful, release_pages() uses a smaller integer type for npages: */
	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
		return;

	release_pages(pages, (int)npages);
}

...in hopes that I can somehow find a way to address Al Viro's other
comments, which have the potential to doom the whole series, heh.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-27 23:59     ` John Hubbard
@ 2022-08-28  0:12       ` Andrew Morton
  2022-08-28  0:31         ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2022-08-28  0:12 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Sat, 27 Aug 2022 16:59:32 -0700 John Hubbard <jhubbard@nvidia.com> wrote:

> Anyway, I'll change my patch locally for now, to this:
> 
> static inline void dio_w_unpin_user_pages(struct page **pages,
> 					  unsigned long npages)
> {
> 	/* Careful, release_pages() uses a smaller integer type for npages: */
> 	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
> 		return;
> 
> 	release_pages(pages, (int)npages);
> }

Well, it might be slower.  release_pages() has a ton of fluff.

As mentioned, the above might be faster if the pages tend
to have page_count()==1??

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-28  0:12       ` Andrew Morton
@ 2022-08-28  0:31         ` John Hubbard
  2022-08-28  1:07           ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-28  0:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 17:12, Andrew Morton wrote:
> On Sat, 27 Aug 2022 16:59:32 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
> 
>> Anyway, I'll change my patch locally for now, to this:
>>
>> static inline void dio_w_unpin_user_pages(struct page **pages,
>> 					  unsigned long npages)
>> {
>> 	/* Careful, release_pages() uses a smaller integer type for npages: */
>> 	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
>> 		return;
>>
>> 	release_pages(pages, (int)npages);
>> }
> 
> Well, it might be slower.  release_pages() has a ton of fluff.
> 
> As mentioned, the above might be faster if the pages tend
> to have page_count()==1??
> 

I don't think we can know the answer to that. This code is called
in all kinds of situations, and it seems to me that whatever
tradeoffs are best for release_pages(), are probably also reasonable
for this code.

Even with all the fluff in release_pages(), it at least batches
the pages, as opposed to the simple put_page() in a loop. Most of 
the callers do have more than one page to release in non-error cases,
so release_pages() does seem better.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-27 23:55     ` John Hubbard
@ 2022-08-28  0:38       ` Al Viro
  2022-08-28  0:39         ` Al Viro
  0 siblings, 1 reply; 32+ messages in thread
From: Al Viro @ 2022-08-28  0:38 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
> On 8/27/22 15:48, Al Viro wrote:
> > On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> >> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> >> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> > 
> > Again, this stuff can be hit with ITER_BVEC iterators
> > 
> >> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> >> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> >>  						  rsize, &pgbase);
> > 
> > and this will break on those.
> 
> If anyone has an example handy, of a user space program that leads
> to this situation (O_DIRECT with ITER_BVEC), it would really help
> me reach enlightenment a lot quicker in this area. :)

Er...  splice(2) to O_DIRECT-opened file on e.g. ext4?  Or
sendfile(2) to the same, for that matter...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-28  0:38       ` Al Viro
@ 2022-08-28  0:39         ` Al Viro
  2022-08-28  0:46           ` John Hubbard
  2022-08-29  4:59           ` John Hubbard
  0 siblings, 2 replies; 32+ messages in thread
From: Al Viro @ 2022-08-28  0:39 UTC (permalink / raw)
  To: John Hubbard
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
> > On 8/27/22 15:48, Al Viro wrote:
> > > On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> > >> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> > >> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> > > 
> > > Again, this stuff can be hit with ITER_BVEC iterators
> > > 
> > >> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> > >> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> > >>  						  rsize, &pgbase);
> > > 
> > > and this will break on those.
> > 
> > If anyone has an example handy, of a user space program that leads
> > to this situation (O_DIRECT with ITER_BVEC), it would really help
> > me reach enlightenment a lot quicker in this area. :)
> 
> Er...  splice(2) to O_DIRECT-opened file on e.g. ext4?  Or
> sendfile(2) to the same, for that matter...

s/ext4/nfs/ to hit this particular codepath, obviously.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-28  0:39         ` Al Viro
@ 2022-08-28  0:46           ` John Hubbard
  2022-08-29  4:59           ` John Hubbard
  1 sibling, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-28  0:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 17:39, Al Viro wrote:
> On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
>> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
>>> On 8/27/22 15:48, Al Viro wrote:
>>>> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
>>>>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
>>>>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>>>>
>>>> Again, this stuff can be hit with ITER_BVEC iterators
>>>>
>>>>> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
>>>>> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>>>>>  						  rsize, &pgbase);
>>>>
>>>> and this will break on those.
>>>
>>> If anyone has an example handy, of a user space program that leads
>>> to this situation (O_DIRECT with ITER_BVEC), it would really help
>>> me reach enlightenment a lot quicker in this area. :)
>>
>> Er...  splice(2) to O_DIRECT-opened file on e.g. ext4?  Or
>> sendfile(2) to the same, for that matter...
> 
> s/ext4/nfs/ to hit this particular codepath, obviously.

aha, thanks. I do remember that you alerted me earlier to splice(2)
problems, and I thought I'd neatly sidestepped those this time, by
staying with user_backed_iter(i) cases. 

But I see that it's going to take something more, after all.



thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-08-28  0:31         ` John Hubbard
@ 2022-08-28  1:07           ` John Hubbard
  0 siblings, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-28  1:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 17:31, John Hubbard wrote:
> On 8/27/22 17:12, Andrew Morton wrote:
>> On Sat, 27 Aug 2022 16:59:32 -0700 John Hubbard <jhubbard@nvidia.com> wrote:
>>
>>> Anyway, I'll change my patch locally for now, to this:
>>>
>>> static inline void dio_w_unpin_user_pages(struct page **pages,
>>> 					  unsigned long npages)
>>> {
>>> 	/* Careful, release_pages() uses a smaller integer type for npages: */
>>> 	if (WARN_ON_ONCE(npages > (unsigned long)INT_MAX))
>>> 		return;
>>>
>>> 	release_pages(pages, (int)npages);
>>> }
>>
>> Well, it might be slower.  release_pages() has a ton of fluff.
>>
>> As mentioned, the above might be faster if the pages tend
>> to have page_count()==1??
>>
> 
> I don't think we can know the answer to that. This code is called

To clarify: I mean, I don't think we can know whether or not these pages 
usually have page_count()==1.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-28  0:39         ` Al Viro
  2022-08-28  0:46           ` John Hubbard
@ 2022-08-29  4:59           ` John Hubbard
  2022-08-29 16:08             ` Jan Kara
  1 sibling, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-29  4:59 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Jens Axboe, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/27/22 17:39, Al Viro wrote:
> On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
>> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
>>> On 8/27/22 15:48, Al Viro wrote:
>>>> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
>>>>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
>>>>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
>>>>
>>>> Again, this stuff can be hit with ITER_BVEC iterators
>>>>
>>>>> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
>>>>> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
>>>>>  						  rsize, &pgbase);
>>>>
>>>> and this will break on those.
>>>
>>> If anyone has an example handy, of a user space program that leads
>>> to this situation (O_DIRECT with ITER_BVEC), it would really help
>>> me reach enlightenment a lot quicker in this area. :)
>>
>> Er...  splice(2) to O_DIRECT-opened file on e.g. ext4?  Or
>> sendfile(2) to the same, for that matter...
> 
> s/ext4/nfs/ to hit this particular codepath, obviously.

OK, I have a solution to this that's pretty easy:

1) Get rid of the user_backed_iter(i) check in
dio_w_iov_iter_pin_pages() and dio_w_iov_iter_pin_pages_alloc(), and

2) At the call sites, match up the unpin calls appropriately.

...and apply a similar fix for the fuse conversion patch.

However, the core block/bio conversion in patch 4 still does depend upon
a key assumption, which I got from a 2019 email discussion with
Christoph Hellwig and others here [1], which says:

    "All pages released by bio_release_pages should come from
     get_get_user_pages...".

I really hope that still holds true. Otherwise this whole thing is in
trouble.



[1] https://lore.kernel.org/kvm/20190724053053.GA18330@infradead.org/

thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-08-29 12:07   ` David Hildenbrand
  2022-08-29 19:33     ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2022-08-29 12:07 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 27.08.22 10:36, John Hubbard wrote:
> pin_user_page() is an externally-usable version of try_grab_page(), but
> with semantics that match get_page(), so that it can act as a drop-in
> replacement for get_page(). Specifically, pin_user_page() has a void
> return type.
> 
> pin_user_page() elevates a page's refcount using FOLL_PIN rules. This
> means that the caller must release the page via unpin_user_page().
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 33 +++++++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 982f2607180b..85a105157334 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1869,6 +1869,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
>  long get_user_pages(unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas);
> +void pin_user_page(struct page *page);
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
>  		    struct vm_area_struct **vmas);
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..245ccb41ed8c 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3213,6 +3213,39 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  }
>  EXPORT_SYMBOL(pin_user_pages);
>  
> +/**
> + * pin_user_page() - apply a FOLL_PIN reference to a page
> + *
> + * @page: the page to be pinned.
> + *
> + * This is similar to get_user_pages(), except that the page's refcount is
> + * elevated using FOLL_PIN, instead of FOLL_GET.
> + *
> + * IMPORTANT: The caller must release the page via unpin_user_page().
> + *
> + */
> +void pin_user_page(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
> +

We should warn if the page is anon and !exclusive.

I assume the intend is to use pin_user_page() only to duplicate pins, right?


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-29  4:59           ` John Hubbard
@ 2022-08-29 16:08             ` Jan Kara
  2022-08-29 19:59               ` John Hubbard
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2022-08-29 16:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: Al Viro, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Jan Kara, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Sun 28-08-22 21:59:49, John Hubbard wrote:
> On 8/27/22 17:39, Al Viro wrote:
> > On Sun, Aug 28, 2022 at 01:38:57AM +0100, Al Viro wrote:
> >> On Sat, Aug 27, 2022 at 04:55:18PM -0700, John Hubbard wrote:
> >>> On 8/27/22 15:48, Al Viro wrote:
> >>>> On Sat, Aug 27, 2022 at 01:36:06AM -0700, John Hubbard wrote:
> >>>>> Convert the NFS Direct IO layer to use pin_user_pages_fast() and
> >>>>> unpin_user_page(), instead of get_user_pages_fast() and put_page().
> >>>>
> >>>> Again, this stuff can be hit with ITER_BVEC iterators
> >>>>
> >>>>> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> >>>>> +		result = dio_w_iov_iter_pin_pages_alloc(iter, &pagevec,
> >>>>>  						  rsize, &pgbase);
> >>>>
> >>>> and this will break on those.
> >>>
> >>> If anyone has an example handy, of a user space program that leads
> >>> to this situation (O_DIRECT with ITER_BVEC), it would really help
> >>> me reach enlightenment a lot quicker in this area. :)
> >>
> >> Er...  splice(2) to O_DIRECT-opened file on e.g. ext4?  Or
> >> sendfile(2) to the same, for that matter...
> > 
> > s/ext4/nfs/ to hit this particular codepath, obviously.
> 
> OK, I have a solution to this that's pretty easy:
> 
> 1) Get rid of the user_backed_iter(i) check in
> dio_w_iov_iter_pin_pages() and dio_w_iov_iter_pin_pages_alloc(), and
> 
> 2) At the call sites, match up the unpin calls appropriately.
> 
> ...and apply a similar fix for the fuse conversion patch.
> 
> However, the core block/bio conversion in patch 4 still does depend upon
> a key assumption, which I got from a 2019 email discussion with
> Christoph Hellwig and others here [1], which says:
> 
>     "All pages released by bio_release_pages should come from
>      get_get_user_pages...".
> 
> I really hope that still holds true. Otherwise this whole thing is in
> trouble.
> 
> [1] https://lore.kernel.org/kvm/20190724053053.GA18330@infradead.org/

Well as far as I've checked that discussion, Christoph was aware of pipe
pages etc. (i.e., bvecs) entering direct IO code. But he had some patches
[2] which enabled GUP to work for bvecs as well (using the kernel mapping
under the hood AFAICT from a quick glance at the series). I suppose we
could also handle this in __iov_iter_get_pages_alloc() by grabbing pin
reference instead of plain get_page() for the case of bvec iter. That way
we should have only pinned pages in bio_release_pages() even for the bvec
case.

[2] http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-29 12:07   ` David Hildenbrand
@ 2022-08-29 19:33     ` John Hubbard
  2022-08-30 12:17       ` David Hildenbrand
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-29 19:33 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/29/22 05:07, David Hildenbrand wrote:
>> +/**
>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>> + *
>> + * @page: the page to be pinned.
>> + *
>> + * This is similar to get_user_pages(), except that the page's refcount is
>> + * elevated using FOLL_PIN, instead of FOLL_GET.

Actually, my commit log has a more useful documentation of this routine,
and given the questions below, I think I'll change to that:

 * pin_user_page() is an externally-usable version of try_grab_page(), but with
 * semantics that match get_page(), so that it can act as a drop-in replacement
 * for get_page().
 *
 * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
 * that the caller must release the page via unpin_user_page().


>> + *
>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>> + *
>> + */
>> +void pin_user_page(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>> +
> 
> We should warn if the page is anon and !exclusive.

That would be sort of OK, because pin_user_page() is being created
specifically for file system (O_DIRECT cases) use, and so the pages
should mostly be file-backed, rather than anon. Although I'm a little
vague about whether all of these iov_iter cases are really always
file-backed pages, especially for cases such as splice(2) to an
O_DIRECT-opened file, that Al Viro mentioned [1].

Can you walk me through the reasoning for why we need to keep out
anon shared pages? 

> 
> I assume the intend is to use pin_user_page() only to duplicate pins, right?
> 

Well, yes or no, depending on your use of the term "pin":

pin_user_page() is used on a page that already has a refcount >= 1 (so
no worries about speculative pinning should apply here), but the page
does not necessarily have any FOLL_PIN's applied to it yet (so it's not
"pinned" in the FOLL_PIN sense).


[1] https://lore.kernel.org/all/Ywq5VrSrY341UVpL@ZenIV/


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-29 16:08             ` Jan Kara
@ 2022-08-29 19:59               ` John Hubbard
  2022-08-31  9:43                 ` Jan Kara
  0 siblings, 1 reply; 32+ messages in thread
From: John Hubbard @ 2022-08-29 19:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 8/29/22 09:08, Jan Kara wrote:
>> However, the core block/bio conversion in patch 4 still does depend upon
>> a key assumption, which I got from a 2019 email discussion with
>> Christoph Hellwig and others here [1], which says:
>>
>>     "All pages released by bio_release_pages should come from
>>      get_get_user_pages...".
>>
>> I really hope that still holds true. Otherwise this whole thing is in
>> trouble.
>>
>> [1] https://lore.kernel.org/kvm/20190724053053.GA18330@infradead.org/
> 
> Well as far as I've checked that discussion, Christoph was aware of pipe
> pages etc. (i.e., bvecs) entering direct IO code. But he had some patches
> [2] which enabled GUP to work for bvecs as well (using the kernel mapping
> under the hood AFAICT from a quick glance at the series). I suppose we
> could also handle this in __iov_iter_get_pages_alloc() by grabbing pin
> reference instead of plain get_page() for the case of bvec iter. That way
> we should have only pinned pages in bio_release_pages() even for the bvec
> case.

OK, thanks, that looks viable. So, that approach assumes that the
remaining two cases in __iov_iter_get_pages_alloc() will never end up
being released via bio_release_pages():

    iov_iter_is_pipe(i)
    iov_iter_is_xarray(i)

I'm actually a little worried about ITER_XARRAY, which is a recent addition.
It seems to be used in ways that are similar to ITER_BVEC, and cephfs is
using it. It's probably OK for now, for this series, which doesn't yet
convert cephfs.


> 
> [2] http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec

Yes, I had looked through that again before sending this. The problem
for me was that that it didn't have to deal with releasing pages
differently (and therefore, differentiating between FOLL_PIN and
FOLL_GET pages). But it did enable GUP to handle bvecs, so with that
applied, one could then make the original claim about bio_release_pages()
and GUP, yes.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-29 19:33     ` John Hubbard
@ 2022-08-30 12:17       ` David Hildenbrand
  2022-08-30 21:42         ` John Hubbard
  2022-08-31  0:06         ` John Hubbard
  0 siblings, 2 replies; 32+ messages in thread
From: David Hildenbrand @ 2022-08-30 12:17 UTC (permalink / raw)
  To: John Hubbard, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 29.08.22 21:33, John Hubbard wrote:
> On 8/29/22 05:07, David Hildenbrand wrote:
>>> +/**
>>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>>> + *
>>> + * @page: the page to be pinned.
>>> + *
>>> + * This is similar to get_user_pages(), except that the page's refcount is
>>> + * elevated using FOLL_PIN, instead of FOLL_GET.
> 
> Actually, my commit log has a more useful documentation of this routine,
> and given the questions below, I think I'll change to that:
> 
>  * pin_user_page() is an externally-usable version of try_grab_page(), but with
>  * semantics that match get_page(), so that it can act as a drop-in replacement
>  * for get_page().
>  *
>  * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
>  * that the caller must release the page via unpin_user_page().

Some thoughts:

a) Can we generalize such that pages with a dedicated pincount
(multi-page folios) are also covered? Maybe avoiding the refcount
terminology would be best.

b) Should we directly work on folios?

c) Would it be valid to pass in a tail page right now?

> 
>>> + *
>>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>>> + *
>>> + */
>>> +void pin_user_page(struct page *page)
>>> +{
>>> +	struct folio *folio = page_folio(page);
>>> +
>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>> +
>>
>> We should warn if the page is anon and !exclusive.
> 
> That would be sort of OK, because pin_user_page() is being created
> specifically for file system (O_DIRECT cases) use, and so the pages
> should mostly be file-backed, rather than anon. Although I'm a little
> vague about whether all of these iov_iter cases are really always
> file-backed pages, especially for cases such as splice(2) to an
> O_DIRECT-opened file, that Al Viro mentioned [1].

If we can, we should document that this interface is not for anonymous
pages and WARN if pinning an anonymous page via this interface.

The only reasonable way to obtain a pin on an anonymous page is via the
page table. Here, FOLL_PIN should be used to do the right thing -- for
example, unshare first (break COW) instead of pinning a shared anonymous
page.

Nothing would speak against duplicating such a pin using this interface
(we'd have to sanity check that the page we're pinning may already be
pinned), but I assume the pages we pin here are *not* necessarily
obtained via GUP FOLL_PIN.

I would be curious under which scenarios we could end up here with an
anonymous page and how we obtained that reference (if not via GUP).

> 
> Can you walk me through the reasoning for why we need to keep out
> anon shared pages? 

We make sure to only pin anonymous pages that are exclusive and check
when unpinning -- see sanity_check_pinned_pages(), there is also a
comment in there -- that pinned anonymous pages are in fact still
exclusive, otherwise we might have a BUG lurking somewhere that can
result in memory corruptions or leaking information between processes.

For example, once we'd pinned an anonymous pages that are not marked
exclusive (!PageAnonExclusive), or we'd be sharing a page that is
pinned, the next write fault would replace the page in the user page
table due to breaking COW, and the GUP pin would point at a different
page than the page table.

Disallowing pinning of anon pages that may be shared in any case
(FOLL_LONGTERM or not) simplifies GUP handling and allows for such
sanity checks.

(side note: after recent VM_BUG_ON discussions we might want to convert
the VM_BUG_ON_PAGE in sanity_check_pinned_pages())

> 
>>
>> I assume the intend is to use pin_user_page() only to duplicate pins, right?
>>
> 
> Well, yes or no, depending on your use of the term "pin":
> 
> pin_user_page() is used on a page that already has a refcount >= 1 (so
> no worries about speculative pinning should apply here), but the page
> does not necessarily have any FOLL_PIN's applied to it yet (so it's not
> "pinned" in the FOLL_PIN sense).

Okay, then we should really figure out if/how anonymous pages could end
up here. I assume they can't, really. But let's see :)


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-30 12:17       ` David Hildenbrand
@ 2022-08-30 21:42         ` John Hubbard
  2022-08-31  0:06         ` John Hubbard
  1 sibling, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-30 21:42 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/30/22 05:17, David Hildenbrand wrote:
> On 29.08.22 21:33, John Hubbard wrote:
>> On 8/29/22 05:07, David Hildenbrand wrote:
>>>> +/**
>>>> + * pin_user_page() - apply a FOLL_PIN reference to a page
>>>> + *
>>>> + * @page: the page to be pinned.
>>>> + *
>>>> + * This is similar to get_user_pages(), except that the page's refcount is
>>>> + * elevated using FOLL_PIN, instead of FOLL_GET.
>>
>> Actually, my commit log has a more useful documentation of this routine,
>> and given the questions below, I think I'll change to that:
>>
>>  * pin_user_page() is an externally-usable version of try_grab_page(), but with
>>  * semantics that match get_page(), so that it can act as a drop-in replacement
>>  * for get_page().
>>  *
>>  * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
>>  * that the caller must release the page via unpin_user_page().
> 
> Some thoughts:
> 
> a) Can we generalize such that pages with a dedicated pincount
> (multi-page folios) are also covered? Maybe avoiding the refcount
> terminology would be best.
> 
> b) Should we directly work on folios?
> 
> c) Would it be valid to pass in a tail page right now?

I would fervently prefer to defer those kinds of questions and ideas,
because the call sites are dealing simply in pages. And this is 
really for file system call sites. The folio conversion is a larger
thing. Below...

> 
>>
>>>> + *
>>>> + * IMPORTANT: The caller must release the page via unpin_user_page().
>>>> + *
>>>> + */
>>>> +void pin_user_page(struct page *page)
>>>> +{
>>>> +	struct folio *folio = page_folio(page);
>>>> +
>>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>>> +
>>>
>>> We should warn if the page is anon and !exclusive.
>>
>> That would be sort of OK, because pin_user_page() is being created
>> specifically for file system (O_DIRECT cases) use, and so the pages
>> should mostly be file-backed, rather than anon. Although I'm a little
>> vague about whether all of these iov_iter cases are really always
>> file-backed pages, especially for cases such as splice(2) to an
>> O_DIRECT-opened file, that Al Viro mentioned [1].
> 
> If we can, we should document that this interface is not for anonymous
> pages and WARN if pinning an anonymous page via this interface.

Yes. OK, I've rewritten the documentation again, and changed the 
warning to just 

    WARN_ON_ONCE(PageAnon(page));

So it looks like this now, what do you think?

/**
 * pin_user_page() - apply a FOLL_PIN reference to a file-backed page that the
 * caller already owns.
 *
 * @page: the page to be pinned.
 *
 * pin_user_page() elevates a page's refcount using FOLL_PIN rules. This means
 * that the caller must release the page via unpin_user_page().
 *
 * pin_user_page() is intended as a drop-in replacement for get_page(). This
 * provides a way for callers to do a subsequent unpin_user_page() on the
 * affected page. However, it is only intended for use by callers (file systems,
 * block/bio) that have a file-backed page. Anonymous pages are not expected nor
 * supported, and will generate a warning.
 *
 * pin_user_page() may also be thought of as an externally-usable version of
 * try_grab_page(), but with semantics that match get_page(), so that it can act
 * as a drop-in replacement for get_page().
 *
 * IMPORTANT: The caller must release the page via unpin_user_page().
 *
 */

> 
> The only reasonable way to obtain a pin on an anonymous page is via the
> page table. Here, FOLL_PIN should be used to do the right thing -- for
> example, unshare first (break COW) instead of pinning a shared anonymous
> page.
> 
> Nothing would speak against duplicating such a pin using this interface
> (we'd have to sanity check that the page we're pinning may already be
> pinned), but I assume the pages we pin here are *not* necessarily
> obtained via GUP FOLL_PIN.
> 
> I would be curious under which scenarios we could end up here with an
> anonymous page and how we obtained that reference (if not via GUP).

Let's see if the warning ever fires. I expect not, but if it does,
then I can add the !PageAnonExclusive qualifier to the warning.

> 
>>
>> Can you walk me through the reasoning for why we need to keep out
>> anon shared pages? 
> 
> We make sure to only pin anonymous pages that are exclusive and check
> when unpinning -- see sanity_check_pinned_pages(), there is also a
> comment in there -- that pinned anonymous pages are in fact still
> exclusive, otherwise we might have a BUG lurking somewhere that can
> result in memory corruptions or leaking information between processes.
> 
> For example, once we'd pinned an anonymous pages that are not marked
> exclusive (!PageAnonExclusive), or we'd be sharing a page that is
> pinned, the next write fault would replace the page in the user page
> table due to breaking COW, and the GUP pin would point at a different
> page than the page table.

Right, OK it all clicks together, thanks for that.

> 
> Disallowing pinning of anon pages that may be shared in any case
> (FOLL_LONGTERM or not) simplifies GUP handling and allows for such
> sanity checks.
> 
> (side note: after recent VM_BUG_ON discussions we might want to convert
> the VM_BUG_ON_PAGE in sanity_check_pinned_pages())
> 
>>
>>>
>>> I assume the intend is to use pin_user_page() only to duplicate pins, right?
>>>
>>
>> Well, yes or no, depending on your use of the term "pin":
>>
>> pin_user_page() is used on a page that already has a refcount >= 1 (so
>> no worries about speculative pinning should apply here), but the page
>> does not necessarily have any FOLL_PIN's applied to it yet (so it's not
>> "pinned" in the FOLL_PIN sense).
> 
> Okay, then we should really figure out if/how anonymous pages could end
> up here. I assume they can't, really. But let's see :)
> 
> 

Yes, again, the warning will reveal that. Which reminds me that I also
have it on my list to also write some test cases that do things like Al
Viro suggested: ITER_BVEC + O_DIRECT.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/6] mm/gup: introduce pin_user_page()
  2022-08-30 12:17       ` David Hildenbrand
  2022-08-30 21:42         ` John Hubbard
@ 2022-08-31  0:06         ` John Hubbard
  1 sibling, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-31  0:06 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton
  Cc: Jens Axboe, Alexander Viro, Miklos Szeredi, Christoph Hellwig,
	Darrick J . Wong, Trond Myklebust, Anna Schumaker, Jan Kara,
	Logan Gunthorpe, linux-block, linux-fsdevel, linux-xfs,
	linux-nfs, linux-mm, LKML

On 8/30/22 05:17, David Hildenbrand wrote:
> (side note: after recent VM_BUG_ON discussions we might want to convert
> the VM_BUG_ON_PAGE in sanity_check_pinned_pages())

Just realized I skipped over this point in my other response.

So, there are 28 "BUG_ON" calls in mm/gup.c, only 3 of which are 
BUILD_BUG_ON().

Maybe a single patch that covers all 25 at once is the way to go, I'm
thinking. And of course, any patch that touches any of those lines
should convert to a WARN* variant--whichever lands first.


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-29 19:59               ` John Hubbard
@ 2022-08-31  9:43                 ` Jan Kara
  2022-08-31 18:02                   ` John Hubbard
  2022-09-01  0:38                   ` Al Viro
  0 siblings, 2 replies; 32+ messages in thread
From: Jan Kara @ 2022-08-31  9:43 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, Al Viro, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On Mon 29-08-22 12:59:26, John Hubbard wrote:
> On 8/29/22 09:08, Jan Kara wrote:
> >> However, the core block/bio conversion in patch 4 still does depend upon
> >> a key assumption, which I got from a 2019 email discussion with
> >> Christoph Hellwig and others here [1], which says:
> >>
> >>     "All pages released by bio_release_pages should come from
> >>      get_get_user_pages...".
> >>
> >> I really hope that still holds true. Otherwise this whole thing is in
> >> trouble.
> >>
> >> [1] https://lore.kernel.org/kvm/20190724053053.GA18330@infradead.org/
> > 
> > Well as far as I've checked that discussion, Christoph was aware of pipe
> > pages etc. (i.e., bvecs) entering direct IO code. But he had some patches
> > [2] which enabled GUP to work for bvecs as well (using the kernel mapping
> > under the hood AFAICT from a quick glance at the series). I suppose we
> > could also handle this in __iov_iter_get_pages_alloc() by grabbing pin
> > reference instead of plain get_page() for the case of bvec iter. That way
> > we should have only pinned pages in bio_release_pages() even for the bvec
> > case.
> 
> OK, thanks, that looks viable. So, that approach assumes that the
> remaining two cases in __iov_iter_get_pages_alloc() will never end up
> being released via bio_release_pages():
> 
>     iov_iter_is_pipe(i)
>     iov_iter_is_xarray(i)
> 
> I'm actually a little worried about ITER_XARRAY, which is a recent addition.
> It seems to be used in ways that are similar to ITER_BVEC, and cephfs is
> using it. It's probably OK for now, for this series, which doesn't yet
> convert cephfs.

So after looking into that a bit more, I think a clean approach would be to
provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
get_page() in all the cases (using this in pipe_get_pages() and
iter_xarray_get_pages() is easy) and then make all bio handling use the
pinning variants for iters. I think at least iov_iter_is_pipe() case needs
to be handled as well because as I wrote above, pipe pages can enter direct
IO code e.g. for splice(2).

Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
actually do want the "pin page" semantics in the end (they are accessing
page contents) so eventually we should convert them all to
iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
will take some more conversion work with networking etc. so I'd start with
converting bios only.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-31  9:43                 ` Jan Kara
@ 2022-08-31 18:02                   ` John Hubbard
  2022-09-01  0:38                   ` Al Viro
  1 sibling, 0 replies; 32+ messages in thread
From: John Hubbard @ 2022-08-31 18:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Al Viro, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On 8/31/22 02:43, Jan Kara wrote:
>> OK, thanks, that looks viable. So, that approach assumes that the
>> remaining two cases in __iov_iter_get_pages_alloc() will never end up
>> being released via bio_release_pages():
>>
>>     iov_iter_is_pipe(i)
>>     iov_iter_is_xarray(i)
>>
>> I'm actually a little worried about ITER_XARRAY, which is a recent addition.
>> It seems to be used in ways that are similar to ITER_BVEC, and cephfs is
>> using it. It's probably OK for now, for this series, which doesn't yet
>> convert cephfs.
> 
> So after looking into that a bit more, I think a clean approach would be to
> provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
> in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
> get_page() in all the cases (using this in pipe_get_pages() and
> iter_xarray_get_pages() is easy) and then make all bio handling use the
> pinning variants for iters. I think at least iov_iter_is_pipe() case needs
> to be handled as well because as I wrote above, pipe pages can enter direct
> IO code e.g. for splice(2).
> 

OK, yes.

> Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
> actually do want the "pin page" semantics in the end (they are accessing
> page contents) so eventually we should convert them all to
> iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
> will take some more conversion work with networking etc. so I'd start with
> converting bios only.
> 
> 								Honza


I'll give this approach a spin, thanks very much for the guidance!


thanks,

-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-08-31  9:43                 ` Jan Kara
  2022-08-31 18:02                   ` John Hubbard
@ 2022-09-01  0:38                   ` Al Viro
  2022-09-01  9:06                     ` Jan Kara
  1 sibling, 1 reply; 32+ messages in thread
From: Al Viro @ 2022-09-01  0:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: John Hubbard, Andrew Morton, Jens Axboe, Miklos Szeredi,
	Christoph Hellwig, Darrick J . Wong, Trond Myklebust,
	Anna Schumaker, Logan Gunthorpe, linux-block, linux-fsdevel,
	linux-xfs, linux-nfs, linux-mm, LKML

On Wed, Aug 31, 2022 at 11:43:49AM +0200, Jan Kara wrote:

> So after looking into that a bit more, I think a clean approach would be to
> provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
> in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
> get_page() in all the cases (using this in pipe_get_pages() and
> iter_xarray_get_pages() is easy) and then make all bio handling use the
> pinning variants for iters. I think at least iov_iter_is_pipe() case needs
> to be handled as well because as I wrote above, pipe pages can enter direct
> IO code e.g. for splice(2).
> 
> Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
> actually do want the "pin page" semantics in the end (they are accessing
> page contents) so eventually we should convert them all to
> iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
> will take some more conversion work with networking etc. so I'd start with
> converting bios only.

Not sure, TBH...

FWIW, quite a few of the callers of iov_iter_get_pages2() do *NOT* need to
grab any references for BVEC/XARRAY/PIPE cases.  What's more, it would be
bloody useful to have a variant that doesn't grab references for
!iter->user_backed case - that could be usable for KVEC as well, simplifying
several callers.

Requirements:
	* recepients of those struct page * should have a way to make
dropping the page refs conditional (obviously); bio machinery can be told
to do so.
	* callers should *NOT* do something like
	"set an ITER_BVEC iter, with page references grabbed and stashed in
bio_vec array, call async read_iter() and drop the references in array - the
refs we grab in dio will serve"
Note that for sync IO that pattern is fine whether we grab/drop anything
inside read_iter(); for async we could take depopulating the bio_vec
array to the IO completion or downstream of that.
	* the code dealing with the references returned by iov_iter_..._pages
should *NOT* play silly buggers with refcounts - something like "I'll grab
a reference, start DMA and report success; page will stay around until I
get around to dropping the ref and callers don't need to wait for that" deep
in the bowels of infinibad stack (or something equally tasteful) is seriously
asking for trouble.

Future plans from the last cycle included iov_iter_find_pages{,_alloc}() that
would *not* grab references on anything other than IOVEC and UBUF (would advance
the iterator, same as iov_iter_get_pages2(), though). Then iov_iter_get_...()
would become a wrapper for that.  After that - look into switching the users
of ..._get_... to ..._find_....   Hadn't done much in that direction yet,
though - need to redo the analysis first.

That primitive might very well do FOLL_PIN instead of FOLL_GET for IOVEC and
UBUF...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-09-01  0:38                   ` Al Viro
@ 2022-09-01  9:06                     ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2022-09-01  9:06 UTC (permalink / raw)
  To: Al Viro
  Cc: Jan Kara, John Hubbard, Andrew Morton, Jens Axboe,
	Miklos Szeredi, Christoph Hellwig, Darrick J . Wong,
	Trond Myklebust, Anna Schumaker, Logan Gunthorpe, linux-block,
	linux-fsdevel, linux-xfs, linux-nfs, linux-mm, LKML

On Thu 01-09-22 01:38:21, Al Viro wrote:
> On Wed, Aug 31, 2022 at 11:43:49AM +0200, Jan Kara wrote:
> 
> > So after looking into that a bit more, I think a clean approach would be to
> > provide iov_iter_pin_pages2() and iov_iter_pages_alloc2(), under the hood
> > in __iov_iter_get_pages_alloc() make sure we use pin_user_page() instead of
> > get_page() in all the cases (using this in pipe_get_pages() and
> > iter_xarray_get_pages() is easy) and then make all bio handling use the
> > pinning variants for iters. I think at least iov_iter_is_pipe() case needs
> > to be handled as well because as I wrote above, pipe pages can enter direct
> > IO code e.g. for splice(2).
> > 
> > Also I think that all iov_iter_get_pages2() (or the _alloc2 variant) users
> > actually do want the "pin page" semantics in the end (they are accessing
> > page contents) so eventually we should convert them all to
> > iov_iter_pin_pages2() and remove iov_iter_get_pages2() altogether. But this
> > will take some more conversion work with networking etc. so I'd start with
> > converting bios only.
> 
> Not sure, TBH...
> 
> FWIW, quite a few of the callers of iov_iter_get_pages2() do *NOT* need to
> grab any references for BVEC/XARRAY/PIPE cases.  What's more, it would be
> bloody useful to have a variant that doesn't grab references for
> !iter->user_backed case - that could be usable for KVEC as well, simplifying
> several callers.
> 
> Requirements:
> 	* recepients of those struct page * should have a way to make
> dropping the page refs conditional (obviously); bio machinery can be told
> to do so.
> 	* callers should *NOT* do something like
> 	"set an ITER_BVEC iter, with page references grabbed and stashed in
> bio_vec array, call async read_iter() and drop the references in array - the
> refs we grab in dio will serve"
> Note that for sync IO that pattern is fine whether we grab/drop anything
> inside read_iter(); for async we could take depopulating the bio_vec
> array to the IO completion or downstream of that.
> 	* the code dealing with the references returned by iov_iter_..._pages
> should *NOT* play silly buggers with refcounts - something like "I'll grab
> a reference, start DMA and report success; page will stay around until I
> get around to dropping the ref and callers don't need to wait for that" deep
> in the bowels of infinibad stack (or something equally tasteful) is seriously
> asking for trouble.

I agree we could get away without grabbing references in some cases. But it
is a performance vs robustness tradeoff I'd say. E.g. with XARRAY case I
can see people feeding pages from struct address_space and it is unclear
what else than page reference would protect them from being freed by
reclaim. Furthermore if e.g. writeback can happen for such struct
address_space (or other filesystem operations requiring stable data), we
really need a full page pin and not just page reference to signal that the
page may be under DMA and may be changing under your hands. So I'm not
against having some special cases that avoid grabbing page reference / page
pin but I think it is justified only in performance sensitive cases and
when we can make sure filesystem backed page (because of above mentioned
data stability issues) or anon page (because of cow handling) cannot
possibly enter this path.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages
  2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
@ 2022-02-27  9:34 ` jhubbard.send.patches
  0 siblings, 0 replies; 32+ messages in thread
From: jhubbard.send.patches @ 2022-02-27  9:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

From: John Hubbard <jhubbard@nvidia.com>

Convert nfs-direct to use pin_user_pages_fast(), and unpin_user_pages(),
in place of get_user_pages_fast() and put_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/nfs/direct.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..42111a75c0f7 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -177,13 +177,6 @@ ssize_t nfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	return nfs_file_direct_write(iocb, iter);
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
-{
-	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
-}
-
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
 			      struct nfs_direct_req *dreq)
 {
@@ -367,7 +360,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  rsize, &pgbase);
 		if (result < 0)
 			break;
@@ -398,7 +391,7 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -811,7 +804,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec, 
+		result = iov_iter_pin_pages_alloc(iter, &pagevec,
 						  wsize, &pgbase);
 		if (result < 0)
 			break;
@@ -850,7 +843,7 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		unpin_user_pages(pagevec, npages);
 		kvfree(pagevec);
 		if (result < 0)
 			break;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2022-09-01  9:08 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-27  8:36 [PATCH 0/6] convert most filesystems to pin_user_pages_fast() John Hubbard
2022-08-27  8:36 ` [PATCH 1/6] mm/gup: introduce pin_user_page() John Hubbard
2022-08-29 12:07   ` David Hildenbrand
2022-08-29 19:33     ` John Hubbard
2022-08-30 12:17       ` David Hildenbrand
2022-08-30 21:42         ` John Hubbard
2022-08-31  0:06         ` John Hubbard
2022-08-27  8:36 ` [PATCH 2/6] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
2022-08-27 22:27   ` Andrew Morton
2022-08-27 23:59     ` John Hubbard
2022-08-28  0:12       ` Andrew Morton
2022-08-28  0:31         ` John Hubbard
2022-08-28  1:07           ` John Hubbard
2022-08-27  8:36 ` [PATCH 3/6] iov_iter: new iov_iter_pin_pages*() routines John Hubbard
2022-08-27 22:46   ` Al Viro
2022-08-27 22:48     ` John Hubbard
2022-08-27  8:36 ` [PATCH 4/6] block, bio, fs: convert most filesystems to pin_user_pages_fast() John Hubbard
2022-08-27  8:36 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
2022-08-27 22:48   ` Al Viro
2022-08-27 23:55     ` John Hubbard
2022-08-28  0:38       ` Al Viro
2022-08-28  0:39         ` Al Viro
2022-08-28  0:46           ` John Hubbard
2022-08-29  4:59           ` John Hubbard
2022-08-29 16:08             ` Jan Kara
2022-08-29 19:59               ` John Hubbard
2022-08-31  9:43                 ` Jan Kara
2022-08-31 18:02                   ` John Hubbard
2022-09-01  0:38                   ` Al Viro
2022-09-01  9:06                     ` Jan Kara
2022-08-27  8:36 ` [PATCH 6/6] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
  -- strict thread matches above, loose matches on Subject: below --
2022-02-27  9:34 [PATCH 0/6] block, fs: convert most Direct IO cases to FOLL_PIN jhubbard.send.patches
2022-02-27  9:34 ` [PATCH 5/6] NFS: direct-io: convert to FOLL_PIN pages jhubbard.send.patches

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.