linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
@ 2022-02-25  8:50 John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 1/7] mm/gup: introduce pin_user_page() John Hubbard
                   ` (8 more replies)
  0 siblings, 9 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Hi,

Summary:

This puts some prerequisites in place, including a CONFIG parameter,
making it possible to start converting and testing the Direct IO part of
each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().

It will take "a few" kernel releases to get the whole thing done.

Details:

As part of fixing the "get_user_pages() + file-backed memory" problem
[1], and to support various COW-related fixes as well [2], we need to
convert the Direct IO code from get_user_pages_fast(), to
pin_user_pages_fast(). Because pin_user_pages*() calls require a
corresponding call to unpin_user_page(), the conversion is more
elaborate than just substitution.

Further complicating the conversion, the block/bio layers get their
Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
each of which has a large number of callers. All of those callers need
to be audited and changed so that they call unpin_user_page(), rather
than put_page().

After quite some time exploring and consulting with people as well, it
is clear that this cannot be done in just one patchset. That's because,
not only is this large and time-consuming (for example, Chaitanya
Kulkarni's first reaction, after looking into the details, was, "convert
the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
but it is also spread across many filesystems.

With that in mind, let's apply most of this patchset soon-ish, and then
work on the filesystem conversions, likely over the course of a few
kernel releases. Once complete, then apply the last patch, and then one
final name change to remove the dio_w_ prefixes, and get us back to the
original names.

In this patchset:

Patches 1, 2, 3: provide the prerequisites to start converting call
sites to call the new dio_w_*() wrapper functions.

Patch 4: convert the core allocation routines to
dio_w_pin_user_pages_fast().

Patches 5, 6: convert a couple of callers (NFS, fuse) to use FOLL_PIN.
This also is a placeholder to show that "filesystems need to be
converted at this point".

At this point, Ubuntu 20.04 boots up and is able to support running some
fio direct IO tests, while keeping the foll pin counts in /proc/vmstat
balanced. (Ubuntu uses fuse during startup, interestingly enough.)

Patch 7: Get rid of the CONFIG parameter, thus effectively switching the
default Direct IO mechanism over to pin_user_pages_fast().

(Not shown): Patch 8: trivial but large: rename everything to get rid of
the dio_w_ prefix, and delete the wrappers.

This is based on mmotm as of about an hour ago. I've also stashed it
here:

    https://github.com/johnhubbard/linux bio_pup_mmotm_20220224

[1] https://lwn.net/Articles/753027/ "The trouble with get_user_pages()"

[2] https://lore.kernel.org/all/20211217113049.23850-1-david@redhat.com/T/#u
    (David Hildenbrand's mm/COW fixes)

John Hubbard (7):
  mm/gup: introduce pin_user_page()
  block: add dio_w_*() wrappers for pin, unpin user pages
  block, fs: assert that key paths use iovecs, and nothing else
  block, bio, fs: initial pin_user_pages_fast() changes
  NFS: direct-io: convert to FOLL_PIN pages
  fuse: convert direct IO paths to use FOLL_PIN
  block, direct-io: flip the switch: use pin_user_pages_fast()

 block/bio.c          | 22 +++++++++++++---------
 block/blk-map.c      |  4 ++--
 fs/direct-io.c       | 26 ++++++++++++++------------
 fs/fuse/dev.c        |  5 ++++-
 fs/fuse/file.c       | 23 ++++++++---------------
 fs/iomap/direct-io.c |  2 +-
 fs/nfs/direct.c      |  2 +-
 include/linux/bvec.h |  4 ++++
 include/linux/mm.h   |  1 +
 lib/iov_iter.c       |  4 ++--
 mm/gup.c             | 34 ++++++++++++++++++++++++++++++++++
 11 files changed, 84 insertions(+), 43 deletions(-)


base-commit: 218d3ca9c0ea1c35f1bc5099325b7df54b52bbdd
prerequisite-patch-id: 7d5a742e37171a15d83a9b3ac9ba0951b573eed8
--
2.35.1


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [RFC PATCH 1/7] mm/gup: introduce pin_user_page()
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-28 13:27   ` David Hildenbrand
  2022-02-25  8:50 ` [RFC PATCH 2/7] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

pin_user_page() is an externally-usable version of try_grab_page(), but
with semantics that match get_page(), so that it can act as a drop-in
replacement for get_page(). Specifically, pin_user_page() has a void
return type.

pin_user_page() elevates a page's refcount is using FOLL_PIN rules. This
means that the caller must release the page via unpin_user_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm.h |  1 +
 mm/gup.c           | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 929488a47181..bb51f5487aef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1914,6 +1914,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
 long get_user_pages(unsigned long start, unsigned long nr_pages,
 			    unsigned int gup_flags, struct page **pages,
 			    struct vm_area_struct **vmas);
+void pin_user_page(struct page *page);
 long pin_user_pages(unsigned long start, unsigned long nr_pages,
 		    unsigned int gup_flags, struct page **pages,
 		    struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 5c3f6ede17eb..44446241c3a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
 }
 EXPORT_SYMBOL(pin_user_pages);
 
+/**
+ * pin_user_page() - apply a FOLL_PIN reference to a page ()
+ *
+ * @page: the page to be pinned.
+ *
+ * Similar to get_user_pages(), in that the page's refcount is elevated using
+ * FOLL_PIN rules.
+ *
+ * IMPORTANT: That means that the caller must release the page via
+ * unpin_user_page().
+ *
+ */
+void pin_user_page(struct page *page)
+{
+	struct folio *folio = page_folio(page);
+
+	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
+
+	/*
+	 * Similar to try_grab_page(): be sure to *also*
+	 * increment the normal page refcount field at least once,
+	 * so that the page really is pinned.
+	 */
+	if (folio_test_large(folio)) {
+		folio_ref_add(folio, 1);
+		atomic_add(1, folio_pincount_ptr(folio));
+	} else {
+		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
+	}
+
+	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
+}
+EXPORT_SYMBOL(pin_user_page);
+
 /*
  * pin_user_pages_unlocked() is the FOLL_PIN variant of
  * get_user_pages_unlocked(). Behavior is the same, except that this one sets
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 2/7] block: add dio_w_*() wrappers for pin, unpin user pages
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 1/7] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 3/7] block, fs: assert that key paths use iovecs, and nothing else John Hubbard
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Add a new config parameter, CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, and
dio_w_*() wrapper functions. Together, these allow the developer to
choose between these sets of routines, for Direct IO code paths:

a) pin_user_pages_fast()
   pin_user_page()
   unpin_user_page()

b) get_user_pages_fast()
   get_page()
   put_page()

CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO is a temporary setting, and will
be deleted once the conversion is complete. In the meantime, developers
can enable this in order to try out each filesystem.

More information: The Direct IO part of the block infrastructure is
being changed to use pin_user_page*() and unpin_user_page*() calls, in
place of a mix of get_user_pages_fast(), get_page(), and put_page().
These have to be changed over all at the same time, for block, bio, and
all filesystems.

While that changeover is in progress (but disabled via this new CONFIG
option), kernel developers need a way to test their changes. The steps
are:

a) Enable CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO

b) Monitor these /proc/vmstat items:

nr_foll_pin_acquired
nr_foll_pin_released

...to ensure that they remain equal, when "at rest".

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/Kconfig        | 25 +++++++++++++++++++++++++
 include/linux/bvec.h | 11 +++++++++++
 2 files changed, 36 insertions(+)

diff --git a/block/Kconfig b/block/Kconfig
index 168b873eb666..f6ca5e9597e4 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -50,6 +50,31 @@ config BLK_DEV_BSG_COMMON
 config BLK_ICQ
 	bool
 
+config BLK_USE_PIN_USER_PAGES_FOR_DIO
+	bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
+	default n
+	help
+	  For Direct IO code, retain the pages via calls to
+	  pin_user_pages_fast(), instead of via get_user_pages_fast().
+	  Likewise, use pin_user_page() instead of get_page(). And then
+	  release such pages via unpin_user_page(), instead of
+	  put_page().
+
+	  This is a temporary setting, which will be deleted once the
+	  conversion is completed, reviewed, and tested. In the meantime,
+	  developers can enable this in order to try out each filesystem.
+	  For that, it's best to monitor these /proc/vmstat items:
+
+		nr_foll_pin_acquired
+		nr_foll_pin_released
+
+	  ...to ensure that they remain equal, when "at rest".
+
+	  Say yes here ONLY if are actively developing or testing the
+	  block layer or filesystems with pin_user_pages_fast().
+	  Otherwise, this is just a way to throw off the refcounting of
+	  pages in the system.
+
 config BLK_DEV_BSGLIB
 	bool "Block layer SG support v4 helper lib"
 	select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 35c25dff651a..a96a68c687f6 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,4 +241,15 @@ static inline void *bvec_virt(struct bio_vec *bvec)
 	return page_address(bvec->bv_page) + bvec->bv_offset;
 }
 
+#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
+#define dio_w_pin_user_pages_fast(s, n, p, f)	pin_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			pin_user_page(p)
+#define dio_w_unpin_user_page(p)		unpin_user_page(p)
+
+#else
+#define dio_w_pin_user_pages_fast(s, n, p, f)	get_user_pages_fast(s, n, p, f)
+#define dio_w_pin_user_page(p)			get_page(p)
+#define dio_w_unpin_user_page(p)		put_page(p)
+#endif
+
 #endif /* __LINUX_BVEC_H */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 3/7] block, fs: assert that key paths use iovecs, and nothing else
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 1/7] mm/gup: introduce pin_user_page() John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 2/7] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 4/7] block, bio, fs: initial pin_user_pages_fast() changes John Hubbard
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Upcoming changes to Direct IO will change it from acquiring pages via
get_user_pages_fast(), to calling pin_user_pages_fast() instead.

Place a few assertions at key points, that the pages are IOVEC (user
pages), to enforce the assumptions that there are no kernel or pipe or
other odd variations being passed.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c    | 4 ++++
 fs/direct-io.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index b15f5466ce08..4679d6539e2d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1167,6 +1167,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	WARN_ON_ONCE(!iter_is_iovec(iter));
+
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
@@ -1217,6 +1219,8 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	WARN_ON_ONCE(!iter_is_iovec(iter));
+
 	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 38bca4980a1c..7dbbbfef300d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,6 +169,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	ssize_t ret;
 
+	WARN_ON_ONCE(!iter_is_iovec(sdio->iter));
+
 	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
 				&sdio->from);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 4/7] block, bio, fs: initial pin_user_pages_fast() changes
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
                   ` (2 preceding siblings ...)
  2022-02-25  8:50 ` [RFC PATCH 3/7] block, fs: assert that key paths use iovecs, and nothing else John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 5/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Use dio_w_*() wrapper calls, in place of get_user_pages_fast(), get_page()
and put_page().

This sets up core parts of the block, bio, and direct-io subsystems to use
dio_*() wrapper calls.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/bio.c          | 18 +++++++++---------
 block/blk-map.c      |  4 ++--
 fs/direct-io.c       | 24 ++++++++++++------------
 fs/iomap/direct-io.c |  2 +-
 lib/iov_iter.c       |  4 ++--
 5 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4679d6539e2d..8541245c53bd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1102,7 +1102,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		dio_w_unpin_user_page(bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1133,7 +1133,7 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
 	size_t i, nr = DIV_ROUND_UP(size + (off & ~PAGE_MASK), PAGE_SIZE);
 
 	for (i = 0; i < nr; i++)
-		put_page(pages[i]);
+		dio_w_unpin_user_page(pages[i]);
 }
 
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
@@ -1144,9 +1144,9 @@ static void bio_put_pages(struct page **pages, size_t size, size_t off)
  * @iter: iov iterator describing the region to be mapped
  *
  * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * pages will have to be released using unpin_user_page() when done. For
+ * multi-segment *iter, this function only adds pages from the next non-empty
+ * segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1180,7 +1180,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 			if (same_page)
-				put_page(page);
+				dio_w_unpin_user_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len))) {
 				bio_put_pages(pages + i, left, offset);
@@ -1237,7 +1237,7 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 			break;
 		}
 		if (same_page)
-			put_page(page);
+			dio_w_unpin_user_page(page);
 		offset = 0;
 	}
 
@@ -1434,8 +1434,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one unpin_user_page() against each page and will run
+ * one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index c7f71d83eff1..5f0d04c5dd9d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -275,7 +275,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						dio_w_unpin_user_page(page);
 					break;
 				}
 
@@ -289,7 +289,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			dio_w_unpin_user_page(pages[j++]);
 		kvfree(pages);
 		/* couldn't stuff something into bio? */
 		if (bytes)
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 7dbbbfef300d..0d7104aa40c3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -183,7 +183,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		dio_w_pin_user_page(page);
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -452,7 +452,7 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		dio_w_unpin_user_page(dio->pages[sdio->head++]);
 }
 
 /*
@@ -717,7 +717,7 @@ static inline int dio_bio_add_page(struct dio_submit *sdio)
 		 */
 		if ((sdio->cur_page_len + sdio->cur_page_offset) == PAGE_SIZE)
 			sdio->pages_in_io--;
-		get_page(sdio->cur_page);
+		dio_w_pin_user_page(sdio->cur_page);
 		sdio->final_block_in_bio = sdio->cur_page_block +
 			(sdio->cur_page_len >> sdio->blkbits);
 		ret = 0;
@@ -832,13 +832,13 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_w_unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	dio_w_pin_user_page(page);		/* It is in dio */
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +853,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_w_unpin_user_page(sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -953,7 +953,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -998,7 +998,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio->op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					return -ENOTBLK;
 				}
 
@@ -1011,7 +1011,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_w_unpin_user_page(page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1051,7 +1051,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_w_unpin_user_page(page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1067,7 +1067,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 		}
 
 		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		dio_w_unpin_user_page(page);
 	}
 out:
 	return ret;
@@ -1289,7 +1289,7 @@ do_blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_w_unpin_user_page(sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 67cf9c16f80c..4e818648a1aa 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -192,7 +192,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	dio_w_pin_user_page(page);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 6dd5330f7a99..f15a3ef5a481 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1538,7 +1538,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 
 		addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
 		n = DIV_ROUND_UP(len, PAGE_SIZE);
-		res = get_user_pages_fast(addr, n, gup_flags, pages);
+		res = dio_w_pin_user_pages_fast(addr, n, gup_flags, pages);
 		if (unlikely(res <= 0))
 			return res;
 		return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1667,7 +1667,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		p = get_pages_array(n);
 		if (!p)
 			return -ENOMEM;
-		res = get_user_pages_fast(addr, n, gup_flags, p);
+		res = dio_w_pin_user_pages_fast(addr, n, gup_flags, p);
 		if (unlikely(res <= 0)) {
 			kvfree(p);
 			*pages = NULL;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 5/7] NFS: direct-io: convert to FOLL_PIN pages
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
                   ` (3 preceding siblings ...)
  2022-02-25  8:50 ` [RFC PATCH 4/7] block, bio, fs: initial pin_user_pages_fast() changes John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 6/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Now that Direct IO's core allocators invoke pin_user_pages_fast(), those
pages must free released via unpin_user_page(), instead of put_page().

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/nfs/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index eabfdab543c8..2e0d399c5a5a 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -181,7 +181,7 @@ static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
 {
 	unsigned int i;
 	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
+		dio_w_unpin_user_page(pages[i]);
 }
 
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 6/7] fuse: convert direct IO paths to use FOLL_PIN
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
                   ` (4 preceding siblings ...)
  2022-02-25  8:50 ` [RFC PATCH 5/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-25  8:50 ` [RFC PATCH 7/7] block, direct-io: flip the switch: use pin_user_pages_fast() John Hubbard
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Convert the fuse filesystem to support the new iov_iter_get_pages()
behavior. That routine now invokes pin_user_pages_fast(), which means that
such pages must be released via unpin_user_page(), rather than via
put_page().

This commit also removes any possibility of kernel pages being handled, in
the fuse_get_user_pages() call. Although this may seem like a steep price
to pay, Christoph Hellwig actually recommended it a few years ago for
nearly the same situation [1].

[1] https://lore.kernel.org/kvm/20190724061750.GA19397@infradead.org/

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 fs/fuse/dev.c  |  5 ++++-
 fs/fuse/file.c | 23 ++++++++---------------
 2 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e1b4a846c90d..a93037c96b89 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -675,7 +675,10 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
 			flush_dcache_page(cs->pg);
 			set_page_dirty_lock(cs->pg);
 		}
-		put_page(cs->pg);
+		if (cs->pipebufs)
+			put_page(cs->pg);
+		else
+			unpin_user_page(cs->pg);
 	}
 	cs->pg = NULL;
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 94747bac3489..395c2fb613fb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -619,7 +619,7 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
 	for (i = 0; i < ap->num_pages; i++) {
 		if (should_dirty)
 			set_page_dirty_lock(ap->pages[i]);
-		put_page(ap->pages[i]);
+		unpin_user_page(ap->pages[i]);
 	}
 }
 
@@ -1382,20 +1382,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	size_t nbytes = 0;  /* # bytes already packed in req */
 	ssize_t ret = 0;
 
-	/* Special case for kernel I/O: can copy directly into the buffer */
-	if (iov_iter_is_kvec(ii)) {
-		unsigned long user_addr = fuse_get_user_addr(ii);
-		size_t frag_size = fuse_get_frag_size(ii, *nbytesp);
-
-		if (write)
-			ap->args.in_args[1].value = (void *) user_addr;
-		else
-			ap->args.out_args[0].value = (void *) user_addr;
-
-		iov_iter_advance(ii, frag_size);
-		*nbytesp = frag_size;
-		return 0;
-	}
+	/*
+	 * TODO: this warning can eventually be removed. It is just to help
+	 * developers, during a transitional period, while direct IO code is
+	 * moving over to FOLL_PIN pages.
+	 */
+	if (WARN_ON_ONCE(!iter_is_iovec(ii)))
+		return -EOPNOTSUPP;
 
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [RFC PATCH 7/7] block, direct-io: flip the switch: use pin_user_pages_fast()
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
                   ` (5 preceding siblings ...)
  2022-02-25  8:50 ` [RFC PATCH 6/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
@ 2022-02-25  8:50 ` John Hubbard
  2022-02-25 12:05 ` [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN Jan Kara
  2022-02-25 13:12 ` David Hildenbrand
  8 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25  8:50 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML, John Hubbard

Remove CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO, but leave the dio_w_*()
wrapper functions in place, with the pin_user_pages_fast() defines as
the only choice.

A subsequent patch is now possible, to rename all dio_w_*() functions so
as to remove the dio_w_ prefix.

Signed-off-by: John Hubbard <jhubbard@nvidia.com>
---
 block/Kconfig        | 25 -------------------------
 include/linux/bvec.h |  7 -------
 2 files changed, 32 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index f6ca5e9597e4..168b873eb666 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -50,31 +50,6 @@ config BLK_DEV_BSG_COMMON
 config BLK_ICQ
 	bool
 
-config BLK_USE_PIN_USER_PAGES_FOR_DIO
-	bool "DEVELOPERS ONLY: Enable pin_user_pages() for Direct IO" if EXPERT
-	default n
-	help
-	  For Direct IO code, retain the pages via calls to
-	  pin_user_pages_fast(), instead of via get_user_pages_fast().
-	  Likewise, use pin_user_page() instead of get_page(). And then
-	  release such pages via unpin_user_page(), instead of
-	  put_page().
-
-	  This is a temporary setting, which will be deleted once the
-	  conversion is completed, reviewed, and tested. In the meantime,
-	  developers can enable this in order to try out each filesystem.
-	  For that, it's best to monitor these /proc/vmstat items:
-
-		nr_foll_pin_acquired
-		nr_foll_pin_released
-
-	  ...to ensure that they remain equal, when "at rest".
-
-	  Say yes here ONLY if are actively developing or testing the
-	  block layer or filesystems with pin_user_pages_fast().
-	  Otherwise, this is just a way to throw off the refcounting of
-	  pages in the system.
-
 config BLK_DEV_BSGLIB
 	bool "Block layer SG support v4 helper lib"
 	select BLK_DEV_BSG_COMMON
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index a96a68c687f6..5bc98b334efe 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -241,15 +241,8 @@ static inline void *bvec_virt(struct bio_vec *bvec)
 	return page_address(bvec->bv_page) + bvec->bv_offset;
 }
 
-#ifdef CONFIG_BLK_USE_PIN_USER_PAGES_FOR_DIO
 #define dio_w_pin_user_pages_fast(s, n, p, f)	pin_user_pages_fast(s, n, p, f)
 #define dio_w_pin_user_page(p)			pin_user_page(p)
 #define dio_w_unpin_user_page(p)		unpin_user_page(p)
 
-#else
-#define dio_w_pin_user_pages_fast(s, n, p, f)	get_user_pages_fast(s, n, p, f)
-#define dio_w_pin_user_page(p)			get_page(p)
-#define dio_w_unpin_user_page(p)		put_page(p)
-#endif
-
 #endif /* __LINUX_BVEC_H */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
                   ` (6 preceding siblings ...)
  2022-02-25  8:50 ` [RFC PATCH 7/7] block, direct-io: flip the switch: use pin_user_pages_fast() John Hubbard
@ 2022-02-25 12:05 ` Jan Kara
  2022-02-25 16:14   ` Chaitanya Kulkarni
  2022-02-25 19:36   ` John Hubbard
  2022-02-25 13:12 ` David Hildenbrand
  8 siblings, 2 replies; 20+ messages in thread
From: Jan Kara @ 2022-02-25 12:05 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, Andrew Morton, Chaitanya Kulkarni, linux-block,
	linux-fsdevel, linux-xfs, linux-mm, LKML

On Fri 25-02-22 00:50:18, John Hubbard wrote:
> Hi,
> 
> Summary:
> 
> This puts some prerequisites in place, including a CONFIG parameter,
> making it possible to start converting and testing the Direct IO part of
> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> 
> It will take "a few" kernel releases to get the whole thing done.
> 
> Details:
> 
> As part of fixing the "get_user_pages() + file-backed memory" problem
> [1], and to support various COW-related fixes as well [2], we need to
> convert the Direct IO code from get_user_pages_fast(), to
> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> corresponding call to unpin_user_page(), the conversion is more
> elaborate than just substitution.
> 
> Further complicating the conversion, the block/bio layers get their
> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> each of which has a large number of callers. All of those callers need
> to be audited and changed so that they call unpin_user_page(), rather
> than put_page().
> 
> After quite some time exploring and consulting with people as well, it
> is clear that this cannot be done in just one patchset. That's because,
> not only is this large and time-consuming (for example, Chaitanya
> Kulkarni's first reaction, after looking into the details, was, "convert
> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
> but it is also spread across many filesystems.

With having modified fs/direct-io.c and fs/iomap/direct-io.c which
filesystems do you know are missing conversion? Or is it that you just want
to make sure with audit everything is fine? The only fs I could find
unconverted by your changes is ceph. Am I missing something?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
                   ` (7 preceding siblings ...)
  2022-02-25 12:05 ` [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN Jan Kara
@ 2022-02-25 13:12 ` David Hildenbrand
  2022-02-25 21:10   ` John Hubbard
  8 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-02-25 13:12 UTC (permalink / raw)
  To: John Hubbard, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 25.02.22 09:50, John Hubbard wrote:
> Hi,
> 
> Summary:
> 
> This puts some prerequisites in place, including a CONFIG parameter,
> making it possible to start converting and testing the Direct IO part of
> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> 
> It will take "a few" kernel releases to get the whole thing done.
> 
> Details:
> 
> As part of fixing the "get_user_pages() + file-backed memory" problem
> [1], and to support various COW-related fixes as well [2], we need to
> convert the Direct IO code from get_user_pages_fast(), to
> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> corresponding call to unpin_user_page(), the conversion is more
> elaborate than just substitution.
> 
> Further complicating the conversion, the block/bio layers get their
> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> each of which has a large number of callers. All of those callers need
> to be audited and changed so that they call unpin_user_page(), rather
> than put_page().

vmsplice is another candidate that uses iov_iter_get_pages() and should
be converted to FOLL_PIN. For that particular user, we have to also pass
FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA
/ ... for all eternity.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25 12:05 ` [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN Jan Kara
@ 2022-02-25 16:14   ` Chaitanya Kulkarni
  2022-02-25 16:40     ` Jan Kara
  2022-02-25 19:36   ` John Hubbard
  1 sibling, 1 reply; 20+ messages in thread
From: Chaitanya Kulkarni @ 2022-02-25 16:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Christoph Hellwig, Dave Chinner, Darrick J . Wong,
	Theodore Ts'o, Alexander Viro, Miklos Szeredi, John Hubbard,
	Andrew Morton, Chaitanya Kulkarni, linux-block, linux-fsdevel,
	linux-xfs, linux-mm, LKML

On 2/25/22 04:05, Jan Kara wrote:
> On Fri 25-02-22 00:50:18, John Hubbard wrote:
>> Hi,
>>
>> Summary:
>>
>> This puts some prerequisites in place, including a CONFIG parameter,
>> making it possible to start converting and testing the Direct IO part of
>> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
>>
>> It will take "a few" kernel releases to get the whole thing done.
>>
>> Details:
>>
>> As part of fixing the "get_user_pages() + file-backed memory" problem
>> [1], and to support various COW-related fixes as well [2], we need to
>> convert the Direct IO code from get_user_pages_fast(), to
>> pin_user_pages_fast(). Because pin_user_pages*() calls require a
>> corresponding call to unpin_user_page(), the conversion is more
>> elaborate than just substitution.
>>
>> Further complicating the conversion, the block/bio layers get their
>> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
>> each of which has a large number of callers. All of those callers need
>> to be audited and changed so that they call unpin_user_page(), rather
>> than put_page().
>>
>> After quite some time exploring and consulting with people as well, it
>> is clear that this cannot be done in just one patchset. That's because,
>> not only is this large and time-consuming (for example, Chaitanya
>> Kulkarni's first reaction, after looking into the details, was, "convert
>> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
>> but it is also spread across many filesystems.
> 
> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> filesystems do you know are missing conversion? Or is it that you just want
> to make sure with audit everything is fine? The only fs I could find
> unconverted by your changes is ceph. Am I missing something?

if I understand your comment correctly file systems which are listed in
the list see [1] (all the credit goes to John to have a complete list)
that are not using iomap but use XXX_XXX_direct_IO() should be fine,
since in the callchain going from :-

XXX_XXX_direct_io()
  __blkdev_direct_io()
   do_direct_io()

   ...

     submit_page_selection()
      get/put_page() <---

will take care of itself ?

> 
> 								Honza
> 

[1]

jfs_direct_IO()
nilfs_direct_IO()
ntfs_dirct_IO()
reiserfs_direct_IO()
udf_direct_IO()
ocfs2_dirct_IO()
affs_direct_IO()
exfat_direct_IO()
ext2_direct_IO()
fat_direct_IO()
hfs_direct_IO()
hfs_plus_direct_IO()

-ck



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25 16:14   ` Chaitanya Kulkarni
@ 2022-02-25 16:40     ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2022-02-25 16:40 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Jan Kara, Jens Axboe, Christoph Hellwig, Dave Chinner,
	Darrick J . Wong, Theodore Ts'o, Alexander Viro,
	Miklos Szeredi, John Hubbard, Andrew Morton, linux-block,
	linux-fsdevel, linux-xfs, linux-mm, LKML

On Fri 25-02-22 16:14:14, Chaitanya Kulkarni wrote:
> On 2/25/22 04:05, Jan Kara wrote:
> > On Fri 25-02-22 00:50:18, John Hubbard wrote:
> >> Hi,
> >>
> >> Summary:
> >>
> >> This puts some prerequisites in place, including a CONFIG parameter,
> >> making it possible to start converting and testing the Direct IO part of
> >> each filesystem, from get_user_pages_fast(), to pin_user_pages_fast().
> >>
> >> It will take "a few" kernel releases to get the whole thing done.
> >>
> >> Details:
> >>
> >> As part of fixing the "get_user_pages() + file-backed memory" problem
> >> [1], and to support various COW-related fixes as well [2], we need to
> >> convert the Direct IO code from get_user_pages_fast(), to
> >> pin_user_pages_fast(). Because pin_user_pages*() calls require a
> >> corresponding call to unpin_user_page(), the conversion is more
> >> elaborate than just substitution.
> >>
> >> Further complicating the conversion, the block/bio layers get their
> >> Direct IO pages via iov_iter_get_pages() and iov_iter_get_pages_alloc(),
> >> each of which has a large number of callers. All of those callers need
> >> to be audited and changed so that they call unpin_user_page(), rather
> >> than put_page().
> >>
> >> After quite some time exploring and consulting with people as well, it
> >> is clear that this cannot be done in just one patchset. That's because,
> >> not only is this large and time-consuming (for example, Chaitanya
> >> Kulkarni's first reaction, after looking into the details, was, "convert
> >> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
> >> but it is also spread across many filesystems.
> > 
> > With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> > filesystems do you know are missing conversion? Or is it that you just want
> > to make sure with audit everything is fine? The only fs I could find
> > unconverted by your changes is ceph. Am I missing something?
> 
> if I understand your comment correctly file systems which are listed in
> the list see [1] (all the credit goes to John to have a complete list)
> that are not using iomap but use XXX_XXX_direct_IO() should be fine,
> since in the callchain going from :-
> 
> XXX_XXX_direct_io()
>   __blkdev_direct_io()
>    do_direct_io()
> 
>    ...
> 
>      submit_page_selection()
>       get/put_page() <---
> 
> will take care of itself ?

Yes, John's changes to fs/direct-io.c should take care of these
filesystems using __blkdev_direct_io().

								Honza

> [1]
> 
> jfs_direct_IO()
> nilfs_direct_IO()
> ntfs_dirct_IO()
> reiserfs_direct_IO()
> udf_direct_IO()
> ocfs2_dirct_IO()
> affs_direct_IO()
> exfat_direct_IO()
> ext2_direct_IO()
> fat_direct_IO()
> hfs_direct_IO()
> hfs_plus_direct_IO()
> 
> -ck
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25 12:05 ` [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN Jan Kara
  2022-02-25 16:14   ` Chaitanya Kulkarni
@ 2022-02-25 19:36   ` John Hubbard
  2022-02-25 22:20     ` John Hubbard
  1 sibling, 1 reply; 20+ messages in thread
From: John Hubbard @ 2022-02-25 19:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Christoph Hellwig, Dave Chinner, Darrick J . Wong,
	Theodore Ts'o, Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni, linux-block, linux-fsdevel, linux-xfs,
	linux-mm, LKML

On 2/25/22 04:05, Jan Kara wrote:
...
>> After quite some time exploring and consulting with people as well, it
>> is clear that this cannot be done in just one patchset. That's because,
>> not only is this large and time-consuming (for example, Chaitanya
>> Kulkarni's first reaction, after looking into the details, was, "convert
>> the remaining filesystems to use iomap, *then* convert to FOLL_PIN..."),
>> but it is also spread across many filesystems.
> 
> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
> filesystems do you know are missing conversion? Or is it that you just want
> to make sure with audit everything is fine? The only fs I could find
> unconverted by your changes is ceph. Am I missing something?
> 
> 								Honza

There are a few more filesystems that call iov_iter_get_pages() or
iov_iter_get_pages_alloc(), plus networking things as well, plus some
others that are hard to categorize, such as vhost. So we have:

* ceph
* rds
* cifs
* p9
* net: __zerocopy_sg_from_iter(), tls_setup_from_iter(),
* crypto: af_alg_make_sg() (maybe N/A)
* vmsplice() (as David Hildenbrand mentioned)
* vhost: vhost_scsi_map_to_sgl()

In addition to that, I was also worried that maybe the
blockdev_direct_IO() or iomap filesystems might be breaking encapsulation
occasionally, by calling put_page() on the direct IO user page buffer.
Perhaps in error paths.

Are you pretty sure that that last concern is not valid? That would be a
welcome bit of news.



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25 13:12 ` David Hildenbrand
@ 2022-02-25 21:10   ` John Hubbard
  0 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25 21:10 UTC (permalink / raw)
  To: David Hildenbrand, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/25/22 05:12, David Hildenbrand wrote:
> vmsplice is another candidate that uses iov_iter_get_pages() and should

Yes.

> be converted to FOLL_PIN. For that particular user, we have to also pass
> FOLL_LONGTERM -- vmsplice as it stands can block memory hotunplug / CMA
> / ... for all eternity.
> 

Oh, thanks for pointing that out. I was thinking about Direct IO, and
would have overlooked that vmsplice need FOLL_LONGTERM.

I'm still quite vague about which parts of vmsplice are in pipe buffers
(and therefore will *not* get FOLL_PIN pages) and which are in user space
buffers. Just haven't spent enough time with vmsplice yet, hopefully it
will clear up with some study of the code.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN
  2022-02-25 19:36   ` John Hubbard
@ 2022-02-25 22:20     ` John Hubbard
  0 siblings, 0 replies; 20+ messages in thread
From: John Hubbard @ 2022-02-25 22:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Christoph Hellwig, Dave Chinner, Darrick J . Wong,
	Theodore Ts'o, Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni, linux-block, linux-fsdevel, linux-xfs,
	linux-mm, LKML

On 2/25/22 11:36, John Hubbard wrote:
> On 2/25/22 04:05, Jan Kara wrote:
> ...
>> With having modified fs/direct-io.c and fs/iomap/direct-io.c which
>> filesystems do you know are missing conversion? Or is it that you just want
>> to make sure with audit everything is fine? The only fs I could find
>> unconverted by your changes is ceph. Am I missing something?
>>
>>                                 Honza
> 
> There are a few more filesystems that call iov_iter_get_pages() or
> iov_iter_get_pages_alloc(), plus networking things as well, plus some
> others that are hard to categorize, such as vhost. So we have:
> 
> * ceph
> * rds
> * cifs
> * p9
> * net: __zerocopy_sg_from_iter(), tls_setup_from_iter(),
> * crypto: af_alg_make_sg() (maybe N/A)
> * vmsplice() (as David Hildenbrand mentioned)
> * vhost: vhost_scsi_map_to_sgl()

...although...if each filesystem does *not* require custom attention,
then there is another, maybe better approach, which is: factor out an
iovec-only pair of allocators, and transition each subsystem when it's
ready. So:

     dio_iov_iter_get_pages()
     dio_iov_iter_get_pages_alloc()

That would allow doing this a bit at a time, and without the horrible
CONFIG parameter that is switched over all at once.

The bio_release_pages() still calls unpin_user_page(), though, so that
means that all Direct IO callers would still have to change over at
once.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()
  2022-02-25  8:50 ` [RFC PATCH 1/7] mm/gup: introduce pin_user_page() John Hubbard
@ 2022-02-28 13:27   ` David Hildenbrand
  2022-02-28 21:14     ` John Hubbard
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-02-28 13:27 UTC (permalink / raw)
  To: John Hubbard, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 25.02.22 09:50, John Hubbard wrote:
> pin_user_page() is an externally-usable version of try_grab_page(), but
> with semantics that match get_page(), so that it can act as a drop-in
> replacement for get_page(). Specifically, pin_user_page() has a void
> return type.
> 
> pin_user_page() elevates a page's refcount is using FOLL_PIN rules. This
> means that the caller must release the page via unpin_user_page().
> 
> Signed-off-by: John Hubbard <jhubbard@nvidia.com>
> ---
>  include/linux/mm.h |  1 +
>  mm/gup.c           | 34 ++++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 929488a47181..bb51f5487aef 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1914,6 +1914,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
>  long get_user_pages(unsigned long start, unsigned long nr_pages,
>  			    unsigned int gup_flags, struct page **pages,
>  			    struct vm_area_struct **vmas);
> +void pin_user_page(struct page *page);
>  long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  		    unsigned int gup_flags, struct page **pages,
>  		    struct vm_area_struct **vmas);
> diff --git a/mm/gup.c b/mm/gup.c
> index 5c3f6ede17eb..44446241c3a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>  }
>  EXPORT_SYMBOL(pin_user_pages);
>  
> +/**
> + * pin_user_page() - apply a FOLL_PIN reference to a page ()
> + *
> + * @page: the page to be pinned.
> + *
> + * Similar to get_user_pages(), in that the page's refcount is elevated using
> + * FOLL_PIN rules.
> + *
> + * IMPORTANT: That means that the caller must release the page via
> + * unpin_user_page().
> + *
> + */
> +void pin_user_page(struct page *page)
> +{
> +	struct folio *folio = page_folio(page);
> +
> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
> +
> +	/*
> +	 * Similar to try_grab_page(): be sure to *also*
> +	 * increment the normal page refcount field at least once,
> +	 * so that the page really is pinned.
> +	 */
> +	if (folio_test_large(folio)) {
> +		folio_ref_add(folio, 1);
> +		atomic_add(1, folio_pincount_ptr(folio));
> +	} else {
> +		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
> +	}
> +
> +	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
> +}
> +EXPORT_SYMBOL(pin_user_page);
> +
>  /*
>   * pin_user_pages_unlocked() is the FOLL_PIN variant of
>   * get_user_pages_unlocked(). Behavior is the same, except that this one sets

I assume that function will only get called on a page that has been
obtained by a previous pin_user_pages_fast(), correct?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()
  2022-02-28 13:27   ` David Hildenbrand
@ 2022-02-28 21:14     ` John Hubbard
  2022-03-01  8:11       ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: John Hubbard @ 2022-02-28 21:14 UTC (permalink / raw)
  To: David Hildenbrand, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 2/28/22 05:27, David Hildenbrand wrote:
...
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 5c3f6ede17eb..44446241c3a9 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>>   }
>>   EXPORT_SYMBOL(pin_user_pages);
>>   
>> +/**
>> + * pin_user_page() - apply a FOLL_PIN reference to a page ()
>> + *
>> + * @page: the page to be pinned.
>> + *
>> + * Similar to get_user_pages(), in that the page's refcount is elevated using
>> + * FOLL_PIN rules.
>> + *
>> + * IMPORTANT: That means that the caller must release the page via
>> + * unpin_user_page().
>> + *
>> + */
>> +void pin_user_page(struct page *page)
>> +{
>> +	struct folio *folio = page_folio(page);
>> +
>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>> +
>> +	/*
>> +	 * Similar to try_grab_page(): be sure to *also*
>> +	 * increment the normal page refcount field at least once,
>> +	 * so that the page really is pinned.
>> +	 */
>> +	if (folio_test_large(folio)) {
>> +		folio_ref_add(folio, 1);
>> +		atomic_add(1, folio_pincount_ptr(folio));
>> +	} else {
>> +		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>> +	}
>> +
>> +	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>> +}
>> +EXPORT_SYMBOL(pin_user_page);
>> +
>>   /*
>>    * pin_user_pages_unlocked() is the FOLL_PIN variant of
>>    * get_user_pages_unlocked(). Behavior is the same, except that this one sets
> 
> I assume that function will only get called on a page that has been
> obtained by a previous pin_user_pages_fast(), correct?
> 

Well, no. This is meant to be used in place of get_page(), for code that
knows that the pages will be released via unpin_user_page(). So there is
no special prerequisite there.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()
  2022-02-28 21:14     ` John Hubbard
@ 2022-03-01  8:11       ` David Hildenbrand
  2022-03-01  8:40         ` John Hubbard
  0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2022-03-01  8:11 UTC (permalink / raw)
  To: John Hubbard, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 28.02.22 22:14, John Hubbard wrote:
> On 2/28/22 05:27, David Hildenbrand wrote:
> ...
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 5c3f6ede17eb..44446241c3a9 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -3034,6 +3034,40 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
>>>   }
>>>   EXPORT_SYMBOL(pin_user_pages);
>>>   
>>> +/**
>>> + * pin_user_page() - apply a FOLL_PIN reference to a page ()
>>> + *
>>> + * @page: the page to be pinned.
>>> + *
>>> + * Similar to get_user_pages(), in that the page's refcount is elevated using
>>> + * FOLL_PIN rules.
>>> + *
>>> + * IMPORTANT: That means that the caller must release the page via
>>> + * unpin_user_page().
>>> + *
>>> + */
>>> +void pin_user_page(struct page *page)
>>> +{
>>> +	struct folio *folio = page_folio(page);
>>> +
>>> +	WARN_ON_ONCE(folio_ref_count(folio) <= 0);
>>> +
>>> +	/*
>>> +	 * Similar to try_grab_page(): be sure to *also*
>>> +	 * increment the normal page refcount field at least once,
>>> +	 * so that the page really is pinned.
>>> +	 */
>>> +	if (folio_test_large(folio)) {
>>> +		folio_ref_add(folio, 1);
>>> +		atomic_add(1, folio_pincount_ptr(folio));
>>> +	} else {
>>> +		folio_ref_add(folio, GUP_PIN_COUNTING_BIAS);
>>> +	}
>>> +
>>> +	node_stat_mod_folio(folio, NR_FOLL_PIN_ACQUIRED, 1);
>>> +}
>>> +EXPORT_SYMBOL(pin_user_page);
>>> +
>>>   /*
>>>    * pin_user_pages_unlocked() is the FOLL_PIN variant of
>>>    * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>>
>> I assume that function will only get called on a page that has been
>> obtained by a previous pin_user_pages_fast(), correct?
>>
> 
> Well, no. This is meant to be used in place of get_page(), for code that
> knows that the pages will be released via unpin_user_page(). So there is
> no special prerequisite there.

That might be problematic and possibly the wrong approach, depending on
*what* we're actually pinning and what we're intending to do with that.

My assumption would have been that this interface is to duplicate a pin
on a page, which would be perfectly fine, because the page actually saw
a FOLL_PIN previously.

We're taking a pin on a page that we haven't obtained via FOLL_PIN if I
understand correctly. Which raises the questions, how do we end up with
the pages here, and what are we doing to do with them (use them like we
obtained them via FOLL_PIN?)?


If it's converting FOLL_GET -> FOLL_PIN manually, then we're bypassing
FOLL_PIN special handling in GUP code:

page = get_user_pages(FOLL_GET)
pin_user_page(page)
put_page(page)


For anonymous pages, we'll bail out for example once we have

https://lkml.kernel.org/r/20220224122614.94921-14-david@redhat.com

Because the conditions for pinned anonymous pages might no longer hold.

If we won't call pin_user_page() on anonymous pages, it would be fine.
But then, I still wonder how we come up the "struct page" here.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()
  2022-03-01  8:11       ` David Hildenbrand
@ 2022-03-01  8:40         ` John Hubbard
  2022-03-01  9:30           ` David Hildenbrand
  0 siblings, 1 reply; 20+ messages in thread
From: John Hubbard @ 2022-03-01  8:40 UTC (permalink / raw)
  To: David Hildenbrand, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML

On 3/1/22 00:11, David Hildenbrand wrote:
>> ...
>>>> +EXPORT_SYMBOL(pin_user_page);
>>>> +
>>>>    /*
>>>>     * pin_user_pages_unlocked() is the FOLL_PIN variant of
>>>>     * get_user_pages_unlocked(). Behavior is the same, except that this one sets
>>>
>>> I assume that function will only get called on a page that has been
>>> obtained by a previous pin_user_pages_fast(), correct?
>>>
>>
>> Well, no. This is meant to be used in place of get_page(), for code that
>> knows that the pages will be released via unpin_user_page(). So there is
>> no special prerequisite there.
> 
> That might be problematic and possibly the wrong approach, depending on
> *what* we're actually pinning and what we're intending to do with that.
> 
> My assumption would have been that this interface is to duplicate a pin

I see that I need to put more documentation here, so people don't have
to assume things... :)

> on a page, which would be perfectly fine, because the page actually saw
> a FOLL_PIN previously.
> 
> We're taking a pin on a page that we haven't obtained via FOLL_PIN if I
> understand correctly. Which raises the questions, how do we end up with
> the pages here, and what are we doing to do with them (use them like we
> obtained them via FOLL_PIN?)?
> 
> 
> If it's converting FOLL_GET -> FOLL_PIN manually, then we're bypassing
> FOLL_PIN special handling in GUP code:
> 
> page = get_user_pages(FOLL_GET)
> pin_user_page(page)
> put_page(page)

No, that's not where this is going at all. The idea, which  I now see
needs better documentation, is to handle file-backed pages. Only.

We're not converting from one type to another, nor are we doubling up.
We're just keeping the pin type consistent so that the vast block-
processing machinery can take pages in and handle them, then release
them at the end with bio_release_pages(), which will call
unpin_user_pages().

> 
> 
> For anonymous pages, we'll bail out for example once we have
> 
> https://lkml.kernel.org/r/20220224122614.94921-14-david@redhat.com
> 
> Because the conditions for pinned anonymous pages might no longer hold.
> 
> If we won't call pin_user_page() on anonymous pages, it would be fine.

We won't, and in fact, I should add WARN_ON_ONCE(PageAnon(page)) to
this function.

> But then, I still wonder how we come up the "struct page" here.
> 

 From the file system. For example, the NFS-direct and fuse conversions
in the last patches show how that works.

Thanks for this feedback, this is very helpful.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH 1/7] mm/gup: introduce pin_user_page()
  2022-03-01  8:40         ` John Hubbard
@ 2022-03-01  9:30           ` David Hildenbrand
  0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2022-03-01  9:30 UTC (permalink / raw)
  To: John Hubbard, Jens Axboe, Jan Kara, Christoph Hellwig,
	Dave Chinner, Darrick J . Wong, Theodore Ts'o,
	Alexander Viro, Miklos Szeredi, Andrew Morton,
	Chaitanya Kulkarni
  Cc: linux-block, linux-fsdevel, linux-xfs, linux-mm, LKML


>>>> That might be problematic and possibly the wrong approach, depending on
>> *what* we're actually pinning and what we're intending to do with that.
>>
>> My assumption would have been that this interface is to duplicate a pin
> 
> I see that I need to put more documentation here, so people don't have
> to assume things... :)
> 

Yes, please :)

>> on a page, which would be perfectly fine, because the page actually saw
>> a FOLL_PIN previously.
>>
>> We're taking a pin on a page that we haven't obtained via FOLL_PIN if I
>> understand correctly. Which raises the questions, how do we end up with
>> the pages here, and what are we doing to do with them (use them like we
>> obtained them via FOLL_PIN?)?
>>
>>
>> If it's converting FOLL_GET -> FOLL_PIN manually, then we're bypassing
>> FOLL_PIN special handling in GUP code:
>>
>> page = get_user_pages(FOLL_GET)
>> pin_user_page(page)
>> put_page(page)
> 
> No, that's not where this is going at all. The idea, which  I now see
> needs better documentation, is to handle file-backed pages. Only.
> 
> We're not converting from one type to another, nor are we doubling up.
> We're just keeping the pin type consistent so that the vast block-
> processing machinery can take pages in and handle them, then release
> them at the end with bio_release_pages(), which will call
> unpin_user_pages().
> 

Ah, okay, that makes sense. Glad to hear that we're intending to use
this with !anon pages only.

>>
>>
>> For anonymous pages, we'll bail out for example once we have
>>
>> https://lkml.kernel.org/r/20220224122614.94921-14-david@redhat.com
>>
>> Because the conditions for pinned anonymous pages might no longer hold.
>>
>> If we won't call pin_user_page() on anonymous pages, it would be fine.
> 
> We won't, and in fact, I should add WARN_ON_ONCE(PageAnon(page)) to
> this function.

Exactly what I would have suggested,

> 
>> But then, I still wonder how we come up the "struct page" here.
>>
> 
>  From the file system. For example, the NFS-direct and fuse conversions
> in the last patches show how that works.
> 
> Thanks for this feedback, this is very helpful.

Thanks for clarifying, John!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-03-01  9:31 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-25  8:50 [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN John Hubbard
2022-02-25  8:50 ` [RFC PATCH 1/7] mm/gup: introduce pin_user_page() John Hubbard
2022-02-28 13:27   ` David Hildenbrand
2022-02-28 21:14     ` John Hubbard
2022-03-01  8:11       ` David Hildenbrand
2022-03-01  8:40         ` John Hubbard
2022-03-01  9:30           ` David Hildenbrand
2022-02-25  8:50 ` [RFC PATCH 2/7] block: add dio_w_*() wrappers for pin, unpin user pages John Hubbard
2022-02-25  8:50 ` [RFC PATCH 3/7] block, fs: assert that key paths use iovecs, and nothing else John Hubbard
2022-02-25  8:50 ` [RFC PATCH 4/7] block, bio, fs: initial pin_user_pages_fast() changes John Hubbard
2022-02-25  8:50 ` [RFC PATCH 5/7] NFS: direct-io: convert to FOLL_PIN pages John Hubbard
2022-02-25  8:50 ` [RFC PATCH 6/7] fuse: convert direct IO paths to use FOLL_PIN John Hubbard
2022-02-25  8:50 ` [RFC PATCH 7/7] block, direct-io: flip the switch: use pin_user_pages_fast() John Hubbard
2022-02-25 12:05 ` [RFC PATCH 0/7] block, fs: convert Direct IO to FOLL_PIN Jan Kara
2022-02-25 16:14   ` Chaitanya Kulkarni
2022-02-25 16:40     ` Jan Kara
2022-02-25 19:36   ` John Hubbard
2022-02-25 22:20     ` John Hubbard
2022-02-25 13:12 ` David Hildenbrand
2022-02-25 21:10   ` John Hubbard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).