linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
@ 2023-01-20 17:55 David Howells
  2023-01-20 17:55 ` [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction David Howells
                   ` (9 more replies)
  0 siblings, 10 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Hi Al, Christoph,

Here are patches to provide support for extracting pages from an iov_iter
and a patch to use the primary extraction function in the block layer bio
code.

The patches make the following changes:

 (1) Add a function, iov_iter_extract_pages() to replace
     iov_iter_get_pages*() that gets refs, pins or just lists the pages as
     appropriate to the iterator type and the I/O direction.

     Add a function, iov_iter_extract_mode() that will indicate from the
     iterator type and the I/O direction how the cleanup is to be
     performed, returning FOLL_GET, FOLL_PIN or 0.

 (2) Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
     that take a page and the return from iov_iter_extract_mode() and do
     the right thing to clean up the page.

 (3) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
     FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
     added.

 (4) Add a function, bio_release_page(), to release a page appropriately to
     the cleanup mode indicated by the BIO_PAGE_* flags.

 (5) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
     pages appropriately and clean them up later.

 (6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

 (7) Renumber FOLL_GET and FOLL_PIN down so that they're at bits 0 and 1
     and coincident with BIO_PAGE_REFFED and BIO_PAGE_PINNED.  The compiler
     can then optimise on that.  Also, it's probably going to be necessary
     to embed these in the page pointer in sk_buff fragments.  This patch
     can go independently through the mm tree.

Changes:
========
ver #7)
 - For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
   as it turned out to be a lot more complicated, with places not setting
   IOCB_WRITE when they should, for example.
 - Drop all the patches that changed things other then the block layer's
   bio handling.  The netfslib and cifs changes can go into a separate
   patchset.
 - Add support for extracting pages from KVEC-type iterators.
 - When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

I've pushed the patches (excluding the two WIP networking patches) here
also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6

David Howells (8):
  iov_iter: Define flags to qualify page extraction.
  iov_iter: Add a function to extract a page list from an iterator
  mm: Provide a helper to drop a pin/ref on a page
  block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the
    meaning
  block: Add BIO_PAGE_PINNED
  block: Make bio structs pin pages rather than ref'ing if appropriate
  block: Fix bio_flagged() so that gcc can better optimise it
  mm: Renumber FOLL_GET and FOLL_PIN down

 block/bio.c               |  43 ++--
 block/blk-map.c           |  26 +--
 block/blk.h               |  29 +++
 fs/iomap/direct-io.c      |   1 -
 include/linux/bio.h       |   5 +-
 include/linux/blk_types.h |   3 +-
 include/linux/mm.h        |  17 +-
 include/linux/uio.h       |  35 ++-
 lib/iov_iter.c            | 438 +++++++++++++++++++++++++++++++++++++-
 mm/gup.c                  |  22 ++
 10 files changed, 571 insertions(+), 48 deletions(-)


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction.
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-21 13:01   ` Christoph Hellwig
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is allowed; but in future,
additional flags will be provided to indicate the I/O direction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@infradead.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
---

Notes:
    ver #7)
     - Don't use FOLL_* as a parameter, but rather define constants
       specifically to use with iov_iter_*_pages*().
     - Drop the I/O direction constants for now.

 block/bio.c         |  6 +++---
 block/blk-map.c     |  8 ++++----
 include/linux/uio.h |  7 +++++--
 lib/iov_iter.c      | 14 ++++++++------
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..6a6c73d14bfd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1249,7 +1249,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	unsigned int gup_flags = 0;
+	unsigned int extract_flags = 0;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1264,7 +1264,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
-		gup_flags |= FOLL_PCI_P2PDMA;
+		extract_flags |= ITER_ALLOW_P2PDMA;
 
 	/*
 	 * Each segment in the iov is required to be a block size multiple.
@@ -1275,7 +1275,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 */
 	size = iov_iter_get_pages(iter, pages,
 				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+				  nr_pages, &offset, extract_flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..bc111261fc82 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -267,7 +267,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 {
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
 	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
-	unsigned int gup_flags = 0;
+	unsigned int extract_flags = 0;
 	struct bio *bio;
 	int ret;
 	int j;
@@ -280,7 +280,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		return -ENOMEM;
 
 	if (blk_queue_pci_p2pdma(rq->q))
-		gup_flags |= FOLL_PCI_P2PDMA;
+		extract_flags |= ITER_ALLOW_P2PDMA;
 
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
@@ -291,10 +291,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
 			pages = stack_pages;
 			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, gup_flags);
+						   nr_vecs, &offs, extract_flags);
 		} else {
 			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, gup_flags);
+						LONG_MAX, &offs, extract_flags);
 		}
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..46d5080314c6 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -252,12 +252,12 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
-		unsigned gup_flags);
+		unsigned extract_flags);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page ***pages, size_t maxsize, size_t *start,
-		unsigned gup_flags);
+		unsigned extract_flags);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
@@ -360,4 +360,7 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	};
 }
 
+/* Flags for iov_iter_get/extract_pages*() */
+#define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..fb04abe7d746 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1432,9 +1432,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start,
-		   unsigned int gup_flags)
+		   unsigned int extract_flags)
 {
-	unsigned int n;
+	unsigned int n, gup_flags = 0;
 
 	if (maxsize > i->count)
 		maxsize = i->count;
@@ -1442,6 +1442,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 	if (maxsize > MAX_RW_COUNT)
 		maxsize = MAX_RW_COUNT;
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
 
 	if (likely(user_backed_iter(i))) {
 		unsigned long addr;
@@ -1495,14 +1497,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start, unsigned gup_flags)
+		   size_t *start, unsigned extract_flags)
 {
 	if (!maxpages)
 		return 0;
 	BUG_ON(!pages);
 
 	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
-					  start, gup_flags);
+					  start, extract_flags);
 }
 EXPORT_SYMBOL_GPL(iov_iter_get_pages);
 
@@ -1515,14 +1517,14 @@ EXPORT_SYMBOL(iov_iter_get_pages2);
 
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   size_t *start, unsigned gup_flags)
+		   size_t *start, unsigned extract_flags)
 {
 	ssize_t len;
 
 	*pages = NULL;
 
 	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
-					 gup_flags);
+					 extract_flags);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
  2023-01-20 17:55 ` [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-21 13:01   ` Christoph Hellwig
                     ` (5 more replies)
  2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
                   ` (7 subsequent siblings)
  9 siblings, 6 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, John Hubbard, linux-mm

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.  The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
as part of gup_flags to indicate how the iterator contents are to be used.

Add a second function, iov_iter_extract_mode(), to determine how the
cleanup should be done.

There are three cases:

 (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins obtained on them (but not references)
     so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
     progress.

     iov_iter_extract_mode() will return FOLL_PIN for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have references obtained on them, but not pins.

     iov_iter_extract_mode() will return FOLL_GET.  The caller should use
     something like put_page() for page disposal.

 (3) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
     permitted.

     iov_iter_extract_mode() will return 0.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
    ver #7)
     - Switch to passing in iter-specific flags rather than FOLL_* flags.
     - Drop the direction flags for now.
     - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
     - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
     - Add support for extraction from KVEC-type iters.
     - Use iov_iter_advance() rather than open-coding it.
     - Make BVEC- and KVEC-type skip over initial empty vectors.
    
    ver #6)
     - Add back the function to indicate the cleanup mode.
     - Drop the cleanup_mode return arg to iov_iter_extract_pages().
     - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
       data_source.
    
    ver #4)
     - Use ITER_SOURCE/DEST instead of WRITE/READ.
     - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
    
    ver #3)
     - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
       to get/pin_user_pages_fast()[1].

 include/linux/uio.h |  28 +++
 lib/iov_iter.c      | 424 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 452 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 46d5080314c6..a4233049ab7a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 /* Flags for iov_iter_get/extract_pages*() */
 #define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       unsigned int extract_flags, size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ * @extract_flags: How the iterator is to be used
+ *
+ * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
+ * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
+ * be retained by the extraction function.
+ *
+ * FOLL_GET indicates that the pages will have a reference taken on them that
+ * the caller must put.  This can be done for DMA/async DIO write from a page.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO read to a page to
+ * avoid CoW problems in fork.
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+#define iov_iter_extract_mode(iter, extract_flags) \
+	(user_backed_iter(iter) ?				\
+	 (iter->data_source == ITER_SOURCE) ?			\
+	 FOLL_GET : FOLL_PIN : 0)
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fb04abe7d746..843abe566efb 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,427 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     unsigned int extract_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->bvec->bv_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	const void *kaddr;
+	size_t skip = i->iov_offset, offset, len;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->kvec->iov_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+	kaddr = i->kvec->iov_base;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	kaddr -= offset;
+	len = offset + maxsize;
+	for (k = 0; k < maxpages; k++) {
+		size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+		if (is_vmalloc_or_module_addr(kaddr))
+			page = vmalloc_to_page(kaddr);
+		else
+			page = virt_to_page(kaddr);
+
+		p[k] = page;
+		len -= seg;
+		kaddr += PAGE_SIZE;
+	}
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator.  The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+							 size_t *size)
+{
+	size_t skip;
+	long k;
+
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
+
+	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (unlikely(!len))
+			continue;
+		if (*size > len)
+			*size = len;
+		return (unsigned long)i->iov[k].iov_base + skip;
+	}
+	BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int extract_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_GET;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(i->data_source != ITER_SOURCE))
+		return -EFAULT;
+
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = iov_iter_extract_first_user_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int extract_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_PIN | FOLL_WRITE;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(i->data_source != ITER_DEST))
+		return -EFAULT;
+
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	if (iov_iter_extract_mode(i, extract_flags) == FOLL_GET)
+		return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+							   maxpages, extract_flags,
+							   offset0);
+	else
+		return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+							   maxpages, extract_flags,
+							   offset0);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extract_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extract_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA be
+ * allowed on the pages extracted.
+ *
+ * The iov_iter_extract_mode() function can be used to query how cleanup should
+ * be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /OUT OF/ the buffer (@i->data_source == ITER_SOURCE), refs
+ *      will be taken on the pages, but pins will not be added.  This can be
+ *      used for DMA from a page; it cannot be used for DMA to a page, as it
+ *      may cause page-COW problems in fork.  iov_iter_extract_mode() will
+ *      return FOLL_GET.
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /INTO/ the described buffer (@i->data_source |= ITER_DEST),
+ *      pins will be added to the pages, but refs will not be taken.  This must
+ *      be used for DMA to a page.  iov_iter_extract_mode() will return
+ *      FOLL_PIN.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but neither refs nor pins will be obtained for the
+ *      caller.  The caller must hold the pipe lock.  iov_iter_extract_mode()
+ *      will return 0.
+ *
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
+ *      iov_iter_extract_mode() will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Peer-to-peer DMA (ITER_ALLOW_P2PDMA) is only permitted with user-backed
+ *      iterators.
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page..
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       unsigned int extract_flags,
+			       size_t *offset0)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (WARN_ON_ONCE(extract_flags & ITER_ALLOW_P2PDMA))
+		return -EIO;
+	if (iov_iter_is_kvec(i))
+		return iov_iter_extract_kvec_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, extract_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
  2023-01-20 17:55 ` [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction David Howells
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-20 17:55 ` [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

Provide a helper in the get_user_pages code to drop a pin or a ref on a
page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
nothing if neither is set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/mm.h |  3 +++
 mm/gup.c           | 22 ++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..f1cf8f4eb946 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,6 +1367,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
 #define SECTION_IN_PAGE_FLAGS
 #endif
 
+void folio_put_unpin(struct folio *folio, unsigned int flags);
+void page_put_unpin(struct page *page, unsigned int flags);
+
 /*
  * The identification function is mainly used by the buddy allocator for
  * determining if two pages could be buddies. We are not really identifying
diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..3ee4b4c7e0cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -191,6 +191,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 		folio_put_refs(folio, refs);
 }
 
+/**
+ * folio_put_unpin - Unpin/put a folio as appropriate
+ * @folio: The folio to release
+ * @flags: gup flags indicating the mode of release (FOLL_*)
+ *
+ * Release a folio according to the flags.  If FOLL_GET is set, the folio has a
+ * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
+ * unaltered.
+ */
+void folio_put_unpin(struct folio *folio, unsigned int flags)
+{
+	if (flags & (FOLL_GET | FOLL_PIN))
+		gup_put_folio(folio, 1, flags);
+}
+EXPORT_SYMBOL_GPL(folio_put_unpin);
+
+void page_put_unpin(struct page *page, unsigned int flags)
+{
+	folio_put_unpin(page_folio(page), flags);
+}
+EXPORT_SYMBOL_GPL(page_put_unpin);
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  * @page:    pointer to page to be grabbed


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (2 preceding siblings ...)
  2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-21 13:04   ` Christoph Hellwig
  2023-01-23  9:38   ` David Howells
  2023-01-20 17:55 ` [PATCH v7 5/8] block: Add BIO_PAGE_PINNED David Howells
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig

Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning.  In a
following patch I intend to add a BIO_PAGE_PINNED flag to indicate that the
page needs unpinning and this way both flags have the same logic.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344730802.2425628.14034153595667416149.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391054631.2311931.7588488803802952158.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
    ver #5)
     - Split from patch that uses iov_iter_extract_pages().

 block/bio.c               | 9 ++++++++-
 fs/iomap/direct-io.c      | 1 -
 include/linux/bio.h       | 2 +-
 include/linux/blk_types.h | 2 +-
 4 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6a6c73d14bfd..cfe11f4799d1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,10 @@ static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting BIO_PAGE_REFFED; if the pages
+ * should not be put, this flag should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +278,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	bio->bi_integrity = NULL;
 #endif
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_vcnt = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +307,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 {
 	bio_uninit(bio);
 	memset(bio, 0, BIO_RESET_BYTES);
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	atomic_set(&bio->__bi_remaining, 1);
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
@@ -812,6 +818,7 @@ EXPORT_SYMBOL(bio_put);
 static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 {
 	bio_set_flag(bio, BIO_CLONED);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1198,7 +1205,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..c0e75900e754 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..63bfd91793f9 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..86711fb0534a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* Pages need refs putting (equivalent to FOLL_GET) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 5/8] block: Add BIO_PAGE_PINNED
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (3 preceding siblings ...)
  2023-01-20 17:55 ` [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-21 13:05   ` Christoph Hellwig
  2023-01-20 17:55 ` [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig

BIO_PAGE_PINNED to indicate that the pages in a bio were pinned (FOLL_PIN)
and that the pin will need removing.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---
 include/linux/blk_types.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86711fb0534a..42b40156c517 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -319,6 +319,7 @@ struct bio {
  */
 enum {
 	BIO_PAGE_REFFED,	/* Pages need refs putting (equivalent to FOLL_GET) */
+	BIO_PAGE_PINNED,	/* Pages need unpinning (equivalent to FOLL_PIN) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (4 preceding siblings ...)
  2023-01-20 17:55 ` [PATCH v7 5/8] block: Add BIO_PAGE_PINNED David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-21 13:07   ` Christoph Hellwig
  2023-01-23 11:28   ` David Howells
  2023-01-20 17:55 ` [PATCH v7 7/8] block: Fix bio_flagged() so that gcc can better optimise it David Howells
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

To implement this:

 (1) If the BIO_PAGE_REFFED flag is set, this causes attached pages to be
     passed to put_page() during cleanup.

 (2) A BIO_PAGE_PINNED flag is provided.  If set, this causes attached
     pages to be passed to unpin_user_page() during cleanup.

 (3) BIO_PAGE_REFFED is set by default and BIO_PAGE_PINNED is cleared by
     default when the bio is (re-)initialised.

 (4) If iov_iter_extract_pages() indicates FOLL_GET, this causes
     BIO_PAGE_REFFED to be set and if FOLL_PIN is indicated, this causes
     BIO_PAGE_PINNED to be set.  If it returns neither FOLL_* flag, then
     both BIO_PAGE_* flags will be cleared.

     Mixing sets of pages with different clean up modes is not supported.

 (5) Cloned bio structs have both flags cleared.

 (6) bio_release_pages() will do the release if either BIO_PAGE_* flag is
     set.

[!] Note that this is tested a bit with ext4, but nothing else.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344731521.2425628.5403113335062567245.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391056047.2311931.6772604381276147664.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/bio.c         | 34 +++++++++++++++++++---------------
 block/blk-map.c     | 22 +++++++++++-----------
 block/blk.h         | 29 +++++++++++++++++++++++++++++
 include/linux/bio.h |  3 ++-
 4 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index cfe11f4799d1..2a6568b58501 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -245,8 +245,9 @@ static void bio_free(struct bio *bio)
  * when IO has completed, or when the bio is released.
  *
  * We set the initial assumption that pages attached to the bio will be
- * released with put_page() by setting BIO_PAGE_REFFED; if the pages
- * should not be put, this flag should be cleared.
+ * released with put_page() by setting BIO_PAGE_REFFED, but this should be set
+ * to BIO_PAGE_PINNED if the page should be unpinned instead; if the pages
+ * should not be put or unpinned, these flags should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -819,6 +820,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 {
 	bio_set_flag(bio, BIO_CLONED);
 	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1183,7 +1185,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1220,7 +1222,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1234,7 +1236,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1245,10 +1247,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_REFFED and
+ * BIO_PAGE_PINNED flags.  For a multi-segment *iter, this function only adds
+ * pages from the next non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1280,9 +1282,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, extract_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, extract_flags, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1315,7 +1317,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1344,6 +1346,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	int ret = 0;
 
+	bio_set_cleanup_mode(bio, iter, 0);
+
 	if (iov_iter_is_bvec(iter)) {
 		bio_iov_bvec_set(bio, iter);
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
@@ -1496,8 +1500,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index bc111261fc82..7d1bc75b9cf2 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,20 +282,20 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	if (blk_queue_pci_p2pdma(rq->q))
 		extract_flags |= ITER_ALLOW_P2PDMA;
 
+	bio_set_cleanup_mode(bio, iter, extract_flags);
+
 	while (iov_iter_count(iter)) {
-		struct page **pages, *stack_pages[UIO_FASTIOV];
+		struct page *stack_pages[UIO_FASTIOV];
+		struct page **pages = stack_pages;
 		ssize_t bytes;
 		size_t offs;
 		int npages;
 
-		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
-			pages = stack_pages;
-			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, extract_flags);
-		} else {
-			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, extract_flags);
-		}
+		if (nr_vecs > ARRAY_SIZE(stack_pages))
+			pages = NULL;
+
+		bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+					       nr_vecs, extract_flags, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -317,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						bio_release_page(bio, page);
 					break;
 				}
 
@@ -329,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			bio_release_page(bio, pages[j++]);
 		if (pages != stack_pages)
 			kvfree(pages);
 		/* couldn't stuff something into bio? */
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..16c8a7a84a16 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,35 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/*
+ * Set the cleanup mode for a bio from an iterator and the extraction flags.
+ */
+static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter,
+					unsigned int extract_flags)
+{
+	unsigned int cleanup_mode;
+
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	cleanup_mode = iov_iter_extract_mode(iter, extract_flags);
+	if (cleanup_mode & FOLL_GET)
+		bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (cleanup_mode & FOLL_PIN)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
+}
+
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+	unsigned int gup_flags = 0;
+
+	gup_flags |= bio_flagged(bio, BIO_PAGE_REFFED) ? FOLL_GET : 0;
+	gup_flags |= bio_flagged(bio, BIO_PAGE_PINNED) ? FOLL_PIN : 0;
+	page_put_unpin(page, gup_flags);
+}
+
 struct request_queue *blk_alloc_queue(int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 63bfd91793f9..1c6f051f6ff2 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -482,7 +482,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (bio_flagged(bio, BIO_PAGE_REFFED))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 7/8] block: Fix bio_flagged() so that gcc can better optimise it
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (5 preceding siblings ...)
  2023-01-20 17:55 ` [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig

Fix bio_flagged() so that multiple instances of it, such as:

	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
	    bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero.  This results in an out-of-line bio_release_page()
being compiled to something like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool.  Removing it
results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Fixes: b7c44ed9d2fc ("block: manipulate bio->bi_flags through helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1c6f051f6ff2..2e6109b0fca8 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-	return (bio->bi_flags & (1U << bit)) != 0;
+	return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (6 preceding siblings ...)
  2023-01-20 17:55 ` [PATCH v7 7/8] block: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-20 18:59   ` Matthew Wilcox
  2023-01-20 19:18   ` David Howells
  2023-01-23 16:31 ` [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) Matthew Wilcox
  2023-01-23 16:38 ` David Howells
  9 siblings, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

Renumber FOLL_GET and FOLL_PIN down to bit 0 and 1 respectively so that
they are coincidentally the same as BIO_PAGE_REFFED and BIO_PAGE_PINNED and
also so that they can be stored in the bottom two bits of a page pointer
(something I'm looking at for zerocopy socket fragments).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/mm.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1cf8f4eb946..33c9eacd9548 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3074,12 +3074,13 @@ static inline vm_fault_t vmf_error(int err)
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			 unsigned int foll_flags);
 
-#define FOLL_WRITE	0x01	/* check pte is writable */
-#define FOLL_TOUCH	0x02	/* mark page accessed */
-#define FOLL_GET	0x04	/* do get_page on page */
-#define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
-#define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
+#define FOLL_GET	0x01	/* do get_page on page (equivalent to BIO_FOLL_GET) */
+#define FOLL_PIN	0x02	/* pages must be released via unpin_user_page */
+#define FOLL_WRITE	0x04	/* check pte is writable */
+#define FOLL_TOUCH	0x08	/* mark page accessed */
+#define FOLL_DUMP	0x10	/* give error on hole if it would be zero */
+#define FOLL_FORCE	0x20	/* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT	0x40	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
 #define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
@@ -3088,7 +3089,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
-#define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
 #define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 #define FOLL_INTERRUPTIBLE  0x200000 /* allow interrupts from generic signals */


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
@ 2023-01-20 18:59   ` Matthew Wilcox
  2023-01-20 19:18   ` David Howells
  1 sibling, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-01-20 18:59 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

On Fri, Jan 20, 2023 at 05:55:56PM +0000, David Howells wrote:
> Renumber FOLL_GET and FOLL_PIN down to bit 0 and 1 respectively so that
> they are coincidentally the same as BIO_PAGE_REFFED and BIO_PAGE_PINNED and
> also so that they can be stored in the bottom two bits of a page pointer
> (something I'm looking at for zerocopy socket fragments).
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Although I was hoping you'd renumber some of the others since we
currently have gaps at 0x200, 0x400, 0x1000, and 0x4000 as well
as the 0x40 you're using here.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
  2023-01-20 18:59   ` Matthew Wilcox
@ 2023-01-20 19:18   ` David Howells
  1 sibling, 0 replies; 54+ messages in thread
From: David Howells @ 2023-01-20 19:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, linux-mm

Matthew Wilcox <willy@infradead.org> wrote:

> Although I was hoping you'd renumber some of the others since we
> currently have gaps at 0x200, 0x400, 0x1000, and 0x4000 as well
> as the 0x40 you're using here.

Since the other patches don't require this one, I can renumber them all and
send it directly to Andrew.

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction.
  2023-01-20 17:55 ` [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction David Howells
@ 2023-01-21 13:01   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:01 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Fri, Jan 20, 2023 at 05:55:49PM +0000, David Howells wrote:
> For now only a flag to allow peer-to-peer DMA is allowed; but in future,
> additional flags will be provided to indicate the I/O direction.

Al still doesn't seem to be sold on the latter part, so maybe just
drop this sentence for now.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-21 13:01   ` Christoph Hellwig
  2023-01-21 13:10   ` Christoph Hellwig
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:01 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

> +#define iov_iter_extract_mode(iter, extract_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (iter->data_source == ITER_SOURCE) ?			\
> +	 FOLL_GET : FOLL_PIN : 0)

The extract_flags argument is unused here.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-20 17:55 ` [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-21 13:04   ` Christoph Hellwig
  2023-01-23  9:38   ` David Howells
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:04 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

On Fri, Jan 20, 2023 at 05:55:52PM +0000, David Howells wrote:
> Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning.  In a
> following patch I intend to add a BIO_PAGE_PINNED flag to indicate that the
> page needs unpinning and this way both flags have the same logic.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

As pointed out last time, we really should not set the flag by default.
When a bio is allocated there are no pages to be released, that only
happens when we add dio-like pages to the bio.  Instead of explaining
why I mean again, I've put together a version of this series that
implements this and my other suggestions here:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dio-pin-pages

This also tests fine with xfs and btrfs and nvme passthrough I/O.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 5/8] block: Add BIO_PAGE_PINNED
  2023-01-20 17:55 ` [PATCH v7 5/8] block: Add BIO_PAGE_PINNED David Howells
@ 2023-01-21 13:05   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:05 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

On Fri, Jan 20, 2023 at 05:55:53PM +0000, David Howells wrote:
> BIO_PAGE_PINNED to indicate that the pages in a bio were pinned (FOLL_PIN)
> and that the pin will need removing.

Just adding the flag without the infrastructure and users is a bit
silly.  See the link to my branch to see what I think is better split.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-20 17:55 ` [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-21 13:07   ` Christoph Hellwig
  2023-01-23 11:28   ` David Howells
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:07 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

This really does three things:
 
 1) add the BIO_PAGE_PINNED infrastructure
 2) convert bio_iov_iter_get_pages to use iov_iter_extract_pages
 3) convert bio_map_user_iov to use iov_iter_extract_pages

so it should be three pages, with the first one also containing
what is in your previous patch.

> @@ -1183,7 +1185,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
>  	bio_for_each_segment_all(bvec, bio, iter_all) {
>  		if (mark_dirty && !PageCompound(bvec->bv_page))
>  			set_page_dirty_lock(bvec->bv_page);
> -		put_page(bvec->bv_page);
> +		bio_release_page(bio, bvec->bv_page);

This can be micro-optimized a bit by only doing the flags conversion
once.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-21 13:01   ` Christoph Hellwig
@ 2023-01-21 13:10   ` Christoph Hellwig
  2023-01-21 13:30   ` David Howells
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:10 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
>  (3) Any other sort of iterator.
> 
>      No refs or pins are obtained on the page, the assumption is made that
>      the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
>      permitted.

The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
break things like io_uring to fixed buffers on NVMe.  ITER_ALLOW_P2PDMA
should strictly increase the group of acceptable iters, it does must
not restrict anything.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-21 13:01   ` Christoph Hellwig
  2023-01-21 13:10   ` Christoph Hellwig
@ 2023-01-21 13:30   ` David Howells
  2023-01-21 13:33     ` Christoph Hellwig
  2023-01-23 11:28   ` David Hildenbrand
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-21 13:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
> >  (3) Any other sort of iterator.
> > 
> >      No refs or pins are obtained on the page, the assumption is made that
> >      the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
> >      permitted.
> 
> The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
> break things like io_uring to fixed buffers on NVMe.  ITER_ALLOW_P2PDMA
> should strictly increase the group of acceptable iters, it does must
> not restrict anything.

So just drop the check?  Or do I actually need to do something to the pages to
make it work?  If I understand the code correctly, it's not actually operable
right now on any page that doesn't have an appropriately marked PTE.

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-21 13:30   ` David Howells
@ 2023-01-21 13:33     ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:33 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On Sat, Jan 21, 2023 at 01:30:47PM +0000, David Howells wrote:
> So just drop the check?  Or do I actually need to do something to the pages to
> make it work?  If I understand the code correctly, it's not actually operable
> right now on any page that doesn't have an appropriately marked PTE.

Yes, just drop the check. The flag just signals that the caller can
deal with p2p pages.  If it gets any or not is a totally different
story.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-20 17:55 ` [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
  2023-01-21 13:04   ` Christoph Hellwig
@ 2023-01-23  9:38   ` David Howells
  2023-01-23  9:56     ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-23  9:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

Christoph Hellwig <hch@infradead.org> wrote:

> ... I've put together a version of this series that implements this and my
> other suggestions here:
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dio-pin-pages
> 
> This also tests fine with xfs and btrfs and nvme passthrough I/O.

That looks okay.  Do you need to put a Co-developed-by tag on the "block:
invert BIO_NO_PAGE_REF" patch?

Should I take that set of patches and send it on to Linus when the window
opens?  Or should it go through the block tree?

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-23  9:38   ` David Howells
@ 2023-01-23  9:56     ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-23  9:56 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

On Mon, Jan 23, 2023 at 09:38:45AM +0000, David Howells wrote:
> That looks okay.  Do you need to put a Co-developed-by tag on the "block:
> invert BIO_NO_PAGE_REF" patch?

I can do if you want, although there's really not much left of the
original.

> Should I take that set of patches and send it on to Linus when the window
> opens?  Or should it go through the block tree?

I think the block tree would be better.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                     ` (2 preceding siblings ...)
  2023-01-21 13:30   ` David Howells
@ 2023-01-23 11:28   ` David Hildenbrand
  2023-01-23 11:51   ` David Howells
  2023-01-23 12:00   ` David Howells
  5 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2023-01-23 11:28 UTC (permalink / raw)
  To: David Howells, Al Viro, Christoph Hellwig
  Cc: Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, John Hubbard, linux-mm

On 20.01.23 18:55, David Howells wrote:
> Add a function, iov_iter_extract_pages(), to extract a list of pages from
> an iterator.  The pages may be returned with a reference added or a pin
> added or neither, depending on the type of iterator and the direction of
> transfer.  The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
> as part of gup_flags to indicate how the iterator contents are to be used.
> 
> Add a second function, iov_iter_extract_mode(), to determine how the
> cleanup should be done.
> 
> There are three cases:
> 
>   (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.
> 
>       Extracted pages will have pins obtained on them (but not references)
>       so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
>       progress.
> 
>       iov_iter_extract_mode() will return FOLL_PIN for this case.  The
>       caller should use something like unpin_user_page() to dispose of the
>       page.
> 
>   (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.
> 
>       Extracted pages will have references obtained on them, but not pins.
> 
>       iov_iter_extract_mode() will return FOLL_GET.  The caller should use
>       something like put_page() for page disposal.
> 
>   (3) Any other sort of iterator.
> 
>       No refs or pins are obtained on the page, the assumption is made that
>       the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
>       permitted.
> 
>       iov_iter_extract_mode() will return 0.  The pages don't need
>       additional disposal.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: John Hubbard <jhubbard@nvidia.com>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
> Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
> Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
> Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
> ---
> 
> Notes:
>      ver #7)
>       - Switch to passing in iter-specific flags rather than FOLL_* flags.
>       - Drop the direction flags for now.
>       - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
>       - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
>       - Add support for extraction from KVEC-type iters.
>       - Use iov_iter_advance() rather than open-coding it.
>       - Make BVEC- and KVEC-type skip over initial empty vectors.
>      
>      ver #6)
>       - Add back the function to indicate the cleanup mode.
>       - Drop the cleanup_mode return arg to iov_iter_extract_pages().
>       - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
>         data_source.
>      
>      ver #4)
>       - Use ITER_SOURCE/DEST instead of WRITE/READ.
>       - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
>      
>      ver #3)
>       - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
>         to get/pin_user_pages_fast()[1].
> 
>   include/linux/uio.h |  28 +++
>   lib/iov_iter.c      | 424 ++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 452 insertions(+)
> 
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 46d5080314c6..a4233049ab7a 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
>   /* Flags for iov_iter_get/extract_pages*() */
>   #define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
>   
> +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> +			       size_t maxsize, unsigned int maxpages,
> +			       unsigned int extract_flags, size_t *offset0);
> +
> +/**
> + * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + * @extract_flags: How the iterator is to be used
> + *
> + * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
> + * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
> + * be retained by the extraction function.
> + *
> + * FOLL_GET indicates that the pages will have a reference taken on them that
> + * the caller must put.  This can be done for DMA/async DIO write from a page.
> + *
> + * FOLL_PIN indicates that the pages will have a pin placed in them that the
> + * caller must unpin.  This is must be done for DMA/async DIO read to a page to
> + * avoid CoW problems in fork.
> + *
> + * 0 indicates that no measures are taken and that it's up to the caller to
> + * retain the pages.
> + */
> +#define iov_iter_extract_mode(iter, extract_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (iter->data_source == ITER_SOURCE) ?			\
> +	 FOLL_GET : FOLL_PIN : 0)
> +
>

How does this work align with the goal of no longer using FOLL_GET for 
O_DIRECT? We should get rid of any FOLL_GET usage for accessing page 
content.

@John, any comments?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-20 17:55 ` [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
  2023-01-21 13:07   ` Christoph Hellwig
@ 2023-01-23 11:28   ` David Howells
  2023-01-23 14:49     ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-23 11:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

In this function:

	void __bio_release_pages(struct bio *bio, bool mark_dirty)
	{
		unsigned int gup_flags = bio_to_gup_flags(bio);
		struct bvec_iter_all iter_all;
		struct bio_vec *bvec;

		bio_for_each_segment_all(bvec, bio, iter_all) {
			if (mark_dirty && !PageCompound(bvec->bv_page))
				set_page_dirty_lock(bvec->bv_page);
	>>>>		page_put_unpin(bvec->bv_page, gup_flags);
		}
	}

that ought to be a call to bio_release_page(), but the optimiser doesn't want
to inline it:-/

I found the only way I can get the compiler to properly inline it without it
repeating the calculations is to renumber the FOLL_* constants down and then
make bio_release_page() something like:

	static inline __attribute__((always_inline))
	void bio_release_page(struct bio *bio, struct page *page)
	{
		page_put_unpin(page,
		    ((bio->bi_flags & (1 << BIO_PAGE_REFFED)) ? FOLL_GET : 0) |
		    ((bio->bi_flags & (1 << BIO_PAGE_PINNED)) ? FOLL_PIN : 0));
	}

I guess the compiler optimiser isn't perfect yet:-)

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                     ` (3 preceding siblings ...)
  2023-01-23 11:28   ` David Hildenbrand
@ 2023-01-23 11:51   ` David Howells
  2023-01-23 13:11     ` David Hildenbrand
  2023-01-23 13:19     ` David Howells
  2023-01-23 12:00   ` David Howells
  5 siblings, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-23 11:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> How does this work align with the goal of no longer using FOLL_GET for
> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.

Would that run the risk of changes being made by the child being visible to
the a DIO write if the parent changes the buffer first?


	PARENT			CHILD
	======			=====
	start-DIO-write
	fork() = pid		fork() = 0
	alter-buffer
	CoW happens
	page copied		original page retained
				alter-buffer
		<DMA-happens>

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                     ` (4 preceding siblings ...)
  2023-01-23 11:51   ` David Howells
@ 2023-01-23 12:00   ` David Howells
  5 siblings, 0 replies; 54+ messages in thread
From: David Howells @ 2023-01-23 12:00 UTC (permalink / raw)
  Cc: dhowells, David Hildenbrand, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, John Hubbard, linux-mm

David Howells <dhowells@redhat.com> wrote:

> > How does this work align with the goal of no longer using FOLL_GET for
> > O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
> 
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
> 
> 
> 	PARENT			CHILD
> 	======			=====
> 	start-DIO-write
> 	fork() = pid		fork() = 0
> 	alter-buffer
> 	CoW happens
> 	page copied		original page retained
> 				alter-buffer
> 		<DMA-happens>

Ah, I think I might have got the wrong end of the stick.  A pinned page is
*always* copied on fork() if I understand copy_present_pte() correctly.

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 11:51   ` David Howells
@ 2023-01-23 13:11     ` David Hildenbrand
  2023-01-23 13:19     ` David Howells
  1 sibling, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2023-01-23 13:11 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 23.01.23 12:51, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> How does this work align with the goal of no longer using FOLL_GET for
>> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
> 
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
> 
> 
> 	PARENT			CHILD
> 	======			=====
> 	start-DIO-write
> 	fork() = pid		fork() = 0
> 	alter-buffer
> 	CoW happens
> 	page copied		original page retained
> 				alter-buffer
> 		<DMA-happens>

FOLL_PIN users are fine in that regard, because we properly detect 
"maybe pinned" during fork() and copy the page. See 
tools/testing/selftests/mm/cow.c (still called 
tools/testing/selftests/vm/cow.c upstream IIRC) for some test cases for 
that handling.

FOLL_GET does not work as expected in that regard: pages can't be 
detected as pinned and we won't be copying them during fork(). We'll end 
up COW-sharing them, which can result in trouble later.

Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure 
what the status is. Interestingly, 
Documentation/core-api/pin_user_pages.rst already documents that "CASE 
1: Direct IO (DIO)" uses FOLL_PIN ... which does, unfortunately, no 
reflect reality yet.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 11:51   ` David Howells
  2023-01-23 13:11     ` David Hildenbrand
@ 2023-01-23 13:19     ` David Howells
  2023-01-23 13:24       ` David Hildenbrand
  2023-01-23 13:38       ` David Howells
  1 sibling, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-23 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> unfortunately, no reflect reality yet.

Yeah - I just came across that.

Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
get_user_pages()?  In which case my patches only need keep track of
pinned/not-pinned and never "got".

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:19     ` David Howells
@ 2023-01-23 13:24       ` David Hildenbrand
  2023-01-23 19:56         ` John Hubbard
  2023-01-26 22:15         ` Al Viro
  2023-01-23 13:38       ` David Howells
  1 sibling, 2 replies; 54+ messages in thread
From: David Hildenbrand @ 2023-01-23 13:24 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 23.01.23 14:19, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>> unfortunately, no reflect reality yet.
> 
> Yeah - I just came across that.
> 
> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> get_user_pages()?  In which case my patches only need keep track of
> pinned/not-pinned and never "got".

That would be the ideal case: whenever intending to access page content, 
use FOLL_PIN instead of FOLL_GET.

The issue that John was trying to sort out was that there are plenty of 
callsites that do a simple put_page() instead of calling 
unpin_user_page(). IIRC, handling that correctly in existing code -- 
what was pinned must be released via unpin_user_page() -- was the 
biggest workitem.

Not sure how that relates to your work here (that's why I was asking): 
if you could avoid FOLL_GET, that would be great :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:19     ` David Howells
  2023-01-23 13:24       ` David Hildenbrand
@ 2023-01-23 13:38       ` David Howells
  2023-01-23 14:20         ` David Hildenbrand
  2023-01-23 16:11         ` Jan Kara
  1 sibling, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-23 13:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling
> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> pinned must be released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if you
> could avoid FOLL_GET, that would be great :)

Well, it simplifies things a bit.

I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
do no ref-getting at all.  Things can be converted over to "unpin the pages or
doing nothing" as they're converted over to using iov_iter_extract_pages()
from iov_iter_get_pages*().

The block bio code then only needs a single bit of state: pinned or not
pinned.

For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
a special cleanup?

sk_buff fragment handling could still be tricky.  I'm thinking that in that
code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
pointer.  Sometimes it allocates a new page and attaches it (have ref);
sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
pointing to a kernel buffer (don't pin or ref).

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:38       ` David Howells
@ 2023-01-23 14:20         ` David Hildenbrand
  2023-01-23 14:48           ` Christoph Hellwig
  2023-01-23 16:11         ` Jan Kara
  1 sibling, 1 reply; 54+ messages in thread
From: David Hildenbrand @ 2023-01-23 14:20 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 23.01.23 14:38, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling
>> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
>> pinned must be released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if you
>> could avoid FOLL_GET, that would be great :)
> 
> Well, it simplifies things a bit.
> 
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all.  Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
> 
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

Unfortunately, I'll have to let BIO experts comment on that :) I only 
know the MM side of things here.

> 
> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> a special cleanup?

Anything that holds pins "possibly forever" should that. vmsplice() is 
another example that should use it, once properly using FOLL_PIN. 
[FOLL_GET | FOLL_LONGTERM is not really used/defined with semantics]

> 
> sk_buff fragment handling could still be tricky.  I'm thinking that in that
> code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
> pointer.  Sometimes it allocates a new page and attaches it (have ref);
> sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
> pointing to a kernel buffer (don't pin or ref).
> 
> David
> 

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 14:20         ` David Hildenbrand
@ 2023-01-23 14:48           ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-23 14:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Howells, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	John Hubbard, linux-mm

On Mon, Jan 23, 2023 at 03:20:41PM +0100, David Hildenbrand wrote:
> > The block bio code then only needs a single bit of state: pinned or not
> > pinned.
> 
> Unfortunately, I'll have to let BIO experts comment on that :) I only know
> the MM side of things here.

It can work for bios.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-23 11:28   ` David Howells
@ 2023-01-23 14:49     ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-23 14:49 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig

On Mon, Jan 23, 2023 at 11:28:40AM +0000, David Howells wrote:
> 	void __bio_release_pages(struct bio *bio, bool mark_dirty)
> 	{
> 		unsigned int gup_flags = bio_to_gup_flags(bio);
> 		struct bvec_iter_all iter_all;
> 		struct bio_vec *bvec;
> 
> 		bio_for_each_segment_all(bvec, bio, iter_all) {
> 			if (mark_dirty && !PageCompound(bvec->bv_page))
> 				set_page_dirty_lock(bvec->bv_page);
> 	>>>>		page_put_unpin(bvec->bv_page, gup_flags);
> 		}
> 	}
> 
> that ought to be a call to bio_release_page(), but the optimiser doesn't want
> to inline it:-/

Why?  __bio_release_pages is the fast path, no need to force using
bio_relese_page which is otherwise only used for error cleanup.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:38       ` David Howells
  2023-01-23 14:20         ` David Hildenbrand
@ 2023-01-23 16:11         ` Jan Kara
  2023-01-23 16:17           ` Christoph Hellwig
  2023-01-23 23:07           ` John Hubbard
  1 sibling, 2 replies; 54+ messages in thread
From: Jan Kara @ 2023-01-23 16:11 UTC (permalink / raw)
  To: David Howells
  Cc: David Hildenbrand, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	John Hubbard, linux-mm

On Mon 23-01-23 13:38:31, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
> > That would be the ideal case: whenever intending to access page content, use
> > FOLL_PIN instead of FOLL_GET.
> > 
> > The issue that John was trying to sort out was that there are plenty of
> > callsites that do a simple put_page() instead of calling
> > unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> > pinned must be released via unpin_user_page() -- was the biggest workitem.
> > 
> > Not sure how that relates to your work here (that's why I was asking): if you
> > could avoid FOLL_GET, that would be great :)
> 
> Well, it simplifies things a bit.
> 
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all.  Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
> 
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

I'm all for using only pin/unpin in the end. But you'd still need to handle
the 'put' for the intermediate time when there are still FOLL_GET users of
the bio infrastructure, wouldn't you?

> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> a special cleanup?

FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
whenever there isn't reasonably bound time after which the page is
unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
long the RDMA is going to be running it should be using FOLL_LONGTERM. The
thing is that pins can block e.g. truncate for DAX inodes and so longterm
pins are not supported for DAX backed pages.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 16:11         ` Jan Kara
@ 2023-01-23 16:17           ` Christoph Hellwig
  2023-01-23 23:07           ` John Hubbard
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-23 16:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Howells, David Hildenbrand, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	John Hubbard, linux-mm

On Mon, Jan 23, 2023 at 05:11:14PM +0100, Jan Kara wrote:
> I'm all for using only pin/unpin in the end. But you'd still need to handle
> the 'put' for the intermediate time when there are still FOLL_GET users of
> the bio infrastructure, wouldn't you?

Yes, we need it for the transition.  But Dave already has a patch to
convert the legacy direct-io code as well, at which point we can
the retire the get bit.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (7 preceding siblings ...)
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
@ 2023-01-23 16:31 ` Matthew Wilcox
  2023-01-23 16:42   ` Jan Kara
  2023-01-23 16:38 ` David Howells
  9 siblings, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-01-23 16:31 UTC (permalink / raw)
  To: David Howells
  Cc: John Hubbard, Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
>  (3) Make the bio struct carry a pair of flags to indicate the cleanup
>      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
>      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
>      added.

I think there's a simpler solution than all of this.

As I understand the fundamental problem here, the question is
when to copy a page on fork.  We have the optimisation of COW, but
O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
to the fork code "You can't do COW to this page".

Why do we want to track that information on a per-page basis?  Wouldn't it
be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
that VMA will now never be COWed, regardless of their refcount/mapcount.
And the whole "did we pin or get this page" problem goes away.  Along
with folio->pincount.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (8 preceding siblings ...)
  2023-01-23 16:31 ` [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) Matthew Wilcox
@ 2023-01-23 16:38 ` David Howells
  2023-01-23 16:42   ` Matthew Wilcox
  2023-01-23 17:19   ` David Howells
  9 siblings, 2 replies; 54+ messages in thread
From: David Howells @ 2023-01-23 16:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, John Hubbard, Al Viro, Christoph Hellwig, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> Why do we want to track that information on a per-page basis?  Wouldn't it
> be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> that VMA will now never be COWed, regardless of their refcount/mapcount.
> And the whole "did we pin or get this page" problem goes away.  Along
> with folio->pincount.

Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
if they did a single DIO to/from it.

Also you only mention DIO read - but what about "start DIO write; fork(); touch
buffer" in the parent - now the write buffer belongs to the child and they can
affect the parent's write.

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 16:31 ` [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) Matthew Wilcox
@ 2023-01-23 16:42   ` Jan Kara
  2023-01-23 17:33     ` Matthew Wilcox
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2023-01-23 16:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, John Hubbard, Al Viro, Christoph Hellwig,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> >      added.
> 
> I think there's a simpler solution than all of this.
> 
> As I understand the fundamental problem here, the question is
> when to copy a page on fork.  We have the optimisation of COW, but
> O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> to the fork code "You can't do COW to this page".
> 
> Why do we want to track that information on a per-page basis?  Wouldn't it
> be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> that VMA will now never be COWed, regardless of their refcount/mapcount.
> And the whole "did we pin or get this page" problem goes away.  Along
> with folio->pincount.

Well, but anon COW code is not the only (planned) consumer of the pincount.
Filesystems also need to know whether a (shared pagecache) page is pinned
and can thus be modified behind their backs. And for that VMA tracking
isn't really an option.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 16:38 ` David Howells
@ 2023-01-23 16:42   ` Matthew Wilcox
  2023-01-23 17:25     ` Jan Kara
  2023-01-23 17:19   ` David Howells
  1 sibling, 1 reply; 54+ messages in thread
From: Matthew Wilcox @ 2023-01-23 16:42 UTC (permalink / raw)
  To: David Howells
  Cc: John Hubbard, Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > Why do we want to track that information on a per-page basis?  Wouldn't it
> > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > And the whole "did we pin or get this page" problem goes away.  Along
> > with folio->pincount.
> 
> Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> if they did a single DIO to/from it.

Yes.  Would that be an actual problem for any real application?

We could do this with a vm_pincount if it's essential to be able to
count how often it's happened and be able to fork() without COW if it's
something that happened in the past and is now not happening.

> Also you only mention DIO read - but what about "start DIO write; fork(); touch
> buffer" in the parent - now the write buffer belongs to the child and they can
> affect the parent's write.

I'm struggling to see the problem here.  If the child hasn't exec'd, the
parent and child are still in the same security domain.  The parent
could have modified the buffer before calling fork().

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 16:38 ` David Howells
  2023-01-23 16:42   ` Matthew Wilcox
@ 2023-01-23 17:19   ` David Howells
  2023-01-23 18:04     ` Matthew Wilcox
  1 sibling, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-23 17:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, John Hubbard, Al Viro, Christoph Hellwig, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Matthew Wilcox <willy@infradead.org> wrote:

> > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> > if they did a single DIO to/from it.
> 
> Yes.  Would that be an actual problem for any real application?

Without auditing all applications that do direct I/O writes, it's hard to
say - but a big database engine, Oracle for example, forking off a process,
say, could cause a massive slow down as fork suddenly has to copy a huge
amount of malloc'd data unnecessarily[*].

[*] I'm making wild assumptions about how Oracle's DB engine works.

> > Also you only mention DIO read - but what about "start DIO write; fork();
> > touch buffer" in the parent - now the write buffer belongs to the child
> > and they can affect the parent's write.
> 
> I'm struggling to see the problem here.  If the child hasn't exec'd, the
> parent and child are still in the same security domain.  The parent
> could have modified the buffer before calling fork().

It could still inadvertently change the data its parent set to write out.  The
child *shouldn't* be able to change the parent's in-progress write.  The most
obvious problem would be in something that does DIO from a stack buffer, I
think.

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 16:42   ` Matthew Wilcox
@ 2023-01-23 17:25     ` Jan Kara
  2023-01-24 10:24       ` David Hildenbrand
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2023-01-23 17:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: David Howells, John Hubbard, Al Viro, Christoph Hellwig,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

On Mon 23-01-23 16:42:56, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
> > Matthew Wilcox <willy@infradead.org> wrote:
> > Also you only mention DIO read - but what about "start DIO write; fork(); touch
> > buffer" in the parent - now the write buffer belongs to the child and they can
> > affect the parent's write.
> 
> I'm struggling to see the problem here.  If the child hasn't exec'd, the
> parent and child are still in the same security domain.  The parent
> could have modified the buffer before calling fork().

Sadly they are not. Android in particular starts applications by forking
one big binary (zygote) that has multiple apps linked together and relies
on the fact the child cannot influence the parent after the fork. We've
already had CVEs with GUP & COW & fork due to this. David Hildebrand has a
lot of memories regarding this I believe ;)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 16:42   ` Jan Kara
@ 2023-01-23 17:33     ` Matthew Wilcox
  2023-01-23 22:53       ` John Hubbard
  2023-01-24 10:29       ` Jan Kara
  0 siblings, 2 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-01-23 17:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Howells, John Hubbard, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > >      added.
> > 
> > I think there's a simpler solution than all of this.
> > 
> > As I understand the fundamental problem here, the question is
> > when to copy a page on fork.  We have the optimisation of COW, but
> > O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> > to the fork code "You can't do COW to this page".
> > 
> > Why do we want to track that information on a per-page basis?  Wouldn't it
> > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > And the whole "did we pin or get this page" problem goes away.  Along
> > with folio->pincount.
> 
> Well, but anon COW code is not the only (planned) consumer of the pincount.
> Filesystems also need to know whether a (shared pagecache) page is pinned
> and can thus be modified behind their backs. And for that VMA tracking
> isn't really an option.

Bleh, I'd forgotten about that problem.  We really do need to keep
track of which pages are under I/O for this case, because we need to
tell the filesystem that they are now available for writeback.

That said, I don't know that we need to keep track of it in the
pages themselves.  Can't we have something similar to rmap which
keeps track of a range of pinned pages, and have it taken care of
at a higher level (ie unpin the pages in the dio_iodone_t rather
than in the BIO completion handler)?

I'm not even sure why pinned pagecache pages remain on the LRU.
They should probably go onto the unevictable list with the mlocked
pages, then on unpin get marked dirty and placed on the active list.
There's no point in writing back a pinned page since it can be
written to at any instant without any part of the kernel knowing.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 17:19   ` David Howells
@ 2023-01-23 18:04     ` Matthew Wilcox
  0 siblings, 0 replies; 54+ messages in thread
From: Matthew Wilcox @ 2023-01-23 18:04 UTC (permalink / raw)
  To: David Howells
  Cc: John Hubbard, Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 23, 2023 at 05:19:51PM +0000, David Howells wrote:
> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > Wouldn't that potentially make someone's entire malloc() heap entirely NOCOW
> > > if they did a single DIO to/from it.
> > 
> > Yes.  Would that be an actual problem for any real application?
> 
> Without auditing all applications that do direct I/O writes, it's hard to
> say - but a big database engine, Oracle for example, forking off a process,
> say, could cause a massive slow down as fork suddenly has to copy a huge
> amount of malloc'd data unnecessarily[*].
> 
> [*] I'm making wild assumptions about how Oracle's DB engine works.

Yes.  The cache is shared between all Oracle processes, so it's not COWed.
Indeed (as the mshare patches show), what Oracle wants is _more_ sharing
between the processes, not _less_.

> > > Also you only mention DIO read - but what about "start DIO write; fork();
> > > touch buffer" in the parent - now the write buffer belongs to the child
> > > and they can affect the parent's write.
> > 
> > I'm struggling to see the problem here.  If the child hasn't exec'd, the
> > parent and child are still in the same security domain.  The parent
> > could have modified the buffer before calling fork().
> 
> It could still inadvertently change the data its parent set to write out.  The
> child *shouldn't* be able to change the parent's in-progress write.  The most
> obvious problem would be in something that does DIO from a stack buffer, I
> think.

If it's a problem then O_DIRECT writes can also set the NOCOW flag.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:24       ` David Hildenbrand
@ 2023-01-23 19:56         ` John Hubbard
  2023-01-26 22:15         ` Al Viro
  1 sibling, 0 replies; 54+ messages in thread
From: John Hubbard @ 2023-01-23 19:56 UTC (permalink / raw)
  To: David Hildenbrand, David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, linux-mm

On 1/23/23 05:24, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>> unfortunately, no reflect reality yet.
>>

Yes, that part of the documentation is...aspirational. :) But this
series is taking us there, so good. Let me go review v8 of the series in
actual detail to verify but it sounds very promising.

>> Yeah - I just came across that.
>>
>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>> get_user_pages()?  In which case my patches only need keep track of
>> pinned/not-pinned and never "got".
> 
> That would be the ideal case: whenever intending to access page content, use FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of callsites that do a simple put_page() instead of calling unpin_user_page(). IIRC, handling that correctly in existing code -- what was pinned must be released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if you could avoid FOLL_GET, that would be great :)
> 

The largest part of this problem has been: __iov_iter_get_pages_alloc()
calls get_user_pages_fast() (so, FOLL_GET), as part of the Direct IO
path. And that __iov_iter_get_pages_alloc() is also called by a wide
variety of things that are not Direct IO: networking, crytpo, RDS.

So splitting out a variant that only Direct IO uses is a great move and
should allow conversion of Direct IO to FOLL_PIN. Again, let me go do an
actual review to check on that.

thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 17:33     ` Matthew Wilcox
@ 2023-01-23 22:53       ` John Hubbard
  2023-01-24 10:29       ` Jan Kara
  1 sibling, 0 replies; 54+ messages in thread
From: John Hubbard @ 2023-01-23 22:53 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On 1/23/23 09:33, Matthew Wilcox wrote:
...
> Bleh, I'd forgotten about that problem.  We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
> 
> That said, I don't know that we need to keep track of it in the
> pages themselves.  Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?
> 
> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked

This is an intriguing idea, but...

> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.
> 

There have been filesystems discussions about this: if a page goes
unwritten for "too long", it's not good. To address that, bounce buffers
were proposed for periodic writeback of pinned pages. The idea with
bounce buffers is: even though the page is never guaranteed up to date
(because DMA can instantly make it effectively dirty), it is at least
much less out of date after a periodic writeback, then it was before.

And that is important for some file system use cases.


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 16:11         ` Jan Kara
  2023-01-23 16:17           ` Christoph Hellwig
@ 2023-01-23 23:07           ` John Hubbard
  2023-01-24  5:57             ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: John Hubbard @ 2023-01-23 23:07 UTC (permalink / raw)
  To: Jan Kara, David Howells
  Cc: David Hildenbrand, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, linux-mm

On 1/23/23 08:11, Jan Kara wrote:
>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
>> a special cleanup?
> 
> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> whenever there isn't reasonably bound time after which the page is
> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> long the RDMA is going to be running it should be using FOLL_LONGTERM. The

Yes, we have been pretty consistently deciding that RDMA generally
implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
FOLL_PIN--that one is actually enforced by the gup/pup APIs.)


thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 23:07           ` John Hubbard
@ 2023-01-24  5:57             ` Christoph Hellwig
  2023-01-24  6:55               ` John Hubbard
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-24  5:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, David Howells, David Hildenbrand, Al Viro,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
> On 1/23/23 08:11, Jan Kara wrote:
> > > For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> > > a special cleanup?
> > 
> > FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> > whenever there isn't reasonably bound time after which the page is
> > unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> > long the RDMA is going to be running it should be using FOLL_LONGTERM. The
> 
> Yes, we have been pretty consistently deciding that RDMA generally
> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)

That's weird.  For storage or file systems, pages are pinnen just as
long when using RDMA as when using local DMA, in fact if you do RDMA
to really fast remote media vs slow local media (e.g. SSD vs disk) you
might pin it shorter when using RDMA.

I think FOLL_LONGTERM makes sense for non-ODP user space memory
registrations for RDMA, which will last basically forever.  It does
not really make much sense at all for in-kernel memory registration for
RDMA that are very short term.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-24  5:57             ` Christoph Hellwig
@ 2023-01-24  6:55               ` John Hubbard
  0 siblings, 0 replies; 54+ messages in thread
From: John Hubbard @ 2023-01-24  6:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, David Howells, David Hildenbrand, Al Viro,
	Matthew Wilcox, Jens Axboe, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	linux-mm

On 1/23/23 21:57, Christoph Hellwig wrote:
> On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
>> On 1/23/23 08:11, Jan Kara wrote:
>>>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
>>>> a special cleanup?
>>>
>>> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
>>> whenever there isn't reasonably bound time after which the page is
>>> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
>>> long the RDMA is going to be running it should be using FOLL_LONGTERM. The
>>
>> Yes, we have been pretty consistently deciding that RDMA generally
>> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
>> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)
> 
> That's weird.  For storage or file systems, pages are pinnen just as
> long when using RDMA as when using local DMA, in fact if you do RDMA
> to really fast remote media vs slow local media (e.g. SSD vs disk) you
> might pin it shorter when using RDMA.

ah OK, it is possible that I've been thinking too much about the HPC
cases that set up RDMA and leave it there, and not enough about
storage.

> 
> I think FOLL_LONGTERM makes sense for non-ODP user space memory
> registrations for RDMA, which will last basically forever.  It does
> not really make much sense at all for in-kernel memory registration for
> RDMA that are very short term.

Agreed.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 17:25     ` Jan Kara
@ 2023-01-24 10:24       ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2023-01-24 10:24 UTC (permalink / raw)
  To: Jan Kara, Matthew Wilcox
  Cc: David Howells, John Hubbard, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

On 23.01.23 18:25, Jan Kara wrote:
> On Mon 23-01-23 16:42:56, Matthew Wilcox wrote:
>> On Mon, Jan 23, 2023 at 04:38:47PM +0000, David Howells wrote:
>>> Matthew Wilcox <willy@infradead.org> wrote:
>>> Also you only mention DIO read - but what about "start DIO write; fork(); touch
>>> buffer" in the parent - now the write buffer belongs to the child and they can
>>> affect the parent's write.
>>
>> I'm struggling to see the problem here.  If the child hasn't exec'd, the
>> parent and child are still in the same security domain.  The parent
>> could have modified the buffer before calling fork().
> 
> Sadly they are not. Android in particular starts applications by forking
> one big binary (zygote) that has multiple apps linked together and relies
> on the fact the child cannot influence the parent after the fork. We've
> already had CVEs with GUP & COW & fork due to this. David Hildebrand has a
> lot of memories regarding this I believe ;)

:)

Once FOLL_PIN is used most of the issues go away and we don't have to 
play any games with VM flags or similar ...

With FOLL_PIN, I consider anon a solved problem and not worth any new 
fancy ideas.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-23 17:33     ` Matthew Wilcox
  2023-01-23 22:53       ` John Hubbard
@ 2023-01-24 10:29       ` Jan Kara
  2023-01-24 13:21         ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Jan Kara @ 2023-01-24 10:29 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, David Howells, John Hubbard, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

On Mon 23-01-23 17:33:24, Matthew Wilcox wrote:
> On Mon, Jan 23, 2023 at 05:42:18PM +0100, Jan Kara wrote:
> > On Mon 23-01-23 16:31:32, Matthew Wilcox wrote:
> > > On Fri, Jan 20, 2023 at 05:55:48PM +0000, David Howells wrote:
> > > >  (3) Make the bio struct carry a pair of flags to indicate the cleanup
> > > >      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
> > > >      FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
> > > >      added.
> > > 
> > > I think there's a simpler solution than all of this.
> > > 
> > > As I understand the fundamental problem here, the question is
> > > when to copy a page on fork.  We have the optimisation of COW, but
> > > O_DIRECT/RDMA/... breaks it.  So all this page pinning is to indicate
> > > to the fork code "You can't do COW to this page".
> > > 
> > > Why do we want to track that information on a per-page basis?  Wouldn't it
> > > be easier to have a VM_NOCOW flag in vma->vm_flags?  Set it the first
> > > time somebody does an O_DIRECT read or RDMA pin.  That's it.  Pages in
> > > that VMA will now never be COWed, regardless of their refcount/mapcount.
> > > And the whole "did we pin or get this page" problem goes away.  Along
> > > with folio->pincount.
> > 
> > Well, but anon COW code is not the only (planned) consumer of the pincount.
> > Filesystems also need to know whether a (shared pagecache) page is pinned
> > and can thus be modified behind their backs. And for that VMA tracking
> > isn't really an option.
> 
> Bleh, I'd forgotten about that problem.  We really do need to keep
> track of which pages are under I/O for this case, because we need to
> tell the filesystem that they are now available for writeback.
> 
> That said, I don't know that we need to keep track of it in the
> pages themselves.  Can't we have something similar to rmap which
> keeps track of a range of pinned pages, and have it taken care of
> at a higher level (ie unpin the pages in the dio_iodone_t rather
> than in the BIO completion handler)?

We could but bear in mind that there are more places than just (the two)
direct IO paths. There are many GUP users that can be operating on mmapped
pagecache and that can modify page data. Also note that e.g. direct IO is
rather performance sensitive so any CPU overhead that is added to that path
gets noticed. That was actually the reason why we keep pinned pages on the
LRU - John tried to remove them while pinning but the overhead was not
acceptable.

Finally, since we need to handle pinning for anon pages as well, just
(ab)using the page refcount looks like the least painful solution.

> I'm not even sure why pinned pagecache pages remain on the LRU.
> They should probably go onto the unevictable list with the mlocked
> pages, then on unpin get marked dirty and placed on the active list.
> There's no point in writing back a pinned page since it can be
> written to at any instant without any part of the kernel knowing.

True but as John said sometimes we need to writeout even pinned page - e.g.
on fsync(2). For some RDMA users which keep pages pinned for days or
months, this is actually crutial...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-24 10:29       ` Jan Kara
@ 2023-01-24 13:21         ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2023-01-24 13:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Matthew Wilcox, David Howells, John Hubbard, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

On Tue, Jan 24, 2023 at 11:29:31AM +0100, Jan Kara wrote:
> True but as John said sometimes we need to writeout even pinned page - e.g.
> on fsync(2). For some RDMA users which keep pages pinned for days or
> months, this is actually crutial...

I think we have to distinguish between short term (just FOLL_PIN)
and long term (FOLL_PIN | FOLL_LONGERM) pins.

For short term ones the proper thing to do in data integrity writeback
is to simply wait for the unpin.  For the long term pins that obviously
can't work.  The right answer for that is complicated and I don't have
a good answer yet.  The best one would probably to probibit them on
MAP_SHARED file backed mappings - this might break some existing
software, but that software is already so broken that this might be
best.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:24       ` David Hildenbrand
  2023-01-23 19:56         ` John Hubbard
@ 2023-01-26 22:15         ` Al Viro
  2023-01-26 23:41           ` David Hildenbrand
  2023-01-27  0:05           ` David Howells
  1 sibling, 2 replies; 54+ messages in thread
From: Al Viro @ 2023-01-26 22:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Howells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > > Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> > > the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> > > already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> > > unfortunately, no reflect reality yet.
> > 
> > Yeah - I just came across that.
> > 
> > Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> > get_user_pages()?  In which case my patches only need keep track of
> > pinned/not-pinned and never "got".
> 
> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling unpin_user_page().
> IIRC, handling that correctly in existing code -- what was pinned must be
> released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if
> you could avoid FOLL_GET, that would be great :)

Take a good look at iter_to_pipe().  It does *not* need to pin anything
(we have an ITER_SOURCE there); with this approach it will.  And it
will stuff those pinned references into a pipe, where they can sit
indefinitely.

IOW, I don't believe it's a usable approach.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-26 22:15         ` Al Viro
@ 2023-01-26 23:41           ` David Hildenbrand
  2023-01-27  0:05           ` David Howells
  1 sibling, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2023-01-26 23:41 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

On 26.01.23 23:15, Al Viro wrote:
> On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
>> On 23.01.23 14:19, David Howells wrote:
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>>> unfortunately, no reflect reality yet.
>>>
>>> Yeah - I just came across that.
>>>
>>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>>> get_user_pages()?  In which case my patches only need keep track of
>>> pinned/not-pinned and never "got".
>>
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling unpin_user_page().
>> IIRC, handling that correctly in existing code -- what was pinned must be
>> released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if
>> you could avoid FOLL_GET, that would be great :)
> 
> Take a good look at iter_to_pipe().  It does *not* need to pin anything
> (we have an ITER_SOURCE there); with this approach it will.  And it
> will stuff those pinned references into a pipe, where they can sit
> indefinitely.
> 
> IOW, I don't believe it's a usable approach.
> 

Not sure what makes you believe that FOLL_GET is any better for this 
long-term pinning, I'd like to learn about that.

As raised already somewhere in the whole discussion by me, the right way 
to take such a long-term ping as vmsplice() does is to use 
FOLL_PIN|FOLL_LONGTERM. As also raised, that will fix the last remaining 
vmsplice()+hugetlb COW issue as tested by the cow.c vm selftest and make 
sure to migrate that memory off of MIGRATE_MOVABLE/CMA memory where we 
cannot tolerate to have long-term unmovable memory sitting around.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-26 22:15         ` Al Viro
  2023-01-26 23:41           ` David Hildenbrand
@ 2023-01-27  0:05           ` David Howells
  2023-01-27  0:20             ` David Hildenbrand
  1 sibling, 1 reply; 54+ messages in thread
From: David Howells @ 2023-01-27  0:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> As raised already somewhere in the whole discussion by me, the right way to
> take such a long-term ping as vmsplice() does is to use
> FOLL_PIN|FOLL_LONGTERM.

So the pipe infrastructure would have to be able to pin pages instead of
carrying refs on them?  What about pages just allocated and added to the pipe
ring in normal pipe use?

David


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-27  0:05           ` David Howells
@ 2023-01-27  0:20             ` David Hildenbrand
  0 siblings, 0 replies; 54+ messages in thread
From: David Hildenbrand @ 2023-01-27  0:20 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 27.01.23 01:05, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> As raised already somewhere in the whole discussion by me, the right way to
>> take such a long-term ping as vmsplice() does is to use
>> FOLL_PIN|FOLL_LONGTERM.
> 
> So the pipe infrastructure would have to be able to pin pages instead of
> carrying refs on them?  What about pages just allocated and added to the pipe
> ring in normal pipe use?
Ordinary kernel allocations (alloc_page() ...) always have to be freed 
via put_page() and friends. Such allocations are unmovable as default 
and don't require any special care.

Pages mapped into user space are movable as default and might be placed 
on ZONE_MOVABLE/CMA memory (well, and might be swapped out). 
FOLL_LONGTERM makes sure to migrate these pages off of such problematic 
physical memory regions, such that we can safely pin them until eternity.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2023-01-27  0:21 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-20 17:55 [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) David Howells
2023-01-20 17:55 ` [PATCH v7 1/8] iov_iter: Define flags to qualify page extraction David Howells
2023-01-21 13:01   ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-01-21 13:01   ` Christoph Hellwig
2023-01-21 13:10   ` Christoph Hellwig
2023-01-21 13:30   ` David Howells
2023-01-21 13:33     ` Christoph Hellwig
2023-01-23 11:28   ` David Hildenbrand
2023-01-23 11:51   ` David Howells
2023-01-23 13:11     ` David Hildenbrand
2023-01-23 13:19     ` David Howells
2023-01-23 13:24       ` David Hildenbrand
2023-01-23 19:56         ` John Hubbard
2023-01-26 22:15         ` Al Viro
2023-01-26 23:41           ` David Hildenbrand
2023-01-27  0:05           ` David Howells
2023-01-27  0:20             ` David Hildenbrand
2023-01-23 13:38       ` David Howells
2023-01-23 14:20         ` David Hildenbrand
2023-01-23 14:48           ` Christoph Hellwig
2023-01-23 16:11         ` Jan Kara
2023-01-23 16:17           ` Christoph Hellwig
2023-01-23 23:07           ` John Hubbard
2023-01-24  5:57             ` Christoph Hellwig
2023-01-24  6:55               ` John Hubbard
2023-01-23 12:00   ` David Howells
2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
2023-01-20 17:55 ` [PATCH v7 4/8] block: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
2023-01-21 13:04   ` Christoph Hellwig
2023-01-23  9:38   ` David Howells
2023-01-23  9:56     ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 5/8] block: Add BIO_PAGE_PINNED David Howells
2023-01-21 13:05   ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 6/8] block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
2023-01-21 13:07   ` Christoph Hellwig
2023-01-23 11:28   ` David Howells
2023-01-23 14:49     ` Christoph Hellwig
2023-01-20 17:55 ` [PATCH v7 7/8] block: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
2023-01-20 18:59   ` Matthew Wilcox
2023-01-20 19:18   ` David Howells
2023-01-23 16:31 ` [PATCH v7 0/8] iov_iter: Improve page extraction (ref, pin or just list) Matthew Wilcox
2023-01-23 16:42   ` Jan Kara
2023-01-23 17:33     ` Matthew Wilcox
2023-01-23 22:53       ` John Hubbard
2023-01-24 10:29       ` Jan Kara
2023-01-24 13:21         ` Christoph Hellwig
2023-01-23 16:38 ` David Howells
2023-01-23 16:42   ` Matthew Wilcox
2023-01-23 17:25     ` Jan Kara
2023-01-24 10:24       ` David Hildenbrand
2023-01-23 17:19   ` David Howells
2023-01-23 18:04     ` Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).