All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list)
@ 2023-02-07 17:12 David Howells
  2023-02-07 17:12 ` [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE David Howells
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:12 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm

Hi Jens, Al, Christoph,

Here are patches to provide support for extracting pages from an iov_iter
and to use this in the extraction functions in the block layer bio code.

The patches make the following changes:

 (1) Change generic_file_splice_read() to load up an ITER_BVEC iterator
     with sufficient pages and use that rather than using an ITER_PIPE.
     This avoids a problem[2] when __iomap_dio_rw() calls iov_iter_revert()
     to shorten an iterator when it races with truncation.  The reversion
     causes the pipe iterator to prematurely release the pages it was
     retaining - despite the read still being in progress.  This caused
     memory corruption.

 (2) Remove ITER_PIPE and its paraphernalia as generic_file_splice_read()
     was the only user.

 (3) Add a function, iov_iter_extract_pages() to replace
     iov_iter_get_pages*() that gets refs, pins or just lists the pages as
     appropriate to the iterator type.

     Add a function, iov_iter_extract_will_pin() that will indicate from
     the iterator type how the cleanup is to be performed, returning true
     if the pages will need unpinning, false otherwise.

 (4) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
     FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
     is added.

     BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
     and this series does not fully address that file.

 (5) Add a function, bio_release_page(), to release a page appropriately to
     the cleanup mode indicated by the BIO_PAGE_* flags.

 (6) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
     pages appropriately and clean them up later.

 (7) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

and tagged it with

	iov-extract-20230207

David

Changes:
========
ver #12)
 - Added the missing __bitwise on the iov_iter_extraction_t typedef.
 - Rebased on -rc7.
 - Don't specify FOLL_PIN to pin_user_pages_fast().
 - Inserted patch at front to fix race between DIO read and truncation that
   caused memory corruption when iov_iter_revert() got called on an
   ITER_PIPE iterator[2].
 - Inserted a patch after that to remove the now-unused ITER_PIPE and its
   helper functions.
 - Removed the ITER_PIPE bits from iov_iter_extract_pages().

ver #11)
 - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
   the returned starting offset.
 - Use __bitwise for the extraction flags

ver #10)
 - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
 - Drop bio_set_cleanup_mode(), open coding it instead.

ver #9)
 - It's now not permitted to use FOLL_PIN outside of mm/, so:
 - Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and
   return true/false instead of FOLL_PIN/0.
 - Drop of folio_put_unpin() and page_put_unpin() and instead call
   unpin_user_page() (and put_page()) directly as necessary.
 - Make __bio_release_pages() call bio_release_page() instead of
   unpin_user_page() as there's no BIO_* -> FOLL_* translation to do.
 - Drop the FOLL_* renumbering patch.
 - Change extract_flags to extraction_flags.

ver #8)
 - Import Christoph Hellwig's changes.
   - Split the conversion-to-extraction patch.
   - Drop the extract_flags arg from iov_iter_extract_mode().
   - Don't default bios to BIO_PAGE_REFFED, but set explicitly.
 - Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
 - Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
 - We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
   patches for that.

ver #7)
 - For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
   as it turned out to be a lot more complicated, with places not setting
   IOCB_WRITE when they should, for example.
 - Drop all the patches that changed things other then the block layer's
   bio handling.  The netfslib and cifs changes can go into a separate
   patchset.
 - Add support for extracting pages from KVEC-type iterators.
 - When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
Link: https://lore.kernel.org/r/20230120175556.3556978-1-dhowells@redhat.com/ # v7
Link: https://lore.kernel.org/r/20230123173007.325544-1-dhowells@redhat.com/ # v8
Link: https://lore.kernel.org/r/20230124170108.1070389-1-dhowells@redhat.com/ # v9
Link: https://lore.kernel.org/r/20230125210657.2335748-1-dhowells@redhat.com/ # v10
Link: https://lore.kernel.org/r/20230126141626.2809643-1-dhowells@redhat.com/ # v11

Christoph Hellwig (1):
  block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted
    logic

David Howells (9):
  vfs, iomap: Fix generic_file_splice_read() to avoid reversion of
    ITER_PIPE
  iov_iter: Kill ITER_PIPE
  iov_iter: Define flags to qualify page extraction.
  iov_iter: Add a function to extract a page list from an iterator
  iomap: Don't get an reference on ZERO_PAGE for direct I/O block
    zeroing
  block: Fix bio_flagged() so that gcc can better optimise it
  block: Add BIO_PAGE_PINNED and associated infrastructure
  block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
  block: convert bio_map_user_iov to use iov_iter_extract_pages

 block/bio.c               |  33 +-
 block/blk-map.c           |  26 +-
 block/blk.h               |  12 +
 fs/cifs/file.c            |   8 +-
 fs/direct-io.c            |   2 +
 fs/iomap/direct-io.c      |   1 -
 fs/splice.c               |  76 +++-
 include/linux/bio.h       |   5 +-
 include/linux/blk_types.h |   3 +-
 include/linux/uio.h       |  49 ++-
 lib/iov_iter.c            | 713 +++++++++++++++-----------------------
 mm/filemap.c              |   3 -
 12 files changed, 426 insertions(+), 505 deletions(-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
@ 2023-02-07 17:12 ` David Howells
  2023-02-08  5:26   ` Christoph Hellwig
  2023-02-08 16:09   ` David Howells
  2023-02-07 17:12 ` [PATCH v12 02/10] iov_iter: Kill ITER_PIPE David Howells
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:12 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

With the new iov_iter_extract_pages() function (added in a later patch),
pages extracted from a non-user-backed iterator, such as ITER_PIPE, aren't
pinned.  __iomap_dio_rw(), however, calls iov_iter_revert() to shorten the
iterator to just the data it is going to use - which causes the pipe
buffers to be freed, even though they're attached to a bio and may get
written to by DMA (thanks to Hillf Danton for spotting this[1]).

This then causes massive memory corruption that is particularly noticable
when the syzbot test[2] is run.  The test boils down to:

	out = creat(argv[1], 0666);
	ftruncate(out, 0x800);
	lseek(out, 0x200, SEEK_SET);
	in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
	sendfile(out, in, NULL, 0x1dd00);

run repeatedly in parallel.  What I think is happening is that ftruncate()
occasionally shortens the DIO read that's about to be made by sendfile's
splice core by reducing i_size.

Fix this by replacing the use of an ITER_PIPE iterator with an ITER_BVEC
iterator for which reversion won't free the buffers.  Bulk allocate all the
buffers we think we're going to use in advance, do the read synchronously
and only then trim the buffer down.  The pages we did use get pushed into
the pipe.

This is more efficient by virtue of doing a bulk page allocation, but
slightly less efficient by ignoring any partial page in the pipe.

Note that this removes the only user of ITER_PIPE.

Fixes: 920756a3306a ("block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages")
Reported-by: syzbot+a440341a59e3b7142895@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230207094731.1390-1-hdanton@sina.com/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
---
 fs/splice.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 68 insertions(+), 8 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 5969b7a1d353..51778437f31f 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -295,24 +295,62 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
  *    used as long as it has more or less sane ->read_iter().
  *
  */
-ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
+ssize_t generic_file_splice_read(struct file *file, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
+	LIST_HEAD(pages);
 	struct iov_iter to;
+	struct bio_vec *bv;
 	struct kiocb kiocb;
-	int ret;
+	struct page *page;
+	unsigned int head;
+	ssize_t ret;
+	size_t used, npages, chunk, remain, reclaim;
+	int i;
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+	npages = DIV_ROUND_UP(len, PAGE_SIZE);
+
+	bv = kmalloc(array_size(npages, sizeof(bv[0])), GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	npages = alloc_pages_bulk_list(GFP_USER, npages, &pages);
+	if (!npages) {
+		kfree(bv);
+		return -ENOMEM;
+	}
 
-	iov_iter_pipe(&to, ITER_DEST, pipe, len);
-	init_sync_kiocb(&kiocb, in);
+	remain = len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	for (i = 0; i < npages; i++) {
+		chunk = min_t(size_t, PAGE_SIZE, remain);
+		page = list_first_entry(&pages, struct page, lru);
+		list_del_init(&page->lru);
+		bv[i].bv_page = page;
+		bv[i].bv_offset = 0;
+		bv[i].bv_len = chunk;
+		remain -= chunk;
+	}
+
+	/* Do the I/O */
+	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
+	init_sync_kiocb(&kiocb, file);
 	kiocb.ki_pos = *ppos;
-	ret = call_read_iter(in, &kiocb, &to);
+	ret = call_read_iter(file, &kiocb, &to);
+
+	reclaim = npages * PAGE_SIZE;
+	remain = 0;
 	if (ret > 0) {
+		reclaim -= ret;
+		remain = ret;
 		*ppos = kiocb.ki_pos;
-		file_accessed(in);
+		file_accessed(file);
 	} else if (ret < 0) {
-		/* free what was emitted */
-		pipe_discard_from(pipe, to.start_head);
 		/*
 		 * callers of ->splice_read() expect -EAGAIN on
 		 * "can't put anything in there", rather than -EFAULT.
@@ -321,6 +359,28 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 			ret = -EAGAIN;
 	}
 
+	/* Free any pages that didn't get touched at all. */
+	for (; reclaim >= PAGE_SIZE; reclaim -= PAGE_SIZE)
+		__free_page(bv[--npages].bv_page);
+
+	/* Push the remaining pages into the pipe. */
+	head = pipe->head;
+	for (i = 0; i < npages; i++) {
+		struct pipe_buffer *buf = &pipe->bufs[head & (pipe->ring_size - 1)];
+
+		chunk = min_t(size_t, remain, PAGE_SIZE);
+		*buf = (struct pipe_buffer) {
+			.ops	= &default_pipe_buf_ops,
+			.page	= bv[i].bv_page,
+			.offset	= 0,
+			.len	= chunk,
+		};
+		head++;
+		remain -= chunk;
+	}
+	pipe->head = head;
+
+	kfree(bv);
 	return ret;
 }
 EXPORT_SYMBOL(generic_file_splice_read);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 02/10] iov_iter: Kill ITER_PIPE
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
  2023-02-07 17:12 ` [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE David Howells
@ 2023-02-07 17:12 ` David Howells
  2023-02-07 17:12 ` [PATCH v12 03/10] iov_iter: Define flags to qualify page extraction David Howells
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:12 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

The ITER_PIPE-type iterator was only used for generic_file_splice_read(),
but that has now been switched to using ITER_BVEC instead, leaving
ITER_PIPE unused - so get rid of it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/cifs/file.c      |   8 +-
 include/linux/uio.h |  14 --
 lib/iov_iter.c      | 435 +-------------------------------------------
 mm/filemap.c        |   3 -
 4 files changed, 4 insertions(+), 456 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 22dfc1f8b4f1..57ca4eea69dd 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3806,13 +3806,7 @@ cifs_readdata_to_iov(struct cifs_readdata *rdata, struct iov_iter *iter)
 		size_t copy = min_t(size_t, remaining, PAGE_SIZE);
 		size_t written;
 
-		if (unlikely(iov_iter_is_pipe(iter))) {
-			void *addr = kmap_atomic(page);
-
-			written = copy_to_iter(addr, copy, iter);
-			kunmap_atomic(addr);
-		} else
-			written = copy_page_to_iter(page, 0, copy, iter);
+		written = copy_page_to_iter(page, 0, copy, iter);
 		remaining -= written;
 		if (written < copy && iov_iter_count(iter) > 0)
 			break;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..dcc0ca5ef491 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,7 +11,6 @@
 #include <uapi/linux/uio.h>
 
 struct page;
-struct pipe_inode_info;
 
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
@@ -23,7 +22,6 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_KVEC,
 	ITER_BVEC,
-	ITER_PIPE,
 	ITER_XARRAY,
 	ITER_DISCARD,
 	ITER_UBUF,
@@ -53,15 +51,10 @@ struct iov_iter {
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
 		struct xarray *xarray;
-		struct pipe_inode_info *pipe;
 		void __user *ubuf;
 	};
 	union {
 		unsigned long nr_segs;
-		struct {
-			unsigned int head;
-			unsigned int start_head;
-		};
 		loff_t xarray_start;
 	};
 };
@@ -99,11 +92,6 @@ static inline bool iov_iter_is_bvec(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_BVEC;
 }
 
-static inline bool iov_iter_is_pipe(const struct iov_iter *i)
-{
-	return iov_iter_type(i) == ITER_PIPE;
-}
-
 static inline bool iov_iter_is_discard(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_DISCARD;
@@ -245,8 +233,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
-			size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..adc5e8aa8ae8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -14,8 +14,6 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 
-#define PIPE_PARANOIA /* for now */
-
 /* covers ubuf and kbuf alike */
 #define iterate_buf(i, n, base, len, off, __p, STEP) {		\
 	size_t __maybe_unused off = 0;				\
@@ -186,156 +184,6 @@ static int copyin(void *to, const void __user *from, size_t n)
 	return res;
 }
 
-static inline struct pipe_buffer *pipe_buf(const struct pipe_inode_info *pipe,
-					   unsigned int slot)
-{
-	return &pipe->bufs[slot & (pipe->ring_size - 1)];
-}
-
-#ifdef PIPE_PARANOIA
-static bool sanity(const struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_head = pipe->head;
-	unsigned int p_tail = pipe->tail;
-	unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
-	unsigned int i_head = i->head;
-	unsigned int idx;
-
-	if (i->last_offset) {
-		struct pipe_buffer *p;
-		if (unlikely(p_occupancy == 0))
-			goto Bad;	// pipe must be non-empty
-		if (unlikely(i_head != p_head - 1))
-			goto Bad;	// must be at the last buffer...
-
-		p = pipe_buf(pipe, i_head);
-		if (unlikely(p->offset + p->len != abs(i->last_offset)))
-			goto Bad;	// ... at the end of segment
-	} else {
-		if (i_head != p_head)
-			goto Bad;	// must be right after the last buffer
-	}
-	return true;
-Bad:
-	printk(KERN_ERR "idx = %d, offset = %d\n", i_head, i->last_offset);
-	printk(KERN_ERR "head = %d, tail = %d, buffers = %d\n",
-			p_head, p_tail, pipe->ring_size);
-	for (idx = 0; idx < pipe->ring_size; idx++)
-		printk(KERN_ERR "[%p %p %d %d]\n",
-			pipe->bufs[idx].ops,
-			pipe->bufs[idx].page,
-			pipe->bufs[idx].offset,
-			pipe->bufs[idx].len);
-	WARN_ON(1);
-	return false;
-}
-#else
-#define sanity(i) true
-#endif
-
-static struct page *push_anon(struct pipe_inode_info *pipe, unsigned size)
-{
-	struct page *page = alloc_page(GFP_USER);
-	if (page) {
-		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-		*buf = (struct pipe_buffer) {
-			.ops = &default_pipe_buf_ops,
-			.page = page,
-			.offset = 0,
-			.len = size
-		};
-	}
-	return page;
-}
-
-static void push_page(struct pipe_inode_info *pipe, struct page *page,
-			unsigned int offset, unsigned int size)
-{
-	struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-	*buf = (struct pipe_buffer) {
-		.ops = &page_cache_pipe_buf_ops,
-		.page = page,
-		.offset = offset,
-		.len = size
-	};
-	get_page(page);
-}
-
-static inline int last_offset(const struct pipe_buffer *buf)
-{
-	if (buf->ops == &default_pipe_buf_ops)
-		return buf->len;	// buf->offset is 0 for those
-	else
-		return -(buf->offset + buf->len);
-}
-
-static struct page *append_pipe(struct iov_iter *i, size_t size,
-				unsigned int *off)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int offset = i->last_offset;
-	struct pipe_buffer *buf;
-	struct page *page;
-
-	if (offset > 0 && offset < PAGE_SIZE) {
-		// some space in the last buffer; add to it
-		buf = pipe_buf(pipe, pipe->head - 1);
-		size = min_t(size_t, size, PAGE_SIZE - offset);
-		buf->len += size;
-		i->last_offset += size;
-		i->count -= size;
-		*off = offset;
-		return buf->page;
-	}
-	// OK, we need a new buffer
-	*off = 0;
-	size = min_t(size_t, size, PAGE_SIZE);
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return NULL;
-	page = push_anon(pipe, size);
-	if (!page)
-		return NULL;
-	i->head = pipe->head - 1;
-	i->last_offset = size;
-	i->count -= size;
-	return page;
-}
-
-static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int head = pipe->head;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	if (offset && i->last_offset == -offset) { // could we merge it?
-		struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
-		if (buf->page == page) {
-			buf->len += bytes;
-			i->last_offset -= bytes;
-			i->count -= bytes;
-			return bytes;
-		}
-	}
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return 0;
-
-	push_page(pipe, page, offset, bytes);
-	i->last_offset = -(offset + bytes);
-	i->head = head;
-	i->count -= bytes;
-	return bytes;
-}
-
 /*
  * fault_in_iov_iter_readable - fault in iov iterator for reading
  * @i: iterator
@@ -439,46 +287,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-// returns the offset in partial buffer (if any)
-static inline unsigned int pipe_npages(const struct iov_iter *i, int *npages)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int used = pipe->head - pipe->tail;
-	int off = i->last_offset;
-
-	*npages = max((int)pipe->max_usage - used, 0);
-
-	if (off > 0 && off < PAGE_SIZE) { // anon and not full
-		(*npages)++;
-		return off;
-	}
-	return 0;
-}
-
-static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
-				struct iov_iter *i)
-{
-	unsigned int off, chunk;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	for (size_t n = bytes; n; n -= chunk) {
-		struct page *page = append_pipe(i, n, &off);
-		chunk = min_t(size_t, n, PAGE_SIZE - off);
-		if (!page)
-			return bytes - n;
-		memcpy_to_page(page, off, addr, chunk);
-		addr += chunk;
-	}
-	return bytes;
-}
-
 static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 			      __wsum sum, size_t off)
 {
@@ -486,44 +294,10 @@ static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 	return csum_block_add(sum, next, off);
 }
 
-static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
-					 struct iov_iter *i, __wsum *sump)
-{
-	__wsum sum = *sump;
-	size_t off = 0;
-	unsigned int chunk, r;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	while (bytes) {
-		struct page *page = append_pipe(i, bytes, &r);
-		char *p;
-
-		if (!page)
-			break;
-		chunk = min_t(size_t, bytes, PAGE_SIZE - r);
-		p = kmap_local_page(page);
-		sum = csum_and_memcpy(p + r, addr + off, chunk, sum, off);
-		kunmap_local(p);
-		off += chunk;
-		bytes -= chunk;
-	}
-	*sump = sum;
-	return off;
-}
-
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_pipe_to_iter(addr, bytes, i);
 	if (user_backed_iter(i))
 		might_fault();
 	iterate_and_advance(i, bytes, base, len, off,
@@ -545,42 +319,6 @@ static int copyout_mc(void __user *to, const void *from, size_t n)
 	return n;
 }
 
-static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
-				struct iov_iter *i)
-{
-	size_t xfer = 0;
-	unsigned int off, chunk;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	while (bytes) {
-		struct page *page = append_pipe(i, bytes, &off);
-		unsigned long rem;
-		char *p;
-
-		if (!page)
-			break;
-		chunk = min_t(size_t, bytes, PAGE_SIZE - off);
-		p = kmap_local_page(page);
-		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
-		chunk -= rem;
-		kunmap_local(p);
-		xfer += chunk;
-		bytes -= chunk;
-		if (rem) {
-			iov_iter_revert(i, rem);
-			break;
-		}
-	}
-	return xfer;
-}
-
 /**
  * _copy_mc_to_iter - copy to iter with source memory error exception handling
  * @addr: source kernel address
@@ -600,9 +338,8 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
  *   alignment and poison alignment assumptions to avoid re-triggering
  *   hardware exceptions.
  *
- * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies.
- *   Compare to copy_to_iter() where only ITER_IOVEC attempts might return
- *   a short copy.
+ * * ITER_KVEC and ITER_BVEC can return short copies.  Compare to
+ *   copy_to_iter() where only ITER_IOVEC attempts might return a short copy.
  *
  * Return: number of bytes copied (may be %0)
  */
@@ -610,8 +347,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_mc_pipe_to_iter(addr, bytes, i);
 	if (user_backed_iter(i))
 		might_fault();
 	__iterate_and_advance(i, bytes, base, len, off,
@@ -717,8 +452,6 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		return 0;
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_page_to_iter_pipe(page, offset, bytes, i);
 	page += offset / PAGE_SIZE; // first subpage
 	offset %= PAGE_SIZE;
 	while (1) {
@@ -767,36 +500,8 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
-static size_t pipe_zero(size_t bytes, struct iov_iter *i)
-{
-	unsigned int chunk, off;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	for (size_t n = bytes; n; n -= chunk) {
-		struct page *page = append_pipe(i, n, &off);
-		char *p;
-
-		if (!page)
-			return bytes - n;
-		chunk = min_t(size_t, n, PAGE_SIZE - off);
-		p = kmap_local_page(page);
-		memset(p + off, 0, chunk);
-		kunmap_local(p);
-	}
-	return bytes;
-}
-
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(iov_iter_is_pipe(i)))
-		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, base, len, count,
 		clear_user(base, len),
 		memset(base, 0, len)
@@ -827,32 +532,6 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-static void pipe_advance(struct iov_iter *i, size_t size)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int off = i->last_offset;
-
-	if (!off && !size) {
-		pipe_discard_from(pipe, i->start_head); // discard everything
-		return;
-	}
-	i->count -= size;
-	while (1) {
-		struct pipe_buffer *buf = pipe_buf(pipe, i->head);
-		if (off) /* make it relative to the beginning of buffer */
-			size += abs(off) - buf->offset;
-		if (size <= buf->len) {
-			buf->len = size;
-			i->last_offset = last_offset(buf);
-			break;
-		}
-		size -= buf->len;
-		i->head++;
-		off = 0;
-	}
-	pipe_discard_from(pipe, i->head + 1); // discard everything past this one
-}
-
 static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
 {
 	const struct bio_vec *bvec, *end;
@@ -904,8 +583,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
-	} else if (iov_iter_is_pipe(i)) {
-		pipe_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -= size;
 	}
@@ -919,26 +596,6 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	if (WARN_ON(unroll > MAX_RW_COUNT))
 		return;
 	i->count += unroll;
-	if (unlikely(iov_iter_is_pipe(i))) {
-		struct pipe_inode_info *pipe = i->pipe;
-		unsigned int head = pipe->head;
-
-		while (head > i->start_head) {
-			struct pipe_buffer *b = pipe_buf(pipe, --head);
-			if (unroll < b->len) {
-				b->len -= unroll;
-				i->last_offset = last_offset(b);
-				i->head = head;
-				return;
-			}
-			unroll -= b->len;
-			pipe_buf_release(pipe, b);
-			pipe->head--;
-		}
-		i->last_offset = 0;
-		i->head = head;
-		return;
-	}
 	if (unlikely(iov_iter_is_discard(i)))
 		return;
 	if (unroll <= i->iov_offset) {
@@ -1026,24 +683,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
-			struct pipe_inode_info *pipe,
-			size_t count)
-{
-	BUG_ON(direction != READ);
-	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
-	*i = (struct iov_iter){
-		.iter_type = ITER_PIPE,
-		.data_source = false,
-		.pipe = pipe,
-		.head = pipe->head,
-		.start_head = pipe->head,
-		.last_offset = 0,
-		.count = count
-	};
-}
-EXPORT_SYMBOL(iov_iter_pipe);
-
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xarray
  * @i: The iterator to initialise.
@@ -1168,19 +807,6 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 	if (iov_iter_is_bvec(i))
 		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
 
-	if (iov_iter_is_pipe(i)) {
-		size_t size = i->count;
-
-		if (size & len_mask)
-			return false;
-		if (size && i->last_offset > 0) {
-			if (i->last_offset & addr_mask)
-				return false;
-		}
-
-		return true;
-	}
-
 	if (iov_iter_is_xarray(i)) {
 		if (i->count & len_mask)
 			return false;
@@ -1250,14 +876,6 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
 
-	if (iov_iter_is_pipe(i)) {
-		size_t size = i->count;
-
-		if (size && i->last_offset > 0)
-			return size | i->last_offset;
-		return size;
-	}
-
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
 
@@ -1309,36 +927,6 @@ static int want_pages_array(struct page ***res, size_t size,
 	return count;
 }
 
-static ssize_t pipe_get_pages(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
-{
-	unsigned int npages, count, off, chunk;
-	struct page **p;
-	size_t left;
-
-	if (!sanity(i))
-		return -EFAULT;
-
-	*start = off = pipe_npages(i, &npages);
-	if (!npages)
-		return -EFAULT;
-	count = want_pages_array(pages, maxsize, off, min(npages, maxpages));
-	if (!count)
-		return -ENOMEM;
-	p = *pages;
-	for (npages = 0, left = maxsize ; npages < count; npages++, left -= chunk) {
-		struct page *page = append_pipe(i, left, &off);
-		if (!page)
-			break;
-		chunk = min_t(size_t, left, PAGE_SIZE - off);
-		get_page(*p++ = page);
-	}
-	if (!npages)
-		return -EFAULT;
-	return maxsize - left;
-}
-
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
 					  pgoff_t index, unsigned int nr_pages)
 {
@@ -1486,8 +1074,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		}
 		return maxsize;
 	}
-	if (iov_iter_is_pipe(i))
-		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	return -EFAULT;
@@ -1577,9 +1163,7 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 	}
 
 	sum = csum_shift(csstate->csum, csstate->off);
-	if (unlikely(iov_iter_is_pipe(i)))
-		bytes = csum_and_copy_to_pipe_iter(addr, bytes, i, &sum);
-	else iterate_and_advance(i, bytes, base, len, off, ({
+	iterate_and_advance(i, bytes, base, len, off, ({
 		next = csum_and_copy_to_user(addr + off, base, len);
 		sum = csum_block_add(sum, next, off);
 		next ? 0 : len;
@@ -1664,15 +1248,6 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
-	if (iov_iter_is_pipe(i)) {
-		int npages;
-
-		if (!sanity(i))
-			return 0;
-
-		pipe_npages(i, &npages);
-		return min(npages, maxpages);
-	}
 	if (iov_iter_is_xarray(i)) {
 		unsigned offset = (i->xarray_start + i->iov_offset) % PAGE_SIZE;
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1685,10 +1260,6 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
-	if (unlikely(iov_iter_is_pipe(new))) {
-		WARN_ON(1);
-		return NULL;
-	}
 	if (iov_iter_is_bvec(new))
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
diff --git a/mm/filemap.c b/mm/filemap.c
index c4d4ace9cc70..f72e4875bfcb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2446,9 +2446,6 @@ static bool filemap_range_uptodate(struct address_space *mapping,
 
 	if (folio_test_uptodate(folio))
 		return true;
-	/* pipes can't handle partially uptodate pages */
-	if (iov_iter_is_pipe(iter))
-		return false;
 	if (!mapping->a_ops->is_partially_uptodate)
 		return false;
 	if (mapping->host->i_blkbits >= folio_shift(folio))


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 03/10] iov_iter: Define flags to qualify page extraction.
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
  2023-02-07 17:12 ` [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE David Howells
  2023-02-07 17:12 ` [PATCH v12 02/10] iov_iter: Kill ITER_PIPE David Howells
@ 2023-02-07 17:12 ` David Howells
  2023-02-07 17:12 ` [PATCH v12 04/10] iov_iter: Add a function to extract a page list from an iterator David Howells
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:12 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Define flags to qualify page extraction to pass into iov_iter_*_pages*()
rather than passing in FOLL_* flags.

For now only a flag to allow peer-to-peer DMA is supported.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
---

Notes:
    ver #12)
     - Use __bitwise for the extraction flags typedef.
    
    ver #11)
     - Use __bitwise for the extraction flags.
    
    ver #9)
     - Change extract_flags to extraction_flags.
    
    ver #7)
     - Don't use FOLL_* as a parameter, but rather define constants
       specifically to use with iov_iter_*_pages*().
     - Drop the I/O direction constants for now.

 block/bio.c         |  6 +++---
 block/blk-map.c     |  8 ++++----
 include/linux/uio.h | 10 ++++++++--
 lib/iov_iter.c      | 14 ++++++++------
 4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ab59a491a883..b97f3991c904 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,11 +1245,11 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
+	iov_iter_extraction_t extraction_flags = 0;
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
 	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
 	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
 	struct page **pages = (struct page **)bv;
-	unsigned int gup_flags = 0;
 	ssize_t size, left;
 	unsigned len, i = 0;
 	size_t offset, trim;
@@ -1264,7 +1264,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
-		gup_flags |= FOLL_PCI_P2PDMA;
+		extraction_flags |= ITER_ALLOW_P2PDMA;
 
 	/*
 	 * Each segment in the iov is required to be a block size multiple.
@@ -1275,7 +1275,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 */
 	size = iov_iter_get_pages(iter, pages,
 				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+				  nr_pages, &offset, extraction_flags);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..080dd60485be 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -265,9 +265,9 @@ static struct bio *blk_rq_map_bio_alloc(struct request *rq,
 static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		gfp_t gfp_mask)
 {
+	iov_iter_extraction_t extraction_flags = 0;
 	unsigned int max_sectors = queue_max_hw_sectors(rq->q);
 	unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
-	unsigned int gup_flags = 0;
 	struct bio *bio;
 	int ret;
 	int j;
@@ -280,7 +280,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		return -ENOMEM;
 
 	if (blk_queue_pci_p2pdma(rq->q))
-		gup_flags |= FOLL_PCI_P2PDMA;
+		extraction_flags |= ITER_ALLOW_P2PDMA;
 
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
@@ -291,10 +291,10 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
 			pages = stack_pages;
 			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, gup_flags);
+						   nr_vecs, &offs, extraction_flags);
 		} else {
 			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, gup_flags);
+						LONG_MAX, &offs, extraction_flags);
 		}
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index dcc0ca5ef491..af70e4c9ea27 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -12,6 +12,8 @@
 
 struct page;
 
+typedef unsigned int __bitwise iov_iter_extraction_t;
+
 struct kvec {
 	void *iov_base; /* and that should *never* hold a userland pointer */
 	size_t iov_len;
@@ -238,12 +240,12 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
-		unsigned gup_flags);
+		iov_iter_extraction_t extraction_flags);
 ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
 			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page ***pages, size_t maxsize, size_t *start,
-		unsigned gup_flags);
+		iov_iter_extraction_t extraction_flags);
 ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
 			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
@@ -346,4 +348,8 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 	};
 }
 
+/* Flags for iov_iter_get/extract_pages*() */
+/* Allow P2PDMA on the extracted pages */
+#define ITER_ALLOW_P2PDMA	((__force iov_iter_extraction_t)0x01)
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index adc5e8aa8ae8..34ee3764d0fa 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1020,9 +1020,9 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start,
-		   unsigned int gup_flags)
+		   iov_iter_extraction_t extraction_flags)
 {
-	unsigned int n;
+	unsigned int n, gup_flags = 0;
 
 	if (maxsize > i->count)
 		maxsize = i->count;
@@ -1030,6 +1030,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		return 0;
 	if (maxsize > MAX_RW_COUNT)
 		maxsize = MAX_RW_COUNT;
+	if (extraction_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
 
 	if (likely(user_backed_iter(i))) {
 		unsigned long addr;
@@ -1081,14 +1083,14 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 
 ssize_t iov_iter_get_pages(struct iov_iter *i,
 		   struct page **pages, size_t maxsize, unsigned maxpages,
-		   size_t *start, unsigned gup_flags)
+		   size_t *start, iov_iter_extraction_t extraction_flags)
 {
 	if (!maxpages)
 		return 0;
 	BUG_ON(!pages);
 
 	return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
-					  start, gup_flags);
+					  start, extraction_flags);
 }
 EXPORT_SYMBOL_GPL(iov_iter_get_pages);
 
@@ -1101,14 +1103,14 @@ EXPORT_SYMBOL(iov_iter_get_pages2);
 
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
-		   size_t *start, unsigned gup_flags)
+		   size_t *start, iov_iter_extraction_t extraction_flags)
 {
 	ssize_t len;
 
 	*pages = NULL;
 
 	len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
-					 gup_flags);
+					 extraction_flags);
 	if (len <= 0) {
 		kvfree(*pages);
 		*pages = NULL;


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 04/10] iov_iter: Add a function to extract a page list from an iterator
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (2 preceding siblings ...)
  2023-02-07 17:12 ` [PATCH v12 03/10] iov_iter: Define flags to qualify page extraction David Howells
@ 2023-02-07 17:12 ` David Howells
  2023-02-07 17:13 ` [PATCH v12 05/10] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:12 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a pin added or nothing,
depending on the type of iterator.

Add a second function, iov_iter_extract_will_pin(), to determine how the
cleanup should be done.

There are two cases:

 (1) ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins (FOLL_PIN) obtained on them so that a
     concurrent fork() will forcibly copy the page so that DMA is done
     to/from the parent's buffer and is unavailable to/unaffected by the
     child process.

     iov_iter_extract_will_pin() will return true for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     iov_iter_extract_will_pin() will return false.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #12)
     - ITER_PIPE is gone, so drop related bits.
     - Don't specify FOLL_PIN as that's implied by pin_user_pages_fast().
    
    ver #11)
     - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
       the returned starting offset.
     - Use __bitwise for the extraction flags
    
    ver #10)
     - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
    
    ver #9)
     - Rename iov_iter_extract_mode() to iov_iter_extract_will_pin() and make
       it return true/false not FOLL_PIN/0 as FOLL_PIN is going to be made
       private to mm/.
     - Change extract_flags to extraction_flags.
    
    ver #8)
     - It seems that all DIO is supposed to be done under FOLL_PIN now, and not
       FOLL_GET, so switch to only using pin_user_pages() for user-backed
       iters.
     - Wrap an argument in brackets in the iov_iter_extract_mode() macro.
     - Drop the extract_flags argument to iov_iter_extract_mode() for now
       [hch].
    
    ver #7)
     - Switch to passing in iter-specific flags rather than FOLL_* flags.
     - Drop the direction flags for now.
     - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
     - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
     - Add support for extraction from KVEC-type iters.
     - Use iov_iter_advance() rather than open-coding it.
     - Make BVEC- and KVEC-type skip over initial empty vectors.
    
    ver #6)
     - Add back the function to indicate the cleanup mode.
     - Drop the cleanup_mode return arg to iov_iter_extract_pages().
     - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
       data_source.
    
    ver #4)
     - Use ITER_SOURCE/DEST instead of WRITE/READ.
     - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
    
    ver #3)
     - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
       to get/pin_user_pages_fast()[1].

 include/linux/uio.h |  27 ++++-
 lib/iov_iter.c      | 264 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 290 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index af70e4c9ea27..cf6658066736 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -347,9 +347,34 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 		.count = count
 	};
 }
-
 /* Flags for iov_iter_get/extract_pages*() */
 /* Allow P2PDMA on the extracted pages */
 #define ITER_ALLOW_P2PDMA	((__force iov_iter_extraction_t)0x01)
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       iov_iter_extraction_t extraction_flags,
+			       size_t *offset0);
+
+/**
+ * iov_iter_extract_will_pin - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ *
+ * Examine the iterator and indicate by returning true or false as to how, if
+ * at all, pages extracted from the iterator will be retained by the extraction
+ * function.
+ *
+ * %true indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO to force fork()
+ * to forcibly copy a page for the child (the parent must retain the original
+ * page).
+ *
+ * %false indicates that no measures are taken and that it's up to the caller
+ * to retain the pages.
+ */
+static inline bool iov_iter_extract_will_pin(const struct iov_iter *iter)
+{
+	return user_backed_iter(iter);
+}
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 34ee3764d0fa..8d34b6552179 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1487,3 +1487,267 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     iov_iter_extraction_t extraction_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->bvec->bv_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->bvec++;
+		skip = 0;
+	}
+
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	const void *kaddr;
+	size_t skip = i->iov_offset, offset, len;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->kvec->iov_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	kaddr = i->kvec->iov_base + skip;
+	offset = (unsigned long)kaddr & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	kaddr -= offset;
+	len = offset + maxsize;
+	for (k = 0; k < maxpages; k++) {
+		size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+		if (is_vmalloc_or_module_addr(kaddr))
+			page = vmalloc_to_page(kaddr);
+		else
+			page = virt_to_page(kaddr);
+
+		p[k] = page;
+		len -= seg;
+		kaddr += PAGE_SIZE;
+	}
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used if the iterator is user-backed
+ * (IOBUF/UBUF).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes forces fork() to give the
+ * child a copy of the page.
+ */
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages,
+					   size_t maxsize,
+					   unsigned int maxpages,
+					   iov_iter_extraction_t extraction_flags,
+					   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = 0;
+	size_t offset;
+	int res;
+
+	if (i->data_source == ITER_DEST)
+		gup_flags |= FOLL_WRITE;
+	if (extraction_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extraction_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extraction_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA
+ * be allowed on the pages extracted.
+ *
+ * The iov_iter_extract_will_pin() function can be used to query how cleanup
+ * should be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF), pins will be
+ *      added to the pages, but refs will not be taken.
+ *      iov_iter_extract_will_pin() will return true.
+ *
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
+ *      iov_iter_extract_will_pin() will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       iov_iter_extraction_t extraction_flags,
+			       size_t *offset0)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_kvec(i))
+		return iov_iter_extract_kvec_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, extraction_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, extraction_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 05/10] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (3 preceding siblings ...)
  2023-02-07 17:12 ` [PATCH v12 04/10] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-02-07 17:13 ` David Howells
  2023-02-07 17:13 ` [PATCH v12 06/10] block: Fix bio_flagged() so that gcc can better optimise it David Howells
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:13 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	John Hubbard

ZERO_PAGE can't go away, no need to hold an extra reference.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: linux-fsdevel@vger.kernel.org
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..47db4ead1e74 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 06/10] block: Fix bio_flagged() so that gcc can better optimise it
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (4 preceding siblings ...)
  2023-02-07 17:13 ` [PATCH v12 05/10] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
@ 2023-02-07 17:13 ` David Howells
  2023-02-07 17:13 ` [PATCH v12 07/10] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:13 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Fix bio_flagged() so that multiple instances of it, such as:

	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
	    bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero.  This results in an out-of-line bio_release_page()
being compiled to something like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool.  Removing it
results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index c1da63f6c808..10366b8bdb13 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -227,7 +227,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-	return (bio->bi_flags & (1U << bit)) != 0;
+	return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 07/10] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (5 preceding siblings ...)
  2023-02-07 17:13 ` [PATCH v12 06/10] block: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-02-07 17:13 ` David Howells
  2023-02-07 17:13 ` [PATCH v12 08/10] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:13 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

From: Christoph Hellwig <hch@lst.de>

Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #8)
     - Split out from another patch [hch].
     - Don't default to BIO_PAGE_REFFED [hch].
    
    ver #5)
     - Split from patch that uses iov_iter_extract_pages().

 block/bio.c               | 2 +-
 block/blk-map.c           | 1 +
 fs/direct-io.c            | 2 ++
 fs/iomap/direct-io.c      | 1 -
 include/linux/bio.h       | 2 +-
 include/linux/blk_types.h | 2 +-
 6 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index b97f3991c904..bf9bf53232be 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1198,7 +1198,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
@@ -1343,6 +1342,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-map.c b/block/blk-map.c
index 080dd60485be..f1f70b50388d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	if (blk_queue_pci_p2pdma(rq->q))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
 		ssize_t bytes;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..07810465fc9d 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -403,6 +403,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_aio;
 	else
 		bio->bi_end_io = dio_bio_end_io;
+	/* for now require references for all pages */
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 47db4ead1e74..c0e75900e754 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -202,7 +202,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 10366b8bdb13..805957c99147 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..7daa261f4f98 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* put pages in bio_release_pages() */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 08/10] block: Add BIO_PAGE_PINNED and associated infrastructure
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (6 preceding siblings ...)
  2023-02-07 17:13 ` [PATCH v12 07/10] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
@ 2023-02-07 17:13 ` David Howells
  2023-02-07 17:13 ` [PATCH v12 09/10] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:13 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #9)
     - Only consider pinning in bio_set_cleanup_mode().  Ref'ing pages in
       struct bio is going away.
     - page_put_unpin() is removed; call unpin_user_page() and put_page()
       directly.
     - Use bio_release_page() in __bio_release_pages().
     - BIO_PAGE_PINNED and BIO_PAGE_REFFED can't both be set, so use if-else
       when testing both of them.
    
    ver #8)
     - Move the infrastructure to clean up pinned pages to this patch [hch].
     - Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
       probably be removed at some point.  FOLL_PIN can then be renumbered
       first.

 block/bio.c               |  6 +++---
 block/blk.h               | 12 ++++++++++++
 include/linux/bio.h       |  3 ++-
 include/linux/blk_types.h |  1 +
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index bf9bf53232be..547e38883934 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1176,7 +1176,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1496,8 +1496,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will unpin each page and will run one bio_put() against the
+ * BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..f02381405311 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,18 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio_flagged(bio, BIO_PAGE_PINNED))
+		unpin_user_page(page);
+	else if (bio_flagged(bio, BIO_PAGE_REFFED))
+		put_page(page);
+}
+
 struct request_queue *blk_alloc_queue(int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 805957c99147..b2c09997d79c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (bio_flagged(bio, BIO_PAGE_REFFED))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 7daa261f4f98..a0e339ff3d09 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,6 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
+	BIO_PAGE_PINNED,	/* Unpin pages in bio_release_pages() */
 	BIO_PAGE_REFFED,	/* put pages in bio_release_pages() */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 09/10] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (7 preceding siblings ...)
  2023-02-07 17:13 ` [PATCH v12 08/10] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
@ 2023-02-07 17:13 ` David Howells
  2023-02-07 17:13 ` [PATCH v12 10/10] block: convert bio_map_user_iov " David Howells
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:13 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #8)
     - Split the patch up a bit [hch].
     - We should only be using pinned/non-pinned pages and not ref'd pages,
       so adjust the comments appropriately.
    
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/bio.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 547e38883934..fc57f0aa098e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1212,7 +1212,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1226,7 +1226,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1237,10 +1237,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1272,9 +1272,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, extraction_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, extraction_flags, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1307,7 +1307,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1342,7 +1342,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
-	bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (iov_iter_extract_will_pin(iter))
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v12 10/10] block: convert bio_map_user_iov to use iov_iter_extract_pages
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (8 preceding siblings ...)
  2023-02-07 17:13 ` [PATCH v12 09/10] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
@ 2023-02-07 17:13 ` David Howells
  2023-02-07 18:49 ` [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) Jens Axboe
  2023-02-07 18:51 ` David Howells
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 17:13 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #8)
     - Split the patch up a bit [hch].
     - We should only be using pinned/non-pinned pages and not ref'd pages,
       so adjust the comments appropriately.
    
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/blk-map.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index f1f70b50388d..0f1593e144da 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -281,22 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 
 	if (blk_queue_pci_p2pdma(rq->q))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
+	if (iov_iter_extract_will_pin(iter))
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 
-	bio_set_flag(bio, BIO_PAGE_REFFED);
 	while (iov_iter_count(iter)) {
-		struct page **pages, *stack_pages[UIO_FASTIOV];
+		struct page *stack_pages[UIO_FASTIOV];
+		struct page **pages = stack_pages;
 		ssize_t bytes;
 		size_t offs;
 		int npages;
 
-		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
-			pages = stack_pages;
-			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, extraction_flags);
-		} else {
-			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, extraction_flags);
-		}
+		if (nr_vecs > ARRAY_SIZE(stack_pages))
+			pages = NULL;
+
+		bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+					       nr_vecs, extraction_flags, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -318,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						bio_release_page(bio, page);
 					break;
 				}
 
@@ -330,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			bio_release_page(bio, pages[j++]);
 		if (pages != stack_pages)
 			kvfree(pages);
 		/* couldn't stuff something into bio? */


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list)
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (9 preceding siblings ...)
  2023-02-07 17:13 ` [PATCH v12 10/10] block: convert bio_map_user_iov " David Howells
@ 2023-02-07 18:49 ` Jens Axboe
  2023-02-07 18:51 ` David Howells
  11 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2023-02-07 18:49 UTC (permalink / raw)
  To: David Howells, Al Viro, Christoph Hellwig
  Cc: Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton, linux-fsdevel,
	linux-block, linux-kernel, linux-mm

On 2/7/23 10:12 AM, David Howells wrote:
> Hi Jens, Al, Christoph,
> 
> Here are patches to provide support for extracting pages from an iov_iter
> and to use this in the extraction functions in the block layer bio code.
> 
> The patches make the following changes:
> 
>  (1) Change generic_file_splice_read() to load up an ITER_BVEC iterator
>      with sufficient pages and use that rather than using an ITER_PIPE.
>      This avoids a problem[2] when __iomap_dio_rw() calls iov_iter_revert()
>      to shorten an iterator when it races with truncation.  The reversion
>      causes the pipe iterator to prematurely release the pages it was
>      retaining - despite the read still being in progress.  This caused
>      memory corruption.
> 
>  (2) Remove ITER_PIPE and its paraphernalia as generic_file_splice_read()
>      was the only user.
> 
>  (3) Add a function, iov_iter_extract_pages() to replace
>      iov_iter_get_pages*() that gets refs, pins or just lists the pages as
>      appropriate to the iterator type.
> 
>      Add a function, iov_iter_extract_will_pin() that will indicate from
>      the iterator type how the cleanup is to be performed, returning true
>      if the pages will need unpinning, false otherwise.
> 
>  (4) Make the bio struct carry a pair of flags to indicate the cleanup
>      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
>      FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
>      is added.
> 
>      BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
>      and this series does not fully address that file.
> 
>  (5) Add a function, bio_release_page(), to release a page appropriately to
>      the cleanup mode indicated by the BIO_PAGE_* flags.
> 
>  (6) Make the iter-to-bio code use iov_iter_extract_pages() to retain the
>      pages appropriately and clean them up later.
> 
>  (7) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

I've updated the for-6.3/iov-extract branch and the for-next branch. This
isn't done to bypass any review, just so we can get some more testing on
this (and because the old one is known broken).

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list)
  2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
                   ` (10 preceding siblings ...)
  2023-02-07 18:49 ` [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) Jens Axboe
@ 2023-02-07 18:51 ` David Howells
  11 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-07 18:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm

Jens Axboe <axboe@kernel.dk> wrote:

> I've updated the for-6.3/iov-extract branch and the for-next branch. This
> isn't done to bypass any review, just so we can get some more testing on
> this (and because the old one is known broken).

Thanks!

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-07 17:12 ` [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE David Howells
@ 2023-02-08  5:26   ` Christoph Hellwig
  2023-02-08 16:09   ` David Howells
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2023-02-08  5:26 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Subject nitpick:  this does not ouch iomap at all.

> Fix this by replacing the use of an ITER_PIPE iterator with an ITER_BVEC
> iterator for which reversion won't free the buffers.  Bulk allocate all the
> buffers we think we're going to use in advance, do the read synchronously
> and only then trim the buffer down.  The pages we did use get pushed into
> the pipe.
> 
> This is more efficient by virtue of doing a bulk page allocation, but
> slightly less efficient by ignoring any partial page in the pipe.

For the usual case of a buffered read into the iter, this completely
changes the semantics:

 - before the pagecache pages just grew a new reference and were
   added to the pipe buffer, and I/O was done without an extra
   copy
 - with this patch you always allocate an extra set of pages for
   the pipe and copy to it

So I very much doubt this makes anything more efficient, and I don't
think we can just do it.

We'll have to fix reverting of pipe buffers, just as I already pointed
out in your cifs series that tries to play the same game.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-07 17:12 ` [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE David Howells
  2023-02-08  5:26   ` Christoph Hellwig
@ 2023-02-08 16:09   ` David Howells
  2023-02-08 16:19     ` Christoph Hellwig
                       ` (3 more replies)
  1 sibling, 4 replies; 19+ messages in thread
From: David Howells @ 2023-02-08 16:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> We'll have to fix reverting of pipe buffers, just as I already pointed
> out in your cifs series that tries to play the same game.

Fixing ITER_PIPE reversion isn't very straightforward, unfortunately.  It's
possible for a partial direct I/O read to use reversion to trim the iterator
(thereby discarding anon pages from a pipe) and then fall back to reading the
rest by buffered I/O.  I suppose I could set a flag on the pipe_buffers
indicating whether iov_iter_revert() is allowed to free them (if they were
spliced in from the pagecache, then they must be freed by reverting them).

How about one of two different solutions?

 (1) Repurpose the function I proposed for generic_file_splice_read() but only
     for splicing from O_DIRECT files; reading from non-O_DIRECT files would
     use an ITER_PIPE as upstream.

 (2) Repurpose the function I proposed for generic_file_splice_read() but only
     for splicing from O_DIRECT files, as (1), but also replace the splice
     from a buffered file with something like the patch below.  This uses
     filemap_get_pages() to do the reading and to get a bunch of folios from
     the pagecache that we can then splice into the pipe directly.

David
---
splice: Do splice read from a buffered file without using ITER_PIPE

Provide a function to do splice read from a buffered file, pulling the
folios out of the pagecache directly by calling filemap_get_pages() to do
any required reading and then pasting the returned folios into the pipe.

A helper function is provided to do the actual folio pasting and will
handle multipage folios by splicing as many of the relevant subpages as
will fit into the pipe.

The ITER_BVEC-based splicing previously added is then only used for
splicing from O_DIRECT files.

The code is loosely based on filemap_read() and might belong in
mm/filemap.c with that as it needs to use filemap_get_pages().

filemap_get_pages() and some of the functions it uses are changed to take
the required byte count rather than an iterator (which was only being used
for iter->count).  This should be split into a separate file.

Signed-off-by: David Howells <dhowells@redhat.com>
---
 fs/splice.c             |  177 +++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/pagemap.h |    2 
 mm/filemap.c            |   23 +++---
 3 files changed, 176 insertions(+), 26 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index fba93f4a4f9e..3ccecfa50eda 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -22,6 +22,7 @@
 #include <linux/fs.h>
 #include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/splice.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
@@ -282,22 +283,13 @@ void splice_shrink_spd(struct splice_pipe_desc *spd)
 	kfree(spd->partial);
 }
 
-/**
- * generic_file_splice_read - splice data from file to a pipe
- * @in:		file to splice from
- * @ppos:	position in @in
- * @pipe:	pipe to splice to
- * @len:	number of bytes to splice
- * @flags:	splice modifier flags
- *
- * Description:
- *    Will read pages from given file and fill them into a pipe. Can be
- *    used as long as it has more or less sane ->read_iter().
- *
+/*
+ * Splice data from an O_DIRECT file into pages and then add them to the output
+ * pipe.
  */
-ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
-				 struct pipe_inode_info *pipe, size_t len,
-				 unsigned int flags)
+static ssize_t generic_file_direct_splice_read(struct file *in, loff_t *ppos,
+					       struct pipe_inode_info *pipe,
+					       size_t len, unsigned int flags)
 {
 	LIST_HEAD(pages);
 	struct iov_iter to;
@@ -383,6 +375,161 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	kfree(bv);
 	return ret;
 }
+
+/*
+ * Splice subpages from a folio into a pipe.
+ */
+static size_t splice_folio_into_pipe(struct pipe_inode_info *pipe,
+				     struct folio *folio,
+				     loff_t fpos, size_t size)
+{
+	struct page *page;
+	size_t spliced = 0, offset = offset_in_folio(folio, fpos);
+
+	page = folio_page(folio, offset / PAGE_SIZE);
+	size = min(size, folio_size(folio) - offset);
+	offset %= PAGE_SIZE;
+
+	while (spliced < size &&
+	       !pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct pipe_buffer *buf = &pipe->bufs[pipe->head & (pipe->ring_size - 1)];
+		size_t part = min_t(size_t, PAGE_SIZE - offset, size - spliced);
+
+		*buf = (struct pipe_buffer) {
+			.ops	= &page_cache_pipe_buf_ops,
+			.page	= page,
+			.offset	= offset,
+			.len	= part,
+		};
+		folio_get(folio);
+		pipe->head++;
+		page++;
+		spliced += part;
+		offset = 0;
+	}
+
+	return spliced;
+}
+
+/*
+ * Splice folios from the pagecache of a buffered (ie. non-O_DIRECT) file into
+ * a pipe.
+ */
+static ssize_t generic_file_buffered_splice_read(struct file *in, loff_t *ppos,
+						 struct pipe_inode_info *pipe,
+						 size_t len,
+						 unsigned int flags)
+{
+	struct folio_batch fbatch;
+	size_t total_spliced = 0, used, npages;
+	loff_t isize, end_offset;
+	bool writably_mapped;
+	int i, error = 0;
+
+	struct kiocb iocb = {
+		.ki_filp	= in,
+		.ki_pos		= *ppos,
+	};
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	folio_batch_init(&fbatch);
+
+	do {
+		cond_resched();
+
+		if (*ppos >= i_size_read(file_inode(in)))
+			break;
+
+		iocb.ki_pos = *ppos;
+		error = filemap_get_pages(&iocb, len, &fbatch);
+		if (error < 0)
+			break;
+
+		/*
+		 * i_size must be checked after we know the pages are Uptodate.
+		 *
+		 * Checking i_size after the check allows us to calculate
+		 * the correct value for "nr", which means the zero-filled
+		 * part of the page is not copied back to userspace (unless
+		 * another truncate extends the file - this is desired though).
+		 */
+		isize = i_size_read(file_inode(in));
+		if (unlikely(*ppos >= isize))
+			break;
+		end_offset = min_t(loff_t, isize, *ppos + len);
+
+		/*
+		 * Once we start copying data, we don't want to be touching any
+		 * cachelines that might be contended:
+		 */
+		writably_mapped = mapping_writably_mapped(in->f_mapping);
+
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			struct folio *folio = fbatch.folios[i];
+			size_t n;
+
+			if (folio_pos(folio) >= end_offset)
+				goto out;
+			folio_mark_accessed(folio);
+
+			/*
+			 * If users can be writing to this folio using arbitrary
+			 * virtual addresses, take care of potential aliasing
+			 * before reading the folio on the kernel side.
+			 */
+			if (writably_mapped)
+				flush_dcache_folio(folio);
+
+			n = splice_folio_into_pipe(pipe, folio, *ppos, len);
+			if (!n)
+				goto out;
+			len -= n;
+			total_spliced += n;
+			*ppos += n;
+			in->f_ra.prev_pos = *ppos;
+			if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+				goto out;
+		}
+
+		folio_batch_release(&fbatch);
+	} while (len);
+
+out:
+	folio_batch_release(&fbatch);
+	file_accessed(in);
+
+	return total_spliced ? total_spliced : error;
+}
+
+/**
+ * generic_file_splice_read - splice data from file to a pipe
+ * @in:		file to splice from
+ * @ppos:	position in @in
+ * @pipe:	pipe to splice to
+ * @len:	number of bytes to splice
+ * @flags:	splice modifier flags
+ *
+ * Description:
+ *    Will read pages from given file and fill them into a pipe. Can be
+ *    used as long as it has more or less sane ->read_iter().
+ *
+ */
+ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
+				 struct pipe_inode_info *pipe, size_t len,
+				 unsigned int flags)
+{
+	if (unlikely(*ppos >= file_inode(in)->i_sb->s_maxbytes))
+		return 0;
+	if (unlikely(!len))
+		return 0;
+	if (in->f_flags & O_DIRECT)
+		return generic_file_direct_splice_read(in, ppos, pipe, len, flags);
+	return generic_file_buffered_splice_read(in, ppos, pipe, len, flags);
+}
 EXPORT_SYMBOL(generic_file_splice_read);
 
 const struct pipe_buf_operations default_pipe_buf_ops = {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 29e1f9e76eb6..317fbc34849f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -748,6 +748,8 @@ struct page *read_cache_page(struct address_space *, pgoff_t index,
 		filler_t *filler, struct file *file);
 extern struct page * read_cache_page_gfp(struct address_space *mapping,
 				pgoff_t index, gfp_t gfp_mask);
+int filemap_get_pages(struct kiocb *iocb, size_t count,
+		      struct folio_batch *fbatch);
 
 static inline struct page *read_mapping_page(struct address_space *mapping,
 				pgoff_t index, struct file *file)
diff --git a/mm/filemap.c b/mm/filemap.c
index f72e4875bfcb..f8b28352b038 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2440,10 +2440,8 @@ static int filemap_read_folio(struct file *file, filler_t filler,
 }
 
 static bool filemap_range_uptodate(struct address_space *mapping,
-		loff_t pos, struct iov_iter *iter, struct folio *folio)
+		loff_t pos, size_t count, struct folio *folio)
 {
-	int count;
-
 	if (folio_test_uptodate(folio))
 		return true;
 	if (!mapping->a_ops->is_partially_uptodate)
@@ -2451,7 +2449,6 @@ static bool filemap_range_uptodate(struct address_space *mapping,
 	if (mapping->host->i_blkbits >= folio_shift(folio))
 		return false;
 
-	count = iter->count;
 	if (folio_pos(folio) > pos) {
 		count -= folio_pos(folio) - pos;
 		pos = 0;
@@ -2463,7 +2460,7 @@ static bool filemap_range_uptodate(struct address_space *mapping,
 }
 
 static int filemap_update_page(struct kiocb *iocb,
-		struct address_space *mapping, struct iov_iter *iter,
+		struct address_space *mapping, size_t count,
 		struct folio *folio)
 {
 	int error;
@@ -2498,7 +2495,7 @@ static int filemap_update_page(struct kiocb *iocb,
 		goto unlock;
 
 	error = 0;
-	if (filemap_range_uptodate(mapping, iocb->ki_pos, iter, folio))
+	if (filemap_range_uptodate(mapping, iocb->ki_pos, count, folio))
 		goto unlock;
 
 	error = -EAGAIN;
@@ -2574,8 +2571,12 @@ static int filemap_readahead(struct kiocb *iocb, struct file *file,
 	return 0;
 }
 
-static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
-		struct folio_batch *fbatch)
+/*
+ * Extract some folios from the pagecache of a file, reading those pages from
+ * the backing store if necessary and waiting for them.
+ */
+int filemap_get_pages(struct kiocb *iocb, size_t count,
+		      struct folio_batch *fbatch)
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
@@ -2585,7 +2586,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 	struct folio *folio;
 	int err = 0;
 
-	last_index = DIV_ROUND_UP(iocb->ki_pos + iter->count, PAGE_SIZE);
+	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -2618,7 +2619,7 @@ static int filemap_get_pages(struct kiocb *iocb, struct iov_iter *iter,
 		if ((iocb->ki_flags & IOCB_WAITQ) &&
 		    folio_batch_count(fbatch) > 1)
 			iocb->ki_flags |= IOCB_NOWAIT;
-		err = filemap_update_page(iocb, mapping, iter, folio);
+		err = filemap_update_page(iocb, mapping, count, folio);
 		if (err)
 			goto err;
 	}
@@ -2688,7 +2689,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
 			break;
 
-		error = filemap_get_pages(iocb, iter, &fbatch);
+		error = filemap_get_pages(iocb, iov_iter_count(iter), &fbatch);
 		if (error < 0)
 			break;
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-08 16:09   ` David Howells
@ 2023-02-08 16:19     ` Christoph Hellwig
  2023-02-08 17:21     ` Matthew Wilcox
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2023-02-08 16:19 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

On Wed, Feb 08, 2023 at 04:09:51PM +0000, David Howells wrote:
> How about one of two different solutions?
> 
>  (1) Repurpose the function I proposed for generic_file_splice_read() but only
>      for splicing from O_DIRECT files; reading from non-O_DIRECT files would
>      use an ITER_PIPE as upstream.

Given the amounts of problems we had with O_DIRECT vs splice, and the
fact that even doing this is a bit pointless that seems sensible to me.

>      for splicing from O_DIRECT files, as (1), but also replace the splice
>      from a buffered file with something like the patch below.  This uses
>      filemap_get_pages() to do the reading and to get a bunch of folios from
>      the pagecache that we can then splice into the pipe directly.

I defintively like the idea of killing ITER_PIPE.  Isn't the 16
folios in a folio tree often much less than what we could fit into
a single pipe buf?  Unless you have a file system that can use
huge folios for buffered I/O and actually does this might significantly
limit performance.

With that in mind I'll try to find some time to review your actual
patch, but I'm a little busy at the moment.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-08 16:09   ` David Howells
  2023-02-08 16:19     ` Christoph Hellwig
@ 2023-02-08 17:21     ` Matthew Wilcox
  2023-02-08 18:15     ` David Howells
  2023-02-09 10:50     ` David Howells
  3 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2023-02-08 17:21 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

On Wed, Feb 08, 2023 at 04:09:51PM +0000, David Howells wrote:
> @@ -2688,7 +2689,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
>  		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
>  			break;
>  
> -		error = filemap_get_pages(iocb, iter, &fbatch);
> +		error = filemap_get_pages(iocb, iov_iter_count(iter), &fbatch);

What's the point in iov_iter_count() anyway?  It's more characters
than iter->count, so why not use it directly?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-08 16:09   ` David Howells
  2023-02-08 16:19     ` Christoph Hellwig
  2023-02-08 17:21     ` Matthew Wilcox
@ 2023-02-08 18:15     ` David Howells
  2023-02-09 10:50     ` David Howells
  3 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-08 18:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Christoph Hellwig, Jens Axboe, Al Viro, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Matthew Wilcox <willy@infradead.org> wrote:

> What's the point in iov_iter_count() anyway?  It's more characters
> than iter->count, so why not use it directly?

I presume it got added for a reason and should be used in preference to direct
access.

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE
  2023-02-08 16:09   ` David Howells
                       ` (2 preceding siblings ...)
  2023-02-08 18:15     ` David Howells
@ 2023-02-09 10:50     ` David Howells
  3 siblings, 0 replies; 19+ messages in thread
From: David Howells @ 2023-02-09 10:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, linux-fsdevel, linux-block, linux-kernel, linux-mm,
	syzbot+a440341a59e3b7142895, Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> I defintively like the idea of killing ITER_PIPE.  Isn't the 16
> folios in a folio tree often much less than what we could fit into
> a single pipe buf?  Unless you have a file system that can use
> huge folios for buffered I/O and actually does this might significantly
> limit performance.

There's a loop there that repeats the filemap_get_pages() until either the
pipe is full or we hit EOF, the same as in filemap_read() (upon which this is
based).

I want to use filemap_get_pages() if I can as that does all the readahead
stuff.  What might be nice, though, is if I could tell it to return rather
than waiting for a folio to come uptodate if it has already returned a folio
so that I can push the other side of the splice along whilst the read is in
progress.

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-02-09 10:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-07 17:12 [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) David Howells
2023-02-07 17:12 ` [PATCH v12 01/10] vfs, iomap: Fix generic_file_splice_read() to avoid reversion of ITER_PIPE David Howells
2023-02-08  5:26   ` Christoph Hellwig
2023-02-08 16:09   ` David Howells
2023-02-08 16:19     ` Christoph Hellwig
2023-02-08 17:21     ` Matthew Wilcox
2023-02-08 18:15     ` David Howells
2023-02-09 10:50     ` David Howells
2023-02-07 17:12 ` [PATCH v12 02/10] iov_iter: Kill ITER_PIPE David Howells
2023-02-07 17:12 ` [PATCH v12 03/10] iov_iter: Define flags to qualify page extraction David Howells
2023-02-07 17:12 ` [PATCH v12 04/10] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-02-07 17:13 ` [PATCH v12 05/10] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
2023-02-07 17:13 ` [PATCH v12 06/10] block: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-02-07 17:13 ` [PATCH v12 07/10] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
2023-02-07 17:13 ` [PATCH v12 08/10] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
2023-02-07 17:13 ` [PATCH v12 09/10] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
2023-02-07 17:13 ` [PATCH v12 10/10] block: convert bio_map_user_iov " David Howells
2023-02-07 18:49 ` [PATCH v12 00/10] iov_iter: Improve page extraction (pin or just list) Jens Axboe
2023-02-07 18:51 ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.