linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE
@ 2023-05-19  7:40 David Howells
  2023-05-19  7:40 ` [PATCH v20 01/32] splice: Fix filemap of a blockdev David Howells
                   ` (35 more replies)
  0 siblings, 36 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm

Hi Jens, Al, Christoph,

The first half of this patchset kills off ITER_PIPE to avoid a race between
truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a
bio with unpinned/unref'ed pages from an O_DIRECT splice read.  This causes
memory corruption[2].  Instead, we use filemap_splice_read(), which invokes
the buffered file reading code and splices from the pagecache into the
pipe; direct_splice_read(), which bulk-allocates a buffer, reads into it
and then pushes the filled pages into the pipe; or handle it in
filesystem-specific code.

 (1) Simplify the calculations for the number of pages to be reclaimed in
     direct_splice_read().

 (2) Turn do_splice_to() into a helper so that it can be used by overlayfs
     and coda to perform the checks on the lower fs.

 (3) Provide shmem with its own splice_read to handle non-existent pages
     in the pagecache.  We don't want a ->read_folio() as we don't want to
     populate holes, but filemap_get_pages() requires it.

 (4) Provide overlayfs with its own splice_read to call down to a lower
     layer as overlayfs doesn't provide ->read_folio().

 (5) Provide coda with its own splice_read to call down to a lower layer as
     coda doesn't provide ->read_folio().

 (6) Direct ->splice_read to direct_splice_read() in tty, procfs, kernfs
     and random files as they just copy to the output buffer and don't
     splice pages.

 (7) Provide stubs for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2,
     orangefs, xfs and zonefs to do locking and/or revalidation.

 (8) Change generic_file_splice_read() to just switch between
     filemap_splice_read() and direct_splice_read() rather than using
     ITER_PIPE.

 (9) Make cifs use generic_file_splice_read().

(10) Remove ITER_PIPE and its paraphernalia as generic_file_splice_read()
     was the only user.

The second half of the patchset rolls page-pinning out to the bio struct
and the block layer, using iov_iter_extract_pages() to get pages and noting
with BIO_PAGE_PINNED if the data pages attached to a bio are pinned.  If
the data pages come from a non-user-backed iterator, then the pages are
left unpinned and unref'd, relying on whoever set up the I/O to do the
retaining

(10) Don't hold a ref on ZERO_PAGE in iomap_dio_zero().

(11) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

(12) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
     FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
     is added.

     BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
     and this series does not fully address that file.

(13) Add a function, bio_release_page(), to release a page appropriately to
     the cleanup mode indicated by the BIO_PAGE_* flags.

(14) Make bio_iov_iter_get_pages() use iov_iter_extract_pages() to retain
     the pages appropriately and clean them up later.

(15) Make bio_map_user_iov() also use iov_iter_extract_pages().

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Changes:
========
ver #20)
 - Make direct_splice_read() limit the read to eof for regular files and
   blockdevs.
 - Check against s_maxbytes on the backing store, not a devnode inode.
 - Provide stubs for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2,
   orangefs, xfs and zonefs.
 - Always use direct_splice_read() for 9p, trace and sockets.

ver #19)
 - Remove a missed get_page() on the zeropage in shmem_splice_read().

ver #18)
 - Split out the cifs bits from the patch the switches
   generic_file_splice_read() over to using the non-ITER_PIPE splicing.
 - Don't get/put refs on the zeropage in shmem_splice_read().

ver #17)
 - Rename do_splice_to() to vfs_splice_read() and export it so that it can
   be a helper and make overlayfs and coda use it, allowing duplicate
   checks to be removed.

ver #16)
 - The filemap_get_pages() changes are now upstream.
 - filemap_splice_read() and direct_splice_read() are now upstream.
 - iov_iter_extract_pages() is now upstream.

ver #15)
 - Fixed up some errors in overlayfs_splice_read().

ver #14)
 - Some changes to generic_file_buffered_splice_read():
   - Rename to filemap_splice_read() and move to mm/filemap.c.
   - Create a helper, pipe_head_buf().
   - Use init_sync_kiocb().
 - Some changes to generic_file_direct_splice_read():
   - Use alloc_pages_bulk_array() rather than alloc_pages_bulk_list().
   - Use release_pages() instead of __free_page() in a loop.
   - Rename to direct_splice_read().
 - Rearrange the patches to implement filemap_splice_read() and
   direct_splice_read() separately to changing generic_file_splice_read().
 - Don't call generic_file_splice_read() when there isn't a ->read_folio().
 - Insert patches to fix read_folio-less cases:
   - Make tty, procfs, kernfs and (u)random use direct_splice_read().
   - Make overlayfs and coda call down to a lower layer.
   - Give shmem its own splice-read that doesn't insert missing pages.
 - Fixed a min() with mixed type args on some arches.

ver #13)
 - Only use allocation in advance and ITER_BVEC for DIO read-splice.
 - Make buffered read-splice get pages directly from the pagecache.
 - Alter filemap_get_pages() & co. so that it doesn't need an iterator.

ver #12)
 - Added the missing __bitwise on the iov_iter_extraction_t typedef.
 - Rebased on -rc7.
 - Don't specify FOLL_PIN to pin_user_pages_fast().
 - Inserted patch at front to fix race between DIO read and truncation that
   caused memory corruption when iov_iter_revert() got called on an
   ITER_PIPE iterator[2].
 - Inserted a patch after that to remove the now-unused ITER_PIPE and its
   helper functions.
 - Removed the ITER_PIPE bits from iov_iter_extract_pages().

ver #11)
 - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
   the returned starting offset.
 - Use __bitwise for the extraction flags

ver #10)
 - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
 - Drop bio_set_cleanup_mode(), open coding it instead.

ver #9)
 - It's now not permitted to use FOLL_PIN outside of mm/, so:
 - Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and
   return true/false instead of FOLL_PIN/0.
 - Drop of folio_put_unpin() and page_put_unpin() and instead call
   unpin_user_page() (and put_page()) directly as necessary.
 - Make __bio_release_pages() call bio_release_page() instead of
   unpin_user_page() as there's no BIO_* -> FOLL_* translation to do.
 - Drop the FOLL_* renumbering patch.
 - Change extract_flags to extraction_flags.

ver #8)
 - Import Christoph Hellwig's changes.
   - Split the conversion-to-extraction patch.
   - Drop the extract_flags arg from iov_iter_extract_mode().
   - Don't default bios to BIO_PAGE_REFFED, but set explicitly.
 - Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
 - Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
 - We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
   patches for that.

ver #7)
 - For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
   as it turned out to be a lot more complicated, with places not setting
   IOCB_WRITE when they should, for example.
 - Drop all the patches that changed things other then the block layer's
   bio handling.  The netfslib and cifs changes can go into a separate
   patchset.
 - Add support for extracting pages from KVEC-type iterators.
 - When extracting from BVEC/KVEC, skip over empty vecs at the front.

ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
Link: https://lore.kernel.org/r/20230120175556.3556978-1-dhowells@redhat.com/ # v7
Link: https://lore.kernel.org/r/20230123173007.325544-1-dhowells@redhat.com/ # v8
Link: https://lore.kernel.org/r/20230124170108.1070389-1-dhowells@redhat.com/ # v9
Link: https://lore.kernel.org/r/20230125210657.2335748-1-dhowells@redhat.com/ # v10
Link: https://lore.kernel.org/r/20230126141626.2809643-1-dhowells@redhat.com/ # v11
Link: https://lore.kernel.org/r/20230207171305.3716974-1-dhowells@redhat.com/ # v12
Link: https://lore.kernel.org/r/20230209102954.528942-1-dhowells@redhat.com/ # v13
Link: https://lore.kernel.org/r/20230214171330.2722188-1-dhowells@redhat.com/ # v14
Link: https://lore.kernel.org/r/20230308143754.1976726-1-dhowells@redhat.com/ # v16
Link: https://lore.kernel.org/r/20230308165251.2078898-1-dhowells@redhat.com/ # v17
Link: https://lore.kernel.org/r/20230314220757.3827941-1-dhowells@redhat.com/ # v18
Link: https://lore.kernel.org/r/20230315163549.295454-1-dhowells@redhat.com/ # v19

Additional patches that got folded in:

Link: https://lore.kernel.org/r/20230213134619.2198965-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230213153301.2338806-1-dhowells@redhat.com/ # v2
Link: https://lore.kernel.org/r/20230214083710.2547248-1-dhowells@redhat.com/ # v3

Christoph Hellwig (1):
  block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted
    logic

David Howells (31):
  splice: Fix filemap of a blockdev
  splice: Clean up direct_splice_read() a bit
  splice: Make direct_read_splice() limit to eof where appropriate
  splice: Make do_splice_to() generic and export it
  splice: Make splice from a DAX file use direct_splice_read()
  shmem: Implement splice-read
  overlayfs: Implement splice-read
  coda: Implement splice-read
  tty, proc, kernfs, random: Use direct_splice_read()
  net: Make sock_splice_read() use direct_splice_read() by default
  9p:  Add splice_read stub
  afs: Provide a splice-read stub
  ceph: Provide a splice-read stub
  ecryptfs: Provide a splice-read stub
  ext4: Provide a splice-read stub
  f2fs: Provide a splice-read stub
  nfs: Provide a splice-read stub
  ntfs3: Provide a splice-read stub
  ocfs2: Provide a splice-read stub
  orangefs: Provide a splice-read stub
  xfs: Provide a splice-read stub
  zonefs: Provide a splice-read stub
  splice: Convert trace/seq to use direct_splice_read()
  splice: Do splice read from a file without using ITER_PIPE
  cifs: Use generic_file_splice_read()
  iov_iter: Kill ITER_PIPE
  iomap: Don't get an reference on ZERO_PAGE for direct I/O block
    zeroing
  block: Fix bio_flagged() so that gcc can better optimise it
  block: Add BIO_PAGE_PINNED and associated infrastructure
  block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
  block: convert bio_map_user_iov to use iov_iter_extract_pages

 block/bio.c               |  29 +--
 block/blk-map.c           |  22 +-
 block/blk.h               |  12 ++
 drivers/char/random.c     |   4 +-
 drivers/tty/tty_io.c      |   4 +-
 fs/9p/vfs_file.c          |  26 ++-
 fs/afs/file.c             |  20 +-
 fs/ceph/file.c            |  66 +++++-
 fs/cifs/cifsfs.c          |   8 +-
 fs/cifs/cifsfs.h          |   3 -
 fs/cifs/file.c            |  16 --
 fs/coda/file.c            |  29 ++-
 fs/direct-io.c            |   2 +
 fs/ecryptfs/file.c        |  27 ++-
 fs/ext4/file.c            |  13 +-
 fs/f2fs/file.c            |  68 +++++-
 fs/iomap/direct-io.c      |   1 -
 fs/kernfs/file.c          |   2 +-
 fs/nfs/file.c             |  26 ++-
 fs/nfs/internal.h         |   2 +
 fs/nfs/nfs4file.c         |   2 +-
 fs/ntfs3/file.c           |  36 +++-
 fs/ocfs2/file.c           |  42 +++-
 fs/ocfs2/ocfs2_trace.h    |   3 +
 fs/orangefs/file.c        |  25 ++-
 fs/overlayfs/file.c       |  23 +-
 fs/proc/inode.c           |   4 +-
 fs/proc/proc_sysctl.c     |   2 +-
 fs/proc_namespace.c       |   6 +-
 fs/splice.c               |  93 ++++----
 fs/xfs/xfs_file.c         |  33 ++-
 fs/xfs/xfs_trace.h        |   2 +-
 fs/zonefs/file.c          |  43 +++-
 include/linux/bio.h       |   5 +-
 include/linux/blk_types.h |   3 +-
 include/linux/splice.h    |   3 +
 include/linux/uio.h       |  14 --
 kernel/trace/trace.c      |   2 +-
 lib/iov_iter.c            | 431 +-------------------------------------
 mm/filemap.c              |   7 +-
 mm/shmem.c                | 134 +++++++++++-
 net/socket.c              |   2 +-
 42 files changed, 717 insertions(+), 578 deletions(-)


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v20 01/32] splice: Fix filemap of a blockdev
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  8:06   ` Christoph Hellwig
  2023-05-19  9:11   ` David Howells
  2023-05-19  7:40 ` [PATCH v20 02/32] splice: Clean up direct_splice_read() a bit David Howells
                   ` (34 subsequent siblings)
  35 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Steve French,
	Christoph Hellwig, John Hubbard

Fix filemap_splice_read() to use file->f_mapping->host, not file->f_inode,
as the source of the file size because in the case of a block device,
file->f_inode points to the block-special file (which is typically 0
length) and not the backing store.

Fixes: 07073eb01c5f ("splice: Add a func to do a splice from a buffered file without ITER_PIPE")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <stfrench@microsoft.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 mm/filemap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..a2006936a6ae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2900,7 +2900,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 	do {
 		cond_resched();
 
-		if (*ppos >= i_size_read(file_inode(in)))
+		if (*ppos >= i_size_read(in->f_mapping->host))
 			break;
 
 		iocb.ki_pos = *ppos;
@@ -2916,7 +2916,7 @@ ssize_t filemap_splice_read(struct file *in, loff_t *ppos,
 		 * part of the page is not copied back to userspace (unless
 		 * another truncate extends the file - this is desired though).
 		 */
-		isize = i_size_read(file_inode(in));
+		isize = i_size_read(in->f_mapping->host);
 		if (unlikely(*ppos >= isize))
 			break;
 		end_offset = min_t(loff_t, isize, *ppos + len);


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 02/32] splice: Clean up direct_splice_read() a bit
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
  2023-05-19  7:40 ` [PATCH v20 01/32] splice: Fix filemap of a blockdev David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
                   ` (33 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

Do a couple of cleanups to direct_splice_read():

 (1) Cast to struct page **, not void *.

 (2) Simplify the calculation of the number of pages to keep/reclaim in
     direct_splice_read().

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 fs/splice.c | 19 +++++++------------
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..4db3eee49423 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -312,7 +312,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	struct page **pages;
 	ssize_t ret;
-	size_t used, npages, chunk, remain, reclaim;
+	size_t used, npages, chunk, remain, keep = 0;
 	int i;
 
 	/* Work out how much data we can actually add into the pipe */
@@ -326,7 +326,7 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	if (!bv)
 		return -ENOMEM;
 
-	pages = (void *)(bv + npages);
+	pages = (struct page **)(bv + npages);
 	npages = alloc_pages_bulk_array(GFP_USER, npages, pages);
 	if (!npages) {
 		kfree(bv);
@@ -349,11 +349,8 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
 
-	reclaim = npages * PAGE_SIZE;
-	remain = 0;
 	if (ret > 0) {
-		reclaim -= ret;
-		remain = ret;
+		keep = DIV_ROUND_UP(ret, PAGE_SIZE);
 		*ppos = kiocb.ki_pos;
 		file_accessed(in);
 	} else if (ret < 0) {
@@ -366,14 +363,12 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	}
 
 	/* Free any pages that didn't get touched at all. */
-	reclaim /= PAGE_SIZE;
-	if (reclaim) {
-		npages -= reclaim;
-		release_pages(pages + npages, reclaim);
-	}
+	if (keep < npages)
+		release_pages(pages + keep, npages - keep);
 
 	/* Push the remaining pages into the pipe. */
-	for (i = 0; i < npages; i++) {
+	remain = ret;
+	for (i = 0; i < keep; i++) {
 		struct pipe_buffer *buf = pipe_head_buf(pipe);
 
 		chunk = min_t(size_t, remain, PAGE_SIZE);


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
  2023-05-19  7:40 ` [PATCH v20 01/32] splice: Fix filemap of a blockdev David Howells
  2023-05-19  7:40 ` [PATCH v20 02/32] splice: Clean up direct_splice_read() a bit David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  8:09   ` Christoph Hellwig
                     ` (3 more replies)
  2023-05-19  7:40 ` [PATCH v20 04/32] splice: Make do_splice_to() generic and export it David Howells
                   ` (32 subsequent siblings)
  35 siblings, 4 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

Make direct_read_splice() limit the read to the end of the file for regular
files and block devices, thereby reducing the amount of allocation it will
do in such a case.

This means that the blockdev code doesn't require any special handling as
filemap_read_splice() also limits to i_size.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/splice.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 4db3eee49423..89c8516554d1 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -315,6 +315,19 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	size_t used, npages, chunk, remain, keep = 0;
 	int i;
 
+	if (!len)
+		return 0;
+
+	if (S_ISREG(file_inode(in)->i_mode) ||
+	    S_ISBLK(file_inode(in)->i_mode)) {
+		loff_t i_size = i_size_read(in->f_mapping->host);
+
+		if (*ppos >= i_size)
+			return 0;
+		if (len > i_size - *ppos)
+			len = i_size - *ppos;
+	}
+
 	/* Work out how much data we can actually add into the pipe */
 	used = pipe_occupancy(pipe->head, pipe->tail);
 	npages = max_t(ssize_t, pipe->max_usage - used, 0);


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 04/32] splice: Make do_splice_to() generic and export it
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (2 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read() David Howells
                   ` (31 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Miklos Szeredi, John Hubbard, linux-unionfs

Rename do_splice_to() to vfs_splice_read() and export it so that it can be
used as a helper when calling down to a lower layer filesystem as it
performs all the necessary checks[1].

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/CAJfpeguGksS3sCigmRi9hJdUec8qtM9f+_9jC1rJhsXT+dV01w@mail.gmail.com/ [1]
---
 fs/splice.c            | 27 ++++++++++++++++++++-------
 include/linux/splice.h |  3 +++
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 89c8516554d1..1e0b7c7038b5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -881,12 +881,24 @@ static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 	return out->f_op->splice_write(pipe, out, ppos, len, flags);
 }
 
-/*
- * Attempt to initiate a splice from a file to a pipe.
+/**
+ * vfs_splice_read - Read data from a file and splice it into a pipe
+ * @in:		File to splice from
+ * @ppos:	Input file offset
+ * @pipe:	Pipe to splice to
+ * @len:	Number of bytes to splice
+ * @flags:	Splice modifier flags (SPLICE_F_*)
+ *
+ * Splice the requested amount of data from the input file to the pipe.  This
+ * is synchronous as the caller must hold the pipe lock across the entire
+ * operation.
+ *
+ * If successful, it returns the amount of data spliced, 0 if it hit the EOF or
+ * a hole and a negative error code otherwise.
  */
-static long do_splice_to(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
+long vfs_splice_read(struct file *in, loff_t *ppos,
+		     struct pipe_inode_info *pipe, size_t len,
+		     unsigned int flags)
 {
 	unsigned int p_space;
 	int ret;
@@ -909,6 +921,7 @@ static long do_splice_to(struct file *in, loff_t *ppos,
 		return warn_unsupported(in, "read");
 	return in->f_op->splice_read(in, ppos, pipe, len, flags);
 }
+EXPORT_SYMBOL_GPL(vfs_splice_read);
 
 /**
  * splice_direct_to_actor - splices data directly between two non-pipes
@@ -978,7 +991,7 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd,
 		size_t read_len;
 		loff_t pos = sd->pos, prev_pos = pos;
 
-		ret = do_splice_to(in, &pos, pipe, len, flags);
+		ret = vfs_splice_read(in, &pos, pipe, len, flags);
 		if (unlikely(ret <= 0))
 			goto out_release;
 
@@ -1126,7 +1139,7 @@ long splice_file_to_pipe(struct file *in,
 	pipe_lock(opipe);
 	ret = wait_for_space(opipe, flags);
 	if (!ret)
-		ret = do_splice_to(in, offset, opipe, len, flags);
+		ret = vfs_splice_read(in, offset, opipe, len, flags);
 	pipe_unlock(opipe);
 	if (ret > 0)
 		wakeup_pipe_readers(opipe);
diff --git a/include/linux/splice.h b/include/linux/splice.h
index a55179fd60fc..8f052c3dae95 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -76,6 +76,9 @@ extern ssize_t splice_to_pipe(struct pipe_inode_info *,
 			      struct splice_pipe_desc *);
 extern ssize_t add_to_pipe(struct pipe_inode_info *,
 			      struct pipe_buffer *);
+long vfs_splice_read(struct file *in, loff_t *ppos,
+		     struct pipe_inode_info *pipe, size_t len,
+		     unsigned int flags);
 extern ssize_t splice_direct_to_actor(struct file *, struct splice_desc *,
 				      splice_direct_actor *);
 extern long do_splice(struct file *in, loff_t *off_in,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read()
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (3 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 04/32] splice: Make do_splice_to() generic and export it David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  8:10   ` Christoph Hellwig
  2023-05-19  8:48   ` David Howells
  2023-05-19  7:40 ` [PATCH v20 06/32] shmem: Implement splice-read David Howells
                   ` (30 subsequent siblings)
  35 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	linux-erofs, linux-ext4, linux-xfs

Make a read splice from a DAX file go directly to direct_splice_read() to
do the reading as filemap_splice_read() is unlikely to find any pagecache
to splice.

I think this affects only erofs, Ext2, Ext4, fuse and XFS.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-erofs@lists.ozlabs.org
cc: linux-ext4@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/splice.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 1e0b7c7038b5..7b818b5b18d4 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -421,6 +421,11 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 	struct kiocb kiocb;
 	int ret;
 
+#ifdef CONFIG_FS_DAX
+	if (IS_DAX(in->f_mapping->host))
+		return direct_splice_read(in, ppos, pipe, len, flags);
+#endif
+
 	iov_iter_pipe(&to, ITER_DEST, pipe, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 06/32] shmem: Implement splice-read
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (4 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read() David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 07/32] overlayfs: " David Howells
                   ` (29 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Daniel Golle, Guenter Roeck,
	Christoph Hellwig, John Hubbard, Hugh Dickins

The new filemap_splice_read() has an implicit expectation via
filemap_get_pages() that ->read_folio() exists if ->readahead() doesn't
fully populate the pagecache of the file it is reading from[1], potentially
leading to a jump to NULL if this doesn't exist.  shmem, however, (and by
extension, tmpfs, ramfs and rootfs), doesn't have ->read_folio(),

Work around this by equipping shmem with its own splice-read
implementation, based on filemap_splice_read(), but able to paste in
zero_page when there's a page missing.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Daniel Golle <daniel@makrotopia.org>
cc: Guenter Roeck <groeck7@gmail.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Hugh Dickins <hughd@google.com>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/Y+pdHFFTk1TTEBsO@makrotopia.org/ [1]
---

Notes:
    ver #19)
     - Remove a missed get_page() on the zero page.
    
    ver #18)
     - Don't take/release a ref on the zero page.

 mm/shmem.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 133 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e40a08c5c6d7..1f504ed982cf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2731,6 +2731,138 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval ? retval : error;
 }
 
+static bool zero_pipe_buf_get(struct pipe_inode_info *pipe,
+			      struct pipe_buffer *buf)
+{
+	return true;
+}
+
+static void zero_pipe_buf_release(struct pipe_inode_info *pipe,
+				  struct pipe_buffer *buf)
+{
+}
+
+static bool zero_pipe_buf_try_steal(struct pipe_inode_info *pipe,
+				    struct pipe_buffer *buf)
+{
+	return false;
+}
+
+static const struct pipe_buf_operations zero_pipe_buf_ops = {
+	.release	= zero_pipe_buf_release,
+	.try_steal	= zero_pipe_buf_try_steal,
+	.get		= zero_pipe_buf_get,
+};
+
+static size_t splice_zeropage_into_pipe(struct pipe_inode_info *pipe,
+					loff_t fpos, size_t size)
+{
+	size_t offset = fpos & ~PAGE_MASK;
+
+	size = min_t(size_t, size, PAGE_SIZE - offset);
+
+	if (!pipe_full(pipe->head, pipe->tail, pipe->max_usage)) {
+		struct pipe_buffer *buf = pipe_head_buf(pipe);
+
+		*buf = (struct pipe_buffer) {
+			.ops	= &zero_pipe_buf_ops,
+			.page	= ZERO_PAGE(0),
+			.offset	= offset,
+			.len	= size,
+		};
+		pipe->head++;
+	}
+
+	return size;
+}
+
+static ssize_t shmem_file_splice_read(struct file *in, loff_t *ppos,
+				      struct pipe_inode_info *pipe,
+				      size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio = NULL;
+	size_t total_spliced = 0, used, npages, n, part;
+	loff_t isize;
+	int error = 0;
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	do {
+		if (*ppos >= i_size_read(inode))
+			break;
+
+		error = shmem_get_folio(inode, *ppos / PAGE_SIZE, &folio, SGP_READ);
+		if (error) {
+			if (error == -EINVAL)
+				error = 0;
+			break;
+		}
+		if (folio) {
+			folio_unlock(folio);
+
+			if (folio_test_hwpoison(folio)) {
+				error = -EIO;
+				break;
+			}
+		}
+
+		/*
+		 * i_size must be checked after we know the pages are Uptodate.
+		 *
+		 * Checking i_size after the check allows us to calculate
+		 * the correct value for "nr", which means the zero-filled
+		 * part of the page is not copied back to userspace (unless
+		 * another truncate extends the file - this is desired though).
+		 */
+		isize = i_size_read(inode);
+		if (unlikely(*ppos >= isize))
+			break;
+		part = min_t(loff_t, isize - *ppos, len);
+
+		if (folio) {
+			/*
+			 * If users can be writing to this page using arbitrary
+			 * virtual addresses, take care about potential aliasing
+			 * before reading the page on the kernel side.
+			 */
+			if (mapping_writably_mapped(mapping))
+				flush_dcache_folio(folio);
+			folio_mark_accessed(folio);
+			/*
+			 * Ok, we have the page, and it's up-to-date, so we can
+			 * now splice it into the pipe.
+			 */
+			n = splice_folio_into_pipe(pipe, folio, *ppos, part);
+			folio_put(folio);
+			folio = NULL;
+		} else {
+			n = splice_zeropage_into_pipe(pipe, *ppos, len);
+		}
+
+		if (!n)
+			break;
+		len -= n;
+		total_spliced += n;
+		*ppos += n;
+		in->f_ra.prev_pos = *ppos;
+		if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
+			break;
+
+		cond_resched();
+	} while (len);
+
+	if (folio)
+		folio_put(folio);
+
+	file_accessed(in);
+	return total_spliced ? total_spliced : error;
+}
+
 static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 {
 	struct address_space *mapping = file->f_mapping;
@@ -3971,7 +4103,7 @@ static const struct file_operations shmem_file_operations = {
 	.read_iter	= shmem_file_read_iter,
 	.write_iter	= generic_file_write_iter,
 	.fsync		= noop_fsync,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= shmem_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
 #endif


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 07/32] overlayfs: Implement splice-read
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (5 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 06/32] shmem: Implement splice-read David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 08/32] coda: " David Howells
                   ` (28 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard, Miklos Szeredi, linux-unionfs

Implement splice-read for overlayfs by passing the request down a layer
rather than going through generic_file_splice_read() which is going to be
changed to assume that ->read_folio() is present on buffered files.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #17)
     - Use vfs_splice_read() helper rather than open-coding checks.
    
    ver #15)
     - Remove redundant FMODE_CAN_ODIRECT check on real file.
     - Do rw_verify_area() on the real file, not the overlay file.
     - Fix a file leak.

 fs/overlayfs/file.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 7c04f033aadd..86197882ff35 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -419,6 +419,27 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return ret;
 }
 
+static ssize_t ovl_splice_read(struct file *in, loff_t *ppos,
+			       struct pipe_inode_info *pipe, size_t len,
+			       unsigned int flags)
+{
+	const struct cred *old_cred;
+	struct fd real;
+	ssize_t ret;
+
+	ret = ovl_real_fdget(in, &real);
+	if (ret)
+		return ret;
+
+	old_cred = ovl_override_creds(file_inode(in)->i_sb);
+	ret = vfs_splice_read(real.file, ppos, pipe, len, flags);
+	revert_creds(old_cred);
+	ovl_file_accessed(in);
+
+	fdput(real);
+	return ret;
+}
+
 /*
  * Calling iter_file_splice_write() directly from overlay's f_op may deadlock
  * due to lock order inversion between pipe->mutex in iter_file_splice_write()
@@ -695,7 +716,7 @@ const struct file_operations ovl_file_operations = {
 	.fallocate	= ovl_fallocate,
 	.fadvise	= ovl_fadvise,
 	.flush		= ovl_flush,
-	.splice_read    = generic_file_splice_read,
+	.splice_read    = ovl_splice_read,
 	.splice_write   = ovl_splice_write,
 
 	.copy_file_range	= ovl_copy_file_range,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 08/32] coda: Implement splice-read
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (6 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 07/32] overlayfs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read() David Howells
                   ` (27 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Jan Harkes,
	Christoph Hellwig, John Hubbard, coda, codalist, linux-unionfs

Implement splice-read for coda by passing the request down a layer rather
than going through generic_file_splice_read() which is going to be changed
to assume that ->read_folio() is present on buffered files.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: coda@cs.cmu.edu
cc: codalist@coda.cs.cmu.edu
cc: linux-unionfs@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---

Notes:
    ver #17)
     - Use vfs_splice_read() helper rather than open-coding checks.

 fs/coda/file.c | 29 ++++++++++++++++++++++++++++-
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/fs/coda/file.c b/fs/coda/file.c
index 3f3c81e6b1ab..12b26bd13564 100644
--- a/fs/coda/file.c
+++ b/fs/coda/file.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/uio.h>
+#include <linux/splice.h>
 
 #include <linux/coda.h>
 #include "coda_psdev.h"
@@ -94,6 +95,32 @@ coda_file_write_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+static ssize_t
+coda_file_splice_read(struct file *coda_file, loff_t *ppos,
+		      struct pipe_inode_info *pipe,
+		      size_t len, unsigned int flags)
+{
+	struct inode *coda_inode = file_inode(coda_file);
+	struct coda_file_info *cfi = coda_ftoc(coda_file);
+	struct file *in = cfi->cfi_container;
+	loff_t ki_pos = *ppos;
+	ssize_t ret;
+
+	ret = venus_access_intent(coda_inode->i_sb, coda_i2f(coda_inode),
+				  &cfi->cfi_access_intent,
+				  len, ki_pos, CODA_ACCESS_TYPE_READ);
+	if (ret)
+		goto finish_read;
+
+	ret = vfs_splice_read(in, ppos, pipe, len, flags);
+
+finish_read:
+	venus_access_intent(coda_inode->i_sb, coda_i2f(coda_inode),
+			    &cfi->cfi_access_intent,
+			    len, ki_pos, CODA_ACCESS_TYPE_READ_FINISH);
+	return ret;
+}
+
 static void
 coda_vm_open(struct vm_area_struct *vma)
 {
@@ -302,5 +329,5 @@ const struct file_operations coda_file_operations = {
 	.open		= coda_open,
 	.release	= coda_release,
 	.fsync		= coda_fsync,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= coda_file_splice_read,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read()
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (7 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 08/32] coda: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  8:12   ` Christoph Hellwig
  2023-05-19  7:40 ` [PATCH v20 10/32] net: Make sock_splice_read() use direct_splice_read() by default David Howells
                   ` (26 subsequent siblings)
  35 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Greg Kroah-Hartman,
	Christoph Hellwig, John Hubbard, Miklos Szeredi, Arnd Bergmann

Use direct_splice_read() for tty, procfs, kernfs and random files rather
than going through generic_file_splice_read() as they just copy the file
into the output buffer and don't splice pages.  This avoids the need for
them to have a ->read_folio() to satisfy filemap_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: John Hubbard <jhubbard@nvidia.com>
cc: David Hildenbrand <david@redhat.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Arnd Bergmann <arnd@arndb.de>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 drivers/char/random.c | 4 ++--
 drivers/tty/tty_io.c  | 4 ++--
 fs/kernfs/file.c      | 2 +-
 fs/proc/inode.c       | 4 ++--
 fs/proc/proc_sysctl.c | 2 +-
 fs/proc_namespace.c   | 6 +++---
 6 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 253f2ddb8913..2b47b500126c 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1546,7 +1546,7 @@ const struct file_operations random_fops = {
 	.compat_ioctl = compat_ptr_ioctl,
 	.fasync = random_fasync,
 	.llseek = noop_llseek,
-	.splice_read = generic_file_splice_read,
+	.splice_read = direct_splice_read,
 	.splice_write = iter_file_splice_write,
 };
 
@@ -1557,7 +1557,7 @@ const struct file_operations urandom_fops = {
 	.compat_ioctl = compat_ptr_ioctl,
 	.fasync = random_fasync,
 	.llseek = noop_llseek,
-	.splice_read = generic_file_splice_read,
+	.splice_read = direct_splice_read,
 	.splice_write = iter_file_splice_write,
 };
 
diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index c84be40fb8df..0b910a2af8e4 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -466,7 +466,7 @@ static const struct file_operations tty_fops = {
 	.llseek		= no_llseek,
 	.read_iter	= tty_read,
 	.write_iter	= tty_write,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.poll		= tty_poll,
 	.unlocked_ioctl	= tty_ioctl,
@@ -481,7 +481,7 @@ static const struct file_operations console_fops = {
 	.llseek		= no_llseek,
 	.read_iter	= tty_read,
 	.write_iter	= redirected_tty_write,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.poll		= tty_poll,
 	.unlocked_ioctl	= tty_ioctl,
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 40c4661f15b7..be43afa2138d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -1011,7 +1011,7 @@ const struct file_operations kernfs_file_fops = {
 	.release	= kernfs_fop_release,
 	.poll		= kernfs_fop_poll,
 	.fsync		= noop_fsync,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.splice_write	= iter_file_splice_write,
 };
 
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index f495fdb39151..711f12706469 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -591,7 +591,7 @@ static const struct file_operations proc_iter_file_ops = {
 	.llseek		= proc_reg_llseek,
 	.read_iter	= proc_reg_read_iter,
 	.write		= proc_reg_write,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.poll		= proc_reg_poll,
 	.unlocked_ioctl	= proc_reg_unlocked_ioctl,
 	.mmap		= proc_reg_mmap,
@@ -617,7 +617,7 @@ static const struct file_operations proc_reg_file_ops_compat = {
 static const struct file_operations proc_iter_file_ops_compat = {
 	.llseek		= proc_reg_llseek,
 	.read_iter	= proc_reg_read_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.write		= proc_reg_write,
 	.poll		= proc_reg_poll,
 	.unlocked_ioctl	= proc_reg_unlocked_ioctl,
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 8038833ff5b0..72f1a8ac7802 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -868,7 +868,7 @@ static const struct file_operations proc_sys_file_operations = {
 	.poll		= proc_sys_poll,
 	.read_iter	= proc_sys_read,
 	.write_iter	= proc_sys_write,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.llseek		= default_llseek,
 };
diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
index 846f9455ae22..492abbbeff5e 100644
--- a/fs/proc_namespace.c
+++ b/fs/proc_namespace.c
@@ -324,7 +324,7 @@ static int mountstats_open(struct inode *inode, struct file *file)
 const struct file_operations proc_mounts_operations = {
 	.open		= mounts_open,
 	.read_iter	= seq_read_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.llseek		= seq_lseek,
 	.release	= mounts_release,
 	.poll		= mounts_poll,
@@ -333,7 +333,7 @@ const struct file_operations proc_mounts_operations = {
 const struct file_operations proc_mountinfo_operations = {
 	.open		= mountinfo_open,
 	.read_iter	= seq_read_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.llseek		= seq_lseek,
 	.release	= mounts_release,
 	.poll		= mounts_poll,
@@ -342,7 +342,7 @@ const struct file_operations proc_mountinfo_operations = {
 const struct file_operations proc_mountstats_operations = {
 	.open		= mountstats_open,
 	.read_iter	= seq_read_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.llseek		= seq_lseek,
 	.release	= mounts_release,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 10/32] net: Make sock_splice_read() use direct_splice_read() by default
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (8 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read() David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 11/32] 9p: Add splice_read stub David Howells
                   ` (25 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Christoph Hellwig,
	netdev

Make sock_splice_read() use direct_splice_read() by default as
file_splice_read() will return immediately with 0 as a socket has no
pagecache and is a zero-size file.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: netdev@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 net/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/socket.c b/net/socket.c
index b7e01d0fe082..40b204a47aba 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1093,7 +1093,7 @@ static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
 	struct socket *sock = file->private_data;
 
 	if (unlikely(!sock->ops->splice_read))
-		return generic_file_splice_read(file, ppos, pipe, len, flags);
+		return direct_splice_read(file, ppos, pipe, len, flags);
 
 	return sock->ops->splice_read(sock, ppos, pipe, len, flags);
 }


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 11/32] 9p:  Add splice_read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (9 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 10/32] net: Make sock_splice_read() use direct_splice_read() by default David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 12/32] afs: Provide a splice-read stub David Howells
                   ` (24 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Dominique Martinet, Eric Van Hensbergen, Latchesar Ionkov,
	Christian Schoenebeck, v9fs

Add a splice_read stub for 9p.  We should use direct_splice_read() if
9PL_DIRECT is set and filemap_splice_read() otherwise.  Note that this
doesn't seem to be particularly related to O_DIRECT.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/9p/vfs_file.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index 6c31b8c8112d..0f3cb439ab2d 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -374,6 +374,28 @@ v9fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+/*
+ * v9fs_file_splice_read - splice-read from a file
+ * @in: The 9p file to read from
+ * @ppos: Where to find/update the file position
+ * @pipe: The pipe to splice into
+ * @len: The maximum amount of data to splice
+ * @flags: SPLICE_F_* flags
+ */
+static ssize_t v9fs_file_splice_read(struct file *in, loff_t *ppos,
+				     struct pipe_inode_info *pipe,
+				     size_t len, unsigned int flags)
+{
+	struct p9_fid *fid = in->private_data;
+
+	p9_debug(P9_DEBUG_VFS, "fid %d count %zu offset %lld\n",
+		 fid->fid, len, *ppos);
+
+	if (fid->mode & P9L_DIRECT)
+		return direct_splice_read(in, ppos, pipe, len, flags);
+	return filemap_splice_read(in, ppos, pipe, len, flags);
+}
+
 /**
  * v9fs_file_write_iter - write to a file
  * @iocb: The operation parameters
@@ -569,7 +591,7 @@ const struct file_operations v9fs_file_operations = {
 	.release = v9fs_dir_release,
 	.lock = v9fs_file_lock,
 	.mmap = generic_file_readonly_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = v9fs_file_splice_read,
 	.splice_write = iter_file_splice_write,
 	.fsync = v9fs_file_fsync,
 };
@@ -583,7 +605,7 @@ const struct file_operations v9fs_file_operations_dotl = {
 	.lock = v9fs_file_lock_dotl,
 	.flock = v9fs_file_flock_dotl,
 	.mmap = v9fs_file_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = v9fs_file_splice_read,
 	.splice_write = iter_file_splice_write,
 	.fsync = v9fs_file_fsync_dotl,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 12/32] afs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (10 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 11/32] 9p: Add splice_read stub David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 13/32] ceph: " David Howells
                   ` (23 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Marc Dionne, linux-afs

Provide a splice_read stub for AFS to call afs_validate() before going into
generic_file_splice_read() so that i_size can be brought as up to date as
possible.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/afs/file.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index 719b31374879..d8a6b09dadf7 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -25,6 +25,9 @@ static void afs_invalidate_folio(struct folio *folio, size_t offset,
 static bool afs_release_folio(struct folio *folio, gfp_t gfp_flags);
 
 static ssize_t afs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter);
+static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
+				    struct pipe_inode_info *pipe,
+				    size_t len, unsigned int flags);
 static void afs_vm_open(struct vm_area_struct *area);
 static void afs_vm_close(struct vm_area_struct *area);
 static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff);
@@ -36,7 +39,7 @@ const struct file_operations afs_file_operations = {
 	.read_iter	= afs_file_read_iter,
 	.write_iter	= afs_file_write,
 	.mmap		= afs_file_mmap,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= afs_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fsync		= afs_fsync,
 	.lock		= afs_lock,
@@ -587,3 +590,18 @@ static ssize_t afs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 
 	return generic_file_read_iter(iocb, iter);
 }
+
+static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
+				    struct pipe_inode_info *pipe,
+				    size_t len, unsigned int flags)
+{
+	struct afs_vnode *vnode = AFS_FS_I(file_inode(in));
+	struct afs_file *af = in->private_data;
+	int ret;
+
+	ret = afs_validate(vnode, af->key);
+	if (ret < 0)
+		return ret;
+
+	return generic_file_splice_read(in, ppos, pipe, len, flags);
+}


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 13/32] ceph: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (11 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 12/32] afs: Provide a splice-read stub David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  8:40   ` Xiubo Li
  2023-05-19  9:24   ` David Howells
  2023-05-19  7:40 ` [PATCH v20 14/32] ecryptfs: " David Howells
                   ` (22 subsequent siblings)
  35 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig, Xiubo Li,
	Ilya Dryomov, ceph-devel

Provide a splice_read stub for Ceph.  This does the inode shutdown check
before proceeding and jumps to direct_splice_read() if O_DIRECT is set, the
file has inline data or is a synchronous file.

We try and get FILE_RD and either FILE_CACHE and/or FILE_LAZYIO caps and
hold them across filemap_splice_read().  If we fail to get FILE_CACHE or
FILE_LAZYIO capabilities, we use direct_splice_read() instead.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Xiubo Li <xiubli@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: ceph-devel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/ceph/file.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index f4d8bf7dec88..382dd5901748 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1745,6 +1745,70 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+/*
+ * Wrap filemap_splice_read with checks for cap bits on the inode.
+ * Atomically grab references, so that those bits are not released
+ * back to the MDS mid-read.
+ */
+static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
+				struct pipe_inode_info *pipe,
+				size_t len, unsigned int flags)
+{
+	struct ceph_file_info *fi = in->private_data;
+	struct inode *inode = file_inode(in);
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	ssize_t ret;
+	int want = 0, got = 0;
+	CEPH_DEFINE_RW_CONTEXT(rw_ctx, 0);
+
+	dout("splice_read %p %llx.%llx %llu~%zu trying to get caps on %p\n",
+	     inode, ceph_vinop(inode), *ppos, len, inode);
+
+	if (ceph_inode_is_shutdown(inode))
+		return -ESTALE;
+
+	if ((in->f_flags & O_DIRECT) ||
+	    ceph_has_inline_data(ci) ||
+	    (fi->flags & CEPH_F_SYNC))
+		return direct_splice_read(in, ppos, pipe, len, flags);
+
+	ceph_start_io_read(inode);
+
+	want = CEPH_CAP_FILE_CACHE;
+	if (fi->fmode & CEPH_FILE_MODE_LAZY)
+		want |= CEPH_CAP_FILE_LAZYIO;
+
+	ret = ceph_get_caps(in, CEPH_CAP_FILE_RD, want, -1, &got);
+	if (ret < 0) {
+		ceph_end_io_read(inode);
+		return ret;
+	}
+
+	if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) == 0) {
+		dout("splice_read/sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
+		     inode, ceph_vinop(inode), *ppos, len,
+		     ceph_cap_string(got));
+
+		ceph_end_io_read(inode);
+		return direct_splice_read(in, ppos, pipe, len, flags);
+	}
+
+	dout("splice_read %p %llx.%llx %llu~%zu got cap refs on %s\n",
+	     inode, ceph_vinop(inode), *ppos, len, ceph_cap_string(got));
+
+	rw_ctx.caps = got;
+	ceph_add_rw_context(fi, &rw_ctx);
+	ret = filemap_splice_read(in, ppos, pipe, len, flags);
+	ceph_del_rw_context(fi, &rw_ctx);
+
+	dout("splice_read %p %llx.%llx dropping cap refs on %s = %zd\n",
+	     inode, ceph_vinop(inode), ceph_cap_string(got), ret);
+	ceph_put_cap_refs(ci, got);
+
+	ceph_end_io_read(inode);
+	return ret;
+}
+
 /*
  * Take cap references to avoid releasing caps to MDS mid-write.
  *
@@ -2593,7 +2657,7 @@ const struct file_operations ceph_file_fops = {
 	.lock = ceph_lock,
 	.setlease = simple_nosetlease,
 	.flock = ceph_flock,
-	.splice_read = generic_file_splice_read,
+	.splice_read = ceph_splice_read,
 	.splice_write = iter_file_splice_write,
 	.unlocked_ioctl = ceph_ioctl,
 	.compat_ioctl = compat_ptr_ioctl,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 14/32] ecryptfs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (12 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 13/32] ceph: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 15/32] ext4: " David Howells
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Tyler Hicks, ecryptfs

Provide a splice_read stub for ecryptfs to update the access time on the
lower file after the operation.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Tyler Hicks <code@tyhicks.com>
cc: ecryptfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/ecryptfs/file.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/fs/ecryptfs/file.c b/fs/ecryptfs/file.c
index 268b74499c28..284395587be0 100644
--- a/fs/ecryptfs/file.c
+++ b/fs/ecryptfs/file.c
@@ -44,6 +44,31 @@ static ssize_t ecryptfs_read_update_atime(struct kiocb *iocb,
 	return rc;
 }
 
+/*
+ * ecryptfs_splice_read_update_atime
+ *
+ * generic_file_splice_read updates the atime of upper layer inode.  But, it
+ * doesn't give us a chance to update the atime of the lower layer inode.  This
+ * function is a wrapper to generic_file_read.  It updates the atime of the
+ * lower level inode if generic_file_read returns without any errors. This is
+ * to be used only for file reads.  The function to be used for directory reads
+ * is ecryptfs_read.
+ */
+static ssize_t ecryptfs_splice_read_update_atime(struct file *in, loff_t *ppos,
+						 struct pipe_inode_info *pipe,
+						 size_t len, unsigned int flags)
+{
+	ssize_t rc;
+	const struct path *path;
+
+	rc = generic_file_splice_read(in, ppos, pipe, len, flags);
+	if (rc >= 0) {
+		path = ecryptfs_dentry_to_lower_path(in->f_path.dentry);
+		touch_atime(path);
+	}
+	return rc;
+}
+
 struct ecryptfs_getdents_callback {
 	struct dir_context ctx;
 	struct dir_context *caller;
@@ -414,5 +439,5 @@ const struct file_operations ecryptfs_main_fops = {
 	.release = ecryptfs_release,
 	.fsync = ecryptfs_fsync,
 	.fasync = ecryptfs_fasync,
-	.splice_read = generic_file_splice_read,
+	.splice_read = ecryptfs_splice_read_update_atime,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 15/32] ext4: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (13 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 14/32] ecryptfs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 16/32] f2fs: " David Howells
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Theodore Ts'o, Andreas Dilger, linux-ext4

Provide a splice_read stub for Ext4.  This does the inode shutdown check
before proceeding.  Splicing from DAX files is handled by
generic_file_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: "Theodore Ts'o" <tytso@mit.edu>
cc: Andreas Dilger <adilger.kernel@dilger.ca>
cc: linux-ext4@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/ext4/file.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d101b3b0c7da..9f8bbd9d131c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -147,6 +147,17 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return generic_file_read_iter(iocb, to);
 }
 
+static ssize_t ext4_file_splice_read(struct file *in, loff_t *ppos,
+				     struct pipe_inode_info *pipe,
+				     size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+		return -EIO;
+	return generic_file_splice_read(in, ppos, pipe, len, flags);
+}
+
 /*
  * Called when an inode is released. Note that this is different
  * from ext4_file_open: open gets called at every open, but release
@@ -957,7 +968,7 @@ const struct file_operations ext4_file_operations = {
 	.release	= ext4_release_file,
 	.fsync		= ext4_sync_file,
 	.get_unmapped_area = thp_get_unmapped_area,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= ext4_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ext4_fallocate,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 16/32] f2fs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (14 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 15/32] ext4: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 17/32] nfs: " David Howells
                   ` (19 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Jaegeuk Kim, Chao Yu, linux-f2fs-devel

Provide a splice_read stub for f2fs.  This does some checks and tracing
before proceeding and will switch from direct-I/O to buffered I/O if forced
or if misaligned.  It also updates the iostats after doing a buffered I/O.

[Note: I wonder if I should only do the tracing if I call
filemap_splice_read() as direct_splice_read() will call
f2fs_file_read_iter().]

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jaegeuk Kim <jaegeuk@kernel.org>
cc: Chao Yu <chao@kernel.org>
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/f2fs/file.c | 68 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 60 insertions(+), 8 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5ac53d2627d2..3723387f4a87 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4367,22 +4367,23 @@ static ssize_t f2fs_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
-static void f2fs_trace_rw_file_path(struct kiocb *iocb, size_t count, int rw)
+static void f2fs_trace_rw_file_path(struct file *file, loff_t pos, size_t count,
+				    int rw)
 {
-	struct inode *inode = file_inode(iocb->ki_filp);
+	struct inode *inode = file_inode(file);
 	char *buf, *path;
 
 	buf = f2fs_getname(F2FS_I_SB(inode));
 	if (!buf)
 		return;
-	path = dentry_path_raw(file_dentry(iocb->ki_filp), buf, PATH_MAX);
+	path = dentry_path_raw(file_dentry(file), buf, PATH_MAX);
 	if (IS_ERR(path))
 		goto free_buf;
 	if (rw == WRITE)
-		trace_f2fs_datawrite_start(inode, iocb->ki_pos, count,
+		trace_f2fs_datawrite_start(inode, pos, count,
 				current->pid, path, current->comm);
 	else
-		trace_f2fs_dataread_start(inode, iocb->ki_pos, count,
+		trace_f2fs_dataread_start(inode, pos, count,
 				current->pid, path, current->comm);
 free_buf:
 	f2fs_putname(buf);
@@ -4398,7 +4399,8 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		return -EOPNOTSUPP;
 
 	if (trace_f2fs_dataread_start_enabled())
-		f2fs_trace_rw_file_path(iocb, iov_iter_count(to), READ);
+		f2fs_trace_rw_file_path(iocb->ki_filp, iocb->ki_pos,
+					iov_iter_count(to), READ);
 
 	if (f2fs_should_use_dio(inode, iocb, to)) {
 		ret = f2fs_dio_read_iter(iocb, to);
@@ -4413,6 +4415,55 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+static ssize_t f2fs_file_splice_read(struct file *in, loff_t *ppos,
+				     struct pipe_inode_info *pipe,
+				     size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	const loff_t pos = *ppos;
+	ssize_t ret;
+
+	if (!f2fs_is_compress_backend_ready(inode))
+		return -EOPNOTSUPP;
+
+	if (trace_f2fs_dataread_start_enabled())
+		f2fs_trace_rw_file_path(in, pos, len, READ);
+
+	if (in->f_flags & O_DIRECT) {
+		if (f2fs_force_buffered_io(inode, READ))
+			goto buffered;
+
+		/*
+		 * Direct I/O not aligned to the disk's logical_block_size will
+		 * be attempted, but will fail with -EINVAL.
+		 *
+		 * f2fs additionally requires that direct I/O be aligned to the
+		 * filesystem block size, which is often a stricter
+		 * requirement.  However, f2fs traditionally falls back to
+		 * buffered I/O on requests that are logical_block_size-aligned
+		 * but not fs-block aligned.
+		 *
+		 * The below logic implements this behavior.
+		 */
+		if (!IS_ALIGNED(pos, i_blocksize(inode)) &&
+		    IS_ALIGNED(pos, bdev_logical_block_size(inode->i_sb->s_bdev)))
+			goto buffered;
+		ret = direct_splice_read(in, ppos, pipe, len, flags);
+		goto done;
+	}
+
+buffered:
+	ret = filemap_splice_read(in, ppos, pipe, len, flags);
+	if (ret > 0)
+		f2fs_update_iostat(F2FS_I_SB(inode), inode,
+				   APP_BUFFERED_READ_IO, ret);
+
+done:
+	if (trace_f2fs_dataread_end_enabled())
+		trace_f2fs_dataread_end(inode, pos, ret);
+	return ret;
+}
+
 static ssize_t f2fs_write_checks(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct file *file = iocb->ki_filp;
@@ -4714,7 +4765,8 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		ret = preallocated;
 	} else {
 		if (trace_f2fs_datawrite_start_enabled())
-			f2fs_trace_rw_file_path(iocb, orig_count, WRITE);
+			f2fs_trace_rw_file_path(iocb->ki_filp, iocb->ki_pos,
+						orig_count, WRITE);
 
 		/* Do the actual write. */
 		ret = dio ?
@@ -4919,7 +4971,7 @@ const struct file_operations f2fs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= f2fs_compat_ioctl,
 #endif
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= f2fs_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fadvise	= f2fs_file_fadvise,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 17/32] nfs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (15 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 16/32] f2fs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 18/32] ntfs3: " David Howells
                   ` (18 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Trond Myklebust, Anna Schumaker, linux-nfs

Provide a splice_read stub for NFS.  This locks the inode around
filemap_splice_read() and revalidates the mapping.  It cannot do this
around direct_splice_read() as that calls ->read_iter().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/nfs/file.c     | 26 +++++++++++++++++++++++++-
 fs/nfs/internal.h |  2 ++
 fs/nfs/nfs4file.c |  2 +-
 3 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index f0edf5a36237..91fe48b47f3c 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -178,6 +178,30 @@ nfs_file_read(struct kiocb *iocb, struct iov_iter *to)
 }
 EXPORT_SYMBOL_GPL(nfs_file_read);
 
+ssize_t
+nfs_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
+		     size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	ssize_t result;
+
+	if (in->f_flags & O_DIRECT)
+		return direct_splice_read(in, ppos, pipe, len, flags);
+
+	dprintk("NFS: splice_read(%pD2, %zu@%lu)\n", in, len, *pos);
+
+	nfs_start_io_read(inode);
+	result = nfs_revalidate_mapping(inode, in->f_mapping);
+	if (!result) {
+		result = filemap_splice_read(in, ppos, pipe, len, flags);
+		if (result > 0)
+			nfs_add_stats(inode, NFSIOS_NORMALREADBYTES, result);
+	}
+	nfs_end_io_read(inode);
+	return result;
+}
+EXPORT_SYMBOL_GPL(nfs_file_splice_read);
+
 int
 nfs_file_mmap(struct file * file, struct vm_area_struct * vma)
 {
@@ -879,7 +903,7 @@ const struct file_operations nfs_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= nfs_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= simple_nosetlease,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 3cc027d3bd58..b5f21d35d30e 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -416,6 +416,8 @@ static inline __u32 nfs_access_xattr_mask(const struct nfs_server *server)
 int nfs_file_fsync(struct file *file, loff_t start, loff_t end, int datasync);
 loff_t nfs_file_llseek(struct file *, loff_t, int);
 ssize_t nfs_file_read(struct kiocb *, struct iov_iter *);
+ssize_t nfs_file_splice_read(struct file *in, loff_t *ppos, struct pipe_inode_info *pipe,
+			     size_t len, unsigned int flags);
 int nfs_file_mmap(struct file *, struct vm_area_struct *);
 ssize_t nfs_file_write(struct kiocb *, struct iov_iter *);
 int nfs_file_release(struct inode *, struct file *);
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 2563ed8580f3..4aeadd6e1a6d 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -454,7 +454,7 @@ const struct file_operations nfs4_file_operations = {
 	.fsync		= nfs_file_fsync,
 	.lock		= nfs_lock,
 	.flock		= nfs_flock,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= nfs_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.check_flags	= nfs_check_flags,
 	.setlease	= nfs4_setlease,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 18/32] ntfs3: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (16 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 17/32] nfs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 19/32] ocfs2: " David Howells
                   ` (17 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Konstantin Komarov, ntfs3

Provide a splice_read stub for NTFS3 to perform various checks before
allowing the operation to proceed.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
cc: ntfs3@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/ntfs3/file.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index 9a3d55c367d9..c27814946aa3 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -744,6 +744,40 @@ static ssize_t ntfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 	return generic_file_read_iter(iocb, iter);
 }
 
+static ssize_t ntfs_file_splice_read(struct file *in, loff_t *ppos,
+				     struct pipe_inode_info *pipe,
+				     size_t len, unsigned int flags)
+{
+	struct inode *inode = in->f_mapping->host;
+	struct ntfs_inode *ni = ntfs_i(inode);
+
+	if (is_encrypted(ni)) {
+		ntfs_inode_warn(inode, "encrypted i/o not supported");
+		return -EOPNOTSUPP;
+	}
+
+	if (is_compressed(ni) && (in->f_flags & O_DIRECT)) {
+		ntfs_inode_warn(inode, "direct i/o + compressed not supported");
+		return -EOPNOTSUPP;
+	}
+
+#ifndef CONFIG_NTFS3_LZX_XPRESS
+	if (ni->ni_flags & NI_FLAG_COMPRESSED_MASK) {
+		ntfs_inode_warn(
+			inode,
+			"activate CONFIG_NTFS3_LZX_XPRESS to read external compressed files");
+		return -EOPNOTSUPP;
+	}
+#endif
+
+	if (is_dedup(ni)) {
+		ntfs_inode_warn(inode, "read deduplicated not supported");
+		return -EOPNOTSUPP;
+	}
+
+	return generic_file_splice_read(in, ppos, pipe, len, flags);
+}
+
 /*
  * ntfs_get_frame_pages
  *
@@ -1159,7 +1193,7 @@ const struct file_operations ntfs_file_operations = {
 #ifdef CONFIG_COMPAT
 	.compat_ioctl	= ntfs_compat_ioctl,
 #endif
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= ntfs_file_splice_read,
 	.mmap		= ntfs_file_mmap,
 	.open		= ntfs_file_open,
 	.fsync		= generic_file_fsync,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 19/32] ocfs2: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (17 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 18/32] ntfs3: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 20/32] orangefs: " David Howells
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Mark Fasheh, Joel Becker, Joseph Qi, ocfs2-devel

Provide a splice_read stub for ocfs2.  This emits trace lines and does an
atime lock/update before calling filemap_splice_read().  It doesn't do this
for around direct_splice_read() as that will call ->read_iter().

A couple of new tracepoints are added for this purpose.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Mark Fasheh <mark@fasheh.com>
cc: Joel Becker <jlbec@evilplan.org>
cc: Joseph Qi <joseph.qi@linux.alibaba.com>
cc: ocfs2-devel@oss.oracle.com
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/ocfs2/file.c        | 42 +++++++++++++++++++++++++++++++++++++++++-
 fs/ocfs2/ocfs2_trace.h |  3 +++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index efb09de4343d..27c54a71ec57 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -2581,6 +2581,46 @@ static ssize_t ocfs2_file_read_iter(struct kiocb *iocb,
 	return ret;
 }
 
+static ssize_t ocfs2_file_splice_read(struct file *in, loff_t *ppos,
+				      struct pipe_inode_info *pipe,
+				      size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	ssize_t ret = 0;
+	int lock_level = 0;
+
+	trace_ocfs2_file_splice_read(inode, in, in->f_path.dentry,
+				     (unsigned long long)OCFS2_I(inode)->ip_blkno,
+				     in->f_path.dentry->d_name.len,
+				     in->f_path.dentry->d_name.name,
+				     0);
+
+	if (in->f_flags & O_DIRECT)
+		return direct_splice_read(in, ppos, pipe, len, flags);
+
+	/*
+	 * We're fine letting folks race truncates and extending writes with
+	 * read across the cluster, just like they can locally.  Hence no
+	 * rw_lock during read.
+	 *
+	 * Take and drop the meta data lock to update inode fields like i_size.
+	 * This allows the checks down below generic_file_splice_read() a
+	 * chance of actually working.
+	 */
+	ret = ocfs2_inode_lock_atime(inode, in->f_path.mnt, &lock_level, true);
+	if (ret < 0) {
+		if (ret != -EAGAIN)
+			mlog_errno(ret);
+		goto bail;
+	}
+	ocfs2_inode_unlock(inode, lock_level);
+
+	ret = filemap_splice_read(in, ppos, pipe, len, flags);
+	trace_filemap_splice_read_ret(ret);
+bail:
+	return ret;
+}
+
 /* Refer generic_file_llseek_unlocked() */
 static loff_t ocfs2_file_llseek(struct file *file, loff_t offset, int whence)
 {
@@ -2744,7 +2784,7 @@ const struct file_operations ocfs2_fops = {
 #endif
 	.lock		= ocfs2_lock,
 	.flock		= ocfs2_flock,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= ocfs2_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= ocfs2_fallocate,
 	.remap_file_range = ocfs2_remap_file_range,
diff --git a/fs/ocfs2/ocfs2_trace.h b/fs/ocfs2/ocfs2_trace.h
index dc4bce1649c1..b8c3d1702076 100644
--- a/fs/ocfs2/ocfs2_trace.h
+++ b/fs/ocfs2/ocfs2_trace.h
@@ -1319,6 +1319,8 @@ DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_write);
 
 DEFINE_OCFS2_FILE_OPS(ocfs2_file_read_iter);
 
+DEFINE_OCFS2_FILE_OPS(ocfs2_file_splice_read);
+
 DEFINE_OCFS2_ULL_ULL_ULL_EVENT(ocfs2_truncate_file);
 
 DEFINE_OCFS2_ULL_ULL_EVENT(ocfs2_truncate_file_error);
@@ -1470,6 +1472,7 @@ TRACE_EVENT(ocfs2_prepare_inode_for_write,
 );
 
 DEFINE_OCFS2_INT_EVENT(generic_file_read_iter_ret);
+DEFINE_OCFS2_INT_EVENT(filemap_splice_read_ret);
 
 /* End of trace events for fs/ocfs2/file.c. */
 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 20/32] orangefs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (18 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 19/32] ocfs2: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 21/32] xfs: " David Howells
                   ` (15 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Mike Marshall, Martin Brandenburg, devel

Provide a splice_read stub for ocfs2.  This increments the read stats and
then locks the inode across the call to filemap_splice_read() and a
revalidation of the mapping.

The lock must not be held across the call to direct_splice_read() as this
calls ->read_iter(); this might also mean that the stat increment doesn't
need to be done in DIO mode.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Mike Marshall <hubcap@omnibond.com>
cc: Martin Brandenburg <martin@omnibond.com>
cc: devel@lists.orangefs.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/orangefs/file.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index 1a4301a38aa7..c1eae145e4a8 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -337,6 +337,29 @@ static ssize_t orangefs_file_read_iter(struct kiocb *iocb,
 	return ret;
 }
 
+static ssize_t orangefs_file_splice_read(struct file *in, loff_t *ppos,
+					 struct pipe_inode_info *pipe,
+					 size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	ssize_t ret;
+
+	orangefs_stats.reads++;
+
+	if (in->f_flags & O_DIRECT)
+		return direct_splice_read(in, ppos, pipe, len, flags);
+
+	down_read(&inode->i_rwsem);
+	ret = orangefs_revalidate_mapping(inode);
+	if (ret)
+		goto out;
+
+	ret = filemap_splice_read(in, ppos, pipe, len, flags);
+out:
+	up_read(&inode->i_rwsem);
+	return ret;
+}
+
 static ssize_t orangefs_file_write_iter(struct kiocb *iocb,
     struct iov_iter *iter)
 {
@@ -556,7 +579,7 @@ const struct file_operations orangefs_file_operations = {
 	.lock		= orangefs_lock,
 	.mmap		= orangefs_file_mmap,
 	.open		= generic_file_open,
-	.splice_read    = generic_file_splice_read,
+	.splice_read    = orangefs_file_splice_read,
 	.splice_write   = iter_file_splice_write,
 	.flush		= orangefs_flush,
 	.release	= orangefs_file_release,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 21/32] xfs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (19 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 20/32] orangefs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 22/32] zonefs: " David Howells
                   ` (14 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Darrick J . Wong, linux-xfs

Provide a splice_read stub for XFS.  This does a stat count and a shutdown
check before proceeding.  For buffered I/O, it emits a new trace line and
locks the inode across the call to filemap_splice_read() and adds to the
stats afterwards.  Splicing from DAX files either copies the data into a
buffer (DIO) or splices from the pagecache.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/xfs/xfs_file.c  | 33 ++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_trace.h |  2 +-
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index aede746541f8..bb1ec755a709 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -306,6 +306,37 @@ xfs_file_read_iter(
 	return ret;
 }
 
+STATIC ssize_t
+xfs_file_splice_read(
+	struct file		*in,
+	loff_t			*ppos,
+	struct pipe_inode_info	*pipe,
+	size_t			len,
+	unsigned int		flags)
+{
+	struct inode		*inode = file_inode(in);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	ssize_t			ret = 0;
+
+	XFS_STATS_INC(mp, xs_read_calls);
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	if (in->f_flags & O_DIRECT)
+		return direct_splice_read(in, ppos, pipe, len, flags);
+
+	trace_xfs_file_splice_read(ip, *ppos, len);
+
+	xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	ret = filemap_splice_read(in, ppos, pipe, len, flags);
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+	if (ret > 0)
+		XFS_STATS_ADD(mp, xs_read_bytes, ret);
+	return ret;
+}
+
 /*
  * Common pre-write limit and setup checks.
  *
@@ -1423,7 +1454,7 @@ const struct file_operations xfs_file_operations = {
 	.llseek		= xfs_file_llseek,
 	.read_iter	= xfs_file_read_iter,
 	.write_iter	= xfs_file_write_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= xfs_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.iopoll		= iocb_bio_iopoll,
 	.unlocked_ioctl	= xfs_file_ioctl,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index cd4ca5b1fcb0..4db669203149 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1445,7 +1445,6 @@ DEFINE_RW_EVENT(xfs_file_direct_write);
 DEFINE_RW_EVENT(xfs_file_dax_write);
 DEFINE_RW_EVENT(xfs_reflink_bounce_dio_write);
 
-
 DECLARE_EVENT_CLASS(xfs_imap_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_off_t offset, ssize_t count,
 		 int whichfork, struct xfs_bmbt_irec *irec),
@@ -1535,6 +1534,7 @@ DEFINE_SIMPLE_IO_EVENT(xfs_zero_eof);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_unwritten);
 DEFINE_SIMPLE_IO_EVENT(xfs_end_io_direct_write_append);
+DEFINE_SIMPLE_IO_EVENT(xfs_file_splice_read);
 
 DECLARE_EVENT_CLASS(xfs_itrunc_class,
 	TP_PROTO(struct xfs_inode *ip, xfs_fsize_t new_size),


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 22/32] zonefs: Provide a splice-read stub
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (20 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 21/32] xfs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read() David Howells
                   ` (13 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Darrick J . Wong, linux-xfs

Provide a splice_read stub for zonefs.  This does some checks before
proceeding.  For buffered I/O, it locks the inode across the call to
filemap_splice_read() and does a size check in case of truncation.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Darrick J. Wong <djwong@kernel.org>
cc: linux-xfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 fs/zonefs/file.c | 43 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
index 132f01d3461f..a8d7bae4910d 100644
--- a/fs/zonefs/file.c
+++ b/fs/zonefs/file.c
@@ -752,6 +752,47 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return ret;
 }
 
+static ssize_t zonefs_file_splice_read(struct file *in, loff_t *ppos,
+				       struct pipe_inode_info *pipe,
+				       size_t len, unsigned int flags)
+{
+	struct inode *inode = file_inode(in);
+	struct zonefs_inode_info *zi = ZONEFS_I(inode);
+	struct zonefs_zone *z = zonefs_inode_zone(inode);
+	loff_t isize;
+	ssize_t ret = 0;
+
+	/* Offline zones cannot be read */
+	if (unlikely(IS_IMMUTABLE(inode) && !(inode->i_mode & 0777)))
+		return -EPERM;
+
+	if (*ppos >= z->z_capacity)
+		return 0;
+
+	if (in->f_flags & O_DIRECT)
+		return direct_splice_read(in, ppos, pipe, len, flags);
+
+	inode_lock_shared(inode);
+
+	/* Limit read operations to written data */
+	mutex_lock(&zi->i_truncate_mutex);
+	isize = i_size_read(inode);
+	if (*ppos >= isize)
+		len = 0;
+	else
+		len = min_t(loff_t, len, isize - *ppos);
+	mutex_unlock(&zi->i_truncate_mutex);
+
+	if (len > 0) {
+		ret = filemap_splice_read(in, ppos, pipe, len, flags);
+		if (ret == -EIO)
+			zonefs_io_error(inode, false);
+	}
+
+	inode_unlock_shared(inode);
+	return ret;
+}
+
 /*
  * Write open accounting is done only for sequential files.
  */
@@ -896,7 +937,7 @@ const struct file_operations zonefs_file_operations = {
 	.llseek		= zonefs_file_llseek,
 	.read_iter	= zonefs_file_read_iter,
 	.write_iter	= zonefs_file_write_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= zonefs_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.iopoll		= iocb_bio_iopoll,
 };


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read()
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (21 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 22/32] zonefs: " David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-22 14:29   ` Steven Rostedt
  2023-05-22 14:50   ` David Howells
  2023-05-19  7:40 ` [PATCH v20 24/32] splice: Do splice read from a file without using ITER_PIPE David Howells
                   ` (12 subsequent siblings)
  35 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Steven Rostedt, Masami Hiramatsu, linux-trace-kernel

For the splice from the trace seq buffer, just use direct_splice_read().

In the future, something better can probably be done by gifting pages from
seq->buf into the pipe, but that would require changing seq->buf into a
vmap over an array of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steven Rostedt <rostedt@goodmis.org>
cc: Masami Hiramatsu <mhiramat@kernel.org>
cc: linux-kernel@vger.kernel.org
cc: linux-trace-kernel@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 kernel/trace/trace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index ebc59781456a..b664020efcb7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5171,7 +5171,7 @@ static const struct file_operations tracing_fops = {
 	.open		= tracing_open,
 	.read		= seq_read,
 	.read_iter	= seq_read_iter,
-	.splice_read	= generic_file_splice_read,
+	.splice_read	= direct_splice_read,
 	.write		= tracing_write_stub,
 	.llseek		= tracing_lseek,
 	.release	= tracing_release,


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 24/32] splice: Do splice read from a file without using ITER_PIPE
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (22 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read() David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 25/32] cifs: Use generic_file_splice_read() David Howells
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Steve French, John Hubbard, linux-cifs

Make generic_file_splice_read() use filemap_splice_read() and
direct_splice_read() rather than using an ITER_PIPE and call_read_iter().

With this, ITER_PIPE is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---

Notes:
    ver #20)
     - Use s_maxbytes from the backing store (in->f_mapping), not the front
       inode (especially for a blockdev).
    
    ver #18)
     - Split out the change to cifs to make it use generic_file_splice_read().
     - Split out the unexport of filemap_splice_read() (still needed by cifs).

 fs/splice.c | 31 +++++++------------------------
 1 file changed, 7 insertions(+), 24 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 7b818b5b18d4..56d9802729d0 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -417,34 +417,17 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
 				 struct pipe_inode_info *pipe, size_t len,
 				 unsigned int flags)
 {
-	struct iov_iter to;
-	struct kiocb kiocb;
-	int ret;
-
+	if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
+		return 0;
+	if (unlikely(!len))
+		return 0;
 #ifdef CONFIG_FS_DAX
 	if (IS_DAX(in->f_mapping->host))
 		return direct_splice_read(in, ppos, pipe, len, flags);
 #endif
-
-	iov_iter_pipe(&to, ITER_DEST, pipe, len);
-	init_sync_kiocb(&kiocb, in);
-	kiocb.ki_pos = *ppos;
-	ret = call_read_iter(in, &kiocb, &to);
-	if (ret > 0) {
-		*ppos = kiocb.ki_pos;
-		file_accessed(in);
-	} else if (ret < 0) {
-		/* free what was emitted */
-		pipe_discard_from(pipe, to.start_head);
-		/*
-		 * callers of ->splice_read() expect -EAGAIN on
-		 * "can't put anything in there", rather than -EFAULT.
-		 */
-		if (ret == -EFAULT)
-			ret = -EAGAIN;
-	}
-
-	return ret;
+	if ((in->f_flags & O_DIRECT))
+		return direct_splice_read(in, ppos, pipe, len, flags);
+	return filemap_splice_read(in, ppos, pipe, len, flags);
 }
 EXPORT_SYMBOL(generic_file_splice_read);
 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 25/32] cifs: Use generic_file_splice_read()
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (23 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 24/32] splice: Do splice read from a file without using ITER_PIPE David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 26/32] iov_iter: Kill ITER_PIPE David Howells
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Paulo Alcantara, Steve French, John Hubbard, linux-cifs

Make cifs use generic_file_splice_read() rather than doing it for itself.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---

Notes:
    ver #20)
     - Don't remove the export of filemap_splice_read().
    
    ver #18)
     - Split out from change to generic_file_splice_read().

 fs/cifs/cifsfs.c |  8 ++++----
 fs/cifs/cifsfs.h |  3 ---
 fs/cifs/file.c   | 16 ----------------
 3 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 43a4d8603db3..257587a6cade 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1376,7 +1376,7 @@ const struct file_operations cifs_file_ops = {
 	.fsync = cifs_fsync,
 	.flush = cifs_flush,
 	.mmap  = cifs_file_mmap,
-	.splice_read = cifs_splice_read,
+	.splice_read = generic_file_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1396,7 +1396,7 @@ const struct file_operations cifs_file_strict_ops = {
 	.fsync = cifs_strict_fsync,
 	.flush = cifs_flush,
 	.mmap = cifs_file_strict_mmap,
-	.splice_read = cifs_splice_read,
+	.splice_read = generic_file_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1434,7 +1434,7 @@ const struct file_operations cifs_file_nobrl_ops = {
 	.fsync = cifs_fsync,
 	.flush = cifs_flush,
 	.mmap  = cifs_file_mmap,
-	.splice_read = cifs_splice_read,
+	.splice_read = generic_file_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1452,7 +1452,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
 	.fsync = cifs_strict_fsync,
 	.flush = cifs_flush,
 	.mmap = cifs_file_strict_mmap,
-	.splice_read = cifs_splice_read,
+	.splice_read = generic_file_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 74cd6fafb33e..d7274eefc666 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -100,9 +100,6 @@ extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
-extern ssize_t cifs_splice_read(struct file *in, loff_t *ppos,
-				struct pipe_inode_info *pipe, size_t len,
-				unsigned int flags);
 extern int cifs_flock(struct file *pfile, int cmd, struct file_lock *plock);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c5fcefdfd797..375a8037a3f3 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -5078,19 +5078,3 @@ const struct address_space_operations cifs_addr_ops_smallbuf = {
 	.launder_folio = cifs_launder_folio,
 	.migrate_folio = filemap_migrate_folio,
 };
-
-/*
- * Splice data from a file into a pipe.
- */
-ssize_t cifs_splice_read(struct file *in, loff_t *ppos,
-			 struct pipe_inode_info *pipe, size_t len,
-			 unsigned int flags)
-{
-	if (unlikely(*ppos >= file_inode(in)->i_sb->s_maxbytes))
-		return 0;
-	if (unlikely(!len))
-		return 0;
-	if (in->f_flags & O_DIRECT)
-		return direct_splice_read(in, ppos, pipe, len, flags);
-	return filemap_splice_read(in, ppos, pipe, len, flags);
-}


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 26/32] iov_iter: Kill ITER_PIPE
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (24 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 25/32] cifs: Use generic_file_splice_read() David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 27/32] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

The ITER_PIPE-type iterator was only used for generic_file_splice_read(),
but that has now been switched to either pull pages directly from the
pagecache for buffered file splice-reads or to use ITER_BVEC instead for
O_DIRECT file splice-reads.  This leaves ITER_PIPE unused - so remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---

Notes:
    ver #20)
     - Rebase and remove additional pipe reference.

 include/linux/uio.h |  14 --
 lib/iov_iter.c      | 431 +-------------------------------------------
 mm/filemap.c        |   3 +-
 3 files changed, 4 insertions(+), 444 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 044c1d8c230c..60c342bb7ab8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -11,7 +11,6 @@
 #include <uapi/linux/uio.h>
 
 struct page;
-struct pipe_inode_info;
 
 typedef unsigned int __bitwise iov_iter_extraction_t;
 
@@ -25,7 +24,6 @@ enum iter_type {
 	ITER_IOVEC,
 	ITER_KVEC,
 	ITER_BVEC,
-	ITER_PIPE,
 	ITER_XARRAY,
 	ITER_DISCARD,
 	ITER_UBUF,
@@ -74,7 +72,6 @@ struct iov_iter {
 				const struct kvec *kvec;
 				const struct bio_vec *bvec;
 				struct xarray *xarray;
-				struct pipe_inode_info *pipe;
 				void __user *ubuf;
 			};
 			size_t count;
@@ -82,10 +79,6 @@ struct iov_iter {
 	};
 	union {
 		unsigned long nr_segs;
-		struct {
-			unsigned int head;
-			unsigned int start_head;
-		};
 		loff_t xarray_start;
 	};
 };
@@ -133,11 +126,6 @@ static inline bool iov_iter_is_bvec(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_BVEC;
 }
 
-static inline bool iov_iter_is_pipe(const struct iov_iter *i)
-{
-	return iov_iter_type(i) == ITER_PIPE;
-}
-
 static inline bool iov_iter_is_discard(const struct iov_iter *i)
 {
 	return iov_iter_type(i) == ITER_DISCARD;
@@ -286,8 +274,6 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
-			size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 960223ed9199..f18138e0292a 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -14,8 +14,6 @@
 #include <linux/scatterlist.h>
 #include <linux/instrumented.h>
 
-#define PIPE_PARANOIA /* for now */
-
 /* covers ubuf and kbuf alike */
 #define iterate_buf(i, n, base, len, off, __p, STEP) {		\
 	size_t __maybe_unused off = 0;				\
@@ -198,150 +196,6 @@ static int copyin(void *to, const void __user *from, size_t n)
 	return res;
 }
 
-#ifdef PIPE_PARANOIA
-static bool sanity(const struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int p_head = pipe->head;
-	unsigned int p_tail = pipe->tail;
-	unsigned int p_occupancy = pipe_occupancy(p_head, p_tail);
-	unsigned int i_head = i->head;
-	unsigned int idx;
-
-	if (i->last_offset) {
-		struct pipe_buffer *p;
-		if (unlikely(p_occupancy == 0))
-			goto Bad;	// pipe must be non-empty
-		if (unlikely(i_head != p_head - 1))
-			goto Bad;	// must be at the last buffer...
-
-		p = pipe_buf(pipe, i_head);
-		if (unlikely(p->offset + p->len != abs(i->last_offset)))
-			goto Bad;	// ... at the end of segment
-	} else {
-		if (i_head != p_head)
-			goto Bad;	// must be right after the last buffer
-	}
-	return true;
-Bad:
-	printk(KERN_ERR "idx = %d, offset = %d\n", i_head, i->last_offset);
-	printk(KERN_ERR "head = %d, tail = %d, buffers = %d\n",
-			p_head, p_tail, pipe->ring_size);
-	for (idx = 0; idx < pipe->ring_size; idx++)
-		printk(KERN_ERR "[%p %p %d %d]\n",
-			pipe->bufs[idx].ops,
-			pipe->bufs[idx].page,
-			pipe->bufs[idx].offset,
-			pipe->bufs[idx].len);
-	WARN_ON(1);
-	return false;
-}
-#else
-#define sanity(i) true
-#endif
-
-static struct page *push_anon(struct pipe_inode_info *pipe, unsigned size)
-{
-	struct page *page = alloc_page(GFP_USER);
-	if (page) {
-		struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-		*buf = (struct pipe_buffer) {
-			.ops = &default_pipe_buf_ops,
-			.page = page,
-			.offset = 0,
-			.len = size
-		};
-	}
-	return page;
-}
-
-static void push_page(struct pipe_inode_info *pipe, struct page *page,
-			unsigned int offset, unsigned int size)
-{
-	struct pipe_buffer *buf = pipe_buf(pipe, pipe->head++);
-	*buf = (struct pipe_buffer) {
-		.ops = &page_cache_pipe_buf_ops,
-		.page = page,
-		.offset = offset,
-		.len = size
-	};
-	get_page(page);
-}
-
-static inline int last_offset(const struct pipe_buffer *buf)
-{
-	if (buf->ops == &default_pipe_buf_ops)
-		return buf->len;	// buf->offset is 0 for those
-	else
-		return -(buf->offset + buf->len);
-}
-
-static struct page *append_pipe(struct iov_iter *i, size_t size,
-				unsigned int *off)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int offset = i->last_offset;
-	struct pipe_buffer *buf;
-	struct page *page;
-
-	if (offset > 0 && offset < PAGE_SIZE) {
-		// some space in the last buffer; add to it
-		buf = pipe_buf(pipe, pipe->head - 1);
-		size = min_t(size_t, size, PAGE_SIZE - offset);
-		buf->len += size;
-		i->last_offset += size;
-		i->count -= size;
-		*off = offset;
-		return buf->page;
-	}
-	// OK, we need a new buffer
-	*off = 0;
-	size = min_t(size_t, size, PAGE_SIZE);
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return NULL;
-	page = push_anon(pipe, size);
-	if (!page)
-		return NULL;
-	i->head = pipe->head - 1;
-	i->last_offset = size;
-	i->count -= size;
-	return page;
-}
-
-static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t bytes,
-			 struct iov_iter *i)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	unsigned int head = pipe->head;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	if (offset && i->last_offset == -offset) { // could we merge it?
-		struct pipe_buffer *buf = pipe_buf(pipe, head - 1);
-		if (buf->page == page) {
-			buf->len += bytes;
-			i->last_offset -= bytes;
-			i->count -= bytes;
-			return bytes;
-		}
-	}
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return 0;
-
-	push_page(pipe, page, offset, bytes);
-	i->last_offset = -(offset + bytes);
-	i->head = head;
-	i->count -= bytes;
-	return bytes;
-}
-
 /*
  * fault_in_iov_iter_readable - fault in iov iterator for reading
  * @i: iterator
@@ -446,46 +300,6 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_init);
 
-// returns the offset in partial buffer (if any)
-static inline unsigned int pipe_npages(const struct iov_iter *i, int *npages)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int used = pipe->head - pipe->tail;
-	int off = i->last_offset;
-
-	*npages = max((int)pipe->max_usage - used, 0);
-
-	if (off > 0 && off < PAGE_SIZE) { // anon and not full
-		(*npages)++;
-		return off;
-	}
-	return 0;
-}
-
-static size_t copy_pipe_to_iter(const void *addr, size_t bytes,
-				struct iov_iter *i)
-{
-	unsigned int off, chunk;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	for (size_t n = bytes; n; n -= chunk) {
-		struct page *page = append_pipe(i, n, &off);
-		chunk = min_t(size_t, n, PAGE_SIZE - off);
-		if (!page)
-			return bytes - n;
-		memcpy_to_page(page, off, addr, chunk);
-		addr += chunk;
-	}
-	return bytes;
-}
-
 static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 			      __wsum sum, size_t off)
 {
@@ -493,44 +307,10 @@ static __wsum csum_and_memcpy(void *to, const void *from, size_t len,
 	return csum_block_add(sum, next, off);
 }
 
-static size_t csum_and_copy_to_pipe_iter(const void *addr, size_t bytes,
-					 struct iov_iter *i, __wsum *sump)
-{
-	__wsum sum = *sump;
-	size_t off = 0;
-	unsigned int chunk, r;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	while (bytes) {
-		struct page *page = append_pipe(i, bytes, &r);
-		char *p;
-
-		if (!page)
-			break;
-		chunk = min_t(size_t, bytes, PAGE_SIZE - r);
-		p = kmap_local_page(page);
-		sum = csum_and_memcpy(p + r, addr + off, chunk, sum, off);
-		kunmap_local(p);
-		off += chunk;
-		bytes -= chunk;
-	}
-	*sump = sum;
-	return off;
-}
-
 size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_pipe_to_iter(addr, bytes, i);
 	if (user_backed_iter(i))
 		might_fault();
 	iterate_and_advance(i, bytes, base, len, off,
@@ -552,42 +332,6 @@ static int copyout_mc(void __user *to, const void *from, size_t n)
 	return n;
 }
 
-static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
-				struct iov_iter *i)
-{
-	size_t xfer = 0;
-	unsigned int off, chunk;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	while (bytes) {
-		struct page *page = append_pipe(i, bytes, &off);
-		unsigned long rem;
-		char *p;
-
-		if (!page)
-			break;
-		chunk = min_t(size_t, bytes, PAGE_SIZE - off);
-		p = kmap_local_page(page);
-		rem = copy_mc_to_kernel(p + off, addr + xfer, chunk);
-		chunk -= rem;
-		kunmap_local(p);
-		xfer += chunk;
-		bytes -= chunk;
-		if (rem) {
-			iov_iter_revert(i, rem);
-			break;
-		}
-	}
-	return xfer;
-}
-
 /**
  * _copy_mc_to_iter - copy to iter with source memory error exception handling
  * @addr: source kernel address
@@ -607,9 +351,8 @@ static size_t copy_mc_pipe_to_iter(const void *addr, size_t bytes,
  *   alignment and poison alignment assumptions to avoid re-triggering
  *   hardware exceptions.
  *
- * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies.
- *   Compare to copy_to_iter() where only ITER_IOVEC attempts might return
- *   a short copy.
+ * * ITER_KVEC and ITER_BVEC can return short copies.  Compare to
+ *   copy_to_iter() where only ITER_IOVEC attempts might return a short copy.
  *
  * Return: number of bytes copied (may be %0)
  */
@@ -617,8 +360,6 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_mc_pipe_to_iter(addr, bytes, i);
 	if (user_backed_iter(i))
 		might_fault();
 	__iterate_and_advance(i, bytes, base, len, off,
@@ -732,8 +473,6 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		return 0;
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_page_to_iter_pipe(page, offset, bytes, i);
 	page += offset / PAGE_SIZE; // first subpage
 	offset %= PAGE_SIZE;
 	while (1) {
@@ -764,8 +503,6 @@ size_t copy_page_to_iter_nofault(struct page *page, unsigned offset, size_t byte
 		return 0;
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
-	if (unlikely(iov_iter_is_pipe(i)))
-		return copy_page_to_iter_pipe(page, offset, bytes, i);
 	page += offset / PAGE_SIZE; // first subpage
 	offset %= PAGE_SIZE;
 	while (1) {
@@ -818,36 +555,8 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 }
 EXPORT_SYMBOL(copy_page_from_iter);
 
-static size_t pipe_zero(size_t bytes, struct iov_iter *i)
-{
-	unsigned int chunk, off;
-
-	if (unlikely(bytes > i->count))
-		bytes = i->count;
-	if (unlikely(!bytes))
-		return 0;
-
-	if (!sanity(i))
-		return 0;
-
-	for (size_t n = bytes; n; n -= chunk) {
-		struct page *page = append_pipe(i, n, &off);
-		char *p;
-
-		if (!page)
-			return bytes - n;
-		chunk = min_t(size_t, n, PAGE_SIZE - off);
-		p = kmap_local_page(page);
-		memset(p + off, 0, chunk);
-		kunmap_local(p);
-	}
-	return bytes;
-}
-
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
-	if (unlikely(iov_iter_is_pipe(i)))
-		return pipe_zero(bytes, i);
 	iterate_and_advance(i, bytes, base, len, count,
 		clear_user(base, len),
 		memset(base, 0, len)
@@ -878,32 +587,6 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 }
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
-static void pipe_advance(struct iov_iter *i, size_t size)
-{
-	struct pipe_inode_info *pipe = i->pipe;
-	int off = i->last_offset;
-
-	if (!off && !size) {
-		pipe_discard_from(pipe, i->start_head); // discard everything
-		return;
-	}
-	i->count -= size;
-	while (1) {
-		struct pipe_buffer *buf = pipe_buf(pipe, i->head);
-		if (off) /* make it relative to the beginning of buffer */
-			size += abs(off) - buf->offset;
-		if (size <= buf->len) {
-			buf->len = size;
-			i->last_offset = last_offset(buf);
-			break;
-		}
-		size -= buf->len;
-		i->head++;
-		off = 0;
-	}
-	pipe_discard_from(pipe, i->head + 1); // discard everything past this one
-}
-
 static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
 {
 	const struct bio_vec *bvec, *end;
@@ -955,8 +638,6 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_iovec_advance(i, size);
 	} else if (iov_iter_is_bvec(i)) {
 		iov_iter_bvec_advance(i, size);
-	} else if (iov_iter_is_pipe(i)) {
-		pipe_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -= size;
 	}
@@ -970,26 +651,6 @@ void iov_iter_revert(struct iov_iter *i, size_t unroll)
 	if (WARN_ON(unroll > MAX_RW_COUNT))
 		return;
 	i->count += unroll;
-	if (unlikely(iov_iter_is_pipe(i))) {
-		struct pipe_inode_info *pipe = i->pipe;
-		unsigned int head = pipe->head;
-
-		while (head > i->start_head) {
-			struct pipe_buffer *b = pipe_buf(pipe, --head);
-			if (unroll < b->len) {
-				b->len -= unroll;
-				i->last_offset = last_offset(b);
-				i->head = head;
-				return;
-			}
-			unroll -= b->len;
-			pipe_buf_release(pipe, b);
-			pipe->head--;
-		}
-		i->last_offset = 0;
-		i->head = head;
-		return;
-	}
 	if (unlikely(iov_iter_is_discard(i)))
 		return;
 	if (unroll <= i->iov_offset) {
@@ -1079,24 +740,6 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
-			struct pipe_inode_info *pipe,
-			size_t count)
-{
-	BUG_ON(direction != READ);
-	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
-	*i = (struct iov_iter){
-		.iter_type = ITER_PIPE,
-		.data_source = false,
-		.pipe = pipe,
-		.head = pipe->head,
-		.start_head = pipe->head,
-		.last_offset = 0,
-		.count = count
-	};
-}
-EXPORT_SYMBOL(iov_iter_pipe);
-
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xarray
  * @i: The iterator to initialise.
@@ -1224,19 +867,6 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 	if (iov_iter_is_bvec(i))
 		return iov_iter_aligned_bvec(i, addr_mask, len_mask);
 
-	if (iov_iter_is_pipe(i)) {
-		size_t size = i->count;
-
-		if (size & len_mask)
-			return false;
-		if (size && i->last_offset > 0) {
-			if (i->last_offset & addr_mask)
-				return false;
-		}
-
-		return true;
-	}
-
 	if (iov_iter_is_xarray(i)) {
 		if (i->count & len_mask)
 			return false;
@@ -1307,14 +937,6 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (iov_iter_is_bvec(i))
 		return iov_iter_alignment_bvec(i);
 
-	if (iov_iter_is_pipe(i)) {
-		size_t size = i->count;
-
-		if (size && i->last_offset > 0)
-			return size | i->last_offset;
-		return size;
-	}
-
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
 
@@ -1367,36 +989,6 @@ static int want_pages_array(struct page ***res, size_t size,
 	return count;
 }
 
-static ssize_t pipe_get_pages(struct iov_iter *i,
-		   struct page ***pages, size_t maxsize, unsigned maxpages,
-		   size_t *start)
-{
-	unsigned int npages, count, off, chunk;
-	struct page **p;
-	size_t left;
-
-	if (!sanity(i))
-		return -EFAULT;
-
-	*start = off = pipe_npages(i, &npages);
-	if (!npages)
-		return -EFAULT;
-	count = want_pages_array(pages, maxsize, off, min(npages, maxpages));
-	if (!count)
-		return -ENOMEM;
-	p = *pages;
-	for (npages = 0, left = maxsize ; npages < count; npages++, left -= chunk) {
-		struct page *page = append_pipe(i, left, &off);
-		if (!page)
-			break;
-		chunk = min_t(size_t, left, PAGE_SIZE - off);
-		get_page(*p++ = page);
-	}
-	if (!npages)
-		return -EFAULT;
-	return maxsize - left;
-}
-
 static ssize_t iter_xarray_populate_pages(struct page **pages, struct xarray *xa,
 					  pgoff_t index, unsigned int nr_pages)
 {
@@ -1547,8 +1139,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		}
 		return maxsize;
 	}
-	if (iov_iter_is_pipe(i))
-		return pipe_get_pages(i, pages, maxsize, maxpages, start);
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
 	return -EFAULT;
@@ -1638,9 +1228,7 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 	}
 
 	sum = csum_shift(csstate->csum, csstate->off);
-	if (unlikely(iov_iter_is_pipe(i)))
-		bytes = csum_and_copy_to_pipe_iter(addr, bytes, i, &sum);
-	else iterate_and_advance(i, bytes, base, len, off, ({
+	iterate_and_advance(i, bytes, base, len, off, ({
 		next = csum_and_copy_to_user(addr + off, base, len);
 		sum = csum_block_add(sum, next, off);
 		next ? 0 : len;
@@ -1725,15 +1313,6 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		return iov_npages(i, maxpages);
 	if (iov_iter_is_bvec(i))
 		return bvec_npages(i, maxpages);
-	if (iov_iter_is_pipe(i)) {
-		int npages;
-
-		if (!sanity(i))
-			return 0;
-
-		pipe_npages(i, &npages);
-		return min(npages, maxpages);
-	}
 	if (iov_iter_is_xarray(i)) {
 		unsigned offset = (i->xarray_start + i->iov_offset) % PAGE_SIZE;
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
@@ -1746,10 +1325,6 @@ EXPORT_SYMBOL(iov_iter_npages);
 const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags)
 {
 	*new = *old;
-	if (unlikely(iov_iter_is_pipe(new))) {
-		WARN_ON(1);
-		return NULL;
-	}
 	if (iov_iter_is_bvec(new))
 		return new->bvec = kmemdup(new->bvec,
 				    new->nr_segs * sizeof(struct bio_vec),
diff --git a/mm/filemap.c b/mm/filemap.c
index a2006936a6ae..394db0c1f197 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2687,8 +2687,7 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
 			break;
 
-		error = filemap_get_pages(iocb, iter->count, &fbatch,
-					  iov_iter_is_pipe(iter));
+		error = filemap_get_pages(iocb, iter->count, &fbatch, false);
 		if (error < 0)
 			break;
 


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 27/32] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (25 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 26/32] iov_iter: Kill ITER_PIPE David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 28/32] block: Fix bio_flagged() so that gcc can better optimise it David Howells
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, John Hubbard, Dave Chinner,
	Christoph Hellwig

ZERO_PAGE can't go away, no need to hold an extra reference.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-fsdevel@vger.kernel.org
---
 fs/iomap/direct-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 019cc87d0fb3..66a9f10e3207 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -203,7 +203,7 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	get_page(page);
+	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 28/32] block: Fix bio_flagged() so that gcc can better optimise it
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (26 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 27/32] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

Fix bio_flagged() so that multiple instances of it, such as:

	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
	    bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero.  This results in an out-of-line bio_release_page()
being compiled to something like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool.  Removing it
results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/ # v6
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index b3e7529ff55e..7f53be035cf0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -229,7 +229,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-	return (bio->bi_flags & (1U << bit)) != 0;
+	return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (27 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 28/32] block: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-20  1:26   ` Kent Overstreet
  2023-05-20  8:40   ` David Howells
  2023-05-19  7:40 ` [PATCH v20 30/32] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
                   ` (6 subsequent siblings)
  35 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

From: Christoph Hellwig <hch@lst.de>

Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #8)
     - Split out from another patch [hch].
     - Don't default to BIO_PAGE_REFFED [hch].
    
    ver #5)
     - Split from patch that uses iov_iter_extract_pages().

 block/bio.c               | 2 +-
 block/blk-map.c           | 1 +
 fs/direct-io.c            | 2 ++
 fs/iomap/direct-io.c      | 1 -
 include/linux/bio.h       | 2 +-
 include/linux/blk_types.h | 2 +-
 6 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 043944fd46eb..8516adeaea26 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1191,7 +1191,6 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
@@ -1336,6 +1335,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
diff --git a/block/blk-map.c b/block/blk-map.c
index 04c55f1c492e..33d9f6e89ba6 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -282,6 +282,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	if (blk_queue_pci_p2pdma(rq->q))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
 
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	while (iov_iter_count(iter)) {
 		struct page **pages, *stack_pages[UIO_FASTIOV];
 		ssize_t bytes;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 0b380bb8a81e..ad20f3428bab 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -402,6 +402,8 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 		bio->bi_end_io = dio_bio_end_aio;
 	else
 		bio->bi_end_io = dio_bio_end_io;
+	/* for now require references for all pages */
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	sdio->bio = bio;
 	sdio->logical_offset_in_bio = sdio->cur_page_fs_offset;
 }
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 66a9f10e3207..08873f0627dd 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -203,7 +203,6 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	__bio_add_page(bio, page, len, 0);
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7f53be035cf0..0922729acd26 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -488,7 +488,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 740afe80f297..dfd2c2cb909d 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,7 +323,7 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* put pages in bio_release_pages() */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 30/32] block: Add BIO_PAGE_PINNED and associated infrastructure
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (28 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 31/32] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #9)
     - Only consider pinning in bio_set_cleanup_mode().  Ref'ing pages in
       struct bio is going away.
     - page_put_unpin() is removed; call unpin_user_page() and put_page()
       directly.
     - Use bio_release_page() in __bio_release_pages().
     - BIO_PAGE_PINNED and BIO_PAGE_REFFED can't both be set, so use if-else
       when testing both of them.
    
    ver #8)
     - Move the infrastructure to clean up pinned pages to this patch [hch].
     - Put BIO_PAGE_PINNED before BIO_PAGE_REFFED as the latter should
       probably be removed at some point.  FOLL_PIN can then be renumbered
       first.

 block/bio.c               |  6 +++---
 block/blk.h               | 12 ++++++++++++
 include/linux/bio.h       |  3 ++-
 include/linux/blk_types.h |  1 +
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 8516adeaea26..17bd01ecde36 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1169,7 +1169,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1489,8 +1489,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will unpin each page and will run one bio_put() against the
+ * BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk.h b/block/blk.h
index 45547bcf1119..e1ded2ccb3ca 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -420,6 +420,18 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+	if (bio_flagged(bio, BIO_PAGE_PINNED))
+		unpin_user_page(page);
+	else if (bio_flagged(bio, BIO_PAGE_REFFED))
+		put_page(page);
+}
+
 struct request_queue *blk_alloc_queue(int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 0922729acd26..8588bcfbc6ef 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -488,7 +488,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (bio_flagged(bio, BIO_PAGE_REFFED))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index dfd2c2cb909d..8ef209e3aa96 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -323,6 +323,7 @@ struct bio {
  * bio flags
  */
 enum {
+	BIO_PAGE_PINNED,	/* Unpin pages in bio_release_pages() */
 	BIO_PAGE_REFFED,	/* put pages in bio_release_pages() */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 31/32] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (29 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 30/32] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:40 ` [PATCH v20 32/32] block: convert bio_map_user_iov " David Howells
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #8)
     - Split the patch up a bit [hch].
     - We should only be using pinned/non-pinned pages and not ref'd pages,
       so adjust the comments appropriately.
    
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/bio.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 17bd01ecde36..798cc4cf3bd2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1205,7 +1205,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1219,7 +1219,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1230,10 +1230,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_PINNED flag.
+ * For a multi-segment *iter, this function only adds pages from the next
+ * non-empty segment of the iov iterator.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1265,9 +1265,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, extraction_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, extraction_flags, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
@@ -1300,7 +1300,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1335,7 +1335,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return 0;
 	}
 
-	bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (iov_iter_extract_will_pin(iter))
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	do {
 		ret = __bio_iov_iter_get_pages(bio, iter);
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v20 32/32] block: convert bio_map_user_iov to use iov_iter_extract_pages
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (30 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 31/32] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
@ 2023-05-19  7:40 ` David Howells
  2023-05-19  7:49 ` [PATCH] iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read() David Howells
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  7:40 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

Notes:
    ver #10)
     - Drop bio_set_cleanup_mode(), open coding it instead.
    
    ver #8)
     - Split the patch up a bit [hch].
     - We should only be using pinned/non-pinned pages and not ref'd pages,
       so adjust the comments appropriately.
    
    ver #7)
     - Don't treat BIO_PAGE_REFFED/PINNED as being the same as FOLL_GET/PIN.
    
    ver #5)
     - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
       BIO_* flags and got rid of bi_cleanup_mode.
     - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

 block/blk-map.c | 23 +++++++++++------------
 1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 33d9f6e89ba6..3551c3ff17cf 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -281,22 +281,21 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 
 	if (blk_queue_pci_p2pdma(rq->q))
 		extraction_flags |= ITER_ALLOW_P2PDMA;
+	if (iov_iter_extract_will_pin(iter))
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 
-	bio_set_flag(bio, BIO_PAGE_REFFED);
 	while (iov_iter_count(iter)) {
-		struct page **pages, *stack_pages[UIO_FASTIOV];
+		struct page *stack_pages[UIO_FASTIOV];
+		struct page **pages = stack_pages;
 		ssize_t bytes;
 		size_t offs;
 		int npages;
 
-		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
-			pages = stack_pages;
-			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, extraction_flags);
-		} else {
-			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, extraction_flags);
-		}
+		if (nr_vecs > ARRAY_SIZE(stack_pages))
+			pages = NULL;
+
+		bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+					       nr_vecs, extraction_flags, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
@@ -318,7 +317,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						bio_release_page(bio, page);
 					break;
 				}
 
@@ -330,7 +329,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			bio_release_page(bio, pages[j++]);
 		if (pages != stack_pages)
 			kvfree(pages);
 		/* couldn't stuff something into bio? */


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH] iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read()
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (31 preceding siblings ...)
  2023-05-19  7:40 ` [PATCH v20 32/32] block: convert bio_map_user_iov " David Howells
@ 2023-05-19  7:49 ` David Howells
  2023-05-19  8:18   ` Christoph Hellwig
  2023-05-19  8:06 ` [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE Christoph Hellwig
                   ` (2 subsequent siblings)
  35 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2023-05-19  7:49 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Christoph Hellwig
  Cc: dhowells, Matthew Wilcox, Jan Kara, Jeff Layton,
	David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm

If it's a problem that direct_splice_read() always allocates as much memory as
is asked for and that will fit into the pipe when less could be allocated in
the case that, say, an O_DIRECT-read will hit a hole and do a short read or a
socket will return less than was asked for, something like the attached
modification to ITER_BVEC could be made.

David
---
iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read()

Add a flag to the iov_iter struct that tells things that write to or allow
writing to a BVEC-type iterator that they should allocate pages to fill in
any slots in the bio_vec array that have null page pointers.  This allows
the bufferage in the bvec to be allocated on-demand.

Iterators of this type are initialised with iov_iter_bvec_autoalloc()
instead of iov_iter_bvec().  Only destination (ie. READ/ITER_DEST)
iterators may be used in this fashion.

An additional function, iov_iter_auto_alloc() is provided to perform the
allocation in the case that the caller wishes to make use of the bio_vec
array directly and the block layer is modified to use it.

direct_splice_read() is then modified to make use of this.  This is less
efficient if we know in advance that we want to allocate the full buffer as
we can't so easily use bulk allocation, but it does mean in cases where we
might not want the full buffer (say we hit a hole in DIO), we don't have to
allocate it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---
 block/bio.c         |    2 
 fs/splice.c         |   36 ++++++-----------
 include/linux/uio.h |   13 ++++--
 lib/iov_iter.c      |  110 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 133 insertions(+), 28 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 798cc4cf3bd2..72d5c1125df2 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1330,6 +1330,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	int ret = 0;
 
 	if (iov_iter_is_bvec(iter)) {
+		if (!iov_iter_auto_alloc(iter, iov_iter_count(iter)))
+			return -ENOMEM;
 		bio_iov_bvec_set(bio, iter);
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 		return 0;
diff --git a/fs/splice.c b/fs/splice.c
index 56d9802729d0..30e7a31c5ada 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -310,10 +310,8 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	struct iov_iter to;
 	struct bio_vec *bv;
 	struct kiocb kiocb;
-	struct page **pages;
 	ssize_t ret;
-	size_t used, npages, chunk, remain, keep = 0;
-	int i;
+	size_t used, npages, chunk, remain, keep = 0, i;
 
 	if (!len)
 		return 0;
@@ -334,30 +332,14 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	len = min_t(size_t, len, npages * PAGE_SIZE);
 	npages = DIV_ROUND_UP(len, PAGE_SIZE);
 
-	bv = kzalloc(array_size(npages, sizeof(bv[0])) +
-		     array_size(npages, sizeof(struct page *)), GFP_KERNEL);
+	bv = kzalloc(array_size(npages, sizeof(bv[0])), GFP_KERNEL);
 	if (!bv)
 		return -ENOMEM;
 
-	pages = (struct page **)(bv + npages);
-	npages = alloc_pages_bulk_array(GFP_USER, npages, pages);
-	if (!npages) {
-		kfree(bv);
-		return -ENOMEM;
-	}
-
 	remain = len = min_t(size_t, len, npages * PAGE_SIZE);
 
-	for (i = 0; i < npages; i++) {
-		chunk = min_t(size_t, PAGE_SIZE, remain);
-		bv[i].bv_page = pages[i];
-		bv[i].bv_offset = 0;
-		bv[i].bv_len = chunk;
-		remain -= chunk;
-	}
-
 	/* Do the I/O */
-	iov_iter_bvec(&to, ITER_DEST, bv, npages, len);
+	iov_iter_bvec_autoalloc(&to, ITER_DEST, bv, npages, len);
 	init_sync_kiocb(&kiocb, in);
 	kiocb.ki_pos = *ppos;
 	ret = call_read_iter(in, &kiocb, &to);
@@ -376,8 +358,16 @@ ssize_t direct_splice_read(struct file *in, loff_t *ppos,
 	}
 
 	/* Free any pages that didn't get touched at all. */
-	if (keep < npages)
-		release_pages(pages + keep, npages - keep);
+	if (keep < npages) {
+		struct page **pages = (struct page **)&bv[keep];
+		size_t j = 0;
+
+		for (i = keep; i < npages; i++)
+			if (bv[i].bv_page)
+				pages[j++] = bv[i].bv_page;
+		if (j)
+			release_pages(pages, j);
+	}
 
 	/* Push the remaining pages into the pipe. */
 	remain = ret;
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 60c342bb7ab8..6bc2287021d9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -40,10 +40,11 @@ struct iov_iter_state {
 
 struct iov_iter {
 	u8 iter_type;
-	bool copy_mc;
-	bool nofault;
-	bool data_source;
-	bool user_backed;
+	bool copy_mc:1;
+	bool nofault:1;
+	bool data_source:1;
+	bool user_backed:1;
+	bool auto_alloc:1;	/* Automatically alloc pages into a bvec */
 	union {
 		size_t iov_offset;
 		int last_offset;
@@ -263,6 +264,7 @@ static inline bool iov_iter_is_copy_mc(const struct iov_iter *i)
 }
 #endif
 
+bool iov_iter_auto_alloc(struct iov_iter *iter, size_t count);
 size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 			unsigned len_mask);
@@ -274,6 +276,9 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec
 			unsigned long nr_segs, size_t count);
 void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
+void iov_iter_bvec_autoalloc(struct iov_iter *i, unsigned int direction,
+			     const struct bio_vec *bvec, unsigned long nr_segs,
+			     size_t count);
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f18138e0292a..3643f9d80ecc 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -52,7 +52,11 @@
 	while (n) {						\
 		unsigned offset = p->bv_offset + skip;		\
 		unsigned left;					\
-		void *kaddr = kmap_local_page(p->bv_page +	\
+		void *kaddr;					\
+								\
+		if (!p->bv_page)				\
+			break;					\
+		kaddr = kmap_local_page(p->bv_page +		\
 					offset / PAGE_SIZE);	\
 		base = kaddr + offset % PAGE_SIZE;		\
 		len = min(min(n, (size_t)(p->bv_len - skip)),	\
@@ -159,6 +163,49 @@ __out:								\
 #define iterate_and_advance(i, n, base, len, off, I, K) \
 	__iterate_and_advance(i, n, base, len, off, I, ((void)(K),0))
 
+/*
+ * Preallocate pages into the bvec sufficient to store count bytes.
+ */
+static bool bvec_auto_alloc(struct iov_iter *iter, size_t count)
+{
+	struct bio_vec *bvec = (struct bio_vec *)iter->bvec;
+	int j;
+
+	if (!count)
+		return true;
+
+	count += iter->iov_offset;
+	for (j = 0; j < iter->nr_segs; j++) {
+		if (!bvec[j].bv_page) {
+			bvec[j].bv_page = alloc_page(GFP_KERNEL);
+			if (!bvec[j].bv_page)
+				return false;
+		}
+		if (bvec[j].bv_len >= count)
+			break;
+		count -= bvec[j].bv_len;
+	}
+
+	return true;
+}
+
+/**
+ * iov_iter_auto_alloc - Perform auto-alloc for an iterator
+ * @iter: The iterator to do the allocation for
+ * @count: The number of bytes we need to store
+ *
+ * Perform auto-allocation on a iterator.  This only works with ITER_BVEC-type
+ * iterators.  It will make sure that pages are allocated sufficient to store
+ * the specified number of bytes.  Returns true if sufficient pages are present
+ * in the bvec and false if an allocation failure occurs.
+ */
+bool iov_iter_auto_alloc(struct iov_iter *iter, size_t count)
+{
+	return !iov_iter_is_bvec(iter) || !iter->auto_alloc ||
+		bvec_auto_alloc(iter, count);
+}
+EXPORT_SYMBOL_GPL(iov_iter_auto_alloc);
+
 static int copyout(void __user *to, const void *from, size_t n)
 {
 	if (should_fail_usercopy())
@@ -313,6 +360,8 @@ size_t _copy_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 		return 0;
 	if (user_backed_iter(i))
 		might_fault();
+	if (!iov_iter_auto_alloc(i, bytes))
+		return 0;
 	iterate_and_advance(i, bytes, base, len, off,
 		copyout(base, addr + off, len),
 		memcpy(base, addr + off, len)
@@ -362,6 +411,8 @@ size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 		return 0;
 	if (user_backed_iter(i))
 		might_fault();
+	if (!iov_iter_auto_alloc(i, bytes))
+		return 0;
 	__iterate_and_advance(i, bytes, base, len, off,
 		copyout_mc(base, addr + off, len),
 		copy_mc_to_kernel(base, addr + off, len)
@@ -503,6 +554,8 @@ size_t copy_page_to_iter_nofault(struct page *page, unsigned offset, size_t byte
 		return 0;
 	if (WARN_ON_ONCE(i->data_source))
 		return 0;
+	if (!iov_iter_auto_alloc(i, bytes))
+		return 0;
 	page += offset / PAGE_SIZE; // first subpage
 	offset %= PAGE_SIZE;
 	while (1) {
@@ -557,6 +610,8 @@ EXPORT_SYMBOL(copy_page_from_iter);
 
 size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 {
+	if (!iov_iter_auto_alloc(i, bytes))
+		return -ENOMEM;
 	iterate_and_advance(i, bytes, base, len, count,
 		clear_user(base, len),
 		memset(base, 0, len)
@@ -598,6 +653,7 @@ static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)
 	size += i->iov_offset;
 
 	for (bvec = i->bvec, end = bvec + i->nr_segs; bvec < end; bvec++) {
+		BUG_ON(!bvec->bv_page);
 		if (likely(size < bvec->bv_len))
 			break;
 		size -= bvec->bv_len;
@@ -740,6 +796,51 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
+/**
+ * iov_iter_bvec_autoalloc - Initialise a BVEC-type I/O iterator with automatic alloc
+ * @i: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @bvec: The array of bio_vecs listing the buffer segments
+ * @nr_segs: The number of segments in @bvec[].
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator to insert pages into a bvec as data is written into
+ * it where NULL pointers exist in the bvec array (if a pointer isn't NULL, the
+ * page it points to will just be used).  No more than @nr_segs pages will be
+ * filled in.  Empty slots will have bv_offset set to 0 and bv_len to
+ * PAGE_SIZE.
+ *
+ * If the iterator is reverted, excess pages will be left for the
+ * caller to clean up.
+ */
+void iov_iter_bvec_autoalloc(struct iov_iter *i, unsigned int direction,
+			     const struct bio_vec *bvec, unsigned long nr_segs,
+			     size_t count)
+{
+	struct bio_vec *bv = (struct bio_vec *)bvec;
+	unsigned long j;
+
+	BUG_ON(direction != READ);
+	*i = (struct iov_iter){
+		.iter_type = ITER_BVEC,
+		.copy_mc = false,
+		.data_source = direction,
+		.auto_alloc = true,
+		.bvec = bvec,
+		.nr_segs = nr_segs,
+		.iov_offset = 0,
+		.count = count
+	};
+
+	for (j = 0; j < nr_segs; j++) {
+		if (!bv[j].bv_page) {
+			bv[j].bv_offset = 0;
+			bv[j].bv_len = PAGE_SIZE;
+		}
+	}
+}
+EXPORT_SYMBOL(iov_iter_bvec_autoalloc);
+
 /**
  * iov_iter_xarray - Initialise an I/O iterator to use the pages in an xarray
  * @i: The iterator to initialise.
@@ -1122,6 +1223,8 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page **p;
 		struct page *page;
 
+		if (!iov_iter_auto_alloc(i, maxsize))
+			return -ENOMEM;
 		page = first_bvec_segment(i, &maxsize, start);
 		n = want_pages_array(pages, maxsize, *start, maxpages);
 		if (!n)
@@ -1226,6 +1329,8 @@ size_t csum_and_copy_to_iter(const void *addr, size_t bytes, void *_csstate,
 		csstate->off += bytes;
 		return bytes;
 	}
+	if (!iov_iter_auto_alloc(i, bytes))
+		return -ENOMEM;
 
 	sum = csum_shift(csstate->csum, csstate->off);
 	iterate_and_advance(i, bytes, base, len, off, ({
@@ -1664,6 +1769,9 @@ static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
 	size_t skip = i->iov_offset, offset;
 	int k;
 
+	if (!iov_iter_auto_alloc(i, maxsize))
+		return -ENOMEM;
+
 	for (;;) {
 		if (i->nr_segs == 0)
 			return 0;


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (32 preceding siblings ...)
  2023-05-19  7:49 ` [PATCH] iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read() David Howells
@ 2023-05-19  8:06 ` Christoph Hellwig
  2023-06-14  0:55 ` Mike Marshall
  2023-06-16 11:38 ` David Howells
  35 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:06 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm

On Fri, May 19, 2023 at 08:40:15AM +0100, David Howells wrote:
> Hi Jens, Al, Christoph,
> 
> The first half of this patchset kills off ITER_PIPE to avoid a race between
> truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a
> bio with unpinned/unref'ed pages from an O_DIRECT splice read.  This causes
> memory corruption[2].  Instead, we use filemap_splice_read(), which invokes
> the buffered file reading code and splices from the pagecache into the
> pipe; direct_splice_read(), which bulk-allocates a buffer, reads into it
> and then pushes the filled pages into the pipe; or handle it in
> filesystem-specific code.

If there's a clearly separate first and second half of a 32 patch
series, it might really make sense to just split it instead of exceeding
every normal attention window..

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 01/32] splice: Fix filemap of a blockdev
  2023-05-19  7:40 ` [PATCH v20 01/32] splice: Fix filemap of a blockdev David Howells
@ 2023-05-19  8:06   ` Christoph Hellwig
  2023-05-19  9:11   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:06 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Steve French,
	Christoph Hellwig, John Hubbard

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

But this really needs to be sent standalone so that it can get picked
up for 6.4 and -stable.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
@ 2023-05-19  8:09   ` Christoph Hellwig
  2023-05-19  8:43   ` David Howells
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:09 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

On Fri, May 19, 2023 at 08:40:18AM +0100, David Howells wrote:
> Make direct_read_splice() limit the read to the end of the file for regular
> files and block devices, thereby reducing the amount of allocation it will
> do in such a case.
> 
> This means that the blockdev code doesn't require any special handling as
> filemap_read_splice() also limits to i_size.

I'm really not sure if this is a good idea.  Right now
direct_read_splice (which also appears a little misnamed) really is
a splice by calling ->read_iter helper.  I we don't do any
of this validtion we can just call it directly from splice.c instead
of calling into ->splice_read for direct I/O and DAX and remove a ton
of boilerplate code.

How often do we even call into splice beyond i_size and for how much?

> +	if (S_ISREG(file_inode(in)->i_mode) ||
> +	    S_ISBLK(file_inode(in)->i_mode)) {

Btw, having these kinds of checks in supposedly generic code is always
a marked for a bit of a layering problem.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read()
  2023-05-19  7:40 ` [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read() David Howells
@ 2023-05-19  8:10   ` Christoph Hellwig
  2023-05-19  8:48   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:10 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	linux-erofs, linux-ext4, linux-xfs

On Fri, May 19, 2023 at 08:40:20AM +0100, David Howells wrote:
> +#ifdef CONFIG_FS_DAX
> +	if (IS_DAX(in->f_mapping->host))

No need for the ifdef.  IS_DAX is compile-time false if CONFIG_FS_DAX
is not set.

A comment on why we're doing this in the code would probably be useful
as well.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read()
  2023-05-19  7:40 ` [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read() David Howells
@ 2023-05-19  8:12   ` Christoph Hellwig
  2023-05-19 16:22     ` Linus Torvalds
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:12 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Greg Kroah-Hartman,
	Christoph Hellwig, John Hubbard, Miklos Szeredi, Arnd Bergmann

On Fri, May 19, 2023 at 08:40:24AM +0100, David Howells wrote:
>  	.fasync = random_fasync,
>  	.llseek = noop_llseek,
> -	.splice_read = generic_file_splice_read,
> +	.splice_read = direct_splice_read,

Pinging Al (and maybe Linus): is there any good reason to not simply
default to direct_splice_read if ->read_iter is implemented and
->splice_read is not once you remove ITER_PIPE?  As long as we
assure direct_splice_read is simply a ->read_iter into newly
allocated pages I can't think of anything that would go wrong there.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH] iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read()
  2023-05-19  7:49 ` [PATCH] iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read() David Howells
@ 2023-05-19  8:18   ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:18 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm

On Fri, May 19, 2023 at 08:49:18AM +0100, David Howells wrote:
> direct_splice_read() is then modified to make use of this.  This is less
> efficient if we know in advance that we want to allocate the full buffer as
> we can't so easily use bulk allocation, but it does mean in cases where we
> might not want the full buffer (say we hit a hole in DIO), we don't have to
> allocate it.

Can you eplain the workloads we're trying to optimize for here?

To me splice on O_DIRECT is more of a historic accident than an actually
good use case..

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 13/32] ceph: Provide a splice-read stub
  2023-05-19  7:40 ` [PATCH v20 13/32] ceph: " David Howells
@ 2023-05-19  8:40   ` Xiubo Li
  2023-05-19  9:24   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Xiubo Li @ 2023-05-19  8:40 UTC (permalink / raw)
  To: David Howells, Jens Axboe, Al Viro, Christoph Hellwig
  Cc: Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig, Ilya Dryomov,
	ceph-devel


On 5/19/23 15:40, David Howells wrote:
> Provide a splice_read stub for Ceph.  This does the inode shutdown check
> before proceeding and jumps to direct_splice_read() if O_DIRECT is set, the
> file has inline data or is a synchronous file.
>
> We try and get FILE_RD and either FILE_CACHE and/or FILE_LAZYIO caps and
> hold them across filemap_splice_read().  If we fail to get FILE_CACHE or
> FILE_LAZYIO capabilities, we use direct_splice_read() instead.
>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Xiubo Li <xiubli@redhat.com>
> cc: Ilya Dryomov <idryomov@gmail.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: ceph-devel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>   fs/ceph/file.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index f4d8bf7dec88..382dd5901748 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1745,6 +1745,70 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to)
>   	return ret;
>   }
>   
> +/*
> + * Wrap filemap_splice_read with checks for cap bits on the inode.
> + * Atomically grab references, so that those bits are not released
> + * back to the MDS mid-read.
> + */
> +static ssize_t ceph_splice_read(struct file *in, loff_t *ppos,
> +				struct pipe_inode_info *pipe,
> +				size_t len, unsigned int flags)
> +{
> +	struct ceph_file_info *fi = in->private_data;
> +	struct inode *inode = file_inode(in);
> +	struct ceph_inode_info *ci = ceph_inode(inode);
> +	ssize_t ret;
> +	int want = 0, got = 0;
> +	CEPH_DEFINE_RW_CONTEXT(rw_ctx, 0);
> +
> +	dout("splice_read %p %llx.%llx %llu~%zu trying to get caps on %p\n",
> +	     inode, ceph_vinop(inode), *ppos, len, inode);
> +
> +	if (ceph_inode_is_shutdown(inode))
> +		return -ESTALE;
> +
> +	if ((in->f_flags & O_DIRECT) ||
> +	    ceph_has_inline_data(ci) ||
> +	    (fi->flags & CEPH_F_SYNC))
> +		return direct_splice_read(in, ppos, pipe, len, flags);
> +
> +	ceph_start_io_read(inode);
> +
> +	want = CEPH_CAP_FILE_CACHE;
> +	if (fi->fmode & CEPH_FILE_MODE_LAZY)
> +		want |= CEPH_CAP_FILE_LAZYIO;
> +
> +	ret = ceph_get_caps(in, CEPH_CAP_FILE_RD, want, -1, &got);
> +	if (ret < 0) {
> +		ceph_end_io_read(inode);
> +		return ret;
> +	}
> +
> +	if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) == 0) {
> +		dout("splice_read/sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
> +		     inode, ceph_vinop(inode), *ppos, len,
> +		     ceph_cap_string(got));
> +
> +		ceph_end_io_read(inode);
> +		return direct_splice_read(in, ppos, pipe, len, flags);

Shouldn't we release cap ref before returning here ?

Thanks

- Xiubo


> +	}
> +
> +	dout("splice_read %p %llx.%llx %llu~%zu got cap refs on %s\n",
> +	     inode, ceph_vinop(inode), *ppos, len, ceph_cap_string(got));
> +
> +	rw_ctx.caps = got;
> +	ceph_add_rw_context(fi, &rw_ctx);
> +	ret = filemap_splice_read(in, ppos, pipe, len, flags);
> +	ceph_del_rw_context(fi, &rw_ctx);
> +
> +	dout("splice_read %p %llx.%llx dropping cap refs on %s = %zd\n",
> +	     inode, ceph_vinop(inode), ceph_cap_string(got), ret);
> +	ceph_put_cap_refs(ci, got);
> +
> +	ceph_end_io_read(inode);
> +	return ret;
> +}
> +
>   /*
>    * Take cap references to avoid releasing caps to MDS mid-write.
>    *
> @@ -2593,7 +2657,7 @@ const struct file_operations ceph_file_fops = {
>   	.lock = ceph_lock,
>   	.setlease = simple_nosetlease,
>   	.flock = ceph_flock,
> -	.splice_read = generic_file_splice_read,
> +	.splice_read = ceph_splice_read,
>   	.splice_write = iter_file_splice_write,
>   	.unlocked_ioctl = ceph_ioctl,
>   	.compat_ioctl = compat_ptr_ioctl,
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
  2023-05-19  8:09   ` Christoph Hellwig
@ 2023-05-19  8:43   ` David Howells
  2023-05-19  8:47     ` Christoph Hellwig
  2023-05-19 22:27     ` David Howells
  2023-05-19 16:31   ` Linus Torvalds
  2023-05-19 16:47   ` David Howells
  3 siblings, 2 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  8:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

Christoph Hellwig <hch@infradead.org> wrote:

> I'm really not sure if this is a good idea.  Right now
> direct_read_splice (which also appears a little misnamed) really is
> a splice by calling ->read_iter helper.

It can be renamed if you want a different name.  copy_splice_read() or
copy_splice_read_iter()?

> I we don't do any of this validtion we can just call it directly from
> splice.c instead of calling into ->splice_read for direct I/O and DAX and
> remove a ton of boilerplate code.

There's a couple of places where we might not want to do that - at least for
non-DAX.  shmem and f2fs for example.  f2fs calls back to buffered reading
under some circumstances.  shmem ignores O_DIRECT and always splices from the
pagecache.

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  8:43   ` David Howells
@ 2023-05-19  8:47     ` Christoph Hellwig
  2023-05-19 22:27     ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-19  8:47 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

On Fri, May 19, 2023 at 09:43:34AM +0100, David Howells wrote:
> > direct_read_splice (which also appears a little misnamed) really is
> > a splice by calling ->read_iter helper.
> 
> It can be renamed if you want a different name.  copy_splice_read() or
> copy_splice_read_iter()?

Maybe something like that, yes.

> 
> > I we don't do any of this validtion we can just call it directly from
> > splice.c instead of calling into ->splice_read for direct I/O and DAX and
> > remove a ton of boilerplate code.
> 
> There's a couple of places where we might not want to do that - at least for
> non-DAX.  shmem and f2fs for example.  f2fs calls back to buffered reading
> under some circumstances.  shmem ignores O_DIRECT and always splices from the
> pagecache.

So?  even if ->read_iter does buffered I/O for O_DIRECT it will still
work.  This can in fact happen for many other file systems due when
they fall back to buffeed I/O due to various reasons.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read()
  2023-05-19  7:40 ` [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read() David Howells
  2023-05-19  8:10   ` Christoph Hellwig
@ 2023-05-19  8:48   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  8:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	linux-erofs, linux-ext4, linux-xfs

Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, May 19, 2023 at 08:40:20AM +0100, David Howells wrote:
> > +#ifdef CONFIG_FS_DAX
> > +	if (IS_DAX(in->f_mapping->host))
> 
> No need for the ifdef.  IS_DAX is compile-time false if CONFIG_FS_DAX
> is not set.

Ah - it's not that IS_DAX() is conditionalised, it's that S_DAX is.  There's a
bunch of places that use CONFIG_FS_DAX blocks, but I guess that's because they
include calls to functions that are conditionalised out.

I wonder if the dax_iomap_rw() declaration in the header can have a non-DAX
fallback that returns an error and then we can get rid of some of the other
conditionalisation.

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 01/32] splice: Fix filemap of a blockdev
  2023-05-19  7:40 ` [PATCH v20 01/32] splice: Fix filemap of a blockdev David Howells
  2023-05-19  8:06   ` Christoph Hellwig
@ 2023-05-19  9:11   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-19  9:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Steve French,
	Christoph Hellwig, John Hubbard

Christoph Hellwig <hch@infradead.org> wrote:

> 
> But this really needs to be sent standalone so that it can get picked
> up for 6.4 and -stable.

Note that only cifs is actually using that code at the moment.

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 13/32] ceph: Provide a splice-read stub
  2023-05-19  7:40 ` [PATCH v20 13/32] ceph: " David Howells
  2023-05-19  8:40   ` Xiubo Li
@ 2023-05-19  9:24   ` David Howells
  2023-05-22  1:53     ` Xiubo Li
  1 sibling, 1 reply; 65+ messages in thread
From: David Howells @ 2023-05-19  9:24 UTC (permalink / raw)
  To: Xiubo Li
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, Ilya Dryomov, ceph-devel

Xiubo Li <xiubli@redhat.com> wrote:

> > +	ret = ceph_get_caps(in, CEPH_CAP_FILE_RD, want, -1, &got);
> > +	if (ret < 0) {
> > +		ceph_end_io_read(inode);
> > +		return ret;
> > +	}
> > +
> > +	if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) == 0) {
> > +		dout("splice_read/sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
> > +		     inode, ceph_vinop(inode), *ppos, len,
> > +		     ceph_cap_string(got));
> > +
> > +		ceph_end_io_read(inode);
> > +		return direct_splice_read(in, ppos, pipe, len, flags);
> 
> Shouldn't we release cap ref before returning here ?

Ummm...  Even if we got no caps?

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read()
  2023-05-19  8:12   ` Christoph Hellwig
@ 2023-05-19 16:22     ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2023-05-19 16:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Greg Kroah-Hartman, Christoph Hellwig,
	John Hubbard, Miklos Szeredi, Arnd Bergmann

On Fri, May 19, 2023 at 1:12 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> Pinging Al (and maybe Linus): is there any good reason to not simply
> default to direct_splice_read if ->read_iter is implemented and
> ->splice_read is not once you remove ITER_PIPE?

For me, the reason isn't so much technical as "historical pain".

We've had *so* many problems with splice on random file descriptors
that I at some point decided "no splice by default".

Now, admittedly most of the problems were due to the whole set_fs()
ambiguity, which you fixed and no longer exists. So maybe we could go
back to "implement splice by default".

I agree that as long as the default implementation is obviously safe,
it should be ok, and maybe direct_splice_read is that obvious..

           Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
  2023-05-19  8:09   ` Christoph Hellwig
  2023-05-19  8:43   ` David Howells
@ 2023-05-19 16:31   ` Linus Torvalds
  2023-05-19 16:47   ` David Howells
  3 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2023-05-19 16:31 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig

On Fri, May 19, 2023 at 12:41 AM David Howells <dhowells@redhat.com> wrote:
>
> +
> +       if (S_ISREG(file_inode(in)->i_mode) ||
> +           S_ISBLK(file_inode(in)->i_mode)) {

This really feels fundamentally wrong to me.

If block and regular files have this limit, they should have their own
splice_read() function that implements that limit.

Not make everybody else check it.

IOW, this should be a separate function ("block_splice_read()" or
whatever), not inside a generic function that other users use.

The zero size checking looks fine, although I wondered about that too.
Some special files do traditionally have special meanings for
zero-sized reads (as in "packet boundary"). But I suspect that isn't
an issue for splice, and perhaps more importantly, I think the same
rule should be in place: special files that want special rules
shouldn't be using this generic function directly then.

                 Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
                     ` (2 preceding siblings ...)
  2023-05-19 16:31   ` Linus Torvalds
@ 2023-05-19 16:47   ` David Howells
  2023-05-19 17:37     ` Linus Torvalds
  3 siblings, 1 reply; 65+ messages in thread
From: David Howells @ 2023-05-19 16:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > +       if (S_ISREG(file_inode(in)->i_mode) ||
> > +           S_ISBLK(file_inode(in)->i_mode)) {
> 
> This really feels fundamentally wrong to me.
> 
> If block and regular files have this limit, they should have their own
> splice_read() function that implements that limit.
> 
> Not make everybody else check it.
> 
> IOW, this should be a separate function ("block_splice_read()" or
> whatever), not inside a generic function that other users use.

This is just an optimisation to cut down the amount of bufferage allocated, so
I could just drop it and leave it to userspace for now as the filesystem/block
layer will stop anyway if it hits the EOF.  Christoph would prefer that I call
direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if
that's fine with you.

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19 16:47   ` David Howells
@ 2023-05-19 17:37     ` Linus Torvalds
  0 siblings, 0 replies; 65+ messages in thread
From: Linus Torvalds @ 2023-05-19 17:37 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig

On Fri, May 19, 2023 at 9:48 AM David Howells <dhowells@redhat.com> wrote:
>
> This is just an optimisation to cut down the amount of bufferage allocated

So the thing is, it's actually very very wrong for some files.

Now, admittedly, those files have other issues too, and it's a design
mistake to begin with, but look at a number of files in /proc.

In particular, look at the regular files that have a size of '0'. It's
quite common indeed. Things like

    /proc/cpuinfo
    /proc/stat
    ...

you can find a ton of them with

    find /proc -type f -size 0

Is it horribly wrong and bad? Yes. I hate it. It means that some
really basic user space tools refuse to work on them, and the tools
are 100% right - this is a kernel misfeature. Trying to do things like

    less -S /proc/cpuinfo

may or may not work depending on your version of 'less', for example,
because it's entirely reasonable to do something like

    fd = open(..);
    if (!fstat(fd, &st))
         len = st.st_size;

and limit your reads to the size of the file - exactly like your patch does.

Except it fails horribly on those broken /proc files.

I hate it, and I blame myself for the above horror, but it's pretty
much unfixable. We could make them look like named pipes or something,
but that's really ugly and probably would break other things anyway.
And we simply don't know the size ahead of time.

Now, *most* things work, because they just do the whole "read until
EOF". In fact, my current version of 'less' has no problem at all
doing the above thing, and gives the "expected" output.

Also, honestly, I really don't think that it's necessarily a good idea
to splice /proc files, but we actually do have splice wired up to
these because people asked for it:

    fe33850ff798 ("proc: wire up generic_file_splice_read for iter ops")
    4bd6a7353ee1 ("sysctl: Convert to iter interfaces")

so I suspect those things do exist.

> I could just drop it and leave it to userspace for now as the filesystem/block
> layer will stop anyway if it hits the EOF.  Christoph would prefer that I call
> direct_splice_read() from generic_file_splice_read() in all O_DIRECT cases, if
> that's fine with you.

I guess that's fine, and for O_DIRECT itself it might even make sense
to do the size test. That said, I doubt it matters: if you use
O_DIRECT on a small file, you only have yourself to blame for doing
something stupid.

And if it isn't a small file, then who cares about some small EOF-time
optimization? Nobody.

So I would suggest not doing that optimization at all, because as-is,
it's either pointless or actively broken.

That said, I would *not* hate some kind of special FMODE_SIZELIMIT
flag that allows filesystems to opt in to "limit reads to size".

We already have flags like that: FMODE_UNSIGNED_OFFSET and
'sb->s_maxbytes' are both basically variations on that same theme, and
having another flag to say "limit reads to i_size" wouldn't be wrong.

It's only wrong when it is done mindlessly with S_ISREG().

             Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19  8:43   ` David Howells
  2023-05-19  8:47     ` Christoph Hellwig
@ 2023-05-19 22:27     ` David Howells
  2023-05-20  3:54       ` Christoph Hellwig
  1 sibling, 1 reply; 65+ messages in thread
From: David Howells @ 2023-05-19 22:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

Okay.  Let's go with that.  So I have to put the handling in vfs_splice_read():

	long vfs_splice_read(struct file *in, loff_t *ppos,
			     struct pipe_inode_info *pipe, size_t len,
			     unsigned int flags)
	{
	...
		if (unlikely(!in->f_op->splice_read))
			return warn_unsupported(in, "read");
		/*
		 * O_DIRECT and DAX don't deal with the pagecache, so we
		 * allocate a buffer, copy into it and splice that into the pipe.
		 */
		if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
			return copy_splice_read(in, ppos, pipe, len, flags);
		return in->f_op->splice_read(in, ppos, pipe, len, flags);
	}

which leaves very little in generic_file_splice_read:

	ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
					 struct pipe_inode_info *pipe, size_t len,
					 unsigned int flags)
	{
		if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
			return 0;
		if (unlikely(!len))
			return 0;
		return filemap_splice_read(in, ppos, pipe, len, flags);
	}

so I wonder if the tests in generic_file_splice_read() can be folded into
vfs_splice_read(), pointers to generic_file_splice_read() be replaced with
pointers to filemap_splice_read() and generic_file_splice_read() just be
removed.

I suspect we can't quite do this because of the *ppos check - but I wonder if
that's actually necessary since filemap_splice_read() checks against
i_size... or if the check can be moved there if we definitely want to do it.

Certainly, the zero-length check can be done in vfs_splice_read().

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-19  7:40 ` [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
@ 2023-05-20  1:26   ` Kent Overstreet
  2023-05-20  3:56     ` Christoph Hellwig
  2023-05-20  8:40   ` David Howells
  1 sibling, 1 reply; 65+ messages in thread
From: Kent Overstreet @ 2023-05-20  1:26 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

On Fri, May 19, 2023 at 08:40:44AM +0100, David Howells wrote:
> From: Christoph Hellwig <hch@lst.de>
> 
> Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
> meaning is only set when a page reference has been acquired that needs to
> be released by bio_release_pages().

What was the motivation for this patch?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate
  2023-05-19 22:27     ` David Howells
@ 2023-05-20  3:54       ` Christoph Hellwig
  0 siblings, 0 replies; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-20  3:54 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig

On Fri, May 19, 2023 at 11:27:51PM +0100, David Howells wrote:
> 	ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
> 					 struct pipe_inode_info *pipe, size_t len,
> 					 unsigned int flags)
> 	{
> 		if (unlikely(*ppos >= in->f_mapping->host->i_sb->s_maxbytes))
> 			return 0;
> 		if (unlikely(!len))
> 			return 0;
> 		return filemap_splice_read(in, ppos, pipe, len, flags);
> 	}
> 
> so I wonder if the tests in generic_file_splice_read() can be folded into
> vfs_splice_read(), pointers to generic_file_splice_read() be replaced with
> pointers to filemap_splice_read() and generic_file_splice_read() just be
> removed.
> 
> I suspect we can't quite do this because of the *ppos check - but I wonder if
> that's actually necessary since filemap_splice_read() checks against
> i_size... or if the check can be moved there if we definitely want to do it.
> 
> Certainly, the zero-length check can be done in vfs_splice_read().

The zero length check makes sense in vfs_splice_read.  The ppos check
I think makes sense for filemap_splice_read - after all we're dealing
with the page cache here, and the page cache needs a working maxbytes
and i_size.  What callers of filemap_splice_read that don't have the
checks do you have in your tree right now and how did you end up with
them?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-20  1:26   ` Kent Overstreet
@ 2023-05-20  3:56     ` Christoph Hellwig
  2023-05-20  4:13       ` Kent Overstreet
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-20  3:56 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: David Howells, Jens Axboe, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig, John Hubbard

On Fri, May 19, 2023 at 09:26:07PM -0400, Kent Overstreet wrote:
> On Fri, May 19, 2023 at 08:40:44AM +0100, David Howells wrote:
> > From: Christoph Hellwig <hch@lst.de>
> > 
> > Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
> > meaning is only set when a page reference has been acquired that needs to
> > be released by bio_release_pages().
> 
> What was the motivation for this patch?

So that is only is set when we need to release a page, instead telling
code to not release it when it otherwise would, where otherwise would
is implicit and undocumented and changes in this series.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-20  3:56     ` Christoph Hellwig
@ 2023-05-20  4:13       ` Kent Overstreet
  2023-05-20  4:17         ` Christoph Hellwig
  0 siblings, 1 reply; 65+ messages in thread
From: Kent Overstreet @ 2023-05-20  4:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

On Fri, May 19, 2023 at 08:56:56PM -0700, Christoph Hellwig wrote:
> On Fri, May 19, 2023 at 09:26:07PM -0400, Kent Overstreet wrote:
> > On Fri, May 19, 2023 at 08:40:44AM +0100, David Howells wrote:
> > > From: Christoph Hellwig <hch@lst.de>
> > > 
> > > Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
> > > meaning is only set when a page reference has been acquired that needs to
> > > be released by bio_release_pages().
> > 
> > What was the motivation for this patch?
> 
> So that is only is set when we need to release a page, instead telling
> code to not release it when it otherwise would, where otherwise would
> is implicit and undocumented and changes in this series.

I suppose this way setting it can be done in bio_iov_iter_get_pages() -
ok yeah, that makes sense.

But it seems like it should be set in bio_iov_iter_get_pages() though,
and I'm not seeing that?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-20  4:13       ` Kent Overstreet
@ 2023-05-20  4:17         ` Christoph Hellwig
  2023-05-20  5:52           ` Kent Overstreet
  0 siblings, 1 reply; 65+ messages in thread
From: Christoph Hellwig @ 2023-05-20  4:17 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Christoph Hellwig, David Howells, Jens Axboe, Al Viro,
	Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, Linus Torvalds, linux-fsdevel, linux-block,
	linux-kernel, linux-mm, Christoph Hellwig, John Hubbard

On Sat, May 20, 2023 at 12:13:49AM -0400, Kent Overstreet wrote:
> I suppose this way setting it can be done in bio_iov_iter_get_pages() -
> ok yeah, that makes sense.
> 
> But it seems like it should be set in bio_iov_iter_get_pages() though,
> and I'm not seeing that?

It is set in bio_iov_iter_get_pages in this patch.  The later gets
replaced with the pinned flag when we bio_iov_iter_get_pages is
changed to pin pages instead.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-20  4:17         ` Christoph Hellwig
@ 2023-05-20  5:52           ` Kent Overstreet
  0 siblings, 0 replies; 65+ messages in thread
From: Kent Overstreet @ 2023-05-20  5:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	John Hubbard

On Fri, May 19, 2023 at 09:17:43PM -0700, Christoph Hellwig wrote:
> On Sat, May 20, 2023 at 12:13:49AM -0400, Kent Overstreet wrote:
> > I suppose this way setting it can be done in bio_iov_iter_get_pages() -
> > ok yeah, that makes sense.
> > 
> > But it seems like it should be set in bio_iov_iter_get_pages() though,
> > and I'm not seeing that?
> 
> It is set in bio_iov_iter_get_pages in this patch.  The later gets
> replaced with the pinned flag when we bio_iov_iter_get_pages is
> changed to pin pages instead.

Whoops, missed it.

Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
  2023-05-19  7:40 ` [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
  2023-05-20  1:26   ` Kent Overstreet
@ 2023-05-20  8:40   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: David Howells @ 2023-05-20  8:40 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, John Hubbard

Kent Overstreet <kent.overstreet@linux.dev> wrote:

> > Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
> > meaning is only set when a page reference has been acquired that needs to
> > be released by bio_release_pages().
> 
> What was the motivation for this patch?

We need to move to using FOLL_PIN for buffers derived from direct I/O to avoid
the fork vs async-DIO race.  Further, we shouldn't be taking a ref or a pin on
pages derived from internal kernel iterators such as KVEC or BVEC as the page
refcount might not be a valid way to control the lifetime of the data/buffers
in those pages (slab, for instance).  Rather, for internal kernel I/O, we need
to rely on the caller to hold onto the memory until we tell them we've
finished.

So we flip the polarity of the page-is-ref'd flag and then add a
page-is-pinned flag.  The intention is to ultimately drop the page-is-ref'd
flag - but we still need to keep the page-is-pinned flag.  This makes it
easier to take a stepwise approach - and having both flags working the same
way makes the logic easier to follow.

See iov_iter_extract_pages() and iov_iter_extract_will_pin().

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 13/32] ceph: Provide a splice-read stub
  2023-05-19  9:24   ` David Howells
@ 2023-05-22  1:53     ` Xiubo Li
  0 siblings, 0 replies; 65+ messages in thread
From: Xiubo Li @ 2023-05-22  1:53 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Ilya Dryomov, ceph-devel


On 5/19/23 17:24, David Howells wrote:
> Xiubo Li <xiubli@redhat.com> wrote:
>
>>> +	ret = ceph_get_caps(in, CEPH_CAP_FILE_RD, want, -1, &got);
>>> +	if (ret < 0) {
>>> +		ceph_end_io_read(inode);
>>> +		return ret;
>>> +	}
>>> +
>>> +	if ((got & (CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO)) == 0) {
>>> +		dout("splice_read/sync %p %llx.%llx %llu~%zu got cap refs on %s\n",
>>> +		     inode, ceph_vinop(inode), *ppos, len,
>>> +		     ceph_cap_string(got));
>>> +
>>> +		ceph_end_io_read(inode);
>>> +		return direct_splice_read(in, ppos, pipe, len, flags);
>> Shouldn't we release cap ref before returning here ?
> Ummm...  Even if we got no caps?

No, at least we have got the 'need' caps: CEPH_CAP_FILE_RD once here.

I saw you have updated this and will check it.

Thanks

- Xiubo

>
> David
>


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read()
  2023-05-19  7:40 ` [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read() David Howells
@ 2023-05-22 14:29   ` Steven Rostedt
  2023-05-22 14:50   ` David Howells
  1 sibling, 0 replies; 65+ messages in thread
From: Steven Rostedt @ 2023-05-22 14:29 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, David Hildenbrand, Jason Gunthorpe, Logan Gunthorpe,
	Hillf Danton, Christian Brauner, Linus Torvalds, linux-fsdevel,
	linux-block, linux-kernel, linux-mm, Christoph Hellwig,
	Masami Hiramatsu, linux-trace-kernel

On Fri, 19 May 2023 08:40:38 +0100
David Howells <dhowells@redhat.com> wrote:

> For the splice from the trace seq buffer, just use direct_splice_read().
> 
> In the future, something better can probably be done by gifting pages from
> seq->buf into the pipe, but that would require changing seq->buf into a
> vmap over an array of pages.

If you can give me a POC of what needs to be done, I could possibly
implement it.

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Steven Rostedt <rostedt@goodmis.org>
> cc: Masami Hiramatsu <mhiramat@kernel.org>
> cc: linux-kernel@vger.kernel.org
> cc: linux-trace-kernel@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-block@vger.kernel.org
> cc: linux-mm@kvack.org
> ---
>  kernel/trace/trace.c | 2 +-

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index ebc59781456a..b664020efcb7 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -5171,7 +5171,7 @@ static const struct file_operations tracing_fops = {
>  	.open		= tracing_open,
>  	.read		= seq_read,
>  	.read_iter	= seq_read_iter,
> -	.splice_read	= generic_file_splice_read,
> +	.splice_read	= direct_splice_read,
>  	.write		= tracing_write_stub,
>  	.llseek		= tracing_lseek,
>  	.release	= tracing_release,


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read()
  2023-05-19  7:40 ` [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read() David Howells
  2023-05-22 14:29   ` Steven Rostedt
@ 2023-05-22 14:50   ` David Howells
  2023-05-22 17:42     ` Linus Torvalds
  1 sibling, 1 reply; 65+ messages in thread
From: David Howells @ 2023-05-22 14:50 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: dhowells, Jens Axboe, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jan Kara, Jeff Layton, David Hildenbrand, Jason Gunthorpe,
	Logan Gunthorpe, Hillf Danton, Christian Brauner, Linus Torvalds,
	linux-fsdevel, linux-block, linux-kernel, linux-mm,
	Christoph Hellwig, Masami Hiramatsu, linux-trace-kernel

Steven Rostedt <rostedt@goodmis.org> wrote:

> > In the future, something better can probably be done by gifting pages from
> > seq->buf into the pipe, but that would require changing seq->buf into a
> > vmap over an array of pages.
> 
> If you can give me a POC of what needs to be done, I could possibly
> implement it.

I wrote my idea up here for Masami[*]:

We could implement seq_splice_read().  What we would need to do is to change
how the seq buffer is allocated: bulk allocate a bunch of arbitrary pages
which we then vmap().  When we need to splice, we read into the buffer, do a
vunmap() and then splice the pages holding the data we used into the pipe.

If we don't manage to splice all the data, we can continue splicing from the
pages we have left next time.  If a read() comes along to view partially
spliced data, we would need to copy from the individual pages.

When we use up all the data, we discard all the pages we might have spliced
from and shuffle down the other pages, call the bulk allocator to replenish
the buffer and then vmap() it again.

Any pages we've spliced from must be discarded and replaced and not rewritten.

If a read() comes without the buffer having been spliced from, it can do as it
does now.

David
---

[*] https://lore.kernel.org/linux-fsdevel/20230522-pfund-ferngeblieben-53fad9c0e527@brauner/T/#mc03959454c76cc3f29024b092c62d88c90f7c071


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read()
  2023-05-22 14:50   ` David Howells
@ 2023-05-22 17:42     ` Linus Torvalds
  2023-05-22 18:38       ` Steven Rostedt
  0 siblings, 1 reply; 65+ messages in thread
From: Linus Torvalds @ 2023-05-22 17:42 UTC (permalink / raw)
  To: David Howells
  Cc: Steven Rostedt, Jens Axboe, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, linux-fsdevel, linux-block, linux-kernel,
	linux-mm, Christoph Hellwig, Masami Hiramatsu,
	linux-trace-kernel

On Mon, May 22, 2023 at 7:50 AM David Howells <dhowells@redhat.com> wrote:
>
> We could implement seq_splice_read().  What we would need to do is to change
> how the seq buffer is allocated: bulk allocate a bunch of arbitrary pages
> which we then vmap().  When we need to splice, we read into the buffer, do a
> vunmap() and then splice the pages holding the data we used into the pipe.

Please don't use vmap as a way to do zero-copy.

The virtual mapping games are more expensive than a small copy from
some random seq file.

Yes, yes, seq_file currently uses "kvmalloc()", which does fall back
to vmalloc too. But the keyword there is "falls back". Most of the
time it's just a regular boring kmalloc, and most of the time a
seq-file is tiny.

                      Linus

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read()
  2023-05-22 17:42     ` Linus Torvalds
@ 2023-05-22 18:38       ` Steven Rostedt
  0 siblings, 0 replies; 65+ messages in thread
From: Steven Rostedt @ 2023-05-22 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Jens Axboe, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jan Kara, Jeff Layton, David Hildenbrand,
	Jason Gunthorpe, Logan Gunthorpe, Hillf Danton,
	Christian Brauner, linux-fsdevel, linux-block, linux-kernel,
	linux-mm, Christoph Hellwig, Masami Hiramatsu,
	linux-trace-kernel

On Mon, 22 May 2023 10:42:12 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 22, 2023 at 7:50 AM David Howells <dhowells@redhat.com> wrote:
> >
> > We could implement seq_splice_read().  What we would need to do is to change
> > how the seq buffer is allocated: bulk allocate a bunch of arbitrary pages
> > which we then vmap().  When we need to splice, we read into the buffer, do a
> > vunmap() and then splice the pages holding the data we used into the pipe.  
> 
> Please don't use vmap as a way to do zero-copy.
> 
> The virtual mapping games are more expensive than a small copy from
> some random seq file.
> 
> Yes, yes, seq_file currently uses "kvmalloc()", which does fall back
> to vmalloc too. But the keyword there is "falls back". Most of the
> time it's just a regular boring kmalloc, and most of the time a
> seq-file is tiny.

I was thinking this change had to do with the splice callback for
trace_pipe_raw (which is a hot path that does zero copy of the ftrace ring
buffer into files). But looking at this further, I see that it's for just
the "trace" file, which is a textual conversion of the tracing data (slow
path, although some user space uses this and parses the text, which IMHO is
wrong).

In other words, I don't really care much about this code being "efficient".

-- Steve


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (33 preceding siblings ...)
  2023-05-19  8:06 ` [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE Christoph Hellwig
@ 2023-06-14  0:55 ` Mike Marshall
  2023-06-16 11:38 ` David Howells
  35 siblings, 0 replies; 65+ messages in thread
From: Mike Marshall @ 2023-06-14  0:55 UTC (permalink / raw)
  To: David Howells, linux-fsdevel, Mike Marshall

Hello... I used
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract
to build 6.4.0-rc2-00037-g0c3c931ab6d1 and ran xfstests
through with no regressions on orangefs. You can add a
tested by me if you'd like...

-Mike

On Fri, May 19, 2023 at 3:42 AM David Howells <dhowells@redhat.com> wrote:
>
> Hi Jens, Al, Christoph,
>
> The first half of this patchset kills off ITER_PIPE to avoid a race between
> truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a
> bio with unpinned/unref'ed pages from an O_DIRECT splice read.  This causes
> memory corruption[2].  Instead, we use filemap_splice_read(), which invokes
> the buffered file reading code and splices from the pagecache into the
> pipe; direct_splice_read(), which bulk-allocates a buffer, reads into it
> and then pushes the filled pages into the pipe; or handle it in
> filesystem-specific code.
>
>  (1) Simplify the calculations for the number of pages to be reclaimed in
>      direct_splice_read().
>
>  (2) Turn do_splice_to() into a helper so that it can be used by overlayfs
>      and coda to perform the checks on the lower fs.
>
>  (3) Provide shmem with its own splice_read to handle non-existent pages
>      in the pagecache.  We don't want a ->read_folio() as we don't want to
>      populate holes, but filemap_get_pages() requires it.
>
>  (4) Provide overlayfs with its own splice_read to call down to a lower
>      layer as overlayfs doesn't provide ->read_folio().
>
>  (5) Provide coda with its own splice_read to call down to a lower layer as
>      coda doesn't provide ->read_folio().
>
>  (6) Direct ->splice_read to direct_splice_read() in tty, procfs, kernfs
>      and random files as they just copy to the output buffer and don't
>      splice pages.
>
>  (7) Provide stubs for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2,
>      orangefs, xfs and zonefs to do locking and/or revalidation.
>
>  (8) Change generic_file_splice_read() to just switch between
>      filemap_splice_read() and direct_splice_read() rather than using
>      ITER_PIPE.
>
>  (9) Make cifs use generic_file_splice_read().
>
> (10) Remove ITER_PIPE and its paraphernalia as generic_file_splice_read()
>      was the only user.
>
> The second half of the patchset rolls page-pinning out to the bio struct
> and the block layer, using iov_iter_extract_pages() to get pages and noting
> with BIO_PAGE_PINNED if the data pages attached to a bio are pinned.  If
> the data pages come from a non-user-backed iterator, then the pages are
> left unpinned and unref'd, relying on whoever set up the I/O to do the
> retaining
>
> (10) Don't hold a ref on ZERO_PAGE in iomap_dio_zero().
>
> (11) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.
>
> (12) Make the bio struct carry a pair of flags to indicate the cleanup
>      mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (indicating
>      FOLL_GET was used) and BIO_PAGE_PINNED (indicating FOLL_PIN was used)
>      is added.
>
>      BIO_PAGE_REFFED will go away, but at the moment fs/direct-io.c sets it
>      and this series does not fully address that file.
>
> (13) Add a function, bio_release_page(), to release a page appropriately to
>      the cleanup mode indicated by the BIO_PAGE_* flags.
>
> (14) Make bio_iov_iter_get_pages() use iov_iter_extract_pages() to retain
>      the pages appropriately and clean them up later.
>
> (15) Make bio_map_user_iov() also use iov_iter_extract_pages().
>
> I've pushed the patches here also:
>
>         https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract
>
> David
>
> Changes:
> ========
> ver #20)
>  - Make direct_splice_read() limit the read to eof for regular files and
>    blockdevs.
>  - Check against s_maxbytes on the backing store, not a devnode inode.
>  - Provide stubs for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3, ocfs2,
>    orangefs, xfs and zonefs.
>  - Always use direct_splice_read() for 9p, trace and sockets.
>
> ver #19)
>  - Remove a missed get_page() on the zeropage in shmem_splice_read().
>
> ver #18)
>  - Split out the cifs bits from the patch the switches
>    generic_file_splice_read() over to using the non-ITER_PIPE splicing.
>  - Don't get/put refs on the zeropage in shmem_splice_read().
>
> ver #17)
>  - Rename do_splice_to() to vfs_splice_read() and export it so that it can
>    be a helper and make overlayfs and coda use it, allowing duplicate
>    checks to be removed.
>
> ver #16)
>  - The filemap_get_pages() changes are now upstream.
>  - filemap_splice_read() and direct_splice_read() are now upstream.
>  - iov_iter_extract_pages() is now upstream.
>
> ver #15)
>  - Fixed up some errors in overlayfs_splice_read().
>
> ver #14)
>  - Some changes to generic_file_buffered_splice_read():
>    - Rename to filemap_splice_read() and move to mm/filemap.c.
>    - Create a helper, pipe_head_buf().
>    - Use init_sync_kiocb().
>  - Some changes to generic_file_direct_splice_read():
>    - Use alloc_pages_bulk_array() rather than alloc_pages_bulk_list().
>    - Use release_pages() instead of __free_page() in a loop.
>    - Rename to direct_splice_read().
>  - Rearrange the patches to implement filemap_splice_read() and
>    direct_splice_read() separately to changing generic_file_splice_read().
>  - Don't call generic_file_splice_read() when there isn't a ->read_folio().
>  - Insert patches to fix read_folio-less cases:
>    - Make tty, procfs, kernfs and (u)random use direct_splice_read().
>    - Make overlayfs and coda call down to a lower layer.
>    - Give shmem its own splice-read that doesn't insert missing pages.
>  - Fixed a min() with mixed type args on some arches.
>
> ver #13)
>  - Only use allocation in advance and ITER_BVEC for DIO read-splice.
>  - Make buffered read-splice get pages directly from the pagecache.
>  - Alter filemap_get_pages() & co. so that it doesn't need an iterator.
>
> ver #12)
>  - Added the missing __bitwise on the iov_iter_extraction_t typedef.
>  - Rebased on -rc7.
>  - Don't specify FOLL_PIN to pin_user_pages_fast().
>  - Inserted patch at front to fix race between DIO read and truncation that
>    caused memory corruption when iov_iter_revert() got called on an
>    ITER_PIPE iterator[2].
>  - Inserted a patch after that to remove the now-unused ITER_PIPE and its
>    helper functions.
>  - Removed the ITER_PIPE bits from iov_iter_extract_pages().
>
> ver #11)
>  - Fix iov_iter_extract_kvec_pages() to include the offset into the page in
>    the returned starting offset.
>  - Use __bitwise for the extraction flags
>
> ver #10)
>  - Fix use of i->kvec in iov_iter_extract_bvec_pages() to be i->bvec.
>  - Drop bio_set_cleanup_mode(), open coding it instead.
>
> ver #9)
>  - It's now not permitted to use FOLL_PIN outside of mm/, so:
>  - Change iov_iter_extract_mode() into iov_iter_extract_will_pin() and
>    return true/false instead of FOLL_PIN/0.
>  - Drop of folio_put_unpin() and page_put_unpin() and instead call
>    unpin_user_page() (and put_page()) directly as necessary.
>  - Make __bio_release_pages() call bio_release_page() instead of
>    unpin_user_page() as there's no BIO_* -> FOLL_* translation to do.
>  - Drop the FOLL_* renumbering patch.
>  - Change extract_flags to extraction_flags.
>
> ver #8)
>  - Import Christoph Hellwig's changes.
>    - Split the conversion-to-extraction patch.
>    - Drop the extract_flags arg from iov_iter_extract_mode().
>    - Don't default bios to BIO_PAGE_REFFED, but set explicitly.
>  - Switch FOLL_PIN and FOLL_GET when renumbering so PIN is at bit 0.
>  - Switch BIO_PAGE_PINNED and BIO_PAGE_REFFED so PINNED is at bit 0.
>  - We should always be using FOLL_PIN (not FOLL_GET) for DIO, so adjust the
>    patches for that.
>
> ver #7)
>  - For now, drop the parts to pass the I/O direction to iov_iter_*pages*()
>    as it turned out to be a lot more complicated, with places not setting
>    IOCB_WRITE when they should, for example.
>  - Drop all the patches that changed things other then the block layer's
>    bio handling.  The netfslib and cifs changes can go into a separate
>    patchset.
>  - Add support for extracting pages from KVEC-type iterators.
>  - When extracting from BVEC/KVEC, skip over empty vecs at the front.
>
> ver #6)
>  - Fix write() syscall and co. not setting IOCB_WRITE.
>  - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
>  - Use op_is_write() in bio_copy_user_iov().
>  - Drop the iterator direction checks from smbd_recv().
>  - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
>    gup_flags to iov_iter_get/extract_pages*().
>  - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
>  - Add back the function to indicate the cleanup mode.
>  - Drop the cleanup_mode return arg to iov_iter_extract_pages().
>  - Provide a helper to clean up a page.
>  - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
>    the same numerical values, enforced with an assertion.
>  - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
>    NFS.
>  - Added in the patches to make CIFS do top-to-bottom iterators and use
>    various of the added extraction functions.
>  - Added a pair of work-in-progess patches to make sk_buff fragments store
>    FOLL_GET and FOLL_PIN.
>
> ver #5)
>  - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
>  - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
>  - Add patch to allow bio_flagged() to be combined by gcc.
>
> ver #4)
>  - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
>    no longer referenced by linux/uio.h.
>  - Add ITER_SOURCE/DEST cleanup patches.
>  - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
>  - Allow additional gup_flags to be passed into iov_iter_extract_pages().
>  - Add struct bio patch.
>
> ver #3)
>  - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
>    to get/pin_user_pages_fast()[1].
>
> ver #2)
>  - Rolled the extraction cleanup mode query function into the extraction
>    function, returning the indication through the argument list.
>  - Fixed patch 4 (extract to scatterlist) to actually use the new
>    extraction API.
>
> Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
> Link: https://lore.kernel.org/r/000000000000b0b3c005f3a09383@google.com/ [2]
> Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
> Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
> Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
> Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
> Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
> Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
> Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5
> Link: https://lore.kernel.org/r/167391047703.2311931.8115712773222260073.stgit@warthog.procyon.org.uk/ # v6
> Link: https://lore.kernel.org/r/20230120175556.3556978-1-dhowells@redhat.com/ # v7
> Link: https://lore.kernel.org/r/20230123173007.325544-1-dhowells@redhat.com/ # v8
> Link: https://lore.kernel.org/r/20230124170108.1070389-1-dhowells@redhat.com/ # v9
> Link: https://lore.kernel.org/r/20230125210657.2335748-1-dhowells@redhat.com/ # v10
> Link: https://lore.kernel.org/r/20230126141626.2809643-1-dhowells@redhat.com/ # v11
> Link: https://lore.kernel.org/r/20230207171305.3716974-1-dhowells@redhat.com/ # v12
> Link: https://lore.kernel.org/r/20230209102954.528942-1-dhowells@redhat.com/ # v13
> Link: https://lore.kernel.org/r/20230214171330.2722188-1-dhowells@redhat.com/ # v14
> Link: https://lore.kernel.org/r/20230308143754.1976726-1-dhowells@redhat.com/ # v16
> Link: https://lore.kernel.org/r/20230308165251.2078898-1-dhowells@redhat.com/ # v17
> Link: https://lore.kernel.org/r/20230314220757.3827941-1-dhowells@redhat.com/ # v18
> Link: https://lore.kernel.org/r/20230315163549.295454-1-dhowells@redhat.com/ # v19
>
> Additional patches that got folded in:
>
> Link: https://lore.kernel.org/r/20230213134619.2198965-1-dhowells@redhat.com/ # v1
> Link: https://lore.kernel.org/r/20230213153301.2338806-1-dhowells@redhat.com/ # v2
> Link: https://lore.kernel.org/r/20230214083710.2547248-1-dhowells@redhat.com/ # v3
>
> Christoph Hellwig (1):
>   block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted
>     logic
>
> David Howells (31):
>   splice: Fix filemap of a blockdev
>   splice: Clean up direct_splice_read() a bit
>   splice: Make direct_read_splice() limit to eof where appropriate
>   splice: Make do_splice_to() generic and export it
>   splice: Make splice from a DAX file use direct_splice_read()
>   shmem: Implement splice-read
>   overlayfs: Implement splice-read
>   coda: Implement splice-read
>   tty, proc, kernfs, random: Use direct_splice_read()
>   net: Make sock_splice_read() use direct_splice_read() by default
>   9p:  Add splice_read stub
>   afs: Provide a splice-read stub
>   ceph: Provide a splice-read stub
>   ecryptfs: Provide a splice-read stub
>   ext4: Provide a splice-read stub
>   f2fs: Provide a splice-read stub
>   nfs: Provide a splice-read stub
>   ntfs3: Provide a splice-read stub
>   ocfs2: Provide a splice-read stub
>   orangefs: Provide a splice-read stub
>   xfs: Provide a splice-read stub
>   zonefs: Provide a splice-read stub
>   splice: Convert trace/seq to use direct_splice_read()
>   splice: Do splice read from a file without using ITER_PIPE
>   cifs: Use generic_file_splice_read()
>   iov_iter: Kill ITER_PIPE
>   iomap: Don't get an reference on ZERO_PAGE for direct I/O block
>     zeroing
>   block: Fix bio_flagged() so that gcc can better optimise it
>   block: Add BIO_PAGE_PINNED and associated infrastructure
>   block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
>   block: convert bio_map_user_iov to use iov_iter_extract_pages
>
>  block/bio.c               |  29 +--
>  block/blk-map.c           |  22 +-
>  block/blk.h               |  12 ++
>  drivers/char/random.c     |   4 +-
>  drivers/tty/tty_io.c      |   4 +-
>  fs/9p/vfs_file.c          |  26 ++-
>  fs/afs/file.c             |  20 +-
>  fs/ceph/file.c            |  66 +++++-
>  fs/cifs/cifsfs.c          |   8 +-
>  fs/cifs/cifsfs.h          |   3 -
>  fs/cifs/file.c            |  16 --
>  fs/coda/file.c            |  29 ++-
>  fs/direct-io.c            |   2 +
>  fs/ecryptfs/file.c        |  27 ++-
>  fs/ext4/file.c            |  13 +-
>  fs/f2fs/file.c            |  68 +++++-
>  fs/iomap/direct-io.c      |   1 -
>  fs/kernfs/file.c          |   2 +-
>  fs/nfs/file.c             |  26 ++-
>  fs/nfs/internal.h         |   2 +
>  fs/nfs/nfs4file.c         |   2 +-
>  fs/ntfs3/file.c           |  36 +++-
>  fs/ocfs2/file.c           |  42 +++-
>  fs/ocfs2/ocfs2_trace.h    |   3 +
>  fs/orangefs/file.c        |  25 ++-
>  fs/overlayfs/file.c       |  23 +-
>  fs/proc/inode.c           |   4 +-
>  fs/proc/proc_sysctl.c     |   2 +-
>  fs/proc_namespace.c       |   6 +-
>  fs/splice.c               |  93 ++++----
>  fs/xfs/xfs_file.c         |  33 ++-
>  fs/xfs/xfs_trace.h        |   2 +-
>  fs/zonefs/file.c          |  43 +++-
>  include/linux/bio.h       |   5 +-
>  include/linux/blk_types.h |   3 +-
>  include/linux/splice.h    |   3 +
>  include/linux/uio.h       |  14 --
>  kernel/trace/trace.c      |   2 +-
>  lib/iov_iter.c            | 431 +-------------------------------------
>  mm/filemap.c              |   7 +-
>  mm/shmem.c                | 134 +++++++++++-
>  net/socket.c              |   2 +-
>  42 files changed, 717 insertions(+), 578 deletions(-)
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE
  2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
                   ` (34 preceding siblings ...)
  2023-06-14  0:55 ` Mike Marshall
@ 2023-06-16 11:38 ` David Howells
  35 siblings, 0 replies; 65+ messages in thread
From: David Howells @ 2023-06-16 11:38 UTC (permalink / raw)
  To: Mike Marshall; +Cc: dhowells, linux-fsdevel

Mike Marshall <hubcap@omnibond.com> wrote:

> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract
> to build 6.4.0-rc2-00037-g0c3c931ab6d1 and ran xfstests
> through with no regressions on orangefs. You can add a
> tested by me if you'd like...

Thanks:-)  I'm pushing this branch in bits, though.  Some of it is in the
block tree and some in the net-next tree.

David


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2023-06-16 11:39 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-19  7:40 [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE David Howells
2023-05-19  7:40 ` [PATCH v20 01/32] splice: Fix filemap of a blockdev David Howells
2023-05-19  8:06   ` Christoph Hellwig
2023-05-19  9:11   ` David Howells
2023-05-19  7:40 ` [PATCH v20 02/32] splice: Clean up direct_splice_read() a bit David Howells
2023-05-19  7:40 ` [PATCH v20 03/32] splice: Make direct_read_splice() limit to eof where appropriate David Howells
2023-05-19  8:09   ` Christoph Hellwig
2023-05-19  8:43   ` David Howells
2023-05-19  8:47     ` Christoph Hellwig
2023-05-19 22:27     ` David Howells
2023-05-20  3:54       ` Christoph Hellwig
2023-05-19 16:31   ` Linus Torvalds
2023-05-19 16:47   ` David Howells
2023-05-19 17:37     ` Linus Torvalds
2023-05-19  7:40 ` [PATCH v20 04/32] splice: Make do_splice_to() generic and export it David Howells
2023-05-19  7:40 ` [PATCH v20 05/32] splice: Make splice from a DAX file use direct_splice_read() David Howells
2023-05-19  8:10   ` Christoph Hellwig
2023-05-19  8:48   ` David Howells
2023-05-19  7:40 ` [PATCH v20 06/32] shmem: Implement splice-read David Howells
2023-05-19  7:40 ` [PATCH v20 07/32] overlayfs: " David Howells
2023-05-19  7:40 ` [PATCH v20 08/32] coda: " David Howells
2023-05-19  7:40 ` [PATCH v20 09/32] tty, proc, kernfs, random: Use direct_splice_read() David Howells
2023-05-19  8:12   ` Christoph Hellwig
2023-05-19 16:22     ` Linus Torvalds
2023-05-19  7:40 ` [PATCH v20 10/32] net: Make sock_splice_read() use direct_splice_read() by default David Howells
2023-05-19  7:40 ` [PATCH v20 11/32] 9p: Add splice_read stub David Howells
2023-05-19  7:40 ` [PATCH v20 12/32] afs: Provide a splice-read stub David Howells
2023-05-19  7:40 ` [PATCH v20 13/32] ceph: " David Howells
2023-05-19  8:40   ` Xiubo Li
2023-05-19  9:24   ` David Howells
2023-05-22  1:53     ` Xiubo Li
2023-05-19  7:40 ` [PATCH v20 14/32] ecryptfs: " David Howells
2023-05-19  7:40 ` [PATCH v20 15/32] ext4: " David Howells
2023-05-19  7:40 ` [PATCH v20 16/32] f2fs: " David Howells
2023-05-19  7:40 ` [PATCH v20 17/32] nfs: " David Howells
2023-05-19  7:40 ` [PATCH v20 18/32] ntfs3: " David Howells
2023-05-19  7:40 ` [PATCH v20 19/32] ocfs2: " David Howells
2023-05-19  7:40 ` [PATCH v20 20/32] orangefs: " David Howells
2023-05-19  7:40 ` [PATCH v20 21/32] xfs: " David Howells
2023-05-19  7:40 ` [PATCH v20 22/32] zonefs: " David Howells
2023-05-19  7:40 ` [PATCH v20 23/32] splice: Convert trace/seq to use direct_splice_read() David Howells
2023-05-22 14:29   ` Steven Rostedt
2023-05-22 14:50   ` David Howells
2023-05-22 17:42     ` Linus Torvalds
2023-05-22 18:38       ` Steven Rostedt
2023-05-19  7:40 ` [PATCH v20 24/32] splice: Do splice read from a file without using ITER_PIPE David Howells
2023-05-19  7:40 ` [PATCH v20 25/32] cifs: Use generic_file_splice_read() David Howells
2023-05-19  7:40 ` [PATCH v20 26/32] iov_iter: Kill ITER_PIPE David Howells
2023-05-19  7:40 ` [PATCH v20 27/32] iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing David Howells
2023-05-19  7:40 ` [PATCH v20 28/32] block: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-05-19  7:40 ` [PATCH v20 29/32] block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic David Howells
2023-05-20  1:26   ` Kent Overstreet
2023-05-20  3:56     ` Christoph Hellwig
2023-05-20  4:13       ` Kent Overstreet
2023-05-20  4:17         ` Christoph Hellwig
2023-05-20  5:52           ` Kent Overstreet
2023-05-20  8:40   ` David Howells
2023-05-19  7:40 ` [PATCH v20 30/32] block: Add BIO_PAGE_PINNED and associated infrastructure David Howells
2023-05-19  7:40 ` [PATCH v20 31/32] block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages David Howells
2023-05-19  7:40 ` [PATCH v20 32/32] block: convert bio_map_user_iov " David Howells
2023-05-19  7:49 ` [PATCH] iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read() David Howells
2023-05-19  8:18   ` Christoph Hellwig
2023-05-19  8:06 ` [PATCH v20 00/32] splice, block: Use page pinning and kill ITER_PIPE Christoph Hellwig
2023-06-14  0:55 ` Mike Marshall
2023-06-16 11:38 ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).