All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list)
@ 2023-01-16 23:07 David Howells
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
                   ` (35 more replies)
  0 siblings, 36 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:07 UTC (permalink / raw)
  To: Al Viro
  Cc: James E.J. Bottomley, Paolo Abeni, John Hubbard,
	Christoph Hellwig, Paulo Alcantara, linux-scsi, Steve French,
	Stefan Metzmacher, Miklos Szeredi, Martin K. Petersen,
	Logan Gunthorpe, Jeff Layton, Jakub Kicinski, netdev,
	Rohith Surabattula, Eric Dumazet, Matthew Wilcox, Anna Schumaker,
	Jens Axboe, Shyam Prasad N, Tom Talpey, linux-rdma,
	Trond Myklebust, Christian Schoenebeck, linux-mm, linux-crypto,
	linux-nfs, v9fs-developer, Latchesar Ionkov, linux-fsdevel,
	Eric Van Hensbergen, Long Li, Jan Kara, linux-cachefs,
	linux-block, Dominique Martinet, Namjae Jeon, David S. Miller,
	linux-cifs, Steve French, Herbert Xu, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel


Hi Al, Christoph,

Here are patches clean up some use of READ/WRITE and ITER_SOURCE/DEST,
patches to provide support for extracting pages from an iov_iter and a
patch to use the primary extraction function in the block layer bio code.
I've also added a bunch of other conversions and had a tentative stab at
the networking code.

The patches make the following changes:

 (1) Deal with switching from using the iterator data_source to indicate
     the I/O direction to deriving this from other information, eg.:
     IOCB_WRITE, IOMAP_WRITE and the REQ_OP_* number.  This allows
     iov_iter_rw() to be removed eventually.

 (2) Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass these into
     iov_iter_get_pages*() to indicate the I/O direction with regards to
     how the buffer described by the iterator is to be used.  This is
     included in the gup_flags passed in with Logan's patches.

     Calls to iov_iter_get_pages*2() are replaced with calls to
     iov_iter_get_pages*() and the former is removed.

 (3) Add a function, iov_iter_extract_pages() to replace
     iov_iter_get_pages*() that gets refs, pins or just lists the pages as
     appropriate to the iterator type and the I/O direction.

     Add a function, iov_iter_extract_mode() that will indicate from the
     iterator type and the I/O direction how the cleanup is to be
     performed, returning FOLL_GET, FOLL_PIN or 0.

     Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
     that take a page and the return from iov_iter_extract_mode() and do
     the right thing to clean up the page.

 (4) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
     FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
     added.  These are forced to have the same value as the FOLL_* flags so
     they can be passed to the previously mentioned cleanup function.

 (5) Make the iter-to-bio code use iov_iter_extract_pages() to
     appropriately retain the pages and clean them up later.

 (6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

 (7) Add a function in netfslib, netfs_extract_user_iter(), to extract a
     UBUF- or IOVEC-type iterator to a page list in a BVEC-type iterator,
     with all the pages suitably ref'd or pinned.

 (8) Add a function in netfslib, netfs_extract_iter_to_sg(), to extract a
     UBUF-, IOVEC-, BVEC-, XARRAY- or KVEC-type iterator to a scatterlist.
     The first two types appropriately ref or pin pages; the latter three
     don't perform any retention, leaving that to the caller.

     Note that I can make use of this in the SCSI and AF_ALG code and
     possibly the networking code, so this might merit being moved to core
     code.

 (9) Make AF_ALG use iov_iter_extract_pages() and possibly go further and
     make it use netfs_extract_iter_to_sg() instead.

(10) Make SCSI vhost use netfs_extract_iter_to_sg().

(11) Make fs/direct-io.c use iov_iter_extract_pages().

(13) Make splice-to-pipe use iov_iter_extract_pages(), but limit the usage
     to a cleanup mode of FOLL_GET.

(13) Make the 9P, FUSE and NFS filesystems use iov_iter_extract_pages().

(14) Make the CIFS filesystem use iterators from the top all the way down
     to the socket on the simple path.  Make it use
     netfs_extract_user_iter() to use an XARRAY-type iterator or to build a
     BVEC-type iterator in the top layers from a UBUF- or IOVEC-type
     iterator and attach the iterator to the operation descriptors.

     netfs_extract_iter_to_sg() is used to build scatterlists for doing
     transport crypto and a function, smb_extract_iter_to_rdma(), is
     provided to build an RDMA SGE list directly from an iterator without
     going via a page list and then a scatter list.

(15) A couple of work-in-progress patches to try and make sk_buff fragments
     record the information needed to clean them up in the lowest two bits
     of the page pointer in the fragment struct.

This leaves:

 (*) Four calls to iov_iter_get_pages() in CEPH.  That will be helped by
     patches to pass an iterator down to the transport layer instead of
     converting to a page list high up and passing that down, but the
     transport layer could do with some massaging so that it doesn't covert
     the iterator to a page list and then the pages individually back to
     iterators to pass to the socket.

 (*) One call to iov_iter_get_pages() each in the networking core, RDS and
     TLS, all related to zero-copy.  TLS seems to do zerocopy-read (or
     maybe decrypt-offload) and should be doing FOLL_PIN, not FOLL_GET for
     user-provided buffers.


Changes:
========
ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

I've pushed the patches (excluding the two WIP networking patches) here
also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5

Previous versions of the CIFS patch sets can be found here:
Link: https://lore.kernel.org/r/164311902471.2806745.10187041199819525677.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/164928615045.457102.10607899252434268982.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211416682.3154751.17287804906832979514.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348876794.2106726.9240233279581920208.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364823513.3334034.11209090728654641458.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126392703.708021.14465850073772688008.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc


---
David Howells (34):
      vfs: Unconditionally set IOCB_WRITE in call_write_iter()
      iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction
      iov_iter: Pass I/O direction into iov_iter_get_pages*()
      iov_iter: Remove iov_iter_get_pages2/pages_alloc2()
      iov_iter: Change the direction macros into an enum
      iov_iter: Use the direction in the iterator functions
      iov_iter: Add a function to extract a page list from an iterator
      mm: Provide a helper to drop a pin/ref on a page
      bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
      mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically
      iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
      bio: Fix bio_flagged() so that gcc can better optimise it
      netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
      netfs: Add a function to extract an iterator into a scatterlist
      af_alg: Pin pages rather than ref'ing if appropriate
      af_alg: [RFC] Use netfs_extract_iter_to_sg() to create scatterlists
      scsi: [RFC] Use netfs_extract_iter_to_sg()
      dio: Pin pages rather than ref'ing if appropriate
      fuse:  Pin pages rather than ref'ing if appropriate
      vfs: Make splice use iov_iter_extract_pages()
      9p: Pin pages rather than ref'ing if appropriate
      nfs: Pin pages rather than ref'ing if appropriate
      cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
      cifs: Add a function to build an RDMA SGE list from an iterator
      cifs: Add a function to Hash the contents of an iterator
      cifs: Add some helper functions
      cifs: Add a function to read into an iter from a socket
      cifs: Change the I/O paths to use an iterator rather than a page list
      cifs: Build the RDMA SGE list directly from an iterator
      cifs: Remove unused code
      cifs: Fix problem with encrypted RDMA data read
      cifs: DIO to/from KVEC-type iterators should now work
      net: [RFC][WIP] Mark each skb_frags as to how they should be cleaned up
      net: [RFC][WIP] Make __zerocopy_sg_from_iter() correctly pin or leave pages unref'd


 block/bio.c               |   48 +-
 block/blk-map.c           |   26 +-
 block/blk.h               |   25 +
 block/fops.c              |    8 +-
 crypto/af_alg.c           |   57 +-
 crypto/algif_hash.c       |   20 +-
 drivers/net/tun.c         |    2 +-
 drivers/vhost/scsi.c      |   75 +-
 fs/9p/vfs_addr.c          |    2 +-
 fs/affs/file.c            |    4 +-
 fs/ceph/addr.c            |    2 +-
 fs/ceph/file.c            |   16 +-
 fs/cifs/Kconfig           |    1 +
 fs/cifs/cifsencrypt.c     |  172 +++-
 fs/cifs/cifsfs.c          |   12 +-
 fs/cifs/cifsfs.h          |    6 +
 fs/cifs/cifsglob.h        |   66 +-
 fs/cifs/cifsproto.h       |   11 +-
 fs/cifs/cifssmb.c         |   13 +-
 fs/cifs/connect.c         |   16 +
 fs/cifs/file.c            | 1851 +++++++++++++++++--------------------
 fs/cifs/fscache.c         |   22 +-
 fs/cifs/fscache.h         |   10 +-
 fs/cifs/misc.c            |  132 +--
 fs/cifs/smb2ops.c         |  374 ++++----
 fs/cifs/smb2pdu.c         |   45 +-
 fs/cifs/smbdirect.c       |  511 ++++++----
 fs/cifs/smbdirect.h       |    4 +-
 fs/cifs/transport.c       |   57 +-
 fs/dax.c                  |    6 +-
 fs/direct-io.c            |   77 +-
 fs/exfat/inode.c          |    6 +-
 fs/ext2/inode.c           |    2 +-
 fs/f2fs/file.c            |   10 +-
 fs/fat/inode.c            |    4 +-
 fs/fuse/dax.c             |    2 +-
 fs/fuse/dev.c             |   24 +-
 fs/fuse/file.c            |   34 +-
 fs/fuse/fuse_i.h          |    1 +
 fs/hfs/inode.c            |    2 +-
 fs/hfsplus/inode.c        |    2 +-
 fs/iomap/direct-io.c      |    6 +-
 fs/jfs/inode.c            |    2 +-
 fs/netfs/Makefile         |    1 +
 fs/netfs/iterator.c       |  371 ++++++++
 fs/nfs/direct.c           |   32 +-
 fs/nilfs2/inode.c         |    2 +-
 fs/ntfs3/inode.c          |    2 +-
 fs/ocfs2/aops.c           |    2 +-
 fs/orangefs/inode.c       |    2 +-
 fs/reiserfs/inode.c       |    2 +-
 fs/splice.c               |   10 +-
 fs/udf/inode.c            |    2 +-
 include/crypto/if_alg.h   |    7 +-
 include/linux/bio.h       |   23 +-
 include/linux/blk_types.h |    3 +-
 include/linux/fs.h        |   11 +
 include/linux/mm.h        |   32 +-
 include/linux/netfs.h     |    6 +
 include/linux/skbuff.h    |  124 ++-
 include/linux/uio.h       |   83 +-
 io_uring/net.c            |    2 +-
 lib/iov_iter.c            |  428 ++++++++-
 mm/gup.c                  |   47 +
 mm/vmalloc.c              |    1 +
 net/9p/trans_common.c     |    6 +-
 net/9p/trans_common.h     |    3 +-
 net/9p/trans_virtio.c     |   91 +-
 net/bpf/test_run.c        |    2 +-
 net/core/datagram.c       |   23 +-
 net/core/gro.c            |    2 +-
 net/core/skbuff.c         |   16 +-
 net/core/skmsg.c          |    4 +-
 net/ipv4/ip_output.c      |    2 +-
 net/ipv4/tcp.c            |    4 +-
 net/ipv6/esp6.c           |    5 +-
 net/ipv6/ip6_output.c     |    2 +-
 net/packet/af_packet.c    |    2 +-
 net/rds/message.c         |    4 +-
 net/tls/tls_sw.c          |    5 +-
 net/xfrm/xfrm_ipcomp.c    |    2 +-
 81 files changed, 3006 insertions(+), 2126 deletions(-)
 create mode 100644 fs/netfs/iterator.c



^ permalink raw reply	[flat|nested] 91+ messages in thread

* [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-17  7:52   ` Christoph Hellwig
                     ` (5 more replies)
  2023-01-16 23:08 ` [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction David Howells
                   ` (34 subsequent siblings)
  35 siblings, 6 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-fsdevel,
	dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

IOCB_WRITE is set by aio, io_uring and cachefiles before submitting a write
operation to the VFS, but it isn't set by, say, the write() system call.

Fix this by setting IOCB_WRITE unconditionally in call_write_iter().

This will allow drivers to use IOCB_WRITE instead of the iterator data
source to determine the I/O direction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
---

 include/linux/fs.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 066555ad1bf8..649ff061440e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2183,6 +2183,7 @@ static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
+	kio->ki_flags |= IOCB_WRITE;
 	return file->f_op->write_iter(kio, iter);
 }
 



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-17  7:55   ` Christoph Hellwig
  2023-01-16 23:08 ` [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*() David Howells
                   ` (33 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Use information other than the iterator direction to determine the
direction of the I/O:

 (*) If a kiocb is available, use the IOCB_WRITE flag.

 (*) If an iomap_iter is available, use the IOMAP_WRITE flag.

 (*) If a request is available, use op_is_write().

Drop the check on the iterator in smbd_recv() and its warning.

This leaves __iov_iter_get_pages_alloc() the only user of iov_iter_rw(), so
move it there and uninline it.

Changes:
========
ver #6)
 - Move to the front of the patchset.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the checks from smbd_recv().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/167305163159.1521586.9460968250704377087.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344727810.2425628.4715663653893036683.stgit@warthog.procyon.org.uk/ # v5
---

 block/blk-map.c      |    2 +-
 block/fops.c         |    8 ++++----
 fs/9p/vfs_addr.c     |    2 +-
 fs/affs/file.c       |    4 ++--
 fs/ceph/file.c       |    2 +-
 fs/cifs/smbdirect.c  |    9 ---------
 fs/dax.c             |    6 +++---
 fs/direct-io.c       |   22 +++++++++++-----------
 fs/exfat/inode.c     |    6 +++---
 fs/ext2/inode.c      |    2 +-
 fs/f2fs/file.c       |   10 +++++-----
 fs/fat/inode.c       |    4 ++--
 fs/fuse/dax.c        |    2 +-
 fs/fuse/file.c       |    8 ++++----
 fs/hfs/inode.c       |    2 +-
 fs/hfsplus/inode.c   |    2 +-
 fs/iomap/direct-io.c |    6 +++---
 fs/jfs/inode.c       |    2 +-
 fs/nfs/direct.c      |    2 +-
 fs/nilfs2/inode.c    |    2 +-
 fs/ntfs3/inode.c     |    2 +-
 fs/ocfs2/aops.c      |    2 +-
 fs/orangefs/inode.c  |    2 +-
 fs/reiserfs/inode.c  |    2 +-
 fs/udf/inode.c       |    2 +-
 include/linux/fs.h   |   10 ++++++++++
 include/linux/uio.h  |    5 -----
 lib/iov_iter.c       |    5 +++++
 28 files changed, 67 insertions(+), 66 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 19940c978c73..08cbb7ff3b19 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -203,7 +203,7 @@ static int bio_copy_user_iov(struct request *rq, struct rq_map_data *map_data,
 	/*
 	 * success
 	 */
-	if ((iov_iter_rw(iter) == WRITE &&
+	if ((op_is_write(rq->cmd_flags) &&
 	     (!map_data || !map_data->null_mapped)) ||
 	    (map_data && map_data->from_user)) {
 		ret = bio_copy_from_iter(bio, iter);
diff --git a/block/fops.c b/block/fops.c
index 50d245e8c913..5d376285edde 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -73,7 +73,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (iocb_is_read(iocb)) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -88,7 +88,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 		goto out;
 	ret = bio.bi_iter.bi_size;
 
-	if (iov_iter_rw(iter) == WRITE)
+	if (iocb_is_write(iocb))
 		task_io_account_write(ret);
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
@@ -174,7 +174,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct blkdev_dio *dio;
 	struct bio *bio;
-	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool is_read = iocb_is_read(iocb), is_sync;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
@@ -296,7 +296,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 					unsigned int nr_pages)
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
-	bool is_read = iov_iter_rw(iter) == READ;
+	bool is_read = iocb_is_read(iocb);
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	struct blkdev_dio *dio;
 	struct bio *bio;
diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index 97599edbc300..080be076b7b6 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -254,7 +254,7 @@ v9fs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t n;
 	int err = 0;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		n = p9_client_write(file->private_data, pos, iter, &err);
 		if (n) {
 			struct inode *inode = file_inode(file);
diff --git a/fs/affs/file.c b/fs/affs/file.c
index cefa222f7881..0dc67fc5d6cb 100644
--- a/fs/affs/file.c
+++ b/fs/affs/file.c
@@ -400,7 +400,7 @@ affs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		loff_t size = offset + count;
 
 		if (AFFS_I(inode)->mmu_private < size)
@@ -408,7 +408,7 @@ affs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	}
 
 	ret = blockdev_direct_IO(iocb, inode, iter, affs_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && iocb_is_write(iocb))
 		affs_write_failed(mapping, offset + count);
 	return ret;
 }
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 764598e1efd9..27c72a2f6af5 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1284,7 +1284,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 	struct timespec64 mtime = current_time(inode);
 	size_t count = iov_iter_count(iter);
 	loff_t pos = iocb->ki_pos;
-	bool write = iov_iter_rw(iter) == WRITE;
+	bool write = iocb_is_write(iocb);
 	bool should_dirty = !write && user_backed_iter(iter);
 
 	if (write && ceph_snap(file_inode(file)) != CEPH_NOSNAP)
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 90789aaa6567..3e693ffd0662 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1938,14 +1938,6 @@ int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
 	unsigned int to_read, page_offset;
 	int rc;
 
-	if (iov_iter_rw(&msg->msg_iter) == WRITE) {
-		/* It's a bug in upper layer to get there */
-		cifs_dbg(VFS, "Invalid msg iter dir %u\n",
-			 iov_iter_rw(&msg->msg_iter));
-		rc = -EINVAL;
-		goto out;
-	}
-
 	switch (iov_iter_type(&msg->msg_iter)) {
 	case ITER_KVEC:
 		buf = msg->msg_iter.kvec->iov_base;
@@ -1967,7 +1959,6 @@ int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
 		rc = -EINVAL;
 	}
 
-out:
 	/* SMBDirect will read it all or nothing */
 	if (rc > 0)
 		msg->msg_iter.count = 0;
diff --git a/fs/dax.c b/fs/dax.c
index c48a3a93ab29..b538a2ab7b66 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1405,7 +1405,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 	loff_t pos = iomi->pos;
 	struct dax_device *dax_dev = iomap->dax_dev;
 	loff_t end = pos + length, done = 0;
-	bool write = iov_iter_rw(iter) == WRITE;
+	bool write = iomi->flags & IOMAP_WRITE;
 	bool cow = write && iomap->flags & IOMAP_F_SHARED;
 	ssize_t ret = 0;
 	size_t xfer;
@@ -1455,7 +1455,7 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi,
 
 		map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
 				DAX_ACCESS, &kaddr, NULL);
-		if (map_len == -EIO && iov_iter_rw(iter) == WRITE) {
+		if (map_len == -EIO && write) {
 			map_len = dax_direct_access(dax_dev, pgoff,
 					PHYS_PFN(size), DAX_RECOVERY_WRITE,
 					&kaddr, NULL);
@@ -1530,7 +1530,7 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (!iomi.len)
 		return 0;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		lockdep_assert_held_write(&iomi.inode->i_rwsem);
 		iomi.flags |= IOMAP_WRITE;
 	} else {
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 03d381377ae1..cf196f2a211e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -1143,7 +1143,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 
 	/* watch out for a 0 len io from a tricksy fs */
-	if (iov_iter_rw(iter) == READ && !count)
+	if (iocb_is_read(iocb) && !count)
 		return 0;
 
 	dio = kmem_cache_alloc(dio_cache, GFP_KERNEL);
@@ -1157,14 +1157,14 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	memset(dio, 0, offsetof(struct dio, pages));
 
 	dio->flags = flags;
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
+	if (dio->flags & DIO_LOCKING && iocb_is_read(iocb)) {
 		/* will be released by direct_io_worker */
 		inode_lock(inode);
 	}
 
 	/* Once we sampled i_size check for reads beyond EOF */
 	dio->i_size = i_size_read(inode);
-	if (iov_iter_rw(iter) == READ && offset >= dio->i_size) {
+	if (iocb_is_read(iocb) && offset >= dio->i_size) {
 		retval = 0;
 		goto fail_dio;
 	}
@@ -1177,7 +1177,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 			goto fail_dio;
 	}
 
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ) {
+	if (dio->flags & DIO_LOCKING && iocb_is_read(iocb)) {
 		struct address_space *mapping = iocb->ki_filp->f_mapping;
 
 		retval = filemap_write_and_wait_range(mapping, offset, end - 1);
@@ -1193,13 +1193,13 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	if (is_sync_kiocb(iocb))
 		dio->is_async = false;
-	else if (iov_iter_rw(iter) == WRITE && end > i_size_read(inode))
+	else if (iocb_is_write(iocb) && end > i_size_read(inode))
 		dio->is_async = false;
 	else
 		dio->is_async = true;
 
 	dio->inode = inode;
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		dio->opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
 		if (iocb->ki_flags & IOCB_NOWAIT)
 			dio->opf |= REQ_NOWAIT;
@@ -1211,7 +1211,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * For AIO O_(D)SYNC writes we need to defer completions to a workqueue
 	 * so that we can call ->fsync.
 	 */
-	if (dio->is_async && iov_iter_rw(iter) == WRITE) {
+	if (dio->is_async && iocb_is_write(iocb)) {
 		retval = 0;
 		if (iocb_is_dsync(iocb))
 			retval = dio_set_defer_completion(dio);
@@ -1248,7 +1248,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	spin_lock_init(&dio->bio_lock);
 	dio->refcount = 1;
 
-	dio->should_dirty = user_backed_iter(iter) && iov_iter_rw(iter) == READ;
+	dio->should_dirty = user_backed_iter(iter) && iocb_is_read(iocb);
 	sdio.iter = iter;
 	sdio.final_block_in_request = end >> blkbits;
 
@@ -1305,7 +1305,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 * we can let i_mutex go now that its achieved its purpose
 	 * of protecting us from looking up uninitialized blocks.
 	 */
-	if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
+	if (iocb_is_read(iocb) && (dio->flags & DIO_LOCKING))
 		inode_unlock(dio->inode);
 
 	/*
@@ -1317,7 +1317,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	 */
 	BUG_ON(retval == -EIOCBQUEUED);
 	if (dio->is_async && retval == 0 && dio->result &&
-	    (iov_iter_rw(iter) == READ || dio->result == count))
+	    (iocb_is_read(iocb) || dio->result == count))
 		retval = -EIOCBQUEUED;
 	else
 		dio_await_completion(dio);
@@ -1330,7 +1330,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 	return retval;
 
 fail_dio:
-	if (dio->flags & DIO_LOCKING && iov_iter_rw(iter) == READ)
+	if (dio->flags & DIO_LOCKING && iocb_is_read(iocb))
 		inode_unlock(inode);
 
 	kmem_cache_free(dio_cache, dio);
diff --git a/fs/exfat/inode.c b/fs/exfat/inode.c
index 5b644cb057fa..82554aaf4fd0 100644
--- a/fs/exfat/inode.c
+++ b/fs/exfat/inode.c
@@ -412,10 +412,10 @@ static ssize_t exfat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct address_space *mapping = iocb->ki_filp->f_mapping;
 	struct inode *inode = mapping->host;
 	loff_t size = iocb->ki_pos + iov_iter_count(iter);
-	int rw = iov_iter_rw(iter);
+	bool writing = iocb_is_write(iocb);
 	ssize_t ret;
 
-	if (rw == WRITE) {
+	if (writing) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->i_size_aligned to block boundary.
@@ -434,7 +434,7 @@ static ssize_t exfat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * condition of exfat_get_block() and ->truncate().
 	 */
 	ret = blockdev_direct_IO(iocb, inode, iter, exfat_get_block);
-	if (ret < 0 && (rw & WRITE))
+	if (ret < 0 && writing)
 		exfat_write_failed(mapping, size);
 	return ret;
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 69aed9e2359e..26a61f886844 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -919,7 +919,7 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t ret;
 
 	ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && iocb_is_write(iocb))
 		ext2_write_failed(mapping, offset + count);
 	return ret;
 }
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ecbc8c135b49..51a24580cfec 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -809,7 +809,7 @@ int f2fs_truncate(struct inode *inode)
 	return 0;
 }
 
-static bool f2fs_force_buffered_io(struct inode *inode, int rw)
+static bool f2fs_force_buffered_io(struct inode *inode, bool writing)
 {
 	struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
 
@@ -827,9 +827,9 @@ static bool f2fs_force_buffered_io(struct inode *inode, int rw)
 	 * for blkzoned device, fallback direct IO to buffered IO, so
 	 * all IOs can be serialized by log-structured write.
 	 */
-	if (f2fs_sb_has_blkzoned(sbi) && (rw == WRITE))
+	if (f2fs_sb_has_blkzoned(sbi) && writing)
 		return true;
-	if (f2fs_lfs_mode(sbi) && rw == WRITE && F2FS_IO_ALIGNED(sbi))
+	if (f2fs_lfs_mode(sbi) && writing && F2FS_IO_ALIGNED(sbi))
 		return true;
 	if (is_sbi_flag_set(sbi, SBI_CP_DISABLED))
 		return true;
@@ -865,7 +865,7 @@ int f2fs_getattr(struct user_namespace *mnt_userns, const struct path *path,
 		unsigned int bsize = i_blocksize(inode);
 
 		stat->result_mask |= STATX_DIOALIGN;
-		if (!f2fs_force_buffered_io(inode, WRITE)) {
+		if (!f2fs_force_buffered_io(inode, true)) {
 			stat->dio_mem_align = bsize;
 			stat->dio_offset_align = bsize;
 		}
@@ -4254,7 +4254,7 @@ static bool f2fs_should_use_dio(struct inode *inode, struct kiocb *iocb,
 	if (!(iocb->ki_flags & IOCB_DIRECT))
 		return false;
 
-	if (f2fs_force_buffered_io(inode, iov_iter_rw(iter)))
+	if (f2fs_force_buffered_io(inode, iocb_is_write(iocb)))
 		return false;
 
 	/*
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index d99b8549ec8f..237e20891df2 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -261,7 +261,7 @@ static ssize_t fat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	loff_t offset = iocb->ki_pos;
 	ssize_t ret;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		/*
 		 * FIXME: blockdev_direct_IO() doesn't use ->write_begin(),
 		 * so we need to update the ->mmu_private to block boundary.
@@ -281,7 +281,7 @@ static ssize_t fat_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * condition of fat_get_block() and ->truncate().
 	 */
 	ret = blockdev_direct_IO(iocb, inode, iter, fat_get_block);
-	if (ret < 0 && iov_iter_rw(iter) == WRITE)
+	if (ret < 0 && iocb_is_write(iocb))
 		fat_write_failed(mapping, offset + count);
 
 	return ret;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index e23e802a8013..4351376db4a1 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -720,7 +720,7 @@ static bool file_extending_write(struct kiocb *iocb, struct iov_iter *from)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	return (iov_iter_rw(from) == WRITE &&
+	return (iocb_is_write(iocb) &&
 		((iocb->ki_pos) >= i_size_read(inode) ||
 		  (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode))));
 }
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 875314ee6f59..d68b45f8b3ae 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -2897,7 +2897,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	inode = file->f_mapping->host;
 	i_size = i_size_read(inode);
 
-	if ((iov_iter_rw(iter) == READ) && (offset >= i_size))
+	if (iocb_is_read(iocb) && (offset >= i_size))
 		return 0;
 
 	io = kmalloc(sizeof(struct fuse_io_priv), GFP_KERNEL);
@@ -2909,7 +2909,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	io->bytes = -1;
 	io->size = 0;
 	io->offset = offset;
-	io->write = (iov_iter_rw(iter) == WRITE);
+	io->write = iocb_is_write(iocb);
 	io->err = 0;
 	/*
 	 * By default, we want to optimize all I/Os with async request
@@ -2942,7 +2942,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 		io->done = &wait;
 	}
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		ret = fuse_direct_io(io, iter, &pos, FUSE_DIO_WRITE);
 		fuse_invalidate_attr_mask(inode, FUSE_STATX_MODSIZE);
 	} else {
@@ -2965,7 +2965,7 @@ fuse_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 
 	kref_put(&io->refcnt, fuse_io_release);
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iocb_is_write(iocb)) {
 		fuse_write_update_attr(inode, pos, ret);
 		/* For extending writes we already hold exclusive lock */
 		if (ret < 0 && offset + count > i_size)
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 9c329a365e75..eec166e039d5 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -141,7 +141,7 @@ static ssize_t hfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely(iocb_is_write(iocb) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c
index 840577a0c1e7..2b4effb6ca3e 100644
--- a/fs/hfsplus/inode.c
+++ b/fs/hfsplus/inode.c
@@ -138,7 +138,7 @@ static ssize_t hfsplus_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely(iocb_is_write(iocb) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9804714b1751..b03d87f116fc 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -519,7 +519,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->submit.waiter = current;
 	dio->submit.poll_bio = NULL;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (iocb_is_read(iocb)) {
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
@@ -573,7 +573,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (iomi.flags & IOMAP_WRITE) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
@@ -613,7 +613,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	 * Revert iter to a state corresponding to that as some callers (such
 	 * as the splice code) rely on it.
 	 */
-	if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
+	if (!(iomi.flags & IOMAP_WRITE) && iomi.pos >= dio->i_size)
 		iov_iter_revert(iter, iomi.pos - dio->i_size);
 
 	if (ret == -EFAULT && dio->size && (dio_flags & IOMAP_DIO_PARTIAL)) {
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 8ac10e396050..0d1f94ac9488 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -334,7 +334,7 @@ static ssize_t jfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely(iocb_is_write(iocb) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 1707f46b1335..d865945f2a63 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -133,7 +133,7 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 
 	VM_BUG_ON(iov_iter_count(iter) != PAGE_SIZE);
 
-	if (iov_iter_rw(iter) == READ)
+	if (iocb_is_read(iocb))
 		ret = nfs_file_direct_read(iocb, iter, true);
 	else
 		ret = nfs_file_direct_write(iocb, iter, true);
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 232dd7b6cca1..496801507083 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -289,7 +289,7 @@ nilfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 
-	if (iov_iter_rw(iter) == WRITE)
+	if (iocb_is_write(iocb))
 		return 0;
 
 	/* Needs synchronization with the cleaner */
diff --git a/fs/ntfs3/inode.c b/fs/ntfs3/inode.c
index 20b953871574..675be8d629fc 100644
--- a/fs/ntfs3/inode.c
+++ b/fs/ntfs3/inode.c
@@ -761,7 +761,7 @@ static ssize_t ntfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	struct ntfs_inode *ni = ntfs_i(inode);
 	loff_t vbo = iocb->ki_pos;
 	loff_t end;
-	int wr = iov_iter_rw(iter) & WRITE;
+	bool wr = iocb_is_write(iocb);
 	size_t iter_count = iov_iter_count(iter);
 	loff_t valid;
 	ssize_t ret;
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 1d65f6ef00ca..b741068a0a7e 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2441,7 +2441,7 @@ static ssize_t ocfs2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	    !ocfs2_supports_append_dio(osb))
 		return 0;
 
-	if (iov_iter_rw(iter) == READ)
+	if (iocb_is_read(iocb))
 		get_block = ocfs2_lock_get_block;
 	else
 		get_block = ocfs2_dio_wr_get_block;
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 4df560894386..ece65907ff83 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -521,7 +521,7 @@ static ssize_t orangefs_direct_IO(struct kiocb *iocb,
 	 */
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
-	enum ORANGEFS_io_type type = iov_iter_rw(iter) == WRITE ?
+	enum ORANGEFS_io_type type = iocb_is_write(iocb) ?
             ORANGEFS_IO_WRITE : ORANGEFS_IO_READ;
 	loff_t *offset = &pos;
 	struct inode *inode = file->f_mapping->host;
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index c7d1fa526dea..0ed65feda193 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -3249,7 +3249,7 @@ static ssize_t reiserfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	 * In case of error extending write may have instantiated a few
 	 * blocks outside i_size. Trim these off again.
 	 */
-	if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
+	if (unlikely(iocb_is_write(iocb) && ret < 0)) {
 		loff_t isize = i_size_read(inode);
 		loff_t end = iocb->ki_pos + count;
 
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 1d7c2a812fc1..66a1b9e85cb2 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -219,7 +219,7 @@ static ssize_t udf_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
 	ssize_t ret;
 
 	ret = blockdev_direct_IO(iocb, inode, iter, udf_get_block);
-	if (unlikely(ret < 0 && iov_iter_rw(iter) == WRITE))
+	if (unlikely(ret < 0 && iocb_is_write(iocb)))
 		udf_write_failed(mapping, iocb->ki_pos + count);
 	return ret;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 649ff061440e..6a488ae69f5d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -353,6 +353,16 @@ static inline bool is_sync_kiocb(struct kiocb *kiocb)
 	return kiocb->ki_complete == NULL;
 }
 
+static inline bool iocb_is_write(const struct kiocb *kiocb)
+{
+	return kiocb->ki_flags & IOCB_WRITE;
+}
+
+static inline bool iocb_is_read(const struct kiocb *kiocb)
+{
+	return !iocb_is_write(kiocb);
+}
+
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*read_folio)(struct file *, struct folio *);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 9f158238edba..6f4dfa96324d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -114,11 +114,6 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
-static inline unsigned char iov_iter_rw(const struct iov_iter *i)
-{
-	return i->data_source ? WRITE : READ;
-}
-
 static inline bool user_backed_iter(const struct iov_iter *i)
 {
 	return i->user_backed;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f9a3ff37ecd1..68497d9c1452 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1429,6 +1429,11 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	return page;
 }
 
+static unsigned char iov_iter_rw(const struct iov_iter *i)
+{
+	return i->data_source ? WRITE : READ;
+}
+
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start,



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
  2023-01-16 23:08 ` [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-17  7:57   ` Christoph Hellwig
  2023-01-17  8:44   ` David Howells
  2023-01-16 23:08 ` [PATCH v6 04/34] iov_iter: Remove iov_iter_get_pages2/pages_alloc2() David Howells
                   ` (32 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
operation.  Don't use READ and WRITE as a read I/O writes to memory and
vice versa - which causes confusion.

The direction is checked against the iterator's data_source.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 block/bio.c             |    6 ++++++
 block/blk-map.c         |    2 ++
 crypto/af_alg.c         |    9 ++++++---
 crypto/algif_hash.c     |    3 ++-
 drivers/vhost/scsi.c    |    9 ++++++---
 fs/ceph/addr.c          |    2 +-
 fs/ceph/file.c          |   14 ++++++++------
 fs/cifs/file.c          |    8 ++++----
 fs/cifs/misc.c          |    3 ++-
 fs/direct-io.c          |    6 ++++--
 fs/fuse/dev.c           |    3 ++-
 fs/fuse/file.c          |    8 ++++----
 fs/nfs/direct.c         |   10 ++++++----
 fs/splice.c             |    3 ++-
 include/crypto/if_alg.h |    3 ++-
 include/linux/bio.h     |   18 ++++++++++++++++--
 include/linux/mm.h      |   10 ++++++++++
 lib/iov_iter.c          |   14 +++++++-------
 net/9p/trans_virtio.c   |   12 ++++++++----
 net/core/datagram.c     |    5 +++--
 net/core/skmsg.c        |    4 ++--
 net/rds/message.c       |    4 ++--
 net/tls/tls_sw.c        |    5 ++---
 23 files changed, 107 insertions(+), 54 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f96fcae3f75..867cf4db87ea 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1242,6 +1242,8 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * pages will have to be released using put_page() when done.
  * For multi-segment *iter, this function only adds pages from the
  * next non-empty segment of the iov iterator.
+ *
+ * The I/O direction is determined from the bio operation type.
  */
 static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
@@ -1263,6 +1265,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
 	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
 
+	gup_flags |= bio_is_write(bio) ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+
 	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
 		gup_flags |= FOLL_PCI_P2PDMA;
 
@@ -1332,6 +1336,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
  * fit into the bio, or are requested in @iter, whatever is smaller. If
  * MM encounters an error pinning the requested pages, it stops. Error
  * is returned only if 0 pages could be pinned.
+ *
+ * The bio operation indicates the data direction.
  */
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 {
diff --git a/block/blk-map.c b/block/blk-map.c
index 08cbb7ff3b19..c30be529fb55 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -279,6 +279,8 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 	if (bio == NULL)
 		return -ENOMEM;
 
+	gup_flags |= bio_is_write(bio) ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+
 	if (blk_queue_pci_p2pdma(rq->q))
 		gup_flags |= FOLL_PCI_P2PDMA;
 
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 0a4fa2a429e2..7a68db157fae 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -531,13 +531,15 @@ static const struct net_proto_family alg_family = {
 	.owner	=	THIS_MODULE,
 };
 
-int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len)
+int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
+		   unsigned int gup_flags)
 {
 	size_t off;
 	ssize_t n;
 	int npages, i;
 
-	n = iov_iter_get_pages2(iter, sgl->pages, len, ALG_MAX_PAGES, &off);
+	n = iov_iter_get_pages(iter, sgl->pages, len, ALG_MAX_PAGES, &off,
+			       gup_flags);
 	if (n < 0)
 		return n;
 
@@ -1310,7 +1312,8 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
 		list_add_tail(&rsgl->list, &areq->rsgl_list);
 
 		/* make one iovec available as scatterlist */
-		err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen);
+		err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen,
+				     FOLL_DEST_BUF);
 		if (err < 0) {
 			rsgl->sg_num_bytes = 0;
 			return err;
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 1d017ec5c63c..fe3d2258145f 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -91,7 +91,8 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
 		if (len > limit)
 			len = limit;
 
-		len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, len);
+		len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, len,
+				     FOLL_SOURCE_BUF);
 		if (len < 0) {
 			err = copied ? 0 : len;
 			goto unlock;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index dca6346d75b3..5d10837d19ec 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -646,10 +646,13 @@ vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd,
 	struct scatterlist *sg = sgl;
 	ssize_t bytes;
 	size_t offset;
-	unsigned int npages = 0;
+	unsigned int npages = 0, gup_flags = 0;
 
-	bytes = iov_iter_get_pages2(iter, pages, LONG_MAX,
-				VHOST_SCSI_PREALLOC_UPAGES, &offset);
+	gup_flags |= write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+
+	bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
+				   VHOST_SCSI_PREALLOC_UPAGES, &offset,
+				   gup_flags);
 	/* No pages were pinned */
 	if (bytes <= 0)
 		return bytes < 0 ? bytes : -EFAULT;
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8c74871e37c9..cfc3353e5604 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -328,7 +328,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
 
 	dout("%s: pos=%llu orig_len=%zu len=%llu\n", __func__, subreq->start, subreq->len, len);
 	iov_iter_xarray(&iter, ITER_DEST, &rreq->mapping->i_pages, subreq->start, len);
-	err = iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off);
+	err = iov_iter_get_pages_alloc(&iter, &pages, len, &page_off, FOLL_DEST_BUF);
 	if (err < 0) {
 		dout("%s: iov_ter_get_pages_alloc returned %d\n", __func__, err);
 		goto out;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 27c72a2f6af5..ffd36eeea186 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -81,7 +81,7 @@ static __le32 ceph_flags_sys2wire(u32 flags)
 #define ITER_GET_BVECS_PAGES	64
 
 static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
-				struct bio_vec *bvecs)
+				struct bio_vec *bvecs, bool write)
 {
 	size_t size = 0;
 	int bvec_idx = 0;
@@ -95,8 +95,9 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
 		size_t start;
 		int idx = 0;
 
-		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
-					   ITER_GET_BVECS_PAGES, &start);
+		bytes = iov_iter_get_pages(iter, pages, maxsize - size,
+					   ITER_GET_BVECS_PAGES, &start,
+					   write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 		if (bytes < 0)
 			return size ?: bytes;
 
@@ -127,7 +128,8 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
  * Return the number of bytes in the created bio_vec array, or an error.
  */
 static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize,
-				    struct bio_vec **bvecs, int *num_bvecs)
+				    struct bio_vec **bvecs, int *num_bvecs,
+				    bool write)
 {
 	struct bio_vec *bv;
 	size_t orig_count = iov_iter_count(iter);
@@ -146,7 +148,7 @@ static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize,
 	if (!bv)
 		return -ENOMEM;
 
-	bytes = __iter_get_bvecs(iter, maxsize, bv);
+	bytes = __iter_get_bvecs(iter, maxsize, bv, write);
 	if (bytes < 0) {
 		/*
 		 * No pages were pinned -- just free the array.
@@ -1334,7 +1336,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
 			break;
 		}
 
-		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages);
+		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages, write);
 		if (len < 0) {
 			ceph_osdc_put_request(req);
 			ret = len;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 22dfc1f8b4f1..d100b9cb8682 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3290,8 +3290,8 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 		if (ctx->direct_io) {
 			ssize_t result;
 
-			result = iov_iter_get_pages_alloc2(
-				from, &pagevec, cur_len, &start);
+			result = iov_iter_get_pages_alloc(
+				from, &pagevec, cur_len, &start, FOLL_SOURCE_BUF);
 			if (result < 0) {
 				cifs_dbg(VFS,
 					 "direct_writev couldn't get user pages (rc=%zd) iter type %d iov_offset %zd count %zd\n",
@@ -4031,9 +4031,9 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
 		if (ctx->direct_io) {
 			ssize_t result;
 
-			result = iov_iter_get_pages_alloc2(
+			result = iov_iter_get_pages_alloc(
 					&direct_iov, &pagevec,
-					cur_len, &start);
+					cur_len, &start, FOLL_DEST_BUF);
 			if (result < 0) {
 				cifs_dbg(VFS,
 					 "Couldn't get user pages (rc=%zd) iter type %d iov_offset %zd count %zd\n",
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 4d3c586785a5..9655cf359ab9 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -1030,7 +1030,8 @@ setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
 	saved_len = count;
 
 	while (count && npages < max_pages) {
-		rc = iov_iter_get_pages2(iter, pages, count, max_pages, &start);
+		rc = iov_iter_get_pages(iter, pages, count, max_pages, &start,
+					rw == WRITE ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 		if (rc < 0) {
 			cifs_dbg(VFS, "Couldn't get user pages (rc=%zd)\n", rc);
 			break;
diff --git a/fs/direct-io.c b/fs/direct-io.c
index cf196f2a211e..b1e26a706e31 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -169,8 +169,10 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				&sdio->from);
+	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
+				 &sdio->from,
+				 op_is_write(dio_op) ?
+				 FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
 		struct page *page = ZERO_PAGE(0);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e8b60ce72c9a..e3d8443e24a6 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -730,7 +730,8 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 		}
 	} else {
 		size_t off;
-		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
+		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off,
+					 cs->write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d68b45f8b3ae..68c196437306 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1414,10 +1414,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
-					*nbytesp - nbytes,
-					max_pages - ap->num_pages,
-					&start);
+		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
+					 *nbytesp - nbytes,
+					 max_pages - ap->num_pages,
+					 &start, write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 		if (ret < 0)
 			break;
 
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index d865945f2a63..42af84685f20 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -332,8 +332,9 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
-						  rsize, &pgbase);
+		result = iov_iter_get_pages_alloc(iter, &pagevec,
+						  rsize, &pgbase,
+						  FOLL_DEST_BUF);
 		if (result < 0)
 			break;
 	
@@ -791,8 +792,9 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc2(iter, &pagevec,
-						  wsize, &pgbase);
+		result = iov_iter_get_pages_alloc(iter, &pagevec,
+						  wsize, &pgbase,
+						  FOLL_SOURCE_BUF);
 		if (result < 0)
 			break;
 
diff --git a/fs/splice.c b/fs/splice.c
index 5969b7a1d353..19c5b5adc548 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1165,7 +1165,8 @@ static int iter_to_pipe(struct iov_iter *from,
 		size_t start;
 		int i, n;
 
-		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
+		left = iov_iter_get_pages(from, pages, ~0UL, 16, &start,
+					  FOLL_SOURCE_BUF);
 		if (left <= 0) {
 			ret = left;
 			break;
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index a5db86670bdf..12058ab6cad9 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -165,7 +165,8 @@ int af_alg_release(struct socket *sock);
 void af_alg_release_parent(struct sock *sk);
 int af_alg_accept(struct sock *sk, struct socket *newsock, bool kern);
 
-int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len);
+int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
+		   unsigned int gup_flags);
 void af_alg_free_sg(struct af_alg_sgl *sgl);
 
 static inline struct alg_sock *alg_sk(struct sock *sk)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 22078a28d7cb..3f7ba7fe48ac 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -40,11 +40,25 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
 #define bio_sectors(bio)	bvec_iter_sectors((bio)->bi_iter)
 #define bio_end_sector(bio)	bvec_iter_end_sector((bio)->bi_iter)
 
+/**
+ * bio_is_write - Query if the I/O direction is towards the disk
+ * @bio: The bio to query
+ *
+ * Return true if this is some sort of write operation - ie. the data is going
+ * towards the disk.
+ */
+static inline bool bio_is_write(const struct bio *bio)
+{
+	return op_is_write(bio_op(bio));
+}
+
 /*
  * Return the data direction, READ or WRITE.
  */
-#define bio_data_dir(bio) \
-	(op_is_write(bio_op(bio)) ? WRITE : READ)
+static inline int bio_data_dir(const struct bio *bio)
+{
+	return bio_is_write(bio) ? WRITE : READ;
+}
 
 /*
  * Check whether this bio carries any data or not. A NULL bio is allowed.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..3af4ca8b1fe7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3090,6 +3090,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 #define FOLL_INTERRUPTIBLE  0x200000 /* allow interrupts from generic signals */
 
+#define FOLL_SOURCE_BUF	0		/* Memory will be read from by I/O */
+#define FOLL_DEST_BUF	FOLL_WRITE	/* Memory will be written to by I/O */
+#define FOLL_BUF_MASK	FOLL_WRITE
+
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
  * other. Here is what they mean, and how to use them:
@@ -3143,6 +3147,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * releasing pages: get_user_pages*() pages must be released via put_page(),
  * while pin_user_pages*() pages must be released via unpin_user_page().
  *
+ * FOLL_SOURCE_BUF and FOLL_DEST_BUF are indicators to get_user_pages*() and
+ * iov_iter_*_pages*() as to how the pages obtained are going to be used.
+ * FOLL_SOURCE_BUF indicates that I/O op is going to transfer from memory to
+ * device; FOLL_DEST_BUF that the op is going to transfer from device to
+ * memory.
+ *
  * Please see Documentation/core-api/pin_user_pages.rst for more information.
  */
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 68497d9c1452..f53583836009 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1429,11 +1429,6 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
 	return page;
 }
 
-static unsigned char iov_iter_rw(const struct iov_iter *i)
-{
-	return i->data_source ? WRITE : READ;
-}
-
 static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   unsigned int maxpages, size_t *start,
@@ -1448,12 +1443,17 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 	if (maxsize > MAX_RW_COUNT)
 		maxsize = MAX_RW_COUNT;
 
+	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF &&
+			 i->data_source == ITER_DEST))
+		return -EIO;
+	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_DEST_BUF &&
+			 i->data_source == ITER_SOURCE))
+		return -EIO;
+
 	if (likely(user_backed_iter(i))) {
 		unsigned long addr;
 		int res;
 
-		if (iov_iter_rw(i) != WRITE)
-			gup_flags |= FOLL_WRITE;
 		if (i->nofault)
 			gup_flags |= FOLL_NOFAULT;
 
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 3c27ffb781e3..eb28b54fe5f6 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -310,7 +310,8 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 			       struct iov_iter *data,
 			       int count,
 			       size_t *offs,
-			       int *need_drop)
+			       int *need_drop,
+			       unsigned int gup_flags)
 {
 	int nr_pages;
 	int err;
@@ -330,7 +331,8 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 			if (err == -ERESTARTSYS)
 				return err;
 		}
-		n = iov_iter_get_pages_alloc2(data, pages, count, offs);
+		n = iov_iter_get_pages_alloc(data, pages, count, offs,
+					     gup_flags);
 		if (n < 0)
 			return n;
 		*need_drop = 1;
@@ -437,7 +439,8 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	if (uodata) {
 		__le32 sz;
 		int n = p9_get_mapped_pages(chan, &out_pages, uodata,
-					    outlen, &offs, &need_drop);
+					    outlen, &offs, &need_drop,
+					    FOLL_DEST_BUF);
 		if (n < 0) {
 			err = n;
 			goto err_out;
@@ -456,7 +459,8 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 		memcpy(&req->tc.sdata[0], &sz, sizeof(sz));
 	} else if (uidata) {
 		int n = p9_get_mapped_pages(chan, &in_pages, uidata,
-					    inlen, &offs, &need_drop);
+					    inlen, &offs, &need_drop,
+					    FOLL_SOURCE_BUF);
 		if (n < 0) {
 			err = n;
 			goto err_out;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e4ff2db40c98..9f0914b781ad 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -632,8 +632,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages2(from, pages, length,
-					    MAX_SKB_FRAGS - frag, &start);
+		copied = iov_iter_get_pages(from, pages, length,
+					    MAX_SKB_FRAGS - frag, &start,
+					    FOLL_SOURCE_BUF);
 		if (copied < 0)
 			return -EFAULT;
 
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 53d0251788aa..f63a13690712 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -324,8 +324,8 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
 			goto out;
 		}
 
-		copied = iov_iter_get_pages2(from, pages, bytes, maxpages,
-					    &offset);
+		copied = iov_iter_get_pages(from, pages, bytes, maxpages,
+					    &offset, FOLL_SOURCE_BUF);
 		if (copied <= 0) {
 			ret = -EFAULT;
 			goto out;
diff --git a/net/rds/message.c b/net/rds/message.c
index b47e4f0a1639..fcfd406b97af 100644
--- a/net/rds/message.c
+++ b/net/rds/message.c
@@ -390,8 +390,8 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
 		size_t start;
 		ssize_t copied;
 
-		copied = iov_iter_get_pages2(from, &pages, PAGE_SIZE,
-					    1, &start);
+		copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
+					    1, &start, FOLL_SOURCE_BUF);
 		if (copied < 0) {
 			struct mmpin *mmp;
 			int i;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 9ed978634125..59acaeb24f54 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1354,9 +1354,8 @@ static int tls_setup_from_iter(struct iov_iter *from,
 			rc = -EFAULT;
 			goto out;
 		}
-		copied = iov_iter_get_pages2(from, pages,
-					    length,
-					    maxpages, &offset);
+		copied = iov_iter_get_pages(from, pages, length,
+					    maxpages, &offset, FOLL_DEST_BUF);
 		if (copied <= 0) {
 			rc = -EFAULT;
 			goto out;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 04/34] iov_iter: Remove iov_iter_get_pages2/pages_alloc2()
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (2 preceding siblings ...)
  2023-01-16 23:08 ` [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*() David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-16 23:08 ` [PATCH v6 05/34] iov_iter: Change the direction macros into an enum David Howells
                   ` (31 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

There are now no users of iov_iter_get_pages2() and
iov_iter_get_pages_alloc2(), so remove them.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/uio.h |    4 ----
 lib/iov_iter.c      |   14 --------------
 2 files changed, 18 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6f4dfa96324d..365e26c405f2 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -248,13 +248,9 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
 		unsigned gup_flags);
-ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
-			size_t maxsize, unsigned maxpages, size_t *start);
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		struct page ***pages, size_t maxsize, size_t *start,
 		unsigned gup_flags);
-ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
-			size_t maxsize, size_t *start);
 int iov_iter_npages(const struct iov_iter *i, int maxpages);
 void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state);
 
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index f53583836009..ca89ffa9d6e1 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1511,13 +1511,6 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iov_iter_get_pages);
 
-ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
-		size_t maxsize, unsigned maxpages, size_t *start)
-{
-	return iov_iter_get_pages(i, pages, maxsize, maxpages, start, 0);
-}
-EXPORT_SYMBOL(iov_iter_get_pages2);
-
 ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 		   struct page ***pages, size_t maxsize,
 		   size_t *start, unsigned gup_flags)
@@ -1536,13 +1529,6 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
 }
 EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc);
 
-ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
-		struct page ***pages, size_t maxsize, size_t *start)
-{
-	return iov_iter_get_pages_alloc(i, pages, maxsize, start, 0);
-}
-EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
-
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 05/34] iov_iter: Change the direction macros into an enum
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (3 preceding siblings ...)
  2023-01-16 23:08 ` [PATCH v6 04/34] iov_iter: Remove iov_iter_get_pages2/pages_alloc2() David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-18 23:14   ` Al Viro
  2023-01-18 23:17   ` David Howells
  2023-01-16 23:08 ` [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions David Howells
                   ` (30 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Change the ITER_SOURCE and ITER_DEST direction macros into an enum and
provide three new helper functions:

 iov_iter_dir() - returns the iterator direction
 iov_iter_is_dest() - returns true if it's an ITER_DEST iterator
 iov_iter_is_source() - returns true if it's an ITER_SOURCE iterator

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>

Link: https://lore.kernel.org/r/167305161763.1521586.6593798818336440133.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344726413.2425628.317218805692680763.stgit@warthog.procyon.org.uk/ # v5
---

 include/linux/uio.h |   30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 365e26c405f2..8d0dabfcb2fe 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,8 +29,10 @@ enum iter_type {
 	ITER_UBUF,
 };
 
-#define ITER_SOURCE	1	// == WRITE
-#define ITER_DEST	0	// == READ
+enum iter_dir {
+	ITER_DEST	= 0,	// == READ
+	ITER_SOURCE	= 1,	// == WRITE
+} __mode(byte);
 
 struct iov_iter_state {
 	size_t iov_offset;
@@ -39,9 +41,9 @@ struct iov_iter_state {
 };
 
 struct iov_iter {
-	u8 iter_type;
+	enum iter_type iter_type __mode(byte);
 	bool nofault;
-	bool data_source;
+	enum iter_dir data_source;
 	bool user_backed;
 	union {
 		size_t iov_offset;
@@ -114,6 +116,26 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
+static inline enum iter_dir iov_iter_dir(const struct iov_iter *i)
+{
+	return i->data_source;
+}
+
+static inline bool iov_iter_is_source(const struct iov_iter *i)
+{
+	return iov_iter_dir(i) == ITER_SOURCE; /* ie. WRITE */
+}
+
+static inline bool iov_iter_is_dest(const struct iov_iter *i)
+{
+	return iov_iter_dir(i) == ITER_DEST; /* ie. READ */
+}
+
+static inline bool iov_iter_dir_valid(enum iter_dir direction)
+{
+	return direction == ITER_DEST || direction == ITER_SOURCE;
+}
+
 static inline bool user_backed_iter(const struct iov_iter *i)
 {
 	return i->user_backed;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (4 preceding siblings ...)
  2023-01-16 23:08 ` [PATCH v6 05/34] iov_iter: Change the direction macros into an enum David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-17  7:58   ` Christoph Hellwig
  2023-01-18 23:15   ` Al Viro
  2023-01-16 23:08 ` [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator David Howells
                   ` (29 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Use the direction in the iterator functions rather than READ/WRITE.

Add a check into __iov_iter_get_pages_alloc() that the supplied
FOLL_SOURCE/DEST_BUF gup_flag matches the ITER_SOURCE/DEST flag on the
iterator.

Changes
=======
ver #6)
 - Add a check on FOLL_SOURCE/DEST_BUF into __iov_iter_get_pages_alloc()

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>

Link: https://lore.kernel.org/r/167305162465.1521586.18077838937455153675.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344727112.2425628.995771894170560721.stgit@warthog.procyon.org.uk/ # v5
---

 include/linux/uio.h |   22 +--
 lib/iov_iter.c      |  409 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 396 insertions(+), 35 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 8d0dabfcb2fe..18b64068cc6d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -256,16 +256,16 @@ bool iov_iter_is_aligned(const struct iov_iter *i, unsigned addr_mask,
 			unsigned len_mask);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 unsigned long iov_iter_gap_alignment(const struct iov_iter *i);
-void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov,
+void iov_iter_init(struct iov_iter *i, enum iter_dir direction, const struct iovec *iov,
 			unsigned long nr_segs, size_t count);
-void iov_iter_kvec(struct iov_iter *i, unsigned int direction, const struct kvec *kvec,
+void iov_iter_kvec(struct iov_iter *i, enum iter_dir direction, const struct kvec *kvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_vec *bvec,
+void iov_iter_bvec(struct iov_iter *i, enum iter_dir direction, const struct bio_vec *bvec,
 			unsigned long nr_segs, size_t count);
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
+void iov_iter_pipe(struct iov_iter *i, enum iter_dir direction, struct pipe_inode_info *pipe,
 			size_t count);
-void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
-void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
+void iov_iter_discard(struct iov_iter *i, enum iter_dir direction, size_t count);
+void iov_iter_xarray(struct iov_iter *i, enum iter_dir direction, struct xarray *xarray,
 		     loff_t start, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
@@ -351,19 +351,19 @@ size_t hash_and_copy_to_iter(const void *addr, size_t bytes, void *hashp,
 struct iovec *iovec_from_user(const struct iovec __user *uvector,
 		unsigned long nr_segs, unsigned long fast_segs,
 		struct iovec *fast_iov, bool compat);
-ssize_t import_iovec(int type, const struct iovec __user *uvec,
+ssize_t import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i);
-ssize_t __import_iovec(int type, const struct iovec __user *uvec,
+ssize_t __import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i, bool compat);
-int import_single_range(int type, void __user *buf, size_t len,
+int import_single_range(enum iter_dir direction, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i);
 
-static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
+static inline void iov_iter_ubuf(struct iov_iter *i, enum iter_dir direction,
 			void __user *buf, size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_UBUF,
 		.user_backed = true,
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ca89ffa9d6e1..6436438bf46b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -421,11 +421,11 @@ size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
 }
 EXPORT_SYMBOL(fault_in_iov_iter_writeable);
 
-void iov_iter_init(struct iov_iter *i, unsigned int direction,
+void iov_iter_init(struct iov_iter *i, enum iter_dir direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_IOVEC,
 		.nofault = false,
@@ -994,11 +994,11 @@ size_t iov_iter_single_seg_count(const struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_single_seg_count);
 
-void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
+void iov_iter_kvec(struct iov_iter *i, enum iter_dir direction,
 			const struct kvec *kvec, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter){
 		.iter_type = ITER_KVEC,
 		.data_source = direction,
@@ -1010,11 +1010,11 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_kvec);
 
-void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
+void iov_iter_bvec(struct iov_iter *i, enum iter_dir direction,
 			const struct bio_vec *bvec, unsigned long nr_segs,
 			size_t count)
 {
-	WARN_ON(direction & ~(READ | WRITE));
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter){
 		.iter_type = ITER_BVEC,
 		.data_source = direction,
@@ -1026,15 +1026,15 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 }
 EXPORT_SYMBOL(iov_iter_bvec);
 
-void iov_iter_pipe(struct iov_iter *i, unsigned int direction,
+void iov_iter_pipe(struct iov_iter *i, enum iter_dir direction,
 			struct pipe_inode_info *pipe,
 			size_t count)
 {
-	BUG_ON(direction != READ);
+	BUG_ON(direction != ITER_DEST);
 	WARN_ON(pipe_full(pipe->head, pipe->tail, pipe->ring_size));
 	*i = (struct iov_iter){
 		.iter_type = ITER_PIPE,
-		.data_source = false,
+		.data_source = ITER_DEST,
 		.pipe = pipe,
 		.head = pipe->head,
 		.start_head = pipe->head,
@@ -1057,10 +1057,10 @@ EXPORT_SYMBOL(iov_iter_pipe);
  * from evaporation, either by taking a ref on them or locking them by the
  * caller.
  */
-void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
+void iov_iter_xarray(struct iov_iter *i, enum iter_dir direction,
 		     struct xarray *xarray, loff_t start, size_t count)
 {
-	BUG_ON(direction & ~1);
+	WARN_ON(!iov_iter_dir_valid(direction));
 	*i = (struct iov_iter) {
 		.iter_type = ITER_XARRAY,
 		.data_source = direction,
@@ -1079,14 +1079,14 @@ EXPORT_SYMBOL(iov_iter_xarray);
  * @count: The size of the I/O buffer in bytes.
  *
  * Set up an I/O iterator that just discards everything that's written to it.
- * It's only available as a READ iterator.
+ * It's only available as a destination iterator.
  */
-void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
+void iov_iter_discard(struct iov_iter *i, enum iter_dir direction, size_t count)
 {
-	BUG_ON(direction != READ);
+	BUG_ON(direction != ITER_DEST);
 	*i = (struct iov_iter){
 		.iter_type = ITER_DISCARD,
-		.data_source = false,
+		.data_source = ITER_DEST,
 		.count = count,
 		.iov_offset = 0
 	};
@@ -1444,10 +1444,10 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 		maxsize = MAX_RW_COUNT;
 
 	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF &&
-			 i->data_source == ITER_DEST))
+			 iov_iter_is_dest(i)))
 		return -EIO;
 	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_DEST_BUF &&
-			 i->data_source == ITER_SOURCE))
+			 iov_iter_is_source(i)))
 		return -EIO;
 
 	if (likely(user_backed_iter(i))) {
@@ -1775,7 +1775,7 @@ struct iovec *iovec_from_user(const struct iovec __user *uvec,
 	return iov;
 }
 
-ssize_t __import_iovec(int type, const struct iovec __user *uvec,
+ssize_t __import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs, struct iovec **iovp,
 		 struct iov_iter *i, bool compat)
 {
@@ -1814,7 +1814,7 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
 		total_len += len;
 	}
 
-	iov_iter_init(i, type, iov, nr_segs, total_len);
+	iov_iter_init(i, direction, iov, nr_segs, total_len);
 	if (iov == *iovp)
 		*iovp = NULL;
 	else
@@ -1827,7 +1827,7 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
  *     into the kernel, check that it is valid, and initialize a new
  *     &struct iov_iter iterator to access it.
  *
- * @type: One of %READ or %WRITE.
+ * @direction: One of %ITER_SOURCE or %ITER_DEST.
  * @uvec: Pointer to the userspace array.
  * @nr_segs: Number of elements in userspace array.
  * @fast_segs: Number of elements in @iov.
@@ -1844,16 +1844,16 @@ ssize_t __import_iovec(int type, const struct iovec __user *uvec,
  *
  * Return: Negative error code on error, bytes imported on success
  */
-ssize_t import_iovec(int type, const struct iovec __user *uvec,
+ssize_t import_iovec(enum iter_dir direction, const struct iovec __user *uvec,
 		 unsigned nr_segs, unsigned fast_segs,
 		 struct iovec **iovp, struct iov_iter *i)
 {
-	return __import_iovec(type, uvec, nr_segs, fast_segs, iovp, i,
+	return __import_iovec(direction, uvec, nr_segs, fast_segs, iovp, i,
 			      in_compat_syscall());
 }
 EXPORT_SYMBOL(import_iovec);
 
-int import_single_range(int rw, void __user *buf, size_t len,
+int import_single_range(enum iter_dir direction, void __user *buf, size_t len,
 		 struct iovec *iov, struct iov_iter *i)
 {
 	if (len > MAX_RW_COUNT)
@@ -1863,7 +1863,7 @@ int import_single_range(int rw, void __user *buf, size_t len,
 
 	iov->iov_base = buf;
 	iov->iov_len = len;
-	iov_iter_init(i, rw, iov, 1, len);
+	iov_iter_init(i, direction, iov, 1, len);
 	return 0;
 }
 EXPORT_SYMBOL(import_single_range);
@@ -1905,3 +1905,364 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     unsigned int gup_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	i->iov_offset += maxsize;
+	i->count -= maxsize;
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	maxsize = min(maxsize, i->bvec->bv_len - skip);
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	i->count -= maxsize;
+	i->iov_offset += maxsize;
+	if (i->iov_offset == i->bvec->bv_len) {
+		i->iov_offset = 0;
+		i->bvec++;
+		i->nr_segs--;
+	}
+	return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator.  The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+							 size_t *size)
+{
+	size_t skip;
+	long k;
+
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
+
+	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (unlikely(!len))
+			continue;
+		if (*size > len)
+			*size = len;
+		return (unsigned long)i->iov[k].iov_base + skip;
+	}
+	BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int gup_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(!iov_iter_is_source(i)))
+		return -EFAULT;
+
+	gup_flags |= FOLL_GET;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = iov_iter_extract_first_user_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int gup_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(!iov_iter_is_dest(i)))
+		return -EFAULT;
+
+	gup_flags |= FOLL_PIN | FOLL_WRITE;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int gup_flags,
+					   size_t *offset0)
+{
+	if (iov_iter_extract_mode(i, gup_flags) == FOLL_GET)
+		return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+							   maxpages, gup_flags,
+							   offset0);
+	else
+		return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+							   maxpages, gup_flags,
+							   offset0);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @gup_flags: Direction indicator and additional flags
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @gup_flags can be set to either FOLL_SOURCE_BUF or FOLL_DEST_BUF, indicating
+ * how the buffer is to be used, and can have FOLL_PCI_P2PDMA OR'd with that.
+ *
+ * The iov_iter_extract_mode() function can be used to query how cleanup should
+ * be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /OUT OF/ the buffer (@gup_flags |= FOLL_SOURCE_BUF), refs
+ *      will be taken on the pages, but pins will not be added.  This can be
+ *      used for DMA from a page; it cannot be used for DMA to a page, as it
+ *      may cause page-COW problems in fork.  iov_iter_extract_mode() will
+ *      return FOLL_GET.
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /INTO/ the described buffer (@gup_flags |= FOLL_DEST_BUF),
+ *      pins will be added to the pages, but refs will not be taken.  This must
+ *      be used for DMA to a page.  iov_iter_extract_mode() will return
+ *      FOLL_PIN.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but neither refs nor pins will be obtained for the
+ *      caller.  The caller must hold the pipe lock.  iov_iter_extract_mode()
+ *      will return 0.
+ *
+ *  (*) If the iterator is ITER_BVEC or ITER_XARRAY, the pages are merely
+ *      listed; no extra refs or pins are obtained.  iov_iter_extract_mode()
+ *      will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Use with ITER_KVEC is not supported as that may refer to memory that
+ *      doesn't have associated page structs.
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page..
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       unsigned int gup_flags,
+			       size_t *offset0)
+{
+	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF &&
+			 iov_iter_is_dest(i)))
+		return -EIO;
+	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_DEST_BUF &&
+			 iov_iter_is_source(i)))
+		return -EIO;
+
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, gup_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, gup_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (5 preceding siblings ...)
  2023-01-16 23:08 ` [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-17  8:01   ` Christoph Hellwig
  2023-01-17  8:19   ` David Howells
  2023-01-16 23:08 ` [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page David Howells
                   ` (28 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, John Hubbard, Matthew Wilcox, linux-fsdevel,
	linux-mm, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.  The caller should pass FOLL_SOURCE_BUF or FOLL_DEST_BUF as part
of gup_flags to indicate how the iterator contents are to be used.

Add a second function, iov_iter_extract_mode(), to determine how the
cleanup should be done.

There are three cases:

 (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins obtained on them (but not references)
     so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
     progress.

     iov_iter_extract_mode() will return FOLL_PIN for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have references obtained on them, but not pins.

     iov_iter_extract_mode() will return FOLL_GET.  The caller should use
     something like put_page() for page disposal.

 (3) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.

     iov_iter_extract_mode() will return 0.  The pages don't need
     additional disposal.

Changes:
========
ver #6)
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
   data_source.

ver #4)
 - Use ITER_SOURCE/DEST instead of WRITE/READ.
 - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166722777971.2555743.12953624861046741424.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732025748.3186319.8314014902727092626.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869689451.3723671.18242195992447653092.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
---

 include/linux/uio.h |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 18b64068cc6d..38607c82e0cc 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -373,4 +373,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, enum iter_dir direction,
 	};
 }
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       unsigned int gup_flags, size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ * @gup_flags: How the iterator is to be used (FOLL_SOURCE/DEST_BUF)
+ *
+ * Examine the iterator and the gup_flags and indicate by returning FOLL_PIN,
+ * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
+ * be retained by the extraction function.
+ *
+ * FOLL_GET indicates that the pages will have a reference taken on them that
+ * the caller must put.  This can be done for DMA/async DIO write from a page.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO read to a page to
+ * avoid CoW problems in fork.
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+#define iov_iter_extract_mode(iter, gup_flags) \
+	(user_backed_iter(iter) ?				\
+	 (gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF ?	\
+	 FOLL_GET : FOLL_PIN : 0)
+
 #endif



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (6 preceding siblings ...)
  2023-01-16 23:08 ` [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-16 23:08 ` David Howells
  2023-01-17  8:02   ` Christoph Hellwig
  2023-01-17  8:21   ` David Howells
  2023-01-16 23:09 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
                   ` (27 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:08 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Provide a helper in the get_user_pages code to drop a pin or a ref on a
page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
nothing if neither is set.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 include/linux/mm.h |    3 +++
 mm/gup.c           |   22 ++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3af4ca8b1fe7..8e746a930945 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,6 +1367,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
 #define SECTION_IN_PAGE_FLAGS
 #endif
 
+void folio_put_unpin(struct folio *folio, unsigned int flags);
+void page_put_unpin(struct page *page, unsigned int flags);
+
 /*
  * The identification function is mainly used by the buddy allocator for
  * determining if two pages could be buddies. We are not really identifying
diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..3ee4b4c7e0cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -191,6 +191,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 		folio_put_refs(folio, refs);
 }
 
+/**
+ * folio_put_unpin - Unpin/put a folio as appropriate
+ * @folio: The folio to release
+ * @flags: gup flags indicating the mode of release (FOLL_*)
+ *
+ * Release a folio according to the flags.  If FOLL_GET is set, the folio has a
+ * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
+ * unaltered.
+ */
+void folio_put_unpin(struct folio *folio, unsigned int flags)
+{
+	if (flags & (FOLL_GET | FOLL_PIN))
+		gup_put_folio(folio, 1, flags);
+}
+EXPORT_SYMBOL_GPL(folio_put_unpin);
+
+void page_put_unpin(struct page *page, unsigned int flags)
+{
+	folio_put_unpin(page_folio(page), flags);
+}
+EXPORT_SYMBOL_GPL(page_put_unpin);
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  * @page:    pointer to page to be grabbed



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (7 preceding siblings ...)
  2023-01-16 23:08 ` [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-17  8:02   ` Christoph Hellwig
  2023-01-16 23:09 ` [PATCH v6 10/34] mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically David Howells
                   ` (26 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning.  In a
following patch I intend to add a BIO_PAGE_PINNED flag to indicate that the
page needs unpinning and this way both flags have the same logic.

Changes
=======
ver #5)
 - Split from patch that uses iov_iter_extract_pages().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org

Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344730802.2425628.14034153595667416149.stgit@warthog.procyon.org.uk/ # v5
---

 block/bio.c               |    9 ++++++++-
 include/linux/bio.h       |    2 +-
 include/linux/blk_types.h |    2 +-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 867cf4db87ea..5b6a76c3e620 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -243,6 +243,10 @@ static void bio_free(struct bio *bio)
  * Users of this function have their own bio allocation. Subsequently,
  * they must remember to pair any call to bio_init() with bio_uninit()
  * when IO has completed, or when the bio is released.
+ *
+ * We set the initial assumption that pages attached to the bio will be
+ * released with put_page() by setting BIO_PAGE_REFFED; if the pages
+ * should not be put, this flag should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -274,6 +278,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 	bio->bi_integrity = NULL;
 #endif
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_vcnt = 0;
 
 	atomic_set(&bio->__bi_remaining, 1);
@@ -302,6 +307,7 @@ void bio_reset(struct bio *bio, struct block_device *bdev, blk_opf_t opf)
 {
 	bio_uninit(bio);
 	memset(bio, 0, BIO_RESET_BYTES);
+	bio_set_flag(bio, BIO_PAGE_REFFED);
 	atomic_set(&bio->__bi_remaining, 1);
 	bio->bi_bdev = bdev;
 	if (bio->bi_bdev)
@@ -812,6 +818,7 @@ EXPORT_SYMBOL(bio_put);
 static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 {
 	bio_set_flag(bio, BIO_CLONED);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1198,7 +1205,7 @@ void bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_io_vec = (struct bio_vec *)iter->bvec;
 	bio->bi_iter.bi_bvec_done = iter->iov_offset;
 	bio->bi_iter.bi_size = size;
-	bio_set_flag(bio, BIO_NO_PAGE_REF);
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
 	bio_set_flag(bio, BIO_CLONED);
 }
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3f7ba7fe48ac..69b32c5532f6 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -496,7 +496,7 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (!bio_flagged(bio, BIO_NO_PAGE_REF))
+	if (bio_flagged(bio, BIO_PAGE_REFFED))
 		__bio_release_pages(bio, mark_dirty);
 }
 
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..86711fb0534a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -318,7 +318,7 @@ struct bio {
  * bio flags
  */
 enum {
-	BIO_NO_PAGE_REF,	/* don't put release vec pages */
+	BIO_PAGE_REFFED,	/* Pages need refs putting (equivalent to FOLL_GET) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 10/34] mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (8 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-17  8:03   ` Christoph Hellwig
  2023-01-16 23:09 ` [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
                   ` (25 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Make BIO_PAGE_REFFED the same as FOLL_GET and BIO_PAGE_PINNED the same as
FOLL_PIN numerically so that the BIO_* flags can be passed directly to
page_put_unpin().

Provide a build-time assertion to check this.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
---

 block/bio.c               |    3 +++
 include/linux/blk_types.h |    1 +
 include/linux/mm.h        |   17 ++++++++++-------
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5b6a76c3e620..d8c636cefcdd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1798,6 +1798,9 @@ static int __init init_bio(void)
 {
 	int i;
 
+	BUILD_BUG_ON((1 << BIO_PAGE_REFFED) != FOLL_GET);
+	BUILD_BUG_ON((1 << BIO_PAGE_PINNED) != FOLL_PIN);
+
 	bio_integrity_init();
 
 	for (i = 0; i < ARRAY_SIZE(bvec_slabs); i++) {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 86711fb0534a..42b40156c517 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -319,6 +319,7 @@ struct bio {
  */
 enum {
 	BIO_PAGE_REFFED,	/* Pages need refs putting (equivalent to FOLL_GET) */
+	BIO_PAGE_PINNED,	/* Pages need unpinning (equivalent to FOLL_PIN) */
 	BIO_CLONED,		/* doesn't own data */
 	BIO_BOUNCED,		/* bio is a bounce bio */
 	BIO_QUIET,		/* Make BIO Quiet */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8e746a930945..f14edb192394 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3074,12 +3074,13 @@ static inline vm_fault_t vmf_error(int err)
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			 unsigned int foll_flags);
 
-#define FOLL_WRITE	0x01	/* check pte is writable */
-#define FOLL_TOUCH	0x02	/* mark page accessed */
-#define FOLL_GET	0x04	/* do get_page on page */
-#define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
-#define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
+#define FOLL_GET	0x01	/* do get_page on page (equivalent to BIO_FOLL_GET) */
+#define FOLL_PIN	0x02	/* pages must be released via unpin_user_page */
+#define FOLL_WRITE	0x04	/* check pte is writable */
+#define FOLL_TOUCH	0x08	/* mark page accessed */
+#define FOLL_DUMP	0x10	/* give error on hole if it would be zero */
+#define FOLL_FORCE	0x20	/* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT	0x40	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
 #define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
@@ -3088,7 +3089,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
-#define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
 #define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 #define FOLL_INTERRUPTIBLE  0x200000 /* allow interrupts from generic signals */
@@ -3098,6 +3098,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_BUF_MASK	FOLL_WRITE
 
 /*
+ * FOLL_GET must be the same bit as BIO_FOLL_GET and FOLL_PIN must be the same
+ * bit as BIO_FOLL_PIN.
+ *
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
  * other. Here is what they mean, and how to use them:
  *



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (9 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 10/34] mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-17  8:07   ` Christoph Hellwig
  2023-01-17  8:26   ` David Howells
  2023-01-16 23:09 ` [PATCH v6 12/34] bio: Fix bio_flagged() so that gcc can better optimise it David Howells
                   ` (24 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Convert the block layer's bio code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the source iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

To implement this:

 (1) If the BIO_PAGE_REFFED flag is set, this causes attached pages to be
     passed to put_page() during cleanup.

 (2) A BIO_PAGE_PINNED flag is provided.  If set, this causes attached
     pages to be passed to unpin_user_page() during cleanup.

 (3) BIO_PAGE_REFFED is set by default and BIO_PAGE_PINNED is cleared by
     default when the bio is (re-)initialised.

 (4) If iov_iter_extract_pages() indicates FOLL_GET, this causes
     BIO_PAGE_REFFED to be set and if FOLL_PIN is indicated, this causes
     BIO_PAGE_PINNED to be set.  If it returns neither FOLL_* flag, then
     both BIO_PAGE_* flags will be cleared.

     Mixing sets of pages with different clean up modes is not supported.

 (5) Cloned bio structs have both flags cleared.

 (6) bio_release_pages() will do the release if either BIO_PAGE_* flag is
     set.

[!] Note that this is tested a bit with ext4, but nothing else.

Changes
=======
ver #5)
 - Transcribe the FOLL_* flags returned by iov_iter_extract_pages() to
   BIO_* flags and got rid of bi_cleanup_mode.
 - Replaced BIO_NO_PAGE_REF to BIO_PAGE_REFFED in the preceding patch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org

Link: https://lore.kernel.org/r/167305166150.1521586.10220949115402059720.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344731521.2425628.5403113335062567245.stgit@warthog.procyon.org.uk/ # v5
---

 block/bio.c         |   34 +++++++++++++++++++---------------
 block/blk-map.c     |   22 +++++++++++-----------
 block/blk.h         |   25 +++++++++++++++++++++++++
 include/linux/bio.h |    3 ++-
 4 files changed, 57 insertions(+), 27 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index d8c636cefcdd..f9ee3625d65c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -245,8 +245,9 @@ static void bio_free(struct bio *bio)
  * when IO has completed, or when the bio is released.
  *
  * We set the initial assumption that pages attached to the bio will be
- * released with put_page() by setting BIO_PAGE_REFFED; if the pages
- * should not be put, this flag should be cleared.
+ * released with put_page() by setting BIO_PAGE_REFFED, but this should be set
+ * to BIO_PAGE_PINNED if the page should be unpinned instead; if the pages
+ * should not be put or unpinned, these flags should be cleared.
  */
 void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
@@ -819,6 +820,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 {
 	bio_set_flag(bio, BIO_CLONED);
 	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	bio_clear_flag(bio, BIO_PAGE_PINNED);
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
@@ -1183,7 +1185,7 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		if (mark_dirty && !PageCompound(bvec->bv_page))
 			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+		bio_release_page(bio, bvec->bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1220,7 +1222,7 @@ static int bio_iov_add_page(struct bio *bio, struct page *page,
 	}
 
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1234,7 +1236,7 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
 			queue_max_zone_append_sectors(q), &same_page) != len)
 		return -EINVAL;
 	if (same_page)
-		put_page(page);
+		bio_release_page(bio, page);
 	return 0;
 }
 
@@ -1245,10 +1247,10 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
  * @bio: bio to add pages to
  * @iter: iov iterator describing the region to be mapped
  *
- * Pins pages from *iter and appends them to @bio's bvec array. The
- * pages will have to be released using put_page() when done.
- * For multi-segment *iter, this function only adds pages from the
- * next non-empty segment of the iov iterator.
+ * Extracts pages from *iter and appends them to @bio's bvec array.  The pages
+ * will have to be cleaned up in the way indicated by the BIO_PAGE_REFFED and
+ * BIO_PAGE_PINNED flags.  For a multi-segment *iter, this function only adds
+ * pages from the next non-empty segment of the iov iterator.
  *
  * The I/O direction is determined from the bio operation type.
  */
@@ -1284,12 +1286,14 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	 * result to ensure the bio's total size is correct. The remainder of
 	 * the iov data will be picked up in the next bio iteration.
 	 */
-	size = iov_iter_get_pages(iter, pages,
-				  UINT_MAX - bio->bi_iter.bi_size,
-				  nr_pages, &offset, gup_flags);
+	size = iov_iter_extract_pages(iter, &pages,
+				      UINT_MAX - bio->bi_iter.bi_size,
+				      nr_pages, gup_flags, &offset);
 	if (unlikely(size <= 0))
 		return size ? size : -EFAULT;
 
+	bio_set_cleanup_mode(bio, iter, gup_flags);
+
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
 	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
@@ -1319,7 +1323,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	iov_iter_revert(iter, left);
 out:
 	while (i < nr_pages)
-		put_page(pages[i++]);
+		bio_release_page(bio, pages[i++]);
 
 	return ret;
 }
@@ -1502,8 +1506,8 @@ void bio_set_pages_dirty(struct bio *bio)
  * the BIO and re-dirty the pages in process context.
  *
  * It is expected that bio_check_pages_dirty() will wholly own the BIO from
- * here on.  It will run one put_page() against each page and will run one
- * bio_put() against the BIO.
+ * here on.  It will run one put_page() or unpin_user_page() against each page
+ * and will run one bio_put() against the BIO.
  */
 
 static void bio_dirty_fn(struct work_struct *work);
diff --git a/block/blk-map.c b/block/blk-map.c
index c30be529fb55..be769f889eca 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -285,24 +285,24 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		gup_flags |= FOLL_PCI_P2PDMA;
 
 	while (iov_iter_count(iter)) {
-		struct page **pages, *stack_pages[UIO_FASTIOV];
+		struct page *stack_pages[UIO_FASTIOV];
+		struct page **pages = stack_pages;
 		ssize_t bytes;
 		size_t offs;
 		int npages;
 
-		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
-			pages = stack_pages;
-			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-						   nr_vecs, &offs, gup_flags);
-		} else {
-			bytes = iov_iter_get_pages_alloc(iter, &pages,
-						LONG_MAX, &offs, gup_flags);
-		}
+		if (nr_vecs > ARRAY_SIZE(stack_pages))
+			pages = NULL;
+
+		bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
+					       nr_vecs, gup_flags, &offs);
 		if (unlikely(bytes <= 0)) {
 			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
 		}
 
+		bio_set_cleanup_mode(bio, iter, gup_flags);
+
 		npages = DIV_ROUND_UP(offs + bytes, PAGE_SIZE);
 
 		if (unlikely(offs & queue_dma_alignment(rq->q)))
@@ -319,7 +319,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 				if (!bio_add_hw_page(rq->q, bio, page, n, offs,
 						     max_sectors, &same_page)) {
 					if (same_page)
-						put_page(page);
+						bio_release_page(bio, page);
 					break;
 				}
 
@@ -331,7 +331,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
 		 * release the pages we didn't map into the bio, if any
 		 */
 		while (j < npages)
-			put_page(pages[j++]);
+			bio_release_page(bio, pages[j++]);
 		if (pages != stack_pages)
 			kvfree(pages);
 		/* couldn't stuff something into bio? */
diff --git a/block/blk.h b/block/blk.h
index 4c3b3325219a..29f12f758915 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,31 @@ int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		unsigned int max_sectors, bool *same_page);
 
+/*
+ * Set the cleanup mode for a bio from an iterator and the GUP flags.
+ */
+static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter,
+					unsigned int gup_flags)
+{
+	unsigned int cleanup_mode;
+
+	bio_clear_flag(bio, BIO_PAGE_REFFED);
+	cleanup_mode = iov_iter_extract_mode(iter, gup_flags);
+	if (cleanup_mode & FOLL_GET)
+		bio_set_flag(bio, BIO_PAGE_REFFED);
+	if (cleanup_mode & FOLL_PIN)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
+}
+
+/*
+ * Clean up a page appropriately, where the page may be pinned, may have a
+ * ref taken on it or neither.
+ */
+static inline void bio_release_page(struct bio *bio, struct page *page)
+{
+	page_put_unpin(page, bio->bi_flags & (FOLL_GET | FOLL_PIN));
+}
+
 struct request_queue *blk_alloc_queue(int node_id);
 
 int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 69b32c5532f6..856b28e41d24 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -496,7 +496,8 @@ void zero_fill_bio(struct bio *bio);
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	if (bio_flagged(bio, BIO_PAGE_REFFED))
+	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+	    bio_flagged(bio, BIO_PAGE_PINNED))
 		__bio_release_pages(bio, mark_dirty);
 }
 



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 12/34] bio: Fix bio_flagged() so that gcc can better optimise it
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (10 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-17  8:07   ` Christoph Hellwig
  2023-01-16 23:09 ` [PATCH v6 13/34] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator David Howells
                   ` (23 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, linux-block, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Fix bio_flagged() so that multiple instances of it, such as:

	if (bio_flagged(bio, BIO_PAGE_REFFED) ||
	    bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero.  This results in an out-of-line bio_release_page()
being compiled to something like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool.  Removing it
results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Fixes: b7c44ed9d2fc ("block: manipulate bio->bi_flags through helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370 [1]
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371 [2]
---

 include/linux/bio.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 856b28e41d24..5e34bcfcfa2c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -241,7 +241,7 @@ static inline void bio_cnt_set(struct bio *bio, unsigned int count)
 
 static inline bool bio_flagged(struct bio *bio, unsigned int bit)
 {
-	return (bio->bi_flags & (1U << bit)) != 0;
+	return bio->bi_flags & (1U << bit);
 }
 
 static inline void bio_set_flag(struct bio *bio, unsigned int bit)



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 13/34] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (11 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 12/34] bio: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-16 23:09 ` [PATCH v6 14/34] netfs: Add a function to extract an iterator into a scatterlist David Howells
                   ` (22 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Steve French, Shyam Prasad N, Rohith Surabattula,
	linux-cachefs, linux-cifs, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Add a function to extract the pages from a user-space supplied iterator
(UBUF- or IOVEC-type) into a BVEC-type iterator, retaining the pages by
getting a ref on them (FOLL_SOURCE_BUF is indicated) or pinning them
(FOLL_DEST_BUF is indicated) as we go.

This is useful in three situations:

 (1) A userspace thread may have a sibling that unmaps or remaps the
     process's VM during the operation, changing the assignment of the
     pages and potentially causing an error.  Retaining the pages keeps
     some pages around, even if this occurs; futher, we find out at the
     point of extraction if EFAULT is going to be incurred.

 (2) Pages might get swapped out/discarded if not retained, so we want to
     retain them to avoid the reload causing a deadlock due to a DIO
     from/to an mmapped region on the same file.

 (3) The iterator may get passed to sendmsg() by the filesystem.  If a
     fault occurs, we may get a short write to a TCP stream that's then
     tricky to recover from.

We don't deal with other types of iterator here, leaving it to other
mechanisms to retain the pages (eg. PG_locked, PG_writeback and the pipe
lock).

Changes:
========
ver #6)
 - Pass in a gup_flags argument to allow FOLL_SOURCE_BUF and FOLL_DEST_BUF
   and other FOLL_* flags to be passed in.
 - Don't pass back the cleanup mode - iov_iter_extract_mode() can be used
   to determine that.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697255265.61150.6289490555867717077.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732026503.3186319.12020462741051772825.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869690376.3723671.8813331570219190705.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920904810.1461876.11603559311247187100.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997422579.9475.12101700945635692496.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305164634.1521586.12199658904363317567.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344729278.2425628.3277966637577509831.stgit@warthog.procyon.org.uk/ # v5
---

 fs/netfs/Makefile     |    1 
 fs/netfs/iterator.c   |  102 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    2 +
 3 files changed, 105 insertions(+)
 create mode 100644 fs/netfs/iterator.c

diff --git a/fs/netfs/Makefile b/fs/netfs/Makefile
index f684c0cd1ec5..386d6fb92793 100644
--- a/fs/netfs/Makefile
+++ b/fs/netfs/Makefile
@@ -3,6 +3,7 @@
 netfs-y := \
 	buffered_read.o \
 	io.o \
+	iterator.o \
 	main.o \
 	objects.o
 
diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
new file mode 100644
index 000000000000..f7f26de1a247
--- /dev/null
+++ b/fs/netfs/iterator.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Iterator helpers.
+ *
+ * Copyright (C) 2022 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <linux/export.h>
+#include <linux/slab.h>
+#include <linux/uio.h>
+#include <linux/netfs.h>
+#include "internal.h"
+
+/**
+ * netfs_extract_user_iter - Extract the pages from a user iterator into a bvec
+ * @orig: The original iterator
+ * @orig_len: The amount of iterator to copy
+ * @new: The iterator to be set up
+ * @gup_flags: Direction indicator and additional flags
+ *
+ * Extract the page fragments from the given amount of the source iterator and
+ * build up a second iterator that refers to all of those bits.  This allows
+ * the original iterator to disposed of.
+ *
+ * @gup_flags should indicate FOLL_SOURCE_BUF or FOLL_DEST_BUF plus any
+ * additional flags needed.
+ *
+ * On success, the number of elements in the bvec is returned, the original
+ * iterator will have been advanced by the amount extracted.
+ *
+ * The iov_iter_extract_mode() function should be used to query how cleanup
+ * should be performed.
+ */
+ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
+				struct iov_iter *new, unsigned int gup_flags)
+{
+	struct bio_vec *bv = NULL;
+	struct page **pages;
+	unsigned int cur_npages;
+	unsigned int max_pages;
+	unsigned int npages = 0;
+	unsigned int i;
+	ssize_t ret;
+	size_t count = orig_len, offset, len;
+	size_t bv_size, pg_size;
+
+	if (WARN_ON_ONCE(!iter_is_ubuf(orig) && !iter_is_iovec(orig)))
+		return -EIO;
+
+	max_pages = iov_iter_npages(orig, INT_MAX);
+	bv_size = array_size(max_pages, sizeof(*bv));
+	bv = kvmalloc(bv_size, GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	/* Put the page list at the end of the bvec list storage.  bvec
+	 * elements are larger than page pointers, so as long as we work
+	 * 0->last, we should be fine.
+	 */
+	pg_size = array_size(max_pages, sizeof(*pages));
+	pages = (void *)bv + bv_size - pg_size;
+
+	while (count && npages < max_pages) {
+		ret = iov_iter_extract_pages(orig, &pages, count,
+					     max_pages - npages, gup_flags,
+					     &offset);
+		if (ret < 0) {
+			pr_err("Couldn't get user pages (rc=%zd)\n", ret);
+			break;
+		}
+
+		if (ret > count) {
+			pr_err("get_pages rc=%zd more than %zu\n", ret, count);
+			break;
+		}
+
+		count -= ret;
+		ret += offset;
+		cur_npages = DIV_ROUND_UP(ret, PAGE_SIZE);
+
+		if (npages + cur_npages > max_pages) {
+			pr_err("Out of bvec array capacity (%u vs %u)\n",
+			       npages + cur_npages, max_pages);
+			break;
+		}
+
+		for (i = 0; i < cur_npages; i++) {
+			len = ret > PAGE_SIZE ? PAGE_SIZE : ret;
+			bv[npages + i].bv_page	 = *pages++;
+			bv[npages + i].bv_offset = offset;
+			bv[npages + i].bv_len	 = len - offset;
+			ret -= len;
+			offset = 0;
+		}
+
+		npages += cur_npages;
+	}
+
+	iov_iter_bvec(new, orig->data_source, bv, npages, orig_len - count);
+	return npages;
+}
+EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index 4c76ddfb6a67..a45757dd382d 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -296,6 +296,8 @@ void netfs_get_subrequest(struct netfs_io_subrequest *subreq,
 void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 			  bool was_async, enum netfs_sreq_ref_trace what);
 void netfs_stats_show(struct seq_file *);
+ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
+				struct iov_iter *new, unsigned int gup_flags);
 
 /**
  * netfs_inode - Get the netfs inode context from the inode



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 14/34] netfs: Add a function to extract an iterator into a scatterlist
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (12 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 13/34] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-16 23:09 ` [PATCH v6 15/34] af_alg: Pin pages rather than ref'ing if appropriate David Howells
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Steve French, Shyam Prasad N, Rohith Surabattula,
	linux-cachefs, linux-cifs, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Provide a function for filling in a scatterlist from the list of pages
contained in an iterator.  The function is passed FOLL_SOURCE_BUF or
FOLL_DEST_BUF to indicate how the extracted pages are to be used.

If the iterator is UBUF- or IOBUF-type, the pages have a ref (FOLL_SOURCE_BUF)
or a pin (FOLL_DEST_BUF) taken on them.

If the iterator is BVEC-, KVEC- or XARRAY-type, no ref is taken on the
pages and it is left to the caller to manage their lifetime.  It cannot be
assumed that a ref can be validly taken, particularly in the case of a KVEC
iterator.

Changes:
========
ver #6)
 - Pass in a gup_flags argument to allow FOLL_SOURCE_BUF and FOLL_DEST_BUF
   and other FOLL_* flags to be passed in.
 - Don't pass back the cleanup mode - iov_iter_extract_mode() can be used
   to determine that.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: linux-cachefs@redhat.com
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697255985.61150.16489950598033809487.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732027275.3186319.5186488812166611598.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869691313.3723671.10714823767342163891.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920905749.1461876.12079195122363691498.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997423514.9475.11145024341505464337.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305165398.1521586.12353215176136705725.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344730041.2425628.14391053364759792950.stgit@warthog.procyon.org.uk/ # v5
---

 fs/netfs/iterator.c   |  269 +++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/netfs.h |    4 +
 mm/vmalloc.c          |    1 
 3 files changed, 274 insertions(+)

diff --git a/fs/netfs/iterator.c b/fs/netfs/iterator.c
index f7f26de1a247..1d20ad2123b5 100644
--- a/fs/netfs/iterator.c
+++ b/fs/netfs/iterator.c
@@ -7,7 +7,9 @@
 
 #include <linux/export.h>
 #include <linux/slab.h>
+#include <linux/mm.h>
 #include <linux/uio.h>
+#include <linux/scatterlist.h>
 #include <linux/netfs.h>
 #include "internal.h"
 
@@ -100,3 +102,270 @@ ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 	return npages;
 }
 EXPORT_SYMBOL_GPL(netfs_extract_user_iter);
+
+/*
+ * Extract as list of up to sg_max pages from UBUF- or IOVEC-class iterators,
+ * pin or get refs on them appropriate and add them to the scatterlist.
+ */
+static ssize_t netfs_extract_user_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int gup_flags)
+{
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	struct page **pages;
+	unsigned int npages;
+	ssize_t ret = 0, res;
+	size_t len, off;
+
+	/* We decant the page list into the tail of the scatterlist */
+	pages = (void *)sgtable->sgl + array_size(sg_max, sizeof(struct scatterlist));
+	pages -= sg_max;
+
+	do {
+		res = iov_iter_extract_pages(iter, &pages, maxsize, sg_max,
+					     gup_flags, &off);
+		if (res < 0)
+			goto failed;
+
+		len = res;
+		maxsize -= len;
+		ret += len;
+		npages = DIV_ROUND_UP(off + len, PAGE_SIZE);
+		sg_max -= npages;
+
+		for (; npages < 0; npages--) {
+			struct page *page = *pages;
+			size_t seg = min_t(size_t, PAGE_SIZE - off, len);
+
+			*pages++ = NULL;
+			sg_set_page(sg, page, len, off);
+			sgtable->nents++;
+			sg++;
+			len -= seg;
+			off = 0;
+		}
+	} while (maxsize > 0 && sg_max > 0);
+
+	return ret;
+
+failed:
+	while (sgtable->nents > sgtable->orig_nents)
+		put_page(sg_page(&sgtable->sgl[--sgtable->nents]));
+	return res;
+}
+
+/*
+ * Extract up to sg_max pages from a BVEC-type iterator and add them to the
+ * scatterlist.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_bvec_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int gup_flags)
+{
+	const struct bio_vec *bv = iter->bvec;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		sg_set_page(sg, bv[i].bv_page, len, off);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+
+		ret += len;
+		maxsize -= len;
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	return ret;
+}
+
+/*
+ * Extract up to sg_max pages from a KVEC-type iterator and add them to the
+ * scatterlist.  This can deal with vmalloc'd buffers as well as kmalloc'd or
+ * static buffers.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_kvec_to_sg(struct iov_iter *iter,
+					ssize_t maxsize,
+					struct sg_table *sgtable,
+					unsigned int sg_max,
+					unsigned int gup_flags)
+{
+	const struct kvec *kv = iter->kvec;
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		struct page *page;
+		unsigned long kaddr;
+		size_t off, len, seg;
+
+		len = kv[i].iov_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		kaddr = (unsigned long)kv[i].iov_base + start;
+		off = kaddr & ~PAGE_MASK;
+		len = min_t(size_t, maxsize, len - start);
+		kaddr &= PAGE_MASK;
+
+		maxsize -= len;
+		ret += len;
+		do {
+			seg = min_t(size_t, len, PAGE_SIZE - off);
+			if (is_vmalloc_or_module_addr((void *)kaddr))
+				page = vmalloc_to_page((void *)kaddr);
+			else
+				page = virt_to_page(kaddr);
+
+			sg_set_page(sg, page, len, off);
+			sgtable->nents++;
+			sg++;
+			sg_max--;
+
+			len -= seg;
+			kaddr += PAGE_SIZE;
+			off = 0;
+		} while (len > 0 && sg_max > 0);
+
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	return ret;
+}
+
+/*
+ * Extract up to sg_max folios from an XARRAY-type iterator and add them to
+ * the scatterlist.  The pages are not pinned.
+ */
+static ssize_t netfs_extract_xarray_to_sg(struct iov_iter *iter,
+					  ssize_t maxsize,
+					  struct sg_table *sgtable,
+					  unsigned int sg_max,
+					  unsigned int gup_flags)
+{
+	struct scatterlist *sg = sgtable->sgl + sgtable->nents;
+	struct xarray *xa = iter->xarray;
+	struct folio *folio;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	ssize_t ret = 0;
+	size_t offset, len;
+	XA_STATE(xas, xa, index);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		offset = offset_in_folio(folio, start);
+		len = min_t(size_t, maxsize, folio_size(folio) - offset);
+
+		sg_set_page(sg, folio_page(folio, 0), len, offset);
+		sgtable->nents++;
+		sg++;
+		sg_max--;
+
+		maxsize -= len;
+		ret += len;
+		if (maxsize <= 0 || sg_max == 0)
+			break;
+	}
+
+	rcu_read_unlock();
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	return ret;
+}
+
+/**
+ * netfs_extract_iter_to_sg - Extract pages from an iterator and add ot an sglist
+ * @iter: The iterator to extract from
+ * @maxsize: The amount of iterator to copy
+ * @sgtable: The scatterlist table to fill in
+ * @sg_max: Maximum number of elements in @sgtable that may be filled
+ * @gup_flags: Direction indicator and additional flags
+ *
+ * Extract the page fragments from the given amount of the source iterator and
+ * add them to a scatterlist that refers to all of those bits, to a maximum
+ * addition of @sg_max elements.
+ *
+ * The pages referred to by UBUF- and IOVEC-type iterators are extracted and
+ * pinned; BVEC-, KVEC- and XARRAY-type are extracted but aren't pinned; PIPE-
+ * and DISCARD-type are not supported.
+ *
+ * No end mark is placed on the scatterlist; that's left to the caller.
+ *
+ * @gup_flags should indicate FOLL_SOURCE_BUF or FOLL_DEST_BUF plus any
+ * additional flags needed.
+ *
+ * If successul, @sgtable->nents is updated to include the number of elements
+ * added and the number of bytes added is returned.  @sgtable->orig_nents is
+ * left unaltered.
+ *
+ * The iov_iter_extract_mode() function should be used to query how cleanup
+ * should be performed.
+ */
+ssize_t netfs_extract_iter_to_sg(struct iov_iter *iter, size_t maxsize,
+				 struct sg_table *sgtable, unsigned int sg_max,
+				 unsigned int gup_flags)
+{
+	if (maxsize == 0)
+		return 0;
+
+	switch (iov_iter_type(iter)) {
+	case ITER_UBUF:
+	case ITER_IOVEC:
+		return netfs_extract_user_to_sg(iter, maxsize, sgtable, sg_max,
+						gup_flags);
+	case ITER_BVEC:
+		return netfs_extract_bvec_to_sg(iter, maxsize, sgtable, sg_max,
+						gup_flags);
+	case ITER_KVEC:
+		return netfs_extract_kvec_to_sg(iter, maxsize, sgtable, sg_max,
+						gup_flags);
+	case ITER_XARRAY:
+		return netfs_extract_xarray_to_sg(iter, maxsize, sgtable, sg_max,
+						  gup_flags);
+	default:
+		pr_err("netfs_extract_iter_to_sg(%u) unsupported\n",
+		       iov_iter_type(iter));
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+}
+EXPORT_SYMBOL_GPL(netfs_extract_iter_to_sg);
diff --git a/include/linux/netfs.h b/include/linux/netfs.h
index a45757dd382d..2493df855f05 100644
--- a/include/linux/netfs.h
+++ b/include/linux/netfs.h
@@ -298,6 +298,10 @@ void netfs_put_subrequest(struct netfs_io_subrequest *subreq,
 void netfs_stats_show(struct seq_file *);
 ssize_t netfs_extract_user_iter(struct iov_iter *orig, size_t orig_len,
 				struct iov_iter *new, unsigned int gup_flags);
+struct sg_table;
+ssize_t netfs_extract_iter_to_sg(struct iov_iter *iter, size_t len,
+				 struct sg_table *sgtable, unsigned int sg_max,
+				 unsigned int gup_flags);
 
 /**
  * netfs_inode - Get the netfs inode context from the inode
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ca71de7c9d77..61f5bec0f2b6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -656,6 +656,7 @@ int is_vmalloc_or_module_addr(const void *x)
 #endif
 	return is_vmalloc_addr(x);
 }
+EXPORT_SYMBOL_GPL(is_vmalloc_or_module_addr);
 
 /*
  * Walk a vmap address to the struct page it maps. Huge vmap mappings will



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 15/34] af_alg: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (13 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 14/34] netfs: Add a function to extract an iterator into a scatterlist David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-16 23:09 ` [PATCH v6 16/34] af_alg: [RFC] Use netfs_extract_iter_to_sg() to create scatterlists David Howells
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Herbert Xu, linux-crypto, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Convert AF_ALG to use iov_iter_extract_pages() instead of
iov_iter_get_pages().  This will pin pages or leave them unaltered rather
than getting a ref on them as appropriate to the iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: linux-crypto@vger.kernel.org
---

 crypto/af_alg.c         |    9 ++++++---
 include/crypto/if_alg.h |    1 +
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 7a68db157fae..c99e09fce71f 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -534,15 +534,18 @@ static const struct net_proto_family alg_family = {
 int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
 		   unsigned int gup_flags)
 {
+	struct page **pages = sgl->pages;
 	size_t off;
 	ssize_t n;
 	int npages, i;
 
-	n = iov_iter_get_pages(iter, sgl->pages, len, ALG_MAX_PAGES, &off,
-			       gup_flags);
+	n = iov_iter_extract_pages(iter, &pages, len, ALG_MAX_PAGES,
+				   gup_flags, &off);
 	if (n < 0)
 		return n;
 
+	sgl->cleanup_mode = iov_iter_extract_mode(iter, gup_flags);
+
 	npages = DIV_ROUND_UP(off + n, PAGE_SIZE);
 	if (WARN_ON(npages == 0))
 		return -EINVAL;
@@ -576,7 +579,7 @@ void af_alg_free_sg(struct af_alg_sgl *sgl)
 	int i;
 
 	for (i = 0; i < sgl->npages; i++)
-		put_page(sgl->pages[i]);
+		page_put_unpin(sgl->pages[i], sgl->cleanup_mode);
 }
 EXPORT_SYMBOL_GPL(af_alg_free_sg);
 
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 12058ab6cad9..95b3b7517d3f 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -61,6 +61,7 @@ struct af_alg_sgl {
 	struct scatterlist sg[ALG_MAX_PAGES + 1];
 	struct page *pages[ALG_MAX_PAGES];
 	unsigned int npages;
+	unsigned int cleanup_mode;
 };
 
 /* TX SGL entry */



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 16/34] af_alg: [RFC] Use netfs_extract_iter_to_sg() to create scatterlists
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (14 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 15/34] af_alg: Pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-16 23:09 ` David Howells
  2023-01-16 23:10 ` [PATCH v6 17/34] scsi: [RFC] Use netfs_extract_iter_to_sg() David Howells
                   ` (19 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:09 UTC (permalink / raw)
  To: Al Viro
  Cc: Herbert Xu, linux-crypto, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Use netfs_extract_iter_to_sg() to decant the destination iterator into a
scatterlist in af_alg_get_rsgl().  af_alg_make_sg() can then be removed.

Note that if this fits, netfs_extract_iter_to_sg() should move to core
code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: linux-crypto@vger.kernel.org
---

 crypto/af_alg.c         |   63 +++++++++++++----------------------------------
 crypto/algif_hash.c     |   21 +++++++++++-----
 include/crypto/if_alg.h |    7 +----
 3 files changed, 35 insertions(+), 56 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index c99e09fce71f..c5fbe39366ff 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -22,6 +22,7 @@
 #include <linux/sched/signal.h>
 #include <linux/security.h>
 #include <linux/string.h>
+#include <linux/netfs.h>
 #include <keys/user-type.h>
 #include <keys/trusted-type.h>
 #include <keys/encrypted-type.h>
@@ -531,55 +532,22 @@ static const struct net_proto_family alg_family = {
 	.owner	=	THIS_MODULE,
 };
 
-int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
-		   unsigned int gup_flags)
-{
-	struct page **pages = sgl->pages;
-	size_t off;
-	ssize_t n;
-	int npages, i;
-
-	n = iov_iter_extract_pages(iter, &pages, len, ALG_MAX_PAGES,
-				   gup_flags, &off);
-	if (n < 0)
-		return n;
-
-	sgl->cleanup_mode = iov_iter_extract_mode(iter, gup_flags);
-
-	npages = DIV_ROUND_UP(off + n, PAGE_SIZE);
-	if (WARN_ON(npages == 0))
-		return -EINVAL;
-	/* Add one extra for linking */
-	sg_init_table(sgl->sg, npages + 1);
-
-	for (i = 0, len = n; i < npages; i++) {
-		int plen = min_t(int, len, PAGE_SIZE - off);
-
-		sg_set_page(sgl->sg + i, sgl->pages[i], plen, off);
-
-		off = 0;
-		len -= plen;
-	}
-	sg_mark_end(sgl->sg + npages - 1);
-	sgl->npages = npages;
-
-	return n;
-}
-EXPORT_SYMBOL_GPL(af_alg_make_sg);
-
 static void af_alg_link_sg(struct af_alg_sgl *sgl_prev,
 			   struct af_alg_sgl *sgl_new)
 {
-	sg_unmark_end(sgl_prev->sg + sgl_prev->npages - 1);
-	sg_chain(sgl_prev->sg, sgl_prev->npages + 1, sgl_new->sg);
+	sg_unmark_end(sgl_prev->sgt.sgl + sgl_prev->sgt.nents - 1);
+	sg_chain(sgl_prev->sgt.sgl, sgl_prev->sgt.nents + 1, sgl_new->sgt.sgl);
 }
 
 void af_alg_free_sg(struct af_alg_sgl *sgl)
 {
 	int i;
 
-	for (i = 0; i < sgl->npages; i++)
-		page_put_unpin(sgl->pages[i], sgl->cleanup_mode);
+	if (!(sgl->cleanup_mode & (FOLL_PIN | FOLL_GET)))
+		return;
+
+	for (i = 0; i < sgl->sgt.nents; i++)
+		page_put_unpin(sg_page(&sgl->sgt.sgl[i]), sgl->cleanup_mode);
 }
 EXPORT_SYMBOL_GPL(af_alg_free_sg);
 
@@ -1293,8 +1261,8 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
 
 	while (maxsize > len && msg_data_left(msg)) {
 		struct af_alg_rsgl *rsgl;
+		ssize_t err;
 		size_t seglen;
-		int err;
 
 		/* limit the amount of readable buffers */
 		if (!af_alg_readable(sk))
@@ -1311,17 +1279,22 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
 				return -ENOMEM;
 		}
 
-		rsgl->sgl.npages = 0;
+		rsgl->sgl.sgt.sgl = rsgl->sgl.sgl;
+		rsgl->sgl.sgt.nents = 0;
+		rsgl->sgl.sgt.orig_nents = 0;
 		list_add_tail(&rsgl->list, &areq->rsgl_list);
 
-		/* make one iovec available as scatterlist */
-		err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen,
-				     FOLL_DEST_BUF);
+		err = netfs_extract_iter_to_sg(&msg->msg_iter, seglen,
+					       &rsgl->sgl.sgt, ALG_MAX_PAGES,
+					       FOLL_DEST_BUF);
 		if (err < 0) {
 			rsgl->sg_num_bytes = 0;
 			return err;
 		}
 
+		rsgl->sgl.cleanup_mode = iov_iter_extract_mode(&msg->msg_iter,
+							       FOLL_DEST_BUF);
+
 		/* chain the new scatterlist with previous one */
 		if (areq->last_rsgl)
 			af_alg_link_sg(&areq->last_rsgl->sgl, &rsgl->sgl);
diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index fe3d2258145f..5aef6818a9ff 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -14,6 +14,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/net.h>
+#include <linux/netfs.h>
 #include <net/sock.h>
 
 struct hash_ctx {
@@ -91,14 +92,22 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
 		if (len > limit)
 			len = limit;
 
-		len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, len,
-				     FOLL_SOURCE_BUF);
+		ctx->sgl.sgt.sgl = ctx->sgl.sgl;
+		ctx->sgl.sgt.nents = 0;
+		ctx->sgl.sgt.orig_nents = 0;
+
+		len = netfs_extract_iter_to_sg(&msg->msg_iter, len,
+					       &ctx->sgl.sgt, ALG_MAX_PAGES,
+					       FOLL_SOURCE_BUF);
 		if (len < 0) {
 			err = copied ? 0 : len;
 			goto unlock;
 		}
 
-		ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, NULL, len);
+		ctx->sgl.cleanup_mode = iov_iter_extract_mode(&msg->msg_iter,
+							      FOLL_SOURCE_BUF);
+
+		ahash_request_set_crypt(&ctx->req, ctx->sgl.sgt.sgl, NULL, len);
 
 		err = crypto_wait_req(crypto_ahash_update(&ctx->req),
 				      &ctx->wait);
@@ -142,8 +151,8 @@ static ssize_t hash_sendpage(struct socket *sock, struct page *page,
 		flags |= MSG_MORE;
 
 	lock_sock(sk);
-	sg_init_table(ctx->sgl.sg, 1);
-	sg_set_page(ctx->sgl.sg, page, size, offset);
+	sg_init_table(ctx->sgl.sgl, 1);
+	sg_set_page(ctx->sgl.sgl, page, size, offset);
 
 	if (!(flags & MSG_MORE)) {
 		err = hash_alloc_result(sk, ctx);
@@ -152,7 +161,7 @@ static ssize_t hash_sendpage(struct socket *sock, struct page *page,
 	} else if (!ctx->more)
 		hash_free_result(sk, ctx);
 
-	ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, ctx->result, size);
+	ahash_request_set_crypt(&ctx->req, ctx->sgl.sgl, ctx->result, size);
 
 	if (!(flags & MSG_MORE)) {
 		if (ctx->more)
diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
index 95b3b7517d3f..424a2071705d 100644
--- a/include/crypto/if_alg.h
+++ b/include/crypto/if_alg.h
@@ -58,9 +58,8 @@ struct af_alg_type {
 };
 
 struct af_alg_sgl {
-	struct scatterlist sg[ALG_MAX_PAGES + 1];
-	struct page *pages[ALG_MAX_PAGES];
-	unsigned int npages;
+	struct sg_table sgt;
+	struct scatterlist sgl[ALG_MAX_PAGES + 1];
 	unsigned int cleanup_mode;
 };
 
@@ -166,8 +165,6 @@ int af_alg_release(struct socket *sock);
 void af_alg_release_parent(struct sock *sk);
 int af_alg_accept(struct sock *sk, struct socket *newsock, bool kern);
 
-int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
-		   unsigned int gup_flags);
 void af_alg_free_sg(struct af_alg_sgl *sgl);
 
 static inline struct alg_sock *alg_sk(struct sock *sk)



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 17/34] scsi: [RFC] Use netfs_extract_iter_to_sg()
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (15 preceding siblings ...)
  2023-01-16 23:09 ` [PATCH v6 16/34] af_alg: [RFC] Use netfs_extract_iter_to_sg() to create scatterlists David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-16 23:10 ` [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate David Howells
                   ` (18 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: James E.J. Bottomley, Martin K. Petersen, Christoph Hellwig,
	linux-scsi, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Use netfs_extract_iter_to_sg() to build a scatterlist from an iterator.

Note that if this fits, netfs_extract_iter_to_sg() should move to core
code.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: James E.J. Bottomley <jejb@linux.ibm.com>
cc: Martin K. Petersen <martin.petersen@oracle.com>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-scsi@vger.kernel.org
---

 drivers/vhost/scsi.c |   78 +++++++++++++++-----------------------------------
 1 file changed, 23 insertions(+), 55 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 5d10837d19ec..af897cc4036d 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -34,6 +34,7 @@
 #include <linux/virtio_scsi.h>
 #include <linux/llist.h>
 #include <linux/bitmap.h>
+#include <linux/netfs.h>
 
 #include "vhost.h"
 
@@ -75,6 +76,9 @@ struct vhost_scsi_cmd {
 	u32 tvc_prot_sgl_count;
 	/* Saved unpacked SCSI LUN for vhost_scsi_target_queue_cmd() */
 	u32 tvc_lun;
+	/* Cleanup modes for scatterlists */
+	unsigned int tvc_cleanup_mode;
+	unsigned int tvc_prot_cleanup_mode;
 	/* Pointer to the SGL formatted memory from virtio-scsi */
 	struct scatterlist *tvc_sgl;
 	struct scatterlist *tvc_prot_sgl;
@@ -339,11 +343,13 @@ static void vhost_scsi_release_cmd_res(struct se_cmd *se_cmd)
 
 	if (tv_cmd->tvc_sgl_count) {
 		for (i = 0; i < tv_cmd->tvc_sgl_count; i++)
-			put_page(sg_page(&tv_cmd->tvc_sgl[i]));
+			page_put_unpin(sg_page(&tv_cmd->tvc_sgl[i]),
+				       tv_cmd->tvc_cleanup_mode);
 	}
 	if (tv_cmd->tvc_prot_sgl_count) {
 		for (i = 0; i < tv_cmd->tvc_prot_sgl_count; i++)
-			put_page(sg_page(&tv_cmd->tvc_prot_sgl[i]));
+			page_put_unpin(sg_page(&tv_cmd->tvc_prot_sgl[i]),
+				       tv_cmd->tvc_prot_cleanup_mode);
 	}
 
 	sbitmap_clear_bit(&svq->scsi_tags, se_cmd->map_tag);
@@ -631,41 +637,6 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct vhost_scsi_tpg *tpg,
 	return cmd;
 }
 
-/*
- * Map a user memory range into a scatterlist
- *
- * Returns the number of scatterlist entries used or -errno on error.
- */
-static int
-vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd,
-		      struct iov_iter *iter,
-		      struct scatterlist *sgl,
-		      bool write)
-{
-	struct page **pages = cmd->tvc_upages;
-	struct scatterlist *sg = sgl;
-	ssize_t bytes;
-	size_t offset;
-	unsigned int npages = 0, gup_flags = 0;
-
-	gup_flags |= write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
-
-	bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
-				   VHOST_SCSI_PREALLOC_UPAGES, &offset,
-				   gup_flags);
-	/* No pages were pinned */
-	if (bytes <= 0)
-		return bytes < 0 ? bytes : -EFAULT;
-
-	while (bytes) {
-		unsigned n = min_t(unsigned, PAGE_SIZE - offset, bytes);
-		sg_set_page(sg++, pages[npages++], n, offset);
-		bytes -= n;
-		offset = 0;
-	}
-	return npages;
-}
-
 static int
 vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, int max_sgls)
 {
@@ -689,24 +660,19 @@ vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, int max_sgls)
 static int
 vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, bool write,
 		      struct iov_iter *iter,
-		      struct scatterlist *sg, int sg_count)
+		      struct scatterlist *sg, int sg_count,
+		      unsigned int *cleanup_mode)
 {
-	struct scatterlist *p = sg;
-	int ret;
+	struct sg_table sgt = { .sgl = sg };
+	unsigned int gup_flags = write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+	ssize_t ret;
 
-	while (iov_iter_count(iter)) {
-		ret = vhost_scsi_map_to_sgl(cmd, iter, sg, write);
-		if (ret < 0) {
-			while (p < sg) {
-				struct page *page = sg_page(p++);
-				if (page)
-					put_page(page);
-			}
-			return ret;
-		}
-		sg += ret;
-	}
-	return 0;
+	ret = netfs_extract_iter_to_sg(iter, LONG_MAX, &sgt, sg_count, gup_flags);
+	if (ret > 0)
+		sg_mark_end(sg + sgt.nents - 1);
+
+	*cleanup_mode = iov_iter_extract_mode(iter, gup_flags);
+	return ret;
 }
 
 static int
@@ -730,7 +696,8 @@ vhost_scsi_mapal(struct vhost_scsi_cmd *cmd,
 
 		ret = vhost_scsi_iov_to_sgl(cmd, write, prot_iter,
 					    cmd->tvc_prot_sgl,
-					    cmd->tvc_prot_sgl_count);
+					    cmd->tvc_prot_sgl_count,
+					    &cmd->tvc_prot_cleanup_mode);
 		if (ret < 0) {
 			cmd->tvc_prot_sgl_count = 0;
 			return ret;
@@ -747,7 +714,8 @@ vhost_scsi_mapal(struct vhost_scsi_cmd *cmd,
 		  cmd->tvc_sgl, cmd->tvc_sgl_count);
 
 	ret = vhost_scsi_iov_to_sgl(cmd, write, data_iter,
-				    cmd->tvc_sgl, cmd->tvc_sgl_count);
+				    cmd->tvc_sgl, cmd->tvc_sgl_count,
+				    &cmd->tvc_cleanup_mode);
 	if (ret < 0) {
 		cmd->tvc_sgl_count = 0;
 		return ret;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (16 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 17/34] scsi: [RFC] Use netfs_extract_iter_to_sg() David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-19  5:04   ` Al Viro
  2023-01-16 23:10 ` [PATCH v6 19/34] fuse: " David Howells
                   ` (17 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-fsdevel, linux-block, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Convert the generic direct-I/O code to use iov_iter_extract_pages() instead
of iov_iter_get_pages().  This will pin pages or leave them unaltered
rather than getting a ref on them as appropriate to the iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
---

 fs/direct-io.c |   57 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index b1e26a706e31..b4d2c9f85a5b 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -142,9 +142,11 @@ struct dio {
 
 	/*
 	 * pages[] (and any fields placed after it) are not zeroed out at
-	 * allocation time.  Don't add new fields after pages[] unless you
-	 * wish that they not be zeroed.
+	 * allocation time.  Don't add new fields after pages[] unless you wish
+	 * that they not be zeroed.  Pages may have a ref taken, a pin emplaced
+	 * or no retention measures.
 	 */
+	unsigned int cleanup_mode;	/* How pages should be cleaned up (0/FOLL_GET/PIN) */
 	union {
 		struct page *pages[DIO_PAGES];	/* page buffer */
 		struct work_struct complete_work;/* deferred AIO completion */
@@ -167,12 +169,13 @@ static inline unsigned dio_pages_present(struct dio_submit *sdio)
 static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 {
 	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
+	unsigned int gup_flags =
+		op_is_write(dio_op) ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+	struct page **pages = dio->pages;
 	ssize_t ret;
 
-	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
-				 &sdio->from,
-				 op_is_write(dio_op) ?
-				 FOLL_SOURCE_BUF : FOLL_DEST_BUF);
+	ret = iov_iter_extract_pages(sdio->iter, &pages, LONG_MAX, DIO_PAGES,
+				     gup_flags, &sdio->from);
 
 	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
 		struct page *page = ZERO_PAGE(0);
@@ -183,7 +186,7 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		 */
 		if (dio->page_errors == 0)
 			dio->page_errors = ret;
-		get_page(page);
+		dio->cleanup_mode = 0;
 		dio->pages[0] = page;
 		sdio->head = 0;
 		sdio->tail = 1;
@@ -197,6 +200,8 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
 		sdio->head = 0;
 		sdio->tail = (ret + PAGE_SIZE - 1) / PAGE_SIZE;
 		sdio->to = ((ret - 1) & (PAGE_SIZE - 1)) + 1;
+		dio->cleanup_mode =
+			iov_iter_extract_mode(sdio->iter, gup_flags);
 		return 0;
 	}
 	return ret;	
@@ -400,6 +405,10 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 	 * we request a valid number of vectors.
 	 */
 	bio = bio_alloc(bdev, nr_vecs, dio->opf, GFP_KERNEL);
+	if (!(dio->cleanup_mode & FOLL_GET))
+		bio_clear_flag(bio, BIO_PAGE_REFFED);
+	if (dio->cleanup_mode & FOLL_PIN)
+		bio_set_flag(bio, BIO_PAGE_PINNED);
 	bio->bi_iter.bi_sector = first_sector;
 	if (dio->is_async)
 		bio->bi_end_io = dio_bio_end_aio;
@@ -443,13 +452,18 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)
 	sdio->logical_offset_in_bio = 0;
 }
 
+static void dio_cleanup_page(struct dio *dio, struct page *page)
+{
+	page_put_unpin(page, dio->cleanup_mode);
+}
+
 /*
  * Release any resources in case of a failure
  */
 static inline void dio_cleanup(struct dio *dio, struct dio_submit *sdio)
 {
 	while (sdio->head < sdio->tail)
-		put_page(dio->pages[sdio->head++]);
+		dio_cleanup_page(dio, dio->pages[sdio->head++]);
 }
 
 /*
@@ -704,7 +718,7 @@ static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
  *
  * Return zero on success.  Non-zero means the caller needs to start a new BIO.
  */
-static inline int dio_bio_add_page(struct dio_submit *sdio)
+static inline int dio_bio_add_page(struct dio *dio, struct dio_submit *sdio)
 {
 	int ret;
 
@@ -771,11 +785,11 @@ static inline int dio_send_cur_page(struct dio *dio, struct dio_submit *sdio,
 			goto out;
 	}
 
-	if (dio_bio_add_page(sdio) != 0) {
+	if (dio_bio_add_page(dio, sdio) != 0) {
 		dio_bio_submit(dio, sdio);
 		ret = dio_new_bio(dio, sdio, sdio->cur_page_block, map_bh);
 		if (ret == 0) {
-			ret = dio_bio_add_page(sdio);
+			ret = dio_bio_add_page(dio, sdio);
 			BUG_ON(ret != 0);
 		}
 	}
@@ -832,13 +846,16 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 	 */
 	if (sdio->cur_page) {
 		ret = dio_send_cur_page(dio, sdio, map_bh);
-		put_page(sdio->cur_page);
+		dio_cleanup_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 		if (ret)
 			return ret;
 	}
 
-	get_page(page);		/* It is in dio */
+	ret = try_grab_page(page, dio->cleanup_mode);		/* It is in dio */
+	if (ret < 0)
+		return ret;
+
 	sdio->cur_page = page;
 	sdio->cur_page_offset = offset;
 	sdio->cur_page_len = len;
@@ -853,7 +870,7 @@ submit_page_section(struct dio *dio, struct dio_submit *sdio, struct page *page,
 		ret = dio_send_cur_page(dio, sdio, map_bh);
 		if (sdio->bio)
 			dio_bio_submit(dio, sdio);
-		put_page(sdio->cur_page);
+		dio_cleanup_page(dio, sdio->cur_page);
 		sdio->cur_page = NULL;
 	}
 	return ret;
@@ -954,7 +971,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				ret = get_more_blocks(dio, sdio, map_bh);
 				if (ret) {
-					put_page(page);
+					dio_cleanup_page(dio, page);
 					goto out;
 				}
 				if (!buffer_mapped(map_bh))
@@ -999,7 +1016,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 
 				/* AKPM: eargh, -ENOTBLK is a hack */
 				if (dio_op == REQ_OP_WRITE) {
-					put_page(page);
+					dio_cleanup_page(dio, page);
 					return -ENOTBLK;
 				}
 
@@ -1012,7 +1029,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 				if (sdio->block_in_file >=
 						i_size_aligned >> blkbits) {
 					/* We hit eof */
-					put_page(page);
+					dio_cleanup_page(dio, page);
 					goto out;
 				}
 				zero_user(page, from, 1 << blkbits);
@@ -1052,7 +1069,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 						  sdio->next_block_for_io,
 						  map_bh);
 			if (ret) {
-				put_page(page);
+				dio_cleanup_page(dio, page);
 				goto out;
 			}
 			sdio->next_block_for_io += this_chunk_blocks;
@@ -1068,7 +1085,7 @@ static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
 		}
 
 		/* Drop the ref which was taken in get_user_pages() */
-		put_page(page);
+		dio_cleanup_page(dio, page);
 	}
 out:
 	return ret;
@@ -1288,7 +1305,7 @@ ssize_t __blockdev_direct_IO(struct kiocb *iocb, struct inode *inode,
 		ret2 = dio_send_cur_page(dio, &sdio, &map_bh);
 		if (retval == 0)
 			retval = ret2;
-		put_page(sdio.cur_page);
+		dio_cleanup_page(dio, sdio.cur_page);
 		sdio.cur_page = NULL;
 	}
 	if (sdio.bio)



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 19/34] fuse:  Pin pages rather than ref'ing if appropriate
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (17 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-16 23:10 ` [PATCH v6 20/34] vfs: Make splice use iov_iter_extract_pages() David Howells
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Miklos Szeredi, Christoph Hellwig, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Convert the fuse code to use iov_iter_extract_pages() instead of
iov_iter_get_pages().  This will pin pages or leave them unaltered rather
than getting a ref on them as appropriate to the iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Miklos Szeredi <miklos@szeredi.hu>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: linux-fsdevel@vger.kernel.org
---

 fs/fuse/dev.c    |   25 +++++++++++++++++++------
 fs/fuse/file.c   |   26 ++++++++++++++++++--------
 fs/fuse/fuse_i.h |    1 +
 3 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e3d8443e24a6..107497e68726 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -641,6 +641,7 @@ static int unlock_request(struct fuse_req *req)
 
 struct fuse_copy_state {
 	int write;
+	unsigned int cleanup_mode;	/* Page cleanup mode (0/FOLL_GET/PIN) */
 	struct fuse_req *req;
 	struct iov_iter *iter;
 	struct pipe_buffer *pipebufs;
@@ -661,6 +662,11 @@ static void fuse_copy_init(struct fuse_copy_state *cs, int write,
 	cs->iter = iter;
 }
 
+static void fuse_release_copy_page(struct fuse_copy_state *cs, struct page *page)
+{
+	page_put_unpin(page, cs->cleanup_mode);
+}
+
 /* Unmap and put previous page of userspace buffer */
 static void fuse_copy_finish(struct fuse_copy_state *cs)
 {
@@ -675,7 +681,7 @@ static void fuse_copy_finish(struct fuse_copy_state *cs)
 			flush_dcache_page(cs->pg);
 			set_page_dirty_lock(cs->pg);
 		}
-		put_page(cs->pg);
+		fuse_release_copy_page(cs, cs->pg);
 	}
 	cs->pg = NULL;
 }
@@ -704,6 +710,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 
 			BUG_ON(!cs->nr_segs);
 			cs->currbuf = buf;
+			cs->cleanup_mode = FOLL_GET;
 			cs->pg = buf->page;
 			cs->offset = buf->offset;
 			cs->len = buf->len;
@@ -722,6 +729,7 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			buf->len = 0;
 
 			cs->currbuf = buf;
+			cs->cleanup_mode = FOLL_GET;
 			cs->pg = page;
 			cs->offset = 0;
 			cs->len = PAGE_SIZE;
@@ -729,15 +737,18 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
 			cs->nr_segs++;
 		}
 	} else {
+		unsigned int gup_flags = cs->write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
+		struct page **pages = &cs->pg;
 		size_t off;
-		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off,
-					 cs->write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
+
+		err = iov_iter_extract_pages(cs->iter, &pages, PAGE_SIZE, 1,
+					     gup_flags, &off);
 		if (err < 0)
 			return err;
 		BUG_ON(!err);
 		cs->len = err;
 		cs->offset = off;
-		cs->pg = page;
+		cs->cleanup_mode = iov_iter_extract_mode(cs->iter, gup_flags);
 	}
 
 	return lock_request(cs->req);
@@ -899,10 +910,12 @@ static int fuse_ref_page(struct fuse_copy_state *cs, struct page *page,
 	if (cs->nr_segs >= cs->pipe->max_usage)
 		return -EIO;
 
-	get_page(page);
+	err = try_grab_page(page, cs->cleanup_mode);
+	if (err < 0)
+		return err;
 	err = unlock_request(cs->req);
 	if (err) {
-		put_page(page);
+		fuse_release_copy_page(cs, page);
 		return err;
 	}
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 68c196437306..c317300e757a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -624,6 +624,11 @@ void fuse_read_args_fill(struct fuse_io_args *ia, struct file *file, loff_t pos,
 	args->out_args[0].size = count;
 }
 
+static void fuse_release_page(struct fuse_args_pages *ap, struct page *page)
+{
+	page_put_unpin(page, ap->cleanup_mode);
+}
+
 static void fuse_release_user_pages(struct fuse_args_pages *ap,
 				    bool should_dirty)
 {
@@ -632,7 +637,7 @@ static void fuse_release_user_pages(struct fuse_args_pages *ap,
 	for (i = 0; i < ap->num_pages; i++) {
 		if (should_dirty)
 			set_page_dirty_lock(ap->pages[i]);
-		put_page(ap->pages[i]);
+		fuse_release_page(ap, ap->pages[i]);
 	}
 }
 
@@ -920,7 +925,7 @@ static void fuse_readpages_end(struct fuse_mount *fm, struct fuse_args *args,
 		else
 			SetPageError(page);
 		unlock_page(page);
-		put_page(page);
+		fuse_release_page(ap, page);
 	}
 	if (ia->ff)
 		fuse_file_put(ia->ff, false, false);
@@ -1153,7 +1158,7 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
 		}
 		if (ia->write.page_locked && (i == ap->num_pages - 1))
 			unlock_page(page);
-		put_page(page);
+		fuse_release_page(ap, page);
 	}
 
 	return err;
@@ -1172,6 +1177,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
 	ap->args.in_pages = true;
 	ap->descs[0].offset = offset;
+	ap->cleanup_mode = FOLL_GET;
 
 	do {
 		size_t tmp;
@@ -1200,7 +1206,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
 
 		if (!tmp) {
 			unlock_page(page);
-			put_page(page);
+			fuse_release_page(ap, page);
 			goto again;
 		}
 
@@ -1393,9 +1399,12 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 			       size_t *nbytesp, int write,
 			       unsigned int max_pages)
 {
+	unsigned int gup_flags = write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
 	size_t nbytes = 0;  /* # bytes already packed in req */
 	ssize_t ret = 0;
 
+	ap->cleanup_mode = iov_iter_extract_mode(ii, gup_flags);
+
 	/* Special case for kernel I/O: can copy directly into the buffer */
 	if (iov_iter_is_kvec(ii)) {
 		unsigned long user_addr = fuse_get_user_addr(ii);
@@ -1412,12 +1421,13 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
 	}
 
 	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
+		struct page **pages = &ap->pages[ap->num_pages];
 		unsigned npages;
 		size_t start;
-		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
-					 *nbytesp - nbytes,
-					 max_pages - ap->num_pages,
-					 &start, write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
+		ret = iov_iter_extract_pages(ii, &pages,
+					     *nbytesp - nbytes,
+					     max_pages - ap->num_pages,
+					     gup_flags, &start);
 		if (ret < 0)
 			break;
 
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index c673faefdcb9..7b6be1dd7593 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -271,6 +271,7 @@ struct fuse_args_pages {
 	struct page **pages;
 	struct fuse_page_desc *descs;
 	unsigned int num_pages;
+	unsigned int cleanup_mode;
 };
 
 #define FUSE_ARGS(args) struct fuse_args args = {}



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 20/34] vfs: Make splice use iov_iter_extract_pages()
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (18 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 19/34] fuse: " David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-19  2:31   ` Al Viro
  2023-01-16 23:10 ` [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate David Howells
                   ` (15 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Matthew Wilcox, linux-fsdevel, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Make splice's iter_to_pipe() use iov_iter_extract_pages().  Splice requests
will rejected if the request if the cleanup mode is going to be anything
other than put_pages() since we're going to be attaching pages from the
iterator to a pipe and then returning to the caller, leaving the spliced
pages to their fates at some unknown time in the future.

Note this will cause some requests to fail that could work before - such as
splicing from an XARRAY-type iterator - if there's any way to do it as
extraction doesn't take refs or pins on non-user-backed iterators.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
---

 fs/splice.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 19c5b5adc548..c3433266ba1b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1159,14 +1159,18 @@ static int iter_to_pipe(struct iov_iter *from,
 	size_t total = 0;
 	int ret = 0;
 
+	/* For the moment, all pages attached to a pipe must have refs, not pins. */
+	if (WARN_ON(iov_iter_extract_mode(from, FOLL_SOURCE_BUF) != FOLL_GET))
+		return -EIO;
+
 	while (iov_iter_count(from)) {
-		struct page *pages[16];
+		struct page *pages[16], **ppages = pages;
 		ssize_t left;
 		size_t start;
 		int i, n;
 
-		left = iov_iter_get_pages(from, pages, ~0UL, 16, &start,
-					  FOLL_SOURCE_BUF);
+		left = iov_iter_extract_pages(from, &ppages, ~0UL, 16,
+					      FOLL_SOURCE_BUF, &start);
 		if (left <= 0) {
 			ret = left;
 			break;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (19 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 20/34] vfs: Make splice use iov_iter_extract_pages() David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-19  2:52   ` Al Viro
  2023-01-19 16:44   ` David Howells
  2023-01-16 23:10 ` [PATCH v6 22/34] nfs: " David Howells
                   ` (14 subsequent siblings)
  35 siblings, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Dominique Martinet, Eric Van Hensbergen, Latchesar Ionkov,
	Christian Schoenebeck, v9fs-developer, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Convert the 9p filesystem to use iov_iter_extract_pages() instead of
iov_iter_get_pages().  This will pin pages or leave them unaltered rather
than getting a ref on them as appropriate to the iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: v9fs-developer@lists.sourceforge.net
---

 net/9p/trans_common.c |    6 ++-
 net/9p/trans_common.h |    3 +-
 net/9p/trans_virtio.c |   89 ++++++++++++++-----------------------------------
 3 files changed, 31 insertions(+), 67 deletions(-)

diff --git a/net/9p/trans_common.c b/net/9p/trans_common.c
index c827f694551c..31d133412677 100644
--- a/net/9p/trans_common.c
+++ b/net/9p/trans_common.c
@@ -12,13 +12,15 @@
  * p9_release_pages - Release pages after the transaction.
  * @pages: array of pages to be put
  * @nr_pages: size of array
+ * @cleanup_mode: How to clean up the pages.
  */
-void p9_release_pages(struct page **pages, int nr_pages)
+void p9_release_pages(struct page **pages, int nr_pages,
+		      unsigned int cleanup_mode)
 {
 	int i;
 
 	for (i = 0; i < nr_pages; i++)
 		if (pages[i])
-			put_page(pages[i]);
+			page_put_unpin(pages[i], cleanup_mode);
 }
 EXPORT_SYMBOL(p9_release_pages);
diff --git a/net/9p/trans_common.h b/net/9p/trans_common.h
index 32134db6abf3..9b20eb4f2359 100644
--- a/net/9p/trans_common.h
+++ b/net/9p/trans_common.h
@@ -4,4 +4,5 @@
  * Author Venkateswararao Jujjuri <jvrao@linux.vnet.ibm.com>
  */
 
-void p9_release_pages(struct page **pages, int nr_pages);
+void p9_release_pages(struct page **pages, int nr_pages,
+		      unsigned int cleanup_mode);
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index eb28b54fe5f6..561f7cbd79da 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -310,73 +310,34 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
 			       struct iov_iter *data,
 			       int count,
 			       size_t *offs,
-			       int *need_drop,
+			       int *cleanup_mode,
 			       unsigned int gup_flags)
 {
 	int nr_pages;
 	int err;
+	int n;
 
 	if (!iov_iter_count(data))
 		return 0;
 
-	if (!iov_iter_is_kvec(data)) {
-		int n;
-		/*
-		 * We allow only p9_max_pages pinned. We wait for the
-		 * Other zc request to finish here
-		 */
-		if (atomic_read(&vp_pinned) >= chan->p9_max_pages) {
-			err = wait_event_killable(vp_wq,
-			      (atomic_read(&vp_pinned) < chan->p9_max_pages));
-			if (err == -ERESTARTSYS)
-				return err;
-		}
-		n = iov_iter_get_pages_alloc(data, pages, count, offs,
-					     gup_flags);
-		if (n < 0)
-			return n;
-		*need_drop = 1;
-		nr_pages = DIV_ROUND_UP(n + *offs, PAGE_SIZE);
-		atomic_add(nr_pages, &vp_pinned);
-		return n;
-	} else {
-		/* kernel buffer, no need to pin pages */
-		int index;
-		size_t len;
-		void *p;
-
-		/* we'd already checked that it's non-empty */
-		while (1) {
-			len = iov_iter_single_seg_count(data);
-			if (likely(len)) {
-				p = data->kvec->iov_base + data->iov_offset;
-				break;
-			}
-			iov_iter_advance(data, 0);
-		}
-		if (len > count)
-			len = count;
-
-		nr_pages = DIV_ROUND_UP((unsigned long)p + len, PAGE_SIZE) -
-			   (unsigned long)p / PAGE_SIZE;
-
-		*pages = kmalloc_array(nr_pages, sizeof(struct page *),
-				       GFP_NOFS);
-		if (!*pages)
-			return -ENOMEM;
-
-		*need_drop = 0;
-		p -= (*offs = offset_in_page(p));
-		for (index = 0; index < nr_pages; index++) {
-			if (is_vmalloc_addr(p))
-				(*pages)[index] = vmalloc_to_page(p);
-			else
-				(*pages)[index] = kmap_to_page(p);
-			p += PAGE_SIZE;
-		}
-		iov_iter_advance(data, len);
-		return len;
+	/*
+	 * We allow only p9_max_pages pinned. We wait for the
+	 * Other zc request to finish here
+	 */
+	if (atomic_read(&vp_pinned) >= chan->p9_max_pages) {
+		err = wait_event_killable(vp_wq,
+					  (atomic_read(&vp_pinned) < chan->p9_max_pages));
+		if (err == -ERESTARTSYS)
+			return err;
 	}
+
+	n = iov_iter_extract_pages(data, pages, count, offs, gup_flags);
+	if (n < 0)
+		return n;
+	*cleanup_mode = iov_iter_extract_mode(data, gup_flags);
+	nr_pages = DIV_ROUND_UP(n + *offs, PAGE_SIZE);
+	atomic_add(nr_pages, &vp_pinned);
+	return n;
 }
 
 static void handle_rerror(struct p9_req_t *req, int in_hdr_len,
@@ -431,7 +392,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	struct virtio_chan *chan = client->trans;
 	struct scatterlist *sgs[4];
 	size_t offs;
-	int need_drop = 0;
+	int cleanup_mode = 0;
 	int kicked = 0;
 
 	p9_debug(P9_DEBUG_TRANS, "virtio request\n");
@@ -439,7 +400,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	if (uodata) {
 		__le32 sz;
 		int n = p9_get_mapped_pages(chan, &out_pages, uodata,
-					    outlen, &offs, &need_drop,
+					    outlen, &offs, &cleanup_mode,
 					    FOLL_DEST_BUF);
 		if (n < 0) {
 			err = n;
@@ -459,7 +420,7 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 		memcpy(&req->tc.sdata[0], &sz, sizeof(sz));
 	} else if (uidata) {
 		int n = p9_get_mapped_pages(chan, &in_pages, uidata,
-					    inlen, &offs, &need_drop,
+					    inlen, &offs, &cleanup_mode,
 					    FOLL_SOURCE_BUF);
 		if (n < 0) {
 			err = n;
@@ -546,14 +507,14 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
 	 * Non kernel buffers are pinned, unpin them
 	 */
 err_out:
-	if (need_drop) {
+	if (cleanup_mode) {
 		if (in_pages) {
 			p9_release_pages(in_pages, in_nr_pages);
-			atomic_sub(in_nr_pages, &vp_pinned);
+			atomic_sub(in_nr_pages, &vp_pinned, cleanup_mode);
 		}
 		if (out_pages) {
 			p9_release_pages(out_pages, out_nr_pages);
-			atomic_sub(out_nr_pages, &vp_pinned);
+			atomic_sub(out_nr_pages, &vp_pinned, cleanup_mode);
 		}
 		/* wakeup anybody waiting for slots to pin pages */
 		wake_up(&vp_wq);



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 22/34] nfs: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (20 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-16 23:10 ` [PATCH v6 23/34] cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE David Howells
                   ` (13 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Trond Myklebust, Anna Schumaker, Jeff Layton, linux-nfs,
	dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Convert the NFS direct I/O code to use iov_iter_extract_pages() instead of
iov_iter_get_pages().  This will pin pages or leave them unaltered rather
than getting a ref on them as appropriate to the iterator.

The pages need to be pinned for DIO-read rather than having refs taken on
them to prevent VM copy-on-write from malfunctioning during a concurrent
fork() (the result of the I/O would otherwise end up only visible to the
child process and not the parent).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-nfs@vger.kernel.org
---

 fs/nfs/direct.c |   32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 42af84685f20..4a3108db2cb6 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -142,11 +142,15 @@ int nfs_swap_rw(struct kiocb *iocb, struct iov_iter *iter)
 	return 0;
 }
 
-static void nfs_direct_release_pages(struct page **pages, unsigned int npages)
+static void nfs_direct_release_pages(struct page **pages, unsigned int npages,
+				     unsigned int cleanup_mode)
 {
 	unsigned int i;
-	for (i = 0; i < npages; i++)
-		put_page(pages[i]);
+
+	if (cleanup_mode) {
+		for (i = 0; i < npages; i++)
+			page_put_unpin(pages[i], cleanup_mode);
+	}
 }
 
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
@@ -327,17 +331,16 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 	inode_dio_begin(inode);
 
 	while (iov_iter_count(iter)) {
-		struct page **pagevec;
+		struct page **pagevec = NULL;
 		size_t bytes;
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec,
-						  rsize, &pgbase,
-						  FOLL_DEST_BUF);
+		result = iov_iter_extract_pages(iter, &pagevec, rsize, INT_MAX,
+						FOLL_DEST_BUF, &pgbase);
 		if (result < 0)
 			break;
-	
+
 		bytes = result;
 		npages = (result + pgbase + PAGE_SIZE - 1) / PAGE_SIZE;
 		for (i = 0; i < npages; i++) {
@@ -363,7 +366,8 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		nfs_direct_release_pages(pagevec, npages,
+					 iov_iter_extract_mode(iter, FOLL_DEST_BUF));
 		kvfree(pagevec);
 		if (result < 0)
 			break;
@@ -787,14 +791,13 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 
 	NFS_I(inode)->write_io += iov_iter_count(iter);
 	while (iov_iter_count(iter)) {
-		struct page **pagevec;
+		struct page **pagevec = NULL;
 		size_t bytes;
 		size_t pgbase;
 		unsigned npages, i;
 
-		result = iov_iter_get_pages_alloc(iter, &pagevec,
-						  wsize, &pgbase,
-						  FOLL_SOURCE_BUF);
+		result = iov_iter_extract_pages(iter, &pagevec, wsize, INT_MAX,
+						FOLL_SOURCE_BUF, &pgbase);
 		if (result < 0)
 			break;
 
@@ -831,7 +834,8 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
 			pos += req_len;
 			dreq->bytes_left -= req_len;
 		}
-		nfs_direct_release_pages(pagevec, npages);
+		nfs_direct_release_pages(pagevec, npages,
+					 iov_iter_extract_mode(iter, FOLL_SOURCE_BUF));
 		kvfree(pagevec);
 		if (result < 0)
 			break;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 23/34] cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (21 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 22/34] nfs: " David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
                   ` (12 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Jeff Layton,
	linux-cifs, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Provide cifs_splice_read() to use a bvec rather than an pipe iterator as
the latter cannot so easily be split and advanced, which is necessary to
pass an iterator down to the bottom levels.  Upstream cifs gets around this
problem by using iov_iter_get_pages() to prefill the pipe and then passing
the list of pages down.

This is done by:

 (1) Bulk-allocate a bunch of pages to carry as much of the requested
     amount of data as possible, but without overrunning the available
     slots in the pipe and add them to an ITER_BVEC.

 (2) Synchronously call ->read_iter() to read into the buffer.

 (3) Discard any unused pages.

 (4) Load the remaining pages into the pipe in order and advance the head
     pointer.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/166732028113.3186319.1793644937097301358.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/cifsfs.c |   12 ++++---
 fs/cifs/cifsfs.h |    3 ++
 fs/cifs/file.c   |   92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/splice.c      |    1 +
 4 files changed, 102 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 10e00c624922..3c57e8b11692 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1358,7 +1358,7 @@ const struct file_operations cifs_file_ops = {
 	.fsync = cifs_fsync,
 	.flush = cifs_flush,
 	.mmap  = cifs_file_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = cifs_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1378,7 +1378,7 @@ const struct file_operations cifs_file_strict_ops = {
 	.fsync = cifs_strict_fsync,
 	.flush = cifs_flush,
 	.mmap = cifs_file_strict_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = cifs_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1398,7 +1398,7 @@ const struct file_operations cifs_file_direct_ops = {
 	.fsync = cifs_fsync,
 	.flush = cifs_flush,
 	.mmap = cifs_file_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = cifs_splice_read,
 	.splice_write = iter_file_splice_write,
 	.unlocked_ioctl  = cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
@@ -1416,7 +1416,7 @@ const struct file_operations cifs_file_nobrl_ops = {
 	.fsync = cifs_fsync,
 	.flush = cifs_flush,
 	.mmap  = cifs_file_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = cifs_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1434,7 +1434,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
 	.fsync = cifs_strict_fsync,
 	.flush = cifs_flush,
 	.mmap = cifs_file_strict_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = cifs_splice_read,
 	.splice_write = iter_file_splice_write,
 	.llseek = cifs_llseek,
 	.unlocked_ioctl	= cifs_ioctl,
@@ -1452,7 +1452,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
 	.fsync = cifs_fsync,
 	.flush = cifs_flush,
 	.mmap = cifs_file_mmap,
-	.splice_read = generic_file_splice_read,
+	.splice_read = cifs_splice_read,
 	.splice_write = iter_file_splice_write,
 	.unlocked_ioctl  = cifs_ioctl,
 	.copy_file_range = cifs_copy_file_range,
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 63a0ac2b9355..25decebbc478 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -100,6 +100,9 @@ extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_splice_read(struct file *in, loff_t *ppos,
+				struct pipe_inode_info *pipe, size_t len,
+				unsigned int flags);
 extern int cifs_flock(struct file *pfile, int cmd, struct file_lock *plock);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index d100b9cb8682..f1297386a185 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -5273,3 +5273,95 @@ const struct address_space_operations cifs_addr_ops_smallbuf = {
 	.launder_folio = cifs_launder_folio,
 	.migrate_folio = filemap_migrate_folio,
 };
+
+/*
+ * Splice data from a file into a pipe.
+ */
+ssize_t cifs_splice_read(struct file *file, loff_t *ppos,
+			 struct pipe_inode_info *pipe, size_t len,
+			 unsigned int flags)
+{
+	LIST_HEAD(pages);
+	struct iov_iter to;
+	struct bio_vec *bv;
+	struct kiocb kiocb;
+	struct page *page;
+	unsigned int head;
+	ssize_t ret;
+	size_t used, npages, chunk, remain, reclaim;
+	int i;
+
+	/* Work out how much data we can actually add into the pipe */
+	used = pipe_occupancy(pipe->head, pipe->tail);
+	npages = max_t(ssize_t, pipe->max_usage - used, 0);
+	len = min_t(size_t, len, npages * PAGE_SIZE);
+	npages = DIV_ROUND_UP(len, PAGE_SIZE);
+
+	bv = kmalloc(array_size(npages, sizeof(bv[0])), GFP_KERNEL);
+	if (!bv)
+		return -ENOMEM;
+
+	npages = alloc_pages_bulk_list(GFP_USER, npages, &pages);
+	if (!npages) {
+		kfree(bv);
+		return -ENOMEM;
+	}
+
+	remain = len = min_t(size_t, len, npages * PAGE_SIZE);
+
+	for (i = 0; i < npages; i++) {
+		chunk = min_t(size_t, PAGE_SIZE, remain);
+		page = list_first_entry(&pages, struct page, lru);
+		list_del_init(&page->lru);
+		bv[i].bv_page = page;
+		bv[i].bv_offset = 0;
+		bv[i].bv_len = chunk;
+		remain -= chunk;
+	}
+
+	/* Do the I/O */
+	iov_iter_bvec(&to, READ, bv, npages, len);
+	init_sync_kiocb(&kiocb, file);
+	kiocb.ki_pos = *ppos;
+	ret = call_read_iter(file, &kiocb, &to);
+
+	reclaim = npages * PAGE_SIZE;
+	remain = 0;
+	if (ret > 0) {
+		reclaim -= ret;
+		remain = ret;
+		*ppos = kiocb.ki_pos;
+		file_accessed(file);
+	} else if (ret < 0) {
+		/*
+		 * callers of ->splice_read() expect -EAGAIN on
+		 * "can't put anything in there", rather than -EFAULT.
+		 */
+		if (ret == -EFAULT)
+			ret = -EAGAIN;
+	}
+
+	/* Free any pages that didn't get touched at all. */
+	for (; reclaim >= PAGE_SIZE; reclaim -= PAGE_SIZE)
+		__free_page(bv[--npages].bv_page);
+
+	/* Push the remaining pages into the pipe. */
+	head = pipe->head;
+	for (i = 0; i < npages; i++) {
+		struct pipe_buffer *buf = &pipe->bufs[head & (pipe->ring_size - 1)];
+
+		chunk = min_t(size_t, remain, PAGE_SIZE);
+		*buf = (struct pipe_buffer) {
+			.ops	= &default_pipe_buf_ops,
+			.page	= bv[i].bv_page,
+			.offset	= 0,
+			.len	= chunk,
+		};
+		head++;
+		remain -= chunk;
+	}
+	pipe->head = head;
+
+	kfree(bv);
+	return ret;
+}
diff --git a/fs/splice.c b/fs/splice.c
index c3433266ba1b..1245ffb64414 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -330,6 +330,7 @@ const struct pipe_buf_operations default_pipe_buf_ops = {
 	.try_steal	= generic_pipe_buf_try_steal,
 	.get		= generic_pipe_buf_get,
 };
+EXPORT_SYMBOL(default_pipe_buf_ops);
 
 /* Pipe buffer operations for a socket and similar. */
 const struct pipe_buf_operations nosteal_pipe_buf_ops = {



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (22 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 23/34] cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 25/34] cifs: Add a function to Hash the contents of " David Howells
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Tom Talpey,
	Jeff Layton, linux-cifs, linux-fsdevel, linux-rdma, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Add a function to add elements onto an RDMA SGE list representing page
fragments extracted from a BVEC-, KVEC- or XARRAY-type iterator and DMA
mapped until the maximum number of elements is reached.

Nothing is done to make sure the pages remain present - that must be done
by the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-rdma@vger.kernel.org

Link: https://lore.kernel.org/r/166697256704.61150.17388516338310645808.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732028840.3186319.8512284239779728860.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/smbdirect.c |  224 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 3e693ffd0662..78a76752fafd 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -44,6 +44,17 @@ static int smbd_post_send_page(struct smbd_connection *info,
 static void destroy_mr_list(struct smbd_connection *info);
 static int allocate_mr_list(struct smbd_connection *info);
 
+struct smb_extract_to_rdma {
+	struct ib_sge		*sge;
+	unsigned int		nr_sge;
+	unsigned int		max_sge;
+	struct ib_device	*device;
+	u32			local_dma_lkey;
+	enum dma_data_direction	direction;
+};
+static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
+					struct smb_extract_to_rdma *rdma);
+
 /* SMBD version number */
 #define SMBD_V1	0x0100
 
@@ -2480,3 +2491,216 @@ int smbd_deregister_mr(struct smbd_mr *smbdirect_mr)
 
 	return rc;
 }
+
+static bool smb_set_sge(struct smb_extract_to_rdma *rdma,
+			struct page *lowest_page, size_t off, size_t len)
+{
+	struct ib_sge *sge = &rdma->sge[rdma->nr_sge];
+	u64 addr;
+
+	addr = ib_dma_map_page(rdma->device, lowest_page,
+			       off, len, rdma->direction);
+	if (ib_dma_mapping_error(rdma->device, addr))
+		return false;
+
+	sge->addr   = addr;
+	sge->length = len;
+	sge->lkey   = rdma->local_dma_lkey;
+	rdma->nr_sge++;
+	return true;
+}
+
+/*
+ * Extract page fragments from a BVEC-class iterator and add them to an RDMA
+ * element list.  The pages are not pinned.
+ */
+static ssize_t smb_extract_bvec_to_rdma(struct iov_iter *iter,
+					struct smb_extract_to_rdma *rdma,
+					ssize_t maxsize)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i, sge_max = rdma->max_sge;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		if (!smb_set_sge(rdma, bv[i].bv_page, off, len))
+			return -EIO;
+		sge_max--;
+
+		ret += len;
+		maxsize -= len;
+		if (maxsize <= 0 || sge_max == 0)
+			break;
+		start = 0;
+	}
+
+	return ret;
+}
+
+/*
+ * Extract fragments from a KVEC-class iterator and add them to an RDMA list.
+ * This can deal with vmalloc'd buffers as well as kmalloc'd or static buffers.
+ * The pages are not pinned.
+ */
+static ssize_t smb_extract_kvec_to_rdma(struct iov_iter *iter,
+					struct smb_extract_to_rdma *rdma,
+					ssize_t maxsize)
+{
+	const struct kvec *kv = iter->kvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i, sge_max = rdma->max_sge;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		struct page *page;
+		unsigned long kaddr;
+		size_t off, len, seg;
+
+		len = kv[i].iov_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		kaddr = (unsigned long)kv[i].iov_base + start;
+		off = kaddr & ~PAGE_MASK;
+		len = min_t(size_t, maxsize, len - start);
+		kaddr &= PAGE_MASK;
+
+		maxsize -= len;
+		ret += len;
+		do {
+			seg = min_t(size_t, len, PAGE_SIZE - off);
+
+			if (is_vmalloc_or_module_addr((void *)kaddr))
+				page = vmalloc_to_page((void *)kaddr);
+			else
+				page = virt_to_page(kaddr);
+
+			if (!smb_set_sge(rdma, page, off, len))
+				return -EIO;
+			sge_max--;
+
+			len -= seg;
+			kaddr += PAGE_SIZE;
+			off = 0;
+		} while (len > 0 && sge_max > 0);
+
+		if (maxsize <= 0 || sge_max == 0)
+			break;
+		start = 0;
+	}
+
+	return ret;
+}
+
+/*
+ * Extract folio fragments from an XARRAY-class iterator and add them to an
+ * RDMA list.  The folios are not pinned.
+ */
+static ssize_t smb_extract_xarray_to_rdma(struct iov_iter *iter,
+					  struct smb_extract_to_rdma *rdma,
+					  ssize_t maxsize)
+{
+	struct xarray *xa = iter->xarray;
+	struct folio *folio;
+	unsigned int sge_max = rdma->max_sge;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	ssize_t ret = 0;
+	size_t off, len;
+	XA_STATE(xas, xa, index);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		off = offset_in_folio(folio, start);
+		len = min_t(size_t, maxsize, folio_size(folio) - off);
+
+		if (!smb_set_sge(rdma, folio_page(folio, 0), off, len)) {
+			rcu_read_lock();
+			return -EIO;
+		}
+		sge_max--;
+
+		maxsize -= len;
+		ret += len;
+		if (maxsize <= 0 || sge_max == 0)
+			break;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Extract page fragments from up to the given amount of the source iterator
+ * and build up an RDMA list that refers to all of those bits.  The RDMA list
+ * is appended to, up to the maximum number of elements set in the parameter
+ * block.
+ *
+ * The extracted page fragments are not pinned or ref'd in any way; if an
+ * IOVEC/UBUF-type iterator is to be used, it should be converted to a
+ * BVEC-type iterator and the pages pinned, ref'd or otherwise held in some
+ * way.
+ */
+static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
+					struct smb_extract_to_rdma *rdma)
+{
+	ssize_t ret;
+	int before = rdma->nr_sge;
+
+	if (iov_iter_is_discard(iter) ||
+	    iov_iter_is_pipe(iter) ||
+	    user_backed_iter(iter)) {
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+
+	switch (iov_iter_type(iter)) {
+	case ITER_BVEC:
+		ret = smb_extract_bvec_to_rdma(iter, rdma, len);
+		break;
+	case ITER_KVEC:
+		ret = smb_extract_kvec_to_rdma(iter, rdma, len);
+		break;
+	case ITER_XARRAY:
+		ret = smb_extract_xarray_to_rdma(iter, rdma, len);
+		break;
+	default:
+		BUG();
+	}
+
+	if (ret > 0) {
+		iov_iter_advance(iter, ret);
+	} else if (ret < 0) {
+		while (rdma->nr_sge > before) {
+			struct ib_sge *sge = &rdma->sge[rdma->nr_sge--];
+
+			ib_dma_unmap_single(rdma->device, sge->addr, sge->length,
+					    rdma->direction);
+			sge->addr = 0;
+		}
+	}
+
+	return ret;
+}



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 25/34] cifs: Add a function to Hash the contents of an iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (23 preceding siblings ...)
  2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 26/34] cifs: Add some helper functions David Howells
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Jeff Layton,
	linux-cifs, linux-fsdevel, linux-crypto, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Add a function to push the contents of a BVEC-, KVEC- or XARRAY-type
iterator into a symmetric hash algorithm.

UBUF- and IOBUF-type iterators are not supported on the assumption that
either we're doing buffered I/O, in which case we won't see them, or we're
doing direct I/O, in which case the iterator will have been extracted into
a BVEC-type iterator higher up.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-crypto@vger.kernel.org

Link: https://lore.kernel.org/r/166697257423.61150.12070648579830206483.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732029577.3186319.17162612653237909961.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/cifsencrypt.c |  144 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 144 insertions(+)

diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index 5db73c0f792a..e13f26371540 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -24,6 +24,150 @@
 #include "../smbfs_common/arc4.h"
 #include <crypto/aead.h>
 
+/*
+ * Hash data from a BVEC-type iterator.
+ */
+static int cifs_shash_bvec(const struct iov_iter *iter, ssize_t maxsize,
+			   struct shash_desc *shash)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	void *p;
+	int ret;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		p = kmap_local_page(bv[i].bv_page);
+		ret = crypto_shash_update(shash, p + off, len);
+		kunmap_local(p);
+		if (ret < 0)
+			return ret;
+
+		maxsize -= len;
+		if (maxsize <= 0)
+			break;
+		start = 0;
+	}
+
+	return 0;
+}
+
+/*
+ * Hash data from a KVEC-type iterator.
+ */
+static int cifs_shash_kvec(const struct iov_iter *iter, ssize_t maxsize,
+			   struct shash_desc *shash)
+{
+	const struct kvec *kv = iter->kvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	int ret;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t len;
+
+		len = kv[i].iov_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		ret = crypto_shash_update(shash, kv[i].iov_base + start, len);
+		if (ret < 0)
+			return ret;
+		maxsize -= len;
+
+		if (maxsize <= 0)
+			break;
+		start = 0;
+	}
+
+	return 0;
+}
+
+/*
+ * Hash data from an XARRAY-type iterator.
+ */
+static ssize_t cifs_shash_xarray(const struct iov_iter *iter, ssize_t maxsize,
+				 struct shash_desc *shash)
+{
+	struct folio *folios[16], *folio;
+	unsigned int nr, i, j, npages;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t last, index = start / PAGE_SIZE;
+	ssize_t ret = 0;
+	size_t len, offset, foffset;
+	void *p;
+
+	if (maxsize == 0)
+		return 0;
+
+	last = (start + maxsize - 1) / PAGE_SIZE;
+	do {
+		nr = xa_extract(iter->xarray, (void **)folios, index, last,
+				ARRAY_SIZE(folios), XA_PRESENT);
+		if (nr == 0)
+			return -EIO;
+
+		for (i = 0; i < nr; i++) {
+			folio = folios[i];
+			npages = folio_nr_pages(folio);
+			foffset = start - folio_pos(folio);
+			offset = foffset % PAGE_SIZE;
+			for (j = foffset / PAGE_SIZE; j < npages; j++) {
+				len = min_t(size_t, maxsize, PAGE_SIZE - offset);
+				p = kmap_local_page(folio_page(folio, j));
+				ret = crypto_shash_update(shash, p, len);
+				kunmap_local(p);
+				if (ret < 0)
+					return ret;
+				maxsize -= len;
+				if (maxsize <= 0)
+					return 0;
+				start += len;
+				offset = 0;
+				index++;
+			}
+		}
+	} while (nr == ARRAY_SIZE(folios));
+	return 0;
+}
+
+/*
+ * Pass the data from an iterator into a hash.
+ */
+static int cifs_shash_iter(const struct iov_iter *iter, size_t maxsize,
+			   struct shash_desc *shash)
+{
+	if (maxsize == 0)
+		return 0;
+
+	switch (iov_iter_type(iter)) {
+	case ITER_BVEC:
+		return cifs_shash_bvec(iter, maxsize, shash);
+	case ITER_KVEC:
+		return cifs_shash_kvec(iter, maxsize, shash);
+	case ITER_XARRAY:
+		return cifs_shash_xarray(iter, maxsize, shash);
+	default:
+		pr_err("cifs_shash_iter(%u) unsupported\n", iov_iter_type(iter));
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+}
+
 int __cifs_calc_signature(struct smb_rqst *rqst,
 			struct TCP_Server_Info *server, char *signature,
 			struct shash_desc *shash)



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 26/34] cifs: Add some helper functions
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (24 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 25/34] cifs: Add a function to Hash the contents of " David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 27/34] cifs: Add a function to read into an iter from a socket David Howells
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Jeff Layton,
	linux-cifs, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Add some helper functions to manipulate the folio marks by iterating
through a list of folios held in an xarray rather than using a page list.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164928616583.457102.15157033997163988344.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211418840.3154751.3090684430628501879.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348878940.2106726.204291614267188735.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364825674.3334034.3356201708659748648.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126394799.708021.10637797063862600488.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697258147.61150.9940790486999562110.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732030314.3186319.9209944805565413627.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/cifsfs.h |    3 ++
 fs/cifs/file.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 25decebbc478..ea628da503c6 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -113,6 +113,9 @@ extern int cifs_file_strict_mmap(struct file *file, struct vm_area_struct *vma);
 extern const struct file_operations cifs_dir_ops;
 extern int cifs_dir_open(struct inode *inode, struct file *file);
 extern int cifs_readdir(struct file *file, struct dir_context *ctx);
+extern void cifs_pages_written_back(struct inode *inode, loff_t start, unsigned int len);
+extern void cifs_pages_write_failed(struct inode *inode, loff_t start, unsigned int len);
+extern void cifs_pages_write_redirty(struct inode *inode, loff_t start, unsigned int len);
 
 /* Functions related to dir entries */
 extern const struct dentry_operations cifs_dentry_ops;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index f1297386a185..2873f28bf388 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -36,6 +36,99 @@
 #include "cifs_ioctl.h"
 #include "cached_dir.h"
 
+/*
+ * Completion of write to server.
+ */
+void cifs_pages_written_back(struct inode *inode, loff_t start, unsigned int len)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio;
+	pgoff_t end;
+
+	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
+
+	if (!len)
+		return;
+
+	rcu_read_lock();
+
+	end = (start + len - 1) / PAGE_SIZE;
+	xas_for_each(&xas, folio, end) {
+		if (!folio_test_writeback(folio)) {
+			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
+				  len, start, folio_index(folio), end);
+			continue;
+		}
+
+		folio_detach_private(folio);
+		folio_end_writeback(folio);
+	}
+
+	rcu_read_unlock();
+}
+
+/*
+ * Failure of write to server.
+ */
+void cifs_pages_write_failed(struct inode *inode, loff_t start, unsigned int len)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio;
+	pgoff_t end;
+
+	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
+
+	if (!len)
+		return;
+
+	rcu_read_lock();
+
+	end = (start + len - 1) / PAGE_SIZE;
+	xas_for_each(&xas, folio, end) {
+		if (!folio_test_writeback(folio)) {
+			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
+				  len, start, folio_index(folio), end);
+			continue;
+		}
+
+		folio_set_error(folio);
+		folio_end_writeback(folio);
+	}
+
+	rcu_read_unlock();
+}
+
+/*
+ * Redirty pages after a temporary failure.
+ */
+void cifs_pages_write_redirty(struct inode *inode, loff_t start, unsigned int len)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio;
+	pgoff_t end;
+
+	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
+
+	if (!len)
+		return;
+
+	rcu_read_lock();
+
+	end = (start + len - 1) / PAGE_SIZE;
+	xas_for_each(&xas, folio, end) {
+		if (!folio_test_writeback(folio)) {
+			WARN_ONCE(1, "bad %x @%llx page %lx %lx\n",
+				  len, start, folio_index(folio), end);
+			continue;
+		}
+
+		filemap_dirty_folio(folio->mapping, folio);
+		folio_end_writeback(folio);
+	}
+
+	rcu_read_unlock();
+}
+
 /*
  * Mark as invalid, all open files on tree connections since they
  * were closed when session to server was lost.



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 27/34] cifs: Add a function to read into an iter from a socket
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (25 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 26/34] cifs: Add some helper functions David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 28/34] cifs: Change the I/O paths to use an iterator rather than a page list David Howells
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Jeff Layton,
	linux-cifs, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Add a helper function to read data from a socket into the given iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164928617874.457102.10021662143234315566.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211419563.3154751.18431990381145195050.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348879662.2106726.16881134187242702351.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364826398.3334034.12541600783145647319.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126395495.708021.12328677373159554478.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697258876.61150.3530237818849429372.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732031039.3186319.10691316510079412635.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/cifsproto.h |    3 +++
 fs/cifs/connect.c   |   16 ++++++++++++++++
 2 files changed, 19 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 1207b39686fb..cb7a3fe89278 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -244,6 +244,9 @@ extern int cifs_read_page_from_socket(struct TCP_Server_Info *server,
 					struct page *page,
 					unsigned int page_offset,
 					unsigned int to_read);
+int cifs_read_iter_from_socket(struct TCP_Server_Info *server,
+			       struct iov_iter *iter,
+			       unsigned int to_read);
 extern int cifs_setup_cifs_sb(struct cifs_sb_info *cifs_sb);
 void cifs_mount_put_conns(struct cifs_mount_ctx *mnt_ctx);
 int cifs_mount_get_session(struct cifs_mount_ctx *mnt_ctx);
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index d371259d6808..68d6d74c2f4e 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -765,6 +765,22 @@ cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
 	return cifs_readv_from_socket(server, &smb_msg);
 }
 
+int
+cifs_read_iter_from_socket(struct TCP_Server_Info *server, struct iov_iter *iter,
+			   unsigned int to_read)
+{
+	struct msghdr smb_msg;
+	int ret;
+
+	smb_msg.msg_iter = *iter;
+	if (smb_msg.msg_iter.count > to_read)
+		smb_msg.msg_iter.count = to_read;
+	ret = cifs_readv_from_socket(server, &smb_msg);
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	return ret;
+}
+
 static bool
 is_smb_response(struct TCP_Server_Info *server, unsigned char type)
 {



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 28/34] cifs: Change the I/O paths to use an iterator rather than a page list
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (26 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 27/34] cifs: Add a function to read into an iter from a socket David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 29/34] cifs: Build the RDMA SGE list directly from an iterator David Howells
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula,
	Paulo Alcantara, Jeff Layton, linux-cifs, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Currently, the cifs I/O paths hand lists of pages from the VM interface
routines at the top all the way through the intervening layers to the
socket interface at the bottom.

This is a problem, however, for interfacing with netfslib which passes an
iterator through to the ->issue_read() method (and will pass an iterator
through to the ->issue_write() method in future).  Netfslib takes over
bounce buffering for direct I/O, async I/O and encrypted content, so cifs
doesn't need to do that.  Netfslib also converts IOVEC-type iterators into
BVEC-type iterators if necessary.

Further, cifs needs foliating - and folios may come in a variety of sizes,
so a page list pointing to an array of heterogeneous pages may cause
problems in places such as where crypto is done.

Change the cifs I/O paths to hand iov_iter iterators all the way through
instead.

Notes:

 (1) Some old routines are #if'd out to be removed in a follow up patch so
     as to avoid confusing diff, thereby making the diff output easier to
     follow.  I've removed functions that don't overlap with anything
     added.

 (2) struct smb_rqst loses rq_pages, rq_offset, rq_npages, rq_pagesz and
     rq_tailsz which describe the pages forming the buffer; instead there's
     an rq_iter describing the source buffer and an rq_buffer which is used
     to hold the buffer for encryption.

 (3) struct cifs_readdata and cifs_writedata are similarly modified to
     smb_rqst.  The ->read_into_pages() and ->copy_into_pages() are then
     replaced with passing the iterator directly to the socket.

     The iterators are stored in these structs so that they are persistent
     and don't get deallocated when the function returns (unlike if they
     were stack variables).

 (4) Buffered writeback is overhauled, borrowing the code from the afs
     filesystem to gather up contiguous runs of folios.  The XARRAY-type
     iterator is then used to refer directly to the pagecache and can be
     passed to the socket to transmit data directly from there.

     This includes:

	cifs_extend_writeback()
	cifs_write_back_from_locked_folio()
	cifs_writepages_region()
	cifs_writepages()

 (5) Pages are converted to folios.

 (6) Direct I/O uses netfs_extract_user_iter() to create a BVEC-type
     iterator from an IOBUF/UBUF-type source iterator.

 (7) smb2_get_aead_req() uses netfs_extract_iter_to_sg() to extract page
     fragments from the iterator into the scatterlists that the crypto
     layer prefers.

 (8) smb2_init_transform_rq() attached pages to smb_rqst::rq_buffer, an
     xarray, to use as a bounce buffer for encryption.  An XARRAY-type
     iterator can then be used to pass the bounce buffer to lower layers.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Paulo Alcantara <pc@cjr.nz>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164311907995.2806745.400147335497304099.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/164928620163.457102.11602306234438271112.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211420279.3154751.15923591172438186144.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348880385.2106726.3220789453472800240.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364827111.3334034.934805882842932881.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126396180.708021.271013668175370826.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697259595.61150.5982032408321852414.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732031756.3186319.12528413619888902872.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/Kconfig       |    1 
 fs/cifs/cifsencrypt.c |   28 -
 fs/cifs/cifsglob.h    |   66 +--
 fs/cifs/cifsproto.h   |    8 
 fs/cifs/cifssmb.c     |   13 -
 fs/cifs/file.c        | 1200 +++++++++++++++++++++++++++++++------------------
 fs/cifs/fscache.c     |   22 -
 fs/cifs/fscache.h     |   10 
 fs/cifs/misc.c        |  133 +----
 fs/cifs/smb2ops.c     |  371 +++++++--------
 fs/cifs/smb2pdu.c     |   45 +-
 fs/cifs/smbdirect.c   |  263 ++++-------
 fs/cifs/smbdirect.h   |    4 
 fs/cifs/transport.c   |   57 +-
 14 files changed, 1127 insertions(+), 1094 deletions(-)

diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index 3b7e3b9e4fd2..1824e0a36f5a 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -18,6 +18,7 @@ config CIFS
 	select DNS_RESOLVER
 	select ASN1
 	select OID_REGISTRY
+	select NETFS_SUPPORT
 	help
 	  This is the client VFS module for the SMB3 family of NAS protocols,
 	  (including support for the most recent, most secure dialect SMB3.1.1)
diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index e13f26371540..05fc6ec36c28 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -169,11 +169,11 @@ static int cifs_shash_iter(const struct iov_iter *iter, size_t maxsize,
 }
 
 int __cifs_calc_signature(struct smb_rqst *rqst,
-			struct TCP_Server_Info *server, char *signature,
-			struct shash_desc *shash)
+			  struct TCP_Server_Info *server, char *signature,
+			  struct shash_desc *shash)
 {
 	int i;
-	int rc;
+	ssize_t rc;
 	struct kvec *iov = rqst->rq_iov;
 	int n_vec = rqst->rq_nvec;
 
@@ -205,25 +205,9 @@ int __cifs_calc_signature(struct smb_rqst *rqst,
 		}
 	}
 
-	/* now hash over the rq_pages array */
-	for (i = 0; i < rqst->rq_npages; i++) {
-		void *kaddr;
-		unsigned int len, offset;
-
-		rqst_page_get_length(rqst, i, &len, &offset);
-
-		kaddr = (char *) kmap(rqst->rq_pages[i]) + offset;
-
-		rc = crypto_shash_update(shash, kaddr, len);
-		if (rc) {
-			cifs_dbg(VFS, "%s: Could not update with payload\n",
-				 __func__);
-			kunmap(rqst->rq_pages[i]);
-			return rc;
-		}
-
-		kunmap(rqst->rq_pages[i]);
-	}
+	rc = cifs_shash_iter(&rqst->rq_iter, iov_iter_count(&rqst->rq_iter), shash);
+	if (rc < 0)
+		return rc;
 
 	rc = crypto_shash_final(shash, signature);
 	if (rc)
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index cfdd5bf701a1..e4f8c0f68152 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -216,11 +216,8 @@ static inline void cifs_free_open_info(struct cifs_open_info_data *data)
 struct smb_rqst {
 	struct kvec	*rq_iov;	/* array of kvecs */
 	unsigned int	rq_nvec;	/* number of kvecs in array */
-	struct page	**rq_pages;	/* pointer to array of page ptrs */
-	unsigned int	rq_offset;	/* the offset to the 1st page */
-	unsigned int	rq_npages;	/* number pages in array */
-	unsigned int	rq_pagesz;	/* page size to use */
-	unsigned int	rq_tailsz;	/* length of last page */
+	struct iov_iter	rq_iter;	/* Data iterator */
+	struct xarray	rq_buffer;	/* Page buffer for encryption */
 };
 
 struct mid_q_entry;
@@ -1426,10 +1423,11 @@ struct cifs_aio_ctx {
 	struct cifsFileInfo	*cfile;
 	struct bio_vec		*bv;
 	loff_t			pos;
-	unsigned int		npages;
+	unsigned int		nr_pinned_pages;
 	ssize_t			rc;
 	unsigned int		len;
 	unsigned int		total_len;
+	unsigned int		bv_cleanup_mode;	/* How to clean up ->bv[] */
 	bool			should_dirty;
 	/*
 	 * Indicates if this aio_ctx is for direct_io,
@@ -1447,28 +1445,18 @@ struct cifs_readdata {
 	struct address_space		*mapping;
 	struct cifs_aio_ctx		*ctx;
 	__u64				offset;
+	ssize_t				got_bytes;
 	unsigned int			bytes;
-	unsigned int			got_bytes;
 	pid_t				pid;
 	int				result;
 	struct work_struct		work;
-	int (*read_into_pages)(struct TCP_Server_Info *server,
-				struct cifs_readdata *rdata,
-				unsigned int len);
-	int (*copy_into_pages)(struct TCP_Server_Info *server,
-				struct cifs_readdata *rdata,
-				struct iov_iter *iter);
+	struct iov_iter			iter;
 	struct kvec			iov[2];
 	struct TCP_Server_Info		*server;
 #ifdef CONFIG_CIFS_SMB_DIRECT
 	struct smbd_mr			*mr;
 #endif
-	unsigned int			pagesz;
-	unsigned int			page_offset;
-	unsigned int			tailsz;
 	struct cifs_credits		credits;
-	unsigned int			nr_pages;
-	struct page			**pages;
 };
 
 /* asynchronous write support */
@@ -1480,6 +1468,8 @@ struct cifs_writedata {
 	struct work_struct		work;
 	struct cifsFileInfo		*cfile;
 	struct cifs_aio_ctx		*ctx;
+	struct iov_iter			iter;
+	struct bio_vec			*bv;
 	__u64				offset;
 	pid_t				pid;
 	unsigned int			bytes;
@@ -1488,12 +1478,7 @@ struct cifs_writedata {
 #ifdef CONFIG_CIFS_SMB_DIRECT
 	struct smbd_mr			*mr;
 #endif
-	unsigned int			pagesz;
-	unsigned int			page_offset;
-	unsigned int			tailsz;
 	struct cifs_credits		credits;
-	unsigned int			nr_pages;
-	struct page			**pages;
 };
 
 /*
@@ -2153,9 +2138,9 @@ static inline void move_cifs_info_to_smb2(struct smb2_file_all_info *dst, const
 	dst->FileNameLength = src->FileNameLength;
 }
 
-static inline unsigned int cifs_get_num_sgs(const struct smb_rqst *rqst,
-					    int num_rqst,
-					    const u8 *sig)
+static inline int cifs_get_num_sgs(const struct smb_rqst *rqst,
+				   int num_rqst,
+				   const u8 *sig)
 {
 	unsigned int len, skip;
 	unsigned int nents = 0;
@@ -2169,6 +2154,20 @@ static inline unsigned int cifs_get_num_sgs(const struct smb_rqst *rqst,
 	 * rqst[1+].rq_iov[0+] data to be encrypted/decrypted
 	 */
 	for (i = 0; i < num_rqst; i++) {
+		/* We really don't want a mixture of pinned and unpinned pages
+		 * in the sglist.  It's hard to keep track of which is what.
+		 * Instead, we convert to a BVEC-type iterator higher up.
+		 */
+		if (WARN_ON_ONCE(user_backed_iter(&rqst[i].rq_iter)))
+			return -EIO;
+
+		/* We also don't want to have any extra refs or pins
+		 * to clean up in the sglist.
+		 */
+		if (WARN_ON_ONCE(iov_iter_extract_mode(&rqst[i].rq_iter,
+						       FOLL_DEST_BUF)))
+			return -EIO;
+
 		/*
 		 * The first rqst has a transform header where the
 		 * first 20 bytes are not part of the encrypted blob.
@@ -2186,7 +2185,7 @@ static inline unsigned int cifs_get_num_sgs(const struct smb_rqst *rqst,
 				nents++;
 			}
 		}
-		nents += rqst[i].rq_npages;
+		nents += iov_iter_npages(&rqst[i].rq_iter, INT_MAX);
 	}
 	nents += DIV_ROUND_UP(offset_in_page(sig) + SMB2_SIGNATURE_SIZE, PAGE_SIZE);
 	return nents;
@@ -2195,9 +2194,9 @@ static inline unsigned int cifs_get_num_sgs(const struct smb_rqst *rqst,
 /* We can not use the normal sg_set_buf() as we will sometimes pass a
  * stack object as buf.
  */
-static inline struct scatterlist *cifs_sg_set_buf(struct scatterlist *sg,
-						  const void *buf,
-						  unsigned int buflen)
+static inline void cifs_sg_set_buf(struct sg_table *sgtable,
+				   const void *buf,
+				   unsigned int buflen)
 {
 	unsigned long addr = (unsigned long)buf;
 	unsigned int off = offset_in_page(addr);
@@ -2207,16 +2206,17 @@ static inline struct scatterlist *cifs_sg_set_buf(struct scatterlist *sg,
 		do {
 			unsigned int len = min_t(unsigned int, buflen, PAGE_SIZE - off);
 
-			sg_set_page(sg++, vmalloc_to_page((void *)addr), len, off);
+			sg_set_page(&sgtable->sgl[sgtable->nents++],
+				    vmalloc_to_page((void *)addr), len, off);
 
 			off = 0;
 			addr += PAGE_SIZE;
 			buflen -= len;
 		} while (buflen);
 	} else {
-		sg_set_page(sg++, virt_to_page(addr), buflen, off);
+		sg_set_page(&sgtable->sgl[sgtable->nents++],
+			    virt_to_page(addr), buflen, off);
 	}
-	return sg;
 }
 
 #endif	/* _CIFS_GLOB_H */
diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index cb7a3fe89278..2873f68a051c 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -584,10 +584,7 @@ int cifs_readv_receive(struct TCP_Server_Info *server, struct mid_q_entry *mid);
 int cifs_async_writev(struct cifs_writedata *wdata,
 		      void (*release)(struct kref *kref));
 void cifs_writev_complete(struct work_struct *work);
-struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
-						work_func_t complete);
-struct cifs_writedata *cifs_writedata_direct_alloc(struct page **pages,
-						work_func_t complete);
+struct cifs_writedata *cifs_writedata_alloc(work_func_t complete);
 void cifs_writedata_release(struct kref *refcount);
 int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
 			  struct cifs_sb_info *cifs_sb,
@@ -604,13 +601,10 @@ enum securityEnum cifs_select_sectype(struct TCP_Server_Info *,
 					enum securityEnum);
 struct cifs_aio_ctx *cifs_aio_ctx_alloc(void);
 void cifs_aio_ctx_release(struct kref *refcount);
-int setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw);
 
 int cifs_alloc_hash(const char *name, struct shash_desc **sdesc);
 void cifs_free_hash(struct shash_desc **sdesc);
 
-void rqst_page_get_length(const struct smb_rqst *rqst, unsigned int page,
-			  unsigned int *len, unsigned int *offset);
 struct cifs_chan *
 cifs_ses_find_chan(struct cifs_ses *ses, struct TCP_Server_Info *server);
 int cifs_try_adding_channels(struct cifs_sb_info *cifs_sb, struct cifs_ses *ses);
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 23f10e0d6e7e..878064370f46 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -24,6 +24,7 @@
 #include <linux/task_io_accounting_ops.h>
 #include <linux/uaccess.h>
 #include "cifspdu.h"
+#include "cifsfs.h"
 #include "cifsglob.h"
 #include "cifsacl.h"
 #include "cifsproto.h"
@@ -1294,11 +1295,7 @@ cifs_readv_callback(struct mid_q_entry *mid)
 	struct TCP_Server_Info *server = tcon->ses->server;
 	struct smb_rqst rqst = { .rq_iov = rdata->iov,
 				 .rq_nvec = 2,
-				 .rq_pages = rdata->pages,
-				 .rq_offset = rdata->page_offset,
-				 .rq_npages = rdata->nr_pages,
-				 .rq_pagesz = rdata->pagesz,
-				 .rq_tailsz = rdata->tailsz };
+				 .rq_iter = rdata->iter };
 	struct cifs_credits credits = { .value = 1, .instance = 0 };
 
 	cifs_dbg(FYI, "%s: mid=%llu state=%d result=%d bytes=%u\n",
@@ -1737,11 +1734,7 @@ cifs_async_writev(struct cifs_writedata *wdata,
 
 	rqst.rq_iov = iov;
 	rqst.rq_nvec = 2;
-	rqst.rq_pages = wdata->pages;
-	rqst.rq_offset = wdata->page_offset;
-	rqst.rq_npages = wdata->nr_pages;
-	rqst.rq_pagesz = wdata->pagesz;
-	rqst.rq_tailsz = wdata->tailsz;
+	rqst.rq_iter = wdata->iter;
 
 	cifs_dbg(FYI, "async write at %llu %u bytes\n",
 		 wdata->offset, wdata->bytes);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 2873f28bf388..cfa8ad8a59c4 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -36,6 +36,32 @@
 #include "cifs_ioctl.h"
 #include "cached_dir.h"
 
+/*
+ * Remove the dirty flags from a span of pages.
+ */
+static void cifs_undirty_folios(struct inode *inode, loff_t start, unsigned int len)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct folio *folio;
+	pgoff_t end;
+
+	XA_STATE(xas, &mapping->i_pages, start / PAGE_SIZE);
+
+	rcu_read_lock();
+
+	end = (start + len - 1) / PAGE_SIZE;
+	xas_for_each_marked(&xas, folio, end, PAGECACHE_TAG_DIRTY) {
+		xas_pause(&xas);
+		rcu_read_unlock();
+		folio_lock(folio);
+		folio_clear_dirty_for_io(folio);
+		folio_unlock(folio);
+		rcu_read_lock();
+	}
+
+	rcu_read_unlock();
+}
+
 /*
  * Completion of write to server.
  */
@@ -2388,7 +2414,6 @@ cifs_writedata_release(struct kref *refcount)
 	if (wdata->cfile)
 		cifsFileInfo_put(wdata->cfile);
 
-	kvfree(wdata->pages);
 	kfree(wdata);
 }
 
@@ -2399,51 +2424,49 @@ cifs_writedata_release(struct kref *refcount)
 static void
 cifs_writev_requeue(struct cifs_writedata *wdata)
 {
-	int i, rc = 0;
+	int rc = 0;
 	struct inode *inode = d_inode(wdata->cfile->dentry);
 	struct TCP_Server_Info *server;
-	unsigned int rest_len;
+	unsigned int rest_len = wdata->bytes;
+	loff_t fpos = wdata->offset;
 
 	server = tlink_tcon(wdata->cfile->tlink)->ses->server;
-	i = 0;
-	rest_len = wdata->bytes;
 	do {
 		struct cifs_writedata *wdata2;
-		unsigned int j, nr_pages, wsize, tailsz, cur_len;
+		unsigned int wsize, cur_len;
 
 		wsize = server->ops->wp_retry_size(inode);
 		if (wsize < rest_len) {
-			nr_pages = wsize / PAGE_SIZE;
-			if (!nr_pages) {
-				rc = -EOPNOTSUPP;
+			if (wsize < PAGE_SIZE) {
+				rc = -ENOTSUPP;
 				break;
 			}
-			cur_len = nr_pages * PAGE_SIZE;
-			tailsz = PAGE_SIZE;
+			cur_len = min(round_down(wsize, PAGE_SIZE), rest_len);
 		} else {
-			nr_pages = DIV_ROUND_UP(rest_len, PAGE_SIZE);
 			cur_len = rest_len;
-			tailsz = rest_len - (nr_pages - 1) * PAGE_SIZE;
 		}
 
-		wdata2 = cifs_writedata_alloc(nr_pages, cifs_writev_complete);
+		wdata2 = cifs_writedata_alloc(cifs_writev_complete);
 		if (!wdata2) {
 			rc = -ENOMEM;
 			break;
 		}
 
-		for (j = 0; j < nr_pages; j++) {
-			wdata2->pages[j] = wdata->pages[i + j];
-			lock_page(wdata2->pages[j]);
-			clear_page_dirty_for_io(wdata2->pages[j]);
-		}
-
 		wdata2->sync_mode = wdata->sync_mode;
-		wdata2->nr_pages = nr_pages;
-		wdata2->offset = page_offset(wdata2->pages[0]);
-		wdata2->pagesz = PAGE_SIZE;
-		wdata2->tailsz = tailsz;
-		wdata2->bytes = cur_len;
+		wdata2->offset	= fpos;
+		wdata2->bytes	= cur_len;
+		wdata2->iter	= wdata->iter;
+
+		iov_iter_advance(&wdata2->iter, fpos - wdata->offset);
+		iov_iter_truncate(&wdata2->iter, wdata2->bytes);
+
+		if (iov_iter_is_xarray(&wdata2->iter))
+			/* Check for pages having been redirtied and clean
+			 * them.  We can do this by walking the xarray.  If
+			 * it's not an xarray, then it's a DIO and we shouldn't
+			 * be mucking around with the page bits.
+			 */
+			cifs_undirty_folios(inode, fpos, cur_len);
 
 		rc = cifs_get_writable_file(CIFS_I(inode), FIND_WR_ANY,
 					    &wdata2->cfile);
@@ -2458,33 +2481,22 @@ cifs_writev_requeue(struct cifs_writedata *wdata)
 						       cifs_writedata_release);
 		}
 
-		for (j = 0; j < nr_pages; j++) {
-			unlock_page(wdata2->pages[j]);
-			if (rc != 0 && !is_retryable_error(rc)) {
-				SetPageError(wdata2->pages[j]);
-				end_page_writeback(wdata2->pages[j]);
-				put_page(wdata2->pages[j]);
-			}
-		}
-
 		kref_put(&wdata2->refcount, cifs_writedata_release);
 		if (rc) {
 			if (is_retryable_error(rc))
 				continue;
-			i += nr_pages;
+			fpos += cur_len;
+			rest_len -= cur_len;
 			break;
 		}
 
+		fpos += cur_len;
 		rest_len -= cur_len;
-		i += nr_pages;
-	} while (i < wdata->nr_pages);
+	} while (rest_len > 0);
 
-	/* cleanup remaining pages from the original wdata */
-	for (; i < wdata->nr_pages; i++) {
-		SetPageError(wdata->pages[i]);
-		end_page_writeback(wdata->pages[i]);
-		put_page(wdata->pages[i]);
-	}
+	/* Clean up remaining pages from the original wdata */
+	if (iov_iter_is_xarray(&wdata->iter))
+		cifs_pages_write_failed(inode, fpos, rest_len);
 
 	if (rc != 0 && !is_retryable_error(rc))
 		mapping_set_error(inode->i_mapping, rc);
@@ -2497,7 +2509,6 @@ cifs_writev_complete(struct work_struct *work)
 	struct cifs_writedata *wdata = container_of(work,
 						struct cifs_writedata, work);
 	struct inode *inode = d_inode(wdata->cfile->dentry);
-	int i = 0;
 
 	if (wdata->result == 0) {
 		spin_lock(&inode->i_lock);
@@ -2508,45 +2519,24 @@ cifs_writev_complete(struct work_struct *work)
 	} else if (wdata->sync_mode == WB_SYNC_ALL && wdata->result == -EAGAIN)
 		return cifs_writev_requeue(wdata);
 
-	for (i = 0; i < wdata->nr_pages; i++) {
-		struct page *page = wdata->pages[i];
+	if (wdata->result == -EAGAIN)
+		cifs_pages_write_redirty(inode, wdata->offset, wdata->bytes);
+	else if (wdata->result < 0)
+		cifs_pages_write_failed(inode, wdata->offset, wdata->bytes);
+	else
+		cifs_pages_written_back(inode, wdata->offset, wdata->bytes);
 
-		if (wdata->result == -EAGAIN)
-			__set_page_dirty_nobuffers(page);
-		else if (wdata->result < 0)
-			SetPageError(page);
-		end_page_writeback(page);
-		cifs_readpage_to_fscache(inode, page);
-		put_page(page);
-	}
 	if (wdata->result != -EAGAIN)
 		mapping_set_error(inode->i_mapping, wdata->result);
 	kref_put(&wdata->refcount, cifs_writedata_release);
 }
 
-struct cifs_writedata *
-cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
-{
-	struct cifs_writedata *writedata = NULL;
-	struct page **pages =
-		kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
-	if (pages) {
-		writedata = cifs_writedata_direct_alloc(pages, complete);
-		if (!writedata)
-			kvfree(pages);
-	}
-
-	return writedata;
-}
-
-struct cifs_writedata *
-cifs_writedata_direct_alloc(struct page **pages, work_func_t complete)
+struct cifs_writedata *cifs_writedata_alloc(work_func_t complete)
 {
 	struct cifs_writedata *wdata;
 
 	wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
 	if (wdata != NULL) {
-		wdata->pages = pages;
 		kref_init(&wdata->refcount);
 		INIT_LIST_HEAD(&wdata->list);
 		init_completion(&wdata->done);
@@ -2555,7 +2545,6 @@ cifs_writedata_direct_alloc(struct page **pages, work_func_t complete)
 	return wdata;
 }
 
-
 static int cifs_partialpagewrite(struct page *page, unsigned from, unsigned to)
 {
 	struct address_space *mapping = page->mapping;
@@ -2614,6 +2603,7 @@ static int cifs_partialpagewrite(struct page *page, unsigned from, unsigned to)
 	return rc;
 }
 
+#if 0 // TODO: Remove for iov_iter support
 static struct cifs_writedata *
 wdata_alloc_and_fillpages(pgoff_t tofind, struct address_space *mapping,
 			  pgoff_t end, pgoff_t *index,
@@ -2919,6 +2909,374 @@ static int cifs_writepages(struct address_space *mapping,
 	set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags);
 	return rc;
 }
+#endif
+
+/*
+ * Extend the region to be written back to include subsequent contiguously
+ * dirty pages if possible, but don't sleep while doing so.
+ */
+static void cifs_extend_writeback(struct address_space *mapping,
+				  long *_count,
+				  loff_t start,
+				  int max_pages,
+				  size_t max_len,
+				  unsigned int *_len)
+{
+	struct folio_batch batch;
+	struct folio *folio;
+	unsigned int psize, nr_pages;
+	size_t len = *_len;
+	pgoff_t index = (start + len) / PAGE_SIZE;
+	bool stop = true;
+	unsigned int i;
+
+	XA_STATE(xas, &mapping->i_pages, index);
+	folio_batch_init(&batch);
+
+	do {
+		/* Firstly, we gather up a batch of contiguous dirty pages
+		 * under the RCU read lock - but we can't clear the dirty flags
+		 * there if any of those pages are mapped.
+		 */
+		rcu_read_lock();
+
+		xas_for_each(&xas, folio, ULONG_MAX) {
+			stop = true;
+			if (xas_retry(&xas, folio))
+				continue;
+			if (xa_is_value(folio))
+				break;
+			if (folio_index(folio) != index)
+				break;
+			if (!folio_try_get_rcu(folio)) {
+				xas_reset(&xas);
+				continue;
+			}
+			nr_pages = folio_nr_pages(folio);
+			if (nr_pages > max_pages)
+				break;
+
+			/* Has the page moved or been split? */
+			if (unlikely(folio != xas_reload(&xas))) {
+				folio_put(folio);
+				break;
+			}
+
+			if (!folio_trylock(folio)) {
+				folio_put(folio);
+				break;
+			}
+			if (!folio_test_dirty(folio) || folio_test_writeback(folio)) {
+				folio_unlock(folio);
+				folio_put(folio);
+				break;
+			}
+
+			max_pages -= nr_pages;
+			psize = folio_size(folio);
+			len += psize;
+			stop = false;
+			if (max_pages <= 0 || len >= max_len || *_count <= 0)
+				stop = true;
+
+			index += nr_pages;
+			if (!folio_batch_add(&batch, folio))
+				break;
+			if (stop)
+				break;
+		}
+
+		if (!stop)
+			xas_pause(&xas);
+		rcu_read_unlock();
+
+		/* Now, if we obtained any pages, we can shift them to being
+		 * writable and mark them for caching.
+		 */
+		if (!folio_batch_count(&batch))
+			break;
+
+		for (i = 0; i < folio_batch_count(&batch); i++) {
+			folio = batch.folios[i];
+			/* The folio should be locked, dirty and not undergoing
+			 * writeback from the loop above.
+			 */
+			if (!folio_clear_dirty_for_io(folio))
+				WARN_ON(1);
+			if (folio_start_writeback(folio))
+				WARN_ON(1);
+
+			*_count -= folio_nr_pages(folio);
+			folio_unlock(folio);
+		}
+
+		folio_batch_release(&batch);
+		cond_resched();
+	} while (!stop);
+
+	*_len = len;
+}
+
+/*
+ * Write back the locked page and any subsequent non-locked dirty pages.
+ */
+static ssize_t cifs_write_back_from_locked_folio(struct address_space *mapping,
+						 struct writeback_control *wbc,
+						 struct folio *folio,
+						 loff_t start, loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct TCP_Server_Info *server;
+	struct cifs_writedata *wdata;
+	struct cifs_sb_info *cifs_sb = CIFS_SB(inode->i_sb);
+	struct cifs_credits credits_on_stack;
+	struct cifs_credits *credits = &credits_on_stack;
+	struct cifsFileInfo *cfile = NULL;
+	unsigned int xid, wsize, len;
+	loff_t i_size = i_size_read(inode);
+	size_t max_len;
+	long count = wbc->nr_to_write;
+	int rc;
+
+	/* The folio should be locked, dirty and not undergoing writeback. */
+	if (folio_start_writeback(folio))
+		WARN_ON(1);
+
+	count -= folio_nr_pages(folio);
+	len = folio_size(folio);
+
+	xid = get_xid();
+	server = cifs_pick_channel(cifs_sb_master_tcon(cifs_sb)->ses);
+
+	rc = cifs_get_writable_file(CIFS_I(inode), FIND_WR_ANY, &cfile);
+	if (rc) {
+		cifs_dbg(VFS, "No writable handle in writepages rc=%d\n", rc);
+		goto err_xid;
+	}
+
+	rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->wsize,
+					   &wsize, credits);
+	if (rc != 0)
+		goto err_close;
+
+	wdata = cifs_writedata_alloc(cifs_writev_complete);
+	if (!wdata) {
+		rc = -ENOMEM;
+		goto err_uncredit;
+	}
+
+	wdata->sync_mode = wbc->sync_mode;
+	wdata->offset = folio_pos(folio);
+	wdata->pid = cfile->pid;
+	wdata->credits = credits_on_stack;
+	wdata->cfile = cfile;
+	wdata->server = server;
+	cfile = NULL;
+
+	/* Find all consecutive lockable dirty pages, stopping when we find a
+	 * page that is not immediately lockable, is not dirty or is missing,
+	 * or we reach the end of the range.
+	 */
+	if (start < i_size) {
+		/* Trim the write to the EOF; the extra data is ignored.  Also
+		 * put an upper limit on the size of a single storedata op.
+		 */
+		max_len = wsize;
+		max_len = min_t(unsigned long long, max_len, end - start + 1);
+		max_len = min_t(unsigned long long, max_len, i_size - start);
+
+		if (len < max_len) {
+			int max_pages = INT_MAX;
+
+#ifdef CONFIG_CIFS_SMB_DIRECT
+			if (server->smbd_conn)
+				max_pages = server->smbd_conn->max_frmr_depth;
+#endif
+			max_pages -= folio_nr_pages(folio);
+
+			if (max_pages > 0)
+				cifs_extend_writeback(mapping, &count, start,
+						      max_pages, max_len, &len);
+		}
+		len = min_t(loff_t, len, max_len);
+	}
+
+	wdata->bytes = len;
+
+	/* We now have a contiguous set of dirty pages, each with writeback
+	 * set; the first page is still locked at this point, but all the rest
+	 * have been unlocked.
+	 */
+	folio_unlock(folio);
+
+	if (start < i_size) {
+		iov_iter_xarray(&wdata->iter, WRITE, &mapping->i_pages, start, len);
+
+		rc = adjust_credits(wdata->server, &wdata->credits, wdata->bytes);
+		if (rc)
+			goto err_wdata;
+
+		if (wdata->cfile->invalidHandle)
+			rc = -EAGAIN;
+		else
+			rc = wdata->server->ops->async_writev(wdata,
+							      cifs_writedata_release);
+		if (rc >= 0) {
+			kref_put(&wdata->refcount, cifs_writedata_release);
+			goto err_close;
+		}
+	} else {
+		/* The dirty region was entirely beyond the EOF. */
+		cifs_pages_written_back(inode, start, len);
+		rc = 0;
+	}
+
+err_wdata:
+	kref_put(&wdata->refcount, cifs_writedata_release);
+err_uncredit:
+	add_credits_and_wake_if(server, credits, 0);
+err_close:
+	if (cfile)
+		cifsFileInfo_put(cfile);
+err_xid:
+	free_xid(xid);
+	if (rc == 0) {
+		wbc->nr_to_write = count;
+	} else if (is_retryable_error(rc)) {
+		cifs_pages_write_redirty(inode, start, len);
+	} else {
+		cifs_pages_write_failed(inode, start, len);
+		mapping_set_error(mapping, rc);
+	}
+	/* Indication to update ctime and mtime as close is deferred */
+	set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags);
+	return rc;
+}
+
+/*
+ * write a region of pages back to the server
+ */
+static int cifs_writepages_region(struct address_space *mapping,
+				  struct writeback_control *wbc,
+				  loff_t start, loff_t end, loff_t *_next)
+{
+	struct folio *folio;
+	struct page *head_page;
+	ssize_t ret;
+	int n, skips = 0;
+
+	do {
+		pgoff_t index = start / PAGE_SIZE;
+
+		n = find_get_pages_range_tag(mapping, &index, end / PAGE_SIZE,
+					     PAGECACHE_TAG_DIRTY, 1, &head_page);
+		if (!n)
+			break;
+
+		folio = page_folio(head_page);
+		start = folio_pos(folio); /* May regress with THPs */
+
+		/* At this point we hold neither the i_pages lock nor the
+		 * page lock: the page may be truncated or invalidated
+		 * (changing page->mapping to NULL), or even swizzled
+		 * back from swapper_space to tmpfs file mapping
+		 */
+		if (wbc->sync_mode != WB_SYNC_NONE) {
+			ret = folio_lock_killable(folio);
+			if (ret < 0) {
+				folio_put(folio);
+				return ret;
+			}
+		} else {
+			if (!folio_trylock(folio)) {
+				folio_put(folio);
+				return 0;
+			}
+		}
+
+		if (folio_mapping(folio) != mapping ||
+		    !folio_test_dirty(folio)) {
+			start += folio_size(folio);
+			folio_unlock(folio);
+			folio_put(folio);
+			continue;
+		}
+
+		if (folio_test_writeback(folio) ||
+		    folio_test_fscache(folio)) {
+			folio_unlock(folio);
+			if (wbc->sync_mode != WB_SYNC_NONE) {
+				folio_wait_writeback(folio);
+#ifdef CONFIG_CIFS_FSCACHE
+				folio_wait_fscache(folio);
+#endif
+			} else {
+				start += folio_size(folio);
+			}
+			folio_put(folio);
+			if (wbc->sync_mode == WB_SYNC_NONE) {
+				if (skips >= 5 || need_resched())
+					break;
+				skips++;
+			}
+			continue;
+		}
+
+		if (!folio_clear_dirty_for_io(folio))
+			/* We hold the page lock - it should've been dirty. */
+			WARN_ON(1);
+
+		ret = cifs_write_back_from_locked_folio(mapping, wbc, folio, start, end);
+		folio_put(folio);
+		if (ret < 0)
+			return ret;
+
+		start += ret;
+		cond_resched();
+	} while (wbc->nr_to_write > 0);
+
+	*_next = start;
+	return 0;
+}
+
+/*
+ * Write some of the pending data back to the server
+ */
+static int cifs_writepages(struct address_space *mapping,
+			   struct writeback_control *wbc)
+{
+	loff_t start, next;
+	int ret;
+
+	/* We have to be careful as we can end up racing with setattr()
+	 * truncating the pagecache since the caller doesn't take a lock here
+	 * to prevent it.
+	 */
+
+	if (wbc->range_cyclic) {
+		start = mapping->writeback_index * PAGE_SIZE;
+		ret = cifs_writepages_region(mapping, wbc, start, LLONG_MAX, &next);
+		if (ret == 0) {
+			mapping->writeback_index = next / PAGE_SIZE;
+			if (start > 0 && wbc->nr_to_write > 0) {
+				ret = cifs_writepages_region(mapping, wbc, 0,
+							     start, &next);
+				if (ret == 0)
+					mapping->writeback_index =
+						next / PAGE_SIZE;
+			}
+		}
+	} else if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
+		ret = cifs_writepages_region(mapping, wbc, 0, LLONG_MAX, &next);
+		if (wbc->nr_to_write > 0 && ret == 0)
+			mapping->writeback_index = next / PAGE_SIZE;
+	} else {
+		ret = cifs_writepages_region(mapping, wbc,
+					     wbc->range_start, wbc->range_end, &next);
+	}
+
+	return ret;
+}
 
 static int
 cifs_writepage_locked(struct page *page, struct writeback_control *wbc)
@@ -2969,6 +3327,7 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	struct cifsFileInfo *cfile = file->private_data;
 	struct cifs_sb_info *cifs_sb = CIFS_SB(cfile->dentry->d_sb);
+	struct folio *folio = page_folio(page);
 	__u32 pid;
 
 	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
@@ -2979,14 +3338,14 @@ static int cifs_write_end(struct file *file, struct address_space *mapping,
 	cifs_dbg(FYI, "write_end for page %p from pos %lld with %d bytes\n",
 		 page, pos, copied);
 
-	if (PageChecked(page)) {
+	if (folio_test_checked(folio)) {
 		if (copied == len)
-			SetPageUptodate(page);
-		ClearPageChecked(page);
-	} else if (!PageUptodate(page) && copied == PAGE_SIZE)
-		SetPageUptodate(page);
+			folio_mark_uptodate(folio);
+		folio_clear_checked(folio);
+	} else if (!folio_test_uptodate(folio) && copied == PAGE_SIZE)
+		folio_mark_uptodate(folio);
 
-	if (!PageUptodate(page)) {
+	if (!folio_test_uptodate(folio)) {
 		char *page_data;
 		unsigned offset = pos & (PAGE_SIZE - 1);
 		unsigned int xid;
@@ -3146,6 +3505,7 @@ int cifs_flush(struct file *file, fl_owner_t id)
 	return rc;
 }
 
+#if 0 // TODO: Remove for iov_iter support
 static int
 cifs_write_allocate_pages(struct page **pages, unsigned long num_pages)
 {
@@ -3186,17 +3546,15 @@ size_t get_numpages(const size_t wsize, const size_t len, size_t *cur_len)
 
 	return num_pages;
 }
+#endif
 
 static void
 cifs_uncached_writedata_release(struct kref *refcount)
 {
-	int i;
 	struct cifs_writedata *wdata = container_of(refcount,
 					struct cifs_writedata, refcount);
 
 	kref_put(&wdata->ctx->refcount, cifs_aio_ctx_release);
-	for (i = 0; i < wdata->nr_pages; i++)
-		put_page(wdata->pages[i]);
 	cifs_writedata_release(refcount);
 }
 
@@ -3222,6 +3580,7 @@ cifs_uncached_writev_complete(struct work_struct *work)
 	kref_put(&wdata->refcount, cifs_uncached_writedata_release);
 }
 
+#if 0 // TODO: Remove for iov_iter support
 static int
 wdata_fill_from_iovec(struct cifs_writedata *wdata, struct iov_iter *from,
 		      size_t *len, unsigned long *num_pages)
@@ -3263,6 +3622,7 @@ wdata_fill_from_iovec(struct cifs_writedata *wdata, struct iov_iter *from,
 	*num_pages = i + 1;
 	return 0;
 }
+#endif
 
 static int
 cifs_resend_wdata(struct cifs_writedata *wdata, struct list_head *wdata_list,
@@ -3334,23 +3694,57 @@ cifs_resend_wdata(struct cifs_writedata *wdata, struct list_head *wdata_list,
 	return rc;
 }
 
+/*
+ * Select span of a bvec iterator we're going to use.  Limit it by both maximum
+ * size and maximum number of segments.
+ */
+static size_t cifs_limit_bvec_subset(const struct iov_iter *iter, size_t max_size,
+				     size_t max_segs, unsigned int *_nsegs)
+{
+	const struct bio_vec *bvecs = iter->bvec;
+	unsigned int nbv = iter->nr_segs, ix = 0, nsegs = 0;
+	size_t len, span = 0, n = iter->count;
+	size_t skip = iter->iov_offset;
+
+	if (WARN_ON(!iov_iter_is_bvec(iter)) || n == 0)
+		return 0;
+
+	while (n && ix < nbv && skip) {
+		len = bvecs[ix].bv_len;
+		if (skip < len)
+			break;
+		skip -= len;
+		n -= len;
+		ix++;
+	}
+
+	while (n && ix < nbv) {
+		len = min3(n, bvecs[ix].bv_len - skip, max_size);
+		span += len;
+		nsegs++;
+		ix++;
+		if (span >= max_size || nsegs >= max_segs)
+			break;
+		skip = 0;
+		n -= len;
+	}
+
+	*_nsegs = nsegs;
+	return span;
+}
+
 static int
-cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
+cifs_write_from_iter(loff_t fpos, size_t len, struct iov_iter *from,
 		     struct cifsFileInfo *open_file,
 		     struct cifs_sb_info *cifs_sb, struct list_head *wdata_list,
 		     struct cifs_aio_ctx *ctx)
 {
 	int rc = 0;
-	size_t cur_len;
-	unsigned long nr_pages, num_pages, i;
+	size_t cur_len, max_len;
 	struct cifs_writedata *wdata;
-	struct iov_iter saved_from = *from;
-	loff_t saved_offset = offset;
 	pid_t pid;
 	struct TCP_Server_Info *server;
-	struct page **pagevec;
-	size_t start;
-	unsigned int xid;
+	unsigned int xid, max_segs = INT_MAX;
 
 	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
 		pid = open_file->pid;
@@ -3360,10 +3754,20 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 	server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses);
 	xid = get_xid();
 
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	if (server->smbd_conn)
+		max_segs = server->smbd_conn->max_frmr_depth;
+#endif
+
 	do {
-		unsigned int wsize;
 		struct cifs_credits credits_on_stack;
 		struct cifs_credits *credits = &credits_on_stack;
+		unsigned int wsize, nsegs = 0;
+
+		if (signal_pending(current)) {
+			rc = -EINTR;
+			break;
+		}
 
 		if (open_file->invalidHandle) {
 			rc = cifs_reopen_file(open_file, false);
@@ -3378,99 +3782,42 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 		if (rc)
 			break;
 
-		cur_len = min_t(const size_t, len, wsize);
-
-		if (ctx->direct_io) {
-			ssize_t result;
-
-			result = iov_iter_get_pages_alloc(
-				from, &pagevec, cur_len, &start, FOLL_SOURCE_BUF);
-			if (result < 0) {
-				cifs_dbg(VFS,
-					 "direct_writev couldn't get user pages (rc=%zd) iter type %d iov_offset %zd count %zd\n",
-					 result, iov_iter_type(from),
-					 from->iov_offset, from->count);
-				dump_stack();
-
-				rc = result;
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
-			cur_len = (size_t)result;
-
-			nr_pages =
-				(cur_len + start + PAGE_SIZE - 1) / PAGE_SIZE;
-
-			wdata = cifs_writedata_direct_alloc(pagevec,
-					     cifs_uncached_writev_complete);
-			if (!wdata) {
-				rc = -ENOMEM;
-				for (i = 0; i < nr_pages; i++)
-					put_page(pagevec[i]);
-				kvfree(pagevec);
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
-
-
-			wdata->page_offset = start;
-			wdata->tailsz =
-				nr_pages > 1 ?
-					cur_len - (PAGE_SIZE - start) -
-					(nr_pages - 2) * PAGE_SIZE :
-					cur_len;
-		} else {
-			nr_pages = get_numpages(wsize, len, &cur_len);
-			wdata = cifs_writedata_alloc(nr_pages,
-					     cifs_uncached_writev_complete);
-			if (!wdata) {
-				rc = -ENOMEM;
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
-
-			rc = cifs_write_allocate_pages(wdata->pages, nr_pages);
-			if (rc) {
-				kvfree(wdata->pages);
-				kfree(wdata);
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
-
-			num_pages = nr_pages;
-			rc = wdata_fill_from_iovec(
-				wdata, from, &cur_len, &num_pages);
-			if (rc) {
-				for (i = 0; i < nr_pages; i++)
-					put_page(wdata->pages[i]);
-				kvfree(wdata->pages);
-				kfree(wdata);
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
+		max_len = min_t(const size_t, len, wsize);
+		if (!max_len) {
+			rc = -EAGAIN;
+			add_credits_and_wake_if(server, credits, 0);
+			break;
+		}
 
-			/*
-			 * Bring nr_pages down to the number of pages we
-			 * actually used, and free any pages that we didn't use.
-			 */
-			for ( ; nr_pages > num_pages; nr_pages--)
-				put_page(wdata->pages[nr_pages - 1]);
+		cur_len = cifs_limit_bvec_subset(from, max_len, max_segs, &nsegs);
+		cifs_dbg(FYI, "write_from_iter len=%zx/%zx nsegs=%u/%lu/%u\n",
+			 cur_len, max_len, nsegs, from->nr_segs, max_segs);
+		if (cur_len == 0) {
+			rc = -EIO;
+			add_credits_and_wake_if(server, credits, 0);
+			break;
+		}
 
-			wdata->tailsz = cur_len - ((nr_pages - 1) * PAGE_SIZE);
+		wdata = cifs_writedata_alloc(cifs_uncached_writev_complete);
+		if (!wdata) {
+			rc = -ENOMEM;
+			add_credits_and_wake_if(server, credits, 0);
+			break;
 		}
 
 		wdata->sync_mode = WB_SYNC_ALL;
-		wdata->nr_pages = nr_pages;
-		wdata->offset = (__u64)offset;
-		wdata->cfile = cifsFileInfo_get(open_file);
-		wdata->server = server;
-		wdata->pid = pid;
-		wdata->bytes = cur_len;
-		wdata->pagesz = PAGE_SIZE;
-		wdata->credits = credits_on_stack;
-		wdata->ctx = ctx;
+		wdata->offset	= (__u64)fpos;
+		wdata->cfile	= cifsFileInfo_get(open_file);
+		wdata->server	= server;
+		wdata->pid	= pid;
+		wdata->bytes	= cur_len;
+		wdata->credits	= credits_on_stack;
+		wdata->iter	= *from;
+		wdata->ctx	= ctx;
 		kref_get(&ctx->refcount);
 
+		iov_iter_truncate(&wdata->iter, cur_len);
+
 		rc = adjust_credits(server, &wdata->credits, wdata->bytes);
 
 		if (!rc) {
@@ -3485,16 +3832,14 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
 			add_credits_and_wake_if(server, &wdata->credits, 0);
 			kref_put(&wdata->refcount,
 				 cifs_uncached_writedata_release);
-			if (rc == -EAGAIN) {
-				*from = saved_from;
-				iov_iter_advance(from, offset - saved_offset);
+			if (rc == -EAGAIN)
 				continue;
-			}
 			break;
 		}
 
 		list_add_tail(&wdata->list, wdata_list);
-		offset += cur_len;
+		iov_iter_advance(from, cur_len);
+		fpos += cur_len;
 		len -= cur_len;
 	} while (len > 0);
 
@@ -3593,8 +3938,6 @@ static ssize_t __cifs_writev(
 	struct cifs_tcon *tcon;
 	struct cifs_sb_info *cifs_sb;
 	struct cifs_aio_ctx *ctx;
-	struct iov_iter saved_from = *from;
-	size_t len = iov_iter_count(from);
 	int rc;
 
 	/*
@@ -3628,23 +3971,56 @@ static ssize_t __cifs_writev(
 		ctx->iocb = iocb;
 
 	ctx->pos = iocb->ki_pos;
+	ctx->direct_io = direct;
+	ctx->nr_pinned_pages = 0;
 
-	if (direct) {
-		ctx->direct_io = true;
-		ctx->iter = *from;
-		ctx->len = len;
-	} else {
-		rc = setup_aio_ctx_iter(ctx, from, ITER_SOURCE);
-		if (rc) {
+	if (user_backed_iter(from)) {
+		/*
+		 * Extract IOVEC/UBUF-type iterators to a BVEC-type iterator as
+		 * they contain references to the calling process's virtual
+		 * memory layout which won't be available in an async worker
+		 * thread.  This also takes a ref or a pin on every folio
+		 * involved.
+		 */
+		rc = netfs_extract_user_iter(from, iov_iter_count(from),
+					     &ctx->iter, FOLL_SOURCE_BUF);
+		if (rc < 0) {
 			kref_put(&ctx->refcount, cifs_aio_ctx_release);
 			return rc;
 		}
+
+		ctx->nr_pinned_pages = rc;
+		ctx->bv = (void *)ctx->iter.bvec;
+		ctx->bv_cleanup_mode =
+			iov_iter_extract_mode(&ctx->iter, FOLL_SOURCE_BUF);
+	} else if ((iov_iter_is_bvec(from) || iov_iter_is_kvec(from)) &&
+		   !is_sync_kiocb(iocb)) {
+		/*
+		 * If the op is asynchronous, we need to copy the list attached
+		 * to a BVEC/KVEC-type iterator, but we assume that the storage
+		 * will be pinned by the caller; in any case, we may or may not
+		 * be able to pin the pages, so we don't try.
+		 */
+		ctx->bv = (void *)dup_iter(&ctx->iter, from, GFP_KERNEL);
+		if (!ctx->bv) {
+			kref_put(&ctx->refcount, cifs_aio_ctx_release);
+			return -ENOMEM;
+		}
+	} else {
+		/*
+		 * Otherwise, we just pass the iterator down as-is and rely on
+		 * the caller to make sure the pages referred to by the
+		 * iterator don't evaporate.
+		 */
+		ctx->iter = *from;
 	}
 
+	ctx->len = iov_iter_count(&ctx->iter);
+
 	/* grab a lock here due to read response handlers can access ctx */
 	mutex_lock(&ctx->aio_mutex);
 
-	rc = cifs_write_from_iter(iocb->ki_pos, ctx->len, &saved_from,
+	rc = cifs_write_from_iter(iocb->ki_pos, ctx->len, &ctx->iter,
 				  cfile, cifs_sb, &ctx->list, ctx);
 
 	/*
@@ -3787,14 +4163,12 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from)
 	return written;
 }
 
-static struct cifs_readdata *
-cifs_readdata_direct_alloc(struct page **pages, work_func_t complete)
+static struct cifs_readdata *cifs_readdata_alloc(work_func_t complete)
 {
 	struct cifs_readdata *rdata;
 
 	rdata = kzalloc(sizeof(*rdata), GFP_KERNEL);
-	if (rdata != NULL) {
-		rdata->pages = pages;
+	if (rdata) {
 		kref_init(&rdata->refcount);
 		INIT_LIST_HEAD(&rdata->list);
 		init_completion(&rdata->done);
@@ -3804,27 +4178,14 @@ cifs_readdata_direct_alloc(struct page **pages, work_func_t complete)
 	return rdata;
 }
 
-static struct cifs_readdata *
-cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
-{
-	struct page **pages =
-		kcalloc(nr_pages, sizeof(struct page *), GFP_KERNEL);
-	struct cifs_readdata *ret = NULL;
-
-	if (pages) {
-		ret = cifs_readdata_direct_alloc(pages, complete);
-		if (!ret)
-			kfree(pages);
-	}
-
-	return ret;
-}
-
 void
 cifs_readdata_release(struct kref *refcount)
 {
 	struct cifs_readdata *rdata = container_of(refcount,
 					struct cifs_readdata, refcount);
+
+	if (rdata->ctx)
+		kref_put(&rdata->ctx->refcount, cifs_aio_ctx_release);
 #ifdef CONFIG_CIFS_SMB_DIRECT
 	if (rdata->mr) {
 		smbd_deregister_mr(rdata->mr);
@@ -3834,85 +4195,9 @@ cifs_readdata_release(struct kref *refcount)
 	if (rdata->cfile)
 		cifsFileInfo_put(rdata->cfile);
 
-	kvfree(rdata->pages);
 	kfree(rdata);
 }
 
-static int
-cifs_read_allocate_pages(struct cifs_readdata *rdata, unsigned int nr_pages)
-{
-	int rc = 0;
-	struct page *page;
-	unsigned int i;
-
-	for (i = 0; i < nr_pages; i++) {
-		page = alloc_page(GFP_KERNEL|__GFP_HIGHMEM);
-		if (!page) {
-			rc = -ENOMEM;
-			break;
-		}
-		rdata->pages[i] = page;
-	}
-
-	if (rc) {
-		unsigned int nr_page_failed = i;
-
-		for (i = 0; i < nr_page_failed; i++) {
-			put_page(rdata->pages[i]);
-			rdata->pages[i] = NULL;
-		}
-	}
-	return rc;
-}
-
-static void
-cifs_uncached_readdata_release(struct kref *refcount)
-{
-	struct cifs_readdata *rdata = container_of(refcount,
-					struct cifs_readdata, refcount);
-	unsigned int i;
-
-	kref_put(&rdata->ctx->refcount, cifs_aio_ctx_release);
-	for (i = 0; i < rdata->nr_pages; i++) {
-		put_page(rdata->pages[i]);
-	}
-	cifs_readdata_release(refcount);
-}
-
-/**
- * cifs_readdata_to_iov - copy data from pages in response to an iovec
- * @rdata:	the readdata response with list of pages holding data
- * @iter:	destination for our data
- *
- * This function copies data from a list of pages in a readdata response into
- * an array of iovecs. It will first calculate where the data should go
- * based on the info in the readdata and then copy the data into that spot.
- */
-static int
-cifs_readdata_to_iov(struct cifs_readdata *rdata, struct iov_iter *iter)
-{
-	size_t remaining = rdata->got_bytes;
-	unsigned int i;
-
-	for (i = 0; i < rdata->nr_pages; i++) {
-		struct page *page = rdata->pages[i];
-		size_t copy = min_t(size_t, remaining, PAGE_SIZE);
-		size_t written;
-
-		if (unlikely(iov_iter_is_pipe(iter))) {
-			void *addr = kmap_atomic(page);
-
-			written = copy_to_iter(addr, copy, iter);
-			kunmap_atomic(addr);
-		} else
-			written = copy_page_to_iter(page, 0, copy, iter);
-		remaining -= written;
-		if (written < copy && iov_iter_count(iter) > 0)
-			break;
-	}
-	return remaining ? -EFAULT : 0;
-}
-
 static void collect_uncached_read_data(struct cifs_aio_ctx *ctx);
 
 static void
@@ -3924,9 +4209,11 @@ cifs_uncached_readv_complete(struct work_struct *work)
 	complete(&rdata->done);
 	collect_uncached_read_data(rdata->ctx);
 	/* the below call can possibly free the last ref to aio ctx */
-	kref_put(&rdata->refcount, cifs_uncached_readdata_release);
+	kref_put(&rdata->refcount, cifs_readdata_release);
 }
 
+#if 0 // TODO: Remove for iov_iter support
+
 static int
 uncached_fill_pages(struct TCP_Server_Info *server,
 		    struct cifs_readdata *rdata, struct iov_iter *iter,
@@ -4000,6 +4287,7 @@ cifs_uncached_copy_into_pages(struct TCP_Server_Info *server,
 {
 	return uncached_fill_pages(server, rdata, iter, iter->count);
 }
+#endif
 
 static int cifs_resend_rdata(struct cifs_readdata *rdata,
 			struct list_head *rdata_list,
@@ -4069,37 +4357,36 @@ static int cifs_resend_rdata(struct cifs_readdata *rdata,
 	} while (rc == -EAGAIN);
 
 fail:
-	kref_put(&rdata->refcount, cifs_uncached_readdata_release);
+	kref_put(&rdata->refcount, cifs_readdata_release);
 	return rc;
 }
 
 static int
-cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
+cifs_send_async_read(loff_t fpos, size_t len, struct cifsFileInfo *open_file,
 		     struct cifs_sb_info *cifs_sb, struct list_head *rdata_list,
 		     struct cifs_aio_ctx *ctx)
 {
 	struct cifs_readdata *rdata;
-	unsigned int npages, rsize;
+	unsigned int rsize, nsegs, max_segs = INT_MAX;
 	struct cifs_credits credits_on_stack;
 	struct cifs_credits *credits = &credits_on_stack;
-	size_t cur_len;
+	size_t cur_len, max_len;
 	int rc;
 	pid_t pid;
 	struct TCP_Server_Info *server;
-	struct page **pagevec;
-	size_t start;
-	struct iov_iter direct_iov = ctx->iter;
 
 	server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses);
 
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	if (server->smbd_conn)
+		max_segs = server->smbd_conn->max_frmr_depth;
+#endif
+
 	if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
 		pid = open_file->pid;
 	else
 		pid = current->tgid;
 
-	if (ctx->direct_io)
-		iov_iter_advance(&direct_iov, offset - ctx->pos);
-
 	do {
 		if (open_file->invalidHandle) {
 			rc = cifs_reopen_file(open_file, true);
@@ -4119,78 +4406,37 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
 		if (rc)
 			break;
 
-		cur_len = min_t(const size_t, len, rsize);
-
-		if (ctx->direct_io) {
-			ssize_t result;
-
-			result = iov_iter_get_pages_alloc(
-					&direct_iov, &pagevec,
-					cur_len, &start, FOLL_DEST_BUF);
-			if (result < 0) {
-				cifs_dbg(VFS,
-					 "Couldn't get user pages (rc=%zd) iter type %d iov_offset %zd count %zd\n",
-					 result, iov_iter_type(&direct_iov),
-					 direct_iov.iov_offset,
-					 direct_iov.count);
-				dump_stack();
-
-				rc = result;
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
-			cur_len = (size_t)result;
-
-			rdata = cifs_readdata_direct_alloc(
-					pagevec, cifs_uncached_readv_complete);
-			if (!rdata) {
-				add_credits_and_wake_if(server, credits, 0);
-				rc = -ENOMEM;
-				break;
-			}
-
-			npages = (cur_len + start + PAGE_SIZE-1) / PAGE_SIZE;
-			rdata->page_offset = start;
-			rdata->tailsz = npages > 1 ?
-				cur_len-(PAGE_SIZE-start)-(npages-2)*PAGE_SIZE :
-				cur_len;
-
-		} else {
-
-			npages = DIV_ROUND_UP(cur_len, PAGE_SIZE);
-			/* allocate a readdata struct */
-			rdata = cifs_readdata_alloc(npages,
-					    cifs_uncached_readv_complete);
-			if (!rdata) {
-				add_credits_and_wake_if(server, credits, 0);
-				rc = -ENOMEM;
-				break;
-			}
+		max_len = min_t(size_t, len, rsize);
 
-			rc = cifs_read_allocate_pages(rdata, npages);
-			if (rc) {
-				kvfree(rdata->pages);
-				kfree(rdata);
-				add_credits_and_wake_if(server, credits, 0);
-				break;
-			}
+		cur_len = cifs_limit_bvec_subset(&ctx->iter, max_len,
+						 max_segs, &nsegs);
+		cifs_dbg(FYI, "read-to-iter len=%zx/%zx nsegs=%u/%lu/%u\n",
+			 cur_len, max_len, nsegs, ctx->iter.nr_segs, max_segs);
+		if (cur_len == 0) {
+			rc = -EIO;
+			add_credits_and_wake_if(server, credits, 0);
+			break;
+		}
 
-			rdata->tailsz = PAGE_SIZE;
+		rdata = cifs_readdata_alloc(cifs_uncached_readv_complete);
+		if (!rdata) {
+			add_credits_and_wake_if(server, credits, 0);
+			rc = -ENOMEM;
+			break;
 		}
 
-		rdata->server = server;
-		rdata->cfile = cifsFileInfo_get(open_file);
-		rdata->nr_pages = npages;
-		rdata->offset = offset;
-		rdata->bytes = cur_len;
-		rdata->pid = pid;
-		rdata->pagesz = PAGE_SIZE;
-		rdata->read_into_pages = cifs_uncached_read_into_pages;
-		rdata->copy_into_pages = cifs_uncached_copy_into_pages;
-		rdata->credits = credits_on_stack;
-		rdata->ctx = ctx;
+		rdata->server	= server;
+		rdata->cfile	= cifsFileInfo_get(open_file);
+		rdata->offset	= fpos;
+		rdata->bytes	= cur_len;
+		rdata->pid	= pid;
+		rdata->credits	= credits_on_stack;
+		rdata->ctx	= ctx;
 		kref_get(&ctx->refcount);
 
+		rdata->iter	= ctx->iter;
+		iov_iter_truncate(&rdata->iter, cur_len);
+
 		rc = adjust_credits(server, &rdata->credits, rdata->bytes);
 
 		if (!rc) {
@@ -4202,17 +4448,15 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
 
 		if (rc) {
 			add_credits_and_wake_if(server, &rdata->credits, 0);
-			kref_put(&rdata->refcount,
-				cifs_uncached_readdata_release);
-			if (rc == -EAGAIN) {
-				iov_iter_revert(&direct_iov, cur_len);
+			kref_put(&rdata->refcount, cifs_readdata_release);
+			if (rc == -EAGAIN)
 				continue;
-			}
 			break;
 		}
 
 		list_add_tail(&rdata->list, rdata_list);
-		offset += cur_len;
+		iov_iter_advance(&ctx->iter, cur_len);
+		fpos += cur_len;
 		len -= cur_len;
 	} while (len > 0);
 
@@ -4254,22 +4498,6 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
 				list_del_init(&rdata->list);
 				INIT_LIST_HEAD(&tmp_list);
 
-				/*
-				 * Got a part of data and then reconnect has
-				 * happened -- fill the buffer and continue
-				 * reading.
-				 */
-				if (got_bytes && got_bytes < rdata->bytes) {
-					rc = 0;
-					if (!ctx->direct_io)
-						rc = cifs_readdata_to_iov(rdata, to);
-					if (rc) {
-						kref_put(&rdata->refcount,
-							cifs_uncached_readdata_release);
-						continue;
-					}
-				}
-
 				if (ctx->direct_io) {
 					/*
 					 * Re-use rdata as this is a
@@ -4286,7 +4514,7 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
 						&tmp_list, ctx);
 
 					kref_put(&rdata->refcount,
-						cifs_uncached_readdata_release);
+						cifs_readdata_release);
 				}
 
 				list_splice(&tmp_list, &ctx->list);
@@ -4294,8 +4522,6 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
 				goto again;
 			} else if (rdata->result)
 				rc = rdata->result;
-			else if (!ctx->direct_io)
-				rc = cifs_readdata_to_iov(rdata, to);
 
 			/* if there was a short read -- discard anything left */
 			if (rdata->got_bytes && rdata->got_bytes < rdata->bytes)
@@ -4304,7 +4530,7 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
 			ctx->total_len += rdata->got_bytes;
 		}
 		list_del_init(&rdata->list);
-		kref_put(&rdata->refcount, cifs_uncached_readdata_release);
+		kref_put(&rdata->refcount, cifs_readdata_release);
 	}
 
 	if (!ctx->direct_io)
@@ -4364,26 +4590,55 @@ static ssize_t __cifs_readv(
 	if (!ctx)
 		return -ENOMEM;
 
-	ctx->cfile = cifsFileInfo_get(cfile);
+	ctx->pos	= offset;
+	ctx->direct_io	= direct;
+	ctx->len	= len;
+	ctx->cfile	= cifsFileInfo_get(cfile);
+	ctx->nr_pinned_pages = 0;
 
 	if (!is_sync_kiocb(iocb))
 		ctx->iocb = iocb;
 
-	if (user_backed_iter(to))
-		ctx->should_dirty = true;
-
-	if (direct) {
-		ctx->pos = offset;
-		ctx->direct_io = true;
-		ctx->iter = *to;
-		ctx->len = len;
-	} else {
-		rc = setup_aio_ctx_iter(ctx, to, ITER_DEST);
-		if (rc) {
+	if (user_backed_iter(to)) {
+		/*
+		 * Extract IOVEC/UBUF-type iterators to a BVEC-type iterator as
+		 * they contain references to the calling process's virtual
+		 * memory layout which won't be available in an async worker
+		 * thread.  This also takes a ref or a pin on every folio
+		 * involved.
+		 */
+		rc = netfs_extract_user_iter(to, iov_iter_count(to),
+					     &ctx->iter, FOLL_DEST_BUF);
+		if (rc < 0) {
 			kref_put(&ctx->refcount, cifs_aio_ctx_release);
 			return rc;
 		}
-		len = ctx->len;
+
+		ctx->nr_pinned_pages = rc;
+		ctx->bv = (void *)ctx->iter.bvec;
+		ctx->bv_cleanup_mode =
+			iov_iter_extract_mode(&ctx->iter, FOLL_DEST_BUF);
+		ctx->should_dirty = true;
+	} else if ((iov_iter_is_bvec(to) || iov_iter_is_kvec(to)) &&
+		   !is_sync_kiocb(iocb)) {
+		/*
+		 * If the op is asynchronous, we need to copy the list attached
+		 * to a BVEC/KVEC-type iterator, but we assume that the storage
+		 * will be retained by the caller; in any case, we may or may
+		 * not be able to pin the pages, so we don't try.
+		 */
+		ctx->bv = (void *)dup_iter(&ctx->iter, to, GFP_KERNEL);
+		if (!ctx->bv) {
+			kref_put(&ctx->refcount, cifs_aio_ctx_release);
+			return -ENOMEM;
+		}
+	} else {
+		/*
+		 * Otherwise, we just pass the iterator down as-is and rely on
+		 * the caller to make sure the pages referred to by the
+		 * iterator don't evaporate.
+		 */
+		ctx->iter = *to;
 	}
 
 	if (direct) {
@@ -4646,6 +4901,8 @@ int cifs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return rc;
 }
 
+#if 0 // TODO: Remove for iov_iter support
+
 static void
 cifs_readv_complete(struct work_struct *work)
 {
@@ -4776,19 +5033,74 @@ cifs_readpages_copy_into_pages(struct TCP_Server_Info *server,
 {
 	return readpages_fill_pages(server, rdata, iter, iter->count);
 }
+#endif
+
+/*
+ * Unlock a bunch of folios in the pagecache.
+ */
+static void cifs_unlock_folios(struct address_space *mapping, pgoff_t first, pgoff_t last)
+{
+	struct folio *folio;
+	XA_STATE(xas, &mapping->i_pages, first);
+
+	rcu_read_lock();
+	xas_for_each(&xas, folio, last) {
+		folio_unlock(folio);
+	}
+	rcu_read_unlock();
+}
+
+static void cifs_readahead_complete(struct work_struct *work)
+{
+	struct cifs_readdata *rdata = container_of(work,
+						   struct cifs_readdata, work);
+	struct folio *folio;
+	pgoff_t last;
+	bool good = rdata->result == 0 || (rdata->result == -EAGAIN && rdata->got_bytes);
+
+	XA_STATE(xas, &rdata->mapping->i_pages, rdata->offset / PAGE_SIZE);
+
+	if (good)
+		cifs_readahead_to_fscache(rdata->mapping->host,
+					  rdata->offset, rdata->bytes);
+
+	if (iov_iter_count(&rdata->iter) > 0)
+		iov_iter_zero(iov_iter_count(&rdata->iter), &rdata->iter);
+
+	last = (rdata->offset + rdata->bytes - 1) / PAGE_SIZE;
+
+	rcu_read_lock();
+	xas_for_each(&xas, folio, last) {
+		if (good) {
+			flush_dcache_folio(folio);
+			folio_mark_uptodate(folio);
+		}
+		folio_unlock(folio);
+	}
+	rcu_read_unlock();
+
+	kref_put(&rdata->refcount, cifs_readdata_release);
+}
 
 static void cifs_readahead(struct readahead_control *ractl)
 {
-	int rc;
 	struct cifsFileInfo *open_file = ractl->file->private_data;
 	struct cifs_sb_info *cifs_sb = CIFS_FILE_SB(ractl->file);
 	struct TCP_Server_Info *server;
-	pid_t pid;
-	unsigned int xid, nr_pages, last_batch_size = 0, cache_nr_pages = 0;
-	pgoff_t next_cached = ULONG_MAX;
+	unsigned int xid, nr_pages, cache_nr_pages = 0;
+	unsigned int ra_pages;
+	pgoff_t next_cached = ULONG_MAX, ra_index;
 	bool caching = fscache_cookie_enabled(cifs_inode_cookie(ractl->mapping->host)) &&
 		cifs_inode_cookie(ractl->mapping->host)->cache_priv;
 	bool check_cache = caching;
+	pid_t pid;
+	int rc = 0;
+
+	/* Note that readahead_count() lags behind our dequeuing of pages from
+	 * the ractl, wo we have to keep track for ourselves.
+	 */
+	ra_pages = readahead_count(ractl);
+	ra_index = readahead_index(ractl);
 
 	xid = get_xid();
 
@@ -4797,22 +5109,21 @@ static void cifs_readahead(struct readahead_control *ractl)
 	else
 		pid = current->tgid;
 
-	rc = 0;
 	server = cifs_pick_channel(tlink_tcon(open_file->tlink)->ses);
 
 	cifs_dbg(FYI, "%s: file=%p mapping=%p num_pages=%u\n",
-		 __func__, ractl->file, ractl->mapping, readahead_count(ractl));
+		 __func__, ractl->file, ractl->mapping, ra_pages);
 
 	/*
 	 * Chop the readahead request up into rsize-sized read requests.
 	 */
-	while ((nr_pages = readahead_count(ractl) - last_batch_size)) {
-		unsigned int i, got, rsize;
-		struct page *page;
+	while ((nr_pages = ra_pages)) {
+		unsigned int i, rsize;
 		struct cifs_readdata *rdata;
 		struct cifs_credits credits_on_stack;
 		struct cifs_credits *credits = &credits_on_stack;
-		pgoff_t index = readahead_index(ractl) + last_batch_size;
+		struct folio *folio;
+		pgoff_t fsize;
 
 		/*
 		 * Find out if we have anything cached in the range of
@@ -4821,21 +5132,22 @@ static void cifs_readahead(struct readahead_control *ractl)
 		if (caching) {
 			if (check_cache) {
 				rc = cifs_fscache_query_occupancy(
-					ractl->mapping->host, index, nr_pages,
+					ractl->mapping->host, ra_index, nr_pages,
 					&next_cached, &cache_nr_pages);
 				if (rc < 0)
 					caching = false;
 				check_cache = false;
 			}
 
-			if (index == next_cached) {
+			if (ra_index == next_cached) {
 				/*
 				 * TODO: Send a whole batch of pages to be read
 				 * by the cache.
 				 */
-				struct folio *folio = readahead_folio(ractl);
-
-				last_batch_size = folio_nr_pages(folio);
+				folio = readahead_folio(ractl);
+				fsize = folio_nr_pages(folio);
+				ra_pages -= fsize;
+				ra_index += fsize;
 				if (cifs_readpage_from_fscache(ractl->mapping->host,
 							       &folio->page) < 0) {
 					/*
@@ -4846,8 +5158,8 @@ static void cifs_readahead(struct readahead_control *ractl)
 					caching = false;
 				}
 				folio_unlock(folio);
-				next_cached++;
-				cache_nr_pages--;
+				next_cached += fsize;
+				cache_nr_pages -= fsize;
 				if (cache_nr_pages == 0)
 					check_cache = true;
 				continue;
@@ -4872,8 +5184,9 @@ static void cifs_readahead(struct readahead_control *ractl)
 						   &rsize, credits);
 		if (rc)
 			break;
-		nr_pages = min_t(size_t, rsize / PAGE_SIZE, readahead_count(ractl));
-		nr_pages = min_t(size_t, nr_pages, next_cached - index);
+		nr_pages = min_t(size_t, rsize / PAGE_SIZE, ra_pages);
+		if (next_cached != ULONG_MAX)
+			nr_pages = min_t(size_t, nr_pages, next_cached - ra_index);
 
 		/*
 		 * Give up immediately if rsize is too small to read an entire
@@ -4886,33 +5199,31 @@ static void cifs_readahead(struct readahead_control *ractl)
 			break;
 		}
 
-		rdata = cifs_readdata_alloc(nr_pages, cifs_readv_complete);
+		rdata = cifs_readdata_alloc(cifs_readahead_complete);
 		if (!rdata) {
 			/* best to give up if we're out of mem */
 			add_credits_and_wake_if(server, credits, 0);
 			break;
 		}
 
-		got = __readahead_batch(ractl, rdata->pages, nr_pages);
-		if (got != nr_pages) {
-			pr_warn("__readahead_batch() returned %u/%u\n",
-				got, nr_pages);
-			nr_pages = got;
-		}
-
-		rdata->nr_pages = nr_pages;
-		rdata->bytes	= readahead_batch_length(ractl);
+		rdata->offset	= ra_index * PAGE_SIZE;
+		rdata->bytes	= nr_pages * PAGE_SIZE;
 		rdata->cfile	= cifsFileInfo_get(open_file);
 		rdata->server	= server;
 		rdata->mapping	= ractl->mapping;
-		rdata->offset	= readahead_pos(ractl);
 		rdata->pid	= pid;
-		rdata->pagesz	= PAGE_SIZE;
-		rdata->tailsz	= PAGE_SIZE;
-		rdata->read_into_pages = cifs_readpages_read_into_pages;
-		rdata->copy_into_pages = cifs_readpages_copy_into_pages;
 		rdata->credits	= credits_on_stack;
 
+		for (i = 0; i < nr_pages; i++) {
+			if (!readahead_folio(ractl))
+				WARN_ON(1);
+		}
+		ra_pages -= nr_pages;
+		ra_index += nr_pages;
+
+		iov_iter_xarray(&rdata->iter, READ, &rdata->mapping->i_pages,
+				rdata->offset, rdata->bytes);
+
 		rc = adjust_credits(server, &rdata->credits, rdata->bytes);
 		if (!rc) {
 			if (rdata->cfile->invalidHandle)
@@ -4923,18 +5234,15 @@ static void cifs_readahead(struct readahead_control *ractl)
 
 		if (rc) {
 			add_credits_and_wake_if(server, &rdata->credits, 0);
-			for (i = 0; i < rdata->nr_pages; i++) {
-				page = rdata->pages[i];
-				unlock_page(page);
-				put_page(page);
-			}
+			cifs_unlock_folios(rdata->mapping,
+					   rdata->offset / PAGE_SIZE,
+					   (rdata->offset + rdata->bytes - 1) / PAGE_SIZE);
 			/* Fallback to the readpage in error/reconnect cases */
 			kref_put(&rdata->refcount, cifs_readdata_release);
 			break;
 		}
 
 		kref_put(&rdata->refcount, cifs_readdata_release);
-		last_batch_size = nr_pages;
 	}
 
 	free_xid(xid);
@@ -4976,10 +5284,6 @@ static int cifs_readpage_worker(struct file *file, struct page *page,
 
 	flush_dcache_page(page);
 	SetPageUptodate(page);
-
-	/* send this page to the cache */
-	cifs_readpage_to_fscache(file_inode(file), page);
-
 	rc = 0;
 
 io_error:
diff --git a/fs/cifs/fscache.c b/fs/cifs/fscache.c
index f6f3a6b75601..47c9f36c11fb 100644
--- a/fs/cifs/fscache.c
+++ b/fs/cifs/fscache.c
@@ -165,22 +165,16 @@ static int fscache_fallback_read_page(struct inode *inode, struct page *page)
 /*
  * Fallback page writing interface.
  */
-static int fscache_fallback_write_page(struct inode *inode, struct page *page,
-				       bool no_space_allocated_yet)
+static int fscache_fallback_write_pages(struct inode *inode, loff_t start, size_t len,
+					bool no_space_allocated_yet)
 {
 	struct netfs_cache_resources cres;
 	struct fscache_cookie *cookie = cifs_inode_cookie(inode);
 	struct iov_iter iter;
-	struct bio_vec bvec[1];
-	loff_t start = page_offset(page);
-	size_t len = PAGE_SIZE;
 	int ret;
 
 	memset(&cres, 0, sizeof(cres));
-	bvec[0].bv_page		= page;
-	bvec[0].bv_offset	= 0;
-	bvec[0].bv_len		= PAGE_SIZE;
-	iov_iter_bvec(&iter, ITER_SOURCE, bvec, ARRAY_SIZE(bvec), PAGE_SIZE);
+	iov_iter_xarray(&iter, ITER_SOURCE, &inode->i_mapping->i_pages, start, len);
 
 	ret = fscache_begin_write_operation(&cres, cookie);
 	if (ret < 0)
@@ -189,7 +183,7 @@ static int fscache_fallback_write_page(struct inode *inode, struct page *page,
 	ret = cres.ops->prepare_write(&cres, &start, &len, i_size_read(inode),
 				      no_space_allocated_yet);
 	if (ret == 0)
-		ret = fscache_write(&cres, page_offset(page), &iter, NULL, NULL);
+		ret = fscache_write(&cres, start, &iter, NULL, NULL);
 	fscache_end_operation(&cres);
 	return ret;
 }
@@ -213,12 +207,12 @@ int __cifs_readpage_from_fscache(struct inode *inode, struct page *page)
 	return 0;
 }
 
-void __cifs_readpage_to_fscache(struct inode *inode, struct page *page)
+void __cifs_readahead_to_fscache(struct inode *inode, loff_t pos, size_t len)
 {
-	cifs_dbg(FYI, "%s: (fsc: %p, p: %p, i: %p)\n",
-		 __func__, cifs_inode_cookie(inode), page, inode);
+	cifs_dbg(FYI, "%s: (fsc: %p, p: %llx, l: %zx, i: %p)\n",
+		 __func__, cifs_inode_cookie(inode), pos, len, inode);
 
-	fscache_fallback_write_page(inode, page, true);
+	fscache_fallback_write_pages(inode, pos, len, true);
 }
 
 /*
diff --git a/fs/cifs/fscache.h b/fs/cifs/fscache.h
index 67b601041f0a..173999610997 100644
--- a/fs/cifs/fscache.h
+++ b/fs/cifs/fscache.h
@@ -90,7 +90,7 @@ static inline int cifs_fscache_query_occupancy(struct inode *inode,
 }
 
 extern int __cifs_readpage_from_fscache(struct inode *pinode, struct page *ppage);
-extern void __cifs_readpage_to_fscache(struct inode *pinode, struct page *ppage);
+extern void __cifs_readahead_to_fscache(struct inode *pinode, loff_t pos, size_t len);
 
 
 static inline int cifs_readpage_from_fscache(struct inode *inode,
@@ -101,11 +101,11 @@ static inline int cifs_readpage_from_fscache(struct inode *inode,
 	return -ENOBUFS;
 }
 
-static inline void cifs_readpage_to_fscache(struct inode *inode,
-					    struct page *page)
+static inline void cifs_readahead_to_fscache(struct inode *inode,
+					     loff_t pos, size_t len)
 {
 	if (cifs_inode_cookie(inode))
-		__cifs_readpage_to_fscache(inode, page);
+		__cifs_readahead_to_fscache(inode, pos, len);
 }
 
 #else /* CONFIG_CIFS_FSCACHE */
@@ -141,7 +141,7 @@ cifs_readpage_from_fscache(struct inode *inode, struct page *page)
 }
 
 static inline
-void cifs_readpage_to_fscache(struct inode *inode, struct page *page) {}
+void cifs_readahead_to_fscache(struct inode *inode, loff_t pos, size_t len) {}
 
 #endif /* CONFIG_CIFS_FSCACHE */
 
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 9655cf359ab9..a54a59a8e196 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -966,16 +966,24 @@ cifs_aio_ctx_release(struct kref *refcount)
 
 	/*
 	 * ctx->bv is only set if setup_aio_ctx_iter() was call successfuly
-	 * which means that iov_iter_get_pages() was a success and thus that
-	 * we have taken reference on pages.
+	 * which means that iov_iter_extract_pages() was a success and thus
+	 * that we may have references or pins on pages that we need to
+	 * release.
 	 */
 	if (ctx->bv) {
-		unsigned i;
-
-		for (i = 0; i < ctx->npages; i++) {
-			if (ctx->should_dirty)
-				set_page_dirty(ctx->bv[i].bv_page);
-			put_page(ctx->bv[i].bv_page);
+		if (ctx->should_dirty || ctx->bv_cleanup_mode) {
+			unsigned i;
+
+			for (i = 0; i < ctx->nr_pinned_pages; i++) {
+				struct page *page = ctx->bv[i].bv_page;
+
+				if (ctx->should_dirty)
+					set_page_dirty(page);
+				if (ctx->bv_cleanup_mode & FOLL_PIN)
+					unpin_user_page(page);
+				if (ctx->bv_cleanup_mode & FOLL_GET)
+					put_page(page);
+			}
 		}
 		kvfree(ctx->bv);
 	}
@@ -983,96 +991,6 @@ cifs_aio_ctx_release(struct kref *refcount)
 	kfree(ctx);
 }
 
-#define CIFS_AIO_KMALLOC_LIMIT (1024 * 1024)
-
-int
-setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
-{
-	ssize_t rc;
-	unsigned int cur_npages;
-	unsigned int npages = 0;
-	unsigned int i;
-	size_t len;
-	size_t count = iov_iter_count(iter);
-	unsigned int saved_len;
-	size_t start;
-	unsigned int max_pages = iov_iter_npages(iter, INT_MAX);
-	struct page **pages = NULL;
-	struct bio_vec *bv = NULL;
-
-	if (iov_iter_is_kvec(iter)) {
-		memcpy(&ctx->iter, iter, sizeof(*iter));
-		ctx->len = count;
-		iov_iter_advance(iter, count);
-		return 0;
-	}
-
-	if (array_size(max_pages, sizeof(*bv)) <= CIFS_AIO_KMALLOC_LIMIT)
-		bv = kmalloc_array(max_pages, sizeof(*bv), GFP_KERNEL);
-
-	if (!bv) {
-		bv = vmalloc(array_size(max_pages, sizeof(*bv)));
-		if (!bv)
-			return -ENOMEM;
-	}
-
-	if (array_size(max_pages, sizeof(*pages)) <= CIFS_AIO_KMALLOC_LIMIT)
-		pages = kmalloc_array(max_pages, sizeof(*pages), GFP_KERNEL);
-
-	if (!pages) {
-		pages = vmalloc(array_size(max_pages, sizeof(*pages)));
-		if (!pages) {
-			kvfree(bv);
-			return -ENOMEM;
-		}
-	}
-
-	saved_len = count;
-
-	while (count && npages < max_pages) {
-		rc = iov_iter_get_pages(iter, pages, count, max_pages, &start,
-					rw == WRITE ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
-		if (rc < 0) {
-			cifs_dbg(VFS, "Couldn't get user pages (rc=%zd)\n", rc);
-			break;
-		}
-
-		if (rc > count) {
-			cifs_dbg(VFS, "get pages rc=%zd more than %zu\n", rc,
-				 count);
-			break;
-		}
-
-		count -= rc;
-		rc += start;
-		cur_npages = DIV_ROUND_UP(rc, PAGE_SIZE);
-
-		if (npages + cur_npages > max_pages) {
-			cifs_dbg(VFS, "out of vec array capacity (%u vs %u)\n",
-				 npages + cur_npages, max_pages);
-			break;
-		}
-
-		for (i = 0; i < cur_npages; i++) {
-			len = rc > PAGE_SIZE ? PAGE_SIZE : rc;
-			bv[npages + i].bv_page = pages[i];
-			bv[npages + i].bv_offset = start;
-			bv[npages + i].bv_len = len - start;
-			rc -= len;
-			start = 0;
-		}
-
-		npages += cur_npages;
-	}
-
-	kvfree(pages);
-	ctx->bv = bv;
-	ctx->len = saved_len - count;
-	ctx->npages = npages;
-	iov_iter_bvec(&ctx->iter, rw, ctx->bv, npages, ctx->len);
-	return 0;
-}
-
 /**
  * cifs_alloc_hash - allocate hash and hash context together
  * @name: The name of the crypto hash algo
@@ -1130,25 +1048,6 @@ cifs_free_hash(struct shash_desc **sdesc)
 	*sdesc = NULL;
 }
 
-/**
- * rqst_page_get_length - obtain the length and offset for a page in smb_rqst
- * @rqst: The request descriptor
- * @page: The index of the page to query
- * @len: Where to store the length for this page:
- * @offset: Where to store the offset for this page
- */
-void rqst_page_get_length(const struct smb_rqst *rqst, unsigned int page,
-			  unsigned int *len, unsigned int *offset)
-{
-	*len = rqst->rq_pagesz;
-	*offset = (page == 0) ? rqst->rq_offset : 0;
-
-	if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
-		*len = rqst->rq_tailsz;
-	else if (page == 0)
-		*len = rqst->rq_pagesz - rqst->rq_offset;
-}
-
 void extract_unc_hostname(const char *unc, const char **h, size_t *len)
 {
 	const char *end;
diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index dc160de7a6de..387effcb905d 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -4226,7 +4226,7 @@ fill_transform_hdr(struct smb2_transform_hdr *tr_hdr, unsigned int orig_len,
 
 static void *smb2_aead_req_alloc(struct crypto_aead *tfm, const struct smb_rqst *rqst,
 				 int num_rqst, const u8 *sig, u8 **iv,
-				 struct aead_request **req, struct scatterlist **sgl,
+				 struct aead_request **req, struct sg_table *sgt,
 				 unsigned int *num_sgs)
 {
 	unsigned int req_size = sizeof(**req) + crypto_aead_reqsize(tfm);
@@ -4235,70 +4235,68 @@ static void *smb2_aead_req_alloc(struct crypto_aead *tfm, const struct smb_rqst
 	u8 *p;
 
 	*num_sgs = cifs_get_num_sgs(rqst, num_rqst, sig);
+	if (IS_ERR_VALUE((long)(int)*num_sgs))
+		return ERR_PTR(*num_sgs);
 
 	len = iv_size;
 	len += crypto_aead_alignmask(tfm) & ~(crypto_tfm_ctx_alignment() - 1);
 	len = ALIGN(len, crypto_tfm_ctx_alignment());
 	len += req_size;
 	len = ALIGN(len, __alignof__(struct scatterlist));
-	len += *num_sgs * sizeof(**sgl);
+	len += array_size(*num_sgs, sizeof(struct scatterlist));
 
-	p = kmalloc(len, GFP_ATOMIC);
+	p = kvzalloc(len, GFP_NOFS);
 	if (!p)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	*iv = (u8 *)PTR_ALIGN(p, crypto_aead_alignmask(tfm) + 1);
 	*req = (struct aead_request *)PTR_ALIGN(*iv + iv_size,
 						crypto_tfm_ctx_alignment());
-	*sgl = (struct scatterlist *)PTR_ALIGN((u8 *)*req + req_size,
-					       __alignof__(struct scatterlist));
+	sgt->sgl = (struct scatterlist *)PTR_ALIGN((u8 *)*req + req_size,
+						   __alignof__(struct scatterlist));
 	return p;
 }
 
-static void *smb2_get_aead_req(struct crypto_aead *tfm, const struct smb_rqst *rqst,
+static void *smb2_get_aead_req(struct crypto_aead *tfm, struct smb_rqst *rqst,
 			       int num_rqst, const u8 *sig, u8 **iv,
 			       struct aead_request **req, struct scatterlist **sgl)
 {
-	unsigned int off, len, skip;
-	struct scatterlist *sg;
-	unsigned int num_sgs;
-	unsigned long addr;
-	int i, j;
+	struct sg_table sgtable = {};
+	unsigned int skip, num_sgs, i, j;
+	ssize_t rc;
 	void *p;
 
-	p = smb2_aead_req_alloc(tfm, rqst, num_rqst, sig, iv, req, sgl, &num_sgs);
-	if (!p)
-		return NULL;
+	p = smb2_aead_req_alloc(tfm, rqst, num_rqst, sig, iv, req, &sgtable, &num_sgs);
+	if (IS_ERR(p))
+		return ERR_CAST(p);
 
-	sg_init_table(*sgl, num_sgs);
-	sg = *sgl;
+	sg_init_marker(sgtable.sgl, num_sgs);
 
-	/* Assumes the first rqst has a transform header as the first iov.
-	 * I.e.
-	 * rqst[0].rq_iov[0]  is transform header
-	 * rqst[0].rq_iov[1+] data to be encrypted/decrypted
-	 * rqst[1+].rq_iov[0+] data to be encrypted/decrypted
-	 */
 	for (i = 0; i < num_rqst; i++) {
-		/*
-		 * The first rqst has a transform header where the
-		 * first 20 bytes are not part of the encrypted blob.
-		 */
-		for (j = 0; j < rqst[i].rq_nvec; j++) {
-			struct kvec *iov = &rqst[i].rq_iov[j];
+		struct iov_iter *iter = &rqst[i].rq_iter;
+		size_t count = iov_iter_count(iter);
 
+		for (j = 0; j < rqst[i].rq_nvec; j++) {
+			/*
+			 * The first rqst has a transform header where the
+			 * first 20 bytes are not part of the encrypted blob
+			 */
 			skip = (i == 0) && (j == 0) ? 20 : 0;
-			addr = (unsigned long)iov->iov_base + skip;
-			len = iov->iov_len - skip;
-			sg = cifs_sg_set_buf(sg, (void *)addr, len);
-		}
-		for (j = 0; j < rqst[i].rq_npages; j++) {
-			rqst_page_get_length(&rqst[i], j, &len, &off);
-			sg_set_page(sg++, rqst[i].rq_pages[j], len, off);
+			cifs_sg_set_buf(&sgtable,
+					rqst[i].rq_iov[j].iov_base + skip,
+					rqst[i].rq_iov[j].iov_len - skip);
 		}
+		sgtable.orig_nents = sgtable.nents;
+
+		netfs_extract_iter_to_sg(iter, count, &sgtable,
+					 num_sgs - sgtable.nents,
+					 FOLL_DEST_BUF);
+		iov_iter_revert(iter, rc);
+		sgtable.orig_nents = sgtable.nents;
 	}
-	cifs_sg_set_buf(sg, sig, SMB2_SIGNATURE_SIZE);
 
+	cifs_sg_set_buf(&sgtable, sig, SMB2_SIGNATURE_SIZE);
+	sg_mark_end(&sgtable.sgl[sgtable.nents - 1]);
 	return p;
 }
 
@@ -4386,8 +4384,8 @@ crypt_message(struct TCP_Server_Info *server, int num_rqst,
 	}
 
 	creq = smb2_get_aead_req(tfm, rqst, num_rqst, sign, &iv, &req, &sg);
-	if (unlikely(!creq))
-		return -ENOMEM;
+	if (unlikely(IS_ERR(creq)))
+		return PTR_ERR(creq);
 
 	if (!enc) {
 		memcpy(sign, &tr_hdr->Signature, SMB2_SIGNATURE_SIZE);
@@ -4419,18 +4417,31 @@ crypt_message(struct TCP_Server_Info *server, int num_rqst,
 	return rc;
 }
 
+/*
+ * Clear a read buffer, discarding the folios which have XA_MARK_0 set.
+ */
+static void cifs_clear_xarray_buffer(struct xarray *buffer)
+{
+	struct folio *folio;
+
+	XA_STATE(xas, buffer, 0);
+
+	rcu_read_lock();
+	xas_for_each_marked(&xas, folio, ULONG_MAX, XA_MARK_0) {
+		folio_put(folio);
+	}
+	rcu_read_unlock();
+	xa_destroy(buffer);
+}
+
 void
 smb3_free_compound_rqst(int num_rqst, struct smb_rqst *rqst)
 {
-	int i, j;
+	int i;
 
-	for (i = 0; i < num_rqst; i++) {
-		if (rqst[i].rq_pages) {
-			for (j = rqst[i].rq_npages - 1; j >= 0; j--)
-				put_page(rqst[i].rq_pages[j]);
-			kfree(rqst[i].rq_pages);
-		}
-	}
+	for (i = 0; i < num_rqst; i++)
+		if (!xa_empty(&rqst[i].rq_buffer))
+			cifs_clear_xarray_buffer(&rqst[i].rq_buffer);
 }
 
 /*
@@ -4450,9 +4461,8 @@ static int
 smb3_init_transform_rq(struct TCP_Server_Info *server, int num_rqst,
 		       struct smb_rqst *new_rq, struct smb_rqst *old_rq)
 {
-	struct page **pages;
 	struct smb2_transform_hdr *tr_hdr = new_rq[0].rq_iov[0].iov_base;
-	unsigned int npages;
+	struct page *page;
 	unsigned int orig_len = 0;
 	int i, j;
 	int rc = -ENOMEM;
@@ -4460,45 +4470,42 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, int num_rqst,
 	for (i = 1; i < num_rqst; i++) {
 		struct smb_rqst *old = &old_rq[i - 1];
 		struct smb_rqst *new = &new_rq[i];
+		struct xarray *buffer = &new->rq_buffer;
+		size_t size = iov_iter_count(&old->rq_iter), seg, copied = 0;
 
 		orig_len += smb_rqst_len(server, old);
 		new->rq_iov = old->rq_iov;
 		new->rq_nvec = old->rq_nvec;
 
-		npages = old->rq_npages;
-		if (!npages)
-			continue;
-
-		pages = kmalloc_array(npages, sizeof(struct page *),
-				      GFP_KERNEL);
-		if (!pages)
-			goto err_free;
-
-		new->rq_pages = pages;
-		new->rq_npages = npages;
-		new->rq_offset = old->rq_offset;
-		new->rq_pagesz = old->rq_pagesz;
-		new->rq_tailsz = old->rq_tailsz;
-
-		for (j = 0; j < npages; j++) {
-			pages[j] = alloc_page(GFP_KERNEL|__GFP_HIGHMEM);
-			if (!pages[j])
-				goto err_free;
-		}
+		xa_init(buffer);
 
-		/* copy pages form the old */
-		for (j = 0; j < npages; j++) {
-			char *dst, *src;
-			unsigned int offset, len;
+		if (size > 0) {
+			unsigned int npages = DIV_ROUND_UP(size, PAGE_SIZE);
 
-			rqst_page_get_length(new, j, &len, &offset);
+			for (j = 0; j < npages; j++) {
+				void *o;
 
-			dst = kmap_local_page(new->rq_pages[j]) + offset;
-			src = kmap_local_page(old->rq_pages[j]) + offset;
+				rc = -ENOMEM;
+				page = alloc_page(GFP_KERNEL|__GFP_HIGHMEM);
+				if (!page)
+					goto err_free;
+				page->index = j;
+				o = xa_store(buffer, j, page, GFP_KERNEL);
+				if (xa_is_err(o)) {
+					rc = xa_err(o);
+					put_page(page);
+					goto err_free;
+				}
 
-			memcpy(dst, src, len);
-			kunmap(new->rq_pages[j]);
-			kunmap(old->rq_pages[j]);
+				seg = min_t(size_t, size - copied, PAGE_SIZE);
+				if (copy_page_from_iter(page, 0, seg, &old->rq_iter) != seg) {
+					rc = -EFAULT;
+					goto err_free;
+				}
+				copied += seg;
+			}
+			iov_iter_xarray(&new->rq_iter, ITER_SOURCE,
+					buffer, 0, size);
 		}
 	}
 
@@ -4527,12 +4534,12 @@ smb3_is_transform_hdr(void *buf)
 
 static int
 decrypt_raw_data(struct TCP_Server_Info *server, char *buf,
-		 unsigned int buf_data_size, struct page **pages,
-		 unsigned int npages, unsigned int page_data_size,
+		 unsigned int buf_data_size, struct iov_iter *iter,
 		 bool is_offloaded)
 {
 	struct kvec iov[2];
 	struct smb_rqst rqst = {NULL};
+	size_t iter_size = 0;
 	int rc;
 
 	iov[0].iov_base = buf;
@@ -4542,10 +4549,10 @@ decrypt_raw_data(struct TCP_Server_Info *server, char *buf,
 
 	rqst.rq_iov = iov;
 	rqst.rq_nvec = 2;
-	rqst.rq_pages = pages;
-	rqst.rq_npages = npages;
-	rqst.rq_pagesz = PAGE_SIZE;
-	rqst.rq_tailsz = (page_data_size % PAGE_SIZE) ? : PAGE_SIZE;
+	if (iter) {
+		rqst.rq_iter = *iter;
+		iter_size = iov_iter_count(iter);
+	}
 
 	rc = crypt_message(server, 1, &rqst, 0);
 	cifs_dbg(FYI, "Decrypt message returned %d\n", rc);
@@ -4556,73 +4563,37 @@ decrypt_raw_data(struct TCP_Server_Info *server, char *buf,
 	memmove(buf, iov[1].iov_base, buf_data_size);
 
 	if (!is_offloaded)
-		server->total_read = buf_data_size + page_data_size;
+		server->total_read = buf_data_size + iter_size;
 
 	return rc;
 }
 
 static int
-read_data_into_pages(struct TCP_Server_Info *server, struct page **pages,
-		     unsigned int npages, unsigned int len)
+cifs_copy_pages_to_iter(struct xarray *pages, unsigned int data_size,
+			unsigned int skip, struct iov_iter *iter)
 {
-	int i;
-	int length;
+	struct page *page;
+	unsigned long index;
 
-	for (i = 0; i < npages; i++) {
-		struct page *page = pages[i];
-		size_t n;
+	xa_for_each(pages, index, page) {
+		size_t n, len = min_t(unsigned int, PAGE_SIZE - skip, data_size);
 
-		n = len;
-		if (len >= PAGE_SIZE) {
-			/* enough data to fill the page */
-			n = PAGE_SIZE;
-			len -= n;
-		} else {
-			zero_user(page, len, PAGE_SIZE - len);
-			len = 0;
+		n = copy_page_to_iter(page, skip, len, iter);
+		if (n != len) {
+			cifs_dbg(VFS, "%s: something went wrong\n", __func__);
+			return -EIO;
 		}
-		length = cifs_read_page_from_socket(server, page, 0, n);
-		if (length < 0)
-			return length;
-		server->total_read += length;
+		data_size -= n;
+		skip = 0;
 	}
 
 	return 0;
 }
 
-static int
-init_read_bvec(struct page **pages, unsigned int npages, unsigned int data_size,
-	       unsigned int cur_off, struct bio_vec **page_vec)
-{
-	struct bio_vec *bvec;
-	int i;
-
-	bvec = kcalloc(npages, sizeof(struct bio_vec), GFP_KERNEL);
-	if (!bvec)
-		return -ENOMEM;
-
-	for (i = 0; i < npages; i++) {
-		bvec[i].bv_page = pages[i];
-		bvec[i].bv_offset = (i == 0) ? cur_off : 0;
-		bvec[i].bv_len = min_t(unsigned int, PAGE_SIZE, data_size);
-		data_size -= bvec[i].bv_len;
-	}
-
-	if (data_size != 0) {
-		cifs_dbg(VFS, "%s: something went wrong\n", __func__);
-		kfree(bvec);
-		return -EIO;
-	}
-
-	*page_vec = bvec;
-	return 0;
-}
-
 static int
 handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
-		 char *buf, unsigned int buf_len, struct page **pages,
-		 unsigned int npages, unsigned int page_data_size,
-		 bool is_offloaded)
+		 char *buf, unsigned int buf_len, struct xarray *pages,
+		 unsigned int pages_len, bool is_offloaded)
 {
 	unsigned int data_offset;
 	unsigned int data_len;
@@ -4631,9 +4602,6 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 	unsigned int pad_len;
 	struct cifs_readdata *rdata = mid->callback_data;
 	struct smb2_hdr *shdr = (struct smb2_hdr *)buf;
-	struct bio_vec *bvec = NULL;
-	struct iov_iter iter;
-	struct kvec iov;
 	int length;
 	bool use_rdma_mr = false;
 
@@ -4722,7 +4690,7 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 			return 0;
 		}
 
-		if (data_len > page_data_size - pad_len) {
+		if (data_len > pages_len - pad_len) {
 			/* data_len is corrupt -- discard frame */
 			rdata->result = -EIO;
 			if (is_offloaded)
@@ -4732,8 +4700,9 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 			return 0;
 		}
 
-		rdata->result = init_read_bvec(pages, npages, page_data_size,
-					       cur_off, &bvec);
+		/* Copy the data to the output I/O iterator. */
+		rdata->result = cifs_copy_pages_to_iter(pages, pages_len,
+							cur_off, &rdata->iter);
 		if (rdata->result != 0) {
 			if (is_offloaded)
 				mid->mid_state = MID_RESPONSE_MALFORMED;
@@ -4741,14 +4710,16 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 				dequeue_mid(mid, rdata->result);
 			return 0;
 		}
+		rdata->got_bytes = pages_len;
 
-		iov_iter_bvec(&iter, ITER_SOURCE, bvec, npages, data_len);
 	} else if (buf_len >= data_offset + data_len) {
 		/* read response payload is in buf */
-		WARN_ONCE(npages > 0, "read data can be either in buf or in pages");
-		iov.iov_base = buf + data_offset;
-		iov.iov_len = data_len;
-		iov_iter_kvec(&iter, ITER_SOURCE, &iov, 1, data_len);
+		WARN_ONCE(pages && !xa_empty(pages),
+			  "read data can be either in buf or in pages");
+		length = copy_to_iter(buf + data_offset, data_len, &rdata->iter);
+		if (length < 0)
+			return length;
+		rdata->got_bytes = data_len;
 	} else {
 		/* read response payload cannot be in both buf and pages */
 		WARN_ONCE(1, "buf can not contain only a part of read data");
@@ -4760,26 +4731,18 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 		return 0;
 	}
 
-	length = rdata->copy_into_pages(server, rdata, &iter);
-
-	kfree(bvec);
-
-	if (length < 0)
-		return length;
-
 	if (is_offloaded)
 		mid->mid_state = MID_RESPONSE_RECEIVED;
 	else
 		dequeue_mid(mid, false);
-	return length;
+	return 0;
 }
 
 struct smb2_decrypt_work {
 	struct work_struct decrypt;
 	struct TCP_Server_Info *server;
-	struct page **ppages;
+	struct xarray buffer;
 	char *buf;
-	unsigned int npages;
 	unsigned int len;
 };
 
@@ -4788,11 +4751,13 @@ static void smb2_decrypt_offload(struct work_struct *work)
 {
 	struct smb2_decrypt_work *dw = container_of(work,
 				struct smb2_decrypt_work, decrypt);
-	int i, rc;
+	int rc;
 	struct mid_q_entry *mid;
+	struct iov_iter iter;
 
+	iov_iter_xarray(&iter, READ, &dw->buffer, 0, dw->len);
 	rc = decrypt_raw_data(dw->server, dw->buf, dw->server->vals->read_rsp_size,
-			      dw->ppages, dw->npages, dw->len, true);
+			      &iter, true);
 	if (rc) {
 		cifs_dbg(VFS, "error decrypting rc=%d\n", rc);
 		goto free_pages;
@@ -4806,7 +4771,7 @@ static void smb2_decrypt_offload(struct work_struct *work)
 		mid->decrypted = true;
 		rc = handle_read_data(dw->server, mid, dw->buf,
 				      dw->server->vals->read_rsp_size,
-				      dw->ppages, dw->npages, dw->len,
+				      &dw->buffer, dw->len,
 				      true);
 		if (rc >= 0) {
 #ifdef CONFIG_CIFS_STATS2
@@ -4839,10 +4804,7 @@ static void smb2_decrypt_offload(struct work_struct *work)
 	}
 
 free_pages:
-	for (i = dw->npages-1; i >= 0; i--)
-		put_page(dw->ppages[i]);
-
-	kfree(dw->ppages);
+	cifs_clear_xarray_buffer(&dw->buffer);
 	cifs_small_buf_release(dw->buf);
 	kfree(dw);
 }
@@ -4852,47 +4814,65 @@ static int
 receive_encrypted_read(struct TCP_Server_Info *server, struct mid_q_entry **mid,
 		       int *num_mids)
 {
+	struct page *page;
 	char *buf = server->smallbuf;
 	struct smb2_transform_hdr *tr_hdr = (struct smb2_transform_hdr *)buf;
-	unsigned int npages;
-	struct page **pages;
-	unsigned int len;
+	struct iov_iter iter;
+	unsigned int len, npages;
 	unsigned int buflen = server->pdu_size;
 	int rc;
 	int i = 0;
 	struct smb2_decrypt_work *dw;
 
+	dw = kzalloc(sizeof(struct smb2_decrypt_work), GFP_KERNEL);
+	if (!dw)
+		return -ENOMEM;
+	xa_init(&dw->buffer);
+	INIT_WORK(&dw->decrypt, smb2_decrypt_offload);
+	dw->server = server;
+
 	*num_mids = 1;
 	len = min_t(unsigned int, buflen, server->vals->read_rsp_size +
 		sizeof(struct smb2_transform_hdr)) - HEADER_SIZE(server) + 1;
 
 	rc = cifs_read_from_socket(server, buf + HEADER_SIZE(server) - 1, len);
 	if (rc < 0)
-		return rc;
+		goto free_dw;
 	server->total_read += rc;
 
 	len = le32_to_cpu(tr_hdr->OriginalMessageSize) -
 		server->vals->read_rsp_size;
+	dw->len = len;
 	npages = DIV_ROUND_UP(len, PAGE_SIZE);
 
-	pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
-	if (!pages) {
-		rc = -ENOMEM;
-		goto discard_data;
-	}
-
+	rc = -ENOMEM;
 	for (; i < npages; i++) {
-		pages[i] = alloc_page(GFP_KERNEL|__GFP_HIGHMEM);
-		if (!pages[i]) {
-			rc = -ENOMEM;
+		void *old;
+
+		page = alloc_page(GFP_KERNEL|__GFP_HIGHMEM);
+		if (!page)
+			goto discard_data;
+		page->index = i;
+		old = xa_store(&dw->buffer, i, page, GFP_KERNEL);
+		if (xa_is_err(old)) {
+			rc = xa_err(old);
+			put_page(page);
 			goto discard_data;
 		}
 	}
 
-	/* read read data into pages */
-	rc = read_data_into_pages(server, pages, npages, len);
-	if (rc)
-		goto free_pages;
+	iov_iter_xarray(&iter, READ, &dw->buffer, 0, npages * PAGE_SIZE);
+
+	/* Read the data into the buffer and clear excess bufferage. */
+	rc = cifs_read_iter_from_socket(server, &iter, dw->len);
+	if (rc < 0)
+		goto discard_data;
+
+	server->total_read += rc;
+	if (rc < npages * PAGE_SIZE)
+		iov_iter_zero(npages * PAGE_SIZE - rc, &iter);
+	iov_iter_revert(&iter, npages * PAGE_SIZE);
+	iov_iter_truncate(&iter, dw->len);
 
 	rc = cifs_discard_remaining_data(server);
 	if (rc)
@@ -4905,39 +4885,28 @@ receive_encrypted_read(struct TCP_Server_Info *server, struct mid_q_entry **mid,
 
 	if ((server->min_offload) && (server->in_flight > 1) &&
 	    (server->pdu_size >= server->min_offload)) {
-		dw = kmalloc(sizeof(struct smb2_decrypt_work), GFP_KERNEL);
-		if (dw == NULL)
-			goto non_offloaded_decrypt;
-
 		dw->buf = server->smallbuf;
 		server->smallbuf = (char *)cifs_small_buf_get();
 
-		INIT_WORK(&dw->decrypt, smb2_decrypt_offload);
-
-		dw->npages = npages;
-		dw->server = server;
-		dw->ppages = pages;
-		dw->len = len;
 		queue_work(decrypt_wq, &dw->decrypt);
 		*num_mids = 0; /* worker thread takes care of finding mid */
 		return -1;
 	}
 
-non_offloaded_decrypt:
 	rc = decrypt_raw_data(server, buf, server->vals->read_rsp_size,
-			      pages, npages, len, false);
+			      &iter, false);
 	if (rc)
 		goto free_pages;
 
 	*mid = smb2_find_mid(server, buf);
-	if (*mid == NULL)
+	if (*mid == NULL) {
 		cifs_dbg(FYI, "mid not found\n");
-	else {
+	} else {
 		cifs_dbg(FYI, "mid found\n");
 		(*mid)->decrypted = true;
 		rc = handle_read_data(server, *mid, buf,
 				      server->vals->read_rsp_size,
-				      pages, npages, len, false);
+				      &dw->buffer, dw->len, false);
 		if (rc >= 0) {
 			if (server->ops->is_network_name_deleted) {
 				server->ops->is_network_name_deleted(buf,
@@ -4947,9 +4916,9 @@ receive_encrypted_read(struct TCP_Server_Info *server, struct mid_q_entry **mid,
 	}
 
 free_pages:
-	for (i = i - 1; i >= 0; i--)
-		put_page(pages[i]);
-	kfree(pages);
+	cifs_clear_xarray_buffer(&dw->buffer);
+free_dw:
+	kfree(dw);
 	return rc;
 discard_data:
 	cifs_discard_remaining_data(server);
@@ -4987,7 +4956,7 @@ receive_encrypted_standard(struct TCP_Server_Info *server,
 	server->total_read += length;
 
 	buf_size = pdu_length - sizeof(struct smb2_transform_hdr);
-	length = decrypt_raw_data(server, buf, buf_size, NULL, 0, 0, false);
+	length = decrypt_raw_data(server, buf, buf_size, NULL, false);
 	if (length)
 		return length;
 
@@ -5086,7 +5055,7 @@ smb3_handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid)
 	char *buf = server->large_buf ? server->bigbuf : server->smallbuf;
 
 	return handle_read_data(server, mid, buf, server->pdu_size,
-				NULL, 0, 0, false);
+				NULL, 0, false);
 }
 
 static int
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index a5695748a89b..66b76636660f 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -4096,10 +4096,8 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
 		struct smbd_buffer_descriptor_v1 *v1;
 		bool need_invalidate = server->dialect == SMB30_PROT_ID;
 
-		rdata->mr = smbd_register_mr(
-				server->smbd_conn, rdata->pages,
-				rdata->nr_pages, rdata->page_offset,
-				rdata->tailsz, true, need_invalidate);
+		rdata->mr = smbd_register_mr(server->smbd_conn, &rdata->iter,
+					     true, need_invalidate);
 		if (!rdata->mr)
 			return -EAGAIN;
 
@@ -4157,11 +4155,7 @@ smb2_readv_callback(struct mid_q_entry *mid)
 	struct cifs_credits credits = { .value = 0, .instance = 0 };
 	struct smb_rqst rqst = { .rq_iov = &rdata->iov[1],
 				 .rq_nvec = 1,
-				 .rq_pages = rdata->pages,
-				 .rq_offset = rdata->page_offset,
-				 .rq_npages = rdata->nr_pages,
-				 .rq_pagesz = rdata->pagesz,
-				 .rq_tailsz = rdata->tailsz };
+				 .rq_iter = rdata->iter };
 
 	WARN_ONCE(rdata->server != mid->server,
 		  "rdata server %p != mid server %p",
@@ -4179,6 +4173,7 @@ smb2_readv_callback(struct mid_q_entry *mid)
 		if (server->sign && !mid->decrypted) {
 			int rc;
 
+			iov_iter_truncate(&rqst.rq_iter, rdata->got_bytes);
 			rc = smb2_verify_signature(&rqst, server);
 			if (rc)
 				cifs_tcon_dbg(VFS, "SMB signature verification returned error = %d\n",
@@ -4504,7 +4499,7 @@ smb2_async_writev(struct cifs_writedata *wdata,
 	req->VolatileFileId = wdata->cfile->fid.volatile_fid;
 	req->WriteChannelInfoOffset = 0;
 	req->WriteChannelInfoLength = 0;
-	req->Channel = 0;
+	req->Channel = SMB2_CHANNEL_NONE;
 	req->Offset = cpu_to_le64(wdata->offset);
 	req->DataOffset = cpu_to_le16(
 				offsetof(struct smb2_write_req, Buffer));
@@ -4521,26 +4516,18 @@ smb2_async_writev(struct cifs_writedata *wdata,
 		server->smbd_conn->rdma_readwrite_threshold) {
 
 		struct smbd_buffer_descriptor_v1 *v1;
+		size_t data_size = iov_iter_count(&wdata->iter);
 		bool need_invalidate = server->dialect == SMB30_PROT_ID;
 
-		wdata->mr = smbd_register_mr(
-				server->smbd_conn, wdata->pages,
-				wdata->nr_pages, wdata->page_offset,
-				wdata->tailsz, false, need_invalidate);
+		wdata->mr = smbd_register_mr(server->smbd_conn, &wdata->iter,
+					     false, need_invalidate);
 		if (!wdata->mr) {
 			rc = -EAGAIN;
 			goto async_writev_out;
 		}
 		req->Length = 0;
 		req->DataOffset = 0;
-		if (wdata->nr_pages > 1)
-			req->RemainingBytes =
-				cpu_to_le32(
-					(wdata->nr_pages - 1) * wdata->pagesz -
-					wdata->page_offset + wdata->tailsz
-				);
-		else
-			req->RemainingBytes = cpu_to_le32(wdata->tailsz);
+		req->RemainingBytes = cpu_to_le32(data_size);
 		req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
 		if (need_invalidate)
 			req->Channel = SMB2_CHANNEL_RDMA_V1;
@@ -4559,19 +4546,13 @@ smb2_async_writev(struct cifs_writedata *wdata,
 
 	rqst.rq_iov = iov;
 	rqst.rq_nvec = 1;
-	rqst.rq_pages = wdata->pages;
-	rqst.rq_offset = wdata->page_offset;
-	rqst.rq_npages = wdata->nr_pages;
-	rqst.rq_pagesz = wdata->pagesz;
-	rqst.rq_tailsz = wdata->tailsz;
+	rqst.rq_iter = wdata->iter;
 #ifdef CONFIG_CIFS_SMB_DIRECT
-	if (wdata->mr) {
+	if (wdata->mr)
 		iov[0].iov_len += sizeof(struct smbd_buffer_descriptor_v1);
-		rqst.rq_npages = 0;
-	}
 #endif
-	cifs_dbg(FYI, "async write at %llu %u bytes\n",
-		 wdata->offset, wdata->bytes);
+	cifs_dbg(FYI, "async write at %llu %u bytes iter=%zx\n",
+		 wdata->offset, wdata->bytes, iov_iter_count(&rqst.rq_iter));
 
 #ifdef CONFIG_CIFS_SMB_DIRECT
 	/* For RDMA read, I/O size is in RemainingBytes not in Length */
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 78a76752fafd..8bd320f0156e 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -34,12 +34,6 @@ static int smbd_post_recv(
 		struct smbd_response *response);
 
 static int smbd_post_send_empty(struct smbd_connection *info);
-static int smbd_post_send_data(
-		struct smbd_connection *info,
-		struct kvec *iov, int n_vec, int remaining_data_length);
-static int smbd_post_send_page(struct smbd_connection *info,
-		struct page *page, unsigned long offset,
-		size_t size, int remaining_data_length);
 
 static void destroy_mr_list(struct smbd_connection *info);
 static int allocate_mr_list(struct smbd_connection *info);
@@ -986,24 +980,6 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	return rc;
 }
 
-/*
- * Send a page
- * page: the page to send
- * offset: offset in the page to send
- * size: length in the page to send
- * remaining_data_length: remaining data to send in this payload
- */
-static int smbd_post_send_page(struct smbd_connection *info, struct page *page,
-		unsigned long offset, size_t size, int remaining_data_length)
-{
-	struct scatterlist sgl;
-
-	sg_init_table(&sgl, 1);
-	sg_set_page(&sgl, page, size, offset);
-
-	return smbd_post_send_sgl(info, &sgl, size, remaining_data_length);
-}
-
 /*
  * Send an empty message
  * Empty message is used to extend credits to peer to for keep live
@@ -1015,35 +991,6 @@ static int smbd_post_send_empty(struct smbd_connection *info)
 	return smbd_post_send_sgl(info, NULL, 0, 0);
 }
 
-/*
- * Send a data buffer
- * iov: the iov array describing the data buffers
- * n_vec: number of iov array
- * remaining_data_length: remaining data to send following this packet
- * in segmented SMBD packet
- */
-static int smbd_post_send_data(
-	struct smbd_connection *info, struct kvec *iov, int n_vec,
-	int remaining_data_length)
-{
-	int i;
-	u32 data_length = 0;
-	struct scatterlist sgl[SMBDIRECT_MAX_SEND_SGE - 1];
-
-	if (n_vec > SMBDIRECT_MAX_SEND_SGE - 1) {
-		cifs_dbg(VFS, "Can't fit data to SGL, n_vec=%d\n", n_vec);
-		return -EINVAL;
-	}
-
-	sg_init_table(sgl, n_vec);
-	for (i = 0; i < n_vec; i++) {
-		data_length += iov[i].iov_len;
-		sg_set_buf(&sgl[i], iov[i].iov_base, iov[i].iov_len);
-	}
-
-	return smbd_post_send_sgl(info, sgl, data_length, remaining_data_length);
-}
-
 /*
  * Post a receive request to the transport
  * The remote peer can only send data when a receive request is posted
@@ -1976,6 +1923,42 @@ int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
 	return rc;
 }
 
+/*
+ * Send the contents of an iterator
+ * @iter: The iterator to send
+ * @_remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_post_send_iter(struct smbd_connection *info,
+			       struct iov_iter *iter,
+			       int *_remaining_data_length)
+{
+	struct scatterlist sgl[SMBDIRECT_MAX_SEND_SGE - 1];
+	unsigned int max_payload = info->max_send_size - sizeof(struct smbd_data_transfer);
+	unsigned int cleanup_mode;
+	ssize_t rc;
+
+	do {
+		struct sg_table sgtable = { .sgl = sgl };
+		size_t maxlen = min_t(size_t, *_remaining_data_length, max_payload);
+
+		sg_init_table(sgtable.sgl, ARRAY_SIZE(sgl));
+		rc = netfs_extract_iter_to_sg(iter, maxlen,
+					      &sgtable, ARRAY_SIZE(sgl),
+					      &cleanup_mode);
+		if (rc < 0)
+			break;
+		if (WARN_ON_ONCE(sgtable.nents == 0))
+			return -EIO;
+		WARN_ON(cleanup_mode != 0);
+
+		sg_mark_end(&sgl[sgtable.nents - 1]);
+		*_remaining_data_length -= rc;
+		rc = smbd_post_send_sgl(info, sgl, rc, *_remaining_data_length);
+	} while (rc == 0 && iov_iter_count(iter) > 0);
+
+	return rc;
+}
+
 /*
  * Send data to transport
  * Each rqst is transported as a SMBDirect payload
@@ -1986,18 +1969,10 @@ int smbd_send(struct TCP_Server_Info *server,
 	int num_rqst, struct smb_rqst *rqst_array)
 {
 	struct smbd_connection *info = server->smbd_conn;
-	struct kvec vecs[SMBDIRECT_MAX_SEND_SGE - 1];
-	int nvecs;
-	int size;
-	unsigned int buflen, remaining_data_length;
-	unsigned int offset, remaining_vec_data_length;
-	int start, i, j;
-	int max_iov_size =
-		info->max_send_size - sizeof(struct smbd_data_transfer);
-	struct kvec *iov;
-	int rc;
 	struct smb_rqst *rqst;
-	int rqst_idx;
+	struct iov_iter iter;
+	unsigned int remaining_data_length, klen;
+	int rc, i, rqst_idx;
 
 	if (info->transport_status != SMBD_CONNECTED)
 		return -EAGAIN;
@@ -2024,84 +1999,36 @@ int smbd_send(struct TCP_Server_Info *server,
 	rqst_idx = 0;
 	do {
 		rqst = &rqst_array[rqst_idx];
-		iov = rqst->rq_iov;
 
 		cifs_dbg(FYI, "Sending smb (RDMA): idx=%d smb_len=%lu\n",
-			rqst_idx, smb_rqst_len(server, rqst));
-		remaining_vec_data_length = 0;
-		for (i = 0; i < rqst->rq_nvec; i++) {
-			remaining_vec_data_length += iov[i].iov_len;
-			dump_smb(iov[i].iov_base, iov[i].iov_len);
-		}
-
-		log_write(INFO, "rqst_idx=%d nvec=%d rqst->rq_npages=%d rq_pagesz=%d rq_tailsz=%d buflen=%lu\n",
-			  rqst_idx, rqst->rq_nvec,
-			  rqst->rq_npages, rqst->rq_pagesz,
-			  rqst->rq_tailsz, smb_rqst_len(server, rqst));
-
-		start = 0;
-		offset = 0;
-		do {
-			buflen = 0;
-			i = start;
-			j = 0;
-			while (i < rqst->rq_nvec &&
-				j < SMBDIRECT_MAX_SEND_SGE - 1 &&
-				buflen < max_iov_size) {
-
-				vecs[j].iov_base = iov[i].iov_base + offset;
-				if (buflen + iov[i].iov_len > max_iov_size) {
-					vecs[j].iov_len =
-						max_iov_size - iov[i].iov_len;
-					buflen = max_iov_size;
-					offset = vecs[j].iov_len;
-				} else {
-					vecs[j].iov_len =
-						iov[i].iov_len - offset;
-					buflen += vecs[j].iov_len;
-					offset = 0;
-					++i;
-				}
-				++j;
-			}
+			 rqst_idx, smb_rqst_len(server, rqst));
+		for (i = 0; i < rqst->rq_nvec; i++)
+			dump_smb(rqst->rq_iov[i].iov_base, rqst->rq_iov[i].iov_len);
+
+		log_write(INFO, "RDMA-WR[%u] nvec=%d len=%u iter=%zu rqlen=%lu\n",
+			  rqst_idx, rqst->rq_nvec, remaining_data_length,
+			  iov_iter_count(&rqst->rq_iter), smb_rqst_len(server, rqst));
+
+		/* Send the metadata pages. */
+		klen = 0;
+		for (i = 0; i < rqst->rq_nvec; i++)
+			klen += rqst->rq_iov[i].iov_len;
+		iov_iter_kvec(&iter, WRITE, rqst->rq_iov, rqst->rq_nvec, klen);
+
+		rc = smbd_post_send_iter(info, &iter, &remaining_data_length);
+		if (rc < 0)
+			break;
 
-			remaining_vec_data_length -= buflen;
-			remaining_data_length -= buflen;
-			log_write(INFO, "sending %s iov[%d] from start=%d nvecs=%d remaining_data_length=%d\n",
-					remaining_vec_data_length > 0 ?
-						"partial" : "complete",
-					rqst->rq_nvec, start, j,
-					remaining_data_length);
-
-			start = i;
-			rc = smbd_post_send_data(info, vecs, j, remaining_data_length);
-			if (rc)
-				goto done;
-		} while (remaining_vec_data_length > 0);
-
-		/* now sending pages if there are any */
-		for (i = 0; i < rqst->rq_npages; i++) {
-			rqst_page_get_length(rqst, i, &buflen, &offset);
-			nvecs = (buflen + max_iov_size - 1) / max_iov_size;
-			log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
-				buflen, nvecs);
-			for (j = 0; j < nvecs; j++) {
-				size = min_t(unsigned int, max_iov_size, remaining_data_length);
-				remaining_data_length -= size;
-				log_write(INFO, "sending pages i=%d offset=%d size=%d remaining_data_length=%d\n",
-					  i, j * max_iov_size + offset, size,
-					  remaining_data_length);
-				rc = smbd_post_send_page(
-					info, rqst->rq_pages[i],
-					j*max_iov_size + offset,
-					size, remaining_data_length);
-				if (rc)
-					goto done;
-			}
+		if (iov_iter_count(&rqst->rq_iter) > 0) {
+			/* And then the data pages if there are any */
+			rc = smbd_post_send_iter(info, &rqst->rq_iter,
+						 &remaining_data_length);
+			if (rc < 0)
+				break;
 		}
+
 	} while (++rqst_idx < num_rqst);
 
-done:
 	/*
 	 * As an optimization, we don't wait for individual I/O to finish
 	 * before sending the next one.
@@ -2305,27 +2232,49 @@ static struct smbd_mr *get_mr(struct smbd_connection *info)
 	goto again;
 }
 
+/*
+ * Transcribe the pages from an iterator into an MR scatterlist.
+ * @iter: The iterator to transcribe
+ * @_remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_iter_to_mr(struct smbd_connection *info,
+			   struct iov_iter *iter,
+			   struct scatterlist *sgl,
+			   unsigned int num_pages)
+{
+	struct sg_table sgtable = { .sgl = sgl };
+	unsigned int cleanup_mode;
+	int ret;
+
+	sg_init_table(sgl, num_pages);
+
+	ret = netfs_extract_iter_to_sg(iter, iov_iter_count(iter),
+				       &sgtable, num_pages, &cleanup_mode);
+	WARN_ON(ret < 0);
+	return ret;
+}
+
 /*
  * Register memory for RDMA read/write
- * pages[]: the list of pages to register memory with
- * num_pages: the number of pages to register
- * tailsz: if non-zero, the bytes to register in the last page
+ * iter: the buffer to register memory with
  * writing: true if this is a RDMA write (SMB read), false for RDMA read
  * need_invalidate: true if this MR needs to be locally invalidated after I/O
  * return value: the MR registered, NULL if failed.
  */
-struct smbd_mr *smbd_register_mr(
-	struct smbd_connection *info, struct page *pages[], int num_pages,
-	int offset, int tailsz, bool writing, bool need_invalidate)
+struct smbd_mr *smbd_register_mr(struct smbd_connection *info,
+				 struct iov_iter *iter,
+				 bool writing, bool need_invalidate)
 {
 	struct smbd_mr *smbdirect_mr;
-	int rc, i;
+	int rc, num_pages;
 	enum dma_data_direction dir;
 	struct ib_reg_wr *reg_wr;
 
+	num_pages = iov_iter_npages(iter, info->max_frmr_depth + 1);
 	if (num_pages > info->max_frmr_depth) {
 		log_rdma_mr(ERR, "num_pages=%d max_frmr_depth=%d\n",
 			num_pages, info->max_frmr_depth);
+		WARN_ON_ONCE(1);
 		return NULL;
 	}
 
@@ -2334,32 +2283,16 @@ struct smbd_mr *smbd_register_mr(
 		log_rdma_mr(ERR, "get_mr returning NULL\n");
 		return NULL;
 	}
+
+	dir = writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
+	smbdirect_mr->dir = dir;
 	smbdirect_mr->need_invalidate = need_invalidate;
 	smbdirect_mr->sgl_count = num_pages;
-	sg_init_table(smbdirect_mr->sgl, num_pages);
-
-	log_rdma_mr(INFO, "num_pages=0x%x offset=0x%x tailsz=0x%x\n",
-			num_pages, offset, tailsz);
-
-	if (num_pages == 1) {
-		sg_set_page(&smbdirect_mr->sgl[0], pages[0], tailsz, offset);
-		goto skip_multiple_pages;
-	}
 
-	/* We have at least two pages to register */
-	sg_set_page(
-		&smbdirect_mr->sgl[0], pages[0], PAGE_SIZE - offset, offset);
-	i = 1;
-	while (i < num_pages - 1) {
-		sg_set_page(&smbdirect_mr->sgl[i], pages[i], PAGE_SIZE, 0);
-		i++;
-	}
-	sg_set_page(&smbdirect_mr->sgl[i], pages[i],
-		tailsz ? tailsz : PAGE_SIZE, 0);
+	log_rdma_mr(INFO, "num_pages=0x%x count=0x%zx\n",
+		    num_pages, iov_iter_count(iter));
+	smbd_iter_to_mr(info, iter, smbdirect_mr->sgl, num_pages);
 
-skip_multiple_pages:
-	dir = writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
-	smbdirect_mr->dir = dir;
 	rc = ib_dma_map_sg(info->id->device, smbdirect_mr->sgl, num_pages, dir);
 	if (!rc) {
 		log_rdma_mr(ERR, "ib_dma_map_sg num_pages=%x dir=%x rc=%x\n",
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 207ef979cd51..be2cf18b7fec 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -302,8 +302,8 @@ struct smbd_mr {
 
 /* Interfaces to register and deregister MR for RDMA read/write */
 struct smbd_mr *smbd_register_mr(
-	struct smbd_connection *info, struct page *pages[], int num_pages,
-	int offset, int tailsz, bool writing, bool need_invalidate);
+	struct smbd_connection *info, struct iov_iter *iter,
+	bool writing, bool need_invalidate);
 int smbd_deregister_mr(struct smbd_mr *mr);
 
 #else
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 3851d0aaa288..f39724093993 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -270,26 +270,7 @@ smb_rqst_len(struct TCP_Server_Info *server, struct smb_rqst *rqst)
 	for (i = 0; i < nvec; i++)
 		buflen += iov[i].iov_len;
 
-	/*
-	 * Add in the page array if there is one. The caller needs to make
-	 * sure rq_offset and rq_tailsz are set correctly. If a buffer of
-	 * multiple pages ends at page boundary, rq_tailsz needs to be set to
-	 * PAGE_SIZE.
-	 */
-	if (rqst->rq_npages) {
-		if (rqst->rq_npages == 1)
-			buflen += rqst->rq_tailsz;
-		else {
-			/*
-			 * If there is more than one page, calculate the
-			 * buffer length based on rq_offset and rq_tailsz
-			 */
-			buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
-					rqst->rq_offset;
-			buflen += rqst->rq_tailsz;
-		}
-	}
-
+	buflen += iov_iter_count(&rqst->rq_iter);
 	return buflen;
 }
 
@@ -376,23 +357,15 @@ __smb_send_rqst(struct TCP_Server_Info *server, int num_rqst,
 
 		total_len += sent;
 
-		/* now walk the page array and send each page in it */
-		for (i = 0; i < rqst[j].rq_npages; i++) {
-			struct bio_vec bvec;
-
-			bvec.bv_page = rqst[j].rq_pages[i];
-			rqst_page_get_length(&rqst[j], i, &bvec.bv_len,
-					     &bvec.bv_offset);
-
-			iov_iter_bvec(&smb_msg.msg_iter, ITER_SOURCE,
-				      &bvec, 1, bvec.bv_len);
+		if (iov_iter_count(&rqst[j].rq_iter) > 0) {
+			smb_msg.msg_iter = rqst[j].rq_iter;
 			rc = smb_send_kvec(server, &smb_msg, &sent);
 			if (rc < 0)
 				break;
-
 			total_len += sent;
 		}
-	}
+
+}
 
 unmask:
 	sigprocmask(SIG_SETMASK, &oldmask, NULL);
@@ -1640,11 +1613,11 @@ int
 cifs_discard_remaining_data(struct TCP_Server_Info *server)
 {
 	unsigned int rfclen = server->pdu_size;
-	int remaining = rfclen + HEADER_PREAMBLE_SIZE(server) -
+	size_t remaining = rfclen + HEADER_PREAMBLE_SIZE(server) -
 		server->total_read;
 
 	while (remaining > 0) {
-		int length;
+		ssize_t length;
 
 		length = cifs_discard_from_socket(server,
 				min_t(size_t, remaining,
@@ -1790,10 +1763,18 @@ cifs_readv_receive(struct TCP_Server_Info *server, struct mid_q_entry *mid)
 		return cifs_readv_discard(server, mid);
 	}
 
-	length = rdata->read_into_pages(server, rdata, data_len);
-	if (length < 0)
-		return length;
-
+#ifdef CONFIG_CIFS_SMB_DIRECT
+	if (rdata->mr)
+		length = data_len; /* An RDMA read is already done. */
+	else
+#endif
+	{
+		length = cifs_read_iter_from_socket(server, &rdata->iter,
+						    data_len);
+		iov_iter_revert(&rdata->iter, data_len);
+	}
+	if (length > 0)
+		rdata->got_bytes += length;
 	server->total_read += length;
 
 	cifs_dbg(FYI, "total_read=%u buflen=%u remaining=%u\n",



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 29/34] cifs: Build the RDMA SGE list directly from an iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (27 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 28/34] cifs: Change the I/O paths to use an iterator rather than a page list David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 30/34] cifs: Remove unused code David Howells
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Tom Talpey,
	Jeff Layton, linux-cifs, linux-rdma, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

In the depths of the cifs RDMA code, extract part of an iov iterator
directly into an SGE list without going through an intermediate
scatterlist.

Note that this doesn't support extraction from an IOBUF- or UBUF-type
iterator (ie. user-supplied buffer).  The assumption is that the higher
layers will extract those to a BVEC-type iterator first and do whatever is
required to stop the pages from going away.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-rdma@vger.kernel.org

Link: https://lore.kernel.org/r/166697260361.61150.5064013393408112197.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732032518.3186319.1859601819981624629.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/smbdirect.c |  111 ++++++++++++++++++---------------------------------
 1 file changed, 39 insertions(+), 72 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 8bd320f0156e..4691b5a8e1ff 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -828,16 +828,16 @@ static int smbd_post_send(struct smbd_connection *info,
 	return rc;
 }
 
-static int smbd_post_send_sgl(struct smbd_connection *info,
-	struct scatterlist *sgl, int data_length, int remaining_data_length)
+static int smbd_post_send_iter(struct smbd_connection *info,
+			       struct iov_iter *iter,
+			       int *_remaining_data_length)
 {
-	int num_sgs;
 	int i, rc;
 	int header_length;
+	int data_length;
 	struct smbd_request *request;
 	struct smbd_data_transfer *packet;
 	int new_credits;
-	struct scatterlist *sg;
 
 wait_credit:
 	/* Wait for send credits. A SMBD packet needs one credit */
@@ -881,6 +881,30 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	}
 
 	request->info = info;
+	memset(request->sge, 0, sizeof(request->sge));
+
+	/* Fill in the data payload to find out how much data we can add */
+	if (iter) {
+		struct smb_extract_to_rdma extract = {
+			.nr_sge		= 1,
+			.max_sge	= SMBDIRECT_MAX_SEND_SGE,
+			.sge		= request->sge,
+			.device		= info->id->device,
+			.local_dma_lkey	= info->pd->local_dma_lkey,
+			.direction	= DMA_TO_DEVICE,
+		};
+
+		rc = smb_extract_iter_to_rdma(iter, *_remaining_data_length,
+					      &extract);
+		if (rc < 0)
+			goto err_dma;
+		data_length = rc;
+		request->num_sge = extract.nr_sge;
+		*_remaining_data_length -= data_length;
+	} else {
+		data_length = 0;
+		request->num_sge = 1;
+	}
 
 	/* Fill in the packet header */
 	packet = smbd_request_payload(request);
@@ -902,7 +926,7 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	else
 		packet->data_offset = cpu_to_le32(24);
 	packet->data_length = cpu_to_le32(data_length);
-	packet->remaining_data_length = cpu_to_le32(remaining_data_length);
+	packet->remaining_data_length = cpu_to_le32(*_remaining_data_length);
 	packet->padding = 0;
 
 	log_outgoing(INFO, "credits_requested=%d credits_granted=%d data_offset=%d data_length=%d remaining_data_length=%d\n",
@@ -918,7 +942,6 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	if (!data_length)
 		header_length = offsetof(struct smbd_data_transfer, padding);
 
-	request->num_sge = 1;
 	request->sge[0].addr = ib_dma_map_single(info->id->device,
 						 (void *)packet,
 						 header_length,
@@ -932,23 +955,6 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	request->sge[0].length = header_length;
 	request->sge[0].lkey = info->pd->local_dma_lkey;
 
-	/* Fill in the packet data payload */
-	num_sgs = sgl ? sg_nents(sgl) : 0;
-	for_each_sg(sgl, sg, num_sgs, i) {
-		request->sge[i+1].addr =
-			ib_dma_map_page(info->id->device, sg_page(sg),
-			       sg->offset, sg->length, DMA_TO_DEVICE);
-		if (ib_dma_mapping_error(
-				info->id->device, request->sge[i+1].addr)) {
-			rc = -EIO;
-			request->sge[i+1].addr = 0;
-			goto err_dma;
-		}
-		request->sge[i+1].length = sg->length;
-		request->sge[i+1].lkey = info->pd->local_dma_lkey;
-		request->num_sge++;
-	}
-
 	rc = smbd_post_send(info, request);
 	if (!rc)
 		return 0;
@@ -987,8 +993,10 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
  */
 static int smbd_post_send_empty(struct smbd_connection *info)
 {
+	int remaining_data_length = 0;
+
 	info->count_send_empty++;
-	return smbd_post_send_sgl(info, NULL, 0, 0);
+	return smbd_post_send_iter(info, NULL, &remaining_data_length);
 }
 
 /*
@@ -1923,42 +1931,6 @@ int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
 	return rc;
 }
 
-/*
- * Send the contents of an iterator
- * @iter: The iterator to send
- * @_remaining_data_length: remaining data to send in this payload
- */
-static int smbd_post_send_iter(struct smbd_connection *info,
-			       struct iov_iter *iter,
-			       int *_remaining_data_length)
-{
-	struct scatterlist sgl[SMBDIRECT_MAX_SEND_SGE - 1];
-	unsigned int max_payload = info->max_send_size - sizeof(struct smbd_data_transfer);
-	unsigned int cleanup_mode;
-	ssize_t rc;
-
-	do {
-		struct sg_table sgtable = { .sgl = sgl };
-		size_t maxlen = min_t(size_t, *_remaining_data_length, max_payload);
-
-		sg_init_table(sgtable.sgl, ARRAY_SIZE(sgl));
-		rc = netfs_extract_iter_to_sg(iter, maxlen,
-					      &sgtable, ARRAY_SIZE(sgl),
-					      &cleanup_mode);
-		if (rc < 0)
-			break;
-		if (WARN_ON_ONCE(sgtable.nents == 0))
-			return -EIO;
-		WARN_ON(cleanup_mode != 0);
-
-		sg_mark_end(&sgl[sgtable.nents - 1]);
-		*_remaining_data_length -= rc;
-		rc = smbd_post_send_sgl(info, sgl, rc, *_remaining_data_length);
-	} while (rc == 0 && iov_iter_count(iter) > 0);
-
-	return rc;
-}
-
 /*
  * Send data to transport
  * Each rqst is transported as a SMBDirect payload
@@ -2240,16 +2212,17 @@ static struct smbd_mr *get_mr(struct smbd_connection *info)
 static int smbd_iter_to_mr(struct smbd_connection *info,
 			   struct iov_iter *iter,
 			   struct scatterlist *sgl,
-			   unsigned int num_pages)
+			   unsigned int num_pages,
+			   bool writing)
 {
 	struct sg_table sgtable = { .sgl = sgl };
-	unsigned int cleanup_mode;
 	int ret;
 
 	sg_init_table(sgl, num_pages);
 
 	ret = netfs_extract_iter_to_sg(iter, iov_iter_count(iter),
-				       &sgtable, num_pages, &cleanup_mode);
+				       &sgtable, num_pages,
+				       writing ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 	WARN_ON(ret < 0);
 	return ret;
 }
@@ -2291,7 +2264,7 @@ struct smbd_mr *smbd_register_mr(struct smbd_connection *info,
 
 	log_rdma_mr(INFO, "num_pages=0x%x count=0x%zx\n",
 		    num_pages, iov_iter_count(iter));
-	smbd_iter_to_mr(info, iter, smbdirect_mr->sgl, num_pages);
+	smbd_iter_to_mr(info, iter, smbdirect_mr->sgl, num_pages, writing);
 
 	rc = ib_dma_map_sg(info->id->device, smbdirect_mr->sgl, num_pages, dir);
 	if (!rc) {
@@ -2602,13 +2575,6 @@ static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
 	ssize_t ret;
 	int before = rdma->nr_sge;
 
-	if (iov_iter_is_discard(iter) ||
-	    iov_iter_is_pipe(iter) ||
-	    user_backed_iter(iter)) {
-		WARN_ON_ONCE(1);
-		return -EIO;
-	}
-
 	switch (iov_iter_type(iter)) {
 	case ITER_BVEC:
 		ret = smb_extract_bvec_to_rdma(iter, rdma, len);
@@ -2620,7 +2586,8 @@ static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
 		ret = smb_extract_xarray_to_rdma(iter, rdma, len);
 		break;
 	default:
-		BUG();
+		WARN_ON_ONCE(1);
+		return -EIO;
 	}
 
 	if (ret > 0) {



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 30/34] cifs: Remove unused code
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (28 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 29/34] cifs: Build the RDMA SGE list directly from an iterator David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 31/34] cifs: Fix problem with encrypted RDMA data read David Howells
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Jeff Layton,
	linux-cifs, dhowells, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel

Remove a bunch of functions that are no longer used and are commented out
after the conversion to use iterators throughout the I/O path.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/164928621823.457102.8777804402615654773.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211421039.3154751.15199634443157779005.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348881165.2106726.2993852968344861224.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364827876.3334034.9331465096417303889.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126396915.708021.2010212654244139442.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697261080.61150.17513116912567922274.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732033255.3186319.5527423437137895940.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/file.c |  606 --------------------------------------------------------
 1 file changed, 606 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index cfa8ad8a59c4..6baf591f63a3 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2603,314 +2603,6 @@ static int cifs_partialpagewrite(struct page *page, unsigned from, unsigned to)
 	return rc;
 }
 
-#if 0 // TODO: Remove for iov_iter support
-static struct cifs_writedata *
-wdata_alloc_and_fillpages(pgoff_t tofind, struct address_space *mapping,
-			  pgoff_t end, pgoff_t *index,
-			  unsigned int *found_pages)
-{
-	struct cifs_writedata *wdata;
-
-	wdata = cifs_writedata_alloc((unsigned int)tofind,
-				     cifs_writev_complete);
-	if (!wdata)
-		return NULL;
-
-	*found_pages = find_get_pages_range_tag(mapping, index, end,
-				PAGECACHE_TAG_DIRTY, tofind, wdata->pages);
-	return wdata;
-}
-
-static unsigned int
-wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
-		    struct address_space *mapping,
-		    struct writeback_control *wbc,
-		    pgoff_t end, pgoff_t *index, pgoff_t *next, bool *done)
-{
-	unsigned int nr_pages = 0, i;
-	struct page *page;
-
-	for (i = 0; i < found_pages; i++) {
-		page = wdata->pages[i];
-		/*
-		 * At this point we hold neither the i_pages lock nor the
-		 * page lock: the page may be truncated or invalidated
-		 * (changing page->mapping to NULL), or even swizzled
-		 * back from swapper_space to tmpfs file mapping
-		 */
-
-		if (nr_pages == 0)
-			lock_page(page);
-		else if (!trylock_page(page))
-			break;
-
-		if (unlikely(page->mapping != mapping)) {
-			unlock_page(page);
-			break;
-		}
-
-		if (!wbc->range_cyclic && page->index > end) {
-			*done = true;
-			unlock_page(page);
-			break;
-		}
-
-		if (*next && (page->index != *next)) {
-			/* Not next consecutive page */
-			unlock_page(page);
-			break;
-		}
-
-		if (wbc->sync_mode != WB_SYNC_NONE)
-			wait_on_page_writeback(page);
-
-		if (PageWriteback(page) ||
-				!clear_page_dirty_for_io(page)) {
-			unlock_page(page);
-			break;
-		}
-
-		/*
-		 * This actually clears the dirty bit in the radix tree.
-		 * See cifs_writepage() for more commentary.
-		 */
-		set_page_writeback(page);
-		if (page_offset(page) >= i_size_read(mapping->host)) {
-			*done = true;
-			unlock_page(page);
-			end_page_writeback(page);
-			break;
-		}
-
-		wdata->pages[i] = page;
-		*next = page->index + 1;
-		++nr_pages;
-	}
-
-	/* reset index to refind any pages skipped */
-	if (nr_pages == 0)
-		*index = wdata->pages[0]->index + 1;
-
-	/* put any pages we aren't going to use */
-	for (i = nr_pages; i < found_pages; i++) {
-		put_page(wdata->pages[i]);
-		wdata->pages[i] = NULL;
-	}
-
-	return nr_pages;
-}
-
-static int
-wdata_send_pages(struct cifs_writedata *wdata, unsigned int nr_pages,
-		 struct address_space *mapping, struct writeback_control *wbc)
-{
-	int rc;
-
-	wdata->sync_mode = wbc->sync_mode;
-	wdata->nr_pages = nr_pages;
-	wdata->offset = page_offset(wdata->pages[0]);
-	wdata->pagesz = PAGE_SIZE;
-	wdata->tailsz = min(i_size_read(mapping->host) -
-			page_offset(wdata->pages[nr_pages - 1]),
-			(loff_t)PAGE_SIZE);
-	wdata->bytes = ((nr_pages - 1) * PAGE_SIZE) + wdata->tailsz;
-	wdata->pid = wdata->cfile->pid;
-
-	rc = adjust_credits(wdata->server, &wdata->credits, wdata->bytes);
-	if (rc)
-		return rc;
-
-	if (wdata->cfile->invalidHandle)
-		rc = -EAGAIN;
-	else
-		rc = wdata->server->ops->async_writev(wdata,
-						      cifs_writedata_release);
-
-	return rc;
-}
-
-static int
-cifs_writepage_locked(struct page *page, struct writeback_control *wbc);
-
-static int cifs_write_one_page(struct page *page, struct writeback_control *wbc,
-		void *data)
-{
-	struct address_space *mapping = data;
-	int ret;
-
-	ret = cifs_writepage_locked(page, wbc);
-	unlock_page(page);
-	mapping_set_error(mapping, ret);
-	return ret;
-}
-
-static int cifs_writepages(struct address_space *mapping,
-			   struct writeback_control *wbc)
-{
-	struct inode *inode = mapping->host;
-	struct cifs_sb_info *cifs_sb = CIFS_SB(inode->i_sb);
-	struct TCP_Server_Info *server;
-	bool done = false, scanned = false, range_whole = false;
-	pgoff_t end, index;
-	struct cifs_writedata *wdata;
-	struct cifsFileInfo *cfile = NULL;
-	int rc = 0;
-	int saved_rc = 0;
-	unsigned int xid;
-
-	/*
-	 * If wsize is smaller than the page cache size, default to writing
-	 * one page at a time.
-	 */
-	if (cifs_sb->ctx->wsize < PAGE_SIZE)
-		return write_cache_pages(mapping, wbc, cifs_write_one_page,
-				mapping);
-
-	xid = get_xid();
-	if (wbc->range_cyclic) {
-		index = mapping->writeback_index; /* Start from prev offset */
-		end = -1;
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = true;
-		scanned = true;
-	}
-	server = cifs_pick_channel(cifs_sb_master_tcon(cifs_sb)->ses);
-
-retry:
-	while (!done && index <= end) {
-		unsigned int i, nr_pages, found_pages, wsize;
-		pgoff_t next = 0, tofind, saved_index = index;
-		struct cifs_credits credits_on_stack;
-		struct cifs_credits *credits = &credits_on_stack;
-		int get_file_rc = 0;
-
-		if (cfile)
-			cifsFileInfo_put(cfile);
-
-		rc = cifs_get_writable_file(CIFS_I(inode), FIND_WR_ANY, &cfile);
-
-		/* in case of an error store it to return later */
-		if (rc)
-			get_file_rc = rc;
-
-		rc = server->ops->wait_mtu_credits(server, cifs_sb->ctx->wsize,
-						   &wsize, credits);
-		if (rc != 0) {
-			done = true;
-			break;
-		}
-
-		tofind = min((wsize / PAGE_SIZE) - 1, end - index) + 1;
-
-		wdata = wdata_alloc_and_fillpages(tofind, mapping, end, &index,
-						  &found_pages);
-		if (!wdata) {
-			rc = -ENOMEM;
-			done = true;
-			add_credits_and_wake_if(server, credits, 0);
-			break;
-		}
-
-		if (found_pages == 0) {
-			kref_put(&wdata->refcount, cifs_writedata_release);
-			add_credits_and_wake_if(server, credits, 0);
-			break;
-		}
-
-		nr_pages = wdata_prepare_pages(wdata, found_pages, mapping, wbc,
-					       end, &index, &next, &done);
-
-		/* nothing to write? */
-		if (nr_pages == 0) {
-			kref_put(&wdata->refcount, cifs_writedata_release);
-			add_credits_and_wake_if(server, credits, 0);
-			continue;
-		}
-
-		wdata->credits = credits_on_stack;
-		wdata->cfile = cfile;
-		wdata->server = server;
-		cfile = NULL;
-
-		if (!wdata->cfile) {
-			cifs_dbg(VFS, "No writable handle in writepages rc=%d\n",
-				 get_file_rc);
-			if (is_retryable_error(get_file_rc))
-				rc = get_file_rc;
-			else
-				rc = -EBADF;
-		} else
-			rc = wdata_send_pages(wdata, nr_pages, mapping, wbc);
-
-		for (i = 0; i < nr_pages; ++i)
-			unlock_page(wdata->pages[i]);
-
-		/* send failure -- clean up the mess */
-		if (rc != 0) {
-			add_credits_and_wake_if(server, &wdata->credits, 0);
-			for (i = 0; i < nr_pages; ++i) {
-				if (is_retryable_error(rc))
-					redirty_page_for_writepage(wbc,
-							   wdata->pages[i]);
-				else
-					SetPageError(wdata->pages[i]);
-				end_page_writeback(wdata->pages[i]);
-				put_page(wdata->pages[i]);
-			}
-			if (!is_retryable_error(rc))
-				mapping_set_error(mapping, rc);
-		}
-		kref_put(&wdata->refcount, cifs_writedata_release);
-
-		if (wbc->sync_mode == WB_SYNC_ALL && rc == -EAGAIN) {
-			index = saved_index;
-			continue;
-		}
-
-		/* Return immediately if we received a signal during writing */
-		if (is_interrupt_error(rc)) {
-			done = true;
-			break;
-		}
-
-		if (rc != 0 && saved_rc == 0)
-			saved_rc = rc;
-
-		wbc->nr_to_write -= nr_pages;
-		if (wbc->nr_to_write <= 0)
-			done = true;
-
-		index = next;
-	}
-
-	if (!scanned && !done) {
-		/*
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		scanned = true;
-		index = 0;
-		goto retry;
-	}
-
-	if (saved_rc != 0)
-		rc = saved_rc;
-
-	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
-		mapping->writeback_index = index;
-
-	if (cfile)
-		cifsFileInfo_put(cfile);
-	free_xid(xid);
-	/* Indication to update ctime and mtime as close is deferred */
-	set_bit(CIFS_INO_MODIFIED_ATTR, &CIFS_I(inode)->flags);
-	return rc;
-}
-#endif
-
 /*
  * Extend the region to be written back to include subsequent contiguously
  * dirty pages if possible, but don't sleep while doing so.
@@ -3505,49 +3197,6 @@ int cifs_flush(struct file *file, fl_owner_t id)
 	return rc;
 }
 
-#if 0 // TODO: Remove for iov_iter support
-static int
-cifs_write_allocate_pages(struct page **pages, unsigned long num_pages)
-{
-	int rc = 0;
-	unsigned long i;
-
-	for (i = 0; i < num_pages; i++) {
-		pages[i] = alloc_page(GFP_KERNEL|__GFP_HIGHMEM);
-		if (!pages[i]) {
-			/*
-			 * save number of pages we have already allocated and
-			 * return with ENOMEM error
-			 */
-			num_pages = i;
-			rc = -ENOMEM;
-			break;
-		}
-	}
-
-	if (rc) {
-		for (i = 0; i < num_pages; i++)
-			put_page(pages[i]);
-	}
-	return rc;
-}
-
-static inline
-size_t get_numpages(const size_t wsize, const size_t len, size_t *cur_len)
-{
-	size_t num_pages;
-	size_t clen;
-
-	clen = min_t(const size_t, len, wsize);
-	num_pages = DIV_ROUND_UP(clen, PAGE_SIZE);
-
-	if (cur_len)
-		*cur_len = clen;
-
-	return num_pages;
-}
-#endif
-
 static void
 cifs_uncached_writedata_release(struct kref *refcount)
 {
@@ -3580,50 +3229,6 @@ cifs_uncached_writev_complete(struct work_struct *work)
 	kref_put(&wdata->refcount, cifs_uncached_writedata_release);
 }
 
-#if 0 // TODO: Remove for iov_iter support
-static int
-wdata_fill_from_iovec(struct cifs_writedata *wdata, struct iov_iter *from,
-		      size_t *len, unsigned long *num_pages)
-{
-	size_t save_len, copied, bytes, cur_len = *len;
-	unsigned long i, nr_pages = *num_pages;
-
-	save_len = cur_len;
-	for (i = 0; i < nr_pages; i++) {
-		bytes = min_t(const size_t, cur_len, PAGE_SIZE);
-		copied = copy_page_from_iter(wdata->pages[i], 0, bytes, from);
-		cur_len -= copied;
-		/*
-		 * If we didn't copy as much as we expected, then that
-		 * may mean we trod into an unmapped area. Stop copying
-		 * at that point. On the next pass through the big
-		 * loop, we'll likely end up getting a zero-length
-		 * write and bailing out of it.
-		 */
-		if (copied < bytes)
-			break;
-	}
-	cur_len = save_len - cur_len;
-	*len = cur_len;
-
-	/*
-	 * If we have no data to send, then that probably means that
-	 * the copy above failed altogether. That's most likely because
-	 * the address in the iovec was bogus. Return -EFAULT and let
-	 * the caller free anything we allocated and bail out.
-	 */
-	if (!cur_len)
-		return -EFAULT;
-
-	/*
-	 * i + 1 now represents the number of pages we actually used in
-	 * the copy phase above.
-	 */
-	*num_pages = i + 1;
-	return 0;
-}
-#endif
-
 static int
 cifs_resend_wdata(struct cifs_writedata *wdata, struct list_head *wdata_list,
 	struct cifs_aio_ctx *ctx)
@@ -4212,83 +3817,6 @@ cifs_uncached_readv_complete(struct work_struct *work)
 	kref_put(&rdata->refcount, cifs_readdata_release);
 }
 
-#if 0 // TODO: Remove for iov_iter support
-
-static int
-uncached_fill_pages(struct TCP_Server_Info *server,
-		    struct cifs_readdata *rdata, struct iov_iter *iter,
-		    unsigned int len)
-{
-	int result = 0;
-	unsigned int i;
-	unsigned int nr_pages = rdata->nr_pages;
-	unsigned int page_offset = rdata->page_offset;
-
-	rdata->got_bytes = 0;
-	rdata->tailsz = PAGE_SIZE;
-	for (i = 0; i < nr_pages; i++) {
-		struct page *page = rdata->pages[i];
-		size_t n;
-		unsigned int segment_size = rdata->pagesz;
-
-		if (i == 0)
-			segment_size -= page_offset;
-		else
-			page_offset = 0;
-
-
-		if (len <= 0) {
-			/* no need to hold page hostage */
-			rdata->pages[i] = NULL;
-			rdata->nr_pages--;
-			put_page(page);
-			continue;
-		}
-
-		n = len;
-		if (len >= segment_size)
-			/* enough data to fill the page */
-			n = segment_size;
-		else
-			rdata->tailsz = len;
-		len -= n;
-
-		if (iter)
-			result = copy_page_from_iter(
-					page, page_offset, n, iter);
-#ifdef CONFIG_CIFS_SMB_DIRECT
-		else if (rdata->mr)
-			result = n;
-#endif
-		else
-			result = cifs_read_page_from_socket(
-					server, page, page_offset, n);
-		if (result < 0)
-			break;
-
-		rdata->got_bytes += result;
-	}
-
-	return rdata->got_bytes > 0 && result != -ECONNABORTED ?
-						rdata->got_bytes : result;
-}
-
-static int
-cifs_uncached_read_into_pages(struct TCP_Server_Info *server,
-			      struct cifs_readdata *rdata, unsigned int len)
-{
-	return uncached_fill_pages(server, rdata, NULL, len);
-}
-
-static int
-cifs_uncached_copy_into_pages(struct TCP_Server_Info *server,
-			      struct cifs_readdata *rdata,
-			      struct iov_iter *iter)
-{
-	return uncached_fill_pages(server, rdata, iter, iter->count);
-}
-#endif
-
 static int cifs_resend_rdata(struct cifs_readdata *rdata,
 			struct list_head *rdata_list,
 			struct cifs_aio_ctx *ctx)
@@ -4901,140 +4429,6 @@ int cifs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	return rc;
 }
 
-#if 0 // TODO: Remove for iov_iter support
-
-static void
-cifs_readv_complete(struct work_struct *work)
-{
-	unsigned int i, got_bytes;
-	struct cifs_readdata *rdata = container_of(work,
-						struct cifs_readdata, work);
-
-	got_bytes = rdata->got_bytes;
-	for (i = 0; i < rdata->nr_pages; i++) {
-		struct page *page = rdata->pages[i];
-
-		if (rdata->result == 0 ||
-		    (rdata->result == -EAGAIN && got_bytes)) {
-			flush_dcache_page(page);
-			SetPageUptodate(page);
-		} else
-			SetPageError(page);
-
-		if (rdata->result == 0 ||
-		    (rdata->result == -EAGAIN && got_bytes))
-			cifs_readpage_to_fscache(rdata->mapping->host, page);
-
-		unlock_page(page);
-
-		got_bytes -= min_t(unsigned int, PAGE_SIZE, got_bytes);
-
-		put_page(page);
-		rdata->pages[i] = NULL;
-	}
-	kref_put(&rdata->refcount, cifs_readdata_release);
-}
-
-static int
-readpages_fill_pages(struct TCP_Server_Info *server,
-		     struct cifs_readdata *rdata, struct iov_iter *iter,
-		     unsigned int len)
-{
-	int result = 0;
-	unsigned int i;
-	u64 eof;
-	pgoff_t eof_index;
-	unsigned int nr_pages = rdata->nr_pages;
-	unsigned int page_offset = rdata->page_offset;
-
-	/* determine the eof that the server (probably) has */
-	eof = CIFS_I(rdata->mapping->host)->server_eof;
-	eof_index = eof ? (eof - 1) >> PAGE_SHIFT : 0;
-	cifs_dbg(FYI, "eof=%llu eof_index=%lu\n", eof, eof_index);
-
-	rdata->got_bytes = 0;
-	rdata->tailsz = PAGE_SIZE;
-	for (i = 0; i < nr_pages; i++) {
-		struct page *page = rdata->pages[i];
-		unsigned int to_read = rdata->pagesz;
-		size_t n;
-
-		if (i == 0)
-			to_read -= page_offset;
-		else
-			page_offset = 0;
-
-		n = to_read;
-
-		if (len >= to_read) {
-			len -= to_read;
-		} else if (len > 0) {
-			/* enough for partial page, fill and zero the rest */
-			zero_user(page, len + page_offset, to_read - len);
-			n = rdata->tailsz = len;
-			len = 0;
-		} else if (page->index > eof_index) {
-			/*
-			 * The VFS will not try to do readahead past the
-			 * i_size, but it's possible that we have outstanding
-			 * writes with gaps in the middle and the i_size hasn't
-			 * caught up yet. Populate those with zeroed out pages
-			 * to prevent the VFS from repeatedly attempting to
-			 * fill them until the writes are flushed.
-			 */
-			zero_user(page, 0, PAGE_SIZE);
-			flush_dcache_page(page);
-			SetPageUptodate(page);
-			unlock_page(page);
-			put_page(page);
-			rdata->pages[i] = NULL;
-			rdata->nr_pages--;
-			continue;
-		} else {
-			/* no need to hold page hostage */
-			unlock_page(page);
-			put_page(page);
-			rdata->pages[i] = NULL;
-			rdata->nr_pages--;
-			continue;
-		}
-
-		if (iter)
-			result = copy_page_from_iter(
-					page, page_offset, n, iter);
-#ifdef CONFIG_CIFS_SMB_DIRECT
-		else if (rdata->mr)
-			result = n;
-#endif
-		else
-			result = cifs_read_page_from_socket(
-					server, page, page_offset, n);
-		if (result < 0)
-			break;
-
-		rdata->got_bytes += result;
-	}
-
-	return rdata->got_bytes > 0 && result != -ECONNABORTED ?
-						rdata->got_bytes : result;
-}
-
-static int
-cifs_readpages_read_into_pages(struct TCP_Server_Info *server,
-			       struct cifs_readdata *rdata, unsigned int len)
-{
-	return readpages_fill_pages(server, rdata, NULL, len);
-}
-
-static int
-cifs_readpages_copy_into_pages(struct TCP_Server_Info *server,
-			       struct cifs_readdata *rdata,
-			       struct iov_iter *iter)
-{
-	return readpages_fill_pages(server, rdata, iter, iter->count);
-}
-#endif
-
 /*
  * Unlock a bunch of folios in the pagecache.
  */



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 31/34] cifs: Fix problem with encrypted RDMA data read
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (29 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 30/34] cifs: Remove unused code David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-19 16:25   ` Stefan Metzmacher
  2023-01-16 23:11 ` [PATCH v6 32/34] cifs: DIO to/from KVEC-type iterators should now work David Howells
                   ` (4 subsequent siblings)
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Tom Talpey, Long Li, Namjae Jeon,
	Stefan Metzmacher, linux-cifs, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

When the cifs client is talking to the ksmbd server by RDMA and the ksmbd
server has "smb3 encryption = yes" in its config file, the normal PDU
stream is encrypted, but the directly-delivered data isn't in the stream
(and isn't encrypted), but is rather delivered by DDP/RDMA packets (at
least with IWarp).

Currently, the direct delivery fails with:

   buf can not contain only a part of read data
   WARNING: CPU: 0 PID: 4619 at fs/cifs/smb2ops.c:4731 handle_read_data+0x393/0x405
   ...
   RIP: 0010:handle_read_data+0x393/0x405
   ...
    smb3_handle_read_data+0x30/0x37
    receive_encrypted_standard+0x141/0x224
    cifs_demultiplex_thread+0x21a/0x63b
    kthread+0xe7/0xef
    ret_from_fork+0x22/0x30

The problem apparently stemming from the fact that it's trying to manage
the decryption, but the data isn't in the smallbuf, the bigbuf or the page
array).

This can be fixed simply by inserting an extra case into handle_read_data()
that checks to see if use_rdma_mr is true, and if it is, just setting
rdata->got_bytes to the length of data delivered and allowing normal
continuation.

This can be seen in an IWarp packet trace.  With the upstream code, it does
a DDP/RDMA packet, which produces the warning above and then retries,
retrieving the data inline, spread across several SMBDirect messages that
get glued together into a single PDU.  With the patch applied, only the
DDP/RDMA packet is seen.

Note that this doesn't happen if the server isn't told to encrypt stuff and
it does also happen with softRoCE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <smfrench@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Long Li <longli@microsoft.com>
cc: Namjae Jeon <linkinjeon@kernel.org>
cc: Stefan Metzmacher <metze@samba.org>
cc: linux-cifs@vger.kernel.org

Link: https://lore.kernel.org/r/166855224228.1998592.2212551359609792175.stgit@warthog.procyon.org.uk/ # v1
---

 fs/cifs/smb2ops.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 387effcb905d..fabb1e135faa 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -4720,6 +4720,9 @@ handle_read_data(struct TCP_Server_Info *server, struct mid_q_entry *mid,
 		if (length < 0)
 			return length;
 		rdata->got_bytes = data_len;
+	} else if (use_rdma_mr) {
+		/* The data was delivered directly by RDMA. */
+		rdata->got_bytes = data_len;
 	} else {
 		/* read response payload cannot be in both buf and pages */
 		WARN_ONCE(1, "buf can not contain only a part of read data");



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 32/34] cifs: DIO to/from KVEC-type iterators should now work
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (30 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 31/34] cifs: Fix problem with encrypted RDMA data read David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-16 23:12 ` [PATCH v6 33/34] net: [RFC][WIP] Mark each skb_frags as to how they should be cleaned up David Howells
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Tom Talpey,
	Jeff Layton, linux-cifs, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

DIO to/from KVEC-type iterators should now work as the iterator is passed
down to the socket in non-RDMA/non-crypto mode and in RDMA or crypto mode
care is taken to handle vmap/vmalloc correctly and not take page refs when
building a scatterlist.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
---

 fs/cifs/file.c |   20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 6baf591f63a3..7f1e01cee83d 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3545,16 +3545,6 @@ static ssize_t __cifs_writev(
 	struct cifs_aio_ctx *ctx;
 	int rc;
 
-	/*
-	 * iov_iter_get_pages_alloc doesn't work with ITER_KVEC.
-	 * In this case, fall back to non-direct write function.
-	 * this could be improved by getting pages directly in ITER_KVEC
-	 */
-	if (direct && iov_iter_is_kvec(from)) {
-		cifs_dbg(FYI, "use non-direct cifs_writev for kvec I/O\n");
-		direct = false;
-	}
-
 	rc = generic_write_checks(iocb, from);
 	if (rc <= 0)
 		return rc;
@@ -4090,16 +4080,6 @@ static ssize_t __cifs_readv(
 	loff_t offset = iocb->ki_pos;
 	struct cifs_aio_ctx *ctx;
 
-	/*
-	 * iov_iter_get_pages_alloc() doesn't work with ITER_KVEC,
-	 * fall back to data copy read path
-	 * this could be improved by getting pages directly in ITER_KVEC
-	 */
-	if (direct && iov_iter_is_kvec(to)) {
-		cifs_dbg(FYI, "use non-direct cifs_user_readv for kvec I/O\n");
-		direct = false;
-	}
-
 	len = iov_iter_count(to);
 	if (!len)
 		return 0;



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 33/34] net: [RFC][WIP] Mark each skb_frags as to how they should be cleaned up
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (31 preceding siblings ...)
  2023-01-16 23:11 ` [PATCH v6 32/34] cifs: DIO to/from KVEC-type iterators should now work David Howells
@ 2023-01-16 23:12 ` David Howells
  2023-01-16 23:12 ` [PATCH v6 34/34] net: [RFC][WIP] Make __zerocopy_sg_from_iter() correctly pin or leave pages unref'd David Howells
                   ` (2 subsequent siblings)
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:12 UTC (permalink / raw)
  To: Al Viro
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

 [!] NOTE: This patch is mostly for illustrative/discussion purposes and
     makes an incomplete change and the networking code may not compile
     thereafter.

There are a couple of problems with pasting foreign pages into sk_buffs
with zerocopy that are analogous to the problems with direct I/O:

 (1) Pages derived from kernel buffers, such as KVEC iterators should not
     have refs taken on them.  Rather, the caller should do whatever it
     needs to to retain the memory.

 (2) Pages derived from userspace buffers must not have refs taken on them
     if they're going to be written to (analogous to direct I/O read) as
     this may cause a malfunction of the VM CoW mechanism with a concurrent
     fork.  Rather, they should have pins taken on them (FOLL_PIN).  This
     will affect zerocopy-recvmsg where that is exists (eg. TLS, I think,
     though that might be decrypt-offload).

This is further complicated by the possibility of a sk_buff containing data
from mixed sources - for example a network filesystem might generate a
message consisting of some metadata from a kernel buffer (which should not
be pinned) and some data from userspace (which should have a ref taken).

To this end, each page fragment attached to a sk_buff needs labelling with
the appropriate cleanup to be applied.  Do this by:

 (1) Replace struct bio_vec as the basis of skb_frag_t with a new struct
     skb_frag.  This has an offset and a length, as before, plus a
     'page_and_mode' member that contains the cleanup mode in the bottom
     two bits and the page pointer in the remaining bits.

     (FOLL_GET and FOLL_PIN got renumbered to bits 0 and 1 in an earlier
     patch).

 (2) The cleanup mode can be one of FOLL_GET (put a ref on the page),
     FOLL_PIN (unpin the page) or 0 (do nothing).

 (3) skb_frag_page() is used to access the page pointer as before.

 (4) __skb_frag_set_page() and skb_frag_set_page() acquire an extra
     argument to indicate the cleanup mode.

 (5) The cleanup mode is set to FOLL_GET on everything for the moment.

 (6) __skb_frag_ref() will call try_grab_page(), passing the cleanup mode
     to indicate whether an extra ref, an extra pin or nothing is required.

     [!] NOTE: If the cleanup mode was 0, this skbuff will also not pin the
     page and the caller needs to be aware of that.

 (7) __skb_frag_unref() will call page_put_unpin() to do the appropriate
     cleanup, based on the mode.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: netdev@vger.kernel.org
---

 drivers/net/tun.c      |    2 -
 include/linux/skbuff.h |  124 ++++++++++++++++++++++++++++++------------------
 io_uring/net.c         |    2 -
 net/bpf/test_run.c     |    2 -
 net/core/datagram.c    |    3 +
 net/core/gro.c         |    2 -
 net/core/skbuff.c      |   16 +++---
 net/ipv4/ip_output.c   |    2 -
 net/ipv4/tcp.c         |    4 +-
 net/ipv6/esp6.c        |    5 +-
 net/ipv6/ip6_output.c  |    2 -
 net/packet/af_packet.c |    2 -
 net/xfrm/xfrm_ipcomp.c |    2 -
 13 files changed, 101 insertions(+), 67 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a7d17c680f4a..6c467c5163b2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1496,7 +1496,7 @@ static struct sk_buff *tun_napi_alloc_frags(struct tun_file *tfile,
 		}
 		page = virt_to_head_page(frag);
 		skb_fill_page_desc(skb, i - 1, page,
-				   frag - page_address(page), fragsz);
+				   frag - page_address(page), fragsz, FOLL_GET);
 	}
 
 	return skb;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4c8492401a10..a1a77909509b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -357,7 +357,51 @@ extern int sysctl_max_skb_frags;
  */
 #define GSO_BY_FRAGS	0xFFFF
 
-typedef struct bio_vec skb_frag_t;
+struct skb_frag {
+	unsigned long	page_and_mode;	/* page pointer | cleanup_mode (0/FOLL_GET/PIN) */
+	unsigned int	len;
+	unsigned int	offset;
+};
+typedef struct skb_frag skb_frag_t;
+
+/**
+ * skb_frag_cleanup() - Returns the cleanup mode for an skb fragment
+ * @frag: skb fragment
+ *
+ * Returns the cleanup mode associated with @frag.  It will be FOLL_GET,
+ * FOLL_PUT or 0.
+ */
+static inline unsigned int skb_frag_cleanup(const skb_frag_t *frag)
+{
+	return frag->page_and_mode & 3;
+}
+
+/**
+ * skb_frag_page() - Returns the page in an skb fragment
+ * @frag: skb fragment
+ *
+ * Returns the &struct page associated with @frag.
+ */
+static inline struct page *skb_frag_page(const skb_frag_t *frag)
+{
+	return (struct page *)(frag->page_and_mode & ~3);
+}
+
+/**
+ * __skb_frag_set_page() - Sets the page in an skb fragment
+ * @frag: skb fragment
+ * @page: The page to set
+ * @cleanup_mode: The cleanup mode to set (0, FOLL_GET, FOLL_PIN)
+ *
+ * Sets the fragment @frag to contain @page with the specified method of
+ * cleaning it up.
+ */
+static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page,
+				       unsigned int cleanup_mode)
+{
+	cleanup_mode &= FOLL_GET | FOLL_PIN;
+	frag->page_and_mode = (unsigned long)page | cleanup_mode;
+}
 
 /**
  * skb_frag_size() - Returns the size of a skb fragment
@@ -365,7 +409,7 @@ typedef struct bio_vec skb_frag_t;
  */
 static inline unsigned int skb_frag_size(const skb_frag_t *frag)
 {
-	return frag->bv_len;
+	return frag->len;
 }
 
 /**
@@ -375,7 +419,7 @@ static inline unsigned int skb_frag_size(const skb_frag_t *frag)
  */
 static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
 {
-	frag->bv_len = size;
+	frag->len = size;
 }
 
 /**
@@ -385,7 +429,7 @@ static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
  */
 static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
 {
-	frag->bv_len += delta;
+	frag->len += delta;
 }
 
 /**
@@ -395,7 +439,7 @@ static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
 {
-	frag->bv_len -= delta;
+	frag->len -= delta;
 }
 
 /**
@@ -2388,7 +2432,8 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 
 static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
 					      int i, struct page *page,
-					      int off, int size)
+					      int off, int size,
+					      unsigned int cleanup_mode)
 {
 	skb_frag_t *frag = &shinfo->frags[i];
 
@@ -2397,9 +2442,9 @@ static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
 	 * that not all callers have unique ownership of the page but rely
 	 * on page_is_pfmemalloc doing the right thing(tm).
 	 */
-	frag->bv_page		  = page;
-	frag->bv_offset		  = off;
+	__skb_frag_set_page(frag, page, cleanup_mode);
 	skb_frag_size_set(frag, size);
+	frag->offset = off;
 }
 
 /**
@@ -2421,6 +2466,7 @@ static inline void skb_len_add(struct sk_buff *skb, int delta)
  * @page: the page to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
+ * @cleanup_mode: The cleanup mode to set (0, FOLL_GET, FOLL_PIN)
  *
  * Initialises the @i'th fragment of @skb to point to &size bytes at
  * offset @off within @page.
@@ -2428,9 +2474,11 @@ static inline void skb_len_add(struct sk_buff *skb, int delta)
  * Does not take any additional reference on the fragment.
  */
 static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
-					struct page *page, int off, int size)
+					struct page *page, int off, int size,
+					unsigned int cleanup_mode)
 {
-	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size,
+				   cleanup_mode);
 	page = compound_head(page);
 	if (page_is_pfmemalloc(page))
 		skb->pfmemalloc	= true;
@@ -2443,6 +2491,7 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
  * @page: the page to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
+ * @cleanup_mode: The cleanup mode to set (0, FOLL_GET, FOLL_PIN)
  *
  * As per __skb_fill_page_desc() -- initialises the @i'th fragment of
  * @skb to point to @size bytes at offset @off within @page. In
@@ -2451,9 +2500,10 @@ static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
  * Does not take any additional reference on the fragment.
  */
 static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
-				      struct page *page, int off, int size)
+				      struct page *page, int off, int size,
+				      unsigned int cleanup_mode)
 {
-	__skb_fill_page_desc(skb, i, page, off, size);
+	__skb_fill_page_desc(skb, i, page, off, size, cleanup_mode);
 	skb_shinfo(skb)->nr_frags = i + 1;
 }
 
@@ -2464,17 +2514,18 @@ static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
  * @page: the page to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
+ * @cleanup_mode: The cleanup mode to set (0, FOLL_GET, FOLL_PIN)
  *
  * Variant of skb_fill_page_desc() which does not deal with
  * pfmemalloc, if page is not owned by us.
  */
 static inline void skb_fill_page_desc_noacc(struct sk_buff *skb, int i,
 					    struct page *page, int off,
-					    int size)
+					    int size, unsigned int cleanup_mode)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 
-	__skb_fill_page_desc_noacc(shinfo, i, page, off, size);
+	__skb_fill_page_desc_noacc(shinfo, i, page, off, size, cleanup_mode);
 	shinfo->nr_frags = i + 1;
 }
 
@@ -3301,7 +3352,7 @@ static inline void skb_propagate_pfmemalloc(const struct page *page,
  */
 static inline unsigned int skb_frag_off(const skb_frag_t *frag)
 {
-	return frag->bv_offset;
+	return frag->offset;
 }
 
 /**
@@ -3311,7 +3362,7 @@ static inline unsigned int skb_frag_off(const skb_frag_t *frag)
  */
 static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
 {
-	frag->bv_offset += delta;
+	frag->offset += delta;
 }
 
 /**
@@ -3321,7 +3372,7 @@ static inline void skb_frag_off_add(skb_frag_t *frag, int delta)
  */
 static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 {
-	frag->bv_offset = offset;
+	frag->offset = offset;
 }
 
 /**
@@ -3332,18 +3383,7 @@ static inline void skb_frag_off_set(skb_frag_t *frag, unsigned int offset)
 static inline void skb_frag_off_copy(skb_frag_t *fragto,
 				     const skb_frag_t *fragfrom)
 {
-	fragto->bv_offset = fragfrom->bv_offset;
-}
-
-/**
- * skb_frag_page - retrieve the page referred to by a paged fragment
- * @frag: the paged fragment
- *
- * Returns the &struct page associated with @frag.
- */
-static inline struct page *skb_frag_page(const skb_frag_t *frag)
-{
-	return frag->bv_page;
+	fragto->offset = fragfrom->offset;
 }
 
 /**
@@ -3354,7 +3394,9 @@ static inline struct page *skb_frag_page(const skb_frag_t *frag)
  */
 static inline void __skb_frag_ref(skb_frag_t *frag)
 {
-	get_page(skb_frag_page(frag));
+	struct page *page = skb_frag_page(frag);
+
+	try_grab_page(page, skb_frag_cleanup(frag));
 }
 
 /**
@@ -3385,7 +3427,7 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
 	if (recycle && page_pool_return_skb_page(page))
 		return;
 #endif
-	put_page(page);
+	page_put_unpin(page, skb_frag_cleanup(frag));
 }
 
 /**
@@ -3439,19 +3481,7 @@ static inline void *skb_frag_address_safe(const skb_frag_t *frag)
 static inline void skb_frag_page_copy(skb_frag_t *fragto,
 				      const skb_frag_t *fragfrom)
 {
-	fragto->bv_page = fragfrom->bv_page;
-}
-
-/**
- * __skb_frag_set_page - sets the page contained in a paged fragment
- * @frag: the paged fragment
- * @page: the page to set
- *
- * Sets the fragment @frag to contain @page.
- */
-static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page)
-{
-	frag->bv_page = page;
+	fragto->page_and_mode = fragfrom->page_and_mode;
 }
 
 /**
@@ -3459,13 +3489,15 @@ static inline void __skb_frag_set_page(skb_frag_t *frag, struct page *page)
  * @skb: the buffer
  * @f: the fragment offset
  * @page: the page to set
+ * @cleanup_mode: The cleanup mode to set (0, FOLL_GET, FOLL_PIN)
  *
  * Sets the @f'th fragment of @skb to contain @page.
  */
 static inline void skb_frag_set_page(struct sk_buff *skb, int f,
-				     struct page *page)
+				     struct page *page,
+				     unsigned int cleanup_mode)
 {
-	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page);
+	__skb_frag_set_page(&skb_shinfo(skb)->frags[f], page, cleanup_mode);
 }
 
 bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t prio);
diff --git a/io_uring/net.c b/io_uring/net.c
index fbc34a7c2743..1d3e24404d75 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1043,7 +1043,7 @@ static int io_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 		copied += v.bv_len;
 		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
 		__skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
-					   v.bv_offset, v.bv_len);
+					   v.bv_offset, v.bv_len, FOLL_GET);
 		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
 	}
 	if (bi.bi_size)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2723623429ac..9ed2de52e1be 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1370,7 +1370,7 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 			}
 
 			frag = &sinfo->frags[sinfo->nr_frags++];
-			__skb_frag_set_page(frag, page);
+			__skb_frag_set_page(frag, page, FOLL_GET);
 
 			data_len = min_t(u32, kattr->test.data_size_in - size,
 					 PAGE_SIZE);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 9f0914b781ad..122bfb144d32 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -678,7 +678,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 				page_ref_sub(last_head, refs);
 				refs = 0;
 			}
-			skb_fill_page_desc_noacc(skb, frag++, head, start, size);
+			skb_fill_page_desc_noacc(skb, frag++, head, start, size,
+						 FOLL_GET);
 		}
 		if (refs)
 			page_ref_sub(last_head, refs);
diff --git a/net/core/gro.c b/net/core/gro.c
index fd8c6a7e8d3e..dfbf2279ce5c 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -228,7 +228,7 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 
 		pinfo->nr_frags = nr_frags + 1 + skbinfo->nr_frags;
 
-		__skb_frag_set_page(frag, page);
+		__skb_frag_set_page(frag, page, FOLL_GET);
 		skb_frag_off_set(frag, first_offset);
 		skb_frag_size_set(frag, first_size);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4a0eb5593275..a6a21a27ebb4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -765,7 +765,7 @@ EXPORT_SYMBOL(__napi_alloc_skb);
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		     int size, unsigned int truesize)
 {
-	skb_fill_page_desc(skb, i, page, off, size);
+	skb_fill_page_desc(skb, i, page, off, size, FOLL_GET);
 	skb->len += size;
 	skb->data_len += size;
 	skb->truesize += truesize;
@@ -1666,10 +1666,10 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
 
 	/* skb frags point to kernel buffers */
 	for (i = 0; i < new_frags - 1; i++) {
-		__skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE);
+		__skb_fill_page_desc(skb, i, head, 0, PAGE_SIZE, FOLL_GET);
 		head = (struct page *)page_private(head);
 	}
-	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off);
+	__skb_fill_page_desc(skb, new_frags - 1, head, 0, d_off, FOLL_GET);
 	skb_shinfo(skb)->nr_frags = new_frags;
 
 release:
@@ -3389,7 +3389,7 @@ skb_zerocopy(struct sk_buff *to, struct sk_buff *from, int len, int hlen)
 		if (plen) {
 			page = virt_to_head_page(from->head);
 			offset = from->data - (unsigned char *)page_address(page);
-			__skb_fill_page_desc(to, 0, page, offset, plen);
+			__skb_fill_page_desc(to, 0, page, offset, plen, FOLL_GET);
 			get_page(page);
 			j = 1;
 			len -= plen;
@@ -4040,7 +4040,7 @@ int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
 	} else if (i < MAX_SKB_FRAGS) {
 		skb_zcopy_downgrade_managed(skb);
 		get_page(page);
-		skb_fill_page_desc_noacc(skb, i, page, offset, size);
+		skb_fill_page_desc_noacc(skb, i, page, offset, size, FOLL_GET);
 	} else {
 		return -EMSGSIZE;
 	}
@@ -4077,7 +4077,7 @@ static inline skb_frag_t skb_head_frag_to_page_desc(struct sk_buff *frag_skb)
 	struct page *page;
 
 	page = virt_to_head_page(frag_skb->head);
-	__skb_frag_set_page(&head_frag, page);
+	__skb_frag_set_page(&head_frag, page, FOLL_GET);
 	skb_frag_off_set(&head_frag, frag_skb->data -
 			 (unsigned char *)page_address(page));
 	skb_frag_size_set(&head_frag, skb_headlen(frag_skb));
@@ -5521,7 +5521,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		offset = from->data - (unsigned char *)page_address(page);
 
 		skb_fill_page_desc(to, to_shinfo->nr_frags,
-				   page, offset, skb_headlen(from));
+				   page, offset, skb_headlen(from), FOLL_GET);
 		*fragstolen = true;
 	} else {
 		if (to_shinfo->nr_frags +
@@ -6221,7 +6221,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 fill_page:
 		chunk = min_t(unsigned long, data_len,
 			      PAGE_SIZE << order);
-		skb_fill_page_desc(skb, i, page, 0, chunk);
+		skb_fill_page_desc(skb, i, page, 0, chunk, FOLL_GET);
 		data_len -= chunk;
 		npages -= 1 << order;
 	}
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 922c87ef1ab5..43ea2e7aeeea 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1221,7 +1221,7 @@ static int __ip_append_data(struct sock *sk,
 					goto error;
 
 				__skb_fill_page_desc(skb, i, pfrag->page,
-						     pfrag->offset, 0);
+						     pfrag->offset, 0, FOLL_GET);
 				skb_shinfo(skb)->nr_frags = ++i;
 				get_page(pfrag->page);
 			}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c567d5e8053e..2cb88e67e152 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1016,7 +1016,7 @@ static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
 		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 	} else {
 		get_page(page);
-		skb_fill_page_desc_noacc(skb, i, page, offset, copy);
+		skb_fill_page_desc_noacc(skb, i, page, offset, copy, FOLL_GET);
 	}
 
 	if (!(flags & MSG_NO_SHARED_FRAGS))
@@ -1385,7 +1385,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			} else {
 				skb_fill_page_desc(skb, i, pfrag->page,
-						   pfrag->offset, copy);
+						   pfrag->offset, copy, FOLL_GET);
 				page_ref_inc(pfrag->page);
 			}
 			pfrag->offset += copy;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 14ed868680c6..13e9d36e132e 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -529,7 +529,7 @@ int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 			nfrags = skb_shinfo(skb)->nr_frags;
 
 			__skb_fill_page_desc(skb, nfrags, page, pfrag->offset,
-					     tailen);
+					     tailen, FOLL_GET);
 			skb_shinfo(skb)->nr_frags = ++nfrags;
 
 			pfrag->offset = pfrag->offset + allocsize;
@@ -635,7 +635,8 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info
 		page = pfrag->page;
 		get_page(page);
 		/* replace page frags in skb with new page */
-		__skb_fill_page_desc(skb, 0, page, pfrag->offset, skb->data_len);
+		__skb_fill_page_desc(skb, 0, page, pfrag->offset, skb->data_len,
+				     FOLL_GET);
 		pfrag->offset = pfrag->offset + allocsize;
 		spin_unlock_bh(&x->lock);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 60fd91bb5171..117fb2bdad02 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1780,7 +1780,7 @@ static int __ip6_append_data(struct sock *sk,
 					goto error;
 
 				__skb_fill_page_desc(skb, i, pfrag->page,
-						     pfrag->offset, 0);
+						     pfrag->offset, 0, FOLL_GET);
 				skb_shinfo(skb)->nr_frags = ++i;
 				get_page(pfrag->page);
 			}
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b5ab98ca2511..15c9f17ce7d8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2630,7 +2630,7 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 		data += len;
 		flush_dcache_page(page);
 		get_page(page);
-		skb_fill_page_desc(skb, nr_frags, page, offset, len);
+		skb_fill_page_desc(skb, nr_frags, page, offset, len, FOLL_GET);
 		to_write -= len;
 		offset = 0;
 		len_max = PAGE_SIZE;
diff --git a/net/xfrm/xfrm_ipcomp.c b/net/xfrm/xfrm_ipcomp.c
index 80143360bf09..8e9574e00cd0 100644
--- a/net/xfrm/xfrm_ipcomp.c
+++ b/net/xfrm/xfrm_ipcomp.c
@@ -74,7 +74,7 @@ static int ipcomp_decompress(struct xfrm_state *x, struct sk_buff *skb)
 		if (!page)
 			return -ENOMEM;
 
-		__skb_frag_set_page(frag, page);
+		__skb_frag_set_page(frag, page, FOLL_GET);
 
 		len = PAGE_SIZE;
 		if (dlen < len)



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [PATCH v6 34/34] net: [RFC][WIP] Make __zerocopy_sg_from_iter() correctly pin or leave pages unref'd
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (32 preceding siblings ...)
  2023-01-16 23:12 ` [PATCH v6 33/34] net: [RFC][WIP] Mark each skb_frags as to how they should be cleaned up David Howells
@ 2023-01-16 23:12 ` David Howells
  2023-01-17  7:46 ` [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) Christoph Hellwig
  2023-01-18 14:00 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
  35 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-16 23:12 UTC (permalink / raw)
  To: Al Viro
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Make __zerocopy_sg_from_iter() call iov_iter_extract_pages() to get pages
that have been ref'd, pinned or left alone as appropriate.  As this is only
used for source buffers, pinning isn't an option, but being unref'd is.

The way __zerocopy_sg_from_iter() merges fragments is also altered, such
that fragments must also match their cleanup modes to be merged.

An extra helper and wrapper, folio_put_unpin_sub() and page_put_unpin_sub()
are added to allow multiple refs to be put/unpinned.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: netdev@vger.kernel.org
---

 include/linux/mm.h  |    2 ++
 mm/gup.c            |   25 +++++++++++++++++++++++++
 net/core/datagram.c |   23 +++++++++++++----------
 3 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f14edb192394..e3923b89c75e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1368,7 +1368,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
 #endif
 
 void folio_put_unpin(struct folio *folio, unsigned int flags);
+void folio_put_unpin_sub(struct folio *folio, unsigned int flags, unsigned int refs);
 void page_put_unpin(struct page *page, unsigned int flags);
+void page_put_unpin_sub(struct page *page, unsigned int flags, unsigned int refs);
 
 /*
  * The identification function is mainly used by the buddy allocator for
diff --git a/mm/gup.c b/mm/gup.c
index 3ee4b4c7e0cb..49dd27ba6c13 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -213,6 +213,31 @@ void page_put_unpin(struct page *page, unsigned int flags)
 }
 EXPORT_SYMBOL_GPL(page_put_unpin);
 
+/**
+ * folio_put_unpin_sub - Unpin/put a folio as appropriate
+ * @folio: The folio to release
+ * @flags: gup flags indicating the mode of release (FOLL_*)
+ * @refs: Number of refs/pins to drop
+ *
+ * Release a folio according to the flags.  If FOLL_GET is set, the folio has a
+ * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
+ * unaltered.
+ */
+void folio_put_unpin_sub(struct folio *folio, unsigned int flags,
+			 unsigned int refs)
+{
+	if (flags & (FOLL_GET | FOLL_PIN))
+		gup_put_folio(folio, refs, flags);
+}
+EXPORT_SYMBOL_GPL(folio_put_unpin_sub);
+
+void page_put_unpin_sub(struct page *page, unsigned int flags,
+			unsigned int refs)
+{
+	folio_put_unpin_sub(page_folio(page), flags, refs);
+}
+EXPORT_SYMBOL_GPL(page_put_unpin_sub);
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  * @page:    pointer to page to be grabbed
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 122bfb144d32..63ea1f8817e0 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -614,6 +614,7 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 			    struct sk_buff *skb, struct iov_iter *from,
 			    size_t length)
 {
+	unsigned int cleanup_mode = iov_iter_extract_mode(from, FOLL_SOURCE_BUF);
 	int frag;
 
 	if (msg && msg->msg_ubuf && msg->sg_from_iter)
@@ -622,7 +623,7 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 	frag = skb_shinfo(skb)->nr_frags;
 
 	while (length && iov_iter_count(from)) {
-		struct page *pages[MAX_SKB_FRAGS];
+		struct page *pages[MAX_SKB_FRAGS], **ppages = pages;
 		struct page *last_head = NULL;
 		size_t start;
 		ssize_t copied;
@@ -632,9 +633,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 		if (frag == MAX_SKB_FRAGS)
 			return -EMSGSIZE;
 
-		copied = iov_iter_get_pages(from, pages, length,
-					    MAX_SKB_FRAGS - frag, &start,
-					    FOLL_SOURCE_BUF);
+		copied = iov_iter_extract_pages(from, &ppages, length,
+						MAX_SKB_FRAGS - frag,
+						FOLL_SOURCE_BUF, &start);
 		if (copied < 0)
 			return -EFAULT;
 
@@ -662,12 +663,14 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 				skb_frag_t *last = &skb_shinfo(skb)->frags[frag - 1];
 
 				if (head == skb_frag_page(last) &&
+				    cleanup_mode == skb_frag_cleanup(last) &&
 				    start == skb_frag_off(last) + skb_frag_size(last)) {
 					skb_frag_size_add(last, size);
 					/* We combined this page, we need to release
-					 * a reference. Since compound pages refcount
-					 * is shared among many pages, batch the refcount
-					 * adjustments to limit false sharing.
+					 * a reference or a pin.  Since compound pages
+					 * refcount is shared among many pages, batch
+					 * the refcount adjustments to limit false
+					 * sharing.
 					 */
 					last_head = head;
 					refs++;
@@ -675,14 +678,14 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
 				}
 			}
 			if (refs) {
-				page_ref_sub(last_head, refs);
+				page_put_unpin_sub(last_head, cleanup_mode, refs);
 				refs = 0;
 			}
 			skb_fill_page_desc_noacc(skb, frag++, head, start, size,
-						 FOLL_GET);
+						 cleanup_mode);
 		}
 		if (refs)
-			page_ref_sub(last_head, refs);
+			page_put_unpin_sub(last_head, cleanup_mode, refs);
 	}
 	return 0;
 }



^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (33 preceding siblings ...)
  2023-01-16 23:12 ` [PATCH v6 34/34] net: [RFC][WIP] Make __zerocopy_sg_from_iter() correctly pin or leave pages unref'd David Howells
@ 2023-01-17  7:46 ` Christoph Hellwig
  2023-01-18 14:00 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
  35 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  7:46 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, James E.J. Bottomley, Paolo Abeni, John Hubbard,
	Christoph Hellwig, Paulo Alcantara, linux-scsi, Steve French,
	Stefan Metzmacher, Miklos Szeredi, Martin K. Petersen,
	Logan Gunthorpe, Jeff Layton, Jakub Kicinski, netdev,
	Rohith Surabattula, Eric Dumazet, Matthew Wilcox, Anna Schumaker,
	Jens Axboe, Shyam Prasad N, Tom Talpey, linux-rdma,
	Trond Myklebust, Christian Schoenebeck, linux-mm, linux-crypto,
	linux-nfs, v9fs-developer, Latchesar Ionkov, linux-fsdevel,
	Eric Van Hensbergen, Long Li, Jan Kara, linux-cachefs,
	linux-block, Dominique Martinet, Namjae Jeon, David S. Miller,
	linux-cifs, Steve French, Herbert Xu, Christoph Hellwig,
	linux-kernel

First off the liver comment:  can we cut down things for a first
round?  Maybe just convert everything using the bio based helpers
and then chunk it up?  Reviewing 34 patches across a dozen subsystems
isn't going to be easy and it will be hard to come up with a final
positive conclusion.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
@ 2023-01-17  7:52   ` Christoph Hellwig
  2023-01-18 22:11     ` Al Viro
  2023-01-17  8:28   ` David Howells
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  7:52 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Mon, Jan 16, 2023 at 11:08:09PM +0000, David Howells wrote:
> IOCB_WRITE is set by aio, io_uring and cachefiles before submitting a write
> operation to the VFS, but it isn't set by, say, the write() system call.
> 
> Fix this by setting IOCB_WRITE unconditionally in call_write_iter().
> 
> This will allow drivers to use IOCB_WRITE instead of the iterator data
> source to determine the I/O direction.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Alexander Viro <viro@zeniv.linux.org.uk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: linux-block@vger.kernel.org
> cc: linux-fsdevel@vger.kernel.org
> ---
> 
>  include/linux/fs.h |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 066555ad1bf8..649ff061440e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2183,6 +2183,7 @@ static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
>  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>  				      struct iov_iter *iter)
>  {
> +	kio->ki_flags |= IOCB_WRITE;
>  	return file->f_op->write_iter(kio, iter);
>  }

This doesn't remove the existing setting of IOCB_WRITE, and also
feelds like the wrong place.

I suspect the best is to:

 - rename init_sync_kiocb to init_kiocb
 - pass a new argument for the destination to it.  I'm not entirely
   sure if flags is a good thing, or an explicit READ/WRITE might be
   better because it's harder to get wrong, even if a the compiler
   might generate worth code for it.
 - also use it in the async callers (io_uring, aio, overlayfs, loop,
   nvmet, target, cachefs, file backed swap)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction
  2023-01-16 23:08 ` [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction David Howells
@ 2023-01-17  7:55   ` Christoph Hellwig
  2023-01-18 22:18     ` Al Viro
  0 siblings, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  7:55 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 16, 2023 at 11:08:17PM +0000, David Howells wrote:
> Use information other than the iterator direction to determine the
> direction of the I/O:
> 
>  (*) If a kiocb is available, use the IOCB_WRITE flag.
> 
>  (*) If an iomap_iter is available, use the IOMAP_WRITE flag.
> 
>  (*) If a request is available, use op_is_write().

The really should be three independent patches.  Plus another one
to drop the debug checks in cifs.

The changes themselves look good to me.

>  
> +static unsigned char iov_iter_rw(const struct iov_iter *i)
> +{
> +	return i->data_source ? WRITE : READ;
> +}

It might as well make sense to just open code this in the only
caller as well (yet another patch).

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-16 23:08 ` [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*() David Howells
@ 2023-01-17  7:57   ` Christoph Hellwig
  2023-01-17  8:07     ` David Hildenbrand
  2023-01-18 23:03     ` Al Viro
  2023-01-17  8:44   ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  7:57 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Peter Zijlstra, David Hildenbrand

On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
> Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
> and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
> operation.  Don't use READ and WRITE as a read I/O writes to memory and
> vice versa - which causes confusion.
> 
> The direction is checked against the iterator's data_source.

Why can't we use the existing FOLL_WRITE?

> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  block/bio.c             |    6 ++++++
>  block/blk-map.c         |    2 ++
>  crypto/af_alg.c         |    9 ++++++---
>  crypto/algif_hash.c     |    3 ++-
>  drivers/vhost/scsi.c    |    9 ++++++---
>  fs/ceph/addr.c          |    2 +-
>  fs/ceph/file.c          |   14 ++++++++------
>  fs/cifs/file.c          |    8 ++++----
>  fs/cifs/misc.c          |    3 ++-
>  fs/direct-io.c          |    6 ++++--
>  fs/fuse/dev.c           |    3 ++-
>  fs/fuse/file.c          |    8 ++++----
>  fs/nfs/direct.c         |   10 ++++++----
>  fs/splice.c             |    3 ++-
>  include/crypto/if_alg.h |    3 ++-
>  include/linux/bio.h     |   18 ++++++++++++++++--
>  include/linux/mm.h      |   10 ++++++++++
>  lib/iov_iter.c          |   14 +++++++-------
>  net/9p/trans_virtio.c   |   12 ++++++++----
>  net/core/datagram.c     |    5 +++--
>  net/core/skmsg.c        |    4 ++--
>  net/rds/message.c       |    4 ++--
>  net/tls/tls_sw.c        |    5 ++---
>  23 files changed, 107 insertions(+), 54 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 5f96fcae3f75..867cf4db87ea 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1242,6 +1242,8 @@ static int bio_iov_add_zone_append_page(struct bio *bio, struct page *page,
>   * pages will have to be released using put_page() when done.
>   * For multi-segment *iter, this function only adds pages from the
>   * next non-empty segment of the iov iterator.
> + *
> + * The I/O direction is determined from the bio operation type.
>   */
>  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> @@ -1263,6 +1265,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
>  	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>  
> +	gup_flags |= bio_is_write(bio) ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
> +
>  	if (bio->bi_bdev && blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
>  		gup_flags |= FOLL_PCI_P2PDMA;
>  
> @@ -1332,6 +1336,8 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>   * fit into the bio, or are requested in @iter, whatever is smaller. If
>   * MM encounters an error pinning the requested pages, it stops. Error
>   * is returned only if 0 pages could be pinned.
> + *
> + * The bio operation indicates the data direction.
>   */
>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 08cbb7ff3b19..c30be529fb55 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -279,6 +279,8 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>  	if (bio == NULL)
>  		return -ENOMEM;
>  
> +	gup_flags |= bio_is_write(bio) ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
> +
>  	if (blk_queue_pci_p2pdma(rq->q))
>  		gup_flags |= FOLL_PCI_P2PDMA;
>  
> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> index 0a4fa2a429e2..7a68db157fae 100644
> --- a/crypto/af_alg.c
> +++ b/crypto/af_alg.c
> @@ -531,13 +531,15 @@ static const struct net_proto_family alg_family = {
>  	.owner	=	THIS_MODULE,
>  };
>  
> -int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len)
> +int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
> +		   unsigned int gup_flags)
>  {
>  	size_t off;
>  	ssize_t n;
>  	int npages, i;
>  
> -	n = iov_iter_get_pages2(iter, sgl->pages, len, ALG_MAX_PAGES, &off);
> +	n = iov_iter_get_pages(iter, sgl->pages, len, ALG_MAX_PAGES, &off,
> +			       gup_flags);
>  	if (n < 0)
>  		return n;
>  
> @@ -1310,7 +1312,8 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, int flags,
>  		list_add_tail(&rsgl->list, &areq->rsgl_list);
>  
>  		/* make one iovec available as scatterlist */
> -		err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen);
> +		err = af_alg_make_sg(&rsgl->sgl, &msg->msg_iter, seglen,
> +				     FOLL_DEST_BUF);
>  		if (err < 0) {
>  			rsgl->sg_num_bytes = 0;
>  			return err;
> diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
> index 1d017ec5c63c..fe3d2258145f 100644
> --- a/crypto/algif_hash.c
> +++ b/crypto/algif_hash.c
> @@ -91,7 +91,8 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
>  		if (len > limit)
>  			len = limit;
>  
> -		len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, len);
> +		len = af_alg_make_sg(&ctx->sgl, &msg->msg_iter, len,
> +				     FOLL_SOURCE_BUF);
>  		if (len < 0) {
>  			err = copied ? 0 : len;
>  			goto unlock;
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index dca6346d75b3..5d10837d19ec 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -646,10 +646,13 @@ vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd,
>  	struct scatterlist *sg = sgl;
>  	ssize_t bytes;
>  	size_t offset;
> -	unsigned int npages = 0;
> +	unsigned int npages = 0, gup_flags = 0;
>  
> -	bytes = iov_iter_get_pages2(iter, pages, LONG_MAX,
> -				VHOST_SCSI_PREALLOC_UPAGES, &offset);
> +	gup_flags |= write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF;
> +
> +	bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
> +				   VHOST_SCSI_PREALLOC_UPAGES, &offset,
> +				   gup_flags);
>  	/* No pages were pinned */
>  	if (bytes <= 0)
>  		return bytes < 0 ? bytes : -EFAULT;
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8c74871e37c9..cfc3353e5604 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -328,7 +328,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest *subreq)
>  
>  	dout("%s: pos=%llu orig_len=%zu len=%llu\n", __func__, subreq->start, subreq->len, len);
>  	iov_iter_xarray(&iter, ITER_DEST, &rreq->mapping->i_pages, subreq->start, len);
> -	err = iov_iter_get_pages_alloc2(&iter, &pages, len, &page_off);
> +	err = iov_iter_get_pages_alloc(&iter, &pages, len, &page_off, FOLL_DEST_BUF);
>  	if (err < 0) {
>  		dout("%s: iov_ter_get_pages_alloc returned %d\n", __func__, err);
>  		goto out;
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 27c72a2f6af5..ffd36eeea186 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -81,7 +81,7 @@ static __le32 ceph_flags_sys2wire(u32 flags)
>  #define ITER_GET_BVECS_PAGES	64
>  
>  static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
> -				struct bio_vec *bvecs)
> +				struct bio_vec *bvecs, bool write)
>  {
>  	size_t size = 0;
>  	int bvec_idx = 0;
> @@ -95,8 +95,9 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
>  		size_t start;
>  		int idx = 0;
>  
> -		bytes = iov_iter_get_pages2(iter, pages, maxsize - size,
> -					   ITER_GET_BVECS_PAGES, &start);
> +		bytes = iov_iter_get_pages(iter, pages, maxsize - size,
> +					   ITER_GET_BVECS_PAGES, &start,
> +					   write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
>  		if (bytes < 0)
>  			return size ?: bytes;
>  
> @@ -127,7 +128,8 @@ static ssize_t __iter_get_bvecs(struct iov_iter *iter, size_t maxsize,
>   * Return the number of bytes in the created bio_vec array, or an error.
>   */
>  static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize,
> -				    struct bio_vec **bvecs, int *num_bvecs)
> +				    struct bio_vec **bvecs, int *num_bvecs,
> +				    bool write)
>  {
>  	struct bio_vec *bv;
>  	size_t orig_count = iov_iter_count(iter);
> @@ -146,7 +148,7 @@ static ssize_t iter_get_bvecs_alloc(struct iov_iter *iter, size_t maxsize,
>  	if (!bv)
>  		return -ENOMEM;
>  
> -	bytes = __iter_get_bvecs(iter, maxsize, bv);
> +	bytes = __iter_get_bvecs(iter, maxsize, bv, write);
>  	if (bytes < 0) {
>  		/*
>  		 * No pages were pinned -- just free the array.
> @@ -1334,7 +1336,7 @@ ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
>  			break;
>  		}
>  
> -		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages);
> +		len = iter_get_bvecs_alloc(iter, size, &bvecs, &num_pages, write);
>  		if (len < 0) {
>  			ceph_osdc_put_request(req);
>  			ret = len;
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 22dfc1f8b4f1..d100b9cb8682 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3290,8 +3290,8 @@ cifs_write_from_iter(loff_t offset, size_t len, struct iov_iter *from,
>  		if (ctx->direct_io) {
>  			ssize_t result;
>  
> -			result = iov_iter_get_pages_alloc2(
> -				from, &pagevec, cur_len, &start);
> +			result = iov_iter_get_pages_alloc(
> +				from, &pagevec, cur_len, &start, FOLL_SOURCE_BUF);
>  			if (result < 0) {
>  				cifs_dbg(VFS,
>  					 "direct_writev couldn't get user pages (rc=%zd) iter type %d iov_offset %zd count %zd\n",
> @@ -4031,9 +4031,9 @@ cifs_send_async_read(loff_t offset, size_t len, struct cifsFileInfo *open_file,
>  		if (ctx->direct_io) {
>  			ssize_t result;
>  
> -			result = iov_iter_get_pages_alloc2(
> +			result = iov_iter_get_pages_alloc(
>  					&direct_iov, &pagevec,
> -					cur_len, &start);
> +					cur_len, &start, FOLL_DEST_BUF);
>  			if (result < 0) {
>  				cifs_dbg(VFS,
>  					 "Couldn't get user pages (rc=%zd) iter type %d iov_offset %zd count %zd\n",
> diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
> index 4d3c586785a5..9655cf359ab9 100644
> --- a/fs/cifs/misc.c
> +++ b/fs/cifs/misc.c
> @@ -1030,7 +1030,8 @@ setup_aio_ctx_iter(struct cifs_aio_ctx *ctx, struct iov_iter *iter, int rw)
>  	saved_len = count;
>  
>  	while (count && npages < max_pages) {
> -		rc = iov_iter_get_pages2(iter, pages, count, max_pages, &start);
> +		rc = iov_iter_get_pages(iter, pages, count, max_pages, &start,
> +					rw == WRITE ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
>  		if (rc < 0) {
>  			cifs_dbg(VFS, "Couldn't get user pages (rc=%zd)\n", rc);
>  			break;
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index cf196f2a211e..b1e26a706e31 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -169,8 +169,10 @@ static inline int dio_refill_pages(struct dio *dio, struct dio_submit *sdio)
>  	const enum req_op dio_op = dio->opf & REQ_OP_MASK;
>  	ssize_t ret;
>  
> -	ret = iov_iter_get_pages2(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
> -				&sdio->from);
> +	ret = iov_iter_get_pages(sdio->iter, dio->pages, LONG_MAX, DIO_PAGES,
> +				 &sdio->from,
> +				 op_is_write(dio_op) ?
> +				 FOLL_SOURCE_BUF : FOLL_DEST_BUF);
>  
>  	if (ret < 0 && sdio->blocks_available && dio_op == REQ_OP_WRITE) {
>  		struct page *page = ZERO_PAGE(0);
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index e8b60ce72c9a..e3d8443e24a6 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -730,7 +730,8 @@ static int fuse_copy_fill(struct fuse_copy_state *cs)
>  		}
>  	} else {
>  		size_t off;
> -		err = iov_iter_get_pages2(cs->iter, &page, PAGE_SIZE, 1, &off);
> +		err = iov_iter_get_pages(cs->iter, &page, PAGE_SIZE, 1, &off,
> +					 cs->write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
>  		if (err < 0)
>  			return err;
>  		BUG_ON(!err);
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index d68b45f8b3ae..68c196437306 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1414,10 +1414,10 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>  	while (nbytes < *nbytesp && ap->num_pages < max_pages) {
>  		unsigned npages;
>  		size_t start;
> -		ret = iov_iter_get_pages2(ii, &ap->pages[ap->num_pages],
> -					*nbytesp - nbytes,
> -					max_pages - ap->num_pages,
> -					&start);
> +		ret = iov_iter_get_pages(ii, &ap->pages[ap->num_pages],
> +					 *nbytesp - nbytes,
> +					 max_pages - ap->num_pages,
> +					 &start, write ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
>  		if (ret < 0)
>  			break;
>  
> diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
> index d865945f2a63..42af84685f20 100644
> --- a/fs/nfs/direct.c
> +++ b/fs/nfs/direct.c
> @@ -332,8 +332,9 @@ static ssize_t nfs_direct_read_schedule_iovec(struct nfs_direct_req *dreq,
>  		size_t pgbase;
>  		unsigned npages, i;
>  
> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> -						  rsize, &pgbase);
> +		result = iov_iter_get_pages_alloc(iter, &pagevec,
> +						  rsize, &pgbase,
> +						  FOLL_DEST_BUF);
>  		if (result < 0)
>  			break;
>  	
> @@ -791,8 +792,9 @@ static ssize_t nfs_direct_write_schedule_iovec(struct nfs_direct_req *dreq,
>  		size_t pgbase;
>  		unsigned npages, i;
>  
> -		result = iov_iter_get_pages_alloc2(iter, &pagevec,
> -						  wsize, &pgbase);
> +		result = iov_iter_get_pages_alloc(iter, &pagevec,
> +						  wsize, &pgbase,
> +						  FOLL_SOURCE_BUF);
>  		if (result < 0)
>  			break;
>  
> diff --git a/fs/splice.c b/fs/splice.c
> index 5969b7a1d353..19c5b5adc548 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1165,7 +1165,8 @@ static int iter_to_pipe(struct iov_iter *from,
>  		size_t start;
>  		int i, n;
>  
> -		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
> +		left = iov_iter_get_pages(from, pages, ~0UL, 16, &start,
> +					  FOLL_SOURCE_BUF);
>  		if (left <= 0) {
>  			ret = left;
>  			break;
> diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
> index a5db86670bdf..12058ab6cad9 100644
> --- a/include/crypto/if_alg.h
> +++ b/include/crypto/if_alg.h
> @@ -165,7 +165,8 @@ int af_alg_release(struct socket *sock);
>  void af_alg_release_parent(struct sock *sk);
>  int af_alg_accept(struct sock *sk, struct socket *newsock, bool kern);
>  
> -int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len);
> +int af_alg_make_sg(struct af_alg_sgl *sgl, struct iov_iter *iter, int len,
> +		   unsigned int gup_flags);
>  void af_alg_free_sg(struct af_alg_sgl *sgl);
>  
>  static inline struct alg_sock *alg_sk(struct sock *sk)
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 22078a28d7cb..3f7ba7fe48ac 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -40,11 +40,25 @@ static inline unsigned int bio_max_segs(unsigned int nr_segs)
>  #define bio_sectors(bio)	bvec_iter_sectors((bio)->bi_iter)
>  #define bio_end_sector(bio)	bvec_iter_end_sector((bio)->bi_iter)
>  
> +/**
> + * bio_is_write - Query if the I/O direction is towards the disk
> + * @bio: The bio to query
> + *
> + * Return true if this is some sort of write operation - ie. the data is going
> + * towards the disk.
> + */
> +static inline bool bio_is_write(const struct bio *bio)
> +{
> +	return op_is_write(bio_op(bio));
> +}
> +
>  /*
>   * Return the data direction, READ or WRITE.
>   */
> -#define bio_data_dir(bio) \
> -	(op_is_write(bio_op(bio)) ? WRITE : READ)
> +static inline int bio_data_dir(const struct bio *bio)
> +{
> +	return bio_is_write(bio) ? WRITE : READ;
> +}
>  
>  /*
>   * Check whether this bio carries any data or not. A NULL bio is allowed.
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3f196e4d66d..3af4ca8b1fe7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3090,6 +3090,10 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>  #define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
>  #define FOLL_INTERRUPTIBLE  0x200000 /* allow interrupts from generic signals */
>  
> +#define FOLL_SOURCE_BUF	0		/* Memory will be read from by I/O */
> +#define FOLL_DEST_BUF	FOLL_WRITE	/* Memory will be written to by I/O */
> +#define FOLL_BUF_MASK	FOLL_WRITE
> +
>  /*
>   * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
>   * other. Here is what they mean, and how to use them:
> @@ -3143,6 +3147,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
>   * releasing pages: get_user_pages*() pages must be released via put_page(),
>   * while pin_user_pages*() pages must be released via unpin_user_page().
>   *
> + * FOLL_SOURCE_BUF and FOLL_DEST_BUF are indicators to get_user_pages*() and
> + * iov_iter_*_pages*() as to how the pages obtained are going to be used.
> + * FOLL_SOURCE_BUF indicates that I/O op is going to transfer from memory to
> + * device; FOLL_DEST_BUF that the op is going to transfer from device to
> + * memory.
> + *
>   * Please see Documentation/core-api/pin_user_pages.rst for more information.
>   */
>  
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 68497d9c1452..f53583836009 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1429,11 +1429,6 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>  	return page;
>  }
>  
> -static unsigned char iov_iter_rw(const struct iov_iter *i)
> -{
> -	return i->data_source ? WRITE : READ;
> -}
> -
>  static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  		   struct page ***pages, size_t maxsize,
>  		   unsigned int maxpages, size_t *start,
> @@ -1448,12 +1443,17 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
>  	if (maxsize > MAX_RW_COUNT)
>  		maxsize = MAX_RW_COUNT;
>  
> +	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF &&
> +			 i->data_source == ITER_DEST))
> +		return -EIO;
> +	if (WARN_ON_ONCE((gup_flags & FOLL_BUF_MASK) == FOLL_DEST_BUF &&
> +			 i->data_source == ITER_SOURCE))
> +		return -EIO;
> +
>  	if (likely(user_backed_iter(i))) {
>  		unsigned long addr;
>  		int res;
>  
> -		if (iov_iter_rw(i) != WRITE)
> -			gup_flags |= FOLL_WRITE;
>  		if (i->nofault)
>  			gup_flags |= FOLL_NOFAULT;
>  
> diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
> index 3c27ffb781e3..eb28b54fe5f6 100644
> --- a/net/9p/trans_virtio.c
> +++ b/net/9p/trans_virtio.c
> @@ -310,7 +310,8 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>  			       struct iov_iter *data,
>  			       int count,
>  			       size_t *offs,
> -			       int *need_drop)
> +			       int *need_drop,
> +			       unsigned int gup_flags)
>  {
>  	int nr_pages;
>  	int err;
> @@ -330,7 +331,8 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>  			if (err == -ERESTARTSYS)
>  				return err;
>  		}
> -		n = iov_iter_get_pages_alloc2(data, pages, count, offs);
> +		n = iov_iter_get_pages_alloc(data, pages, count, offs,
> +					     gup_flags);
>  		if (n < 0)
>  			return n;
>  		*need_drop = 1;
> @@ -437,7 +439,8 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
>  	if (uodata) {
>  		__le32 sz;
>  		int n = p9_get_mapped_pages(chan, &out_pages, uodata,
> -					    outlen, &offs, &need_drop);
> +					    outlen, &offs, &need_drop,
> +					    FOLL_DEST_BUF);
>  		if (n < 0) {
>  			err = n;
>  			goto err_out;
> @@ -456,7 +459,8 @@ p9_virtio_zc_request(struct p9_client *client, struct p9_req_t *req,
>  		memcpy(&req->tc.sdata[0], &sz, sizeof(sz));
>  	} else if (uidata) {
>  		int n = p9_get_mapped_pages(chan, &in_pages, uidata,
> -					    inlen, &offs, &need_drop);
> +					    inlen, &offs, &need_drop,
> +					    FOLL_SOURCE_BUF);
>  		if (n < 0) {
>  			err = n;
>  			goto err_out;
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index e4ff2db40c98..9f0914b781ad 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -632,8 +632,9 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk,
>  		if (frag == MAX_SKB_FRAGS)
>  			return -EMSGSIZE;
>  
> -		copied = iov_iter_get_pages2(from, pages, length,
> -					    MAX_SKB_FRAGS - frag, &start);
> +		copied = iov_iter_get_pages(from, pages, length,
> +					    MAX_SKB_FRAGS - frag, &start,
> +					    FOLL_SOURCE_BUF);
>  		if (copied < 0)
>  			return -EFAULT;
>  
> diff --git a/net/core/skmsg.c b/net/core/skmsg.c
> index 53d0251788aa..f63a13690712 100644
> --- a/net/core/skmsg.c
> +++ b/net/core/skmsg.c
> @@ -324,8 +324,8 @@ int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
>  			goto out;
>  		}
>  
> -		copied = iov_iter_get_pages2(from, pages, bytes, maxpages,
> -					    &offset);
> +		copied = iov_iter_get_pages(from, pages, bytes, maxpages,
> +					    &offset, FOLL_SOURCE_BUF);
>  		if (copied <= 0) {
>  			ret = -EFAULT;
>  			goto out;
> diff --git a/net/rds/message.c b/net/rds/message.c
> index b47e4f0a1639..fcfd406b97af 100644
> --- a/net/rds/message.c
> +++ b/net/rds/message.c
> @@ -390,8 +390,8 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *
>  		size_t start;
>  		ssize_t copied;
>  
> -		copied = iov_iter_get_pages2(from, &pages, PAGE_SIZE,
> -					    1, &start);
> +		copied = iov_iter_get_pages(from, &pages, PAGE_SIZE,
> +					    1, &start, FOLL_SOURCE_BUF);
>  		if (copied < 0) {
>  			struct mmpin *mmp;
>  			int i;
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 9ed978634125..59acaeb24f54 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -1354,9 +1354,8 @@ static int tls_setup_from_iter(struct iov_iter *from,
>  			rc = -EFAULT;
>  			goto out;
>  		}
> -		copied = iov_iter_get_pages2(from, pages,
> -					    length,
> -					    maxpages, &offset);
> +		copied = iov_iter_get_pages(from, pages, length,
> +					    maxpages, &offset, FOLL_DEST_BUF);
>  		if (copied <= 0) {
>  			rc = -EFAULT;
>  			goto out;
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions
  2023-01-16 23:08 ` [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions David Howells
@ 2023-01-17  7:58   ` Christoph Hellwig
  2023-01-18 22:28     ` Al Viro
  2023-01-18 23:15   ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  7:58 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 16, 2023 at 11:08:44PM +0000, David Howells wrote:
> Use the direction in the iterator functions rather than READ/WRITE.

I don't think we need the direction at all as nothing uses it any more.
Maybe don't crate churn there until that's been sorted.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator
  2023-01-16 23:08 ` [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-17  8:01   ` Christoph Hellwig
  2023-01-17  8:19   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:01 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, John Hubbard, Matthew Wilcox,
	linux-fsdevel, linux-mm, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-block, linux-kernel

> Changes:
> ========
> ver #6)
>  - Add back the function to indicate the cleanup mode.
>  - Drop the cleanup_mode return arg to iov_iter_extract_pages().
>  - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
>    data_source.

FYI, the changelog goes after the --- so that it doesn't get added
to the git history.

> Link: https://lore.kernel.org/r/166732025748.3186319.8314014902727092626.stgit@warthog.procyon.org.uk/ # rfc
> Link: https://lore.kernel.org/r/166869689451.3723671.18242195992447653092.stgit@warthog.procyon.org.uk/ # rfc
> Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
> Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
> Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5

And all these links aren't exactly useful.  This fairly trivial commit
is going to look like a hot mess in git.

> +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> +			       size_t maxsize, unsigned int maxpages,
> +			       unsigned int gup_flags, size_t *offset0);

This function isn't actually added in the current patch.

> +#define iov_iter_extract_mode(iter, gup_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF ?	\
> +	 FOLL_GET : FOLL_PIN : 0)

And inline function would be nice here.  I guess that would require
moving the FULL flags into mm_types.h, though.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page
  2023-01-16 23:08 ` [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page David Howells
@ 2023-01-17  8:02   ` Christoph Hellwig
  2023-01-17  8:21   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:02 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 16, 2023 at 11:08:59PM +0000, David Howells wrote:
> Provide a helper in the get_user_pages code to drop a pin or a ref on a
> page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
> nothing if neither is set.

Please don't add new page based helpers.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-16 23:09 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-17  8:02   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:02 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, Christoph Hellwig, Jeff Layton,
	linux-fsdevel, linux-kernel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 10/34] mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically
  2023-01-16 23:09 ` [PATCH v6 10/34] mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically David Howells
@ 2023-01-17  8:03   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:03 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, Christoph Hellwig, Jeff Layton,
	linux-fsdevel, linux-kernel

On Mon, Jan 16, 2023 at 11:09:13PM +0000, David Howells wrote:
> Make BIO_PAGE_REFFED the same as FOLL_GET and BIO_PAGE_PINNED the same as
> FOLL_PIN numerically so that the BIO_* flags can be passed directly to
> page_put_unpin().

Umm, no.  No matching entangling of flags, please.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-16 23:09 ` [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-17  8:07   ` Christoph Hellwig
  2023-01-17  8:26   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:07 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-block, Christoph Hellwig, Jeff Layton,
	linux-fsdevel, linux-kernel

> +	size = iov_iter_extract_pages(iter, &pages,
> +				      UINT_MAX - bio->bi_iter.bi_size,
> +				      nr_pages, gup_flags, &offset);
>  	if (unlikely(size <= 0))


> +	bio_set_cleanup_mode(bio, iter, gup_flags);

This should move out to bio_iov_iter_get_pages and only be called
once.

> +++ b/block/blk-map.c
> @@ -285,24 +285,24 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
>  		gup_flags |= FOLL_PCI_P2PDMA;
>  
>  	while (iov_iter_count(iter)) {
> -		struct page **pages, *stack_pages[UIO_FASTIOV];
> +		struct page *stack_pages[UIO_FASTIOV];
> +		struct page **pages = stack_pages;
>  		ssize_t bytes;
>  		size_t offs;
>  		int npages;
>  
> -		if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
> -			pages = stack_pages;
> -			bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
> -						   nr_vecs, &offs, gup_flags);
> -		} else {
> -			bytes = iov_iter_get_pages_alloc(iter, &pages,
> -						LONG_MAX, &offs, gup_flags);
> -		}
> +		if (nr_vecs > ARRAY_SIZE(stack_pages))
> +			pages = NULL;
> +
> +		bytes = iov_iter_extract_pages(iter, &pages, LONG_MAX,
> +					       nr_vecs, gup_flags, &offs);
>  		if (unlikely(bytes <= 0)) {
>  			ret = bytes ? bytes : -EFAULT;
>  			goto out_unmap;
>  		}
>  
> +		bio_set_cleanup_mode(bio, iter, gup_flags);

Same here - one call outside of the loop.

> +static inline void bio_set_cleanup_mode(struct bio *bio, struct iov_iter *iter,
> +					unsigned int gup_flags)
> +{
> +	unsigned int cleanup_mode;
> +
> +	bio_clear_flag(bio, BIO_PAGE_REFFED);

.. and this should not be needed.  Instead:

> +	cleanup_mode = iov_iter_extract_mode(iter, gup_flags);
> +	if (cleanup_mode & FOLL_GET)
> +		bio_set_flag(bio, BIO_PAGE_REFFED);
> +	if (cleanup_mode & FOLL_PIN)
> +		bio_set_flag(bio, BIO_PAGE_PINNED);

We could warn if a not match flag is set here if we really care.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 12/34] bio: Fix bio_flagged() so that gcc can better optimise it
  2023-01-16 23:09 ` [PATCH v6 12/34] bio: Fix bio_flagged() so that gcc can better optimise it David Howells
@ 2023-01-17  8:07   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:07 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Jens Axboe, linux-block, Christoph Hellwig,
	Matthew Wilcox, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-kernel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-17  7:57   ` Christoph Hellwig
@ 2023-01-17  8:07     ` David Hildenbrand
  2023-01-17  8:09       ` Christoph Hellwig
  2023-01-18 23:03     ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: David Hildenbrand @ 2023-01-17  8:07 UTC (permalink / raw)
  To: Christoph Hellwig, David Howells
  Cc: Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Peter Zijlstra

On 17.01.23 08:57, Christoph Hellwig wrote:
> On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
>> Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
>> and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
>> operation.  Don't use READ and WRITE as a read I/O writes to memory and
>> vice versa - which causes confusion.
>>
>> The direction is checked against the iterator's data_source.
> 
> Why can't we use the existing FOLL_WRITE?

Agreed. What I understand, David considers that confusing when 
considering the I/O side of things.

I recall that there is

DMA_BIDIRECTIONAL -> FOLL_WRITE
DMA_TO_DEVICE -> !FOLL_WRITE
DMA_FROM_DEVICE -> FOLL_WRITE

that used different defines for a different API. Such terminology would 
be easier to get ... but then, again, not sure if we really need 
acronyms here.

We're pinning pages and FOLL_WRITE defines how we (pinning the page) are 
going to access these pages: R/O or R/W. So the read vs. write is never 
from the POC of the device (DMA read will write to the page).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-17  8:07     ` David Hildenbrand
@ 2023-01-17  8:09       ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, David Howells, Al Viro, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Peter Zijlstra

On Tue, Jan 17, 2023 at 09:07:48AM +0100, David Hildenbrand wrote:
> Agreed. What I understand, David considers that confusing when considering
> the I/O side of things.
> 
> I recall that there is
> 
> DMA_BIDIRECTIONAL -> FOLL_WRITE
> DMA_TO_DEVICE -> !FOLL_WRITE
> DMA_FROM_DEVICE -> FOLL_WRITE
> 
> that used different defines for a different API. Such terminology would be
> easier to get ... but then, again, not sure if we really need acronyms here.
> 
> We're pinning pages and FOLL_WRITE defines how we (pinning the page) are
> going to access these pages: R/O or R/W. So the read vs. write is never from
> the POC of the device (DMA read will write to the page).

Yes.  Maybe the name could be a little more verboe, FOLL_MEM_WRITE or
FOLL_WRITE_TO_MEM.  But I'd really prefer any renaming to be split from
logic changes.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator
  2023-01-16 23:08 ` [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-17  8:01   ` Christoph Hellwig
@ 2023-01-17  8:19   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-17  8:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Christoph Hellwig, John Hubbard,
	Matthew Wilcox, linux-fsdevel, linux-mm, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-block, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> > +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> > +			       size_t maxsize, unsigned int maxpages,
> > +			       unsigned int gup_flags, size_t *offset0);
> 
> This function isn't actually added in the current patch.

Oh...  It ended up in the wrong patch.

> > +#define iov_iter_extract_mode(iter, gup_flags) \
> > +	(user_backed_iter(iter) ?				\
> > +	 (gup_flags & FOLL_BUF_MASK) == FOLL_SOURCE_BUF ?	\
> > +	 FOLL_GET : FOLL_PIN : 0)
> 
> And inline function would be nice here.  I guess that would require
> moving the FULL flags into mm_types.h, though.

Yeah, the movement of FOLL_* flags is queued in a patch in akpm's tree.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page
  2023-01-16 23:08 ` [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page David Howells
  2023-01-17  8:02   ` Christoph Hellwig
@ 2023-01-17  8:21   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-17  8:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Jan 16, 2023 at 11:08:59PM +0000, David Howells wrote:
> > Provide a helper in the get_user_pages code to drop a pin or a ref on a
> > page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
> > nothing if neither is set.
> 
> Please don't add new page based helpers.

Yes, I know, but all of the callers still use pages, not folios.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-16 23:09 ` [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
  2023-01-17  8:07   ` Christoph Hellwig
@ 2023-01-17  8:26   ` David Howells
  2023-01-17  8:44     ` Christoph Hellwig
  1 sibling, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-17  8:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig,
	Matthew Wilcox, Logan Gunthorpe, linux-block, Jeff Layton,
	linux-fsdevel, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> > +	bio_clear_flag(bio, BIO_PAGE_REFFED);
> 
> .. and this should not be needed.  Instead:
> 
> > +	cleanup_mode = iov_iter_extract_mode(iter, gup_flags);
> > +	if (cleanup_mode & FOLL_GET)
> > +		bio_set_flag(bio, BIO_PAGE_REFFED);
> > +	if (cleanup_mode & FOLL_PIN)
> > +		bio_set_flag(bio, BIO_PAGE_PINNED);
> 
> We could warn if a not match flag is set here if we really care.

Um... With these patches, BIO_PAGE_REFFED is set by default when the bio is
initialised otherwise every user of struct bio that currently adds pages
directly (assuming there are any) rather than going through
bio_iov_iter_get_pages() will have to set the flag, hence the need to clear
it.

Actually, I could do:

	if (!(cleanup_mode & FOLL_GET))
		bio_clear_flag(bio, BIO_PAGE_REFFED);
	if (cleanup_mode & FOLL_PIN)
		bio_set_flag(bio, BIO_PAGE_PINNED);

which should also work.

David



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
  2023-01-17  7:52   ` Christoph Hellwig
@ 2023-01-17  8:28   ` David Howells
  2023-01-17  8:44     ` Christoph Hellwig
  2023-01-17 11:11   ` David Howells
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-17  8:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Matthew Wilcox, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> I suspect the best is to:
> 
>  - rename init_sync_kiocb to init_kiocb
>  - pass a new argument for the destination

Do you mean the direction rather than the destination?

>    to it.  I'm not entirely
>    sure if flags is a good thing, or an explicit READ/WRITE might be
>    better because it's harder to get wrong, even if a the compiler
>    might generate worth code for it.
>  - also use it in the async callers (io_uring, aio, overlayfs, loop,
>    nvmet, target, cachefs, file backed swap)

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-16 23:08 ` [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*() David Howells
  2023-01-17  7:57   ` Christoph Hellwig
@ 2023-01-17  8:44   ` David Howells
  2023-01-17  8:46     ` Christoph Hellwig
  2023-01-17  8:47     ` David Hildenbrand
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-17  8:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Peter Zijlstra, David Hildenbrand

Christoph Hellwig <hch@infradead.org> wrote:

> On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
> > Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
> > and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
> > operation.  Don't use READ and WRITE as a read I/O writes to memory and
> > vice versa - which causes confusion.
> > 
> > The direction is checked against the iterator's data_source.
> 
> Why can't we use the existing FOLL_WRITE?

Because FOLL_WRITE doesn't mean the same as WRITE:

 (1) It looks like it should really be FOLL_CHECK_PTES_WRITABLE.  It's not
     defined as being anything to do with the I/O.

 (2) The reason Al added ITER_SOURCE and ITER_DEST is that the use of READ and
     WRITE with the iterators is confusing and kind of inverted - and the same
     would apply with using FOLL_WRITE:

	if (rw == READ)
		gup_flags |= FOLL_WRITE;

So my thought is to make how you are using the buffer described by the
iterator explicit: "I'm using it as a source buffer" or "I'm using it as a
destination buffer".

Also, I don't want it to be FOLL_WRITE or 0.  I want it to be written
explicitly in both cases.  If you're going to insist on using FOLL_WRITE, then
there should be a FOLL_READ to go with it, even if it's #defined to 0.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
  2023-01-17  8:26   ` David Howells
@ 2023-01-17  8:44     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:44 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Jens Axboe, Jan Kara,
	Christoph Hellwig, Matthew Wilcox, Logan Gunthorpe, linux-block,
	Jeff Layton, linux-fsdevel, linux-kernel

On Tue, Jan 17, 2023 at 08:26:08AM +0000, David Howells wrote:
> Um... With these patches, BIO_PAGE_REFFED is set by default when the bio is
> initialised otherwise every user of struct bio that currently adds pages
> directly (assuming there are any) rather than going through
> bio_iov_iter_get_pages() will have to set the flag, hence the need to clear
> it.

I think we need to fix that (in the patch inverting the polarity) and
only set the flag where it is needed.

All eventually calls come from the direct I/O code in the block layer,
iomap, legacy generic and zonefs, and they release pages that came
from some form of hup.  So we can just set BIO_PAGE_REFFED in
bio_iov_iter_get_pages and dio_refill_pages.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-17  8:28   ` David Howells
@ 2023-01-17  8:44     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:44 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Tue, Jan 17, 2023 at 08:28:24AM +0000, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
> > I suspect the best is to:
> > 
> >  - rename init_sync_kiocb to init_kiocb
> >  - pass a new argument for the destination
> 
> Do you mean the direction rather than the destination?

Yes.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-17  8:44   ` David Howells
@ 2023-01-17  8:46     ` Christoph Hellwig
  2023-01-17  8:47     ` David Hildenbrand
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17  8:46 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Peter Zijlstra, David Hildenbrand

On Tue, Jan 17, 2023 at 08:44:16AM +0000, David Howells wrote:
> Also, I don't want it to be FOLL_WRITE or 0.  I want it to be written
> explicitly in both cases.  If you're going to insist on using FOLL_WRITE, then
> there should be a FOLL_READ to go with it, even if it's #defined to 0.

Well, that's not how FOLL_* works.  And another new flag that is defined
to 0 but only used by some I/O callers is really confusing.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-17  8:44   ` David Howells
  2023-01-17  8:46     ` Christoph Hellwig
@ 2023-01-17  8:47     ` David Hildenbrand
  1 sibling, 0 replies; 91+ messages in thread
From: David Hildenbrand @ 2023-01-17  8:47 UTC (permalink / raw)
  To: David Howells, Christoph Hellwig
  Cc: Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Peter Zijlstra

On 17.01.23 09:44, David Howells wrote:
> Christoph Hellwig <hch@infradead.org> wrote:
> 
>> On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
>>> Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
>>> and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
>>> operation.  Don't use READ and WRITE as a read I/O writes to memory and
>>> vice versa - which causes confusion.
>>>
>>> The direction is checked against the iterator's data_source.
>>
>> Why can't we use the existing FOLL_WRITE?
> 
> Because FOLL_WRITE doesn't mean the same as WRITE:
> 
>   (1) It looks like it should really be FOLL_CHECK_PTES_WRITABLE.  It's not
>       defined as being anything to do with the I/O.

Especially combined with FOLL_FORCE, this is not true.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
  2023-01-17  7:52   ` Christoph Hellwig
  2023-01-17  8:28   ` David Howells
@ 2023-01-17 11:11   ` David Howells
  2023-01-17 11:11     ` Christoph Hellwig
  2023-01-18 22:05   ` Al Viro
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-17 11:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Matthew Wilcox, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> I suspect the best is to:
> 
>  - rename init_sync_kiocb to init_kiocb
>  - pass a new argument for the direction to it.  I'm not entirely
>    sure if flags is a good thing, or an explicit READ/WRITE might be
>    better because it's harder to get wrong, even if a the compiler
>    might generate worth code for it.

So something like:

	init_kiocb(kiocb, file, WRITE);
	init_kiocb(kiocb, file, READ);

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-17 11:11   ` David Howells
@ 2023-01-17 11:11     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-17 11:11 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Tue, Jan 17, 2023 at 11:11:16AM +0000, David Howells wrote:
> So something like:
> 
> 	init_kiocb(kiocb, file, WRITE);
> 	init_kiocb(kiocb, file, READ);

Yes.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
                   ` (34 preceding siblings ...)
  2023-01-17  7:46 ` [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) Christoph Hellwig
@ 2023-01-18 14:00 ` David Howells
  2023-01-18 14:09   ` Christoph Hellwig
  35 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-18 14:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Jens Axboe, Jan Kara, Matthew Wilcox,
	Logan Gunthorpe, linux-block, Christoph Hellwig, Jeff Layton,
	linux-fsdevel, linux-kernel

Actually, should I make it so that the bottom two bits of bi_flags are a
four-state variable and make it such that bio_release_page() gives a warning
if the state is 0 - ie. unset?

The states would then be, say:

	0	WARN(), do no cleanup
	1	FOLL_GET
	2	FOLL_PUT
	3	do no cleanup

This should help debug any places, such as iomap_dio_zero() that I just found,
that add pages with refs without calling iov_iter_extract_pages().

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
  2023-01-18 14:00 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
@ 2023-01-18 14:09   ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-18 14:09 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Jens Axboe, Jan Kara, Matthew Wilcox,
	Logan Gunthorpe, linux-block, Christoph Hellwig, Jeff Layton,
	linux-fsdevel, linux-kernel

On Wed, Jan 18, 2023 at 02:00:54PM +0000, David Howells wrote:
> Actually, should I make it so that the bottom two bits of bi_flags are a
> four-state variable and make it such that bio_release_page() gives a warning
> if the state is 0 - ie. unset?
> 
> The states would then be, say:
> 
> 	0	WARN(), do no cleanup
> 	1	FOLL_GET
> 	2	FOLL_PUT
> 	3	do no cleanup
> 
> This should help debug any places, such as iomap_dio_zero() that I just found,
> that add pages with refs without calling iov_iter_extract_pages().

I don't really see a point.  The fundamental use case of the bio itself
isn't really to this at all.  So we're stealing one, or in the future
two bits mostly to optimize some direct I/O use cases.  In fact I
wonder if instead we should just drop this micro-optimization entirely
an just add a member for the foll flags to the direct I/O container
structures (struct blkdev_dio, strut iomap_dio, struct dio, or just on
stack for __blkdev_direct_IO_simple and zonefs_file_dio_append) and
pass that to bio_release_pages.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
                     ` (2 preceding siblings ...)
  2023-01-17 11:11   ` David Howells
@ 2023-01-18 22:05   ` Al Viro
  2023-01-19  5:41     ` Christoph Hellwig
  2023-01-19 10:01   ` David Howells
  2023-01-19 11:06   ` David Howells
  5 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2023-01-18 22:05 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-fsdevel,
	Christoph Hellwig, Matthew Wilcox, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-kernel

On Mon, Jan 16, 2023 at 11:08:09PM +0000, David Howells wrote:
> IOCB_WRITE is set by aio, io_uring and cachefiles before submitting a write
> operation to the VFS, but it isn't set by, say, the write() system call.
> 
> Fix this by setting IOCB_WRITE unconditionally in call_write_iter().

	Which does nothing for places that do not use call_write_iter()...
__kernel_write_iter() is one such; for less obvious specimen see
drivers/nvme/target/io-cmd-file.c:nvmet_file_submit_bvec() - there
we have iocb coming from the caller and *not* fed to init_sync_kiocb(),
so Christoph's suggestion doesn't work either.  Sure, we could take
care of that by adding ki_flags |= IOCB_WRITE in there, but...

FWIW, call chains for ->write_iter() (as an explicit method call) are:

->write_iter() <- __kernel_write_iter() [init_sync_kiocb()]
->write_iter() <- call_write_iter() <- new_sync_write() [init_sync_kiocb()]
->write_iter() <- call_write_iter() <- do_iter_read_write() [init_sync_kiocb()]
->write_iter() <- call_write_iter() <- aio_write() [sets KIOCB_WRITE]
->write_iter() <- call_write_iter() <- io_write() [sets KIOCB_WRITE]

->write_iter() <- nvmet_file_submit_bvec()
->write_iter() <- call_write_iter() <- lo_rw_aio()
->write_iter() <- call_write_iter() <- fd_execute_rw_aio()
->write_iter() <- call_write_iter() <- vfs_iocb_iter_write()

The last 4 neither set KIOCB_WRITE nor call init_sync_kiocb().  What's
more, there are places that call instances (or their guts - look at
btrfs_do_write_iter() callers) directly...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-17  7:52   ` Christoph Hellwig
@ 2023-01-18 22:11     ` Al Viro
  2023-01-19  5:44       ` Christoph Hellwig
  2023-01-19 11:34       ` David Howells
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2023-01-18 22:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Matthew Wilcox, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-kernel

On Mon, Jan 16, 2023 at 11:52:43PM -0800, Christoph Hellwig wrote:

> This doesn't remove the existing setting of IOCB_WRITE, and also
> feelds like the wrong place.
> 
> I suspect the best is to:
> 
>  - rename init_sync_kiocb to init_kiocb
>  - pass a new argument for the destination to it.  I'm not entirely
>    sure if flags is a good thing, or an explicit READ/WRITE might be
>    better because it's harder to get wrong, even if a the compiler
>    might generate worth code for it.
>  - also use it in the async callers (io_uring, aio, overlayfs, loop,
>    nvmet, target, cachefs, file backed swap)

Do you want it to mess with get_current_ioprio() for those?  Looks
wrong...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction
  2023-01-17  7:55   ` Christoph Hellwig
@ 2023-01-18 22:18     ` Al Viro
  0 siblings, 0 replies; 91+ messages in thread
From: Al Viro @ 2023-01-18 22:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

On Mon, Jan 16, 2023 at 11:55:08PM -0800, Christoph Hellwig wrote:
> On Mon, Jan 16, 2023 at 11:08:17PM +0000, David Howells wrote:
> > Use information other than the iterator direction to determine the
> > direction of the I/O:
> > 
> >  (*) If a kiocb is available, use the IOCB_WRITE flag.
> > 
> >  (*) If an iomap_iter is available, use the IOMAP_WRITE flag.
> > 
> >  (*) If a request is available, use op_is_write().
> 
> The really should be three independent patches.  Plus another one
> to drop the debug checks in cifs.
> 
> The changes themselves look good to me.
> 
> >  
> > +static unsigned char iov_iter_rw(const struct iov_iter *i)
> > +{
> > +	return i->data_source ? WRITE : READ;
> > +}
> 
> It might as well make sense to just open code this in the only
> caller as well (yet another patch).

Especially since
		/* if it's a destination, tell g-u-p we want them writable */
                if (!i->data_source)
			gup_flags |= FOLL_WRITE;
is less confusing than the current
                if (iov_iter_rw(i) != WRITE)
			gup_flags |= FOLL_WRITE;


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions
  2023-01-17  7:58   ` Christoph Hellwig
@ 2023-01-18 22:28     ` Al Viro
  2023-01-19  5:45       ` Christoph Hellwig
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2023-01-18 22:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

On Mon, Jan 16, 2023 at 11:58:16PM -0800, Christoph Hellwig wrote:
> On Mon, Jan 16, 2023 at 11:08:44PM +0000, David Howells wrote:
> > Use the direction in the iterator functions rather than READ/WRITE.
> 
> I don't think we need the direction at all as nothing uses it any more.

... except for checks in copy_from_iter() et.al.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-17  7:57   ` Christoph Hellwig
  2023-01-17  8:07     ` David Hildenbrand
@ 2023-01-18 23:03     ` Al Viro
  2023-01-19  0:15       ` Al Viro
  1 sibling, 1 reply; 91+ messages in thread
From: Al Viro @ 2023-01-18 23:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Peter Zijlstra, David Hildenbrand

On Mon, Jan 16, 2023 at 11:57:08PM -0800, Christoph Hellwig wrote:
> On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
> > Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
> > and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
> > operation.  Don't use READ and WRITE as a read I/O writes to memory and
> > vice versa - which causes confusion.
> > 
> > The direction is checked against the iterator's data_source.
> 
> Why can't we use the existing FOLL_WRITE?

	I'm really not fond of passing FOLL_... stuff into iov_iter
primitives.  That space contains things like FOLL_PIN, which makes
no sense whatsoever for non-user-backed iterators; having the
callers pass it in makes them automatically dependent upon the
iov_iter flavour.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 05/34] iov_iter: Change the direction macros into an enum
  2023-01-16 23:08 ` [PATCH v6 05/34] iov_iter: Change the direction macros into an enum David Howells
@ 2023-01-18 23:14   ` Al Viro
  2023-01-18 23:17   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2023-01-18 23:14 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 16, 2023 at 11:08:38PM +0000, David Howells wrote:
> Change the ITER_SOURCE and ITER_DEST direction macros into an enum and
> provide three new helper functions:
> 
>  iov_iter_dir() - returns the iterator direction
>  iov_iter_is_dest() - returns true if it's an ITER_DEST iterator
>  iov_iter_is_source() - returns true if it's an ITER_SOURCE iterator

What for?  We have two valid values -
	1) it is a data source
	2) it is not a data source
Why do we need to store that as an enum?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions
  2023-01-16 23:08 ` [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions David Howells
  2023-01-17  7:58   ` Christoph Hellwig
@ 2023-01-18 23:15   ` Al Viro
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2023-01-18 23:15 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Mon, Jan 16, 2023 at 11:08:44PM +0000, David Howells wrote:
> Use the direction in the iterator functions rather than READ/WRITE.
> 
> Add a check into __iov_iter_get_pages_alloc() that the supplied
> FOLL_SOURCE/DEST_BUF gup_flag matches the ITER_SOURCE/DEST flag on the
> iterator.

Incidentally, s/iterator/initializer/ (or constructor, for that matter).
Those are not iterators...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 05/34] iov_iter: Change the direction macros into an enum
  2023-01-16 23:08 ` [PATCH v6 05/34] iov_iter: Change the direction macros into an enum David Howells
  2023-01-18 23:14   ` Al Viro
@ 2023-01-18 23:17   ` David Howells
  2023-01-18 23:19     ` Al Viro
  2023-01-18 23:24     ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-18 23:17 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> > Change the ITER_SOURCE and ITER_DEST direction macros into an enum and
> > provide three new helper functions:
> > 
> >  iov_iter_dir() - returns the iterator direction
> >  iov_iter_is_dest() - returns true if it's an ITER_DEST iterator
> >  iov_iter_is_source() - returns true if it's an ITER_SOURCE iterator
> 
> What for?  We have two valid values -
> 	1) it is a data source
> 	2) it is not a data source
> Why do we need to store that as an enum?

Compile time type checking.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 05/34] iov_iter: Change the direction macros into an enum
  2023-01-18 23:17   ` David Howells
@ 2023-01-18 23:19     ` Al Viro
  2023-01-18 23:24     ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2023-01-18 23:19 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Wed, Jan 18, 2023 at 11:17:41PM +0000, David Howells wrote:
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > > Change the ITER_SOURCE and ITER_DEST direction macros into an enum and
> > > provide three new helper functions:
> > > 
> > >  iov_iter_dir() - returns the iterator direction
> > >  iov_iter_is_dest() - returns true if it's an ITER_DEST iterator
> > >  iov_iter_is_source() - returns true if it's an ITER_SOURCE iterator
> > 
> > What for?  We have two valid values -
> > 	1) it is a data source
> > 	2) it is not a data source
> > Why do we need to store that as an enum?
> 
> Compile time type checking.

Huh?  int-to-enum conversion is quiet; it would catch explicit
huge constants, but that's it...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 05/34] iov_iter: Change the direction macros into an enum
  2023-01-18 23:17   ` David Howells
  2023-01-18 23:19     ` Al Viro
@ 2023-01-18 23:24     ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-18 23:24 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> > Compile time type checking.
> 
> Huh?  int-to-enum conversion is quiet; it would catch explicit
> huge constants, but that's it...

*shrug*.

But can we at least get rid of the:

	iov_iter_foo(&iter, ITER_SOURCE, ...);

	WARN_ON(direction & ~(READ | WRITE));

mismatch?  Either use ITER_SOURCE/DEST or use READ/WRITE but not mix them.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-18 23:03     ` Al Viro
@ 2023-01-19  0:15       ` Al Viro
  2023-01-19  2:11         ` Al Viro
  2023-01-19  5:47         ` Christoph Hellwig
  0 siblings, 2 replies; 91+ messages in thread
From: Al Viro @ 2023-01-19  0:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Peter Zijlstra, David Hildenbrand

On Wed, Jan 18, 2023 at 11:03:52PM +0000, Al Viro wrote:
> On Mon, Jan 16, 2023 at 11:57:08PM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
> > > Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
> > > and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
> > > operation.  Don't use READ and WRITE as a read I/O writes to memory and
> > > vice versa - which causes confusion.
> > > 
> > > The direction is checked against the iterator's data_source.
> > 
> > Why can't we use the existing FOLL_WRITE?
> 
> 	I'm really not fond of passing FOLL_... stuff into iov_iter
> primitives.  That space contains things like FOLL_PIN, which makes
> no sense whatsoever for non-user-backed iterators; having the
> callers pass it in makes them automatically dependent upon the
> iov_iter flavour.

Actually, looking at that thing...  Currently we use it only for
FOLL_PCI_P2PDMA.  It alters behaviour of get_user_pages_fast(), but...
it is completely ignored for ITER_BVEC or ITER_PIPE.  So how the
hell is it supposed to work?

And ITER_BVEC *can* get there.  blkdev_direct_IO() can get anything
->write_iter() can get, and io_uring will feed stuff to it.  For
that matter, ->read_iter() can lead to it as well, so
generic_file_splice_read() can end up passing ITER_PIPE to that
sucker.

Could somebody give a braindump on that thing?  It looks like we
have pages that should not be DMA'd to/from unless driver takes
some precautions and we want to make sure they won't be fed to
drivers that don't take such.  With checks done in a very odd
place...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-19  0:15       ` Al Viro
@ 2023-01-19  2:11         ` Al Viro
  2023-01-19  5:47           ` Christoph Hellwig
  2023-01-19  5:47         ` Christoph Hellwig
  1 sibling, 1 reply; 91+ messages in thread
From: Al Viro @ 2023-01-19  2:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Peter Zijlstra, David Hildenbrand

On Thu, Jan 19, 2023 at 12:15:44AM +0000, Al Viro wrote:
> On Wed, Jan 18, 2023 at 11:03:52PM +0000, Al Viro wrote:
> > On Mon, Jan 16, 2023 at 11:57:08PM -0800, Christoph Hellwig wrote:
> > > On Mon, Jan 16, 2023 at 11:08:24PM +0000, David Howells wrote:
> > > > Define FOLL_SOURCE_BUF and FOLL_DEST_BUF to indicate to get_user_pages*()
> > > > and iov_iter_get_pages*() how the buffer is intended to be used in an I/O
> > > > operation.  Don't use READ and WRITE as a read I/O writes to memory and
> > > > vice versa - which causes confusion.
> > > > 
> > > > The direction is checked against the iterator's data_source.
> > > 
> > > Why can't we use the existing FOLL_WRITE?
> > 
> > 	I'm really not fond of passing FOLL_... stuff into iov_iter
> > primitives.  That space contains things like FOLL_PIN, which makes
> > no sense whatsoever for non-user-backed iterators; having the
> > callers pass it in makes them automatically dependent upon the
> > iov_iter flavour.
> 
> Actually, looking at that thing...  Currently we use it only for
> FOLL_PCI_P2PDMA.  It alters behaviour of get_user_pages_fast(), but...
> it is completely ignored for ITER_BVEC or ITER_PIPE.  So how the
> hell is it supposed to work?
> 
> And ITER_BVEC *can* get there.  blkdev_direct_IO() can get anything
> ->write_iter() can get, and io_uring will feed stuff to it.  For
> that matter, ->read_iter() can lead to it as well, so
> generic_file_splice_read() can end up passing ITER_PIPE to that
> sucker.
> 
> Could somebody give a braindump on that thing?  It looks like we
> have pages that should not be DMA'd to/from unless driver takes
> some precautions and we want to make sure they won't be fed to
> drivers that don't take such.  With checks done in a very odd
> place...

PS: Documentation/driver-api/pci/p2pdma.rst seems to imply that those
pages should not be possible to mmap, so either that needs to be
updated, or... how the hell could we run into those in g-u-p,
anyway?  Really confused...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 20/34] vfs: Make splice use iov_iter_extract_pages()
  2023-01-16 23:10 ` [PATCH v6 20/34] vfs: Make splice use iov_iter_extract_pages() David Howells
@ 2023-01-19  2:31   ` Al Viro
  0 siblings, 0 replies; 91+ messages in thread
From: Al Viro @ 2023-01-19  2:31 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Matthew Wilcox, linux-fsdevel,
	Christoph Hellwig, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-block, linux-kernel

On Mon, Jan 16, 2023 at 11:10:25PM +0000, David Howells wrote:

> diff --git a/fs/splice.c b/fs/splice.c
> index 19c5b5adc548..c3433266ba1b 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1159,14 +1159,18 @@ static int iter_to_pipe(struct iov_iter *from,
>  	size_t total = 0;
>  	int ret = 0;
>  
> +	/* For the moment, all pages attached to a pipe must have refs, not pins. */
> +	if (WARN_ON(iov_iter_extract_mode(from, FOLL_SOURCE_BUF) != FOLL_GET))
> +		return -EIO;

Huh?  WTF does that have to do with pins?  Why would we be pinning the _source_
pages anyway?  We do want them referenced, for obvious reasons (they might be
stuck in the pipe), but that has nothing to do with get vs. pin.

If anything, this is one place where we want the semantics of iov_iter_get_pages...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:10 ` [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-19  2:52   ` Al Viro
  2023-01-19 16:44   ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Al Viro @ 2023-01-19  2:52 UTC (permalink / raw)
  To: David Howells
  Cc: Dominique Martinet, Eric Van Hensbergen, Latchesar Ionkov,
	Christian Schoenebeck, v9fs-developer, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

On Mon, Jan 16, 2023 at 11:10:32PM +0000, David Howells wrote:

> @@ -310,73 +310,34 @@ static int p9_get_mapped_pages(struct virtio_chan *chan,
>  			       struct iov_iter *data,
>  			       int count,
>  			       size_t *offs,
> -			       int *need_drop,
> +			       int *cleanup_mode,
>  			       unsigned int gup_flags)
>  {
>  	int nr_pages;
>  	int err;
> +	int n;
>  
>  	if (!iov_iter_count(data))
>  		return 0;
>  
> -	if (!iov_iter_is_kvec(data)) {
> -		int n;
> -		/*
> -		 * We allow only p9_max_pages pinned. We wait for the
> -		 * Other zc request to finish here
> -		 */
> -		if (atomic_read(&vp_pinned) >= chan->p9_max_pages) {
> -			err = wait_event_killable(vp_wq,
> -			      (atomic_read(&vp_pinned) < chan->p9_max_pages));
> -			if (err == -ERESTARTSYS)
> -				return err;
> -		}
> -		n = iov_iter_get_pages_alloc(data, pages, count, offs,
> -					     gup_flags);
> -		if (n < 0)
> -			return n;
> -		*need_drop = 1;
> -		nr_pages = DIV_ROUND_UP(n + *offs, PAGE_SIZE);
> -		atomic_add(nr_pages, &vp_pinned);
> -		return n;
> -	} else {
> -		/* kernel buffer, no need to pin pages */
> -		int index;
> -		size_t len;
> -		void *p;
> -
> -		/* we'd already checked that it's non-empty */
> -		while (1) {
> -			len = iov_iter_single_seg_count(data);
> -			if (likely(len)) {
> -				p = data->kvec->iov_base + data->iov_offset;
> -				break;
> -			}
> -			iov_iter_advance(data, 0);
> -		}
> -		if (len > count)
> -			len = count;
> -
> -		nr_pages = DIV_ROUND_UP((unsigned long)p + len, PAGE_SIZE) -
> -			   (unsigned long)p / PAGE_SIZE;
> -
> -		*pages = kmalloc_array(nr_pages, sizeof(struct page *),
> -				       GFP_NOFS);
> -		if (!*pages)
> -			return -ENOMEM;
> -
> -		*need_drop = 0;
> -		p -= (*offs = offset_in_page(p));
> -		for (index = 0; index < nr_pages; index++) {
> -			if (is_vmalloc_addr(p))
> -				(*pages)[index] = vmalloc_to_page(p);
> -			else
> -				(*pages)[index] = kmap_to_page(p);
> -			p += PAGE_SIZE;
> -		}
> -		iov_iter_advance(data, len);
> -		return len;
> +	/*
> +	 * We allow only p9_max_pages pinned. We wait for the
> +	 * Other zc request to finish here
> +	 */
> +	if (atomic_read(&vp_pinned) >= chan->p9_max_pages) {
> +		err = wait_event_killable(vp_wq,
> +					  (atomic_read(&vp_pinned) < chan->p9_max_pages));
> +		if (err == -ERESTARTSYS)
> +			return err;
>  	}
> +
> +	n = iov_iter_extract_pages(data, pages, count, offs, gup_flags);

Wait a sec; just how would that work for ITER_KVEC?  AFAICS, in your
tree that would blow with -EFAULT...

Yup; in p9_client_readdir() in your tree:

net/9p/client.c:2057:	iov_iter_kvec(&to, ITER_DEST, &kv, 1, count);

net/9p/client.c:2077:		req = p9_client_zc_rpc(clnt, P9_TREADDIR, &to, NULL, rsize, 0,
net/9p/client.c:2078:				       11, "dqd", fid->fid, offset, rsize);

where
net/9p/client.c:799:	err = c->trans_mod->zc_request(c, req, uidata, uodata,
net/9p/client.c:800:				       inlen, olen, in_hdrlen);

and in p9_virtio_zc_request(), which is a possible ->zc_request() instance
net/9p/trans_virtio.c:402:		int n = p9_get_mapped_pages(chan, &out_pages, uodata,
net/9p/trans_virtio.c:403:					    outlen, &offs, &cleanup_mode,
net/9p/trans_virtio.c:404:					    FOLL_DEST_BUF);

with p9_get_mapped_pages() hitting
net/9p/trans_virtio.c:334:	n = iov_iter_extract_pages(data, pages, count, offs, gup_flags);
net/9p/trans_virtio.c:335:	if (n < 0)
net/9p/trans_virtio.c:336:		return n;

and in iov_iter_extract_get_pages()
lib/iov_iter.c:2250:	if (likely(user_backed_iter(i)))
lib/iov_iter.c:2251:		return iov_iter_extract_user_pages(i, pages, maxsize,
lib/iov_iter.c:2252:						   maxpages, gup_flags,
lib/iov_iter.c:2253:						   offset0);
lib/iov_iter.c:2254:	if (iov_iter_is_bvec(i))
lib/iov_iter.c:2255:		return iov_iter_extract_bvec_pages(i, pages, maxsize,
lib/iov_iter.c:2256:						   maxpages, gup_flags,
lib/iov_iter.c:2257:						   offset0);
lib/iov_iter.c:2258:	if (iov_iter_is_pipe(i))
lib/iov_iter.c:2259:		return iov_iter_extract_pipe_pages(i, pages, maxsize,
lib/iov_iter.c:2260:						   maxpages, gup_flags,
lib/iov_iter.c:2261:						   offset0);
lib/iov_iter.c:2262:	if (iov_iter_is_xarray(i))
lib/iov_iter.c:2263:		return iov_iter_extract_xarray_pages(i, pages, maxsize,
lib/iov_iter.c:2264:						     maxpages, gup_flags,
lib/iov_iter.c:2265:						     offset0);
lib/iov_iter.c:2266:	return -EFAULT;

All quoted lines by your
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/tree/
How could that possibly work?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:10 ` [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate David Howells
@ 2023-01-19  5:04   ` Al Viro
  2023-01-19  5:51     ` Christoph Hellwig
  0 siblings, 1 reply; 91+ messages in thread
From: Al Viro @ 2023-01-19  5:04 UTC (permalink / raw)
  To: David Howells
  Cc: Jens Axboe, Jan Kara, Christoph Hellwig, Matthew Wilcox,
	Logan Gunthorpe, linux-fsdevel, linux-block, Christoph Hellwig,
	Jeff Layton, linux-kernel

On Mon, Jan 16, 2023 at 11:10:11PM +0000, David Howells wrote:
> Convert the generic direct-I/O code to use iov_iter_extract_pages() instead
> of iov_iter_get_pages().  This will pin pages or leave them unaltered
> rather than getting a ref on them as appropriate to the iterator.
> 
> The pages need to be pinned for DIO-read rather than having refs taken on
> them to prevent VM copy-on-write from malfunctioning during a concurrent
> fork() (the result of the I/O would otherwise end up only visible to the
> child process and not the parent).

Several observations:

1) fs/direct-io.c is ancient, grotty and has few remaining users.
The case of block devices got split off first; these days it's in
block/fops.c.  Then iomap-using filesystems went to fs/iomap/direct-io.c,
leaving this sucker used only by affs, ext2, fat, exfat, hfs, hfsplus, jfs,
nilfs2, ntfs3, reiserfs, udf and ocfs2.  And frankly, the sooner it dies
the better off we are.  IOW, you've picked an uninteresting part and left
the important ones untouched.

2) if you look at the "should_dirty" logics in either of those (including
fs/direct-io.c itself) you'll see a funny thing.  First of all,
dio->should_dirty (or its counterparts) gets set iff we have a user-backed
iter and operation is a read.  I.e. precisely the case when you get bio
marked with BIO_PAGE_PINNED.  And that's not an accident - look at the
places where we check that predicate: dio_bio_submit() calls
bio_set_pages_dirty() if that predicate is true before submitting the
sucker and dio_bio_complete() uses it to choose between bio_check_pages_dirty()
and bio_release_pages() + bio_put().

Look at bio_check_pages_dirty() - it checks if any of the pages we were
reading into had managed to lose the dirty bit; if none had it does
bio_release_pages(bio, false) + bio_put(bio) and returns.  If some had,
it shoves bio into bio_dirty_list and arranges for bio_release_pages(bio, true)
+ bio_put(bio) called from upper half (via schedule_work()).  The effect
of the second argument of bio_release_pages() is to (re)dirty the pages;
it can't be done from interrupt, so we have to defer it to process context.

Now, do we need to redirty anything there?  Recall that page pinning had
been introduced precisely to prevent writeback while the page is getting
DMA into it.  Who is going to mark it clean before we unpin it?

Unless I misunderstand something fundamental about the whole thing,
this crap should become useless with that conversion.  And it's not just
the ->should_dirty and its equivalents - bio_check_pages_dirty() and
the stuff around it should also be gone once block/fops.c and
fs/iomap/direct-io.c are switched to your iov_iter_extract_pages.
Moreover, that way the only places legitimately passing true to
bio_release_pages() are blk_rq_unmap_user() (on completion of
bypass read request mapping a user page) and __blkdev_direct_IO_simple()
(on completion of short sync O_DIRECT read from block device).
Both could just as well call bio_set_pages_dirty(bio) +
bio_release_pages(bio, false), killing the "dirty on release" logics
and losing the 'mark_dirty' argument.

BTW, where do we dirty the pages on IO_URING_OP_READ_FIXED with
O_DIRECT file?  AFAICS, bio_set_pages_dirty() won't be called
(ITER_BVEC iter) and neither will bio_release_pages() do anything
(BIO_NO_PAGE_REF set on the bio by bio_iov_bvec_set() called
due to the same ITER_BVEC iter).  Am I missing something trivial
here?  Jens?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-18 22:05   ` Al Viro
@ 2023-01-19  5:41     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19  5:41 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Wed, Jan 18, 2023 at 10:05:38PM +0000, Al Viro wrote:
> __kernel_write_iter() is one such; for less obvious specimen see
> drivers/nvme/target/io-cmd-file.c:nvmet_file_submit_bvec() - there
> we have iocb coming from the caller and *not* fed to init_sync_kiocb(),
> so Christoph's suggestion doesn't work either.  Sure, we could take
> care of that by adding ki_flags |= IOCB_WRITE in there, but...

None of the asyc users of iocbs currently uses init_sync_kiocb.  My
suggestion is to use it everywhere - we have less than a dozen of
these that I all listed.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-18 22:11     ` Al Viro
@ 2023-01-19  5:44       ` Christoph Hellwig
  2023-01-19 11:34       ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19  5:44 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, David Howells, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Wed, Jan 18, 2023 at 10:11:45PM +0000, Al Viro wrote:
> On Mon, Jan 16, 2023 at 11:52:43PM -0800, Christoph Hellwig wrote:
> 
> > This doesn't remove the existing setting of IOCB_WRITE, and also
> > feelds like the wrong place.
> > 
> > I suspect the best is to:
> > 
> >  - rename init_sync_kiocb to init_kiocb
> >  - pass a new argument for the destination to it.  I'm not entirely
> >    sure if flags is a good thing, or an explicit READ/WRITE might be
> >    better because it's harder to get wrong, even if a the compiler
> >    might generate worth code for it.
> >  - also use it in the async callers (io_uring, aio, overlayfs, loop,
> >    nvmet, target, cachefs, file backed swap)
> 
> Do you want it to mess with get_current_ioprio() for those?  Looks
> wrong...

We want to be consistent for sync vs async submission.  So I think yes,
we want to do the get_current_ioprio for most of them, exceptions
beeing aio and io_uring - those could use a __init_iocb or
init_iocb_ioprio variant that passs in the explicit priority if we want
to avoid the call if it would be overriden later.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions
  2023-01-18 22:28     ` Al Viro
@ 2023-01-19  5:45       ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19  5:45 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, David Howells, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel

On Wed, Jan 18, 2023 at 10:28:28PM +0000, Al Viro wrote:
> On Mon, Jan 16, 2023 at 11:58:16PM -0800, Christoph Hellwig wrote:
> > On Mon, Jan 16, 2023 at 11:08:44PM +0000, David Howells wrote:
> > > Use the direction in the iterator functions rather than READ/WRITE.
> > 
> > I don't think we need the direction at all as nothing uses it any more.
> 
> ... except for checks in copy_from_iter() et.al.

Debug checks, yes.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-19  0:15       ` Al Viro
  2023-01-19  2:11         ` Al Viro
@ 2023-01-19  5:47         ` Christoph Hellwig
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19  5:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, David Howells, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Peter Zijlstra, David Hildenbrand

On Thu, Jan 19, 2023 at 12:15:44AM +0000, Al Viro wrote:
> Actually, looking at that thing...  Currently we use it only for
> FOLL_PCI_P2PDMA.  It alters behaviour of get_user_pages_fast(), but...
> it is completely ignored for ITER_BVEC or ITER_PIPE.  So how the
> hell is it supposed to work?

It broadens the acceptance criteria for UBUF/IOVEC types.  It doesn't
change behavior for already accepted memory for those or any others.

> Could somebody give a braindump on that thing?  It looks like we
> have pages that should not be DMA'd to/from unless driver takes
> some precautions and we want to make sure they won't be fed to
> drivers that don't take such.  With checks done in a very odd
> place...

Yes, normal gup excludes P2P pages.  This flag allows it to get them.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*()
  2023-01-19  2:11         ` Al Viro
@ 2023-01-19  5:47           ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19  5:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, David Howells, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Peter Zijlstra, David Hildenbrand

On Thu, Jan 19, 2023 at 02:11:19AM +0000, Al Viro wrote:
> PS: Documentation/driver-api/pci/p2pdma.rst seems to imply that those
> pages should not be possible to mmap, so either that needs to be
> updated, or... how the hell could we run into those in g-u-p,
> anyway?  Really confused...

Yes, that needs an update.  That limitation was from before the
mmap support was added.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate
  2023-01-19  5:04   ` Al Viro
@ 2023-01-19  5:51     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19  5:51 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Jens Axboe, Jan Kara, Christoph Hellwig,
	Matthew Wilcox, Logan Gunthorpe, linux-fsdevel, linux-block,
	Christoph Hellwig, Jeff Layton, linux-kernel

On Thu, Jan 19, 2023 at 05:04:20AM +0000, Al Viro wrote:
> 1) fs/direct-io.c is ancient, grotty and has few remaining users.
> The case of block devices got split off first; these days it's in
> block/fops.c.  Then iomap-using filesystems went to fs/iomap/direct-io.c,
> leaving this sucker used only by affs, ext2, fat, exfat, hfs, hfsplus, jfs,
> nilfs2, ntfs3, reiserfs, udf and ocfs2.  And frankly, the sooner it dies
> the better off we are.  IOW, you've picked an uninteresting part and left
> the important ones untouched.

Agreed.  That being said if we want file systems (including those not
using this legacy version) to be able to rely on correct page dirtying
eventually everything needs to pin pages it writes to.  So we need to
either actually fix or remove this code in the forseeable future.  It's
by far not the most interesting and highest priority, though.   And as
I said this series is already too large too review anyway, I'd really
prefer to get a core set done ASAP and then iterate on the callers and
additional bits.

> Unless I misunderstand something fundamental about the whole thing,
> this crap should become useless with that conversion.

It should - mostly.  But we need to be very careful about that, so
I'd prefer a separate small series for it to be honest.

> BTW, where do we dirty the pages on IO_URING_OP_READ_FIXED with
> O_DIRECT file?  AFAICS, bio_set_pages_dirty() won't be called
> (ITER_BVEC iter) and neither will bio_release_pages() do anything
> (BIO_NO_PAGE_REF set on the bio by bio_iov_bvec_set() called
> due to the same ITER_BVEC iter).  Am I missing something trivial
> here?  Jens?

I don't think we do that all right now.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
                     ` (3 preceding siblings ...)
  2023-01-18 22:05   ` Al Viro
@ 2023-01-19 10:01   ` David Howells
  2023-01-19 16:46     ` Christoph Hellwig
  2023-01-19 11:06   ` David Howells
  5 siblings, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-19 10:01 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> 	Which does nothing for places that do not use call_write_iter()...
> __kernel_write_iter() is one such; for less obvious specimen see
> drivers/nvme/target/io-cmd-file.c:nvmet_file_submit_bvec()

Should these be calling call_read/write_iter()?  If not, should
call_read/write_iter() be dropped?

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
                     ` (4 preceding siblings ...)
  2023-01-19 10:01   ` David Howells
@ 2023-01-19 11:06   ` David Howells
  5 siblings, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-19 11:06 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> ->write_iter() <- nvmet_file_submit_bvec()
> ->write_iter() <- call_write_iter() <- lo_rw_aio()

Could call init_kiocb() in lo_rw_aio() and then just overwrite ki_ioprio.

> ->write_iter() <- call_write_iter() <- fd_execute_rw_aio()

fd_execute_rw_aio() perhaps should call init_kiocb() since the struct is
allocated with kmalloc() and not fully initialised.

> ->write_iter() <- call_write_iter() <- vfs_iocb_iter_write()
>
> The last 4 neither set KIOCB_WRITE nor call init_sync_kiocb().

vfs_iocb_iter_write() is given an initialised kiocb.  It should not be calling
init_sync_kiocb() itself.

It's called from two places: cachefiles, which initialises the kiocb itself
and sets IOCB_WRITE, and overlayfs, which gets the kiocb from the VFS via its
->write_iter hook the caller of which should have already set IOCB_WRITE.

cachefiles should be using init_kiocb() - though since it used kzalloc,
init_kiocb() clearing the struct is redundant.

> What's more, there are places that call instances (or their guts - look at
> btrfs_do_write_iter() callers) directly...

At least in the case of btrfs_ioctl_encoded_write(), that can call
init_kiocb().  But as you say, there are more to be found.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-18 22:11     ` Al Viro
  2023-01-19  5:44       ` Christoph Hellwig
@ 2023-01-19 11:34       ` David Howells
  2023-01-19 16:48         ` Christoph Hellwig
  2023-01-19 21:14         ` David Howells
  1 sibling, 2 replies; 91+ messages in thread
From: David Howells @ 2023-01-19 11:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Matthew Wilcox, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> We want to be consistent for sync vs async submission.  So I think yes,
> we want to do the get_current_ioprio for most of them, exceptions
> beeing aio and io_uring - those could use a __init_iocb or
> init_iocb_ioprio variant that passs in the explicit priority if we want
> to avoid the call if it would be overriden later.

io_uring is a bit problematic in this regard.  io_prep_rw() starts the
initialisation of the kiocb, so io_read() and io_write() can't just
reinitialise it.  OTOH, I'm not sure io_prep_rw() has sufficient information
to hand.

I wonder if I should add a flag to struct io_op_def to indicate that this is
going to be a write operation and maybe add a REQ_F_WRITE flag that gets set
by that.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 31/34] cifs: Fix problem with encrypted RDMA data read
  2023-01-16 23:11 ` [PATCH v6 31/34] cifs: Fix problem with encrypted RDMA data read David Howells
@ 2023-01-19 16:25   ` Stefan Metzmacher
  0 siblings, 0 replies; 91+ messages in thread
From: Stefan Metzmacher @ 2023-01-19 16:25 UTC (permalink / raw)
  To: David Howells, Al Viro
  Cc: Steve French, Tom Talpey, Long Li, Namjae Jeon, linux-cifs,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Am 17.01.23 um 00:11 schrieb David Howells:
> When the cifs client is talking to the ksmbd server by RDMA and the ksmbd
> server has "smb3 encryption = yes" in its config file, the normal PDU
> stream is encrypted, but the directly-delivered data isn't in the stream
> (and isn't encrypted), but is rather delivered by DDP/RDMA packets (at
> least with IWarp).

In that case the client must not use DDP/RDMA offload!
This needs to be fixed in the request code for both read and write!

metze

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate
  2023-01-16 23:10 ` [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate David Howells
  2023-01-19  2:52   ` Al Viro
@ 2023-01-19 16:44   ` David Howells
  2023-01-19 16:51     ` Christoph Hellwig
  1 sibling, 1 reply; 91+ messages in thread
From: David Howells @ 2023-01-19 16:44 UTC (permalink / raw)
  To: Al Viro, Dominique Martinet
  Cc: dhowells, Eric Van Hensbergen, Latchesar Ionkov,
	Christian Schoenebeck, v9fs-developer, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

Al Viro <viro@zeniv.linux.org.uk> wrote:

> Wait a sec; just how would that work for ITER_KVEC?  AFAICS, in your
> tree that would blow with -EFAULT...

You're right.  I wonder if I should handle ITER_KVEC in
iov_iter_extract_pages(), though I'm sure I've been told that a kvec might
point to data that doesn't have a matching page struct.  Or maybe it's that
the refcount shouldn't be changed on it.

A question for the 9p devs:

Looking more into p9_virtio_zc_request(), it might be better to use
netfs_extract_iter_to_sg(), since the page list is going to get turned into
one, instead of calling p9_get_mapped_pages() and pack_sg_list().

This would, however, require that chan->sg[] be populated outside of the
spinlock'd section - is there any reason that this can't be the case?  There's
nothing inside the locked section that makes sure the chan can be used before
it launches into loading up the scatterlist.  There is a wait afterwards, but
it has to drop the lock first, so wouldn't stop a parallel op from clobbering
chan->sg[] anyway.

Further, if virtqueue_add_sgs() fails with -ENOSPC and we go round again to
req_retry_pinned, do we actually need to reload chan->sg[]?

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-19 10:01   ` David Howells
@ 2023-01-19 16:46     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19 16:46 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Thu, Jan 19, 2023 at 10:01:26AM +0000, David Howells wrote:
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > 	Which does nothing for places that do not use call_write_iter()...
> > __kernel_write_iter() is one such; for less obvious specimen see
> > drivers/nvme/target/io-cmd-file.c:nvmet_file_submit_bvec()
> 
> Should these be calling call_read/write_iter()?  If not, should
> call_read/write_iter() be dropped?

I wish they'd just go away, they are a bit of a distraction.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-19 11:34       ` David Howells
@ 2023-01-19 16:48         ` Christoph Hellwig
  2023-01-19 21:14         ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19 16:48 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel, Matthew Wilcox, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-kernel

On Thu, Jan 19, 2023 at 11:34:26AM +0000, David Howells wrote:
> io_uring is a bit problematic in this regard.  io_prep_rw() starts the
> initialisation of the kiocb, so io_read() and io_write() can't just
> reinitialise it.  OTOH, I'm not sure io_prep_rw() has sufficient information
> to hand.

It could probably be refactored.  That being said, I suspect we're
better off deferring the whole iov_iter direction cleanup. It's a bit
ugly right now, but there is nothing urgent.  The gup pin work otoh
really is something we need to get down rather sooner than later.

So what about deferring this whole cleanup for now?

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate
  2023-01-19 16:44   ` David Howells
@ 2023-01-19 16:51     ` Christoph Hellwig
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Hellwig @ 2023-01-19 16:51 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Dominique Martinet, Eric Van Hensbergen,
	Latchesar Ionkov, Christian Schoenebeck, v9fs-developer,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

On Thu, Jan 19, 2023 at 04:44:14PM +0000, David Howells wrote:
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> You're right.  I wonder if I should handle ITER_KVEC in
> iov_iter_extract_pages(), though I'm sure I've been told that a kvec might
> point to data that doesn't have a matching page struct.  Or maybe it's that
> the refcount shouldn't be changed on it.

They could in theory contain non-page backed memory, even if I don't
think we currently have that in tree.  The worst case is probably
vmalloc()ed memory.  Many instance will have no good way to deal with
something that isn't page backed.  That's one reason why I'd relaly
love to see ITER_KVEC go away - for most use cases ITER_BVEC is the
right thing, and the others are probably broken for various combinations
already, but that's going to be a fair amount of work.  For now just
failing the I/O if the instance can't deal with it is probably the
right thing.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter()
  2023-01-19 11:34       ` David Howells
  2023-01-19 16:48         ` Christoph Hellwig
@ 2023-01-19 21:14         ` David Howells
  1 sibling, 0 replies; 91+ messages in thread
From: David Howells @ 2023-01-19 21:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Christoph Hellwig, Jens Axboe, linux-block,
	linux-fsdevel, Matthew Wilcox, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-kernel

Christoph Hellwig <hch@infradead.org> wrote:

> So what about deferring this whole cleanup for now?

So you'd rather I stick with the direction indicator in the iov_iter struct
for now?

I still want to add iov_iter_extract_pages(), netfs_extract_user_iter() and
netfs_extract_iter_to_sg(), even if it's only cifs and netfslib that use them
for the moment.

David


^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2023-01-19 21:24 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
2023-01-16 23:08 ` [PATCH v6 01/34] vfs: Unconditionally set IOCB_WRITE in call_write_iter() David Howells
2023-01-17  7:52   ` Christoph Hellwig
2023-01-18 22:11     ` Al Viro
2023-01-19  5:44       ` Christoph Hellwig
2023-01-19 11:34       ` David Howells
2023-01-19 16:48         ` Christoph Hellwig
2023-01-19 21:14         ` David Howells
2023-01-17  8:28   ` David Howells
2023-01-17  8:44     ` Christoph Hellwig
2023-01-17 11:11   ` David Howells
2023-01-17 11:11     ` Christoph Hellwig
2023-01-18 22:05   ` Al Viro
2023-01-19  5:41     ` Christoph Hellwig
2023-01-19 10:01   ` David Howells
2023-01-19 16:46     ` Christoph Hellwig
2023-01-19 11:06   ` David Howells
2023-01-16 23:08 ` [PATCH v6 02/34] iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction David Howells
2023-01-17  7:55   ` Christoph Hellwig
2023-01-18 22:18     ` Al Viro
2023-01-16 23:08 ` [PATCH v6 03/34] iov_iter: Pass I/O direction into iov_iter_get_pages*() David Howells
2023-01-17  7:57   ` Christoph Hellwig
2023-01-17  8:07     ` David Hildenbrand
2023-01-17  8:09       ` Christoph Hellwig
2023-01-18 23:03     ` Al Viro
2023-01-19  0:15       ` Al Viro
2023-01-19  2:11         ` Al Viro
2023-01-19  5:47           ` Christoph Hellwig
2023-01-19  5:47         ` Christoph Hellwig
2023-01-17  8:44   ` David Howells
2023-01-17  8:46     ` Christoph Hellwig
2023-01-17  8:47     ` David Hildenbrand
2023-01-16 23:08 ` [PATCH v6 04/34] iov_iter: Remove iov_iter_get_pages2/pages_alloc2() David Howells
2023-01-16 23:08 ` [PATCH v6 05/34] iov_iter: Change the direction macros into an enum David Howells
2023-01-18 23:14   ` Al Viro
2023-01-18 23:17   ` David Howells
2023-01-18 23:19     ` Al Viro
2023-01-18 23:24     ` David Howells
2023-01-16 23:08 ` [PATCH v6 06/34] iov_iter: Use the direction in the iterator functions David Howells
2023-01-17  7:58   ` Christoph Hellwig
2023-01-18 22:28     ` Al Viro
2023-01-19  5:45       ` Christoph Hellwig
2023-01-18 23:15   ` Al Viro
2023-01-16 23:08 ` [PATCH v6 07/34] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-01-17  8:01   ` Christoph Hellwig
2023-01-17  8:19   ` David Howells
2023-01-16 23:08 ` [PATCH v6 08/34] mm: Provide a helper to drop a pin/ref on a page David Howells
2023-01-17  8:02   ` Christoph Hellwig
2023-01-17  8:21   ` David Howells
2023-01-16 23:09 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
2023-01-17  8:02   ` Christoph Hellwig
2023-01-16 23:09 ` [PATCH v6 10/34] mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically David Howells
2023-01-17  8:03   ` Christoph Hellwig
2023-01-16 23:09 ` [PATCH v6 11/34] iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate David Howells
2023-01-17  8:07   ` Christoph Hellwig
2023-01-17  8:26   ` David Howells
2023-01-17  8:44     ` Christoph Hellwig
2023-01-16 23:09 ` [PATCH v6 12/34] bio: Fix bio_flagged() so that gcc can better optimise it David Howells
2023-01-17  8:07   ` Christoph Hellwig
2023-01-16 23:09 ` [PATCH v6 13/34] netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator David Howells
2023-01-16 23:09 ` [PATCH v6 14/34] netfs: Add a function to extract an iterator into a scatterlist David Howells
2023-01-16 23:09 ` [PATCH v6 15/34] af_alg: Pin pages rather than ref'ing if appropriate David Howells
2023-01-16 23:09 ` [PATCH v6 16/34] af_alg: [RFC] Use netfs_extract_iter_to_sg() to create scatterlists David Howells
2023-01-16 23:10 ` [PATCH v6 17/34] scsi: [RFC] Use netfs_extract_iter_to_sg() David Howells
2023-01-16 23:10 ` [PATCH v6 18/34] dio: Pin pages rather than ref'ing if appropriate David Howells
2023-01-19  5:04   ` Al Viro
2023-01-19  5:51     ` Christoph Hellwig
2023-01-16 23:10 ` [PATCH v6 19/34] fuse: " David Howells
2023-01-16 23:10 ` [PATCH v6 20/34] vfs: Make splice use iov_iter_extract_pages() David Howells
2023-01-19  2:31   ` Al Viro
2023-01-16 23:10 ` [PATCH v6 21/34] 9p: Pin pages rather than ref'ing if appropriate David Howells
2023-01-19  2:52   ` Al Viro
2023-01-19 16:44   ` David Howells
2023-01-19 16:51     ` Christoph Hellwig
2023-01-16 23:10 ` [PATCH v6 22/34] nfs: " David Howells
2023-01-16 23:10 ` [PATCH v6 23/34] cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE David Howells
2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
2023-01-16 23:11 ` [PATCH v6 25/34] cifs: Add a function to Hash the contents of " David Howells
2023-01-16 23:11 ` [PATCH v6 26/34] cifs: Add some helper functions David Howells
2023-01-16 23:11 ` [PATCH v6 27/34] cifs: Add a function to read into an iter from a socket David Howells
2023-01-16 23:11 ` [PATCH v6 28/34] cifs: Change the I/O paths to use an iterator rather than a page list David Howells
2023-01-16 23:11 ` [PATCH v6 29/34] cifs: Build the RDMA SGE list directly from an iterator David Howells
2023-01-16 23:11 ` [PATCH v6 30/34] cifs: Remove unused code David Howells
2023-01-16 23:11 ` [PATCH v6 31/34] cifs: Fix problem with encrypted RDMA data read David Howells
2023-01-19 16:25   ` Stefan Metzmacher
2023-01-16 23:11 ` [PATCH v6 32/34] cifs: DIO to/from KVEC-type iterators should now work David Howells
2023-01-16 23:12 ` [PATCH v6 33/34] net: [RFC][WIP] Mark each skb_frags as to how they should be cleaned up David Howells
2023-01-16 23:12 ` [PATCH v6 34/34] net: [RFC][WIP] Make __zerocopy_sg_from_iter() correctly pin or leave pages unref'd David Howells
2023-01-17  7:46 ` [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) Christoph Hellwig
2023-01-18 14:00 ` [PATCH v6 09/34] bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning David Howells
2023-01-18 14:09   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.