linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list)
@ 2023-01-16 23:07 David Howells
  2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: David Howells @ 2023-01-16 23:07 UTC (permalink / raw)
  To: Al Viro
  Cc: James E.J. Bottomley, Paolo Abeni, John Hubbard,
	Christoph Hellwig, Paulo Alcantara, linux-scsi, Steve French,
	Stefan Metzmacher, Miklos Szeredi, Martin K. Petersen,
	Logan Gunthorpe, Jeff Layton, Jakub Kicinski, netdev,
	Rohith Surabattula, Eric Dumazet, Matthew Wilcox, Anna Schumaker,
	Jens Axboe, Shyam Prasad N, Tom Talpey, linux-rdma,
	Trond Myklebust, Christian Schoenebeck, linux-mm, linux-crypto,
	linux-nfs, v9fs-developer, Latchesar Ionkov, linux-fsdevel,
	Eric Van Hensbergen, Long Li, Jan Kara, linux-cachefs,
	linux-block, Dominique Martinet, Namjae Jeon, David S. Miller,
	linux-cifs, Steve French, Herbert Xu, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel


Hi Al, Christoph,

Here are patches clean up some use of READ/WRITE and ITER_SOURCE/DEST,
patches to provide support for extracting pages from an iov_iter and a
patch to use the primary extraction function in the block layer bio code.
I've also added a bunch of other conversions and had a tentative stab at
the networking code.

The patches make the following changes:

 (1) Deal with switching from using the iterator data_source to indicate
     the I/O direction to deriving this from other information, eg.:
     IOCB_WRITE, IOMAP_WRITE and the REQ_OP_* number.  This allows
     iov_iter_rw() to be removed eventually.

 (2) Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass these into
     iov_iter_get_pages*() to indicate the I/O direction with regards to
     how the buffer described by the iterator is to be used.  This is
     included in the gup_flags passed in with Logan's patches.

     Calls to iov_iter_get_pages*2() are replaced with calls to
     iov_iter_get_pages*() and the former is removed.

 (3) Add a function, iov_iter_extract_pages() to replace
     iov_iter_get_pages*() that gets refs, pins or just lists the pages as
     appropriate to the iterator type and the I/O direction.

     Add a function, iov_iter_extract_mode() that will indicate from the
     iterator type and the I/O direction how the cleanup is to be
     performed, returning FOLL_GET, FOLL_PIN or 0.

     Add a function, folio_put_unpin(), and a wrapper, page_put_unpin(),
     that take a page and the return from iov_iter_extract_mode() and do
     the right thing to clean up the page.

 (4) Make the bio struct carry a pair of flags to indicate the cleanup
     mode.  BIO_NO_PAGE_REF is replaced with BIO_PAGE_REFFED (equivalent to
     FOLL_GET) and BIO_PAGE_PINNED (equivalent to BIO_PAGE_PINNED) is
     added.  These are forced to have the same value as the FOLL_* flags so
     they can be passed to the previously mentioned cleanup function.

 (5) Make the iter-to-bio code use iov_iter_extract_pages() to
     appropriately retain the pages and clean them up later.

 (6) Fix bio_flagged() so that it doesn't prevent a gcc optimisation.

 (7) Add a function in netfslib, netfs_extract_user_iter(), to extract a
     UBUF- or IOVEC-type iterator to a page list in a BVEC-type iterator,
     with all the pages suitably ref'd or pinned.

 (8) Add a function in netfslib, netfs_extract_iter_to_sg(), to extract a
     UBUF-, IOVEC-, BVEC-, XARRAY- or KVEC-type iterator to a scatterlist.
     The first two types appropriately ref or pin pages; the latter three
     don't perform any retention, leaving that to the caller.

     Note that I can make use of this in the SCSI and AF_ALG code and
     possibly the networking code, so this might merit being moved to core
     code.

 (9) Make AF_ALG use iov_iter_extract_pages() and possibly go further and
     make it use netfs_extract_iter_to_sg() instead.

(10) Make SCSI vhost use netfs_extract_iter_to_sg().

(11) Make fs/direct-io.c use iov_iter_extract_pages().

(13) Make splice-to-pipe use iov_iter_extract_pages(), but limit the usage
     to a cleanup mode of FOLL_GET.

(13) Make the 9P, FUSE and NFS filesystems use iov_iter_extract_pages().

(14) Make the CIFS filesystem use iterators from the top all the way down
     to the socket on the simple path.  Make it use
     netfs_extract_user_iter() to use an XARRAY-type iterator or to build a
     BVEC-type iterator in the top layers from a UBUF- or IOVEC-type
     iterator and attach the iterator to the operation descriptors.

     netfs_extract_iter_to_sg() is used to build scatterlists for doing
     transport crypto and a function, smb_extract_iter_to_rdma(), is
     provided to build an RDMA SGE list directly from an iterator without
     going via a page list and then a scatter list.

(15) A couple of work-in-progress patches to try and make sk_buff fragments
     record the information needed to clean them up in the lowest two bits
     of the page pointer in the fragment struct.

This leaves:

 (*) Four calls to iov_iter_get_pages() in CEPH.  That will be helped by
     patches to pass an iterator down to the transport layer instead of
     converting to a page list high up and passing that down, but the
     transport layer could do with some massaging so that it doesn't covert
     the iterator to a page list and then the pages individually back to
     iterators to pass to the socket.

 (*) One call to iov_iter_get_pages() each in the networking core, RDS and
     TLS, all related to zero-copy.  TLS seems to do zerocopy-read (or
     maybe decrypt-offload) and should be doing FOLL_PIN, not FOLL_GET for
     user-provided buffers.


Changes:
========
ver #6)
 - Fix write() syscall and co. not setting IOCB_WRITE.
 - Added iocb_is_read() and iocb_is_write() to check IOCB_WRITE.
 - Use op_is_write() in bio_copy_user_iov().
 - Drop the iterator direction checks from smbd_recv().
 - Define FOLL_SOURCE_BUF and FOLL_DEST_BUF and pass them in as part of
   gup_flags to iov_iter_get/extract_pages*().
 - Replace iov_iter_get_pages*2() with iov_iter_get_pages*() and remove.
 - Add back the function to indicate the cleanup mode.
 - Drop the cleanup_mode return arg to iov_iter_extract_pages().
 - Provide a helper to clean up a page.
 - Renumbered FOLL_GET and FOLL_PIN and made BIO_PAGE_REFFED/PINNED have
   the same numerical values, enforced with an assertion.
 - Converted AF_ALG, SCSI vhost, generic DIO, FUSE, splice to pipe, 9P and
   NFS.
 - Added in the patches to make CIFS do top-to-bottom iterators and use
   various of the added extraction functions.
 - Added a pair of work-in-progess patches to make sk_buff fragments store
   FOLL_GET and FOLL_PIN.

ver #5)
 - Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED and split into own patch.
 - Transcribe FOLL_GET/PIN into BIO_PAGE_REFFED/PINNED flags.
 - Add patch to allow bio_flagged() to be combined by gcc.

ver #4)
 - Drop the patch to move the FOLL_* flags to linux/mm_types.h as they're
   no longer referenced by linux/uio.h.
 - Add ITER_SOURCE/DEST cleanup patches.
 - Make iov_iter/netfslib iter extraction patches use ITER_SOURCE/DEST.
 - Allow additional gup_flags to be passed into iov_iter_extract_pages().
 - Add struct bio patch.

ver #3)
 - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
   to get/pin_user_pages_fast()[1].

ver #2)
 - Rolled the extraction cleanup mode query function into the extraction
   function, returning the indication through the argument list.
 - Fixed patch 4 (extract to scatterlist) to actually use the new
   extraction API.

I've pushed the patches (excluding the two WIP networking patches) here
also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract

David

Link: https://lore.kernel.org/r/Y3zFzdWnWlEJ8X8/@infradead.org/ [1]
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166722777223.2555743.162508599131141451.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166869687556.3723671.10061142538708346995.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166920902005.1461876.2786264600108839814.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997419665.9475.15014699817597102032.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305160937.1521586.133299343565358971.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344725490.2425628.13771289553670112965.stgit@warthog.procyon.org.uk/ # v5

Previous versions of the CIFS patch sets can be found here:
Link: https://lore.kernel.org/r/164311902471.2806745.10187041199819525677.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/164928615045.457102.10607899252434268982.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165211416682.3154751.17287804906832979514.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165348876794.2106726.9240233279581920208.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/165364823513.3334034.11209090728654641458.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/166126392703.708021.14465850073772688008.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732024173.3186319.18204305072070871546.stgit@warthog.procyon.org.uk/ # rfc


---
David Howells (34):
      vfs: Unconditionally set IOCB_WRITE in call_write_iter()
      iov_iter: Use IOCB/IOMAP_WRITE/op_is_write rather than iterator direction
      iov_iter: Pass I/O direction into iov_iter_get_pages*()
      iov_iter: Remove iov_iter_get_pages2/pages_alloc2()
      iov_iter: Change the direction macros into an enum
      iov_iter: Use the direction in the iterator functions
      iov_iter: Add a function to extract a page list from an iterator
      mm: Provide a helper to drop a pin/ref on a page
      bio: Rename BIO_NO_PAGE_REF to BIO_PAGE_REFFED and invert the meaning
      mm, block: Make BIO_PAGE_REFFED/PINNED the same as FOLL_GET/PIN numerically
      iov_iter, block: Make bio structs pin pages rather than ref'ing if appropriate
      bio: Fix bio_flagged() so that gcc can better optimise it
      netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
      netfs: Add a function to extract an iterator into a scatterlist
      af_alg: Pin pages rather than ref'ing if appropriate
      af_alg: [RFC] Use netfs_extract_iter_to_sg() to create scatterlists
      scsi: [RFC] Use netfs_extract_iter_to_sg()
      dio: Pin pages rather than ref'ing if appropriate
      fuse:  Pin pages rather than ref'ing if appropriate
      vfs: Make splice use iov_iter_extract_pages()
      9p: Pin pages rather than ref'ing if appropriate
      nfs: Pin pages rather than ref'ing if appropriate
      cifs: Implement splice_read to pass down ITER_BVEC not ITER_PIPE
      cifs: Add a function to build an RDMA SGE list from an iterator
      cifs: Add a function to Hash the contents of an iterator
      cifs: Add some helper functions
      cifs: Add a function to read into an iter from a socket
      cifs: Change the I/O paths to use an iterator rather than a page list
      cifs: Build the RDMA SGE list directly from an iterator
      cifs: Remove unused code
      cifs: Fix problem with encrypted RDMA data read
      cifs: DIO to/from KVEC-type iterators should now work
      net: [RFC][WIP] Mark each skb_frags as to how they should be cleaned up
      net: [RFC][WIP] Make __zerocopy_sg_from_iter() correctly pin or leave pages unref'd


 block/bio.c               |   48 +-
 block/blk-map.c           |   26 +-
 block/blk.h               |   25 +
 block/fops.c              |    8 +-
 crypto/af_alg.c           |   57 +-
 crypto/algif_hash.c       |   20 +-
 drivers/net/tun.c         |    2 +-
 drivers/vhost/scsi.c      |   75 +-
 fs/9p/vfs_addr.c          |    2 +-
 fs/affs/file.c            |    4 +-
 fs/ceph/addr.c            |    2 +-
 fs/ceph/file.c            |   16 +-
 fs/cifs/Kconfig           |    1 +
 fs/cifs/cifsencrypt.c     |  172 +++-
 fs/cifs/cifsfs.c          |   12 +-
 fs/cifs/cifsfs.h          |    6 +
 fs/cifs/cifsglob.h        |   66 +-
 fs/cifs/cifsproto.h       |   11 +-
 fs/cifs/cifssmb.c         |   13 +-
 fs/cifs/connect.c         |   16 +
 fs/cifs/file.c            | 1851 +++++++++++++++++--------------------
 fs/cifs/fscache.c         |   22 +-
 fs/cifs/fscache.h         |   10 +-
 fs/cifs/misc.c            |  132 +--
 fs/cifs/smb2ops.c         |  374 ++++----
 fs/cifs/smb2pdu.c         |   45 +-
 fs/cifs/smbdirect.c       |  511 ++++++----
 fs/cifs/smbdirect.h       |    4 +-
 fs/cifs/transport.c       |   57 +-
 fs/dax.c                  |    6 +-
 fs/direct-io.c            |   77 +-
 fs/exfat/inode.c          |    6 +-
 fs/ext2/inode.c           |    2 +-
 fs/f2fs/file.c            |   10 +-
 fs/fat/inode.c            |    4 +-
 fs/fuse/dax.c             |    2 +-
 fs/fuse/dev.c             |   24 +-
 fs/fuse/file.c            |   34 +-
 fs/fuse/fuse_i.h          |    1 +
 fs/hfs/inode.c            |    2 +-
 fs/hfsplus/inode.c        |    2 +-
 fs/iomap/direct-io.c      |    6 +-
 fs/jfs/inode.c            |    2 +-
 fs/netfs/Makefile         |    1 +
 fs/netfs/iterator.c       |  371 ++++++++
 fs/nfs/direct.c           |   32 +-
 fs/nilfs2/inode.c         |    2 +-
 fs/ntfs3/inode.c          |    2 +-
 fs/ocfs2/aops.c           |    2 +-
 fs/orangefs/inode.c       |    2 +-
 fs/reiserfs/inode.c       |    2 +-
 fs/splice.c               |   10 +-
 fs/udf/inode.c            |    2 +-
 include/crypto/if_alg.h   |    7 +-
 include/linux/bio.h       |   23 +-
 include/linux/blk_types.h |    3 +-
 include/linux/fs.h        |   11 +
 include/linux/mm.h        |   32 +-
 include/linux/netfs.h     |    6 +
 include/linux/skbuff.h    |  124 ++-
 include/linux/uio.h       |   83 +-
 io_uring/net.c            |    2 +-
 lib/iov_iter.c            |  428 ++++++++-
 mm/gup.c                  |   47 +
 mm/vmalloc.c              |    1 +
 net/9p/trans_common.c     |    6 +-
 net/9p/trans_common.h     |    3 +-
 net/9p/trans_virtio.c     |   91 +-
 net/bpf/test_run.c        |    2 +-
 net/core/datagram.c       |   23 +-
 net/core/gro.c            |    2 +-
 net/core/skbuff.c         |   16 +-
 net/core/skmsg.c          |    4 +-
 net/ipv4/ip_output.c      |    2 +-
 net/ipv4/tcp.c            |    4 +-
 net/ipv6/esp6.c           |    5 +-
 net/ipv6/ip6_output.c     |    2 +-
 net/packet/af_packet.c    |    2 +-
 net/rds/message.c         |    4 +-
 net/tls/tls_sw.c          |    5 +-
 net/xfrm/xfrm_ipcomp.c    |    2 +-
 81 files changed, 3006 insertions(+), 2126 deletions(-)
 create mode 100644 fs/netfs/iterator.c



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
@ 2023-01-16 23:10 ` David Howells
  2023-01-16 23:11 ` [PATCH v6 29/34] cifs: Build the RDMA SGE list directly " David Howells
  2023-01-17  7:46 ` [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) Christoph Hellwig
  2 siblings, 0 replies; 4+ messages in thread
From: David Howells @ 2023-01-16 23:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Tom Talpey,
	Jeff Layton, linux-cifs, linux-fsdevel, linux-rdma, dhowells,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel

Add a function to add elements onto an RDMA SGE list representing page
fragments extracted from a BVEC-, KVEC- or XARRAY-type iterator and DMA
mapped until the maximum number of elements is reached.

Nothing is done to make sure the pages remain present - that must be done
by the caller.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-rdma@vger.kernel.org

Link: https://lore.kernel.org/r/166697256704.61150.17388516338310645808.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732028840.3186319.8512284239779728860.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/smbdirect.c |  224 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 224 insertions(+)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 3e693ffd0662..78a76752fafd 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -44,6 +44,17 @@ static int smbd_post_send_page(struct smbd_connection *info,
 static void destroy_mr_list(struct smbd_connection *info);
 static int allocate_mr_list(struct smbd_connection *info);
 
+struct smb_extract_to_rdma {
+	struct ib_sge		*sge;
+	unsigned int		nr_sge;
+	unsigned int		max_sge;
+	struct ib_device	*device;
+	u32			local_dma_lkey;
+	enum dma_data_direction	direction;
+};
+static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
+					struct smb_extract_to_rdma *rdma);
+
 /* SMBD version number */
 #define SMBD_V1	0x0100
 
@@ -2480,3 +2491,216 @@ int smbd_deregister_mr(struct smbd_mr *smbdirect_mr)
 
 	return rc;
 }
+
+static bool smb_set_sge(struct smb_extract_to_rdma *rdma,
+			struct page *lowest_page, size_t off, size_t len)
+{
+	struct ib_sge *sge = &rdma->sge[rdma->nr_sge];
+	u64 addr;
+
+	addr = ib_dma_map_page(rdma->device, lowest_page,
+			       off, len, rdma->direction);
+	if (ib_dma_mapping_error(rdma->device, addr))
+		return false;
+
+	sge->addr   = addr;
+	sge->length = len;
+	sge->lkey   = rdma->local_dma_lkey;
+	rdma->nr_sge++;
+	return true;
+}
+
+/*
+ * Extract page fragments from a BVEC-class iterator and add them to an RDMA
+ * element list.  The pages are not pinned.
+ */
+static ssize_t smb_extract_bvec_to_rdma(struct iov_iter *iter,
+					struct smb_extract_to_rdma *rdma,
+					ssize_t maxsize)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i, sge_max = rdma->max_sge;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		if (!smb_set_sge(rdma, bv[i].bv_page, off, len))
+			return -EIO;
+		sge_max--;
+
+		ret += len;
+		maxsize -= len;
+		if (maxsize <= 0 || sge_max == 0)
+			break;
+		start = 0;
+	}
+
+	return ret;
+}
+
+/*
+ * Extract fragments from a KVEC-class iterator and add them to an RDMA list.
+ * This can deal with vmalloc'd buffers as well as kmalloc'd or static buffers.
+ * The pages are not pinned.
+ */
+static ssize_t smb_extract_kvec_to_rdma(struct iov_iter *iter,
+					struct smb_extract_to_rdma *rdma,
+					ssize_t maxsize)
+{
+	const struct kvec *kv = iter->kvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i, sge_max = rdma->max_sge;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		struct page *page;
+		unsigned long kaddr;
+		size_t off, len, seg;
+
+		len = kv[i].iov_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		kaddr = (unsigned long)kv[i].iov_base + start;
+		off = kaddr & ~PAGE_MASK;
+		len = min_t(size_t, maxsize, len - start);
+		kaddr &= PAGE_MASK;
+
+		maxsize -= len;
+		ret += len;
+		do {
+			seg = min_t(size_t, len, PAGE_SIZE - off);
+
+			if (is_vmalloc_or_module_addr((void *)kaddr))
+				page = vmalloc_to_page((void *)kaddr);
+			else
+				page = virt_to_page(kaddr);
+
+			if (!smb_set_sge(rdma, page, off, len))
+				return -EIO;
+			sge_max--;
+
+			len -= seg;
+			kaddr += PAGE_SIZE;
+			off = 0;
+		} while (len > 0 && sge_max > 0);
+
+		if (maxsize <= 0 || sge_max == 0)
+			break;
+		start = 0;
+	}
+
+	return ret;
+}
+
+/*
+ * Extract folio fragments from an XARRAY-class iterator and add them to an
+ * RDMA list.  The folios are not pinned.
+ */
+static ssize_t smb_extract_xarray_to_rdma(struct iov_iter *iter,
+					  struct smb_extract_to_rdma *rdma,
+					  ssize_t maxsize)
+{
+	struct xarray *xa = iter->xarray;
+	struct folio *folio;
+	unsigned int sge_max = rdma->max_sge;
+	loff_t start = iter->xarray_start + iter->iov_offset;
+	pgoff_t index = start / PAGE_SIZE;
+	ssize_t ret = 0;
+	size_t off, len;
+	XA_STATE(xas, xa, index);
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, folio, ULONG_MAX) {
+		if (xas_retry(&xas, folio))
+			continue;
+		if (WARN_ON(xa_is_value(folio)))
+			break;
+		if (WARN_ON(folio_test_hugetlb(folio)))
+			break;
+
+		off = offset_in_folio(folio, start);
+		len = min_t(size_t, maxsize, folio_size(folio) - off);
+
+		if (!smb_set_sge(rdma, folio_page(folio, 0), off, len)) {
+			rcu_read_lock();
+			return -EIO;
+		}
+		sge_max--;
+
+		maxsize -= len;
+		ret += len;
+		if (maxsize <= 0 || sge_max == 0)
+			break;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Extract page fragments from up to the given amount of the source iterator
+ * and build up an RDMA list that refers to all of those bits.  The RDMA list
+ * is appended to, up to the maximum number of elements set in the parameter
+ * block.
+ *
+ * The extracted page fragments are not pinned or ref'd in any way; if an
+ * IOVEC/UBUF-type iterator is to be used, it should be converted to a
+ * BVEC-type iterator and the pages pinned, ref'd or otherwise held in some
+ * way.
+ */
+static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
+					struct smb_extract_to_rdma *rdma)
+{
+	ssize_t ret;
+	int before = rdma->nr_sge;
+
+	if (iov_iter_is_discard(iter) ||
+	    iov_iter_is_pipe(iter) ||
+	    user_backed_iter(iter)) {
+		WARN_ON_ONCE(1);
+		return -EIO;
+	}
+
+	switch (iov_iter_type(iter)) {
+	case ITER_BVEC:
+		ret = smb_extract_bvec_to_rdma(iter, rdma, len);
+		break;
+	case ITER_KVEC:
+		ret = smb_extract_kvec_to_rdma(iter, rdma, len);
+		break;
+	case ITER_XARRAY:
+		ret = smb_extract_xarray_to_rdma(iter, rdma, len);
+		break;
+	default:
+		BUG();
+	}
+
+	if (ret > 0) {
+		iov_iter_advance(iter, ret);
+	} else if (ret < 0) {
+		while (rdma->nr_sge > before) {
+			struct ib_sge *sge = &rdma->sge[rdma->nr_sge--];
+
+			ib_dma_unmap_single(rdma->device, sge->addr, sge->length,
+					    rdma->direction);
+			sge->addr = 0;
+		}
+	}
+
+	return ret;
+}



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v6 29/34] cifs: Build the RDMA SGE list directly from an iterator
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
  2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
@ 2023-01-16 23:11 ` David Howells
  2023-01-17  7:46 ` [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) Christoph Hellwig
  2 siblings, 0 replies; 4+ messages in thread
From: David Howells @ 2023-01-16 23:11 UTC (permalink / raw)
  To: Al Viro
  Cc: Steve French, Shyam Prasad N, Rohith Surabattula, Tom Talpey,
	Jeff Layton, linux-cifs, linux-rdma, dhowells, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel

In the depths of the cifs RDMA code, extract part of an iov iterator
directly into an SGE list without going through an intermediate
scatterlist.

Note that this doesn't support extraction from an IOBUF- or UBUF-type
iterator (ie. user-supplied buffer).  The assumption is that the higher
layers will extract those to a BVEC-type iterator first and do whatever is
required to stop the pages from going away.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steve French <sfrench@samba.org>
cc: Shyam Prasad N <nspmangalore@gmail.com>
cc: Rohith Surabattula <rohiths.msft@gmail.com>
cc: Tom Talpey <tom@talpey.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-cifs@vger.kernel.org
cc: linux-rdma@vger.kernel.org

Link: https://lore.kernel.org/r/166697260361.61150.5064013393408112197.stgit@warthog.procyon.org.uk/ # rfc
Link: https://lore.kernel.org/r/166732032518.3186319.1859601819981624629.stgit@warthog.procyon.org.uk/ # rfc
---

 fs/cifs/smbdirect.c |  111 ++++++++++++++++++---------------------------------
 1 file changed, 39 insertions(+), 72 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 8bd320f0156e..4691b5a8e1ff 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -828,16 +828,16 @@ static int smbd_post_send(struct smbd_connection *info,
 	return rc;
 }
 
-static int smbd_post_send_sgl(struct smbd_connection *info,
-	struct scatterlist *sgl, int data_length, int remaining_data_length)
+static int smbd_post_send_iter(struct smbd_connection *info,
+			       struct iov_iter *iter,
+			       int *_remaining_data_length)
 {
-	int num_sgs;
 	int i, rc;
 	int header_length;
+	int data_length;
 	struct smbd_request *request;
 	struct smbd_data_transfer *packet;
 	int new_credits;
-	struct scatterlist *sg;
 
 wait_credit:
 	/* Wait for send credits. A SMBD packet needs one credit */
@@ -881,6 +881,30 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	}
 
 	request->info = info;
+	memset(request->sge, 0, sizeof(request->sge));
+
+	/* Fill in the data payload to find out how much data we can add */
+	if (iter) {
+		struct smb_extract_to_rdma extract = {
+			.nr_sge		= 1,
+			.max_sge	= SMBDIRECT_MAX_SEND_SGE,
+			.sge		= request->sge,
+			.device		= info->id->device,
+			.local_dma_lkey	= info->pd->local_dma_lkey,
+			.direction	= DMA_TO_DEVICE,
+		};
+
+		rc = smb_extract_iter_to_rdma(iter, *_remaining_data_length,
+					      &extract);
+		if (rc < 0)
+			goto err_dma;
+		data_length = rc;
+		request->num_sge = extract.nr_sge;
+		*_remaining_data_length -= data_length;
+	} else {
+		data_length = 0;
+		request->num_sge = 1;
+	}
 
 	/* Fill in the packet header */
 	packet = smbd_request_payload(request);
@@ -902,7 +926,7 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	else
 		packet->data_offset = cpu_to_le32(24);
 	packet->data_length = cpu_to_le32(data_length);
-	packet->remaining_data_length = cpu_to_le32(remaining_data_length);
+	packet->remaining_data_length = cpu_to_le32(*_remaining_data_length);
 	packet->padding = 0;
 
 	log_outgoing(INFO, "credits_requested=%d credits_granted=%d data_offset=%d data_length=%d remaining_data_length=%d\n",
@@ -918,7 +942,6 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	if (!data_length)
 		header_length = offsetof(struct smbd_data_transfer, padding);
 
-	request->num_sge = 1;
 	request->sge[0].addr = ib_dma_map_single(info->id->device,
 						 (void *)packet,
 						 header_length,
@@ -932,23 +955,6 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
 	request->sge[0].length = header_length;
 	request->sge[0].lkey = info->pd->local_dma_lkey;
 
-	/* Fill in the packet data payload */
-	num_sgs = sgl ? sg_nents(sgl) : 0;
-	for_each_sg(sgl, sg, num_sgs, i) {
-		request->sge[i+1].addr =
-			ib_dma_map_page(info->id->device, sg_page(sg),
-			       sg->offset, sg->length, DMA_TO_DEVICE);
-		if (ib_dma_mapping_error(
-				info->id->device, request->sge[i+1].addr)) {
-			rc = -EIO;
-			request->sge[i+1].addr = 0;
-			goto err_dma;
-		}
-		request->sge[i+1].length = sg->length;
-		request->sge[i+1].lkey = info->pd->local_dma_lkey;
-		request->num_sge++;
-	}
-
 	rc = smbd_post_send(info, request);
 	if (!rc)
 		return 0;
@@ -987,8 +993,10 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
  */
 static int smbd_post_send_empty(struct smbd_connection *info)
 {
+	int remaining_data_length = 0;
+
 	info->count_send_empty++;
-	return smbd_post_send_sgl(info, NULL, 0, 0);
+	return smbd_post_send_iter(info, NULL, &remaining_data_length);
 }
 
 /*
@@ -1923,42 +1931,6 @@ int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
 	return rc;
 }
 
-/*
- * Send the contents of an iterator
- * @iter: The iterator to send
- * @_remaining_data_length: remaining data to send in this payload
- */
-static int smbd_post_send_iter(struct smbd_connection *info,
-			       struct iov_iter *iter,
-			       int *_remaining_data_length)
-{
-	struct scatterlist sgl[SMBDIRECT_MAX_SEND_SGE - 1];
-	unsigned int max_payload = info->max_send_size - sizeof(struct smbd_data_transfer);
-	unsigned int cleanup_mode;
-	ssize_t rc;
-
-	do {
-		struct sg_table sgtable = { .sgl = sgl };
-		size_t maxlen = min_t(size_t, *_remaining_data_length, max_payload);
-
-		sg_init_table(sgtable.sgl, ARRAY_SIZE(sgl));
-		rc = netfs_extract_iter_to_sg(iter, maxlen,
-					      &sgtable, ARRAY_SIZE(sgl),
-					      &cleanup_mode);
-		if (rc < 0)
-			break;
-		if (WARN_ON_ONCE(sgtable.nents == 0))
-			return -EIO;
-		WARN_ON(cleanup_mode != 0);
-
-		sg_mark_end(&sgl[sgtable.nents - 1]);
-		*_remaining_data_length -= rc;
-		rc = smbd_post_send_sgl(info, sgl, rc, *_remaining_data_length);
-	} while (rc == 0 && iov_iter_count(iter) > 0);
-
-	return rc;
-}
-
 /*
  * Send data to transport
  * Each rqst is transported as a SMBDirect payload
@@ -2240,16 +2212,17 @@ static struct smbd_mr *get_mr(struct smbd_connection *info)
 static int smbd_iter_to_mr(struct smbd_connection *info,
 			   struct iov_iter *iter,
 			   struct scatterlist *sgl,
-			   unsigned int num_pages)
+			   unsigned int num_pages,
+			   bool writing)
 {
 	struct sg_table sgtable = { .sgl = sgl };
-	unsigned int cleanup_mode;
 	int ret;
 
 	sg_init_table(sgl, num_pages);
 
 	ret = netfs_extract_iter_to_sg(iter, iov_iter_count(iter),
-				       &sgtable, num_pages, &cleanup_mode);
+				       &sgtable, num_pages,
+				       writing ? FOLL_SOURCE_BUF : FOLL_DEST_BUF);
 	WARN_ON(ret < 0);
 	return ret;
 }
@@ -2291,7 +2264,7 @@ struct smbd_mr *smbd_register_mr(struct smbd_connection *info,
 
 	log_rdma_mr(INFO, "num_pages=0x%x count=0x%zx\n",
 		    num_pages, iov_iter_count(iter));
-	smbd_iter_to_mr(info, iter, smbdirect_mr->sgl, num_pages);
+	smbd_iter_to_mr(info, iter, smbdirect_mr->sgl, num_pages, writing);
 
 	rc = ib_dma_map_sg(info->id->device, smbdirect_mr->sgl, num_pages, dir);
 	if (!rc) {
@@ -2602,13 +2575,6 @@ static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
 	ssize_t ret;
 	int before = rdma->nr_sge;
 
-	if (iov_iter_is_discard(iter) ||
-	    iov_iter_is_pipe(iter) ||
-	    user_backed_iter(iter)) {
-		WARN_ON_ONCE(1);
-		return -EIO;
-	}
-
 	switch (iov_iter_type(iter)) {
 	case ITER_BVEC:
 		ret = smb_extract_bvec_to_rdma(iter, rdma, len);
@@ -2620,7 +2586,8 @@ static ssize_t smb_extract_iter_to_rdma(struct iov_iter *iter, size_t len,
 		ret = smb_extract_xarray_to_rdma(iter, rdma, len);
 		break;
 	default:
-		BUG();
+		WARN_ON_ONCE(1);
+		return -EIO;
 	}
 
 	if (ret > 0) {



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list)
  2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
  2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
  2023-01-16 23:11 ` [PATCH v6 29/34] cifs: Build the RDMA SGE list directly " David Howells
@ 2023-01-17  7:46 ` Christoph Hellwig
  2 siblings, 0 replies; 4+ messages in thread
From: Christoph Hellwig @ 2023-01-17  7:46 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, James E.J. Bottomley, Paolo Abeni, John Hubbard,
	Christoph Hellwig, Paulo Alcantara, linux-scsi, Steve French,
	Stefan Metzmacher, Miklos Szeredi, Martin K. Petersen,
	Logan Gunthorpe, Jeff Layton, Jakub Kicinski, netdev,
	Rohith Surabattula, Eric Dumazet, Matthew Wilcox, Anna Schumaker,
	Jens Axboe, Shyam Prasad N, Tom Talpey, linux-rdma,
	Trond Myklebust, Christian Schoenebeck, linux-mm, linux-crypto,
	linux-nfs, v9fs-developer, Latchesar Ionkov, linux-fsdevel,
	Eric Van Hensbergen, Long Li, Jan Kara, linux-cachefs,
	linux-block, Dominique Martinet, Namjae Jeon, David S. Miller,
	linux-cifs, Steve French, Herbert Xu, Christoph Hellwig,
	linux-kernel

First off the liver comment:  can we cut down things for a first
round?  Maybe just convert everything using the bio based helpers
and then chunk it up?  Reviewing 34 patches across a dozen subsystems
isn't going to be easy and it will be hard to come up with a final
positive conclusion.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-01-17  7:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-16 23:07 [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) David Howells
2023-01-16 23:10 ` [PATCH v6 24/34] cifs: Add a function to build an RDMA SGE list from an iterator David Howells
2023-01-16 23:11 ` [PATCH v6 29/34] cifs: Build the RDMA SGE list directly " David Howells
2023-01-17  7:46 ` [PATCH v6 00/34] iov_iter: Improve page extraction (ref, pin or just list) Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).