linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES)
@ 2023-03-16 15:25 David Howells
  2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
                   ` (27 more replies)
  0 siblings, 28 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Hi Willy, Dave, et al.,

[NOTE! This patchset is a work in progress and some modules will not
 compile with it.]

I've been looking at how to make pipes handle the splicing in of multipage
folios and also looking to see if I could implement a suggestion from Willy
that pipe_buffers could perhaps hold a list of pages (which could make
splicing simpler - an entire splice segment would go in a single
pipe_buffer).

There are a couple of issues here:

 (1) Gifting/stealing a multipage folio is really tricky.  I think that if
     a multipage folio if gifted, the gift flag should be quietly dropped.
     Userspace has no control over what splice() and vmsplice() will see in
     the pagecache.

 (2) The sendpage op expects to be given a single page and various network
     protocols just attach that to a socket buffer.

This patchset aims to deal with the second by removing the ->sendpage()
operation and replacing it with sendmsg() and a new internal flag
MSG_SPLICE_PAGES.  As sendmsg() takes an I/O iterator, this also affords
the opportunity to pass a slew of pages in one go, rather than one at a
time.

If MSG_SPLICE_PAGES is set, the current implementation requires that the
iterator be ITER_BVEC-type and that the pages can be retained by calling
get_page() on them.  Note that I'm accessing the bvec[] directly, but
should really use iov_iter_extract_pages() which would allow an
ITER_XARRAY-type iterator to be used also.

The patchset consists of the following parts:

 (1) Define the MSG_SPLICE_PAGES flag.

 (2) Provide a simple allocator that takes pages and splits pieces off them
     on request and returns them with a ref on the page.  Unlike with slab
     memory, the lifetime of the allocated memory is controlled by the page
     refcount.  This allows protocol bits to be included in the same bvec[]
     as the data.

 (3) Implement MSG_SPLICE_PAGES support in TCP.

 (4) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
     various callers.

 (5) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
     a wrapper around sendmsg().

 (6) Implement MSG_SPLICE_PAGES support in AF_UNIX.

 (7) Implement MSG_SPLICE_PAGES support in AF_ALG and make
     af_alg_sendpage() just a wrapper around sendmsg().

 (8) Rename pipe_to_sendpage() to pipe_to_sendmsg() and make it a wrapper
     around sendmsg().

 (9) Remove sendpage file operation.

(10) Convert siw, ceph, iscsi and tcp_bpf to use sendmsg() instead of
     tcp_sendpage().

(11) Make skb_send_sock() use sendmsg().

(12) Remove AF_ALG's hash_sendpage() as hash_sendmsg() seems to do paste
     the page pointers in anyway.

(13) Convert ceph, rds, dlm and sunrpc to use sendmsg().

(14) Remove the sendpage socket operation.

This leaves the implementation of MSG_SPLICE_PAGES in AF_TLS, AF_KCM,
AF_SMC and Chelsio-TLS which I'm going to need help with, and cleaning up
the use of kernel_sendpage in AF_KCM, AF_SMC and NVMe over TCP still to be
done.


I'm wondering about how best to proceed further:

 - Rather than providing a special allocator, should protocols implementing
   MSG_SPLICE_PAGES recognise pages that belong to the slab allocator and
   copy the content of those to the skbuff and only directly attach the
   source page if it's not a slab page?

 - Should MSG_SPLICE_PAGES work with ITER_XARRAY as well as ITER_BVEC?

 - Should MSG_SPLICE_PAGES just be a hint and get ignored if the conditions
   for using it are not met rather than giving an error?

 - Should pages attached to a pipe be pinned (ie. FOLL_PIN) rather than
   simply ref'd (ie. FOLL_GET) so that the DIO issue doesn't occur on
   spliced pages?

 - Similarly, should pages undergoing zerocopy be pinned when attached to
   an skbuff rather than being simply ref'd?  I have a patch to note in the
   bottom two bits of the frag page pointer if they are pinned, ref'd or
   neither.


I have tested AF_UNIX splicing - which, surprisingly, seems nearly twice as
fast - TCP splicing, the siw driver (softIWarp RDMA with nfs and cifs),
sunrpc (with nfsd) and UDP (using a patched rxrpc).

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-sendpage

David

David Howells (28):
  net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
  Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
  tcp: Support MSG_SPLICE_PAGES
  tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
  tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
    tcp_sendmsg
  espintcp: Inline do_tcp_sendpages()
  tls: Inline do_tcp_sendpages()
  siw: Inline do_tcp_sendpages()
  tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
  ip, udp: Support MSG_SPLICE_PAGES
  udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
  af_unix: Support MSG_SPLICE_PAGES
  crypto: af_alg: Indent the loop in af_alg_sendmsg()
  crypto: af_alg: Support MSG_SPLICE_PAGES
  crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES
  splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
  Remove file->f_op->sendpage
  siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
  ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  tcp_bpf: Make tcp_bpf_sendpage() go through
    tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
  net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
  algif: Remove hash_sendpage*()
  ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
  rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

 Documentation/networking/scaling.rst     |   4 +-
 crypto/Kconfig                           |   1 +
 crypto/af_alg.c                          | 137 +++++--------
 crypto/algif_aead.c                      |  40 ++--
 crypto/algif_hash.c                      |  66 ------
 crypto/algif_rng.c                       |   2 -
 crypto/algif_skcipher.c                  |  22 +-
 drivers/infiniband/sw/siw/siw_qp_tx.c    | 224 +++++----------------
 drivers/target/iscsi/iscsi_target_util.c |  14 +-
 fs/dlm/lowcomms.c                        |  10 +-
 fs/splice.c                              |  42 ++--
 include/linux/fs.h                       |   3 -
 include/linux/net.h                      |   8 -
 include/linux/socket.h                   |   1 +
 include/linux/splice.h                   |   2 +
 include/linux/zcopy_alloc.h              |  16 ++
 include/net/inet_common.h                |   2 -
 include/net/sock.h                       |   6 -
 include/net/tcp.h                        |   2 -
 include/net/tls.h                        |   2 +-
 mm/Makefile                              |   2 +-
 mm/zcopy_alloc.c                         | 129 ++++++++++++
 net/appletalk/ddp.c                      |   1 -
 net/atm/pvc.c                            |   1 -
 net/atm/svc.c                            |   1 -
 net/ax25/af_ax25.c                       |   1 -
 net/caif/caif_socket.c                   |   2 -
 net/can/bcm.c                            |   1 -
 net/can/isotp.c                          |   1 -
 net/can/j1939/socket.c                   |   1 -
 net/can/raw.c                            |   1 -
 net/ceph/messenger_v1.c                  |  58 ++----
 net/ceph/messenger_v2.c                  |  89 ++-------
 net/core/skbuff.c                        |  49 +++--
 net/core/sock.c                          |  35 +---
 net/dccp/ipv4.c                          |   1 -
 net/dccp/ipv6.c                          |   1 -
 net/ieee802154/socket.c                  |   2 -
 net/ipv4/af_inet.c                       |  21 --
 net/ipv4/ip_output.c                     |  89 ++++++++-
 net/ipv4/tcp.c                           | 244 +++++------------------
 net/ipv4/tcp_bpf.c                       |  72 ++-----
 net/ipv4/tcp_ipv4.c                      |   1 -
 net/ipv4/udp.c                           |  54 -----
 net/ipv4/udp_impl.h                      |   2 -
 net/ipv4/udplite.c                       |   1 -
 net/ipv6/af_inet6.c                      |   3 -
 net/ipv6/raw.c                           |   1 -
 net/ipv6/tcp_ipv6.c                      |   1 -
 net/key/af_key.c                         |   1 -
 net/l2tp/l2tp_ip.c                       |   1 -
 net/l2tp/l2tp_ip6.c                      |   1 -
 net/llc/af_llc.c                         |   1 -
 net/mctp/af_mctp.c                       |   1 -
 net/mptcp/protocol.c                     |   2 -
 net/netlink/af_netlink.c                 |   1 -
 net/netrom/af_netrom.c                   |   1 -
 net/packet/af_packet.c                   |   2 -
 net/phonet/socket.c                      |   2 -
 net/qrtr/af_qrtr.c                       |   1 -
 net/rds/af_rds.c                         |   1 -
 net/rds/tcp_send.c                       |  80 ++++----
 net/rose/af_rose.c                       |   1 -
 net/rxrpc/af_rxrpc.c                     |   1 -
 net/sctp/protocol.c                      |   1 -
 net/socket.c                             |  74 +------
 net/sunrpc/svcsock.c                     |  70 ++-----
 net/sunrpc/xdr.c                         |  24 ++-
 net/tipc/socket.c                        |   3 -
 net/tls/tls_main.c                       |  24 ++-
 net/unix/af_unix.c                       | 223 +++++++--------------
 net/vmw_vsock/af_vsock.c                 |   3 -
 net/x25/af_x25.c                         |   1 -
 net/xdp/xsk.c                            |   1 -
 net/xfrm/espintcp.c                      |  10 +-
 75 files changed, 687 insertions(+), 1313 deletions(-)
 create mode 100644 include/linux/zcopy_alloc.h
 create mode 100644 mm/zcopy_alloc.c



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES David Howells
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can.

This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/linux/socket.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..a67d02da3c54 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -327,6 +327,7 @@ struct ucred {
 					  */
 
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
+#define MSG_SPLICE_PAGES 0x8000000	/* Splice the pages from the iterator in sendmsg() */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
  2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 17:28   ` Matthew Wilcox
  2023-03-16 18:00   ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
                   ` (25 subsequent siblings)
  27 siblings, 2 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Bernard Metzler,
	Tom Talpey, linux-rdma

If a network protocol sendmsg() sees MSG_SPLICE_DATA, it expects that the
iterator is of ITER_BVEC type and that all the pages can have refs taken on
them with get_page() and discarded with put_page().  Bits of network
filesystem protocol data, however, are typically contained in slab memory
for which the cleanup method is kfree(), not put_page(), so this doesn't
work.

Provide a simple allocator, zcopy_alloc(), that allocates a page at a time
per-cpu and sequentially breaks off pieces and hands them out with a ref as
it's asked for them.  The caller disposes of the memory it was given by
calling put_page().  When a page is all parcelled out, it is abandoned by
the allocator and another page is obtained.  The page will get cleaned up
when the last skbuff fragment is destroyed.

A helper function, zcopy_memdup() is provided to call zcopy_alloc() and
copy the data it is given into it.

[!] I'm not sure this is the best way to do things.  A better way might be
    to make the network protocol look at the page and copy it if it's a
    slab object rather than taking a ref on it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Tom Talpey <tom@talpey.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: netdev@vger.kernel.org
---
 include/linux/zcopy_alloc.h |  16 +++++
 mm/Makefile                 |   2 +-
 mm/zcopy_alloc.c            | 129 ++++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/zcopy_alloc.h
 create mode 100644 mm/zcopy_alloc.c

diff --git a/include/linux/zcopy_alloc.h b/include/linux/zcopy_alloc.h
new file mode 100644
index 000000000000..8eb205678073
--- /dev/null
+++ b/include/linux/zcopy_alloc.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Defs for for zerocopy filler fragment allocator.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_ZCOPY_ALLOC_H
+#define _LINUX_ZCOPY_ALLOC_H
+
+struct bio_vec;
+
+int zcopy_alloc(size_t size, struct bio_vec *bvec, gfp_t gfp);
+int zcopy_memdup(size_t size, const void *p, struct bio_vec *bvec, gfp_t gfp);
+
+#endif /* _LINUX_ZCOPY_ALLOC_H */
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..3848f43751ee 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o percpu.o slab_common.o \
-			   compaction.o \
+			   compaction.o zcopy_alloc.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o gup.o mmap_lock.o $(mmu-y)
 
diff --git a/mm/zcopy_alloc.c b/mm/zcopy_alloc.c
new file mode 100644
index 000000000000..7b219392e829
--- /dev/null
+++ b/mm/zcopy_alloc.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Allocator for zerocopy filler fragments
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * Provide a facility whereby pieces of bufferage can be allocated for
+ * insertion into bio_vec arrays intended for zerocopying, allowing protocol
+ * stuff to be mixed in with data.
+ *
+ * Unlike objects allocated from the slab, the lifetime of these pieces of
+ * buffer are governed purely by the refcount of the page in which they reside.
+ */
+
+#include <linux/export.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/zcopy_alloc.h>
+#include <linux/bvec.h>
+
+struct zcopy_alloc_info {
+	struct folio	*folio;		/* Page currently being allocated from */
+	struct folio	*spare;		/* Spare page */
+	unsigned int	used;		/* Amount of folio used */
+	spinlock_t	lock;		/* Allocation lock (needs bh-disable) */
+};
+
+static struct zcopy_alloc_info __percpu *zcopy_alloc_info;
+
+static int __init zcopy_alloc_init(void)
+{
+	zcopy_alloc_info = alloc_percpu(struct zcopy_alloc_info);
+	if (!zcopy_alloc_info)
+		panic("Unable to set up zcopy_alloc allocator\n");
+	return 0;
+}
+subsys_initcall(zcopy_alloc_init);
+
+/**
+ * zcopy_alloc - Allocate some memory for use in zerocopy
+ * @size: The amount of memory (maximum 1/2 page).
+ * @bvec: Where to store the details of the memory
+ * @gfp: Allocation flags under which to make an allocation
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data.  Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and pieces are broken off
+ * sequentially and given to the allocators with a ref until it no longer has
+ * enough spare space, at which point the allocator's ref is dropped and a new
+ * folio is allocated.  The folio remains in existence until the last ref held
+ * by, say, a sk_buff is discarded and then the page is returned to the
+ * allocator.
+ *
+ * Returns 0 on success and -ENOMEM on allocation failure.  If successful, the
+ * details of the allocated memory are placed in *%bvec.
+ *
+ * The allocated memory should be disposed of with folio_put().
+ */
+int zcopy_alloc(size_t size, struct bio_vec *bvec, gfp_t gfp)
+{
+	struct zcopy_alloc_info *info;
+	struct folio *folio, *spare = NULL;
+	size_t full_size = round_up(size, 8);
+
+	if (WARN_ON_ONCE(full_size > PAGE_SIZE / 2))
+		return -ENOMEM; /* Allocate pages */
+
+try_again:
+	info = get_cpu_ptr(zcopy_alloc_info);
+
+	folio = info->folio;
+	if (folio && folio_size(folio) - info->used < full_size) {
+		folio_put(folio);
+		folio = info->folio = NULL;
+	}
+	if (spare && !info->spare) {
+		info->spare = spare;
+		spare = NULL;
+	}
+	if (!folio && info->spare) {
+		folio = info->folio = info->spare;
+		info->spare = NULL;
+		info->used = 0;
+	}
+	if (folio) {
+		bvec_set_folio(bvec, folio, size, info->used);
+		info->used += full_size;
+		if (info->used < folio_size(folio))
+			folio_get(folio);
+		else
+			info->folio = NULL;
+	}
+
+	put_cpu_ptr(zcopy_alloc_info);
+	if (folio) {
+		if (spare)
+			folio_put(spare);
+		return 0;
+	}
+
+	spare = folio_alloc(gfp, 0);
+	if (!spare)
+		return -ENOMEM;
+	goto try_again;
+}
+EXPORT_SYMBOL(zcopy_alloc);
+
+/**
+ * zcopy_memdup - Allocate some memory for use in zerocopy and fill it
+ * @size: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @bvec: Where to store the details of the memory
+ * @gfp: Allocation flags under which to make an allocation
+ */
+int zcopy_memdup(size_t size, const void *p, struct bio_vec *bvec, gfp_t gfp)
+{
+	void *q;
+
+	if (zcopy_alloc(size, bvec, gfp) < 0)
+		return -ENOMEM;
+
+	q = kmap_local_folio(page_folio(bvec->bv_page), bvec->bv_offset);
+	memcpy(q, p, size);
+	kunmap_local(q);
+	return 0;
+}
+EXPORT_SYMBOL(zcopy_memdup);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
  2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
  2023-03-16 15:25 ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 18:37   ` Willem de Bruijn
  2023-03-16 18:44   ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 04/28] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
                   ` (24 subsequent siblings)
  27 siblings, 2 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Make TCP's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 288693981b00..77c0c69208a5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	int process_backlog = 0;
-	bool zc = false;
+	int zc = 0;
 	long timeo;
 
 	flags = msg->msg_flags;
@@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (msg->msg_ubuf) {
 			uarg = msg->msg_ubuf;
 			net_zcopy_get(uarg);
-			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (sk->sk_route_caps & NETIF_F_SG)
+				zc = 1;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
 			if (!uarg) {
 				err = -ENOBUFS;
 				goto out_err;
 			}
-			zc = sk->sk_route_caps & NETIF_F_SG;
-			if (!zc)
+			if (sk->sk_route_caps & NETIF_F_SG)
+				zc = 1;
+			else
 				uarg_to_msgzc(uarg)->zerocopy = 0;
 		}
+	} else if (unlikely(flags & MSG_SPLICE_PAGES) && size) {
+		if (!iov_iter_is_bvec(&msg->msg_iter))
+			return -EINVAL;
+		if (sk->sk_route_caps & NETIF_F_SG)
+			zc = 2;
 	}
 
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1345,7 +1352,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (copy > msg_data_left(msg))
 			copy = msg_data_left(msg);
 
-		if (!zc) {
+		if (zc == 0) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
@@ -1390,7 +1397,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				page_ref_inc(pfrag->page);
 			}
 			pfrag->offset += copy;
-		} else {
+		} else if (zc == 1)  {
 			/* First append to a fragless skb builds initial
 			 * pure zerocopy skb
 			 */
@@ -1411,6 +1418,46 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			if (err < 0)
 				goto do_error;
 			copy = err;
+		} else if (zc == 2) {
+			/* Splice in data. */
+			const struct bio_vec *bv = msg->msg_iter.bvec;
+			size_t seg = iov_iter_single_seg_count(&msg->msg_iter);
+			size_t off = bv->bv_offset + msg->msg_iter.iov_offset;
+			bool can_coalesce;
+			int i = skb_shinfo(skb)->nr_frags;
+
+			if (copy > seg)
+				copy = seg;
+
+			can_coalesce = skb_can_coalesce(skb, i, bv->bv_page, off);
+			if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
+				tcp_mark_push(tp, skb);
+				goto new_segment;
+			}
+			if (tcp_downgrade_zcopy_pure(sk, skb))
+				goto wait_for_space;
+
+			copy = tcp_wmem_schedule(sk, copy);
+			if (!copy)
+				goto wait_for_space;
+
+			if (can_coalesce) {
+				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+			} else {
+				get_page(bv->bv_page);
+				skb_fill_page_desc_noacc(skb, i, bv->bv_page, off, copy);
+			}
+			iov_iter_advance(&msg->msg_iter, copy);
+
+			if (!(flags & MSG_NO_SHARED_FRAGS))
+				skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+
+			skb->len += copy;
+			skb->data_len += copy;
+			skb->truesize += copy;
+			sk_wmem_queued_add(sk, copy);
+			sk_mem_charge(sk, copy);
+
 		}
 
 		if (!copied)



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 04/28] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (2 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 05/28] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Convert do_tcp_sendpages() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself.  do_tcp_sendpages() can then be
inlined in subsequent patches into its callers.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp.c | 160 +++----------------------------------------------
 1 file changed, 9 insertions(+), 151 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 77c0c69208a5..7c3acc5673e9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,163 +971,21 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
 	return min(copy, sk->sk_forward_alloc);
 }
 
-static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
-				      struct page *page, int offset, size_t *size)
-{
-	struct sk_buff *skb = tcp_write_queue_tail(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	bool can_coalesce;
-	int copy, i;
-
-	if (!skb || (copy = size_goal - skb->len) <= 0 ||
-	    !tcp_skb_can_collapse_to(skb)) {
-new_segment:
-		if (!sk_stream_memory_free(sk))
-			return NULL;
-
-		skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
-					   tcp_rtx_and_write_queues_empty(sk));
-		if (!skb)
-			return NULL;
-
-#ifdef CONFIG_TLS_DEVICE
-		skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
-#endif
-		tcp_skb_entail(sk, skb);
-		copy = size_goal;
-	}
-
-	if (copy > *size)
-		copy = *size;
-
-	i = skb_shinfo(skb)->nr_frags;
-	can_coalesce = skb_can_coalesce(skb, i, page, offset);
-	if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
-		tcp_mark_push(tp, skb);
-		goto new_segment;
-	}
-	if (tcp_downgrade_zcopy_pure(sk, skb))
-		return NULL;
-
-	copy = tcp_wmem_schedule(sk, copy);
-	if (!copy)
-		return NULL;
-
-	if (can_coalesce) {
-		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
-	} else {
-		get_page(page);
-		skb_fill_page_desc_noacc(skb, i, page, offset, copy);
-	}
-
-	if (!(flags & MSG_NO_SHARED_FRAGS))
-		skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
-	skb->len += copy;
-	skb->data_len += copy;
-	skb->truesize += copy;
-	sk_wmem_queued_add(sk, copy);
-	sk_mem_charge(sk, copy);
-	WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
-	TCP_SKB_CB(skb)->end_seq += copy;
-	tcp_skb_pcount_set(skb, 0);
-
-	*size = copy;
-	return skb;
-}
-
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			 size_t size, int flags)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
-	int mss_now, size_goal;
-	int err;
-	ssize_t copied;
-	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
-
-	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    WARN_ONCE(!sendpage_ok(page),
-		      "page must not be a Slab one and have page_count > 0"))
-		return -EINVAL;
-
-	/* Wait for a connection to finish. One exception is TCP Fast Open
-	 * (passive side) where data is allowed to be sent before a connection
-	 * is fully established.
-	 */
-	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
-	    !tcp_passive_fastopen(sk)) {
-		err = sk_stream_wait_connect(sk, &timeo);
-		if (err != 0)
-			goto out_err;
-	}
-
-	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
-
-	mss_now = tcp_send_mss(sk, &size_goal, flags);
-	copied = 0;
-
-	err = -EPIPE;
-	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
-		goto out_err;
-
-	while (size > 0) {
-		struct sk_buff *skb;
-		size_t copy = size;
-
-		skb = tcp_build_frag(sk, size_goal, flags, page, offset, &copy);
-		if (!skb)
-			goto wait_for_space;
-
-		if (!copied)
-			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
-		copied += copy;
-		offset += copy;
-		size -= copy;
-		if (!size)
-			goto out;
-
-		if (skb->len < size_goal || (flags & MSG_OOB))
-			continue;
-
-		if (forced_push(tp)) {
-			tcp_mark_push(tp, skb);
-			__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
-		} else if (skb == tcp_send_head(sk))
-			tcp_push_one(sk, mss_now);
-		continue;
-
-wait_for_space:
-		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-		tcp_push(sk, flags & ~MSG_MORE, mss_now,
-			 TCP_NAGLE_PUSH, size_goal);
-
-		err = sk_stream_wait_memory(sk, &timeo);
-		if (err != 0)
-			goto do_error;
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = flags | MSG_SPLICE_PAGES,
+	};
 
-		mss_now = tcp_send_mss(sk, &size_goal, flags);
-	}
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
-out:
-	if (copied) {
-		tcp_tx_timestamp(sk, sk->sk_tsflags);
-		if (!(flags & MSG_SENDPAGE_NOTLAST))
-			tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
-	}
-	return copied;
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
-do_error:
-	tcp_remove_empty_skb(sk);
-	if (copied)
-		goto out;
-out_err:
-	/* make sure we wake any epoll edge trigger waiter */
-	if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
-		sk->sk_write_space(sk);
-		tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
-	}
-	return sk_stream_error(sk, flags, err);
+	return tcp_sendmsg_locked(sk, &msg, size);
 }
 EXPORT_SYMBOL_GPL(do_tcp_sendpages);
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 05/28] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (3 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 04/28] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 06/28] espintcp: Inline do_tcp_sendpages() David Howells
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, John Fastabend,
	Jakub Sitnicki, bpf

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it.  This is part of replacing ->sendpage() with a call to
sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
---
 net/ipv4/tcp_bpf.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index cf26d65ca389..7f17134637eb 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -72,11 +72,13 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
 {
 	bool apply = apply_bytes;
 	struct scatterlist *sge;
+	struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 	struct page *page;
 	int size, ret = 0;
 	u32 off;
 
 	while (1) {
+		struct bio_vec bvec;
 		bool has_tx_ulp;
 
 		sge = sk_msg_elem(msg, msg->sg.start);
@@ -88,16 +90,18 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
 		tcp_rate_check_app_limited(sk);
 retry:
 		has_tx_ulp = tls_sw_has_ctx_tx(sk);
-		if (has_tx_ulp) {
-			flags |= MSG_SENDPAGE_NOPOLICY;
-			ret = kernel_sendpage_locked(sk,
-						     page, off, size, flags);
-		} else {
-			ret = do_tcp_sendpages(sk, page, off, size, flags);
-		}
+		if (has_tx_ulp)
+			msghdr.msg_flags |= MSG_SENDPAGE_NOPOLICY;
 
+		if (flags & MSG_SENDPAGE_NOTLAST)
+			msghdr.msg_flags |= MSG_MORE;
+
+		bvec_set_page(&bvec, page, size, off);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+		ret = tcp_sendmsg_locked(sk, &msghdr, size);
 		if (ret <= 0)
 			return ret;
+
 		if (apply)
 			apply_bytes -= ret;
 		msg->sg.size -= ret;
@@ -398,7 +402,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	long timeo;
 	int flags;
 
-	/* Don't let internal do_tcp_sendpages() flags through */
+	/* Don't let internal sendpage flags through */
 	flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
 	flags |= MSG_NO_SHARED_FRAGS;
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 06/28] espintcp: Inline do_tcp_sendpages()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (4 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 05/28] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 07/28] tls: " David Howells
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Steffen Klassert,
	Herbert Xu

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steffen Klassert <steffen.klassert@secunet.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/xfrm/espintcp.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 872b80188e83..3504925babdb 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -205,14 +205,16 @@ static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
 static int espintcp_sendskmsg_locked(struct sock *sk,
 				     struct espintcp_msg *emsg, int flags)
 {
+	struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 	struct sk_msg *skmsg = &emsg->skmsg;
 	struct scatterlist *sg;
 	int done = 0;
 	int ret;
 
-	flags |= MSG_SENDPAGE_NOTLAST;
+	msghdr.msg_flags |= MSG_SENDPAGE_NOTLAST;
 	sg = &skmsg->sg.data[skmsg->sg.start];
 	do {
+		struct bio_vec bvec;
 		size_t size = sg->length - emsg->offset;
 		int offset = sg->offset + emsg->offset;
 		struct page *p;
@@ -220,11 +222,13 @@ static int espintcp_sendskmsg_locked(struct sock *sk,
 		emsg->offset = 0;
 
 		if (sg_is_last(sg))
-			flags &= ~MSG_SENDPAGE_NOTLAST;
+			msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
 
 		p = sg_page(sg);
 retry:
-		ret = do_tcp_sendpages(sk, p, offset, size, flags);
+		bvec_set_page(&bvec, p, size, offset);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+		ret = tcp_sendmsg_locked(sk, &msghdr, size);
 		if (ret < 0) {
 			emsg->offset = offset - sg->offset;
 			skmsg->sg.start += done;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 07/28] tls: Inline do_tcp_sendpages()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (5 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 06/28] espintcp: Inline do_tcp_sendpages() David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 08/28] siw: " David Howells
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Boris Pismenny,
	John Fastabend

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/tls.h  |  2 +-
 net/tls/tls_main.c | 24 +++++++++++++++---------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 154949c7b0c8..d31521c36a84 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -256,7 +256,7 @@ struct tls_context {
 	struct scatterlist *partially_sent_record;
 	u16 partially_sent_offset;
 
-	bool in_tcp_sendpages;
+	bool splicing_pages;
 	bool pending_open_record_frags;
 
 	struct mutex tx_lock; /* protects partially_sent_* fields and
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 3735cb00905d..8802b4f8b652 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -124,7 +124,10 @@ int tls_push_sg(struct sock *sk,
 		u16 first_offset,
 		int flags)
 {
-	int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = flags | MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST,
+	};
 	int ret = 0;
 	struct page *p;
 	size_t size;
@@ -133,16 +136,19 @@ int tls_push_sg(struct sock *sk,
 	size = sg->length - offset;
 	offset += sg->offset;
 
-	ctx->in_tcp_sendpages = true;
+	ctx->splicing_pages = true;
 	while (1) {
 		if (sg_is_last(sg))
-			sendpage_flags = flags;
+			msg.msg_flags = flags | MSG_SPLICE_PAGES;
 
 		/* is sending application-limited? */
 		tcp_rate_check_app_limited(sk);
 		p = sg_page(sg);
 retry:
-		ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
+		bvec_set_page(&bvec, p, size, offset);
+		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+		ret = tcp_sendmsg_locked(sk, &msg, size);
 
 		if (ret != size) {
 			if (ret > 0) {
@@ -154,7 +160,7 @@ int tls_push_sg(struct sock *sk,
 			offset -= sg->offset;
 			ctx->partially_sent_offset = offset;
 			ctx->partially_sent_record = (void *)sg;
-			ctx->in_tcp_sendpages = false;
+			ctx->splicing_pages = false;
 			return ret;
 		}
 
@@ -168,7 +174,7 @@ int tls_push_sg(struct sock *sk,
 		size = sg->length;
 	}
 
-	ctx->in_tcp_sendpages = false;
+	ctx->splicing_pages = false;
 
 	return 0;
 }
@@ -246,11 +252,11 @@ static void tls_write_space(struct sock *sk)
 {
 	struct tls_context *ctx = tls_get_ctx(sk);
 
-	/* If in_tcp_sendpages call lower protocol write space handler
+	/* If splicing_pages call lower protocol write space handler
 	 * to ensure we wake up any waiting operations there. For example
-	 * if do_tcp_sendpages where to call sk_wait_event.
+	 * if splicing pages where to call sk_wait_event.
 	 */
-	if (ctx->in_tcp_sendpages) {
+	if (ctx->splicing_pages) {
 		ctx->sk_write_space(sk);
 		return;
 	}



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 08/28] siw: Inline do_tcp_sendpages()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (6 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 07/28] tls: " David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:25 ` [RFC PATCH 09/28] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Bernard Metzler,
	Tom Talpey, linux-rdma

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Tom Talpey <tom@talpey.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: netdev@vger.kernel.org
---
 drivers/infiniband/sw/siw/siw_qp_tx.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 05052b49107f..8fc179321e2b 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -313,7 +313,7 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
 }
 
 /*
- * 0copy TCP transmit interface: Use do_tcp_sendpages.
+ * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
  *
  * Using sendpage to push page by page appears to be less efficient
  * than using sendmsg, even if data are copied.
@@ -324,20 +324,27 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
 static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
 			     size_t size)
 {
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT |
+			      MSG_SENDPAGE_NOTLAST),
+	};
 	struct sock *sk = s->sk;
-	int i = 0, rv = 0, sent = 0,
-	    flags = MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST;
+	int i = 0, rv = 0, sent = 0;
 
 	while (size) {
 		size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
 
 		if (size + offset <= PAGE_SIZE)
-			flags = MSG_MORE | MSG_DONTWAIT;
+			msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT;
 
 		tcp_rate_check_app_limited(sk);
+		bvec_set_page(&bvec, page[i], bytes, offset);
+		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
 try_page_again:
 		lock_sock(sk);
-		rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags);
+		rv = tcp_sendmsg_locked(sk, &msg, size);
 		release_sock(sk);
 
 		if (rv > 0) {



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 09/28] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (7 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 08/28] siw: " David Howells
@ 2023-03-16 15:25 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 10/28] ip, udp: Support MSG_SPLICE_PAGES David Howells
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Fold do_tcp_sendpages() into its last remaining caller,
tcp_sendpage_locked().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/tcp.h |  2 --
 net/ipv4/tcp.c    | 21 +++++++--------------
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..844bc8e6a714 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -333,8 +333,6 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
 		 int flags);
 int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags);
 int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
 void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle,
 	      int size_goal);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7c3acc5673e9..f1454e4497df 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -971,14 +971,19 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
 	return min(copy, sk->sk_forward_alloc);
 }
 
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
-			 size_t size, int flags)
+int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
+			size_t size, int flags)
 {
 	struct bio_vec bvec;
 	struct msghdr msg = {
 		.msg_flags = flags | MSG_SPLICE_PAGES,
 	};
 
+	if (!(sk->sk_route_caps & NETIF_F_SG))
+		return sock_no_sendpage_locked(sk, page, offset, size, flags);
+
+	tcp_rate_check_app_limited(sk);  /* is sending application-limited? */
+
 	bvec_set_page(&bvec, page, size, offset);
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
@@ -987,18 +992,6 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 
 	return tcp_sendmsg_locked(sk, &msg, size);
 }
-EXPORT_SYMBOL_GPL(do_tcp_sendpages);
-
-int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
-			size_t size, int flags)
-{
-	if (!(sk->sk_route_caps & NETIF_F_SG))
-		return sock_no_sendpage_locked(sk, page, offset, size, flags);
-
-	tcp_rate_check_app_limited(sk);  /* is sending application-limited? */
-
-	return do_tcp_sendpages(sk, page, offset, size, flags);
-}
 EXPORT_SYMBOL_GPL(tcp_sendpage_locked);
 
 int tcp_sendpage(struct sock *sk, struct page *page, int offset,



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 10/28] ip, udp: Support MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (8 preceding siblings ...)
  2023-03-16 15:25 ` [RFC PATCH 09/28] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 11/28] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Willem de Bruijn

Make IP/UDP sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/ip_output.c | 89 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 86 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4e4e308c3230..721d7e4343ed 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -977,7 +977,7 @@ static int __ip_append_data(struct sock *sk,
 	int err;
 	int offset = 0;
 	bool zc = false;
-	unsigned int maxfraglen, fragheaderlen, maxnonfragsize;
+	unsigned int maxfraglen, fragheaderlen, maxnonfragsize, xlength;
 	int csummode = CHECKSUM_NONE;
 	struct rtable *rt = (struct rtable *)cork->dst;
 	unsigned int wmem_alloc_delta = 0;
@@ -1017,6 +1017,7 @@ static int __ip_append_data(struct sock *sk,
 	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
 		csummode = CHECKSUM_PARTIAL;
 
+	xlength = length;
 	if ((flags & MSG_ZEROCOPY) && length) {
 		struct msghdr *msg = from;
 
@@ -1047,6 +1048,16 @@ static int __ip_append_data(struct sock *sk,
 				skb_zcopy_set(skb, uarg, &extra_uref);
 			}
 		}
+	} else if ((flags & MSG_SPLICE_PAGES) && length) {
+		struct msghdr *msg = from;
+
+		if (!iov_iter_is_bvec(&msg->msg_iter))
+			return -EINVAL;
+		if (inet->hdrincl)
+			return -EPERM;
+		if (!(rt->dst.dev->features & NETIF_F_SG))
+			return -EOPNOTSUPP;
+		xlength = transhdrlen; /* We need an empty buffer to attach stuff to */
 	}
 
 	cork->length += length;
@@ -1074,6 +1085,50 @@ static int __ip_append_data(struct sock *sk,
 			unsigned int alloclen, alloc_extra;
 			unsigned int pagedlen;
 			struct sk_buff *skb_prev;
+
+			if (unlikely(flags & MSG_SPLICE_PAGES)) {
+				skb_prev = skb;
+				fraggap = skb_prev->len - maxfraglen;
+
+				alloclen = fragheaderlen + hh_len + fraggap + 15;
+				skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
+				if (unlikely(!skb)) {
+					err = -ENOBUFS;
+					goto error;
+				}
+
+				/*
+				 *	Fill in the control structures
+				 */
+				skb->ip_summed = CHECKSUM_NONE;
+				skb->csum = 0;
+				skb_reserve(skb, hh_len);
+
+				/*
+				 *	Find where to start putting bytes.
+				 */
+				skb_put(skb, fragheaderlen + fraggap);
+				skb_reset_network_header(skb);
+				skb->transport_header = (skb->network_header +
+							 fragheaderlen);
+				if (fraggap) {
+					skb->csum = skb_copy_and_csum_bits(
+						skb_prev, maxfraglen,
+						skb_transport_header(skb),
+						fraggap);
+					skb_prev->csum = csum_sub(skb_prev->csum,
+								  skb->csum);
+					pskb_trim_unique(skb_prev, maxfraglen);
+				}
+
+				/*
+				 * Put the packet on the pending queue.
+				 */
+				__skb_queue_tail(&sk->sk_write_queue, skb);
+				continue;
+			}
+			xlength = length;
+
 alloc_new_skb:
 			skb_prev = skb;
 			if (skb_prev)
@@ -1085,7 +1140,7 @@ static int __ip_append_data(struct sock *sk,
 			 * If remaining data exceeds the mtu,
 			 * we know we need more fragment(s).
 			 */
-			datalen = length + fraggap;
+			datalen = xlength + fraggap;
 			if (datalen > mtu - fragheaderlen)
 				datalen = maxfraglen - fragheaderlen;
 			fraglen = datalen + fragheaderlen;
@@ -1099,7 +1154,7 @@ static int __ip_append_data(struct sock *sk,
 			 * because we have no idea what fragment will be
 			 * the last.
 			 */
-			if (datalen == length + fraggap)
+			if (datalen == xlength + fraggap)
 				alloc_extra += rt->dst.trailer_len;
 
 			if ((flags & MSG_MORE) &&
@@ -1206,6 +1261,34 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
+		} else if (flags & MSG_SPLICE_PAGES) {
+			struct msghdr *msg = from;
+			struct iov_iter *iter = &msg->msg_iter;
+			const struct bio_vec *bv = iter->bvec;
+
+			if (iov_iter_count(iter) <= 0) {
+				err = -EIO;
+				goto error;
+			}
+
+			copy = iov_iter_single_seg_count(&msg->msg_iter);
+
+			err = skb_append_pagefrags(skb, bv->bv_page,
+						   bv->bv_offset + iter->iov_offset,
+						   copy);
+			if (err < 0)
+				goto error;
+
+			if (skb->ip_summed == CHECKSUM_NONE) {
+				__wsum csum;
+				csum = csum_page(bv->bv_page,
+						 bv->bv_offset + iter->iov_offset, copy);
+				skb->csum = csum_block_add(skb->csum, csum, skb->len);
+			}
+
+			iov_iter_advance(iter, copy);
+			skb_len_add(skb, copy);
+			refcount_add(copy, &sk->sk_wmem_alloc);
 		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 11/28] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (9 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 10/28] ip, udp: Support MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 12/28] af_unix: Support MSG_SPLICE_PAGES David Howells
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Willem de Bruijn

Convert udp_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/udp.c | 50 +++++++++-----------------------------------------
 1 file changed, 9 insertions(+), 41 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c605d171eb2d..097feb92e215 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1332,52 +1332,20 @@ EXPORT_SYMBOL(udp_sendmsg);
 int udp_sendpage(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags)
 {
-	struct inet_sock *inet = inet_sk(sk);
-	struct udp_sock *up = udp_sk(sk);
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE
+	};
 	int ret;
 
-	if (flags & MSG_SENDPAGE_NOTLAST)
-		flags |= MSG_MORE;
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
-	if (!up->pending) {
-		struct msghdr msg = {	.msg_flags = flags|MSG_MORE };
-
-		/* Call udp_sendmsg to specify destination address which
-		 * sendpage interface can't pass.
-		 * This will succeed only when the socket is connected.
-		 */
-		ret = udp_sendmsg(sk, &msg, 0);
-		if (ret < 0)
-			return ret;
-	}
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
 	lock_sock(sk);
-
-	if (unlikely(!up->pending)) {
-		release_sock(sk);
-
-		net_dbg_ratelimited("cork failed\n");
-		return -EINVAL;
-	}
-
-	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
-	if (ret == -EOPNOTSUPP) {
-		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
-	}
-	if (ret < 0) {
-		udp_flush_pending_frames(sk);
-		goto out;
-	}
-
-	up->len += size;
-	if (!(READ_ONCE(up->corkflag) || (flags&MSG_MORE)))
-		ret = udp_push_pending_frames(sk);
-	if (!ret)
-		ret = size;
-out:
+	ret = udp_sendmsg(sk, &msg, size);
 	release_sock(sk);
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 12/28] af_unix: Support MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (10 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 11/28] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 13/28] crypto: af_alg: Indent the loop in af_alg_sendmsg() David Howells
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the
source iterator if given and if ITER_BVEC and copying the data in
otherwise.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/unix/af_unix.c | 84 +++++++++++++++++++++++++++++++++++++---------
 1 file changed, 68 insertions(+), 16 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 347122c3575e..6f3454db9c53 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2151,6 +2151,44 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
 }
 #endif
 
+/*
+ * Extract pages from a BVEC-type iterator and add them to the socket buffer.
+ */
+static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
+					struct iov_iter *iter, ssize_t maxsize)
+{
+	const struct bio_vec *bv = iter->bvec;
+	unsigned long start = iter->iov_offset;
+	unsigned int i;
+	ssize_t ret = 0;
+
+	for (i = 0; i < iter->nr_segs; i++) {
+		size_t off, len;
+
+		len = bv[i].bv_len;
+		if (start >= len) {
+			start -= len;
+			continue;
+		}
+
+		len = min_t(size_t, maxsize, len - start);
+		off = bv[i].bv_offset + start;
+
+		if (skb_append_pagefrags(skb, bv->bv_page, off, len) < 0)
+			break;
+
+		ret += len;
+		maxsize -= len;
+		if (maxsize <= 0)
+			break;
+		start = 0;
+	}
+
+	if (ret > 0)
+		iov_iter_advance(iter, ret);
+	return ret;
+}
+
 static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 			       size_t len)
 {
@@ -2194,19 +2232,25 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	while (sent < len) {
 		size = len - sent;
 
-		/* Keep two messages in the pipe so it schedules better */
-		size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+			skb = sock_alloc_send_pskb(sk, 0, 0,
+						   msg->msg_flags & MSG_DONTWAIT,
+						   &err, 0);
+		} else {
+			/* Keep two messages in the pipe so it schedules better */
+			size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
 
-		/* allow fallback to order-0 allocations */
-		size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
+			/* allow fallback to order-0 allocations */
+			size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
 
-		data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
+			data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
 
-		data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
+			data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
 
-		skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
-					   msg->msg_flags & MSG_DONTWAIT, &err,
-					   get_order(UNIX_SKB_FRAGS_SZ));
+			skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
+						   msg->msg_flags & MSG_DONTWAIT, &err,
+						   get_order(UNIX_SKB_FRAGS_SZ));
+		}
 		if (!skb)
 			goto out_err;
 
@@ -2218,13 +2262,21 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		}
 		fds_sent = true;
 
-		skb_put(skb, size - data_len);
-		skb->data_len = data_len;
-		skb->len = size;
-		err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
-		if (err) {
-			kfree_skb(skb);
-			goto out_err;
+		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+			size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size);
+			skb->data_len += size;
+			skb->len += size;
+			skb->truesize += size;
+			refcount_add(size, &sk->sk_wmem_alloc);
+		} else {
+			skb_put(skb, size - data_len);
+			skb->data_len = data_len;
+			skb->len = size;
+			err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+			if (err) {
+				kfree_skb(skb);
+				goto out_err;
+			}
 		}
 
 		unix_state_lock(other);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 13/28] crypto: af_alg: Indent the loop in af_alg_sendmsg()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (11 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 12/28] af_unix: Support MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 14/28] crypto: af_alg: Support MSG_SPLICE_PAGES David Howells
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Herbert Xu, linux-crypto

Put the loop in af_alg_sendmsg() into an if-statement to indent it to make
the next patch easier to review as that will add another branch to handle
MSG_SPLICE_PAGES to the if-statement.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: netdev@vger.kernel.org
---
 crypto/af_alg.c | 50 +++++++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 5f7252a5b7b4..feb989b32606 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -1060,35 +1060,37 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		if (sgl->cur)
 			sg_unmark_end(sg + sgl->cur - 1);
 
-		do {
-			struct page *pg;
-			unsigned int i = sgl->cur;
+		if (1 /* TODO check MSG_SPLICE_PAGES */) {
+			do {
+				struct page *pg;
+				unsigned int i = sgl->cur;
 
-			plen = min_t(size_t, len, PAGE_SIZE);
+				plen = min_t(size_t, len, PAGE_SIZE);
 
-			pg = alloc_page(GFP_KERNEL);
-			if (!pg) {
-				err = -ENOMEM;
-				goto unlock;
-			}
+				pg = alloc_page(GFP_KERNEL);
+				if (!pg) {
+					err = -ENOMEM;
+					goto unlock;
+				}
 
-			sg_assign_page(sg + i, pg);
+				sg_assign_page(sg + i, pg);
 
-			err = memcpy_from_msg(page_address(sg_page(sg + i)),
-					      msg, plen);
-			if (err) {
-				__free_page(sg_page(sg + i));
-				sg_assign_page(sg + i, NULL);
-				goto unlock;
-			}
+				err = memcpy_from_msg(page_address(sg_page(sg + i)),
+						      msg, plen);
+				if (err) {
+					__free_page(sg_page(sg + i));
+					sg_assign_page(sg + i, NULL);
+					goto unlock;
+				}
 
-			sg[i].length = plen;
-			len -= plen;
-			ctx->used += plen;
-			copied += plen;
-			size -= plen;
-			sgl->cur++;
-		} while (len && sgl->cur < MAX_SGL_ENTS);
+				sg[i].length = plen;
+				len -= plen;
+				ctx->used += plen;
+				copied += plen;
+				size -= plen;
+				sgl->cur++;
+			} while (len && sgl->cur < MAX_SGL_ENTS);
+		}
 
 		if (!size)
 			sg_mark_end(sg + sgl->cur - 1);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 14/28] crypto: af_alg: Support MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (12 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 13/28] crypto: af_alg: Indent the loop in af_alg_sendmsg() David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 15/28] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES David Howells
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Herbert Xu, linux-crypto

Make AF_ALG sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible (the iterator must be
ITER_BVEC and the pages must be spliceable).

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

[!] Note that this makes use of netfs_extract_iter_to_sg() from netfslib.
    This probably needs moving to core code somewhere.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: netdev@vger.kernel.org
---
 crypto/Kconfig          |  1 +
 crypto/af_alg.c         | 29 +++++++++++++++++++++++++++--
 crypto/algif_aead.c     | 22 +++++++++++-----------
 crypto/algif_skcipher.c |  8 ++++----
 4 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 9c86f7045157..8c04ecbb4395 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1297,6 +1297,7 @@ menu "Userspace interface"
 
 config CRYPTO_USER_API
 	tristate
+	select NETFS_SUPPORT # for netfs_extract_iter_to_sg()
 
 config CRYPTO_USER_API_HASH
 	tristate "Hash algorithms"
diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index feb989b32606..80ab4f6e018c 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -22,6 +22,7 @@
 #include <linux/sched/signal.h>
 #include <linux/security.h>
 #include <linux/string.h>
+#include <linux/netfs.h>
 #include <keys/user-type.h>
 #include <keys/trusted-type.h>
 #include <keys/encrypted-type.h>
@@ -970,6 +971,10 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	bool init = false;
 	int err = 0;
 
+	if ((msg->msg_flags & MSG_SPLICE_PAGES) &&
+	    !iov_iter_is_bvec(&msg->msg_iter))
+		return -EINVAL;
+
 	if (msg->msg_controllen) {
 		err = af_alg_cmsg_send(msg, &con);
 		if (err)
@@ -1015,7 +1020,7 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
 	while (size) {
 		struct scatterlist *sg;
 		size_t len = size;
-		size_t plen;
+		ssize_t plen;
 
 		/* use the existing memory in an allocated page */
 		if (ctx->merge) {
@@ -1060,7 +1065,27 @@ int af_alg_sendmsg(struct socket *sock, struct msghdr *msg, size_t size,
 		if (sgl->cur)
 			sg_unmark_end(sg + sgl->cur - 1);
 
-		if (1 /* TODO check MSG_SPLICE_PAGES */) {
+		if (msg->msg_flags & MSG_SPLICE_PAGES) {
+			struct sg_table sgtable = {
+				.sgl		= sg,
+				.nents		= sgl->cur,
+				.orig_nents	= sgl->cur,
+			};
+
+			plen = netfs_extract_iter_to_sg(&msg->msg_iter, len,
+							&sgtable, MAX_SGL_ENTS, 0);
+			if (plen < 0) {
+				err = plen;
+				goto unlock;
+			}
+
+			for (; sgl->cur < sgtable.nents; sgl->cur++)
+				get_page(sg_page(&sg[sgl->cur]));
+			len -= plen;
+			ctx->used += plen;
+			copied += plen;
+			size -= plen;
+		} else {
 			do {
 				struct page *pg;
 				unsigned int i = sgl->cur;
diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
index 42493b4d8ce4..279eb17a1dfc 100644
--- a/crypto/algif_aead.c
+++ b/crypto/algif_aead.c
@@ -9,8 +9,8 @@
  * The following concept of the memory management is used:
  *
  * The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage/sendmsg. Filling
- * up the TX SGL does not cause a crypto operation -- the data will only be
+ * filled by user space with the data submitted via sendpage. Filling up
+ * the TX SGL does not cause a crypto operation -- the data will only be
  * tracked by the kernel. Upon receipt of one recvmsg call, the caller must
  * provide a buffer which is tracked with the RX SGL.
  *
@@ -113,19 +113,19 @@ static int _aead_recvmsg(struct socket *sock, struct msghdr *msg,
 	}
 
 	/*
-	 * Data length provided by caller via sendmsg/sendpage that has not
-	 * yet been processed.
+	 * Data length provided by caller via sendmsg that has not yet been
+	 * processed.
 	 */
 	used = ctx->used;
 
 	/*
-	 * Make sure sufficient data is present -- note, the same check is
-	 * also present in sendmsg/sendpage. The checks in sendpage/sendmsg
-	 * shall provide an information to the data sender that something is
-	 * wrong, but they are irrelevant to maintain the kernel integrity.
-	 * We need this check here too in case user space decides to not honor
-	 * the error message in sendmsg/sendpage and still call recvmsg. This
-	 * check here protects the kernel integrity.
+	 * Make sure sufficient data is present -- note, the same check is also
+	 * present in sendmsg. The checks in sendmsg shall provide an
+	 * information to the data sender that something is wrong, but they are
+	 * irrelevant to maintain the kernel integrity.  We need this check
+	 * here too in case user space decides to not honor the error message
+	 * in sendmsg and still call recvmsg. This check here protects the
+	 * kernel integrity.
 	 */
 	if (!aead_sufficient_data(sk))
 		return -EINVAL;
diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
index ee8890ee8f33..021f9ce7e87c 100644
--- a/crypto/algif_skcipher.c
+++ b/crypto/algif_skcipher.c
@@ -9,10 +9,10 @@
  * The following concept of the memory management is used:
  *
  * The kernel maintains two SGLs, the TX SGL and the RX SGL. The TX SGL is
- * filled by user space with the data submitted via sendpage/sendmsg. Filling
- * up the TX SGL does not cause a crypto operation -- the data will only be
- * tracked by the kernel. Upon receipt of one recvmsg call, the caller must
- * provide a buffer which is tracked with the RX SGL.
+ * filled by user space with the data submitted via sendmsg. Filling up the TX
+ * SGL does not cause a crypto operation -- the data will only be tracked by
+ * the kernel. Upon receipt of one recvmsg call, the caller must provide a
+ * buffer which is tracked with the RX SGL.
  *
  * During the processing of the recvmsg operation, the cipher request is
  * allocated and prepared. As part of the recvmsg operation, the processed



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 15/28] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (13 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 14/28] crypto: af_alg: Support MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 16/28] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() David Howells
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Herbert Xu, linux-crypto

Convert af_alg_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

[!] Note that this makes use of netfs_extract_iter_to_sg() from netfslib.
    This probably needs moving to core code somewhere.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: netdev@vger.kernel.org
---
 crypto/af_alg.c | 53 +++++++++----------------------------------------
 1 file changed, 9 insertions(+), 44 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index 80ab4f6e018c..0e77fce60876 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -1148,53 +1148,18 @@ EXPORT_SYMBOL_GPL(af_alg_sendmsg);
 ssize_t af_alg_sendpage(struct socket *sock, struct page *page,
 			int offset, size_t size, int flags)
 {
-	struct sock *sk = sock->sk;
-	struct alg_sock *ask = alg_sk(sk);
-	struct af_alg_ctx *ctx = ask->private;
-	struct af_alg_tsgl *sgl;
-	int err = -EINVAL;
-
-	if (flags & MSG_SENDPAGE_NOTLAST)
-		flags |= MSG_MORE;
-
-	lock_sock(sk);
-	if (!ctx->more && ctx->used)
-		goto unlock;
-
-	if (!size)
-		goto done;
-
-	if (!af_alg_writable(sk)) {
-		err = af_alg_wait_for_wmem(sk, flags);
-		if (err)
-			goto unlock;
-	}
-
-	err = af_alg_alloc_tsgl(sk);
-	if (err)
-		goto unlock;
-
-	ctx->merge = 0;
-	sgl = list_entry(ctx->tsgl_list.prev, struct af_alg_tsgl, list);
-
-	if (sgl->cur)
-		sg_unmark_end(sgl->sg + sgl->cur - 1);
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = flags | MSG_SPLICE_PAGES,
+	};
 
-	sg_mark_end(sgl->sg + sgl->cur);
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
-	get_page(page);
-	sg_set_page(sgl->sg + sgl->cur, page, size, offset);
-	sgl->cur++;
-	ctx->used += size;
-
-done:
-	ctx->more = flags & MSG_MORE;
-
-unlock:
-	af_alg_data_wakeup(sk);
-	release_sock(sk);
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
-	return err ?: size;
+	return sock_sendmsg(sock, &msg);
 }
 EXPORT_SYMBOL_GPL(af_alg_sendpage);
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 16/28] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (14 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 15/28] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 17/28] Remove file->f_op->sendpage David Howells
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() to splice data from
a pipe to a socket.  This paves the way for passing in multiple pages at
once from a pipe and the handling of multipage folios.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 fs/splice.c            | 42 +++++++++++++++++++++++-------------------
 include/linux/fs.h     |  2 --
 include/linux/splice.h |  2 ++
 net/socket.c           | 26 ++------------------------
 4 files changed, 27 insertions(+), 45 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index f46dd1fb367b..23ead122d631 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -32,6 +32,7 @@
 #include <linux/uio.h>
 #include <linux/security.h>
 #include <linux/gfp.h>
+#include <linux/net.h>
 #include <linux/socket.h>
 #include <linux/sched/signal.h>
 
@@ -410,29 +411,32 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
 };
 EXPORT_SYMBOL(nosteal_pipe_buf_ops);
 
+#ifdef CONFIG_NET
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
  * using sendpage(). Return the number of bytes sent.
  */
-static int pipe_to_sendpage(struct pipe_inode_info *pipe,
-			    struct pipe_buffer *buf, struct splice_desc *sd)
+static int pipe_to_sendmsg(struct pipe_inode_info *pipe,
+			   struct pipe_buffer *buf, struct splice_desc *sd)
 {
-	struct file *file = sd->u.file;
-	loff_t pos = sd->pos;
-	int more;
-
-	if (!likely(file->f_op->sendpage))
-		return -EINVAL;
+	struct socket *sock = sock_from_file(sd->u.file);
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = MSG_SPLICE_PAGES,
+	};
 
-	more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
+	if (sd->flags & SPLICE_F_MORE)
+		msg.msg_flags |= MSG_MORE;
 
 	if (sd->len < sd->total_len &&
 	    pipe_occupancy(pipe->head, pipe->tail) > 1)
-		more |= MSG_SENDPAGE_NOTLAST;
+		msg.msg_flags |= MSG_MORE;
 
-	return file->f_op->sendpage(file, buf->page, buf->offset,
-				    sd->len, &pos, more);
+	bvec_set_page(&bvec, buf->page, sd->len, buf->offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, sd->len);
+	return sock_sendmsg(sock, &msg);
 }
+#endif
 
 static void wakeup_pipe_writers(struct pipe_inode_info *pipe)
 {
@@ -614,7 +618,7 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
  * Description:
  *    This function does little more than loop over the pipe and call
  *    @actor to do the actual moving of a single struct pipe_buffer to
- *    the desired destination. See pipe_to_file, pipe_to_sendpage, or
+ *    the desired destination. See pipe_to_file, pipe_to_sendmsg, or
  *    pipe_to_user.
  *
  */
@@ -795,8 +799,9 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 EXPORT_SYMBOL(iter_file_splice_write);
 
+#ifdef CONFIG_NET
 /**
- * generic_splice_sendpage - splice data from a pipe to a socket
+ * splice_to_socket - splice data from a pipe to a socket
  * @pipe:	pipe to splice from
  * @out:	socket to write to
  * @ppos:	position in @out
@@ -808,13 +813,12 @@ EXPORT_SYMBOL(iter_file_splice_write);
  *    is involved.
  *
  */
-ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
-				loff_t *ppos, size_t len, unsigned int flags)
+ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
+			 loff_t *ppos, size_t len, unsigned int flags)
 {
-	return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendpage);
+	return splice_from_pipe(pipe, out, ppos, len, flags, pipe_to_sendmsg);
 }
-
-EXPORT_SYMBOL(generic_splice_sendpage);
+#endif
 
 static int warn_unsupported(struct file *file, const char *op)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..f3ccc243851e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2740,8 +2740,6 @@ extern ssize_t generic_file_splice_read(struct file *, loff_t *,
 		struct pipe_inode_info *, size_t, unsigned int);
 extern ssize_t iter_file_splice_write(struct pipe_inode_info *,
 		struct file *, loff_t *, size_t, unsigned int);
-extern ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe,
-		struct file *out, loff_t *, size_t len, unsigned int flags);
 extern long do_splice_direct(struct file *in, loff_t *ppos, struct file *out,
 		loff_t *opos, size_t len, unsigned int flags);
 
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 8f052c3dae95..e6153feda86c 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -87,6 +87,8 @@ extern long do_splice(struct file *in, loff_t *off_in,
 
 extern long do_tee(struct file *in, struct file *out, size_t len,
 		   unsigned int flags);
+extern ssize_t splice_to_socket(struct pipe_inode_info *pipe, struct file *out,
+				loff_t *ppos, size_t len, unsigned int flags);
 
 /*
  * for dynamic pipe sizing
diff --git a/net/socket.c b/net/socket.c
index 6bae8ce7059e..1b48a976b8cc 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -57,6 +57,7 @@
 #include <linux/mm.h>
 #include <linux/socket.h>
 #include <linux/file.h>
+#include <linux/splice.h>
 #include <linux/net.h>
 #include <linux/interrupt.h>
 #include <linux/thread_info.h>
@@ -126,8 +127,6 @@ static long compat_sock_ioctl(struct file *file,
 			      unsigned int cmd, unsigned long arg);
 #endif
 static int sock_fasync(int fd, struct file *filp, int on);
-static ssize_t sock_sendpage(struct file *file, struct page *page,
-			     int offset, size_t size, loff_t *ppos, int more);
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
 				struct pipe_inode_info *pipe, size_t len,
 				unsigned int flags);
@@ -162,8 +161,7 @@ static const struct file_operations socket_file_ops = {
 	.mmap =		sock_mmap,
 	.release =	sock_close,
 	.fasync =	sock_fasync,
-	.sendpage =	sock_sendpage,
-	.splice_write = generic_splice_sendpage,
+	.splice_write = splice_to_socket,
 	.splice_read =	sock_splice_read,
 	.show_fdinfo =	sock_show_fdinfo,
 };
@@ -1062,26 +1060,6 @@ int kernel_recvmsg(struct socket *sock, struct msghdr *msg,
 }
 EXPORT_SYMBOL(kernel_recvmsg);
 
-static ssize_t sock_sendpage(struct file *file, struct page *page,
-			     int offset, size_t size, loff_t *ppos, int more)
-{
-	struct socket *sock;
-	int flags;
-	int ret;
-
-	sock = file->private_data;
-
-	flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0;
-	/* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */
-	flags |= more;
-
-	ret = kernel_sendpage(sock, page, offset, size, flags);
-
-	if (trace_sock_send_length_enabled())
-		call_trace_sock_send_length(sock->sk, ret, 0);
-	return ret;
-}
-
 static ssize_t sock_splice_read(struct file *file, loff_t *ppos,
 				struct pipe_inode_info *pipe, size_t len,
 				unsigned int flags)



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 17/28] Remove file->f_op->sendpage
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (15 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 16/28] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit David Howells
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Remove file->f_op->sendpage as splicing to a socket now calls sendmsg
rather than sendpage.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/linux/fs.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3ccc243851e..a9f1b2543d2c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1773,7 +1773,6 @@ struct file_operations {
 	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
 	int (*fasync) (int, struct file *, int);
 	int (*lock) (struct file *, int, struct file_lock *);
-	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
 	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 	int (*check_flags)(int);
 	int (*flock) (struct file *, int, struct file_lock *);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (16 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 17/28] Remove file->f_op->sendpage David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 19/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Bernard Metzler,
	Tom Talpey, linux-rdma

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.

To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator.  The header and trailer (if present) are copied into
memory acquired from zcopy_alloc() which just breaks a page up into small
pieces that can be freed with put_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Tom Talpey <tom@talpey.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: netdev@vger.kernel.org
---
 drivers/infiniband/sw/siw/siw_qp_tx.c | 231 +++++---------------------
 1 file changed, 46 insertions(+), 185 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 8fc179321e2b..ec4f0ac324ce 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -8,6 +8,7 @@
 #include <linux/net.h>
 #include <linux/scatterlist.h>
 #include <linux/highmem.h>
+#include <linux/zcopy_alloc.h>
 #include <net/tcp.h>
 
 #include <rdma/iw_cm.h>
@@ -312,114 +313,8 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
 	return rv;
 }
 
-/*
- * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
- *
- * Using sendpage to push page by page appears to be less efficient
- * than using sendmsg, even if data are copied.
- *
- * A general performance limitation might be the extra four bytes
- * trailer checksum segment to be pushed after user data.
- */
-static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
-			     size_t size)
-{
-	struct bio_vec bvec;
-	struct msghdr msg = {
-		.msg_flags = (MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT |
-			      MSG_SENDPAGE_NOTLAST),
-	};
-	struct sock *sk = s->sk;
-	int i = 0, rv = 0, sent = 0;
-
-	while (size) {
-		size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
-
-		if (size + offset <= PAGE_SIZE)
-			msg.msg_flags = MSG_SPLICE_PAGES | MSG_MORE | MSG_DONTWAIT;
-
-		tcp_rate_check_app_limited(sk);
-		bvec_set_page(&bvec, page[i], bytes, offset);
-		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
-
-try_page_again:
-		lock_sock(sk);
-		rv = tcp_sendmsg_locked(sk, &msg, size);
-		release_sock(sk);
-
-		if (rv > 0) {
-			size -= rv;
-			sent += rv;
-			if (rv != bytes) {
-				offset += rv;
-				bytes -= rv;
-				goto try_page_again;
-			}
-			offset = 0;
-		} else {
-			if (rv == -EAGAIN || rv == 0)
-				break;
-			return rv;
-		}
-		i++;
-	}
-	return sent;
-}
-
-/*
- * siw_0copy_tx()
- *
- * Pushes list of pages to TCP socket. If pages from multiple
- * SGE's, all referenced pages of each SGE are pushed in one
- * shot.
- */
-static int siw_0copy_tx(struct socket *s, struct page **page,
-			struct siw_sge *sge, unsigned int offset,
-			unsigned int size)
-{
-	int i = 0, sent = 0, rv;
-	int sge_bytes = min(sge->length - offset, size);
-
-	offset = (sge->laddr + offset) & ~PAGE_MASK;
-
-	while (sent != size) {
-		rv = siw_tcp_sendpages(s, &page[i], offset, sge_bytes);
-		if (rv >= 0) {
-			sent += rv;
-			if (size == sent || sge_bytes > rv)
-				break;
-
-			i += PAGE_ALIGN(sge_bytes + offset) >> PAGE_SHIFT;
-			sge++;
-			sge_bytes = min(sge->length, size - sent);
-			offset = sge->laddr & ~PAGE_MASK;
-		} else {
-			sent = rv;
-			break;
-		}
-	}
-	return sent;
-}
-
 #define MAX_TRAILER (MPA_CRC_SIZE + 4)
 
-static void siw_unmap_pages(struct kvec *iov, unsigned long kmap_mask, int len)
-{
-	int i;
-
-	/*
-	 * Work backwards through the array to honor the kmap_local_page()
-	 * ordering requirements.
-	 */
-	for (i = (len-1); i >= 0; i--) {
-		if (kmap_mask & BIT(i)) {
-			unsigned long addr = (unsigned long)iov[i].iov_base;
-
-			kunmap_local((void *)(addr & PAGE_MASK));
-		}
-	}
-}
-
 /*
  * siw_tx_hdt() tries to push a complete packet to TCP where all
  * packet fragments are referenced by the elements of one iovec.
@@ -439,15 +334,13 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 {
 	struct siw_wqe *wqe = &c_tx->wqe_active;
 	struct siw_sge *sge = &wqe->sqe.sge[c_tx->sge_idx];
-	struct kvec iov[MAX_ARRAY];
-	struct page *page_array[MAX_ARRAY];
+	struct bio_vec bvec[MAX_ARRAY];
 	struct msghdr msg = { .msg_flags = MSG_DONTWAIT | MSG_EOR };
 
 	int seg = 0, do_crc = c_tx->do_crc, is_kva = 0, rv;
 	unsigned int data_len = c_tx->bytes_unsent, hdr_len = 0, trl_len = 0,
 		     sge_off = c_tx->sge_off, sge_idx = c_tx->sge_idx,
 		     pbl_idx = c_tx->pbl_idx;
-	unsigned long kmap_mask = 0L;
 
 	if (c_tx->state == SIW_SEND_HDR) {
 		if (c_tx->use_sendpage) {
@@ -457,10 +350,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 
 			c_tx->state = SIW_SEND_DATA;
 		} else {
-			iov[0].iov_base =
-				(char *)&c_tx->pkt.ctrl + c_tx->ctrl_sent;
-			iov[0].iov_len = hdr_len =
-				c_tx->ctrl_len - c_tx->ctrl_sent;
+			const void *hdr = &c_tx->pkt.ctrl + c_tx->ctrl_sent;
+
+			hdr_len = c_tx->ctrl_len - c_tx->ctrl_sent;
+			rv = zcopy_memdup(hdr_len, hdr, &bvec[0], GFP_NOFS);
+			if (rv < 0)
+				goto done;
 			seg = 1;
 		}
 	}
@@ -478,28 +373,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 		} else {
 			is_kva = 1;
 		}
-		if (is_kva && !c_tx->use_sendpage) {
-			/*
-			 * tx from kernel virtual address: either inline data
-			 * or memory region with assigned kernel buffer
-			 */
-			iov[seg].iov_base =
-				(void *)(uintptr_t)(sge->laddr + sge_off);
-			iov[seg].iov_len = sge_len;
-
-			if (do_crc)
-				crypto_shash_update(c_tx->mpa_crc_hd,
-						    iov[seg].iov_base,
-						    sge_len);
-			sge_off += sge_len;
-			data_len -= sge_len;
-			seg++;
-			goto sge_done;
-		}
 
 		while (sge_len) {
 			size_t plen = min((int)PAGE_SIZE - fp_off, sge_len);
-			void *kaddr;
 
 			if (!is_kva) {
 				struct page *p;
@@ -512,33 +388,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 					p = siw_get_upage(mem->umem,
 							  sge->laddr + sge_off);
 				if (unlikely(!p)) {
-					siw_unmap_pages(iov, kmap_mask, seg);
 					wqe->processed -= c_tx->bytes_unsent;
 					rv = -EFAULT;
 					goto done_crc;
 				}
-				page_array[seg] = p;
-
-				if (!c_tx->use_sendpage) {
-					void *kaddr = kmap_local_page(p);
-
-					/* Remember for later kunmap() */
-					kmap_mask |= BIT(seg);
-					iov[seg].iov_base = kaddr + fp_off;
-					iov[seg].iov_len = plen;
-
-					if (do_crc)
-						crypto_shash_update(
-							c_tx->mpa_crc_hd,
-							iov[seg].iov_base,
-							plen);
-				} else if (do_crc) {
-					kaddr = kmap_local_page(p);
-					crypto_shash_update(c_tx->mpa_crc_hd,
-							    kaddr + fp_off,
-							    plen);
-					kunmap_local(kaddr);
-				}
+
+				bvec_set_page(&bvec[seg], p, plen, fp_off);
 			} else {
 				/*
 				 * Cast to an uintptr_t to preserve all 64 bits
@@ -552,12 +407,15 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 				 * bits on a 64 bit platform and 32 bits on a
 				 * 32 bit platform.
 				 */
-				page_array[seg] = virt_to_page((void *)(va & PAGE_MASK));
-				if (do_crc)
-					crypto_shash_update(
-						c_tx->mpa_crc_hd,
-						(void *)va,
-						plen);
+				bvec_set_virt(&bvec[seg], (void *)va, plen);
+			}
+
+			if (do_crc) {
+				void *kaddr = kmap_local_page(bvec[seg].bv_page);
+				crypto_shash_update(c_tx->mpa_crc_hd,
+						    kaddr + bvec[seg].bv_offset,
+						    bvec[seg].bv_len);
+				kunmap_local(kaddr);
 			}
 
 			sge_len -= plen;
@@ -567,13 +425,12 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 
 			if (++seg > (int)MAX_ARRAY) {
 				siw_dbg_qp(tx_qp(c_tx), "to many fragments\n");
-				siw_unmap_pages(iov, kmap_mask, seg-1);
 				wqe->processed -= c_tx->bytes_unsent;
 				rv = -EMSGSIZE;
 				goto done_crc;
 			}
 		}
-sge_done:
+
 		/* Update SGE variables at end of SGE */
 		if (sge_off == sge->length &&
 		    (data_len != 0 || wqe->processed < wqe->bytes)) {
@@ -582,15 +439,8 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 			sge_off = 0;
 		}
 	}
-	/* trailer */
-	if (likely(c_tx->state != SIW_SEND_TRAILER)) {
-		iov[seg].iov_base = &c_tx->trailer.pad[4 - c_tx->pad];
-		iov[seg].iov_len = trl_len = MAX_TRAILER - (4 - c_tx->pad);
-	} else {
-		iov[seg].iov_base = &c_tx->trailer.pad[c_tx->ctrl_sent];
-		iov[seg].iov_len = trl_len = MAX_TRAILER - c_tx->ctrl_sent;
-	}
 
+	/* Set the CRC in the trailer */
 	if (c_tx->pad) {
 		*(u32 *)c_tx->trailer.pad = 0;
 		if (do_crc)
@@ -603,23 +453,31 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 	else if (do_crc)
 		crypto_shash_final(c_tx->mpa_crc_hd, (u8 *)&c_tx->trailer.crc);
 
-	data_len = c_tx->bytes_unsent;
+	/* Copy the trailer and add it to the output list */
+	if (likely(c_tx->state != SIW_SEND_TRAILER)) {
+		void *trl = &c_tx->trailer.pad[4 - c_tx->pad];
 
-	if (c_tx->use_sendpage) {
-		rv = siw_0copy_tx(s, page_array, &wqe->sqe.sge[c_tx->sge_idx],
-				  c_tx->sge_off, data_len);
-		if (rv == data_len) {
-			rv = kernel_sendmsg(s, &msg, &iov[seg], 1, trl_len);
-			if (rv > 0)
-				rv += data_len;
-			else
-				rv = data_len;
-		}
+		trl_len = MAX_TRAILER - (4 - c_tx->pad);
+		rv = zcopy_memdup(trl_len, trl, &bvec[seg], GFP_NOFS);
+		if (rv < 0)
+			goto done_crc;
 	} else {
-		rv = kernel_sendmsg(s, &msg, iov, seg + 1,
-				    hdr_len + data_len + trl_len);
-		siw_unmap_pages(iov, kmap_mask, seg);
+		void *trl = &c_tx->trailer.pad[c_tx->ctrl_sent];
+
+		trl_len = MAX_TRAILER - c_tx->ctrl_sent;
+		rv = zcopy_memdup(trl_len, trl, &bvec[seg], GFP_NOFS);
+		if (rv < 0)
+			goto done_crc;
 	}
+
+	data_len = c_tx->bytes_unsent;
+
+	if (c_tx->use_sendpage)
+		msg.msg_flags |= MSG_SPLICE_PAGES;
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, seg + 1,
+		      hdr_len + data_len + trl_len);
+	rv = sock_sendmsg(s, &msg);
+
 	if (rv < (int)hdr_len) {
 		/* Not even complete hdr pushed or negative rv */
 		wqe->processed -= data_len;
@@ -680,6 +538,9 @@ static int siw_tx_hdt(struct siw_iwarp_tx *c_tx, struct socket *s)
 	}
 done_crc:
 	c_tx->do_crc = 0;
+	if (c_tx->state == SIW_SEND_HDR)
+		folio_put(page_folio(bvec[0].bv_page));
+	folio_put(page_folio(bvec[seg].bv_page));
 done:
 	return rv;
 }



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 19/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (17 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 20/28] iscsi: " David Howells
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Ilya Dryomov, Xiubo Li,
	ceph-devel

Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data.  For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netdev@vger.kernel.org
---
 net/ceph/messenger_v1.c | 58 ++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 39 deletions(-)

diff --git a/net/ceph/messenger_v1.c b/net/ceph/messenger_v1.c
index d664cb1593a7..b2d801a49122 100644
--- a/net/ceph/messenger_v1.c
+++ b/net/ceph/messenger_v1.c
@@ -74,37 +74,6 @@ static int ceph_tcp_sendmsg(struct socket *sock, struct kvec *iov,
 	return r;
 }
 
-/*
- * @more: either or both of MSG_MORE and MSG_SENDPAGE_NOTLAST
- */
-static int ceph_tcp_sendpage(struct socket *sock, struct page *page,
-			     int offset, size_t size, int more)
-{
-	ssize_t (*sendpage)(struct socket *sock, struct page *page,
-			    int offset, size_t size, int flags);
-	int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
-	int ret;
-
-	/*
-	 * sendpage cannot properly handle pages with page_count == 0,
-	 * we need to fall back to sendmsg if that's the case.
-	 *
-	 * Same goes for slab pages: skb_can_coalesce() allows
-	 * coalescing neighboring slab objects into a single frag which
-	 * triggers one of hardened usercopy checks.
-	 */
-	if (sendpage_ok(page))
-		sendpage = sock->ops->sendpage;
-	else
-		sendpage = sock_no_sendpage;
-
-	ret = sendpage(sock, page, offset, size, flags);
-	if (ret == -EAGAIN)
-		ret = 0;
-
-	return ret;
-}
-
 static void con_out_kvec_reset(struct ceph_connection *con)
 {
 	BUG_ON(con->v1.out_skip);
@@ -464,7 +433,6 @@ static int write_partial_message_data(struct ceph_connection *con)
 	struct ceph_msg *msg = con->out_msg;
 	struct ceph_msg_data_cursor *cursor = &msg->cursor;
 	bool do_datacrc = !ceph_test_opt(from_msgr(con->msgr), NOCRC);
-	int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
 	u32 crc;
 
 	dout("%s %p msg %p\n", __func__, con, msg);
@@ -482,6 +450,10 @@ static int write_partial_message_data(struct ceph_connection *con)
 	 */
 	crc = do_datacrc ? le32_to_cpu(msg->footer.data_crc) : 0;
 	while (cursor->total_resid) {
+		struct bio_vec bvec;
+		struct msghdr msghdr = {
+			.msg_flags = MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST,
+		};
 		struct page *page;
 		size_t page_offset;
 		size_t length;
@@ -494,9 +466,12 @@ static int write_partial_message_data(struct ceph_connection *con)
 
 		page = ceph_msg_data_next(cursor, &page_offset, &length);
 		if (length == cursor->total_resid)
-			more = MSG_MORE;
-		ret = ceph_tcp_sendpage(con->sock, page, page_offset, length,
-					more);
+			msghdr.msg_flags |= MSG_MORE;
+
+		bvec_set_page(&bvec, page, length, page_offset);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, length);
+
+		ret = sock_sendmsg(con->sock, &msghdr);
 		if (ret <= 0) {
 			if (do_datacrc)
 				msg->footer.data_crc = cpu_to_le32(crc);
@@ -526,7 +501,10 @@ static int write_partial_message_data(struct ceph_connection *con)
  */
 static int write_partial_skip(struct ceph_connection *con)
 {
-	int more = MSG_MORE | MSG_SENDPAGE_NOTLAST;
+	struct bio_vec bvec;
+	struct msghdr msghdr = {
+		.msg_flags = MSG_SPLICE_PAGES | MSG_SENDPAGE_NOTLAST | MSG_MORE,
+	};
 	int ret;
 
 	dout("%s %p %d left\n", __func__, con, con->v1.out_skip);
@@ -534,9 +512,11 @@ static int write_partial_skip(struct ceph_connection *con)
 		size_t size = min(con->v1.out_skip, (int)PAGE_SIZE);
 
 		if (size == con->v1.out_skip)
-			more = MSG_MORE;
-		ret = ceph_tcp_sendpage(con->sock, ceph_zero_page, 0, size,
-					more);
+			msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
+		bvec_set_page(&bvec, ZERO_PAGE(0), size, 0);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+		ret = sock_sendmsg(con->sock, &msghdr);
 		if (ret <= 0)
 			goto out;
 		con->v1.out_skip -= ret;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 20/28] iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (18 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 19/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 21/28] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Martin K. Petersen,
	linux-scsi, target-devel

Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage.  This allows
multiple pages and multipage folios to be passed through.

TODO: iscsit_fe_sendpage_sg() should perhaps set up a bio_vec array for the
entire set of pages it's going to transfer plus two for the header and
trailer and use zcopy_alloc() to allocate the header and trailer - and then
call sendmsg once for the entire message.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "Martin K. Petersen" <martin.petersen@oracle.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-scsi@vger.kernel.org
cc: target-devel@vger.kernel.org
cc: netdev@vger.kernel.org
---
 drivers/target/iscsi/iscsi_target_util.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 26dc8ed3045b..c7d58e41ac3b 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -1078,6 +1078,8 @@ int iscsit_fe_sendpage_sg(
 	struct iscsit_conn *conn)
 {
 	struct scatterlist *sg = cmd->first_data_sg;
+	struct bio_vec bvec;
+	struct msghdr msghdr = { .msg_flags = MSG_SPLICE_PAGES,	};
 	struct kvec iov;
 	u32 tx_hdr_size, data_len;
 	u32 offset = cmd->first_data_sg_off;
@@ -1121,17 +1123,17 @@ int iscsit_fe_sendpage_sg(
 		u32 space = (sg->length - offset);
 		u32 sub_len = min_t(u32, data_len, space);
 send_pg:
-		tx_sent = conn->sock->ops->sendpage(conn->sock,
-					sg_page(sg), sg->offset + offset, sub_len, 0);
+		bvec_set_page(&bvec, sg_page(sg), sub_len, sg->offset + offset);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, sub_len);
+
+		tx_sent = conn->sock->ops->sendmsg(conn->sock, &msghdr, sub_len);
 		if (tx_sent != sub_len) {
 			if (tx_sent == -EAGAIN) {
-				pr_err("tcp_sendpage() returned"
-						" -EAGAIN\n");
+				pr_err("sendmsg/splice returned -EAGAIN\n");
 				goto send_pg;
 			}
 
-			pr_err("tcp_sendpage() failure: %d\n",
-					tx_sent);
+			pr_err("sendmsg/splice failure: %d\n", tx_sent);
 			return -1;
 		}
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 21/28] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (19 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 20/28] iscsi: " David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 22/28] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock() David Howells
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, John Fastabend,
	Jakub Sitnicki, bpf

Translate tcp_bpf_sendpage() calls to tcp_bpf_sendmsg(MSG_SPLICE_PAGES).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: bpf@vger.kernel.org
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp_bpf.c | 49 +++++++++-------------------------------------
 1 file changed, 9 insertions(+), 40 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 7f17134637eb..de37a4372437 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -485,49 +485,18 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 static int tcp_bpf_sendpage(struct sock *sk, struct page *page, int offset,
 			    size_t size, int flags)
 {
-	struct sk_msg tmp, *msg = NULL;
-	int err = 0, copied = 0;
-	struct sk_psock *psock;
-	bool enospc = false;
-
-	psock = sk_psock_get(sk);
-	if (unlikely(!psock))
-		return tcp_sendpage(sk, page, offset, size, flags);
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = flags | MSG_SPLICE_PAGES,
+	};
 
-	lock_sock(sk);
-	if (psock->cork) {
-		msg = psock->cork;
-	} else {
-		msg = &tmp;
-		sk_msg_init(msg);
-	}
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
-	/* Catch case where ring is full and sendpage is stalled. */
-	if (unlikely(sk_msg_full(msg)))
-		goto out_err;
-
-	sk_msg_page_add(msg, page, size, offset);
-	sk_mem_charge(sk, size);
-	copied = size;
-	if (sk_msg_full(msg))
-		enospc = true;
-	if (psock->cork_bytes) {
-		if (size > psock->cork_bytes)
-			psock->cork_bytes = 0;
-		else
-			psock->cork_bytes -= size;
-		if (psock->cork_bytes && !enospc)
-			goto out_err;
-		/* All cork bytes are accounted, rerun the prog. */
-		psock->eval = __SK_NONE;
-		psock->cork_bytes = 0;
-	}
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
-	err = tcp_bpf_send_verdict(sk, psock, msg, &copied, flags);
-out_err:
-	release_sock(sk);
-	sk_psock_put(sk, psock);
-	return copied ? copied : err;
+	return tcp_bpf_sendmsg(sk, &msg, size);
 }
 
 enum {



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 22/28] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (20 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 21/28] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES) David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 23/28] algif: Remove hash_sendpage*() David Howells
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Use sendmsg() with MSG_SPLICE_PAGES rather than sendpage in
skb_send_sock().  This causes pages to be spliced from the source iterator
if possible (the iterator must be ITER_BVEC and the pages must be
spliceable).

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Note that this could perhaps be improved to fill out a bvec array with all
the frags and then make a single sendmsg call, possibly sticking the header
on the front also.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/core/skbuff.c | 49 ++++++++++++++++++++++++++---------------------
 1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eb7d33b41e71..9fa333e26b7d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2927,32 +2927,32 @@ int skb_splice_bits(struct sk_buff *skb, struct sock *sk, unsigned int offset,
 }
 EXPORT_SYMBOL_GPL(skb_splice_bits);
 
-static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg,
-			    struct kvec *vec, size_t num, size_t size)
+static int sendmsg_locked(struct sock *sk, struct msghdr *msg)
 {
 	struct socket *sock = sk->sk_socket;
+	size_t size = msg_data_left(msg);
 
 	if (!sock)
 		return -EINVAL;
-	return kernel_sendmsg(sock, msg, vec, num, size);
+
+	if (!sock->ops->sendmsg_locked)
+		return sock_no_sendmsg_locked(sk, msg, size);
+
+	return sock->ops->sendmsg_locked(sk, msg, size);
 }
 
-static int sendpage_unlocked(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags)
+static int sendmsg_unlocked(struct sock *sk, struct msghdr *msg)
 {
 	struct socket *sock = sk->sk_socket;
 
 	if (!sock)
 		return -EINVAL;
-	return kernel_sendpage(sock, page, offset, size, flags);
+	return sock_sendmsg(sock, msg);
 }
 
-typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg,
-			    struct kvec *vec, size_t num, size_t size);
-typedef int (*sendpage_func)(struct sock *sk, struct page *page, int offset,
-			     size_t size, int flags);
+typedef int (*sendmsg_func)(struct sock *sk, struct msghdr *msg);
 static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
-			   int len, sendmsg_func sendmsg, sendpage_func sendpage)
+			   int len, sendmsg_func sendmsg)
 {
 	unsigned int orig_len = len;
 	struct sk_buff *head = skb;
@@ -2972,8 +2972,9 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
 		memset(&msg, 0, sizeof(msg));
 		msg.msg_flags = MSG_DONTWAIT;
 
-		ret = INDIRECT_CALL_2(sendmsg, kernel_sendmsg_locked,
-				      sendmsg_unlocked, sk, &msg, &kv, 1, slen);
+		iov_iter_kvec(&msg.msg_iter, ITER_SOURCE, &kv, 1, slen);
+		ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+				      sendmsg_unlocked, sk, &msg);
 		if (ret <= 0)
 			goto error;
 
@@ -3004,11 +3005,17 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
 		slen = min_t(size_t, len, skb_frag_size(frag) - offset);
 
 		while (slen) {
-			ret = INDIRECT_CALL_2(sendpage, kernel_sendpage_locked,
-					      sendpage_unlocked, sk,
-					      skb_frag_page(frag),
-					      skb_frag_off(frag) + offset,
-					      slen, MSG_DONTWAIT);
+			struct bio_vec bvec;
+			struct msghdr msg = {
+				.msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT,
+			};
+
+			bvec_set_page(&bvec, skb_frag_page(frag), slen,
+				      skb_frag_off(frag) + offset);
+			iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, slen);
+
+			ret = INDIRECT_CALL_2(sendmsg, sendmsg_locked,
+					      sendmsg_unlocked, sk, &msg);
 			if (ret <= 0)
 				goto error;
 
@@ -3045,16 +3052,14 @@ static int __skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset,
 int skb_send_sock_locked(struct sock *sk, struct sk_buff *skb, int offset,
 			 int len)
 {
-	return __skb_send_sock(sk, skb, offset, len, kernel_sendmsg_locked,
-			       kernel_sendpage_locked);
+	return __skb_send_sock(sk, skb, offset, len, sendmsg_locked);
 }
 EXPORT_SYMBOL_GPL(skb_send_sock_locked);
 
 /* Send skb data on a socket. Socket must be unlocked. */
 int skb_send_sock(struct sock *sk, struct sk_buff *skb, int offset, int len)
 {
-	return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked,
-			       sendpage_unlocked);
+	return __skb_send_sock(sk, skb, offset, len, sendmsg_unlocked);
 }
 
 /**



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 23/28] algif: Remove hash_sendpage*()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (21 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 22/28] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock() David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 24/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Herbert Xu, linux-crypto

Remove hash_sendpage*() and use hash_sendmsg() as the latter seems to just
use the source pages directly anyway.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-crypto@vger.kernel.org
cc: netdev@vger.kernel.org
---
 crypto/algif_hash.c | 66 ---------------------------------------------
 1 file changed, 66 deletions(-)

diff --git a/crypto/algif_hash.c b/crypto/algif_hash.c
index 1d017ec5c63c..52f5828a054a 100644
--- a/crypto/algif_hash.c
+++ b/crypto/algif_hash.c
@@ -129,58 +129,6 @@ static int hash_sendmsg(struct socket *sock, struct msghdr *msg,
 	return err ?: copied;
 }
 
-static ssize_t hash_sendpage(struct socket *sock, struct page *page,
-			     int offset, size_t size, int flags)
-{
-	struct sock *sk = sock->sk;
-	struct alg_sock *ask = alg_sk(sk);
-	struct hash_ctx *ctx = ask->private;
-	int err;
-
-	if (flags & MSG_SENDPAGE_NOTLAST)
-		flags |= MSG_MORE;
-
-	lock_sock(sk);
-	sg_init_table(ctx->sgl.sg, 1);
-	sg_set_page(ctx->sgl.sg, page, size, offset);
-
-	if (!(flags & MSG_MORE)) {
-		err = hash_alloc_result(sk, ctx);
-		if (err)
-			goto unlock;
-	} else if (!ctx->more)
-		hash_free_result(sk, ctx);
-
-	ahash_request_set_crypt(&ctx->req, ctx->sgl.sg, ctx->result, size);
-
-	if (!(flags & MSG_MORE)) {
-		if (ctx->more)
-			err = crypto_ahash_finup(&ctx->req);
-		else
-			err = crypto_ahash_digest(&ctx->req);
-	} else {
-		if (!ctx->more) {
-			err = crypto_ahash_init(&ctx->req);
-			err = crypto_wait_req(err, &ctx->wait);
-			if (err)
-				goto unlock;
-		}
-
-		err = crypto_ahash_update(&ctx->req);
-	}
-
-	err = crypto_wait_req(err, &ctx->wait);
-	if (err)
-		goto unlock;
-
-	ctx->more = flags & MSG_MORE;
-
-unlock:
-	release_sock(sk);
-
-	return err ?: size;
-}
-
 static int hash_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 			int flags)
 {
@@ -285,7 +233,6 @@ static struct proto_ops algif_hash_ops = {
 
 	.release	=	af_alg_release,
 	.sendmsg	=	hash_sendmsg,
-	.sendpage	=	hash_sendpage,
 	.recvmsg	=	hash_recvmsg,
 	.accept		=	hash_accept,
 };
@@ -337,18 +284,6 @@ static int hash_sendmsg_nokey(struct socket *sock, struct msghdr *msg,
 	return hash_sendmsg(sock, msg, size);
 }
 
-static ssize_t hash_sendpage_nokey(struct socket *sock, struct page *page,
-				   int offset, size_t size, int flags)
-{
-	int err;
-
-	err = hash_check_key(sock);
-	if (err)
-		return err;
-
-	return hash_sendpage(sock, page, offset, size, flags);
-}
-
 static int hash_recvmsg_nokey(struct socket *sock, struct msghdr *msg,
 			      size_t ignored, int flags)
 {
@@ -387,7 +322,6 @@ static struct proto_ops algif_hash_ops_nokey = {
 
 	.release	=	af_alg_release,
 	.sendmsg	=	hash_sendmsg_nokey,
-	.sendpage	=	hash_sendpage_nokey,
 	.recvmsg	=	hash_recvmsg_nokey,
 	.accept		=	hash_accept_nokey,
 };



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 24/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (22 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 23/28] algif: Remove hash_sendpage*() David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 25/28] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Ilya Dryomov, Xiubo Li,
	ceph-devel

Use sendmsg() and MSG_SPLICE_PAGES rather than sendpage in ceph when
transmitting data.  For the moment, this can only transmit one page at a
time because of the architecture of net/ceph/, but if
write_partial_message_data() can be given a bvec[] at a time by the
iteration code, this would allow pages to be sent in a batch.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Ilya Dryomov <idryomov@gmail.com>
cc: Xiubo Li <xiubli@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: ceph-devel@vger.kernel.org
cc: netdev@vger.kernel.org
---
 net/ceph/messenger_v2.c | 89 +++++++++--------------------------------
 1 file changed, 18 insertions(+), 71 deletions(-)

diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index 301a991dc6a6..1637a0c21126 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -117,91 +117,38 @@ static int ceph_tcp_recv(struct ceph_connection *con)
 	return ret;
 }
 
-static int do_sendmsg(struct socket *sock, struct iov_iter *it)
-{
-	struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
-	int ret;
-
-	msg.msg_iter = *it;
-	while (iov_iter_count(it)) {
-		ret = sock_sendmsg(sock, &msg);
-		if (ret <= 0) {
-			if (ret == -EAGAIN)
-				ret = 0;
-			return ret;
-		}
-
-		iov_iter_advance(it, ret);
-	}
-
-	WARN_ON(msg_data_left(&msg));
-	return 1;
-}
-
-static int do_try_sendpage(struct socket *sock, struct iov_iter *it)
-{
-	struct msghdr msg = { .msg_flags = CEPH_MSG_FLAGS };
-	struct bio_vec bv;
-	int ret;
-
-	if (WARN_ON(!iov_iter_is_bvec(it)))
-		return -EINVAL;
-
-	while (iov_iter_count(it)) {
-		/* iov_iter_iovec() for ITER_BVEC */
-		bvec_set_page(&bv, it->bvec->bv_page,
-			      min(iov_iter_count(it),
-				  it->bvec->bv_len - it->iov_offset),
-			      it->bvec->bv_offset + it->iov_offset);
-
-		/*
-		 * sendpage cannot properly handle pages with
-		 * page_count == 0, we need to fall back to sendmsg if
-		 * that's the case.
-		 *
-		 * Same goes for slab pages: skb_can_coalesce() allows
-		 * coalescing neighboring slab objects into a single frag
-		 * which triggers one of hardened usercopy checks.
-		 */
-		if (sendpage_ok(bv.bv_page)) {
-			ret = sock->ops->sendpage(sock, bv.bv_page,
-						  bv.bv_offset, bv.bv_len,
-						  CEPH_MSG_FLAGS);
-		} else {
-			iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bv, 1, bv.bv_len);
-			ret = sock_sendmsg(sock, &msg);
-		}
-		if (ret <= 0) {
-			if (ret == -EAGAIN)
-				ret = 0;
-			return ret;
-		}
-
-		iov_iter_advance(it, ret);
-	}
-
-	return 1;
-}
-
 /*
  * Write as much as possible.  The socket is expected to be corked,
  * so we don't bother with MSG_MORE/MSG_SENDPAGE_NOTLAST here.
  *
  * Return:
- *   1 - done, nothing (else) to write
+ *  >0 - done, nothing (else) to write
  *   0 - socket is full, need to wait
  *  <0 - error
  */
 static int ceph_tcp_send(struct ceph_connection *con)
 {
+	struct msghdr msg = {
+		.msg_iter	= con->v2.out_iter,
+		.msg_flags	= CEPH_MSG_FLAGS,
+	};
 	int ret;
 
+	if (WARN_ON(!iov_iter_is_bvec(&con->v2.out_iter)))
+		return -EINVAL;
+
+	if (con->v2.out_iter_sendpage)
+		msg.msg_flags |= MSG_SPLICE_PAGES;
+
 	dout("%s con %p have %zu try_sendpage %d\n", __func__, con,
 	     iov_iter_count(&con->v2.out_iter), con->v2.out_iter_sendpage);
-	if (con->v2.out_iter_sendpage)
-		ret = do_try_sendpage(con->sock, &con->v2.out_iter);
-	else
-		ret = do_sendmsg(con->sock, &con->v2.out_iter);
+
+	ret = sock_sendmsg(con->sock, &msg);
+	if (ret > 0)
+		iov_iter_advance(&con->v2.out_iter, ret);
+	else if (ret == -EAGAIN)
+		ret = 0;
+
 	dout("%s con %p ret %d left %zu\n", __func__, con, ret,
 	     iov_iter_count(&con->v2.out_iter));
 	return ret;



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 25/28] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (23 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 24/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 26/28] dlm: " David Howells
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Santosh Shilimkar,
	linux-rdma, rds-devel

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header and data
pages.

To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator.  The header are copied into memory acquired from
zcopy_alloc() which just breaks a page up into small pieces that can be
freed with put_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: rds-devel@oss.oracle.com
cc: netdev@vger.kernel.org
---
 net/rds/tcp_send.c | 80 ++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 45 deletions(-)

diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 8c4d1d6e9249..0d6eb85a930d 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -32,6 +32,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/in.h>
+#include <linux/zcopy_alloc.h>
 #include <net/tcp.h>
 
 #include "rds_single_path.h"
@@ -52,29 +53,24 @@ void rds_tcp_xmit_path_complete(struct rds_conn_path *cp)
 	tcp_sock_set_cork(tc->t_sock->sk, false);
 }
 
-/* the core send_sem serializes this with other xmit and shutdown */
-static int rds_tcp_sendmsg(struct socket *sock, void *data, unsigned int len)
-{
-	struct kvec vec = {
-		.iov_base = data,
-		.iov_len = len,
-	};
-	struct msghdr msg = {
-		.msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL,
-	};
-
-	return kernel_sendmsg(sock, &msg, &vec, 1, vec.iov_len);
-}
-
 /* the core send_sem serializes this with other xmit and shutdown */
 int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 		 unsigned int hdr_off, unsigned int sg, unsigned int off)
 {
 	struct rds_conn_path *cp = rm->m_inc.i_conn_path;
 	struct rds_tcp_connection *tc = cp->cp_transport_data;
+	struct msghdr msg = {
+		.msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+	};
+	struct bio_vec *bvec;
+	unsigned int i, size = 0, ix = 0;
+	bool free_hdr = false;
 	int done = 0;
-	int ret = 0;
-	int more;
+	int ret = -ENOMEM;
+
+	bvec = kmalloc_array(1 + sg, sizeof(struct bio_vec), GFP_KERNEL);
+	if (!bvec)
+		goto out;
 
 	if (hdr_off == 0) {
 		/*
@@ -101,41 +97,30 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 		/* see rds_tcp_write_space() */
 		set_bit(SOCK_NOSPACE, &tc->t_sock->sk->sk_socket->flags);
 
-		ret = rds_tcp_sendmsg(tc->t_sock,
-				      (void *)&rm->m_inc.i_hdr + hdr_off,
-				      sizeof(rm->m_inc.i_hdr) - hdr_off);
+		ret = zcopy_memdup(sizeof(rm->m_inc.i_hdr) - hdr_off,
+				   (void *)&rm->m_inc.i_hdr + hdr_off,
+				   &bvec[ix], GFP_KERNEL);
 		if (ret < 0)
 			goto out;
-		done += ret;
-		if (hdr_off + done != sizeof(struct rds_header))
-			goto out;
+		free_hdr = true;
+		size += bvec[ix].bv_len;
+		ix++;
 	}
 
-	more = rm->data.op_nents > 1 ? (MSG_MORE | MSG_SENDPAGE_NOTLAST) : 0;
-	while (sg < rm->data.op_nents) {
-		int flags = MSG_DONTWAIT | MSG_NOSIGNAL | more;
-
-		ret = tc->t_sock->ops->sendpage(tc->t_sock,
-						sg_page(&rm->data.op_sg[sg]),
-						rm->data.op_sg[sg].offset + off,
-						rm->data.op_sg[sg].length - off,
-						flags);
-		rdsdebug("tcp sendpage %p:%u:%u ret %d\n", (void *)sg_page(&rm->data.op_sg[sg]),
-			 rm->data.op_sg[sg].offset + off, rm->data.op_sg[sg].length - off,
-			 ret);
-		if (ret <= 0)
-			break;
-
-		off += ret;
-		done += ret;
-		if (off == rm->data.op_sg[sg].length) {
-			off = 0;
-			sg++;
-		}
-		if (sg == rm->data.op_nents - 1)
-			more = 0;
+	for (i = sg; i < rm->data.op_nents; i++) {
+		bvec_set_page(&bvec[ix],
+			      sg_page(&rm->data.op_sg[i]),
+			      rm->data.op_sg[i].length - off,
+			      rm->data.op_sg[i].offset + off);
+		off = 0;
+		size += bvec[ix].bv_len;
+		ix++;
 	}
 
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, bvec, ix, size);
+	ret = sock_sendmsg(tc->t_sock, &msg);
+	rdsdebug("tcp sendmsg-splice %u,%u ret %d\n", ix, size, ret);
+
 out:
 	if (ret <= 0) {
 		/* write_space will hit after EAGAIN, all else fatal */
@@ -158,6 +143,11 @@ int rds_tcp_xmit(struct rds_connection *conn, struct rds_message *rm,
 	}
 	if (done == 0)
 		done = ret;
+	if (bvec) {
+		if (free_hdr)
+			put_page(bvec[0].bv_page);
+		kfree(bvec);
+	}
 	return done;
 }
 



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 26/28] dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (24 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 25/28] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 15:26 ` [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
       [not found] ` <20230316152618.711970-29-dhowells@redhat.com>
  27 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Christine Caulfield,
	David Teigland, cluster-devel

When transmitting data, call down a layer using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather using
sendpage.  This allows ->sendpage() to be replaced by something that can
handle multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christine Caulfield <ccaulfie@redhat.com>
cc: David Teigland <teigland@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: cluster-devel@redhat.com
cc: netdev@vger.kernel.org
---
 fs/dlm/lowcomms.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index a9b14f81d655..9c0c691b6106 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1394,8 +1394,11 @@ int dlm_lowcomms_resend_msg(struct dlm_msg *msg)
 /* Send a message */
 static int send_to_sock(struct connection *con)
 {
-	const int msg_flags = MSG_DONTWAIT | MSG_NOSIGNAL;
 	struct writequeue_entry *e;
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = MSG_SPLICE_PAGES | MSG_DONTWAIT | MSG_NOSIGNAL,
+	};
 	int len, offset, ret;
 
 	spin_lock_bh(&con->writequeue_lock);
@@ -1411,8 +1414,9 @@ static int send_to_sock(struct connection *con)
 	WARN_ON_ONCE(len == 0 && e->users == 0);
 	spin_unlock_bh(&con->writequeue_lock);
 
-	ret = kernel_sendpage(con->sock, e->page, offset, len,
-			      msg_flags);
+	bvec_set_page(&bvec, e->page, len, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, len);
+	ret = sock_sendmsg(con->sock, &msg);
 	trace_dlm_send(con->nodeid, ret);
 	if (ret == -EAGAIN || ret == 0) {
 		lock_sock(con->sock->sk);



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
                   ` (25 preceding siblings ...)
  2023-03-16 15:26 ` [RFC PATCH 26/28] dlm: " David Howells
@ 2023-03-16 15:26 ` David Howells
  2023-03-16 16:17   ` Trond Myklebust
  2023-03-16 16:24   ` David Howells
       [not found] ` <20230316152618.711970-29-dhowells@redhat.com>
  27 siblings, 2 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 15:26 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Trond Myklebust,
	Anna Schumaker, Chuck Lever, linux-nfs

When transmitting data, call down into TCP using a single sendmsg with
MSG_SPLICE_PAGES to indicate that content should be spliced rather than
performing several sendmsg and sendpage calls to transmit header, data
pages and trailer.

To make this work, the data is assembled in a bio_vec array and attached to
a BVEC-type iterator.  The bio_vec array has two extra slots before the
first for headers and one after the last for a trailer.  The headers and
trailer are copied into memory acquired from zcopy_alloc() which just
breaks a page up into small pieces that can be freed with put_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Trond Myklebust <trond.myklebust@hammerspace.com>
cc: Anna Schumaker <anna@kernel.org>
cc: Chuck Lever <chuck.lever@oracle.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-nfs@vger.kernel.org
cc: netdev@vger.kernel.org
---
 net/sunrpc/svcsock.c | 70 ++++++++++++--------------------------------
 net/sunrpc/xdr.c     | 24 ++++++++++++---
 2 files changed, 38 insertions(+), 56 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 03a4f5615086..1fa41ddbc40e 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -36,6 +36,7 @@
 #include <linux/skbuff.h>
 #include <linux/file.h>
 #include <linux/freezer.h>
+#include <linux/zcopy_alloc.h>
 #include <net/sock.h>
 #include <net/checksum.h>
 #include <net/ip.h>
@@ -1060,16 +1061,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 	return 0;	/* record not complete */
 }
 
-static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
-			      int flags)
-{
-	return kernel_sendpage(sock, virt_to_page(vec->iov_base),
-			       offset_in_page(vec->iov_base),
-			       vec->iov_len, flags);
-}
-
 /*
- * kernel_sendpage() is used exclusively to reduce the number of
+ * MSG_SPLICE_PAGES is used exclusively to reduce the number of
  * copy operations in this path. Therefore the caller must ensure
  * that the pages backing @xdr are unchanging.
  *
@@ -1081,65 +1074,38 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
 {
 	const struct kvec *head = xdr->head;
 	const struct kvec *tail = xdr->tail;
-	struct kvec rm = {
-		.iov_base	= &marker,
-		.iov_len	= sizeof(marker),
-	};
 	struct msghdr msg = {
-		.msg_flags	= 0,
+		.msg_flags	= MSG_SPLICE_PAGES,
 	};
-	int ret;
+	int ret, n = xdr_buf_pagecount(xdr), size;
 
 	*sentp = 0;
 	ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
 	if (ret < 0)
 		return ret;
 
-	ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
+	ret = zcopy_memdup(sizeof(marker), &marker, &xdr->bvec[-2], GFP_KERNEL);
 	if (ret < 0)
 		return ret;
-	*sentp += ret;
-	if (ret != rm.iov_len)
-		return -EAGAIN;
 
-	ret = svc_tcp_send_kvec(sock, head, 0);
+	ret = zcopy_memdup(head->iov_len, head->iov_base, &xdr->bvec[-1], GFP_KERNEL);
 	if (ret < 0)
 		return ret;
-	*sentp += ret;
-	if (ret != head->iov_len)
-		goto out;
 
-	if (xdr->page_len) {
-		unsigned int offset, len, remaining;
-		struct bio_vec *bvec;
-
-		bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
-		offset = offset_in_page(xdr->page_base);
-		remaining = xdr->page_len;
-		while (remaining > 0) {
-			len = min(remaining, bvec->bv_len - offset);
-			ret = kernel_sendpage(sock, bvec->bv_page,
-					      bvec->bv_offset + offset,
-					      len, 0);
-			if (ret < 0)
-				return ret;
-			*sentp += ret;
-			if (ret != len)
-				goto out;
-			remaining -= len;
-			offset = 0;
-			bvec++;
-		}
-	}
+	ret = zcopy_memdup(tail->iov_len, tail->iov_base, &xdr->bvec[n], GFP_KERNEL);
+	if (ret < 0)
+		return ret;
 
-	if (tail->iov_len) {
-		ret = svc_tcp_send_kvec(sock, tail, 0);
-		if (ret < 0)
-			return ret;
-		*sentp += ret;
-	}
+	size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 2, n + 3, size);
 
-out:
+	ret = sock_sendmsg(sock, &msg);
+	if (ret < 0)
+		return ret;
+	if (ret > 0)
+		*sentp = ret;
+	if (ret != size)
+		return -EAGAIN;
 	return 0;
 }
 
diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
index 36835b2f5446..6dff0b4f17b8 100644
--- a/net/sunrpc/xdr.c
+++ b/net/sunrpc/xdr.c
@@ -145,14 +145,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
 {
 	size_t i, n = xdr_buf_pagecount(buf);
 
-	if (n != 0 && buf->bvec == NULL) {
-		buf->bvec = kmalloc_array(n, sizeof(buf->bvec[0]), gfp);
+	if (buf->bvec == NULL) {
+		/* Allow for two headers and a trailer to be attached */
+		buf->bvec = kmalloc_array(n + 3, sizeof(buf->bvec[0]), gfp);
 		if (!buf->bvec)
 			return -ENOMEM;
+		buf->bvec += 2;
+		buf->bvec[-2].bv_page = NULL;
+		buf->bvec[-1].bv_page = NULL;
 		for (i = 0; i < n; i++) {
 			bvec_set_page(&buf->bvec[i], buf->pages[i], PAGE_SIZE,
 				      0);
 		}
+		buf->bvec[n].bv_page = NULL;
 	}
 	return 0;
 }
@@ -160,8 +165,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
 void
 xdr_free_bvec(struct xdr_buf *buf)
 {
-	kfree(buf->bvec);
-	buf->bvec = NULL;
+	if (buf->bvec) {
+		size_t n = xdr_buf_pagecount(buf);
+
+		if (buf->bvec[-2].bv_page)
+			put_page(buf->bvec[-2].bv_page);
+		if (buf->bvec[-1].bv_page)
+			put_page(buf->bvec[-1].bv_page);
+		if (buf->bvec[n].bv_page)
+			put_page(buf->bvec[n].bv_page);
+		buf->bvec -= 2;
+		kfree(buf->bvec);
+		buf->bvec = NULL;
+	}
 }
 
 /**



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 28/28] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)
       [not found] ` <20230316152618.711970-29-dhowells@redhat.com>
@ 2023-03-16 15:57   ` Marc Kleine-Budde
  0 siblings, 0 replies; 50+ messages in thread
From: Marc Kleine-Budde @ 2023-03-16 15:57 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Al Viro, Christoph Hellwig, Jens Axboe, Jeff Layton,
	Christian Brauner, Linus Torvalds, netdev, linux-fsdevel,
	linux-kernel, linux-mm, bpf, dccp, linux-afs, linux-arm-msm,
	linux-can, linux-crypto, linux-doc, linux-hams, linux-rdma,
	linux-sctp, linux-wpan, linux-x25, mptcp, rds-devel,
	tipc-discussion, virtualization

[-- Attachment #1: Type: text/plain, Size: 740 bytes --]

On 16.03.2023 15:26:18, David Howells wrote:
> [!] Note: This is a work in progress.  At the moment, some things won't
>     build if this patch is applied.  nvme, kcm, smc, tls.
> 
> Remove ->sendpage() and ->sendpage_locked().  sendmsg() with
> MSG_SPLICE_PAGES should be used instead.  This allows multiple pages and
> multipage folios to be passed through.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

> cc: linux-can@vger.kernel.org

Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can

Marc

-- 
Pengutronix e.K.                 | Marc Kleine-Budde           |
Embedded Linux                   | https://www.pengutronix.de  |
Amtsgericht Hildesheim, HRA 2686 | Fax:   +49-5121-206917-5555 |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 15:26 ` [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
@ 2023-03-16 16:17   ` Trond Myklebust
  2023-03-16 17:10     ` Chuck Lever III
                       ` (2 more replies)
  2023-03-16 16:24   ` David Howells
  1 sibling, 3 replies; 50+ messages in thread
From: Trond Myklebust @ 2023-03-16 16:17 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexander Viro, Christoph Hellwig, Jens Axboe,
	Jeffrey Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Trond Myklebust,
	Anna Schumaker, Charles Edward Lever, linux-nfs



> On Mar 16, 2023, at 11:26, David Howells <dhowells@redhat.com> wrote:
> 
> When transmitting data, call down into TCP using a single sendmsg with
> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
> performing several sendmsg and sendpage calls to transmit header, data
> pages and trailer.
> 
> To make this work, the data is assembled in a bio_vec array and attached to
> a BVEC-type iterator.  The bio_vec array has two extra slots before the
> first for headers and one after the last for a trailer.  The headers and
> trailer are copied into memory acquired from zcopy_alloc() which just
> breaks a page up into small pieces that can be freed with put_page().
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Trond Myklebust <trond.myklebust@hammerspace.com>
> cc: Anna Schumaker <anna@kernel.org>
> cc: Chuck Lever <chuck.lever@oracle.com>
> cc: Jeff Layton <jlayton@kernel.org>
> cc: "David S. Miller" <davem@davemloft.net>
> cc: Eric Dumazet <edumazet@google.com>
> cc: Jakub Kicinski <kuba@kernel.org>
> cc: Paolo Abeni <pabeni@redhat.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: linux-nfs@vger.kernel.org
> cc: netdev@vger.kernel.org
> ---
> net/sunrpc/svcsock.c | 70 ++++++++++++--------------------------------
> net/sunrpc/xdr.c     | 24 ++++++++++++---
> 2 files changed, 38 insertions(+), 56 deletions(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 03a4f5615086..1fa41ddbc40e 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -36,6 +36,7 @@
> #include <linux/skbuff.h>
> #include <linux/file.h>
> #include <linux/freezer.h>
> +#include <linux/zcopy_alloc.h>
> #include <net/sock.h>
> #include <net/checksum.h>
> #include <net/ip.h>
> @@ -1060,16 +1061,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> return 0; /* record not complete */
> }
> 
> -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
> -      int flags)
> -{
> - return kernel_sendpage(sock, virt_to_page(vec->iov_base),
> -       offset_in_page(vec->iov_base),
> -       vec->iov_len, flags);
> -}
> -
> /*
> - * kernel_sendpage() is used exclusively to reduce the number of
> + * MSG_SPLICE_PAGES is used exclusively to reduce the number of
>  * copy operations in this path. Therefore the caller must ensure
>  * that the pages backing @xdr are unchanging.
>  *
> @@ -1081,65 +1074,38 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
> {
> const struct kvec *head = xdr->head;
> const struct kvec *tail = xdr->tail;
> - struct kvec rm = {
> - .iov_base = &marker,
> - .iov_len = sizeof(marker),
> - };
> struct msghdr msg = {
> - .msg_flags = 0,
> + .msg_flags = MSG_SPLICE_PAGES,
> };
> - int ret;
> + int ret, n = xdr_buf_pagecount(xdr), size;
> 
> *sentp = 0;
> ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
> if (ret < 0)
> return ret;
> 
> - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
> + ret = zcopy_memdup(sizeof(marker), &marker, &xdr->bvec[-2], GFP_KERNEL);
> if (ret < 0)
> return ret;
> - *sentp += ret;
> - if (ret != rm.iov_len)
> - return -EAGAIN;
> 
> - ret = svc_tcp_send_kvec(sock, head, 0);
> + ret = zcopy_memdup(head->iov_len, head->iov_base, &xdr->bvec[-1], GFP_KERNEL);
> if (ret < 0)
> return ret;
> - *sentp += ret;
> - if (ret != head->iov_len)
> - goto out;
> 
> - if (xdr->page_len) {
> - unsigned int offset, len, remaining;
> - struct bio_vec *bvec;
> -
> - bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
> - offset = offset_in_page(xdr->page_base);
> - remaining = xdr->page_len;
> - while (remaining > 0) {
> - len = min(remaining, bvec->bv_len - offset);
> - ret = kernel_sendpage(sock, bvec->bv_page,
> -      bvec->bv_offset + offset,
> -      len, 0);
> - if (ret < 0)
> - return ret;
> - *sentp += ret;
> - if (ret != len)
> - goto out;
> - remaining -= len;
> - offset = 0;
> - bvec++;
> - }
> - }
> + ret = zcopy_memdup(tail->iov_len, tail->iov_base, &xdr->bvec[n], GFP_KERNEL);
> + if (ret < 0)
> + return ret;
> 
> - if (tail->iov_len) {
> - ret = svc_tcp_send_kvec(sock, tail, 0);
> - if (ret < 0)
> - return ret;
> - *sentp += ret;
> - }
> + size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 2, n + 3, size);
> 
> -out:
> + ret = sock_sendmsg(sock, &msg);
> + if (ret < 0)
> + return ret;
> + if (ret > 0)
> + *sentp = ret;
> + if (ret != size)
> + return -EAGAIN;
> return 0;
> }
> 
> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
> index 36835b2f5446..6dff0b4f17b8 100644
> --- a/net/sunrpc/xdr.c
> +++ b/net/sunrpc/xdr.c
> @@ -145,14 +145,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
> {
> size_t i, n = xdr_buf_pagecount(buf);
> 
> - if (n != 0 && buf->bvec == NULL) {
> - buf->bvec = kmalloc_array(n, sizeof(buf->bvec[0]), gfp);
> + if (buf->bvec == NULL) {
> + /* Allow for two headers and a trailer to be attached */
> + buf->bvec = kmalloc_array(n + 3, sizeof(buf->bvec[0]), gfp);
> if (!buf->bvec)
> return -ENOMEM;
> + buf->bvec += 2;
> + buf->bvec[-2].bv_page = NULL;
> + buf->bvec[-1].bv_page = NULL;

NACK.

> for (i = 0; i < n; i++) {
> bvec_set_page(&buf->bvec[i], buf->pages[i], PAGE_SIZE,
>      0);
> }
> + buf->bvec[n].bv_page = NULL;
> }
> return 0;
> }
> @@ -160,8 +165,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
> void
> xdr_free_bvec(struct xdr_buf *buf)
> {
> - kfree(buf->bvec);
> - buf->bvec = NULL;
> + if (buf->bvec) {
> + size_t n = xdr_buf_pagecount(buf);
> +
> + if (buf->bvec[-2].bv_page)
> + put_page(buf->bvec[-2].bv_page);
> + if (buf->bvec[-1].bv_page)
> + put_page(buf->bvec[-1].bv_page);
> + if (buf->bvec[n].bv_page)
> + put_page(buf->bvec[n].bv_page);
> + buf->bvec -= 2;
> + kfree(buf->bvec);
> + buf->bvec = NULL;
> + }
> }
> 
> /**
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 15:26 ` [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
  2023-03-16 16:17   ` Trond Myklebust
@ 2023-03-16 16:24   ` David Howells
  2023-03-16 17:23     ` Trond Myklebust
  2023-03-16 18:06     ` David Howells
  1 sibling, 2 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 16:24 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexander Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs

Trond Myklebust <trondmy@hammerspace.com> wrote:

> > + buf->bvec += 2;
> > + buf->bvec[-2].bv_page = NULL;
> > + buf->bvec[-1].bv_page = NULL;
> 
> NACK.

Can you elaborate?

Is it that you dislike allocating extra slots for protocol bits?  Or just that
the bvec[] is offset by 2?  Or some other reason?

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 16:17   ` Trond Myklebust
@ 2023-03-16 17:10     ` Chuck Lever III
  2023-03-16 17:28     ` David Howells
  2023-03-16 21:21     ` David Howells
  2 siblings, 0 replies; 50+ messages in thread
From: Chuck Lever III @ 2023-03-16 17:10 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Linux NFS Mailing List


Note: this is the first I've seen of this series -- not sure why
I never received any of these patches.

That means I haven't seen the cover letter and do not have any
context for this proposed change.


> On Mar 16, 2023, at 12:17 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
>> On Mar 16, 2023, at 11:26, David Howells <dhowells@redhat.com> wrote:
>> 
>> When transmitting data, call down into TCP using a single sendmsg with
>> MSG_SPLICE_PAGES to indicate that content should be spliced rather than
>> performing several sendmsg and sendpage calls to transmit header, data
>> pages and trailer.

We've tried combining the sendpages calls in here before. It
results in a significant and measurable performance regression.
See:

da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")

and it's subsequent revert:

4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again")


Therefore, this kind of change needs to be accompanied by both
benchmark results and some field testing to convince me it won't
cause harm.

Also, I'd rather see struct xdr_buf changed to /replace/ the
head/pagevec/tail arrangement with bvecs before we do this
kind of overhaul.

And, we have to make certain that this doesn't break operation
with kTLS sockets... do they support MSG_SPLICE_PAGES ?


>> To make this work, the data is assembled in a bio_vec array and attached to
>> a BVEC-type iterator.  The bio_vec array has two extra slots before the
>> first for headers and one after the last for a trailer.  The headers and
>> trailer are copied into memory acquired from zcopy_alloc() which just
>> breaks a page up into small pieces that can be freed with put_page().
>> 
>> Signed-off-by: David Howells <dhowells@redhat.com>
>> cc: Trond Myklebust <trond.myklebust@hammerspace.com>
>> cc: Anna Schumaker <anna@kernel.org>
>> cc: Chuck Lever <chuck.lever@oracle.com>
>> cc: Jeff Layton <jlayton@kernel.org>
>> cc: "David S. Miller" <davem@davemloft.net>
>> cc: Eric Dumazet <edumazet@google.com>
>> cc: Jakub Kicinski <kuba@kernel.org>
>> cc: Paolo Abeni <pabeni@redhat.com>
>> cc: Jens Axboe <axboe@kernel.dk>
>> cc: Matthew Wilcox <willy@infradead.org>
>> cc: linux-nfs@vger.kernel.org
>> cc: netdev@vger.kernel.org
>> ---
>> net/sunrpc/svcsock.c | 70 ++++++++++++--------------------------------
>> net/sunrpc/xdr.c     | 24 ++++++++++++---
>> 2 files changed, 38 insertions(+), 56 deletions(-)
>> 
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index 03a4f5615086..1fa41ddbc40e 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -36,6 +36,7 @@
>> #include <linux/skbuff.h>
>> #include <linux/file.h>
>> #include <linux/freezer.h>
>> +#include <linux/zcopy_alloc.h>
>> #include <net/sock.h>
>> #include <net/checksum.h>
>> #include <net/ip.h>
>> @@ -1060,16 +1061,8 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>> return 0; /* record not complete */
>> }
>> 
>> -static int svc_tcp_send_kvec(struct socket *sock, const struct kvec *vec,
>> -      int flags)
>> -{
>> - return kernel_sendpage(sock, virt_to_page(vec->iov_base),
>> -       offset_in_page(vec->iov_base),
>> -       vec->iov_len, flags);
>> -}
>> -
>> /*
>> - * kernel_sendpage() is used exclusively to reduce the number of
>> + * MSG_SPLICE_PAGES is used exclusively to reduce the number of
>> * copy operations in this path. Therefore the caller must ensure
>> * that the pages backing @xdr are unchanging.
>> *
>> @@ -1081,65 +1074,38 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
>> {
>> const struct kvec *head = xdr->head;
>> const struct kvec *tail = xdr->tail;
>> - struct kvec rm = {
>> - .iov_base = &marker,
>> - .iov_len = sizeof(marker),
>> - };
>> struct msghdr msg = {
>> - .msg_flags = 0,
>> + .msg_flags = MSG_SPLICE_PAGES,
>> };
>> - int ret;
>> + int ret, n = xdr_buf_pagecount(xdr), size;
>> 
>> *sentp = 0;
>> ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
>> if (ret < 0)
>> return ret;
>> 
>> - ret = kernel_sendmsg(sock, &msg, &rm, 1, rm.iov_len);
>> + ret = zcopy_memdup(sizeof(marker), &marker, &xdr->bvec[-2], GFP_KERNEL);
>> if (ret < 0)
>> return ret;
>> - *sentp += ret;
>> - if (ret != rm.iov_len)
>> - return -EAGAIN;
>> 
>> - ret = svc_tcp_send_kvec(sock, head, 0);
>> + ret = zcopy_memdup(head->iov_len, head->iov_base, &xdr->bvec[-1], GFP_KERNEL);
>> if (ret < 0)
>> return ret;
>> - *sentp += ret;
>> - if (ret != head->iov_len)
>> - goto out;
>> 
>> - if (xdr->page_len) {
>> - unsigned int offset, len, remaining;
>> - struct bio_vec *bvec;
>> -
>> - bvec = xdr->bvec + (xdr->page_base >> PAGE_SHIFT);
>> - offset = offset_in_page(xdr->page_base);
>> - remaining = xdr->page_len;
>> - while (remaining > 0) {
>> - len = min(remaining, bvec->bv_len - offset);
>> - ret = kernel_sendpage(sock, bvec->bv_page,
>> -      bvec->bv_offset + offset,
>> -      len, 0);
>> - if (ret < 0)
>> - return ret;
>> - *sentp += ret;
>> - if (ret != len)
>> - goto out;
>> - remaining -= len;
>> - offset = 0;
>> - bvec++;
>> - }
>> - }
>> + ret = zcopy_memdup(tail->iov_len, tail->iov_base, &xdr->bvec[n], GFP_KERNEL);
>> + if (ret < 0)
>> + return ret;
>> 
>> - if (tail->iov_len) {
>> - ret = svc_tcp_send_kvec(sock, tail, 0);
>> - if (ret < 0)
>> - return ret;
>> - *sentp += ret;
>> - }
>> + size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
>> + iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 2, n + 3, size);
>> 
>> -out:
>> + ret = sock_sendmsg(sock, &msg);
>> + if (ret < 0)
>> + return ret;
>> + if (ret > 0)
>> + *sentp = ret;
>> + if (ret != size)
>> + return -EAGAIN;
>> return 0;
>> }
>> 
>> diff --git a/net/sunrpc/xdr.c b/net/sunrpc/xdr.c
>> index 36835b2f5446..6dff0b4f17b8 100644
>> --- a/net/sunrpc/xdr.c
>> +++ b/net/sunrpc/xdr.c
>> @@ -145,14 +145,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
>> {
>> size_t i, n = xdr_buf_pagecount(buf);
>> 
>> - if (n != 0 && buf->bvec == NULL) {
>> - buf->bvec = kmalloc_array(n, sizeof(buf->bvec[0]), gfp);
>> + if (buf->bvec == NULL) {
>> + /* Allow for two headers and a trailer to be attached */
>> + buf->bvec = kmalloc_array(n + 3, sizeof(buf->bvec[0]), gfp);
>> if (!buf->bvec)
>> return -ENOMEM;
>> + buf->bvec += 2;
>> + buf->bvec[-2].bv_page = NULL;
>> + buf->bvec[-1].bv_page = NULL;
> 
> NACK.
> 
>> for (i = 0; i < n; i++) {
>> bvec_set_page(&buf->bvec[i], buf->pages[i], PAGE_SIZE,
>>     0);
>> }
>> + buf->bvec[n].bv_page = NULL;
>> }
>> return 0;
>> }
>> @@ -160,8 +165,19 @@ xdr_alloc_bvec(struct xdr_buf *buf, gfp_t gfp)
>> void
>> xdr_free_bvec(struct xdr_buf *buf)
>> {
>> - kfree(buf->bvec);
>> - buf->bvec = NULL;
>> + if (buf->bvec) {
>> + size_t n = xdr_buf_pagecount(buf);
>> +
>> + if (buf->bvec[-2].bv_page)
>> + put_page(buf->bvec[-2].bv_page);
>> + if (buf->bvec[-1].bv_page)
>> + put_page(buf->bvec[-1].bv_page);
>> + if (buf->bvec[n].bv_page)
>> + put_page(buf->bvec[n].bv_page);
>> + buf->bvec -= 2;
>> + kfree(buf->bvec);
>> + buf->bvec = NULL;
>> + }
>> }
>> 
>> /**
>> 
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 16:24   ` David Howells
@ 2023-03-16 17:23     ` Trond Myklebust
  2023-03-16 18:06     ` David Howells
  1 sibling, 0 replies; 50+ messages in thread
From: Trond Myklebust @ 2023-03-16 17:23 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexander Viro, Christoph Hellwig, Jens Axboe,
	Jeffrey Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 981 bytes --]



On Mar 16, 2023, at 12:24, David Howells <dhowells@redhat.com> wrote:

Trond Myklebust <trondmy@hammerspace.com> wrote:

+ buf->bvec += 2;
+ buf->bvec[-2].bv_page = NULL;
+ buf->bvec[-1].bv_page = NULL;

NACK.

Can you elaborate?

Is it that you dislike allocating extra slots for protocol bits?  Or just that
the bvec[] is offset by 2?  Or some other reason?


1) This is code that is common to the client and the server. Why are we adding unused 3  bvec slots to every client RPC call?
2) It obfuscates the existence of these bvec slots.
3) knfsd may use splice_direct_to_actor() in order to avoid copying the page cache data into private buffers (it just takes a reference to the pages). Using MSG_SPLICE_PAGES will presumably require it to protect those pages against further writes while the socket is referencing them.


_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com


[-- Attachment #2: Type: text/html, Size: 3245 bytes --]

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 16:17   ` Trond Myklebust
  2023-03-16 17:10     ` Chuck Lever III
@ 2023-03-16 17:28     ` David Howells
  2023-03-16 17:41       ` Chuck Lever III
  2023-03-16 21:21     ` David Howells
  2 siblings, 1 reply; 50+ messages in thread
From: David Howells @ 2023-03-16 17:28 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: dhowells, Trond Myklebust, Matthew Wilcox, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeffrey Layton, Christian Brauner,
	Linus Torvalds, netdev, linux-fsdevel, linux-kernel, linux-mm,
	Anna Schumaker, Linux NFS Mailing List

Chuck Lever III <chuck.lever@oracle.com> wrote:

> That means I haven't seen the cover letter and do not have any
> context for this proposed change.

https://lore.kernel.org/linux-fsdevel/20230316152618.711970-1-dhowells@redhat.com/

> We've tried combining the sendpages calls in here before. It
> results in a significant and measurable performance regression.
> See:
> 
> da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")

The commit replaced the use of sendpage with sendmsg, but that took away the
zerocopy aspect of sendpage.  The idea behind MSG_SPLICE_PAGES is that it
allows you to do keep that.  I'll have to try reapplying this commit and
adding the MSG_SPLICE_PAGES flag.

> Therefore, this kind of change needs to be accompanied by both
> benchmark results and some field testing to convince me it won't
> cause harm.

Yep.

> And, we have to make certain that this doesn't break operation
> with kTLS sockets... do they support MSG_SPLICE_PAGES ?

I haven't yet tackled AF_TLS, AF_KCM or AF_SMC as they seem significantly more
complex than TCP and UDP.  I thought I'd get some feedback on what I have
before I tried my hand at those.

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
  2023-03-16 15:25 ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES David Howells
@ 2023-03-16 17:28   ` Matthew Wilcox
  2023-03-16 18:00   ` David Howells
  1 sibling, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2023-03-16 17:28 UTC (permalink / raw)
  To: David Howells
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Al Viro, Christoph Hellwig, Jens Axboe, Jeff Layton,
	Christian Brauner, Linus Torvalds, netdev, linux-fsdevel,
	linux-kernel, linux-mm, Bernard Metzler, Tom Talpey, linux-rdma

On Thu, Mar 16, 2023 at 03:25:52PM +0000, David Howells wrote:
> If a network protocol sendmsg() sees MSG_SPLICE_DATA, it expects that the
> iterator is of ITER_BVEC type and that all the pages can have refs taken on
> them with get_page() and discarded with put_page().  Bits of network
> filesystem protocol data, however, are typically contained in slab memory
> for which the cleanup method is kfree(), not put_page(), so this doesn't
> work.
> 
> Provide a simple allocator, zcopy_alloc(), that allocates a page at a time
> per-cpu and sequentially breaks off pieces and hands them out with a ref as
> it's asked for them.  The caller disposes of the memory it was given by
> calling put_page().  When a page is all parcelled out, it is abandoned by
> the allocator and another page is obtained.  The page will get cleaned up
> when the last skbuff fragment is destroyed.

This feels a _lot_ like the page_frag allocator.  Can the two be
unified?


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 17:28     ` David Howells
@ 2023-03-16 17:41       ` Chuck Lever III
  0 siblings, 0 replies; 50+ messages in thread
From: Chuck Lever III @ 2023-03-16 17:41 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Linux NFS Mailing List



> On Mar 16, 2023, at 1:28 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> That means I haven't seen the cover letter and do not have any
>> context for this proposed change.
> 
> https://lore.kernel.org/linux-fsdevel/20230316152618.711970-1-dhowells@redhat.com/
> 
>> We've tried combining the sendpages calls in here before. It
>> results in a significant and measurable performance regression.
>> See:
>> 
>> da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")
> 
> The commit replaced the use of sendpage with sendmsg, but that took away the
> zerocopy aspect of sendpage.  The idea behind MSG_SPLICE_PAGES is that it
> allows you to do keep that.  I'll have to try reapplying this commit and
> adding the MSG_SPLICE_PAGES flag.

Note that, as Trond point out, NFSD can handle an NFS READ
request with either a splice actor or by copying through a
vector, depending on what the underlying filesystem can
support and whether we are using a security flavor that
requires stable pages. Grep for RQ_SPLICE_OK.

Eventually we want to make use of iomaps to ensure that
reading areas of a file that are not allocated on disk
does not trigger an extent allocation. Anna is working on
that, but I have no idea what it will look like. We can
talk more at LSF, if you'll both be around.

Also... I find I have to put back the use of MSG_MORE and
friends in here, otherwise kTLS will split each of these
kernel_sendsomething() calls into its own TLS record. This
code is likely going to look different after support for
RPC-with-TLS goes in.


>> Therefore, this kind of change needs to be accompanied by both
>> benchmark results and some field testing to convince me it won't
>> cause harm.
> 
> Yep.
> 
>> And, we have to make certain that this doesn't break operation
>> with kTLS sockets... do they support MSG_SPLICE_PAGES ?
> 
> I haven't yet tackled AF_TLS, AF_KCM or AF_SMC as they seem significantly more
> complex than TCP and UDP.  I thought I'd get some feedback on what I have
> before I tried my hand at those.

OK, I didn't mean AF_TLS, I meant the stuff under net/tls,
which is AF_INET[6] and TCP, but with a ULP in place. It's
got its own sendpage and sendmsg methods that choke when
an unrecognized MSG_ flag is present.

But OK, you're just asking for feedback, so I'll put my red
pencil down.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
  2023-03-16 15:25 ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES David Howells
  2023-03-16 17:28   ` Matthew Wilcox
@ 2023-03-16 18:00   ` David Howells
  1 sibling, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 18:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Al Viro, Christoph Hellwig, Jens Axboe, Jeff Layton,
	Christian Brauner, Linus Torvalds, netdev, linux-fsdevel,
	linux-kernel, linux-mm, Bernard Metzler, Tom Talpey, linux-rdma

Matthew Wilcox <willy@infradead.org> wrote:

> This feels a _lot_ like the page_frag allocator.  Can the two be
> unified?

Looks kind of similar.  I might well be able to use that instead.

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 16:24   ` David Howells
  2023-03-16 17:23     ` Trond Myklebust
@ 2023-03-16 18:06     ` David Howells
  2023-03-16 19:01       ` Trond Myklebust
                         ` (2 more replies)
  1 sibling, 3 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 18:06 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexander Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs

Trond Myklebust <trondmy@hammerspace.com> wrote:

> 1) This is code that is common to the client and the server. Why are we
> adding unused 3 bvec slots to every client RPC call?

Fair point, but I'm trying to avoid making four+ sendmsg calls in nfsd rather
than one.

> 2) It obfuscates the existence of these bvec slots.

True, it'd be nice to find a better way to do it.  Question is, can the client
make use of MSG_SPLICE_PAGES also?

> 3) knfsd may use splice_direct_to_actor() in order to avoid copying the page
> cache data into private buffers (it just takes a reference to the
> pages). Using MSG_SPLICE_PAGES will presumably require it to protect those
> pages against further writes while the socket is referencing them.

Upstream sunrpc is using sendpage with TCP.  It already has that issue.
MSG_SPLICE_PAGES is a way of doing sendpage through sendmsg.

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* RE: [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES
  2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
@ 2023-03-16 18:37   ` Willem de Bruijn
  2023-03-16 18:44   ` David Howells
  1 sibling, 0 replies; 50+ messages in thread
From: Willem de Bruijn @ 2023-03-16 18:37 UTC (permalink / raw)
  To: David Howells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

David Howells wrote:
> Make TCP's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
> spliced from the source iterator if possible (the iterator must be
> ITER_BVEC and the pages must be spliceable).
> 
> This allows ->sendpage() to be replaced by something that can handle
> multiple multipage folios in a single transaction.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Eric Dumazet <edumazet@google.com>
> cc: "David S. Miller" <davem@davemloft.net>
> cc: Jakub Kicinski <kuba@kernel.org>
> cc: Paolo Abeni <pabeni@redhat.com>
> cc: Jens Axboe <axboe@kernel.dk>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: netdev@vger.kernel.org
> ---
>  net/ipv4/tcp.c | 59 +++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 53 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 288693981b00..77c0c69208a5 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1220,7 +1220,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  	int flags, err, copied = 0;
>  	int mss_now = 0, size_goal, copied_syn = 0;
>  	int process_backlog = 0;
> -	bool zc = false;
> +	int zc = 0;
>  	long timeo;
>  
>  	flags = msg->msg_flags;
> @@ -1231,17 +1231,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>  		if (msg->msg_ubuf) {
>  			uarg = msg->msg_ubuf;
>  			net_zcopy_get(uarg);
> -			zc = sk->sk_route_caps & NETIF_F_SG;
> +			if (sk->sk_route_caps & NETIF_F_SG)
> +				zc = 1;
>  		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>  			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
>  			if (!uarg) {
>  				err = -ENOBUFS;
>  				goto out_err;
>  			}
> -			zc = sk->sk_route_caps & NETIF_F_SG;
> -			if (!zc)
> +			if (sk->sk_route_caps & NETIF_F_SG)
> +				zc = 1;
> +			else
>  				uarg_to_msgzc(uarg)->zerocopy = 0;
>  		}
> +	} else if (unlikely(flags & MSG_SPLICE_PAGES) && size) {
> +		if (!iov_iter_is_bvec(&msg->msg_iter))
> +			return -EINVAL;
> +		if (sk->sk_route_caps & NETIF_F_SG)
> +			zc = 2;
>  	}

The commit message mentions MSG_SPLICE_PAGES as an internal flag.

It can be passed from userspace. The code anticipates that and checks
preconditions.

A side effect is that legacy applications that may already be setting
this bit in the flags now start failing. Most socket types are
historically permissive and simply ignore undefined flags.

With MSG_ZEROCOPY we chose to be extra cautious and added
SOCK_ZEROCOPY, only testing the MSG_ZEROCOPY bit if this socket option
is explicitly enabled. Perhaps more cautious than necessary, but FYI.


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES
  2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
  2023-03-16 18:37   ` Willem de Bruijn
@ 2023-03-16 18:44   ` David Howells
  2023-03-16 19:00     ` Willem de Bruijn
  2023-03-21  0:38     ` David Howells
  1 sibling, 2 replies; 50+ messages in thread
From: David Howells @ 2023-03-16 18:44 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm

Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> 
> It can be passed from userspace. The code anticipates that and checks
> preconditions.

Should I add a separate field in the in-kernel msghdr struct for such internal
flags?  That would also avoid putting an internal flag in the same space as
the uapi flags.

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES
  2023-03-16 18:44   ` David Howells
@ 2023-03-16 19:00     ` Willem de Bruijn
  2023-03-21  0:38     ` David Howells
  1 sibling, 0 replies; 50+ messages in thread
From: Willem de Bruijn @ 2023-03-16 19:00 UTC (permalink / raw)
  To: David Howells, Willem de Bruijn
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm

David Howells wrote:
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > 
> > It can be passed from userspace. The code anticipates that and checks
> > preconditions.
> 
> Should I add a separate field in the in-kernel msghdr struct for such internal
> flags?  That would also avoid putting an internal flag in the same space as
> the uapi flags.

That would work, if no cost to common paths that don't need it.

A not very pretty alternative would be to add an an extra arg to each
sendmsg handler that is used only when called from sendpage.

There are a few other internal MSG_.. flags, such as
MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
in sendmsg, I think. Which would explain why it was clearly safe to
add them.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 18:06     ` David Howells
@ 2023-03-16 19:01       ` Trond Myklebust
  2023-03-22 13:10       ` David Howells
  2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
  2 siblings, 0 replies; 50+ messages in thread
From: Trond Myklebust @ 2023-03-16 19:01 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexander Viro, Christoph Hellwig, Jens Axboe,
	Jeffrey Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs



> On Mar 16, 2023, at 14:06, David Howells <dhowells@redhat.com> wrote:
> 
> Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
>> 1) This is code that is common to the client and the server. Why are we
>> adding unused 3 bvec slots to every client RPC call?
> 
> Fair point, but I'm trying to avoid making four+ sendmsg calls in nfsd rather
> than one.

Add an enum iter_type for ITER_ITER ? :-)

Otherwise, please just split these functions into one for knfsd and a separate one for the client.

> 
>> 2) It obfuscates the existence of these bvec slots.
> 
> True, it'd be nice to find a better way to do it.  Question is, can the client
> make use of MSG_SPLICE_PAGES also?

The requirement for O_DIRECT support means we get the stable write issues with added extra spicy sauce.

> 
>> 3) knfsd may use splice_direct_to_actor() in order to avoid copying the page
>> cache data into private buffers (it just takes a reference to the
>> pages). Using MSG_SPLICE_PAGES will presumably require it to protect those
>> pages against further writes while the socket is referencing them.
> 
> Upstream sunrpc is using sendpage with TCP.  It already has that issue.
> MSG_SPLICE_PAGES is a way of doing sendpage through sendmsg.

Fair enough. I do seem to remember a schism with the knfsd developers over that issue.

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 16:17   ` Trond Myklebust
  2023-03-16 17:10     ` Chuck Lever III
  2023-03-16 17:28     ` David Howells
@ 2023-03-16 21:21     ` David Howells
  2023-03-17 15:29       ` Chuck Lever III
  2 siblings, 1 reply; 50+ messages in thread
From: David Howells @ 2023-03-16 21:21 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: dhowells, Trond Myklebust, Matthew Wilcox, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeffrey Layton, Christian Brauner,
	Linus Torvalds, netdev, linux-fsdevel, linux-kernel, linux-mm,
	Anna Schumaker, Linux NFS Mailing List

Chuck Lever III <chuck.lever@oracle.com> wrote:

> Therefore, this kind of change needs to be accompanied by both
> benchmark results and some field testing to convince me it won't
> cause harm.

Btw, what do you use to benchmark NFS performance?

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 21:21     ` David Howells
@ 2023-03-17 15:29       ` Chuck Lever III
  0 siblings, 0 replies; 50+ messages in thread
From: Chuck Lever III @ 2023-03-17 15:29 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Linux NFS Mailing List


> On Mar 16, 2023, at 5:21 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> Therefore, this kind of change needs to be accompanied by both
>> benchmark results and some field testing to convince me it won't
>> cause harm.
> 
> Btw, what do you use to benchmark NFS performance?

It depends on what I'm trying to observe. I have only a small
handful of systems in my lab, which is why I was not able to
immediately detect the effects of the zero-copy change in my
lab. Daire has a large client cohort on a fast network, so is
able to see the impact of that kind of change quite readily.

A perhaps more interesting question is what kind of tooling
would I use to measure the performance of the proposed change.

The bottom line is whether or not applications on clients can
see a change. NFS client implementations can hide server and
network latency improvement from applications, and RPC-on-TCP
adds palpable latency as well that reduces the efficacy of
server performance optimizations.

For that I might use a multi-threaded fio workload with fixed
record sizes (2KB, 8KB, 128KB, 1MB) and then look at the
throughput numbers and latency distribution for each size.

In a single thread qd=1 test, iozone can show changes in
READ latency pretty clearly, though most folks believe qd=1
tests are junk.

I generally run such tests on 100GbE with a tmpfs or NVMe
export to take filesystem latencies out of the equation,
although that might matter more for WRITE latency if you
can keep your READ workload completely in server memory.

To measure server-side behavior without the effects of the
network or client, NFSD has a built-in trace point,
nfsd:svc_stats_latency, that records the latency in
microseconds of each RPC. Run the above workloads and
record this tracepoint (perhaps with a capture filter to
record only the latency of READ operations).

Then you can post-process the raw latencies to get an average
latency and deviation, or even look at latency distribution
to see if the shape of the outlier curve has changed. I use
awk for this.

[ Sidebar: you can use this tracepoint to track latency
outliers too, but that's another topic. ]

Second, I might try a flame graph study to measure changes in
instruction path length, and also capture an average cycles-
per-byte-read value. Looking at CPU cache misses can often be
a rathole, but flame graphs can surface changes there too.

And lastly, you might want to visit lock_stats to see if
there is any significant change in lock contention. An
unexpected increase in lock contention can often rob
positive changes made in other areas.


My guess is that for the RQ_SPLICE_OK case, the difference
would amount to the elimination of the kernel_sendpage
calls, which are indirect, but not terribly expensive.
Those calls amount to a significant cost only on large I/O.
It might not amount to much relative to the other costs
in the READ path.

So the real purpose here would have to be refactoring to
use bvecs instead of the bespoke xdr_buf structure, and I
would like to see support for bvecs in all of our transports
(looking at you, RDMA) to make this truly worthwhile. I had
started this a while back, but lack of a bvec-based RDMA API
made it less interesting to me. It isn't clear to me yet
whether bvecs or folios should be the replacement for
xdr_buf's head/pages/tail, but I'm a paid-in-full member of
the uneducated rabble.

This might sound like a lot of pushback, but actually I am
open to discussing clean-ups in this area, including the
one you proposed. Just getting a little more careful about
this kind of change as time goes on. And it sounds like you
were already aware of the most recent previous attempt at
this kind of improvement.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES
  2023-03-16 18:44   ` David Howells
  2023-03-16 19:00     ` Willem de Bruijn
@ 2023-03-21  0:38     ` David Howells
  2023-03-21 14:22       ` Willem de Bruijn
  1 sibling, 1 reply; 50+ messages in thread
From: David Howells @ 2023-03-21  0:38 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm

Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:

> David Howells wrote:
> > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > 
> > > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > > 
> > > It can be passed from userspace. The code anticipates that and checks
> > > preconditions.
> > 
> > Should I add a separate field in the in-kernel msghdr struct for such internal
> > flags?  That would also avoid putting an internal flag in the same space as
> > the uapi flags.
> 
> That would work, if no cost to common paths that don't need it.

Actually, it might be tricky.  __ip_append_data() doesn't take a msghdr struct
pointer per se.  The "void *from" argument *might* point to one - but it
depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't
know.

Possibly this changes if sendpage goes away.

> A not very pretty alternative would be to add an an extra arg to each
> sendmsg handler that is used only when called from sendpage.
> 
> There are a few other internal MSG_.. flags, such as
> MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
> in sendmsg, I think. Which would explain why it was clearly safe to
> add them.

Should those be moved across to the internal flags with MSG_SPLICE_PAGES?

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES
  2023-03-21  0:38     ` David Howells
@ 2023-03-21 14:22       ` Willem de Bruijn
  0 siblings, 0 replies; 50+ messages in thread
From: Willem de Bruijn @ 2023-03-21 14:22 UTC (permalink / raw)
  To: David Howells, Willem de Bruijn
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm

David Howells wrote:
> Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> > David Howells wrote:
> > > Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > > 
> > > > The commit message mentions MSG_SPLICE_PAGES as an internal flag.
> > > > 
> > > > It can be passed from userspace. The code anticipates that and checks
> > > > preconditions.
> > > 
> > > Should I add a separate field in the in-kernel msghdr struct for such internal
> > > flags?  That would also avoid putting an internal flag in the same space as
> > > the uapi flags.
> > 
> > That would work, if no cost to common paths that don't need it.
> 
> Actually, it might be tricky.  __ip_append_data() doesn't take a msghdr struct
> pointer per se.  The "void *from" argument *might* point to one - but it
> depends on seeing a MSG_SPLICE_PAGES or MSG_ZEROCOPY flag, otherwise we don't
> know.
> 
> Possibly this changes if sendpage goes away.

Is it sufficient to mask out this bit in tcp_sendmsg_locked and
udp_sendmsg if passed from userspace (and should be ignored), and pass
it through flags to callees like ip_append_data?
> 
> > A not very pretty alternative would be to add an an extra arg to each
> > sendmsg handler that is used only when called from sendpage.
> > 
> > There are a few other internal MSG_.. flags, such as
> > MSG_SENDPAGE_NOPOLICY. Those are all limited to sendpage, and ignored
> > in sendmsg, I think. Which would explain why it was clearly safe to
> > add them.
> 
> Should those be moved across to the internal flags with MSG_SPLICE_PAGES?

I would not include that in this patch series.



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  2023-03-16 18:06     ` David Howells
  2023-03-16 19:01       ` Trond Myklebust
@ 2023-03-22 13:10       ` David Howells
  2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
  2 siblings, 0 replies; 50+ messages in thread
From: David Howells @ 2023-03-22 13:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexander Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs

Trond Myklebust <trondmy@hammerspace.com> wrote:

> Add an enum iter_type for ITER_ITER ? :-)

Actually, that might not be such a bad idea, now that I've pondered on it some
more.  Give it an array of iterators and add a flag to each iterator to say if
it can be spliced from or not.

Once ITER_PIPE is killed off, advancing and reverting over it should be pretty
straightforward - though each iterator would also need to keep track of how
big it started off as in order that it can be reverted over.

David



^ permalink raw reply	[flat|nested] 50+ messages in thread

* [RFC PATCH] iov_iter: Add an iterator-of-iterators
  2023-03-16 18:06     ` David Howells
  2023-03-16 19:01       ` Trond Myklebust
  2023-03-22 13:10       ` David Howells
@ 2023-03-22 18:15       ` David Howells
  2023-03-22 18:47         ` Trond Myklebust
  2023-03-22 18:49         ` Matthew Wilcox
  2 siblings, 2 replies; 50+ messages in thread
From: David Howells @ 2023-03-22 18:15 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: dhowells, Matthew Wilcox, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexander Viro, Christoph Hellwig,
	Jens Axboe, Jeffrey Layton, Christian Brauner, Linus Torvalds,
	netdev, linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs

Trond Myklebust <trondmy@hammerspace.com> wrote:

> Add an enum iter_type for ITER_ITER ? :-)

Well, you asked for it...  It's actually fairly straightforward once
ITER_PIPE is removed.

---
iov_iter: Add an iterator-of-iterators

Provide an I/O iterator that takes an array of iterators and iterates over
them in turn.  Then make the sunrpc service code (and thus nfsd) use it.

In this particular instance, the svc_tcp_sendmsg() sets up an array of
three iterators: once for the marker+header, one for the body and one
optional one for the tail, then sets msg_iter to be an
iterator-of-iterators across them.

Signed-off-by: David Howells <dhowells@redhat.com>
---    
 include/linux/uio.h  |   19 +++-
 lib/iov_iter.c       |  233 +++++++++++++++++++++++++++++++++++++++++++++++++--
 net/sunrpc/svcsock.c |   29 +++---
 3 files changed, 258 insertions(+), 23 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 74598426edb4..321381d3d616 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -27,6 +27,7 @@ enum iter_type {
 	ITER_XARRAY,
 	ITER_DISCARD,
 	ITER_UBUF,
+	ITER_ITERLIST,
 };
 
 #define ITER_SOURCE	1	// == WRITE
@@ -43,17 +44,17 @@ struct iov_iter {
 	bool nofault;
 	bool data_source;
 	bool user_backed;
-	union {
-		size_t iov_offset;
-		int last_offset;
-	};
+	bool spliceable;
+	size_t iov_offset;
 	size_t count;
+	size_t orig_count;
 	union {
 		const struct iovec *iov;
 		const struct kvec *kvec;
 		const struct bio_vec *bvec;
 		struct xarray *xarray;
 		void __user *ubuf;
+		struct iov_iter *iterlist;
 	};
 	union {
 		unsigned long nr_segs;
@@ -104,6 +105,11 @@ static inline bool iov_iter_is_xarray(const struct iov_iter *i)
 	return iov_iter_type(i) == ITER_XARRAY;
 }
 
+static inline bool iov_iter_is_iterlist(const struct iov_iter *i)
+{
+	return iov_iter_type(i) == ITER_ITERLIST;
+}
+
 static inline unsigned char iov_iter_rw(const struct iov_iter *i)
 {
 	return i->data_source ? WRITE : READ;
@@ -238,6 +244,8 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_
 void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
 void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
 		     loff_t start, size_t count);
+void iov_iter_iterlist(struct iov_iter *i, unsigned int direction, struct iov_iter *iterlist,
+		       unsigned long nr_segs, size_t count);
 ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
 		size_t maxsize, unsigned maxpages, size_t *start,
 		iov_iter_extraction_t extraction_flags);
@@ -345,7 +353,8 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 		.user_backed = true,
 		.data_source = direction,
 		.ubuf = buf,
-		.count = count
+		.count = count,
+		.orig_count = count,
 	};
 }
 /* Flags for iov_iter_get/extract_pages*() */
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fad95e4cf372..34ce3b958b6c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -282,7 +282,8 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
 		.iov = iov,
 		.nr_segs = nr_segs,
 		.iov_offset = 0,
-		.count = count
+		.count = count,
+		.orig_count = count,
 	};
 }
 EXPORT_SYMBOL(iov_iter_init);
@@ -364,6 +365,26 @@ size_t _copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
 
+	if (unlikely(iov_iter_is_iterlist(i))) {
+		size_t copied = 0;
+
+		while (bytes && i->count) {
+			size_t part = min(bytes, i->iterlist->count), n;
+
+			if (part > 0)
+				n = _copy_from_iter(addr, part, i->iterlist);
+			addr += n;
+			copied += n;
+			bytes -= n;
+			i->count -= n;
+			if (n < part || !bytes)
+				break;
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		return copied;
+	}
+
 	if (user_backed_iter(i))
 		might_fault();
 	iterate_and_advance(i, bytes, base, len, off,
@@ -380,6 +401,27 @@ size_t _copy_from_iter_nocache(void *addr, size_t bytes, struct iov_iter *i)
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
 
+	if (unlikely(iov_iter_is_iterlist(i))) {
+		size_t copied = 0;
+
+		while (bytes && i->count) {
+			size_t part = min(bytes, i->iterlist->count), n;
+
+			if (part > 0)
+				n = _copy_from_iter_nocache(addr, part,
+							    i->iterlist);
+			addr += n;
+			copied += n;
+			bytes -= n;
+			i->count -= n;
+			if (n < part || !bytes)
+				break;
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		return copied;
+	}
+
 	iterate_and_advance(i, bytes, base, len, off,
 		__copy_from_user_inatomic_nocache(addr + off, base, len),
 		memcpy(addr + off, base, len)
@@ -411,6 +453,27 @@ size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
 
+	if (unlikely(iov_iter_is_iterlist(i))) {
+		size_t copied = 0;
+
+		while (bytes && i->count) {
+			size_t part = min(bytes, i->iterlist->count), n;
+
+			if (part > 0)
+				n = _copy_from_iter_flushcache(addr, part,
+							       i->iterlist);
+			addr += n;
+			copied += n;
+			bytes -= n;
+			i->count -= n;
+			if (n < part || !bytes)
+				break;
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		return copied;
+	}
+
 	iterate_and_advance(i, bytes, base, len, off,
 		__copy_from_user_flushcache(addr + off, base, len),
 		memcpy_flushcache(addr + off, base, len)
@@ -514,7 +577,31 @@ EXPORT_SYMBOL(iov_iter_zero);
 size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
 				  struct iov_iter *i)
 {
-	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
+	char *kaddr, *p;
+
+	if (unlikely(iov_iter_is_iterlist(i))) {
+		size_t copied = 0;
+
+		while (bytes && i->count) {
+			size_t part = min(bytes, i->iterlist->count), n;
+
+			if (part > 0)
+				n = copy_page_from_iter_atomic(page, offset, part,
+							       i->iterlist);
+			offset += n;
+			copied += n;
+			bytes -= n;
+			i->count -= n;
+			if (n < part || !bytes)
+				break;
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		return copied;
+	}
+
+	kaddr = kmap_atomic(page);
+	p = kaddr + offset;
 	if (!page_copy_sane(page, offset, bytes)) {
 		kunmap_atomic(kaddr);
 		return 0;
@@ -585,19 +672,49 @@ void iov_iter_advance(struct iov_iter *i, size_t size)
 		iov_iter_bvec_advance(i, size);
 	} else if (iov_iter_is_discard(i)) {
 		i->count -= size;
+	}else if (iov_iter_is_iterlist(i)) {
+		i->count -= size;
+		for (;;) {
+			size_t part = min(size, i->iterlist->count);
+
+			if (part > 0)
+				iov_iter_advance(i->iterlist, part);
+			size -= part;
+			if (!size)
+				break;
+			i->iterlist++;
+			i->nr_segs--;
+		}
 	}
 }
 EXPORT_SYMBOL(iov_iter_advance);
 
+static void iov_iter_revert_iterlist(struct iov_iter *i, size_t unroll)
+{
+	for (;;) {
+		size_t part = min(unroll, i->iterlist->orig_count - i->iterlist->count);
+
+		if (part > 0)
+			iov_iter_revert(i->iterlist, part);
+		unroll -= part;
+		if (!unroll)
+			break;
+		i->iterlist--;
+		i->nr_segs++;
+	}
+}
+
 void iov_iter_revert(struct iov_iter *i, size_t unroll)
 {
 	if (!unroll)
 		return;
-	if (WARN_ON(unroll > MAX_RW_COUNT))
+	if (WARN_ON(unroll > i->orig_count - i->count))
 		return;
 	i->count += unroll;
 	if (unlikely(iov_iter_is_discard(i)))
 		return;
+	if (unlikely(iov_iter_is_iterlist(i)))
+		return iov_iter_revert_iterlist(i, unroll);
 	if (unroll <= i->iov_offset) {
 		i->iov_offset -= unroll;
 		return;
@@ -641,6 +758,8 @@ EXPORT_SYMBOL(iov_iter_revert);
  */
 size_t iov_iter_single_seg_count(const struct iov_iter *i)
 {
+	if (iov_iter_is_iterlist(i))
+		i = i->iterlist;
 	if (i->nr_segs > 1) {
 		if (likely(iter_is_iovec(i) || iov_iter_is_kvec(i)))
 			return min(i->count, i->iov->iov_len - i->iov_offset);
@@ -662,7 +781,8 @@ void iov_iter_kvec(struct iov_iter *i, unsigned int direction,
 		.kvec = kvec,
 		.nr_segs = nr_segs,
 		.iov_offset = 0,
-		.count = count
+		.count = count,
+		.orig_count = count,
 	};
 }
 EXPORT_SYMBOL(iov_iter_kvec);
@@ -678,7 +798,8 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction,
 		.bvec = bvec,
 		.nr_segs = nr_segs,
 		.iov_offset = 0,
-		.count = count
+		.count = count,
+		.orig_count = count,
 	};
 }
 EXPORT_SYMBOL(iov_iter_bvec);
@@ -706,6 +827,7 @@ void iov_iter_xarray(struct iov_iter *i, unsigned int direction,
 		.xarray = xarray,
 		.xarray_start = start,
 		.count = count,
+		.orig_count = count,
 		.iov_offset = 0
 	};
 }
@@ -727,11 +849,47 @@ void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count)
 		.iter_type = ITER_DISCARD,
 		.data_source = false,
 		.count = count,
+		.orig_count = count,
 		.iov_offset = 0
 	};
 }
 EXPORT_SYMBOL(iov_iter_discard);
 
+/**
+ * iov_iter_iterlist - Initialise an I/O iterator that is a list of iterators
+ * @iter: The iterator to initialise.
+ * @direction: The direction of the transfer.
+ * @iterlist: The list of iterators
+ * @nr_segs: The number of elements in the list
+ * @count: The size of the I/O buffer in bytes.
+ *
+ * Set up an I/O iterator that just discards everything that's written to it.
+ * It's only available as a source iterator (for WRITE), all the iterators in
+ * the list must be the same and none of them can be ITER_ITERLIST type.
+ */
+void iov_iter_iterlist(struct iov_iter *iter, unsigned int direction,
+		       struct iov_iter *iterlist, unsigned long nr_segs,
+		       size_t count)
+{
+	unsigned long i;
+
+	BUG_ON(direction != WRITE);
+	for (i = 0; i < nr_segs; i++) {
+		BUG_ON(iterlist[i].iter_type == ITER_ITERLIST);
+		BUG_ON(!iterlist[i].data_source);
+	}
+
+	*iter = (struct iov_iter){
+		.iter_type	= ITER_ITERLIST,
+		.data_source	= true,
+		.count		= count,
+		.orig_count	= count,
+		.iterlist	= iterlist,
+		.nr_segs	= nr_segs,
+	};
+}
+EXPORT_SYMBOL(iov_iter_iterlist);
+
 static bool iov_iter_aligned_iovec(const struct iov_iter *i, unsigned addr_mask,
 				   unsigned len_mask)
 {
@@ -879,6 +1037,15 @@ unsigned long iov_iter_alignment(const struct iov_iter *i)
 	if (iov_iter_is_xarray(i))
 		return (i->xarray_start + i->iov_offset) | i->count;
 
+	if (iov_iter_is_iterlist(i)) {
+		unsigned long align = 0;
+		unsigned int j;
+
+		for (j = 0; j < i->nr_segs; j++)
+			align |= iov_iter_alignment(&i->iterlist[j]);
+		return align;
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_alignment);
@@ -1078,6 +1245,18 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
 	}
 	if (iov_iter_is_xarray(i))
 		return iter_xarray_get_pages(i, pages, maxsize, maxpages, start);
+	if (iov_iter_is_iterlist(i)) {
+		ssize_t size;
+
+		while (!i->iterlist->count) {
+			i->iterlist++;
+			i->nr_segs--;
+		}
+		size = __iov_iter_get_pages_alloc(i->iterlist, pages, maxsize, maxpages,
+						  start, extraction_flags);
+		i->count -= size;
+		return size;
+	}
 	return -EFAULT;
 }
 
@@ -1126,6 +1305,31 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
 }
 EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
 
+static size_t csum_and_copy_from_iterlist(void *addr, size_t bytes, __wsum *csum,
+					  struct iov_iter *i)
+{
+	size_t copied = 0, n;
+
+	while (i->count && i->nr_segs) {
+		struct iov_iter *j = i->iterlist;
+
+		if (j->count == 0) {
+			i->iterlist++;
+			i->nr_segs--;
+			continue;
+		}
+
+		n = csum_and_copy_from_iter(addr, bytes - copied, csum, j);
+		addr += n;
+		copied += n;
+		i->count -= n;
+		if (n == 0)
+			break;
+	}
+
+	return copied;
+}
+
 size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 			       struct iov_iter *i)
 {
@@ -1133,6 +1337,8 @@ size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
 	sum = *csum;
 	if (WARN_ON_ONCE(!i->data_source))
 		return 0;
+	if (iov_iter_is_iterlist(i))
+		return csum_and_copy_from_iterlist(addr, bytes, csum, i);
 
 	iterate_and_advance(i, bytes, base, len, off, ({
 		next = csum_and_copy_from_user(base, addr + off, len);
@@ -1236,6 +1442,21 @@ static int bvec_npages(const struct iov_iter *i, int maxpages)
 	return npages;
 }
 
+static int iterlist_npages(const struct iov_iter *i, int maxpages)
+{
+	ssize_t size = i->count;
+	const struct iov_iter *p;
+	int npages = 0;
+
+	for (p = i->iterlist; size; p++) {
+		size -= p->count;
+		npages += iov_iter_npages(p, maxpages - npages);
+		if (unlikely(npages >= maxpages))
+			return maxpages;
+	}
+	return npages;
+}
+
 int iov_iter_npages(const struct iov_iter *i, int maxpages)
 {
 	if (unlikely(!i->count))
@@ -1255,6 +1476,8 @@ int iov_iter_npages(const struct iov_iter *i, int maxpages)
 		int npages = DIV_ROUND_UP(offset + i->count, PAGE_SIZE);
 		return min(npages, maxpages);
 	}
+	if (iov_iter_is_iterlist(i))
+		return iterlist_npages(i, maxpages);
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_npages);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 1d0f0f764e16..030a1fa5171b 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1073,11 +1073,13 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
 {
 	const struct kvec *head = xdr->head;
 	const struct kvec *tail = xdr->tail;
+	struct iov_iter iters[3];
+	struct bio_vec head_bv, tail_bv;
 	struct msghdr msg = {
-		.msg_flags	= MSG_SPLICE_PAGES,
+		.msg_flags	= 0, //MSG_SPLICE_PAGES,
 	};
-	void *m, *h, *t;
-	int ret, n = xdr_buf_pagecount(xdr), size;
+	void *m, *t;
+	int ret, n = 2, size;
 
 	*sentp = 0;
 	ret = xdr_alloc_bvec(xdr, GFP_KERNEL);
@@ -1089,27 +1091,28 @@ static int svc_tcp_sendmsg(struct socket *sock, struct xdr_buf *xdr,
 	if (!m)
 		return -ENOMEM;
 
-	h = m + sizeof(marker);
-	t = h + head->iov_len;
+	memcpy(m, &marker, sizeof(marker));
+	if (head->iov_len)
+		memcpy(m + sizeof(marker), head->iov_base, head->iov_len);
+	bvec_set_virt(&head_bv, m, sizeof(marker) + head->iov_len);
+	iov_iter_bvec(&iters[0], ITER_SOURCE, &head_bv, 1,
+		      sizeof(marker) + head->iov_len);
 
-	bvec_set_virt(&xdr->bvec[-1], m, sizeof(marker) + head->iov_len);
-	n++;
+	iov_iter_bvec(&iters[1], ITER_SOURCE, xdr->bvec,
+		      xdr_buf_pagecount(xdr), xdr->page_len);
 
 	if (tail->iov_len) {
 		t = page_frag_alloc(NULL, tail->iov_len, GFP_KERNEL);
 		if (!t)
 			return -ENOMEM;
-		bvec_set_virt(&xdr->bvec[n],  t, tail->iov_len);
 		memcpy(t, tail->iov_base, tail->iov_len);
+		bvec_set_virt(&tail_bv,  t, tail->iov_len);
+		iov_iter_bvec(&iters[2], ITER_SOURCE, &tail_bv, 1, tail->iov_len);
 		n++;
 	}
 
-	memcpy(m, &marker, sizeof(marker));
-	if (head->iov_len)
-		memcpy(h, head->iov_base, head->iov_len);
-
 	size = sizeof(marker) + head->iov_len + xdr->page_len + tail->iov_len;
-	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, xdr->bvec - 1, n, size);
+	iov_iter_iterlist(&msg.msg_iter, ITER_SOURCE, iters, n, size);
 
 	ret = sock_sendmsg(sock, &msg);
 	if (ret < 0)



^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH] iov_iter: Add an iterator-of-iterators
  2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
@ 2023-03-22 18:47         ` Trond Myklebust
  2023-03-22 18:49         ` Matthew Wilcox
  1 sibling, 0 replies; 50+ messages in thread
From: Trond Myklebust @ 2023-03-22 18:47 UTC (permalink / raw)
  To: David Howells
  Cc: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexander Viro, Christoph Hellwig, Jens Axboe,
	Jeffrey Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs



> On Mar 22, 2023, at 14:15, David Howells <dhowells@redhat.com> wrote:
> 
> Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
>> Add an enum iter_type for ITER_ITER ? :-)
> 
> Well, you asked for it...  It's actually fairly straightforward once
> ITER_PIPE is removed.
> 
> ---
> iov_iter: Add an iterator-of-iterators
> 
> Provide an I/O iterator that takes an array of iterators and iterates over
> them in turn.  Then make the sunrpc service code (and thus nfsd) use it.
> 
> In this particular instance, the svc_tcp_sendmsg() sets up an array of
> three iterators: once for the marker+header, one for the body and one
> optional one for the tail, then sets msg_iter to be an
> iterator-of-iterators across them.

Cool! This is something that can be used on the receive side as well, so very useful. I can imagine it might also open up a few more use cases for ITER_XARRAY.

Thanks!
  Trond

_________________________________
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC PATCH] iov_iter: Add an iterator-of-iterators
  2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
  2023-03-22 18:47         ` Trond Myklebust
@ 2023-03-22 18:49         ` Matthew Wilcox
  1 sibling, 0 replies; 50+ messages in thread
From: Matthew Wilcox @ 2023-03-22 18:49 UTC (permalink / raw)
  To: David Howells
  Cc: Trond Myklebust, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Alexander Viro, Christoph Hellwig, Jens Axboe,
	Jeffrey Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm, Anna Schumaker,
	Charles Edward Lever, linux-nfs

On Wed, Mar 22, 2023 at 06:15:45PM +0000, David Howells wrote:
> @@ -43,17 +44,17 @@ struct iov_iter {
>  	bool nofault;
>  	bool data_source;
>  	bool user_backed;
> -	union {
> -		size_t iov_offset;
> -		int last_offset;
> -	};
> +	bool spliceable;

We've now up to five u8s in a row here (iter_type, nofault, data_source,
user_backed).  Is it time to turn some/all of them into:

	bool nofault:1;
	bool data_source:1;
	bool user_backed:1;
	bool spliceable:1;

You can't take the address of them then, but I don't believe we do that
anywhere.



^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2023-03-22 18:49 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
2023-03-16 15:25 ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES David Howells
2023-03-16 17:28   ` Matthew Wilcox
2023-03-16 18:00   ` David Howells
2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
2023-03-16 18:37   ` Willem de Bruijn
2023-03-16 18:44   ` David Howells
2023-03-16 19:00     ` Willem de Bruijn
2023-03-21  0:38     ` David Howells
2023-03-21 14:22       ` Willem de Bruijn
2023-03-16 15:25 ` [RFC PATCH 04/28] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:25 ` [RFC PATCH 05/28] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
2023-03-16 15:25 ` [RFC PATCH 06/28] espintcp: Inline do_tcp_sendpages() David Howells
2023-03-16 15:25 ` [RFC PATCH 07/28] tls: " David Howells
2023-03-16 15:25 ` [RFC PATCH 08/28] siw: " David Howells
2023-03-16 15:25 ` [RFC PATCH 09/28] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
2023-03-16 15:26 ` [RFC PATCH 10/28] ip, udp: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 11/28] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 12/28] af_unix: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 13/28] crypto: af_alg: Indent the loop in af_alg_sendmsg() David Howells
2023-03-16 15:26 ` [RFC PATCH 14/28] crypto: af_alg: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 15/28] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 16/28] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() David Howells
2023-03-16 15:26 ` [RFC PATCH 17/28] Remove file->f_op->sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit David Howells
2023-03-16 15:26 ` [RFC PATCH 19/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 20/28] iscsi: " David Howells
2023-03-16 15:26 ` [RFC PATCH 21/28] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:26 ` [RFC PATCH 22/28] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock() David Howells
2023-03-16 15:26 ` [RFC PATCH 23/28] algif: Remove hash_sendpage*() David Howells
2023-03-16 15:26 ` [RFC PATCH 24/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells
2023-03-16 15:26 ` [RFC PATCH 25/28] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 26/28] dlm: " David Howells
2023-03-16 15:26 ` [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
2023-03-16 16:17   ` Trond Myklebust
2023-03-16 17:10     ` Chuck Lever III
2023-03-16 17:28     ` David Howells
2023-03-16 17:41       ` Chuck Lever III
2023-03-16 21:21     ` David Howells
2023-03-17 15:29       ` Chuck Lever III
2023-03-16 16:24   ` David Howells
2023-03-16 17:23     ` Trond Myklebust
2023-03-16 18:06     ` David Howells
2023-03-16 19:01       ` Trond Myklebust
2023-03-22 13:10       ` David Howells
2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
2023-03-22 18:47         ` Trond Myklebust
2023-03-22 18:49         ` Matthew Wilcox
     [not found] ` <20230316152618.711970-29-dhowells@redhat.com>
2023-03-16 15:57   ` [RFC PATCH 28/28] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) Marc Kleine-Budde

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).