All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES)
@ 2023-03-16 15:25 David Howells
  2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
                   ` (27 more replies)
  0 siblings, 28 replies; 81+ messages in thread
From: David Howells @ 2023-03-16 15:25 UTC (permalink / raw)
  To: Matthew Wilcox, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: David Howells, Al Viro, Christoph Hellwig, Jens Axboe,
	Jeff Layton, Christian Brauner, Linus Torvalds, netdev,
	linux-fsdevel, linux-kernel, linux-mm

Hi Willy, Dave, et al.,

[NOTE! This patchset is a work in progress and some modules will not
 compile with it.]

I've been looking at how to make pipes handle the splicing in of multipage
folios and also looking to see if I could implement a suggestion from Willy
that pipe_buffers could perhaps hold a list of pages (which could make
splicing simpler - an entire splice segment would go in a single
pipe_buffer).

There are a couple of issues here:

 (1) Gifting/stealing a multipage folio is really tricky.  I think that if
     a multipage folio if gifted, the gift flag should be quietly dropped.
     Userspace has no control over what splice() and vmsplice() will see in
     the pagecache.

 (2) The sendpage op expects to be given a single page and various network
     protocols just attach that to a socket buffer.

This patchset aims to deal with the second by removing the ->sendpage()
operation and replacing it with sendmsg() and a new internal flag
MSG_SPLICE_PAGES.  As sendmsg() takes an I/O iterator, this also affords
the opportunity to pass a slew of pages in one go, rather than one at a
time.

If MSG_SPLICE_PAGES is set, the current implementation requires that the
iterator be ITER_BVEC-type and that the pages can be retained by calling
get_page() on them.  Note that I'm accessing the bvec[] directly, but
should really use iov_iter_extract_pages() which would allow an
ITER_XARRAY-type iterator to be used also.

The patchset consists of the following parts:

 (1) Define the MSG_SPLICE_PAGES flag.

 (2) Provide a simple allocator that takes pages and splits pieces off them
     on request and returns them with a ref on the page.  Unlike with slab
     memory, the lifetime of the allocated memory is controlled by the page
     refcount.  This allows protocol bits to be included in the same bvec[]
     as the data.

 (3) Implement MSG_SPLICE_PAGES support in TCP.

 (4) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
     various callers.

 (5) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
     a wrapper around sendmsg().

 (6) Implement MSG_SPLICE_PAGES support in AF_UNIX.

 (7) Implement MSG_SPLICE_PAGES support in AF_ALG and make
     af_alg_sendpage() just a wrapper around sendmsg().

 (8) Rename pipe_to_sendpage() to pipe_to_sendmsg() and make it a wrapper
     around sendmsg().

 (9) Remove sendpage file operation.

(10) Convert siw, ceph, iscsi and tcp_bpf to use sendmsg() instead of
     tcp_sendpage().

(11) Make skb_send_sock() use sendmsg().

(12) Remove AF_ALG's hash_sendpage() as hash_sendmsg() seems to do paste
     the page pointers in anyway.

(13) Convert ceph, rds, dlm and sunrpc to use sendmsg().

(14) Remove the sendpage socket operation.

This leaves the implementation of MSG_SPLICE_PAGES in AF_TLS, AF_KCM,
AF_SMC and Chelsio-TLS which I'm going to need help with, and cleaning up
the use of kernel_sendpage in AF_KCM, AF_SMC and NVMe over TCP still to be
done.


I'm wondering about how best to proceed further:

 - Rather than providing a special allocator, should protocols implementing
   MSG_SPLICE_PAGES recognise pages that belong to the slab allocator and
   copy the content of those to the skbuff and only directly attach the
   source page if it's not a slab page?

 - Should MSG_SPLICE_PAGES work with ITER_XARRAY as well as ITER_BVEC?

 - Should MSG_SPLICE_PAGES just be a hint and get ignored if the conditions
   for using it are not met rather than giving an error?

 - Should pages attached to a pipe be pinned (ie. FOLL_PIN) rather than
   simply ref'd (ie. FOLL_GET) so that the DIO issue doesn't occur on
   spliced pages?

 - Similarly, should pages undergoing zerocopy be pinned when attached to
   an skbuff rather than being simply ref'd?  I have a patch to note in the
   bottom two bits of the frag page pointer if they are pinned, ref'd or
   neither.


I have tested AF_UNIX splicing - which, surprisingly, seems nearly twice as
fast - TCP splicing, the siw driver (softIWarp RDMA with nfs and cifs),
sunrpc (with nfsd) and UDP (using a patched rxrpc).

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-sendpage

David

David Howells (28):
  net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
  Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
  tcp: Support MSG_SPLICE_PAGES
  tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
  tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
    tcp_sendmsg
  espintcp: Inline do_tcp_sendpages()
  tls: Inline do_tcp_sendpages()
  siw: Inline do_tcp_sendpages()
  tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
  ip, udp: Support MSG_SPLICE_PAGES
  udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
  af_unix: Support MSG_SPLICE_PAGES
  crypto: af_alg: Indent the loop in af_alg_sendmsg()
  crypto: af_alg: Support MSG_SPLICE_PAGES
  crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES
  splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage()
  Remove file->f_op->sendpage
  siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit
  ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  iscsi: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  tcp_bpf: Make tcp_bpf_sendpage() go through
    tcp_bpf_sendmsg(MSG_SPLICE_PAGES)
  net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock()
  algif: Remove hash_sendpage*()
  ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage()
  rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  dlm: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage
  sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
  sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)

 Documentation/networking/scaling.rst     |   4 +-
 crypto/Kconfig                           |   1 +
 crypto/af_alg.c                          | 137 +++++--------
 crypto/algif_aead.c                      |  40 ++--
 crypto/algif_hash.c                      |  66 ------
 crypto/algif_rng.c                       |   2 -
 crypto/algif_skcipher.c                  |  22 +-
 drivers/infiniband/sw/siw/siw_qp_tx.c    | 224 +++++----------------
 drivers/target/iscsi/iscsi_target_util.c |  14 +-
 fs/dlm/lowcomms.c                        |  10 +-
 fs/splice.c                              |  42 ++--
 include/linux/fs.h                       |   3 -
 include/linux/net.h                      |   8 -
 include/linux/socket.h                   |   1 +
 include/linux/splice.h                   |   2 +
 include/linux/zcopy_alloc.h              |  16 ++
 include/net/inet_common.h                |   2 -
 include/net/sock.h                       |   6 -
 include/net/tcp.h                        |   2 -
 include/net/tls.h                        |   2 +-
 mm/Makefile                              |   2 +-
 mm/zcopy_alloc.c                         | 129 ++++++++++++
 net/appletalk/ddp.c                      |   1 -
 net/atm/pvc.c                            |   1 -
 net/atm/svc.c                            |   1 -
 net/ax25/af_ax25.c                       |   1 -
 net/caif/caif_socket.c                   |   2 -
 net/can/bcm.c                            |   1 -
 net/can/isotp.c                          |   1 -
 net/can/j1939/socket.c                   |   1 -
 net/can/raw.c                            |   1 -
 net/ceph/messenger_v1.c                  |  58 ++----
 net/ceph/messenger_v2.c                  |  89 ++-------
 net/core/skbuff.c                        |  49 +++--
 net/core/sock.c                          |  35 +---
 net/dccp/ipv4.c                          |   1 -
 net/dccp/ipv6.c                          |   1 -
 net/ieee802154/socket.c                  |   2 -
 net/ipv4/af_inet.c                       |  21 --
 net/ipv4/ip_output.c                     |  89 ++++++++-
 net/ipv4/tcp.c                           | 244 +++++------------------
 net/ipv4/tcp_bpf.c                       |  72 ++-----
 net/ipv4/tcp_ipv4.c                      |   1 -
 net/ipv4/udp.c                           |  54 -----
 net/ipv4/udp_impl.h                      |   2 -
 net/ipv4/udplite.c                       |   1 -
 net/ipv6/af_inet6.c                      |   3 -
 net/ipv6/raw.c                           |   1 -
 net/ipv6/tcp_ipv6.c                      |   1 -
 net/key/af_key.c                         |   1 -
 net/l2tp/l2tp_ip.c                       |   1 -
 net/l2tp/l2tp_ip6.c                      |   1 -
 net/llc/af_llc.c                         |   1 -
 net/mctp/af_mctp.c                       |   1 -
 net/mptcp/protocol.c                     |   2 -
 net/netlink/af_netlink.c                 |   1 -
 net/netrom/af_netrom.c                   |   1 -
 net/packet/af_packet.c                   |   2 -
 net/phonet/socket.c                      |   2 -
 net/qrtr/af_qrtr.c                       |   1 -
 net/rds/af_rds.c                         |   1 -
 net/rds/tcp_send.c                       |  80 ++++----
 net/rose/af_rose.c                       |   1 -
 net/rxrpc/af_rxrpc.c                     |   1 -
 net/sctp/protocol.c                      |   1 -
 net/socket.c                             |  74 +------
 net/sunrpc/svcsock.c                     |  70 ++-----
 net/sunrpc/xdr.c                         |  24 ++-
 net/tipc/socket.c                        |   3 -
 net/tls/tls_main.c                       |  24 ++-
 net/unix/af_unix.c                       | 223 +++++++--------------
 net/vmw_vsock/af_vsock.c                 |   3 -
 net/x25/af_x25.c                         |   1 -
 net/xdp/xsk.c                            |   1 -
 net/xfrm/espintcp.c                      |  10 +-
 75 files changed, 687 insertions(+), 1313 deletions(-)
 create mode 100644 include/linux/zcopy_alloc.h
 create mode 100644 mm/zcopy_alloc.c


^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2023-03-25  9:21 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
2023-03-16 15:25 ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES David Howells
2023-03-16 17:28   ` Matthew Wilcox
2023-03-16 18:00   ` David Howells
2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
2023-03-16 18:37   ` Willem de Bruijn
2023-03-16 18:44   ` David Howells
2023-03-16 19:00     ` Willem de Bruijn
2023-03-21  0:38     ` David Howells
2023-03-21 14:22       ` Willem de Bruijn
2023-03-22 13:56         ` [RFC PATCH 0/3] net: Drop size arg from ->sendmsg() and pass msghdr into __ip{,6}_append_data() David Howells
2023-03-22 13:56           ` [RFC PATCH 1/3] net: Drop the size argument from ->sendmsg() David Howells
2023-03-22 13:56             ` David Howells
2023-03-22 13:56             ` David Howells
2023-03-22 14:13             ` [RFC,1/3] " bluez.test.bot
2023-03-23  1:11             ` bluez.test.bot
2023-03-22 13:56           ` [RFC PATCH 2/3] ip: Make __ip{,6}_append_data() and co. take a msghdr* David Howells
2023-03-22 17:25             ` kernel test robot
2023-03-22 22:12             ` kernel test robot
2023-03-23  1:25             ` kernel test robot
2023-03-23  1:25             ` kernel test robot
2023-03-22 13:56           ` [RFC PATCH 3/3] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
2023-03-23  1:17           ` [RFC PATCH 0/3] net: Drop size arg from ->sendmsg() and pass msghdr into __ip{,6}_append_data() Willem de Bruijn
2023-03-16 15:25 ` [RFC PATCH 04/28] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:25 ` [RFC PATCH 05/28] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
2023-03-16 15:25 ` [RFC PATCH 06/28] espintcp: Inline do_tcp_sendpages() David Howells
2023-03-16 15:25 ` [RFC PATCH 07/28] tls: " David Howells
2023-03-16 15:25 ` [RFC PATCH 08/28] siw: " David Howells
2023-03-20 10:53   ` Bernard Metzler
2023-03-20 11:08   ` David Howells
2023-03-20 12:27     ` Bernard Metzler
2023-03-20 13:13     ` David Howells
2023-03-20 13:18       ` Bernard Metzler
2023-03-16 15:25 ` [RFC PATCH 09/28] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
2023-03-16 15:26 ` [RFC PATCH 10/28] ip, udp: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 11/28] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 12/28] af_unix: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 13/28] crypto: af_alg: Indent the loop in af_alg_sendmsg() David Howells
2023-03-16 15:26 ` [RFC PATCH 14/28] crypto: af_alg: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 15/28] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 16/28] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() David Howells
2023-03-16 15:26 ` [RFC PATCH 17/28] Remove file->f_op->sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit David Howells
2023-03-20 13:39   ` Bernard Metzler
2023-03-16 15:26 ` [RFC PATCH 19/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 20/28] iscsi: " David Howells
2023-03-16 15:26 ` [RFC PATCH 21/28] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:26 ` [RFC PATCH 22/28] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock() David Howells
2023-03-16 15:26 ` [RFC PATCH 23/28] algif: Remove hash_sendpage*() David Howells
2023-03-17  2:40   ` Herbert Xu
2023-03-24 16:47     ` David Howells
2023-03-25  6:00       ` Herbert Xu
2023-03-25  7:44       ` David Howells
2023-03-25  9:21         ` Herbert Xu
2023-03-16 15:26 ` [RFC PATCH 24/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells
2023-03-16 15:26 ` [RFC PATCH 25/28] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 26/28] dlm: " David Howells
2023-03-16 15:26   ` [Cluster-devel] " David Howells
2023-03-16 15:26 ` [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
2023-03-16 16:17   ` Trond Myklebust
2023-03-16 17:10     ` Chuck Lever III
2023-03-16 17:28     ` David Howells
2023-03-16 17:41       ` Chuck Lever III
2023-03-16 21:21     ` David Howells
2023-03-17 15:29       ` Chuck Lever III
2023-03-16 16:24   ` David Howells
2023-03-16 17:23     ` Trond Myklebust
2023-03-16 18:06     ` David Howells
2023-03-16 19:01       ` Trond Myklebust
2023-03-22 13:10       ` David Howells
2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
2023-03-22 18:47         ` Trond Myklebust
2023-03-22 18:49         ` Matthew Wilcox
2023-03-16 15:26 ` [RFC PATCH 28/28] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:26   ` David Howells
2023-03-16 15:26   ` David Howells
2023-03-16 15:26   ` David Howells
2023-03-16 15:57   ` Marc Kleine-Budde
2023-03-16 15:57     ` Marc Kleine-Budde
2023-03-16 15:57     ` Marc Kleine-Budde

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.