netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1
@ 2023-04-05 16:53 David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 01/20] net: Add samples for network I/O and splicing David Howells
                   ` (21 more replies)
  0 siblings, 22 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Here's the first tranche of patches towards providing a MSG_SPLICE_PAGES
internal sendmsg flag that is intended to replace the ->sendpage() op with
calls to sendmsg().  MSG_SPLICE is a hint that tells the protocol that it
should splice the pages supplied if it can and copy them if not.

This will allow splice to pass multiple pages in a single call and allow
certain parts of higher protocols (e.g. sunrpc, iwarp) to pass an entire
message in one go rather than having to send them piecemeal.  This should
also make it easier to handle the splicing of multipage folios.

This set consists of the following parts:

 (1) Provide a set of sample functions in samples/net/ that can be used to
     drive splice() and sendfile() with TCP/TCP6, UDP/UDP6, TLS over
     TCP/TCP6, UNIX and ALG hash/skcipher sockets for testing.

 (2) Define the MSG_SPLICE_PAGES flag and prevent sys_sendmsg() from being
     able to set it.

 (3) Overhaul the page_frag_alloc_align() allocator:

     (a) Split it out from mm/page_alloc.c into its own file,
     mm/page_frag_alloc.c.

     (b) Make it use multipage folios rather than compound pages.

     (c) Give it per-cpu buckets to allocate from so no locking is
     required.

     (d) The netdev_alloc_cache and the napi fragment cache are then cast
     in terms of this and some private allocators are removed.

     I'm not sure that the existing allocator is 100% thread safe.

 (4) Implement MSG_SPLICE_PAGES support in TCP.

 (5) Make do_tcp_sendpages() just wrap sendmsg() and then fold it in to its
     various callers.

 (6) Implement MSG_SPLICE_PAGES support in IP and make udp_sendpage() just
     a wrapper around sendmsg().

 (7) Implement MSG_SPLICE_PAGES support in IP6/UDP6.

 (8) Implement MSG_SPLICE_PAGES support in AF_UNIX.

 (9) Make AF_UNIX copy unspliceable pages.

I've pushed the patches here also:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=sendpage-1

The follow-on patches are on branch iov-sendpage on the same tree.

David

Changes
=======
ver #4)
 - Added some sample socket-I/O programs into samples/net/.
 - Fix a missing page-get in AF_KCM.
 - Init the sgtable and mark the end in AF_ALG when calling
   netfs_extract_iter_to_sg().
 - Add a destructor func for page frag caches prior to generalising it and
   making it per-cpu.

ver #3)
 - Dropped the iterator-of-iterators patch.
 - Only expunge MSG_SPLICE_PAGES in sys_send[m]msg, not sys_recv[m]msg.
 - Split MSG_SPLICE_PAGES code in __ip_append_data() out into helper
   functions.
 - Implement MSG_SPLICE_PAGES support in __ip6_append_data() using the
   above helper functions.
 - Rename 'xlength' to 'initial_length'.
 - Minimise the changes to sunrpc for the moment.
 - Don't give -EOPNOTSUPP if NETIF_F_SG not available, just copy instead.
 - Implemented MSG_SPLICE_PAGES support in the TLS, Chelsio-TLS and AF_KCM
   code.

ver #2)
 - Overhauled the page_frag_alloc() allocator: large folios and per-cpu.
   - Got rid of my own zerocopy allocator.
 - Use iov_iter_extract_pages() rather poking in iter->bvec.
 - Made page splicing fall back to page copying on a page-by-page basis.
 - Made splice_to_socket() pass 16 pipe buffers at a time.
 - Made AF_ALG/hash use finup/digest where possible in sendmsg.
 - Added an iterator-of-iterators, ITER_ITERLIST.
 - Made sunrpc use the iterator-of-iterators.
 - Converted more drivers.

Link: https://lore.kernel.org/r/20230316152618.711970-1-dhowells@redhat.com/ # v1
Link: https://lore.kernel.org/r/20230329141354.516864-1-dhowells@redhat.com/ # v2
Link: https://lore.kernel.org/r/20230331160914.1608208-1-dhowells@redhat.com/ # v3

David Howells (20):
  net: Add samples for network I/O and splicing
  net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
  mm: Move the page fragment allocator from page_alloc.c into its own
    file
  mm: Make the page_frag_cache allocator use multipage folios
  mm: Make the page_frag_cache allocator use per-cpu
  tcp: Support MSG_SPLICE_PAGES
  tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
  tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around
    tcp_sendmsg
  espintcp: Inline do_tcp_sendpages()
  tls: Inline do_tcp_sendpages()
  siw: Inline do_tcp_sendpages()
  tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
  udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
  ip: Remove ip_append_page()
  ip, udp: Support MSG_SPLICE_PAGES
  ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  ip6, udp6: Support MSG_SPLICE_PAGES
  af_unix: Support MSG_SPLICE_PAGES
  af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data

 drivers/infiniband/sw/siw/siw_qp_tx.c      |  17 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  19 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.h |   2 -
 drivers/nvme/host/tcp.c                    |  19 +-
 drivers/nvme/target/tcp.c                  |  22 +-
 include/linux/gfp.h                        |  17 +-
 include/linux/mm_types.h                   |  13 +-
 include/linux/socket.h                     |   3 +
 include/net/ip.h                           |   3 +-
 include/net/tcp.h                          |   2 -
 include/net/tls.h                          |   2 +-
 mm/Makefile                                |   2 +-
 mm/page_alloc.c                            | 126 ----------
 mm/page_frag_alloc.c                       | 201 ++++++++++++++++
 net/core/skbuff.c                          |  32 +--
 net/ipv4/ip_output.c                       | 202 ++++++----------
 net/ipv4/tcp.c                             | 260 ++++++++-------------
 net/ipv4/tcp_bpf.c                         |  20 +-
 net/ipv4/udp.c                             |  50 +---
 net/ipv6/ip6_output.c                      |  12 +
 net/socket.c                               |   2 +
 net/tls/tls_main.c                         |  24 +-
 net/unix/af_unix.c                         | 115 +++++++--
 net/xfrm/espintcp.c                        |  10 +-
 samples/Kconfig                            |   6 +
 samples/Makefile                           |   1 +
 samples/net/Makefile                       |  13 ++
 samples/net/alg-encrypt.c                  | 201 ++++++++++++++++
 samples/net/alg-hash.c                     | 143 ++++++++++++
 samples/net/splice-out.c                   | 142 +++++++++++
 samples/net/tcp-send.c                     | 154 ++++++++++++
 samples/net/tcp-sink.c                     |  76 ++++++
 samples/net/tls-send.c                     | 176 ++++++++++++++
 samples/net/tls-sink.c                     |  98 ++++++++
 samples/net/udp-send.c                     | 151 ++++++++++++
 samples/net/udp-sink.c                     |  82 +++++++
 samples/net/unix-send.c                    | 147 ++++++++++++
 samples/net/unix-sink.c                    |  51 ++++
 38 files changed, 2017 insertions(+), 599 deletions(-)
 create mode 100644 mm/page_frag_alloc.c
 create mode 100644 samples/net/Makefile
 create mode 100644 samples/net/alg-encrypt.c
 create mode 100644 samples/net/alg-hash.c
 create mode 100644 samples/net/splice-out.c
 create mode 100644 samples/net/tcp-send.c
 create mode 100644 samples/net/tcp-sink.c
 create mode 100644 samples/net/tls-send.c
 create mode 100644 samples/net/tls-sink.c
 create mode 100644 samples/net/udp-send.c
 create mode 100644 samples/net/udp-sink.c
 create mode 100644 samples/net/unix-send.c
 create mode 100644 samples/net/unix-sink.c


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 01/20] net: Add samples for network I/O and splicing
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 02/20] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Boris Pismenny, John Fastabend, Herbert Xu

Add some small sample programs for doing network I/O including splicing.

There are three IPv4/IPv6 servers: tcp-sink, tls-sink and udp-sink.  They
can be given a port number by passing "-p <port>" and will listen on an
IPv6 socket unless given a "-4" flag, in which case they'll listen for IPv4
only.

There are three IPv4/IPv6 clients: tcp-send, tls-send and udp-send.  They
are given a file to get data from (or "-" for stdin) and the name of a
server to talk to.  They can also be given a port number by passing "-p
<port>", "-4" or "-6" to force the use of IPv4 or IPv6, "-s" to indicate
they should use splice/sendfile to transfer the data and "-z" to specify
how much data to copy.  If "-s" is given, the input will be spliced if it's
a pipe and sendfiled otherwise.

A driver program, splice-out, is provided to splice data from a file/stdin
to stdout and can be used to pipe into the aforementioned clients for
testing splice.  This takes the name of the file to splice from (or "-" for
stdin).  It can also be given "-w <size>" to indicate the maximum size of
each splice, "-k <size>" if a chunk of the input should be skipped between
splices to prevent coalescence and "-s" if sendfile should be used instead
of splice.

Additionally, there is an AF_UNIX client and server.  These are similar to
the IPv[46] programs, except both take a socket path and there is no option
to change the port number.

And then there are two AF_ALG clients (there is no server).  These are
similar to the other clients, except no destination is specified.  One
exercised skcipher encryption and the other hashing.

Examples include:

	./splice-out -w0x400 /foo/16K 4K | ./alg-encrypt -s -
	./splice-out -w0x400 /foo/1M | ./unix-send -s - /tmp/foo
	./splice-out -w0x400 /foo/16K 16K -w1 | ./tls-send -s6 -z16K - servbox
	./tcp-send /bin/ls 192.168.6.1
	./udp-send -4 -p5555 /foo/4K localhost

where, for example, /foo/16K is a 16KiB file.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: netdev@vger.kernel.org
---
 samples/Kconfig           |   6 ++
 samples/Makefile          |   1 +
 samples/net/Makefile      |  13 +++
 samples/net/alg-encrypt.c | 201 ++++++++++++++++++++++++++++++++++++++
 samples/net/alg-hash.c    | 143 +++++++++++++++++++++++++++
 samples/net/splice-out.c  | 142 +++++++++++++++++++++++++++
 samples/net/tcp-send.c    | 154 +++++++++++++++++++++++++++++
 samples/net/tcp-sink.c    |  76 ++++++++++++++
 samples/net/tls-send.c    | 176 +++++++++++++++++++++++++++++++++
 samples/net/tls-sink.c    |  98 +++++++++++++++++++
 samples/net/udp-send.c    | 151 ++++++++++++++++++++++++++++
 samples/net/udp-sink.c    |  82 ++++++++++++++++
 samples/net/unix-send.c   | 147 ++++++++++++++++++++++++++++
 samples/net/unix-sink.c   |  51 ++++++++++
 14 files changed, 1441 insertions(+)
 create mode 100644 samples/net/Makefile
 create mode 100644 samples/net/alg-encrypt.c
 create mode 100644 samples/net/alg-hash.c
 create mode 100644 samples/net/splice-out.c
 create mode 100644 samples/net/tcp-send.c
 create mode 100644 samples/net/tcp-sink.c
 create mode 100644 samples/net/tls-send.c
 create mode 100644 samples/net/tls-sink.c
 create mode 100644 samples/net/udp-send.c
 create mode 100644 samples/net/udp-sink.c
 create mode 100644 samples/net/unix-send.c
 create mode 100644 samples/net/unix-sink.c

diff --git a/samples/Kconfig b/samples/Kconfig
index 30ef8bd48ba3..14051e9f7532 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -273,6 +273,12 @@ config SAMPLE_CORESIGHT_SYSCFG
 	  This demonstrates how a user may create their own CoreSight
 	  configurations and easily load them into the system at runtime.
 
+config SAMPLE_NET
+	bool "Build example programs that drive network protocols"
+	depends on NET
+	help
+	  Build example userspace programs that drive network protocols.
+
 source "samples/rust/Kconfig"
 
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 7cb632ef88ee..22c1d6244eaf 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
 obj-$(CONFIG_SAMPLE_CORESIGHT_SYSCFG)	+= coresight/
 obj-$(CONFIG_SAMPLE_FPROBE)		+= fprobe/
 obj-$(CONFIG_SAMPLES_RUST)		+= rust/
+obj-$(CONFIG_SAMPLE_NET)		+= net/
diff --git a/samples/net/Makefile b/samples/net/Makefile
new file mode 100644
index 000000000000..0ccd68a36edf
--- /dev/null
+++ b/samples/net/Makefile
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: GPL-2.0-only
+userprogs-always-y += \
+	alg-hash \
+	alg-encrypt \
+	splice-out \
+	tcp-send \
+	tcp-sink \
+	tls-send \
+	tls-sink \
+	udp-send \
+	udp-sink \
+	unix-send \
+	unix-sink
diff --git a/samples/net/alg-encrypt.c b/samples/net/alg-encrypt.c
new file mode 100644
index 000000000000..34a62a9c480a
--- /dev/null
+++ b/samples/net/alg-encrypt.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* AF_ALG hash test
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/un.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+#include <linux/if_alg.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+#define min(x, y) ((x) < (y) ? (x) : (y))
+
+static unsigned char buffer[4096 * 32] __attribute__((aligned(4096)));
+static unsigned char iv[16];
+static unsigned char key[16];
+
+static const struct sockaddr_alg sa = {
+	.salg_family	= AF_ALG,
+	.salg_type	= "skcipher",
+	.salg_name	= "cbc(aes)",
+};
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "alg-send [-s] [-z<size>] <file>|-\n");
+	exit(2);
+}
+
+static void algif_add_set_op(struct msghdr *msg, unsigned int op)
+{
+	struct cmsghdr *__cmsg;
+
+	__cmsg = msg->msg_control + msg->msg_controllen;
+	__cmsg->cmsg_len	= CMSG_LEN(sizeof(unsigned int));
+	__cmsg->cmsg_level	= SOL_ALG;
+	__cmsg->cmsg_type	= ALG_SET_OP;
+	*(unsigned int *)CMSG_DATA(__cmsg) = op;
+	msg->msg_controllen += CMSG_ALIGN(__cmsg->cmsg_len);
+}
+
+static void algif_add_set_iv(struct msghdr *msg, const void *iv, size_t ivlen)
+{
+	struct af_alg_iv *ivbuf;
+	struct cmsghdr *__cmsg;
+
+	printf("%zx\n", msg->msg_controllen);
+	__cmsg = msg->msg_control + msg->msg_controllen;
+	__cmsg->cmsg_len	= CMSG_LEN(sizeof(*ivbuf) + ivlen);
+	__cmsg->cmsg_level	= SOL_ALG;
+	__cmsg->cmsg_type	= ALG_SET_IV;
+	ivbuf = (struct af_alg_iv *)CMSG_DATA(__cmsg);
+	ivbuf->ivlen = ivlen;
+	memcpy(ivbuf->iv, iv, ivlen);
+	msg->msg_controllen += CMSG_ALIGN(__cmsg->cmsg_len);
+}
+
+int main(int argc, char *argv[])
+{
+	struct msghdr msg;
+	struct stat st;
+	const char *filename;
+	unsigned char ctrl[4096];
+	ssize_t r, w, o, ret;
+	size_t size = LONG_MAX, total = 0, i, out = 160;
+	char *end;
+	bool use_sendfile = false, all = true;
+	int opt, alg, sock, fd = 0;
+
+	while ((opt = getopt(argc, argv, "sz:")) != EOF) {
+		switch (opt) {
+		case 's':
+			use_sendfile = true;
+			break;
+		case 'z':
+			size = strtoul(optarg, &end, 0);
+			switch (*end) {
+			case 'K':
+			case 'k':
+				size *= 1024;
+				break;
+			case 'M':
+			case 'm':
+				size *= 1024 * 1024;
+				break;
+			}
+			all = false;
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+	if (argc != 1)
+		format();
+	filename = argv[0];
+
+	alg = socket(AF_ALG, SOCK_SEQPACKET, 0);
+	OSERROR(alg, "AF_ALG");
+	OSERROR(bind(alg, (struct sockaddr *)&sa, sizeof(sa)), "bind");
+	OSERROR(setsockopt(alg, SOL_ALG, ALG_SET_KEY, key, sizeof(key)), "ALG_SET_KEY");
+	sock = accept(alg, NULL, 0);
+	OSERROR(sock, "accept");
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		size = st.st_size;
+	} else {
+		OSERROR(fstat(fd, &st), argv[2]);
+	}
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_control = ctrl;
+	algif_add_set_op(&msg, ALG_OP_ENCRYPT);
+	algif_add_set_iv(&msg, iv, sizeof(iv));
+
+	OSERROR(sendmsg(sock, &msg, MSG_MORE), "sock/sendmsg");
+
+	if (!use_sendfile) {
+		bool more = false;
+
+		while (size) {
+			r = read(fd, buffer, sizeof(buffer));
+			OSERROR(r, filename);
+			if (r == 0)
+				break;
+			size -= r;
+
+			o = 0;
+			do {
+				more = size > 0;
+				w = send(sock, buffer + o, r - o,
+					 more ? MSG_MORE : 0);
+				OSERROR(w, "sock/send");
+				total += w;
+				o += w;
+			} while (o < r);
+		}
+
+		if (more)
+			send(sock, NULL, 0, 0);
+	} else if (S_ISFIFO(st.st_mode)) {
+		do {
+			r = splice(fd, NULL, sock, NULL, size,
+				   size > 0 ? SPLICE_F_MORE : 0);
+			OSERROR(r, "sock/splice");
+			size -= r;
+			total += r;
+		} while (r > 0 && size > 0);
+		if (size && !all) {
+			fprintf(stderr, "Short splice\n");
+			exit(1);
+		}
+	} else {
+		r = sendfile(sock, fd, NULL, size);
+		OSERROR(r, "sock/sendfile");
+		if (r != size) {
+			fprintf(stderr, "Short sendfile\n");
+			exit(1);
+		}
+		total = r;
+	}
+
+	while (total > 0) {
+		ret = read(sock, buffer, min(sizeof(buffer), total));
+		OSERROR(ret, "sock/read");
+		if (ret == 0)
+			break;
+		total -= ret;
+
+		if (out > 0) {
+			ret = min(out, ret);
+			out -= ret;
+			for (i = 0; i < ret; i++)
+				printf("%02x", (unsigned char)buffer[i]);
+		}
+		printf("...\n");
+	}
+
+	OSERROR(close(sock), "sock/close");
+	OSERROR(close(alg), "alg/close");
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/alg-hash.c b/samples/net/alg-hash.c
new file mode 100644
index 000000000000..842a8016acb3
--- /dev/null
+++ b/samples/net/alg-hash.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* AF_ALG hash test
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/un.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+#include <linux/if_alg.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[4096 * 32] __attribute__((aligned(4096)));
+
+static const struct sockaddr_alg sa = {
+	.salg_family	= AF_ALG,
+	.salg_type	= "hash",
+	.salg_name	= "sha1",
+};
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "alg-send [-s] [-z<size>] <file>|-\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	struct stat st;
+	const char *filename;
+	ssize_t r, w, o, ret;
+	size_t size = LONG_MAX, i;
+	char *end;
+	int use_sendfile = 0;
+	int opt, alg, sock, fd = 0;
+
+	while ((opt = getopt(argc, argv, "sz:")) != EOF) {
+		switch (opt) {
+		case 's':
+			use_sendfile = true;
+			break;
+		case 'z':
+			size = strtoul(optarg, &end, 0);
+			switch (*end) {
+			case 'K':
+			case 'k':
+				size *= 1024;
+				break;
+			case 'M':
+			case 'm':
+				size *= 1024 * 1024;
+				break;
+			}
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+	if (argc != 1)
+		format();
+	filename = argv[0];
+
+	alg = socket(AF_ALG, SOCK_SEQPACKET, 0);
+	OSERROR(alg, "AF_ALG");
+	OSERROR(bind(alg, (struct sockaddr *)&sa, sizeof(sa)), "bind");
+	sock = accept(alg, NULL, 0);
+	OSERROR(sock, "accept");
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		size = st.st_size;
+	} else {
+		OSERROR(fstat(fd, &st), argv[2]);
+	}
+
+	if (!use_sendfile) {
+		bool more = false;
+
+		while (size) {
+			r = read(fd, buffer, sizeof(buffer));
+			OSERROR(r, filename);
+			if (r == 0)
+				break;
+			size -= r;
+
+			o = 0;
+			do {
+				more = size > 0;
+				w = send(sock, buffer + o, r - o,
+					 more ? MSG_MORE : 0);
+				OSERROR(w, "sock/send");
+				o += w;
+			} while (o < r);
+		}
+
+		if (more)
+			send(sock, NULL, 0, 0);
+	} else if (S_ISFIFO(st.st_mode)) {
+		r = splice(fd, NULL, sock, NULL, size, 0);
+		OSERROR(r, "sock/splice");
+		if (r != size) {
+			fprintf(stderr, "Short splice\n");
+			exit(1);
+		}
+	} else {
+		r = sendfile(sock, fd, NULL, size);
+		OSERROR(r, "sock/sendfile");
+		if (r != size) {
+			fprintf(stderr, "Short sendfile\n");
+			exit(1);
+		}
+	}
+
+	ret = read(sock, buffer, sizeof(buffer));
+	OSERROR(ret, "sock/read");
+
+	for (i = 0; i < ret; i++)
+		printf("%02x", (unsigned char)buffer[i]);
+	printf("\n");
+
+	OSERROR(close(sock), "sock/close");
+	OSERROR(close(alg), "alg/close");
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/splice-out.c b/samples/net/splice-out.c
new file mode 100644
index 000000000000..07bc0d774779
--- /dev/null
+++ b/samples/net/splice-out.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/* Splice or sendfile from the given file/stdin to stdout.
+ *
+ * Format: splice-out [-s] <file>|- [<size>]
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <getopt.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+#define min(x, y) ((x) < (y) ? (x) : (y))
+
+static unsigned char buffer[4096] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "splice-out [-kN][-s][-wN] <file>|- [<size>]\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	const char *filename;
+	struct stat st;
+	ssize_t r;
+	size_t size = 1024 * 1024, skip = 0, unit = 0, part;
+	char *end;
+	bool use_sendfile = false, all = true;
+	int opt, fd = 0;
+
+	while ((opt = getopt(argc, argv, "k:sw:")),
+	       opt != -1) {
+		switch (opt) {
+		case 'k':
+			/* Skip size - prevent coalescence. */
+			skip = strtoul(optarg, &end, 0);
+			if (skip < 1 || skip >= 4096) {
+				fprintf(stderr, "-kN must be 0<N<4096\n");
+				exit(2);
+			}
+			break;
+		case 's':
+			use_sendfile = 1;
+			break;
+		case 'w':
+			/* Write unit size */
+			unit = strtoul(optarg, &end, 0);
+			if (!unit) {
+				fprintf(stderr, "-wN must be >0\n");
+				exit(2);
+			}
+			switch (*end) {
+			case 'K':
+			case 'k':
+				unit *= 1024;
+				break;
+			case 'M':
+			case 'm':
+				unit *= 1024 * 1024;
+				break;
+			}
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+
+	if (argc != 1 && argc != 2)
+		format();
+
+	filename = argv[0];
+	if (argc == 2) {
+		size = strtoul(argv[1], &end, 0);
+		switch (*end) {
+		case 'K':
+		case 'k':
+			size *= 1024;
+			break;
+		case 'M':
+		case 'm':
+			size *= 1024 * 1024;
+			break;
+		}
+		all = false;
+	}
+
+	OSERROR(fstat(1, &st), "stdout");
+	if (!S_ISFIFO(st.st_mode)) {
+		fprintf(stderr, "stdout must be a pipe\n");
+		exit(3);
+	}
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		if (!all && size > st.st_size) {
+			fprintf(stderr, "%s: Specified size larger than file\n", filename);
+			exit(3);
+		}
+	}
+
+	do {
+		if (skip) {
+			part = skip;
+			do {
+				r = read(fd, buffer, skip);
+				OSERROR(r, filename);
+				part -= r;
+			} while (part > 0 && r > 0);
+		}
+
+		part = unit ? min(size, unit) : size;
+		if (use_sendfile) {
+			r = sendfile(1, fd, NULL, part);
+			OSERROR(r, "sendfile");
+		} else {
+			r = splice(fd, NULL, 1, NULL, part, 0);
+			OSERROR(r, "splice");
+		}
+		if (!all)
+			size -= r;
+	} while (r > 0 && size > 0);
+
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/tcp-send.c b/samples/net/tcp-send.c
new file mode 100644
index 000000000000..153105f6a30a
--- /dev/null
+++ b/samples/net/tcp-send.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * TCP send client.  Pass -s to splice.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <netdb.h>
+#include <netinet/in.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[4096] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "tcp-send [-46s][-p<port>][-z<size>] <file>|- <server>\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	struct addrinfo *addrs = NULL, hints = {};
+	struct stat st;
+	const char *filename, *sockname, *service = "5555";
+	ssize_t r, w, o;
+	size_t size = LONG_MAX;
+	char *end;
+	bool use_sendfile = false;
+	int opt, sock, fd = 0, gai;
+
+	hints.ai_family   = AF_UNSPEC;
+	hints.ai_socktype = SOCK_STREAM;
+
+	while ((opt = getopt(argc, argv, "46p:sz:")) != EOF) {
+		switch (opt) {
+		case '4':
+			hints.ai_family = AF_INET;
+			break;
+		case '6':
+			hints.ai_family = AF_INET6;
+			break;
+		case 'p':
+			service = optarg;
+			break;
+		case 's':
+			use_sendfile = true;
+			break;
+		case 'z':
+			size = strtoul(optarg, &end, 0);
+			switch (*end) {
+			case 'K':
+			case 'k':
+				size *= 1024;
+				break;
+			case 'M':
+			case 'm':
+				size *= 1024 * 1024;
+				break;
+			}
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+	if (argc != 2)
+		format();
+	filename = argv[0];
+	sockname = argv[1];
+
+	gai = getaddrinfo(sockname, service, &hints, &addrs);
+	if (gai) {
+		fprintf(stderr, "%s: %s\n", sockname, gai_strerror(gai));
+		exit(3);
+	}
+
+	if (!addrs) {
+		fprintf(stderr, "%s: No addresses\n", sockname);
+		exit(3);
+	}
+
+	sockname = addrs->ai_canonname;
+	sock = socket(addrs->ai_family, addrs->ai_socktype, addrs->ai_protocol);
+	OSERROR(sock, "socket");
+	OSERROR(connect(sock, addrs->ai_addr, addrs->ai_addrlen), "connect");
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		if (size > st.st_size)
+			size = st.st_size;
+	} else {
+		OSERROR(fstat(fd, &st), filename);
+	}
+
+	if (!use_sendfile) {
+		bool more = false;
+
+		while (size) {
+			r = read(fd, buffer, sizeof(buffer));
+			OSERROR(r, filename);
+			if (r == 0)
+				break;
+			size -= r;
+
+			o = 0;
+			do {
+				more = size > 0;
+				w = send(sock, buffer + o, r - o,
+					 more ? MSG_MORE : 0);
+				OSERROR(w, "sock/send");
+				o += w;
+			} while (o < r);
+		}
+
+		if (more)
+			send(sock, NULL, 0, 0);
+	} else if (S_ISFIFO(st.st_mode)) {
+		r = splice(fd, NULL, sock, NULL, size, 0);
+		OSERROR(r, "sock/splice");
+		if (r != size) {
+			fprintf(stderr, "Short splice\n");
+			exit(1);
+		}
+	} else {
+		r = sendfile(sock, fd, NULL, size);
+		OSERROR(r, "sock/sendfile");
+		if (r != size) {
+			fprintf(stderr, "Short sendfile\n");
+			exit(1);
+		}
+	}
+
+	OSERROR(close(sock), "sock/close");
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/tcp-sink.c b/samples/net/tcp-sink.c
new file mode 100644
index 000000000000..33d949d0e9aa
--- /dev/null
+++ b/samples/net/tcp-sink.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * TCP sink server
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <netinet/in.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "tcp-sink [-4][-p<port>]\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned int port = 5555;
+	bool ipv6 = true;
+	int opt, server_sock, sock;
+
+
+	while ((opt = getopt(argc, argv, "4p:")) != EOF) {
+		switch (opt) {
+		case '4':
+			ipv6 = false;
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		default:
+			format();
+		}
+	}
+
+	if (!ipv6) {
+		struct sockaddr_in sin = {
+			.sin_family = AF_INET,
+			.sin_port   = htons(port),
+		};
+		server_sock = socket(AF_INET, SOCK_STREAM, 0);
+		OSERROR(server_sock, "socket");
+		OSERROR(bind(server_sock, (struct sockaddr *)&sin, sizeof(sin)), "bind");
+		OSERROR(listen(server_sock, 1), "listen");
+	} else {
+		struct sockaddr_in6 sin6 = {
+			.sin6_family = AF_INET6,
+			.sin6_port   = htons(port),
+		};
+		server_sock = socket(AF_INET6, SOCK_STREAM, 0);
+		OSERROR(server_sock, "socket");
+		OSERROR(bind(server_sock, (struct sockaddr *)&sin6, sizeof(sin6)), "bind");
+		OSERROR(listen(server_sock, 1), "listen");
+	}
+
+	for (;;) {
+		sock = accept(server_sock, NULL, NULL);
+		if (sock != -1) {
+			while (read(sock, buffer, sizeof(buffer)) > 0) {}
+			close(sock);
+		}
+	}
+}
diff --git a/samples/net/tls-send.c b/samples/net/tls-send.c
new file mode 100644
index 000000000000..b3b8a0a3b41f
--- /dev/null
+++ b/samples/net/tls-send.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * TLS-over-TCP send client.  Pass -s to splice.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <netdb.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+#include <linux/tls.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[4096] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "tls-send [-46s][-p<port>][-z<size>] <file>|- <server>\n");
+	exit(2);
+}
+
+static void set_tls(int sock)
+{
+	struct tls12_crypto_info_aes_gcm_128 crypto_info;
+
+	crypto_info.info.version = TLS_1_2_VERSION;
+	crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128;
+	memset(crypto_info.iv,		0, TLS_CIPHER_AES_GCM_128_IV_SIZE);
+	memset(crypto_info.rec_seq,	0, TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
+	memset(crypto_info.key,		0, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+	memset(crypto_info.salt,	0, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+
+	OSERROR(setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")),
+		"TCP_ULP");
+	OSERROR(setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)),
+		"TLS_TX");
+	OSERROR(setsockopt(sock, SOL_TLS, TLS_RX, &crypto_info, sizeof(crypto_info)),
+		"TLS_RX");
+}
+
+int main(int argc, char *argv[])
+{
+	struct addrinfo *addrs = NULL, hints = {};
+	struct stat st;
+	const char *filename, *sockname, *service = "5556";
+	ssize_t r, w, o;
+	size_t size = LONG_MAX;
+	char *end;
+	bool use_sendfile = false;
+	int opt, sock, fd = 0, gai;
+
+	hints.ai_family   = AF_UNSPEC;
+	hints.ai_socktype = SOCK_STREAM;
+
+	while ((opt = getopt(argc, argv, "46p:sz:")) != EOF) {
+		switch (opt) {
+		case '4':
+			hints.ai_family = AF_INET;
+			break;
+		case '6':
+			hints.ai_family = AF_INET6;
+			break;
+		case 'p':
+			service = optarg;
+			break;
+		case 's':
+			use_sendfile = true;
+			break;
+		case 'z':
+			size = strtoul(optarg, &end, 0);
+			switch (*end) {
+			case 'K':
+			case 'k':
+				size *= 1024;
+				break;
+			case 'M':
+			case 'm':
+				size *= 1024 * 1024;
+				break;
+			}
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+	if (argc != 2)
+		format();
+	filename = argv[0];
+	sockname = argv[1];
+
+	gai = getaddrinfo(sockname, service, &hints, &addrs);
+	if (gai) {
+		fprintf(stderr, "%s: %s\n", sockname, gai_strerror(gai));
+		exit(3);
+	}
+
+	if (!addrs) {
+		fprintf(stderr, "%s: No addresses\n", sockname);
+		exit(3);
+	}
+
+	sockname = addrs->ai_canonname;
+	sock = socket(addrs->ai_family, addrs->ai_socktype, addrs->ai_protocol);
+	OSERROR(sock, "socket");
+	OSERROR(connect(sock, addrs->ai_addr, addrs->ai_addrlen), "connect");
+	set_tls(sock);
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		if (size > st.st_size)
+			size = st.st_size;
+	} else {
+		OSERROR(fstat(fd, &st), filename);
+	}
+
+	if (!use_sendfile) {
+		bool more = false;
+
+		while (size) {
+			r = read(fd, buffer, sizeof(buffer));
+			OSERROR(r, filename);
+			if (r == 0)
+				break;
+			size -= r;
+
+			o = 0;
+			do {
+				more = size > 0;
+				w = send(sock, buffer + o, r - o,
+					 more ? MSG_MORE : 0);
+				OSERROR(w, "sock/send");
+				o += w;
+			} while (o < r);
+		}
+
+		if (more)
+			send(sock, NULL, 0, 0);
+	} else if (S_ISFIFO(st.st_mode)) {
+		r = splice(fd, NULL, sock, NULL, size, 0);
+		OSERROR(r, "sock/splice");
+		if (r != size) {
+			fprintf(stderr, "Short splice\n");
+			exit(1);
+		}
+	} else {
+		r = sendfile(sock, fd, NULL, size);
+		OSERROR(r, "sock/sendfile");
+		if (r != size) {
+			fprintf(stderr, "Short sendfile\n");
+			exit(1);
+		}
+	}
+
+	OSERROR(close(sock), "sock/close");
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/tls-sink.c b/samples/net/tls-sink.c
new file mode 100644
index 000000000000..1d6d4ed07101
--- /dev/null
+++ b/samples/net/tls-sink.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * TLS-over-TCP sink server
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <netinet/in.h>
+#include <netinet/tcp.h>
+#include <linux/tls.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "tls-sink [-4][-p<port>]\n");
+	exit(2);
+}
+
+static void set_tls(int sock)
+{
+	struct tls12_crypto_info_aes_gcm_128 crypto_info;
+
+	crypto_info.info.version = TLS_1_2_VERSION;
+	crypto_info.info.cipher_type = TLS_CIPHER_AES_GCM_128;
+	memset(crypto_info.iv,		0, TLS_CIPHER_AES_GCM_128_IV_SIZE);
+	memset(crypto_info.rec_seq,	0, TLS_CIPHER_AES_GCM_128_REC_SEQ_SIZE);
+	memset(crypto_info.key,		0, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
+	memset(crypto_info.salt,	0, TLS_CIPHER_AES_GCM_128_SALT_SIZE);
+
+	OSERROR(setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls")),
+		"TCP_ULP");
+	OSERROR(setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)),
+		"TLS_TX");
+	OSERROR(setsockopt(sock, SOL_TLS, TLS_RX, &crypto_info, sizeof(crypto_info)),
+		"TLS_RX");
+}
+
+int main(int argc, char *argv[])
+{
+	unsigned int port = 5556;
+	bool ipv6 = true;
+	int opt, server_sock, sock;
+
+
+	while ((opt = getopt(argc, argv, "4p:")) != EOF) {
+		switch (opt) {
+		case '4':
+			ipv6 = false;
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		default:
+			format();
+		}
+	}
+
+	if (!ipv6) {
+		struct sockaddr_in sin = {
+			.sin_family = AF_INET,
+			.sin_port   = htons(port),
+		};
+		server_sock = socket(AF_INET, SOCK_STREAM, 0);
+		OSERROR(server_sock, "socket");
+		OSERROR(bind(server_sock, (struct sockaddr *)&sin, sizeof(sin)), "bind");
+		OSERROR(listen(server_sock, 1), "listen");
+	} else {
+		struct sockaddr_in6 sin6 = {
+			.sin6_family = AF_INET6,
+			.sin6_port   = htons(port),
+		};
+		server_sock = socket(AF_INET6, SOCK_STREAM, 0);
+		OSERROR(server_sock, "socket");
+		OSERROR(bind(server_sock, (struct sockaddr *)&sin6, sizeof(sin6)), "bind");
+		OSERROR(listen(server_sock, 1), "listen");
+	}
+
+	for (;;) {
+		sock = accept(server_sock, NULL, NULL);
+		if (sock != -1) {
+			set_tls(sock);
+			while (read(sock, buffer, sizeof(buffer)) > 0) {}
+			close(sock);
+		}
+	}
+}
diff --git a/samples/net/udp-send.c b/samples/net/udp-send.c
new file mode 100644
index 000000000000..31abd6b2d9fd
--- /dev/null
+++ b/samples/net/udp-send.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * UDP send client.  Pass -s to splice.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <netdb.h>
+#include <netinet/in.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+#define min(x, y) ((x) < (y) ? (x) : (y))
+
+static unsigned char buffer[65536] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "udp-send [-46s][-p<port>][-z<size>] <file>|- <server>\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	struct addrinfo *addrs = NULL, hints = {};
+	struct stat st;
+	const char *filename, *sockname, *service = "5555";
+	unsigned int len;
+	ssize_t r, o, size = 65535;
+	char *end;
+	bool use_sendfile = false;
+	int opt, sock, fd = 0, gai;
+
+	hints.ai_family   = AF_UNSPEC;
+	hints.ai_socktype = SOCK_DGRAM;
+
+	while ((opt = getopt(argc, argv, "46p:sz:")) != EOF) {
+		switch (opt) {
+		case '4':
+			hints.ai_family = AF_INET;
+			break;
+		case '6':
+			hints.ai_family = AF_INET6;
+			break;
+		case 'p':
+			service = optarg;
+			break;
+		case 's':
+			use_sendfile = true;
+			break;
+		case 'z':
+			size = strtoul(optarg, &end, 0);
+			switch (*end) {
+			case 'K':
+			case 'k':
+				size *= 1024;
+				break;
+			}
+			if (size > 65535) {
+				fprintf(stderr, "Too much data for UDP packet\n");
+				exit(2);
+			}
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+	if (argc != 2)
+		format();
+	filename = argv[0];
+	sockname = argv[1];
+
+	gai = getaddrinfo(sockname, service, &hints, &addrs);
+	if (gai) {
+		fprintf(stderr, "%s: %s\n", sockname, gai_strerror(gai));
+		exit(3);
+	}
+
+	if (!addrs) {
+		fprintf(stderr, "%s: No addresses\n", sockname);
+		exit(3);
+	}
+
+	sockname = addrs->ai_canonname;
+	sock = socket(addrs->ai_family, addrs->ai_socktype, addrs->ai_protocol);
+	OSERROR(sock, "socket");
+	OSERROR(connect(sock, addrs->ai_addr, addrs->ai_addrlen), "connect");
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		if (size > st.st_size)
+			size = st.st_size;
+	} else {
+		OSERROR(fstat(fd, &st), filename);
+	}
+
+	len = htonl(size);
+	OSERROR(send(sock, &len, 4, MSG_MORE), "sock/send");
+
+	if (!use_sendfile) {
+		while (size) {
+			r = read(fd, buffer, sizeof(buffer));
+			OSERROR(r, filename);
+			if (r == 0)
+				break;
+			size -= r;
+
+			o = 0;
+			do {
+				ssize_t w = send(sock, buffer + o, r - o,
+						 size > 0 ? MSG_MORE : 0);
+				OSERROR(w, "sock/send");
+				o += w;
+			} while (o < r);
+		}
+	} else if (S_ISFIFO(st.st_mode)) {
+		r = splice(fd, NULL, sock, NULL, size, 0);
+		OSERROR(r, "sock/splice");
+		if (r != size) {
+			fprintf(stderr, "Short splice\n");
+			exit(1);
+		}
+	} else {
+		r = sendfile(sock, fd, NULL, size);
+		OSERROR(r, "sock/sendfile");
+		if (r != size) {
+			fprintf(stderr, "Short sendfile\n");
+			exit(1);
+		}
+	}
+
+	OSERROR(close(sock), "sock/close");
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/udp-sink.c b/samples/net/udp-sink.c
new file mode 100644
index 000000000000..b98f45b64296
--- /dev/null
+++ b/samples/net/udp-sink.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * UDP sink server
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <netinet/in.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "udp-sink [-4][-p<port>]\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	struct iovec iov[1] = {
+		[0] = {
+			.iov_base	= buffer,
+			.iov_len	= sizeof(buffer),
+		},
+	};
+	struct msghdr msg = {
+		.msg_iov	= iov,
+		.msg_iovlen	= 1,
+	};
+	unsigned int port = 5555;
+	bool ipv6 = true;
+	int opt, sock;
+
+	while ((opt = getopt(argc, argv, "4p:")) != EOF) {
+		switch (opt) {
+		case '4':
+			ipv6 = false;
+			break;
+		case 'p':
+			port = atoi(optarg);
+			break;
+		default:
+			format();
+		}
+	}
+
+	if (!ipv6) {
+		struct sockaddr_in sin = {
+			.sin_family = AF_INET,
+			.sin_port   = htons(port),
+		};
+		sock = socket(AF_INET, SOCK_DGRAM, 0);
+		OSERROR(sock, "socket");
+		OSERROR(bind(sock, (struct sockaddr *)&sin, sizeof(sin)), "bind");
+	} else {
+		struct sockaddr_in6 sin6 = {
+			.sin6_family = AF_INET6,
+			.sin6_port   = htons(port),
+		};
+		sock = socket(AF_INET6, SOCK_DGRAM, 0);
+		OSERROR(sock, "socket");
+		OSERROR(bind(sock, (struct sockaddr *)&sin6, sizeof(sin6)), "bind");
+	}
+
+	for (;;) {
+		ssize_t r;
+
+		r = recvmsg(sock, &msg, 0);
+		printf("rx %zd\n", r);
+	}
+}
diff --git a/samples/net/unix-send.c b/samples/net/unix-send.c
new file mode 100644
index 000000000000..88fae776985c
--- /dev/null
+++ b/samples/net/unix-send.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * AF_UNIX stream send client.  Pass -s to splice.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/un.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/sendfile.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+#define min(x, y) ((x) < (y) ? (x) : (y))
+
+static unsigned char buffer[4096] __attribute__((aligned(4096)));
+
+static __attribute__((noreturn))
+void format(void)
+{
+	fprintf(stderr, "unix-send [-s] [-z<size>] <file>|- <socket-file>\n");
+	exit(2);
+}
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_un sun = { .sun_family = AF_UNIX, };
+	struct stat st;
+	const char *filename, *sockname;
+	ssize_t r, w, o, size = LONG_MAX;
+	size_t plen, total = 0;
+	char *end;
+	bool use_sendfile = false, all = true;
+	int opt, sock, fd = 0;
+
+	while ((opt = getopt(argc, argv, "sz:")) != EOF) {
+		switch (opt) {
+		case 's':
+			use_sendfile = true;
+			break;
+		case 'z':
+			size = strtoul(optarg, &end, 0);
+			switch (*end) {
+			case 'K':
+			case 'k':
+				size *= 1024;
+				break;
+			case 'M':
+			case 'm':
+				size *= 1024 * 1024;
+				break;
+			}
+			all = false;
+			break;
+		default:
+			format();
+		}
+	}
+
+	argc -= optind;
+	argv += optind;
+	if (argc != 2)
+		format();
+	filename = argv[0];
+	sockname = argv[1];
+
+	plen = strlen(sockname);
+	if (plen == 0 || plen > sizeof(sun.sun_path) - 1) {
+		fprintf(stderr, "socket filename too short or too long\n");
+		exit(2);
+	}
+	memcpy(sun.sun_path, sockname, plen + 1);
+
+	sock = socket(AF_UNIX, SOCK_STREAM, 0);
+	OSERROR(sock, "socket");
+	OSERROR(connect(sock, (struct sockaddr *)&sun, sizeof(sun)), "connect");
+
+	if (strcmp(filename, "-") != 0) {
+		fd = open(filename, O_RDONLY);
+		OSERROR(fd, filename);
+		OSERROR(fstat(fd, &st), filename);
+		if (size > st.st_size)
+			size = st.st_size;
+	} else {
+		OSERROR(fstat(fd, &st), argv[2]);
+	}
+
+	if (!use_sendfile) {
+		bool more = false;
+
+		while (size) {
+			r = read(fd, buffer, min(sizeof(buffer), size));
+			OSERROR(r, filename);
+			if (r == 0)
+				break;
+			size -= r;
+
+			o = 0;
+			do {
+				more = size > 0;
+				w = send(sock, buffer + o, r - o,
+					 more ? MSG_MORE : 0);
+				OSERROR(w, "sock/send");
+				o += w;
+				total += w;
+			} while (o < r);
+		}
+
+		if (more)
+			send(sock, NULL, 0, 0);
+	} else if (S_ISFIFO(st.st_mode)) {
+		do {
+			r = splice(fd, NULL, sock, NULL, size,
+				   size > 0 ? SPLICE_F_MORE : 0);
+			OSERROR(r, "sock/splice");
+			size -= r;
+			total += r;
+		} while (r > 0 && size > 0);
+		if (size && !all) {
+			fprintf(stderr, "Short splice\n");
+			exit(1);
+		}
+	} else {
+		r = sendfile(sock, fd, NULL, size);
+		OSERROR(r, "sock/sendfile");
+		if (r != size) {
+			fprintf(stderr, "Short sendfile\n");
+			exit(1);
+		}
+		total += r;
+	}
+
+	printf("Sent %zu bytes\n", total);
+	OSERROR(close(sock), "sock/close");
+	OSERROR(close(fd), "close");
+	return 0;
+}
diff --git a/samples/net/unix-sink.c b/samples/net/unix-sink.c
new file mode 100644
index 000000000000..3c75979dc52a
--- /dev/null
+++ b/samples/net/unix-sink.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * UNIX stream sink server
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/un.h>
+#include <sys/socket.h>
+
+#define OSERROR(X, Y) do { if ((long)(X) == -1) { perror(Y); exit(1); } } while(0)
+
+static unsigned char buffer[512 * 1024] __attribute__((aligned(4096)));
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_un sun = { .sun_family = AF_UNIX, };
+	size_t plen;
+	int server_sock, sock;
+
+	if (argc != 2) {
+		fprintf(stderr, "unix-sink <socket-file>\n");
+		exit(2);
+	}
+
+	plen = strlen(argv[1]);
+	if (plen == 0 || plen > sizeof(sun.sun_path) - 1) {
+		fprintf(stderr, "socket filename too short or too long\n");
+		exit(2);
+	}
+	memcpy(sun.sun_path, argv[1], plen + 1);
+
+	server_sock = socket(AF_UNIX, SOCK_STREAM, 0);
+	OSERROR(server_sock, "socket");
+	OSERROR(bind(server_sock, (struct sockaddr *)&sun, sizeof(sun)), "bind");
+	OSERROR(listen(server_sock, 1), "listen");
+
+	for (;;) {
+		sock = accept(server_sock, NULL, NULL);
+		if (sock != -1) {
+			while (read(sock, buffer, sizeof(buffer)) > 0) {}
+			close(sock);
+		}
+	}
+}


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 02/20] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 01/20] net: Add samples for network I/O and splicing David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 03/20] mm: Move the page fragment allocator from page_alloc.c into its own file David Howells
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Willem de Bruijn

Declare MSG_SPLICE_PAGES, an internal sendmsg() flag, that hints to a
network protocol that it should splice pages from the source iterator
rather than copying the data if it can.  This flag is added to a list that
is cleared by sendmsg syscalls on entry.

This is intended as a replacement for the ->sendpage() op, allowing a way
to splice in several multipage folios in one go.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/linux/socket.h | 3 +++
 net/socket.c           | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..bd1cc3238851 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -327,6 +327,7 @@ struct ucred {
 					  */
 
 #define MSG_ZEROCOPY	0x4000000	/* Use user data in kernel path */
+#define MSG_SPLICE_PAGES 0x8000000	/* Splice the pages from the iterator in sendmsg() */
 #define MSG_FASTOPEN	0x20000000	/* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x40000000	/* Set close_on_exec for file
 					   descriptor received through
@@ -337,6 +338,8 @@ struct ucred {
 #define MSG_CMSG_COMPAT	0		/* We never have 32 bit fixups */
 #endif
 
+/* Flags to be cleared on entry by sendmsg and sendmmsg syscalls */
+#define MSG_INTERNAL_SENDMSG_FLAGS (MSG_SPLICE_PAGES)
 
 /* Setsockoptions(2) level. Thanks to BSD these must match IPPROTO_xxx */
 #define SOL_IP		0
diff --git a/net/socket.c b/net/socket.c
index 73e493da4589..b3fd3f7f7e03 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2136,6 +2136,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
 		msg.msg_name = (struct sockaddr *)&address;
 		msg.msg_namelen = addr_len;
 	}
+	flags &= ~MSG_INTERNAL_SENDMSG_FLAGS;
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 	msg.msg_flags = flags;
@@ -2483,6 +2484,7 @@ static int ____sys_sendmsg(struct socket *sock, struct msghdr *msg_sys,
 	}
 	msg_sys->msg_flags = flags;
 
+	flags &= ~MSG_INTERNAL_SENDMSG_FLAGS;
 	if (sock->file->f_flags & O_NONBLOCK)
 		msg_sys->msg_flags |= MSG_DONTWAIT;
 	/*


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 03/20] mm: Move the page fragment allocator from page_alloc.c into its own file
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 01/20] net: Add samples for network I/O and splicing David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 02/20] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 04/20] mm: Make the page_frag_cache allocator use multipage folios David Howells
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Bernard Metzler, Tom Talpey, linux-rdma

Move the page fragment allocator from page_alloc.c into its own file
preparatory to changing it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Tom Talpey <tom@talpey.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: netdev@vger.kernel.org
---
 mm/Makefile          |   2 +-
 mm/page_alloc.c      | 126 -----------------------------------------
 mm/page_frag_alloc.c | 131 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 132 insertions(+), 127 deletions(-)
 create mode 100644 mm/page_frag_alloc.c

diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..4e6dc12b4cbd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o percpu.o slab_common.o \
-			   compaction.o \
+			   compaction.o page_frag_alloc.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o gup.o mmap_lock.o $(mmu-y)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7136c36c5d01..d751e750c14b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5695,132 +5695,6 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
-/*
- * Page Fragment:
- *  An arbitrary-length arbitrary-offset area of memory which resides
- *  within a 0 or higher order page.  Multiple fragments within that page
- *  are individually refcounted, in the page's reference counter.
- *
- * The page_frag functions below provide a simple allocation framework for
- * page fragments.  This is used by the network stack and network device
- * drivers to provide a backing region of memory for use as either an
- * sk_buff->head, or to be used in the "frags" portion of skb_shared_info.
- */
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
-					     gfp_t gfp_mask)
-{
-	struct page *page = NULL;
-	gfp_t gfp = gfp_mask;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
-	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
-				PAGE_FRAG_CACHE_MAX_ORDER);
-	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
-#endif
-	if (unlikely(!page))
-		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
-
-	nc->va = page ? page_address(page) : NULL;
-
-	return page;
-}
-
-void __page_frag_cache_drain(struct page *page, unsigned int count)
-{
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
-
-	if (page_ref_sub_and_test(page, count))
-		free_the_page(page, compound_order(page));
-}
-EXPORT_SYMBOL(__page_frag_cache_drain);
-
-void *page_frag_alloc_align(struct page_frag_cache *nc,
-		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
-{
-	unsigned int size = PAGE_SIZE;
-	struct page *page;
-	int offset;
-
-	if (unlikely(!nc->va)) {
-refill:
-		page = __page_frag_cache_refill(nc, gfp_mask);
-		if (!page)
-			return NULL;
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
-		/* Even if we own the page, we do not use atomic_set().
-		 * This would break get_page_unless_zero() users.
-		 */
-		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
-
-		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = page_is_pfmemalloc(page);
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = size;
-	}
-
-	offset = nc->offset - fragsz;
-	if (unlikely(offset < 0)) {
-		page = virt_to_page(nc->va);
-
-		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
-			goto refill;
-
-		if (unlikely(nc->pfmemalloc)) {
-			free_the_page(page, compound_order(page));
-			goto refill;
-		}
-
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
-		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
-
-		/* reset page count bias and offset to start of new frag */
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = size - fragsz;
-		if (unlikely(offset < 0)) {
-			/*
-			 * The caller is trying to allocate a fragment
-			 * with fragsz > PAGE_SIZE but the cache isn't big
-			 * enough to satisfy the request, this may
-			 * happen in low memory conditions.
-			 * We don't release the cache page because
-			 * it could make memory pressure worse
-			 * so we simply return NULL here.
-			 */
-			return NULL;
-		}
-	}
-
-	nc->pagecnt_bias--;
-	offset &= align_mask;
-	nc->offset = offset;
-
-	return nc->va + offset;
-}
-EXPORT_SYMBOL(page_frag_alloc_align);
-
-/*
- * Frees a page fragment allocated out of either a compound or order 0 page.
- */
-void page_frag_free(void *addr)
-{
-	struct page *page = virt_to_head_page(addr);
-
-	if (unlikely(put_page_testzero(page)))
-		free_the_page(page, compound_order(page));
-}
-EXPORT_SYMBOL(page_frag_free);
-
 static void *make_alloc_exact(unsigned long addr, unsigned int order,
 		size_t size)
 {
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
new file mode 100644
index 000000000000..bee95824ef8f
--- /dev/null
+++ b/mm/page_frag_alloc.c
@@ -0,0 +1,131 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Page fragment allocator
+ *
+ * Page Fragment:
+ *  An arbitrary-length arbitrary-offset area of memory which resides within a
+ *  0 or higher order page.  Multiple fragments within that page are
+ *  individually refcounted, in the page's reference counter.
+ *
+ * The page_frag functions provide a simple allocation framework for page
+ * fragments.  This is used by the network stack and network device drivers to
+ * provide a backing region of memory for use as either an sk_buff->head, or to
+ * be used in the "frags" portion of skb_shared_info.
+ */
+
+#include <linux/export.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+
+static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
+					     gfp_t gfp_mask)
+{
+	struct page *page = NULL;
+	gfp_t gfp = gfp_mask;
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
+		    __GFP_NOMEMALLOC;
+	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
+				PAGE_FRAG_CACHE_MAX_ORDER);
+	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+#endif
+	if (unlikely(!page))
+		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+
+	nc->va = page ? page_address(page) : NULL;
+
+	return page;
+}
+
+void __page_frag_cache_drain(struct page *page, unsigned int count)
+{
+	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+
+	if (page_ref_sub_and_test(page, count - 1))
+		__free_pages(page, compound_order(page));
+}
+EXPORT_SYMBOL(__page_frag_cache_drain);
+
+void *page_frag_alloc_align(struct page_frag_cache *nc,
+		      unsigned int fragsz, gfp_t gfp_mask,
+		      unsigned int align_mask)
+{
+	unsigned int size = PAGE_SIZE;
+	struct page *page;
+	int offset;
+
+	if (unlikely(!nc->va)) {
+refill:
+		page = __page_frag_cache_refill(nc, gfp_mask);
+		if (!page)
+			return NULL;
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+		/* if size can vary use size else just use PAGE_SIZE */
+		size = nc->size;
+#endif
+		/* Even if we own the page, we do not use atomic_set().
+		 * This would break get_page_unless_zero() users.
+		 */
+		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pfmemalloc = page_is_pfmemalloc(page);
+		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+		nc->offset = size;
+	}
+
+	offset = nc->offset - fragsz;
+	if (unlikely(offset < 0)) {
+		page = virt_to_page(nc->va);
+
+		if (page_ref_count(page) != nc->pagecnt_bias)
+			goto refill;
+		if (unlikely(nc->pfmemalloc)) {
+			page_ref_sub(page, nc->pagecnt_bias - 1);
+			__free_pages(page, compound_order(page));
+			goto refill;
+		}
+
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+		/* if size can vary use size else just use PAGE_SIZE */
+		size = nc->size;
+#endif
+		/* OK, page count is 0, we can safely set it */
+		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+		offset = size - fragsz;
+		if (unlikely(offset < 0)) {
+			/*
+			 * The caller is trying to allocate a fragment
+			 * with fragsz > PAGE_SIZE but the cache isn't big
+			 * enough to satisfy the request, this may
+			 * happen in low memory conditions.
+			 * We don't release the cache page because
+			 * it could make memory pressure worse
+			 * so we simply return NULL here.
+			 */
+			return NULL;
+		}
+	}
+
+	nc->pagecnt_bias--;
+	offset &= align_mask;
+	nc->offset = offset;
+
+	return nc->va + offset;
+}
+EXPORT_SYMBOL(page_frag_alloc_align);
+
+/*
+ * Frees a page fragment allocated out of either a compound or order 0 page.
+ */
+void page_frag_free(void *addr)
+{
+	struct page *page = virt_to_head_page(addr);
+
+	__free_pages(page, compound_order(page));
+}
+EXPORT_SYMBOL(page_frag_free);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 04/20] mm: Make the page_frag_cache allocator use multipage folios
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (2 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 03/20] mm: Move the page fragment allocator from page_alloc.c into its own file David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 05/20] mm: Make the page_frag_cache allocator use per-cpu David Howells
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Change the page_frag_cache allocator to use multipage folios rather than
groups of pages.  This reduces page_frag_free to just a folio_put() or
put_page().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |  15 +--
 drivers/nvme/host/tcp.c                    |   7 +-
 drivers/nvme/target/tcp.c                  |   5 +-
 include/linux/gfp.h                        |   1 +
 include/linux/mm_types.h                   |  13 +--
 mm/page_frag_alloc.c                       | 101 +++++++++++----------
 6 files changed, 63 insertions(+), 79 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 69fba29055e9..6ce532217777 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -286,7 +286,6 @@ mtk_wed_wo_queue_free(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 static void
 mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
 	int i;
 
 	for (i = 0; i < q->n_desc; i++) {
@@ -298,12 +297,7 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		entry->buf = NULL;
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_clear(&q->cache);
 }
 
 static void
@@ -320,12 +314,7 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(buf);
 	}
 
-	if (!q->cache.va)
-		return;
-
-	page = virt_to_page(q->cache.va);
-	__page_frag_cache_drain(page, q->cache.pagecnt_bias);
-	memset(&q->cache, 0, sizeof(q->cache));
+	page_frag_cache_clear(&q->cache);
 }
 
 static void
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 42c0598c31f2..76f12ac714b0 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1323,12 +1323,7 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	if (queue->pf_cache.va) {
-		page = virt_to_head_page(queue->pf_cache.va);
-		__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
-		queue->pf_cache.va = NULL;
-	}
-
+	page_frag_cache_clear(&queue->pf_cache);
 	noreclaim_flag = memalloc_noreclaim_save();
 	sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 66e8f9fd0ca7..ae871c31cf00 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1438,7 +1438,6 @@ static void nvmet_tcp_free_cmd_data_in_buffers(struct nvmet_tcp_queue *queue)
 
 static void nvmet_tcp_release_queue_work(struct work_struct *w)
 {
-	struct page *page;
 	struct nvmet_tcp_queue *queue =
 		container_of(w, struct nvmet_tcp_queue, release_work);
 
@@ -1460,9 +1459,7 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-
-	page = virt_to_head_page(queue->pf_cache.va);
-	__page_frag_cache_drain(page, queue->pf_cache.pagecnt_bias);
+	page_frag_cache_clear(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 65a78773dcca..5e15384798eb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -313,6 +313,7 @@ static inline void *page_frag_alloc(struct page_frag_cache *nc,
 {
 	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
 }
+void page_frag_cache_clear(struct page_frag_cache *nc);
 
 extern void page_frag_free(void *addr);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0722859c3647..49a70b3f44a9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -420,18 +420,13 @@ static inline void *folio_get_private(struct folio *folio)
 }
 
 struct page_frag_cache {
-	void * va;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	__u16 offset;
-	__u16 size;
-#else
-	__u32 offset;
-#endif
+	struct folio	*folio;
+	unsigned int	offset;
 	/* we maintain a pagecount bias, so that we dont dirty cache line
 	 * containing page->_refcount every time we allocate a fragment.
 	 */
-	unsigned int		pagecnt_bias;
-	bool pfmemalloc;
+	unsigned int	pagecnt_bias;
+	bool		pfmemalloc;
 };
 
 typedef unsigned long vm_flags_t;
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index bee95824ef8f..9b138cb0e3a4 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,88 +16,95 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 
-static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
-					     gfp_t gfp_mask)
+/*
+ * Allocate a new folio for the frag cache.
+ */
+static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
+					    gfp_t gfp_mask)
 {
-	struct page *page = NULL;
+	struct folio *folio = NULL;
 	gfp_t gfp = gfp_mask;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
-		    __GFP_NOMEMALLOC;
-	page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
-				PAGE_FRAG_CACHE_MAX_ORDER);
-	nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
+	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
+	folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
 #endif
-	if (unlikely(!page))
-		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
-
-	nc->va = page ? page_address(page) : NULL;
+	if (unlikely(!folio))
+		folio = folio_alloc(gfp, 0);
 
-	return page;
+	if (folio)
+		nc->folio = folio;
+	return folio;
 }
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
 {
-	VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
+	struct folio *folio = page_folio(page);
 
-	if (page_ref_sub_and_test(page, count - 1))
-		__free_pages(page, compound_order(page));
+	VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+
+	folio_put_refs(folio, count);
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
+void page_frag_cache_clear(struct page_frag_cache *nc)
+{
+	struct folio *folio = nc->folio;
+
+	if (folio) {
+		VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
+		folio_put_refs(folio, nc->pagecnt_bias);
+		nc->folio = NULL;
+	}
+
+}
+EXPORT_SYMBOL(page_frag_cache_clear);
+
 void *page_frag_alloc_align(struct page_frag_cache *nc,
 		      unsigned int fragsz, gfp_t gfp_mask,
 		      unsigned int align_mask)
 {
-	unsigned int size = PAGE_SIZE;
-	struct page *page;
-	int offset;
+	struct folio *folio = nc->folio;
+	size_t offset;
 
-	if (unlikely(!nc->va)) {
+	if (unlikely(!folio)) {
 refill:
-		page = __page_frag_cache_refill(nc, gfp_mask);
-		if (!page)
+		folio = page_frag_cache_refill(nc, gfp_mask);
+		if (!folio)
 			return NULL;
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
+		folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
 
 		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = page_is_pfmemalloc(page);
+		nc->pfmemalloc = folio_is_pfmemalloc(folio);
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = size;
+		nc->offset = folio_size(folio);
 	}
 
-	offset = nc->offset - fragsz;
-	if (unlikely(offset < 0)) {
-		page = virt_to_page(nc->va);
-
-		if (page_ref_count(page) != nc->pagecnt_bias)
+	offset = nc->offset;
+	if (unlikely(fragsz > offset)) {
+		/* Reuse the folio if everyone we gave it to has finished with it. */
+		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
+			nc->folio = NULL;
 			goto refill;
+		}
+
 		if (unlikely(nc->pfmemalloc)) {
-			page_ref_sub(page, nc->pagecnt_bias - 1);
-			__free_pages(page, compound_order(page));
+			__folio_put(folio);
+			nc->folio = NULL;
 			goto refill;
 		}
 
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-		/* if size can vary use size else just use PAGE_SIZE */
-		size = nc->size;
-#endif
 		/* OK, page count is 0, we can safely set it */
-		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+		folio_set_count(folio, PAGE_FRAG_CACHE_MAX_SIZE + 1);
 
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		offset = size - fragsz;
-		if (unlikely(offset < 0)) {
+		offset = folio_size(folio);
+		if (unlikely(fragsz > offset)) {
 			/*
 			 * The caller is trying to allocate a fragment
 			 * with fragsz > PAGE_SIZE but the cache isn't big
@@ -107,15 +114,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 			 * it could make memory pressure worse
 			 * so we simply return NULL here.
 			 */
+			nc->offset = offset;
 			return NULL;
 		}
 	}
 
 	nc->pagecnt_bias--;
+	offset -= fragsz;
 	offset &= align_mask;
 	nc->offset = offset;
 
-	return nc->va + offset;
+	return folio_address(folio) + offset;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 
@@ -124,8 +133,6 @@ EXPORT_SYMBOL(page_frag_alloc_align);
  */
 void page_frag_free(void *addr)
 {
-	struct page *page = virt_to_head_page(addr);
-
-	__free_pages(page, compound_order(page));
+	folio_put(virt_to_folio(addr));
 }
 EXPORT_SYMBOL(page_frag_free);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 05/20] mm: Make the page_frag_cache allocator use per-cpu
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (3 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 04/20] mm: Make the page_frag_cache allocator use multipage folios David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 06/20] tcp: Support MSG_SPLICE_PAGES David Howells
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Lorenzo Bianconi, Felix Fietkau, John Crispin,
	Sean Wang, Mark Lee, Keith Busch, Christoph Hellwig,
	Sagi Grimberg, Chaitanya Kulkarni, linux-nvme, linux-mediatek

Make the page_frag_cache allocator have a separate allocation bucket for
each cpu to avoid racing.  This means that no lock is required, other than
preempt disablement, to allocate from it, though if a softirq wants to
access it, then softirq disablement will need to be added.

Make the NVMe and mediatek drivers pass in NULL to page_frag_cache() and
use the default allocation buckets rather than defining their own.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Lorenzo Bianconi <lorenzo@kernel.org>
cc: Felix Fietkau <nbd@nbd.name>
cc: John Crispin <john@phrozen.org>
cc: Sean Wang <sean.wang@mediatek.com>
cc: Mark Lee <Mark-MC.Lee@mediatek.com>
cc: Keith Busch <kbusch@kernel.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Sagi Grimberg <sagi@grimberg.me>
cc: Chaitanya Kulkarni <kch@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: linux-nvme@lists.infradead.org
cc: linux-mediatek@lists.infradead.org
---
 drivers/net/ethernet/mediatek/mtk_wed_wo.c |   8 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.h |   2 -
 drivers/nvme/host/tcp.c                    |  14 +-
 drivers/nvme/target/tcp.c                  |  19 +-
 include/linux/gfp.h                        |  18 +-
 mm/page_frag_alloc.c                       | 195 ++++++++++++++-------
 net/core/skbuff.c                          |  32 ++--
 7 files changed, 164 insertions(+), 124 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.c b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
index 6ce532217777..859f34447f2f 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.c
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.c
@@ -143,7 +143,7 @@ mtk_wed_wo_queue_refill(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q,
 		dma_addr_t addr;
 		void *buf;
 
-		buf = page_frag_alloc(&q->cache, q->buf_size, GFP_ATOMIC);
+		buf = page_frag_alloc(NULL, q->buf_size, GFP_ATOMIC);
 		if (!buf)
 			break;
 
@@ -296,15 +296,11 @@ mtk_wed_wo_queue_tx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 		skb_free_frag(entry->buf);
 		entry->buf = NULL;
 	}
-
-	page_frag_cache_clear(&q->cache);
 }
 
 static void
 mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 {
-	struct page *page;
-
 	for (;;) {
 		void *buf = mtk_wed_wo_dequeue(wo, q, NULL, true);
 
@@ -313,8 +309,6 @@ mtk_wed_wo_queue_rx_clean(struct mtk_wed_wo *wo, struct mtk_wed_wo_queue *q)
 
 		skb_free_frag(buf);
 	}
-
-	page_frag_cache_clear(&q->cache);
 }
 
 static void
diff --git a/drivers/net/ethernet/mediatek/mtk_wed_wo.h b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
index dbcf42ce9173..6f940db67fb8 100644
--- a/drivers/net/ethernet/mediatek/mtk_wed_wo.h
+++ b/drivers/net/ethernet/mediatek/mtk_wed_wo.h
@@ -210,8 +210,6 @@ struct mtk_wed_wo_queue_entry {
 struct mtk_wed_wo_queue {
 	struct mtk_wed_wo_queue_regs regs;
 
-	struct page_frag_cache cache;
-
 	struct mtk_wed_wo_queue_desc *desc;
 	dma_addr_t desc_dma;
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 76f12ac714b0..5a92236db92a 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -147,8 +147,6 @@ struct nvme_tcp_queue {
 	__le32			exp_ddgst;
 	__le32			recv_ddgst;
 
-	struct page_frag_cache	pf_cache;
-
 	void (*state_change)(struct sock *);
 	void (*data_ready)(struct sock *);
 	void (*write_space)(struct sock *);
@@ -482,9 +480,8 @@ static int nvme_tcp_init_request(struct blk_mq_tag_set *set,
 	struct nvme_tcp_queue *queue = &ctrl->queues[queue_idx];
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	req->pdu = page_frag_alloc(&queue->pf_cache,
-		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
-		GFP_KERNEL | __GFP_ZERO);
+	req->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+				   GFP_KERNEL | __GFP_ZERO);
 	if (!req->pdu)
 		return -ENOMEM;
 
@@ -1300,9 +1297,8 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 	struct nvme_tcp_request *async = &ctrl->async_req;
 	u8 hdgst = nvme_tcp_hdgst_len(queue);
 
-	async->pdu = page_frag_alloc(&queue->pf_cache,
-		sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
-		GFP_KERNEL | __GFP_ZERO);
+	async->pdu = page_frag_alloc(NULL, sizeof(struct nvme_tcp_cmd_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!async->pdu)
 		return -ENOMEM;
 
@@ -1312,7 +1308,6 @@ static int nvme_tcp_alloc_async_req(struct nvme_tcp_ctrl *ctrl)
 
 static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 {
-	struct page *page;
 	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
 	struct nvme_tcp_queue *queue = &ctrl->queues[qid];
 	unsigned int noreclaim_flag;
@@ -1323,7 +1318,6 @@ static void nvme_tcp_free_queue(struct nvme_ctrl *nctrl, int qid)
 	if (queue->hdr_digest || queue->data_digest)
 		nvme_tcp_free_crypto(queue);
 
-	page_frag_cache_clear(&queue->pf_cache);
 	noreclaim_flag = memalloc_noreclaim_save();
 	sock_release(queue->sock);
 	memalloc_noreclaim_restore(noreclaim_flag);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index ae871c31cf00..d6cc557cc539 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -143,8 +143,6 @@ struct nvmet_tcp_queue {
 
 	struct nvmet_tcp_cmd	connect;
 
-	struct page_frag_cache	pf_cache;
-
 	void (*data_ready)(struct sock *);
 	void (*state_change)(struct sock *);
 	void (*write_space)(struct sock *);
@@ -1312,25 +1310,25 @@ static int nvmet_tcp_alloc_cmd(struct nvmet_tcp_queue *queue,
 	c->queue = queue;
 	c->req.port = queue->port->nport;
 
-	c->cmd_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->cmd_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->cmd_pdu = page_frag_alloc(NULL, sizeof(*c->cmd_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->cmd_pdu)
 		return -ENOMEM;
 	c->req.cmd = &c->cmd_pdu->cmd;
 
-	c->rsp_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->rsp_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->rsp_pdu = page_frag_alloc(NULL, sizeof(*c->rsp_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->rsp_pdu)
 		goto out_free_cmd;
 	c->req.cqe = &c->rsp_pdu->cqe;
 
-	c->data_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->data_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->data_pdu = page_frag_alloc(NULL, sizeof(*c->data_pdu) + hdgst,
+				      GFP_KERNEL | __GFP_ZERO);
 	if (!c->data_pdu)
 		goto out_free_rsp;
 
-	c->r2t_pdu = page_frag_alloc(&queue->pf_cache,
-			sizeof(*c->r2t_pdu) + hdgst, GFP_KERNEL | __GFP_ZERO);
+	c->r2t_pdu = page_frag_alloc(NULL, sizeof(*c->r2t_pdu) + hdgst,
+				     GFP_KERNEL | __GFP_ZERO);
 	if (!c->r2t_pdu)
 		goto out_free_data;
 
@@ -1459,7 +1457,6 @@ static void nvmet_tcp_release_queue_work(struct work_struct *w)
 	if (queue->hdr_digest || queue->data_digest)
 		nvmet_tcp_free_crypto(queue);
 	ida_free(&nvmet_tcp_queue_ida, queue->idx);
-	page_frag_cache_clear(&queue->pf_cache);
 	kfree(queue);
 }
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5e15384798eb..b208ca315882 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -304,16 +304,18 @@ extern void free_pages(unsigned long addr, unsigned int order);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
-extern void *page_frag_alloc_align(struct page_frag_cache *nc,
-				   unsigned int fragsz, gfp_t gfp_mask,
-				   unsigned int align_mask);
-
-static inline void *page_frag_alloc(struct page_frag_cache *nc,
-			     unsigned int fragsz, gfp_t gfp_mask)
+extern void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+				   size_t fragsz, gfp_t gfp,
+				   unsigned long align_mask);
+extern void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+			      const void *p, size_t fragsz, gfp_t gfp,
+			      unsigned long align_mask);
+
+static inline void *page_frag_alloc(struct page_frag_cache __percpu *frag_cache,
+				    size_t fragsz, gfp_t gfp)
 {
-	return page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
+	return page_frag_alloc_align(frag_cache, fragsz, gfp, ULONG_MAX);
 }
-void page_frag_cache_clear(struct page_frag_cache *nc);
 
 extern void page_frag_free(void *addr);
 
diff --git a/mm/page_frag_alloc.c b/mm/page_frag_alloc.c
index 9b138cb0e3a4..7844398afe26 100644
--- a/mm/page_frag_alloc.c
+++ b/mm/page_frag_alloc.c
@@ -16,25 +16,23 @@
 #include <linux/init.h>
 #include <linux/mm.h>
 
+static DEFINE_PER_CPU(struct page_frag_cache, page_frag_default_allocator);
+
 /*
  * Allocate a new folio for the frag cache.
  */
-static struct folio *page_frag_cache_refill(struct page_frag_cache *nc,
-					    gfp_t gfp_mask)
+static struct folio *page_frag_cache_refill(gfp_t gfp)
 {
-	struct folio *folio = NULL;
-	gfp_t gfp = gfp_mask;
+	struct folio *folio;
 
 #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
-	gfp_mask |= __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
-	folio = folio_alloc(gfp_mask, PAGE_FRAG_CACHE_MAX_ORDER);
+	folio = folio_alloc(gfp | __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC,
+			    PAGE_FRAG_CACHE_MAX_ORDER);
+	if (folio)
+		return folio;
 #endif
-	if (unlikely(!folio))
-		folio = folio_alloc(gfp, 0);
 
-	if (folio)
-		nc->folio = folio;
-	return folio;
+	return folio_alloc(gfp, 0);
 }
 
 void __page_frag_cache_drain(struct page *page, unsigned int count)
@@ -47,54 +45,68 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
-void page_frag_cache_clear(struct page_frag_cache *nc)
-{
-	struct folio *folio = nc->folio;
-
-	if (folio) {
-		VM_BUG_ON_FOLIO(folio_ref_count(folio) == 0, folio);
-		folio_put_refs(folio, nc->pagecnt_bias);
-		nc->folio = NULL;
-	}
-
-}
-EXPORT_SYMBOL(page_frag_cache_clear);
-
-void *page_frag_alloc_align(struct page_frag_cache *nc,
-		      unsigned int fragsz, gfp_t gfp_mask,
-		      unsigned int align_mask)
+/**
+ * page_frag_alloc_align - Allocate some memory for use in zerocopy
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The size of the fragment desired
+ * @gfp: Allocation flags under which to make an allocation
+ * @align_mask: The required alignment
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data.  Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and fragments are broken off
+ * sequentially and returned to the caller with a ref until the folio no longer
+ * has enough spare space - at which point the allocator's ref is dropped and a
+ * new folio is allocated.  The folio remains in existence until the last ref
+ * held by, say, an sk_buff is discarded and then the page is returned to the
+ * page allocator.
+ *
+ * Returns a pointer to the memory on success and -ENOMEM on allocation
+ * failure.
+ *
+ * The allocated memory should be disposed of with folio_put().
+ */
+void *page_frag_alloc_align(struct page_frag_cache __percpu *frag_cache,
+			    size_t fragsz, gfp_t gfp, unsigned long align_mask)
 {
-	struct folio *folio = nc->folio;
+	struct page_frag_cache *nc;
+	struct folio *folio, *spare = NULL;
 	size_t offset;
+	void *p;
 
-	if (unlikely(!folio)) {
-refill:
-		folio = page_frag_cache_refill(nc, gfp_mask);
-		if (!folio)
-			return NULL;
+	if (!frag_cache)
+		frag_cache = &page_frag_default_allocator;
+	if (WARN_ON_ONCE(fragsz == 0))
+		fragsz = 1;
+	align_mask &= ~3UL;
 
-		/* Even if we own the page, we do not use atomic_set().
-		 * This would break get_page_unless_zero() users.
-		 */
-		folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
-
-		/* reset page count bias and offset to start of new frag */
-		nc->pfmemalloc = folio_is_pfmemalloc(folio);
-		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
-		nc->offset = folio_size(folio);
+	nc = get_cpu_ptr(frag_cache);
+reload:
+	folio = nc->folio;
+	offset = nc->offset;
+try_again:
+
+	/* Make the allocation if there's sufficient space. */
+	if (fragsz <= offset) {
+		nc->pagecnt_bias--;
+		offset = (offset - fragsz) & align_mask;
+		nc->offset = offset;
+		p = folio_address(folio) + offset;
+		put_cpu_ptr(frag_cache);
+		if (spare)
+			folio_put(spare);
+		return p;
 	}
 
-	offset = nc->offset;
-	if (unlikely(fragsz > offset)) {
-		/* Reuse the folio if everyone we gave it to has finished with it. */
-		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias)) {
-			nc->folio = NULL;
+	/* Insufficient space - see if we can refurbish the current folio. */
+	if (folio) {
+		if (!folio_ref_sub_and_test(folio, nc->pagecnt_bias))
 			goto refill;
-		}
 
 		if (unlikely(nc->pfmemalloc)) {
 			__folio_put(folio);
-			nc->folio = NULL;
 			goto refill;
 		}
 
@@ -104,27 +116,56 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
 		offset = folio_size(folio);
-		if (unlikely(fragsz > offset)) {
-			/*
-			 * The caller is trying to allocate a fragment
-			 * with fragsz > PAGE_SIZE but the cache isn't big
-			 * enough to satisfy the request, this may
-			 * happen in low memory conditions.
-			 * We don't release the cache page because
-			 * it could make memory pressure worse
-			 * so we simply return NULL here.
-			 */
-			nc->offset = offset;
+		if (unlikely(fragsz > offset))
+			goto frag_too_big;
+		goto try_again;
+	}
+
+refill:
+	if (!spare) {
+		nc->folio = NULL;
+		put_cpu_ptr(frag_cache);
+
+		spare = page_frag_cache_refill(gfp);
+		if (!spare)
 			return NULL;
-		}
+
+		nc = get_cpu_ptr(frag_cache);
+		/* We may now be on a different cpu and/or someone else may
+		 * have refilled it
+		 */
+		nc->pfmemalloc = folio_is_pfmemalloc(spare);
+		if (nc->folio)
+			goto reload;
 	}
 
-	nc->pagecnt_bias--;
-	offset -= fragsz;
-	offset &= align_mask;
+	nc->folio = spare;
+	folio = spare;
+	spare = NULL;
+
+	/* Even if we own the page, we do not use atomic_set().  This would
+	 * break get_page_unless_zero() users.
+	 */
+	folio_ref_add(folio, PAGE_FRAG_CACHE_MAX_SIZE);
+
+	/* Reset page count bias and offset to start of new frag */
+	nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+	offset = folio_size(folio);
+	goto try_again;
+
+frag_too_big:
+	/*
+	 * The caller is trying to allocate a fragment with fragsz > PAGE_SIZE
+	 * but the cache isn't big enough to satisfy the request, this may
+	 * happen in low memory conditions.  We don't release the cache page
+	 * because it could make memory pressure worse so we simply return NULL
+	 * here.
+	 */
 	nc->offset = offset;
-
-	return folio_address(folio) + offset;
+	put_cpu_ptr(frag_cache);
+	if (spare)
+		folio_put(spare);
+	return NULL;
 }
 EXPORT_SYMBOL(page_frag_alloc_align);
 
@@ -136,3 +177,25 @@ void page_frag_free(void *addr)
 	folio_put(virt_to_folio(addr));
 }
 EXPORT_SYMBOL(page_frag_free);
+
+/**
+ * page_frag_memdup - Allocate a page fragment and duplicate some data into it
+ * @frag_cache: The frag cache to use (or NULL for the default)
+ * @fragsz: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @gfp: Allocation flags under which to make an allocation
+ * @align_mask: The required alignment
+ */
+void *page_frag_memdup(struct page_frag_cache __percpu *frag_cache,
+		       const void *p, size_t fragsz, gfp_t gfp,
+		       unsigned long align_mask)
+{
+	void *q;
+
+	q = page_frag_alloc_align(frag_cache, fragsz, gfp, align_mask);
+	if (!q)
+		return q;
+
+	return memcpy(q, p, fragsz);
+}
+EXPORT_SYMBOL(page_frag_memdup);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 050a875d09c5..3d05ed64b606 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -222,13 +222,13 @@ static void *page_frag_alloc_1k(struct page_frag_1k *nc, gfp_t gfp_mask)
 #endif
 
 struct napi_alloc_cache {
-	struct page_frag_cache page;
 	struct page_frag_1k page_small;
 	unsigned int skb_count;
 	void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
 
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct page_frag_cache, napi_frag_cache);
 static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 /* Double check that napi_get_frags() allocates skbs with
@@ -250,11 +250,9 @@ void napi_get_frags_check(struct napi_struct *napi)
 
 void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 {
-	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
 	fragsz = SKB_DATA_ALIGN(fragsz);
 
-	return page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+	return page_frag_alloc_align(&napi_frag_cache, fragsz, GFP_ATOMIC, align_mask);
 }
 EXPORT_SYMBOL(__napi_alloc_frag_align);
 
@@ -264,15 +262,12 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
 
 	fragsz = SKB_DATA_ALIGN(fragsz);
 	if (in_hardirq() || irqs_disabled()) {
-		struct page_frag_cache *nc = this_cpu_ptr(&netdev_alloc_cache);
-
-		data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(&netdev_alloc_cache,
+					     fragsz, GFP_ATOMIC, align_mask);
 	} else {
-		struct napi_alloc_cache *nc;
-
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache);
-		data = page_frag_alloc_align(&nc->page, fragsz, GFP_ATOMIC, align_mask);
+		data = page_frag_alloc_align(&napi_frag_cache,
+					     fragsz, GFP_ATOMIC, align_mask);
 		local_bh_enable();
 	}
 	return data;
@@ -652,7 +647,6 @@ EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 				   gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc;
 	struct sk_buff *skb;
 	bool pfmemalloc;
 	void *data;
@@ -677,14 +671,12 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 		gfp_mask |= __GFP_MEMALLOC;
 
 	if (in_hardirq() || irqs_disabled()) {
-		nc = this_cpu_ptr(&netdev_alloc_cache);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		data = page_frag_alloc(&netdev_alloc_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 	} else {
 		local_bh_disable();
-		nc = this_cpu_ptr(&napi_alloc_cache.page);
-		data = page_frag_alloc(nc, len, gfp_mask);
-		pfmemalloc = nc->pfmemalloc;
+		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 		local_bh_enable();
 	}
 
@@ -772,8 +764,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	} else {
 		len = SKB_HEAD_ALIGN(len);
 
-		data = page_frag_alloc(&nc->page, len, gfp_mask);
-		pfmemalloc = nc->page.pfmemalloc;
+		data = page_frag_alloc(&napi_frag_cache, len, gfp_mask);
+		pfmemalloc = folio_is_pfmemalloc(virt_to_folio(data));
 	}
 
 	if (unlikely(!data))


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 06/20] tcp: Support MSG_SPLICE_PAGES
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (4 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 05/20] mm: Make the page_frag_cache allocator use per-cpu David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 07/20] tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Make TCP's sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp.c | 67 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fd68d49490f2..510bacc7ce7b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1221,7 +1221,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	int flags, err, copied = 0;
 	int mss_now = 0, size_goal, copied_syn = 0;
 	int process_backlog = 0;
-	bool zc = false;
+	int zc = 0;
 	long timeo;
 
 	flags = msg->msg_flags;
@@ -1232,17 +1232,22 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (msg->msg_ubuf) {
 			uarg = msg->msg_ubuf;
 			net_zcopy_get(uarg);
-			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (sk->sk_route_caps & NETIF_F_SG)
+				zc = 1;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
 			if (!uarg) {
 				err = -ENOBUFS;
 				goto out_err;
 			}
-			zc = sk->sk_route_caps & NETIF_F_SG;
-			if (!zc)
+			if (sk->sk_route_caps & NETIF_F_SG)
+				zc = 1;
+			else
 				uarg_to_msgzc(uarg)->zerocopy = 0;
 		}
+	} else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) {
+		if (sk->sk_route_caps & NETIF_F_SG)
+			zc = 2;
 	}
 
 	if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1305,7 +1310,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		goto do_error;
 
 	while (msg_data_left(msg)) {
-		int copy = 0;
+		ssize_t copy = 0;
 
 		skb = tcp_write_queue_tail(sk);
 		if (skb)
@@ -1346,7 +1351,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (copy > msg_data_left(msg))
 			copy = msg_data_left(msg);
 
-		if (!zc) {
+		if (zc == 0) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
@@ -1391,7 +1396,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				page_ref_inc(pfrag->page);
 			}
 			pfrag->offset += copy;
-		} else {
+		} else if (zc == 1)  {
 			/* First append to a fragless skb builds initial
 			 * pure zerocopy skb
 			 */
@@ -1412,6 +1417,54 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			if (err < 0)
 				goto do_error;
 			copy = err;
+		} else if (zc == 2) {
+			/* Splice in data. */
+			struct page *page = NULL, **pages = &page;
+			size_t off = 0, part;
+			bool can_coalesce;
+			int i = skb_shinfo(skb)->nr_frags;
+
+			copy = iov_iter_extract_pages(&msg->msg_iter, &pages,
+						      copy, 1, 0, &off);
+			if (copy <= 0) {
+				err = copy ?: -EIO;
+				goto do_error;
+			}
+
+			can_coalesce = skb_can_coalesce(skb, i, page, off);
+			if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
+				tcp_mark_push(tp, skb);
+				iov_iter_revert(&msg->msg_iter, copy);
+				goto new_segment;
+			}
+			if (tcp_downgrade_zcopy_pure(sk, skb)) {
+				iov_iter_revert(&msg->msg_iter, copy);
+				goto wait_for_space;
+			}
+
+			part = tcp_wmem_schedule(sk, copy);
+			iov_iter_revert(&msg->msg_iter, copy - part);
+			if (!part)
+				goto wait_for_space;
+			copy = part;
+
+			if (can_coalesce) {
+				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+			} else {
+				get_page(page);
+				skb_fill_page_desc_noacc(skb, i, page, off, copy);
+			}
+			page = NULL;
+
+			if (!(flags & MSG_NO_SHARED_FRAGS))
+				skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
+
+			skb->len += copy;
+			skb->data_len += copy;
+			skb->truesize += copy;
+			sk_wmem_queued_add(sk, copy);
+			sk_mem_charge(sk, copy);
+
 		}
 
 		if (!copied)


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 07/20] tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (5 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 06/20] tcp: Support MSG_SPLICE_PAGES David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 08/20] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

If sendmsg() with MSG_SPLICE_PAGES encounters a page that shouldn't be
spliced - a slab page, for instance, or one with a zero count - make
tcp_sendmsg() copy it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 510bacc7ce7b..238a8ad6527c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1418,10 +1418,10 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				goto do_error;
 			copy = err;
 		} else if (zc == 2) {
-			/* Splice in data. */
+			/* Splice in data if we can; copy if we can't. */
 			struct page *page = NULL, **pages = &page;
 			size_t off = 0, part;
-			bool can_coalesce;
+			bool can_coalesce, put = false;
 			int i = skb_shinfo(skb)->nr_frags;
 
 			copy = iov_iter_extract_pages(&msg->msg_iter, &pages,
@@ -1448,12 +1448,34 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				goto wait_for_space;
 			copy = part;
 
+			if (!sendpage_ok(page)) {
+				const void *p = kmap_local_page(page);
+				void *q;
+
+				q = page_frag_memdup(NULL, p + off, copy,
+						     sk->sk_allocation, ULONG_MAX);
+				kunmap_local(p);
+				if (!q) {
+					iov_iter_revert(&msg->msg_iter, copy);
+					err = copy ?: -ENOMEM;
+					goto do_error;
+				}
+				page = virt_to_page(q);
+				off = offset_in_page(q);
+				put = true;
+				can_coalesce = false;
+			}
+
 			if (can_coalesce) {
 				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			} else {
-				get_page(page);
+				if (!put)
+					get_page(page);
+				put = false;
 				skb_fill_page_desc_noacc(skb, i, page, off, copy);
 			}
+			if (put)
+				put_page(page);
 			page = NULL;
 
 			if (!(flags & MSG_NO_SHARED_FRAGS))


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 08/20] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (6 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 07/20] tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 09/20] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Convert do_tcp_sendpages() to use sendmsg() with MSG_SPLICE_PAGES rather
than directly splicing in the pages itself.  do_tcp_sendpages() can then be
inlined in subsequent patches into its callers.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/tcp.c | 158 +++----------------------------------------------
 1 file changed, 7 insertions(+), 151 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 238a8ad6527c..a8a4ace8b3da 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -972,163 +972,19 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
 	return min(copy, sk->sk_forward_alloc);
 }
 
-static struct sk_buff *tcp_build_frag(struct sock *sk, int size_goal, int flags,
-				      struct page *page, int offset, size_t *size)
-{
-	struct sk_buff *skb = tcp_write_queue_tail(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	bool can_coalesce;
-	int copy, i;
-
-	if (!skb || (copy = size_goal - skb->len) <= 0 ||
-	    !tcp_skb_can_collapse_to(skb)) {
-new_segment:
-		if (!sk_stream_memory_free(sk))
-			return NULL;
-
-		skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
-					   tcp_rtx_and_write_queues_empty(sk));
-		if (!skb)
-			return NULL;
-
-#ifdef CONFIG_TLS_DEVICE
-		skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED);
-#endif
-		tcp_skb_entail(sk, skb);
-		copy = size_goal;
-	}
-
-	if (copy > *size)
-		copy = *size;
-
-	i = skb_shinfo(skb)->nr_frags;
-	can_coalesce = skb_can_coalesce(skb, i, page, offset);
-	if (!can_coalesce && i >= READ_ONCE(sysctl_max_skb_frags)) {
-		tcp_mark_push(tp, skb);
-		goto new_segment;
-	}
-	if (tcp_downgrade_zcopy_pure(sk, skb))
-		return NULL;
-
-	copy = tcp_wmem_schedule(sk, copy);
-	if (!copy)
-		return NULL;
-
-	if (can_coalesce) {
-		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
-	} else {
-		get_page(page);
-		skb_fill_page_desc_noacc(skb, i, page, offset, copy);
-	}
-
-	if (!(flags & MSG_NO_SHARED_FRAGS))
-		skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
-
-	skb->len += copy;
-	skb->data_len += copy;
-	skb->truesize += copy;
-	sk_wmem_queued_add(sk, copy);
-	sk_mem_charge(sk, copy);
-	WRITE_ONCE(tp->write_seq, tp->write_seq + copy);
-	TCP_SKB_CB(skb)->end_seq += copy;
-	tcp_skb_pcount_set(skb, 0);
-
-	*size = copy;
-	return skb;
-}
-
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			 size_t size, int flags)
 {
-	struct tcp_sock *tp = tcp_sk(sk);
-	int mss_now, size_goal;
-	int err;
-	ssize_t copied;
-	long timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
-
-	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
-	    WARN_ONCE(!sendpage_ok(page),
-		      "page must not be a Slab one and have page_count > 0"))
-		return -EINVAL;
-
-	/* Wait for a connection to finish. One exception is TCP Fast Open
-	 * (passive side) where data is allowed to be sent before a connection
-	 * is fully established.
-	 */
-	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
-	    !tcp_passive_fastopen(sk)) {
-		err = sk_stream_wait_connect(sk, &timeo);
-		if (err != 0)
-			goto out_err;
-	}
+	struct bio_vec bvec;
+	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 
-	sk_clear_bit(SOCKWQ_ASYNC_NOSPACE, sk);
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
-	mss_now = tcp_send_mss(sk, &size_goal, flags);
-	copied = 0;
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
-	err = -EPIPE;
-	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
-		goto out_err;
-
-	while (size > 0) {
-		struct sk_buff *skb;
-		size_t copy = size;
-
-		skb = tcp_build_frag(sk, size_goal, flags, page, offset, &copy);
-		if (!skb)
-			goto wait_for_space;
-
-		if (!copied)
-			TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;
-
-		copied += copy;
-		offset += copy;
-		size -= copy;
-		if (!size)
-			goto out;
-
-		if (skb->len < size_goal || (flags & MSG_OOB))
-			continue;
-
-		if (forced_push(tp)) {
-			tcp_mark_push(tp, skb);
-			__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
-		} else if (skb == tcp_send_head(sk))
-			tcp_push_one(sk, mss_now);
-		continue;
-
-wait_for_space:
-		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-		tcp_push(sk, flags & ~MSG_MORE, mss_now,
-			 TCP_NAGLE_PUSH, size_goal);
-
-		err = sk_stream_wait_memory(sk, &timeo);
-		if (err != 0)
-			goto do_error;
-
-		mss_now = tcp_send_mss(sk, &size_goal, flags);
-	}
-
-out:
-	if (copied) {
-		tcp_tx_timestamp(sk, sk->sk_tsflags);
-		if (!(flags & MSG_SENDPAGE_NOTLAST))
-			tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
-	}
-	return copied;
-
-do_error:
-	tcp_remove_empty_skb(sk);
-	if (copied)
-		goto out;
-out_err:
-	/* make sure we wake any epoll edge trigger waiter */
-	if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
-		sk->sk_write_space(sk);
-		tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED);
-	}
-	return sk_stream_error(sk, flags, err);
+	return tcp_sendmsg_locked(sk, &msg, size);
 }
 EXPORT_SYMBOL_GPL(do_tcp_sendpages);
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 09/20] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (7 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 08/20] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 10/20] espintcp: Inline do_tcp_sendpages() David Howells
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, John Fastabend, Jakub Sitnicki, bpf

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it.  This is part of replacing ->sendpage() with a call to
sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Sitnicki <jakub@cloudflare.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
cc: bpf@vger.kernel.org
---
 net/ipv4/tcp_bpf.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index ebf917511937..24bfb885777e 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -72,11 +72,13 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
 {
 	bool apply = apply_bytes;
 	struct scatterlist *sge;
+	struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 	struct page *page;
 	int size, ret = 0;
 	u32 off;
 
 	while (1) {
+		struct bio_vec bvec;
 		bool has_tx_ulp;
 
 		sge = sk_msg_elem(msg, msg->sg.start);
@@ -88,16 +90,18 @@ static int tcp_bpf_push(struct sock *sk, struct sk_msg *msg, u32 apply_bytes,
 		tcp_rate_check_app_limited(sk);
 retry:
 		has_tx_ulp = tls_sw_has_ctx_tx(sk);
-		if (has_tx_ulp) {
-			flags |= MSG_SENDPAGE_NOPOLICY;
-			ret = kernel_sendpage_locked(sk,
-						     page, off, size, flags);
-		} else {
-			ret = do_tcp_sendpages(sk, page, off, size, flags);
-		}
+		if (has_tx_ulp)
+			msghdr.msg_flags |= MSG_SENDPAGE_NOPOLICY;
 
+		if (flags & MSG_SENDPAGE_NOTLAST)
+			msghdr.msg_flags |= MSG_MORE;
+
+		bvec_set_page(&bvec, page, size, off);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+		ret = tcp_sendmsg_locked(sk, &msghdr, size);
 		if (ret <= 0)
 			return ret;
+
 		if (apply)
 			apply_bytes -= ret;
 		msg->sg.size -= ret;
@@ -404,7 +408,7 @@ static int tcp_bpf_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 	long timeo;
 	int flags;
 
-	/* Don't let internal do_tcp_sendpages() flags through */
+	/* Don't let internal sendpage flags through */
 	flags = (msg->msg_flags & ~MSG_SENDPAGE_DECRYPTED);
 	flags |= MSG_NO_SHARED_FRAGS;
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 10/20] espintcp: Inline do_tcp_sendpages()
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (8 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 09/20] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 11/20] tls: " David Howells
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Steffen Klassert, Herbert Xu

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Steffen Klassert <steffen.klassert@secunet.com>
cc: Herbert Xu <herbert@gondor.apana.org.au>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/xfrm/espintcp.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/espintcp.c b/net/xfrm/espintcp.c
index 872b80188e83..3504925babdb 100644
--- a/net/xfrm/espintcp.c
+++ b/net/xfrm/espintcp.c
@@ -205,14 +205,16 @@ static int espintcp_sendskb_locked(struct sock *sk, struct espintcp_msg *emsg,
 static int espintcp_sendskmsg_locked(struct sock *sk,
 				     struct espintcp_msg *emsg, int flags)
 {
+	struct msghdr msghdr = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 	struct sk_msg *skmsg = &emsg->skmsg;
 	struct scatterlist *sg;
 	int done = 0;
 	int ret;
 
-	flags |= MSG_SENDPAGE_NOTLAST;
+	msghdr.msg_flags |= MSG_SENDPAGE_NOTLAST;
 	sg = &skmsg->sg.data[skmsg->sg.start];
 	do {
+		struct bio_vec bvec;
 		size_t size = sg->length - emsg->offset;
 		int offset = sg->offset + emsg->offset;
 		struct page *p;
@@ -220,11 +222,13 @@ static int espintcp_sendskmsg_locked(struct sock *sk,
 		emsg->offset = 0;
 
 		if (sg_is_last(sg))
-			flags &= ~MSG_SENDPAGE_NOTLAST;
+			msghdr.msg_flags &= ~MSG_SENDPAGE_NOTLAST;
 
 		p = sg_page(sg);
 retry:
-		ret = do_tcp_sendpages(sk, p, offset, size, flags);
+		bvec_set_page(&bvec, p, size, offset);
+		iov_iter_bvec(&msghdr.msg_iter, ITER_SOURCE, &bvec, 1, size);
+		ret = tcp_sendmsg_locked(sk, &msghdr, size);
 		if (ret < 0) {
 			emsg->offset = offset - sg->offset;
 			skmsg->sg.start += done;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 11/20] tls: Inline do_tcp_sendpages()
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (9 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 10/20] espintcp: Inline do_tcp_sendpages() David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 12/20] siw: " David Howells
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Boris Pismenny, John Fastabend

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Boris Pismenny <borisp@nvidia.com>
cc: John Fastabend <john.fastabend@gmail.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/tls.h  |  2 +-
 net/tls/tls_main.c | 24 +++++++++++++++---------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 154949c7b0c8..d31521c36a84 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -256,7 +256,7 @@ struct tls_context {
 	struct scatterlist *partially_sent_record;
 	u16 partially_sent_offset;
 
-	bool in_tcp_sendpages;
+	bool splicing_pages;
 	bool pending_open_record_frags;
 
 	struct mutex tx_lock; /* protects partially_sent_* fields and
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index b32c112984dd..1d0e318d7977 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -124,7 +124,10 @@ int tls_push_sg(struct sock *sk,
 		u16 first_offset,
 		int flags)
 {
-	int sendpage_flags = flags | MSG_SENDPAGE_NOTLAST;
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = MSG_SENDPAGE_NOTLAST | MSG_SPLICE_PAGES | flags,
+	};
 	int ret = 0;
 	struct page *p;
 	size_t size;
@@ -133,16 +136,19 @@ int tls_push_sg(struct sock *sk,
 	size = sg->length - offset;
 	offset += sg->offset;
 
-	ctx->in_tcp_sendpages = true;
+	ctx->splicing_pages = true;
 	while (1) {
 		if (sg_is_last(sg))
-			sendpage_flags = flags;
+			msg.msg_flags = flags;
 
 		/* is sending application-limited? */
 		tcp_rate_check_app_limited(sk);
 		p = sg_page(sg);
 retry:
-		ret = do_tcp_sendpages(sk, p, offset, size, sendpage_flags);
+		bvec_set_page(&bvec, p, size, offset);
+		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
+		ret = tcp_sendmsg_locked(sk, &msg, size);
 
 		if (ret != size) {
 			if (ret > 0) {
@@ -154,7 +160,7 @@ int tls_push_sg(struct sock *sk,
 			offset -= sg->offset;
 			ctx->partially_sent_offset = offset;
 			ctx->partially_sent_record = (void *)sg;
-			ctx->in_tcp_sendpages = false;
+			ctx->splicing_pages = false;
 			return ret;
 		}
 
@@ -168,7 +174,7 @@ int tls_push_sg(struct sock *sk,
 		size = sg->length;
 	}
 
-	ctx->in_tcp_sendpages = false;
+	ctx->splicing_pages = false;
 
 	return 0;
 }
@@ -246,11 +252,11 @@ static void tls_write_space(struct sock *sk)
 {
 	struct tls_context *ctx = tls_get_ctx(sk);
 
-	/* If in_tcp_sendpages call lower protocol write space handler
+	/* If splicing_pages call lower protocol write space handler
 	 * to ensure we wake up any waiting operations there. For example
-	 * if do_tcp_sendpages where to call sk_wait_event.
+	 * if splicing pages where to call sk_wait_event.
 	 */
-	if (ctx->in_tcp_sendpages) {
+	if (ctx->splicing_pages) {
 		ctx->sk_write_space(sk);
 		return;
 	}


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 12/20] siw: Inline do_tcp_sendpages()
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (10 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 11/20] tls: " David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 13/20] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm, Bernard Metzler, Tom Talpey, linux-rdma

do_tcp_sendpages() is now just a small wrapper around tcp_sendmsg_locked(),
so inline it, allowing do_tcp_sendpages() to be removed.  This is part of
replacing ->sendpage() with a call to sendmsg() with MSG_SPLICE_PAGES set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Tom Talpey <tom@talpey.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: netdev@vger.kernel.org
---
 drivers/infiniband/sw/siw/siw_qp_tx.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_qp_tx.c b/drivers/infiniband/sw/siw/siw_qp_tx.c
index 05052b49107f..fa5de40d85d5 100644
--- a/drivers/infiniband/sw/siw/siw_qp_tx.c
+++ b/drivers/infiniband/sw/siw/siw_qp_tx.c
@@ -313,7 +313,7 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
 }
 
 /*
- * 0copy TCP transmit interface: Use do_tcp_sendpages.
+ * 0copy TCP transmit interface: Use MSG_SPLICE_PAGES.
  *
  * Using sendpage to push page by page appears to be less efficient
  * than using sendmsg, even if data are copied.
@@ -324,20 +324,27 @@ static int siw_tx_ctrl(struct siw_iwarp_tx *c_tx, struct socket *s,
 static int siw_tcp_sendpages(struct socket *s, struct page **page, int offset,
 			     size_t size)
 {
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = (MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST |
+			      MSG_SPLICE_PAGES),
+	};
 	struct sock *sk = s->sk;
-	int i = 0, rv = 0, sent = 0,
-	    flags = MSG_MORE | MSG_DONTWAIT | MSG_SENDPAGE_NOTLAST;
+	int i = 0, rv = 0, sent = 0;
 
 	while (size) {
 		size_t bytes = min_t(size_t, PAGE_SIZE - offset, size);
 
 		if (size + offset <= PAGE_SIZE)
-			flags = MSG_MORE | MSG_DONTWAIT;
+			msg.msg_flags = MSG_MORE | MSG_DONTWAIT;
 
 		tcp_rate_check_app_limited(sk);
+		bvec_set_page(&bvec, page[i], bytes, offset);
+		iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
+
 try_page_again:
 		lock_sock(sk);
-		rv = do_tcp_sendpages(sk, page[i], offset, bytes, flags);
+		rv = tcp_sendmsg_locked(sk, &msg, size);
 		release_sock(sk);
 
 		if (rv > 0) {


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 13/20] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked()
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (11 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 12/20] siw: " David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 14/20] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Fold do_tcp_sendpages() into its last remaining caller,
tcp_sendpage_locked().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Dumazet <edumazet@google.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/tcp.h |  2 --
 net/ipv4/tcp.c    | 21 +++++++--------------
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index a0a91a988272..11c62d37f3d5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -333,8 +333,6 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size,
 		 int flags);
 int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
-		 size_t size, int flags);
 int tcp_send_mss(struct sock *sk, int *size_goal, int flags);
 void tcp_push(struct sock *sk, int flags, int mss_now, int nonagle,
 	      int size_goal);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a8a4ace8b3da..c7240beedfed 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -972,12 +972,17 @@ static int tcp_wmem_schedule(struct sock *sk, int copy)
 	return min(copy, sk->sk_forward_alloc);
 }
 
-ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
-			 size_t size, int flags)
+int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
+			size_t size, int flags)
 {
 	struct bio_vec bvec;
 	struct msghdr msg = { .msg_flags = flags | MSG_SPLICE_PAGES, };
 
+	if (!(sk->sk_route_caps & NETIF_F_SG))
+		return sock_no_sendpage_locked(sk, page, offset, size, flags);
+
+	tcp_rate_check_app_limited(sk);  /* is sending application-limited? */
+
 	bvec_set_page(&bvec, page, size, offset);
 	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
@@ -986,18 +991,6 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 
 	return tcp_sendmsg_locked(sk, &msg, size);
 }
-EXPORT_SYMBOL_GPL(do_tcp_sendpages);
-
-int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
-			size_t size, int flags)
-{
-	if (!(sk->sk_route_caps & NETIF_F_SG))
-		return sock_no_sendpage_locked(sk, page, offset, size, flags);
-
-	tcp_rate_check_app_limited(sk);  /* is sending application-limited? */
-
-	return do_tcp_sendpages(sk, page, offset, size, flags);
-}
 EXPORT_SYMBOL_GPL(tcp_sendpage_locked);
 
 int tcp_sendpage(struct sock *sk, struct page *page, int offset,


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 14/20] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (12 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 13/20] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 15/20] ip: Remove ip_append_page() David Howells
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Convert udp_sendpage() to use sendmsg() with MSG_SPLICE_PAGES rather than
directly splicing in the pages itself.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/udp.c | 50 +++++++++-----------------------------------------
 1 file changed, 9 insertions(+), 41 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aa32afd871ee..0d3e78a65f51 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1332,52 +1332,20 @@ EXPORT_SYMBOL(udp_sendmsg);
 int udp_sendpage(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags)
 {
-	struct inet_sock *inet = inet_sk(sk);
-	struct udp_sock *up = udp_sk(sk);
+	struct bio_vec bvec;
+	struct msghdr msg = {
+		.msg_flags = flags | MSG_SPLICE_PAGES | MSG_MORE
+	};
 	int ret;
 
-	if (flags & MSG_SENDPAGE_NOTLAST)
-		flags |= MSG_MORE;
+	bvec_set_page(&bvec, page, size, offset);
+	iov_iter_bvec(&msg.msg_iter, ITER_SOURCE, &bvec, 1, size);
 
-	if (!up->pending) {
-		struct msghdr msg = {	.msg_flags = flags|MSG_MORE };
-
-		/* Call udp_sendmsg to specify destination address which
-		 * sendpage interface can't pass.
-		 * This will succeed only when the socket is connected.
-		 */
-		ret = udp_sendmsg(sk, &msg, 0);
-		if (ret < 0)
-			return ret;
-	}
+	if (flags & MSG_SENDPAGE_NOTLAST)
+		msg.msg_flags |= MSG_MORE;
 
 	lock_sock(sk);
-
-	if (unlikely(!up->pending)) {
-		release_sock(sk);
-
-		net_dbg_ratelimited("cork failed\n");
-		return -EINVAL;
-	}
-
-	ret = ip_append_page(sk, &inet->cork.fl.u.ip4,
-			     page, offset, size, flags);
-	if (ret == -EOPNOTSUPP) {
-		release_sock(sk);
-		return sock_no_sendpage(sk->sk_socket, page, offset,
-					size, flags);
-	}
-	if (ret < 0) {
-		udp_flush_pending_frames(sk);
-		goto out;
-	}
-
-	up->len += size;
-	if (!(READ_ONCE(up->corkflag) || (flags&MSG_MORE)))
-		ret = udp_push_pending_frames(sk);
-	if (!ret)
-		ret = size;
-out:
+	ret = udp_sendmsg(sk, &msg, size);
 	release_sock(sk);
 	return ret;
 }


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 15/20] ip: Remove ip_append_page()
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (13 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 14/20] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 16/20] ip, udp: Support MSG_SPLICE_PAGES David Howells
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

ip_append_page() is no longer used with the removal of udp_sendpage(), so
remove it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/ip.h     |   2 -
 net/ipv4/ip_output.c | 136 ++-----------------------------------------
 2 files changed, 4 insertions(+), 134 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index c3fffaa92d6e..7627a4df893b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -220,8 +220,6 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 		   unsigned int flags);
 int ip_generic_getfrag(void *from, char *to, int offset, int len, int odd,
 		       struct sk_buff *skb);
-ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
-		       int offset, size_t size, int flags);
 struct sk_buff *__ip_make_skb(struct sock *sk, struct flowi4 *fl4,
 			      struct sk_buff_head *queue,
 			      struct inet_cork *cork);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 22a90a9392eb..2dacee1a1ed4 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1310,10 +1310,10 @@ static int ip_setup_cork(struct sock *sk, struct inet_cork *cork,
 }
 
 /*
- *	ip_append_data() and ip_append_page() can make one large IP datagram
- *	from many pieces of data. Each pieces will be holded on the socket
- *	until ip_push_pending_frames() is called. Each piece can be a page
- *	or non-page data.
+ *	ip_append_data() can make one large IP datagram from many pieces of
+ *	data.  Each piece will be held on the socket until
+ *	ip_push_pending_frames() is called. Each piece can be a page or
+ *	non-page data.
  *
  *	Not only UDP, other transport protocols - e.g. raw sockets - can use
  *	this interface potentially.
@@ -1346,134 +1346,6 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 				from, length, transhdrlen, flags);
 }
 
-ssize_t	ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
-		       int offset, size_t size, int flags)
-{
-	struct inet_sock *inet = inet_sk(sk);
-	struct sk_buff *skb;
-	struct rtable *rt;
-	struct ip_options *opt = NULL;
-	struct inet_cork *cork;
-	int hh_len;
-	int mtu;
-	int len;
-	int err;
-	unsigned int maxfraglen, fragheaderlen, fraggap, maxnonfragsize;
-
-	if (inet->hdrincl)
-		return -EPERM;
-
-	if (flags&MSG_PROBE)
-		return 0;
-
-	if (skb_queue_empty(&sk->sk_write_queue))
-		return -EINVAL;
-
-	cork = &inet->cork.base;
-	rt = (struct rtable *)cork->dst;
-	if (cork->flags & IPCORK_OPT)
-		opt = cork->opt;
-
-	if (!(rt->dst.dev->features & NETIF_F_SG))
-		return -EOPNOTSUPP;
-
-	hh_len = LL_RESERVED_SPACE(rt->dst.dev);
-	mtu = cork->gso_size ? IP_MAX_MTU : cork->fragsize;
-
-	fragheaderlen = sizeof(struct iphdr) + (opt ? opt->optlen : 0);
-	maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen;
-	maxnonfragsize = ip_sk_ignore_df(sk) ? 0xFFFF : mtu;
-
-	if (cork->length + size > maxnonfragsize - fragheaderlen) {
-		ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
-			       mtu - (opt ? opt->optlen : 0));
-		return -EMSGSIZE;
-	}
-
-	skb = skb_peek_tail(&sk->sk_write_queue);
-	if (!skb)
-		return -EINVAL;
-
-	cork->length += size;
-
-	while (size > 0) {
-		/* Check if the remaining data fits into current packet. */
-		len = mtu - skb->len;
-		if (len < size)
-			len = maxfraglen - skb->len;
-
-		if (len <= 0) {
-			struct sk_buff *skb_prev;
-			int alloclen;
-
-			skb_prev = skb;
-			fraggap = skb_prev->len - maxfraglen;
-
-			alloclen = fragheaderlen + hh_len + fraggap + 15;
-			skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation);
-			if (unlikely(!skb)) {
-				err = -ENOBUFS;
-				goto error;
-			}
-
-			/*
-			 *	Fill in the control structures
-			 */
-			skb->ip_summed = CHECKSUM_NONE;
-			skb->csum = 0;
-			skb_reserve(skb, hh_len);
-
-			/*
-			 *	Find where to start putting bytes.
-			 */
-			skb_put(skb, fragheaderlen + fraggap);
-			skb_reset_network_header(skb);
-			skb->transport_header = (skb->network_header +
-						 fragheaderlen);
-			if (fraggap) {
-				skb->csum = skb_copy_and_csum_bits(skb_prev,
-								   maxfraglen,
-						    skb_transport_header(skb),
-								   fraggap);
-				skb_prev->csum = csum_sub(skb_prev->csum,
-							  skb->csum);
-				pskb_trim_unique(skb_prev, maxfraglen);
-			}
-
-			/*
-			 * Put the packet on the pending queue.
-			 */
-			__skb_queue_tail(&sk->sk_write_queue, skb);
-			continue;
-		}
-
-		if (len > size)
-			len = size;
-
-		if (skb_append_pagefrags(skb, page, offset, len)) {
-			err = -EMSGSIZE;
-			goto error;
-		}
-
-		if (skb->ip_summed == CHECKSUM_NONE) {
-			__wsum csum;
-			csum = csum_page(page, offset, len);
-			skb->csum = csum_block_add(skb->csum, csum, skb->len);
-		}
-
-		skb_len_add(skb, len);
-		refcount_add(len, &sk->sk_wmem_alloc);
-		offset += len;
-		size -= len;
-	}
-	return 0;
-
-error:
-	cork->length -= size;
-	IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
-	return err;
-}
-
 static void ip_cork_release(struct inet_cork *cork)
 {
 	cork->flags &= ~IPCORK_OPT;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 16/20] ip, udp: Support MSG_SPLICE_PAGES
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (14 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 15/20] ip: Remove ip_append_page() David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 17/20] ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Make IP/UDP sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/ip_output.c | 47 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 2dacee1a1ed4..13d19867ffd3 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -957,6 +957,41 @@ csum_page(struct page *page, int offset, int copy)
 	return csum;
 }
 
+/*
+ * Add (or copy) data pages for MSG_SPLICE_PAGES.
+ */
+static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
+			     void *from, int *pcopy)
+{
+	struct msghdr *msg = from;
+	struct page *page = NULL, **pages = &page;
+	ssize_t copy = *pcopy;
+	size_t off;
+	int err;
+
+	copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
+	if (copy <= 0)
+		return copy ?: -EIO;
+
+	err = skb_append_pagefrags(skb, page, off, copy);
+	if (err < 0) {
+		iov_iter_revert(&msg->msg_iter, copy);
+		return err;
+	}
+
+	if (skb->ip_summed == CHECKSUM_NONE) {
+		__wsum csum;
+
+		csum = csum_page(page, off, copy);
+		skb->csum = csum_block_add(skb->csum, csum, skb->len);
+	}
+
+	skb_len_add(skb, copy);
+	refcount_add(copy, &sk->sk_wmem_alloc);
+	*pcopy = copy;
+	return 0;
+}
+
 static int __ip_append_data(struct sock *sk,
 			    struct flowi4 *fl4,
 			    struct sk_buff_head *queue,
@@ -1048,6 +1083,14 @@ static int __ip_append_data(struct sock *sk,
 				skb_zcopy_set(skb, uarg, &extra_uref);
 			}
 		}
+	} else if ((flags & MSG_SPLICE_PAGES) && length) {
+		if (inet->hdrincl)
+			return -EPERM;
+		if (rt->dst.dev->features & NETIF_F_SG)
+			/* We need an empty buffer to attach stuff to */
+			paged = true;
+		else
+			flags &= ~MSG_SPLICE_PAGES;
 	}
 
 	cork->length += length;
@@ -1207,6 +1250,10 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
+		} else if (flags & MSG_SPLICE_PAGES) {
+			err = __ip_splice_pages(sk, skb, from, &copy);
+			if (err < 0)
+				goto error;
 		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 17/20] ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (15 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 16/20] ip, udp: Support MSG_SPLICE_PAGES David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 18/20] ip6, udp6: Support MSG_SPLICE_PAGES David Howells
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

If sendmsg() with MSG_SPLICE_PAGES encounters a page that shouldn't be
spliced - a slab page, for instance, or one with a zero count - make
__ip_append_data() copy it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/ipv4/ip_output.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 13d19867ffd3..e34c86b1b59a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -967,13 +967,32 @@ static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
 	struct page *page = NULL, **pages = &page;
 	ssize_t copy = *pcopy;
 	size_t off;
+	bool put = false;
 	int err;
 
 	copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off);
 	if (copy <= 0)
 		return copy ?: -EIO;
 
+	if (!sendpage_ok(page)) {
+		const void *p = kmap_local_page(page);
+		void *q;
+
+		q = page_frag_memdup(NULL, p + off, copy,
+				     sk->sk_allocation, ULONG_MAX);
+		kunmap_local(p);
+		if (!q) {
+			iov_iter_revert(&msg->msg_iter, copy);
+			return -ENOMEM;
+		}
+		page = virt_to_page(q);
+		off = offset_in_page(q);
+		put = true;
+	}
+
 	err = skb_append_pagefrags(skb, page, off, copy);
+	if (put)
+		put_page(page);
 	if (err < 0) {
 		iov_iter_revert(&msg->msg_iter, copy);
 		return err;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 18/20] ip6, udp6: Support MSG_SPLICE_PAGES
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (16 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 17/20] ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 19/20] af_unix: " David Howells
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Make IP6/UDP6 sendmsg() support MSG_SPLICE_PAGES.  This causes pages to be
spliced from the source iterator if possible, copying the data if not.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 include/net/ip.h      |  1 +
 net/ipv4/ip_output.c  |  4 ++--
 net/ipv6/ip6_output.c | 12 ++++++++++++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 7627a4df893b..8a50341007bf 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -211,6 +211,7 @@ int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb);
 int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl,
 		    __u8 tos);
 void ip_init(void);
+int __ip_splice_pages(struct sock *sk, struct sk_buff *skb, void *from, int *pcopy);
 int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 		   int getfrag(void *from, char *to, int offset, int len,
 			       int odd, struct sk_buff *skb),
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e34c86b1b59a..241a78d82766 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -960,8 +960,7 @@ csum_page(struct page *page, int offset, int copy)
 /*
  * Add (or copy) data pages for MSG_SPLICE_PAGES.
  */
-static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
-			     void *from, int *pcopy)
+int __ip_splice_pages(struct sock *sk, struct sk_buff *skb, void *from, int *pcopy)
 {
 	struct msghdr *msg = from;
 	struct page *page = NULL, **pages = &page;
@@ -1010,6 +1009,7 @@ static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb,
 	*pcopy = copy;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(__ip_splice_pages);
 
 static int __ip_append_data(struct sock *sk,
 			    struct flowi4 *fl4,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 0b6140f0179d..82846d18cf22 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1589,6 +1589,14 @@ static int __ip6_append_data(struct sock *sk,
 				skb_zcopy_set(skb, uarg, &extra_uref);
 			}
 		}
+	} else if ((flags & MSG_SPLICE_PAGES) && length) {
+		if (inet_sk(sk)->hdrincl)
+			return -EPERM;
+		if (rt->dst.dev->features & NETIF_F_SG)
+			/* We need an empty buffer to attach stuff to */
+			paged = true;
+		else
+			flags &= ~MSG_SPLICE_PAGES;
 	}
 
 	/*
@@ -1778,6 +1786,10 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
+		} else if (flags & MSG_SPLICE_PAGES) {
+			err = __ip_splice_pages(sk, skb, from, &copy);
+			if (err < 0)
+				goto error;
 		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 19/20] af_unix: Support MSG_SPLICE_PAGES
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (17 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 18/20] ip6, udp6: Support MSG_SPLICE_PAGES David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-05 16:53 ` [PATCH net-next v4 20/20] af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

Make AF_UNIX sendmsg() support MSG_SPLICE_PAGES, splicing in pages from the
source iterator if possible and copying the data in otherwise.

This allows ->sendpage() to be replaced by something that can handle
multiple multipage folios in a single transaction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/unix/af_unix.c | 93 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 77 insertions(+), 16 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fb31e8a4409e..fee431a089d3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2157,6 +2157,53 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
 }
 #endif
 
+/*
+ * Extract pages from an iterator and add them to the socket buffer.
+ */
+static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
+					struct iov_iter *iter, ssize_t maxsize)
+{
+	struct page *pages[8], **ppages = pages;
+	unsigned int i, nr;
+	ssize_t ret = 0;
+
+	while (iter->count > 0) {
+		size_t off, len;
+
+		nr = min_t(size_t, MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags,
+			   ARRAY_SIZE(pages));
+		if (nr == 0)
+			break;
+
+		len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off);
+		if (len <= 0) {
+			if (!ret)
+				ret = len ?: -EIO;
+			break;
+		}
+
+		i = 0;
+		do {
+			size_t part = min_t(size_t, PAGE_SIZE - off, len);
+
+			if (skb_append_pagefrags(skb, pages[i++], off, part) < 0) {
+				if (!ret)
+					ret = -EMSGSIZE;
+				goto out;
+			}
+			off = 0;
+			ret += part;
+			maxsize -= part;
+			len -= part;
+		} while (len > 0);
+		if (maxsize <= 0)
+			break;
+	}
+
+out:
+	return ret;
+}
+
 static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 			       size_t len)
 {
@@ -2200,19 +2247,25 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	while (sent < len) {
 		size = len - sent;
 
-		/* Keep two messages in the pipe so it schedules better */
-		size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+			skb = sock_alloc_send_pskb(sk, 0, 0,
+						   msg->msg_flags & MSG_DONTWAIT,
+						   &err, 0);
+		} else {
+			/* Keep two messages in the pipe so it schedules better */
+			size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
 
-		/* allow fallback to order-0 allocations */
-		size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
+			/* allow fallback to order-0 allocations */
+			size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
 
-		data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
+			data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
 
-		data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
+			data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
 
-		skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
-					   msg->msg_flags & MSG_DONTWAIT, &err,
-					   get_order(UNIX_SKB_FRAGS_SZ));
+			skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
+						   msg->msg_flags & MSG_DONTWAIT, &err,
+						   get_order(UNIX_SKB_FRAGS_SZ));
+		}
 		if (!skb)
 			goto out_err;
 
@@ -2224,13 +2277,21 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		}
 		fds_sent = true;
 
-		skb_put(skb, size - data_len);
-		skb->data_len = data_len;
-		skb->len = size;
-		err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
-		if (err) {
-			kfree_skb(skb);
-			goto out_err;
+		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
+			size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size);
+			skb->data_len += size;
+			skb->len += size;
+			skb->truesize += size;
+			refcount_add(size, &sk->sk_wmem_alloc);
+		} else {
+			skb_put(skb, size - data_len);
+			skb->data_len = data_len;
+			skb->len = size;
+			err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
+			if (err) {
+				kfree_skb(skb);
+				goto out_err;
+			}
 		}
 
 		unix_state_lock(other);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH net-next v4 20/20] af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (18 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 19/20] af_unix: " David Howells
@ 2023-04-05 16:53 ` David Howells
  2023-04-06  2:19 ` [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 Jakub Kicinski
  2023-04-06  9:12 ` David Howells
  21 siblings, 0 replies; 24+ messages in thread
From: David Howells @ 2023-04-05 16:53 UTC (permalink / raw)
  To: netdev
  Cc: David Howells, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Willem de Bruijn, Matthew Wilcox, Al Viro,
	Christoph Hellwig, Jens Axboe, Jeff Layton, Christian Brauner,
	Chuck Lever III, Linus Torvalds, linux-fsdevel, linux-kernel,
	linux-mm

If sendmsg() with MSG_SPLICE_PAGES encounters a page that shouldn't be
spliced - a slab page, for instance, or one with a zero count - make
unix_extract_bvec_to_skb() copy it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: netdev@vger.kernel.org
---
 net/unix/af_unix.c | 44 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index fee431a089d3..6941be8dae7e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2160,12 +2160,12 @@ static int queue_oob(struct socket *sock, struct msghdr *msg, struct sock *other
 /*
  * Extract pages from an iterator and add them to the socket buffer.
  */
-static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
-					struct iov_iter *iter, ssize_t maxsize)
+static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb, struct iov_iter *iter,
+					ssize_t maxsize, gfp_t gfp)
 {
 	struct page *pages[8], **ppages = pages;
 	unsigned int i, nr;
-	ssize_t ret = 0;
+	ssize_t spliced = 0, ret = 0;
 
 	while (iter->count > 0) {
 		size_t off, len;
@@ -2177,31 +2177,52 @@ static ssize_t unix_extract_bvec_to_skb(struct sk_buff *skb,
 
 		len = iov_iter_extract_pages(iter, &ppages, maxsize, nr, 0, &off);
 		if (len <= 0) {
-			if (!ret)
-				ret = len ?: -EIO;
+			ret = len ?: -EIO;
 			break;
 		}
 
 		i = 0;
 		do {
+			struct page *page = pages[i++];
 			size_t part = min_t(size_t, PAGE_SIZE - off, len);
+			bool put = false;
+
+			if (!sendpage_ok(page)) {
+				const void *p = kmap_local_page(page);
+				void *q;
+
+				q = page_frag_memdup(NULL, p + off, part, gfp,
+						     ULONG_MAX);
+				kunmap_local(p);
+				if (!q) {
+					iov_iter_revert(iter, len);
+					ret = -ENOMEM;
+					goto out;
+				}
+				page = virt_to_page(q);
+				off = offset_in_page(q);
+				put = true;
+			}
 
-			if (skb_append_pagefrags(skb, pages[i++], off, part) < 0) {
-				if (!ret)
-					ret = -EMSGSIZE;
+			ret = skb_append_pagefrags(skb, page, off, part);
+			if (put)
+				put_page(page);
+			if (ret < 0) {
+				iov_iter_revert(iter, len);
 				goto out;
 			}
 			off = 0;
-			ret += part;
+			spliced += part;
 			maxsize -= part;
 			len -= part;
 		} while (len > 0);
+
 		if (maxsize <= 0)
 			break;
 	}
 
 out:
-	return ret;
+	return spliced ?: ret;
 }
 
 static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
@@ -2278,7 +2299,8 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		fds_sent = true;
 
 		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
-			size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size);
+			size = unix_extract_bvec_to_skb(skb, &msg->msg_iter, size,
+							sk->sk_allocation);
 			skb->data_len += size;
 			skb->len += size;
 			skb->truesize += size;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (19 preceding siblings ...)
  2023-04-05 16:53 ` [PATCH net-next v4 20/20] af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
@ 2023-04-06  2:19 ` Jakub Kicinski
  2023-04-06  9:12 ` David Howells
  21 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-06  2:19 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, Matthew Wilcox, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Chuck Lever III,
	Linus Torvalds, linux-fsdevel, linux-kernel, linux-mm

On Wed,  5 Apr 2023 17:53:19 +0100 David Howells wrote:
> Here's the first tranche of patches towards providing a MSG_SPLICE_PAGES
> internal sendmsg flag that is intended to replace the ->sendpage() op with
> calls to sendmsg().  MSG_SPLICE is a hint that tells the protocol that it
> should splice the pages supplied if it can and copy them if not.

Thanks for splitting off a smaller series!
My day is out of hours so just a trivial comment, in case kbuild bot
hasn't pinged you - this appears to break the build on the relatively
recently added page_frag_cache in google's vNIC (gve).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1
  2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
                   ` (20 preceding siblings ...)
  2023-04-06  2:19 ` [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 Jakub Kicinski
@ 2023-04-06  9:12 ` David Howells
  2023-04-06 15:03   ` Jakub Kicinski
  21 siblings, 1 reply; 24+ messages in thread
From: David Howells @ 2023-04-06  9:12 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: dhowells, netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, Matthew Wilcox, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Chuck Lever III,
	Linus Torvalds, linux-fsdevel, linux-kernel, linux-mm

Jakub Kicinski <kuba@kernel.org> wrote:

> Thanks for splitting off a smaller series!
> My day is out of hours so just a trivial comment, in case kbuild bot
> hasn't pinged you - this appears to break the build on the relatively
> recently added page_frag_cache in google's vNIC (gve).

Yep.  I've just been fixing that up.

I'll also break off the samples patch and that can go by itself.  Is there a
problem with the 32-bit userspace build environment that patchwork is using?
The sample programs that patch adds are all userspace helpers.  It seems that
<features.h> is referencing a file that doesn't exist:

In file included from /usr/include/features.h:514,
                 from /usr/include/bits/libc-header-start.h:33,
                 from /usr/include/stdio.h:27,
                 from ../samples/net/alg-hash.c:9:
/usr/include/gnu/stubs.h:7:11: fatal error: gnu/stubs-32.h: No such file or directory
    7 | # include <gnu/stubs-32.h>
      |           ^~~~~~~~~~~~~~~~
compilation terminated.

Excerpt from:

https://patchwork.hopto.org/static/nipa/737278/13202278/build_32bit/

https://patchwork.kernel.org/project/netdevbpf/patch/20230405165339.3468808-2-dhowells@redhat.com/

David


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1
  2023-04-06  9:12 ` David Howells
@ 2023-04-06 15:03   ` Jakub Kicinski
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Kicinski @ 2023-04-06 15:03 UTC (permalink / raw)
  To: David Howells
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Willem de Bruijn, Matthew Wilcox, Al Viro, Christoph Hellwig,
	Jens Axboe, Jeff Layton, Christian Brauner, Chuck Lever III,
	Linus Torvalds, linux-fsdevel, linux-kernel, linux-mm

On Thu, 06 Apr 2023 10:12:19 +0100 David Howells wrote:
> I'll also break off the samples patch and that can go by itself.  Is there a
> problem with the 32-bit userspace build environment that patchwork is using?

Yes, I think it was missing 32b glibc, I installed it last night.
Hopefully should work now.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2023-04-06 15:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-05 16:53 [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 David Howells
2023-04-05 16:53 ` [PATCH net-next v4 01/20] net: Add samples for network I/O and splicing David Howells
2023-04-05 16:53 ` [PATCH net-next v4 02/20] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
2023-04-05 16:53 ` [PATCH net-next v4 03/20] mm: Move the page fragment allocator from page_alloc.c into its own file David Howells
2023-04-05 16:53 ` [PATCH net-next v4 04/20] mm: Make the page_frag_cache allocator use multipage folios David Howells
2023-04-05 16:53 ` [PATCH net-next v4 05/20] mm: Make the page_frag_cache allocator use per-cpu David Howells
2023-04-05 16:53 ` [PATCH net-next v4 06/20] tcp: Support MSG_SPLICE_PAGES David Howells
2023-04-05 16:53 ` [PATCH net-next v4 07/20] tcp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
2023-04-05 16:53 ` [PATCH net-next v4 08/20] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
2023-04-05 16:53 ` [PATCH net-next v4 09/20] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
2023-04-05 16:53 ` [PATCH net-next v4 10/20] espintcp: Inline do_tcp_sendpages() David Howells
2023-04-05 16:53 ` [PATCH net-next v4 11/20] tls: " David Howells
2023-04-05 16:53 ` [PATCH net-next v4 12/20] siw: " David Howells
2023-04-05 16:53 ` [PATCH net-next v4 13/20] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
2023-04-05 16:53 ` [PATCH net-next v4 14/20] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-04-05 16:53 ` [PATCH net-next v4 15/20] ip: Remove ip_append_page() David Howells
2023-04-05 16:53 ` [PATCH net-next v4 16/20] ip, udp: Support MSG_SPLICE_PAGES David Howells
2023-04-05 16:53 ` [PATCH net-next v4 17/20] ip, udp: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
2023-04-05 16:53 ` [PATCH net-next v4 18/20] ip6, udp6: Support MSG_SPLICE_PAGES David Howells
2023-04-05 16:53 ` [PATCH net-next v4 19/20] af_unix: " David Howells
2023-04-05 16:53 ` [PATCH net-next v4 20/20] af_unix: Make sendmsg(MSG_SPLICE_PAGES) copy unspliceable data David Howells
2023-04-06  2:19 ` [PATCH net-next v4 00/20] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1 Jakub Kicinski
2023-04-06  9:12 ` David Howells
2023-04-06 15:03   ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).