All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Howells <dhowells@redhat.com>
To: Matthew Wilcox <willy@infradead.org>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: David Howells <dhowells@redhat.com>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Christoph Hellwig <hch@infradead.org>,
	Jens Axboe <axboe@kernel.dk>, Jeff Layton <jlayton@kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Bernard Metzler <bmt@zurich.ibm.com>, Tom Talpey <tom@talpey.com>,
	linux-rdma@vger.kernel.org
Subject: [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES
Date: Thu, 16 Mar 2023 15:25:52 +0000	[thread overview]
Message-ID: <20230316152618.711970-3-dhowells@redhat.com> (raw)
In-Reply-To: <20230316152618.711970-1-dhowells@redhat.com>

If a network protocol sendmsg() sees MSG_SPLICE_DATA, it expects that the
iterator is of ITER_BVEC type and that all the pages can have refs taken on
them with get_page() and discarded with put_page().  Bits of network
filesystem protocol data, however, are typically contained in slab memory
for which the cleanup method is kfree(), not put_page(), so this doesn't
work.

Provide a simple allocator, zcopy_alloc(), that allocates a page at a time
per-cpu and sequentially breaks off pieces and hands them out with a ref as
it's asked for them.  The caller disposes of the memory it was given by
calling put_page().  When a page is all parcelled out, it is abandoned by
the allocator and another page is obtained.  The page will get cleaned up
when the last skbuff fragment is destroyed.

A helper function, zcopy_memdup() is provided to call zcopy_alloc() and
copy the data it is given into it.

[!] I'm not sure this is the best way to do things.  A better way might be
    to make the network protocol look at the page and copy it if it's a
    slab object rather than taking a ref on it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Bernard Metzler <bmt@zurich.ibm.com>
cc: Tom Talpey <tom@talpey.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-rdma@vger.kernel.org
cc: netdev@vger.kernel.org
---
 include/linux/zcopy_alloc.h |  16 +++++
 mm/Makefile                 |   2 +-
 mm/zcopy_alloc.c            | 129 ++++++++++++++++++++++++++++++++++++
 3 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/zcopy_alloc.h
 create mode 100644 mm/zcopy_alloc.c

diff --git a/include/linux/zcopy_alloc.h b/include/linux/zcopy_alloc.h
new file mode 100644
index 000000000000..8eb205678073
--- /dev/null
+++ b/include/linux/zcopy_alloc.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Defs for for zerocopy filler fragment allocator.
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ */
+
+#ifndef _LINUX_ZCOPY_ALLOC_H
+#define _LINUX_ZCOPY_ALLOC_H
+
+struct bio_vec;
+
+int zcopy_alloc(size_t size, struct bio_vec *bvec, gfp_t gfp);
+int zcopy_memdup(size_t size, const void *p, struct bio_vec *bvec, gfp_t gfp);
+
+#endif /* _LINUX_ZCOPY_ALLOC_H */
diff --git a/mm/Makefile b/mm/Makefile
index 8e105e5b3e29..3848f43751ee 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o percpu.o slab_common.o \
-			   compaction.o \
+			   compaction.o zcopy_alloc.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o gup.o mmap_lock.o $(mmu-y)
 
diff --git a/mm/zcopy_alloc.c b/mm/zcopy_alloc.c
new file mode 100644
index 000000000000..7b219392e829
--- /dev/null
+++ b/mm/zcopy_alloc.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Allocator for zerocopy filler fragments
+ *
+ * Copyright (C) 2023 Red Hat, Inc. All Rights Reserved.
+ * Written by David Howells (dhowells@redhat.com)
+ *
+ * Provide a facility whereby pieces of bufferage can be allocated for
+ * insertion into bio_vec arrays intended for zerocopying, allowing protocol
+ * stuff to be mixed in with data.
+ *
+ * Unlike objects allocated from the slab, the lifetime of these pieces of
+ * buffer are governed purely by the refcount of the page in which they reside.
+ */
+
+#include <linux/export.h>
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/zcopy_alloc.h>
+#include <linux/bvec.h>
+
+struct zcopy_alloc_info {
+	struct folio	*folio;		/* Page currently being allocated from */
+	struct folio	*spare;		/* Spare page */
+	unsigned int	used;		/* Amount of folio used */
+	spinlock_t	lock;		/* Allocation lock (needs bh-disable) */
+};
+
+static struct zcopy_alloc_info __percpu *zcopy_alloc_info;
+
+static int __init zcopy_alloc_init(void)
+{
+	zcopy_alloc_info = alloc_percpu(struct zcopy_alloc_info);
+	if (!zcopy_alloc_info)
+		panic("Unable to set up zcopy_alloc allocator\n");
+	return 0;
+}
+subsys_initcall(zcopy_alloc_init);
+
+/**
+ * zcopy_alloc - Allocate some memory for use in zerocopy
+ * @size: The amount of memory (maximum 1/2 page).
+ * @bvec: Where to store the details of the memory
+ * @gfp: Allocation flags under which to make an allocation
+ *
+ * Allocate some memory for use with zerocopy where protocol bits have to be
+ * mixed in with spliced/zerocopied data.  Unlike memory allocated from the
+ * slab, this memory's lifetime is purely dependent on the folio's refcount.
+ *
+ * The way it works is that a folio is allocated and pieces are broken off
+ * sequentially and given to the allocators with a ref until it no longer has
+ * enough spare space, at which point the allocator's ref is dropped and a new
+ * folio is allocated.  The folio remains in existence until the last ref held
+ * by, say, a sk_buff is discarded and then the page is returned to the
+ * allocator.
+ *
+ * Returns 0 on success and -ENOMEM on allocation failure.  If successful, the
+ * details of the allocated memory are placed in *%bvec.
+ *
+ * The allocated memory should be disposed of with folio_put().
+ */
+int zcopy_alloc(size_t size, struct bio_vec *bvec, gfp_t gfp)
+{
+	struct zcopy_alloc_info *info;
+	struct folio *folio, *spare = NULL;
+	size_t full_size = round_up(size, 8);
+
+	if (WARN_ON_ONCE(full_size > PAGE_SIZE / 2))
+		return -ENOMEM; /* Allocate pages */
+
+try_again:
+	info = get_cpu_ptr(zcopy_alloc_info);
+
+	folio = info->folio;
+	if (folio && folio_size(folio) - info->used < full_size) {
+		folio_put(folio);
+		folio = info->folio = NULL;
+	}
+	if (spare && !info->spare) {
+		info->spare = spare;
+		spare = NULL;
+	}
+	if (!folio && info->spare) {
+		folio = info->folio = info->spare;
+		info->spare = NULL;
+		info->used = 0;
+	}
+	if (folio) {
+		bvec_set_folio(bvec, folio, size, info->used);
+		info->used += full_size;
+		if (info->used < folio_size(folio))
+			folio_get(folio);
+		else
+			info->folio = NULL;
+	}
+
+	put_cpu_ptr(zcopy_alloc_info);
+	if (folio) {
+		if (spare)
+			folio_put(spare);
+		return 0;
+	}
+
+	spare = folio_alloc(gfp, 0);
+	if (!spare)
+		return -ENOMEM;
+	goto try_again;
+}
+EXPORT_SYMBOL(zcopy_alloc);
+
+/**
+ * zcopy_memdup - Allocate some memory for use in zerocopy and fill it
+ * @size: The amount of memory to copy (maximum 1/2 page).
+ * @p: The source data to copy
+ * @bvec: Where to store the details of the memory
+ * @gfp: Allocation flags under which to make an allocation
+ */
+int zcopy_memdup(size_t size, const void *p, struct bio_vec *bvec, gfp_t gfp)
+{
+	void *q;
+
+	if (zcopy_alloc(size, bvec, gfp) < 0)
+		return -ENOMEM;
+
+	q = kmap_local_folio(page_folio(bvec->bv_page), bvec->bv_offset);
+	memcpy(q, p, size);
+	kunmap_local(q);
+	return 0;
+}
+EXPORT_SYMBOL(zcopy_memdup);


  parent reply	other threads:[~2023-03-16 15:27 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-16 15:25 [RFC PATCH 00/28] splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:25 ` [RFC PATCH 01/28] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
2023-03-16 15:25 ` David Howells [this message]
2023-03-16 17:28   ` [RFC PATCH 02/28] Add a special allocator for staging netfs protocol to MSG_SPLICE_PAGES Matthew Wilcox
2023-03-16 18:00   ` David Howells
2023-03-16 15:25 ` [RFC PATCH 03/28] tcp: Support MSG_SPLICE_PAGES David Howells
2023-03-16 18:37   ` Willem de Bruijn
2023-03-16 18:44   ` David Howells
2023-03-16 19:00     ` Willem de Bruijn
2023-03-21  0:38     ` David Howells
2023-03-21 14:22       ` Willem de Bruijn
2023-03-22 13:56         ` [RFC PATCH 0/3] net: Drop size arg from ->sendmsg() and pass msghdr into __ip{,6}_append_data() David Howells
2023-03-22 13:56           ` [RFC PATCH 1/3] net: Drop the size argument from ->sendmsg() David Howells
2023-03-22 13:56             ` David Howells
2023-03-22 13:56             ` David Howells
2023-03-22 14:13             ` [RFC,1/3] " bluez.test.bot
2023-03-23  1:11             ` bluez.test.bot
2023-03-22 13:56           ` [RFC PATCH 2/3] ip: Make __ip{,6}_append_data() and co. take a msghdr* David Howells
2023-03-22 17:25             ` kernel test robot
2023-03-22 22:12             ` kernel test robot
2023-03-23  1:25             ` kernel test robot
2023-03-23  1:25             ` kernel test robot
2023-03-22 13:56           ` [RFC PATCH 3/3] net: Declare MSG_SPLICE_PAGES internal sendmsg() flag David Howells
2023-03-23  1:17           ` [RFC PATCH 0/3] net: Drop size arg from ->sendmsg() and pass msghdr into __ip{,6}_append_data() Willem de Bruijn
2023-03-16 15:25 ` [RFC PATCH 04/28] tcp: Convert do_tcp_sendpages() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:25 ` [RFC PATCH 05/28] tcp_bpf: Inline do_tcp_sendpages as it's now a wrapper around tcp_sendmsg David Howells
2023-03-16 15:25 ` [RFC PATCH 06/28] espintcp: Inline do_tcp_sendpages() David Howells
2023-03-16 15:25 ` [RFC PATCH 07/28] tls: " David Howells
2023-03-16 15:25 ` [RFC PATCH 08/28] siw: " David Howells
2023-03-20 10:53   ` Bernard Metzler
2023-03-20 11:08   ` David Howells
2023-03-20 12:27     ` Bernard Metzler
2023-03-20 13:13     ` David Howells
2023-03-20 13:18       ` Bernard Metzler
2023-03-16 15:25 ` [RFC PATCH 09/28] tcp: Fold do_tcp_sendpages() into tcp_sendpage_locked() David Howells
2023-03-16 15:26 ` [RFC PATCH 10/28] ip, udp: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 11/28] udp: Convert udp_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 12/28] af_unix: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 13/28] crypto: af_alg: Indent the loop in af_alg_sendmsg() David Howells
2023-03-16 15:26 ` [RFC PATCH 14/28] crypto: af_alg: Support MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 15/28] crypto: af_alg: Convert af_alg_sendpage() to use MSG_SPLICE_PAGES David Howells
2023-03-16 15:26 ` [RFC PATCH 16/28] splice, net: Use sendmsg(MSG_SPLICE_PAGES) rather than ->sendpage() David Howells
2023-03-16 15:26 ` [RFC PATCH 17/28] Remove file->f_op->sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 18/28] siw: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage to transmit David Howells
2023-03-20 13:39   ` Bernard Metzler
2023-03-16 15:26 ` [RFC PATCH 19/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 20/28] iscsi: " David Howells
2023-03-16 15:26 ` [RFC PATCH 21/28] tcp_bpf: Make tcp_bpf_sendpage() go through tcp_bpf_sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:26 ` [RFC PATCH 22/28] net: Use sendmsg(MSG_SPLICE_PAGES) not sendpage in skb_send_sock() David Howells
2023-03-16 15:26 ` [RFC PATCH 23/28] algif: Remove hash_sendpage*() David Howells
2023-03-17  2:40   ` Herbert Xu
2023-03-24 16:47     ` David Howells
2023-03-25  6:00       ` Herbert Xu
2023-03-25  7:44       ` David Howells
2023-03-25  9:21         ` Herbert Xu
2023-03-16 15:26 ` [RFC PATCH 24/28] ceph: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage() David Howells
2023-03-16 15:26 ` [RFC PATCH 25/28] rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpage David Howells
2023-03-16 15:26 ` [RFC PATCH 26/28] dlm: " David Howells
2023-03-16 15:26   ` [Cluster-devel] " David Howells
2023-03-16 15:26 ` [RFC PATCH 27/28] sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage David Howells
2023-03-16 16:17   ` Trond Myklebust
2023-03-16 17:10     ` Chuck Lever III
2023-03-16 17:28     ` David Howells
2023-03-16 17:41       ` Chuck Lever III
2023-03-16 21:21     ` David Howells
2023-03-17 15:29       ` Chuck Lever III
2023-03-16 16:24   ` David Howells
2023-03-16 17:23     ` Trond Myklebust
2023-03-16 18:06     ` David Howells
2023-03-16 19:01       ` Trond Myklebust
2023-03-22 13:10       ` David Howells
2023-03-22 18:15       ` [RFC PATCH] iov_iter: Add an iterator-of-iterators David Howells
2023-03-22 18:47         ` Trond Myklebust
2023-03-22 18:49         ` Matthew Wilcox
2023-03-16 15:26 ` [RFC PATCH 28/28] sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES) David Howells
2023-03-16 15:26   ` David Howells
2023-03-16 15:26   ` David Howells
2023-03-16 15:26   ` David Howells
2023-03-16 15:57   ` Marc Kleine-Budde
2023-03-16 15:57     ` Marc Kleine-Budde
2023-03-16 15:57     ` Marc Kleine-Budde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230316152618.711970-3-dhowells@redhat.com \
    --to=dhowells@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bmt@zurich.ibm.com \
    --cc=brauner@kernel.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hch@infradead.org \
    --cc=jlayton@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=tom@talpey.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.