All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 00/19] io_uring zerocopy tx
@ 2021-12-21 15:35 Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 01/19] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
                   ` (19 more replies)
  0 siblings, 20 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Update on io_uring zerocopy tx, still RFC. For v1 and design notes see

https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/

Absolute numbers (against dummy) got higher since v1, + ~10-12% requests/s for
the peak performance case. 5/19 brought a couple of percents, but most of it
came with 8/19 and 9/19 (+8-11% in numbers, 5-7% in profiles). It will also
be needed in the future for p2p. Any reason not to do alike for paged non-zc?
Small (under 100-150B) packets?

Most of checks are removed from non-zc paths. Implemented a bit trickier in
__ip_append_data(), but considering already existing assumptions around "from"
argument it should be fine.

Benchmarks for dummy netdev, UDP/IPv4, payload size=4096:
 -n<N> is how many requests we submit per syscall. From io_uring perspective -n1
       is wasteful and far from optimal, but included for comparison.
 -z0   disables zerocopy, just normal io_uring send requests
 -f    makes to flush "buffer free" notifications for every request

                        | K reqs/s | speedup
msg_zerocopy (non-zc)   | 1120     | 1.12
msg_zerocopy (zc)       | 997      | 1
io_uring -n1 -z0        | 1469     | 1.47
io_uring -n8 -z0        | 1780     | 1.78
io_uring -n1 -f         | 1688     | 1.69
io_uring -n1            | 1774     | 1.77
io_uring -n8 -f         | 2075     | 2.08
io_uring -n8            | 2265     | 2.27

note: it might be not too interesting to compare zc vs non-zc, the performance
relative difference can be shifted in favour of zerocopy by cutting constant
per-request overhead, and there are easy ways of doing that, e.g. by compiling
out unused features. Even more true for the table below as there was additional
noise taking a good quarter of CPU cycles.

Some data for UDP/IPv6 between a pair of NICs. 9/19 wasn't there at the time of
testing. All tests are CPU bound and so as expected reqs/s for zerocopy doesn't
vary much between different payload sizes. io_uring to msg_zerocopy ratio is not
too representative for reasons similar to described above.

payload | test                   | K reqs/s
___________________________________________ 
 8192   | io_uring -n8 (dummy)   | 599
        | io_uring -n1 -z0       | 264
        | io_uring -n8 -z0       | 302
        | msg_zerocopy           | 248
        | msg_zerocopy -z        | 183
        | io_uring -n1 -f        | 306
        | io_uring -n1           | 318
        | io_uring -n8 -f        | 373
        | io_uring -n8           | 401

 4096   | io_uring -n8 (dummy)   | 601
        | io_uring -n1 -z0       | 303
        | io_uring -n8 -z0       | 366
        | msg_zerocopy           | 278
        | msg_zerocopy -z        | 187
        | io_uring -n1 -f        | 317
        | io_uring -n1           | 325
        | io_uring -n8 -f        | 387
        | io_uring -n8           | 405

 1024   | io_uring -n8 (dummy)   | 601
        | io_uring -n1 -z0       | 329
        | io_uring -n8 -z0       | 407
        | msg_zerocopy           | 301
        | msg_zerocopy -z        | 186
        | io_uring -n1 -f        | 317
        | io_uring -n1           | 327
        | io_uring -n8 -f        | 390
        | io_uring -n8           | 403

 512    | io_uring -n8 (dummy)   | 601
        | io_uring -n1 -z0       | 340
        | io_uring -n8 -z0       | 417
        | msg_zerocopy           | 310
        | msg_zerocopy -z        | 186
        | io_uring -n1 -f        | 317
        | io_uring -n1           | 328
        | io_uring -n8 -f        | 392
        | io_uring -n8           | 406

 128    | io_uring -n8 (dummy)   | 602
        | io_uring -n1 -z0       | 341
        | io_uring -n8 -z0       | 428
        | msg_zerocopy           | 317
        | msg_zerocopy -z        | 188
        | io_uring -n1 -f        | 318
        | io_uring -n1           | 331
        | io_uring -n8 -f        | 391
        | io_uring -n8           | 408

https://github.com/isilence/linux/tree/zc_v2
https://github.com/isilence/liburing/tree/zc_v2

The Benchmark is <liburing>/test/send-zc,

send-zc [-f] [-n<N>] [-z0] -s<payload size> -D<dst ip> (-6|-4) [-t<sec>] udp

As a server you can use msg_zerocopy from in kernel's selftests, or a copy of
it at <liburing>/test/msg_zerocopy. No server is needed for dummy testing.

dummy setup:
sudo ip li add dummy0 type dummy && sudo ip li set dummy0 up mtu 65536
# make traffic for the specified IP to go through dummy0
sudo ip route add <ip_address> dev dummy0

v2: remove additional overhead for non-zc from skb_release_data() (Jonathan)
    avoid msg propagation, hide extra bits of non-zc overhead
    task_work based "buffer free" notifications
    improve io_uring's notification refcounting
    added 5/19, (no pfmemalloc tracking)
    added 8/19 and 9/19 preventing small copies with zc
    misc small changes

Pavel Begunkov (19):
  skbuff: add SKBFL_DONT_ORPHAN flag
  skbuff: pass a struct ubuf_info in msghdr
  net: add zerocopy_sg_from_iter for bvec
  net: optimise page get/free for bvec zc
  net: don't track pfmemalloc for zc registered mem
  ipv4/udp: add support msgdr::msg_ubuf
  ipv6/udp: add support msgdr::msg_ubuf
  ipv4: avoid partial copy for zc
  ipv6: avoid partial copy for zc
  io_uring: add send notifiers registration
  io_uring: infrastructure for send zc notifications
  io_uring: wire send zc request type
  io_uring: add an option to flush zc notifications
  io_uring: opcode independent fixed buf import
  io_uring: sendzc with fixed buffers
  io_uring: cache struct ubuf_info
  io_uring: unclog ctx refs waiting with zc notifiers
  io_uring: task_work for notification delivery
  io_uring: optimise task referencing by notifiers

 fs/io_uring.c                 | 440 +++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h        |  46 ++--
 include/linux/socket.h        |   1 +
 include/uapi/linux/io_uring.h |  14 ++
 net/compat.c                  |   1 +
 net/core/datagram.c           |  58 +++++
 net/core/skbuff.c             |  16 +-
 net/ipv4/ip_output.c          |  55 +++--
 net/ipv6/ip6_output.c         |  54 ++++-
 net/socket.c                  |   3 +
 10 files changed, 633 insertions(+), 55 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC v2 01/19] skbuff: add SKBFL_DONT_ORPHAN flag
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr Pavel Begunkov
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

We don't want to list every single ubuf_info callback in
skb_orphan_frags(), add a flag controlling the behaviour.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 8 +++++---
 net/core/skbuff.c      | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c8cb7e697d47..b80944a9ce8f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -459,10 +459,13 @@ enum {
 	 * charged to the kernel memory.
 	 */
 	SKBFL_PURE_ZEROCOPY = BIT(2),
+
+	SKBFL_DONT_ORPHAN = BIT(3),
 };
 
 #define SKBFL_ZEROCOPY_FRAG	(SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
-#define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY)
+#define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
+				 SKBFL_DONT_ORPHAN)
 
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
@@ -2839,8 +2842,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
 	if (likely(!skb_zcopy(skb)))
 		return 0;
-	if (!skb_zcopy_is_nouarg(skb) &&
-	    skb_uarg(skb)->callback == msg_zerocopy_callback)
+	if (skb_shinfo(skb)->flags & SKBFL_DONT_ORPHAN)
 		return 0;
 	return skb_copy_ubufs(skb, gfp_mask);
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ba2f38246f07..b23db60ea6f9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1191,7 +1191,7 @@ struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
 	uarg->len = 1;
 	uarg->bytelen = size;
 	uarg->zerocopy = 1;
-	uarg->flags = SKBFL_ZEROCOPY_FRAG;
+	uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
 	refcount_set(&uarg->refcnt, 1);
 	sock_hold(sk);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 01/19] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2022-01-11 13:51   ` Hao Xu
  2021-12-21 15:35 ` [RFC v2 03/19] net: add zerocopy_sg_from_iter for bvec Pavel Begunkov
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Instead of the net stack managing ubuf_info, allow to pass it in from
outside in a struct msghdr (in-kernel structure), so io_uring can make
use of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c          | 2 ++
 include/linux/socket.h | 1 +
 net/compat.c           | 1 +
 net/socket.c           | 3 +++
 4 files changed, 7 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 72da3a75521a..59380e3454ad 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4911,6 +4911,7 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
+	msg.msg_ubuf = NULL;
 
 	flags = req->sr_msg.msg_flags;
 	if (issue_flags & IO_URING_F_NONBLOCK)
@@ -5157,6 +5158,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_namelen = 0;
 	msg.msg_iocb = NULL;
 	msg.msg_flags = 0;
+	msg.msg_ubuf = NULL;
 
 	flags = req->sr_msg.msg_flags;
 	if (force_nonblock)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8ef26d89ef49..6bd2c6b0c6f2 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -65,6 +65,7 @@ struct msghdr {
 	__kernel_size_t	msg_controllen;	/* ancillary data buffer length */
 	unsigned int	msg_flags;	/* flags on received message */
 	struct kiocb	*msg_iocb;	/* ptr to iocb for async requests */
+	struct ubuf_info *msg_ubuf;
 };
 
 struct user_msghdr {
diff --git a/net/compat.c b/net/compat.c
index 210fc3b4d0d8..6cd2e7683dd0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -80,6 +80,7 @@ int __get_compat_msghdr(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	kmsg->msg_ubuf = NULL;
 	*ptr = msg.msg_iov;
 	*len = msg.msg_iovlen;
 	return 0;
diff --git a/net/socket.c b/net/socket.c
index 7f64a6eccf63..0a29b616a38c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2023,6 +2023,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
 	msg.msg_control = NULL;
 	msg.msg_controllen = 0;
 	msg.msg_namelen = 0;
+	msg.msg_ubuf = NULL;
 	if (addr) {
 		err = move_addr_to_kernel(addr, addr_len, &address);
 		if (err < 0)
@@ -2088,6 +2089,7 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
 	msg.msg_namelen = 0;
 	msg.msg_iocb = NULL;
 	msg.msg_flags = 0;
+	msg.msg_ubuf = NULL;
 	if (sock->file->f_flags & O_NONBLOCK)
 		flags |= MSG_DONTWAIT;
 	err = sock_recvmsg(sock, &msg, flags);
@@ -2326,6 +2328,7 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,
 		return -EMSGSIZE;
 
 	kmsg->msg_iocb = NULL;
+	kmsg->msg_ubuf = NULL;
 	*uiov = msg.msg_iov;
 	*nsegs = msg.msg_iovlen;
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 03/19] net: add zerocopy_sg_from_iter for bvec
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 01/19] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 04/19] net: optimise page get/free for bvec zc Pavel Begunkov
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Add a separate path for bvec iterators in __zerocopy_sg_from_iter, first
it's quite faster but also will be needed to optimise out
get/put_page()

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/core/datagram.c | 50 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee290776c661..cb1e34fbcd44 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -616,11 +616,61 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
 }
 EXPORT_SYMBOL(skb_copy_datagram_from_iter);
 
+static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
+				   struct iov_iter *from, size_t length)
+{
+	int ret, frag = skb_shinfo(skb)->nr_frags;
+	struct bvec_iter bi;
+	struct bio_vec v;
+	ssize_t copied = 0;
+	unsigned long truesize = 0;
+
+	bi.bi_size = min(from->count, length);
+	bi.bi_bvec_done = from->iov_offset;
+	bi.bi_idx = 0;
+
+	while (bi.bi_size) {
+		if (frag == MAX_SKB_FRAGS) {
+			ret = -EMSGSIZE;
+			goto out;
+		}
+
+		v = mp_bvec_iter_bvec(from->bvec, bi);
+		copied += v.bv_len;
+		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+		get_page(v.bv_page);
+		skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
+		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+	}
+	ret = 0;
+out:
+	skb->data_len += copied;
+	skb->len += copied;
+	skb->truesize += truesize;
+
+	if (sk && sk->sk_type == SOCK_STREAM) {
+		sk_wmem_queued_add(sk, truesize);
+		if (!skb_zcopy_pure(skb))
+			sk_mem_charge(sk, truesize);
+	} else {
+		refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+	}
+
+	from->bvec += bi.bi_idx;
+	from->nr_segs -= bi.bi_idx;
+	from->count = bi.bi_size;
+	from->iov_offset = bi.bi_bvec_done;
+	return ret;
+}
+
 int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 			    struct iov_iter *from, size_t length)
 {
 	int frag = skb_shinfo(skb)->nr_frags;
 
+	if (iov_iter_is_bvec(from))
+		return __zerocopy_sg_from_bvec(sk, skb, from, length);
+
 	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
 		struct page *last_head = NULL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 04/19] net: optimise page get/free for bvec zc
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (2 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 03/19] net: add zerocopy_sg_from_iter for bvec Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 05/19] net: don't track pfmemalloc for zc registered mem Pavel Begunkov
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

get_page() in __zerocopy_sg_from_bvec() and matching put_page()s are
expensive. However, we can avoid it if the caller can guarantee that
pages stay alive until the corresponding ubuf_info is not released.
In particular, it targets io_uring with fixed buffers following the
described contract.

Assuming that nobody yet uses bvec together with zerocopy, make all
calls with bvec iterators follow this model.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 12 ++++++++++--
 net/core/datagram.c    |  9 +++++++--
 net/core/skbuff.c      | 14 +++++++++++++-
 3 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b80944a9ce8f..f6a6fd67e1ea 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -461,11 +461,16 @@ enum {
 	SKBFL_PURE_ZEROCOPY = BIT(2),
 
 	SKBFL_DONT_ORPHAN = BIT(3),
+
+	/* page references are managed by the ubuf_info, so it's safe to
+	 * use frags only up until ubuf_info is released
+	 */
+	SKBFL_MANAGED_FRAGS = BIT(4),
 };
 
 #define SKBFL_ZEROCOPY_FRAG	(SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
 #define SKBFL_ALL_ZEROCOPY	(SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
-				 SKBFL_DONT_ORPHAN)
+				 SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAGS)
 
 /*
  * The callback notifies userspace to release buffers when skb DMA is done in
@@ -3155,7 +3160,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
  */
 static inline void skb_frag_unref(struct sk_buff *skb, int f)
 {
-	__skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+	if (!(shinfo->flags & SKBFL_MANAGED_FRAGS))
+		__skb_frag_unref(&shinfo->frags[f], skb->pp_recycle);
 }
 
 /**
diff --git a/net/core/datagram.c b/net/core/datagram.c
index cb1e34fbcd44..46526af40552 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -638,7 +638,6 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
 		v = mp_bvec_iter_bvec(from->bvec, bi);
 		copied += v.bv_len;
 		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
-		get_page(v.bv_page);
 		skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
 		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
 	}
@@ -667,9 +666,15 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
 			    struct iov_iter *from, size_t length)
 {
 	int frag = skb_shinfo(skb)->nr_frags;
+	bool managed = skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS;
 
-	if (iov_iter_is_bvec(from))
+	if (iov_iter_is_bvec(from) && (managed || frag == 0)) {
+		skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAGS;
 		return __zerocopy_sg_from_bvec(sk, skb, from, length);
+	}
+
+	if (managed)
+		return -EFAULT;
 
 	while (length && iov_iter_count(from)) {
 		struct page *pages[MAX_SKB_FRAGS];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b23db60ea6f9..10cdcb99d34b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,11 +666,18 @@ static void skb_release_data(struct sk_buff *skb)
 			      &shinfo->dataref))
 		goto exit;
 
-	skb_zcopy_clear(skb, true);
+	if (skb_zcopy(skb)) {
+		bool skip_unref = shinfo->flags & SKBFL_MANAGED_FRAGS;
+
+		skb_zcopy_clear(skb, true);
+		if (skip_unref)
+			goto free_head;
+	}
 
 	for (i = 0; i < shinfo->nr_frags; i++)
 		__skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
 
+free_head:
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
 
@@ -1597,6 +1604,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
 	skb_copy_header(n, skb);
+	skb_shinfo(n)->flags &= ~SKBFL_MANAGED_FRAGS;
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1653,6 +1661,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 			skb_frag_ref(skb, i);
 		}
 		skb_shinfo(n)->nr_frags = i;
+		skb_shinfo(n)->flags &= ~SKBFL_MANAGED_FRAGS;
 	}
 
 	if (skb_has_frag_list(skb)) {
@@ -1725,6 +1734,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
 			refcount_inc(&skb_uarg(skb)->refcnt);
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
 			skb_frag_ref(skb, i);
+		skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAGS;
 
 		if (skb_has_frag_list(skb))
 			skb_clone_fraglist(skb);
@@ -3788,6 +3798,8 @@ int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
 	if (skb_can_coalesce(skb, i, page, offset)) {
 		skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
 	} else if (i < MAX_SKB_FRAGS) {
+		if (skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS)
+			return -EMSGSIZE;
 		get_page(page);
 		skb_fill_page_desc(skb, i, page, offset, size);
 	} else {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 05/19] net: don't track pfmemalloc for zc registered mem
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (3 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 04/19] net: optimise page get/free for bvec zc Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 06/19] ipv4/udp: add support msgdr::msg_ubuf Pavel Begunkov
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

In case of zerocopy frags are filled with userspace allocated memory, we
shouldn't care about setting skb->pfmemalloc for them, and especially
when the buffers were somehow pre-registered (i.e. getting bvec from
io_uring). Remove the tracking from __zerocopy_sg_from_bvec().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/linux/skbuff.h | 28 +++++++++++++++++-----------
 net/core/datagram.c    |  7 +++++--
 2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f6a6fd67e1ea..eef064fbf715 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2203,6 +2203,22 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 	return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
+static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
+					      int i, struct page *page,
+					      int off, int size)
+{
+	skb_frag_t *frag = &shinfo->frags[i];
+
+	/*
+	 * Propagate page pfmemalloc to the skb if we can. The problem is
+	 * that not all callers have unique ownership of the page but rely
+	 * on page_is_pfmemalloc doing the right thing(tm).
+	 */
+	frag->bv_page		  = page;
+	frag->bv_offset		  = off;
+	skb_frag_size_set(frag, size);
+}
+
 /**
  * __skb_fill_page_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
@@ -2219,17 +2235,7 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
 					struct page *page, int off, int size)
 {
-	skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
-	/*
-	 * Propagate page pfmemalloc to the skb if we can. The problem is
-	 * that not all callers have unique ownership of the page but rely
-	 * on page_is_pfmemalloc doing the right thing(tm).
-	 */
-	frag->bv_page		  = page;
-	frag->bv_offset		  = off;
-	skb_frag_size_set(frag, size);
-
+	__skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
 	page = compound_head(page);
 	if (page_is_pfmemalloc(page))
 		skb->pfmemalloc	= true;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 46526af40552..f8f147e14d1c 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -619,7 +619,8 @@ EXPORT_SYMBOL(skb_copy_datagram_from_iter);
 static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
 				   struct iov_iter *from, size_t length)
 {
-	int ret, frag = skb_shinfo(skb)->nr_frags;
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	int ret, frag = shinfo->nr_frags;
 	struct bvec_iter bi;
 	struct bio_vec v;
 	ssize_t copied = 0;
@@ -638,11 +639,13 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
 		v = mp_bvec_iter_bvec(from->bvec, bi);
 		copied += v.bv_len;
 		truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
-		skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
+		__skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
+					   v.bv_offset, v.bv_len);
 		bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
 	}
 	ret = 0;
 out:
+	shinfo->nr_frags = frag;
 	skb->data_len += copied;
 	skb->len += copied;
 	skb->truesize += truesize;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 06/19] ipv4/udp: add support msgdr::msg_ubuf
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (4 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 05/19] net: don't track pfmemalloc for zc registered mem Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 07/19] ipv6/udp: " Pavel Begunkov
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Make ipv4/udp to use ubuf_info passed in struct msghdr if it was
specified.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv4/ip_output.c | 50 ++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9bca57ef8b83..f820288092ab 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -953,7 +953,6 @@ static int __ip_append_data(struct sock *sk,
 	struct inet_sock *inet = inet_sk(sk);
 	struct ubuf_info *uarg = NULL;
 	struct sk_buff *skb;
-
 	struct ip_options *opt = cork->opt;
 	int hh_len;
 	int exthdrlen;
@@ -967,6 +966,7 @@ static int __ip_append_data(struct sock *sk,
 	unsigned int wmem_alloc_delta = 0;
 	bool paged, extra_uref = false;
 	u32 tskey = 0;
+	bool zc = false;
 
 	skb = skb_peek_tail(queue);
 
@@ -1001,17 +1001,37 @@ static int __ip_append_data(struct sock *sk,
 	    (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
 		csummode = CHECKSUM_PARTIAL;
 
-	if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
-		uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
-		if (!uarg)
-			return -ENOBUFS;
-		extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
-		if (rt->dst.dev->features & NETIF_F_SG &&
-		    csummode == CHECKSUM_PARTIAL) {
-			paged = true;
-		} else {
-			uarg->zerocopy = 0;
-			skb_zcopy_set(skb, uarg, &extra_uref);
+	if ((flags & MSG_ZEROCOPY) && length) {
+		struct msghdr *msg = from;
+
+		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+			uarg = msg->msg_ubuf;
+			if (skb_zcopy(skb) && uarg != skb_zcopy(skb))
+				return -EINVAL;
+
+			if (rt->dst.dev->features & NETIF_F_SG &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+			} else {
+				/* Drop uarg if can't zerocopy, callers should
+				 * be able to handle it.
+				 */
+				uarg = NULL;
+			}
+		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+			if (!uarg)
+				return -ENOBUFS;
+			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
+			if (rt->dst.dev->features & NETIF_F_SG &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+			} else {
+				uarg->zerocopy = 0;
+				skb_zcopy_set(skb, uarg, &extra_uref);
+			}
 		}
 	}
 
@@ -1172,9 +1192,13 @@ static int __ip_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else if (!uarg || !uarg->zerocopy) {
+		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 
+			if (skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS) {
+				err = -EFAULT;
+				goto error;
+			}
 			err = -ENOMEM;
 			if (!sk_page_frag_refill(sk, pfrag))
 				goto error;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 07/19] ipv6/udp: add support msgdr::msg_ubuf
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (5 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 06/19] ipv4/udp: add support msgdr::msg_ubuf Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 08/19] ipv4: avoid partial copy for zc Pavel Begunkov
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Make ipv6/udp to use ubuf_info passed in struct msghdr if it was
specified.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv6/ip6_output.c | 49 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 2f044a49afa8..822e3894dd3b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1452,6 +1452,7 @@ static int __ip6_append_data(struct sock *sk,
 	unsigned int maxnonfragsize, headersize;
 	unsigned int wmem_alloc_delta = 0;
 	bool paged, extra_uref = false;
+	bool zc = false;
 
 	skb = skb_peek_tail(queue);
 	if (!skb) {
@@ -1516,17 +1517,37 @@ static int __ip6_append_data(struct sock *sk,
 	    rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
 		csummode = CHECKSUM_PARTIAL;
 
-	if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
-		uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
-		if (!uarg)
-			return -ENOBUFS;
-		extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
-		if (rt->dst.dev->features & NETIF_F_SG &&
-		    csummode == CHECKSUM_PARTIAL) {
-			paged = true;
-		} else {
-			uarg->zerocopy = 0;
-			skb_zcopy_set(skb, uarg, &extra_uref);
+	if ((flags & MSG_ZEROCOPY) && length) {
+		struct msghdr *msg = from;
+
+		if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+			uarg = msg->msg_ubuf;
+			if (skb_zcopy(skb) && uarg != skb_zcopy(skb))
+				return -EINVAL;
+
+			if (rt->dst.dev->features & NETIF_F_SG &&
+				csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+			} else {
+				/* Drop uarg if can't zerocopy, callers should
+				 * be able to handle it.
+				 */
+				uarg = NULL;
+			}
+		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+			if (!uarg)
+				return -ENOBUFS;
+			extra_uref = !skb_zcopy(skb);	/* only ref on new uarg */
+			if (rt->dst.dev->features & NETIF_F_SG &&
+			    csummode == CHECKSUM_PARTIAL) {
+				paged = true;
+				zc = true;
+			} else {
+				uarg->zerocopy = 0;
+				skb_zcopy_set(skb, uarg, &extra_uref);
+			}
 		}
 	}
 
@@ -1717,9 +1738,13 @@ static int __ip6_append_data(struct sock *sk,
 				err = -EFAULT;
 				goto error;
 			}
-		} else if (!uarg || !uarg->zerocopy) {
+		} else if (!zc) {
 			int i = skb_shinfo(skb)->nr_frags;
 
+			if (skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS) {
+				err = -EFAULT;
+				goto error;
+			}
 			err = -ENOMEM;
 			if (!sk_page_frag_refill(sk, pfrag))
 				goto error;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 08/19] ipv4: avoid partial copy for zc
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (6 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 07/19] ipv6/udp: " Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 09/19] ipv6: " Pavel Begunkov
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 148 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such coies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv4/ip_output.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index f820288092ab..5ec9e540a660 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1095,9 +1095,12 @@ static int __ip_append_data(struct sock *sk,
 				 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
 				  !(rt->dst.dev->features & NETIF_F_SG)))
 				alloclen = fraglen;
-			else {
+			else if (!zc) {
 				alloclen = min_t(int, fraglen, MAX_HEADER);
 				pagedlen = fraglen - alloclen;
+			} else {
+				alloclen = fragheaderlen + transhdrlen;
+				pagedlen = datalen - transhdrlen;
 			}
 
 			alloclen += alloc_extra;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 09/19] ipv6: avoid partial copy for zc
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (7 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 08/19] ipv4: avoid partial copy for zc Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 10/19] io_uring: add send notifiers registration Pavel Begunkov
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 128 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such coies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 net/ipv6/ip6_output.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 822e3894dd3b..3ca07d2ea9ca 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1626,9 +1626,12 @@ static int __ip6_append_data(struct sock *sk,
 				 (fraglen + alloc_extra < SKB_MAX_ALLOC ||
 				  !(rt->dst.dev->features & NETIF_F_SG)))
 				alloclen = fraglen;
-			else {
+			else if (!zc) {
 				alloclen = min_t(int, fraglen, MAX_HEADER);
 				pagedlen = fraglen - alloclen;
+			} else {
+				alloclen = fragheaderlen + transhdrlen;
+				pagedlen = datalen - transhdrlen;
 			}
 			alloclen += alloc_extra;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 10/19] io_uring: add send notifiers registration
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (8 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 09/19] ipv6: " Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 11/19] io_uring: infrastructure for send zc notifications Pavel Begunkov
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Add IORING_REGISTER_TX_CTX and IORING_UNREGISTER_TX_CTX. Transmission
(i.e. send) context will serve be used to notify the userspace when
fixed buffers used for zerocopy sends are released by the kernel.

Notification of a single tx context lives in generations, where each
generation posts one CQE with ->user_data equal to the specified tag and
->res is a generation number starting from 0. All requests issued
against a ctx will get attached to the current generation of
notifications. Then, the userspace will be able to request to flush the
notification allowing it to post a CQE when all buffers of all requests
attached to it are released by the kernel. It'll also switch the
generation to a new one with a sequence number incremented by one.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 72 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/io_uring.h |  7 ++++
 2 files changed, 79 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 59380e3454ad..a01f91e70fa5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -94,6 +94,8 @@
 #define IORING_MAX_CQ_ENTRIES	(2 * IORING_MAX_ENTRIES)
 #define IORING_SQPOLL_CAP_ENTRIES_VALUE 8
 
+#define IORING_MAX_TX_NOTIFIERS	(1U << 10)
+
 /* only define max */
 #define IORING_MAX_FIXED_FILES	(1U << 15)
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
@@ -326,6 +328,15 @@ struct io_submit_state {
 	struct blk_plug		plug;
 };
 
+struct io_tx_notifier {
+};
+
+struct io_tx_ctx {
+	struct io_tx_notifier	*notifier;
+	u64			tag;
+	u32			seq;
+};
+
 struct io_ring_ctx {
 	/* const or read-mostly hot data */
 	struct {
@@ -373,6 +384,8 @@ struct io_ring_ctx {
 		unsigned		nr_user_files;
 		unsigned		nr_user_bufs;
 		struct io_mapped_ubuf	**user_bufs;
+		struct io_tx_ctx	*tx_ctxs;
+		unsigned		nr_tx_ctxs;
 
 		struct io_submit_state	submit_state;
 		struct list_head	timeout_list;
@@ -9199,6 +9212,55 @@ static int io_buffer_validate(struct iovec *iov)
 	return 0;
 }
 
+static int io_sqe_tx_ctx_unregister(struct io_ring_ctx *ctx)
+{
+	if (!ctx->nr_tx_ctxs)
+		return -ENXIO;
+
+	kvfree(ctx->tx_ctxs);
+	ctx->tx_ctxs = NULL;
+	ctx->nr_tx_ctxs = 0;
+	return 0;
+}
+
+static int io_sqe_tx_ctx_register(struct io_ring_ctx *ctx,
+				  void __user *arg, unsigned int nr_args)
+{
+	struct io_uring_tx_ctx_register __user *tx_args = arg;
+	struct io_uring_tx_ctx_register tx_arg;
+	unsigned i;
+	int ret;
+
+	if (ctx->nr_tx_ctxs)
+		return -EBUSY;
+	if (!nr_args)
+		return -EINVAL;
+	if (nr_args > IORING_MAX_TX_NOTIFIERS)
+		return -EMFILE;
+
+	ctx->tx_ctxs = kvcalloc(nr_args, sizeof(ctx->tx_ctxs[0]),
+				GFP_KERNEL_ACCOUNT);
+	if (!ctx->tx_ctxs)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_args; i++, ctx->nr_tx_ctxs++) {
+		struct io_tx_ctx *tx_ctx = &ctx->tx_ctxs[i];
+
+		if (copy_from_user(&tx_arg, &tx_args[i], sizeof(tx_arg))) {
+			ret = -EFAULT;
+			goto out_fput;
+		}
+		tx_ctx->tag = tx_arg.tag;
+	}
+	return 0;
+
+out_fput:
+	kvfree(ctx->tx_ctxs);
+	ctx->tx_ctxs = NULL;
+	ctx->nr_tx_ctxs = 0;
+	return ret;
+}
+
 static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
 				   unsigned int nr_args, u64 __user *tags)
 {
@@ -9429,6 +9491,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
+	io_sqe_tx_ctx_unregister(ctx);
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
 
@@ -11104,6 +11167,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_iowq_max_workers(ctx, arg);
 		break;
+	case IORING_REGISTER_TX_CTX:
+		ret = io_sqe_tx_ctx_register(ctx, arg, nr_args);
+		break;
+	case IORING_UNREGISTER_TX_CTX:
+		ret = -EINVAL;
+		if (arg || nr_args)
+			break;
+		ret = io_sqe_tx_ctx_unregister(ctx);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..f2e8d18e40e0 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -325,6 +325,9 @@ enum {
 	/* set/get max number of io-wq workers */
 	IORING_REGISTER_IOWQ_MAX_WORKERS	= 19,
 
+	IORING_REGISTER_TX_CTX			= 20,
+	IORING_UNREGISTER_TX_CTX		= 21,
+
 	/* this goes last */
 	IORING_REGISTER_LAST
 };
@@ -365,6 +368,10 @@ struct io_uring_rsrc_update2 {
 	__u32 resv2;
 };
 
+struct io_uring_tx_ctx_register {
+	__u64 tag;
+};
+
 /* Skip updating fd indexes set to this value in the fd table */
 #define IORING_REGISTER_FILES_SKIP	(-2)
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 11/19] io_uring: infrastructure for send zc notifications
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (9 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 10/19] io_uring: add send notifiers registration Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 12/19] io_uring: wire send zc request type Pavel Begunkov
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Add a new ubuf_info callback io_uring_tx_zerocopy_callback(), which
should post an CQE when it completes. Also, implement some
infrastructuire for allocating and managing struct ubuf_info.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 108 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a01f91e70fa5..92190679f3f6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -329,6 +329,11 @@ struct io_submit_state {
 };
 
 struct io_tx_notifier {
+	struct ubuf_info	uarg;
+	struct work_struct	commit_work;
+	struct percpu_ref	*fixed_rsrc_refs;
+	u64			tag;
+	u32			seq;
 };
 
 struct io_tx_ctx {
@@ -1275,15 +1280,20 @@ static void io_rsrc_refs_refill(struct io_ring_ctx *ctx)
 	percpu_ref_get_many(&ctx->rsrc_node->refs, IO_RSRC_REF_BATCH);
 }
 
+static inline void io_set_rsrc_node(struct percpu_ref **rsrc_refs,
+				    struct io_ring_ctx *ctx)
+{
+	*rsrc_refs = &ctx->rsrc_node->refs;
+	ctx->rsrc_cached_refs--;
+	if (unlikely(ctx->rsrc_cached_refs < 0))
+		io_rsrc_refs_refill(ctx);
+}
+
 static inline void io_req_set_rsrc_node(struct io_kiocb *req,
 					struct io_ring_ctx *ctx)
 {
-	if (!req->fixed_rsrc_refs) {
-		req->fixed_rsrc_refs = &ctx->rsrc_node->refs;
-		ctx->rsrc_cached_refs--;
-		if (unlikely(ctx->rsrc_cached_refs < 0))
-			io_rsrc_refs_refill(ctx);
-	}
+	if (!req->fixed_rsrc_refs)
+		io_set_rsrc_node(&req->fixed_rsrc_refs, ctx);
 }
 
 static void io_refs_resurrect(struct percpu_ref *ref, struct completion *compl)
@@ -1930,6 +1940,76 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
 	return __io_fill_cqe(ctx, user_data, res, cflags);
 }
 
+static void io_zc_tx_work_callback(struct work_struct *work)
+{
+	struct io_tx_notifier *notifier = container_of(work, struct io_tx_notifier,
+						       commit_work);
+	struct io_ring_ctx *ctx = notifier->uarg.ctx;
+
+	spin_lock(&ctx->completion_lock);
+	io_fill_cqe_aux(ctx, notifier->tag, notifier->seq, 0);
+	io_commit_cqring(ctx);
+	spin_unlock(&ctx->completion_lock);
+	io_cqring_ev_posted(ctx);
+
+	percpu_ref_put(notifier->fixed_rsrc_refs);
+	percpu_ref_put(&ctx->refs);
+	kfree(notifier);
+}
+
+static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
+					  struct ubuf_info *uarg,
+					  bool success)
+{
+	struct io_tx_notifier *notifier = container_of(uarg,
+						struct io_tx_notifier, uarg);
+
+	if (!refcount_dec_and_test(&uarg->refcnt))
+		return;
+
+	if (in_interrupt()) {
+		INIT_WORK(&notifier->commit_work, io_zc_tx_work_callback);
+		queue_work(system_unbound_wq, &notifier->commit_work);
+	} else {
+		io_zc_tx_work_callback(&notifier->commit_work);
+	}
+}
+
+static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
+						   struct io_tx_ctx *tx_ctx)
+{
+	struct io_tx_notifier *notifier;
+	struct ubuf_info *uarg;
+
+	notifier = kmalloc(sizeof(*notifier), GFP_ATOMIC);
+	if (!notifier)
+		return NULL;
+
+	WARN_ON_ONCE(!current->io_uring);
+	notifier->seq = tx_ctx->seq++;
+	notifier->tag = tx_ctx->tag;
+	io_set_rsrc_node(&notifier->fixed_rsrc_refs, ctx);
+
+	uarg = &notifier->uarg;
+	uarg->ctx = ctx;
+	uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+	uarg->callback = io_uring_tx_zerocopy_callback;
+	refcount_set(&uarg->refcnt, 1);
+	percpu_ref_get(&ctx->refs);
+	return notifier;
+}
+
+__attribute__((unused))
+static inline struct io_tx_notifier *io_get_tx_notifier(struct io_ring_ctx *ctx,
+							struct io_tx_ctx *tx_ctx)
+{
+	if (tx_ctx->notifier)
+		return tx_ctx->notifier;
+
+	tx_ctx->notifier = io_alloc_tx_notifier(ctx, tx_ctx);
+	return tx_ctx->notifier;
+}
+
 static void io_req_complete_post(struct io_kiocb *req, s32 res,
 				 u32 cflags)
 {
@@ -9212,11 +9292,27 @@ static int io_buffer_validate(struct iovec *iov)
 	return 0;
 }
 
+static void io_sqe_tx_ctx_kill_ubufs(struct io_ring_ctx *ctx)
+{
+	struct io_tx_ctx *tx_ctx;
+	int i;
+
+	for (i = 0; i < ctx->nr_tx_ctxs; i++) {
+		tx_ctx = &ctx->tx_ctxs[i];
+		if (!tx_ctx->notifier)
+			continue;
+		io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg,
+					      true);
+		tx_ctx->notifier = NULL;
+	}
+}
+
 static int io_sqe_tx_ctx_unregister(struct io_ring_ctx *ctx)
 {
 	if (!ctx->nr_tx_ctxs)
 		return -ENXIO;
 
+	io_sqe_tx_ctx_kill_ubufs(ctx);
 	kvfree(ctx->tx_ctxs);
 	ctx->tx_ctxs = NULL;
 	ctx->nr_tx_ctxs = 0;
@@ -9608,6 +9704,12 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 			io_sq_thread_unpark(sqd);
 		}
 
+		if (READ_ONCE(ctx->nr_tx_ctxs)) {
+			mutex_lock(&ctx->uring_lock);
+			io_sqe_tx_ctx_kill_ubufs(ctx);
+			mutex_unlock(&ctx->uring_lock);
+		}
+
 		io_req_caches_free(ctx);
 
 		if (WARN_ON_ONCE(time_after(jiffies, timeout))) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 12/19] io_uring: wire send zc request type
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (10 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 11/19] io_uring: infrastructure for send zc notifications Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 13/19] io_uring: add an option to flush zc notifications Pavel Begunkov
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from
other send requests is that the user should specify a tx context index,
which will notifiy the userspace when the kernel doesn't need the
buffers anymore and it's safe to reuse them. So, overwriting data
buffers is racy before getting a separate notification even when the
request is already completed.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 120 +++++++++++++++++++++++++++++++++-
 include/uapi/linux/io_uring.h |   2 +
 2 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 92190679f3f6..9452b4ec32b6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -600,6 +600,16 @@ struct io_sr_msg {
 	size_t				len;
 };
 
+struct io_sendzc {
+	struct file			*file;
+	void __user			*buf;
+	size_t				len;
+	struct io_tx_ctx 		*tx_ctx;
+	int				msg_flags;
+	int				addr_len;
+	void __user			*addr;
+};
+
 struct io_open {
 	struct file			*file;
 	int				dfd;
@@ -874,6 +884,7 @@ struct io_kiocb {
 		struct io_mkdir		mkdir;
 		struct io_symlink	symlink;
 		struct io_hardlink	hardlink;
+		struct io_sendzc	msgzc;
 	};
 
 	u8				opcode;
@@ -1123,6 +1134,12 @@ static const struct io_op_def io_op_defs[] = {
 	[IORING_OP_MKDIRAT] = {},
 	[IORING_OP_SYMLINKAT] = {},
 	[IORING_OP_LINKAT] = {},
+	[IORING_OP_SENDZC] = {
+		.needs_file		= 1,
+		.unbound_nonreg_file	= 1,
+		.pollout		= 1,
+		.audit_skip		= 1,
+	},
 };
 
 /* requests with any of those set should undergo io_disarm_next() */
@@ -1999,7 +2016,6 @@ static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
 	return notifier;
 }
 
-__attribute__((unused))
 static inline struct io_tx_notifier *io_get_tx_notifier(struct io_ring_ctx *ctx,
 							struct io_tx_ctx *tx_ctx)
 {
@@ -5025,6 +5041,102 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_sendzc *sr = &req->msgzc;
+	unsigned int idx;
+
+	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+		return -EINVAL;
+	if (READ_ONCE(sqe->ioprio))
+		return -EINVAL;
+
+	sr->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	sr->len = READ_ONCE(sqe->len);
+	sr->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
+	if (sr->msg_flags & MSG_DONTWAIT)
+		req->flags |= REQ_F_NOWAIT;
+
+	idx = READ_ONCE(sqe->tx_ctx_idx);
+	if (idx > ctx->nr_tx_ctxs)
+		return -EINVAL;
+	idx = array_index_nospec(idx, ctx->nr_tx_ctxs);
+	req->msgzc.tx_ctx = &ctx->tx_ctxs[idx];
+
+	sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+	sr->addr_len = READ_ONCE(sqe->__pad2[0]);
+
+#ifdef CONFIG_COMPAT
+	if (req->ctx->compat)
+		sr->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+	return 0;
+}
+
+static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct sockaddr_storage address;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_tx_notifier *notifier;
+	struct io_sendzc *sr = &req->msgzc;
+	struct msghdr msg;
+	struct iovec iov;
+	struct socket *sock;
+	unsigned flags;
+	int ret, min_ret = 0;
+
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+	ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter);
+	if (unlikely(ret))
+		return ret;
+
+	msg.msg_name = NULL;
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	msg.msg_namelen = 0;
+	if (sr->addr) {
+		ret = move_addr_to_kernel(sr->addr, sr->addr_len, &address);
+		if (ret < 0)
+			return ret;
+		msg.msg_name = (struct sockaddr *)&address;
+		msg.msg_namelen = sr->addr_len;
+	}
+
+	io_ring_submit_lock(ctx, issue_flags & IO_URING_F_UNLOCKED);
+	notifier = io_get_tx_notifier(ctx, req->msgzc.tx_ctx);
+	if (!notifier) {
+		req_set_fail(req);
+		ret = -ENOMEM;
+		goto out;
+	}
+	msg.msg_ubuf = &notifier->uarg;
+
+	flags = sr->msg_flags | MSG_ZEROCOPY;
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		flags |= MSG_DONTWAIT;
+	if (flags & MSG_WAITALL)
+		min_ret = iov_iter_count(&msg.msg_iter);
+	msg.msg_flags = flags;
+	ret = sock_sendmsg(sock, &msg);
+
+	if (ret < min_ret) {
+		if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
+			goto out;
+		if (ret == -ERESTARTSYS)
+			ret = -EINTR;
+		req_set_fail(req);
+	}
+	io_ring_submit_unlock(ctx, issue_flags & IO_URING_F_UNLOCKED);
+	__io_req_complete(req, issue_flags, ret, 0);
+	return 0;
+out:
+	io_ring_submit_unlock(ctx, issue_flags & IO_URING_F_UNLOCKED);
+	return ret;
+}
+
 static int __io_recvmsg_copy_hdr(struct io_kiocb *req,
 				 struct io_async_msghdr *iomsg)
 {
@@ -5428,6 +5540,7 @@ IO_NETOP_PREP_ASYNC(sendmsg);
 IO_NETOP_PREP_ASYNC(recvmsg);
 IO_NETOP_PREP_ASYNC(connect);
 IO_NETOP_PREP(accept);
+IO_NETOP_PREP(sendzc);
 IO_NETOP_FN(send);
 IO_NETOP_FN(recv);
 #endif /* CONFIG_NET */
@@ -6575,6 +6688,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	case IORING_OP_SENDMSG:
 	case IORING_OP_SEND:
 		return io_sendmsg_prep(req, sqe);
+	case IORING_OP_SENDZC:
+		return io_sendzc_prep(req, sqe);
 	case IORING_OP_RECVMSG:
 	case IORING_OP_RECV:
 		return io_recvmsg_prep(req, sqe);
@@ -6832,6 +6947,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	case IORING_OP_SEND:
 		ret = io_send(req, issue_flags);
 		break;
+	case IORING_OP_SENDZC:
+		ret = io_sendzc(req, issue_flags);
+		break;
 	case IORING_OP_RECVMSG:
 		ret = io_recvmsg(req, issue_flags);
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f2e8d18e40e0..bbc78fe8ca77 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -59,6 +59,7 @@ struct io_uring_sqe {
 	union {
 		__s32	splice_fd_in;
 		__u32	file_index;
+		__u32	tx_ctx_idx;
 	};
 	__u64	__pad2[2];
 };
@@ -143,6 +144,7 @@ enum {
 	IORING_OP_MKDIRAT,
 	IORING_OP_SYMLINKAT,
 	IORING_OP_LINKAT,
+	IORING_OP_SENDZC,
 
 	/* this goes last, obviously */
 	IORING_OP_LAST,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 13/19] io_uring: add an option to flush zc notifications
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (11 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 12/19] io_uring: wire send zc request type Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 14/19] io_uring: opcode independent fixed buf import Pavel Begunkov
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Add IORING_SENDZC_FLUSH flag. If specified, a send zc operation on
success should also flush a corresponding ubuf_info.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 26 +++++++++++++++++++-------
 include/uapi/linux/io_uring.h |  4 ++++
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9452b4ec32b6..ec1f6c60a14c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -608,6 +608,7 @@ struct io_sendzc {
 	int				msg_flags;
 	int				addr_len;
 	void __user			*addr;
+	unsigned int			zc_flags;
 };
 
 struct io_open {
@@ -1992,6 +1993,12 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 	}
 }
 
+static void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
+{
+	io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg, true);
+	tx_ctx->notifier = NULL;
+}
+
 static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
 						   struct io_tx_ctx *tx_ctx)
 {
@@ -5041,6 +5048,8 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
+#define IO_SENDZC_VALID_FLAGS IORING_SENDZC_FLUSH
+
 static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
 	struct io_ring_ctx *ctx = req->ctx;
@@ -5049,8 +5058,6 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
-	if (READ_ONCE(sqe->ioprio))
-		return -EINVAL;
 
 	sr->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	sr->len = READ_ONCE(sqe->len);
@@ -5067,6 +5074,10 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
 	sr->addr_len = READ_ONCE(sqe->__pad2[0]);
 
+	req->msgzc.zc_flags = READ_ONCE(sqe->ioprio);
+	if (req->msgzc.zc_flags & ~IO_SENDZC_VALID_FLAGS)
+		return -EINVAL;
+
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
 		sr->msg_flags |= MSG_CMSG_COMPAT;
@@ -5089,6 +5100,7 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	sock = sock_from_file(req->file);
 	if (unlikely(!sock))
 		return -ENOTSOCK;
+
 	ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter);
 	if (unlikely(ret))
 		return ret;
@@ -5128,6 +5140,8 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 		req_set_fail(req);
+	} else if (req->msgzc.zc_flags & IORING_SENDZC_FLUSH) {
+		io_tx_kill_notification(req->msgzc.tx_ctx);
 	}
 	io_ring_submit_unlock(ctx, issue_flags & IO_URING_F_UNLOCKED);
 	__io_req_complete(req, issue_flags, ret, 0);
@@ -9417,11 +9431,9 @@ static void io_sqe_tx_ctx_kill_ubufs(struct io_ring_ctx *ctx)
 
 	for (i = 0; i < ctx->nr_tx_ctxs; i++) {
 		tx_ctx = &ctx->tx_ctxs[i];
-		if (!tx_ctx->notifier)
-			continue;
-		io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg,
-					      true);
-		tx_ctx->notifier = NULL;
+
+		if (tx_ctx->notifier)
+			io_tx_kill_notification(tx_ctx);
 	}
 }
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index bbc78fe8ca77..ac18e8e6f86f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -187,6 +187,10 @@ enum {
 #define IORING_POLL_UPDATE_EVENTS	(1U << 1)
 #define IORING_POLL_UPDATE_USER_DATA	(1U << 2)
 
+enum {
+	IORING_SENDZC_FLUSH		= (1U << 0),
+};
+
 /*
  * IO completion data structure (Completion Queue Entry)
  */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 14/19] io_uring: opcode independent fixed buf import
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (12 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 13/19] io_uring: add an option to flush zc notifications Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 15/19] io_uring: sendzc with fixed buffers Pavel Begunkov
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Extract an opcode independent helper from io_import_fixed for
initialising an iov_iter with a fixed buffer with

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ec1f6c60a14c..40a8d7799be3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3152,11 +3152,11 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
 	}
 }
 
-static int __io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter,
-			     struct io_mapped_ubuf *imu)
+static int __io_import_fixed(int rw, struct iov_iter *iter,
+			     struct io_mapped_ubuf *imu,
+			     u64 buf_addr, size_t len)
 {
-	size_t len = req->rw.len;
-	u64 buf_end, buf_addr = req->rw.addr;
+	u64 buf_end;
 	size_t offset;
 
 	if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end)))
@@ -3225,7 +3225,7 @@ static int io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter)
 		imu = READ_ONCE(ctx->user_bufs[index]);
 		req->imu = imu;
 	}
-	return __io_import_fixed(req, rw, iter, imu);
+	return __io_import_fixed(rw, iter, imu, req->rw.addr, req->rw.len);
 }
 
 static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 15/19] io_uring: sendzc with fixed buffers
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (13 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 14/19] io_uring: opcode independent fixed buf import Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 16/19] io_uring: cache struct ubuf_info Pavel Begunkov
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Allow zerocopy sends to use fixed buffers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c                 | 19 +++++++++++++++++--
 include/uapi/linux/io_uring.h |  1 +
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 40a8d7799be3..654023ba0b91 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5048,7 +5048,7 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
 	return 0;
 }
 
-#define IO_SENDZC_VALID_FLAGS IORING_SENDZC_FLUSH
+#define IO_SENDZC_VALID_FLAGS (IORING_SENDZC_FLUSH | IORING_SENDZC_FIXED_BUF)
 
 static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
@@ -5078,6 +5078,15 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	if (req->msgzc.zc_flags & ~IO_SENDZC_VALID_FLAGS)
 		return -EINVAL;
 
+	if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
+		idx = READ_ONCE(sqe->buf_index);
+		if (unlikely(idx >= ctx->nr_user_bufs))
+			return -EFAULT;
+		idx = array_index_nospec(idx, ctx->nr_user_bufs);
+		req->imu = READ_ONCE(ctx->user_bufs[idx]);
+		io_req_set_rsrc_node(req, ctx);
+	}
+
 #ifdef CONFIG_COMPAT
 	if (req->ctx->compat)
 		sr->msg_flags |= MSG_CMSG_COMPAT;
@@ -5101,7 +5110,13 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	if (unlikely(!sock))
 		return -ENOTSOCK;
 
-	ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter);
+	if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
+		ret = __io_import_fixed(WRITE, &msg.msg_iter, req->imu,
+					(u64)sr->buf, sr->len);
+	} else {
+		ret = import_single_range(WRITE, sr->buf, sr->len, &iov,
+					  &msg.msg_iter);
+	}
 	if (unlikely(ret))
 		return ret;
 
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ac18e8e6f86f..740af1d0409f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -189,6 +189,7 @@ enum {
 
 enum {
 	IORING_SENDZC_FLUSH		= (1U << 0),
+	IORING_SENDZC_FIXED_BUF		= (1U << 1),
 };
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 16/19] io_uring: cache struct ubuf_info
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (14 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 15/19] io_uring: sendzc with fixed buffers Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 17/19] io_uring: unclog ctx refs waiting with zc notifiers Pavel Begunkov
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Allocation/deallocation of ubuf_info takes some time, add an
optimisation caching them. The implementation is alike to how we cache
requests in io_req_complete_post().

->ubuf_list is protected by ->uring_lock and requests try grab directly
from it, and there is also ->ubuf_list_locked list protected by
->completion_lock, which is eventually batch spliced to ->ubuf_list.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 654023ba0b91..5f79178a3f38 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -334,6 +334,7 @@ struct io_tx_notifier {
 	struct percpu_ref	*fixed_rsrc_refs;
 	u64			tag;
 	u32			seq;
+	struct list_head	cache_node;
 };
 
 struct io_tx_ctx {
@@ -393,6 +394,9 @@ struct io_ring_ctx {
 		unsigned		nr_tx_ctxs;
 
 		struct io_submit_state	submit_state;
+		struct list_head	ubuf_list;
+		struct list_head	ubuf_list_locked;
+		int			ubuf_locked_nr;
 		struct list_head	timeout_list;
 		struct list_head	ltimeout_list;
 		struct list_head	cq_overflow_list;
@@ -1491,6 +1495,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_WQ_LIST(&ctx->locked_free_list);
 	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+	INIT_LIST_HEAD(&ctx->ubuf_list);
+	INIT_LIST_HEAD(&ctx->ubuf_list_locked);
 	return ctx;
 err:
 	kfree(ctx->dummy_ubuf);
@@ -1963,16 +1969,20 @@ static void io_zc_tx_work_callback(struct work_struct *work)
 	struct io_tx_notifier *notifier = container_of(work, struct io_tx_notifier,
 						       commit_work);
 	struct io_ring_ctx *ctx = notifier->uarg.ctx;
+	struct percpu_ref *rsrc_refs = notifier->fixed_rsrc_refs;
 
 	spin_lock(&ctx->completion_lock);
 	io_fill_cqe_aux(ctx, notifier->tag, notifier->seq, 0);
+
+	list_add(&notifier->cache_node, &ctx->ubuf_list_locked);
+	ctx->ubuf_locked_nr++;
+
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
 
-	percpu_ref_put(notifier->fixed_rsrc_refs);
+	percpu_ref_put(rsrc_refs);
 	percpu_ref_put(&ctx->refs);
-	kfree(notifier);
 }
 
 static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
@@ -1999,26 +2009,69 @@ static void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
 	tx_ctx->notifier = NULL;
 }
 
+static void io_notifier_splice(struct io_ring_ctx *ctx)
+{
+	spin_lock(&ctx->completion_lock);
+	list_splice_init(&ctx->ubuf_list_locked, &ctx->ubuf_list);
+	ctx->ubuf_locked_nr = 0;
+	spin_unlock(&ctx->completion_lock);
+}
+
+static void io_notifier_free_cached(struct io_ring_ctx *ctx)
+{
+	struct io_tx_notifier *notifier;
+
+	io_notifier_splice(ctx);
+
+	while (!list_empty(&ctx->ubuf_list)) {
+		notifier = list_first_entry(&ctx->ubuf_list,
+					    struct io_tx_notifier, cache_node);
+		list_del(&notifier->cache_node);
+		kfree(notifier);
+	}
+}
+
+static inline bool io_notifier_has_cached(struct io_ring_ctx *ctx)
+{
+	if (likely(!list_empty(&ctx->ubuf_list)))
+		return true;
+	if (READ_ONCE(ctx->ubuf_locked_nr) <= IO_REQ_ALLOC_BATCH)
+		return false;
+	io_notifier_splice(ctx);
+	return !list_empty(&ctx->ubuf_list);
+}
+
 static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
 						   struct io_tx_ctx *tx_ctx)
 {
 	struct io_tx_notifier *notifier;
 	struct ubuf_info *uarg;
 
-	notifier = kmalloc(sizeof(*notifier), GFP_ATOMIC);
-	if (!notifier)
-		return NULL;
+	if (likely(io_notifier_has_cached(ctx))) {
+		if (WARN_ON_ONCE(list_empty(&ctx->ubuf_list)))
+			return NULL;
+
+		notifier = list_first_entry(&ctx->ubuf_list,
+					    struct io_tx_notifier, cache_node);
+		list_del(&notifier->cache_node);
+	} else {
+		gfp_t gfp_flags = GFP_ATOMIC|GFP_KERNEL_ACCOUNT;
+
+		notifier = kmalloc(sizeof(*notifier), gfp_flags);
+		if (!notifier)
+			return NULL;
+		uarg = &notifier->uarg;
+		uarg->ctx = ctx;
+		uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+		uarg->callback = io_uring_tx_zerocopy_callback;
+	}
 
 	WARN_ON_ONCE(!current->io_uring);
 	notifier->seq = tx_ctx->seq++;
 	notifier->tag = tx_ctx->tag;
 	io_set_rsrc_node(&notifier->fixed_rsrc_refs, ctx);
 
-	uarg = &notifier->uarg;
-	uarg->ctx = ctx;
-	uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
-	uarg->callback = io_uring_tx_zerocopy_callback;
-	refcount_set(&uarg->refcnt, 1);
+	refcount_set(&notifier->uarg.refcnt, 1);
 	percpu_ref_get(&ctx->refs);
 	return notifier;
 }
@@ -9732,6 +9785,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
+	io_notifier_free_cached(ctx);
 	io_sqe_tx_ctx_unregister(ctx);
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 17/19] io_uring: unclog ctx refs waiting with zc notifiers
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (15 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 16/19] io_uring: cache struct ubuf_info Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 18/19] io_uring: task_work for notification delivery Pavel Begunkov
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Currently every instance of struct io_tx_notifier holds a ctx reference,
including ones sitting in caches. So, when we try to quiesce the ring
(e.g. for register) we'd be waiting for refs that nobody can release.
That's worked around in for cancellation.

Don't do ctx references but wait for all notifiers to return into
caches when needed. Even better solution would be to wait for all rsrc
refs. It's also nice to remove an extra pair of percpu_ref_get/put().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5f79178a3f38..8cfa8ea161e4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -453,6 +453,7 @@ struct io_ring_ctx {
 		struct io_mapped_ubuf		*dummy_ubuf;
 		struct io_rsrc_data		*file_data;
 		struct io_rsrc_data		*buf_data;
+		int				nr_tx_ctx;
 
 		struct delayed_work		rsrc_put_work;
 		struct llist_head		rsrc_put_llist;
@@ -1982,7 +1983,6 @@ static void io_zc_tx_work_callback(struct work_struct *work)
 	io_cqring_ev_posted(ctx);
 
 	percpu_ref_put(rsrc_refs);
-	percpu_ref_put(&ctx->refs);
 }
 
 static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
@@ -2028,6 +2028,7 @@ static void io_notifier_free_cached(struct io_ring_ctx *ctx)
 					    struct io_tx_notifier, cache_node);
 		list_del(&notifier->cache_node);
 		kfree(notifier);
+		ctx->nr_tx_ctx--;
 	}
 }
 
@@ -2060,6 +2061,7 @@ static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
 		notifier = kmalloc(sizeof(*notifier), gfp_flags);
 		if (!notifier)
 			return NULL;
+		ctx->nr_tx_ctx++;
 		uarg = &notifier->uarg;
 		uarg->ctx = ctx;
 		uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
@@ -2072,7 +2074,6 @@ static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
 	io_set_rsrc_node(&notifier->fixed_rsrc_refs, ctx);
 
 	refcount_set(&notifier->uarg.refcnt, 1);
-	percpu_ref_get(&ctx->refs);
 	return notifier;
 }
 
@@ -9785,7 +9786,6 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
 #endif
 	WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
 
-	io_notifier_free_cached(ctx);
 	io_sqe_tx_ctx_unregister(ctx);
 	io_mem_free(ctx->rings);
 	io_mem_free(ctx->sq_sqes);
@@ -9946,6 +9946,19 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 	spin_lock(&ctx->completion_lock);
 	spin_unlock(&ctx->completion_lock);
 
+	while (1) {
+		int nr;
+
+		mutex_lock(&ctx->uring_lock);
+		io_notifier_free_cached(ctx);
+		nr = ctx->nr_tx_ctx;
+		mutex_unlock(&ctx->uring_lock);
+
+		if (!nr)
+			break;
+		schedule_timeout(interval);
+	}
+
 	io_ring_ctx_free(ctx);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 18/19] io_uring: task_work for notification delivery
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (16 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 17/19] io_uring: unclog ctx refs waiting with zc notifiers Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:35 ` [RFC v2 19/19] io_uring: optimise task referencing by notifiers Pavel Begunkov
  2021-12-21 15:43 ` [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

workqueues are way too heavy for tx notification delivery. We still
need some non-irq context because ->completion_lock is not irq-safe, so
use task_work instead. Expectedly, performance for test cases with real
hardware and juggling lots of notifications the perfomance is
drastically better, e.g. profiles percetage of relevant parts drops
from 30% to less than 3%

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 57 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8cfa8ea161e4..ee496b463462 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -330,11 +330,16 @@ struct io_submit_state {
 
 struct io_tx_notifier {
 	struct ubuf_info	uarg;
-	struct work_struct	commit_work;
 	struct percpu_ref	*fixed_rsrc_refs;
 	u64			tag;
 	u32			seq;
 	struct list_head	cache_node;
+	struct task_struct	*task;
+
+	union {
+		struct callback_head	task_work;
+		struct work_struct	commit_work;
+	};
 };
 
 struct io_tx_ctx {
@@ -1965,19 +1970,17 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
 	return __io_fill_cqe(ctx, user_data, res, cflags);
 }
 
-static void io_zc_tx_work_callback(struct work_struct *work)
+static void io_zc_tx_notifier_finish(struct callback_head *cb)
 {
-	struct io_tx_notifier *notifier = container_of(work, struct io_tx_notifier,
-						       commit_work);
+	struct io_tx_notifier *notifier = container_of(cb, struct io_tx_notifier,
+						       task_work);
 	struct io_ring_ctx *ctx = notifier->uarg.ctx;
 	struct percpu_ref *rsrc_refs = notifier->fixed_rsrc_refs;
 
 	spin_lock(&ctx->completion_lock);
 	io_fill_cqe_aux(ctx, notifier->tag, notifier->seq, 0);
-
 	list_add(&notifier->cache_node, &ctx->ubuf_list_locked);
 	ctx->ubuf_locked_nr++;
-
 	io_commit_cqring(ctx);
 	spin_unlock(&ctx->completion_lock);
 	io_cqring_ev_posted(ctx);
@@ -1985,6 +1988,14 @@ static void io_zc_tx_work_callback(struct work_struct *work)
 	percpu_ref_put(rsrc_refs);
 }
 
+static void io_zc_tx_work_callback(struct work_struct *work)
+{
+	struct io_tx_notifier *notifier = container_of(work, struct io_tx_notifier,
+						       commit_work);
+
+	io_zc_tx_notifier_finish(&notifier->task_work);
+}
+
 static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 					  struct ubuf_info *uarg,
 					  bool success)
@@ -1994,21 +2005,39 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 
 	if (!refcount_dec_and_test(&uarg->refcnt))
 		return;
+	if (unlikely(!notifier->task))
+		goto fallback;
 
-	if (in_interrupt()) {
-		INIT_WORK(&notifier->commit_work, io_zc_tx_work_callback);
-		queue_work(system_unbound_wq, &notifier->commit_work);
-	} else {
-		io_zc_tx_work_callback(&notifier->commit_work);
+	put_task_struct(notifier->task);
+	notifier->task = NULL;
+
+	if (!in_interrupt()) {
+		io_zc_tx_notifier_finish(&notifier->task_work);
+		return;
 	}
+
+	init_task_work(&notifier->task_work, io_zc_tx_notifier_finish);
+	if (likely(!task_work_add(notifier->task, &notifier->task_work,
+				  TWA_SIGNAL)))
+		return;
+
+fallback:
+	INIT_WORK(&notifier->commit_work, io_zc_tx_work_callback);
+	queue_work(system_unbound_wq, &notifier->commit_work);
 }
 
-static void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
+static inline void __io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
 {
 	io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg, true);
 	tx_ctx->notifier = NULL;
 }
 
+static inline void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
+{
+	tx_ctx->notifier->task = get_task_struct(current);
+	__io_tx_kill_notification(tx_ctx);
+}
+
 static void io_notifier_splice(struct io_ring_ctx *ctx)
 {
 	spin_lock(&ctx->completion_lock);
@@ -2058,7 +2087,7 @@ static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
 	} else {
 		gfp_t gfp_flags = GFP_ATOMIC|GFP_KERNEL_ACCOUNT;
 
-		notifier = kmalloc(sizeof(*notifier), gfp_flags);
+		notifier = kzalloc(sizeof(*notifier), gfp_flags);
 		if (!notifier)
 			return NULL;
 		ctx->nr_tx_ctx++;
@@ -9502,7 +9531,7 @@ static void io_sqe_tx_ctx_kill_ubufs(struct io_ring_ctx *ctx)
 		tx_ctx = &ctx->tx_ctxs[i];
 
 		if (tx_ctx->notifier)
-			io_tx_kill_notification(tx_ctx);
+			__io_tx_kill_notification(tx_ctx);
 	}
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC v2 19/19] io_uring: optimise task referencing by notifiers
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (17 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 18/19] io_uring: task_work for notification delivery Pavel Begunkov
@ 2021-12-21 15:35 ` Pavel Begunkov
  2021-12-21 15:43 ` [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:35 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe,
	Pavel Begunkov

Use io_put_task()/etc. infra holding task references for notifiers, so
we can get rid of atomics there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 fs/io_uring.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index ee496b463462..0eadf4ee5402 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2008,7 +2008,7 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 	if (unlikely(!notifier->task))
 		goto fallback;
 
-	put_task_struct(notifier->task);
+	io_put_task(notifier->task, 1);
 	notifier->task = NULL;
 
 	if (!in_interrupt()) {
@@ -2034,7 +2034,8 @@ static inline void __io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
 
 static inline void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
 {
-	tx_ctx->notifier->task = get_task_struct(current);
+	tx_ctx->notifier->task = current;
+	io_get_task_refs(1);
 	__io_tx_kill_notification(tx_ctx);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC v2 00/19] io_uring zerocopy tx
  2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
                   ` (18 preceding siblings ...)
  2021-12-21 15:35 ` [RFC v2 19/19] io_uring: optimise task referencing by notifiers Pavel Begunkov
@ 2021-12-21 15:43 ` Pavel Begunkov
  19 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2021-12-21 15:43 UTC (permalink / raw)
  To: io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 814 bytes --]

On 12/21/21 15:35, Pavel Begunkov wrote:
> Update on io_uring zerocopy tx, still RFC. For v1 and design notes see
[...]
> 
> Benchmarks for dummy netdev, UDP/IPv4, payload size=4096:
>   -n<N> is how many requests we submit per syscall. From io_uring perspective -n1
>         is wasteful and far from optimal, but included for comparison.
>   -z0   disables zerocopy, just normal io_uring send requests
>   -f    makes to flush "buffer free" notifications for every request
[...]> https://github.com/isilence/linux/tree/zc_v2
> https://github.com/isilence/liburing/tree/zc_v2
> 
> The Benchmark is <liburing>/test/send-zc,
> 
> send-zc [-f] [-n<N>] [-z0] -s<payload size> -D<dst ip> (-6|-4) [-t<sec>] udp

Attaching a standalone send-zc for convenience.

gcc -luring -O2 -o send-zc ./send-zc.c

-- 
Pavel Begunkov

[-- Attachment #2: send-zc.c --]
[-- Type: text/x-csrc, Size: 8350 bytes --]

/* SPDX-License-Identifier: MIT */
/* based on linux-kernel/tools/testing/selftests/net/msg_zerocopy.c */
/* gcc -luring -O2 -o send-zc ./send-zc.c */
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <assert.h>
#include <errno.h>
#include <error.h>
#include <limits.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdbool.h>
#include <string.h>

#include <arpa/inet.h>
#include <linux/errqueue.h>
#include <linux/if_packet.h>
#include <linux/ipv6.h>
#include <linux/socket.h>
#include <linux/sockios.h>
#include <net/ethernet.h>
#include <net/if.h>
#include <netinet/ip.h>
#include <netinet/in.h>
#include <netinet/ip6.h>
#include <netinet/tcp.h>
#include <netinet/udp.h>
#include <sys/socket.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/un.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/syscall.h>

#include <liburing.h>

enum {
	__IORING_OP_SENDZC 		= 40,

	__IORING_SENDZC_FLUSH		= (1U << 0),
	__IORING_SENDZC_FIXED_BUF	= (1U << 1),

	__IORING_REGISTER_TX_CTX	= 20,
	__IORING_UNREGISTER_TX_CTX	= 21,
};

struct __io_uring_tx_ctx_register {
	__u64 tag;
};

#ifndef SO_ZEROCOPY
#define SO_ZEROCOPY	60
#endif
#define ZC_TAG 0xfffffffULL

static bool fixed_files;
static bool zc;
static bool flush;
static int nr_reqs;
static bool fixed_buf;

static int  cfg_family		= PF_UNSPEC;
static int  cfg_payload_len;
static int  cfg_port		= 8000;
static int  cfg_runtime_ms	= 4200;

static socklen_t cfg_alen;
static struct sockaddr_storage cfg_dst_addr;

static char payload[IP_MAXPACKET] __attribute__((aligned(4096)));


static inline int ____sys_io_uring_register(int fd, unsigned opcode,
					    const void *arg, unsigned nr_args)
{
	int ret;

	ret = syscall(__NR_io_uring_register, fd, opcode, arg, nr_args);
	return (ret < 0) ? -errno : ret;
}

static unsigned long gettimeofday_ms(void)
{
	struct timeval tv;

	gettimeofday(&tv, NULL);
	return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
}

static void do_setsockopt(int fd, int level, int optname, int val)
{
	if (setsockopt(fd, level, optname, &val, sizeof(val)))
		error(1, errno, "setsockopt %d.%d: %d", level, optname, val);
}

static void setup_sockaddr(int domain, const char *str_addr,
			   struct sockaddr_storage *sockaddr)
{
	struct sockaddr_in6 *addr6 = (void *) sockaddr;
	struct sockaddr_in *addr4 = (void *) sockaddr;

	switch (domain) {
	case PF_INET:
		memset(addr4, 0, sizeof(*addr4));
		addr4->sin_family = AF_INET;
		addr4->sin_port = htons(cfg_port);
		if (str_addr &&
		    inet_pton(AF_INET, str_addr, &(addr4->sin_addr)) != 1)
			error(1, 0, "ipv4 parse error: %s", str_addr);
		break;
	case PF_INET6:
		memset(addr6, 0, sizeof(*addr6));
		addr6->sin6_family = AF_INET6;
		addr6->sin6_port = htons(cfg_port);
		if (str_addr &&
		    inet_pton(AF_INET6, str_addr, &(addr6->sin6_addr)) != 1)
			error(1, 0, "ipv6 parse error: %s", str_addr);
		break;
	default:
		error(1, 0, "illegal domain");
	}
}

static int do_setup_tx(int domain, int type, int protocol)
{
	int fd;

	fd = socket(domain, type, protocol);
	if (fd == -1)
		error(1, errno, "socket t");

	do_setsockopt(fd, SOL_SOCKET, SO_SNDBUF, 1 << 21);
	do_setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, 1);

	if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
		error(1, errno, "connect");
	return fd;
}

static inline struct io_uring_cqe *wait_cqe_fast(struct io_uring *ring)
{
	struct io_uring_cqe *cqe;
	unsigned head;
	int ret;

	io_uring_for_each_cqe(ring, head, cqe)
		return cqe;

	ret = io_uring_wait_cqe(ring, &cqe);
	if (ret)
		error(1, ret, "wait cqe");
	return cqe;
}

static void do_tx(int domain, int type, int protocol)
{
	unsigned long packets = 0, bytes = 0;
	struct io_uring ring;
	struct iovec iov;
	uint64_t tstop;
	int i, fd, ret;
	int compl_cqes = 0;

	fd = do_setup_tx(domain, type, protocol);

	ret = io_uring_queue_init(512, &ring, 0);
	if (ret)
		error(1, ret, "io_uring: queue init");

	struct __io_uring_tx_ctx_register r = { .tag = ZC_TAG, };
	ret = ____sys_io_uring_register(ring.ring_fd, __IORING_REGISTER_TX_CTX, (void *)&r, 1);
	if (ret)
		error(1, ret, "io_uring: tx ctx registration");

	ret = io_uring_register_files(&ring, &fd, 1);
	if (ret < 0)
		error(1, ret, "io_uring: files registration");

	iov.iov_base = payload;
	iov.iov_len = cfg_payload_len;
	ret = io_uring_register_buffers(&ring, &iov, 1);
	if (ret < 0)
		error(1, ret, "io_uring: buffer registration");

	tstop = gettimeofday_ms() + cfg_runtime_ms;
	do {
		struct io_uring_sqe *sqe;
		struct io_uring_cqe *cqe;

		compl_cqes += flush ? nr_reqs : 0;

		for (i = 0; i < nr_reqs; i++) {
			sqe = io_uring_get_sqe(&ring);
			io_uring_prep_send(sqe, fd, payload, cfg_payload_len, 0);
			sqe->user_data = 1;
			if (fixed_files) {
				sqe->fd = 0;
				sqe->flags = IOSQE_FIXED_FILE;
			}

			if (zc) {
				sqe->opcode = __IORING_OP_SENDZC;
				sqe->file_index = 0; // sqe->tx_ctx_idx = 0;
				sqe->ioprio = 0;
				sqe->off = 0;
				sqe->__pad2[0] = 0;

				if (flush)
					sqe->ioprio |= __IORING_SENDZC_FLUSH;
				if (fixed_buf) {
					sqe->ioprio |= __IORING_SENDZC_FIXED_BUF;
					sqe->buf_index = 0;
				}
			}
		}

		ret = io_uring_submit(&ring);
		if (ret != nr_reqs)
			error(1, ret, "submit");

		for (i = 0; i < nr_reqs; i++) {
			cqe = wait_cqe_fast(&ring);

			if (cqe->user_data == ZC_TAG) {
				compl_cqes--;
				i--;
			} else if (cqe->user_data == 1) {
				if (cqe->res <= 0)
					error(1, cqe->res, "send failed");

				packets++;
				bytes += cqe->res;
			} else {
				error(1, cqe->user_data, "invalid user_data");
			}

			io_uring_cqe_seen(&ring, cqe);
		}
	} while (gettimeofday_ms() < tstop);

	if (close(fd))
		error(1, errno, "close");

	fprintf(stderr, "tx=%lu (MB=%lu), tx/s=%lu (MB/s=%lu)\n",
			packets, bytes >> 20,
			packets / (cfg_runtime_ms / 1000),
			(bytes >> 20) / (cfg_runtime_ms / 1000));

	while (compl_cqes) {
		struct io_uring_cqe *cqe = wait_cqe_fast(&ring);

		io_uring_cqe_seen(&ring, cqe);
		compl_cqes--;
	}

	ret = ____sys_io_uring_register(ring.ring_fd, __IORING_UNREGISTER_TX_CTX,
					NULL, 0);
	if (ret)
		error(1, ret, "io_uring: tx ctx unregistration");

	io_uring_queue_exit(&ring);
}

static void do_test(int domain, int type, int protocol)
{
	int i;

	for (i = 0; i < IP_MAXPACKET; i++)
		payload[i] = 'a' + (i % 26);

	do_tx(domain, type, protocol);
}

static void usage(const char *filepath)
{
	error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
		    "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);
}

static void parse_opts(int argc, char **argv)
{
	const int max_payload_len = sizeof(payload) -
				    sizeof(struct ipv6hdr) -
				    sizeof(struct tcphdr) -
				    40 /* max tcp options */;
	int c;
	char *daddr = NULL;

	if (argc <= 1)
		usage(argv[0]);

	cfg_payload_len = max_payload_len;
	fixed_files = 1;
	zc = 1;
	flush = 0 && zc;
	nr_reqs = 8;
	fixed_buf = 1 && zc;

	while ((c = getopt(argc, argv, "46D:i:p:s:t:n:r:fz:h")) != -1) {
		switch (c) {
		case '4':
			if (cfg_family != PF_UNSPEC)
				error(1, 0, "Pass one of -4 or -6");
			cfg_family = PF_INET;
			cfg_alen = sizeof(struct sockaddr_in);
			break;
		case '6':
			if (cfg_family != PF_UNSPEC)
				error(1, 0, "Pass one of -4 or -6");
			cfg_family = PF_INET6;
			cfg_alen = sizeof(struct sockaddr_in6);
			break;
		case 'D':
			daddr = optarg;
			break;
		case 'p':
			cfg_port = strtoul(optarg, NULL, 0);
			break;
		case 's':
			cfg_payload_len = strtoul(optarg, NULL, 0);
			break;
		case 't':
			cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
			break;
		case 'n':
		case 'r':
			nr_reqs = strtoul(optarg, NULL, 0);
			break;
		case 'f':
			flush = 1;
			break;
		case 'z':
			zc = strtoul(optarg, NULL, 0);
			break;
		}
	}

	if (flush && !zc)
		error(1, 0, "Flush should be used with zc only");

	setup_sockaddr(cfg_family, daddr, &cfg_dst_addr);

	if (cfg_payload_len > max_payload_len)
		error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);

	if (optind != argc - 1)
		usage(argv[0]);
}

int main(int argc, char **argv)
{
	const char *cfg_test;

	parse_opts(argc, argv);

	cfg_test = argv[argc - 1];

	if (!strcmp(cfg_test, "tcp"))
		do_test(cfg_family, SOCK_STREAM, 0);
	else if (!strcmp(cfg_test, "udp"))
		do_test(cfg_family, SOCK_DGRAM, 0);
	else
		error(1, 0, "unknown cfg_test %s", cfg_test);

	return 0;
}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr
  2021-12-21 15:35 ` [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr Pavel Begunkov
@ 2022-01-11 13:51   ` Hao Xu
  2022-01-11 15:50     ` Pavel Begunkov
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Xu @ 2022-01-11 13:51 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe

在 2021/12/21 下午11:35, Pavel Begunkov 写道:
> Instead of the net stack managing ubuf_info, allow to pass it in from
> outside in a struct msghdr (in-kernel structure), so io_uring can make
> use of it.
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
Hi Pavel,
I've some confusions here since I have a lack of
network knowledge.
The first one is why do we make ubuf_info visible
for io_uring. Why not just follow the old MSG_ZEROCOPY
logic?

The second one, my understanding about the buffer
lifecycle is that the kernel side informs
the userspace by a cqe generated by the ubuf_info
callback that all the buffers attaching to the
same notifier is now free to use when all the data
is sent, then why is the flush in 13/19 needed as
it is at the submission period?

Regards,
Hao

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr
  2022-01-11 13:51   ` Hao Xu
@ 2022-01-11 15:50     ` Pavel Begunkov
  2022-01-12  3:39       ` Hao Xu
  0 siblings, 1 reply; 25+ messages in thread
From: Pavel Begunkov @ 2022-01-11 15:50 UTC (permalink / raw)
  To: Hao Xu, io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe

On 1/11/22 13:51, Hao Xu wrote:
> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>> Instead of the net stack managing ubuf_info, allow to pass it in from
>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>> use of it.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
> Hi Pavel,
> I've some confusions here since I have a lack of
> network knowledge.
> The first one is why do we make ubuf_info visible
> for io_uring. Why not just follow the old MSG_ZEROCOPY
> logic?

I assume you mean leaving allocation up and so in socket awhile the
patchset let's io_uring to manage and control ubufs. In short,
performance and out convenience

TL;DR;
First, we want a nice and uniform API with io_uring, i.e. posting
an CQE instead of polling an err queue/etc., and for that the network
will need to know about io_uring ctx in some way. As an alternative it
may theoretically be registered in socket, but it'll quickly turn into
a huge mess, consider that it's a many to many relation b/w io_uring and
sockets. The fact that io_uring holds refs to files will only complicate
it.

It will also limit API. For instance, we won't be able to use a single
ubuf with several different sockets.

Another problem is performance, registration or some other tricks
would some additional sync. It'd also need sync on use, say it's
just one rcu_read, but the problem that it only adds up to complexity
and prevents some other optimisations. E.g. we amortise to ~0 atomics
getting refs on skb setups based on guarantees io_uring provides, and
not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
by the issuer, and so it needs some context as currently provided by
ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
it removes kmalloc/free for almost zero overhead.


> The second one, my understanding about the buffer
> lifecycle is that the kernel side informs
> the userspace by a cqe generated by the ubuf_info
> callback that all the buffers attaching to the
> same notifier is now free to use when all the data
> is sent, then why is the flush in 13/19 needed as
> it is at the submission period?

Probably I wasn't clear enough. A user has to flush a notifier, only
then it's expected to post an CQE after all buffers attached to it
are freed. io_uring holds one ubuf ref, which will be release on flush.
I also need to add a way to flush without send.

Will spend some time documenting for next iteration.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr
  2022-01-11 15:50     ` Pavel Begunkov
@ 2022-01-12  3:39       ` Hao Xu
  2022-01-12 16:53         ` Pavel Begunkov
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Xu @ 2022-01-12  3:39 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe

在 2022/1/11 下午11:50, Pavel Begunkov 写道:
> On 1/11/22 13:51, Hao Xu wrote:
>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>>> Instead of the net stack managing ubuf_info, allow to pass it in from
>>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>>> use of it.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>> Hi Pavel,
>> I've some confusions here since I have a lack of
>> network knowledge.
>> The first one is why do we make ubuf_info visible
>> for io_uring. Why not just follow the old MSG_ZEROCOPY
>> logic?
> 
> I assume you mean leaving allocation up and so in socket awhile the
> patchset let's io_uring to manage and control ubufs. In short,
> performance and out convenience
> 
> TL;DR;
> First, we want a nice and uniform API with io_uring, i.e. posting
> an CQE instead of polling an err queue/etc., and for that the network
> will need to know about io_uring ctx in some way. As an alternative it
> may theoretically be registered in socket, but it'll quickly turn into
> a huge mess, consider that it's a many to many relation b/w io_uring and
> sockets. The fact that io_uring holds refs to files will only complicate
> it.
Make sense to me, thanks.
> 
> It will also limit API. For instance, we won't be able to use a single
> ubuf with several different sockets.
Is there any use cases for this multiple sockets with single
notification?
> 
> Another problem is performance, registration or some other tricks
> would some additional sync. It'd also need sync on use, say it's
> just one rcu_read, but the problem that it only adds up to complexity
> and prevents some other optimisations. E.g. we amortise to ~0 atomics
> getting refs on skb setups based on guarantees io_uring provides, and
> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
> by the issuer, and so it needs some context as currently provided by
> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
> it removes kmalloc/free for almost zero overhead.
> 
> 
>> The second one, my understanding about the buffer
>> lifecycle is that the kernel side informs
>> the userspace by a cqe generated by the ubuf_info
>> callback that all the buffers attaching to the
>> same notifier is now free to use when all the data
>> is sent, then why is the flush in 13/19 needed as
>> it is at the submission period?
> 
> Probably I wasn't clear enough. A user has to flush a notifier, only
> then it's expected to post an CQE after all buffers attached to it
> are freed. io_uring holds one ubuf ref, which will be release on flush.
I see, I saw another ref inc in skb_zcopy_set() which I previously
misunderstood and thus thought there was only one refcount. Thanks!
> I also need to add a way to flush without send.
> 
> Will spend some time documenting for next iteration.
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr
  2022-01-12  3:39       ` Hao Xu
@ 2022-01-12 16:53         ` Pavel Begunkov
  0 siblings, 0 replies; 25+ messages in thread
From: Pavel Begunkov @ 2022-01-12 16:53 UTC (permalink / raw)
  To: Hao Xu, io-uring, netdev, linux-kernel
  Cc: Jakub Kicinski, Jonathan Lemon, David S . Miller,
	Willem de Bruijn, Eric Dumazet, David Ahern, Jens Axboe

On 1/12/22 03:39, Hao Xu wrote:
> 在 2022/1/11 下午11:50, Pavel Begunkov 写道:
>> On 1/11/22 13:51, Hao Xu wrote:
>>> 在 2021/12/21 下午11:35, Pavel Begunkov 写道:
>>>> Instead of the net stack managing ubuf_info, allow to pass it in from
>>>> outside in a struct msghdr (in-kernel structure), so io_uring can make
>>>> use of it.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>> Hi Pavel,
>>> I've some confusions here since I have a lack of
>>> network knowledge.
>>> The first one is why do we make ubuf_info visible
>>> for io_uring. Why not just follow the old MSG_ZEROCOPY
>>> logic?
>>
>> I assume you mean leaving allocation up and so in socket awhile the
>> patchset let's io_uring to manage and control ubufs. In short,
>> performance and out convenience
>>
>> TL;DR;
>> First, we want a nice and uniform API with io_uring, i.e. posting
>> an CQE instead of polling an err queue/etc., and for that the network
>> will need to know about io_uring ctx in some way. As an alternative it
>> may theoretically be registered in socket, but it'll quickly turn into
>> a huge mess, consider that it's a many to many relation b/w io_uring and
>> sockets. The fact that io_uring holds refs to files will only complicate
>> it.
> Make sense to me, thanks.
>>
>> It will also limit API. For instance, we won't be able to use a single
>> ubuf with several different sockets.
> Is there any use cases for this multiple sockets with single
> notification?

Don't know, scatter send maybe? It's just one of those moments when
a design that feels right (to me) yields more flexibility than was
initially planned, which is definitely a good thing


>> Another problem is performance, registration or some other tricks
>> would some additional sync. It'd also need sync on use, say it's
>> just one rcu_read, but the problem that it only adds up to complexity
>> and prevents some other optimisations. E.g. we amortise to ~0 atomics
>> getting refs on skb setups based on guarantees io_uring provides, and
>> not only. SKBFL_MANAGED_FRAGS can only work with pages being controlled
>> by the issuer, and so it needs some context as currently provided by
>> ubuf. io_uring also caches ubufs, which relies on io_uring locking, so
>> it removes kmalloc/free for almost zero overhead.
>>
>>
>>> The second one, my understanding about the buffer
>>> lifecycle is that the kernel side informs
>>> the userspace by a cqe generated by the ubuf_info
>>> callback that all the buffers attaching to the
>>> same notifier is now free to use when all the data
>>> is sent, then why is the flush in 13/19 needed as
>>> it is at the submission period?
>>
>> Probably I wasn't clear enough. A user has to flush a notifier, only
>> then it's expected to post an CQE after all buffers attached to it
>> are freed. io_uring holds one ubuf ref, which will be release on flush.
> I see, I saw another ref inc in skb_zcopy_set() which I previously
> misunderstood and thus thought there was only one refcount. Thanks!
>> I also need to add a way to flush without send.
>>
>> Will spend some time documenting for next iteration.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-01-12 16:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21 15:35 [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 01/19] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 02/19] skbuff: pass a struct ubuf_info in msghdr Pavel Begunkov
2022-01-11 13:51   ` Hao Xu
2022-01-11 15:50     ` Pavel Begunkov
2022-01-12  3:39       ` Hao Xu
2022-01-12 16:53         ` Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 03/19] net: add zerocopy_sg_from_iter for bvec Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 04/19] net: optimise page get/free for bvec zc Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 05/19] net: don't track pfmemalloc for zc registered mem Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 06/19] ipv4/udp: add support msgdr::msg_ubuf Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 07/19] ipv6/udp: " Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 08/19] ipv4: avoid partial copy for zc Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 09/19] ipv6: " Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 10/19] io_uring: add send notifiers registration Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 11/19] io_uring: infrastructure for send zc notifications Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 12/19] io_uring: wire send zc request type Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 13/19] io_uring: add an option to flush zc notifications Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 14/19] io_uring: opcode independent fixed buf import Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 15/19] io_uring: sendzc with fixed buffers Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 16/19] io_uring: cache struct ubuf_info Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 17/19] io_uring: unclog ctx refs waiting with zc notifiers Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 18/19] io_uring: task_work for notification delivery Pavel Begunkov
2021-12-21 15:35 ` [RFC v2 19/19] io_uring: optimise task referencing by notifiers Pavel Begunkov
2021-12-21 15:43 ` [RFC v2 00/19] io_uring zerocopy tx Pavel Begunkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.