All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
@ 2022-09-16 21:36 Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC Stefan Metzmacher
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-16 21:36 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence; +Cc: Stefan Metzmacher

Hi Pavel, hi Jens,

I did some initial testing with IORING_OP_SEND_ZC.
While reading the code I think I found a race that
can lead to IORING_CQE_F_MORE being missing even if
the net layer got references.

While there I added some code to allow userpace to
know how effective the IORING_OP_SEND_ZC attempt was,
in order to avoid it it's not used (e.g. on a long living tcp
connection).

This change requires a change to the existing test, see:
https://github.com/metze-samba/liburing/tree/test-send-zerocopy

Stefan Metzmacher (5):
  io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
  io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res
  io_uring/core: keep req->cqe.flags on generic errors
  io_uring/net: let io_sendzc set IORING_CQE_F_MORE before
    sock_sendmsg()
  io_uring/notif: let userspace know how effective the zero copy usage
    was

 include/linux/io_uring_types.h |  6 +++---
 io_uring/io_uring.c            | 18 +++++++++++++-----
 io_uring/net.c                 | 19 +++++++++++++------
 io_uring/notif.c               | 18 ++++++++++++++++++
 io_uring/opdef.c               |  2 +-
 net/ipv4/ip_output.c           |  3 ++-
 net/ipv4/tcp.c                 |  2 ++
 net/ipv6/ip6_output.c          |  3 ++-
 8 files changed, 54 insertions(+), 17 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
@ 2022-09-16 21:36 ` Stefan Metzmacher
  2022-09-17  9:17   ` Pavel Begunkov
  2022-09-16 21:36 ` [PATCH 2/5] io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res Stefan Metzmacher
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-16 21:36 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence; +Cc: Stefan Metzmacher

It's confusing to see the string SENDZC_NOTIF in ftrace output
when using IORING_OP_SEND_ZC.

Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
---
 io_uring/opdef.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index c61494e0a602..c4dddd0fd709 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -471,7 +471,7 @@ const struct io_op_def io_op_defs[] = {
 		.prep_async		= io_uring_cmd_prep_async,
 	},
 	[IORING_OP_SEND_ZC] = {
-		.name			= "SENDZC_NOTIF",
+		.name			= "SEND_ZC",
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollout		= 1,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC Stefan Metzmacher
@ 2022-09-16 21:36 ` Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 3/5] io_uring/core: keep req->cqe.flags on generic errors Stefan Metzmacher
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-16 21:36 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence; +Cc: Stefan Metzmacher

Both flags and res have the same lifetime currently.
io_req_set_res() sets both of them.

Callers of io_req_set_res() like req_fail_link_node(),
io_req_complete_failed() and io_req_task_queue_fail()
set io_cqe->res to their callers value and force flags to 0.

The motivation for this change is the next commit,
it will let us keep io_cqe->flags even on error.
For IORING_OP_SEND_ZC it is needed to keep IORING_CQE_F_MORE
even on a generic failure, userspace needs to know that
a IORING_CQE_F_NOTIF will follow. Otherwise the buffers
might be reused too early.

Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
---
 include/linux/io_uring_types.h |  6 +++---
 io_uring/io_uring.c            | 10 ++++++++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 677a25d44d7f..37925db42ae9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -473,12 +473,12 @@ struct io_task_work {
 
 struct io_cqe {
 	__u64	user_data;
-	__s32	res;
-	/* fd initially, then cflags for completion */
+	/* fd initially, then res for completion */
 	union {
-		__u32	flags;
 		int	fd;
+		__s32	res;
 	};
+	__u32	flags;
 };
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index b9640ad5069f..ae69cff94664 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -837,8 +837,6 @@ static void io_preinit_req(struct io_kiocb *req, struct io_ring_ctx *ctx)
 	req->ctx = ctx;
 	req->link = NULL;
 	req->async_data = NULL;
-	/* not necessary, but safer to zero */
-	req->cqe.res = 0;
 }
 
 static void io_flush_cached_locked_reqs(struct io_ring_ctx *ctx,
@@ -1574,6 +1572,12 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
 	if (!def->audit_skip)
 		audit_uring_entry(req->opcode);
 
+	/*
+	 * req->cqe.fd was resolved by io_assign_file
+	 * now make sure its alias req->cqe.res is reset,
+	 * so we don't use that value by accident.
+	 */
+	req->cqe.res = -1;
 	ret = def->issue(req, issue_flags);
 
 	if (!def->audit_skip)
@@ -1902,6 +1906,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			state->need_plug = false;
 			blk_start_plug_nr_ios(&state->plug, state->submit_nr);
 		}
+	} else {
+		req->cqe.fd = -1;
 	}
 
 	personality = READ_ONCE(sqe->personality);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] io_uring/core: keep req->cqe.flags on generic errors
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 2/5] io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res Stefan Metzmacher
@ 2022-09-16 21:36 ` Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 4/5] io_uring/net: let io_sendzc set IORING_CQE_F_MORE before sock_sendmsg() Stefan Metzmacher
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-16 21:36 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence; +Cc: Stefan Metzmacher

Soon we'll have the case where IORING_OP_SEND_ZC will
add IORING_CQE_F_MORE to req->cqe.flags before calling
into sock_sendmsg() and once we did that we have to
keep that flag even if we will hit some generic error
later, e.g. in the partial io retry case.

Hopefully passing req->cqe.flags to inline io_req_set_res(),
allows the compiler to optimize out the effective
req->cqe.flags = req->cqe.flags.

Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
---
 io_uring/io_uring.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ae69cff94664..062edbc04168 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -212,7 +212,7 @@ bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
 static inline void req_fail_link_node(struct io_kiocb *req, int res)
 {
 	req_set_fail(req);
-	io_req_set_res(req, res, 0);
+	io_req_set_res(req, res, req->cqe.flags);
 }
 
 static inline void io_req_add_to_cache(struct io_kiocb *req, struct io_ring_ctx *ctx)
@@ -824,7 +824,8 @@ inline void __io_req_complete(struct io_kiocb *req, unsigned issue_flags)
 void io_req_complete_failed(struct io_kiocb *req, s32 res)
 {
 	req_set_fail(req);
-	io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
+	req->cqe.flags |= io_put_kbuf(req, IO_URING_F_UNLOCKED);
+	io_req_set_res(req, res, req->cqe.flags);
 	io_req_complete_post(req);
 }
 
@@ -1106,7 +1107,7 @@ void io_req_task_submit(struct io_kiocb *req, bool *locked)
 
 void io_req_task_queue_fail(struct io_kiocb *req, int ret)
 {
-	io_req_set_res(req, ret, 0);
+	io_req_set_res(req, ret, req->cqe.flags);
 	req->io_task_work.func = io_req_task_cancel;
 	io_req_task_work_add(req);
 }
@@ -1847,6 +1848,7 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	/* same numerical values with corresponding REQ_F_*, safe to copy */
 	req->flags = sqe_flags = READ_ONCE(sqe->flags);
 	req->cqe.user_data = READ_ONCE(sqe->user_data);
+	req->cqe.flags = 0;
 	req->file = NULL;
 	req->rsrc_node = NULL;
 	req->task = current;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] io_uring/net: let io_sendzc set IORING_CQE_F_MORE before sock_sendmsg()
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
                   ` (2 preceding siblings ...)
  2022-09-16 21:36 ` [PATCH 3/5] io_uring/core: keep req->cqe.flags on generic errors Stefan Metzmacher
@ 2022-09-16 21:36 ` Stefan Metzmacher
  2022-09-16 21:36 ` [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was Stefan Metzmacher
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-16 21:36 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence; +Cc: Stefan Metzmacher

sock_sendmsg() can take references to the passed buffers even on
failure!

So we need to make sure we'll set IORING_CQE_F_MORE before
calling sock_sendmsg().

As REQ_F_CQE_SKIP for notif and IORING_CQE_F_MORE for the main request
go hand in hand, lets simplify the REQ_F_CQE_SKIP logic too.

We just start with REQ_F_CQE_SKIP set and reset it when we
set IORING_CQE_F_MORE on the main request in order to have
the transition in one isolated place.

In future we might be able to revert IORING_CQE_F_MORE and
!REQ_F_CQE_SKIP again if we find out that no reference was
taken by the network layer. But that's a change for another day.
The important thing would just be that the documentation for
IORING_OP_SEND_ZC would indicate that the kernel may decide
to return just a single cqe without IORING_CQE_F_MORE, even
in the success case, so that userspace would not break when
we add such an optimization at a layer point.

Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
---
 io_uring/net.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index e9efed40cf3d..61e6194b01b7 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -883,7 +883,6 @@ void io_sendzc_cleanup(struct io_kiocb *req)
 {
 	struct io_sendzc *zc = io_kiocb_to_cmd(req, struct io_sendzc);
 
-	zc->notif->flags |= REQ_F_CQE_SKIP;
 	io_notif_flush(zc->notif);
 	zc->notif = NULL;
 }
@@ -920,6 +919,8 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	notif->cqe.user_data = req->cqe.user_data;
 	notif->cqe.res = 0;
 	notif->cqe.flags = IORING_CQE_F_NOTIF;
+	/* skip the notif cqe until we call sock_sendmsg() */
+	notif->flags |= REQ_F_CQE_SKIP;
 	req->flags |= REQ_F_NEED_CLEANUP;
 
 	zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -1000,7 +1001,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	struct msghdr msg;
 	struct iovec iov;
 	struct socket *sock;
-	unsigned msg_flags, cflags;
+	unsigned msg_flags;
 	int ret, min_ret = 0;
 
 	sock = sock_from_file(req->file);
@@ -1055,6 +1056,15 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 	msg.msg_flags = msg_flags;
 	msg.msg_ubuf = &io_notif_to_data(zc->notif)->uarg;
 	msg.sg_from_iter = io_sg_from_iter;
+
+	/*
+	 * Now that we call sock_sendmsg,
+	 * we need to assume that the data is referenced
+	 * even on failure!
+	 * So we need to force a NOTIF cqe
+	 */
+	zc->notif->flags &= ~REQ_F_CQE_SKIP;
+	req->cqe.flags |= IORING_CQE_F_MORE;
 	ret = sock_sendmsg(sock, &msg);
 
 	if (unlikely(ret < min_ret)) {
@@ -1068,8 +1078,6 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 			req->flags |= REQ_F_PARTIAL_IO;
 			return io_setup_async_addr(req, addr, issue_flags);
 		}
-		if (ret < 0 && !zc->done_io)
-			zc->notif->flags |= REQ_F_CQE_SKIP;
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 		req_set_fail(req);
@@ -1082,8 +1090,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 
 	io_notif_flush(zc->notif);
 	req->flags &= ~REQ_F_NEED_CLEANUP;
-	cflags = ret >= 0 ? IORING_CQE_F_MORE : 0;
-	io_req_set_res(req, ret, cflags);
+	io_req_set_res(req, ret, req->cqe.flags);
 	return IOU_OK;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
                   ` (3 preceding siblings ...)
  2022-09-16 21:36 ` [PATCH 4/5] io_uring/net: let io_sendzc set IORING_CQE_F_MORE before sock_sendmsg() Stefan Metzmacher
@ 2022-09-16 21:36 ` Stefan Metzmacher
  2022-09-17  9:22   ` Pavel Begunkov
  2022-09-17  9:16 ` [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Pavel Begunkov
  2022-09-18 22:49 ` (subset) " Jens Axboe
  6 siblings, 1 reply; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-16 21:36 UTC (permalink / raw)
  To: io-uring, axboe, asml.silence; +Cc: Stefan Metzmacher, Jakub Kicinski, netdev

The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags
and it will now have the number of successful completed
io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits
of cqe->res, the high bit (0x80000000) is set when
io_uring_tx_zerocopy_callback() was called with success=false.

If cqe->res is still 0, zero copy wasn't used at all.

These values give userspace a change to adjust its strategy
choosing IORING_OP_SEND_ZC or IORING_OP_SEND. And it's a bit
richer than just a simple SO_EE_CODE_ZEROCOPY_COPIED indication.

Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")
Fixes: eb315a7d1396b ("tcp: support externally provided ubufs")
Fixes: 1fd3ae8c906c0 ("ipv6/udp: support externally provided ubufs")
Fixes: c445f31b3cfaa ("ipv4/udp: support externally provided ubufs")
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: io-uring@vger.kernel.org
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
---
 io_uring/notif.c      | 18 ++++++++++++++++++
 net/ipv4/ip_output.c  |  3 ++-
 net/ipv4/tcp.c        |  2 ++
 net/ipv6/ip6_output.c |  3 ++-
 4 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..b07d2a049931 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -28,7 +28,24 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
 	struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
 	struct io_kiocb *notif = cmd_to_io_kiocb(nd);
 
+	uarg->zerocopy = uarg->zerocopy & success;
+
+	if (success && notif->cqe.res < S32_MAX)
+		notif->cqe.res++;
+
 	if (refcount_dec_and_test(&uarg->refcnt)) {
+		/*
+		 * If we hit at least one case that
+		 * was not able to use zero copy,
+		 * we set the high bit 0x80000000
+		 * so that notif->cqe.res < 0, means it was
+		 * as least copied once.
+		 *
+		 * The other 31 bits are the success count.
+		 */
+		if (!uarg->zerocopy)
+			notif->cqe.res |= S32_MIN;
+
 		notif->io_task_work.func = __io_notif_complete_tw;
 		io_req_task_work_add(notif);
 	}
@@ -53,6 +70,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
 
 	nd = io_notif_to_data(notif);
 	nd->account_pages = 0;
+	nd->uarg.zerocopy = 1;
 	nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
 	nd->uarg.callback = io_uring_tx_zerocopy_callback;
 	refcount_set(&nd->uarg.refcnt, 1);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d7bd1daf022b..4bdea7a4b2f7 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1032,7 +1032,8 @@ static int __ip_append_data(struct sock *sk,
 				paged = true;
 				zc = true;
 				uarg = msg->msg_ubuf;
-			}
+			} else
+				msg->msg_ubuf->zerocopy = 0;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
 			if (!uarg)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 970e9a2cca4a..27a22d470741 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1231,6 +1231,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			uarg = msg->msg_ubuf;
 			net_zcopy_get(uarg);
 			zc = sk->sk_route_caps & NETIF_F_SG;
+			if (!zc)
+				uarg->zerocopy = 0;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
 			if (!uarg) {
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index f152e51242cb..d85036e91cf7 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1556,7 +1556,8 @@ static int __ip6_append_data(struct sock *sk,
 				paged = true;
 				zc = true;
 				uarg = msg->msg_ubuf;
-			}
+			} else
+				msg->msg_ubuf->zerocopy = 0;
 		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
 			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
 			if (!uarg)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
                   ` (4 preceding siblings ...)
  2022-09-16 21:36 ` [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was Stefan Metzmacher
@ 2022-09-17  9:16 ` Pavel Begunkov
  2022-09-17 10:44   ` Stefan Metzmacher
  2022-09-18 22:49 ` (subset) " Jens Axboe
  6 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-09-17  9:16 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring, axboe

On 9/16/22 22:36, Stefan Metzmacher wrote:
> Hi Pavel, hi Jens,
> 
> I did some initial testing with IORING_OP_SEND_ZC.
> While reading the code I think I found a race that
> can lead to IORING_CQE_F_MORE being missing even if
> the net layer got references.

Hey Stefan,

Did you see some kind of buggy behaviour in userspace?
If network sends anything it should return how many bytes
it queued for sending, otherwise there would be duplicated
packets / data on the other endpoint in userspace, and I
don't think any driver / lower layer would keep memory
after returning an error.

In any case, I was looking on a bit different problem, but
it should look much cleaner using the same approach, see
branch [1], and patch [3] for sendzc in particular.

[1] https://github.com/isilence/linux.git partial-fail
[2] https://github.com/isilence/linux/tree/io_uring/partial-fail
[3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894


> While there I added some code to allow userpace to
> know how effective the IORING_OP_SEND_ZC attempt was,
> in order to avoid it it's not used (e.g. on a long living tcp
> connection).> 
> This change requires a change to the existing test, see:
> https://github.com/metze-samba/liburing/tree/test-send-zerocopy
> 
> Stefan Metzmacher (5):
>    io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
>    io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res
>    io_uring/core: keep req->cqe.flags on generic errors
>    io_uring/net: let io_sendzc set IORING_CQE_F_MORE before
>      sock_sendmsg()
>    io_uring/notif: let userspace know how effective the zero copy usage
>      was
> 
>   include/linux/io_uring_types.h |  6 +++---
>   io_uring/io_uring.c            | 18 +++++++++++++-----
>   io_uring/net.c                 | 19 +++++++++++++------
>   io_uring/notif.c               | 18 ++++++++++++++++++
>   io_uring/opdef.c               |  2 +-
>   net/ipv4/ip_output.c           |  3 ++-
>   net/ipv4/tcp.c                 |  2 ++
>   net/ipv6/ip6_output.c          |  3 ++-
>   8 files changed, 54 insertions(+), 17 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
  2022-09-16 21:36 ` [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC Stefan Metzmacher
@ 2022-09-17  9:17   ` Pavel Begunkov
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-09-17  9:17 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring, axboe

On 9/16/22 22:36, Stefan Metzmacher wrote:
> It's confusing to see the string SENDZC_NOTIF in ftrace output
> when using IORING_OP_SEND_ZC.
> 
> Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")

It doesn't really fix anything, but it's quite simple and
we may want to take it for 6.0.

Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>


> Signed-off-by: Stefan Metzmacher <metze@samba.org>
> Cc: Pavel Begunkov <asml.silence@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: io-uring@vger.kernel.org
> ---
>   io_uring/opdef.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
> index c61494e0a602..c4dddd0fd709 100644
> --- a/io_uring/opdef.c
> +++ b/io_uring/opdef.c
> @@ -471,7 +471,7 @@ const struct io_op_def io_op_defs[] = {
>   		.prep_async		= io_uring_cmd_prep_async,
>   	},
>   	[IORING_OP_SEND_ZC] = {
> -		.name			= "SENDZC_NOTIF",
> +		.name			= "SEND_ZC",
>   		.needs_file		= 1,
>   		.unbound_nonreg_file	= 1,
>   		.pollout		= 1,

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was
  2022-09-16 21:36 ` [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was Stefan Metzmacher
@ 2022-09-17  9:22   ` Pavel Begunkov
  2022-09-17 10:24     ` Stefan Metzmacher
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-09-17  9:22 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring, axboe; +Cc: Jakub Kicinski, netdev

On 9/16/22 22:36, Stefan Metzmacher wrote:
> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags
> and it will now have the number of successful completed
> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits
> of cqe->res, the high bit (0x80000000) is set when
> io_uring_tx_zerocopy_callback() was called with success=false.

It has a couple of problems, and because that "simplify uapi"
patch is transitional it doesn't go well with what I'm queuing
for 6.1, let's hold it for a while.


> If cqe->res is still 0, zero copy wasn't used at all.
> 
> These values give userspace a change to adjust its strategy
> choosing IORING_OP_SEND_ZC or IORING_OP_SEND. And it's a bit
> richer than just a simple SO_EE_CODE_ZEROCOPY_COPIED indication.
> 
> Fixes: b48c312be05e8 ("io_uring/net: simplify zerocopy send user API")
> Fixes: eb315a7d1396b ("tcp: support externally provided ubufs")
> Fixes: 1fd3ae8c906c0 ("ipv6/udp: support externally provided ubufs")
> Fixes: c445f31b3cfaa ("ipv4/udp: support externally provided ubufs")
> Signed-off-by: Stefan Metzmacher <metze@samba.org>
> Cc: Pavel Begunkov <asml.silence@gmail.com>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: io-uring@vger.kernel.org
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: netdev@vger.kernel.org
> ---
>   io_uring/notif.c      | 18 ++++++++++++++++++
>   net/ipv4/ip_output.c  |  3 ++-
>   net/ipv4/tcp.c        |  2 ++
>   net/ipv6/ip6_output.c |  3 ++-
>   4 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/io_uring/notif.c b/io_uring/notif.c
> index e37c6569d82e..b07d2a049931 100644
> --- a/io_uring/notif.c
> +++ b/io_uring/notif.c
> @@ -28,7 +28,24 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
>   	struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>   	struct io_kiocb *notif = cmd_to_io_kiocb(nd);
>   
> +	uarg->zerocopy = uarg->zerocopy & success;
> +
> +	if (success && notif->cqe.res < S32_MAX)
> +		notif->cqe.res++;
> +
>   	if (refcount_dec_and_test(&uarg->refcnt)) {
> +		/*
> +		 * If we hit at least one case that
> +		 * was not able to use zero copy,
> +		 * we set the high bit 0x80000000
> +		 * so that notif->cqe.res < 0, means it was
> +		 * as least copied once.
> +		 *
> +		 * The other 31 bits are the success count.
> +		 */
> +		if (!uarg->zerocopy)
> +			notif->cqe.res |= S32_MIN;
> +
>   		notif->io_task_work.func = __io_notif_complete_tw;
>   		io_req_task_work_add(notif);
>   	}
> @@ -53,6 +70,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
>   
>   	nd = io_notif_to_data(notif);
>   	nd->account_pages = 0;
> +	nd->uarg.zerocopy = 1;
>   	nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
>   	nd->uarg.callback = io_uring_tx_zerocopy_callback;
>   	refcount_set(&nd->uarg.refcnt, 1);
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index d7bd1daf022b..4bdea7a4b2f7 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -1032,7 +1032,8 @@ static int __ip_append_data(struct sock *sk,
>   				paged = true;
>   				zc = true;
>   				uarg = msg->msg_ubuf;
> -			}
> +			} else
> +				msg->msg_ubuf->zerocopy = 0;
>   		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>   			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
>   			if (!uarg)
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 970e9a2cca4a..27a22d470741 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1231,6 +1231,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>   			uarg = msg->msg_ubuf;
>   			net_zcopy_get(uarg);
>   			zc = sk->sk_route_caps & NETIF_F_SG;
> +			if (!zc)
> +				uarg->zerocopy = 0;
>   		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>   			uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
>   			if (!uarg) {
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index f152e51242cb..d85036e91cf7 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1556,7 +1556,8 @@ static int __ip6_append_data(struct sock *sk,
>   				paged = true;
>   				zc = true;
>   				uarg = msg->msg_ubuf;
> -			}
> +			} else
> +				msg->msg_ubuf->zerocopy = 0;
>   		} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
>   			uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
>   			if (!uarg)

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was
  2022-09-17  9:22   ` Pavel Begunkov
@ 2022-09-17 10:24     ` Stefan Metzmacher
  2022-09-21 12:04       ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-17 10:24 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, axboe; +Cc: Jakub Kicinski, netdev


Am 17.09.22 um 11:22 schrieb Pavel Begunkov:
> On 9/16/22 22:36, Stefan Metzmacher wrote:
>> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags
>> and it will now have the number of successful completed
>> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits
>> of cqe->res, the high bit (0x80000000) is set when
>> io_uring_tx_zerocopy_callback() was called with success=false.
> 
> It has a couple of problems, and because that "simplify uapi"
> patch is transitional it doesn't go well with what I'm queuing
> for 6.1, let's hold it for a while.

Once the current behavior gets released stable, we're no
longer able to change the meaning of cqe.res.

As cqe.res == 0 would mean zero copy wasn't used at all,
which would be the indication for userspace to avoid using SEND_ZC.

But if 6.0 would always return cqe.res = 0, there's no chance for
userspace to have a detection strategy.

And I don't think it will cause a lot of trouble for your 6.1 stuff (assuming
you mean your SENDMSG_ZC code), I was already having that on top
of my test branches, the current one is:
https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-6.0.0-rc5-metze.08

I plan to test SENDMSG_ZC with Samba next week.

metze

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
  2022-09-17  9:16 ` [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Pavel Begunkov
@ 2022-09-17 10:44   ` Stefan Metzmacher
  2022-09-21 11:39     ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-17 10:44 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, axboe


Am 17.09.22 um 11:16 schrieb Pavel Begunkov:
> On 9/16/22 22:36, Stefan Metzmacher wrote:
>> Hi Pavel, hi Jens,
>>
>> I did some initial testing with IORING_OP_SEND_ZC.
>> While reading the code I think I found a race that
>> can lead to IORING_CQE_F_MORE being missing even if
>> the net layer got references.
> 
> Hey Stefan,
> 
> Did you see some kind of buggy behaviour in userspace?

No I was just reading the code and found it a bit confusing,
and couldn't prove that we don't have a problem with loosing
a notif cqe.

> If network sends anything it should return how many bytes
> it queued for sending, otherwise there would be duplicated
> packets / data on the other endpoint in userspace, and I
> don't think any driver / lower layer would keep memory
> after returning an error.

As I'm also working on a socket driver for smbdirect,
I already thought about how I could hook into
IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
a loop sending individual fragments, which have a reference,
but if I find a connection drop after the first one, I'd
return ECONNRESET or EPIPE in order to get faster recovery
instead of announcing a short write to the caller.

If we would take my 5/5 we could also have a different
strategy to check decide if MORE/NOTIF is needed.
If notif->cqe.res is still 0 and io_notif_flush drops
the last reference we could go without MORE/NOTIF at all.
In all other cases we'd either set MORE/NOTIF at the end
of io_sendzc of in the fail hook.

> In any case, I was looking on a bit different problem, but
> it should look much cleaner using the same approach, see
> branch [1], and patch [3] for sendzc in particular.
> 
> [1] https://github.com/isilence/linux.git partial-fail
> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894

	const struct io_op_def *def = &io_op_defs[req->opcode];

	req_set_fail(req);
	io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
	if (def->fail)
		def->fail(req);
	io_req_complete_post(req);

Will loose req->cqe.flags, but the fail hook in general looks like a good idea.

And don't we care about the other failure cases where req->cqe.flags gets overwritten?

metze

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (subset) [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
  2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
                   ` (5 preceding siblings ...)
  2022-09-17  9:16 ` [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Pavel Begunkov
@ 2022-09-18 22:49 ` Jens Axboe
  6 siblings, 0 replies; 17+ messages in thread
From: Jens Axboe @ 2022-09-18 22:49 UTC (permalink / raw)
  To: Stefan Metzmacher, asml.silence, io-uring

On Fri, 16 Sep 2022 23:36:24 +0200, Stefan Metzmacher wrote:
> I did some initial testing with IORING_OP_SEND_ZC.
> While reading the code I think I found a race that
> can lead to IORING_CQE_F_MORE being missing even if
> the net layer got references.
> 
> While there I added some code to allow userpace to
> know how effective the IORING_OP_SEND_ZC attempt was,
> in order to avoid it it's not used (e.g. on a long living tcp
> connection).
> 
> [...]

Applied, thanks!

[1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC
      commit: 9bd3f728223ebcfef8e9d087bdd142f0e388215d

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
  2022-09-17 10:44   ` Stefan Metzmacher
@ 2022-09-21 11:39     ` Pavel Begunkov
  2022-09-21 12:18       ` Stefan Metzmacher
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-09-21 11:39 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring, axboe

On 9/17/22 11:44, Stefan Metzmacher wrote:
> Am 17.09.22 um 11:16 schrieb Pavel Begunkov:
>> On 9/16/22 22:36, Stefan Metzmacher wrote:
>>> Hi Pavel, hi Jens,
>>>
>>> I did some initial testing with IORING_OP_SEND_ZC.
>>> While reading the code I think I found a race that
>>> can lead to IORING_CQE_F_MORE being missing even if
>>> the net layer got references.
>>
>> Hey Stefan,
>>
>> Did you see some kind of buggy behaviour in userspace?

Apologies for the delay,

> No I was just reading the code and found it a bit confusing,
> and couldn't prove that we don't have a problem with loosing
> a notif cqe.
> 
>> If network sends anything it should return how many bytes
>> it queued for sending, otherwise there would be duplicated
>> packets / data on the other endpoint in userspace, and I
>> don't think any driver / lower layer would keep memory
>> after returning an error.
> 
> As I'm also working on a socket driver for smbdirect,
> I already thought about how I could hook into
> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
> a loop sending individual fragments, which have a reference,
> but if I find a connection drop after the first one, I'd
> return ECONNRESET or EPIPE in order to get faster recovery
> instead of announcing a short write to the caller.

I doesn't sound right for me, but neither I know samba to
really have an opinion. In any case, I see how it may be
more robust if we always try to push a notification cqe.
Will you send a patch?

> If we would take my 5/5 we could also have a different
> strategy to check decide if MORE/NOTIF is needed.
> If notif->cqe.res is still 0 and io_notif_flush drops
> the last reference we could go without MORE/NOTIF at all.
> In all other cases we'd either set MORE/NOTIF at the end
> of io_sendzc of in the fail hook.

I had a similar optimisation, i.e. when io_notif_flush() in
the submission path is dropping the last ref, but killed it
as it was completely useless, I haven't hit this path even
once even with UDP, not to mention TCP.

>> In any case, I was looking on a bit different problem, but
>> it should look much cleaner using the same approach, see
>> branch [1], and patch [3] for sendzc in particular.
>>
>> [1] https://github.com/isilence/linux.git partial-fail
>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894
> 
>      const struct io_op_def *def = &io_op_defs[req->opcode];
> 
>      req_set_fail(req);
>      io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
>      if (def->fail)
>          def->fail(req);
>      io_req_complete_post(req);
> 
> Will loose req->cqe.flags, but the fail hook in general looks like a good idea.

I just don't like those sporadic changes all across core io_uring
code also adding some overhead.

> And don't we care about the other failure cases where req->cqe.flags gets overwritten?

We don't usually carry them around ->issue handler boundaries,
e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);

IORING_CQE_F_BUFFER is a bit more trickier, but there is
special handling for this one and it wouldn't fit "set cflags
in advance" logic anyway.

iow, ->fail callback sounds good enough for now, we'll change
it later if needed.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was
  2022-09-17 10:24     ` Stefan Metzmacher
@ 2022-09-21 12:04       ` Pavel Begunkov
  2022-09-21 12:33         ` Stefan Metzmacher
  0 siblings, 1 reply; 17+ messages in thread
From: Pavel Begunkov @ 2022-09-21 12:04 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring, axboe; +Cc: Jakub Kicinski, netdev

On 9/17/22 11:24, Stefan Metzmacher wrote:
> Am 17.09.22 um 11:22 schrieb Pavel Begunkov:
>> On 9/16/22 22:36, Stefan Metzmacher wrote:
>>> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags
>>> and it will now have the number of successful completed
>>> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits
>>> of cqe->res, the high bit (0x80000000) is set when
>>> io_uring_tx_zerocopy_callback() was called with success=false.
>>
>> It has a couple of problems, and because that "simplify uapi"
>> patch is transitional it doesn't go well with what I'm queuing
>> for 6.1, let's hold it for a while.
> 
> Once the current behavior gets released stable, we're no
> longer able to change the meaning of cqe.res.
> 
> As cqe.res == 0 would mean zero copy wasn't used at all,
> which would be the indication for userspace to avoid using SEND_ZC.
> 
> But if 6.0 would always return cqe.res = 0, there's no chance for
> userspace to have a detection strategy.
> 
> And I don't think it will cause a lot of trouble for your 6.1 stuff (assuming
> you mean your SENDMSG_ZC code), I was already having that on top
> of my test branches, the current one is:
> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-6.0.0-rc5-metze.08

Not that one though, 1) I want to shrink ubuf_info as we're a bit out
of space on the io_uring side and it prevents some embedding, so there
won't be additional fields you used. And 2) we want to have a feature
merging completion + notif CQEs into one (useful for UDP and some TCP
cases), but that would mean we'll be using cqe->res for the normal
return value.

We can disable the success counting in this case, but it's not nice,
and we can't actually count in io_uring_tx_zerocopy_callback() as in
the patch (racy). Also, the callback will be called multiple times for
a number of different reasons like io_uring flush or net splitting skbs.
The number won't be much useful and unnecessary exposes internal details,
so I think F_COPIED in cqe->flags is a better option.

It's a good question though whether we need more versatile reporting and
how to do it right. Probably should be optional and not a part of IO path,
e.g. send(MSG_PROBE) / ioctl / proc stats / etc.

> I plan to test SENDMSG_ZC with Samba next week.

Awesome

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
  2022-09-21 11:39     ` Pavel Begunkov
@ 2022-09-21 12:18       ` Stefan Metzmacher
  2022-09-21 12:58         ` Pavel Begunkov
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-21 12:18 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, axboe


Hi Pavel,

>>> If network sends anything it should return how many bytes
>>> it queued for sending, otherwise there would be duplicated
>>> packets / data on the other endpoint in userspace, and I
>>> don't think any driver / lower layer would keep memory
>>> after returning an error.
>>
>> As I'm also working on a socket driver for smbdirect,
>> I already thought about how I could hook into
>> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
>> a loop sending individual fragments, which have a reference,
>> but if I find a connection drop after the first one, I'd
>> return ECONNRESET or EPIPE in order to get faster recovery
>> instead of announcing a short write to the caller.
> 
> I doesn't sound right for me, but neither I know samba to
> really have an opinion. In any case, I see how it may be
> more robust if we always try to push a notification cqe.
> Will you send a patch?

You mean the IORING_OP_SEND_ZC should always
issue a NOTIF cqe, one it passed the io_sendzc_prep stage?

>> If we would take my 5/5 we could also have a different
>> strategy to check decide if MORE/NOTIF is needed.
>> If notif->cqe.res is still 0 and io_notif_flush drops
>> the last reference we could go without MORE/NOTIF at all.
>> In all other cases we'd either set MORE/NOTIF at the end
>> of io_sendzc of in the fail hook.
> 
> I had a similar optimisation, i.e. when io_notif_flush() in
> the submission path is dropping the last ref, but killed it
> as it was completely useless, I haven't hit this path even
> once even with UDP, not to mention TCP.

If I remember correctly I hit it all the time on loopback,
but I'd have to recheck.

>>> In any case, I was looking on a bit different problem, but
>>> it should look much cleaner using the same approach, see
>>> branch [1], and patch [3] for sendzc in particular.
>>>
>>> [1] https://github.com/isilence/linux.git partial-fail
>>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
>>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894
>>
>>      const struct io_op_def *def = &io_op_defs[req->opcode];
>>
>>      req_set_fail(req);
>>      io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
>>      if (def->fail)
>>          def->fail(req);
>>      io_req_complete_post(req);
>>
>> Will loose req->cqe.flags, but the fail hook in general looks like a good idea.
> 
> I just don't like those sporadic changes all across core io_uring
> code also adding some overhead.
> 
>> And don't we care about the other failure cases where req->cqe.flags gets overwritten?
> 
> We don't usually carry them around ->issue handler boundaries,
> e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);
> 
> IORING_CQE_F_BUFFER is a bit more trickier, but there is
> special handling for this one and it wouldn't fit "set cflags
> in advance" logic anyway.
> 
> iow, ->fail callback sounds good enough for now, we'll change
> it later if needed.

The fail hook should re-add the MORE flag?

So I'll try to do the following changes:

1. take your ->fail() patches

2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes
    (But the documentation should still make it clear that
     userspace have to cope with just a single cqe (without MORE)
     for both successs and failure, so that we can improve things later)

3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ?
    That would indicate to userspace that we don't give any information
    if zero copy was actually used or not. This would present someone
    from relying on cqe.res == 0 and we can improve it by providing
    more useful values in future.

Are you ok with that plan for 6.0?

metze

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was
  2022-09-21 12:04       ` Pavel Begunkov
@ 2022-09-21 12:33         ` Stefan Metzmacher
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Metzmacher @ 2022-09-21 12:33 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring, axboe; +Cc: Jakub Kicinski, netdev

Am 21.09.22 um 14:04 schrieb Pavel Begunkov:
> On 9/17/22 11:24, Stefan Metzmacher wrote:
>> Am 17.09.22 um 11:22 schrieb Pavel Begunkov:
>>> On 9/16/22 22:36, Stefan Metzmacher wrote:
>>>> The 2nd cqe for IORING_OP_SEND_ZC has IORING_CQE_F_NOTIF set in cqe->flags
>>>> and it will now have the number of successful completed
>>>> io_uring_tx_zerocopy_callback() callbacks in the lower 31-bits
>>>> of cqe->res, the high bit (0x80000000) is set when
>>>> io_uring_tx_zerocopy_callback() was called with success=false.
>>>
>>> It has a couple of problems, and because that "simplify uapi"
>>> patch is transitional it doesn't go well with what I'm queuing
>>> for 6.1, let's hold it for a while.
>>
>> Once the current behavior gets released stable, we're no
>> longer able to change the meaning of cqe.res.
>>
>> As cqe.res == 0 would mean zero copy wasn't used at all,
>> which would be the indication for userspace to avoid using SEND_ZC.
>>
>> But if 6.0 would always return cqe.res = 0, there's no chance for
>> userspace to have a detection strategy.
>>
>> And I don't think it will cause a lot of trouble for your 6.1 stuff (assuming
>> you mean your SENDMSG_ZC code), I was already having that on top
>> of my test branches, the current one is:
>> https://git.samba.org/?p=metze/linux/wip.git;a=shortlog;h=refs/heads/io_uring-6.0.0-rc5-metze.08
> 
> Not that one though, 1) I want to shrink ubuf_info as we're a bit out
> of space on the io_uring side and it prevents some embedding, so there
> won't be additional fields you used. And 2) we want to have a feature
> merging completion + notif CQEs into one (useful for UDP and some TCP
> cases), but that would mean we'll be using cqe->res for the normal
> return value.

But wouldn't that just don't have the MORE flag set, and be just
like IO_SEND?

> We can disable the success counting in this case, but it's not nice,
> and we can't actually count in io_uring_tx_zerocopy_callback() as in
> the patch (racy). Also, the callback will be called multiple times for
> a number of different reasons like io_uring flush or net splitting skbs.
> The number won't be much useful and unnecessary exposes internal details,
> so I think F_COPIED in cqe->flags is a better option.

Ok, that's better than nothing.

> It's a good question though whether we need more versatile reporting and
> how to do it right. Probably should be optional and not a part of IO path,
> e.g. send(MSG_PROBE) / ioctl / proc stats / etc.
> 
>> I plan to test SENDMSG_ZC with Samba next week.
> 
> Awesome

Currently I'm fighting with converting samba's libtevent to
use POLL_ADD and io_uring_submit_and_wait_timeout().
Which turns out to be much more complex that it should be.
And IORING_POLL_ADD_LEVEL doesn't seem to work at all...
But I'll write a separate mail about my findings in that area...

metze


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements
  2022-09-21 12:18       ` Stefan Metzmacher
@ 2022-09-21 12:58         ` Pavel Begunkov
  0 siblings, 0 replies; 17+ messages in thread
From: Pavel Begunkov @ 2022-09-21 12:58 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring, axboe

On 9/21/22 13:18, Stefan Metzmacher wrote:
> 
> Hi Pavel,
> 
>>>> If network sends anything it should return how many bytes
>>>> it queued for sending, otherwise there would be duplicated
>>>> packets / data on the other endpoint in userspace, and I
>>>> don't think any driver / lower layer would keep memory
>>>> after returning an error.
>>>
>>> As I'm also working on a socket driver for smbdirect,
>>> I already thought about how I could hook into
>>> IORING_OP_SEND[MSG]_ZC, and for sendmsg I'd have
>>> a loop sending individual fragments, which have a reference,
>>> but if I find a connection drop after the first one, I'd
>>> return ECONNRESET or EPIPE in order to get faster recovery
>>> instead of announcing a short write to the caller.
>>
>> I doesn't sound right for me, but neither I know samba to
>> really have an opinion. In any case, I see how it may be
>> more robust if we always try to push a notification cqe.
>> Will you send a patch?
> 
> You mean the IORING_OP_SEND_ZC should always
> issue a NOTIF cqe, one it passed the io_sendzc_prep stage?

Currently we add F_MORE iff cqe->res >= 0, but I want to
decouple them so users don't try to infer one from another.

In theory, it's already a subset of this, but it sounds
like a good idea to always emit notifications (when we can)
to stop users making assumptions about it. And should also
be cleaner.

>>> If we would take my 5/5 we could also have a different
>>> strategy to check decide if MORE/NOTIF is needed.
>>> If notif->cqe.res is still 0 and io_notif_flush drops
>>> the last reference we could go without MORE/NOTIF at all.
>>> In all other cases we'd either set MORE/NOTIF at the end
>>> of io_sendzc of in the fail hook.
>>
>> I had a similar optimisation, i.e. when io_notif_flush() in
>> the submission path is dropping the last ref, but killed it
>> as it was completely useless, I haven't hit this path even
>> once even with UDP, not to mention TCP.
> 
> If I remember correctly I hit it all the time on loopback,
> but I'd have to recheck.

Yeah, I meant real network in particular. I've seen it with
virtual interfaces, but for instance loopback is not interesting
as it doesn't support zerocopy in the first place. You was
probably testing with a modified skb_orphan_frags_rx().

>>>> In any case, I was looking on a bit different problem, but
>>>> it should look much cleaner using the same approach, see
>>>> branch [1], and patch [3] for sendzc in particular.
>>>>
>>>> [1] https://github.com/isilence/linux.git partial-fail
>>>> [2] https://github.com/isilence/linux/tree/io_uring/partial-fail
>>>> [3] https://github.com/isilence/linux/commit/acb4f9bf869e1c2542849e11d992a63d95f2b894
>>>
>>>      const struct io_op_def *def = &io_op_defs[req->opcode];
>>>
>>>      req_set_fail(req);
>>>      io_req_set_res(req, res, io_put_kbuf(req, IO_URING_F_UNLOCKED));
>>>      if (def->fail)
>>>          def->fail(req);
>>>      io_req_complete_post(req);
>>>
>>> Will loose req->cqe.flags, but the fail hook in general looks like a good idea.
>>
>> I just don't like those sporadic changes all across core io_uring
>> code also adding some overhead.
>>
>>> And don't we care about the other failure cases where req->cqe.flags gets overwritten?
>>
>> We don't usually carry them around ->issue handler boundaries,
>> e.g. directly do io_post_aux_cqe(res, IORING_CQE_F_MORE);
>>
>> IORING_CQE_F_BUFFER is a bit more trickier, but there is
>> special handling for this one and it wouldn't fit "set cflags
>> in advance" logic anyway.
>>
>> iow, ->fail callback sounds good enough for now, we'll change
>> it later if needed.
> 
> The fail hook should re-add the MORE flag?

Never set CQE_SKIP for notifications and add MORE flag in the
fail hook, sounds good.


> So I'll try to do the following changes:
> 
> 1. take your ->fail() patches
> 
> 2. once io_sendzc_prep() is over always trigger MORE/NOFIF cqes
>     (But the documentation should still make it clear that
>      userspace have to cope with just a single cqe (without MORE)
>      for both successs and failure, so that we can improve things later)

I've sent a liburing man patch, should be good enough.

> 3. Can I change the cqe.res of the NOTIF cqe to be 0xffffffff ?
>     That would indicate to userspace that we don't give any information
>     if zero copy was actually used or not. This would present someone
>     from relying on cqe.res == 0 and we can improve it by providing
>     more useful values in future.

I don't get the difference, someone just as easily may rely on
0xf..ff. What does it gives us? 0 usually suits better as default.

> Are you ok with that plan for 6.0?

It's late 1-2 are better to be 6.1 + stable. It doesn't break uapi,
so it's fine. It's not even strictly necessary to back port it
(but still a good idea to do it).

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-09-21 13:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16 21:36 [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Stefan Metzmacher
2022-09-16 21:36 ` [PATCH 1/5] io_uring/opdef: rename SENDZC_NOTIF to SEND_ZC Stefan Metzmacher
2022-09-17  9:17   ` Pavel Begunkov
2022-09-16 21:36 ` [PATCH 2/5] io_uring/core: move io_cqe->fd over from io_cqe->flags to io_cqe->res Stefan Metzmacher
2022-09-16 21:36 ` [PATCH 3/5] io_uring/core: keep req->cqe.flags on generic errors Stefan Metzmacher
2022-09-16 21:36 ` [PATCH 4/5] io_uring/net: let io_sendzc set IORING_CQE_F_MORE before sock_sendmsg() Stefan Metzmacher
2022-09-16 21:36 ` [PATCH 5/5] io_uring/notif: let userspace know how effective the zero copy usage was Stefan Metzmacher
2022-09-17  9:22   ` Pavel Begunkov
2022-09-17 10:24     ` Stefan Metzmacher
2022-09-21 12:04       ` Pavel Begunkov
2022-09-21 12:33         ` Stefan Metzmacher
2022-09-17  9:16 ` [PATCH for-6.0 0/5] IORING_OP_SEND_ZC improvements Pavel Begunkov
2022-09-17 10:44   ` Stefan Metzmacher
2022-09-21 11:39     ` Pavel Begunkov
2022-09-21 12:18       ` Stefan Metzmacher
2022-09-21 12:58         ` Pavel Begunkov
2022-09-18 22:49 ` (subset) " Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.