All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] io_uring zc notification tag override
@ 2022-08-16  7:41 Pavel Begunkov
  2022-08-16  7:42 ` [RFC 1/2] io_uring/notif: change notif CQE uapi format Pavel Begunkov
  2022-08-16  7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
  0 siblings, 2 replies; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-16  7:41 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Dylan Yudaken, Stefan Metzmacher

Following up user feedback, we'd better to have a way to set a notification
tag on the fly comparing to statically at the registration time at it's
currently implemented. Add a feature to allow to copy send_sqe->user_data
into the notification it's flushing.

With 1/2 we can also add a cqe flag for when it fallback from zerocopy
to normal copy execution.

A git branch for convenience:
https://github.com/isilence/linux.git zc-override-tag

https://github.com/isilence/linux/tree/net/zc-override-tag

Pavel Begunkov (2):
  io_uring/notif: change notif CQE uapi format
  io_uring/net: allow to override notification tag

 include/uapi/linux/io_uring.h | 10 ++++++++++
 io_uring/net.c                |  6 +++++-
 io_uring/notif.c              |  4 ++--
 3 files changed, 17 insertions(+), 3 deletions(-)

-- 
2.37.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/2] io_uring/notif: change notif CQE uapi format
  2022-08-16  7:41 [RFC 0/2] io_uring zc notification tag override Pavel Begunkov
@ 2022-08-16  7:42 ` Pavel Begunkov
  2022-08-16  8:14   ` Stefan Metzmacher
  2022-08-16  7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
  1 sibling, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-16  7:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Dylan Yudaken, Stefan Metzmacher

Change the notification CQE layout while we can, put the seq number into
cqe->res so we can cqe->flags to mark notification CQEs with
IORING_CQE_F_NOTIF and add other flags in the future if needed. This
will be needed to distinguish notifications from send completions when
they use the same user_data.

Also, limit the sequence number to u16 and reserve upper 16 bits for the
future. We also want it to mask out the sign bit for userspace
convenience as it's easier to test for (cqe->res < 0).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h | 6 ++++++
 io_uring/notif.c              | 4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1463cfecb56b..20368394870e 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -286,6 +286,9 @@ enum io_uring_op {
 #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
 #define IORING_RECVSEND_NOTIF_FLUSH	(1U << 3)
 
+/* cqe->res mask for extracting the notification sequence number */
+#define IORING_NOTIF_SEQ_MASK		0xFFFFU
+
 /*
  * accept flags stored in sqe->ioprio
  */
@@ -337,10 +340,13 @@ struct io_uring_cqe {
  * IORING_CQE_F_BUFFER	If set, the upper 16 bits are the buffer ID
  * IORING_CQE_F_MORE	If set, parent SQE will generate more CQE entries
  * IORING_CQE_F_SOCK_NONEMPTY	If set, more data to read after socket recv
+ * IORING_CQE_F_NOTIF	Set for notification CQEs. Can be used to distinct
+ *			them from sends.
  */
 #define IORING_CQE_F_BUFFER		(1U << 0)
 #define IORING_CQE_F_MORE		(1U << 1)
 #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
+#define IORING_CQE_F_NOTIF		(1U << 3)
 
 enum {
 	IORING_CQE_BUFFER_SHIFT		= 16,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 714715678817..6e17d1ae5a0d 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -60,8 +60,8 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
 	notif->rsrc_node = NULL;
 	io_req_set_rsrc_node(notif, ctx, 0);
 	notif->cqe.user_data = slot->tag;
-	notif->cqe.flags = slot->seq++;
-	notif->cqe.res = 0;
+	notif->cqe.flags = IORING_CQE_F_NOTIF;
+	notif->cqe.res = slot->seq++ & IORING_NOTIF_SEQ_MASK;
 
 	nd = io_notif_to_data(notif);
 	nd->account_pages = 0;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-16  7:41 [RFC 0/2] io_uring zc notification tag override Pavel Begunkov
  2022-08-16  7:42 ` [RFC 1/2] io_uring/notif: change notif CQE uapi format Pavel Begunkov
@ 2022-08-16  7:42 ` Pavel Begunkov
  2022-08-16  8:23   ` Stefan Metzmacher
  2022-08-16  8:37   ` Dylan Yudaken
  1 sibling, 2 replies; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-16  7:42 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe, asml.silence, Dylan Yudaken, Stefan Metzmacher

Considering limited amount of slots some users struggle with
registration time notification tag assignment as it's hard to manage
notifications using sequence numbers. Add a simple feature that copies
sqe->user_data of a send(+flush) request into the notification CQE it
flushes (and only when it's flushes).

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h | 4 ++++
 io_uring/net.c                | 6 +++++-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 20368394870e..91e7944c9c78 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -280,11 +280,15 @@ enum io_uring_op {
  *
  * IORING_RECVSEND_NOTIF_FLUSH	Flush a notification after a successful
  *				successful. Only for zerocopy sends.
+ *
+ * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
+ *				  completion even if it's flushed.
  */
 #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
 #define IORING_RECV_MULTISHOT		(1U << 1)
 #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
 #define IORING_RECVSEND_NOTIF_FLUSH	(1U << 3)
+#define IORING_RECVSEND_NOTIF_COPY_TAG	(1U << 4)
 
 /* cqe->res mask for extracting the notification sequence number */
 #define IORING_NOTIF_SEQ_MASK		0xFFFFU
diff --git a/io_uring/net.c b/io_uring/net.c
index bd3fad9536ef..4d271a269979 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 
 	zc->flags = READ_ONCE(sqe->ioprio);
 	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
-			  IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
+			  IORING_RECVSEND_FIXED_BUF |
+			  IORING_RECVSEND_NOTIF_FLUSH |
+			  IORING_RECVSEND_NOTIF_COPY_TAG))
 		return -EINVAL;
 	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
 		unsigned idx = READ_ONCE(sqe->buf_index);
@@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
 		if (ret == -ERESTARTSYS)
 			ret = -EINTR;
 	} else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
+		if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
+			notif->cqe.user_data = req->cqe.user_data;
 		io_notif_slot_flush_submit(notif_slot, 0);
 	}
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 1/2] io_uring/notif: change notif CQE uapi format
  2022-08-16  7:42 ` [RFC 1/2] io_uring/notif: change notif CQE uapi format Pavel Begunkov
@ 2022-08-16  8:14   ` Stefan Metzmacher
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Metzmacher @ 2022-08-16  8:14 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe, Dylan Yudaken

Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
> Change the notification CQE layout while we can, put the seq number into
> cqe->res so we can cqe->flags to mark notification CQEs with
> IORING_CQE_F_NOTIF and add other flags in the future if needed. This
> will be needed to distinguish notifications from send completions when
> they use the same user_data.
> 
> Also, limit the sequence number to u16 and reserve upper 16 bits for the
> future. We also want it to mask out the sign bit for userspace
> convenience as it's easier to test for (cqe->res < 0).
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>   include/uapi/linux/io_uring.h | 6 ++++++
>   io_uring/notif.c              | 4 ++--
>   2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 1463cfecb56b..20368394870e 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -286,6 +286,9 @@ enum io_uring_op {
>   #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
>   #define IORING_RECVSEND_NOTIF_FLUSH	(1U << 3)
>   
> +/* cqe->res mask for extracting the notification sequence number */
> +#define IORING_NOTIF_SEQ_MASK		0xFFFFU
> +
>   /*
>    * accept flags stored in sqe->ioprio
>    */
> @@ -337,10 +340,13 @@ struct io_uring_cqe {
>    * IORING_CQE_F_BUFFER	If set, the upper 16 bits are the buffer ID
>    * IORING_CQE_F_MORE	If set, parent SQE will generate more CQE entries
>    * IORING_CQE_F_SOCK_NONEMPTY	If set, more data to read after socket recv
> + * IORING_CQE_F_NOTIF	Set for notification CQEs. Can be used to distinct
> + *			them from sends.
>    */
>   #define IORING_CQE_F_BUFFER		(1U << 0)
>   #define IORING_CQE_F_MORE		(1U << 1)
>   #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
> +#define IORING_CQE_F_NOTIF		(1U << 3)
>   
>   enum {
>   	IORING_CQE_BUFFER_SHIFT		= 16,
> diff --git a/io_uring/notif.c b/io_uring/notif.c
> index 714715678817..6e17d1ae5a0d 100644
> --- a/io_uring/notif.c
> +++ b/io_uring/notif.c
> @@ -60,8 +60,8 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>   	notif->rsrc_node = NULL;
>   	io_req_set_rsrc_node(notif, ctx, 0);
>   	notif->cqe.user_data = slot->tag;
> -	notif->cqe.flags = slot->seq++;
> -	notif->cqe.res = 0;
> +	notif->cqe.flags = IORING_CQE_F_NOTIF;
> +	notif->cqe.res = slot->seq++ & IORING_NOTIF_SEQ_MASK;
>   
>   	nd = io_notif_to_data(notif);
>   	nd->account_pages = 0;

This looks good.

metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-16  7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
@ 2022-08-16  8:23   ` Stefan Metzmacher
  2022-08-17 12:42     ` Pavel Begunkov
  2022-08-16  8:37   ` Dylan Yudaken
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-08-16  8:23 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe, Dylan Yudaken

Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
> Considering limited amount of slots some users struggle with
> registration time notification tag assignment as it's hard to manage
> notifications using sequence numbers. Add a simple feature that copies
> sqe->user_data of a send(+flush) request into the notification CQE it
> flushes (and only when it's flushes).
> 
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>   include/uapi/linux/io_uring.h | 4 ++++
>   io_uring/net.c                | 6 +++++-
>   2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 20368394870e..91e7944c9c78 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -280,11 +280,15 @@ enum io_uring_op {
>    *
>    * IORING_RECVSEND_NOTIF_FLUSH	Flush a notification after a successful
>    *				successful. Only for zerocopy sends.
> + *
> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
> + *				  completion even if it's flushed.
>    */
>   #define IORING_RECVSEND_POLL_FIRST	(1U << 0)
>   #define IORING_RECV_MULTISHOT		(1U << 1)
>   #define IORING_RECVSEND_FIXED_BUF	(1U << 2)
>   #define IORING_RECVSEND_NOTIF_FLUSH	(1U << 3)
> +#define IORING_RECVSEND_NOTIF_COPY_TAG	(1U << 4)
>   
>   /* cqe->res mask for extracting the notification sequence number */
>   #define IORING_NOTIF_SEQ_MASK		0xFFFFU
> diff --git a/io_uring/net.c b/io_uring/net.c
> index bd3fad9536ef..4d271a269979 100644
> --- a/io_uring/net.c
> +++ b/io_uring/net.c
> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   
>   	zc->flags = READ_ONCE(sqe->ioprio);
>   	if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
> -			  IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
> +			  IORING_RECVSEND_FIXED_BUF |
> +			  IORING_RECVSEND_NOTIF_FLUSH |
> +			  IORING_RECVSEND_NOTIF_COPY_TAG))
>   		return -EINVAL;
>   	if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>   		unsigned idx = READ_ONCE(sqe->buf_index);
> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>   		if (ret == -ERESTARTSYS)
>   			ret = -EINTR;
>   	} else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
> +		if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
> +			notif->cqe.user_data = req->cqe.user_data;
>   		io_notif_slot_flush_submit(notif_slot, 0);
>   	}

This would work but it seems to be confusing.

Can't we have a slot-less mode, with slot_idx==U16_MAX,
where we always allocate a new notif for each request,
this would then get the same user_data and would be referenced on the
request in order to reuse the same notif on an async retry after a short send.
And this notif will always be flushed at the end of the request.

This:

struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
                                 struct io_notif_slot *slot)

would change to:

struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
                                 __u64 cqe_user_data,
				__s32 cqe_res)


and:

void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)

(__must_hold looks wrong there...)

could just be:

void io_notif_flush(struct io_notif_*notif)

What do you think? It would remove the whole notif slot complexity
from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.

metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-16  7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
  2022-08-16  8:23   ` Stefan Metzmacher
@ 2022-08-16  8:37   ` Dylan Yudaken
  2022-08-17 10:48     ` Pavel Begunkov
  1 sibling, 1 reply; 14+ messages in thread
From: Dylan Yudaken @ 2022-08-16  8:37 UTC (permalink / raw)
  To: asml.silence, io-uring; +Cc: axboe, metze

On Tue, 2022-08-16 at 08:42 +0100, Pavel Begunkov wrote:
> Considering limited amount of slots some users struggle with
> registration time notification tag assignment as it's hard to manage
> notifications using sequence numbers. Add a simple feature that
> copies
> sqe->user_data of a send(+flush) request into the notification CQE it
> flushes (and only when it's flushes).

I think for this to be useful I think it would also be needed to have
flags on the generated CQE.

If there are more CQEs coming for the same request it should have
IORING_CQE_F_MORE set. Otherwise user space would not be able to know
if it is able to reuse local data.

Additionally it would need to provide a way of disambiguating the send
CQE with the flush CQE.

Dylan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-16  8:37   ` Dylan Yudaken
@ 2022-08-17 10:48     ` Pavel Begunkov
  2022-08-17 12:04       ` Dylan Yudaken
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-17 10:48 UTC (permalink / raw)
  To: Dylan Yudaken, io-uring; +Cc: axboe, metze

On 8/16/22 09:37, Dylan Yudaken wrote:
> On Tue, 2022-08-16 at 08:42 +0100, Pavel Begunkov wrote:
>> Considering limited amount of slots some users struggle with
>> registration time notification tag assignment as it's hard to manage
>> notifications using sequence numbers. Add a simple feature that
>> copies
>> sqe->user_data of a send(+flush) request into the notification CQE it
>> flushes (and only when it's flushes).
> 
> I think for this to be useful I think it would also be needed to have
> flags on the generated CQE.
> 
> If there are more CQEs coming for the same request it should have
> IORING_CQE_F_MORE set. Otherwise user space would not be able to know
> if it is able to reuse local data.

If you want to have:

expect_more = cqe->flags & IORING_CQE_F_MORE;

Then in the current form you can perfectly do that with

// MSG_WAITALL
expect_more = (cqe->res == io_len);
// !MSG_WAITALL,
expect_more = (cqe->res >= 0);

But might be more convenient to have IORING_CQE_F_MORE set,
one problem is a slight change of (implicit) semantics, i.e.
we don't execute linked requests when filling a IORING_CQE_F_MORE
CQE + CQE ordering implied from that.

It's maybe worth to not rely on the link failing concept for
deciding whether to flush or not.


> Additionally it would need to provide a way of disambiguating the send
> CQE with the flush CQE.

Do you mean like IORING_CQE_F_NOTIF from 1/2?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-17 10:48     ` Pavel Begunkov
@ 2022-08-17 12:04       ` Dylan Yudaken
  2022-08-17 12:44         ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Dylan Yudaken @ 2022-08-17 12:04 UTC (permalink / raw)
  To: asml.silence, io-uring; +Cc: axboe, metze

On Wed, 2022-08-17 at 11:48 +0100, Pavel Begunkov wrote:
> On 8/16/22 09:37, Dylan Yudaken wrote:
> > On Tue, 2022-08-16 at 08:42 +0100, Pavel Begunkov wrote:
> > > Considering limited amount of slots some users struggle with
> > > registration time notification tag assignment as it's hard to
> > > manage
> > > notifications using sequence numbers. Add a simple feature that
> > > copies
> > > sqe->user_data of a send(+flush) request into the notification
> > > CQE it
> > > flushes (and only when it's flushes).
> > 
> > I think for this to be useful I think it would also be needed to
> > have
> > flags on the generated CQE.
> > 
> > If there are more CQEs coming for the same request it should have
> > IORING_CQE_F_MORE set. Otherwise user space would not be able to
> > know
> > if it is able to reuse local data.
> 
> If you want to have:
> 
> expect_more = cqe->flags & IORING_CQE_F_MORE;
> 
> Then in the current form you can perfectly do that with
> 
> // MSG_WAITALL
> expect_more = (cqe->res == io_len);
> // !MSG_WAITALL,
> expect_more = (cqe->res >= 0);
> 
> But might be more convenient to have IORING_CQE_F_MORE set,
> one problem is a slight change of (implicit) semantics, i.e.
> we don't execute linked requests when filling a IORING_CQE_F_MORE
> CQE + CQE ordering implied from that.
> 
> It's maybe worth to not rely on the link failing concept for
> deciding whether to flush or not.

Is the ordering guaranteed then to be <send cqe>, <notif cqe>?
If so I would put the IORING_CQE_F_MORE more as a nice to have for
consistency with other ops

> 
> 
> > Additionally it would need to provide a way of disambiguating the
> > send
> > CQE with the flush CQE.
> 
> Do you mean like IORING_CQE_F_NOTIF from 1/2?
> 

Apologies - I missed that 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-16  8:23   ` Stefan Metzmacher
@ 2022-08-17 12:42     ` Pavel Begunkov
  2022-08-18 18:13       ` Stefan Metzmacher
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-17 12:42 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: Jens Axboe, Dylan Yudaken

On 8/16/22 09:23, Stefan Metzmacher wrote:
> Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
>> Considering limited amount of slots some users struggle with
>> registration time notification tag assignment as it's hard to manage
>> notifications using sequence numbers. Add a simple feature that copies
>> sqe->user_data of a send(+flush) request into the notification CQE it
>> flushes (and only when it's flushes).
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>   include/uapi/linux/io_uring.h | 4 ++++
>>   io_uring/net.c                | 6 +++++-
>>   2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 20368394870e..91e7944c9c78 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -280,11 +280,15 @@ enum io_uring_op {
>>    *
>>    * IORING_RECVSEND_NOTIF_FLUSH    Flush a notification after a successful
>>    *                successful. Only for zerocopy sends.
>> + *
>> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
>> + *                  completion even if it's flushed.
>>    */
>>   #define IORING_RECVSEND_POLL_FIRST    (1U << 0)
>>   #define IORING_RECV_MULTISHOT        (1U << 1)
>>   #define IORING_RECVSEND_FIXED_BUF    (1U << 2)
>>   #define IORING_RECVSEND_NOTIF_FLUSH    (1U << 3)
>> +#define IORING_RECVSEND_NOTIF_COPY_TAG    (1U << 4)
>>   /* cqe->res mask for extracting the notification sequence number */
>>   #define IORING_NOTIF_SEQ_MASK        0xFFFFU
>> diff --git a/io_uring/net.c b/io_uring/net.c
>> index bd3fad9536ef..4d271a269979 100644
>> --- a/io_uring/net.c
>> +++ b/io_uring/net.c
>> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>       zc->flags = READ_ONCE(sqe->ioprio);
>>       if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>> -              IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
>> +              IORING_RECVSEND_FIXED_BUF |
>> +              IORING_RECVSEND_NOTIF_FLUSH |
>> +              IORING_RECVSEND_NOTIF_COPY_TAG))
>>           return -EINVAL;
>>       if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>>           unsigned idx = READ_ONCE(sqe->buf_index);
>> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>>           if (ret == -ERESTARTSYS)
>>               ret = -EINTR;
>>       } else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
>> +        if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
>> +            notif->cqe.user_data = req->cqe.user_data;
>>           io_notif_slot_flush_submit(notif_slot, 0);
>>       }
> 
> This would work but it seems to be confusing.
> 
> Can't we have a slot-less mode, with slot_idx==U16_MAX,
> where we always allocate a new notif for each request,
> this would then get the same user_data and would be referenced on the
> request in order to reuse the same notif on an async retry after a short send.

Ok, retries may make slots managing much harder, let me think

> And this notif will always be flushed at the end of the request.
> 
> This:
> 
> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>                                  struct io_notif_slot *slot)
> 
> would change to:
> 
> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>                                  __u64 cqe_user_data,
>                  __s32 cqe_res)
> 
> 
> and:
> 
> void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)
> 
> (__must_hold looks wrong there...)

Nope, it should be there

> could just be:
> 
> void io_notif_flush(struct io_notif_*notif)
> 
> What do you think? It would remove the whole notif slot complexity
> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.

The downside is that requests then should be pretty large or it'll
lose in performance. Surely not a problem for 8MB per request but
even 4KB won't suffice. And users may want to put in smaller chunks
on the wire instead of waiting for mode data to let tcp handle
pacing and potentially improve latencies by sending earlier.

On the other hand that one notification per request idea mentioned
before can extended to 1-2 CQEs per request, which is interestingly
the approach zc send discussions started with.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-17 12:04       ` Dylan Yudaken
@ 2022-08-17 12:44         ` Pavel Begunkov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-17 12:44 UTC (permalink / raw)
  To: Dylan Yudaken, io-uring; +Cc: axboe, metze

On 8/17/22 13:04, Dylan Yudaken wrote:
> On Wed, 2022-08-17 at 11:48 +0100, Pavel Begunkov wrote:
>> On 8/16/22 09:37, Dylan Yudaken wrote:
>>> On Tue, 2022-08-16 at 08:42 +0100, Pavel Begunkov wrote:
>>>> Considering limited amount of slots some users struggle with
>>>> registration time notification tag assignment as it's hard to
>>>> manage
>>>> notifications using sequence numbers. Add a simple feature that
>>>> copies
>>>> sqe->user_data of a send(+flush) request into the notification
>>>> CQE it
>>>> flushes (and only when it's flushes).
>>>
>>> I think for this to be useful I think it would also be needed to
>>> have
>>> flags on the generated CQE.
>>>
>>> If there are more CQEs coming for the same request it should have
>>> IORING_CQE_F_MORE set. Otherwise user space would not be able to
>>> know
>>> if it is able to reuse local data.
>>
>> If you want to have:
>>
>> expect_more = cqe->flags & IORING_CQE_F_MORE;
>>
>> Then in the current form you can perfectly do that with
>>
>> // MSG_WAITALL
>> expect_more = (cqe->res == io_len);
>> // !MSG_WAITALL,
>> expect_more = (cqe->res >= 0);
>>
>> But might be more convenient to have IORING_CQE_F_MORE set,
>> one problem is a slight change of (implicit) semantics, i.e.
>> we don't execute linked requests when filling a IORING_CQE_F_MORE
>> CQE + CQE ordering implied from that.
>>
>> It's maybe worth to not rely on the link failing concept for
>> deciding whether to flush or not.
> 
> Is the ordering guaranteed then to be <send cqe>, <notif cqe>?

Not yet, need to send this patch

https://github.com/isilence/linux/commit/9a1464905be3fc0cee4f68b01e43c5ad14a05b06

> If so I would put the IORING_CQE_F_MORE more as a nice to have for
> consistency with other ops
> 
>>
>>
>>> Additionally it would need to provide a way of disambiguating the
>>> send
>>> CQE with the flush CQE.
>>
>> Do you mean like IORING_CQE_F_NOTIF from 1/2?
>>
> 
> Apologies - I missed that
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-17 12:42     ` Pavel Begunkov
@ 2022-08-18 18:13       ` Stefan Metzmacher
  2022-08-19 11:42         ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-08-18 18:13 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe, Dylan Yudaken


Am 17.08.22 um 14:42 schrieb Pavel Begunkov:
> On 8/16/22 09:23, Stefan Metzmacher wrote:
>> Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
>>> Considering limited amount of slots some users struggle with
>>> registration time notification tag assignment as it's hard to manage
>>> notifications using sequence numbers. Add a simple feature that copies
>>> sqe->user_data of a send(+flush) request into the notification CQE it
>>> flushes (and only when it's flushes).
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>>>   include/uapi/linux/io_uring.h | 4 ++++
>>>   io_uring/net.c                | 6 +++++-
>>>   2 files changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 20368394870e..91e7944c9c78 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -280,11 +280,15 @@ enum io_uring_op {
>>>    *
>>>    * IORING_RECVSEND_NOTIF_FLUSH    Flush a notification after a successful
>>>    *                successful. Only for zerocopy sends.
>>> + *
>>> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
>>> + *                  completion even if it's flushed.
>>>    */
>>>   #define IORING_RECVSEND_POLL_FIRST    (1U << 0)
>>>   #define IORING_RECV_MULTISHOT        (1U << 1)
>>>   #define IORING_RECVSEND_FIXED_BUF    (1U << 2)
>>>   #define IORING_RECVSEND_NOTIF_FLUSH    (1U << 3)
>>> +#define IORING_RECVSEND_NOTIF_COPY_TAG    (1U << 4)
>>>   /* cqe->res mask for extracting the notification sequence number */
>>>   #define IORING_NOTIF_SEQ_MASK        0xFFFFU
>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>> index bd3fad9536ef..4d271a269979 100644
>>> --- a/io_uring/net.c
>>> +++ b/io_uring/net.c
>>> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>       zc->flags = READ_ONCE(sqe->ioprio);
>>>       if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>>> -              IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
>>> +              IORING_RECVSEND_FIXED_BUF |
>>> +              IORING_RECVSEND_NOTIF_FLUSH |
>>> +              IORING_RECVSEND_NOTIF_COPY_TAG))
>>>           return -EINVAL;
>>>       if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>>>           unsigned idx = READ_ONCE(sqe->buf_index);
>>> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>>>           if (ret == -ERESTARTSYS)
>>>               ret = -EINTR;
>>>       } else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
>>> +        if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
>>> +            notif->cqe.user_data = req->cqe.user_data;
>>>           io_notif_slot_flush_submit(notif_slot, 0);
>>>       }
>>
>> This would work but it seems to be confusing.
>>
>> Can't we have a slot-less mode, with slot_idx==U16_MAX,
>> where we always allocate a new notif for each request,
>> this would then get the same user_data and would be referenced on the
>> request in order to reuse the same notif on an async retry after a short send.
> 
> Ok, retries may make slots managing much harder, let me think

With retries it would be much saner to use the same
notif for the whole request. So keeping it referenced
as zc->notif might be a way, maybe doing that in the _prep
function in order to do it just once, then we don't need
zc->slot_idx anymore.

>> And this notif will always be flushed at the end of the request.
>>
>> This:
>>
>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>                                  struct io_notif_slot *slot)
>>
>> would change to:
>>
>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>                                  __u64 cqe_user_data,
>>                  __s32 cqe_res)
>>
>>
>> and:
>>
>> void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)
>>
>> (__must_hold looks wrong there...)
> 
> Nope, it should be there

Shouldn't it be something like
__must_hold(&slot->notif->ctx->uring_lock)

There is not 'ctx' argument.

>> could just be:
>>
>> void io_notif_flush(struct io_notif_*notif)
>>
>> What do you think? It would remove the whole notif slot complexity
>> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.
> 
> The downside is that requests then should be pretty large or it'll
> lose in performance. Surely not a problem for 8MB per request but
> even 4KB won't suffice. And users may want to put in smaller chunks
> on the wire instead of waiting for mode data to let tcp handle
> pacing and potentially improve latencies by sending earlier.

If this is optional applications can decide what fits better.

> On the other hand that one notification per request idea mentioned
> before can extended to 1-2 CQEs per request, which is interestingly
> the approach zc send discussions started with.

In order to make use of any of this I need any way
to get 2 CQEs with user_data being the same or related.

The only benefit for with slots is being able to avoid or
batch additional CQEs, correct? Or is there more to it?

metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-18 18:13       ` Stefan Metzmacher
@ 2022-08-19 11:42         ` Pavel Begunkov
  2022-08-19 12:36           ` Stefan Metzmacher
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-19 11:42 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: Jens Axboe, Dylan Yudaken

On 8/18/22 19:13, Stefan Metzmacher wrote:
> Am 17.08.22 um 14:42 schrieb Pavel Begunkov:
>> On 8/16/22 09:23, Stefan Metzmacher wrote:
>>> Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
>>>> Considering limited amount of slots some users struggle with
>>>> registration time notification tag assignment as it's hard to manage
>>>> notifications using sequence numbers. Add a simple feature that copies
>>>> sqe->user_data of a send(+flush) request into the notification CQE it
>>>> flushes (and only when it's flushes).
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>>>   include/uapi/linux/io_uring.h | 4 ++++
>>>>   io_uring/net.c                | 6 +++++-
>>>>   2 files changed, 9 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 20368394870e..91e7944c9c78 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -280,11 +280,15 @@ enum io_uring_op {
>>>>    *
>>>>    * IORING_RECVSEND_NOTIF_FLUSH    Flush a notification after a successful
>>>>    *                successful. Only for zerocopy sends.
>>>> + *
>>>> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
>>>> + *                  completion even if it's flushed.
>>>>    */
>>>>   #define IORING_RECVSEND_POLL_FIRST    (1U << 0)
>>>>   #define IORING_RECV_MULTISHOT        (1U << 1)
>>>>   #define IORING_RECVSEND_FIXED_BUF    (1U << 2)
>>>>   #define IORING_RECVSEND_NOTIF_FLUSH    (1U << 3)
>>>> +#define IORING_RECVSEND_NOTIF_COPY_TAG    (1U << 4)
>>>>   /* cqe->res mask for extracting the notification sequence number */
>>>>   #define IORING_NOTIF_SEQ_MASK        0xFFFFU
>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>> index bd3fad9536ef..4d271a269979 100644
>>>> --- a/io_uring/net.c
>>>> +++ b/io_uring/net.c
>>>> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>       zc->flags = READ_ONCE(sqe->ioprio);
>>>>       if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>>>> -              IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
>>>> +              IORING_RECVSEND_FIXED_BUF |
>>>> +              IORING_RECVSEND_NOTIF_FLUSH |
>>>> +              IORING_RECVSEND_NOTIF_COPY_TAG))
>>>>           return -EINVAL;
>>>>       if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>>>>           unsigned idx = READ_ONCE(sqe->buf_index);
>>>> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>>>>           if (ret == -ERESTARTSYS)
>>>>               ret = -EINTR;
>>>>       } else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
>>>> +        if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
>>>> +            notif->cqe.user_data = req->cqe.user_data;
>>>>           io_notif_slot_flush_submit(notif_slot, 0);
>>>>       }
>>>
>>> This would work but it seems to be confusing.
>>>
>>> Can't we have a slot-less mode, with slot_idx==U16_MAX,
>>> where we always allocate a new notif for each request,
>>> this would then get the same user_data and would be referenced on the
>>> request in order to reuse the same notif on an async retry after a short send.
>>
>> Ok, retries may make slots managing much harder, let me think
> 
> With retries it would be much saner to use the same
> notif for the whole request. So keeping it referenced
> as zc->notif might be a way, maybe doing that in the _prep
> function in order to do it just once, then we don't need
> zc->slot_idx anymore.

Even though it's possible atm with some userspace consideration,
it's definitely should be patched up.

>>> And this notif will always be flushed at the end of the request.
>>>
>>> This:
>>>
>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>>                                  struct io_notif_slot *slot)
>>>
>>> would change to:
>>>
>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>>                                  __u64 cqe_user_data,
>>>                  __s32 cqe_res)
>>>
>>>
>>> and:
>>>
>>> void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)
>>>
>>> (__must_hold looks wrong there...)
>>
>> Nope, it should be there
> 
> Shouldn't it be something like
> __must_hold(&slot->notif->ctx->uring_lock)
> 
> There is not 'ctx' argument.

Ah, in this sense, agree

>>> What do you think? It would remove the whole notif slot complexity
>>> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.
>>
>> The downside is that requests then should be pretty large or it'll
>> lose in performance. Surely not a problem for 8MB per request but
>> even 4KB won't suffice. And users may want to put in smaller chunks
>> on the wire instead of waiting for mode data to let tcp handle
>> pacing and potentially improve latencies by sending earlier.
> 
> If this is optional applications can decide what fits better.
> 
>> On the other hand that one notification per request idea mentioned
>> before can extended to 1-2 CQEs per request, which is interestingly
>> the approach zc send discussions started with.
> 
> In order to make use of any of this I need any way
> to get 2 CQEs with user_data being the same or related.

The idea described above will post 2 CQEs (mostly) per request
as you want with an optional way to have only 1 CQE. My current
sentiment is to kill all the slot business, leave this 1-2 CQE
per request and see if there are users for whom it won't be
enough. It's anyway just a slight deviation from what I wanted
to push as a complimentary interface.

> The only benefit for with slots is being able to avoid or
> batch additional CQEs, correct? Or is there more to it?

CQE batching is a lesser problem, I'm more concerned of how
it sticks with the network. In short, it'll hugely underperform
with TCP if requests are not large enough.

A simple bench with some hacks, localhost, TCP, run by

./msg_zerocopy -6 -r tcp -s <size> &
./io_uring_zerocopy_tx -6 -D "::1" -s <size> -m <0,2> tcp


non-zerocopy:
4000B:  tx=8711880 (MB=33233), tx/s=1742376 (MB/s=6646)
16000B: tx=3196528 (MB=48775), tx/s=639305 (MB/s=9755)
60000B: tx=1036536 (MB=59311), tx/s=207307 (MB/s=11862)

zerocopy:
4000B:  tx=3003488 (MB=11457), tx/s=600697 (MB/s=2291)
16000B: tx=2940296 (MB=44865), tx/s=588059 (MB/s=8973)
60000B: tx=2621792 (MB=150020), tx/s=524358 (MB/s=30004)

Reusing notifications with slots will change the picture.
And it this has nothing to do with io_uring overhead like
CQE posting and so on.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-19 11:42         ` Pavel Begunkov
@ 2022-08-19 12:36           ` Stefan Metzmacher
  2022-08-22 11:49             ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Metzmacher @ 2022-08-19 12:36 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring; +Cc: Jens Axboe, Dylan Yudaken


Am 19.08.22 um 13:42 schrieb Pavel Begunkov:
> On 8/18/22 19:13, Stefan Metzmacher wrote:
>> Am 17.08.22 um 14:42 schrieb Pavel Begunkov:
>>> On 8/16/22 09:23, Stefan Metzmacher wrote:
>>>> Am 16.08.22 um 09:42 schrieb Pavel Begunkov:
>>>>> Considering limited amount of slots some users struggle with
>>>>> registration time notification tag assignment as it's hard to manage
>>>>> notifications using sequence numbers. Add a simple feature that copies
>>>>> sqe->user_data of a send(+flush) request into the notification CQE it
>>>>> flushes (and only when it's flushes).
>>>>>
>>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>>> ---
>>>>>   include/uapi/linux/io_uring.h | 4 ++++
>>>>>   io_uring/net.c                | 6 +++++-
>>>>>   2 files changed, 9 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>> index 20368394870e..91e7944c9c78 100644
>>>>> --- a/include/uapi/linux/io_uring.h
>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>> @@ -280,11 +280,15 @@ enum io_uring_op {
>>>>>    *
>>>>>    * IORING_RECVSEND_NOTIF_FLUSH    Flush a notification after a successful
>>>>>    *                successful. Only for zerocopy sends.
>>>>> + *
>>>>> + * IORING_RECVSEND_NOTIF_COPY_TAG Copy request's user_data into the notification
>>>>> + *                  completion even if it's flushed.
>>>>>    */
>>>>>   #define IORING_RECVSEND_POLL_FIRST    (1U << 0)
>>>>>   #define IORING_RECV_MULTISHOT        (1U << 1)
>>>>>   #define IORING_RECVSEND_FIXED_BUF    (1U << 2)
>>>>>   #define IORING_RECVSEND_NOTIF_FLUSH    (1U << 3)
>>>>> +#define IORING_RECVSEND_NOTIF_COPY_TAG    (1U << 4)
>>>>>   /* cqe->res mask for extracting the notification sequence number */
>>>>>   #define IORING_NOTIF_SEQ_MASK        0xFFFFU
>>>>> diff --git a/io_uring/net.c b/io_uring/net.c
>>>>> index bd3fad9536ef..4d271a269979 100644
>>>>> --- a/io_uring/net.c
>>>>> +++ b/io_uring/net.c
>>>>> @@ -858,7 +858,9 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>>>       zc->flags = READ_ONCE(sqe->ioprio);
>>>>>       if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
>>>>> -              IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
>>>>> +              IORING_RECVSEND_FIXED_BUF |
>>>>> +              IORING_RECVSEND_NOTIF_FLUSH |
>>>>> +              IORING_RECVSEND_NOTIF_COPY_TAG))
>>>>>           return -EINVAL;
>>>>>       if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
>>>>>           unsigned idx = READ_ONCE(sqe->buf_index);
>>>>> @@ -1024,6 +1026,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
>>>>>           if (ret == -ERESTARTSYS)
>>>>>               ret = -EINTR;
>>>>>       } else if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH) {
>>>>> +        if (zc->flags & IORING_RECVSEND_NOTIF_COPY_TAG)
>>>>> +            notif->cqe.user_data = req->cqe.user_data;
>>>>>           io_notif_slot_flush_submit(notif_slot, 0);
>>>>>       }
>>>>
>>>> This would work but it seems to be confusing.
>>>>
>>>> Can't we have a slot-less mode, with slot_idx==U16_MAX,
>>>> where we always allocate a new notif for each request,
>>>> this would then get the same user_data and would be referenced on the
>>>> request in order to reuse the same notif on an async retry after a short send.
>>>
>>> Ok, retries may make slots managing much harder, let me think
>>
>> With retries it would be much saner to use the same
>> notif for the whole request. So keeping it referenced
>> as zc->notif might be a way, maybe doing that in the _prep
>> function in order to do it just once, then we don't need
>> zc->slot_idx anymore.
> 
> Even though it's possible atm with some userspace consideration,
> it's definitely should be patched up.
> 
>>>> And this notif will always be flushed at the end of the request.
>>>>
>>>> This:
>>>>
>>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>>>                                  struct io_notif_slot *slot)
>>>>
>>>> would change to:
>>>>
>>>> struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx,
>>>>                                  __u64 cqe_user_data,
>>>>                  __s32 cqe_res)
>>>>
>>>>
>>>> and:
>>>>
>>>> void io_notif_slot_flush(struct io_notif_slot *slot) __must_hold(&ctx->uring_lock)
>>>>
>>>> (__must_hold looks wrong there...)
>>>
>>> Nope, it should be there
>>
>> Shouldn't it be something like
>> __must_hold(&slot->notif->ctx->uring_lock)
>>
>> There is not 'ctx' argument.
> 
> Ah, in this sense, agree
> 
>>>> What do you think? It would remove the whole notif slot complexity
>>>> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.
>>>
>>> The downside is that requests then should be pretty large or it'll
>>> lose in performance. Surely not a problem for 8MB per request but
>>> even 4KB won't suffice. And users may want to put in smaller chunks
>>> on the wire instead of waiting for mode data to let tcp handle
>>> pacing and potentially improve latencies by sending earlier.
>>
>> If this is optional applications can decide what fits better.
>>
>>> On the other hand that one notification per request idea mentioned
>>> before can extended to 1-2 CQEs per request, which is interestingly
>>> the approach zc send discussions started with.
>>
>> In order to make use of any of this I need any way
>> to get 2 CQEs with user_data being the same or related.
> 
> The idea described above will post 2 CQEs (mostly) per request
> as you want with an optional way to have only 1 CQE. My current
> sentiment is to kill all the slot business, leave this 1-2 CQE
> per request and see if there are users for whom it won't be
> enough. It's anyway just a slight deviation from what I wanted
> to push as a complimentary interface.

Ah, ok, removing the slot stuff again would be fine for me...

>> The only benefit for with slots is being able to avoid or
>> batch additional CQEs, correct? Or is there more to it?
> 
> CQE batching is a lesser problem, I'm more concerned of how
> it sticks with the network. In short, it'll hugely underperform
> with TCP if requests are not large enough.
> 
> A simple bench with some hacks, localhost, TCP, run by
> 
> ./msg_zerocopy -6 -r tcp -s <size> &
> ./io_uring_zerocopy_tx -6 -D "::1" -s <size> -m <0,2> tcp
> 
> 
> non-zerocopy:
> 4000B:  tx=8711880 (MB=33233), tx/s=1742376 (MB/s=6646)
> 16000B: tx=3196528 (MB=48775), tx/s=639305 (MB/s=9755)
> 60000B: tx=1036536 (MB=59311), tx/s=207307 (MB/s=11862)
> 
> zerocopy:
> 4000B:  tx=3003488 (MB=11457), tx/s=600697 (MB/s=2291)
> 16000B: tx=2940296 (MB=44865), tx/s=588059 (MB/s=8973)
> 60000B: tx=2621792 (MB=150020), tx/s=524358 (MB/s=30004)

So with something between 16k and 60k we reach the point where
ZC starts to be faster, correct?

Did you remove the loopback restriction as described in
Documentation/networking/msg_zerocopy.rst ?

Are the results similar when using ./msg_zerocopy -6 tcp -s <size>
as client?

And the reason is some page pinning overhead from iov_iter_get_pages2()
in __zerocopy_sg_from_iter()?

> Reusing notifications with slots will change the picture.
> And it this has nothing to do with io_uring overhead like
> CQE posting and so on.

Hmm I don't understand how the number of notif structures
would have any impact? Is it related to io_sg_from_iter()?

metze

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 2/2] io_uring/net: allow to override notification tag
  2022-08-19 12:36           ` Stefan Metzmacher
@ 2022-08-22 11:49             ` Pavel Begunkov
  0 siblings, 0 replies; 14+ messages in thread
From: Pavel Begunkov @ 2022-08-22 11:49 UTC (permalink / raw)
  To: Stefan Metzmacher, io-uring; +Cc: Jens Axboe, Dylan Yudaken

On 8/19/22 13:36, Stefan Metzmacher wrote:
[...]
>>>>> What do you think? It would remove the whole notif slot complexity
>>>>> from caller using IORING_RECVSEND_NOTIF_FLUSH for every request anyway.
>>>>
>>>> The downside is that requests then should be pretty large or it'll
>>>> lose in performance. Surely not a problem for 8MB per request but
>>>> even 4KB won't suffice. And users may want to put in smaller chunks
>>>> on the wire instead of waiting for mode data to let tcp handle
>>>> pacing and potentially improve latencies by sending earlier.
>>>
>>> If this is optional applications can decide what fits better.
>>>
>>>> On the other hand that one notification per request idea mentioned
>>>> before can extended to 1-2 CQEs per request, which is interestingly
>>>> the approach zc send discussions started with.
>>>
>>> In order to make use of any of this I need any way
>>> to get 2 CQEs with user_data being the same or related.
>>
>> The idea described above will post 2 CQEs (mostly) per request
>> as you want with an optional way to have only 1 CQE. My current
>> sentiment is to kill all the slot business, leave this 1-2 CQE
>> per request and see if there are users for whom it won't be
>> enough. It's anyway just a slight deviation from what I wanted
>> to push as a complimentary interface.
> 
> Ah, ok, removing the slot stuff again would be fine for me...
> 
>>> The only benefit for with slots is being able to avoid or
>>> batch additional CQEs, correct? Or is there more to it?
>>
>> CQE batching is a lesser problem, I'm more concerned of how
>> it sticks with the network. In short, it'll hugely underperform
>> with TCP if requests are not large enough.
>>
>> A simple bench with some hacks, localhost, TCP, run by
>>
>> ./msg_zerocopy -6 -r tcp -s <size> &
>> ./io_uring_zerocopy_tx -6 -D "::1" -s <size> -m <0,2> tcp
>>
>>
>> non-zerocopy:
>> 4000B:  tx=8711880 (MB=33233), tx/s=1742376 (MB/s=6646)
>> 16000B: tx=3196528 (MB=48775), tx/s=639305 (MB/s=9755)
>> 60000B: tx=1036536 (MB=59311), tx/s=207307 (MB/s=11862)
>>
>> zerocopy:
>> 4000B:  tx=3003488 (MB=11457), tx/s=600697 (MB/s=2291)
>> 16000B: tx=2940296 (MB=44865), tx/s=588059 (MB/s=8973)
>> 60000B: tx=2621792 (MB=150020), tx/s=524358 (MB/s=30004)
> 
> So with something between 16k and 60k we reach the point where
> ZC starts to be faster, correct?

For this setup -- yes, should be somewhat around 16-20K,
don't remember numbers for real hw, but I saw similar
tendencies.

> Did you remove the loopback restriction as described in
> Documentation/networking/msg_zerocopy.rst ?

right, it wouldn't outperform even with large payload otherwise

> Are the results similar when using ./msg_zerocopy -6 tcp -s <size>
> as client?

Shouldn't be, it also batches multiple requests to a single
(internal) notification and also exposes it to the userspace
differently.

> And the reason is some page pinning overhead from iov_iter_get_pages2()
> in __zerocopy_sg_from_iter()?

No, I was using registered buffers here, so instead of
iov_iter_get_pages2() business zerocopy was doing
io_uring/net.c:io_sg_from_iter(). And in any case overhead on pinning
wouldn't drastically change it.

>> Reusing notifications with slots will change the picture.
>> And it this has nothing to do with io_uring overhead like
>> CQE posting and so on.
> 
> Hmm I don't understand how the number of notif structures
> would have any impact? Is it related to io_sg_from_iter()?

It comes from TCP stack force changing an skbuff every time
it meets a new ubuf_info (i.e. a notification handle for
simplicity), there is a slight bump on skb allocation overhead
but the main problem is seemingly comes from tcp_push and so,
feeding it down the stack. I don't think there is any fundamental
reason for why it should be working so much slower but might
be problematic from engineering perspective. I'll ask a bit
around or maybe look myself if find time for that.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-08-22 11:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-16  7:41 [RFC 0/2] io_uring zc notification tag override Pavel Begunkov
2022-08-16  7:42 ` [RFC 1/2] io_uring/notif: change notif CQE uapi format Pavel Begunkov
2022-08-16  8:14   ` Stefan Metzmacher
2022-08-16  7:42 ` [RFC 2/2] io_uring/net: allow to override notification tag Pavel Begunkov
2022-08-16  8:23   ` Stefan Metzmacher
2022-08-17 12:42     ` Pavel Begunkov
2022-08-18 18:13       ` Stefan Metzmacher
2022-08-19 11:42         ` Pavel Begunkov
2022-08-19 12:36           ` Stefan Metzmacher
2022-08-22 11:49             ` Pavel Begunkov
2022-08-16  8:37   ` Dylan Yudaken
2022-08-17 10:48     ` Pavel Begunkov
2022-08-17 12:04       ` Dylan Yudaken
2022-08-17 12:44         ` Pavel Begunkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.