All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Metzmacher <metze@samba.org>
To: Pavel Begunkov <asml.silence@gmail.com>,
	io-uring <io-uring@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>
Cc: Jakub Kicinski <kuba@kernel.org>, netdev <netdev@vger.kernel.org>,
	Dylan Yudaken <dylany@fb.com>
Subject: Re: IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED)
Date: Fri, 21 Oct 2022 11:36:38 +0200	[thread overview]
Message-ID: <3e56c92b-567c-7bb4-2644-dc1ad1d8c3ae@samba.org> (raw)
In-Reply-To: <c7505b91-16c3-8f83-9782-a520e8b0f484@gmail.com>

Hi Pavel,

>>>> So far I came up with a IORING_SEND_NOTIF_REPORT_USAGE opt-in flag
>>>> and the reporting is done in cqe.res with IORING_NOTIF_USAGE_ZC_USED (0x00000001)
>>>> and/or IORING_NOTIF_USAGE_ZC_COPIED (0x8000000). So the caller is also
>>>> able to notice that some parts were able to use zero copy, while other
>>>> fragments were copied.
>>>
>>> Are we really interested in multihoming and probably some very edge cases?
>>> I'd argue we're not and it should be a single bool hint indicating whether
>>> zc is viable or not. It can do more complex calculations _if_ needed, e.g.
>>> looking inside skb's and figure out how many bytes were copied but as for me
>>> it should better be turned into a single bool in the end. Could also be the
>>> number of bytes copied, but I don't think we can't have the accuracy for
>>> that (e.g. what we're going to return if some protocol duplicates an skb
>>> and sends to 2 different devices or is processing it in a pipeline?)
>>>
>>> So the question is what is the use case for having 2 flags?
>>
>> It's mostly for debugging.
> 
> Ok, than it sounds like we don't need it.

Maybe I could add some trace points to the callback?

>>> btw, now we've got another example why the report flag is a good idea,
>>
>> I don't understand that line...
> 
> I'm just telling that IORING_SEND_NOTIF_* instead of unconditional reporting
> is more flexible and extendible from the uapi perspective.

ok

>>> we can't use cqe.res unconditionally because we want to have a "one CQE
>>> per request" mode, but it's fine if we make it and the report flag
>>> mutually exclusive.
>>
>> You mean we can add an optimized case where SEND[MSG]_ZC would not
>> generate F_MORE and skips F_NOTIF, because we copied or the transmission
>> path was really fast?
> 
> It is rather about optionally omitting the first (aka completion) cqe and
> posting only the notification cqe, which makes a lot of sense for UDP and
> some TCP use cases.

OK.

>> Then I'd move to IORING_CQE_F_COPIED again...
> [...]
>>>> -struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
>>>> +static void __io_notif_complete_tw_report_usage(struct io_kiocb *notif, bool *locked)
>>>
>>> Just shove all that into __io_notif_complete_tw().
>>
>> Ok, and then optimze later?
> 
> Right, I'm just tired of back porting patches by hand :)

ok, I just assumed it would be 6.1 only.

>> Otherwise we could have IORING_CQE_F_COPIED by default without opt-in
>> flag...

Do you still want an opt-in flag to get IORING_CQE_F_COPIED?
If so what name do you want it to be?

>>>> +static void io_uring_tx_zerocopy_callback_report_usage(struct sk_buff *skb,
>>>> +                            struct ubuf_info *uarg,
>>>> +                            bool success)
>>>> +{
>>>> +    struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
>>>> +
>>>> +    if (success && !nd->zc_used && skb)
>>>> +        nd->zc_used = true;
>>>> +    else if (unlikely(!success && !nd->zc_copied))
>>>> +        nd->zc_copied = true;
>>>
>>> It's fine but racy, so let's WRITE_ONCE() to indicate it.
>>
>> I don't see how this could be a problem, but I can add it.
> 
> It's not a problem, but better to be a little be more explicit
> about parallel writes.

ok.

>>>> diff --git a/io_uring/notif.h b/io_uring/notif.h
>>>> index 5b4d710c8ca5..5ac7a2745e52 100644
>>>> --- a/io_uring/notif.h
>>>> +++ b/io_uring/notif.h
>>>> @@ -13,10 +13,12 @@ struct io_notif_data {
>>>>       struct file        *file;
>>>>       struct ubuf_info    uarg;
>>>>       unsigned long        account_pages;
>>>> +    bool            zc_used;
>>>> +    bool            zc_copied;
>>>
>>> IIRC io_notif_data is fully packed in 6.1, so placing zc_{used,copied}
>>> there might complicate backporting (if any). We can place them in io_kiocb
>>> directly and move in 6.2. Alternatively account_pages doesn't have to be
>>> long.
>>
>> As far as I can see kernel-dk-block/io_uring-6.1 alread has your
>> shrink patches included...
> 
> Sorry, I mean 6.0

So you want to backport to 6.0?

Find the current version below, sizeof(struct io_kiocb) will grow from
3*64 + 24 to 3*64 + 32 (on x64_64) to it stays within 4 cache lines.

I tried this first:

union {
   u8 iopoll_completed;
   struct {
     u8 zc_used:1;
     u8 zc_copied:1;
   };
};

But then WRITE_ONCE() complains about a bitfield write.

So let me now about the opt-in flag and I'll prepare real commits
including a patch that moves from struct io_kiocb to struct io_notif_data
on top.

metze

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index f5b687a787a3..189152ad78d6 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -515,6 +515,9 @@ struct io_kiocb {
  	u8				opcode;
  	/* polled IO has completed */
  	u8				iopoll_completed;
+	/* these will be moved to struct io_notif_data in 6.1 */
+	bool				zc_used;
+	bool				zc_copied;
  	/*
  	 * Can be either a fixed buffer index, or used with provided buffers.
  	 * For the latter, before issue it points to the buffer group ID,
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ab7458033ee3..738d6234d1d9 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -350,6 +350,7 @@ struct io_uring_cqe {
  #define IORING_CQE_F_MORE		(1U << 1)
  #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
  #define IORING_CQE_F_NOTIF		(1U << 3)
+#define IORING_CQE_F_COPIED		(1U << 4)

  enum {
  	IORING_CQE_BUFFER_SHIFT		= 16,
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e37c6569d82e..033aca064b10 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -18,6 +18,10 @@ static void __io_notif_complete_tw(struct io_kiocb *notif, bool *locked)
  		__io_unaccount_mem(ctx->user, nd->account_pages);
  		nd->account_pages = 0;
  	}
+
+	if (notif->zc_copied || !notif->zc_used)
+		notif->cqe.flags |= IORING_CQE_F_COPIED;
+
  	io_req_task_complete(notif, locked);
  }

@@ -28,6 +32,11 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
  	struct io_notif_data *nd = container_of(uarg, struct io_notif_data, uarg);
  	struct io_kiocb *notif = cmd_to_io_kiocb(nd);

+	if (success && !notif->zc_used && skb)
+		WRITE_ONCE(notif->zc_used, true);
+	else if (!success && !notif->zc_copied)
+		WRITE_ONCE(notif->zc_copied, true);
+
  	if (refcount_dec_and_test(&uarg->refcnt)) {
  		notif->io_task_work.func = __io_notif_complete_tw;
  		io_req_task_work_add(notif);
@@ -55,6 +64,7 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx)
  	nd->account_pages = 0;
  	nd->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
  	nd->uarg.callback = io_uring_tx_zerocopy_callback;
+	notif->zc_used = notif->zc_copied = false;
  	refcount_set(&nd->uarg.refcnt, 1);
  	return notif;
  }


  reply	other threads:[~2022-10-21  9:36 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-14 11:06 IORING_CQE_F_COPIED Stefan Metzmacher
2022-10-17 16:46 ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-18  8:43   ` IORING_CQE_F_COPIED Stefan Metzmacher
2022-10-19 15:06     ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-19 16:12       ` IORING_CQE_F_COPIED Stefan Metzmacher
2022-10-20  2:24         ` IORING_CQE_F_COPIED Pavel Begunkov
2022-10-20 10:04           ` IORING_SEND_NOTIF_REPORT_USAGE (was Re: IORING_CQE_F_COPIED) Stefan Metzmacher
2022-10-20 13:46             ` Pavel Begunkov
2022-10-20 14:51               ` Stefan Metzmacher
2022-10-20 15:31                 ` Pavel Begunkov
2022-10-21  9:36                   ` Stefan Metzmacher [this message]
2022-10-21 11:09                     ` Pavel Begunkov
2022-10-21 14:03                       ` Stefan Metzmacher
2022-10-27  8:47                         ` Stefan Metzmacher
2022-10-27 10:51                         ` Pavel Begunkov
2022-10-20 10:10           ` IORING_SEND_NOTIF_USER_DATA " Stefan Metzmacher
2022-10-20 15:37             ` Pavel Begunkov
2022-10-21  8:32               ` Stefan Metzmacher
2022-10-21  9:27                 ` Pavel Begunkov
2022-10-21  9:45                   ` Stefan Metzmacher
2022-10-21 11:20                     ` Pavel Begunkov
2022-10-21 12:10                       ` Stefan Metzmacher
2022-10-21 10:15                   ` Stefan Metzmacher
2022-10-21 11:26                     ` Pavel Begunkov
2022-10-21 12:38                       ` Stefan Metzmacher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e56c92b-567c-7bb4-2644-dc1ad1d8c3ae@samba.org \
    --to=metze@samba.org \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=dylany@fb.com \
    --cc=io-uring@vger.kernel.org \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.