Re: [FEATURE REQUEST] Specify a sqe won't generate a cqe

From: "Carter Li 李通洲" <carter.li@eoitek.com>
To: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring <io-uring@vger.kernel.org>
Subject: Re: [FEATURE REQUEST] Specify a sqe won't generate a cqe
Date: Fri, 14 Feb 2020 13:27:27 +0000	[thread overview]
Message-ID: <57BDF3A6-7279-4250-B200-76FDCDB04765@eoitek.com> (raw)
In-Reply-To: <19236051-0949-ed5c-d1d5-458c07681f36@gmail.com>

> 2020年2月14日 下午8:52，Pavel Begunkov <asml.silence@gmail.com> 写道：
> 
> On 2/14/2020 2:27 PM, Carter Li 李通洲 wrote:
>> 
>>> 2020年2月14日 下午6:34，Pavel Begunkov <asml.silence@gmail.com> 写道：
>>> 
>>> On 2/14/2020 11:29 AM, Carter Li 李通洲 wrote:
>>>> To implement io_uring_wait_cqe_timeout, we introduce a magic number
>>>> called `LIBURING_UDATA_TIMEOUT`. The problem is that not only we
>>>> must make sure that users should never set sqe->user_data to
>>>> LIBURING_UDATA_TIMEOUT, but also introduce extra complexity to
>>>> filter out TIMEOUT cqes.
>>>> 
>>>> Former discussion: https://github.com/axboe/liburing/issues/53
>>>> 
>>>> I’m suggesting introducing a new SQE flag called IOSQE_IGNORE_CQE
>>>> to solve this problem.
>>>> 
>>>> For a sqe tagged with IOSQE_IGNORE_CQE flag, it won’t generate a cqe
>>>> on completion. So that IORING_OP_TIMEOUT can be filtered on kernel
>>>> side.
>>>> 
>>>> In addition, `IOSQE_IGNORE_CQE` can be used to save cq size.
>>>> 
>>>> For example `POLL_ADD(POLLIN)->READ/RECV` link chain, people usually
>>>> don’t care the result of `POLL_ADD` is ( since it will always be
>>>> POLLIN ), `IOSQE_IGNORE_CQE` can be set on `POLL_ADD` to save lots
>>>> of cq size.
>>>> 
>>>> Besides POLL_ADD, people usually don’t care the result of POLL_REMOVE
>>>> /TIMEOUT_REMOVE/ASYNC_CANCEL/CLOSE. These operations can also be tagged
>>>> with IOSQE_IGNORE_CQE.
>>>> 
>>>> Thoughts?
>>>> 
>>> 
>>> I like the idea! And that's one of my TODOs for the eBPF plans.
>>> Let me list my use cases, so we can think how to extend it a bit.
>>> 
>>> 1. In case of link fail, we need to reap all -ECANCELLED, analise it and
>>> resubmit the rest. It's quite inconvenient. We may want to have CQE only
>>> for not cancelled requests.
>>> 
>>> 2. When chain succeeded, you in the most cases already know the result
>>> of all intermediate CQEs, but you still need to reap and match them.
>>> I'd prefer to have only 1 CQE per link, that is either for the first
>>> failed or for the last request in the chain.
>>> 
>>> These 2 may shed much processing overhead from the userspace.
>> 
>> I couldn't agree more!
>> 
>> Another problem is that io_uring_enter will be awaked for completion of
>> every operation in a link, which results in unnecessary context switch.
>> When awaked, users have nothing to do but issue another io_uring_enter
>> syscall to wait for completion of the entire link chain.
> 
> Good point. Sounds like I have one more thing to do :)
> Would the behaviour as in the (2) cover all your needs?

(2) should cover most cases for me. For cases it couldn’t cover ( if any ),
I can still use normal sqes.

> 
> There is a nuisance with linked timeouts, but I think it's reasonable
> for REQ->LINKED_TIMEOUT, where it didn't fired, notify only for REQ
> 
>>> 
>>> 3. If we generate requests by eBPF even the notion of per-request event
>>> may broke.
>>> - eBPF creating new requests would also need to specify user-data, and
>>> this may be problematic from the user perspective.
>>> - may want to not generate CQEs automatically, but let eBPF do it.
>>> 
>>> -- 
>>> Pavel Begunkov
>> 
> 
> -- 
> Pavel Begunkov