Re: [RFC] single cqe per link

From: Pavel Begunkov <asml.silence@gmail.com>
To: "Jens Axboe" <axboe@kernel.dk>, "Carter Li 李通洲" <carter.li@eoitek.com>
Cc: io-uring <io-uring@vger.kernel.org>
Subject: Re: [RFC] single cqe per link
Date: Tue, 25 Feb 2020 13:12:19 +0300	[thread overview]
Message-ID: <1e733dd7-acd4-dde6-b3c5-c0ee0fbeda2a@gmail.com> (raw)
In-Reply-To: <56a18348-2949-e9da-b036-600b5bb4dad2@kernel.dk>

On 2/25/2020 6:13 AM, Jens Axboe wrote:
>>> I still think flags tagged on sqes could be a better choice, which
>>> gives users an ability to deside if they want to ignore the cqes, not
>>> only for links, but also for normal sqes.
>>>
>>> In addition, boxed cqes couldn’t resolve the issue of
>>> IORING_IO_TIMEOUT.
>>
>> I would tend to agree, and it'd be trivial to just set the flag on
>> whatever SQEs in the chain you don't care about. Or even an individual
>> SQE, though that's probably a bit more of a reach in terms of use case.
>> Maybe nop with drain + ignore?

Flexible, but not performant. The existence of drain is already makes
io_uring to do a lot of extra stuff, and even worse when it's actually used.

>>
>> In any case it's definitely more flexible.

That's a different thing. Knowing how requests behave (e.g. if
nbytes!=res, then fail link), one would want to get cqe for the last
executed sqe, whether it's an error or a success for the last one.

It makes a link to be handled as a single entity. I don't see a way to
emulate similar behaviour with the unconditional masking. Probably, we
will need them both.

> In the interest of taking this to the extreme, I tried a nop benchmark
> on my laptop (qemu/kvm). Granted, this setup is particularly sensitive
> to spinlocks, they are a lot more expensive there than on a real host.
> 
> Anyway, regular nops run at about 9.5M/sec with a single thread.
> Flagging all SQEs with IOSQE_NO_CQE nets me about 14M/sec. So a handy
> improvement. Looking at the top of profiles:
> 
> cqe-per-sqe:
> 
> +   28.45%  io_uring  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
> +   14.38%  io_uring  [kernel.kallsyms]  [k] io_submit_sqes
> +    9.38%  io_uring  [kernel.kallsyms]  [k] io_put_req
> +    7.25%  io_uring  libc-2.31.so       [.] syscall
> +    6.12%  io_uring  [kernel.kallsyms]  [k] kmem_cache_free
> 
> no-cqes:
> 
> +   19.72%  io_uring  [kernel.kallsyms]  [k] io_put_req
> +   11.93%  io_uring  [kernel.kallsyms]  [k] io_submit_sqes
> +   10.14%  io_uring  [kernel.kallsyms]  [k] kmem_cache_free
> +    9.55%  io_uring  libc-2.31.so       [.] syscall
> +    7.48%  io_uring  [kernel.kallsyms]  [k] __io_queue_sqe
> 
> I'll try the real disk IO tomorrow, using polled IO.

Great, would love to see

-- 
Pavel Begunkov