Re: [RFC] io_uring: stop only support read/write for ctx with IORING_SETUP_IOPOLL

From: yangerkun <yangerkun@huawei.com>
To: Jens Axboe <axboe@kernel.dk>, Bob Liu <bob.liu@oracle.com>,
	<linux-block@vger.kernel.org>
Cc: <houtao1@huawei.com>, <yi.zhang@huawei.com>
Subject: Re: [RFC] io_uring: stop only support read/write for ctx with IORING_SETUP_IOPOLL
Date: Thu, 7 Nov 2019 09:27:51 +0800	[thread overview]
Message-ID: <0f3419a4-e49f-4056-48cf-0e50431b8062@huawei.com> (raw)
In-Reply-To: <8194c20d-4523-8885-2b80-b9849d37e890@kernel.dk>

On 2019/11/6 22:14, Jens Axboe wrote:
> On 11/6/19 5:00 AM, yangerkun wrote:
>>
>>
>> On 2019/11/4 22:01, Jens Axboe wrote:
>>> On 11/4/19 4:46 AM, yangerkun wrote:
>>>>
>>>>
>>>> On 2019/11/4 18:09, Bob Liu wrote:
>>>>> On 11/4/19 4:56 PM, yangerkun wrote:
>>>>>> There is no problem to support other type request for the ctx with
>>>>>> IORING_SETUP_IOPOLL.
>>>>>
>>>>> Could you describe the benefit of doing this?
>>>>
>>>> Hi,
>>>>
>>>> I am trying to replace libaio with io_uring in InnoDB/MariaDB(which
>>>> build on xfs/nvme). And in order to simulate the timeout mechanism
>>>> like io_getevents, firstly, to use the poll function of io_uring's fd
>>>> has been selected and it really did work. But while trying to enable
>>>> IORING_SETUP_IOPOLL since xfs has iopoll function interface, the
>>>> mechanism will fail since io_uring_poll does check the cq.head between
>>>> cached_cq_tail, which will not update until we call io_uring_enter and
>>>> do the poll. So, instead, I decide to use timeout requests in
>>>> io_uring but will return -EINVAL since we enable IORING_SETUP_IOPOLL
>>>> at the same time. I think this combination is a normal scene so as
>>>> the other combination descibed in this patch. I am not sure does it a
>>>> good solution for this problem, and maybe there exists some better way.
>>>
>>> I think we can support timeouts pretty easily with SETUP_IOPOLL, but we
>>> can't mix and match everything. Pretty sure I've written at length about
>>> that before, but the tldr is that for purely polled commands, we won't
>>> have an IRQ event. That's the case for nvme if it's misconfigured, but
>>> for an optimal setup where nvme is loaded with poll queues, there will
>>> never be an interrupt for the command. This means that we can never wait
>>> in io_cqring_wait(), we must always call the iopoll poller, because if
>>> we wait we might very well be waiting for events that will never happen
>>> unless we actively poll for them.
>>>
>>> This could be supported if we accounted requests, but I don't want to
>>> add that kind of overhead. Same with the lock+irqdisable you had to add
>>> for this, it's not acceptable overhead.
>>>
>>> Sounds like you just need timeout support for polling? If so, then that
>>
>> Hi,
>>
>> Yeah, the thing I need add is the timeout support for polling.
>>
>>> is supportable as we know that these events will trigger an async event
>>> when they happen. Either that, or it triggers when we poll for
>>> completions. So it's safe to support, and we can definitely do that.
>>
>> I am a little confuse. What you describe is once enable SETUP_IOPOLL
>> and there is a async call of io_timeout_fn, we should not call
>> io_cqring_fill_event directly, but leave io_iopoll_check to do this?
>> Or, the parallel running for io_iopoll_check may trigger some problem
>> since they can going to io_cqring_fill_event.
> 
> Maybe that wasn't quite clear, what I'm trying to say is that there are
> two outcomes with IORING_OP_TIMEOUT:
> 
> 1) The number of events specified in the timeout is met. This happens
>     through the normal poll command checks when we complete commands.
> 2) The timer fires. When this happens, we just increment ->cq_timeouts.
>     You'd have to make a note of that in the poll loop just like we do in
>     cqring_wait() to know to abort if that value is different from when
>     we started the loop.
> 
> All that's needed to support timeouts with IORING_SETUP_IOPOLL is to
> have that ->cq_timeouts check in place. With that, the restriction could
> be lifted.
> 

Thank you for your detailed explanation! I will try the solution you
suggest instead of this patch!