Re: [RFC 0/2] io_uring: examine request result only after completion

From: Jens Axboe <axboe@kernel.dk>
To: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Cc: linux-block@vger.kernel.org
Subject: Re: [RFC 0/2] io_uring: examine request result only after completion
Date: Fri, 25 Oct 2019 08:21:56 -0600	[thread overview]
Message-ID: <6adb9d2d-93f1-f915-7f20-5faa34b06398@kernel.dk> (raw)
In-Reply-To: <533409a8-6907-44d8-1b90-a10ec3483c2c@kernel.dk>

On 10/25/19 8:18 AM, Jens Axboe wrote:
> On 10/25/19 8:07 AM, Jens Axboe wrote:
>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>
>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>>>> ensuring first that the request has completed:
>>>>>>>
>>>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>> 		if (req->result == -EAGAIN)
>>>>>>> 			return -EAGAIN;
>>>>>> I'm a little confused, because we should be holding the submission
>>>>>> reference to the request still at this point. So how is it going away?
>>>>>> I must be missing something...
>>>>> I don't think the submission reference is going away...
>>>>>
>>>>> I *think* the problem has to do with the fact that
>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>> called from interrupt context in my configuration and so there is a
>>>>> potential race between updating the request there and checking it in
>>>>> __io_submit_sqe().
>>>>>
>>>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>>>> code snippet above:
>>>>>
>>>>>          if (req->result == --EAGAIN) {
>>>>>
>>>>>              poll for REQ_F_IOPOLL_COMPLETED
>>>>>
>>>>>              return -EAGAIN;
>>>>>
>>>>> }
>>>>>
>>>>> and that got rid of the problem.
>>>> But that will not work at all for a proper poll setup, where you don't
>>>> trigger any IRQs... It only happens to work for this case because you're
>>>> still triggering interrupts. But even in that case, it's not a real
>>>> solution, but I don't think that's the argument here ;-)
>>>
>>> Sure.
>>>
>>> I'm just curious though as how it would break the poll case because
>>> io_complete_rw_iopoll() would still be called though through polling,
>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>> should be able to reliably check req->result.
>>
>> It'd break the poll case because the task doing the submission is
>> generally also the one that finds and reaps completion. Hence if you
>> block that task just polling on that completion bit, you are preventing
>> that very task from going and reaping completions. The condition would
>> never become true, and you are now looping forever.
>>
>>> The same poll test seemed to run ok with nvme interrupts not being
>>> triggered. Anyway, no argument that it's not needed!
>>
>> A few reasons why it would make progress:
>>
>> - You eventually trigger a timeout on the nvme side, as blk-mq finds the
>>     request hasn't been completed by an IRQ. But that's a 30 second ordeal
>>     before that event occurs.
>>
>> - There was still interrupts enabled.
>>
>> - You have two threads, one doing submission and one doing completions.
>>     Maybe using SQPOLL? If that's the case, then yes, it'd still work as
>>     you have separate threads for submission and completion.
>>
>> For the "generic" case of just using one thread and IRQs disabled, it'd
>> deadlock.
>>
>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>> really should just disallow that, to be honest, it doesn't make any
>>>> sense. But let me think about if we can do a reasonable solution to this
>>>> that doesn't involve adding overhead for a proper setup.
>>>
>>> It's a nonsensical config in a way and so disallowing it would make
>>> the most sense.
>>
>> Definitely. The nvme driver should not set .poll() if it doesn't have
>> non-irq poll queues. Something like this:
> 
> Actually, we already disable polling if we don't have specific poll
> queues:
> 
>          if (set->nr_maps > HCTX_TYPE_POLL &&
>              set->map[HCTX_TYPE_POLL].nr_queues)
>                  blk_queue_flag_set(QUEUE_FLAG_POLL, q);
> 
> Did you see any timeouts in your tests? I wonder if the use-after-free
> triggered when the timeout found the request while you had the busy-spin
> logic we discussed previously.

Ah, but we still have fops->iopoll() set for that case. So we just won't
poll for it, it'll get completed by IRQ. So I do think we need to handle
this case in io_uring. I'll get back to you.

-- 
Jens Axboe