Re: [RFC 0/2] io_uring: examine request result only after completion

From: Jens Axboe <axboe@kernel.dk>
To: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Cc: linux-block@vger.kernel.org
Subject: Re: [RFC 0/2] io_uring: examine request result only after completion
Date: Tue, 29 Oct 2019 13:27:19 -0600	[thread overview]
Message-ID: <47b38d9d-04a3-99f6-c586-e82611d21655@kernel.dk> (raw)
In-Reply-To: <34f483d9-2a97-30c3-9937-d3596649356c@oracle.com>

On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>
>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>> trace included
>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>> without
>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>
>>>>>>>>>>      if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>          if (req->result == -EAGAIN)
>>>>>>>>>>              return -EAGAIN;
>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>> going away?
>>>>>>>>> I must be missing something...
>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>
>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>> potential race between updating the request there and checking
>>>>>>>> it in
>>>>>>>> __io_submit_sqe().
>>>>>>>>
>>>>>>>> My first workaround was to simply poll for
>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>> code snippet above:
>>>>>>>>
>>>>>>>>            if (req->result == --EAGAIN) {
>>>>>>>>
>>>>>>>>                poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>
>>>>>>>>                return -EAGAIN;
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> and that got rid of the problem.
>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>> don't
>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>> you're
>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>> Sure.
>>>>>>
>>>>>> I'm just curious though as how it would break the poll case because
>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>> should be able to reliably check req->result.
>>>>> It'd break the poll case because the task doing the submission is
>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>> block that task just polling on that completion bit, you are
>>>>> preventing
>>>>> that very task from going and reaping completions. The condition would
>>>>> never become true, and you are now looping forever.
>>>>>
>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>> A few reasons why it would make progress:
>>>>>
>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>> finds the
>>>>>       request hasn't been completed by an IRQ. But that's a 30
>>>>> second ordeal
>>>>>       before that event occurs.
>>>>>
>>>>> - There was still interrupts enabled.
>>>>>
>>>>> - You have two threads, one doing submission and one doing
>>>>> completions.
>>>>>       Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>> work as
>>>>>       you have separate threads for submission and completion.
>>>>>
>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>> it'd
>>>>> deadlock.
>>>>>
>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>> to this
>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>> the most sense.
>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>> non-irq poll queues. Something like this:
>>>> Actually, we already disable polling if we don't have specific poll
>>>> queues:
>>>>
>>>>            if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>                set->map[HCTX_TYPE_POLL].nr_queues)
>>>>                    blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>
>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>> triggered when the timeout found the request while you had the
>>>> busy-spin
>>>> logic we discussed previously.
>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>> this case in io_uring. I'll get back to you.
>>>
>>
>> I ran the same test on linux-next-20191029 in polled mode and got the
>> same free-after-user panic:
>>
>> - I booted with nvme.poll_queues set and verified that all queues
>> except default where of type poll
>>
>> - I added three assertions to verify the following:
>>
>>      - nvme_timeout() is not called
>>
>>      - io_complete_rw_iopoll() is not called from interrupt context
>>
>>      - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>
>> Is it possible that the race is there also in polled mode since a
>> request submitted by one thread could conceivably be polled for and
>> completed by a different thread, e.g. in
>> io_uring_enter()->io_iopoll_check()?
>>
>> --bijan
>>
>>
> I also tested my RFC again with 1 thread and with queue depths of 1 to
> 1024 in multiples of 8 and didn't see any hangs.
> 
> Just to be clear, the busy-spin logic discussed before was only a
> workaround an not in the RFC.

What is your exact test case?

-- 
Jens Axboe