linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Cc: linux-block@vger.kernel.org
Subject: Re: [RFC 0/2] io_uring: examine request result only after completion
Date: Tue, 29 Oct 2019 13:33:21 -0600	[thread overview]
Message-ID: <057bb6f9-29ec-1160-a1b1-00c57b610282@kernel.dk> (raw)
In-Reply-To: <c7b599e4-cf3d-5390-f6f4-360d4435ea43@oracle.com>

On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:27 PM, Jens Axboe wrote:
>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>> trace included
>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>> without
>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>            if (req->result == -EAGAIN)
>>>>>>>>>>>>                return -EAGAIN;
>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>> going away?
>>>>>>>>>>> I must be missing something...
>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>
>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>> it in
>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>
>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>> code snippet above:
>>>>>>>>>>
>>>>>>>>>>              if (req->result == --EAGAIN) {
>>>>>>>>>>
>>>>>>>>>>                  poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>
>>>>>>>>>>                  return -EAGAIN;
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> and that got rid of the problem.
>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>> don't
>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>> you're
>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>> should be able to reliably check req->result.
>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>> block that task just polling on that completion bit, you are
>>>>>>> preventing
>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>> never become true, and you are now looping forever.
>>>>>>>
>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>> A few reasons why it would make progress:
>>>>>>>
>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>> finds the
>>>>>>>         request hasn't been completed by an IRQ. But that's a 30
>>>>>>> second ordeal
>>>>>>>         before that event occurs.
>>>>>>>
>>>>>>> - There was still interrupts enabled.
>>>>>>>
>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>> completions.
>>>>>>>         Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>> work as
>>>>>>>         you have separate threads for submission and completion.
>>>>>>>
>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>> it'd
>>>>>>> deadlock.
>>>>>>>
>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>> to this
>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>> the most sense.
>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>> non-irq poll queues. Something like this:
>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>> queues:
>>>>>>
>>>>>>              if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>                  set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>                      blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>
>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>> triggered when the timeout found the request while you had the
>>>>>> busy-spin
>>>>>> logic we discussed previously.
>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>> this case in io_uring. I'll get back to you.
>>>>>
>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>> same free-after-user panic:
>>>>
>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>> except default where of type poll
>>>>
>>>> - I added three assertions to verify the following:
>>>>
>>>>        - nvme_timeout() is not called
>>>>
>>>>        - io_complete_rw_iopoll() is not called from interrupt context
>>>>
>>>>        - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>
>>>> Is it possible that the race is there also in polled mode since a
>>>> request submitted by one thread could conceivably be polled for and
>>>> completed by a different thread, e.g. in
>>>> io_uring_enter()->io_iopoll_check()?
>>>>
>>>> --bijan
>>>>
>>>>
>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>> 1024 in multiples of 8 and didn't see any hangs.
>>>
>>> Just to be clear, the busy-spin logic discussed before was only a
>>> workaround an not in the RFC.
>> What is your exact test case?
>>
> See original cover letter.  I can reproduce the failure with numjobs
> between 8 and 32.

And how many poll queues are you using?

-- 
Jens Axboe


  reply	other threads:[~2019-10-29 19:33 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-24  9:18 [RFC 0/2] io_uring: examine request result only after completion Bijan Mottahedeh
2019-10-24  9:18 ` [RFC 1/2] io_uring: create io_queue_async() function Bijan Mottahedeh
2019-10-24  9:18 ` [RFC 2/2] io_uring: examine request result only after completion Bijan Mottahedeh
2019-10-24 17:09 ` [RFC 0/2] " Jens Axboe
2019-10-24 19:18   ` Bijan Mottahedeh
2019-10-24 22:31     ` Jens Axboe
     [not found]       ` <fa82e9fc-caf7-a94a-ebff-536413e9ecce@oracle.com>
2019-10-25 14:07         ` Jens Axboe
2019-10-25 14:18           ` Jens Axboe
2019-10-25 14:21             ` Jens Axboe
2019-10-29 19:17               ` Bijan Mottahedeh
2019-10-29 19:23                 ` Bijan Mottahedeh
2019-10-29 19:27                   ` Jens Axboe
2019-10-29 19:31                     ` Bijan Mottahedeh
2019-10-29 19:33                       ` Jens Axboe [this message]
2019-10-29 19:40                         ` Bijan Mottahedeh
2019-10-29 19:46                           ` Jens Axboe
2019-10-29 19:51                             ` Bijan Mottahedeh
2019-10-29 19:52                               ` Jens Axboe
2019-10-30  1:02                                 ` Jens Axboe
2019-10-30 14:02                                   ` Bijan Mottahedeh
2019-10-30 14:18                                     ` Jens Axboe
2019-10-30 17:32                                       ` Jens Axboe
2019-10-30 19:21                                         ` Bijan Mottahedeh
2019-10-30 19:26                                           ` Jens Axboe
2019-10-25 14:42             ` Bijan Mottahedeh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=057bb6f9-29ec-1160-a1b1-00c57b610282@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=bijan.mottahedeh@oracle.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).