linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Cc: linux-block@vger.kernel.org
Subject: Re: [RFC 0/2] io_uring: examine request result only after completion
Date: Fri, 25 Oct 2019 08:07:20 -0600	[thread overview]
Message-ID: <b7abb363-d665-b46a-9fb5-d01e7a6ce4d6@kernel.dk> (raw)
In-Reply-To: <fa82e9fc-caf7-a94a-ebff-536413e9ecce@oracle.com>

On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
> 
> On 10/24/19 3:31 PM, Jens Axboe wrote:
>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>> ensuring first that the request has completed:
>>>>>
>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>> 		if (req->result == -EAGAIN)
>>>>> 			return -EAGAIN;
>>>> I'm a little confused, because we should be holding the submission
>>>> reference to the request still at this point. So how is it going away?
>>>> I must be missing something...
>>> I don't think the submission reference is going away...
>>>
>>> I *think* the problem has to do with the fact that
>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>> called from interrupt context in my configuration and so there is a
>>> potential race between updating the request there and checking it in
>>> __io_submit_sqe().
>>>
>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>> code snippet above:
>>>
>>>        if (req->result == --EAGAIN) {
>>>
>>>            poll for REQ_F_IOPOLL_COMPLETED
>>>
>>>            return -EAGAIN;
>>>
>>> }
>>>
>>> and that got rid of the problem.
>> But that will not work at all for a proper poll setup, where you don't
>> trigger any IRQs... It only happens to work for this case because you're
>> still triggering interrupts. But even in that case, it's not a real
>> solution, but I don't think that's the argument here ;-)
> 
> Sure.
> 
> I'm just curious though as how it would break the poll case because
> io_complete_rw_iopoll() would still be called though through polling,
> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
> should be able to reliably check req->result.

It'd break the poll case because the task doing the submission is
generally also the one that finds and reaps completion. Hence if you
block that task just polling on that completion bit, you are preventing
that very task from going and reaping completions. The condition would
never become true, and you are now looping forever.

> The same poll test seemed to run ok with nvme interrupts not being
> triggered. Anyway, no argument that it's not needed!

A few reasons why it would make progress:

- You eventually trigger a timeout on the nvme side, as blk-mq finds the
  request hasn't been completed by an IRQ. But that's a 30 second ordeal
  before that event occurs.

- There was still interrupts enabled.

- You have two threads, one doing submission and one doing completions.
  Maybe using SQPOLL? If that's the case, then yes, it'd still work as
  you have separate threads for submission and completion.

For the "generic" case of just using one thread and IRQs disabled, it'd
deadlock.

>> I see what the race is now, it's specific to IRQ driven polling. We
>> really should just disallow that, to be honest, it doesn't make any
>> sense. But let me think about if we can do a reasonable solution to this
>> that doesn't involve adding overhead for a proper setup.
> 
> It's a nonsensical config in a way and so disallowing it would make
> the most sense.

Definitely. The nvme driver should not set .poll() if it doesn't have
non-irq poll queues. Something like this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 869f462e6b6e..ac1b0a9387a4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1594,6 +1594,16 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.poll		= nvme_poll,
 };
 
+static const struct blk_mq_ops nvme_mq_nopoll_ops = {
+	.queue_rq	= nvme_queue_rq,
+	.complete	= nvme_pci_complete_rq,
+	.commit_rqs	= nvme_commit_rqs,
+	.init_hctx	= nvme_init_hctx,
+	.init_request	= nvme_init_request,
+	.map_queues	= nvme_pci_map_queues,
+	.timeout	= nvme_timeout,
+};
+
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
 {
 	if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) {
@@ -2269,11 +2279,14 @@ static void nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.nr_maps = 2; /* default + read */
-		if (dev->io_queues[HCTX_TYPE_POLL])
+		if (dev->io_queues[HCTX_TYPE_POLL]) {
+			dev->tagset.ops = &nvme_mq_ops;
 			dev->tagset.nr_maps++;
+		} else {
+			dev->tagset.ops = &nvme_mq_nopoll_ops;
+		}
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =

-- 
Jens Axboe


  parent reply	other threads:[~2019-10-25 14:07 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-24  9:18 [RFC 0/2] io_uring: examine request result only after completion Bijan Mottahedeh
2019-10-24  9:18 ` [RFC 1/2] io_uring: create io_queue_async() function Bijan Mottahedeh
2019-10-24  9:18 ` [RFC 2/2] io_uring: examine request result only after completion Bijan Mottahedeh
2019-10-24 17:09 ` [RFC 0/2] " Jens Axboe
2019-10-24 19:18   ` Bijan Mottahedeh
2019-10-24 22:31     ` Jens Axboe
     [not found]       ` <fa82e9fc-caf7-a94a-ebff-536413e9ecce@oracle.com>
2019-10-25 14:07         ` Jens Axboe [this message]
2019-10-25 14:18           ` Jens Axboe
2019-10-25 14:21             ` Jens Axboe
2019-10-29 19:17               ` Bijan Mottahedeh
2019-10-29 19:23                 ` Bijan Mottahedeh
2019-10-29 19:27                   ` Jens Axboe
2019-10-29 19:31                     ` Bijan Mottahedeh
2019-10-29 19:33                       ` Jens Axboe
2019-10-29 19:40                         ` Bijan Mottahedeh
2019-10-29 19:46                           ` Jens Axboe
2019-10-29 19:51                             ` Bijan Mottahedeh
2019-10-29 19:52                               ` Jens Axboe
2019-10-30  1:02                                 ` Jens Axboe
2019-10-30 14:02                                   ` Bijan Mottahedeh
2019-10-30 14:18                                     ` Jens Axboe
2019-10-30 17:32                                       ` Jens Axboe
2019-10-30 19:21                                         ` Bijan Mottahedeh
2019-10-30 19:26                                           ` Jens Axboe
2019-10-25 14:42             ` Bijan Mottahedeh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b7abb363-d665-b46a-9fb5-d01e7a6ce4d6@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=bijan.mottahedeh@oracle.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).