All of lore.kernel.org
 help / color / mirror / Atom feed
From: "jianchao.wang" <jianchao.w.wang@oracle.com>
To: Ming Lei <ming.lei@redhat.com>
Cc: Sagi Grimberg <sagi@grimberg.me>, Xiao Liang <xiliang@redhat.com>,
	linux-nvme@lists.infradead.org,
	Keith Busch <keith.busch@intel.com>,
	stable@vger.kernel.org, Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
Date: Sun, 28 Jan 2018 17:01:18 +0800	[thread overview]
Message-ID: <ead2bb8f-3628-ab15-f2f1-d84ffb1a3ed4@oracle.com> (raw)
In-Reply-To: <20180127154356.GA20712@ming.t460p>

Hi ming

Thanks for your time to look at this.
That's really appreciated.

On 01/27/2018 11:44 PM, Ming Lei wrote:
>>>> The best way to fix this is to ensure the timeout path has been completed before cancel the
>>>> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
>>>> it has to be fixed by other way.)
>>> Maybe your approach looks a bit clean and simplify the implementation, but seems
>>> it isn't necessary.
>>>
>>> So could you explain a bit what the exact issue you are worrying about? deadlock?
>>> or others?
>> There is indeed potential issue. But it is in very narrow window.
>> Please refer to https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2018_1_19_68&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=vcQM1keq4TzxrKgWxIBOPADXTw7IRtGtWTBlHVIvPn8&s=83k_LmLl0TFLYwkVn4HjZbj5RgIMlLv570rnB5jNRC0&e=
> OK, follows description from the above link:
> 
>> Yes, once controller disabling completes, any started requests will be
>> handled and cannot expire. But before the _boundary_, there could be a
>> nvme_timeout context runs with nvme_dev_disable in parallel. If a timeout
>> path grabs a request, then nvme_dev_disable cannot get and cancel it.
>>
>> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>>
>> The worst case is :
>> nvme_timeout                              nvme_reset_work
>> if (ctrl->state == RESETTING )              nvme_dev_disable
>>    nvme_dev_disable                        initializing procedure
>>
>> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
> OK, that is the issue, the nvme_dev_disable() in nvme_timeout() may
> disable queues again, and cause hang in nvme_reset_work().

Yes. :)

> 
> Looks Keith's suggestion of introducing nvme_sync_queues() should be
> enough to kill the race, but seems nvme_sync_queues() should have been
> called at the entry of nvme_reset_work() unconditionally.

Thanks for your suggestion.

Jianchao

WARNING: multiple messages have this Message-ID (diff)
From: jianchao.w.wang@oracle.com (jianchao.wang)
Subject: [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED
Date: Sun, 28 Jan 2018 17:01:18 +0800	[thread overview]
Message-ID: <ead2bb8f-3628-ab15-f2f1-d84ffb1a3ed4@oracle.com> (raw)
In-Reply-To: <20180127154356.GA20712@ming.t460p>

Hi ming

Thanks for your time to look at this.
That's really appreciated.

On 01/27/2018 11:44 PM, Ming Lei wrote:
>>>> The best way to fix this is to ensure the timeout path has been completed before cancel the
>>>> previously outstanding requests. (Just ignore the case where the nvme_timeout will invoke nvme_dev_disable,
>>>> it has to be fixed by other way.)
>>> Maybe your approach looks a bit clean and simplify the implementation, but seems
>>> it isn't necessary.
>>>
>>> So could you explain a bit what the exact issue you are worrying about? deadlock?
>>> or others?
>> There is indeed potential issue. But it is in very narrow window.
>> Please refer to https://urldefense.proofpoint.com/v2/url?u=https-3A__lkml.org_lkml_2018_1_19_68&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=vcQM1keq4TzxrKgWxIBOPADXTw7IRtGtWTBlHVIvPn8&s=83k_LmLl0TFLYwkVn4HjZbj5RgIMlLv570rnB5jNRC0&e=
> OK, follows description from the above link:
> 
>> Yes, once controller disabling completes, any started requests will be
>> handled and cannot expire. But before the _boundary_, there could be a
>> nvme_timeout context runs with nvme_dev_disable in parallel. If a timeout
>> path grabs a request, then nvme_dev_disable cannot get and cancel it.
>>
>> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>>
>> The worst case is :
>> nvme_timeout                              nvme_reset_work
>> if (ctrl->state == RESETTING )              nvme_dev_disable
>>    nvme_dev_disable                        initializing procedure
>>
>> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
> OK, that is the issue, the nvme_dev_disable() in nvme_timeout() may
> disable queues again, and cause hang in nvme_reset_work().

Yes. :)

> 
> Looks Keith's suggestion of introducing nvme_sync_queues() should be
> enough to kill the race, but seems nvme_sync_queues() should have been
> called at the entry of nvme_reset_work() unconditionally.

Thanks for your suggestion.

Jianchao

  reply	other threads:[~2018-01-28  9:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-25  8:10 [PATCH] nvme: don't retry request marked as NVME_REQ_CANCELLED Ming Lei
2018-01-25  8:10 ` Ming Lei
2018-01-25  8:52 ` jianchao.wang
2018-01-25  8:52   ` jianchao.wang
2018-01-25 10:15   ` Ming Lei
2018-01-25 10:15     ` Ming Lei
2018-01-27 12:33     ` jianchao.wang
2018-01-27 12:33       ` jianchao.wang
2018-01-27 13:31       ` Ming Lei
2018-01-27 13:31         ` Ming Lei
2018-01-27 14:29         ` jianchao.wang
2018-01-27 14:29           ` jianchao.wang
2018-01-27 15:44           ` Ming Lei
2018-01-27 15:44             ` Ming Lei
2018-01-28  9:01             ` jianchao.wang [this message]
2018-01-28  9:01               ` jianchao.wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ead2bb8f-3628-ab15-f2f1-d84ffb1a3ed4@oracle.com \
    --to=jianchao.w.wang@oracle.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=sagi@grimberg.me \
    --cc=stable@vger.kernel.org \
    --cc=xiliang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.