Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing

From: "jianchao.wang" <jianchao.w.wang@oracle.com>
To: Keith Busch <keith.busch@intel.com>
Cc: axboe@fb.com, hch@lst.de, sagi@grimberg.me, maxg@mellanox.com,
	james.smart@broadcom.com, linux-nvme@lists.infradead.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V5 0/2] nvme-pci: fix the timeout case when reset is ongoing
Date: Fri, 19 Jan 2018 17:02:06 +0800	[thread overview]
Message-ID: <84b4e3bc-fe23-607e-9d5e-bb5644eedb54@oracle.com> (raw)
In-Reply-To: <20180119084218.GF12043@localhost.localdomain>

Hi Keith

Thanks for your kindly and detailed response and patch.

On 01/19/2018 04:42 PM, Keith Busch wrote:
> On Fri, Jan 19, 2018 at 04:14:02PM +0800, jianchao.wang wrote:
>> On 01/19/2018 04:01 PM, Keith Busch wrote:
>>> The nvme_dev_disable routine makes forward progress without depending on
>>> timeout handling to complete expired commands. Once controller disabling
>>> completes, there can't possibly be any started requests that can expire.
>>> So we don't need nvme_timeout to do anything for requests above the
>>> boundary.
>>>
>> Yes, once controller disabling completes, any started requests will be handled and cannot expire.
>> But before the _boundary_, there could be a nvme_timeout context runs with nvme_dev_disable in parallel.
>> If a timeout path grabs a request, then nvme_dev_disable cannot get and cancel it.
>> So even though the nvme_dev_disable completes, there still could be a request in nvme_timeout context.
>>
>> The worst case is :
>> nvme_timeout                              nvme_reset_work
>> if (ctrl->state == RESETTING )              nvme_dev_disable
>>     nvme_dev_disable                        initializing procedure
>>
>> the nvme_dev_disable run with reinit procedure in nvme_reset_work in parallel.
> 
> Okay, I see what you're saying. That's a pretty obscure case, as normally
> we enter nvme_reset_work with the queues already disabled, but there
> are a few cases where we need nvme_reset_work to handle that.
> 
> But if we are in that case, I think we really just want to sync the
> queues. What do you think of this?
> 
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fde6fd2e7eef..516383193416 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3520,6 +3520,17 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
>  }
>  EXPORT_SYMBOL_GPL(nvme_stop_queues);
>  
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns;
> +
> +	mutex_lock(&ctrl->namespaces_mutex);
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		blk_sync_queue(ns->queue);
> +	mutex_unlock(&ctrl->namespaces_mutex);
> +}
> +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> +
>  void nvme_start_queues(struct nvme_ctrl *ctrl)
>  {
>  	struct nvme_ns *ns;
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 8e7fc1b041b7..45b1b8ceddb6 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -374,6 +374,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
>  
>  void nvme_stop_queues(struct nvme_ctrl *ctrl);
>  void nvme_start_queues(struct nvme_ctrl *ctrl);
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
>  void nvme_kill_queues(struct nvme_ctrl *ctrl);
>  void nvme_unfreeze(struct nvme_ctrl *ctrl);
>  void nvme_wait_freeze(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a2ffb557b616..1fe00be22ad1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2289,8 +2289,10 @@ static void nvme_reset_work(struct work_struct *work)
>  	 * If we're called to reset a live controller first shut it down before
>  	 * moving on.
>  	 */
> -	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
> +	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) {
>  		nvme_dev_disable(dev, false);
> +		nvme_sync_queues(&dev->ctrl);
> +	}
>  
>  	result = nvme_pci_enable(dev);
>  	if (result)
> --
> 

We should not use blk_sync_queue here, the requeue_work and run_work will be canceled.
Just flush_work(&q->timeout_work) should be ok.

In addition, we could check NVME_CC_ENABLE in nvme_dev_disable to avoid redundant invoking.
:)

Thanks
Jianchao