[PATCH RFC] nvme: fix race condition between remove and scan

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] nvme: fix race condition between remove and scan_work
@ 2019-04-11 13:32 Yufen Yu
  2019-04-19 12:28 ` yuyufen
  2019-04-24 16:23 ` Sagi Grimberg
  0 siblings, 2 replies; 7+ messages in thread
From: Yufen Yu @ 2019-04-11 13:32 UTC (permalink / raw)


There is a race condition between nvme_remove and nvme_scan_work, as
following:

sysfs_kr_wirte
nvme_remove            nvme_queue_scan            timeout_worker
                       queue scan_work

		       nvme_scan_work

set NVME_CTRL_DELETING
flush_work(&ctrl->scan_work)
wait_for_completion scan_work

                       nvme_scan_ns_list
		        nvme_validate_ns
			 nvme_submit_sync_cmd(admin_q)
                          IO timeout
                                                  nvme_timeout
						   nvme_dev_disable
						    nvme_suspend_queue(admin_q)
							blk_mq_quiesce_queue

                                                   nvme_reset_ctrl
						    return -EBUSY

                        nvme_scan_ns_sequential
			 nvme_submit_sync_cmd(admin_q)
                          IO cannnot issue, as admin_q suspend
                         wait_for_completion_io_timeout

nvme_remove() wait for nvme_scan_work() finish, and scan_work wait for
io complete. When a previous IO issued in scan_work has timeout,
nvme_timeout() will suspend admin queue and make the queue in quiesce state.

Since nvme controller state is 'NVME_CTRL_DELETING', nvme_reset_ctrl()
can not start reset work to recovery admin queue. Then, the next io request
issued to admin_q will be stalled in blk-mq 'ctx' queue forever.
As a result, both of remove and scan_work threads are stuck!

We fix this problem by adding manual recovery admin queue after
nvme_reset_ctrl return -EBUSY. Since ctrl.admin_q has been flags
'NVMEQ_ENABLED', all requests issused will return as io error.

Cc: Keith Busch <kbusch at kernel.org>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: James Smart <james.smart at broadcom.com>
Cc: Bart Van Assche <bvanassche at acm.org>
Signed-off-by: Yufen Yu <yuyufen at huawei.com>
---
 drivers/nvme/host/pci.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a90cf5d63aac..77807515d0e3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1272,6 +1272,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct request *abort_req;
 	struct nvme_command cmd;
 	u32 csts = readl(dev->bar + NVME_REG_CSTS);
+	int error = 0;
 
 	/* If PCI error recovery process is happening, we cannot reset or
 	 * the recovery mechanism will surely fail.
@@ -1329,7 +1330,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
+		/*
+		 * If reset ctrl fail, we need to drain all requests in ctx
+		 * and elevator, avoiding io stuck forever.
+		 */
+		error = nvme_reset_ctrl(&dev->ctrl);
+		if (error)
+			blk_mq_unquiesce_queue(dev->ctrl.admin_q);
 
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_DONE;
-- 
2.16.2.dirty

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC] nvme: fix race condition between remove and scan_work
  2019-04-11 13:32 [PATCH RFC] nvme: fix race condition between remove and scan_work Yufen Yu
@ 2019-04-19 12:28 ` yuyufen
  2019-04-24 16:23 ` Sagi Grimberg
  1 sibling, 0 replies; 7+ messages in thread
From: yuyufen @ 2019-04-19 12:28 UTC (permalink / raw)


ping ?


On 2019/4/11 21:32, Yufen Yu wrote:
> There is a race condition between nvme_remove and nvme_scan_work, as
> following:
>
> sysfs_kr_wirte
> nvme_remove            nvme_queue_scan            timeout_worker
>                         queue scan_work
>
>                         nvme_scan_work
>
> set NVME_CTRL_DELETING
> flush_work(&ctrl->scan_work)
> wait_for_completion scan_work
>
>                         nvme_scan_ns_list
>                          nvme_validate_ns
>                           nvme_submit_sync_cmd(admin_q)
>                            IO timeout
>                                                    nvme_timeout
>                                                     nvme_dev_disable
>                                                      nvme_suspend_queue(admin_q)
>                                                       blk_mq_quiesce_queue
>
>                                                     nvme_reset_ctrl
>                                                      return -EBUSY
>                         nvme_scan_ns_sequential
>                           nvme_submit_sync_cmd(admin_q)
>                            IO cannnot issue, as admin_q suspend
>                           wait_for_completion_io_timeout
>
> nvme_remove() wait for nvme_scan_work() finish, and scan_work wait for
> io complete. When a previous IO issued in scan_work has timeout,
> nvme_timeout() will suspend admin queue and make the queue in quiesce state.
>
> Since nvme controller state is 'NVME_CTRL_DELETING', nvme_reset_ctrl()
> can not start reset work to recovery admin queue. Then, the next io request
> issued to admin_q will be stalled in blk-mq 'ctx' queue forever.
> As a result, both of remove and scan_work threads are stuck!
>
> We fix this problem by adding manual recovery admin queue after
> nvme_reset_ctrl return -EBUSY. Since ctrl.admin_q has been flags
> 'NVMEQ_ENABLED', all requests issused will return as io error.
>
> Cc: Keith Busch <kbusch at kernel.org>
> Cc: Sagi Grimberg <sagi at grimberg.me>
> Cc: James Smart <james.smart at broadcom.com>
> Cc: Bart Van Assche <bvanassche at acm.org>
> Signed-off-by: Yufen Yu <yuyufen at huawei.com>
> ---
>   drivers/nvme/host/pci.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a90cf5d63aac..77807515d0e3 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1272,6 +1272,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>   	struct request *abort_req;
>   	struct nvme_command cmd;
>   	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +	int error = 0;
>   
>   	/* If PCI error recovery process is happening, we cannot reset or
>   	 * the recovery mechanism will surely fail.
> @@ -1329,7 +1330,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>   			 "I/O %d QID %d timeout, reset controller\n",
>   			 req->tag, nvmeq->qid);
>   		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		/*
> +		 * If reset ctrl fail, we need to drain all requests in ctx
> +		 * and elevator, avoiding io stuck forever.
> +		 */
> +		error = nvme_reset_ctrl(&dev->ctrl);
> +		if (error)
> +			blk_mq_unquiesce_queue(dev->ctrl.admin_q);
>   
>   		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>   		return BLK_EH_DONE;

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC] nvme: fix race condition between remove and scan_work
  2019-04-11 13:32 [PATCH RFC] nvme: fix race condition between remove and scan_work Yufen Yu
  2019-04-19 12:28 ` yuyufen
@ 2019-04-24 16:23 ` Sagi Grimberg
  2019-04-24 16:26   ` Keith Busch
  1 sibling, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2019-04-24 16:23 UTC (permalink / raw)




On 4/11/19 6:32 AM, Yufen Yu wrote:
> There is a race condition between nvme_remove and nvme_scan_work, as
> following:
> 
> sysfs_kr_wirte
> nvme_remove            nvme_queue_scan            timeout_worker
>                         queue scan_work
> 
> 		       nvme_scan_work
> 
> set NVME_CTRL_DELETING
> flush_work(&ctrl->scan_work)
> wait_for_completion scan_work
> 
>                         nvme_scan_ns_list
> 		        nvme_validate_ns
> 			 nvme_submit_sync_cmd(admin_q)
>                            IO timeout
>                                                    nvme_timeout
> 						   nvme_dev_disable
> 						    nvme_suspend_queue(admin_q)
> 							blk_mq_quiesce_queue
> 
>                                                     nvme_reset_ctrl
> 						    return -EBUSY
> 
>                          nvme_scan_ns_sequential
> 			 nvme_submit_sync_cmd(admin_q)
>                            IO cannnot issue, as admin_q suspend
>                           wait_for_completion_io_timeout
> 

Should we still attempt nvme_scan_ns_sequential() if nvme_scan_ns_list()
failed?

I still think that I/Os cannot hang forever when issued concurrently
with controller removal...

> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a90cf5d63aac..77807515d0e3 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1272,6 +1272,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>   	struct request *abort_req;
>   	struct nvme_command cmd;
>   	u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +	int error = 0;

This can move to the usage scope right?

>   
>   	/* If PCI error recovery process is happening, we cannot reset or
>   	 * the recovery mechanism will surely fail.
> @@ -1329,7 +1330,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>   			 "I/O %d QID %d timeout, reset controller\n",
>   			 req->tag, nvmeq->qid);
>   		nvme_dev_disable(dev, false);
> -		nvme_reset_ctrl(&dev->ctrl);
> +		/*
> +		 * If reset ctrl fail, we need to drain all requests in ctx
> +		 * and elevator, avoiding io stuck forever.
> +		 */
> +		error = nvme_reset_ctrl(&dev->ctrl);
> +		if (error)
> +			blk_mq_unquiesce_queue(dev->ctrl.admin_q);

Is it just DELETING state that is acceptable here? or can we meet other
states that fail transition to RESETTING (CONNECTING/DEAD)?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC] nvme: fix race condition between remove and scan_work
  2019-04-24 16:23 ` Sagi Grimberg
@ 2019-04-24 16:26   ` Keith Busch
  2019-04-24 16:42     ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2019-04-24 16:26 UTC (permalink / raw)


On Wed, Apr 24, 2019@09:23:10AM -0700, Sagi Grimberg wrote:
> >   	/* If PCI error recovery process is happening, we cannot reset or
> >   	 * the recovery mechanism will surely fail.
> > @@ -1329,7 +1330,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >   			 "I/O %d QID %d timeout, reset controller\n",
> >   			 req->tag, nvmeq->qid);
> >   		nvme_dev_disable(dev, false);
> > -		nvme_reset_ctrl(&dev->ctrl);
> > +		/*
> > +		 * If reset ctrl fail, we need to drain all requests in ctx
> > +		 * and elevator, avoiding io stuck forever.
> > +		 */
> > +		error = nvme_reset_ctrl(&dev->ctrl);
> > +		if (error)
> > +			blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> 
> Is it just DELETING state that is acceptable here? or can we meet other
> states that fail transition to RESETTING (CONNECTING/DEAD)?

It could be connecting or already scheduled resetting, in which case we
wouldn't want to unquiesce.

When we do want to unquiesce, though, we also want to do that to the
IO queues, not just the admin queue. Untested below, but this might be
in the right direction:

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a90cf5d63aac..acfb34c945b2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1315,6 +1315,10 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_DONE;
+	case NVME_CTRL_DELETING:
+		nvme_dev_disable(dev, true);
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_DONE;
 	default:
 		break;
 	}
@@ -2438,8 +2442,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * must flush all entered requests to their failed completion to avoid
 	 * deadlocking blk-mq hot-cpu notifier.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		if (dev->ctrl.admin_q)
+			blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
--

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC] nvme: fix race condition between remove and scan_work
  2019-04-24 16:26   ` Keith Busch
@ 2019-04-24 16:42     ` Sagi Grimberg
  2019-04-30 13:14       ` yuyufen
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2019-04-24 16:42 UTC (permalink / raw)



>> Is it just DELETING state that is acceptable here? or can we meet other
>> states that fail transition to RESETTING (CONNECTING/DEAD)?
> 
> It could be connecting or already scheduled resetting, in which case we
> wouldn't want to unquiesce.
> 
> When we do want to unquiesce, though, we also want to do that to the
> IO queues, not just the admin queue. Untested below, but this might be
> in the right direction:

This looks good Keith, Yufen, does that address your race?

> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a90cf5d63aac..acfb34c945b2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1315,6 +1315,10 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>   		nvme_dev_disable(dev, false);
>   		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>   		return BLK_EH_DONE;
> +	case NVME_CTRL_DELETING:
> +		nvme_dev_disable(dev, true);
> +		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> +		return BLK_EH_DONE;
>   	default:
>   		break;
>   	}
> @@ -2438,8 +2442,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>   	 * must flush all entered requests to their failed completion to avoid
>   	 * deadlocking blk-mq hot-cpu notifier.
>   	 */
> -	if (shutdown)
> +	if (shutdown) {
>   		nvme_start_queues(&dev->ctrl);
> +		if (dev->ctrl.admin_q)
> +			blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> +	}
>   	mutex_unlock(&dev->shutdown_lock);
>   }
>   
> --
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC] nvme: fix race condition between remove and scan_work
  2019-04-24 16:42     ` Sagi Grimberg
@ 2019-04-30 13:14       ` yuyufen
  2019-04-30 14:58         ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: yuyufen @ 2019-04-30 13:14 UTC (permalink / raw)



On 2019/4/25 0:42, Sagi Grimberg wrote:
>
>>> Is it just DELETING state that is acceptable here? or can we meet other
>>> states that fail transition to RESETTING (CONNECTING/DEAD)?
>>
>> It could be connecting or already scheduled resetting, in which case we
>> wouldn't want to unquiesce.
>>
>> When we do want to unquiesce, though, we also want to do that to the
>> IO queues, not just the admin queue. Untested below, but this might be
>> in the right direction:
>
> This looks good Keith, Yufen, does that address your race?

Thanks a lot for your response and suggestion.
The follow patch can address our problem.:-)

Yufen
thanks

>
>>
>> ---
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index a90cf5d63aac..acfb34c945b2 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -1315,6 +1315,10 @@ static enum blk_eh_timer_return 
>> nvme_timeout(struct request *req, bool reserved)
>>           nvme_dev_disable(dev, false);
>>           nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>>           return BLK_EH_DONE;
>> +    case NVME_CTRL_DELETING:
>> +        nvme_dev_disable(dev, true);
>> +        nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>> +        return BLK_EH_DONE;
>>       default:
>>           break;
>>       }
>> @@ -2438,8 +2442,11 @@ static void nvme_dev_disable(struct nvme_dev 
>> *dev, bool shutdown)
>>        * must flush all entered requests to their failed completion 
>> to avoid
>>        * deadlocking blk-mq hot-cpu notifier.
>>        */
>> -    if (shutdown)
>> +    if (shutdown) {
>>           nvme_start_queues(&dev->ctrl);
>> +        if (dev->ctrl.admin_q)
>> +            blk_mq_unquiesce_queue(dev->ctrl.admin_q);
>> +    }
>>       mutex_unlock(&dev->shutdown_lock);
>>   }
>>   --
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>>
>
> .
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC] nvme: fix race condition between remove and scan_work
  2019-04-30 13:14       ` yuyufen
@ 2019-04-30 14:58         ` Sagi Grimberg
  0 siblings, 0 replies; 7+ messages in thread
From: Sagi Grimberg @ 2019-04-30 14:58 UTC (permalink / raw)



>> This looks good Keith, Yufen, does that address your race?
> 
> Thanks a lot for your response and suggestion.
> The follow patch can address our problem.:-)

Keith, care to send a patch?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-04-30 14:58 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-11 13:32 [PATCH RFC] nvme: fix race condition between remove and scan_work Yufen Yu
2019-04-19 12:28 ` yuyufen
2019-04-24 16:23 ` Sagi Grimberg
2019-04-24 16:26   ` Keith Busch
2019-04-24 16:42     ` Sagi Grimberg
2019-04-30 13:14       ` yuyufen
2019-04-30 14:58         ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.