linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] nvme: fix hang in path of removing disk
@ 2017-05-17  1:27 Ming Lei
  2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
  2017-05-17  1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei
  0 siblings, 2 replies; 18+ messages in thread
From: Ming Lei @ 2017-05-17  1:27 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, Zhang Yi, linux-block, Ming Lei

The 1st patch fixes one race between resettting failure and remove.

The 2nd patch avoids hanging in del_gendisk() if del_gendisk() waits
for these submitted writeback requests.

Keith, about the 1st patch, as we talked, there are two directions for
fixing the 1st issue, I know you may be working towards treating this
reset failure as controller dead, and not completed yet. But if
nvme_remove_dead_ctrl() is needed in any one of reset failure path,
this patch is still required. Seems both two directions aren't
contradictorily.

Ming Lei (2):
  nvme: fix race between removing and reseting failure
  nvme: avoid to hang in remove disk

 drivers/nvme/host/core.c | 8 ++++++++
 drivers/nvme/host/pci.c  | 9 ++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-17  1:27 [PATCH 0/2] nvme: fix hang in path of removing disk Ming Lei
@ 2017-05-17  1:27 ` Ming Lei
  2017-05-17  6:38   ` Johannes Thumshirn
                     ` (3 more replies)
  2017-05-17  1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei
  1 sibling, 4 replies; 18+ messages in thread
From: Ming Lei @ 2017-05-17  1:27 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, Zhang Yi, linux-block, Ming Lei, stable

When one NVMe PCI device is being resetted and found reset failue,
nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
are put into stopped first, then schedule .remove_work to release the driver.

Unfortunately if the driver is being released via sysfs store
just before the .remove_work is run, del_gendisk() from
nvme_remove() may hang forever because hw queues are stopped and
the submitted writeback IOs from fsync_bdev() can't be completed at all.

This patch fixes the following issue[1][2] by moving nvme_kill_queues()
into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
flushs .reset_work, and this way is reasonable and safe because
nvme_dev_disable() has started to suspend queues and canceled requests
already.

[1] test script
	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60

	sleep 35
	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
	echo 1 > $SYSFS_NVME_PCI_PATH/reset
	echo 1 > $SYSFS_NVME_PCI_PATH/remove
	echo 1 > /sys/bus/pci/rescan

[2] kernel hang log
[  492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds.
[  492.240081]       Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3
[  492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  492.255346] nvme-test       D    0  5939   5938 0x00000080
[  492.261475] Call Trace:
[  492.264215]  __schedule+0x289/0x8f0
[  492.268105]  ? write_cache_pages+0x14c/0x510
[  492.272873]  schedule+0x36/0x80
[  492.276381]  io_schedule+0x16/0x40
[  492.280181]  wait_on_page_bit_common+0x137/0x220
[  492.285336]  ? page_cache_tree_insert+0x120/0x120
[  492.290589]  __filemap_fdatawait_range+0x128/0x1a0
[  492.295941]  filemap_fdatawait_range+0x14/0x30
[  492.300902]  filemap_fdatawait+0x23/0x30
[  492.305282]  filemap_write_and_wait+0x4c/0x80
[  492.310151]  __sync_blockdev+0x1f/0x40
[  492.314336]  fsync_bdev+0x44/0x50
[  492.318039]  invalidate_partition+0x24/0x50
[  492.322710]  del_gendisk+0xcd/0x2e0
[  492.326608]  nvme_ns_remove+0x105/0x130 [nvme_core]
[  492.332054]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
[  492.338082]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
[  492.343519]  nvme_remove+0x5d/0x170 [nvme]
[  492.348096]  pci_device_remove+0x39/0xc0
[  492.352477]  device_release_driver_internal+0x141/0x1f0
[  492.358311]  device_release_driver+0x12/0x20
[  492.363072]  pci_stop_bus_device+0x8c/0xa0
[  492.367646]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
[  492.373965]  remove_store+0x7c/0x90
[  492.377852]  dev_attr_store+0x18/0x30
[  492.381941]  sysfs_kf_write+0x3a/0x50
[  492.386028]  kernfs_fop_write+0xff/0x180
[  492.390409]  __vfs_write+0x37/0x160
[  492.394304]  ? selinux_file_permission+0xe5/0x120
[  492.399556]  ? security_file_permission+0x3b/0xc0
[  492.404807]  vfs_write+0xb2/0x1b0
[  492.408508]  ? syscall_trace_enter+0x1d0/0x2b0
[  492.413462]  SyS_write+0x55/0xc0
[  492.417064]  do_syscall_64+0x67/0x180
[  492.421155]  entry_SYSCALL64_slow_path+0x25/0x25

Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fed803232edc..5e39abe57c56 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 
 	kref_get(&dev->ctrl.kref);
 	nvme_dev_disable(dev, false);
+
+	/*
+	 * nvme_dev_disable() has suspended queues, then no new I/O
+	 * can be submitted to hardware successfully any more, so
+	 * kill queues now for avoiding race between reset failure
+	 * and remove.
+	 */
+	nvme_kill_queues(&dev->ctrl);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
@@ -1993,7 +2001,6 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, remove_work);
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	nvme_kill_queues(&dev->ctrl);
 	if (pci_get_drvdata(pdev))
 		device_release_driver(&pdev->dev);
 	nvme_put_ctrl(&dev->ctrl);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] nvme: avoid to hang in remove disk
  2017-05-17  1:27 [PATCH 0/2] nvme: fix hang in path of removing disk Ming Lei
  2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
@ 2017-05-17  1:27 ` Ming Lei
  2017-05-18 13:49   ` Christoph Hellwig
  1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2017-05-17  1:27 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, Zhang Yi, linux-block, Ming Lei, stable

If some writeback requests are submitted just before queue is killed,
and these requests may not be canceled in nvme_dev_disable() because
they are not started yet, it is still possible for blk-mq to hold
these requests in .requeue list.

So we have to abort these requests first before del_gendisk(), because
del_gendisk() may wait for completion of these requests.

Cc: stable@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d5e0906262ea..8eaeea86509a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 					&nvme_ns_attr_group);
 		if (ns->ndev)
 			nvme_nvm_unregister_sysfs(ns);
+		/*
+		 * If queue is dead, we have to abort requests in
+		 * requeue list because fsync_bdev() in removing disk
+		 * path may wait for these IOs, which can't
+		 * be submitted to hardware too.
+		 */
+		if (blk_queue_dying(ns->queue))
+			blk_mq_abort_requeue_list(ns->queue);
 		del_gendisk(ns->disk);
 		blk_mq_abort_requeue_list(ns->queue);
 		blk_cleanup_queue(ns->queue);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
@ 2017-05-17  6:38   ` Johannes Thumshirn
  2017-05-17  7:01     ` Ming Lei
  2017-05-18 13:47   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Johannes Thumshirn @ 2017-05-17  6:38 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, Zhang Yi, linux-block, stable

On 05/17/2017 03:27 AM, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.
> 
> [1] test script
> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60

Nit: the actual size after the -size parameter is missing.

Anyways:
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-17  6:38   ` Johannes Thumshirn
@ 2017-05-17  7:01     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2017-05-17  7:01 UTC (permalink / raw)
  To: Johannes Thumshirn
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, Zhang Yi, linux-block, stable

On Wed, May 17, 2017 at 08:38:01AM +0200, Johannes Thumshirn wrote:
> On 05/17/2017 03:27 AM, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> 
> Nit: the actual size after the -size parameter is missing.

Forget to mention, $NVME_DISK has to be one partition of nvme disk, and
the type need to be 'filesystem' type for reproduction, then actual size
isn't needed.

> 
> Anyways:
> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>

Thanks for review!


Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
  2017-05-17  6:38   ` Johannes Thumshirn
@ 2017-05-18 13:47   ` Christoph Hellwig
  2017-05-18 15:04     ` Ming Lei
  2017-05-18 14:13   ` Keith Busch
  2017-05-19 14:41   ` Jens Axboe
  3 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-05-18 13:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, Zhang Yi, linux-block, stable

On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.
> 
> [1] test script
> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> 
> 	sleep 35
> 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> 	echo 1 > /sys/bus/pci/rescan
> 
> [2] kernel hang log
> [  492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds.
> [  492.240081]       Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3
> [  492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  492.255346] nvme-test       D    0  5939   5938 0x00000080
> [  492.261475] Call Trace:
> [  492.264215]  __schedule+0x289/0x8f0
> [  492.268105]  ? write_cache_pages+0x14c/0x510
> [  492.272873]  schedule+0x36/0x80
> [  492.276381]  io_schedule+0x16/0x40
> [  492.280181]  wait_on_page_bit_common+0x137/0x220
> [  492.285336]  ? page_cache_tree_insert+0x120/0x120
> [  492.290589]  __filemap_fdatawait_range+0x128/0x1a0
> [  492.295941]  filemap_fdatawait_range+0x14/0x30
> [  492.300902]  filemap_fdatawait+0x23/0x30
> [  492.305282]  filemap_write_and_wait+0x4c/0x80
> [  492.310151]  __sync_blockdev+0x1f/0x40
> [  492.314336]  fsync_bdev+0x44/0x50
> [  492.318039]  invalidate_partition+0x24/0x50
> [  492.322710]  del_gendisk+0xcd/0x2e0
> [  492.326608]  nvme_ns_remove+0x105/0x130 [nvme_core]
> [  492.332054]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
> [  492.338082]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
> [  492.343519]  nvme_remove+0x5d/0x170 [nvme]
> [  492.348096]  pci_device_remove+0x39/0xc0
> [  492.352477]  device_release_driver_internal+0x141/0x1f0
> [  492.358311]  device_release_driver+0x12/0x20
> [  492.363072]  pci_stop_bus_device+0x8c/0xa0
> [  492.367646]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
> [  492.373965]  remove_store+0x7c/0x90
> [  492.377852]  dev_attr_store+0x18/0x30
> [  492.381941]  sysfs_kf_write+0x3a/0x50
> [  492.386028]  kernfs_fop_write+0xff/0x180
> [  492.390409]  __vfs_write+0x37/0x160
> [  492.394304]  ? selinux_file_permission+0xe5/0x120
> [  492.399556]  ? security_file_permission+0x3b/0xc0
> [  492.404807]  vfs_write+0xb2/0x1b0
> [  492.408508]  ? syscall_trace_enter+0x1d0/0x2b0
> [  492.413462]  SyS_write+0x55/0xc0
> [  492.417064]  do_syscall_64+0x67/0x180
> [  492.421155]  entry_SYSCALL64_slow_path+0x25/0x25
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/pci.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fed803232edc..5e39abe57c56 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
>  
>  	kref_get(&dev->ctrl.kref);
>  	nvme_dev_disable(dev, false);

> +	/*
> +	 * nvme_dev_disable() has suspended queues, then no new I/O
> +	 * can be submitted to hardware successfully any more, so
> +	 * kill queues now for avoiding race between reset failure
> +	 * and remove.

How about this instead:

	/*
	 * nvme_dev_disable() has suspended the I/O queues and no new I/O can
	 * be submitted now.  Kill the queues now to avoid races between a
	 * possible reset failure and the controller removal work.
	 */

Otherwise this looks fine to me:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] nvme: avoid to hang in remove disk
  2017-05-17  1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei
@ 2017-05-18 13:49   ` Christoph Hellwig
  2017-05-18 15:35     ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2017-05-18 13:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Christoph Hellwig, Sagi Grimberg,
	linux-nvme, Zhang Yi, linux-block, stable

On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote:
> If some writeback requests are submitted just before queue is killed,
> and these requests may not be canceled in nvme_dev_disable() because
> they are not started yet, it is still possible for blk-mq to hold
> these requests in .requeue list.
> 
> So we have to abort these requests first before del_gendisk(), because
> del_gendisk() may wait for completion of these requests.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/nvme/host/core.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index d5e0906262ea..8eaeea86509a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns)
>  					&nvme_ns_attr_group);
>  		if (ns->ndev)
>  			nvme_nvm_unregister_sysfs(ns);
> +		/*
> +		 * If queue is dead, we have to abort requests in
> +		 * requeue list because fsync_bdev() in removing disk
> +		 * path may wait for these IOs, which can't
> +		 * be submitted to hardware too.
> +		 */
> +		if (blk_queue_dying(ns->queue))
> +			blk_mq_abort_requeue_list(ns->queue);
>  		del_gendisk(ns->disk);
>  		blk_mq_abort_requeue_list(ns->queue);

Why can't we just move the blk_mq_abort_requeue_list call before
del_gendisk in general?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
  2017-05-17  6:38   ` Johannes Thumshirn
  2017-05-18 13:47   ` Christoph Hellwig
@ 2017-05-18 14:13   ` Keith Busch
  2017-05-19 12:52     ` Ming Lei
  2017-05-19 14:41   ` Jens Axboe
  3 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2017-05-18 14:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Zhang Yi, linux-block, stable

On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.

I'm still not sure moving where we kill the queues is the correct way
to fix this problem. The nvme_kill_queues restarts all the hardware
queues to force all IO to failure already, so why is this really stuck?
We should be able to make forward progress even if we kill the queues
while calling into del_gendisk, right? That could happen with a different
sequence of events, so that also needs to work.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-18 13:47   ` Christoph Hellwig
@ 2017-05-18 15:04     ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2017-05-18 15:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Zhang Yi,
	linux-block, stable

On Thu, May 18, 2017 at 03:47:34PM +0200, Christoph Hellwig wrote:
> On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> > 
> > 	sleep 35
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> > 	echo 1 > /sys/bus/pci/rescan
> > 
> > [2] kernel hang log
> > [  492.232593] INFO: task nvme-test:5939 blocked for more than 120 seconds.
> > [  492.240081]       Not tainted 4.11.0.nvme_v4.11_debug_hang+ #3
> > [  492.246600] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  492.255346] nvme-test       D    0  5939   5938 0x00000080
> > [  492.261475] Call Trace:
> > [  492.264215]  __schedule+0x289/0x8f0
> > [  492.268105]  ? write_cache_pages+0x14c/0x510
> > [  492.272873]  schedule+0x36/0x80
> > [  492.276381]  io_schedule+0x16/0x40
> > [  492.280181]  wait_on_page_bit_common+0x137/0x220
> > [  492.285336]  ? page_cache_tree_insert+0x120/0x120
> > [  492.290589]  __filemap_fdatawait_range+0x128/0x1a0
> > [  492.295941]  filemap_fdatawait_range+0x14/0x30
> > [  492.300902]  filemap_fdatawait+0x23/0x30
> > [  492.305282]  filemap_write_and_wait+0x4c/0x80
> > [  492.310151]  __sync_blockdev+0x1f/0x40
> > [  492.314336]  fsync_bdev+0x44/0x50
> > [  492.318039]  invalidate_partition+0x24/0x50
> > [  492.322710]  del_gendisk+0xcd/0x2e0
> > [  492.326608]  nvme_ns_remove+0x105/0x130 [nvme_core]
> > [  492.332054]  nvme_remove_namespaces+0x32/0x50 [nvme_core]
> > [  492.338082]  nvme_uninit_ctrl+0x2d/0xa0 [nvme_core]
> > [  492.343519]  nvme_remove+0x5d/0x170 [nvme]
> > [  492.348096]  pci_device_remove+0x39/0xc0
> > [  492.352477]  device_release_driver_internal+0x141/0x1f0
> > [  492.358311]  device_release_driver+0x12/0x20
> > [  492.363072]  pci_stop_bus_device+0x8c/0xa0
> > [  492.367646]  pci_stop_and_remove_bus_device_locked+0x1a/0x30
> > [  492.373965]  remove_store+0x7c/0x90
> > [  492.377852]  dev_attr_store+0x18/0x30
> > [  492.381941]  sysfs_kf_write+0x3a/0x50
> > [  492.386028]  kernfs_fop_write+0xff/0x180
> > [  492.390409]  __vfs_write+0x37/0x160
> > [  492.394304]  ? selinux_file_permission+0xe5/0x120
> > [  492.399556]  ? security_file_permission+0x3b/0xc0
> > [  492.404807]  vfs_write+0xb2/0x1b0
> > [  492.408508]  ? syscall_trace_enter+0x1d0/0x2b0
> > [  492.413462]  SyS_write+0x55/0xc0
> > [  492.417064]  do_syscall_64+0x67/0x180
> > [  492.421155]  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/nvme/host/pci.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index fed803232edc..5e39abe57c56 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1887,6 +1887,14 @@ static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
> >  
> >  	kref_get(&dev->ctrl.kref);
> >  	nvme_dev_disable(dev, false);
> 
> > +	/*
> > +	 * nvme_dev_disable() has suspended queues, then no new I/O
> > +	 * can be submitted to hardware successfully any more, so
> > +	 * kill queues now for avoiding race between reset failure
> > +	 * and remove.
> 
> How about this instead:
> 
> 	/*
> 	 * nvme_dev_disable() has suspended the I/O queues and no new I/O can
> 	 * be submitted now.  Kill the queues now to avoid races between a
> 	 * possible reset failure and the controller removal work.
> 	 */

That is fine, will change to this.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] nvme: avoid to hang in remove disk
  2017-05-18 13:49   ` Christoph Hellwig
@ 2017-05-18 15:35     ` Ming Lei
  2017-05-18 16:06       ` Keith Busch
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2017-05-18 15:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Sagi Grimberg, linux-nvme, Zhang Yi,
	linux-block, stable

On Thu, May 18, 2017 at 03:49:31PM +0200, Christoph Hellwig wrote:
> On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote:
> > If some writeback requests are submitted just before queue is killed,
> > and these requests may not be canceled in nvme_dev_disable() because
> > they are not started yet, it is still possible for blk-mq to hold
> > these requests in .requeue list.
> > 
> > So we have to abort these requests first before del_gendisk(), because
> > del_gendisk() may wait for completion of these requests.
> > 
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/nvme/host/core.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index d5e0906262ea..8eaeea86509a 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> >  					&nvme_ns_attr_group);
> >  		if (ns->ndev)
> >  			nvme_nvm_unregister_sysfs(ns);
> > +		/*
> > +		 * If queue is dead, we have to abort requests in
> > +		 * requeue list because fsync_bdev() in removing disk
> > +		 * path may wait for these IOs, which can't
> > +		 * be submitted to hardware too.
> > +		 */
> > +		if (blk_queue_dying(ns->queue))
> > +			blk_mq_abort_requeue_list(ns->queue);
> >  		del_gendisk(ns->disk);
> >  		blk_mq_abort_requeue_list(ns->queue);
> 
> Why can't we just move the blk_mq_abort_requeue_list call before
> del_gendisk in general?

That may cause data loss if queue isn't killed. Normally queue is only killed
when the controller is dead(such as in reset failure) or !pci_device_is_present()
(in nvme_remove()).

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] nvme: avoid to hang in remove disk
  2017-05-18 15:35     ` Ming Lei
@ 2017-05-18 16:06       ` Keith Busch
  2017-05-19 13:19         ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: Keith Busch @ 2017-05-18 16:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme,
	Zhang Yi, linux-block, stable

On Thu, May 18, 2017 at 11:35:43PM +0800, Ming Lei wrote:
> On Thu, May 18, 2017 at 03:49:31PM +0200, Christoph Hellwig wrote:
> > On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote:
> > > If some writeback requests are submitted just before queue is killed,
> > > and these requests may not be canceled in nvme_dev_disable() because
> > > they are not started yet, it is still possible for blk-mq to hold
> > > these requests in .requeue list.
> > > 
> > > So we have to abort these requests first before del_gendisk(), because
> > > del_gendisk() may wait for completion of these requests.
> > > 
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/nvme/host/core.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index d5e0906262ea..8eaeea86509a 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> > >  					&nvme_ns_attr_group);
> > >  		if (ns->ndev)
> > >  			nvme_nvm_unregister_sysfs(ns);
> > > +		/*
> > > +		 * If queue is dead, we have to abort requests in
> > > +		 * requeue list because fsync_bdev() in removing disk
> > > +		 * path may wait for these IOs, which can't
> > > +		 * be submitted to hardware too.
> > > +		 */
> > > +		if (blk_queue_dying(ns->queue))
> > > +			blk_mq_abort_requeue_list(ns->queue);
> > >  		del_gendisk(ns->disk);
> > >  		blk_mq_abort_requeue_list(ns->queue);
> > 
> > Why can't we just move the blk_mq_abort_requeue_list call before
> > del_gendisk in general?
> 
> That may cause data loss if queue isn't killed. Normally queue is only killed
> when the controller is dead(such as in reset failure) or !pci_device_is_present()
> (in nvme_remove()).

But in your test, your controller isn't even dead. Why are we killing
it when it's still functional? I think we need to first not consider
this perfectly functional controller to be dead under these conditions,
and second, understand why killing the queues after del_gendisk is called
does not allow forward progress.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-18 14:13   ` Keith Busch
@ 2017-05-19 12:52     ` Ming Lei
  2017-05-19 15:15       ` Keith Busch
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2017-05-19 12:52 UTC (permalink / raw)
  To: Keith Busch
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Zhang Yi, linux-block, stable

On Thu, May 18, 2017 at 10:13:07AM -0400, Keith Busch wrote:
> On Wed, May 17, 2017 at 09:27:28AM +0800, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> 
> I'm still not sure moving where we kill the queues is the correct way
> to fix this problem. The nvme_kill_queues restarts all the hardware
> queues to force all IO to failure already, so why is this really stuck?

That is a good question.

Today I investigate further, and figured out that it is because that
blk_mq_start_stopped_hw_queues() in nvme_kill_queues() does not
run hw queues actually becasue the queues are started in
nvme_reset_work() already. Will figure out a patch later to fix the
issue.

And the reason why this patch 'fixes' the issue is that just timing,
I guess.

> We should be able to make forward progress even if we kill the queues
> while calling into del_gendisk, right? That could happen with a different
> sequence of events, so that also needs to work.

Now I am not a big fun of this patch for fixing the issue.

But I still think it may be better to move nvme_kill_queues() into
nvme_remove_dead_ctrl() as an improvement because during this small
window page cache can be used up by write application, and no writeback
can move on meantime.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] nvme: avoid to hang in remove disk
  2017-05-18 16:06       ` Keith Busch
@ 2017-05-19 13:19         ` Ming Lei
  0 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2017-05-19 13:19 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Jens Axboe, Sagi Grimberg, linux-nvme,
	Zhang Yi, linux-block, stable

On Thu, May 18, 2017 at 12:06:24PM -0400, Keith Busch wrote:
> On Thu, May 18, 2017 at 11:35:43PM +0800, Ming Lei wrote:
> > On Thu, May 18, 2017 at 03:49:31PM +0200, Christoph Hellwig wrote:
> > > On Wed, May 17, 2017 at 09:27:29AM +0800, Ming Lei wrote:
> > > > If some writeback requests are submitted just before queue is killed,
> > > > and these requests may not be canceled in nvme_dev_disable() because
> > > > they are not started yet, it is still possible for blk-mq to hold
> > > > these requests in .requeue list.
> > > > 
> > > > So we have to abort these requests first before del_gendisk(), because
> > > > del_gendisk() may wait for completion of these requests.
> > > > 
> > > > Cc: stable@vger.kernel.org
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/nvme/host/core.c | 8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > > 
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index d5e0906262ea..8eaeea86509a 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -2097,6 +2097,14 @@ static void nvme_ns_remove(struct nvme_ns *ns)
> > > >  					&nvme_ns_attr_group);
> > > >  		if (ns->ndev)
> > > >  			nvme_nvm_unregister_sysfs(ns);
> > > > +		/*
> > > > +		 * If queue is dead, we have to abort requests in
> > > > +		 * requeue list because fsync_bdev() in removing disk
> > > > +		 * path may wait for these IOs, which can't
> > > > +		 * be submitted to hardware too.
> > > > +		 */
> > > > +		if (blk_queue_dying(ns->queue))
> > > > +			blk_mq_abort_requeue_list(ns->queue);
> > > >  		del_gendisk(ns->disk);
> > > >  		blk_mq_abort_requeue_list(ns->queue);
> > > 
> > > Why can't we just move the blk_mq_abort_requeue_list call before
> > > del_gendisk in general?
> > 
> > That may cause data loss if queue isn't killed. Normally queue is only killed
> > when the controller is dead(such as in reset failure) or !pci_device_is_present()
> > (in nvme_remove()).
> 
> But in your test, your controller isn't even dead. Why are we killing
> it when it's still functional? I think we need to first not consider

Last time, you posted one patch which looks fixes the hang, but the
NVMe device becomes not functional after rebinding later, see
the link:

	http://marc.info/?l=linux-block&m=149430047402103&w=2

Also not sure it is worthy of doing that, because in my case,
nvme_dev_disable() has been called in nvme_reset_notify() before the reset,
that means requests have been canceled already. As I mentioned, the two
directions aren't contradictorily. But thinking controller as live under
this situation need big change, such as, not canceling requests in whole
reset path.

Finally this patch is not only for the case of reset failure vs. remove,
but also for hot-unplug, queues are still killed in path of hot remove.

> this perfectly functional controller to be dead under these conditions,
> and second, understand why killing the queues after del_gendisk is called
> does not allow forward progress.

I have explained this cause in last email, :-)

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
                     ` (2 preceding siblings ...)
  2017-05-18 14:13   ` Keith Busch
@ 2017-05-19 14:41   ` Jens Axboe
  2017-05-19 15:10     ` Ming Lei
  2017-05-19 16:40     ` Ming Lei
  3 siblings, 2 replies; 18+ messages in thread
From: Jens Axboe @ 2017-05-19 14:41 UTC (permalink / raw)
  To: Ming Lei, Keith Busch, Christoph Hellwig, Sagi Grimberg
  Cc: linux-nvme, Zhang Yi, linux-block, stable

On 05/16/2017 07:27 PM, Ming Lei wrote:
> When one NVMe PCI device is being resetted and found reset failue,
> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> are put into stopped first, then schedule .remove_work to release the driver.
> 
> Unfortunately if the driver is being released via sysfs store
> just before the .remove_work is run, del_gendisk() from
> nvme_remove() may hang forever because hw queues are stopped and
> the submitted writeback IOs from fsync_bdev() can't be completed at all.
> 
> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> flushs .reset_work, and this way is reasonable and safe because
> nvme_dev_disable() has started to suspend queues and canceled requests
> already.
> 
> [1] test script
> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> 
> 	sleep 35
> 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> 	echo 1 > /sys/bus/pci/rescan

The patch looks good to me. But since you have a nice reproducer, how about
turning that into a blktests [1] test case?

[1] https://github.com/osandov/blktests

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-19 14:41   ` Jens Axboe
@ 2017-05-19 15:10     ` Ming Lei
  2017-05-19 16:40     ` Ming Lei
  1 sibling, 0 replies; 18+ messages in thread
From: Ming Lei @ 2017-05-19 15:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-block,
	stable, linux-nvme, Zhang Yi

On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote:
> On 05/16/2017 07:27 PM, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> > 
> > 	sleep 35
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> > 	echo 1 > /sys/bus/pci/rescan
> 
> The patch looks good to me. But since you have a nice reproducer, how about
> turning that into a blktests [1] test case?

This test has triggered several problems recently, and looks a good idea to
integrate into blktests.

I am a little busy recently, and may take a while to start that. If
anyone would like to integrate the case, please go ahead, and I am happy
to provide any details you need.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-19 12:52     ` Ming Lei
@ 2017-05-19 15:15       ` Keith Busch
  0 siblings, 0 replies; 18+ messages in thread
From: Keith Busch @ 2017-05-19 15:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, linux-nvme,
	Zhang Yi, linux-block, stable

On Fri, May 19, 2017 at 08:52:45PM +0800, Ming Lei wrote:
> But I still think it may be better to move nvme_kill_queues() into
> nvme_remove_dead_ctrl() as an improvement because during this small
> window page cache can be used up by write application, and no writeback
> can move on meantime.

Yes, I agree that's a better placement for it. I was just concerned
about the reasoning since it would also mean we're still stuck if an
IO timeout occurs while calling del_gendisk. So I'm okay with the patch
as-is, but I'll also look into my other concern.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-19 14:41   ` Jens Axboe
  2017-05-19 15:10     ` Ming Lei
@ 2017-05-19 16:40     ` Ming Lei
  2017-05-19 16:55       ` yizhan
  1 sibling, 1 reply; 18+ messages in thread
From: Ming Lei @ 2017-05-19 16:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, Sagi Grimberg, linux-block,
	stable, linux-nvme, Zhang Yi

On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote:
> On 05/16/2017 07:27 PM, Ming Lei wrote:
> > When one NVMe PCI device is being resetted and found reset failue,
> > nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
> > are put into stopped first, then schedule .remove_work to release the driver.
> > 
> > Unfortunately if the driver is being released via sysfs store
> > just before the .remove_work is run, del_gendisk() from
> > nvme_remove() may hang forever because hw queues are stopped and
> > the submitted writeback IOs from fsync_bdev() can't be completed at all.
> > 
> > This patch fixes the following issue[1][2] by moving nvme_kill_queues()
> > into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
> > flushs .reset_work, and this way is reasonable and safe because
> > nvme_dev_disable() has started to suspend queues and canceled requests
> > already.
> > 
> > [1] test script
> > 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
> > 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
> > 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
> > 
> > 	sleep 35
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
> > 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
> > 	echo 1 > /sys/bus/pci/rescan
> 
> The patch looks good to me. But since you have a nice reproducer, how about
> turning that into a blktests [1] test case?
> 
> [1] https://github.com/osandov/blktests

Forget to mention, this test case is written by Zhang Yi.

Zhang Yi, maybe you can try to integrate your NVMe test case into
blktests if you are interested, :-)

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] nvme: fix race between removing and reseting failure
  2017-05-19 16:40     ` Ming Lei
@ 2017-05-19 16:55       ` yizhan
  0 siblings, 0 replies; 18+ messages in thread
From: yizhan @ 2017-05-19 16:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Sagi Grimberg, linux-nvme, Keith Busch, stable,
	Christoph Hellwig



On 05/20/2017 12:40 AM, Ming Lei wrote:
> On Fri, May 19, 2017 at 08:41:13AM -0600, Jens Axboe wrote:
>> On 05/16/2017 07:27 PM, Ming Lei wrote:
>>> When one NVMe PCI device is being resetted and found reset failue,
>>> nvme_remove_dead_ctrl() is called to handle the failure: blk-mq hw queues
>>> are put into stopped first, then schedule .remove_work to release the driver.
>>>
>>> Unfortunately if the driver is being released via sysfs store
>>> just before the .remove_work is run, del_gendisk() from
>>> nvme_remove() may hang forever because hw queues are stopped and
>>> the submitted writeback IOs from fsync_bdev() can't be completed at all.
>>>
>>> This patch fixes the following issue[1][2] by moving nvme_kill_queues()
>>> into nvme_remove_dead_ctrl() to avoid the issue because nvme_remove()
>>> flushs .reset_work, and this way is reasonable and safe because
>>> nvme_dev_disable() has started to suspend queues and canceled requests
>>> already.
>>>
>>> [1] test script
>>> 	fio -filename=$NVME_DISK -iodepth=1 -thread -rw=randwrite -ioengine=psync \
>>> 	    -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 \
>>> 	    -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest -numjobs=60
>>>
>>> 	sleep 35
>>> 	echo 1 > $SYSFS_NVME_PCI_PATH/rescan
>>> 	echo 1 > $SYSFS_NVME_PCI_PATH/reset
>>> 	echo 1 > $SYSFS_NVME_PCI_PATH/remove
>>> 	echo 1 > /sys/bus/pci/rescan
>> The patch looks good to me. But since you have a nice reproducer, how about
>> turning that into a blktests [1] test case?
>>
>> [1] https://github.com/osandov/blktests
> Forget to mention, this test case is written by Zhang Yi.
>
> Zhang Yi, maybe you can try to integrate your NVMe test case into
> blktests if you are interested, :-)
Sure, I will have a try. :)

Thanks
Yi
> Thanks,
> Ming
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-05-19 16:55 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-17  1:27 [PATCH 0/2] nvme: fix hang in path of removing disk Ming Lei
2017-05-17  1:27 ` [PATCH 1/2] nvme: fix race between removing and reseting failure Ming Lei
2017-05-17  6:38   ` Johannes Thumshirn
2017-05-17  7:01     ` Ming Lei
2017-05-18 13:47   ` Christoph Hellwig
2017-05-18 15:04     ` Ming Lei
2017-05-18 14:13   ` Keith Busch
2017-05-19 12:52     ` Ming Lei
2017-05-19 15:15       ` Keith Busch
2017-05-19 14:41   ` Jens Axboe
2017-05-19 15:10     ` Ming Lei
2017-05-19 16:40     ` Ming Lei
2017-05-19 16:55       ` yizhan
2017-05-17  1:27 ` [PATCH 2/2] nvme: avoid to hang in remove disk Ming Lei
2017-05-18 13:49   ` Christoph Hellwig
2017-05-18 15:35     ` Ming Lei
2017-05-18 16:06       ` Keith Busch
2017-05-19 13:19         ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).