linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] nvme-core: Fix extra device_put() call on error path
@ 2019-07-18 22:51 Logan Gunthorpe
  2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
  2019-07-18 23:52 ` [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Sagi Grimberg
  0 siblings, 2 replies; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-18 22:51 UTC (permalink / raw)
  To: linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Logan Gunthorpe

In the error path for nvme_init_subsystem(), nvme_put_subsystem()
will call device_put(), but it will get called again after the
mutex_unlock().

The device_put() only needs to be called if device_add() fails.

This bug caused a KASAN use-after-free error when adding and
removing subsytems in a loop:

  BUG: KASAN: use-after-free in device_del+0x8d9/0x9a0
  Read of size 8 at addr ffff8883cdaf7120 by task multipathd/329

  CPU: 0 PID: 329 Comm: multipathd Not tainted 5.2.0-rc6-vmlocalyes-00019-g70a2b39005fd-dirty #314
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
  Call Trace:
   dump_stack+0x7b/0xb5
   print_address_description+0x6f/0x280
   ? device_del+0x8d9/0x9a0
   __kasan_report+0x148/0x199
   ? device_del+0x8d9/0x9a0
   ? class_release+0x100/0x130
   ? device_del+0x8d9/0x9a0
   kasan_report+0x12/0x20
   __asan_report_load8_noabort+0x14/0x20
   device_del+0x8d9/0x9a0
   ? device_platform_notify+0x70/0x70
   nvme_destroy_subsystem+0xf9/0x150
   nvme_free_ctrl+0x280/0x3a0
   device_release+0x72/0x1d0
   kobject_put+0x144/0x410
   put_device+0x13/0x20
   nvme_free_ns+0xc4/0x100
   nvme_release+0xb3/0xe0
   __blkdev_put+0x549/0x6e0
   ? kasan_check_write+0x14/0x20
   ? bd_set_size+0xb0/0xb0
   ? kasan_check_write+0x14/0x20
   ? mutex_lock+0x8f/0xe0
   ? __mutex_lock_slowpath+0x20/0x20
   ? locks_remove_file+0x239/0x370
   blkdev_put+0x72/0x2c0
   blkdev_close+0x8d/0xd0
   __fput+0x256/0x770
   ? _raw_read_lock_irq+0x40/0x40
   ____fput+0xe/0x10
   task_work_run+0x10c/0x180
   ? filp_close+0xf7/0x140
   exit_to_usermode_loop+0x151/0x170
   do_syscall_64+0x240/0x2e0
   ? prepare_exit_to_usermode+0xd5/0x190
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f5a79af05d7
  Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 06 fc ff ff 8b 44 24
  RSP: 002b:00007f5a7799c810 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
  RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f5a79af05d7
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
  RBP: 00007f5a58000f98 R08: 0000000000000002 R09: 00007f5a7935ee80
  R10: 0000000000000000 R11: 0000000000000293 R12: 000055e432447240
  R13: 0000000000000000 R14: 0000000000000001 R15: 000055e4324a9cf0

  Allocated by task 1236:
   save_stack+0x21/0x80
   __kasan_kmalloc.constprop.6+0xab/0xe0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x102/0x210
   nvme_init_identify+0x13c3/0x3820
   nvme_loop_configure_admin_queue+0x4fa/0x5e0
   nvme_loop_create_ctrl+0x469/0xf40
   nvmf_dev_write+0x19a3/0x21ab
   __vfs_write+0x66/0x120
   vfs_write+0x154/0x490
   ksys_write+0x104/0x240
   __x64_sys_write+0x73/0xb0
   do_syscall_64+0xa5/0x2e0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 329:
   save_stack+0x21/0x80
   __kasan_slab_free+0x129/0x190
   kasan_slab_free+0xe/0x10
   kfree+0xa7/0x200
   nvme_release_subsystem+0x49/0x60
   device_release+0x72/0x1d0
   kobject_put+0x144/0x410
   put_device+0x13/0x20
   klist_class_dev_put+0x31/0x40
   klist_put+0x8f/0xf0
   klist_del+0xe/0x10
   device_del+0x3a7/0x9a0
   nvme_destroy_subsystem+0xf9/0x150
   nvme_free_ctrl+0x280/0x3a0
   device_release+0x72/0x1d0
   kobject_put+0x144/0x410
   put_device+0x13/0x20
   nvme_free_ns+0xc4/0x100
   nvme_release+0xb3/0xe0
   __blkdev_put+0x549/0x6e0
   blkdev_put+0x72/0x2c0
   blkdev_close+0x8d/0xd0
   __fput+0x256/0x770
   ____fput+0xe/0x10
   task_work_run+0x10c/0x180
   exit_to_usermode_loop+0x151/0x170
   do_syscall_64+0x240/0x2e0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller list")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cc09b81fc7f4..3ca33a2714e5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2489,6 +2489,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 		if (ret) {
 			dev_err(ctrl->device,
 				"failed to register subsystem device.\n");
+			put_device(&subsys->dev);
 			goto out_unlock;
 		}
 		ida_init(&subsys->ns_ida);
@@ -2511,7 +2512,6 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	nvme_put_subsystem(subsys);
 out_unlock:
 	mutex_unlock(&nvme_subsystems_lock);
-	put_device(&subsys->dev);
 	return ret;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-18 22:51 [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Logan Gunthorpe
@ 2019-07-18 22:51 ` Logan Gunthorpe
  2019-07-19  0:25   ` Sagi Grimberg
  2019-07-18 23:52 ` [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Sagi Grimberg
  1 sibling, 1 reply; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-18 22:51 UTC (permalink / raw)
  To: linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Logan Gunthorpe

With multipath enabled, nvme_scan_work() can read from the
device (through nvme_mpath_add_disk()). However, with fabrics,
once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
(see nvmf_check_ready()).

After setting the state to deleting, nvme_remove_namespaces() will
hang waiting for scan_work to flush and these tasks will hang.

To fix this, ensure we take scan_lock before changing the ctrl-state.
Also, ensure the state is checked while the lock is held
in nvme_scan_lock_work().

  INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
        Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u4:3    D    0   166      2 0x80004000
  Workqueue: nvme-wq nvme_scan_work
  Call Trace:
   __schedule+0x851/0x1400
   schedule+0x99/0x210
   io_schedule+0x21/0x70
   do_read_cache_page+0xa57/0x1330
   read_cache_page+0x4a/0x70
   read_dev_sector+0xbf/0x380
   amiga_partition+0xc4/0x1230
   check_partition+0x30f/0x630
   rescan_partitions+0x19a/0x980
   __blkdev_get+0x85a/0x12f0
   blkdev_get+0x2a5/0x790
   __device_add_disk+0xe25/0x1250
   device_add_disk+0x13/0x20
   nvme_mpath_set_live+0x172/0x2b0
   nvme_update_ns_ana_state+0x130/0x180
   nvme_set_ns_ana_state+0x9a/0xb0
   nvme_parse_ana_log+0x1c3/0x4a0
   nvme_mpath_add_disk+0x157/0x290
   nvme_validate_ns+0x1017/0x1bd0
   nvme_scan_work+0x44d/0x6a0
   process_one_work+0x7d7/0x1240
   worker_thread+0x8e/0xff0
   kthread+0x2c3/0x3b0
   ret_from_fork+0x35/0x40

   INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
        Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u4:1    D    0  1034      2 0x80004000
  Workqueue: nvme-delete-wq nvme_delete_ctrl_work
  Call Trace:
   __schedule+0x851/0x1400
   schedule+0x99/0x210
   schedule_timeout+0x390/0x830
   wait_for_completion+0x1a7/0x310
   __flush_work+0x241/0x5d0
   flush_work+0x10/0x20
   nvme_remove_namespaces+0x85/0x3d0
   nvme_do_delete_ctrl+0xb4/0x1e0
   nvme_delete_ctrl_work+0x15/0x20
   process_one_work+0x7d7/0x1240
   worker_thread+0x8e/0xff0
   kthread+0x2c3/0x3b0
   ret_from_fork+0x35/0x40

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/nvme/host/core.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3ca33a2714e5..0a7b46066fe3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -301,6 +301,9 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	unsigned long flags;
 	bool changed = false;
 
+	/* Ensure a scan isn't in progress when we change the state */
+	mutex_lock(&ctrl->scan_lock);
+
 	spin_lock_irqsave(&ctrl->lock, flags);
 
 	old_state = ctrl->state;
@@ -375,6 +378,8 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		ctrl->state = new_state;
 
 	spin_unlock_irqrestore(&ctrl->lock, flags);
+	mutex_unlock(&ctrl->scan_lock);
+
 	if (changed && ctrl->state == NVME_CTRL_LIVE)
 		nvme_kick_requeue_lists(ctrl);
 	return changed;
@@ -3534,6 +3539,8 @@ static void nvme_scan_work(struct work_struct *work)
 	struct nvme_id_ctrl *id;
 	unsigned nn;
 
+	mutex_lock(&ctrl->scan_lock);
+
 	if (ctrl->state != NVME_CTRL_LIVE)
 		return;
 
@@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct *work)
 	if (nvme_identify_ctrl(ctrl, &id))
 		return;
 
-	mutex_lock(&ctrl->scan_lock);
 	nn = le32_to_cpu(id->nn);
 	if (ctrl->vs >= NVME_VS(1, 1, 0) &&
 	    !(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] nvme-core: Fix extra device_put() call on error path
  2019-07-18 22:51 [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Logan Gunthorpe
  2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
@ 2019-07-18 23:52 ` Sagi Grimberg
  1 sibling, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-18 23:52 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig

Looks good Logan,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
@ 2019-07-19  0:25   ` Sagi Grimberg
  2019-07-19  0:31     ` Sagi Grimberg
  2019-07-19  0:39     ` Logan Gunthorpe
  0 siblings, 2 replies; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19  0:25 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig


> With multipath enabled, nvme_scan_work() can read from the
> device (through nvme_mpath_add_disk()). However, with fabrics,
> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
> (see nvmf_check_ready()).
> 
> After setting the state to deleting, nvme_remove_namespaces() will
> hang waiting for scan_work to flush and these tasks will hang.
> 
> To fix this, ensure we take scan_lock before changing the ctrl-state.
> Also, ensure the state is checked while the lock is held
> in nvme_scan_lock_work().

That's a big hammer...

But this is I/O that we cannot have queued until we have a path..

I would rather have nvme_remove_namespaces() requeue all I/Os for
namespaces that serve as the current_path and have the make_request
routine to fail I/O if all controllers are deleting as well.

Would something like [1] (untested) make sense instead?


> +	mutex_lock(&ctrl->scan_lock);
> +
>   	if (ctrl->state != NVME_CTRL_LIVE)
>   		return;

unlock

>   
> @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct *work)
>   	if (nvme_identify_ctrl(ctrl, &id))
>   		return;

unlock


[1]:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 76cd3dd8736a..627f5871858d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3576,6 +3576,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
         struct nvme_ns *ns, *next;
         LIST_HEAD(ns_list);

+       mutex_lock(&ctrl->scan_lock);
+       list_for_each_entry(ns, &ctrl->namespaces, list)
+               nvme_mpath_clear_current_path(ns);
+       mutex_lock(&ctrl->scan_lock);
+
         /* prevent racing with ns scanning */
         flush_work(&ctrl->scan_work);

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a9a927677970..da1731266788 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -231,6 +231,24 @@ inline struct nvme_ns *nvme_find_path(struct 
nvme_ns_head *head)
         return ns;
  }

+static bool nvme_available_path(struct nvme_ns_head *head)
+{
+       struct nvme_ns *ns;
+
+       list_for_each_entry_rcu(ns, &head->list, siblings) {
+               switch (ns->ctrl->state) {
+               case NVME_CTRL_LIVE:
+               case NVME_CTRL_RESETTING:
+               case NVME_CTRL_CONNECTING:
+                       /* fallthru */
+                       return true;
+               default:
+                       break;
+               }
+       }
+       return false;
+}
+
  static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
                 struct bio *bio)
  {
@@ -257,14 +275,14 @@ static blk_qc_t nvme_ns_head_make_request(struct 
request_queue *q,
                                       disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = direct_make_request(bio);
-       } else if (!list_empty_careful(&head->list)) {
-               dev_warn_ratelimited(dev, "no path available - requeuing 
I/O\n");
+       } else if (nvme_available_path(head)) {
+               dev_warn_ratelimited(dev, "no usable path - requeuing 
I/O\n");

                 spin_lock_irq(&head->requeue_lock);
                 bio_list_add(&head->requeue_list, bio);
                 spin_unlock_irq(&head->requeue_lock);
         } else {
-               dev_warn_ratelimited(dev, "no path - failing I/O\n");
+               dev_warn_ratelimited(dev, "no available path - failing 
I/O\n");

                 bio->bi_status = BLK_STS_IOERR;
                 bio_endio(bio);
--

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-19  0:25   ` Sagi Grimberg
@ 2019-07-19  0:31     ` Sagi Grimberg
  2019-07-19 20:15       ` Sagi Grimberg
  2019-07-19  0:39     ` Logan Gunthorpe
  1 sibling, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19  0:31 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig

> [1]:
Or actually:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 76cd3dd8736a..a0e2072fe73e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3576,6 +3576,12 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
         struct nvme_ns *ns, *next;
         LIST_HEAD(ns_list);

+       mutex_lock(&ctrl->scan_lock);
+       list_for_each_entry(ns, &ctrl->namespaces, list)
+               if (nvme_mpath_clear_current_path(ns))
+                       kblockd_schedule_work(&ns->head->requeue_work);
+       mutex_lock(&ctrl->scan_lock);
+
         /* prevent racing with ns scanning */
         flush_work(&ctrl->scan_work);

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a9a927677970..0e7e41381388 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -109,18 +109,22 @@ static const char *nvme_ana_state_names[] = {
         [NVME_ANA_CHANGE]               = "change",
  };

-void nvme_mpath_clear_current_path(struct nvme_ns *ns)
+bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
  {
         struct nvme_ns_head *head = ns->head;
+       bool changed = false;
         int node;

         if (!head)
-               return;
+               return changed;

         for_each_node(node) {
-               if (ns == rcu_access_pointer(head->current_path[node]))
+               if (ns == rcu_access_pointer(head->current_path[node])) {
                         rcu_assign_pointer(head->current_path[node], NULL);
+                       changed = true;
+               }
         }
+       return changed;
  }

  static bool nvme_path_is_disabled(struct nvme_ns *ns)
@@ -231,6 +235,24 @@ inline struct nvme_ns *nvme_find_path(struct 
nvme_ns_head *head)
         return ns;
  }

+static bool nvme_available_path(struct nvme_ns_head *head)
+{
+       struct nvme_ns *ns;
+
+       list_for_each_entry_rcu(ns, &head->list, siblings) {
+               switch (ns->ctrl->state) {
+               case NVME_CTRL_LIVE:
+               case NVME_CTRL_RESETTING:
+               case NVME_CTRL_CONNECTING:
+                       /* fallthru */
+                       return true;
+               default:
+                       break;
+               }
+       }
+       return false;
+}
+
  static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
                 struct bio *bio)
  {
@@ -257,14 +279,14 @@ static blk_qc_t nvme_ns_head_make_request(struct 
request_queue *q,
                                       disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = direct_make_request(bio);
-       } else if (!list_empty_careful(&head->list)) {
-               dev_warn_ratelimited(dev, "no path available - requeuing 
I/O\n");
+       } else if (nvme_available_path(head)) {
+               dev_warn_ratelimited(dev, "no usable path - requeuing 
I/O\n");

                 spin_lock_irq(&head->requeue_lock);
                 bio_list_add(&head->requeue_list, bio);
                 spin_unlock_irq(&head->requeue_lock);
         } else {
-               dev_warn_ratelimited(dev, "no path - failing I/O\n");
+               dev_warn_ratelimited(dev, "no available path - failing 
I/O\n");

                 bio->bi_status = BLK_STS_IOERR;
                 bio_endio(bio);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 716a876119c8..915179368d30 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -496,7 +496,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head);
  int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
  void nvme_mpath_uninit(struct nvme_ctrl *ctrl);
  void nvme_mpath_stop(struct nvme_ctrl *ctrl);
-void nvme_mpath_clear_current_path(struct nvme_ns *ns);
+bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
  struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);

  static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
--

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-19  0:25   ` Sagi Grimberg
  2019-07-19  0:31     ` Sagi Grimberg
@ 2019-07-19  0:39     ` Logan Gunthorpe
  2019-07-19  0:50       ` Sagi Grimberg
  1 sibling, 1 reply; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-19  0:39 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig



On 2019-07-18 6:25 p.m., Sagi Grimberg wrote:
> 
>> With multipath enabled, nvme_scan_work() can read from the
>> device (through nvme_mpath_add_disk()). However, with fabrics,
>> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
>> (see nvmf_check_ready()).
>>
>> After setting the state to deleting, nvme_remove_namespaces() will
>> hang waiting for scan_work to flush and these tasks will hang.
>>
>> To fix this, ensure we take scan_lock before changing the ctrl-state.
>> Also, ensure the state is checked while the lock is held
>> in nvme_scan_lock_work().
> 
> That's a big hammer...

I didn't think the scan_lock was that contested or that
nvme_change_ctrl_state() was really called that often...

> But this is I/O that we cannot have queued until we have a path..
> 
> I would rather have nvme_remove_namespaces() requeue all I/Os for
> namespaces that serve as the current_path and have the make_request
> routine to fail I/O if all controllers are deleting as well.
> 
> Would something like [1] (untested) make sense instead?

I'll have to give this a try next week and I'll let you know then. It
kind of makes sense to me but a number of things I tried to fix this
that I thought made sense did not work.

> 
>> +    mutex_lock(&ctrl->scan_lock);
>> +
>>       if (ctrl->state != NVME_CTRL_LIVE)
>>           return;
> 
> unlock

If we unlock here and relock below, we'd have to recheck the ctrl->state
to avoid any races. If you don't want to call nvme_identify_ctrl with
the lock held, then it would probably be better to move the state check
below it.

>>   @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct
>> *work)
>>       if (nvme_identify_ctrl(ctrl, &id))
>>           return;
> 
> unlock
> 
> 
> [1]:
> -- 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 76cd3dd8736a..627f5871858d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3576,6 +3576,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>         struct nvme_ns *ns, *next;
>         LIST_HEAD(ns_list);
> 
> +       mutex_lock(&ctrl->scan_lock);
> +       list_for_each_entry(ns, &ctrl->namespaces, list)
> +               nvme_mpath_clear_current_path(ns);
> +       mutex_lock(&ctrl->scan_lock);
> +
>         /* prevent racing with ns scanning */
>         flush_work(&ctrl->scan_work);
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a9a927677970..da1731266788 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -231,6 +231,24 @@ inline struct nvme_ns *nvme_find_path(struct
> nvme_ns_head *head)
>         return ns;
>  }
> 
> +static bool nvme_available_path(struct nvme_ns_head *head)
> +{
> +       struct nvme_ns *ns;
> +
> +       list_for_each_entry_rcu(ns, &head->list, siblings) {
> +               switch (ns->ctrl->state) {
> +               case NVME_CTRL_LIVE:
> +               case NVME_CTRL_RESETTING:
> +               case NVME_CTRL_CONNECTING:
> +                       /* fallthru */
> +                       return true;
> +               default:
> +                       break;
> +               }
> +       }
> +       return false;
> +}
> +
>  static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
>                 struct bio *bio)
>  {
> @@ -257,14 +275,14 @@ static blk_qc_t nvme_ns_head_make_request(struct
> request_queue *q,
>                                       disk_devt(ns->head->disk),
>                                       bio->bi_iter.bi_sector);
>                 ret = direct_make_request(bio);
> -       } else if (!list_empty_careful(&head->list)) {
> -               dev_warn_ratelimited(dev, "no path available - requeuing
> I/O\n");
> +       } else if (nvme_available_path(head)) {
> +               dev_warn_ratelimited(dev, "no usable path - requeuing
> I/O\n");
> 
>                 spin_lock_irq(&head->requeue_lock);
>                 bio_list_add(&head->requeue_list, bio);
>                 spin_unlock_irq(&head->requeue_lock);
>         } else {
> -               dev_warn_ratelimited(dev, "no path - failing I/O\n");
> +               dev_warn_ratelimited(dev, "no available path - failing
> I/O\n");
> 
>                 bio->bi_status = BLK_STS_IOERR;
>                 bio_endio(bio);
> -- 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-19  0:39     ` Logan Gunthorpe
@ 2019-07-19  0:50       ` Sagi Grimberg
  2019-07-24 19:12         ` Logan Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19  0:50 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig


>>> With multipath enabled, nvme_scan_work() can read from the
>>> device (through nvme_mpath_add_disk()). However, with fabrics,
>>> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
>>> (see nvmf_check_ready()).
>>>
>>> After setting the state to deleting, nvme_remove_namespaces() will
>>> hang waiting for scan_work to flush and these tasks will hang.
>>>
>>> To fix this, ensure we take scan_lock before changing the ctrl-state.
>>> Also, ensure the state is checked while the lock is held
>>> in nvme_scan_lock_work().
>>
>> That's a big hammer...
> 
> I didn't think the scan_lock was that contested or that
> nvme_change_ctrl_state() was really called that often...

it shouldn't be, but I think it makes the flow more convoluted
as we serialize by flushing the scan_work right after...

The design principal is met as we do get the I/O failing,
but its just that with mpath we simply queue the I/O again
because the head->list happens to not be empty.
Perhaps taking care of that check is cleaner.

>> But this is I/O that we cannot have queued until we have a path..
>>
>> I would rather have nvme_remove_namespaces() requeue all I/Os for
>> namespaces that serve as the current_path and have the make_request
>> routine to fail I/O if all controllers are deleting as well.
>>
>> Would something like [1] (untested) make sense instead?
> 
> I'll have to give this a try next week and I'll let you know then. It
> kind of makes sense to me but a number of things I tried to fix this
> that I thought made sense did not work.

Thanks. Do you have a firm reproducer for it?

>>> +    mutex_lock(&ctrl->scan_lock);
>>> +
>>>        if (ctrl->state != NVME_CTRL_LIVE)
>>>            return;
>>
>> unlock
> 
> If we unlock here and relock below, we'd have to recheck the ctrl->state
> to avoid any races. If you don't want to call nvme_identify_ctrl with
> the lock held, then it would probably be better to move the state check
> below it.

Meant before the return statement.

> 
>>>    @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct
>>> *work)
>>>        if (nvme_identify_ctrl(ctrl, &id))
>>>            return;
>>
>> unlock

Same here.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-19  0:31     ` Sagi Grimberg
@ 2019-07-19 20:15       ` Sagi Grimberg
  0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19 20:15 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig


>> [1]:
> Or actually:
> -- 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 76cd3dd8736a..a0e2072fe73e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3576,6 +3576,12 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>          struct nvme_ns *ns, *next;
>          LIST_HEAD(ns_list);
> 
> +       mutex_lock(&ctrl->scan_lock);
> +       list_for_each_entry(ns, &ctrl->namespaces, list)
> +               if (nvme_mpath_clear_current_path(ns))
> +                       kblockd_schedule_work(&ns->head->requeue_work);
> +       mutex_lock(&ctrl->scan_lock);

This should be mutex_unlock of course...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-19  0:50       ` Sagi Grimberg
@ 2019-07-24 19:12         ` Logan Gunthorpe
  2019-07-25 18:16           ` Sagi Grimberg
  0 siblings, 1 reply; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-24 19:12 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig

Hey,

Sorry for the delay.

I tested your patch and it does work. Do you want me to send your change
as a full patch? Can I add your signed-off-by?

On 2019-07-18 6:50 p.m., Sagi Grimberg wrote:
>> I didn't think the scan_lock was that contested or that
>> nvme_change_ctrl_state() was really called that often...
> 
> it shouldn't be, but I think it makes the flow more convoluted
> as we serialize by flushing the scan_work right after...

I would argue that the check for state in nvme_scan_work() without a
lock is racy and confusing. There's nothing to prevent the state from
changing immediately after the check.

> The design principal is met as we do get the I/O failing,
> but its just that with mpath we simply queue the I/O again
> because the head->list happens to not be empty.
> Perhaps taking care of that check is cleaner.

Yes, I feel your patch is a good solution on it's own merits.
> Thanks. Do you have a firm reproducer for it?

Yes. If you connect to and then immediately disconnect from a target (at
least with nvme-loop) you will reliably trigger this bug -- or one of
the others I've sent patches for.

>>>> +    mutex_lock(&ctrl->scan_lock);
>>>> +
>>>>        if (ctrl->state != NVME_CTRL_LIVE)
>>>>            return;
>>>
>>> unlock
>>
>> If we unlock here and relock below, we'd have to recheck the ctrl->state
>> to avoid any races. If you don't want to call nvme_identify_ctrl with
>> the lock held, then it would probably be better to move the state check
>> below it.
> 
> Meant before the return statement.

Ah, right, my mistake.

Logan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-24 19:12         ` Logan Gunthorpe
@ 2019-07-25 18:16           ` Sagi Grimberg
  2019-07-25 18:20             ` Logan Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-25 18:16 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig


> Hey,
> 
> Sorry for the delay.
> 
> I tested your patch and it does work. Do you want me to send your change
> as a full patch? Can I add your signed-off-by?

I have a patch ready with a proper comment in place. I can send it
out... Can I get your tested-by and reported-by?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
  2019-07-25 18:16           ` Sagi Grimberg
@ 2019-07-25 18:20             ` Logan Gunthorpe
  0 siblings, 0 replies; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-25 18:20 UTC (permalink / raw)
  To: Sagi Grimberg, linux-kernel, linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig



On 2019-07-25 12:16 p.m., Sagi Grimberg wrote:
> 
>> Hey,
>>
>> Sorry for the delay.
>>
>> I tested your patch and it does work. Do you want me to send your change
>> as a full patch? Can I add your signed-off-by?
> 
> I have a patch ready with a proper comment in place. I can send it
> out... Can I get your tested-by and reported-by?

Yup. I can retest once you send the patch and I'll send a test-by then.
For now you can add my reported-by:

Reported-by: Logan Gunthorpe <logang@deltatee.com>

Note: I had to modify your original patch a fair bit so it compiled with
multipath disabled.

Logan



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-07-25 18:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-18 22:51 [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Logan Gunthorpe
2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
2019-07-19  0:25   ` Sagi Grimberg
2019-07-19  0:31     ` Sagi Grimberg
2019-07-19 20:15       ` Sagi Grimberg
2019-07-19  0:39     ` Logan Gunthorpe
2019-07-19  0:50       ` Sagi Grimberg
2019-07-24 19:12         ` Logan Gunthorpe
2019-07-25 18:16           ` Sagi Grimberg
2019-07-25 18:20             ` Logan Gunthorpe
2019-07-18 23:52 ` [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).