* [PATCH 1/2] nvme-core: Fix extra device_put() call on error path
@ 2019-07-18 22:51 Logan Gunthorpe
2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
2019-07-18 23:52 ` [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Sagi Grimberg
0 siblings, 2 replies; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-18 22:51 UTC (permalink / raw)
To: linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Logan Gunthorpe
In the error path for nvme_init_subsystem(), nvme_put_subsystem()
will call device_put(), but it will get called again after the
mutex_unlock().
The device_put() only needs to be called if device_add() fails.
This bug caused a KASAN use-after-free error when adding and
removing subsytems in a loop:
BUG: KASAN: use-after-free in device_del+0x8d9/0x9a0
Read of size 8 at addr ffff8883cdaf7120 by task multipathd/329
CPU: 0 PID: 329 Comm: multipathd Not tainted 5.2.0-rc6-vmlocalyes-00019-g70a2b39005fd-dirty #314
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
dump_stack+0x7b/0xb5
print_address_description+0x6f/0x280
? device_del+0x8d9/0x9a0
__kasan_report+0x148/0x199
? device_del+0x8d9/0x9a0
? class_release+0x100/0x130
? device_del+0x8d9/0x9a0
kasan_report+0x12/0x20
__asan_report_load8_noabort+0x14/0x20
device_del+0x8d9/0x9a0
? device_platform_notify+0x70/0x70
nvme_destroy_subsystem+0xf9/0x150
nvme_free_ctrl+0x280/0x3a0
device_release+0x72/0x1d0
kobject_put+0x144/0x410
put_device+0x13/0x20
nvme_free_ns+0xc4/0x100
nvme_release+0xb3/0xe0
__blkdev_put+0x549/0x6e0
? kasan_check_write+0x14/0x20
? bd_set_size+0xb0/0xb0
? kasan_check_write+0x14/0x20
? mutex_lock+0x8f/0xe0
? __mutex_lock_slowpath+0x20/0x20
? locks_remove_file+0x239/0x370
blkdev_put+0x72/0x2c0
blkdev_close+0x8d/0xd0
__fput+0x256/0x770
? _raw_read_lock_irq+0x40/0x40
____fput+0xe/0x10
task_work_run+0x10c/0x180
? filp_close+0xf7/0x140
exit_to_usermode_loop+0x151/0x170
do_syscall_64+0x240/0x2e0
? prepare_exit_to_usermode+0xd5/0x190
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f5a79af05d7
Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 06 fc ff ff 8b 44 24
RSP: 002b:00007f5a7799c810 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f5a79af05d7
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
RBP: 00007f5a58000f98 R08: 0000000000000002 R09: 00007f5a7935ee80
R10: 0000000000000000 R11: 0000000000000293 R12: 000055e432447240
R13: 0000000000000000 R14: 0000000000000001 R15: 000055e4324a9cf0
Allocated by task 1236:
save_stack+0x21/0x80
__kasan_kmalloc.constprop.6+0xab/0xe0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x102/0x210
nvme_init_identify+0x13c3/0x3820
nvme_loop_configure_admin_queue+0x4fa/0x5e0
nvme_loop_create_ctrl+0x469/0xf40
nvmf_dev_write+0x19a3/0x21ab
__vfs_write+0x66/0x120
vfs_write+0x154/0x490
ksys_write+0x104/0x240
__x64_sys_write+0x73/0xb0
do_syscall_64+0xa5/0x2e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 329:
save_stack+0x21/0x80
__kasan_slab_free+0x129/0x190
kasan_slab_free+0xe/0x10
kfree+0xa7/0x200
nvme_release_subsystem+0x49/0x60
device_release+0x72/0x1d0
kobject_put+0x144/0x410
put_device+0x13/0x20
klist_class_dev_put+0x31/0x40
klist_put+0x8f/0xf0
klist_del+0xe/0x10
device_del+0x3a7/0x9a0
nvme_destroy_subsystem+0xf9/0x150
nvme_free_ctrl+0x280/0x3a0
device_release+0x72/0x1d0
kobject_put+0x144/0x410
put_device+0x13/0x20
nvme_free_ns+0xc4/0x100
nvme_release+0xb3/0xe0
__blkdev_put+0x549/0x6e0
blkdev_put+0x72/0x2c0
blkdev_close+0x8d/0xd0
__fput+0x256/0x770
____fput+0xe/0x10
task_work_run+0x10c/0x180
exit_to_usermode_loop+0x151/0x170
do_syscall_64+0x240/0x2e0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller list")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
drivers/nvme/host/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cc09b81fc7f4..3ca33a2714e5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2489,6 +2489,7 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
if (ret) {
dev_err(ctrl->device,
"failed to register subsystem device.\n");
+ put_device(&subsys->dev);
goto out_unlock;
}
ida_init(&subsys->ns_ida);
@@ -2511,7 +2512,6 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
nvme_put_subsystem(subsys);
out_unlock:
mutex_unlock(&nvme_subsystems_lock);
- put_device(&subsys->dev);
return ret;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-18 22:51 [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Logan Gunthorpe
@ 2019-07-18 22:51 ` Logan Gunthorpe
2019-07-19 0:25 ` Sagi Grimberg
2019-07-18 23:52 ` [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Sagi Grimberg
1 sibling, 1 reply; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-18 22:51 UTC (permalink / raw)
To: linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Logan Gunthorpe
With multipath enabled, nvme_scan_work() can read from the
device (through nvme_mpath_add_disk()). However, with fabrics,
once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
(see nvmf_check_ready()).
After setting the state to deleting, nvme_remove_namespaces() will
hang waiting for scan_work to flush and these tasks will hang.
To fix this, ensure we take scan_lock before changing the ctrl-state.
Also, ensure the state is checked while the lock is held
in nvme_scan_lock_work().
INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:3 D 0 166 2 0x80004000
Workqueue: nvme-wq nvme_scan_work
Call Trace:
__schedule+0x851/0x1400
schedule+0x99/0x210
io_schedule+0x21/0x70
do_read_cache_page+0xa57/0x1330
read_cache_page+0x4a/0x70
read_dev_sector+0xbf/0x380
amiga_partition+0xc4/0x1230
check_partition+0x30f/0x630
rescan_partitions+0x19a/0x980
__blkdev_get+0x85a/0x12f0
blkdev_get+0x2a5/0x790
__device_add_disk+0xe25/0x1250
device_add_disk+0x13/0x20
nvme_mpath_set_live+0x172/0x2b0
nvme_update_ns_ana_state+0x130/0x180
nvme_set_ns_ana_state+0x9a/0xb0
nvme_parse_ana_log+0x1c3/0x4a0
nvme_mpath_add_disk+0x157/0x290
nvme_validate_ns+0x1017/0x1bd0
nvme_scan_work+0x44d/0x6a0
process_one_work+0x7d7/0x1240
worker_thread+0x8e/0xff0
kthread+0x2c3/0x3b0
ret_from_fork+0x35/0x40
INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u4:1 D 0 1034 2 0x80004000
Workqueue: nvme-delete-wq nvme_delete_ctrl_work
Call Trace:
__schedule+0x851/0x1400
schedule+0x99/0x210
schedule_timeout+0x390/0x830
wait_for_completion+0x1a7/0x310
__flush_work+0x241/0x5d0
flush_work+0x10/0x20
nvme_remove_namespaces+0x85/0x3d0
nvme_do_delete_ctrl+0xb4/0x1e0
nvme_delete_ctrl_work+0x15/0x20
process_one_work+0x7d7/0x1240
worker_thread+0x8e/0xff0
kthread+0x2c3/0x3b0
ret_from_fork+0x35/0x40
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
drivers/nvme/host/core.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3ca33a2714e5..0a7b46066fe3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -301,6 +301,9 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
unsigned long flags;
bool changed = false;
+ /* Ensure a scan isn't in progress when we change the state */
+ mutex_lock(&ctrl->scan_lock);
+
spin_lock_irqsave(&ctrl->lock, flags);
old_state = ctrl->state;
@@ -375,6 +378,8 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
ctrl->state = new_state;
spin_unlock_irqrestore(&ctrl->lock, flags);
+ mutex_unlock(&ctrl->scan_lock);
+
if (changed && ctrl->state == NVME_CTRL_LIVE)
nvme_kick_requeue_lists(ctrl);
return changed;
@@ -3534,6 +3539,8 @@ static void nvme_scan_work(struct work_struct *work)
struct nvme_id_ctrl *id;
unsigned nn;
+ mutex_lock(&ctrl->scan_lock);
+
if (ctrl->state != NVME_CTRL_LIVE)
return;
@@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct *work)
if (nvme_identify_ctrl(ctrl, &id))
return;
- mutex_lock(&ctrl->scan_lock);
nn = le32_to_cpu(id->nn);
if (ctrl->vs >= NVME_VS(1, 1, 0) &&
!(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) {
--
2.20.1
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] nvme-core: Fix extra device_put() call on error path
2019-07-18 22:51 [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Logan Gunthorpe
2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
@ 2019-07-18 23:52 ` Sagi Grimberg
1 sibling, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-18 23:52 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
Looks good Logan,
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
@ 2019-07-19 0:25 ` Sagi Grimberg
2019-07-19 0:31 ` Sagi Grimberg
2019-07-19 0:39 ` Logan Gunthorpe
0 siblings, 2 replies; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19 0:25 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
> With multipath enabled, nvme_scan_work() can read from the
> device (through nvme_mpath_add_disk()). However, with fabrics,
> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
> (see nvmf_check_ready()).
>
> After setting the state to deleting, nvme_remove_namespaces() will
> hang waiting for scan_work to flush and these tasks will hang.
>
> To fix this, ensure we take scan_lock before changing the ctrl-state.
> Also, ensure the state is checked while the lock is held
> in nvme_scan_lock_work().
That's a big hammer...
But this is I/O that we cannot have queued until we have a path..
I would rather have nvme_remove_namespaces() requeue all I/Os for
namespaces that serve as the current_path and have the make_request
routine to fail I/O if all controllers are deleting as well.
Would something like [1] (untested) make sense instead?
> + mutex_lock(&ctrl->scan_lock);
> +
> if (ctrl->state != NVME_CTRL_LIVE)
> return;
unlock
>
> @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct *work)
> if (nvme_identify_ctrl(ctrl, &id))
> return;
unlock
[1]:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 76cd3dd8736a..627f5871858d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3576,6 +3576,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
struct nvme_ns *ns, *next;
LIST_HEAD(ns_list);
+ mutex_lock(&ctrl->scan_lock);
+ list_for_each_entry(ns, &ctrl->namespaces, list)
+ nvme_mpath_clear_current_path(ns);
+ mutex_lock(&ctrl->scan_lock);
+
/* prevent racing with ns scanning */
flush_work(&ctrl->scan_work);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a9a927677970..da1731266788 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -231,6 +231,24 @@ inline struct nvme_ns *nvme_find_path(struct
nvme_ns_head *head)
return ns;
}
+static bool nvme_available_path(struct nvme_ns_head *head)
+{
+ struct nvme_ns *ns;
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ switch (ns->ctrl->state) {
+ case NVME_CTRL_LIVE:
+ case NVME_CTRL_RESETTING:
+ case NVME_CTRL_CONNECTING:
+ /* fallthru */
+ return true;
+ default:
+ break;
+ }
+ }
+ return false;
+}
+
static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
struct bio *bio)
{
@@ -257,14 +275,14 @@ static blk_qc_t nvme_ns_head_make_request(struct
request_queue *q,
disk_devt(ns->head->disk),
bio->bi_iter.bi_sector);
ret = direct_make_request(bio);
- } else if (!list_empty_careful(&head->list)) {
- dev_warn_ratelimited(dev, "no path available - requeuing
I/O\n");
+ } else if (nvme_available_path(head)) {
+ dev_warn_ratelimited(dev, "no usable path - requeuing
I/O\n");
spin_lock_irq(&head->requeue_lock);
bio_list_add(&head->requeue_list, bio);
spin_unlock_irq(&head->requeue_lock);
} else {
- dev_warn_ratelimited(dev, "no path - failing I/O\n");
+ dev_warn_ratelimited(dev, "no available path - failing
I/O\n");
bio->bi_status = BLK_STS_IOERR;
bio_endio(bio);
--
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-19 0:25 ` Sagi Grimberg
@ 2019-07-19 0:31 ` Sagi Grimberg
2019-07-19 20:15 ` Sagi Grimberg
2019-07-19 0:39 ` Logan Gunthorpe
1 sibling, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19 0:31 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
> [1]:
Or actually:
--
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 76cd3dd8736a..a0e2072fe73e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3576,6 +3576,12 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
struct nvme_ns *ns, *next;
LIST_HEAD(ns_list);
+ mutex_lock(&ctrl->scan_lock);
+ list_for_each_entry(ns, &ctrl->namespaces, list)
+ if (nvme_mpath_clear_current_path(ns))
+ kblockd_schedule_work(&ns->head->requeue_work);
+ mutex_lock(&ctrl->scan_lock);
+
/* prevent racing with ns scanning */
flush_work(&ctrl->scan_work);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a9a927677970..0e7e41381388 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -109,18 +109,22 @@ static const char *nvme_ana_state_names[] = {
[NVME_ANA_CHANGE] = "change",
};
-void nvme_mpath_clear_current_path(struct nvme_ns *ns)
+bool nvme_mpath_clear_current_path(struct nvme_ns *ns)
{
struct nvme_ns_head *head = ns->head;
+ bool changed = false;
int node;
if (!head)
- return;
+ return changed;
for_each_node(node) {
- if (ns == rcu_access_pointer(head->current_path[node]))
+ if (ns == rcu_access_pointer(head->current_path[node])) {
rcu_assign_pointer(head->current_path[node], NULL);
+ changed = true;
+ }
}
+ return changed;
}
static bool nvme_path_is_disabled(struct nvme_ns *ns)
@@ -231,6 +235,24 @@ inline struct nvme_ns *nvme_find_path(struct
nvme_ns_head *head)
return ns;
}
+static bool nvme_available_path(struct nvme_ns_head *head)
+{
+ struct nvme_ns *ns;
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ switch (ns->ctrl->state) {
+ case NVME_CTRL_LIVE:
+ case NVME_CTRL_RESETTING:
+ case NVME_CTRL_CONNECTING:
+ /* fallthru */
+ return true;
+ default:
+ break;
+ }
+ }
+ return false;
+}
+
static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
struct bio *bio)
{
@@ -257,14 +279,14 @@ static blk_qc_t nvme_ns_head_make_request(struct
request_queue *q,
disk_devt(ns->head->disk),
bio->bi_iter.bi_sector);
ret = direct_make_request(bio);
- } else if (!list_empty_careful(&head->list)) {
- dev_warn_ratelimited(dev, "no path available - requeuing
I/O\n");
+ } else if (nvme_available_path(head)) {
+ dev_warn_ratelimited(dev, "no usable path - requeuing
I/O\n");
spin_lock_irq(&head->requeue_lock);
bio_list_add(&head->requeue_list, bio);
spin_unlock_irq(&head->requeue_lock);
} else {
- dev_warn_ratelimited(dev, "no path - failing I/O\n");
+ dev_warn_ratelimited(dev, "no available path - failing
I/O\n");
bio->bi_status = BLK_STS_IOERR;
bio_endio(bio);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 716a876119c8..915179368d30 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -496,7 +496,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head);
int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
void nvme_mpath_uninit(struct nvme_ctrl *ctrl);
void nvme_mpath_stop(struct nvme_ctrl *ctrl);
-void nvme_mpath_clear_current_path(struct nvme_ns *ns);
+bool nvme_mpath_clear_current_path(struct nvme_ns *ns);
struct nvme_ns *nvme_find_path(struct nvme_ns_head *head);
static inline void nvme_mpath_check_last_path(struct nvme_ns *ns)
--
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-19 0:25 ` Sagi Grimberg
2019-07-19 0:31 ` Sagi Grimberg
@ 2019-07-19 0:39 ` Logan Gunthorpe
2019-07-19 0:50 ` Sagi Grimberg
1 sibling, 1 reply; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-19 0:39 UTC (permalink / raw)
To: Sagi Grimberg, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
On 2019-07-18 6:25 p.m., Sagi Grimberg wrote:
>
>> With multipath enabled, nvme_scan_work() can read from the
>> device (through nvme_mpath_add_disk()). However, with fabrics,
>> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
>> (see nvmf_check_ready()).
>>
>> After setting the state to deleting, nvme_remove_namespaces() will
>> hang waiting for scan_work to flush and these tasks will hang.
>>
>> To fix this, ensure we take scan_lock before changing the ctrl-state.
>> Also, ensure the state is checked while the lock is held
>> in nvme_scan_lock_work().
>
> That's a big hammer...
I didn't think the scan_lock was that contested or that
nvme_change_ctrl_state() was really called that often...
> But this is I/O that we cannot have queued until we have a path..
>
> I would rather have nvme_remove_namespaces() requeue all I/Os for
> namespaces that serve as the current_path and have the make_request
> routine to fail I/O if all controllers are deleting as well.
>
> Would something like [1] (untested) make sense instead?
I'll have to give this a try next week and I'll let you know then. It
kind of makes sense to me but a number of things I tried to fix this
that I thought made sense did not work.
>
>> + mutex_lock(&ctrl->scan_lock);
>> +
>> if (ctrl->state != NVME_CTRL_LIVE)
>> return;
>
> unlock
If we unlock here and relock below, we'd have to recheck the ctrl->state
to avoid any races. If you don't want to call nvme_identify_ctrl with
the lock held, then it would probably be better to move the state check
below it.
>> @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct
>> *work)
>> if (nvme_identify_ctrl(ctrl, &id))
>> return;
>
> unlock
>
>
> [1]:
> --
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 76cd3dd8736a..627f5871858d 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3576,6 +3576,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> struct nvme_ns *ns, *next;
> LIST_HEAD(ns_list);
>
> + mutex_lock(&ctrl->scan_lock);
> + list_for_each_entry(ns, &ctrl->namespaces, list)
> + nvme_mpath_clear_current_path(ns);
> + mutex_lock(&ctrl->scan_lock);
> +
> /* prevent racing with ns scanning */
> flush_work(&ctrl->scan_work);
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a9a927677970..da1731266788 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -231,6 +231,24 @@ inline struct nvme_ns *nvme_find_path(struct
> nvme_ns_head *head)
> return ns;
> }
>
> +static bool nvme_available_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *ns;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + switch (ns->ctrl->state) {
> + case NVME_CTRL_LIVE:
> + case NVME_CTRL_RESETTING:
> + case NVME_CTRL_CONNECTING:
> + /* fallthru */
> + return true;
> + default:
> + break;
> + }
> + }
> + return false;
> +}
> +
> static blk_qc_t nvme_ns_head_make_request(struct request_queue *q,
> struct bio *bio)
> {
> @@ -257,14 +275,14 @@ static blk_qc_t nvme_ns_head_make_request(struct
> request_queue *q,
> disk_devt(ns->head->disk),
> bio->bi_iter.bi_sector);
> ret = direct_make_request(bio);
> - } else if (!list_empty_careful(&head->list)) {
> - dev_warn_ratelimited(dev, "no path available - requeuing
> I/O\n");
> + } else if (nvme_available_path(head)) {
> + dev_warn_ratelimited(dev, "no usable path - requeuing
> I/O\n");
>
> spin_lock_irq(&head->requeue_lock);
> bio_list_add(&head->requeue_list, bio);
> spin_unlock_irq(&head->requeue_lock);
> } else {
> - dev_warn_ratelimited(dev, "no path - failing I/O\n");
> + dev_warn_ratelimited(dev, "no available path - failing
> I/O\n");
>
> bio->bi_status = BLK_STS_IOERR;
> bio_endio(bio);
> --
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-19 0:39 ` Logan Gunthorpe
@ 2019-07-19 0:50 ` Sagi Grimberg
2019-07-24 19:12 ` Logan Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19 0:50 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
>>> With multipath enabled, nvme_scan_work() can read from the
>>> device (through nvme_mpath_add_disk()). However, with fabrics,
>>> once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
>>> (see nvmf_check_ready()).
>>>
>>> After setting the state to deleting, nvme_remove_namespaces() will
>>> hang waiting for scan_work to flush and these tasks will hang.
>>>
>>> To fix this, ensure we take scan_lock before changing the ctrl-state.
>>> Also, ensure the state is checked while the lock is held
>>> in nvme_scan_lock_work().
>>
>> That's a big hammer...
>
> I didn't think the scan_lock was that contested or that
> nvme_change_ctrl_state() was really called that often...
it shouldn't be, but I think it makes the flow more convoluted
as we serialize by flushing the scan_work right after...
The design principal is met as we do get the I/O failing,
but its just that with mpath we simply queue the I/O again
because the head->list happens to not be empty.
Perhaps taking care of that check is cleaner.
>> But this is I/O that we cannot have queued until we have a path..
>>
>> I would rather have nvme_remove_namespaces() requeue all I/Os for
>> namespaces that serve as the current_path and have the make_request
>> routine to fail I/O if all controllers are deleting as well.
>>
>> Would something like [1] (untested) make sense instead?
>
> I'll have to give this a try next week and I'll let you know then. It
> kind of makes sense to me but a number of things I tried to fix this
> that I thought made sense did not work.
Thanks. Do you have a firm reproducer for it?
>>> + mutex_lock(&ctrl->scan_lock);
>>> +
>>> if (ctrl->state != NVME_CTRL_LIVE)
>>> return;
>>
>> unlock
>
> If we unlock here and relock below, we'd have to recheck the ctrl->state
> to avoid any races. If you don't want to call nvme_identify_ctrl with
> the lock held, then it would probably be better to move the state check
> below it.
Meant before the return statement.
>
>>> @@ -3547,7 +3554,6 @@ static void nvme_scan_work(struct work_struct
>>> *work)
>>> if (nvme_identify_ctrl(ctrl, &id))
>>> return;
>>
>> unlock
Same here.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-19 0:31 ` Sagi Grimberg
@ 2019-07-19 20:15 ` Sagi Grimberg
0 siblings, 0 replies; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-19 20:15 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
>> [1]:
> Or actually:
> --
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 76cd3dd8736a..a0e2072fe73e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3576,6 +3576,12 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> struct nvme_ns *ns, *next;
> LIST_HEAD(ns_list);
>
> + mutex_lock(&ctrl->scan_lock);
> + list_for_each_entry(ns, &ctrl->namespaces, list)
> + if (nvme_mpath_clear_current_path(ns))
> + kblockd_schedule_work(&ns->head->requeue_work);
> + mutex_lock(&ctrl->scan_lock);
This should be mutex_unlock of course...
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-19 0:50 ` Sagi Grimberg
@ 2019-07-24 19:12 ` Logan Gunthorpe
2019-07-25 18:16 ` Sagi Grimberg
0 siblings, 1 reply; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-24 19:12 UTC (permalink / raw)
To: Sagi Grimberg, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
Hey,
Sorry for the delay.
I tested your patch and it does work. Do you want me to send your change
as a full patch? Can I add your signed-off-by?
On 2019-07-18 6:50 p.m., Sagi Grimberg wrote:
>> I didn't think the scan_lock was that contested or that
>> nvme_change_ctrl_state() was really called that often...
>
> it shouldn't be, but I think it makes the flow more convoluted
> as we serialize by flushing the scan_work right after...
I would argue that the check for state in nvme_scan_work() without a
lock is racy and confusing. There's nothing to prevent the state from
changing immediately after the check.
> The design principal is met as we do get the I/O failing,
> but its just that with mpath we simply queue the I/O again
> because the head->list happens to not be empty.
> Perhaps taking care of that check is cleaner.
Yes, I feel your patch is a good solution on it's own merits.
> Thanks. Do you have a firm reproducer for it?
Yes. If you connect to and then immediately disconnect from a target (at
least with nvme-loop) you will reliably trigger this bug -- or one of
the others I've sent patches for.
>>>> + mutex_lock(&ctrl->scan_lock);
>>>> +
>>>> if (ctrl->state != NVME_CTRL_LIVE)
>>>> return;
>>>
>>> unlock
>>
>> If we unlock here and relock below, we'd have to recheck the ctrl->state
>> to avoid any races. If you don't want to call nvme_identify_ctrl with
>> the lock held, then it would probably be better to move the state check
>> below it.
>
> Meant before the return statement.
Ah, right, my mistake.
Logan
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-24 19:12 ` Logan Gunthorpe
@ 2019-07-25 18:16 ` Sagi Grimberg
2019-07-25 18:20 ` Logan Gunthorpe
0 siblings, 1 reply; 11+ messages in thread
From: Sagi Grimberg @ 2019-07-25 18:16 UTC (permalink / raw)
To: Logan Gunthorpe, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
> Hey,
>
> Sorry for the delay.
>
> I tested your patch and it does work. Do you want me to send your change
> as a full patch? Can I add your signed-off-by?
I have a patch ready with a proper comment in place. I can send it
out... Can I get your tested-by and reported-by?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning
2019-07-25 18:16 ` Sagi Grimberg
@ 2019-07-25 18:20 ` Logan Gunthorpe
0 siblings, 0 replies; 11+ messages in thread
From: Logan Gunthorpe @ 2019-07-25 18:20 UTC (permalink / raw)
To: Sagi Grimberg, linux-kernel, linux-nvme
Cc: Keith Busch, Jens Axboe, Christoph Hellwig
On 2019-07-25 12:16 p.m., Sagi Grimberg wrote:
>
>> Hey,
>>
>> Sorry for the delay.
>>
>> I tested your patch and it does work. Do you want me to send your change
>> as a full patch? Can I add your signed-off-by?
>
> I have a patch ready with a proper comment in place. I can send it
> out... Can I get your tested-by and reported-by?
Yup. I can retest once you send the patch and I'll send a test-by then.
For now you can add my reported-by:
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Note: I had to modify your original patch a fair bit so it compiled with
multipath disabled.
Logan
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2019-07-25 18:20 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-18 22:51 [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Logan Gunthorpe
2019-07-18 22:51 ` [PATCH 2/2] nvme-core: Fix deadlock when deleting the ctrl while scanning Logan Gunthorpe
2019-07-19 0:25 ` Sagi Grimberg
2019-07-19 0:31 ` Sagi Grimberg
2019-07-19 20:15 ` Sagi Grimberg
2019-07-19 0:39 ` Logan Gunthorpe
2019-07-19 0:50 ` Sagi Grimberg
2019-07-24 19:12 ` Logan Gunthorpe
2019-07-25 18:16 ` Sagi Grimberg
2019-07-25 18:20 ` Logan Gunthorpe
2019-07-18 23:52 ` [PATCH 1/2] nvme-core: Fix extra device_put() call on error path Sagi Grimberg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).