[PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates
@ 2020-06-24  0:18 Sagi Grimberg
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked Sagi Grimberg
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

Changes from v1:
- Fixed compilation error in patch #4
- Added patch #5 to resolve a use-after-free condition

Hey All,

The following patches addresses some deadlocks observed while performing some
stress testing of a connect/disconnect storm in addition to rapid ana path
switches concurrently (paths may transition between live<->inaccessible)
on a large number of namespaces (100+).

The test mainly triggers three main flows:
1. ongoing ns scanning, in the presence of concurrent ANA path state changes
   and controller removals (disconnect).
2. ongoing ns scanning (or ana processing) in the presence of concurrent
   controller removal (disconnect).
3. ongoing ANA processing in the presence of concurrent controller removal
   (disconnect).

What was observed is that basically when we disconnect while scan_work and/or ana_work
are running, we can easily deadlock. The main reason is that scan_work and ana_work
may both register the gendisk, triggering I/O (partition scans). Given that a
controller removal (disconnect) may also be running at the same time, I/O may
block. The issue with blocking the head->disk I/O under the locks taken by
both ana_work and scan_work, it means that no other path may update path states
and by doing so, unblock the blocking I/O.

With this patchset applied, the test is able to pass successfully without any
deadlocks.

The last patch is posted as an RFC, while it solves a real problem, we are
essentially adding state to the controller without it going via the normal
controller state, the reason is that the controller state will also affect
ongoing mpath I/O which is the original cause of the deadlock. We are open
to alternative better suggestions if such exist.

Anton Eidelman (3):
  nvme-multipath: fix deadlock between ana_work and scan_work
  nvme-multipath: fix deadlock due to head->lock
  nvme-core: fix deadlock in disconnect during scan_work and/or ana_work

Sagi Grimberg (3):
  nvme: fix possible deadlock when I/O is blocked
  nvme: don't protect ns mutation with ns->head->lock
  nvme-multipath: fix bogus request queue reference put

 drivers/nvme/host/core.c      | 11 +++++++-
 drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++----------
 drivers/nvme/host/nvme.h      |  3 +++
 3 files changed, 47 insertions(+), 15 deletions(-)

-- 
2.25.1

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
@ 2020-06-24  0:18 ` Sagi Grimberg
  2020-06-24  6:29   ` Christoph Hellwig
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work Sagi Grimberg
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")

When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition scan which triggers I/O on the mpath device node.
This process will usually be triggered from the scan_work which holds
the scan_lock. If I/O blocks (if we got ana change currently have only
available paths but none are accessible) this can deadlock on the head
disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
takes it to check for resize (also triggered from scan_work on a different
path). See trace [1].

This is no longer needed since commit cb224c3af4df ("nvme: Convert to
use set_capacity_revalidate_and_notify") which already updates resize
info without unnecessarily revalidating the disk.

[1]:
--
kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:9   D    0   494      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  revalidate_disk+0x63/0xa0
kernel:  __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
kernel:  nvme_revalidate_disk+0xa4/0x160 [nvme_core]
kernel:  ? evict+0x14c/0x1b0
kernel:  revalidate_disk+0x2b/0xa0
kernel:  nvme_validate_ns+0x49/0x940 [nvme_core]
kernel:  ? blk_mq_free_request+0xd2/0x100
kernel:  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x1f/0x40
...
kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:1   D    0  2630      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  ? file_fdatawait_range+0x30/0x30
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  ? kmem_cache_alloc_trace+0x19c/0x230
kernel:  efi_partition+0x1e6/0x708
kernel:  ? vsnprintf+0x39e/0x4e0
kernel:  ? snprintf+0x49/0x60
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  ? blk_mq_free_request+0xd2/0x100
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x1f/0x40
--

Fixes: fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 915fa2e609eb..28f4388c1337 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1974,7 +1974,6 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ns->head->disk) {
 		nvme_update_disk_info(ns->head->disk, ns, id);
 		blk_queue_stack_limits(ns->head->disk->queue, ns->queue);
-		revalidate_disk(ns->head->disk);
 	}
 #endif
 	return 0;
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work
  2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked Sagi Grimberg
@ 2020-06-24  0:18 ` Sagi Grimberg
  2020-06-24  6:34   ` Christoph Hellwig
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock Sagi Grimberg
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

From: Anton Eidelman <anton@lightbitslabs.com>

When scan_work calls nvme_mpath_add_disk() this holds ana_lock
and invokes nvme_parse_ana_log(), which may issue IO
in device_add_disk() and hang waiting for an accessible path.
While nvme_mpath_set_live() only called when nvme_state_is_live(),
a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.

In order to recover and complete the IO ana_work on the same ctrl
should be able to update the path state and remove NVME_NS_ANA_PENDING.

The deadlock occurs because scan_work keeps holding ana_lock,
so ana_work hangs [1].

Fix:
Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
of the ANA group desc, and then calls nvme_update_ns_ana_state() without
holding ana_lock.

[1]:
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140

kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  ? select_task_rq_fair+0x1aa/0x5c0
kernel:  ? kvm_sched_clock_read+0x11/0x20
kernel:  ? sched_clock+0x9/0x10
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x4d/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x35/0x40

Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/multipath.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index da78e499947a..a646f9eb2e73 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -640,13 +640,13 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
 }
 DEVICE_ATTR_RO(ana_state);
 
-static int nvme_set_ns_ana_state(struct nvme_ctrl *ctrl,
+static int nvme_get_ns_ana_state(struct nvme_ctrl *ctrl,
 		struct nvme_ana_group_desc *desc, void *data)
 {
-	struct nvme_ns *ns = data;
+	struct nvme_ana_group_desc *dst = data;
 
-	if (ns->ana_grpid == le32_to_cpu(desc->grpid)) {
-		nvme_update_ns_ana_state(desc, ns);
+	if (desc->grpid == dst->grpid) {
+		*dst = *desc;
 		return -ENXIO; /* just break out of the loop */
 	}
 
@@ -656,10 +656,19 @@ static int nvme_set_ns_ana_state(struct nvme_ctrl *ctrl,
 void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id)
 {
 	if (nvme_ctrl_use_ana(ns->ctrl)) {
+		struct nvme_ana_group_desc desc = {
+			.grpid = id->anagrpid,
+			.state = 0,
+		};
+
 		mutex_lock(&ns->ctrl->ana_lock);
 		ns->ana_grpid = le32_to_cpu(id->anagrpid);
-		nvme_parse_ana_log(ns->ctrl, ns, nvme_set_ns_ana_state);
+		nvme_parse_ana_log(ns->ctrl, &desc, nvme_get_ns_ana_state);
 		mutex_unlock(&ns->ctrl->ana_lock);
+		if (desc.state) {
+			/* found the group desc: update */
+			nvme_update_ns_ana_state(&desc, ns);
+		}
 	} else {
 		mutex_lock(&ns->head->lock);
 		ns->ana_state = NVME_ANA_OPTIMIZED; 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock
  2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked Sagi Grimberg
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work Sagi Grimberg
@ 2020-06-24  0:18 ` Sagi Grimberg
  2020-06-24  6:37   ` Christoph Hellwig
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock Sagi Grimberg
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

Right now ns->head->lock is protecting namespace mutation
which is wrong and unneeded. Move it to only protect
against head mutations. While we're at it, remove unnecessary
ns->head reference as we already have head pointer.

Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/multipath.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a646f9eb2e73..c8fc42e37138 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -409,11 +409,10 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 {
 	struct nvme_ns_head *head = ns->head;
 
-	lockdep_assert_held(&ns->head->lock);
-
 	if (!head->disk)
 		return;
 
+	mutex_lock(&head->lock);
 	if (!(head->disk->flags & GENHD_FL_UP))
 		device_add_disk(&head->subsys->dev, head->disk,
 				nvme_ns_id_attr_groups);
@@ -427,8 +426,9 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 		srcu_read_unlock(&head->srcu, srcu_idx);
 	}
 
-	synchronize_srcu(&ns->head->srcu);
-	kblockd_schedule_work(&ns->head->requeue_work);
+	synchronize_srcu(&head->srcu);
+	kblockd_schedule_work(&head->requeue_work);
+	mutex_unlock(&head->lock);
 }
 
 static int nvme_parse_ana_log(struct nvme_ctrl *ctrl, void *data,
@@ -483,14 +483,12 @@ static inline bool nvme_state_is_live(enum nvme_ana_state state)
 static void nvme_update_ns_ana_state(struct nvme_ana_group_desc *desc,
 		struct nvme_ns *ns)
 {
-	mutex_lock(&ns->head->lock);
 	ns->ana_grpid = le32_to_cpu(desc->grpid);
 	ns->ana_state = desc->state;
 	clear_bit(NVME_NS_ANA_PENDING, &ns->flags);
 
 	if (nvme_state_is_live(ns->ana_state))
 		nvme_mpath_set_live(ns);
-	mutex_unlock(&ns->head->lock);
 }
 
 static int nvme_update_ana_state(struct nvme_ctrl *ctrl,
@@ -670,10 +668,8 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id)
 			nvme_update_ns_ana_state(&desc, ns);
 		}
 	} else {
-		mutex_lock(&ns->head->lock);
 		ns->ana_state = NVME_ANA_OPTIMIZED; 
 		nvme_mpath_set_live(ns);
-		mutex_unlock(&ns->head->lock);
 	}
 
 	if (bdi_cap_stable_pages_required(ns->queue->backing_dev_info)) {
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock
  2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
                   ` (2 preceding siblings ...)
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock Sagi Grimberg
@ 2020-06-24  0:18 ` Sagi Grimberg
  2020-06-24  6:39   ` Christoph Hellwig
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 5/6] nvme-multipath: fix bogus request queue reference put Sagi Grimberg
  2020-06-24  0:18 ` [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work Sagi Grimberg
  5 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

From: Anton Eidelman <anton@lightbitslabs.com>

In the following scenario scan_work and ana_work will deadlock:

When scan_work calls nvme_mpath_add_disk() this holds ana_lock
and invokes nvme_parse_ana_log(), which may issue IO
in device_add_disk() and hang waiting for an accessible path.

While nvme_mpath_set_live() only called when nvme_state_is_live(),
a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.

Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on
ANY ctrl will not be able to complete nvme_mpath_set_live()
on the same ns->head, which is required in order to update
the new accessible path and remove NVME_NS_ANA_PENDING..
Therefore IO never completes: deadlock [1].

Fix:
Move device_add_disk out of the head->lock and protect it with an
atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit.

[1]:
kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u8:2    D    0   160      2 0x80004000
kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  nvme_update_ns_ana_state+0x22/0x60 [nvme_core]
kernel:  nvme_update_ana_state+0xca/0xe0 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_read_ana_log+0x76/0x100 [nvme_core]
kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x4d/0x400
kernel:  kthread+0x104/0x140
kernel:  ret_from_fork+0x35/0x40
kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u8:4    D    0   439      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_mpath_add_disk+0xbe/0x100 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x256/0x390 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x4d/0x400
kernel:  kthread+0x104/0x140
kernel:  ret_from_fork+0x35/0x40

Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/multipath.c | 4 ++--
 drivers/nvme/host/nvme.h      | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index c8fc42e37138..505f5f43cd69 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -412,11 +412,11 @@ static void nvme_mpath_set_live(struct nvme_ns *ns)
 	if (!head->disk)
 		return;
 
-	mutex_lock(&head->lock);
-	if (!(head->disk->flags & GENHD_FL_UP))
+	if (!test_and_set_bit(NVME_NS_HEAD_HAS_DISK, &head->flags))
 		device_add_disk(&head->subsys->dev, head->disk,
 				nvme_ns_id_attr_groups);
 
+	mutex_lock(&head->lock);
 	if (nvme_path_is_optimized(ns)) {
 		int node, srcu_idx;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c0f4226d3299..ee2553caaff6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -364,6 +364,8 @@ struct nvme_ns_head {
 	spinlock_t		requeue_lock;
 	struct work_struct	requeue_work;
 	struct mutex		lock;
+	unsigned long		flags;
+#define NVME_NS_HEAD_HAS_DISK	0
 	struct nvme_ns __rcu	*current_path[];
 #endif
 };
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 for-5.8-rc 5/6] nvme-multipath: fix bogus request queue reference put
  2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
                   ` (3 preceding siblings ...)
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock Sagi Grimberg
@ 2020-06-24  0:18 ` Sagi Grimberg
  2020-06-24  6:40   ` Christoph Hellwig
  2020-06-24  0:18 ` [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work Sagi Grimberg
  5 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

The mpath disk node takes a reference on the request mpath
request queue when adding live path to the mpath gendisk.
However if we connected to an inaccessible path device_add_disk
is not called, so if we disconnect and remove the mpath gendisk
we endup putting an reference on the request queue that was
never taken [1].

Fix that to check if we ever added a live path (using
NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
reference.

[1]:
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
CPU: 1 PID: 1372 Comm: nvme Tainted: G           O      5.7.0-rc2+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xa6/0xf0
RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
FS:  00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 disk_release+0xa2/0xc0
 device_release+0x28/0x80
 kobject_put+0xa5/0x1b0
 nvme_put_ns_head+0x26/0x70 [nvme_core]
 nvme_put_ns+0x30/0x60 [nvme_core]
 nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
 nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
 nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
 kernfs_fop_write+0xc1/0x1a0
 vfs_write+0xb6/0x1a0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x52/0x1a0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: Anton Eidelman <anton@lightbitslabs.com>
Tested-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/multipath.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 505f5f43cd69..de280d156bbb 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -691,6 +691,14 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 	kblockd_schedule_work(&head->requeue_work);
 	flush_work(&head->requeue_work);
 	blk_cleanup_queue(head->disk->queue);
+	if (!test_bit(NVME_NS_HEAD_HAS_DISK, &head->flags)) {
+		/*
+		 * if device_add_disk wasn't called, prevent
+		 * disk release to put a bogus reference on the
+		 * request queue
+		 */
+		head->disk->queue = NULL;
+	}
 	put_disk(head->disk);
 }
 
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
  2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
                   ` (4 preceding siblings ...)
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 5/6] nvme-multipath: fix bogus request queue reference put Sagi Grimberg
@ 2020-06-24  0:18 ` Sagi Grimberg
  2020-06-24  6:43   ` Christoph Hellwig
  5 siblings, 1 reply; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  0:18 UTC (permalink / raw)
  To: linux-nvme, Christoph Hellwig, Keith Busch; +Cc: Anton Eidelman

From: Anton Eidelman <anton@lightbitslabs.com>

A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
    is an optimized path to it, path nvme1 happens to be
    inaccessible.

2) Before scan_work is complete nvme0 disconnect is initiated
    nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING

3) scan_work(1) attempts to submit IO,
    but nvme_path_is_optimized() observes nvme0 is not LIVE.
    Since nvme1 is a possible path IO is requeued and scan_work hangs.

--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
--

4) Delete also hangs in flush_work(ctrl->scan_work)
    from nvme_remove_namespaces().

Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.

[  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[  605.552087] Call Trace:
[  605.552683]  __schedule+0x2b9/0x6c0
[  605.553507]  schedule+0x42/0xb0
[  605.554201]  io_schedule+0x16/0x40
[  605.555012]  do_read_cache_page+0x438/0x830
[  605.556925]  read_cache_page+0x12/0x20
[  605.557757]  read_dev_sector+0x27/0xc0
[  605.558587]  amiga_partition+0x4d/0x4c5
[  605.561278]  check_partition+0x154/0x244
[  605.562138]  rescan_partitions+0xae/0x280
[  605.563076]  __blkdev_get+0x40f/0x560
[  605.563830]  blkdev_get+0x3d/0x140
[  605.564500]  __device_add_disk+0x388/0x480
[  605.565316]  device_add_disk+0x13/0x20
[  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
[  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
[  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
[  605.573330]  process_one_work+0x1db/0x380
[  605.574144]  worker_thread+0x4d/0x400
[  605.574896]  kthread+0x104/0x140
[  605.577205]  ret_from_fork+0x35/0x40
[  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
[  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.582320] nvme            D    0 14044  14043 0x00000000
[  605.583424] Call Trace:
[  605.583935]  __schedule+0x2b9/0x6c0
[  605.584625]  schedule+0x42/0xb0
[  605.585290]  schedule_timeout+0x203/0x2f0
[  605.588493]  wait_for_completion+0xb1/0x120
[  605.590066]  __flush_work+0x123/0x1d0
[  605.591758]  __cancel_work_timer+0x10e/0x190
[  605.593542]  cancel_work_sync+0x10/0x20
[  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
[  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
[  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[  605.598320]  dev_attr_store+0x17/0x30

Fix: disconnect sets ctrl->abort_scan_and_ana that prevents any further
scan_work or ana_work activation and flushes them before setting the ctrl
state to NVME_CTRL_DELETING.

Fixes: 0d0b660f214d ("nvme: add ANA support")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
---
 drivers/nvme/host/core.c      | 10 ++++++++++
 drivers/nvme/host/multipath.c |  7 +++++++
 drivers/nvme/host/nvme.h      |  1 +
 3 files changed, 18 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 28f4388c1337..85e7c7887ae8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -198,6 +198,9 @@ static void nvme_delete_ctrl_sync(struct nvme_ctrl *ctrl)
 	 * since ->delete_ctrl can free the controller.
 	 */
 	nvme_get_ctrl(ctrl);
+	ctrl->abort_scan_and_ana = true;
+	flush_work(&ctrl->scan_work);
+	nvme_mpath_stop(ctrl);
 	if (nvme_change_ctrl_state(ctrl, NVME_CTRL_DELETING))
 		nvme_do_delete_ctrl(ctrl);
 	nvme_put_ctrl(ctrl);
@@ -3865,6 +3868,13 @@ static void nvme_scan_work(struct work_struct *work)
 	struct nvme_ctrl *ctrl =
 		container_of(work, struct nvme_ctrl, scan_work);
 
+	/*
+	 * Controller deletion started, we may issue I/O, block and prevent
+	 * the controller deletion process from completing
+	 */
+	if (ctrl->abort_scan_and_ana)
+		return;
+
 	/* No tagset on a live ctrl means IO queues could not created */
 	if (ctrl->state != NVME_CTRL_LIVE || !ctrl->tagset)
 		return;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index de280d156bbb..eedb6eb23c61 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -565,6 +565,13 @@ static void nvme_ana_work(struct work_struct *work)
 {
 	struct nvme_ctrl *ctrl = container_of(work, struct nvme_ctrl, ana_work);
 
+	/*
+	 * Controller deletion started, we may issue I/O, block and prevent
+	 * the controller deletion process from completing
+	 */
+	if (ctrl->abort_scan_and_ana)
+		return;
+
 	nvme_read_ana_log(ctrl);
 }
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ee2553caaff6..2b1d04727f59 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -264,6 +264,7 @@ struct nvme_ctrl {
 	struct work_struct fw_act_work;
 	unsigned long events;
 	bool created;
+	bool abort_scan_and_ana;
 
 #ifdef CONFIG_NVME_MULTIPATH
 	/* asymmetric namespace access: */
-- 
2.25.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked Sagi Grimberg
@ 2020-06-24  6:29   ` Christoph Hellwig
  2020-06-24  6:54     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:29 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 05:18:48PM -0700, Sagi Grimberg wrote:
> Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
> in nvme_validate_ns")
> 
> When adding a new namespace to the head disk (via nvme_mpath_set_live)
> we will see partition scan which triggers I/O on the mpath device node.
> This process will usually be triggered from the scan_work which holds
> the scan_lock. If I/O blocks (if we got ana change currently have only
> available paths but none are accessible) this can deadlock on the head
> disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
> takes it to check for resize (also triggered from scan_work on a different
> path). See trace [1].
> 
> This is no longer needed since commit cb224c3af4df ("nvme: Convert to
> use set_capacity_revalidate_and_notify") which already updates resize
> info without unnecessarily revalidating the disk.

Did you look if any other revalidation gets skipped this way? E.g.
LBA size change?

From purely the size POV this looks good, though.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work Sagi Grimberg
@ 2020-06-24  6:34   ` Christoph Hellwig
  2020-06-24  6:56     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:34 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 05:18:49PM -0700, Sagi Grimberg wrote:
> From: Anton Eidelman <anton@lightbitslabs.com>
> 
> When scan_work calls nvme_mpath_add_disk() this holds ana_lock
> and invokes nvme_parse_ana_log(), which may issue IO
> in device_add_disk() and hang waiting for an accessible path.
> While nvme_mpath_set_live() only called when nvme_state_is_live(),
> a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
> 
> In order to recover and complete the IO ana_work on the same ctrl
> should be able to update the path state and remove NVME_NS_ANA_PENDING.
> 
> The deadlock occurs because scan_work keeps holding ana_lock,
> so ana_work hangs [1].
> 
> Fix:
> Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
> of the ANA group desc, and then calls nvme_update_ns_ana_state() without
> holding ana_lock.
> 
> [1]:
> kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
> kernel: Call Trace:
> kernel:  __schedule+0x2b9/0x6c0
> kernel:  schedule+0x42/0xb0
> kernel:  io_schedule+0x16/0x40
> kernel:  do_read_cache_page+0x438/0x830
> kernel:  read_cache_page+0x12/0x20
> kernel:  read_dev_sector+0x27/0xc0
> kernel:  read_lba+0xc1/0x220
> kernel:  efi_partition+0x1e6/0x708
> kernel:  check_partition+0x154/0x244
> kernel:  rescan_partitions+0xae/0x280
> kernel:  __blkdev_get+0x40f/0x560
> kernel:  blkdev_get+0x3d/0x140
> kernel:  __device_add_disk+0x388/0x480
> kernel:  device_add_disk+0x13/0x20
> kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
> kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
> kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
> kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
> kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
> kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
> kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
> kernel:  process_one_work+0x1db/0x380
> kernel:  worker_thread+0x249/0x400
> kernel:  kthread+0x104/0x140
> 
> kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
> kernel: Call Trace:
> kernel:  __schedule+0x2b9/0x6c0
> kernel:  schedule+0x42/0xb0
> kernel:  schedule_preempt_disabled+0xe/0x10
> kernel:  __mutex_lock.isra.0+0x182/0x4f0
> kernel:  ? __switch_to_asm+0x34/0x70
> kernel:  ? select_task_rq_fair+0x1aa/0x5c0
> kernel:  ? kvm_sched_clock_read+0x11/0x20
> kernel:  ? sched_clock+0x9/0x10
> kernel:  __mutex_lock_slowpath+0x13/0x20
> kernel:  mutex_lock+0x2e/0x40
> kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
> kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
> kernel:  process_one_work+0x1db/0x380
> kernel:  worker_thread+0x4d/0x400
> kernel:  kthread+0x104/0x140
> kernel:  ? process_one_work+0x380/0x380
> kernel:  ? kthread_park+0x80/0x80
> kernel:  ret_from_fork+0x35/0x40
> 
> Fixes: 0d0b660f214d ("nvme: add ANA support")
> Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
> ---
>  drivers/nvme/host/multipath.c | 19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index da78e499947a..a646f9eb2e73 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -640,13 +640,13 @@ static ssize_t ana_state_show(struct device *dev, struct device_attribute *attr,
>  }
>  DEVICE_ATTR_RO(ana_state);
>  
> -static int nvme_set_ns_ana_state(struct nvme_ctrl *ctrl,
> +static int nvme_get_ns_ana_state(struct nvme_ctrl *ctrl,
>  		struct nvme_ana_group_desc *desc, void *data)

Maybe this should be called nvme_lookup_ana_group_desc or so given that
it doesn't actually do anything about the state?

>  {
> -	struct nvme_ns *ns = data;
> +	struct nvme_ana_group_desc *dst = data;
>  
> -	if (ns->ana_grpid == le32_to_cpu(desc->grpid)) {
> -		nvme_update_ns_ana_state(desc, ns);
> +	if (desc->grpid == dst->grpid) {
> +		*dst = *desc;
>  		return -ENXIO; /* just break out of the loop */
>  	}

Any maybe doing the early return here would also clarify things:

	if (desc->grpid != dst->grpid)
		return 0;
	*dst = *desc;
	return -ENXIO; /* just break out of the loop */

Otherwise this looks good.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock Sagi Grimberg
@ 2020-06-24  6:37   ` Christoph Hellwig
  2020-06-24  6:58     ` Sagi Grimberg
  2020-06-24  8:24     ` Sagi Grimberg
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:37 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 05:18:50PM -0700, Sagi Grimberg wrote:
> Right now ns->head->lock is protecting namespace mutation
> which is wrong and unneeded. Move it to only protect
> against head mutations. While we're at it, remove unnecessary
> ns->head reference as we already have head pointer.

The changes look reasonable (nitpick below), but this changelog doesn't
sway what the problem is and how it it fixed.

> -	synchronize_srcu(&ns->head->srcu);
> -	kblockd_schedule_work(&ns->head->requeue_work);
> +	synchronize_srcu(&head->srcu);
> +	kblockd_schedule_work(&head->requeue_work);
> +	mutex_unlock(&head->lock);

Can't we move the synchronize_srcu and kblockd_schedule_work out of
the lock as well?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock Sagi Grimberg
@ 2020-06-24  6:39   ` Christoph Hellwig
  2020-06-24  7:00     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:39 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 05:18:51PM -0700, Sagi Grimberg wrote:
> -	mutex_lock(&head->lock);
> -	if (!(head->disk->flags & GENHD_FL_UP))
> +	if (!test_and_set_bit(NVME_NS_HEAD_HAS_DISK, &head->flags))
>  		device_add_disk(&head->subsys->dev, head->disk,
>  				nvme_ns_id_attr_groups);

The HAS_DISK name is a little strange.  The disk obviously is set
before, otherwise this call would dereference a NULL pointer :)

Maybe NVME_NSHEAD_LIVE?  or NVME_NSHEAD_DISK_LIVE?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 5/6] nvme-multipath: fix bogus request queue reference put
  2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 5/6] nvme-multipath: fix bogus request queue reference put Sagi Grimberg
@ 2020-06-24  6:40   ` Christoph Hellwig
  0 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:40 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

Looks sensible.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
  2020-06-24  0:18 ` [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work Sagi Grimberg
@ 2020-06-24  6:43   ` Christoph Hellwig
  2020-06-24  7:13     ` Sagi Grimberg
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:43 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 05:18:53PM -0700, Sagi Grimberg wrote:
> From: Anton Eidelman <anton@lightbitslabs.com>
> 
> A deadlock happens in the following scenario with multipath:
> 1) scan_work(nvme0) detects a new nsid while nvme0
>     is an optimized path to it, path nvme1 happens to be
>     inaccessible.
> 
> 2) Before scan_work is complete nvme0 disconnect is initiated
>     nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
> 
> 3) scan_work(1) attempts to submit IO,
>     but nvme_path_is_optimized() observes nvme0 is not LIVE.
>     Since nvme1 is a possible path IO is requeued and scan_work hangs.

I'm really worried about another flag outside the state machine.  If
we really need a multi-step deletion we should have
NVME_CTRL_DELETE_START, NVME_CTRL_DELETE_CONT or so states and run
this via the state machine.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-06-24  6:29   ` Christoph Hellwig
@ 2020-06-24  6:54     ` Sagi Grimberg
  2020-06-24  6:57       ` Christoph Hellwig
  2020-07-07 10:57       ` Anthony Iliopoulos
  0 siblings, 2 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  6:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme


>> Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
>> in nvme_validate_ns")
>>
>> When adding a new namespace to the head disk (via nvme_mpath_set_live)
>> we will see partition scan which triggers I/O on the mpath device node.
>> This process will usually be triggered from the scan_work which holds
>> the scan_lock. If I/O blocks (if we got ana change currently have only
>> available paths but none are accessible) this can deadlock on the head
>> disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
>> takes it to check for resize (also triggered from scan_work on a different
>> path). See trace [1].
>>
>> This is no longer needed since commit cb224c3af4df ("nvme: Convert to
>> use set_capacity_revalidate_and_notify") which already updates resize
>> info without unnecessarily revalidating the disk.
> 
> Did you look if any other revalidation gets skipped this way? E.g.
> LBA size change?
> 
>  From purely the size POV this looks good, though.

The only reason the revalidate was added was to update on a size change,
which is redundant.

And if you look at revalidate, it only calls fops->revalidate_disk
(which is a noop in the mpath device) and checks the size change.

If you find this explanation sufficient, I'd love your Reviewed-by
tag ;)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work
  2020-06-24  6:34   ` Christoph Hellwig
@ 2020-06-24  6:56     ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  6:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme


>> -static int nvme_set_ns_ana_state(struct nvme_ctrl *ctrl,
>> +static int nvme_get_ns_ana_state(struct nvme_ctrl *ctrl,
>>   		struct nvme_ana_group_desc *desc, void *data)
> 
> Maybe this should be called nvme_lookup_ana_group_desc or so given that
> it doesn't actually do anything about the state?

Agree, reads much better.

> 
>>   {
>> -	struct nvme_ns *ns = data;
>> +	struct nvme_ana_group_desc *dst = data;
>>   
>> -	if (ns->ana_grpid == le32_to_cpu(desc->grpid)) {
>> -		nvme_update_ns_ana_state(desc, ns);
>> +	if (desc->grpid == dst->grpid) {
>> +		*dst = *desc;
>>   		return -ENXIO; /* just break out of the loop */
>>   	}
> 
> Any maybe doing the early return here would also clarify things:
> 
> 	if (desc->grpid != dst->grpid)
> 		return 0;
> 	*dst = *desc;
> 	return -ENXIO; /* just break out of the loop */
> 
> Otherwise this looks good.

Will fix.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-06-24  6:54     ` Sagi Grimberg
@ 2020-06-24  6:57       ` Christoph Hellwig
  2020-06-24  7:09         ` Sagi Grimberg
  2020-07-07 10:57       ` Anthony Iliopoulos
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-06-24  6:57 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 11:54:31PM -0700, Sagi Grimberg wrote:
>
>>> Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
>>> in nvme_validate_ns")
>>>
>>> When adding a new namespace to the head disk (via nvme_mpath_set_live)
>>> we will see partition scan which triggers I/O on the mpath device node.
>>> This process will usually be triggered from the scan_work which holds
>>> the scan_lock. If I/O blocks (if we got ana change currently have only
>>> available paths but none are accessible) this can deadlock on the head
>>> disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
>>> takes it to check for resize (also triggered from scan_work on a different
>>> path). See trace [1].
>>>
>>> This is no longer needed since commit cb224c3af4df ("nvme: Convert to
>>> use set_capacity_revalidate_and_notify") which already updates resize
>>> info without unnecessarily revalidating the disk.
>>
>> Did you look if any other revalidation gets skipped this way? E.g.
>> LBA size change?
>>
>>  From purely the size POV this looks good, though.
>
> The only reason the revalidate was added was to update on a size change,
> which is redundant.
>
> And if you look at revalidate, it only calls fops->revalidate_disk
> (which is a noop in the mpath device) and checks the size change.

Please add this blurb to the changelog for the next version.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock
  2020-06-24  6:37   ` Christoph Hellwig
@ 2020-06-24  6:58     ` Sagi Grimberg
  2020-06-24  8:24     ` Sagi Grimberg
  1 sibling, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  6:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme


>> Right now ns->head->lock is protecting namespace mutation
>> which is wrong and unneeded. Move it to only protect
>> against head mutations. While we're at it, remove unnecessary
>> ns->head reference as we already have head pointer.
> 
> The changes look reasonable (nitpick below), but this changelog doesn't
> sway what the problem is and how it it fixed.
> 
>> -	synchronize_srcu(&ns->head->srcu);
>> -	kblockd_schedule_work(&ns->head->requeue_work);
>> +	synchronize_srcu(&head->srcu);
>> +	kblockd_schedule_work(&head->requeue_work);
>> +	mutex_unlock(&head->lock);
> 
> Can't we move the synchronize_srcu and kblockd_schedule_work out of
> the lock as well?

the requeue_work kick for sure, and I think also the synchronize given
that the actual critical section is protected...

Will fix.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock
  2020-06-24  6:39   ` Christoph Hellwig
@ 2020-06-24  7:00     ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  7:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme


>> -	mutex_lock(&head->lock);
>> -	if (!(head->disk->flags & GENHD_FL_UP))
>> +	if (!test_and_set_bit(NVME_NS_HEAD_HAS_DISK, &head->flags))
>>   		device_add_disk(&head->subsys->dev, head->disk,
>>   				nvme_ns_id_attr_groups);
> 
> The HAS_DISK name is a little strange.  The disk obviously is set
> before, otherwise this call would dereference a NULL pointer :)
> 
> Maybe NVME_NSHEAD_LIVE?  or NVME_NSHEAD_DISK_LIVE?

I think I like NVME_NSHEAD_DISK_LIVE, which indicates better that
it relates to the gendisk and it is live (we called device_add_disk)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-06-24  6:57       ` Christoph Hellwig
@ 2020-06-24  7:09         ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  7:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme


>>>> Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
>>>> in nvme_validate_ns")
>>>>
>>>> When adding a new namespace to the head disk (via nvme_mpath_set_live)
>>>> we will see partition scan which triggers I/O on the mpath device node.
>>>> This process will usually be triggered from the scan_work which holds
>>>> the scan_lock. If I/O blocks (if we got ana change currently have only
>>>> available paths but none are accessible) this can deadlock on the head
>>>> disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
>>>> takes it to check for resize (also triggered from scan_work on a different
>>>> path). See trace [1].
>>>>
>>>> This is no longer needed since commit cb224c3af4df ("nvme: Convert to
>>>> use set_capacity_revalidate_and_notify") which already updates resize
>>>> info without unnecessarily revalidating the disk.
>>>
>>> Did you look if any other revalidation gets skipped this way? E.g.
>>> LBA size change?
>>>
>>>   From purely the size POV this looks good, though.
>>
>> The only reason the revalidate was added was to update on a size change,
>> which is redundant.
>>
>> And if you look at revalidate, it only calls fops->revalidate_disk
>> (which is a noop in the mpath device) and checks the size change.
> 
> Please add this blurb to the changelog for the next version.

Will do. thanks

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
  2020-06-24  6:43   ` Christoph Hellwig
@ 2020-06-24  7:13     ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  7:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme


>> From: Anton Eidelman <anton@lightbitslabs.com>
>>
>> A deadlock happens in the following scenario with multipath:
>> 1) scan_work(nvme0) detects a new nsid while nvme0
>>      is an optimized path to it, path nvme1 happens to be
>>      inaccessible.
>>
>> 2) Before scan_work is complete nvme0 disconnect is initiated
>>      nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
>>
>> 3) scan_work(1) attempts to submit IO,
>>      but nvme_path_is_optimized() observes nvme0 is not LIVE.
>>      Since nvme1 is a possible path IO is requeued and scan_work hangs.
> 
> I'm really worried about another flag outside the state machine.  If
> we really need a multi-step deletion we should have
> NVME_CTRL_DELETE_START, NVME_CTRL_DELETE_CONT or so states and run
> this via the state machine.

Let me look into this, I'll send out v3 of the other 5 patches, and
will follow up with this as a follow on series. Going via the state
machine is slightly more delicate, but I agree that its a better
approach.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock
  2020-06-24  6:37   ` Christoph Hellwig
  2020-06-24  6:58     ` Sagi Grimberg
@ 2020-06-24  8:24     ` Sagi Grimberg
  1 sibling, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-06-24  8:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Anton Eidelman, linux-nvme

>> Right now ns->head->lock is protecting namespace mutation
>> which is wrong and unneeded. Move it to only protect
>> against head mutations. While we're at it, remove unnecessary
>> ns->head reference as we already have head pointer.
> 
> The changes look reasonable (nitpick below), but this changelog doesn't
> sway what the problem is and how it it fixed.

I'll add to v3:

The problem with this is that the head->lock spans
mpath disk node I/O that may block under some conditions (if
for example the controller is disconnecting or the path
became inaccessible), The locking scheme does not allow any
other path to enable itself, preventing blocked I/O to complete
and forward-progress from there.

This is a preparation patch for the fix in a subsequent patch
where the disk I/O will also be done outside the head->lock.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-06-24  6:54     ` Sagi Grimberg
  2020-06-24  6:57       ` Christoph Hellwig
@ 2020-07-07 10:57       ` Anthony Iliopoulos
  2020-07-08 14:42         ` Christoph Hellwig
  2020-07-14 11:12         ` Christoph Hellwig
  1 sibling, 2 replies; 25+ messages in thread
From: Anthony Iliopoulos @ 2020-07-07 10:57 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Keith Busch, Anton Eidelman, Christoph Hellwig, linux-nvme

On Tue, Jun 23, 2020 at 11:54:31PM -0700, Sagi Grimberg wrote:
> 
> > > Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
> > > in nvme_validate_ns")
> > > 
> > > When adding a new namespace to the head disk (via nvme_mpath_set_live)
> > > we will see partition scan which triggers I/O on the mpath device node.
> > > This process will usually be triggered from the scan_work which holds
> > > the scan_lock. If I/O blocks (if we got ana change currently have only
> > > available paths but none are accessible) this can deadlock on the head
> > > disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
> > > takes it to check for resize (also triggered from scan_work on a different
> > > path). See trace [1].
> > > 
> > > This is no longer needed since commit cb224c3af4df ("nvme: Convert to
> > > use set_capacity_revalidate_and_notify") which already updates resize
> > > info without unnecessarily revalidating the disk.
> > 
> > Did you look if any other revalidation gets skipped this way? E.g.
> > LBA size change?
> > 
> >  From purely the size POV this looks good, though.
> 
> The only reason the revalidate was added was to update on a size change,
> which is redundant.

The revalidation is still required there in order to update the blockdev
size when multipath is enabled. With this revert, the behavior regressed
back to before fab7772bfbcf, so the size isn't updated under multipath
when the device is in use (e.g. mounted), it will only be effectively
updated next time it is mounted as a side-effect of blkdev_open on first
opener. This prevents online resizing of filesystems.

> And if you look at revalidate, it only calls fops->revalidate_disk
> (which is a noop in the mpath device) and checks the size change.

The mpath gendisk doesn't have a revalidate_disk fop indeed, but call to
check_disk_size_change was needed to update its size. Otherwise, the
size change isn't reflected externally to userspace before next mount.
cb224c3af4df doesn't fix this as it skips revalidate and thus also skips
updating the size.

The issue is that under CONFIG_NVME_MULTIPATH=y the ns->disk becomes a
hidden gendisk (see nvme_set_disk_name) and thus doesn't have a backing
bdev, while ns->head->disk becomes the "visible" one to userspace (with
a backing bdev), and therefore needs its bdev->bd_inode->i_size updated
when resized.

The following patch seems to be working, although I'm not sufficiently
familiar with the code to tell if this won't cause any other deadlocks
with some confidence:

From 607221431d99503fa7ff0091bf6277d946ef9930 Mon Sep 17 00:00:00 2001
From: Anthony Iliopoulos <ailiop@suse.com>
Date: Tue, 7 Jul 2020 12:42:54 +0200
Subject: nvme: explicitly update mpath disk capacity on revalidation

Commit 3b4b19721ec652 ("nvme: fix possible deadlock when I/O is
blocked") reverted multipath head disk revalidation due to deadlocks
caused by holding the bd_mutex during revalidate.

Updating the multipath disk blockdev size is still required though for
userspace to be able to observe any resizing while the device is
mounted. Directly update the bdev inode size to avoid unnecessarily
holding the bdev->bd_mutex.

Fixes: 3b4b19721ec652 ("nvme: fix possible deadlock when I/O is
blocked")

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
---
 drivers/nvme/host/core.c |  1 +
 drivers/nvme/host/nvme.h | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8410d03b940d74..add040168e67e2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1980,6 +1980,7 @@ static int __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	if (ns->head->disk) {
 		nvme_update_disk_info(ns->head->disk, ns, id);
 		blk_queue_stack_limits(ns->head->disk->queue, ns->queue);
+		nvme_mpath_update_disk_size(ns->head->disk);
 	}
 #endif
 	return 0;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2ef8d501e2a87c..1de3f9b827aa56 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -604,6 +604,16 @@ static inline void nvme_trace_bio_complete(struct request *req,
 		trace_block_bio_complete(ns->head->disk->queue, req->bio);
 }

+static inline void nvme_mpath_update_disk_size(struct gendisk *disk)
+{
+	struct block_device *bdev = bdget_disk(disk, 0);
+
+	if (bdev) {
+		bd_set_size(bdev, get_capacity(disk) << SECTOR_SHIFT);
+		bdput(bdev);
+	}
+}
+
 extern struct device_attribute dev_attr_ana_grpid;
 extern struct device_attribute dev_attr_ana_state;
 extern struct device_attribute subsys_attr_iopolicy;
@@ -679,6 +689,9 @@ static inline void nvme_mpath_wait_freeze(struct nvme_subsystem *subsys)
 static inline void nvme_mpath_start_freeze(struct nvme_subsystem *subsys)
 {
 }
+static inline void nvme_mpath_update_disk_size(struct gendisk *disk)
+{
+}
 #endif /* CONFIG_NVME_MULTIPATH */

 #ifdef CONFIG_NVM
--
2.27.0

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-07-07 10:57       ` Anthony Iliopoulos
@ 2020-07-08 14:42         ` Christoph Hellwig
  2020-07-10  4:47           ` Sagi Grimberg
  2020-07-14 11:12         ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-07-08 14:42 UTC (permalink / raw)
  To: Anthony Iliopoulos
  Cc: Keith Busch, Anton Eidelman, Sagi Grimberg, linux-nvme,
	Christoph Hellwig

On Tue, Jul 07, 2020 at 12:57:35PM +0200, Anthony Iliopoulos wrote:
> The revalidation is still required there in order to update the blockdev
> size when multipath is enabled. With this revert, the behavior regressed
> back to before fab7772bfbcf, so the size isn't updated under multipath
> when the device is in use (e.g. mounted), it will only be effectively
> updated next time it is mounted as a side-effect of blkdev_open on first
> opener. This prevents online resizing of filesystems.
> 
> > And if you look at revalidate, it only calls fops->revalidate_disk
> > (which is a noop in the mpath device) and checks the size change.
> 
> The mpath gendisk doesn't have a revalidate_disk fop indeed, but call to
> check_disk_size_change was needed to update its size. Otherwise, the
> size change isn't reflected externally to userspace before next mount.
> cb224c3af4df doesn't fix this as it skips revalidate and thus also skips
> updating the size.
> 
> The issue is that under CONFIG_NVME_MULTIPATH=y the ns->disk becomes a
> hidden gendisk (see nvme_set_disk_name) and thus doesn't have a backing
> bdev, while ns->head->disk becomes the "visible" one to userspace (with
> a backing bdev), and therefore needs its bdev->bd_inode->i_size updated
> when resized.
> 
> The following patch seems to be working, although I'm not sufficiently
> familiar with the code to tell if this won't cause any other deadlocks
> with some confidence:

This looks sensible to me.  Sagi?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-07-08 14:42         ` Christoph Hellwig
@ 2020-07-10  4:47           ` Sagi Grimberg
  0 siblings, 0 replies; 25+ messages in thread
From: Sagi Grimberg @ 2020-07-10  4:47 UTC (permalink / raw)
  To: Christoph Hellwig, Anthony Iliopoulos
  Cc: Keith Busch, Anton Eidelman, linux-nvme


>> The revalidation is still required there in order to update the blockdev
>> size when multipath is enabled. With this revert, the behavior regressed
>> back to before fab7772bfbcf, so the size isn't updated under multipath
>> when the device is in use (e.g. mounted), it will only be effectively
>> updated next time it is mounted as a side-effect of blkdev_open on first
>> opener. This prevents online resizing of filesystems.
>>
>>> And if you look at revalidate, it only calls fops->revalidate_disk
>>> (which is a noop in the mpath device) and checks the size change.
>>
>> The mpath gendisk doesn't have a revalidate_disk fop indeed, but call to
>> check_disk_size_change was needed to update its size. Otherwise, the
>> size change isn't reflected externally to userspace before next mount.
>> cb224c3af4df doesn't fix this as it skips revalidate and thus also skips
>> updating the size.
>>
>> The issue is that under CONFIG_NVME_MULTIPATH=y the ns->disk becomes a
>> hidden gendisk (see nvme_set_disk_name) and thus doesn't have a backing
>> bdev, while ns->head->disk becomes the "visible" one to userspace (with
>> a backing bdev), and therefore needs its bdev->bd_inode->i_size updated
>> when resized.
>>
>> The following patch seems to be working, although I'm not sufficiently
>> familiar with the code to tell if this won't cause any other deadlocks
>> with some confidence:
> 
> This looks sensible to me.  Sagi?

Looks sensible to me too, but let's add a comment to clarify the 
existence of this please.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked
  2020-07-07 10:57       ` Anthony Iliopoulos
  2020-07-08 14:42         ` Christoph Hellwig
@ 2020-07-14 11:12         ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-07-14 11:12 UTC (permalink / raw)
  To: Anthony Iliopoulos
  Cc: Keith Busch, Anton Eidelman, Sagi Grimberg, linux-nvme,
	Christoph Hellwig

Thanks,

applied to nvme-5.8.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-07-14 11:13 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-24  0:18 [PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates Sagi Grimberg
2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 1/6] nvme: fix possible deadlock when I/O is blocked Sagi Grimberg
2020-06-24  6:29   ` Christoph Hellwig
2020-06-24  6:54     ` Sagi Grimberg
2020-06-24  6:57       ` Christoph Hellwig
2020-06-24  7:09         ` Sagi Grimberg
2020-07-07 10:57       ` Anthony Iliopoulos
2020-07-08 14:42         ` Christoph Hellwig
2020-07-10  4:47           ` Sagi Grimberg
2020-07-14 11:12         ` Christoph Hellwig
2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 2/6] nvme-multipath: fix deadlock between ana_work and scan_work Sagi Grimberg
2020-06-24  6:34   ` Christoph Hellwig
2020-06-24  6:56     ` Sagi Grimberg
2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 3/6] nvme: don't protect ns mutation with ns->head->lock Sagi Grimberg
2020-06-24  6:37   ` Christoph Hellwig
2020-06-24  6:58     ` Sagi Grimberg
2020-06-24  8:24     ` Sagi Grimberg
2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 4/6] nvme-multipath: fix deadlock due to head->lock Sagi Grimberg
2020-06-24  6:39   ` Christoph Hellwig
2020-06-24  7:00     ` Sagi Grimberg
2020-06-24  0:18 ` [PATCH v2 for-5.8-rc 5/6] nvme-multipath: fix bogus request queue reference put Sagi Grimberg
2020-06-24  6:40   ` Christoph Hellwig
2020-06-24  0:18 ` [PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work Sagi Grimberg
2020-06-24  6:43   ` Christoph Hellwig
2020-06-24  7:13     ` Sagi Grimberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).