All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] nvme: Ensure forward progress during Admin passthru
@ 2018-06-22 19:59 Scott Bauer
  2018-06-22 19:59 ` [PATCH 1/1] " Scott Bauer
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Scott Bauer @ 2018-06-22 19:59 UTC (permalink / raw)


This small patch is an attempt to fix a rare boundary condition.
The only downside of the patch is scan work cannot happen concurrently,
but I don't think that's possible outside of someone issuing a sysctl.

If the controller goes down during the admin command, the command will
timeout, if the controller is in a poor enough state the reset commands
will time out. After the original admin command timeout, we unblock userland
which can start to revalidate the namespaces. In order todo that we
aquire the controller namespace sempahore as WRITE. We then issue an admin
command which times out, triggers a reset where we attempt to stop queues
in dev_disable(). Stopping queues attempts to down_read on the semaphore,
which is currently being held on a down write, waiting for this I/O to timeout.
This I/O cannot timeout until the semaphore is dropped from down_write, and
now we're deadlocked.

The patch does what nvme_remove_namespaces does, it takes over the controllers
namespace list under the down_write. Then modifies the private list, potentially
removing namespaces, and resplicing it once revalidation has occured. The whole
goal is to *not* hold a down write while issuing admin commands.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-22 19:59 [PATCH 0/1] nvme: Ensure forward progress during Admin passthru Scott Bauer
@ 2018-06-22 19:59 ` Scott Bauer
  2018-06-27 19:12   ` Keith Busch
  2018-06-24 17:38 ` [PATCH 0/1] " Sagi Grimberg
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Scott Bauer @ 2018-06-22 19:59 UTC (permalink / raw)


If the controller supports effects and goes down during
the passthru admin command we will deadlock during
namespace revalidation.

[  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
[  363.488290]       Not tainted 4.17.0+ #2
[  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.488303] kworker/u16:5   D    0   231      2 0x80000000
[  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  363.488338] Call Trace:
[  363.488385]  schedule+0x75/0x190
[  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
[  363.488481]  call_rwsem_down_read_failed+0x14/0x30
[  363.488504]  down_read+0x1d/0x80
[  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
[  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
[  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
[  363.488911]  process_one_work+0x81a/0x1400
[  363.488934]  worker_thread+0x87/0xe80
[  363.488955]  kthread+0x2db/0x390
[  363.488977]  ret_from_fork+0x35/0x40

Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")

Signed-off-by: Scott Bauer <scott.bauer at intel.com>
---
 drivers/nvme/host/core.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030b2c3f..19a729ccc30e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1152,18 +1152,20 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 static void nvme_update_formats(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns, *next;
-	LIST_HEAD(rm_list);
+	LIST_HEAD(ns_list);
 
 	down_write(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
-			list_move_tail(&ns->list, &rm_list);
-		}
-	}
+	list_splice_init(&ctrl->namespaces, &ns_list);
 	up_write(&ctrl->namespaces_rwsem);
 
-	list_for_each_entry_safe(ns, next, &rm_list, list)
-		nvme_ns_remove(ns);
+	list_for_each_entry_safe(ns, next, &ns_list, list) {
+		if (ns->disk && nvme_revalidate_disk(ns->disk))
+			nvme_ns_remove(ns);
+	}
+
+	down_write(&ctrl->namespaces_rwsem);
+	list_splice_init(&ns_list, &ctrl->namespaces);
+	up_write(&ctrl->namespaces_rwsem);
 }
 
 static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 0/1] nvme: Ensure forward progress during Admin passthru
  2018-06-22 19:59 [PATCH 0/1] nvme: Ensure forward progress during Admin passthru Scott Bauer
  2018-06-22 19:59 ` [PATCH 1/1] " Scott Bauer
@ 2018-06-24 17:38 ` Sagi Grimberg
  2018-06-27 19:08   ` Keith Busch
  2018-06-28 17:10 ` [PATCH v2 1/1] " Scott Bauer
  2018-06-29 19:03 ` [PATCH v3 " Scott Bauer
  3 siblings, 1 reply; 17+ messages in thread
From: Sagi Grimberg @ 2018-06-24 17:38 UTC (permalink / raw)



> This small patch is an attempt to fix a rare boundary condition.
> The only downside of the patch is scan work cannot happen concurrently,
> but I don't think that's possible outside of someone issuing a sysctl.
> 
> If the controller goes down during the admin command, the command will
> timeout, if the controller is in a poor enough state the reset commands
> will time out. After the original admin command timeout, we unblock userland
> which can start to revalidate the namespaces. In order todo that we
> aquire the controller namespace sempahore as WRITE. We then issue an admin
> command which times out, triggers a reset where we attempt to stop queues
> in dev_disable(). Stopping queues attempts to down_read on the semaphore,
> which is currently being held on a down write, waiting for this I/O to timeout.
> This I/O cannot timeout until the semaphore is dropped from down_write, and
> now we're deadlocked.
> 
> The patch does what nvme_remove_namespaces does, it takes over the controllers
> namespace list under the down_write. Then modifies the private list, potentially
> removing namespaces, and resplicing it once revalidation has occured. The whole
> goal is to *not* hold a down write while issuing admin commands.

Keith, can you explain why we need to disable the device from the
timeout handler instead of simply schedule a reset and return
BLK_EH_RESET_TIMER?

The only place where I *think* can't schedule a reset and need to
return BLK_EH_DONE is when the controller is already resetting, in which
case we'd simply need to cancel the timed out command.

What am I missing?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-27 19:12   ` Keith Busch
@ 2018-06-27 19:01     ` Scott Bauer
  2018-06-27 20:27       ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Scott Bauer @ 2018-06-27 19:01 UTC (permalink / raw)


On Wed, Jun 27, 2018@01:12:42PM -0600, Keith Busch wrote:
> How about changing the reaction based on success?

I don't think this will work. The other scenario is the original admin command
was fine. We enter nvme_update_formats, grab the semaphore as write, issue the
nvme admin identify, that fails, we enter timeout, and attempt to down_read in
diasble_dev -> stop queues and get stuck there.

We need a way where we are not holding a down_write on the semaphore while issuing
I/O.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 0/1] nvme: Ensure forward progress during Admin passthru
  2018-06-24 17:38 ` [PATCH 0/1] " Sagi Grimberg
@ 2018-06-27 19:08   ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2018-06-27 19:08 UTC (permalink / raw)


On Sun, Jun 24, 2018@08:38:12PM +0300, Sagi Grimberg wrote:
> 
> > This small patch is an attempt to fix a rare boundary condition.
> > The only downside of the patch is scan work cannot happen concurrently,
> > but I don't think that's possible outside of someone issuing a sysctl.
> > 
> > If the controller goes down during the admin command, the command will
> > timeout, if the controller is in a poor enough state the reset commands
> > will time out. After the original admin command timeout, we unblock userland
> > which can start to revalidate the namespaces. In order todo that we
> > aquire the controller namespace sempahore as WRITE. We then issue an admin
> > command which times out, triggers a reset where we attempt to stop queues
> > in dev_disable(). Stopping queues attempts to down_read on the semaphore,
> > which is currently being held on a down write, waiting for this I/O to timeout.
> > This I/O cannot timeout until the semaphore is dropped from down_write, and
> > now we're deadlocked.
> > 
> > The patch does what nvme_remove_namespaces does, it takes over the controllers
> > namespace list under the down_write. Then modifies the private list, potentially
> > removing namespaces, and resplicing it once revalidation has occured. The whole
> > goal is to *not* hold a down write while issuing admin commands.
> 
> Keith, can you explain why we need to disable the device from the
> timeout handler instead of simply schedule a reset and return
> BLK_EH_RESET_TIMER?
> 
> The only place where I *think* can't schedule a reset and need to
> return BLK_EH_DONE is when the controller is already resetting, in which
> case we'd simply need to cancel the timed out command.
> 
> What am I missing?

The reset, removing, and dead states.

It was done this way for a few reasons, some of which have changed. We
didn't actualy have the nvme state machine back then, so we didn't know
what state we were in, which forced the remove case to have lots of odd
work-queue syncs.

When an IO times out because a controller is stuck, we'll often time
them out in batches. By disabling the controller inline with the timeout
handler, we handle every single timed out command in a single call to
the nvme timeout handler instead of 1000's of them, and since we were
already in a work-queue, we are conveniently in a context that can handle
a timeout inline with the notification.

The other reason was how the blk-mq timeout handler worked. It was
creating restrictions on when a driver can complete a request, and
returning BLK_EH_HANDLED was the only way we could deterministically
get the request actually completed.

We might be able to revisit the nvme-pci behavior now that blk-wq
timeout works very differently, but I'll have to wait to see if the
current behavior is settled or not.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-22 19:59 ` [PATCH 1/1] " Scott Bauer
@ 2018-06-27 19:12   ` Keith Busch
  2018-06-27 19:01     ` Scott Bauer
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-06-27 19:12 UTC (permalink / raw)


How about changing the reaction based on success?


---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030b2c3f..63fe4e130bec 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1173,13 +1173,14 @@ static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
 	 * prevent memory corruption if a logical block size was changed by
 	 * this command.
 	 */
-	if (effects & NVME_CMD_EFFECTS_LBCC)
+	if (status == NVME_SC_SUCCESS && effects & NVME_CMD_EFFECTS_LBCC)
 		nvme_update_formats(ctrl);
 	if (effects & (NVME_CMD_EFFECTS_LBCC | NVME_CMD_EFFECTS_CSE_MASK))
 		nvme_unfreeze(ctrl);
-	if (effects & NVME_CMD_EFFECTS_CCC)
+	if (status == NVME_SC_SUCCESS && effects & NVME_CMD_EFFECTS_CCC)
 		nvme_init_identify(ctrl);
-	if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC))
+	if (status == NVME_SUCCESS &&
+	    effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC))
 		nvme_queue_scan(ctrl);
 }
 
@@ -1220,7 +1221,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
 			(void __user *)(uintptr_t)cmd.metadata, cmd.metadata,
 			0, &cmd.result, timeout);
-	nvme_passthru_end(ctrl, effects);
+	nvme_passthru_end(ctrl, effects, status);
 
 	if (status >= 0) {
 		if (put_user(cmd.result, &ucmd->result))
--

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-27 19:01     ` Scott Bauer
@ 2018-06-27 20:27       ` Keith Busch
  2018-06-27 20:49         ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-06-27 20:27 UTC (permalink / raw)


On Wed, Jun 27, 2018@01:01:38PM -0600, Scott Bauer wrote:
> On Wed, Jun 27, 2018@01:12:42PM -0600, Keith Busch wrote:
> > How about changing the reaction based on success?
> 
> I don't think this will work. The other scenario is the original admin command
> was fine. We enter nvme_update_formats, grab the semaphore as write, issue the
> nvme admin identify, that fails, we enter timeout, and attempt to down_read in
> diasble_dev -> stop queues and get stuck there.
> 
> We need a way where we are not holding a down_write on the semaphore while issuing
> I/O.

Oh right ... Maybe the down_write is a bad idea and we should instead
do something like how nvme_kill_queues works? What do you think of the
following?

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030..596ea60 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1152,18 +1152,16 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 static void nvme_update_formats(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns, *next;
-	LIST_HEAD(rm_list);
 
-	down_write(&ctrl->namespaces_rwsem);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
-			list_move_tail(&ns->list, &rm_list);
-		}
+		if (ns->disk && nvme_revalidate_disk(ns->disk))
+			if (test_bit(NVME_NS_DEAD, &ns->flags))
+				set_capacity(disk, 0);
 	}
-	up_write(&ctrl->namespaces_rwsem);
+	up_read(&ctrl->namespaces_rwsem);
 
-	list_for_each_entry_safe(ns, next, &rm_list, list)
-		nvme_ns_remove(ns);
+	nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
 }
 
 static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
@@ -3138,7 +3136,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 
 	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->head->ns_id > nsid)
+		if (ns->head->ns_id > nsid || test_bit(NVME_NS_DEAD, &ns->flags))
 			list_move_tail(&ns->list, &rm_list);
 	}
 	up_write(&ctrl->namespaces_rwsem);
-- 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-27 20:27       ` Keith Busch
@ 2018-06-27 20:49         ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2018-06-27 20:49 UTC (permalink / raw)


On Wed, Jun 27, 2018@02:27:41PM -0600, Keith Busch wrote:
>  static void nvme_update_formats(struct nvme_ctrl *ctrl)
>  {
>  	struct nvme_ns *ns, *next;
> -	LIST_HEAD(rm_list);
>  
> -	down_write(&ctrl->namespaces_rwsem);
> +	down_read(&ctrl->namespaces_rwsem);
>  	list_for_each_entry(ns, &ctrl->namespaces, list) {
> -		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
> -			list_move_tail(&ns->list, &rm_list);
> -		}
> +		if (ns->disk && nvme_revalidate_disk(ns->disk))
> +			if (test_bit(NVME_NS_DEAD, &ns->flags))
> +				set_capacity(disk, 0);

Meant to be "!test_and_set_bit(...)" ... and needs to do a few other
things on top of setting capacity to 0, like blk_set_queue_dying. Maybe
just turn the inner loop of nvme_kill_queues() into a function call so
this path can reuse it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-22 19:59 [PATCH 0/1] nvme: Ensure forward progress during Admin passthru Scott Bauer
  2018-06-22 19:59 ` [PATCH 1/1] " Scott Bauer
  2018-06-24 17:38 ` [PATCH 0/1] " Sagi Grimberg
@ 2018-06-28 17:10 ` Scott Bauer
  2018-06-28 19:16   ` Keith Busch
  2018-06-29 19:03 ` [PATCH v3 " Scott Bauer
  3 siblings, 1 reply; 17+ messages in thread
From: Scott Bauer @ 2018-06-28 17:10 UTC (permalink / raw)


If the controller supports effects and goes down during
the passthru admin command we will deadlock during
namespace revalidation.

[  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
[  363.488290]       Not tainted 4.17.0+ #2
[  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.488303] kworker/u16:5   D    0   231      2 0x80000000
[  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  363.488338] Call Trace:
[  363.488385]  schedule+0x75/0x190
[  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
[  363.488481]  call_rwsem_down_read_failed+0x14/0x30
[  363.488504]  down_read+0x1d/0x80
[  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
[  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
[  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
[  363.488911]  process_one_work+0x81a/0x1400
[  363.488934]  worker_thread+0x87/0xe80
[  363.488955]  kthread+0x2db/0x390
[  363.488977]  ret_from_fork+0x35/0x40

Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")

Signed-off-by: Scott Bauer <scott.bauer at intel.com>
---
 drivers/nvme/host/core.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030b2c3f..1ad19f0782db 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -100,6 +100,15 @@ static struct class *nvme_subsys_class;
 static void nvme_ns_remove(struct nvme_ns *ns);
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
+static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
+					   unsigned nsid);
+
+static void nvme_set_queue_dying(struct nvme_ns *ns)
+{
+	blk_set_queue_dying(ns->queue);
+	/* Forcibly unquiesce queues to avoid blocking dispatch */
+	blk_mq_unquiesce_queue(ns->queue);
+}
 
 static void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
@@ -1151,19 +1160,17 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 static void nvme_update_formats(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns, *next;
-	LIST_HEAD(rm_list);
+	struct nvme_ns *ns;
 
-	down_write(&ctrl->namespaces_rwsem);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
-			list_move_tail(&ns->list, &rm_list);
-		}
+		if (ns->disk && nvme_revalidate_disk(ns->disk))
+			if (!test_and_set_bit(NVME_NS_DEAD, &ns->flags))
+				nvme_set_queue_dying(ns);
 	}
-	up_write(&ctrl->namespaces_rwsem);
+	up_read(&ctrl->namespaces_rwsem);
 
-	list_for_each_entry_safe(ns, next, &rm_list, list)
-		nvme_ns_remove(ns);
+	nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
 }
 
 static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
@@ -3138,7 +3145,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 
 	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->head->ns_id > nsid)
+		if (ns->head->ns_id > nsid || test_bit(NVME_NS_DEAD, &ns->flags))
 			list_move_tail(&ns->list, &rm_list);
 	}
 	up_write(&ctrl->namespaces_rwsem);
@@ -3550,10 +3557,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
 			continue;
 		revalidate_disk(ns->disk);
-		blk_set_queue_dying(ns->queue);
-
-		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
+		nvme_set_queue_dying(ns);
 	}
 	up_read(&ctrl->namespaces_rwsem);
 }
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-28 17:10 ` [PATCH v2 1/1] " Scott Bauer
@ 2018-06-28 19:16   ` Keith Busch
  2018-06-28 19:19     ` Scott Bauer
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-06-28 19:16 UTC (permalink / raw)


On Thu, Jun 28, 2018@11:10:07AM -0600, Scott Bauer wrote:
> If the controller supports effects and goes down during
> the passthru admin command we will deadlock during
> namespace revalidation.
> 
> [  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
> [  363.488290]       Not tainted 4.17.0+ #2
> [  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  363.488303] kworker/u16:5   D    0   231      2 0x80000000
> [  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> [  363.488338] Call Trace:
> [  363.488385]  schedule+0x75/0x190
> [  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
> [  363.488481]  call_rwsem_down_read_failed+0x14/0x30
> [  363.488504]  down_read+0x1d/0x80
> [  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
> [  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
> [  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
> [  363.488911]  process_one_work+0x81a/0x1400
> [  363.488934]  worker_thread+0x87/0xe80
> [  363.488955]  kthread+0x2db/0x390
> [  363.488977]  ret_from_fork+0x35/0x40
> 
> Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")
> 
> Signed-off-by: Scott Bauer <scott.bauer at intel.com>
> ---
>  drivers/nvme/host/core.c | 32 ++++++++++++++++++--------------
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 46df030b2c3f..1ad19f0782db 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -100,6 +100,15 @@ static struct class *nvme_subsys_class;
>  static void nvme_ns_remove(struct nvme_ns *ns);
>  static int nvme_revalidate_disk(struct gendisk *disk);
>  static void nvme_put_subsystem(struct nvme_subsystem *subsys);
> +static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
> +					   unsigned nsid);
> +
> +static void nvme_set_queue_dying(struct nvme_ns *ns)
> +{
> +	blk_set_queue_dying(ns->queue);
> +	/* Forcibly unquiesce queues to avoid blocking dispatch */
> +	blk_mq_unquiesce_queue(ns->queue);
> +}

I think we should actually do everything that the dead namespace does,
including revalidating the capacity to 0, just in case a buffered writer
is preventing the removal from completing. Here's an update on top of
your patch.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 87027c122d3d..9fc15221faeb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -107,6 +107,13 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 
 static void nvme_set_queue_dying(struct nvme_ns *ns)
 {
+	/*
+	 * Revalidating a dead namespace sets capacity to 0. This will end
+	 * buffered writers dirtying pages that can't be synced.
+	 */
+	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
+		continue;
+	revalidate_disk(ns->disk);
 	blk_set_queue_dying(ns->queue);
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	blk_mq_unquiesce_queue(ns->queue);
@@ -1162,11 +1169,9 @@ static void nvme_update_formats(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns;
 
 	down_read(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
+	list_for_each_entry(ns, &ctrl->namespaces, list)
 		if (ns->disk && nvme_revalidate_disk(ns->disk))
-			if (!test_and_set_bit(NVME_NS_DEAD, &ns->flags))
-				nvme_set_queue_dying(ns);
-	}
+			nvme_set_queue_dying(ns);
 	up_read(&ctrl->namespaces_rwsem);
 
 	nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
@@ -3548,16 +3553,8 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 	if (ctrl->admin_q)
 		blk_mq_unquiesce_queue(ctrl->admin_q);
 
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		/*
-		 * Revalidating a dead namespace sets capacity to 0. This will
-		 * end buffered writers dirtying pages that can't be synced.
-		 */
-		if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
-			continue;
-		revalidate_disk(ns->disk);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
 		nvme_set_queue_dying(ns);
-	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_kill_queues);
--

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-28 19:16   ` Keith Busch
@ 2018-06-28 19:19     ` Scott Bauer
  2018-06-28 19:54       ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Scott Bauer @ 2018-06-28 19:19 UTC (permalink / raw)


On Thu, Jun 28, 2018@01:16:32PM -0600, Keith Busch wrote:
> On Thu, Jun 28, 2018@11:10:07AM -0600, Scott Bauer wrote:
> >  static void nvme_put_subsystem(struct nvme_subsystem *subsys);
> > +static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
> > +					   unsigned nsid);
> > +
> > +static void nvme_set_queue_dying(struct nvme_ns *ns)
> > +{
> > +	blk_set_queue_dying(ns->queue);
> > +	/* Forcibly unquiesce queues to avoid blocking dispatch */
> > +	blk_mq_unquiesce_queue(ns->queue);
> > +}
> 
> I think we should actually do everything that the dead namespace does,
> including revalidating the capacity to 0, just in case a buffered writer
> is preventing the removal from completing. Here's an update on top of
> your patch.
> 

>  {
> +	/*
> +	 * Revalidating a dead namespace sets capacity to 0. This will end
> +	 * buffered writers dirtying pages that can't be synced.
> +	 */
> +	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
> +		continue;
> +	revalidate_disk(ns->disk);

I was under the impression we could not do this because the queues are still frozen,
that's why we didnt just issue a scan_work?

revalidate_disk can potentially try and issue more I/O while the queues are
frozen from passthru start.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-28 19:19     ` Scott Bauer
@ 2018-06-28 19:54       ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2018-06-28 19:54 UTC (permalink / raw)


On Thu, Jun 28, 2018@01:19:48PM -0600, Scott Bauer wrote:
> On Thu, Jun 28, 2018@01:16:32PM -0600, Keith Busch wrote:
> > On Thu, Jun 28, 2018@11:10:07AM -0600, Scott Bauer wrote:
> > >  static void nvme_put_subsystem(struct nvme_subsystem *subsys);
> > > +static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
> > > +					   unsigned nsid);
> > > +
> > > +static void nvme_set_queue_dying(struct nvme_ns *ns)
> > > +{
> > > +	blk_set_queue_dying(ns->queue);
> > > +	/* Forcibly unquiesce queues to avoid blocking dispatch */
> > > +	blk_mq_unquiesce_queue(ns->queue);
> > > +}
> > 
> > I think we should actually do everything that the dead namespace does,
> > including revalidating the capacity to 0, just in case a buffered writer
> > is preventing the removal from completing. Here's an update on top of
> > your patch.
> > 
> 
> >  {
> > +	/*
> > +	 * Revalidating a dead namespace sets capacity to 0. This will end
> > +	 * buffered writers dirtying pages that can't be synced.
> > +	 */
> > +	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
> > +		continue;
> > +	revalidate_disk(ns->disk);
> 
> I was under the impression we could not do this because the queues are still frozen,
> that's why we didnt just issue a scan_work?
> 
> revalidate_disk can potentially try and issue more I/O while the queues are
> frozen from passthru start.

Not when the namespace is dead. It will just set capacity to 0 which
kills any buffered writers; no new IO will be attempted, and anything
gating on entering the frozen queue will fail once we set the queue to
dying.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-22 19:59 [PATCH 0/1] nvme: Ensure forward progress during Admin passthru Scott Bauer
                   ` (2 preceding siblings ...)
  2018-06-28 17:10 ` [PATCH v2 1/1] " Scott Bauer
@ 2018-06-29 19:03 ` Scott Bauer
  2018-06-29 20:23   ` Keith Busch
  3 siblings, 1 reply; 17+ messages in thread
From: Scott Bauer @ 2018-06-29 19:03 UTC (permalink / raw)


If the controller supports effects and goes down during
the passthru admin command we will deadlock during
namespace revalidation.

[  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
[  363.488290]       Not tainted 4.17.0+ #2
[  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.488303] kworker/u16:5   D    0   231      2 0x80000000
[  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  363.488338] Call Trace:
[  363.488385]  schedule+0x75/0x190
[  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
[  363.488481]  call_rwsem_down_read_failed+0x14/0x30
[  363.488504]  down_read+0x1d/0x80
[  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
[  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
[  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
[  363.488911]  process_one_work+0x81a/0x1400
[  363.488934]  worker_thread+0x87/0xe80
[  363.488955]  kthread+0x2db/0x390
[  363.488977]  ret_from_fork+0x35/0x40

Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")

Signed-off-by: Scott Bauer <scott.bauer at intel.com>
---
 drivers/nvme/host/core.c | 50 +++++++++++++++++++++-------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46df030b2c3f..e7668c4bb4dd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -100,6 +100,22 @@ static struct class *nvme_subsys_class;
 static void nvme_ns_remove(struct nvme_ns *ns);
 static int nvme_revalidate_disk(struct gendisk *disk);
 static void nvme_put_subsystem(struct nvme_subsystem *subsys);
+static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
+					   unsigned nsid);
+
+static void nvme_set_queue_dying(struct nvme_ns *ns)
+{
+	/*
+	 * Revalidating a dead namespace sets capacity to 0. This will end
+	 * buffered writers dirtying pages that can't be synced.
+	 */
+	if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
+		return;
+	revalidate_disk(ns->disk);
+	blk_set_queue_dying(ns->queue);
+	/* Forcibly unquiesce queues to avoid blocking dispatch */
+	blk_mq_unquiesce_queue(ns->queue);
+}
 
 static void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
@@ -1151,19 +1167,15 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 static void nvme_update_formats(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns, *next;
-	LIST_HEAD(rm_list);
+	struct nvme_ns *ns;
 
-	down_write(&ctrl->namespaces_rwsem);
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
-			list_move_tail(&ns->list, &rm_list);
-		}
-	}
-	up_write(&ctrl->namespaces_rwsem);
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		if (ns->disk && nvme_revalidate_disk(ns->disk))
+			nvme_set_queue_dying(ns);
+	up_read(&ctrl->namespaces_rwsem);
 
-	list_for_each_entry_safe(ns, next, &rm_list, list)
-		nvme_ns_remove(ns);
+	nvme_remove_invalid_namespaces(ctrl, NVME_NSID_ALL);
 }
 
 static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
@@ -3138,7 +3150,7 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 
 	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->head->ns_id > nsid)
+		if (ns->head->ns_id > nsid || test_bit(NVME_NS_DEAD, &ns->flags))
 			list_move_tail(&ns->list, &rm_list);
 	}
 	up_write(&ctrl->namespaces_rwsem);
@@ -3542,19 +3554,9 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 	if (ctrl->admin_q)
 		blk_mq_unquiesce_queue(ctrl->admin_q);
 
-	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		/*
-		 * Revalidating a dead namespace sets capacity to 0. This will
-		 * end buffered writers dirtying pages that can't be synced.
-		 */
-		if (!ns->disk || test_and_set_bit(NVME_NS_DEAD, &ns->flags))
-			continue;
-		revalidate_disk(ns->disk);
-		blk_set_queue_dying(ns->queue);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		nvme_set_queue_dying(ns);
 
-		/* Forcibly unquiesce queues to avoid blocking dispatch */
-		blk_mq_unquiesce_queue(ns->queue);
-	}
 	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_kill_queues);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-29 19:03 ` [PATCH v3 " Scott Bauer
@ 2018-06-29 20:23   ` Keith Busch
  2018-07-16 22:09     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-06-29 20:23 UTC (permalink / raw)


On Fri, Jun 29, 2018@01:03:28PM -0600, Scott Bauer wrote:
> If the controller supports effects and goes down during
> the passthru admin command we will deadlock during
> namespace revalidation.
> 
> [  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
> [  363.488290]       Not tainted 4.17.0+ #2
> [  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  363.488303] kworker/u16:5   D    0   231      2 0x80000000
> [  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> [  363.488338] Call Trace:
> [  363.488385]  schedule+0x75/0x190
> [  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
> [  363.488481]  call_rwsem_down_read_failed+0x14/0x30
> [  363.488504]  down_read+0x1d/0x80
> [  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
> [  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
> [  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
> [  363.488911]  process_one_work+0x81a/0x1400
> [  363.488934]  worker_thread+0x87/0xe80
> [  363.488955]  kthread+0x2db/0x390
> [  363.488977]  ret_from_fork+0x35/0x40
> 
> Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")
> 
> Signed-off-by: Scott Bauer <scott.bauer at intel.com>

Looks good to me.

Reviewed-by: Keith Busch <keith.busch at linux.intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/1] nvme: Ensure forward progress during Admin passthru
  2018-06-29 20:23   ` Keith Busch
@ 2018-07-16 22:09     ` Keith Busch
  2018-07-17 12:42       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-16 22:09 UTC (permalink / raw)


On Fri, Jun 29, 2018@02:23:50PM -0600, Keith Busch wrote:
> On Fri, Jun 29, 2018@01:03:28PM -0600, Scott Bauer wrote:
> > If the controller supports effects and goes down during
> > the passthru admin command we will deadlock during
> > namespace revalidation.
> > 
> > [  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
> > [  363.488290]       Not tainted 4.17.0+ #2
> > [  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  363.488303] kworker/u16:5   D    0   231      2 0x80000000
> > [  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> > [  363.488338] Call Trace:
> > [  363.488385]  schedule+0x75/0x190
> > [  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
> > [  363.488481]  call_rwsem_down_read_failed+0x14/0x30
> > [  363.488504]  down_read+0x1d/0x80
> > [  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
> > [  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
> > [  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
> > [  363.488911]  process_one_work+0x81a/0x1400
> > [  363.488934]  worker_thread+0x87/0xe80
> > [  363.488955]  kthread+0x2db/0x390
> > [  363.488977]  ret_from_fork+0x35/0x40
> > 
> > Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")
> > 
> > Signed-off-by: Scott Bauer <scott.bauer at intel.com>
> 
> Looks good to me.
> 
> Reviewed-by: Keith Busch <keith.busch at linux.intel.com>

Christoph,
Could we get this one queued up for next?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/1] nvme: Ensure forward progress during Admin passthru
  2018-07-16 22:09     ` Keith Busch
@ 2018-07-17 12:42       ` Christoph Hellwig
  2018-07-18 11:26         ` Sagi Grimberg
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2018-07-17 12:42 UTC (permalink / raw)


On Mon, Jul 16, 2018@04:09:40PM -0600, Keith Busch wrote:
> On Fri, Jun 29, 2018@02:23:50PM -0600, Keith Busch wrote:
> > On Fri, Jun 29, 2018@01:03:28PM -0600, Scott Bauer wrote:
> > > If the controller supports effects and goes down during
> > > the passthru admin command we will deadlock during
> > > namespace revalidation.
> > > 
> > > [  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
> > > [  363.488290]       Not tainted 4.17.0+ #2
> > > [  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [  363.488303] kworker/u16:5   D    0   231      2 0x80000000
> > > [  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> > > [  363.488338] Call Trace:
> > > [  363.488385]  schedule+0x75/0x190
> > > [  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
> > > [  363.488481]  call_rwsem_down_read_failed+0x14/0x30
> > > [  363.488504]  down_read+0x1d/0x80
> > > [  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
> > > [  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
> > > [  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
> > > [  363.488911]  process_one_work+0x81a/0x1400
> > > [  363.488934]  worker_thread+0x87/0xe80
> > > [  363.488955]  kthread+0x2db/0x390
> > > [  363.488977]  ret_from_fork+0x35/0x40
> > > 
> > > Fixes: 84fef62d135b6 ("nvme: check admin passthru command effects")
> > > 
> > > Signed-off-by: Scott Bauer <scott.bauer at intel.com>
> > 
> > Looks good to me.
> > 
> > Reviewed-by: Keith Busch <keith.busch at linux.intel.com>
> 
> Christoph,
> Could we get this one queued up for next?

Will look at it ASAP.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/1] nvme: Ensure forward progress during Admin passthru
  2018-07-17 12:42       ` Christoph Hellwig
@ 2018-07-18 11:26         ` Sagi Grimberg
  0 siblings, 0 replies; 17+ messages in thread
From: Sagi Grimberg @ 2018-07-18 11:26 UTC (permalink / raw)



>> Christoph,
>> Could we get this one queued up for next?
> 
> Will look at it ASAP.

Would be good if the change log would explain _how_ it
ensures forward progress.. i.e. that it moves namespaces
list mutation to nvme_remove_invalid_namespaces which allows
to acquire namespaces_rwsem as a read lock, allowing dev_disable
to stop controller queues (acquiring a read lock as well) in case
a timeout have occured on a passthru command.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-07-18 11:26 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-22 19:59 [PATCH 0/1] nvme: Ensure forward progress during Admin passthru Scott Bauer
2018-06-22 19:59 ` [PATCH 1/1] " Scott Bauer
2018-06-27 19:12   ` Keith Busch
2018-06-27 19:01     ` Scott Bauer
2018-06-27 20:27       ` Keith Busch
2018-06-27 20:49         ` Keith Busch
2018-06-24 17:38 ` [PATCH 0/1] " Sagi Grimberg
2018-06-27 19:08   ` Keith Busch
2018-06-28 17:10 ` [PATCH v2 1/1] " Scott Bauer
2018-06-28 19:16   ` Keith Busch
2018-06-28 19:19     ` Scott Bauer
2018-06-28 19:54       ` Keith Busch
2018-06-29 19:03 ` [PATCH v3 " Scott Bauer
2018-06-29 20:23   ` Keith Busch
2018-07-16 22:09     ` Keith Busch
2018-07-17 12:42       ` Christoph Hellwig
2018-07-18 11:26         ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.