All of lore.kernel.org
 help / color / mirror / Atom feed
* respun multiple removal fix
@ 2017-06-01 14:06 Christoph Hellwig
  2017-06-01 14:06 ` [PATCH] nvme: fix multiple ctrl removal scheduling Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2017-06-01 14:06 UTC (permalink / raw)


This includes the fabrics drivers.  FC looked a bit odd here, so
extra review from James would be great.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 14:06 respun multiple removal fix Christoph Hellwig
@ 2017-06-01 14:06 ` Christoph Hellwig
  2017-06-01 14:37   ` Sagi Grimberg
  2017-06-01 20:00   ` Keith Busch
  0 siblings, 2 replies; 9+ messages in thread
From: Christoph Hellwig @ 2017-06-01 14:06 UTC (permalink / raw)


From: Rakesh Pandit <rakesh@tuxera.com>

Commit c5f6ce97c1210 tries to address multiple resets but fails as
work_busy doesn't involve any synchronization and can fail.  This is
reproducible easily as can be seen by WARNING below which is triggered
with line:

WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

Allowing multiple resets can result in multiple controller removal as
well if different conditions inside nvme_reset_work fail and which
might deadlock on device_release_driver.

This patch introduces new state called NVME_CTRL_SCHED_RESET and uses
it for synchronizing work queue addition (reset_work) as state change
uses controller spinlock.

[  480.327007] WARNING: CPU: 3 PID: 150 at drivers/nvme/host/pci.c:1900 nvme_reset_work+0x36c/0xec0
[  480.327008] Modules linked in: rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast...
[  480.327044]  btusb videobuf2_core ghash_clmulni_intel snd_hwdep cfg80211 acer_wmi hci_uart..
[  480.327065] CPU: 3 PID: 150 Comm: kworker/u16:2 Not tainted 4.12.0-rc1+ #13
[  480.327065] Hardware name: Acer Predator G9-591/Mustang_SLS, BIOS V1.10 03/03/2016
[  480.327066] Workqueue: nvme nvme_reset_work
[  480.327067] task: ffff880498ad8000 task.stack: ffffc90002218000
[  480.327068] RIP: 0010:nvme_reset_work+0x36c/0xec0
[  480.327069] RSP: 0018:ffffc9000221bdb8 EFLAGS: 00010246
[  480.327070] RAX: 0000000000460000 RBX: ffff880498a98128 RCX: dead000000000200
[  480.327070] RDX: 0000000000000001 RSI: ffff8804b1028020 RDI: ffff880498a98128
[  480.327071] RBP: ffffc9000221be50 R08: 0000000000000000 R09: 0000000000000000
[  480.327071] R10: ffffc90001963ce8 R11: 000000000000020d R12: ffff880498a98000
[  480.327072] R13: ffff880498a53500 R14: ffff880498a98130 R15: ffff880498a98128
[  480.327072] FS:  0000000000000000(0000) GS:ffff8804c1cc0000(0000) knlGS:0000000000000000
[  480.327073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  480.327074] CR2: 00007ffcf3c37f78 CR3: 0000000001e09000 CR4: 00000000003406e0
[  480.327074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  480.327075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  480.327075] Call Trace:
[  480.327079]  ? __switch_to+0x227/0x400
[  480.327081]  process_one_work+0x18c/0x3a0
[  480.327082]  worker_thread+0x4e/0x3b0
[  480.327084]  kthread+0x109/0x140
[  480.327085]  ? process_one_work+0x3a0/0x3a0
[  480.327087]  ? kthread_park+0x60/0x60
[  480.327102]  ret_from_fork+0x2c/0x40
[  480.327103] Code: e8 5a dc ff ff 85 c0 41 89 c1 0f.....

Fixes: c5f6ce97c1210 ("nvme: don't schedule multiple resets")
Signed-off-by: Rakesh Pandit <rakesh at tuxera.com>
[hch: also update the fabrics drivers]
Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c   | 12 +++++++++++-
 drivers/nvme/host/fc.c     | 12 ++++++++----
 drivers/nvme/host/nvme.h   |  1 +
 drivers/nvme/host/pci.c    | 12 +++++++++---
 drivers/nvme/host/rdma.c   |  5 ++++-
 drivers/nvme/target/loop.c |  5 ++++-
 6 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a60926410438..0935dc08f46e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -161,7 +161,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 			break;
 		}
 		break;
-	case NVME_CTRL_RESETTING:
+	case NVME_CTRL_SCHED_RESET:
 		switch (old_state) {
 		case NVME_CTRL_NEW:
 		case NVME_CTRL_LIVE:
@@ -172,6 +172,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 			break;
 		}
 		break;
+	case NVME_CTRL_RESETTING:
+		switch (old_state) {
+		case NVME_CTRL_SCHED_RESET:
+			changed = true;
+			/* FALLTHRU */
+		default:
+			break;
+		}
+		break;
 	case NVME_CTRL_RECONNECTING:
 		switch (old_state) {
 		case NVME_CTRL_LIVE:
@@ -1907,6 +1916,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
 	static const char *const state_name[] = {
 		[NVME_CTRL_NEW]		= "new",
 		[NVME_CTRL_LIVE]	= "live",
+		[NVME_CTRL_SCHED_RESET] = "sched_reset",
 		[NVME_CTRL_RESETTING]	= "resetting",
 		[NVME_CTRL_RECONNECTING]= "reconnecting",
 		[NVME_CTRL_DELETING]	= "deleting",
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 5b14cbefb724..61d7746bbbc8 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1754,10 +1754,10 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
 	if (ctrl->queue_count > 1)
 		nvme_stop_queues(&ctrl->ctrl);
 
-	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RECONNECTING)) {
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_SCHED_RESET)) {
 		dev_err(ctrl->ctrl.device,
 			"NVME-FC{%d}: error_recovery: Couldn't change state "
-			"to RECONNECTING\n", ctrl->cnum);
+			"to SCHED_RESET\n", ctrl->cnum);
 		return;
 	}
 
@@ -2578,7 +2578,7 @@ static void
 nvme_fc_reconnect_or_delete(struct nvme_fc_ctrl *ctrl, int status)
 {
 	/* If we are resetting/deleting then do nothing */
-	if (ctrl->ctrl.state != NVME_CTRL_RECONNECTING) {
+	if (ctrl->ctrl.state != NVME_CTRL_SCHED_RESET) {
 		WARN_ON_ONCE(ctrl->ctrl.state == NVME_CTRL_NEW ||
 			ctrl->ctrl.state == NVME_CTRL_LIVE);
 		return;
@@ -2613,6 +2613,10 @@ nvme_fc_reset_ctrl_work(struct work_struct *work)
 	/* will block will waiting for io to terminate */
 	nvme_fc_delete_association(ctrl);
 
+	if (WARN_ON_ONCE(!nvme_change_ctrl_state(&ctrl->ctrl,
+			NVME_CTRL_RESETTING)))
+		return;
+
 	ret = nvme_fc_create_association(ctrl);
 	if (ret)
 		nvme_fc_reconnect_or_delete(ctrl, ret);
@@ -2633,7 +2637,7 @@ nvme_fc_reset_nvme_ctrl(struct nvme_ctrl *nctrl)
 	dev_info(ctrl->ctrl.device,
 		"NVME-FC{%d}: admin requested controller reset\n", ctrl->cnum);
 
-	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_SCHED_RESET))
 		return -EBUSY;
 
 	if (!queue_work(nvme_fc_wq, &ctrl->reset_work))
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9d6a070d4391..4ca84b12df69 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -109,6 +109,7 @@ static inline struct nvme_request *nvme_req(struct request *req)
 enum nvme_ctrl_state {
 	NVME_CTRL_NEW,
 	NVME_CTRL_LIVE,
+	NVME_CTRL_SCHED_RESET,
 	NVME_CTRL_RESETTING,
 	NVME_CTRL_RECONNECTING,
 	NVME_CTRL_DELETING,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d52701df7245..5b460ce745f6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1366,7 +1366,12 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 	 */
 	bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
 
-	/* If there is a reset ongoing, we shouldn't reset again. */
+	/*
+	 * If there is a reset ongoing, we shouldn't reset again.  Even though
+	 * work_busy provides an advisory hint, nvme_reset is always called
+	 * after this and takes care of synchronization using controller state
+	 * change.
+	 */
 	if (work_busy(&dev->reset_work))
 		return false;
 
@@ -2009,8 +2014,8 @@ static int nvme_reset(struct nvme_dev *dev)
 {
 	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
 		return -ENODEV;
-	if (work_busy(&dev->reset_work))
-		return -ENODEV;
+	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_SCHED_RESET))
+		return -EBUSY;
 	if (!queue_work(nvme_workq, &dev->reset_work))
 		return -EBUSY;
 	return 0;
@@ -2138,6 +2143,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	dev_info(dev->ctrl.device, "pci function %s\n", dev_name(&pdev->dev));
 
+	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_SCHED_RESET);
 	queue_work(nvme_workq, &dev->reset_work);
 	return 0;
 
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 28bd255c144d..aa3174afe73b 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1736,6 +1736,9 @@ static void nvme_rdma_reset_ctrl_work(struct work_struct *work)
 
 	nvme_rdma_shutdown_ctrl(ctrl);
 
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+		goto del_dead_ctrl;
+
 	ret = nvme_rdma_configure_admin_queue(ctrl);
 	if (ret) {
 		/* ctrl is already shutdown, just remove the ctrl */
@@ -1778,7 +1781,7 @@ static int nvme_rdma_reset_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_rdma_ctrl *ctrl = to_rdma_ctrl(nctrl);
 
-	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_SCHED_RESET))
 		return -EBUSY;
 
 	if (!queue_work(nvme_rdma_wq, &ctrl->reset_work))
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index e503cfff0337..de9c1d367d70 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -508,6 +508,9 @@ static void nvme_loop_reset_ctrl_work(struct work_struct *work)
 
 	nvme_loop_shutdown_ctrl(ctrl);
 
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+		goto out_disable;
+
 	ret = nvme_loop_configure_admin_queue(ctrl);
 	if (ret)
 		goto out_disable;
@@ -544,7 +547,7 @@ static int nvme_loop_reset_ctrl(struct nvme_ctrl *nctrl)
 {
 	struct nvme_loop_ctrl *ctrl = to_loop_ctrl(nctrl);
 
-	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_RESETTING))
+	if (!nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_SCHED_RESET))
 		return -EBUSY;
 
 	if (!schedule_work(&ctrl->reset_work))
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 14:06 ` [PATCH] nvme: fix multiple ctrl removal scheduling Christoph Hellwig
@ 2017-06-01 14:37   ` Sagi Grimberg
  2017-06-01 15:41     ` Christoph Hellwig
  2017-06-01 20:00   ` Keith Busch
  1 sibling, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2017-06-01 14:37 UTC (permalink / raw)



> From: Rakesh Pandit <rakesh at tuxera.com>
>
> Commit c5f6ce97c1210 tries to address multiple resets but fails as
> work_busy doesn't involve any synchronization and can fail.  This is
> reproducible easily as can be seen by WARNING below which is triggered
> with line:
>
> WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)
>
> Allowing multiple resets can result in multiple controller removal as
> well if different conditions inside nvme_reset_work fail and which
> might deadlock on device_release_driver.
>
> This patch introduces new state called NVME_CTRL_SCHED_RESET and uses
> it for synchronizing work queue addition (reset_work) as state change
> uses controller spinlock.

Christoph,

Did you catch my proposal to Keith and Rakesh?

I proposed to synchronize with ctrl state NVME_CTRL_RESETTING
before scheduling the reset_work and on removal flow (which is
why the warning exist) simply do a sync cancel of reset_work.

IMO, this is better than adding a new state to the ctrl state machine.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 14:37   ` Sagi Grimberg
@ 2017-06-01 15:41     ` Christoph Hellwig
  2017-06-01 22:02       ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2017-06-01 15:41 UTC (permalink / raw)


On Thu, Jun 01, 2017@05:37:41PM +0300, Sagi Grimberg wrote:
> Did you catch my proposal to Keith and Rakesh?

I saw it.

>
> I proposed to synchronize with ctrl state NVME_CTRL_RESETTING
> before scheduling the reset_work and on removal flow (which is
> why the warning exist) simply do a sync cancel of reset_work.
>
> IMO, this is better than adding a new state to the ctrl state machine.

Why?  Knowing from the state machine that we're going to reset
is very useful and will safe us from doing pointless work elsewhere.
There is a reason we have done it that way in Fabrics from the beginning.

E.g. as the work_busy check in nvme_should_reset can now be replaced
with a check for NVME_CTRL_SCHED_RESET and NVME_CTRL_RESETTING and
we're much more reliable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 14:06 ` [PATCH] nvme: fix multiple ctrl removal scheduling Christoph Hellwig
  2017-06-01 14:37   ` Sagi Grimberg
@ 2017-06-01 20:00   ` Keith Busch
  2017-06-02 17:10     ` Rakesh Pandit
  1 sibling, 1 reply; 9+ messages in thread
From: Keith Busch @ 2017-06-01 20:00 UTC (permalink / raw)


On Thu, Jun 01, 2017@04:06:29PM +0200, Christoph Hellwig wrote:
> index a60926410438..0935dc08f46e 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -161,7 +161,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>  			break;
>  		}
>  		break;
> -	case NVME_CTRL_RESETTING:
> +	case NVME_CTRL_SCHED_RESET:
>  		switch (old_state) {
>  		case NVME_CTRL_NEW:
>  		case NVME_CTRL_LIVE:
> @@ -172,6 +172,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>  			break;
>  		}
>  		break;
> +	case NVME_CTRL_RESETTING:
> +		switch (old_state) {
> +		case NVME_CTRL_SCHED_RESET:
> +			changed = true;
> +			/* FALLTHRU */
> +		default:
> +			break;
> +		}
> +		break;
>  	case NVME_CTRL_RECONNECTING:
>  		switch (old_state) {
>  		case NVME_CTRL_LIVE:

We also need to add the new NVME_SCHED_RESET state as a valid old_state
transition to NVME_CTRL_DELETING.

> @@ -2009,8 +2014,8 @@ static int nvme_reset(struct nvme_dev *dev)
>  {
>  	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
>  		return -ENODEV;
> -	if (work_busy(&dev->reset_work))
> -		return -ENODEV;
> +	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_SCHED_RESET))
> +		return -EBUSY;
>  	if (!queue_work(nvme_workq, &dev->reset_work))
>  		return -EBUSY;
>  	return 0;

Just going through a hypothetical scenario: let's say a user requests
an orderly removal of a device through sysfs. The the controller happens
to fail after del_gendisk is called, and commands timeout, triggering a
reset.

Today, the reset work will consider the controller dead and kill the
nvme queues to force all new IO to fail immediately.

With this change, the reset work won't get to run, so nothing will get
to kill the queues. The driver will be stuck at del_gendisk.

Something like the following will fix that for pci:

---
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -993,7 +993,10 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
-		nvme_reset(dev);
+		if (nvme_reset(dev) != 0) {
+			if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD))
+				nvme_kill_queues(&dev->ctrl);
+		}
 
 		/*
 		 * Mark the request as handled, since the inline shutdown
--

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 15:41     ` Christoph Hellwig
@ 2017-06-01 22:02       ` Sagi Grimberg
  2017-06-02 17:28         ` Rakesh Pandit
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2017-06-01 22:02 UTC (permalink / raw)



>> Did you catch my proposal to Keith and Rakesh?
>
> I saw it.
>
>>
>> I proposed to synchronize with ctrl state NVME_CTRL_RESETTING
>> before scheduling the reset_work and on removal flow (which is
>> why the warning exist) simply do a sync cancel of reset_work.
>>
>> IMO, this is better than adding a new state to the ctrl state machine.
>
> Why?  Knowing from the state machine that we're going to reset
> is very useful and will safe us from doing pointless work elsewhere.
> There is a reason we have done it that way in Fabrics from the beginning.

I'm not following, we have not done it for fabrics, we are doing in
fabrics exactly what I'm proposing.

> E.g. as the work_busy check in nvme_should_reset can now be replaced
> with a check for NVME_CTRL_SCHED_RESET and NVME_CTRL_RESETTING and
> we're much more reliable.

I agree on the state check, I think it should check NVME_CTRL_RESETTING
and use that state when we schedule reset_work (like we do in fabrics)
and drop the redundant WARN_ON in reset_work to protect against the
special case of controller concurrent deletion (which we remove
altogether by canceling the reset_work there).

I think we're in sync about the mechanics, I just understand why we need
another state here.

Can something like the (untested) following work?
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2769298e16dc..ca92fa24c6fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1384,7 +1384,7 @@ static bool nvme_should_reset(struct nvme_dev 
*dev, u32 csts)
         bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);

         /* If there is a reset ongoing, we shouldn't reset again. */
-       if (work_busy(&dev->reset_work))
+       if (dev->ctrl.state == NVME_CTRL_RESETTING)
                 return false;

         /* We shouldn't reset unless the controller is on fatal error state
@@ -2026,8 +2026,8 @@ static int nvme_reset(struct nvme_dev *dev)
  {
         if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
                 return -ENODEV;
-       if (work_busy(&dev->reset_work))
-               return -ENODEV;
+       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+               return -EBUSY;
         if (!queue_work(nvme_wq, &dev->reset_work))
                 return -EBUSY;
         return 0;
@@ -2194,6 +2194,7 @@ static void nvme_remove(struct pci_dev *pdev)
  {
         struct nvme_dev *dev = pci_get_drvdata(pdev);

+       cancel_work_sync(&dev->reset_work);
         nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);

         pci_set_drvdata(pdev, NULL);
--

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 20:00   ` Keith Busch
@ 2017-06-02 17:10     ` Rakesh Pandit
  0 siblings, 0 replies; 9+ messages in thread
From: Rakesh Pandit @ 2017-06-02 17:10 UTC (permalink / raw)


On Thu, Jun 01, 2017@04:00:26PM -0400, Keith Busch wrote:
> On Thu, Jun 01, 2017@04:06:29PM +0200, Christoph Hellwig wrote:
> > index a60926410438..0935dc08f46e 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
....
> >  		break;
> > +	case NVME_CTRL_RESETTING:
> > +		switch (old_state) {
> > +		case NVME_CTRL_SCHED_RESET:
> > +			changed = true;
> > +			/* FALLTHRU */
> > +		default:
> > +			break;
> > +		}
> > +		break;
> >  	case NVME_CTRL_RECONNECTING:
> >  		switch (old_state) {
> >  		case NVME_CTRL_LIVE:
> 
> We also need to add the new NVME_SCHED_RESET state as a valid old_state
> transition to NVME_CTRL_DELETING.
>

Makes sense.

> > @@ -2009,8 +2014,8 @@ static int nvme_reset(struct nvme_dev *dev)
> >  {
> >  	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
> >  		return -ENODEV;
> > -	if (work_busy(&dev->reset_work))
> > -		return -ENODEV;
> > +	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_SCHED_RESET))
> > +		return -EBUSY;
> >  	if (!queue_work(nvme_workq, &dev->reset_work))
> >  		return -EBUSY;
> >  	return 0;
> 
> Just going through a hypothetical scenario: let's say a user requests
> an orderly removal of a device through sysfs. The the controller happens
> to fail after del_gendisk is called, and commands timeout, triggering a
> reset.
> 
> Today, the reset work will consider the controller dead and kill the
> nvme queues to force all new IO to fail immediately.
> 
> With this change, the reset work won't get to run, so nothing will get
> to kill the queues. The driver will be stuck at del_gendisk.
> 
> Something like the following will fix that for pci:
> 
> ---
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -993,7 +993,10 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  			 "I/O %d QID %d timeout, reset controller\n",
>  			 req->tag, nvmeq->qid);
>  		nvme_dev_disable(dev, false);
> -		nvme_reset(dev);
> +		if (nvme_reset(dev) != 0) {
> +			if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD))
> +				nvme_kill_queues(&dev->ctrl);
> +		}
>  
>  		/*
>  		 * Mark the request as handled, since the inline shutdown
> --

Thanks for pointing out.  Seems correct.

Christoph: may you fold above diff and a change above into a new
version of patch.  Let me know if you want me to post a new version.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-01 22:02       ` Sagi Grimberg
@ 2017-06-02 17:28         ` Rakesh Pandit
  2017-06-03  7:31           ` Sagi Grimberg
  0 siblings, 1 reply; 9+ messages in thread
From: Rakesh Pandit @ 2017-06-02 17:28 UTC (permalink / raw)


On Fri, Jun 02, 2017@01:02:09AM +0300, Sagi Grimberg wrote:
> 
> > > Did you catch my proposal to Keith and Rakesh?
> > 
> > I saw it.
> > 
> > > 
> > > I proposed to synchronize with ctrl state NVME_CTRL_RESETTING
> > > before scheduling the reset_work and on removal flow (which is
> > > why the warning exist) simply do a sync cancel of reset_work.
> > > 
> > > IMO, this is better than adding a new state to the ctrl state machine.
> > 
> > Why?  Knowing from the state machine that we're going to reset
> > is very useful and will safe us from doing pointless work elsewhere.
> > There is a reason we have done it that way in Fabrics from the beginning.
> 
> I'm not following, we have not done it for fabrics, we are doing in
> fabrics exactly what I'm proposing.
> 
> > E.g. as the work_busy check in nvme_should_reset can now be replaced
> > with a check for NVME_CTRL_SCHED_RESET and NVME_CTRL_RESETTING and
> > we're much more reliable.
> 
> I agree on the state check, I think it should check NVME_CTRL_RESETTING
> and use that state when we schedule reset_work (like we do in fabrics)
> and drop the redundant WARN_ON in reset_work to protect against the
> special case of controller concurrent deletion (which we remove
> altogether by canceling the reset_work there).
> 
> I think we're in sync about the mechanics, I just understand why we need
> another state here.
> 
> Can something like the (untested) following work?

Logically it should work.

> --
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 2769298e16dc..ca92fa24c6fc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1384,7 +1384,7 @@ static bool nvme_should_reset(struct nvme_dev *dev,
> u32 csts)
>         bool nssro = dev->subsystem && (csts & NVME_CSTS_NSSRO);
> 
>         /* If there is a reset ongoing, we shouldn't reset again. */
> -       if (work_busy(&dev->reset_work))
> +       if (dev->ctrl.state == NVME_CTRL_RESETTING)
>                 return false;
> 
>         /* We shouldn't reset unless the controller is on fatal error state
> @@ -2026,8 +2026,8 @@ static int nvme_reset(struct nvme_dev *dev)
>  {
>         if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
>                 return -ENODEV;
> -       if (work_busy(&dev->reset_work))
> -               return -ENODEV;
> +       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
> +               return -EBUSY;
>         if (!queue_work(nvme_wq, &dev->reset_work))
>                 return -EBUSY;
>         return 0;
> @@ -2194,6 +2194,7 @@ static void nvme_remove(struct pci_dev *pdev)
>  {
>         struct nvme_dev *dev = pci_get_drvdata(pdev);
> 
> +       cancel_work_sync(&dev->reset_work);
>         nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> 
>         pci_set_drvdata(pdev, NULL);
> --

cancel_work_sync seems to be a good idea but I would still consider
new state to be a better alternative as it seems to make overal design
simpler.

Before using the new state earlier versions of my patch did consider
using RESETTING state for synchronization but we gave up the idea.

Anyway, once we can decide which way to go, I would be happy to give
this a test spin.  You just forgot to remove the now reduntant WARN_ON
in untested one but of course this was just untested and for
demonstration only.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] nvme: fix multiple ctrl removal scheduling
  2017-06-02 17:28         ` Rakesh Pandit
@ 2017-06-03  7:31           ` Sagi Grimberg
  0 siblings, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2017-06-03  7:31 UTC (permalink / raw)



> cancel_work_sync seems to be a good idea

In fact, I think we should have it independently of
the state discussion, it healthy to make sure nothing runs
concurrently with remove.

> but I would still consider
> new state to be a better alternative as it seems to make overal design
> simpler.

I'm still not convinced on why adding a new state is making things
simpler.

> 
> Anyway, once we can decide which way to go, I would be happy to give
> this a test spin.

Keith, Christoph? can you please explain again why adding a new
state is better?

> You just forgot to remove the now reduntant WARN_ON
> in untested one but of course this was just untested and for
> demonstration only.

Obviously.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-06-03  7:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-01 14:06 respun multiple removal fix Christoph Hellwig
2017-06-01 14:06 ` [PATCH] nvme: fix multiple ctrl removal scheduling Christoph Hellwig
2017-06-01 14:37   ` Sagi Grimberg
2017-06-01 15:41     ` Christoph Hellwig
2017-06-01 22:02       ` Sagi Grimberg
2017-06-02 17:28         ` Rakesh Pandit
2017-06-03  7:31           ` Sagi Grimberg
2017-06-01 20:00   ` Keith Busch
2017-06-02 17:10     ` Rakesh Pandit

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.