All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-11  9:38 ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Change V2->V3:
 - Keep queue frozen for reset case. If IO requests timeout during in RECONNECTING
   state, fail them and kill the controller. Really appreciate Keith's directive and advice.
 - Add 3 patches to fix some bug about namespaces_mutex and change it to rwsem. It could fix
   the deadlock risk introduced by namespace_mutex following.
 - other misc changes.

Changes V1->V2:
 - free and disable pci things in nvme_pci_disable_ctrl_directly
 - change comment and add reviewed-by in 1st patch
 - resort patches
 - other misc changes

There are 9 patches:

1st ~ 3th patches is to change the namespaces_mutex to rw_semphore lock.
4th, 6th, 7th is to do some preparations for the 8th patch.
5th fixes a bug found when test.
8th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
9th fixes a bug after 8th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_V3_plus_rwsem

Jianchao Wang (9)
0001-nvme-fix-the-dangerous-reference-of-namespaces-list.patch
0002-nvme-fix-the-deadlock-in-nvme_update_formats.patch
0003-nvme-change-namespaces_mutext-to-namespaces_rwsem.patch
0004-nvme-pci-quiesce-IO-queues-prior-to-disabling-device.patch
0005-nvme-pci-suspend-queues-based-on-online_queues.patch
0006-nvme-pci-drain-the-entered-requests-after-ctrl-is-sh.patch
0007-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0008-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0009-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch

diff stat following:
block/blk-mq.c                |   3 +-
drivers/nvme/host/core.c      |  88 +++++++++------
drivers/nvme/host/multipath.c |   4 +-
drivers/nvme/host/nvme.h      |   2 +-
drivers/nvme/host/pci.c       | 255 +++++++++++++++++++++++++++++++-----------
include/linux/blk-mq.h        |   1 +
6 files changed, 252 insertions(+), 101 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH V3 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable
@ 2018-02-11  9:38 ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


Hi Christoph, Keith and Sagi

Please consider and comment on the following patchset.
That's really appreciated.

There is a complicated relationship between nvme_timeout and nvme_dev_disable.
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.
We have found some issues introduced by them, please refer the following link

http://lists.infradead.org/pipermail/linux-nvme/2018-January/015053.html 
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015276.html
http://lists.infradead.org/pipermail/linux-nvme/2018-January/015328.html
Even we cannot ensure there is no other issue.

The best way to fix them is to break up the relationship between them.
With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Change V2->V3:
 - Keep queue frozen for reset case. If IO requests timeout during in RECONNECTING
   state, fail them and kill the controller. Really appreciate Keith's directive and advice.
 - Add 3 patches to fix some bug about namespaces_mutex and change it to rwsem. It could fix
   the deadlock risk introduced by namespace_mutex following.
 - other misc changes.

Changes V1->V2:
 - free and disable pci things in nvme_pci_disable_ctrl_directly
 - change comment and add reviewed-by in 1st patch
 - resort patches
 - other misc changes

There are 9 patches:

1st ~ 3th patches is to change the namespaces_mutex to rw_semphore lock.
4th, 6th, 7th is to do some preparations for the 8th patch.
5th fixes a bug found when test.
8th is to avoid nvme_dev_disable to be invoked by nvme_timeout, and implement
the synchronization between them. More details, please refer to the comment of
this patch.
9th fixes a bug after 8th patch is introduced. It let nvme_delete_io_queues can
only be wakeup by completion path.

This patchset was tested under debug patch for some days.
And some bugfix have been done.
The patches are available in following it branch:
https://github.com/jianchwa/linux-blcok.git nvme_fixes_V3_plus_rwsem

Jianchao Wang (9)
0001-nvme-fix-the-dangerous-reference-of-namespaces-list.patch
0002-nvme-fix-the-deadlock-in-nvme_update_formats.patch
0003-nvme-change-namespaces_mutext-to-namespaces_rwsem.patch
0004-nvme-pci-quiesce-IO-queues-prior-to-disabling-device.patch
0005-nvme-pci-suspend-queues-based-on-online_queues.patch
0006-nvme-pci-drain-the-entered-requests-after-ctrl-is-sh.patch
0007-blk-mq-make-blk_mq_rq_update_aborted_gstate-a-extern.patch
0008-nvme-pci-break-up-nvme_timeout-and-nvme_dev_disable.patch
0009-nvme-pci-discard-wait-timeout-when-delete-cq-sq.patch

diff stat following:
block/blk-mq.c                |   3 +-
drivers/nvme/host/core.c      |  88 +++++++++------
drivers/nvme/host/multipath.c |   4 +-
drivers/nvme/host/nvme.h      |   2 +-
drivers/nvme/host/pci.c       | 255 +++++++++++++++++++++++++++++++-----------
include/linux/blk-mq.h        |   1 +
6 files changed, 252 insertions(+), 101 deletions(-)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/9] nvme: fix the dangerous reference of namespaces list
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme_remove_namespaces and nvme_remove_invalid_namespaces reference
the ctrl->namespaces list w/o holding namespaces_mutext. It is ok
to invoke nvme_ns_remove there, but what if there is others.

To be safer, reference the ctrl->namespaces list under
namespaces_mutext.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/core.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e810487..bc05bc4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3092,11 +3092,20 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 					unsigned nsid)
 {
 	struct nvme_ns *ns, *next;
+	LIST_HEAD(rm_list);
 
+	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->head->ns_id > nsid)
-			nvme_ns_remove(ns);
+		if (ns->head->ns_id > nsid) {
+			list_del_init(&ns->list);
+			list_add_tail(&ns->list, &rm_list);
+		}
 	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+
+	list_for_each_entry_safe(ns, next, &rm_list, list)
+		nvme_ns_remove(ns);
+
 }
 
 static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
@@ -3196,6 +3205,7 @@ EXPORT_SYMBOL_GPL(nvme_queue_scan);
 void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns, *next;
+	LIST_HEAD(ns_list);
 
 	/*
 	 * The dead states indicates the controller was not gracefully
@@ -3206,7 +3216,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	if (ctrl->state == NVME_CTRL_DEAD)
 		nvme_kill_queues(ctrl);
 
-	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_splice_init(&ctrl->namespaces, &ns_list);
+	mutex_unlock(&ctrl->namespaces_mutex);
+
+	list_for_each_entry_safe(ns, next, &ns_list, list)
 		nvme_ns_remove(ns);
 }
 EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/9] nvme: fix the dangerous reference of namespaces list
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


nvme_remove_namespaces and nvme_remove_invalid_namespaces reference
the ctrl->namespaces list w/o holding namespaces_mutext. It is ok
to invoke nvme_ns_remove there, but what if there is others.

To be safer, reference the ctrl->namespaces list under
namespaces_mutext.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/core.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index e810487..bc05bc4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3092,11 +3092,20 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 					unsigned nsid)
 {
 	struct nvme_ns *ns, *next;
+	LIST_HEAD(rm_list);
 
+	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
-		if (ns->head->ns_id > nsid)
-			nvme_ns_remove(ns);
+		if (ns->head->ns_id > nsid) {
+			list_del_init(&ns->list);
+			list_add_tail(&ns->list, &rm_list);
+		}
 	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+
+	list_for_each_entry_safe(ns, next, &rm_list, list)
+		nvme_ns_remove(ns);
+
 }
 
 static int nvme_scan_ns_list(struct nvme_ctrl *ctrl, unsigned nn)
@@ -3196,6 +3205,7 @@ EXPORT_SYMBOL_GPL(nvme_queue_scan);
 void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns, *next;
+	LIST_HEAD(ns_list);
 
 	/*
 	 * The dead states indicates the controller was not gracefully
@@ -3206,7 +3216,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	if (ctrl->state == NVME_CTRL_DEAD)
 		nvme_kill_queues(ctrl);
 
-	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_splice_init(&ctrl->namespaces, &ns_list);
+	mutex_unlock(&ctrl->namespaces_mutex);
+
+	list_for_each_entry_safe(ns, next, &ns_list, list)
 		nvme_ns_remove(ns);
 }
 EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme_update_formats will invoke nvme_ns_remove under namespaces_mutext.
The will cause deadlock because nvme_ns_remove will also require
the namespaces_mutext. Fix it by getting the ns entries which should
be removed under namespaces_mutext and invoke nvme_ns_remove out of
namespaces_mutext.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/core.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bc05bc4..7425124 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1117,14 +1117,20 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 static void nvme_update_formats(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
+	struct nvme_ns *ns, *next;
+	LIST_HEAD(rm_list);
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->disk && nvme_revalidate_disk(ns->disk))
-			nvme_ns_remove(ns);
+		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
+			list_del_init(&ns->list);
+			list_add_tail(&ns->list, &rm_list);
+		}
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
+
+	list_for_each_entry_safe(ns, next, &rm_list, list)
+		nvme_ns_remove(ns);
 }
 
 static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


nvme_update_formats will invoke nvme_ns_remove under namespaces_mutext.
The will cause deadlock because nvme_ns_remove will also require
the namespaces_mutext. Fix it by getting the ns entries which should
be removed under namespaces_mutext and invoke nvme_ns_remove out of
namespaces_mutext.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/core.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bc05bc4..7425124 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1117,14 +1117,20 @@ static u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 
 static void nvme_update_formats(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ns *ns;
+	struct nvme_ns *ns, *next;
+	LIST_HEAD(rm_list);
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
-		if (ns->disk && nvme_revalidate_disk(ns->disk))
-			nvme_ns_remove(ns);
+		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
+			list_del_init(&ns->list);
+			list_add_tail(&ns->list, &rm_list);
+		}
 	}
 	mutex_unlock(&ctrl->namespaces_mutex);
+
+	list_for_each_entry_safe(ns, next, &rm_list, list)
+		nvme_ns_remove(ns);
 }
 
 static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

namespaces_mutext is used to synchronize the operations on ctrl
namespaces list. Most of the time, it is a read operation. It is
better to change it from mutex to rwsem.

On the other hand, the namespaces mutex could introduce circular
dependency easily.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/core.c      | 64 +++++++++++++++++++++----------------------
 drivers/nvme/host/multipath.c |  4 +--
 drivers/nvme/host/nvme.h      |  2 +-
 3 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7425124..176ba9e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1120,14 +1120,14 @@ static void nvme_update_formats(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns, *next;
 	LIST_HEAD(rm_list);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
 			list_del_init(&ns->list);
 			list_add_tail(&ns->list, &rm_list);
 		}
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	list_for_each_entry_safe(ns, next, &rm_list, list)
 		nvme_ns_remove(ns);
@@ -2437,7 +2437,7 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
 	struct nvme_ns *ns;
 	int ret;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	if (list_empty(&ctrl->namespaces)) {
 		ret = -ENOTTY;
 		goto out_unlock;
@@ -2454,14 +2454,14 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
 	dev_warn(ctrl->device,
 		"using deprecated NVME_IOCTL_IO_CMD ioctl on the char device!\n");
 	kref_get(&ns->kref);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 
 	ret = nvme_user_cmd(ctrl, ns, argp);
 	nvme_put_ns(ns);
 	return ret;
 
 out_unlock:
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 	return ret;
 }
 
@@ -2894,7 +2894,7 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 {
 	struct nvme_ns *ns, *ret = NULL;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->head->ns_id == nsid) {
 			if (!kref_get_unless_zero(&ns->kref))
@@ -2905,7 +2905,7 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		if (ns->head->ns_id > nsid)
 			break;
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 	return ret;
 }
 
@@ -3017,9 +3017,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	__nvme_revalidate_disk(disk, id);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_add_tail(&ns->list, &ctrl->namespaces);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	nvme_get_ctrl(ctrl);
 
@@ -3072,9 +3072,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	list_del_rcu(&ns->siblings);
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
-	mutex_lock(&ns->ctrl->namespaces_mutex);
+	down_write(&ns->ctrl->namespaces_rwsem);
 	list_del_init(&ns->list);
-	mutex_unlock(&ns->ctrl->namespaces_mutex);
+	up_write(&ns->ctrl->namespaces_rwsem);
 
 	synchronize_srcu(&ns->head->srcu);
 	nvme_mpath_check_last_path(ns);
@@ -3100,14 +3100,14 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 	LIST_HEAD(rm_list);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
 		if (ns->head->ns_id > nsid) {
 			list_del_init(&ns->list);
 			list_add_tail(&ns->list, &rm_list);
 		}
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	list_for_each_entry_safe(ns, next, &rm_list, list)
 		nvme_ns_remove(ns);
@@ -3187,9 +3187,9 @@ static void nvme_scan_work(struct work_struct *work)
 	}
 	nvme_scan_ns_sequential(ctrl, nn);
  done:
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_sort(NULL, &ctrl->namespaces, ns_cmp);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 	kfree(id);
 }
 
@@ -3222,9 +3222,9 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	if (ctrl->state == NVME_CTRL_DEAD)
 		nvme_kill_queues(ctrl);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_splice_init(&ctrl->namespaces, &ns_list);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	list_for_each_entry_safe(ns, next, &ns_list, list)
 		nvme_ns_remove(ns);
@@ -3413,7 +3413,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->state = NVME_CTRL_NEW;
 	spin_lock_init(&ctrl->lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
-	mutex_init(&ctrl->namespaces_mutex);
+	init_rwsem(&ctrl->namespaces_rwsem);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
@@ -3474,7 +3474,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	if (ctrl->admin_q)
@@ -3493,7 +3493,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
 		blk_mq_unquiesce_queue(ns->queue);
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_kill_queues);
 
@@ -3501,10 +3501,10 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_unfreeze_queue(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
@@ -3512,13 +3512,13 @@ void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout);
 		if (timeout <= 0)
 			break;
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
 
@@ -3526,10 +3526,10 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_freeze_queue_wait(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
 
@@ -3537,10 +3537,10 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_freeze_queue_start(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
@@ -3548,10 +3548,10 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_quiesce_queue(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
@@ -3559,10 +3559,10 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_unquiesce_queue(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 3b211d9..559b601 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -44,12 +44,12 @@ void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->head->disk)
 			kblockd_schedule_work(&ns->head->requeue_work);
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 
 static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 8e4550f..4191650 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -140,7 +140,7 @@ struct nvme_ctrl {
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
-	struct mutex namespaces_mutex;
+	struct rw_semaphore namespaces_rwsem;
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


namespaces_mutext is used to synchronize the operations on ctrl
namespaces list. Most of the time, it is a read operation. It is
better to change it from mutex to rwsem.

On the other hand, the namespaces mutex could introduce circular
dependency easily.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/core.c      | 64 +++++++++++++++++++++----------------------
 drivers/nvme/host/multipath.c |  4 +--
 drivers/nvme/host/nvme.h      |  2 +-
 3 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7425124..176ba9e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1120,14 +1120,14 @@ static void nvme_update_formats(struct nvme_ctrl *ctrl)
 	struct nvme_ns *ns, *next;
 	LIST_HEAD(rm_list);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
 			list_del_init(&ns->list);
 			list_add_tail(&ns->list, &rm_list);
 		}
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	list_for_each_entry_safe(ns, next, &rm_list, list)
 		nvme_ns_remove(ns);
@@ -2437,7 +2437,7 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
 	struct nvme_ns *ns;
 	int ret;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	if (list_empty(&ctrl->namespaces)) {
 		ret = -ENOTTY;
 		goto out_unlock;
@@ -2454,14 +2454,14 @@ static int nvme_dev_user_cmd(struct nvme_ctrl *ctrl, void __user *argp)
 	dev_warn(ctrl->device,
 		"using deprecated NVME_IOCTL_IO_CMD ioctl on the char device!\n");
 	kref_get(&ns->kref);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 
 	ret = nvme_user_cmd(ctrl, ns, argp);
 	nvme_put_ns(ns);
 	return ret;
 
 out_unlock:
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 	return ret;
 }
 
@@ -2894,7 +2894,7 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 {
 	struct nvme_ns *ns, *ret = NULL;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->head->ns_id == nsid) {
 			if (!kref_get_unless_zero(&ns->kref))
@@ -2905,7 +2905,7 @@ static struct nvme_ns *nvme_find_get_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		if (ns->head->ns_id > nsid)
 			break;
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 	return ret;
 }
 
@@ -3017,9 +3017,9 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	__nvme_revalidate_disk(disk, id);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_add_tail(&ns->list, &ctrl->namespaces);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	nvme_get_ctrl(ctrl);
 
@@ -3072,9 +3072,9 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 	list_del_rcu(&ns->siblings);
 	mutex_unlock(&ns->ctrl->subsys->lock);
 
-	mutex_lock(&ns->ctrl->namespaces_mutex);
+	down_write(&ns->ctrl->namespaces_rwsem);
 	list_del_init(&ns->list);
-	mutex_unlock(&ns->ctrl->namespaces_mutex);
+	up_write(&ns->ctrl->namespaces_rwsem);
 
 	synchronize_srcu(&ns->head->srcu);
 	nvme_mpath_check_last_path(ns);
@@ -3100,14 +3100,14 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
 	struct nvme_ns *ns, *next;
 	LIST_HEAD(rm_list);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
 		if (ns->head->ns_id > nsid) {
 			list_del_init(&ns->list);
 			list_add_tail(&ns->list, &rm_list);
 		}
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	list_for_each_entry_safe(ns, next, &rm_list, list)
 		nvme_ns_remove(ns);
@@ -3187,9 +3187,9 @@ static void nvme_scan_work(struct work_struct *work)
 	}
 	nvme_scan_ns_sequential(ctrl, nn);
  done:
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_sort(NULL, &ctrl->namespaces, ns_cmp);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 	kfree(id);
 }
 
@@ -3222,9 +3222,9 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	if (ctrl->state == NVME_CTRL_DEAD)
 		nvme_kill_queues(ctrl);
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_write(&ctrl->namespaces_rwsem);
 	list_splice_init(&ctrl->namespaces, &ns_list);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_write(&ctrl->namespaces_rwsem);
 
 	list_for_each_entry_safe(ns, next, &ns_list, list)
 		nvme_ns_remove(ns);
@@ -3413,7 +3413,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->state = NVME_CTRL_NEW;
 	spin_lock_init(&ctrl->lock);
 	INIT_LIST_HEAD(&ctrl->namespaces);
-	mutex_init(&ctrl->namespaces_mutex);
+	init_rwsem(&ctrl->namespaces_rwsem);
 	ctrl->dev = dev;
 	ctrl->ops = ops;
 	ctrl->quirks = quirks;
@@ -3474,7 +3474,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 
 	/* Forcibly unquiesce queues to avoid blocking dispatch */
 	if (ctrl->admin_q)
@@ -3493,7 +3493,7 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
 		/* Forcibly unquiesce queues to avoid blocking dispatch */
 		blk_mq_unquiesce_queue(ns->queue);
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_kill_queues);
 
@@ -3501,10 +3501,10 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_unfreeze_queue(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
 
@@ -3512,13 +3512,13 @@ void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		timeout = blk_mq_freeze_queue_wait_timeout(ns->queue, timeout);
 		if (timeout <= 0)
 			break;
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze_timeout);
 
@@ -3526,10 +3526,10 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_freeze_queue_wait(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
 
@@ -3537,10 +3537,10 @@ void nvme_start_freeze(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_freeze_queue_start(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
@@ -3548,10 +3548,10 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_quiesce_queue(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
@@ -3559,10 +3559,10 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
 		blk_mq_unquiesce_queue(ns->queue);
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 3b211d9..559b601 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -44,12 +44,12 @@ void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl)
 {
 	struct nvme_ns *ns;
 
-	mutex_lock(&ctrl->namespaces_mutex);
+	down_read(&ctrl->namespaces_rwsem);
 	list_for_each_entry(ns, &ctrl->namespaces, list) {
 		if (ns->head->disk)
 			kblockd_schedule_work(&ns->head->requeue_work);
 	}
-	mutex_unlock(&ctrl->namespaces_mutex);
+	up_read(&ctrl->namespaces_rwsem);
 }
 
 static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 8e4550f..4191650 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -140,7 +140,7 @@ struct nvme_ctrl {
 	struct blk_mq_tag_set *tagset;
 	struct blk_mq_tag_set *admin_tagset;
 	struct list_head namespaces;
-	struct mutex namespaces_mutex;
+	struct rw_semaphore namespaces_rwsem;
 	struct device ctrl_device;
 	struct device *device;	/* char device */
 	struct cdev cdev;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..00cffed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (!dead) {
 		if (shutdown)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+	}
+	nvme_stop_queues(&dev->ctrl);
 
+	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 */
 		if (dev->host_mem_descs)
 			nvme_set_host_mem(dev, 0);
-
-	}
-	nvme_stop_queues(&dev->ctrl);
-
-	if (!dead) {
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.

Reviewed-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6fe7af0..00cffed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	if (!dead) {
 		if (shutdown)
 			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+	}
+	nvme_stop_queues(&dev->ctrl);
 
+	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
 		 * host memory buffer.  In theory the shutdown / reset should
@@ -2195,11 +2198,6 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		 */
 		if (dev->host_mem_descs)
 			nvme_set_host_mem(dev, 0);
-
-	}
-	nvme_stop_queues(&dev->ctrl);
-
-	if (!dead) {
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/9] nvme-pci: suspend queues based on online_queues
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

nvme cq irq is freed based on queue_count. When the sq/cq creation
fails, irq will not be setup. free_irq will warn 'Try to free
already-free irq'.

To fix it, we only increase online_queues when adminq/sq/cq are
created and associated irq is setup. Then suspend queues based
on online_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..c5c1365 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1315,9 +1315,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1461,13 +1458,14 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
-		goto release_sq;
+		goto offline;
 
 	return result;
 
- release_sq:
+offline:
+	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
 	return result;
 }
@@ -1607,6 +1605,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 
@@ -1954,6 +1953,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 	return nvme_create_io_queues(dev);
@@ -2167,6 +2167,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
@@ -2175,8 +2176,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
 		    dev->ctrl.state == NVME_CTRL_RESETTING)
 			nvme_start_freeze(&dev->ctrl);
-		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+
+		dead = !!((csts & NVME_CSTS_CFS) ||
+				!(csts & NVME_CSTS_RDY) ||
+				(pdev->error_state  != pci_channel_io_normal) ||
+				(dev->online_queues == 0));
 	}
 
 	/*
@@ -2201,9 +2205,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	if (dev->ctrl.admin_q)
+		blk_mq_quiesce_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
@@ -2339,16 +2348,18 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	/*
-	 * Keep the controller around but remove all namespaces if we don't have
-	 * any working I/O queue.
-	 */
-	if (dev->online_queues < 2) {
+
+	/* In case of online_queues is zero, it has gone to out */
+	if (dev->online_queues == 1) {
+		/*
+		 * Keep the controller around but remove all namespaces if we
+		 * don't have any working I/O queue.
+		 */
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
+	} else if (dev->online_queues > 1) {
 		nvme_start_queues(&dev->ctrl);
 		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/9] nvme-pci: suspend queues based on online_queues
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


nvme cq irq is freed based on queue_count. When the sq/cq creation
fails, irq will not be setup. free_irq will warn 'Try to free
already-free irq'.

To fix it, we only increase online_queues when adminq/sq/cq are
created and associated irq is setup. Then suspend queues based
on online_queues.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 00cffed..c5c1365 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1315,9 +1315,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
 
-	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-		blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
 	pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
 	return 0;
@@ -1461,13 +1458,14 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
-		goto release_sq;
+		goto offline;
 
 	return result;
 
- release_sq:
+offline:
+	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
 	return result;
 }
@@ -1607,6 +1605,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 
@@ -1954,6 +1953,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
+		dev->online_queues--;
 		return result;
 	}
 	return nvme_create_io_queues(dev);
@@ -2167,6 +2167,7 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
@@ -2175,8 +2176,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
 		    dev->ctrl.state == NVME_CTRL_RESETTING)
 			nvme_start_freeze(&dev->ctrl);
-		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
-			pdev->error_state  != pci_channel_io_normal);
+
+		dead = !!((csts & NVME_CSTS_CFS) ||
+				!(csts & NVME_CSTS_RDY) ||
+				(pdev->error_state  != pci_channel_io_normal) ||
+				(dev->online_queues == 0));
 	}
 
 	/*
@@ -2201,9 +2205,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_io_queues(dev);
 		nvme_disable_admin_queue(dev, shutdown);
 	}
-	for (i = dev->ctrl.queue_count - 1; i >= 0; i--)
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
 		nvme_suspend_queue(&dev->queues[i]);
 
+	if (dev->ctrl.admin_q)
+		blk_mq_quiesce_queue(dev->ctrl.admin_q);
+
 	nvme_pci_disable(dev);
 
 	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
@@ -2339,16 +2348,18 @@ static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	/*
-	 * Keep the controller around but remove all namespaces if we don't have
-	 * any working I/O queue.
-	 */
-	if (dev->online_queues < 2) {
+
+	/* In case of online_queues is zero, it has gone to out */
+	if (dev->online_queues == 1) {
+		/*
+		 * Keep the controller around but remove all namespaces if we
+		 * don't have any working I/O queue.
+		 */
 		dev_warn(dev->ctrl.device, "IO queues not created\n");
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 		new_state = NVME_CTRL_ADMIN_ONLY;
-	} else {
+	} else if (dev->online_queues > 1) {
 		nvme_start_queues(&dev->ctrl);
 		nvme_wait_freeze(&dev->ctrl);
 		/* hit this only when allocate tagset fails */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/9] nvme-pci: drain the entered requests after ctrl is shutdown
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, we will unquiesce the queues after the controller is
shutdown to avoid residual requests to be stuck. In fact, we can
do it more cleanly, just wait freeze and drain them before
nvme_dev_disable return.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c5c1365..f3e0eae 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2187,10 +2187,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * Give the controller a chance to complete all entered requests if
 	 * doing a safe shutdown.
 	 */
-	if (!dead) {
-		if (shutdown)
-			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
-	}
+	if (!dead && shutdown)
+		nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2219,12 +2218,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/9] nvme-pci: drain the entered requests after ctrl is shutdown
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


Currently, we will unquiesce the queues after the controller is
shutdown to avoid residual requests to be stuck. In fact, we can
do it more cleanly, just wait freeze and drain them before
nvme_dev_disable return.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c5c1365..f3e0eae 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2187,10 +2187,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * Give the controller a chance to complete all entered requests if
 	 * doing a safe shutdown.
 	 */
-	if (!dead) {
-		if (shutdown)
-			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
-	}
+	if (!dead && shutdown)
+		nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
+
 	nvme_stop_queues(&dev->ctrl);
 
 	if (!dead) {
@@ -2219,12 +2218,15 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
 
 	/*
-	 * The driver will not be starting up queues again if shutting down so
-	 * must flush all entered requests to their failed completion to avoid
-	 * deadlocking blk-mq hot-cpu notifier.
+	 * For shutdown case, controller will not be setup again soon. If any
+	 * residual requests here, the controller must have go wrong. Drain and
+	 * fail all the residual entered IO requests.
 	 */
-	if (shutdown)
+	if (shutdown) {
 		nvme_start_queues(&dev->ctrl);
+		nvme_wait_freeze(&dev->ctrl);
+		nvme_stop_queues(&dev->ctrl);
+	}
 	mutex_unlock(&dev->shutdown_lock);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/9] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

No functional change, just make blk_mq_rq_update_aborted_gstate a
external interface.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 block/blk-mq.c         | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d..a027ca2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -581,7 +581,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 {
 	unsigned long flags;
 
@@ -597,6 +597,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(blk_mq_rq_update_aborted_gstate);
 
 static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49..ad54024 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,6 +257,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/9] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


No functional change, just make blk_mq_rq_update_aborted_gstate a
external interface.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 block/blk-mq.c         | 3 ++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 01f271d..a027ca2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -581,7 +581,7 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 {
 	unsigned long flags;
 
@@ -597,6 +597,7 @@ static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
 	u64_stats_update_end(&rq->aborted_gstate_sync);
 	local_irq_restore(flags);
 }
+EXPORT_SYMBOL(blk_mq_rq_update_aborted_gstate);
 
 static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49..ad54024 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,6 +257,7 @@ void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 void blk_mq_kick_requeue_list(struct request_queue *q);
 void blk_mq_delay_kick_requeue_list(struct request_queue *q, unsigned long msecs);
 void blk_mq_complete_request(struct request *rq);
+void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate);
 
 bool blk_mq_queue_stopped(struct request_queue *q);
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   previous outstanding requests.
2nd case is necessary. This patch is to break up 1st and 3th case.

To achieve this:
Use blk_abort_request to force all the previous outstanding requests
expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
completed and freed. We needn't invoke nvme_dev_disable any more.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.
In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce
NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
let nvme_timeout be able to distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and implementat synchronization between them.

Suggested-by: Keith Busch <keith.busch@intel.com>
(If IO requests timeout in RECONNECTING state, fail them and kill the
 controller)
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 192 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 157 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f3e0eae..a0ff18e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,9 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_pci_free_and_disable(struct nvme_dev *dev);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +83,7 @@ struct nvme_dev {
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
+	int grab_flag;
 	struct dma_pool *prp_page_pool;
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
@@ -1130,6 +1134,24 @@ static void abort_endio(struct request *req, blk_status_t error)
 	blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	u32 csts;
+	bool dead;
+
+	if (!pci_is_enabled(pdev))
+		return;
+
+	csts = readl(dev->bar + NVME_REG_CSTS);
+	dead = !!((csts & NVME_CSTS_CFS) ||
+			!(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+	if (!dead)
+		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+	nvme_pci_free_and_disable(dev);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1213,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	/*
 	 * Reset immediately if the controller is failed
+	 * nvme_dev_disable will take over the expired requests.
 	 */
 	if (nvme_should_reset(dev, csts)) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1210,38 +1233,59 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * The previous outstanding requests on adminq and ioq have been
+	 * grabbed by nvme_dev_disable, so it must be admin request from
+	 * the nvme_dev_disable.
 	 */
-	if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+	if ((READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBED)) {
+		nvme_pci_disable_ctrl_directly(dev);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	}
 
 	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
+	 * nvme_dev_disable is grabbing all the outstanding requests.
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED.
+	 * The nvme_dev_disable will take over all the things.
+	 */
+	if (READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBING) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_NOT_HANDLED;
+	}
+
+	/*
+	 * There could IO request timeout during RECONNECTING state. It is
+	 * due to the wait freeze in nvme_reset_work. Plus the adminq request
+	 * timeout, don't try to requeue them and change state to DELETING to
+	 * prevent reset work from moving forward.
+	 */
+	if (dev->ctrl.state == NVME_CTRL_RECONNECTING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		nvme_req(req)->status |= NVME_SC_DNR;
+		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * If the controller is DELETING, needn't to do any recovery,
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DELETING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED,
+	 * The nvme_dev_disable will take over all the things.
 	 */
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2149,25 +2193,97 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
 	pci_release_mem_regions(to_pci_dev(dev->dev));
 }
 
-static void nvme_pci_disable(struct nvme_dev *dev)
+static void nvme_pci_abort_rq(struct request *req, void *data, bool reserved)
 {
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	nvme_req(req)->status = NVME_SC_ABORT_REQ;
+	blk_abort_request(req);
+}
+
+/*
+ * nvme_pci_grab_outstanding and shutdown_lock ensure will be
+ * only one nvme_pci_free_and_disable running at one moment.
+ */
+static void nvme_pci_free_and_disable(struct nvme_dev *dev)
+{
+	int onlines, i;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	if (!pci_is_enabled(pdev))
+		return;
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
+		nvme_suspend_queue(&dev->queues[i]);
+
 	nvme_release_cmb(dev);
 	pci_free_irq_vectors(pdev);
 
-	if (pci_is_enabled(pdev)) {
-		pci_disable_pcie_error_reporting(pdev);
-		pci_disable_device(pdev);
-	}
+	pci_disable_pcie_error_reporting(pdev);
+	pci_disable_device(pdev);
+}
+
+/*
+ * Now the controller has been disabled, no interrupt will race with us,
+ * so it is safe to complete them
+ *  - IO requests will be requeued.
+ *  - adminq requests will be failed, the tags will also be freed.
+ */
+static void nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Cancel I/O %d", req->tag);
+
+	/* controller has been disabled/shtdown, it is safe here to clear
+	 * the aborted_gstate and complete request.
+	 */
+	if (nvme_req(req)->flags | NVME_REQ_CANCELLED)
+		blk_mq_rq_update_aborted_gstate(req, 0);
+	blk_mq_complete_request(req);
+}
+
+
+static void nvme_pci_sync_timeout_work(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	/* ensure the timeout_work is queued */
+	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+	flush_work(&ctrl->admin_q->timeout_work);
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+	up_read(&ctrl->namespaces_rwsem);
+}
+
+static void nvme_pci_grab_outstanding(struct nvme_dev *dev)
+{
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_abort_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_abort_rq, &dev->ctrl);
+	nvme_pci_sync_timeout_work(&dev->ctrl);
+	/* All the outstanding requests have been grabbed. They are forced to be
+	 * expired, neither irq completion path nor timeout path could take them
+	 * away.
+	 */
 }
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
-	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
@@ -2192,6 +2308,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBING);
+	nvme_pci_grab_outstanding(dev);
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBED);
+	/*
+	 * All the previous outstanding requests have been grabbed and
+	 * nvme_timeout path is guaranteed to be drained.
+	 */
+
 	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
@@ -2205,18 +2329,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 
-	onlines = dev->online_queues;
-	for (i = onlines - 1; i >= 0; i--)
-		nvme_suspend_queue(&dev->queues[i]);
-
 	if (dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(dev->ctrl.admin_q);
 
-	nvme_pci_disable(dev);
+	nvme_pci_free_and_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_cancel_rq, &dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, 0);
 	/*
 	 * For shutdown case, controller will not be setup again soon. If any
 	 * residual requests here, the controller must have go wrong. Drain and
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   previous outstanding requests.
2nd case is necessary. This patch is to break up 1st and 3th case.

To achieve this:
Use blk_abort_request to force all the previous outstanding requests
expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
completed and freed. We needn't invoke nvme_dev_disable any more.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.
In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce
NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
let nvme_timeout be able to distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and implementat synchronization between them.

Suggested-by: Keith Busch <keith.busch at intel.com>
(If IO requests timeout in RECONNECTING state, fail them and kill the
 controller)
Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 192 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 157 insertions(+), 35 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f3e0eae..a0ff18e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,9 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+static void nvme_pci_free_and_disable(struct nvme_dev *dev);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +83,7 @@ struct nvme_dev {
 	struct blk_mq_tag_set admin_tagset;
 	u32 __iomem *dbs;
 	struct device *dev;
+	int grab_flag;
 	struct dma_pool *prp_page_pool;
 	struct dma_pool *prp_small_pool;
 	unsigned online_queues;
@@ -1130,6 +1134,24 @@ static void abort_endio(struct request *req, blk_status_t error)
 	blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	u32 csts;
+	bool dead;
+
+	if (!pci_is_enabled(pdev))
+		return;
+
+	csts = readl(dev->bar + NVME_REG_CSTS);
+	dead = !!((csts & NVME_CSTS_CFS) ||
+			!(csts & NVME_CSTS_RDY) ||
+			pdev->error_state  != pci_channel_io_normal);
+	if (!dead)
+		nvme_disable_ctrl(&dev->ctrl, dev->ctrl.cap);
+	nvme_pci_free_and_disable(dev);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1213,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 
 	/*
 	 * Reset immediately if the controller is failed
+	 * nvme_dev_disable will take over the expired requests.
 	 */
 	if (nvme_should_reset(dev, csts)) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	/*
@@ -1210,38 +1233,59 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	}
 
 	/*
-	 * Shutdown immediately if controller times out while starting. The
-	 * reset work will see the pci device disabled when it gets the forced
-	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_HANDLED.
+	 * The previous outstanding requests on adminq and ioq have been
+	 * grabbed by nvme_dev_disable, so it must be admin request from
+	 * the nvme_dev_disable.
 	 */
-	if (dev->ctrl.state == NVME_CTRL_RESETTING) {
-		dev_warn(dev->ctrl.device,
-			 "I/O %d QID %d timeout, disable controller\n",
-			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
+	if ((READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBED)) {
+		nvme_pci_disable_ctrl_directly(dev);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_HANDLED;
 	}
 
 	/*
- 	 * Shutdown the controller immediately and schedule a reset if the
- 	 * command was already aborted once before and still hasn't been
- 	 * returned to the driver, or if this is the admin queue.
+	 * nvme_dev_disable is grabbing all the outstanding requests.
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED.
+	 * The nvme_dev_disable will take over all the things.
+	 */
+	if (READ_ONCE(dev->grab_flag) == NVME_PCI_OUTSTANDING_GRABBING) {
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_NOT_HANDLED;
+	}
+
+	/*
+	 * There could IO request timeout during RECONNECTING state. It is
+	 * due to the wait freeze in nvme_reset_work. Plus the adminq request
+	 * timeout, don't try to requeue them and change state to DELETING to
+	 * prevent reset work from moving forward.
+	 */
+	if (dev->ctrl.state == NVME_CTRL_RECONNECTING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		nvme_req(req)->status |= NVME_SC_DNR;
+		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * If the controller is DELETING, needn't to do any recovery,
+	 */
+	if (dev->ctrl.state == NVME_CTRL_DELETING) {
+		nvme_pci_disable_ctrl_directly(dev);
+		nvme_req(req)->status = NVME_SC_ABORT_REQ;
+		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+		return BLK_EH_HANDLED;
+	}
+	/*
+	 * Just set NVME_REQ_CANCELLED and return BLK_EH_NOT_HANDLED,
+	 * The nvme_dev_disable will take over all the things.
 	 */
 	if (!nvmeq->qid || iod->aborted) {
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-
-		/*
-		 * Mark the request as handled, since the inline shutdown
-		 * forces all outstanding requests to complete.
-		 */
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_HANDLED;
+		return BLK_EH_NOT_HANDLED;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -2149,25 +2193,97 @@ static void nvme_dev_unmap(struct nvme_dev *dev)
 	pci_release_mem_regions(to_pci_dev(dev->dev));
 }
 
-static void nvme_pci_disable(struct nvme_dev *dev)
+static void nvme_pci_abort_rq(struct request *req, void *data, bool reserved)
 {
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Abort I/O %d", req->tag);
+
+	nvme_req(req)->status = NVME_SC_ABORT_REQ;
+	blk_abort_request(req);
+}
+
+/*
+ * nvme_pci_grab_outstanding and shutdown_lock ensure will be
+ * only one nvme_pci_free_and_disable running at one moment.
+ */
+static void nvme_pci_free_and_disable(struct nvme_dev *dev)
+{
+	int onlines, i;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
+	if (!pci_is_enabled(pdev))
+		return;
+
+	onlines = dev->online_queues;
+	for (i = onlines - 1; i >= 0; i--)
+		nvme_suspend_queue(&dev->queues[i]);
+
 	nvme_release_cmb(dev);
 	pci_free_irq_vectors(pdev);
 
-	if (pci_is_enabled(pdev)) {
-		pci_disable_pcie_error_reporting(pdev);
-		pci_disable_device(pdev);
-	}
+	pci_disable_pcie_error_reporting(pdev);
+	pci_disable_device(pdev);
+}
+
+/*
+ * Now the controller has been disabled, no interrupt will race with us,
+ * so it is safe to complete them
+ *  - IO requests will be requeued.
+ *  - adminq requests will be failed, the tags will also be freed.
+ */
+static void nvme_pci_cancel_rq(struct request *req, void *data, bool reserved)
+{
+	if (!blk_mq_request_started(req))
+		return;
+
+	dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
+				"Cancel I/O %d", req->tag);
+
+	/* controller has been disabled/shtdown, it is safe here to clear
+	 * the aborted_gstate and complete request.
+	 */
+	if (nvme_req(req)->flags | NVME_REQ_CANCELLED)
+		blk_mq_rq_update_aborted_gstate(req, 0);
+	blk_mq_complete_request(req);
+}
+
+
+static void nvme_pci_sync_timeout_work(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	/* ensure the timeout_work is queued */
+	kblockd_schedule_work(&ctrl->admin_q->timeout_work);
+	flush_work(&ctrl->admin_q->timeout_work);
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		kblockd_schedule_work(&ns->queue->timeout_work);
+
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		flush_work(&ns->queue->timeout_work);
+	up_read(&ctrl->namespaces_rwsem);
+}
+
+static void nvme_pci_grab_outstanding(struct nvme_dev *dev)
+{
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_abort_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_abort_rq, &dev->ctrl);
+	nvme_pci_sync_timeout_work(&dev->ctrl);
+	/* All the outstanding requests have been grabbed. They are forced to be
+	 * expired, neither irq completion path nor timeout path could take them
+	 * away.
+	 */
 }
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
-	int i;
 	bool dead = true;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int onlines;
 
 	mutex_lock(&dev->shutdown_lock);
 	if (pci_is_enabled(pdev)) {
@@ -2192,6 +2308,14 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 
 	nvme_stop_queues(&dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBING);
+	nvme_pci_grab_outstanding(dev);
+	WRITE_ONCE(dev->grab_flag, NVME_PCI_OUTSTANDING_GRABBED);
+	/*
+	 * All the previous outstanding requests have been grabbed and
+	 * nvme_timeout path is guaranteed to be drained.
+	 */
+
 	if (!dead) {
 		/*
 		 * If the controller is still alive tell it to stop using the
@@ -2205,18 +2329,16 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		nvme_disable_admin_queue(dev, shutdown);
 	}
 
-	onlines = dev->online_queues;
-	for (i = onlines - 1; i >= 0; i--)
-		nvme_suspend_queue(&dev->queues[i]);
-
 	if (dev->ctrl.admin_q)
 		blk_mq_quiesce_queue(dev->ctrl.admin_q);
 
-	nvme_pci_disable(dev);
+	nvme_pci_free_and_disable(dev);
 
-	blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
-	blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->tagset, nvme_pci_cancel_rq, &dev->ctrl);
+	blk_mq_tagset_busy_iter(&dev->admin_tagset,
+			nvme_pci_cancel_rq, &dev->ctrl);
 
+	WRITE_ONCE(dev->grab_flag, 0);
 	/*
 	 * For shutdown case, controller will not be setup again soon. If any
 	 * residual requests here, the controller must have go wrong. Drain and
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 9/9] nvme-pci: discard wait timeout when delete cq/sq
  2018-02-11  9:38 ` Jianchao Wang
@ 2018-02-11  9:38   ` Jianchao Wang
  -1 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi; +Cc: linux-nvme, linux-kernel

Currently, nvme_disable_io_queues could be wakeup by both request
completion and wait timeout path. This is unnecessary and could
introduce race between nvme_dev_disable and request timeout path.
When delete cq/sq command expires, the nvme_disable_io_queues will
also be wakeup and return to nvme_dev_disable, then handle the
outstanding requests. This will race with the request timeout path.

To fix it, just use wait_for_completion instead of the timeout one.
The request timeout path will wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a0ff18e..28f077b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2058,7 +2058,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2066,15 +2065,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 9/9] nvme-pci: discard wait timeout when delete cq/sq
@ 2018-02-11  9:38   ` Jianchao Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Jianchao Wang @ 2018-02-11  9:38 UTC (permalink / raw)


Currently, nvme_disable_io_queues could be wakeup by both request
completion and wait timeout path. This is unnecessary and could
introduce race between nvme_dev_disable and request timeout path.
When delete cq/sq command expires, the nvme_disable_io_queues will
also be wakeup and return to nvme_dev_disable, then handle the
outstanding requests. This will race with the request timeout path.

To fix it, just use wait_for_completion instead of the timeout one.
The request timeout path will wakeup it.

Signed-off-by: Jianchao Wang <jianchao.w.wang at oracle.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a0ff18e..28f077b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2058,7 +2058,6 @@ static int nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)
 static void nvme_disable_io_queues(struct nvme_dev *dev)
 {
 	int pass, queues = dev->online_queues - 1;
-	unsigned long timeout;
 	u8 opcode = nvme_admin_delete_sq;
 
 	for (pass = 0; pass < 2; pass++) {
@@ -2066,15 +2065,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev)
 
 		reinit_completion(&dev->ioq_wait);
  retry:
-		timeout = ADMIN_TIMEOUT;
 		for (; i > 0; i--, sent++)
 			if (nvme_delete_queue(&dev->queues[i], opcode))
 				break;
 
 		while (sent--) {
-			timeout = wait_for_completion_io_timeout(&dev->ioq_wait, timeout);
-			if (timeout == 0)
-				return;
+			wait_for_completion(&dev->ioq_wait);
 			if (i)
 				goto retry;
 		}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/9] nvme: fix the dangerous reference of namespaces list
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:14     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:14 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/9] nvme: fix the dangerous reference of namespaces list
@ 2018-02-11 11:14     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:14 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/9] nvme: fix the dangerous reference of namespaces list
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:15     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:15 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e810487..bc05bc4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3092,11 +3092,20 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
>   					unsigned nsid)
>   {
>   	struct nvme_ns *ns, *next;
> +	LIST_HEAD(rm_list);
>   
> +	mutex_lock(&ctrl->namespaces_mutex);
>   	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
> -		if (ns->head->ns_id > nsid)
> -			nvme_ns_remove(ns);
> +		if (ns->head->ns_id > nsid) {
> +			list_del_init(&ns->list);
> +			list_add_tail(&ns->list, &rm_list);

list_move_tail?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/9] nvme: fix the dangerous reference of namespaces list
@ 2018-02-11 11:15     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:15 UTC (permalink / raw)


> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index e810487..bc05bc4 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3092,11 +3092,20 @@ static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
>   					unsigned nsid)
>   {
>   	struct nvme_ns *ns, *next;
> +	LIST_HEAD(rm_list);
>   
> +	mutex_lock(&ctrl->namespaces_mutex);
>   	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
> -		if (ns->head->ns_id > nsid)
> -			nvme_ns_remove(ns);
> +		if (ns->head->ns_id > nsid) {
> +			list_del_init(&ns->list);
> +			list_add_tail(&ns->list, &rm_list);

list_move_tail?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:16     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:16 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

>   	mutex_lock(&ctrl->namespaces_mutex);
>   	list_for_each_entry(ns, &ctrl->namespaces, list) {
> -		if (ns->disk && nvme_revalidate_disk(ns->disk))
> -			nvme_ns_remove(ns);
> +		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
> +			list_del_init(&ns->list);
> +			list_add_tail(&ns->list, &rm_list);

list_move_tail?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats
@ 2018-02-11 11:16     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:16 UTC (permalink / raw)


>   	mutex_lock(&ctrl->namespaces_mutex);
>   	list_for_each_entry(ns, &ctrl->namespaces, list) {
> -		if (ns->disk && nvme_revalidate_disk(ns->disk))
> -			nvme_ns_remove(ns);
> +		if (ns->disk && nvme_revalidate_disk(ns->disk)) {
> +			list_del_init(&ns->list);
> +			list_add_tail(&ns->list, &rm_list);

list_move_tail?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:17     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:17 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

> namespaces_mutext is used to synchronize the operations on ctrl
> namespaces list. Most of the time, it is a read operation. It is
> better to change it from mutex to rwsem.
> 
> On the other hand, the namespaces mutex could introduce circular
> dependency easily.

On the other hand of what?

Also, can you give an example of such?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem
@ 2018-02-11 11:17     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:17 UTC (permalink / raw)


> namespaces_mutext is used to synchronize the operations on ctrl
> namespaces list. Most of the time, it is a read operation. It is
> better to change it from mutex to rwsem.
> 
> On the other hand, the namespaces mutex could introduce circular
> dependency easily.

On the other hand of what?

Also, can you give an example of such?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:19     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:19 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel


> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..00cffed 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>   	if (!dead) {
>   		if (shutdown)
>   			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> +	}
> +	nvme_stop_queues(&dev->ctrl);

Nit, extra newline before nvme_stop_queues would be nice.

Other than that, looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
@ 2018-02-11 11:19     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:19 UTC (permalink / raw)



> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6fe7af0..00cffed 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>   	if (!dead) {
>   		if (shutdown)
>   			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> +	}
> +	nvme_stop_queues(&dev->ctrl);

Nit, extra newline before nvme_stop_queues would be nice.

Other than that, looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 9/9] nvme-pci: discard wait timeout when delete cq/sq
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:24     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:24 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

This looks ok to me, unless we have some quirks that need it?

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 9/9] nvme-pci: discard wait timeout when delete cq/sq
@ 2018-02-11 11:24     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:24 UTC (permalink / raw)


This looks ok to me, unless we have some quirks that need it?

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-11  9:38   ` Jianchao Wang
@ 2018-02-11 11:36     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:36 UTC (permalink / raw)
  To: Jianchao Wang, keith.busch, axboe, hch; +Cc: linux-nvme, linux-kernel

Jianchao,

> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>   - nvme_timeout has to invoke nvme_dev_disable to stop the
>     controller doing DMA access before free the request.
>   - nvme_dev_disable has to depend on nvme_timeout to complete
>     adminq requests to set HMB or delete sq/cq when the controller
>     has no response.
>   - nvme_dev_disable will race with nvme_timeout when cancels the
>     previous outstanding requests.
> 2nd case is necessary. This patch is to break up 1st and 3th case.
> 
> To achieve this:
> Use blk_abort_request to force all the previous outstanding requests
> expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
> BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
> completed and freed. We needn't invoke nvme_dev_disable any more.
> 
> We use NVME_REQ_CANCELLED to identify them. After the controller is
> totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
> to clear requests and invoke blk_mq_complete_request to complete them.
> In addition, to identify the previous adminq/IOq request and adminq
> requests from nvme_dev_disable, we introduce
> NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
> let nvme_timeout be able to distinguish them.
> 
> For the adminq requests from nvme_dev_disable/nvme_reset_work:
> invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
> return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
> see the error.

I think this is going in the wrong direction. Every state that is needed
to handle serialization should be done in core ctrl state. Moreover,
please try to avoid handling this locally in nvme-pci, place common
helpers in nvme-core and use them. I won't be surprised if fc would
need some of these.

Also, can we try and not disable the controller from nvme_timeout? I'm
not sure I understand why is this needed at all. What's wrong with
scheduling a controller reset/delete? Why is something like
nvme_pci_disable_ctrl_directly needed?

I'd like to see an effort to consolidate error handling paths rather
than enhancing the current skew...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-11 11:36     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2018-02-11 11:36 UTC (permalink / raw)


Jianchao,

> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>   - nvme_timeout has to invoke nvme_dev_disable to stop the
>     controller doing DMA access before free the request.
>   - nvme_dev_disable has to depend on nvme_timeout to complete
>     adminq requests to set HMB or delete sq/cq when the controller
>     has no response.
>   - nvme_dev_disable will race with nvme_timeout when cancels the
>     previous outstanding requests.
> 2nd case is necessary. This patch is to break up 1st and 3th case.
> 
> To achieve this:
> Use blk_abort_request to force all the previous outstanding requests
> expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
> BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
> completed and freed. We needn't invoke nvme_dev_disable any more.
> 
> We use NVME_REQ_CANCELLED to identify them. After the controller is
> totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
> to clear requests and invoke blk_mq_complete_request to complete them.
> In addition, to identify the previous adminq/IOq request and adminq
> requests from nvme_dev_disable, we introduce
> NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
> let nvme_timeout be able to distinguish them.
> 
> For the adminq requests from nvme_dev_disable/nvme_reset_work:
> invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
> return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
> see the error.

I think this is going in the wrong direction. Every state that is needed
to handle serialization should be done in core ctrl state. Moreover,
please try to avoid handling this locally in nvme-pci, place common
helpers in nvme-core and use them. I won't be surprised if fc would
need some of these.

Also, can we try and not disable the controller from nvme_timeout? I'm
not sure I understand why is this needed at all. What's wrong with
scheduling a controller reset/delete? Why is something like
nvme_pci_disable_ctrl_directly needed?

I'd like to see an effort to consolidate error handling paths rather
than enhancing the current skew...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats
  2018-02-11 11:16     ` Sagi Grimberg
@ 2018-02-12  1:40       ` jianchao.wang
  -1 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  1:40 UTC (permalink / raw)
  To: Sagi Grimberg, keith.busch, axboe, hch; +Cc: linux-kernel, linux-nvme

Hi Sagi

Thanks for your kindly response and directive.
That's really appreciated.

On 02/11/2018 07:16 PM, Sagi Grimberg wrote:
>>       mutex_lock(&ctrl->namespaces_mutex);
>>       list_for_each_entry(ns, &ctrl->namespaces, list) {
>> -        if (ns->disk && nvme_revalidate_disk(ns->disk))
>> -            nvme_ns_remove(ns);
>> +        if (ns->disk && nvme_revalidate_disk(ns->disk)) {
>> +            list_del_init(&ns->list);
>> +            list_add_tail(&ns->list, &rm_list);
> 
> list_move_tail?

Yes, I will use this interface next version, as well as the 1st patch.

Sincerely
Jianchao

> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=w-s8SrsXdziNNWDrav4KevPTjbt0WNbNQsEmXKL--FY&s=o0wVdE60HzA9inklf3xGMxQjy6l6JHNoDICmOHDBZMA&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats
@ 2018-02-12  1:40       ` jianchao.wang
  0 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  1:40 UTC (permalink / raw)


Hi Sagi

Thanks for your kindly response and directive.
That's really appreciated.

On 02/11/2018 07:16 PM, Sagi Grimberg wrote:
>> ????? mutex_lock(&ctrl->namespaces_mutex);
>> ????? list_for_each_entry(ns, &ctrl->namespaces, list) {
>> -??????? if (ns->disk && nvme_revalidate_disk(ns->disk))
>> -??????????? nvme_ns_remove(ns);
>> +??????? if (ns->disk && nvme_revalidate_disk(ns->disk)) {
>> +??????????? list_del_init(&ns->list);
>> +??????????? list_add_tail(&ns->list, &rm_list);
> 
> list_move_tail?

Yes, I will use this interface next version, as well as the 1st patch.

Sincerely
Jianchao

> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=w-s8SrsXdziNNWDrav4KevPTjbt0WNbNQsEmXKL--FY&s=o0wVdE60HzA9inklf3xGMxQjy6l6JHNoDICmOHDBZMA&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-11 11:36     ` Sagi Grimberg
@ 2018-02-12  2:16       ` jianchao.wang
  -1 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  2:16 UTC (permalink / raw)
  To: Sagi Grimberg, keith.busch, axboe, hch; +Cc: linux-kernel, linux-nvme

Hi Sagi

Thanks for your kindly remind and directive.
That's really appreciated.

On 02/11/2018 07:36 PM, Sagi Grimberg wrote:
> Jianchao,
> 
>> Currently, the complicated relationship between nvme_dev_disable
>> and nvme_timeout has become a devil that will introduce many
>> circular pattern which may trigger deadlock or IO hang. Let's
>> enumerate the tangles between them:
>>   - nvme_timeout has to invoke nvme_dev_disable to stop the
>>     controller doing DMA access before free the request.
>>   - nvme_dev_disable has to depend on nvme_timeout to complete
>>     adminq requests to set HMB or delete sq/cq when the controller
>>     has no response.
>>   - nvme_dev_disable will race with nvme_timeout when cancels the
>>     previous outstanding requests.
>> 2nd case is necessary. This patch is to break up 1st and 3th case.
>>
>> To achieve this:
>> Use blk_abort_request to force all the previous outstanding requests
>> expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
>> BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
>> completed and freed. We needn't invoke nvme_dev_disable any more.
>>
>> We use NVME_REQ_CANCELLED to identify them. After the controller is
>> totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
>> to clear requests and invoke blk_mq_complete_request to complete them.
>> In addition, to identify the previous adminq/IOq request and adminq
>> requests from nvme_dev_disable, we introduce
>> NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
>> let nvme_timeout be able to distinguish them.
>>
>> For the adminq requests from nvme_dev_disable/nvme_reset_work:
>> invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
>> return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
>> see the error.
> 
> I think this is going in the wrong direction. Every state that is needed
> to handle serialization should be done in core ctrl state. Moreover,
> please try to avoid handling this locally in nvme-pci, place common
> helpers in nvme-core and use them. I won't be surprised if fc would
> need some of these.
> 
> Also, can we try and not disable the controller from nvme_timeout?

In fact, that is what this patch what to do. For the previous outstanding requests,
this patch return BLK_EH_NOT_HANDLED and defer the work to nvme_dev_disable.

 I'm
> not sure I understand why is this needed at all. What's wrong with
> scheduling a controller reset/delete? Why is something like
> nvme_pci_disable_ctrl_directly needed?

Keith used to point out to me that, we cannot complete and free a request
before we close the controller and pci master bus, otherwise, there will
be somethings wrong in DMA accessing, because when we complete a request, 
the associated DMA mapping will be freed.

For the previous outstanding requests, this patch could grab them with blk_abort_request
in nvme_dev_disable, and complete them after we disable/shutdown the controller.

But for the adminq requests in nvme_dev_disable and nvme_reset_work, we cannot do this.
We cannot schedule another reset_work->nvme_dev_disable to do that, because we are in it.
So I use this nvme_pci_disable_ctrl_directly which looks like very ugly, to disable the
controller and then we could complete the request with failure to move progress forward.

> I'd like to see an effort to consolidate error handling paths rather
> than enhancing the current skew...

Yes, absolutely. That is also what I expect. :)

This patch has two aspects:
1. grab all the previous outstanding requests with blk_abort_request.
   It is safe when race with the irq completion path. And then complete them
   after we disable/shutdown the controller completely.
   I think this part could be placed in nvme ctrl core.

2. avoid nvme_timeout invoke nvme_dev_disable. this is the most tricky part.
   as I shared above, we have to _disable_ the controller _before_ we compete the adminq request
   from the nvme_dev_disable and nvme_reset_work. Consequently, we cannot do as the
   nvme_rdma_timeout, schedule a recovery work and then return. 

Thanks for your directive here again.

Sincerely
Jianchao

> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=4qwnypr12koG5mlPP0ZwJ8CdYf883HXT6wCWIRalSDo&s=cRJP-1nma0XWbRggjGKHWYIiTEDUbiLHI_5YBfpKMBU&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-12  2:16       ` jianchao.wang
  0 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  2:16 UTC (permalink / raw)


Hi Sagi

Thanks for your kindly remind and directive.
That's really appreciated.

On 02/11/2018 07:36 PM, Sagi Grimberg wrote:
> Jianchao,
> 
>> Currently, the complicated relationship between nvme_dev_disable
>> and nvme_timeout has become a devil that will introduce many
>> circular pattern which may trigger deadlock or IO hang. Let's
>> enumerate the tangles between them:
>> ? - nvme_timeout has to invoke nvme_dev_disable to stop the
>> ??? controller doing DMA access before free the request.
>> ? - nvme_dev_disable has to depend on nvme_timeout to complete
>> ??? adminq requests to set HMB or delete sq/cq when the controller
>> ??? has no response.
>> ? - nvme_dev_disable will race with nvme_timeout when cancels the
>> ??? previous outstanding requests.
>> 2nd case is necessary. This patch is to break up 1st and 3th case.
>>
>> To achieve this:
>> Use blk_abort_request to force all the previous outstanding requests
>> expired in nvme_dev_disable. Set NVME_REQ_CANCELLED and return
>> BLK_EH_NOT_HANDLED in nvme_timeout. Then the request will not be
>> completed and freed. We needn't invoke nvme_dev_disable any more.
>>
>> We use NVME_REQ_CANCELLED to identify them. After the controller is
>> totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
>> to clear requests and invoke blk_mq_complete_request to complete them.
>> In addition, to identify the previous adminq/IOq request and adminq
>> requests from nvme_dev_disable, we introduce
>> NVME_PCI_OUTSTANDING_GRABBING and NVME_PCI_OUTSTANDING_GRABBED to
>> let nvme_timeout be able to distinguish them.
>>
>> For the adminq requests from nvme_dev_disable/nvme_reset_work:
>> invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
>> return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
>> see the error.
> 
> I think this is going in the wrong direction. Every state that is needed
> to handle serialization should be done in core ctrl state. Moreover,
> please try to avoid handling this locally in nvme-pci, place common
> helpers in nvme-core and use them. I won't be surprised if fc would
> need some of these.
> 
> Also, can we try and not disable the controller from nvme_timeout?

In fact, that is what this patch what to do. For the previous outstanding requests,
this patch return BLK_EH_NOT_HANDLED and defer the work to nvme_dev_disable.

 I'm
> not sure I understand why is this needed at all. What's wrong with
> scheduling a controller reset/delete? Why is something like
> nvme_pci_disable_ctrl_directly needed?

Keith used to point out to me that, we cannot complete and free a request
before we close the controller and pci master bus, otherwise, there will
be somethings wrong in DMA accessing, because when we complete a request, 
the associated DMA mapping will be freed.

For the previous outstanding requests, this patch could grab them with blk_abort_request
in nvme_dev_disable, and complete them after we disable/shutdown the controller.

But for the adminq requests in nvme_dev_disable and nvme_reset_work, we cannot do this.
We cannot schedule another reset_work->nvme_dev_disable to do that, because we are in it.
So I use this nvme_pci_disable_ctrl_directly which looks like very ugly, to disable the
controller and then we could complete the request with failure to move progress forward.

> I'd like to see an effort to consolidate error handling paths rather
> than enhancing the current skew...

Yes, absolutely. That is also what I expect. :)

This patch has two aspects:
1. grab all the previous outstanding requests with blk_abort_request.
   It is safe when race with the irq completion path. And then complete them
   after we disable/shutdown the controller completely.
   I think this part could be placed in nvme ctrl core.

2. avoid nvme_timeout invoke nvme_dev_disable. this is the most tricky part.
   as I shared above, we have to _disable_ the controller _before_ we compete the adminq request
   from the nvme_dev_disable and nvme_reset_work. Consequently, we cannot do as the
   nvme_rdma_timeout, schedule a recovery work and then return. 

Thanks for your directive here again.

Sincerely
Jianchao

> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=4qwnypr12koG5mlPP0ZwJ8CdYf883HXT6wCWIRalSDo&s=cRJP-1nma0XWbRggjGKHWYIiTEDUbiLHI_5YBfpKMBU&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
  2018-02-11 11:19     ` Sagi Grimberg
@ 2018-02-12  2:17       ` jianchao.wang
  -1 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  2:17 UTC (permalink / raw)
  To: Sagi Grimberg, keith.busch, axboe, hch; +Cc: linux-kernel, linux-nvme

Hi Sagi

Thanks for your kindly reminding.

On 02/11/2018 07:19 PM, Sagi Grimberg wrote:
> 
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 6fe7af0..00cffed 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>>       if (!dead) {
>>           if (shutdown)
>>               nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
>> +    }
>> +    nvme_stop_queues(&dev->ctrl);
> 
> Nit, extra newline before nvme_stop_queues would be nice.
> 

Yes, I will change this next version.

> Other than that, looks good,
> 
> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=YyTE_X1Vl7LP-4IDkvFIPtRdBfnPAeiULK6MVj1Qfy4&s=0q3njAGASSRaQZEYLRXwJmUjYJRweNTVwoqR9mbHWTs&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses
@ 2018-02-12  2:17       ` jianchao.wang
  0 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  2:17 UTC (permalink / raw)


Hi Sagi

Thanks for your kindly reminding.

On 02/11/2018 07:19 PM, Sagi Grimberg wrote:
> 
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index 6fe7af0..00cffed 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2186,7 +2186,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>> ????? if (!dead) {
>> ????????? if (shutdown)
>> ????????????? nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
>> +??? }
>> +??? nvme_stop_queues(&dev->ctrl);
> 
> Nit, extra newline before nvme_stop_queues would be nice.
> 

Yes, I will change this next version.

> Other than that, looks good,
> 
> Reviewed-by: Sagi Grimberg <sagi at grimberg.me>
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=YyTE_X1Vl7LP-4IDkvFIPtRdBfnPAeiULK6MVj1Qfy4&s=0q3njAGASSRaQZEYLRXwJmUjYJRweNTVwoqR9mbHWTs&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem
  2018-02-11 11:17     ` Sagi Grimberg
@ 2018-02-12  2:33       ` jianchao.wang
  -1 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  2:33 UTC (permalink / raw)
  To: Sagi Grimberg, keith.busch, axboe, hch; +Cc: linux-kernel, linux-nvme

Hi Sagi

Thanks for your kindly response.
And sorry for my bad description. 

On 02/11/2018 07:17 PM, Sagi Grimberg wrote:
>> namespaces_mutext is used to synchronize the operations on ctrl
>> namespaces list. Most of the time, it is a read operation. It is
>> better to change it from mutex to rwsem.
>>
>> On the other hand, the namespaces mutex could introduce circular
>> dependency easily.
> 
> On the other hand of what?
> 
> Also, can you give an example of such?
> 

We have a lot of interface in nvme core that need to hold this 
namespace_mutex and looks like we will have more.
The mutex will trouble us in following scenario.

For example:

context A                        context B
nvme_xxx_xxx                     nvme_xxx_xxx
  -> hold namespace_mutex          -> try to require namespace_mutex
    -> sync context B

Keith's patch will incur this.
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015605.html

Because I add a interface that will hold namespace_mutex, so I add this patchset here.

Thanks
Jianchao
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=7MZaKEgFTQoH0JODWrGtapw7rGRwXzyRBi-OKlhNaTE&s=-L_YlxhBS6aEW6Gt5tI4bV67eWn4JdvW2Nh3HAZ111Y&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem
@ 2018-02-12  2:33       ` jianchao.wang
  0 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  2:33 UTC (permalink / raw)


Hi Sagi

Thanks for your kindly response.
And sorry for my bad description. 

On 02/11/2018 07:17 PM, Sagi Grimberg wrote:
>> namespaces_mutext is used to synchronize the operations on ctrl
>> namespaces list. Most of the time, it is a read operation. It is
>> better to change it from mutex to rwsem.
>>
>> On the other hand, the namespaces mutex could introduce circular
>> dependency easily.
> 
> On the other hand of what?
> 
> Also, can you give an example of such?
> 

We have a lot of interface in nvme core that need to hold this 
namespace_mutex and looks like we will have more.
The mutex will trouble us in following scenario.

For example:

context A                        context B
nvme_xxx_xxx                     nvme_xxx_xxx
  -> hold namespace_mutex          -> try to require namespace_mutex
    -> sync context B

Keith's patch will incur this.
http://lists.infradead.org/pipermail/linux-nvme/2018-February/015605.html

Because I add a interface that will hold namespace_mutex, so I add this patchset here.

Thanks
Jianchao
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=7MZaKEgFTQoH0JODWrGtapw7rGRwXzyRBi-OKlhNaTE&s=-L_YlxhBS6aEW6Gt5tI4bV67eWn4JdvW2Nh3HAZ111Y&e=
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
  2018-02-12  2:16       ` jianchao.wang
@ 2018-02-12  7:51         ` jianchao.wang
  -1 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  7:51 UTC (permalink / raw)
  To: Sagi Grimberg, keith.busch, axboe, hch; +Cc: linux-kernel, linux-nvme

Hi Sagi

Just make some supplement here.

On 02/12/2018 10:16 AM, jianchao.wang wrote:
>> I think this is going in the wrong direction. Every state that is needed
>> to handle serialization should be done in core ctrl state. Moreover,
>> please try to avoid handling this locally in nvme-pci, place common
>> helpers in nvme-core and use them. I won't be surprised if fc would
>> need some of these.
>>
>> Also, can we try and not disable the controller from nvme_timeout?
> In fact, that is what this patch what to do. For the previous outstanding requests,
> this patch return BLK_EH_NOT_HANDLED and defer the work to nvme_dev_disable.
> 
>  I'm
>> not sure I understand why is this needed at all. What's wrong with
>> scheduling a controller reset/delete? Why is something like
>> nvme_pci_disable_ctrl_directly needed?
> Keith used to point out to me that, we cannot complete and free a request
> before we close the controller and pci master bus, otherwise, there will
> be somethings wrong in DMA accessing, because when we complete a request, 
> the associated DMA mapping will be freed.
> 
> For the previous outstanding requests, this patch could grab them with blk_abort_request
> in nvme_dev_disable, and complete them after we disable/shutdown the controller.
> 
> But for the adminq requests in nvme_dev_disable and nvme_reset_work, we cannot do this.
> We cannot schedule another reset_work->nvme_dev_disable to do that, because we are in it.
> So I use this nvme_pci_disable_ctrl_directly which looks like very ugly, to disable the
> controller and then we could complete the request with failure to move progress forward.
> 
>> I'd like to see an effort to consolidate error handling paths rather
>> than enhancing the current skew...
> Yes, absolutely. That is also what I expect. :)
> 
> This patch has two aspects:
> 1. grab all the previous outstanding requests with blk_abort_request.
>    It is safe when race with the irq completion path. And then complete them
>    after we disable/shutdown the controller completely.
>    I think this part could be placed in nvme ctrl core.

The 'grab' here is to avoid the timeout path to be triggered during nvme_dev_disable.
This is important, because nvme_timeout may issue IOs on adminq or invoke nvme_pci_disable_ctrl_directly
which could race with nvme_dev_disable.

> 2. avoid nvme_timeout invoke nvme_dev_disable. this is the most tricky part.

And also, this will introduce some dangerous scenarios. I have reported some of them before.

>    as I shared above, we have to _disable_ the controller _before_ we compete the adminq request
>    from the nvme_dev_disable and nvme_reset_work. Consequently, we cannot do as the
>    nvme_rdma_timeout, schedule a recovery work and then return. 

Actually, the nvme_timeout have a similar pattern with nvme_rdma_timeout.
When adminq/IOq request timeout, we can schedule reset_work for it. But if the requests from the reset_work
procedure timeout, we cannot schedule reset_work any more. At the moment, we have to close the controller
directly and fail the requests.

Looking forward some advice on this.
That's really appreciated.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable
@ 2018-02-12  7:51         ` jianchao.wang
  0 siblings, 0 replies; 44+ messages in thread
From: jianchao.wang @ 2018-02-12  7:51 UTC (permalink / raw)


Hi Sagi

Just make some supplement here.

On 02/12/2018 10:16 AM, jianchao.wang wrote:
>> I think this is going in the wrong direction. Every state that is needed
>> to handle serialization should be done in core ctrl state. Moreover,
>> please try to avoid handling this locally in nvme-pci, place common
>> helpers in nvme-core and use them. I won't be surprised if fc would
>> need some of these.
>>
>> Also, can we try and not disable the controller from nvme_timeout?
> In fact, that is what this patch what to do. For the previous outstanding requests,
> this patch return BLK_EH_NOT_HANDLED and defer the work to nvme_dev_disable.
> 
>  I'm
>> not sure I understand why is this needed at all. What's wrong with
>> scheduling a controller reset/delete? Why is something like
>> nvme_pci_disable_ctrl_directly needed?
> Keith used to point out to me that, we cannot complete and free a request
> before we close the controller and pci master bus, otherwise, there will
> be somethings wrong in DMA accessing, because when we complete a request, 
> the associated DMA mapping will be freed.
> 
> For the previous outstanding requests, this patch could grab them with blk_abort_request
> in nvme_dev_disable, and complete them after we disable/shutdown the controller.
> 
> But for the adminq requests in nvme_dev_disable and nvme_reset_work, we cannot do this.
> We cannot schedule another reset_work->nvme_dev_disable to do that, because we are in it.
> So I use this nvme_pci_disable_ctrl_directly which looks like very ugly, to disable the
> controller and then we could complete the request with failure to move progress forward.
> 
>> I'd like to see an effort to consolidate error handling paths rather
>> than enhancing the current skew...
> Yes, absolutely. That is also what I expect. :)
> 
> This patch has two aspects:
> 1. grab all the previous outstanding requests with blk_abort_request.
>    It is safe when race with the irq completion path. And then complete them
>    after we disable/shutdown the controller completely.
>    I think this part could be placed in nvme ctrl core.

The 'grab' here is to avoid the timeout path to be triggered during nvme_dev_disable.
This is important, because nvme_timeout may issue IOs on adminq or invoke nvme_pci_disable_ctrl_directly
which could race with nvme_dev_disable.

> 2. avoid nvme_timeout invoke nvme_dev_disable. this is the most tricky part.

And also, this will introduce some dangerous scenarios. I have reported some of them before.

>    as I shared above, we have to _disable_ the controller _before_ we compete the adminq request
>    from the nvme_dev_disable and nvme_reset_work. Consequently, we cannot do as the
>    nvme_rdma_timeout, schedule a recovery work and then return. 

Actually, the nvme_timeout have a similar pattern with nvme_rdma_timeout.
When adminq/IOq request timeout, we can schedule reset_work for it. But if the requests from the reset_work
procedure timeout, we cannot schedule reset_work any more. At the moment, we have to close the controller
directly and fail the requests.

Looking forward some advice on this.
That's really appreciated.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-02-12  7:55 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-11  9:38 [PATCH V3 0/6]nvme-pci: fixes on nvme_timeout and nvme_dev_disable Jianchao Wang
2018-02-11  9:38 ` Jianchao Wang
2018-02-11  9:38 ` [PATCH 1/9] nvme: fix the dangerous reference of namespaces list Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11 11:14   ` Sagi Grimberg
2018-02-11 11:14     ` Sagi Grimberg
2018-02-11 11:15   ` Sagi Grimberg
2018-02-11 11:15     ` Sagi Grimberg
2018-02-11  9:38 ` [PATCH 2/9] nvme: fix the deadlock in nvme_update_formats Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11 11:16   ` Sagi Grimberg
2018-02-11 11:16     ` Sagi Grimberg
2018-02-12  1:40     ` jianchao.wang
2018-02-12  1:40       ` jianchao.wang
2018-02-11  9:38 ` [PATCH 3/9] nvme: change namespaces_mutext to namespaces_rwsem Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11 11:17   ` Sagi Grimberg
2018-02-11 11:17     ` Sagi Grimberg
2018-02-12  2:33     ` jianchao.wang
2018-02-12  2:33       ` jianchao.wang
2018-02-11  9:38 ` [PATCH 4/9] nvme-pci: quiesce IO queues prior to disabling device HMB accesses Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11 11:19   ` Sagi Grimberg
2018-02-11 11:19     ` Sagi Grimberg
2018-02-12  2:17     ` jianchao.wang
2018-02-12  2:17       ` jianchao.wang
2018-02-11  9:38 ` [PATCH 5/9] nvme-pci: suspend queues based on online_queues Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11  9:38 ` [PATCH 6/9] nvme-pci: drain the entered requests after ctrl is shutdown Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11  9:38 ` [PATCH 7/9] blk-mq: make blk_mq_rq_update_aborted_gstate a external interface Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11  9:38 ` [PATCH 8/9] nvme-pci: break up nvme_timeout and nvme_dev_disable Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11 11:36   ` Sagi Grimberg
2018-02-11 11:36     ` Sagi Grimberg
2018-02-12  2:16     ` jianchao.wang
2018-02-12  2:16       ` jianchao.wang
2018-02-12  7:51       ` jianchao.wang
2018-02-12  7:51         ` jianchao.wang
2018-02-11  9:38 ` [PATCH 9/9] nvme-pci: discard wait timeout when delete cq/sq Jianchao Wang
2018-02-11  9:38   ` Jianchao Wang
2018-02-11 11:24   ` Sagi Grimberg
2018-02-11 11:24     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.