All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-4.5 00/13] NVMe fixups for 4.5
@ 2016-02-10 18:17 Keith Busch
  2016-02-10 18:17 ` [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue Keith Busch
                   ` (12 more replies)
  0 siblings, 13 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


Sorry for the patch bomb in the middle of rc, though this is fixing
legit issues.

Patches 1 and 2 are already reviewed/acked, and added those to this
reviewers (thanks!).

Patches 3 and 4 are performance related and pretty straight forward.

The remaining are specifically for surprise removal and other controller
failure scenarios. I had all these in a single patch, but there was
too much going on, so I split at what appears to be reasonable and self
contained features.

To test what was previously broken: run buffered writes to an nvme drive,
remove the drive via sysfs, then pull the drive out.

Keith Busch (13):
  blk-mq: End unstarted requests on dying queue
  NVMe: Fix io incapable return values
  NVMe: Allow request merges
  NVMe: Set queue limits max_dev_sectors
  NVMe: Fix namespace removal deadlock
  NVMe: Remove WQ_MEM_RECLAIM from nvme work queue
  NVMe: Requeue requests on suspended queues
  NVMe: Poll device while still active during remove
  NVMe: Simplify device reset failure
  NVMe: Move error handling to failed reset handler
  NVMe: Dead namespace handling
  NVMe: Mark queues as dead on degraded controller
  NVMe: Rate limit nvme IO warnings

 block/blk-mq.c           |  6 ++--
 drivers/nvme/host/core.c | 74 ++++++++++++++++++++++-----------------
 drivers/nvme/host/nvme.h | 12 +++++--
 drivers/nvme/host/pci.c  | 91 ++++++++++++++++++++++++++++++------------------
 4 files changed, 113 insertions(+), 70 deletions(-)

-- 
2.6.2.307.g37023ba

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-11 12:13   ` Johannes Thumshirn
  2016-02-11 12:30   ` Sagi Grimberg
  2016-02-10 18:17 ` [PATCH for-4.5 02/13] NVMe: Fix io incapable return values Keith Busch
                   ` (11 subsequent siblings)
  12 siblings, 2 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


Go directly to ending a request if it wasn't started. Previously, completing a
request may invoke a driver callback for a request it didn't initialize.

Signed-off-by: Keith Busch <keith.busch at intel.com>
Acked-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c0622f..56c0a72 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -599,8 +599,10 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		 * If a request wasn't started before the queue was
 		 * marked dying, kill it here or it'll go unnoticed.
 		 */
-		if (unlikely(blk_queue_dying(rq->q)))
-			blk_mq_complete_request(rq, -EIO);
+		if (unlikely(blk_queue_dying(rq->q))) {
+			rq->errors = -EIO;
+			blk_mq_end_request(rq, rq->errors);
+		}
 		return;
 	}
 
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 02/13] NVMe: Fix io incapable return values
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
  2016-02-10 18:17 ` [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-11 12:14   ` Johannes Thumshirn
  2016-02-10 18:17 ` [PATCH for-4.5 03/13] NVMe: Allow request merges Keith Busch
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


The function returns true when the controller can't handle IO.

Signed-off-by: Keith Busch <keith.busch at intel.com>
Reviewed-by: Sagi Grimberg <sagig at mellanox.com>
Acked-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/nvme.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 4fb5bb7..9664d07 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -139,9 +139,9 @@ static inline bool nvme_io_incapable(struct nvme_ctrl *ctrl)
 	u32 val = 0;
 
 	if (ctrl->ops->io_incapable(ctrl))
-		return false;
+		return true;
 	if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val))
-		return false;
+		return true;
 	return val & NVME_CSTS_CFS;
 }
 
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
  2016-02-10 18:17 ` [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue Keith Busch
  2016-02-10 18:17 ` [PATCH for-4.5 02/13] NVMe: Fix io incapable return values Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:36   ` Christoph Hellwig
                     ` (2 more replies)
  2016-02-10 18:17 ` [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors Keith Busch
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


It is generally more efficient to submit larger IO.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c5bf001..3cd921e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1121,7 +1121,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	ns->queue = blk_mq_init_queue(ctrl->tagset);
 	if (IS_ERR(ns->queue))
 		goto out_free_ns;
-	queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, ns->queue);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, ns->queue);
 	ns->queue->queuedata = ns;
 	ns->ctrl = ctrl;
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (2 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 03/13] NVMe: Allow request merges Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:20   ` Christoph Hellwig
  2016-02-10 18:17 ` [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock Keith Busch
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3cd921e..852b789 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1136,6 +1136,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	blk_queue_logical_block_size(ns->queue, 1 << ns->lba_shift);
 	if (ctrl->max_hw_sectors) {
+		ns->queue->limits.max_dev_sectors = ctrl->max_hw_sectors;
 		blk_queue_max_hw_sectors(ns->queue, ctrl->max_hw_sectors);
 		blk_queue_max_segments(ns->queue,
 			(ctrl->max_hw_sectors / (ctrl->page_size >> 9)) + 1);
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (3 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-11 12:19   ` Johannes Thumshirn
  2016-02-11 16:38   ` Wenbo Wang
  2016-02-10 18:17 ` [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue Keith Busch
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


Namespace removal holds the namespace list mutex. If the controller
fails during namespace removal, the reset work will deadlock trying to
acquire the mutex, and no progress will be made.

This patch completes the potentially blocking parts of namespace
removal in a controller work queue so that reset work can cleanup a
failed controller.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 81 ++++++++++++++++++++++++++++++------------------
 drivers/nvme/host/nvme.h |  6 ++++
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 852b789..be27f9f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -48,6 +48,10 @@ static void nvme_free_ns(struct kref *kref)
 {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	mutex_lock(&ns->ctrl->namespaces_mutex);
+	list_del_init(&ns->list);
+	mutex_unlock(&ns->ctrl->namespaces_mutex);
+
 	if (ns->type == NVME_NS_LIGHTNVM)
 		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
 
@@ -1106,6 +1110,37 @@ static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	return NULL;
 }
 
+static void nvme_ns_remove_work(struct work_struct *work)
+{
+	struct nvme_ns *ns = container_of(work, struct nvme_ns, remove_work);
+	bool kill = nvme_io_incapable(ns->ctrl) &&
+			!blk_queue_dying(ns->queue);
+
+	if (kill) {
+		blk_set_queue_dying(ns->queue);
+
+		/*
+		 * The controller was shutdown first if we got here through
+		 * device removal. The shutdown may requeue outstanding
+		 * requests. These need to be aborted immediately so
+		 * del_gendisk doesn't block indefinitely for their completion.
+		 */
+		blk_mq_abort_requeue_list(ns->queue);
+	}
+	if (ns->disk->flags & GENHD_FL_UP) {
+		if (blk_get_integrity(ns->disk))
+			blk_integrity_unregister(ns->disk);
+		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
+					&nvme_ns_attr_group);
+		del_gendisk(ns->disk);
+	}
+	if (kill || !blk_queue_dying(ns->queue)) {
+		blk_mq_abort_requeue_list(ns->queue);
+		blk_cleanup_queue(ns->queue);
+	}
+	nvme_put_ns(ns);
+}
+
 static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 {
 	struct nvme_ns *ns;
@@ -1161,6 +1196,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	list_add_tail(&ns->list, &ctrl->namespaces);
 	kref_get(&ctrl->kref);
+	INIT_WORK(&ns->remove_work, nvme_ns_remove_work);
 	if (ns->type == NVME_NS_LIGHTNVM)
 		return;
 
@@ -1180,35 +1216,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)
 {
-	bool kill = nvme_io_incapable(ns->ctrl) &&
-			!blk_queue_dying(ns->queue);
-
-	lockdep_assert_held(&ns->ctrl->namespaces_mutex);
-
-	if (kill) {
-		blk_set_queue_dying(ns->queue);
-
-		/*
-		 * The controller was shutdown first if we got here through
-		 * device removal. The shutdown may requeue outstanding
-		 * requests. These need to be aborted immediately so
-		 * del_gendisk doesn't block indefinitely for their completion.
-		 */
-		blk_mq_abort_requeue_list(ns->queue);
-	}
-	if (ns->disk->flags & GENHD_FL_UP) {
-		if (blk_get_integrity(ns->disk))
-			blk_integrity_unregister(ns->disk);
-		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group);
-		del_gendisk(ns->disk);
-	}
-	if (kill || !blk_queue_dying(ns->queue)) {
-		blk_mq_abort_requeue_list(ns->queue);
-		blk_cleanup_queue(ns->queue);
-	}
-	list_del_init(&ns->list);
-	nvme_put_ns(ns);
+	if (!test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
+		queue_work(ns->ctrl->ns_workq, &ns->remove_work);
 }
 
 static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -1305,6 +1314,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
 		nvme_ns_remove(ns);
 	mutex_unlock(&ctrl->namespaces_mutex);
+
+	flush_workqueue(ctrl->ns_workq);
 }
 
 static DEFINE_IDA(nvme_instance_ida);
@@ -1337,12 +1348,14 @@ static void nvme_release_instance(struct nvme_ctrl *ctrl)
 }
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
- {
+{
 	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
 
 	spin_lock(&dev_list_lock);
 	list_del(&ctrl->node);
 	spin_unlock(&dev_list_lock);
+
+	destroy_workqueue(ctrl->ns_workq);
 }
 
 static void nvme_free_ctrl(struct kref *kref)
@@ -1392,11 +1405,19 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	get_device(ctrl->device);
 	dev_set_drvdata(ctrl->device, ctrl);
 
+	ctrl->ns_workq = alloc_workqueue("nvme-ns", WQ_UNBOUND, 0);
+	if (!ctrl->ns_workq) {
+		ret = -ENOMEM;
+		goto out_destroy_device;
+	}
+
 	spin_lock(&dev_list_lock);
 	list_add_tail(&ctrl->node, &nvme_ctrl_list);
 	spin_unlock(&dev_list_lock);
 
 	return 0;
+out_destroy_device:
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
 out_release_instance:
 	nvme_release_instance(ctrl);
 out:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9664d07..d330512 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -72,6 +72,7 @@ struct nvme_ctrl {
 	struct mutex namespaces_mutex;
 	struct device *device;	/* char device */
 	struct list_head node;
+	struct workqueue_struct *ns_workq;
 
 	char name[12];
 	char serial[20];
@@ -112,6 +113,11 @@ struct nvme_ns {
 	bool ext;
 	u8 pi_type;
 	int type;
+	struct work_struct remove_work;
+	unsigned long flags;
+
+#define NVME_NS_REMOVING 0
+
 	u64 mode_select_num_blocks;
 	u32 mode_select_block_len;
 };
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (4 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:46   ` Christoph Hellwig
  2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


This isn't used for work in the memory reclaim path, and we may need
to sync with work queues that also are not flagged memory relaim. This
fixes a kernel warning if we ever do sync with such a work queue.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 72ef832..5bea054 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2227,7 +2227,7 @@ static int __init nvme_init(void)
 
 	init_waitqueue_head(&nvme_kthread_wait);
 
-	nvme_workq = alloc_workqueue("nvme", WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
+	nvme_workq = alloc_workqueue("nvme", WQ_UNBOUND, 0);
 	if (!nvme_workq)
 		return -ENOMEM;
 
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (5 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:47   ` Christoph Hellwig
                     ` (3 more replies)
  2016-02-10 18:17 ` [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove Keith Busch
                   ` (5 subsequent siblings)
  12 siblings, 4 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


It's possible a request may get to the driver after the nvme queue was
disabled. This has the request requeue if that happens.

Note the request is still "started" by the driver, but requeuing will
clear the start state for timeout handling.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5bea054..f8bb781 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -678,6 +678,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	blk_mq_start_request(req);
 
 	spin_lock_irq(&nvmeq->q_lock);
+	if (unlikely(nvmeq->cq_vector < 0)) {
+		ret = BLK_MQ_RQ_QUEUE_BUSY;
+		spin_unlock_irq(&nvmeq->q_lock);
+		goto out;
+	}
 	__nvme_submit_cmd(nvmeq, &cmnd);
 	nvme_process_cq(nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (6 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:48   ` Christoph Hellwig
                     ` (2 more replies)
  2016-02-10 18:17 ` [PATCH for-4.5 09/13] NVMe: Simplify device reset failure Keith Busch
                   ` (4 subsequent siblings)
  12 siblings, 3 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


A device failure or link down wouldn't have been detected during namespace
removal. This patch keeps the device in the list for polling so that the
thread may see such failure and initiate a reset. The device is removed
from the list after disable, so we can safely flush the reset work as
it can't be requeued when disable completes.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f8bb781..0303936 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2116,16 +2116,12 @@ static void nvme_remove(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
 
-	spin_lock(&dev_list_lock);
-	list_del_init(&dev->node);
-	spin_unlock(&dev_list_lock);
-
 	pci_set_drvdata(pdev, NULL);
-	flush_work(&dev->reset_work);
 	flush_work(&dev->scan_work);
 	nvme_remove_namespaces(&dev->ctrl);
 	nvme_uninit_ctrl(&dev->ctrl);
 	nvme_dev_disable(dev, true);
+	flush_work(&dev->reset_work);
 	nvme_dev_remove_admin(dev);
 	nvme_free_queues(dev, 0);
 	nvme_release_cmb(dev);
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 09/13] NVMe: Simplify device reset failure
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (7 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:53   ` Christoph Hellwig
  2016-02-11 12:28   ` Johannes Thumshirn
  2016-02-10 18:17 ` [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler Keith Busch
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


A reset failure schedules the device to unbind from the driver through
the pci driver's remove. This cleans up all intialization, so there is
no need to duplicate the potentially racy cleanup.

To help understand why a reset failed, the status is logged with the
existed warning message.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 43 ++++++++++++++++---------------------------
 1 file changed, 16 insertions(+), 27 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 0303936..3381bac 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -86,7 +86,6 @@ struct nvme_queue;
 
 static int nvme_reset(struct nvme_dev *dev);
 static void nvme_process_cq(struct nvme_queue *nvmeq);
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 
 /*
@@ -1899,10 +1898,19 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
+static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
+{
+	dev_warn(dev->dev, "Removing after probe failure status:%d\n", status);
+
+	kref_get(&dev->ctrl.kref);
+	if (!schedule_work(&dev->remove_work))
+		nvme_put_ctrl(&dev->ctrl);
+}
+
 static void nvme_reset_work(struct work_struct *work)
 {
 	struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
-	int result;
+	int result = ENODEV;
 
 	if (WARN_ON(test_bit(NVME_CTRL_RESETTING, &dev->flags)))
 		goto out;
@@ -1922,26 +1930,26 @@ static void nvme_reset_work(struct work_struct *work)
 
 	result = nvme_configure_admin_queue(dev);
 	if (result)
-		goto unmap;
+		goto out;
 
 	nvme_init_queue(dev->queues[0], 0);
 	result = nvme_alloc_admin_tags(dev);
 	if (result)
-		goto disable;
+		goto out;
 
 	result = nvme_init_identify(&dev->ctrl);
 	if (result)
-		goto free_tags;
+		goto out;
 
 	result = nvme_setup_io_queues(dev);
 	if (result)
-		goto free_tags;
+		goto out;
 
 	dev->ctrl.event_limit = NVME_NR_AEN_COMMANDS;
 
 	result = nvme_dev_list_add(dev);
 	if (result)
-		goto remove;
+		goto out;
 
 	/*
 	 * Keep the controller around but remove all namespaces if we don't have
@@ -1958,19 +1966,8 @@ static void nvme_reset_work(struct work_struct *work)
 	clear_bit(NVME_CTRL_RESETTING, &dev->flags);
 	return;
 
- remove:
-	nvme_dev_list_remove(dev);
- free_tags:
-	nvme_dev_remove_admin(dev);
-	blk_put_queue(dev->ctrl.admin_q);
-	dev->ctrl.admin_q = NULL;
-	dev->queues[0]->tags = NULL;
- disable:
-	nvme_disable_admin_queue(dev, false);
- unmap:
-	nvme_dev_unmap(dev);
  out:
-	nvme_remove_dead_ctrl(dev);
+	nvme_remove_dead_ctrl(dev, result);
 }
 
 static void nvme_remove_dead_ctrl_work(struct work_struct *work)
@@ -1983,14 +1980,6 @@ static void nvme_remove_dead_ctrl_work(struct work_struct *work)
 	nvme_put_ctrl(&dev->ctrl);
 }
 
-static void nvme_remove_dead_ctrl(struct nvme_dev *dev)
-{
-	dev_warn(dev->dev, "Removing after probe failure\n");
-	kref_get(&dev->ctrl.kref);
-	if (!schedule_work(&dev->remove_work))
-		nvme_put_ctrl(&dev->ctrl);
-}
-
 static int nvme_reset(struct nvme_dev *dev)
 {
 	if (!dev->ctrl.admin_q || blk_queue_dying(dev->ctrl.admin_q))
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (8 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 09/13] NVMe: Simplify device reset failure Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-11 12:34   ` Johannes Thumshirn
  2016-02-11 12:50   ` Sagi Grimberg
  2016-02-10 18:17 ` [PATCH for-4.5 11/13] NVMe: Dead namespace handling Keith Busch
                   ` (2 subsequent siblings)
  12 siblings, 2 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


This moves the dead queue handling out of the namespace removal path
and into the reset failure path. It fixes a deadlock condition if the
controller fails or link down during del_gendisk.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 19 ++-----------------
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 22 ++++++++++++++++++++++
 3 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index be27f9f..41b595c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -64,7 +64,7 @@ static void nvme_free_ns(struct kref *kref)
 	kfree(ns);
 }
 
-static void nvme_put_ns(struct nvme_ns *ns)
+void nvme_put_ns(struct nvme_ns *ns)
 {
 	kref_put(&ns->kref, nvme_free_ns);
 }
@@ -1113,28 +1113,13 @@ static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 static void nvme_ns_remove_work(struct work_struct *work)
 {
 	struct nvme_ns *ns = container_of(work, struct nvme_ns, remove_work);
-	bool kill = nvme_io_incapable(ns->ctrl) &&
-			!blk_queue_dying(ns->queue);
-
-	if (kill) {
-		blk_set_queue_dying(ns->queue);
-
-		/*
-		 * The controller was shutdown first if we got here through
-		 * device removal. The shutdown may requeue outstanding
-		 * requests. These need to be aborted immediately so
-		 * del_gendisk doesn't block indefinitely for their completion.
-		 */
-		blk_mq_abort_requeue_list(ns->queue);
-	}
+
 	if (ns->disk->flags & GENHD_FL_UP) {
 		if (blk_get_integrity(ns->disk))
 			blk_integrity_unregister(ns->disk);
 		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
 					&nvme_ns_attr_group);
 		del_gendisk(ns->disk);
-	}
-	if (kill || !blk_queue_dying(ns->queue)) {
 		blk_mq_abort_requeue_list(ns->queue);
 		blk_cleanup_queue(ns->queue);
 	}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d330512..19a64b2 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -270,6 +270,7 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 			dma_addr_t dma_addr, u32 *result);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
+void nvme_put_ns(struct nvme_ns *ns);
 
 extern spinlock_t dev_list_lock;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3381bac..a18e4ab 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1898,11 +1898,33 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	kfree(dev);
 }
 
+static void nvme_kill_ns_queues(struct nvme_dev *dev)
+{
+	struct nvme_ns *ns;
+	struct nvme_ctrl *ctrl = &dev->ctrl;
+
+	nvme_dev_disable(dev, false);
+
+	mutex_lock(&ctrl->namespaces_mutex);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		if (!kref_get_unless_zero(&ns->kref))
+			continue;
+
+		blk_set_queue_dying(ns->queue);
+		blk_mq_abort_requeue_list(ns->queue);
+		blk_mq_start_stopped_hw_queues(ns->queue, true);
+
+		nvme_put_ns(ns);
+	}
+	mutex_unlock(&ctrl->namespaces_mutex);
+}
+
 static void nvme_remove_dead_ctrl(struct nvme_dev *dev, int status)
 {
 	dev_warn(dev->dev, "Removing after probe failure status:%d\n", status);
 
 	kref_get(&dev->ctrl.kref);
+	nvme_kill_ns_queues(dev);
 	if (!schedule_work(&dev->remove_work))
 		nvme_put_ctrl(&dev->ctrl);
 }
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 11/13] NVMe: Dead namespace handling
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (9 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-11 12:43   ` Johannes Thumshirn
  2016-02-11 12:59   ` Sagi Grimberg
  2016-02-10 18:17 ` [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller Keith Busch
  2016-02-10 18:17 ` [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings Keith Busch
  12 siblings, 2 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


This adds a "dead" state to a namespace and revalidates such a namespace
to 0 capacity. This will force buffered writers to stop writing pages
that can't be synced, and ends requests in failure if any is submitted
to such a namespace.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c |  4 ++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 12 +++++++++++-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 41b595c..84e9f41 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -560,6 +560,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
 	u16 old_ms;
 	unsigned short bs;
 
+	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
+		set_capacity(disk, 0);
+		return -ENODEV;
+	}
 	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
 		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
 				__func__, ns->ctrl->instance, ns->ns_id);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 19a64b2..e4b4110 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -117,6 +117,7 @@ struct nvme_ns {
 	unsigned long flags;
 
 #define NVME_NS_REMOVING 0
+#define NVME_NS_DEAD     1
 
 	u64 mode_select_num_blocks;
 	u32 mode_select_block_len;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a18e4ab..7fd8a54 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -678,7 +678,9 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 	spin_lock_irq(&nvmeq->q_lock);
 	if (unlikely(nvmeq->cq_vector < 0)) {
-		ret = BLK_MQ_RQ_QUEUE_BUSY;
+		ret = test_bit(NVME_NS_DEAD, &ns->flags) ?
+					BLK_MQ_RQ_QUEUE_ERROR :
+					BLK_MQ_RQ_QUEUE_BUSY;
 		spin_unlock_irq(&nvmeq->q_lock);
 		goto out;
 	}
@@ -1910,6 +1912,14 @@ static void nvme_kill_ns_queues(struct nvme_dev *dev)
 		if (!kref_get_unless_zero(&ns->kref))
 			continue;
 
+		/*
+		 * Revalidating a dead namespace sets capacity to 0. This will
+		 * end buffered writers dirtying more pages that can't be
+		 * synced.
+		 */
+		if (!test_and_set_bit(NVME_NS_DEAD, &ns->flags))
+			revalidate_disk(ns->disk);
+
 		blk_set_queue_dying(ns->queue);
 		blk_mq_abort_requeue_list(ns->queue);
 		blk_mq_start_stopped_hw_queues(ns->queue, true);
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (10 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 11/13] NVMe: Dead namespace handling Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-11 12:44   ` Johannes Thumshirn
  2016-02-11 13:00   ` Sagi Grimberg
  2016-02-10 18:17 ` [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings Keith Busch
  12 siblings, 2 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


If a previously optimal controller goes degraded, the IO queues will
stop working, but the controller is still managable. Prior to removing
namespace, we need to fail the queues so queued and new IO will fail
and allow namespace removal to progress.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7fd8a54..18c14f6 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1989,6 +1989,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 */
 	if (dev->online_queues < 2) {
 		dev_warn(dev->dev, "IO queues not created\n");
+		nvme_kill_ns_queues(dev);
 		nvme_remove_namespaces(&dev->ctrl);
 	} else {
 		nvme_start_queues(&dev->ctrl);
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings
  2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
                   ` (11 preceding siblings ...)
  2016-02-10 18:17 ` [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller Keith Busch
@ 2016-02-10 18:17 ` Keith Busch
  2016-02-10 18:54   ` Christoph Hellwig
  12 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:17 UTC (permalink / raw)


We don't need to spam the kernel logs with thousands of messages. We
can infer all IO's are being cancelled with fewer messages.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 18c14f6..4d3b3d0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1005,7 +1005,7 @@ static void nvme_cancel_queue_ios(struct request *req, void *data, bool reserved
 	if (!blk_mq_request_started(req))
 		return;
 
-	dev_warn(nvmeq->q_dmadev,
+	dev_warn_ratelimited(nvmeq->q_dmadev,
 		 "Cancelling I/O %d QID %d\n", req->tag, nvmeq->qid);
 
 	status = NVME_SC_ABORT_REQ;
-- 
2.6.2.307.g37023ba

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors
  2016-02-10 18:17 ` [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors Keith Busch
@ 2016-02-10 18:20   ` Christoph Hellwig
  2016-02-10 18:24     ` Keith Busch
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:20 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:21AM -0700, Keith Busch wrote:
> Signed-off-by: Keith Busch <keith.busch at intel.com>

why?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors
  2016-02-10 18:20   ` Christoph Hellwig
@ 2016-02-10 18:24     ` Keith Busch
  2016-02-10 18:40       ` Christoph Hellwig
  0 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 18:24 UTC (permalink / raw)


On Wed, Feb 10, 2016@10:20:51AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 10, 2016@11:17:21AM -0700, Keith Busch wrote:
> > Signed-off-by: Keith Busch <keith.busch at intel.com>
> 
> why?

Ah, missing change log here.

It is so the queue limits will be correct. The max_hw_sectors will be
wrong without this.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 18:17 ` [PATCH for-4.5 03/13] NVMe: Allow request merges Keith Busch
@ 2016-02-10 18:36   ` Christoph Hellwig
  2016-02-10 18:37     ` Jens Axboe
  2016-02-11 12:15   ` Johannes Thumshirn
  2016-02-11 12:33   ` Sagi Grimberg
  2 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:36 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:20AM -0700, Keith Busch wrote:
> It is generally more efficient to submit larger IO.

In general it is, but it might be good to try to understand
why this flag was set in the first place, and why the argument
for that isn't true anymore, or never was.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 18:36   ` Christoph Hellwig
@ 2016-02-10 18:37     ` Jens Axboe
  2016-02-10 21:01       ` Keith Busch
  0 siblings, 1 reply; 57+ messages in thread
From: Jens Axboe @ 2016-02-10 18:37 UTC (permalink / raw)


On 02/10/2016 11:36 AM, Christoph Hellwig wrote:
> On Wed, Feb 10, 2016@11:17:20AM -0700, Keith Busch wrote:
>> It is generally more efficient to submit larger IO.
>
> In general it is, but it might be good to try to understand
> why this flag was set in the first place, and why the argument
> for that isn't true anymore, or never was.

IIRC, it was set when the conversion to blk-mq happened, since the 
original driver could not do merges being bio based.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors
  2016-02-10 18:24     ` Keith Busch
@ 2016-02-10 18:40       ` Christoph Hellwig
  2016-02-10 19:49         ` Keith Busch
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:40 UTC (permalink / raw)


On Wed, Feb 10, 2016@06:24:17PM +0000, Keith Busch wrote:
> On Wed, Feb 10, 2016@10:20:51AM -0800, Christoph Hellwig wrote:
> > On Wed, Feb 10, 2016@11:17:21AM -0700, Keith Busch wrote:
> > > Signed-off-by: Keith Busch <keith.busch at intel.com>
> > 
> > why?
> 
> Ah, missing change log here.
> 
> It is so the queue limits will be correct. The max_hw_sectors will be
> wrong without this.

How so?  We only ever look at max_dev_sectors in blk_queue_max_hw_sectors
and blk_stack_limits, and both do a min_not_zero.  A zero max_dev_sectors
should thus be perfectly fine, and in fact no one but the SCSI disk
driver even bothers to set one.

Ohh, I see now - blk_set_default_limits initializes max_dev_sectors to
BLK_SAFE_MAX_SECTORS instead of leaving it as zero.  I think that's the
issue which needs to be fixed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue
  2016-02-10 18:17 ` [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue Keith Busch
@ 2016-02-10 18:46   ` Christoph Hellwig
  2016-02-10 23:37     ` Keith Busch
  2016-02-11 14:52     ` Keith Busch
  0 siblings, 2 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:46 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:23AM -0700, Keith Busch wrote:
> This isn't used for work in the memory reclaim path, and we may need
> to sync with work queues that also are not flagged memory relaim. This
> fixes a kernel warning if we ever do sync with such a work queue.

We do need it during memory reclaim: memory reclaim in general
does I/O, which can be on NVMe.  We then need the workqueue to
abort a command or reset an overloaded controller to make progress.
Not having WQ_MEM_RECLAIM risks deadlocks in heavily loaded systems.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
@ 2016-02-10 18:47   ` Christoph Hellwig
  2016-02-11 12:22   ` Johannes Thumshirn
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:47 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:24AM -0700, Keith Busch wrote:
> It's possible a request may get to the driver after the nvme queue was
> disabled. This has the request requeue if that happens.
> 
> Note the request is still "started" by the driver, but requeuing will
> clear the start state for timeout handling.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove
  2016-02-10 18:17 ` [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove Keith Busch
@ 2016-02-10 18:48   ` Christoph Hellwig
  2016-02-11 12:26   ` Johannes Thumshirn
  2016-02-11 12:42   ` Sagi Grimberg
  2 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:48 UTC (permalink / raw)


Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 09/13] NVMe: Simplify device reset failure
  2016-02-10 18:17 ` [PATCH for-4.5 09/13] NVMe: Simplify device reset failure Keith Busch
@ 2016-02-10 18:53   ` Christoph Hellwig
  2016-02-11 12:28   ` Johannes Thumshirn
  1 sibling, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:53 UTC (permalink / raw)


> +	dev_warn(dev->dev, "Removing after probe failure status:%d\n", status);

Why not space after the colon?

>  static void nvme_reset_work(struct work_struct *work)
>  {
>  	struct nvme_dev *dev = container_of(work, struct nvme_dev, reset_work);
> -	int result;
> +	int result = ENODEV;

negative errors, please.

> - remove:
> -	nvme_dev_list_remove(dev);
> - free_tags:
> -	nvme_dev_remove_admin(dev);
> -	blk_put_queue(dev->ctrl.admin_q);
> -	dev->ctrl.admin_q = NULL;
> -	dev->queues[0]->tags = NULL;
> - disable:
> -	nvme_disable_admin_queue(dev, false);
> - unmap:
> -	nvme_dev_unmap(dev);
>   out:
> -	nvme_remove_dead_ctrl(dev);
> +	nvme_remove_dead_ctrl(dev, result);

So this relies on the cleanup path to handle all states
of partial initialization.  I suspect you've audited this in
the current version, but the remove path could use a comment
stating that any edit will have to obey this restriction.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings
  2016-02-10 18:17 ` [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings Keith Busch
@ 2016-02-10 18:54   ` Christoph Hellwig
  2016-02-10 21:10     ` Keith Busch
  0 siblings, 1 reply; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-10 18:54 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:30AM -0700, Keith Busch wrote:
> We don't need to spam the kernel logs with thousands of messages. We
> can infer all IO's are being cancelled with fewer messages.

Do we need this message at all?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors
  2016-02-10 18:40       ` Christoph Hellwig
@ 2016-02-10 19:49         ` Keith Busch
  2016-02-10 22:53           ` Martin K. Petersen
  0 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 19:49 UTC (permalink / raw)


On Wed, Feb 10, 2016@10:40:31AM -0800, Christoph Hellwig wrote:
> Ohh, I see now - blk_set_default_limits initializes max_dev_sectors to
> BLK_SAFE_MAX_SECTORS instead of leaving it as zero.  I think that's the
> issue which needs to be fixed.

Heh, that's pretty funny. I mentioned something along those lines after
I broke the the block layer in rc1, but thought it was done that way on
purpose. :)

So instead of this nvme specific patch, we should do this?

---
diff --git a/block/blk-settings.c b/block/blk-settings.c
index dd49735..c7bb666 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -91,8 +91,8 @@ void blk_set_default_limits(struct queue_limits *lim)
        lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
        lim->virt_boundary_mask = 0;
        lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
-       lim->max_sectors = lim->max_dev_sectors = lim->max_hw_sectors =
-               BLK_SAFE_MAX_SECTORS;
+       lim->max_sectors = lim->max_hw_sectors = BLK_SAFE_MAX_SECTORS;
+       lim->max_dev_sectors = 0;
        lim->chunk_sectors = 0;
        lim->max_write_same_sectors = 0;
        lim->max_discard_sectors = 0;
--

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 18:37     ` Jens Axboe
@ 2016-02-10 21:01       ` Keith Busch
  2016-02-10 21:19         ` Jens Axboe
  0 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 21:01 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:37:05AM -0700, Jens Axboe wrote:
> On 02/10/2016 11:36 AM, Christoph Hellwig wrote:
> >In general it is, but it might be good to try to understand
> >why this flag was set in the first place, and why the argument
> >for that isn't true anymore, or never was.
> 
> IIRC, it was set when the conversion to blk-mq happened, since the
> original driver could not do merges being bio based.

This actually carried over from when it was still bio-based. The flag
was set in the driver's initial commit, but should not have been.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings
  2016-02-10 18:54   ` Christoph Hellwig
@ 2016-02-10 21:10     ` Keith Busch
  2016-02-11 12:29       ` Sagi Grimberg
  0 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-10 21:10 UTC (permalink / raw)


On Wed, Feb 10, 2016@10:54:04AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 10, 2016@11:17:30AM -0700, Keith Busch wrote:
> > We don't need to spam the kernel logs with thousands of messages. We
> > can infer all IO's are being cancelled with fewer messages.
> 
> Do we need this message at all?

I don't find it useful for anything other than confirming a test hit
that path. Delete it?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 21:01       ` Keith Busch
@ 2016-02-10 21:19         ` Jens Axboe
  0 siblings, 0 replies; 57+ messages in thread
From: Jens Axboe @ 2016-02-10 21:19 UTC (permalink / raw)


On 02/10/2016 02:01 PM, Keith Busch wrote:
> On Wed, Feb 10, 2016@11:37:05AM -0700, Jens Axboe wrote:
>> On 02/10/2016 11:36 AM, Christoph Hellwig wrote:
>>> In general it is, but it might be good to try to understand
>>> why this flag was set in the first place, and why the argument
>>> for that isn't true anymore, or never was.
>>
>> IIRC, it was set when the conversion to blk-mq happened, since the
>> original driver could not do merges being bio based.
>
> This actually carried over from when it was still bio-based. The flag
> was set in the driver's initial commit, but should not have been.

Yeah, but the flag was a no-op back then, since no merging is done for 
bio based drivers. So flag or no flag back then, no merging would have 
been done.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors
  2016-02-10 19:49         ` Keith Busch
@ 2016-02-10 22:53           ` Martin K. Petersen
  0 siblings, 0 replies; 57+ messages in thread
From: Martin K. Petersen @ 2016-02-10 22:53 UTC (permalink / raw)


>>>>> "Keith" == Keith Busch <keith.busch at intel.com> writes:

Keith> Heh, that's pretty funny. I mentioned something along those lines
Keith> after I broke the the block layer in rc1, but thought it was done
Keith> that way on purpose. :)

Keith> So instead of this nvme specific patch, we should do this?

Yes. Thinko on my part, sorry about that.

Acked-by: Martin K. Petersen <martin.petersen at oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue
  2016-02-10 18:46   ` Christoph Hellwig
@ 2016-02-10 23:37     ` Keith Busch
  2016-02-11 14:52     ` Keith Busch
  1 sibling, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-10 23:37 UTC (permalink / raw)


On Wed, Feb 10, 2016@10:46:41AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 10, 2016@11:17:23AM -0700, Keith Busch wrote:
> > This isn't used for work in the memory reclaim path, and we may need
> > to sync with work queues that also are not flagged memory relaim. This
> > fixes a kernel warning if we ever do sync with such a work queue.
> 
> We do need it during memory reclaim: memory reclaim in general
> does I/O, which can be on NVMe.  We then need the workqueue to
> abort a command or reset an overloaded controller to make progress.
> Not having WQ_MEM_RECLAIM risks deadlocks in heavily loaded systems.

Darn. Invalidating a disk drains lru, which syncs with work scheduled
on the system_wq. Syncing with that from a memory reclaim work queue
hits a kernel warning.

That lru drain work is reclaiming memory, though. Does this need
to be using a WQ_MEM_RECLAIM queue, then?

This is the alternate patch I didn't plan to submit:

---
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0e32bc7..f7cc91e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -356,6 +359,7 @@ extern struct workqueue_struct *system_unbound_wq;
 extern struct workqueue_struct *system_freezable_wq;
 extern struct workqueue_struct *system_power_efficient_wq;
 extern struct workqueue_struct *system_freezable_power_efficient_wq;
+extern struct workqueue_struct *system_mem_wq;

 extern struct workqueue_struct *
 __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 61a0264..57a50d2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5483,10 +5483,13 @@ static int __init init_workqueues(void)
        system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_power_efficient",
                                              WQ_FREEZABLE | WQ_POWER_EFFICIENT,
                                              0);
+       system_mem_wq = alloc_workqueue("events_mem_unbound", WQ_UNBOUND | WQ_MEM_RECLAIM,
+                                           WQ_UNBOUND_MAX_ACTIVE);
        BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
               !system_unbound_wq || !system_freezable_wq ||
               !system_power_efficient_wq ||
-              !system_freezable_power_efficient_wq);
+              !system_freezable_power_efficient_wq ||
+              !system_mem_wq);

        wq_watchdog_init();

diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e9..eecf98a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -685,7 +685,7 @@ void lru_add_drain_all(void)
                    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
                    need_activate_page_drain(cpu)) {
                        INIT_WORK(work, lru_add_drain_per_cpu);
-                       schedule_work_on(cpu, work);
+                       queue_work_on(cpu, system_mem_wq, work);
                        cpumask_set_cpu(cpu, &has_work);
                }
        }
--

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue
  2016-02-10 18:17 ` [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue Keith Busch
@ 2016-02-11 12:13   ` Johannes Thumshirn
  2016-02-11 12:30   ` Sagi Grimberg
  1 sibling, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:13 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:18AM -0700, Keith Busch wrote:
> Go directly to ending a request if it wasn't started. Previously, completing a
> request may invoke a driver callback for a request it didn't initialize.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> Acked-by: Christoph Hellwig <hch at lst.de>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 02/13] NVMe: Fix io incapable return values
  2016-02-10 18:17 ` [PATCH for-4.5 02/13] NVMe: Fix io incapable return values Keith Busch
@ 2016-02-11 12:14   ` Johannes Thumshirn
  0 siblings, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:14 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:19AM -0700, Keith Busch wrote:
> The function returns true when the controller can't handle IO.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> Reviewed-by: Sagi Grimberg <sagig at mellanox.com>
> Acked-by: Christoph Hellwig <hch at lst.de>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 18:17 ` [PATCH for-4.5 03/13] NVMe: Allow request merges Keith Busch
  2016-02-10 18:36   ` Christoph Hellwig
@ 2016-02-11 12:15   ` Johannes Thumshirn
  2016-02-11 12:33   ` Sagi Grimberg
  2 siblings, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:15 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:20AM -0700, Keith Busch wrote:
> It is generally more efficient to submit larger IO.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock
  2016-02-10 18:17 ` [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock Keith Busch
@ 2016-02-11 12:19   ` Johannes Thumshirn
  2016-02-11 16:38   ` Wenbo Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:19 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:22AM -0700, Keith Busch wrote:
> Namespace removal holds the namespace list mutex. If the controller
> fails during namespace removal, the reset work will deadlock trying to
> acquire the mutex, and no progress will be made.
> 
> This patch completes the potentially blocking parts of namespace
> removal in a controller work queue so that reset work can cleanup a
> failed controller.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
  2016-02-10 18:47   ` Christoph Hellwig
@ 2016-02-11 12:22   ` Johannes Thumshirn
  2016-02-11 12:41   ` Sagi Grimberg
  2016-02-11 16:47   ` Wenbo Wang
  3 siblings, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:22 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:24AM -0700, Keith Busch wrote:
> It's possible a request may get to the driver after the nvme queue was
> disabled. This has the request requeue if that happens.
> 
> Note the request is still "started" by the driver, but requeuing will
> clear the start state for timeout handling.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove
  2016-02-10 18:17 ` [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove Keith Busch
  2016-02-10 18:48   ` Christoph Hellwig
@ 2016-02-11 12:26   ` Johannes Thumshirn
  2016-02-11 12:42   ` Sagi Grimberg
  2 siblings, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:26 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:25AM -0700, Keith Busch wrote:
> A device failure or link down wouldn't have been detected during namespace
> removal. This patch keeps the device in the list for polling so that the
> thread may see such failure and initiate a reset. The device is removed
> from the list after disable, so we can safely flush the reset work as
> it can't be requeued when disable completes.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 09/13] NVMe: Simplify device reset failure
  2016-02-10 18:17 ` [PATCH for-4.5 09/13] NVMe: Simplify device reset failure Keith Busch
  2016-02-10 18:53   ` Christoph Hellwig
@ 2016-02-11 12:28   ` Johannes Thumshirn
  1 sibling, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:28 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:26AM -0700, Keith Busch wrote:
> A reset failure schedules the device to unbind from the driver through
> the pci driver's remove. This cleans up all intialization, so there is
> no need to duplicate the potentially racy cleanup.
> 
> To help understand why a reset failed, the status is logged with the
> existed warning message.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings
  2016-02-10 21:10     ` Keith Busch
@ 2016-02-11 12:29       ` Sagi Grimberg
  2016-02-11 15:12         ` Keith Busch
  0 siblings, 1 reply; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:29 UTC (permalink / raw)



>> Do we need this message at all?
>
> I don't find it useful for anything other than confirming a test hit
> that path. Delete it?

Perhaps we can have it dev_dbg.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue
  2016-02-10 18:17 ` [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue Keith Busch
  2016-02-11 12:13   ` Johannes Thumshirn
@ 2016-02-11 12:30   ` Sagi Grimberg
  1 sibling, 0 replies; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:30 UTC (permalink / raw)


Again,

Reviewed-by: Sagi Grimberg <sagig at mellanox.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 03/13] NVMe: Allow request merges
  2016-02-10 18:17 ` [PATCH for-4.5 03/13] NVMe: Allow request merges Keith Busch
  2016-02-10 18:36   ` Christoph Hellwig
  2016-02-11 12:15   ` Johannes Thumshirn
@ 2016-02-11 12:33   ` Sagi Grimberg
  2 siblings, 0 replies; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:33 UTC (permalink / raw)



> It is generally more efficient to submit larger IO.
>
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Oh yes, I wanted that for some time now, ever since Jens
and I addressed some issues with front merges and gaps which
detected by iser (I had this patch in my pipe but didn't have time to
test...)

Reviewed-by: Sagi Grimberg <sagig at mellanox.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler
  2016-02-10 18:17 ` [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler Keith Busch
@ 2016-02-11 12:34   ` Johannes Thumshirn
  2016-02-11 12:50   ` Sagi Grimberg
  1 sibling, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:34 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:27AM -0700, Keith Busch wrote:
> This moves the dead queue handling out of the namespace removal path
> and into the reset failure path. It fixes a deadlock condition if the
> controller fails or link down during del_gendisk.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
  2016-02-10 18:47   ` Christoph Hellwig
  2016-02-11 12:22   ` Johannes Thumshirn
@ 2016-02-11 12:41   ` Sagi Grimberg
  2016-02-11 16:47   ` Wenbo Wang
  3 siblings, 0 replies; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:41 UTC (permalink / raw)


Looks ok,

Reviewed-by: Sagi Grimberg <sagig at mellanox.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove
  2016-02-10 18:17 ` [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove Keith Busch
  2016-02-10 18:48   ` Christoph Hellwig
  2016-02-11 12:26   ` Johannes Thumshirn
@ 2016-02-11 12:42   ` Sagi Grimberg
  2 siblings, 0 replies; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:42 UTC (permalink / raw)


Lookd good,

Reviewed-by: Sagi Grimberg <sagig at mellanox.com>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 11/13] NVMe: Dead namespace handling
  2016-02-10 18:17 ` [PATCH for-4.5 11/13] NVMe: Dead namespace handling Keith Busch
@ 2016-02-11 12:43   ` Johannes Thumshirn
  2016-02-11 12:59   ` Sagi Grimberg
  1 sibling, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:43 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:28AM -0700, Keith Busch wrote:
> This adds a "dead" state to a namespace and revalidates such a namespace
> to 0 capacity. This will force buffered writers to stop writing pages
> that can't be synced, and ends requests in failure if any is submitted
> to such a namespace.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller
  2016-02-10 18:17 ` [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller Keith Busch
@ 2016-02-11 12:44   ` Johannes Thumshirn
  2016-02-11 13:00   ` Sagi Grimberg
  1 sibling, 0 replies; 57+ messages in thread
From: Johannes Thumshirn @ 2016-02-11 12:44 UTC (permalink / raw)


On Wed, Feb 10, 2016@11:17:29AM -0700, Keith Busch wrote:
> If a previously optimal controller goes degraded, the IO queues will
> stop working, but the controller is still managable. Prior to removing
> namespace, we need to fail the queues so queued and new IO will fail
> and allow namespace removal to progress.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler
  2016-02-10 18:17 ` [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler Keith Busch
  2016-02-11 12:34   ` Johannes Thumshirn
@ 2016-02-11 12:50   ` Sagi Grimberg
  2016-02-11 15:11     ` Keith Busch
  1 sibling, 1 reply; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:50 UTC (permalink / raw)




On 10/02/2016 20:17, Keith Busch wrote:
> This moves the dead queue handling out of the namespace removal path
> and into the reset failure path. It fixes a deadlock condition if the
> controller fails or link down during del_gendisk.

How does it fix the deadlock?

>
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>   drivers/nvme/host/core.c | 19 ++-----------------
>   drivers/nvme/host/nvme.h |  1 +
>   drivers/nvme/host/pci.c  | 22 ++++++++++++++++++++++
>   3 files changed, 25 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index be27f9f..41b595c 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -64,7 +64,7 @@ static void nvme_free_ns(struct kref *kref)
>   	kfree(ns);
>   }
>
> -static void nvme_put_ns(struct nvme_ns *ns)
> +void nvme_put_ns(struct nvme_ns *ns)
>   {
>   	kref_put(&ns->kref, nvme_free_ns);
>   }
> @@ -1113,28 +1113,13 @@ static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
>   static void nvme_ns_remove_work(struct work_struct *work)
>   {
>   	struct nvme_ns *ns = container_of(work, struct nvme_ns, remove_work);
> -	bool kill = nvme_io_incapable(ns->ctrl) &&
> -			!blk_queue_dying(ns->queue);
> -
> -	if (kill) {
> -		blk_set_queue_dying(ns->queue);
> -
> -		/*
> -		 * The controller was shutdown first if we got here through
> -		 * device removal. The shutdown may requeue outstanding
> -		 * requests. These need to be aborted immediately so
> -		 * del_gendisk doesn't block indefinitely for their completion.
> -		 */
> -		blk_mq_abort_requeue_list(ns->queue);
> -	}
> +
>   	if (ns->disk->flags & GENHD_FL_UP) {
>   		if (blk_get_integrity(ns->disk))
>   			blk_integrity_unregister(ns->disk);
>   		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
>   					&nvme_ns_attr_group);
>   		del_gendisk(ns->disk);
> -	}
> -	if (kill || !blk_queue_dying(ns->queue)) {
>   		blk_mq_abort_requeue_list(ns->queue);
>   		blk_cleanup_queue(ns->queue);
>   	}
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index d330512..19a64b2 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -270,6 +270,7 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
>   int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
>   			dma_addr_t dma_addr, u32 *result);
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
> +void nvme_put_ns(struct nvme_ns *ns);
>
>   extern spinlock_t dev_list_lock;
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3381bac..a18e4ab 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1898,11 +1898,33 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
>   	kfree(dev);
>   }
>
> +static void nvme_kill_ns_queues(struct nvme_dev *dev)
> +{
> +	struct nvme_ns *ns;
> +	struct nvme_ctrl *ctrl = &dev->ctrl;
> +
> +	nvme_dev_disable(dev, false);
> +
> +	mutex_lock(&ctrl->namespaces_mutex);
> +	list_for_each_entry(ns, &ctrl->namespaces, list) {
> +		if (!kref_get_unless_zero(&ns->kref))
> +			continue;
> +
> +		blk_set_queue_dying(ns->queue);
> +		blk_mq_abort_requeue_list(ns->queue);
> +		blk_mq_start_stopped_hw_queues(ns->queue, true);
> +
> +		nvme_put_ns(ns);
> +	}
> +	mutex_unlock(&ctrl->namespaces_mutex);
> +}
> +

Why on earth is this pci specific? This should be in the
core. Aside from that, I'd really prefer if the core can handle this
without having the pci (or other) triggering it explicitly, but if
this must move out of the ns remove then we need documentation on
what are the rules of when the driver needs to call it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 11/13] NVMe: Dead namespace handling
  2016-02-10 18:17 ` [PATCH for-4.5 11/13] NVMe: Dead namespace handling Keith Busch
  2016-02-11 12:43   ` Johannes Thumshirn
@ 2016-02-11 12:59   ` Sagi Grimberg
  2016-02-11 15:07     ` Keith Busch
  1 sibling, 1 reply; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 12:59 UTC (permalink / raw)



> This adds a "dead" state to a namespace and revalidates such a namespace
> to 0 capacity. This will force buffered writers to stop writing pages
> that can't be synced, and ends requests in failure if any is submitted
> to such a namespace.

This sorta going towards a namespace state machine (like scsi). Maybe
we need to centralize it correctly instead of adding states that are
relevant is sporadic areas?

>
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>   drivers/nvme/host/core.c |  4 ++++
>   drivers/nvme/host/nvme.h |  1 +
>   drivers/nvme/host/pci.c  | 12 +++++++++++-
>   3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 41b595c..84e9f41 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -560,6 +560,10 @@ static int nvme_revalidate_disk(struct gendisk *disk)
>   	u16 old_ms;
>   	unsigned short bs;
>
> +	if (test_bit(NVME_NS_DEAD, &ns->flags)) {
> +		set_capacity(disk, 0);
> +		return -ENODEV;
> +	}
>   	if (nvme_identify_ns(ns->ctrl, ns->ns_id, &id)) {
>   		dev_warn(ns->ctrl->dev, "%s: Identify failure nvme%dn%d\n",
>   				__func__, ns->ctrl->instance, ns->ns_id);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 19a64b2..e4b4110 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -117,6 +117,7 @@ struct nvme_ns {
>   	unsigned long flags;
>
>   #define NVME_NS_REMOVING 0
> +#define NVME_NS_DEAD     1
>
>   	u64 mode_select_num_blocks;
>   	u32 mode_select_block_len;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a18e4ab..7fd8a54 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -678,7 +678,9 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>
>   	spin_lock_irq(&nvmeq->q_lock);
>   	if (unlikely(nvmeq->cq_vector < 0)) {
> -		ret = BLK_MQ_RQ_QUEUE_BUSY;
> +		ret = test_bit(NVME_NS_DEAD, &ns->flags) ?
> +					BLK_MQ_RQ_QUEUE_ERROR :
> +					BLK_MQ_RQ_QUEUE_BUSY;
>   		spin_unlock_irq(&nvmeq->q_lock);
>   		goto out;
>   	}

I can't say I'm a fan of doing all this in queue_rq...

besides why is this state check under the cq_vector < 0 condition?
This is really confusing...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller
  2016-02-10 18:17 ` [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller Keith Busch
  2016-02-11 12:44   ` Johannes Thumshirn
@ 2016-02-11 13:00   ` Sagi Grimberg
  1 sibling, 0 replies; 57+ messages in thread
From: Sagi Grimberg @ 2016-02-11 13:00 UTC (permalink / raw)



> If a previously optimal controller goes degraded, the IO queues will
> stop working, but the controller is still managable. Prior to removing
> namespace, we need to fail the queues so queued and new IO will fail
> and allow namespace removal to progress.
>
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>   drivers/nvme/host/pci.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 7fd8a54..18c14f6 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1989,6 +1989,7 @@ static void nvme_reset_work(struct work_struct *work)
>   	 */
>   	if (dev->online_queues < 2) {
>   		dev_warn(dev->dev, "IO queues not created\n");
> +		nvme_kill_ns_queues(dev);

Another example why we need to document when should a driver
call this....

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue
  2016-02-10 18:46   ` Christoph Hellwig
  2016-02-10 23:37     ` Keith Busch
@ 2016-02-11 14:52     ` Keith Busch
  1 sibling, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-11 14:52 UTC (permalink / raw)


On Wed, Feb 10, 2016@10:46:41AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 10, 2016@11:17:23AM -0700, Keith Busch wrote:
> > This isn't used for work in the memory reclaim path, and we may need
> > to sync with work queues that also are not flagged memory relaim. This
> > fixes a kernel warning if we ever do sync with such a work queue.
> 
> We do need it during memory reclaim: memory reclaim in general
> does I/O, which can be on NVMe.  We then need the workqueue to
> abort a command or reset an overloaded controller to make progress.
> Not having WQ_MEM_RECLAIM risks deadlocks in heavily loaded systems.

There was a more obvious way to do this that took me another day to
see. We schedule remove_work on the system work queue, and all this
cleanup can happen there instead of in the reset_work on the nvme
work queue.

I will drop the unneeded patches from this series, and resubmit the rest
with the reviews, fixes, and recommended code changes and documentation.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 11/13] NVMe: Dead namespace handling
  2016-02-11 12:59   ` Sagi Grimberg
@ 2016-02-11 15:07     ` Keith Busch
  0 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-11 15:07 UTC (permalink / raw)


On Thu, Feb 11, 2016@02:59:08PM +0200, Sagi Grimberg wrote:
> >This adds a "dead" state to a namespace and revalidates such a namespace
> >to 0 capacity. This will force buffered writers to stop writing pages
> >that can't be synced, and ends requests in failure if any is submitted
> >to such a namespace.
> 
> This sorta going towards a namespace state machine (like scsi). Maybe
> we need to centralize it correctly instead of adding states that are
> relevant is sporadic areas?

Yeah, we need to borrow the better ideas from scsi. Slowly moving to a
state machine.
 
> >  	if (unlikely(nvmeq->cq_vector < 0)) {
> >-		ret = BLK_MQ_RQ_QUEUE_BUSY;
> >+		ret = test_bit(NVME_NS_DEAD, &ns->flags) ?
> >+					BLK_MQ_RQ_QUEUE_ERROR :
> >+					BLK_MQ_RQ_QUEUE_BUSY;
> >  		spin_unlock_irq(&nvmeq->q_lock);
> >  		goto out;
> >  	}
> 
> I can't say I'm a fan of doing all this in queue_rq...
> 
> besides why is this state check under the cq_vector < 0 condition?
> This is really confusing...

Agreed, that's not very clear. We will potentially have an option to
disable interrupt vectors entirely in the future, so this won't be
appropriate criteria if/when that happens.

I've a follow up series that uses queue states instead of irq and
namespace states, but planned to target 4.6 unless an alternate proposal
surfaces.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler
  2016-02-11 12:50   ` Sagi Grimberg
@ 2016-02-11 15:11     ` Keith Busch
  0 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-02-11 15:11 UTC (permalink / raw)


On Thu, Feb 11, 2016@02:50:54PM +0200, Sagi Grimberg wrote:
> On 10/02/2016 20:17, Keith Busch wrote:
> >This moves the dead queue handling out of the namespace removal path
> >and into the reset failure path. It fixes a deadlock condition if the
> >controller fails or link down during del_gendisk.
> 
> How does it fix the deadlock?

Previously the queues were setup for failure prior to calling del_gendisk
only if the controller was broken. If the controller happened to be
optimal, this process would have been skipped. If the controller then
failed, the queues wouldn't be killed.
 
> >+	nvme_dev_disable(dev, false);
> >+
> >+	mutex_lock(&ctrl->namespaces_mutex);
> >+	list_for_each_entry(ns, &ctrl->namespaces, list) {
> >+		if (!kref_get_unless_zero(&ns->kref))
> >+			continue;
> >+
> >+		blk_set_queue_dying(ns->queue);
> >+		blk_mq_abort_requeue_list(ns->queue);
> >+		blk_mq_start_stopped_hw_queues(ns->queue, true);
> >+
> >+		nvme_put_ns(ns);
> >+	}
> >+	mutex_unlock(&ctrl->namespaces_mutex);
> >+}
> >+
> 
> Why on earth is this pci specific? This should be in the
> core. Aside from that, I'd really prefer if the core can handle this
> without having the pci (or other) triggering it explicitly, but if
> this must move out of the ns remove then we need documentation on
> what are the rules of when the driver needs to call it.

It's PCI specific only because of the potential need to disable the
controller first (nvme_dev_disable), which is currently PCI specific.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings
  2016-02-11 12:29       ` Sagi Grimberg
@ 2016-02-11 15:12         ` Keith Busch
  2016-02-11 15:18           ` Christoph Hellwig
  0 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-11 15:12 UTC (permalink / raw)


On Thu, Feb 11, 2016@02:29:59PM +0200, Sagi Grimberg wrote:
> 
> >>Do we need this message at all?
> >
> >I don't find it useful for anything other than confirming a test hit
> >that path. Delete it?
> 
> Perhaps we can have it dev_dbg.

Thanks, sounds like the right compromise to me.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings
  2016-02-11 15:12         ` Keith Busch
@ 2016-02-11 15:18           ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-02-11 15:18 UTC (permalink / raw)


On Thu, Feb 11, 2016@03:12:45PM +0000, Keith Busch wrote:
> On Thu, Feb 11, 2016@02:29:59PM +0200, Sagi Grimberg wrote:
> > 
> > >>Do we need this message at all?
> > >
> > >I don't find it useful for anything other than confirming a test hit
> > >that path. Delete it?
> > 
> > Perhaps we can have it dev_dbg.
> 
> Thanks, sounds like the right compromise to me.

Fine with me.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock
  2016-02-10 18:17 ` [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock Keith Busch
  2016-02-11 12:19   ` Johannes Thumshirn
@ 2016-02-11 16:38   ` Wenbo Wang
  1 sibling, 0 replies; 57+ messages in thread
From: Wenbo Wang @ 2016-02-11 16:38 UTC (permalink / raw)


Not sure if async namespace removal has any potential risk.

-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com] 
Sent: Thursday, February 11, 2016 2:17 AM
To: linux-nvme at lists.infradead.org; Jens Axboe
Cc: Christoph Hellwig; Sagi Grimberg; Wenbo Wang; Keith Busch
Subject: [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock

Namespace removal holds the namespace list mutex. If the controller fails during namespace removal, the reset work will deadlock trying to acquire the mutex, and no progress will be made.

This patch completes the potentially blocking parts of namespace removal in a controller work queue so that reset work can cleanup a failed controller.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 81 ++++++++++++++++++++++++++++++------------------
 drivers/nvme/host/nvme.h |  6 ++++
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 852b789..be27f9f 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -48,6 +48,10 @@ static void nvme_free_ns(struct kref *kref)  {
 	struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+	mutex_lock(&ns->ctrl->namespaces_mutex);
+	list_del_init(&ns->list);
+	mutex_unlock(&ns->ctrl->namespaces_mutex);
+
 	if (ns->type == NVME_NS_LIGHTNVM)
 		nvme_nvm_unregister(ns->queue, ns->disk->disk_name);
 
@@ -1106,6 +1110,37 @@ static struct nvme_ns *nvme_find_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	return NULL;
 }
 
+static void nvme_ns_remove_work(struct work_struct *work) {
+	struct nvme_ns *ns = container_of(work, struct nvme_ns, remove_work);
+	bool kill = nvme_io_incapable(ns->ctrl) &&
+			!blk_queue_dying(ns->queue);
+
+	if (kill) {
+		blk_set_queue_dying(ns->queue);
+
+		/*
+		 * The controller was shutdown first if we got here through
+		 * device removal. The shutdown may requeue outstanding
+		 * requests. These need to be aborted immediately so
+		 * del_gendisk doesn't block indefinitely for their completion.
+		 */
+		blk_mq_abort_requeue_list(ns->queue);
+	}
+	if (ns->disk->flags & GENHD_FL_UP) {
+		if (blk_get_integrity(ns->disk))
+			blk_integrity_unregister(ns->disk);
+		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
+					&nvme_ns_attr_group);
+		del_gendisk(ns->disk);
+	}
+	if (kill || !blk_queue_dying(ns->queue)) {
+		blk_mq_abort_requeue_list(ns->queue);
+		blk_cleanup_queue(ns->queue);
+	}
+	nvme_put_ns(ns);
+}
+
 static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)  {
 	struct nvme_ns *ns;
@@ -1161,6 +1196,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	list_add_tail(&ns->list, &ctrl->namespaces);
 	kref_get(&ctrl->kref);
+	INIT_WORK(&ns->remove_work, nvme_ns_remove_work);
 	if (ns->type == NVME_NS_LIGHTNVM)
 		return;
 
@@ -1180,35 +1216,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 static void nvme_ns_remove(struct nvme_ns *ns)  {
-	bool kill = nvme_io_incapable(ns->ctrl) &&
-			!blk_queue_dying(ns->queue);
-
-	lockdep_assert_held(&ns->ctrl->namespaces_mutex);
-
-	if (kill) {
-		blk_set_queue_dying(ns->queue);
-
-		/*
-		 * The controller was shutdown first if we got here through
-		 * device removal. The shutdown may requeue outstanding
-		 * requests. These need to be aborted immediately so
-		 * del_gendisk doesn't block indefinitely for their completion.
-		 */
-		blk_mq_abort_requeue_list(ns->queue);
-	}
-	if (ns->disk->flags & GENHD_FL_UP) {
-		if (blk_get_integrity(ns->disk))
-			blk_integrity_unregister(ns->disk);
-		sysfs_remove_group(&disk_to_dev(ns->disk)->kobj,
-					&nvme_ns_attr_group);
-		del_gendisk(ns->disk);
-	}
-	if (kill || !blk_queue_dying(ns->queue)) {
-		blk_mq_abort_requeue_list(ns->queue);
-		blk_cleanup_queue(ns->queue);
-	}
-	list_del_init(&ns->list);
-	nvme_put_ns(ns);
+	if (!test_and_set_bit(NVME_NS_REMOVING, &ns->flags))
+		queue_work(ns->ctrl->ns_workq, &ns->remove_work);
 }
 
 static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid) @@ -1305,6 +1314,8 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list)
 		nvme_ns_remove(ns);
 	mutex_unlock(&ctrl->namespaces_mutex);
+
+	flush_workqueue(ctrl->ns_workq);
 }
 
 static DEFINE_IDA(nvme_instance_ida);
@@ -1337,12 +1348,14 @@ static void nvme_release_instance(struct nvme_ctrl *ctrl)  }
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
- {
+{
 	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
 
 	spin_lock(&dev_list_lock);
 	list_del(&ctrl->node);
 	spin_unlock(&dev_list_lock);
+
+	destroy_workqueue(ctrl->ns_workq);
 }
 
 static void nvme_free_ctrl(struct kref *kref) @@ -1392,11 +1405,19 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	get_device(ctrl->device);
 	dev_set_drvdata(ctrl->device, ctrl);
 
+	ctrl->ns_workq = alloc_workqueue("nvme-ns", WQ_UNBOUND, 0);
+	if (!ctrl->ns_workq) {
+		ret = -ENOMEM;
+		goto out_destroy_device;
+	}
+
 	spin_lock(&dev_list_lock);
 	list_add_tail(&ctrl->node, &nvme_ctrl_list);
 	spin_unlock(&dev_list_lock);
 
 	return 0;
+out_destroy_device:
+	device_destroy(nvme_class, MKDEV(nvme_char_major, ctrl->instance));
 out_release_instance:
 	nvme_release_instance(ctrl);
 out:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 9664d07..d330512 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -72,6 +72,7 @@ struct nvme_ctrl {
 	struct mutex namespaces_mutex;
 	struct device *device;	/* char device */
 	struct list_head node;
+	struct workqueue_struct *ns_workq;
 
 	char name[12];
 	char serial[20];
@@ -112,6 +113,11 @@ struct nvme_ns {
 	bool ext;
 	u8 pi_type;
 	int type;
+	struct work_struct remove_work;
+	unsigned long flags;
+
+#define NVME_NS_REMOVING 0
+
 	u64 mode_select_num_blocks;
 	u32 mode_select_block_len;
 };
--
2.6.2.307.g37023ba

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
                     ` (2 preceding siblings ...)
  2016-02-11 12:41   ` Sagi Grimberg
@ 2016-02-11 16:47   ` Wenbo Wang
  2016-02-11 17:00     ` Keith Busch
  3 siblings, 1 reply; 57+ messages in thread
From: Wenbo Wang @ 2016-02-11 16:47 UTC (permalink / raw)


If adding synchronize_rcu and blk_sync_queue, I guess this is not needed.

-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com] 
Sent: Thursday, February 11, 2016 2:17 AM
To: linux-nvme at lists.infradead.org; Jens Axboe
Cc: Christoph Hellwig; Sagi Grimberg; Wenbo Wang; Keith Busch
Subject: [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues

It's possible a request may get to the driver after the nvme queue was disabled. This has the request requeue if that happens.

Note the request is still "started" by the driver, but requeuing will clear the start state for timeout handling.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 5bea054..f8bb781 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -678,6 +678,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	blk_mq_start_request(req);
 
 	spin_lock_irq(&nvmeq->q_lock);
+	if (unlikely(nvmeq->cq_vector < 0)) {
+		ret = BLK_MQ_RQ_QUEUE_BUSY;
+		spin_unlock_irq(&nvmeq->q_lock);
+		goto out;
+	}
 	__nvme_submit_cmd(nvmeq, &cmnd);
 	nvme_process_cq(nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
--
2.6.2.307.g37023ba

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-11 16:47   ` Wenbo Wang
@ 2016-02-11 17:00     ` Keith Busch
  2016-02-11 17:21       ` Wenbo Wang
  0 siblings, 1 reply; 57+ messages in thread
From: Keith Busch @ 2016-02-11 17:00 UTC (permalink / raw)


On Thu, Feb 11, 2016@04:47:31PM +0000, Wenbo Wang wrote:
> If adding synchronize_rcu and blk_sync_queue, I guess this is not needed.

That was to sync IO tasks to make sure they saw the h/w cxt STOPPED
flag. We still need to restart those queues at some point to flush queued
requests to completion, and this patch is just prepping for that.

I used "blk_queue_dying" in a previous patch series. I went going back
and forth on using that criteria. SCSI has device states, so trying to
apply those ideas to namespaces.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues
  2016-02-11 17:00     ` Keith Busch
@ 2016-02-11 17:21       ` Wenbo Wang
  0 siblings, 0 replies; 57+ messages in thread
From: Wenbo Wang @ 2016-02-11 17:21 UTC (permalink / raw)


If device is removed, I think nvme_clear_queue cleans all those pending requests.
If devices is restarted, should not cq_vector be setup before enabling the h/w ctx?

-----Original Message-----
From: Keith Busch [mailto:keith.busch@intel.com] 
Sent: Friday, February 12, 2016 1:00 AM
To: Wenbo Wang
Cc: linux-nvme at lists.infradead.org; Jens Axboe; Christoph Hellwig; Sagi Grimberg
Subject: Re: [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues

On Thu, Feb 11, 2016@04:47:31PM +0000, Wenbo Wang wrote:
> If adding synchronize_rcu and blk_sync_queue, I guess this is not needed.

That was to sync IO tasks to make sure they saw the h/w cxt STOPPED flag. We still need to restart those queues at some point to flush queued requests to completion, and this patch is just prepping for that.

I used "blk_queue_dying" in a previous patch series. I went going back and forth on using that criteria. SCSI has device states, so trying to apply those ideas to namespaces.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2016-02-11 17:21 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-10 18:17 [PATCH for-4.5 00/13] NVMe fixups for 4.5 Keith Busch
2016-02-10 18:17 ` [PATCH for-4.5 01/13] blk-mq: End unstarted requests on dying queue Keith Busch
2016-02-11 12:13   ` Johannes Thumshirn
2016-02-11 12:30   ` Sagi Grimberg
2016-02-10 18:17 ` [PATCH for-4.5 02/13] NVMe: Fix io incapable return values Keith Busch
2016-02-11 12:14   ` Johannes Thumshirn
2016-02-10 18:17 ` [PATCH for-4.5 03/13] NVMe: Allow request merges Keith Busch
2016-02-10 18:36   ` Christoph Hellwig
2016-02-10 18:37     ` Jens Axboe
2016-02-10 21:01       ` Keith Busch
2016-02-10 21:19         ` Jens Axboe
2016-02-11 12:15   ` Johannes Thumshirn
2016-02-11 12:33   ` Sagi Grimberg
2016-02-10 18:17 ` [PATCH for-4.5 04/13] NVMe: Set queue limits max_dev_sectors Keith Busch
2016-02-10 18:20   ` Christoph Hellwig
2016-02-10 18:24     ` Keith Busch
2016-02-10 18:40       ` Christoph Hellwig
2016-02-10 19:49         ` Keith Busch
2016-02-10 22:53           ` Martin K. Petersen
2016-02-10 18:17 ` [PATCH for-4.5 05/13] NVMe: Fix namespace removal deadlock Keith Busch
2016-02-11 12:19   ` Johannes Thumshirn
2016-02-11 16:38   ` Wenbo Wang
2016-02-10 18:17 ` [PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue Keith Busch
2016-02-10 18:46   ` Christoph Hellwig
2016-02-10 23:37     ` Keith Busch
2016-02-11 14:52     ` Keith Busch
2016-02-10 18:17 ` [PATCH for-4.5 07/13] NVMe: Requeue requests on suspended queues Keith Busch
2016-02-10 18:47   ` Christoph Hellwig
2016-02-11 12:22   ` Johannes Thumshirn
2016-02-11 12:41   ` Sagi Grimberg
2016-02-11 16:47   ` Wenbo Wang
2016-02-11 17:00     ` Keith Busch
2016-02-11 17:21       ` Wenbo Wang
2016-02-10 18:17 ` [PATCH for-4.5 08/13] NVMe: Poll device while still active during remove Keith Busch
2016-02-10 18:48   ` Christoph Hellwig
2016-02-11 12:26   ` Johannes Thumshirn
2016-02-11 12:42   ` Sagi Grimberg
2016-02-10 18:17 ` [PATCH for-4.5 09/13] NVMe: Simplify device reset failure Keith Busch
2016-02-10 18:53   ` Christoph Hellwig
2016-02-11 12:28   ` Johannes Thumshirn
2016-02-10 18:17 ` [PATCH for-4.5 10/13] NVMe: Move error handling to failed reset handler Keith Busch
2016-02-11 12:34   ` Johannes Thumshirn
2016-02-11 12:50   ` Sagi Grimberg
2016-02-11 15:11     ` Keith Busch
2016-02-10 18:17 ` [PATCH for-4.5 11/13] NVMe: Dead namespace handling Keith Busch
2016-02-11 12:43   ` Johannes Thumshirn
2016-02-11 12:59   ` Sagi Grimberg
2016-02-11 15:07     ` Keith Busch
2016-02-10 18:17 ` [PATCH for-4.5 12/13] NVMe: Mark queues as dead on degraded controller Keith Busch
2016-02-11 12:44   ` Johannes Thumshirn
2016-02-11 13:00   ` Sagi Grimberg
2016-02-10 18:17 ` [PATCH for-4.5 13/13] NVMe: Rate limit nvme IO warnings Keith Busch
2016-02-10 18:54   ` Christoph Hellwig
2016-02-10 21:10     ` Keith Busch
2016-02-11 12:29       ` Sagi Grimberg
2016-02-11 15:12         ` Keith Busch
2016-02-11 15:18           ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.