All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] NVMe related fixes
@ 2017-01-04 22:41 ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

I've been looking into an old regression origianlly reported here:

  http://lists.infradead.org/pipermail/linux-nvme/2016-August/005699.html

The root cause is blk-mq's hot cpu notifier is stuck indefinitely during
suspend on requests that entered a stopped hardware context, and that
hardware context will not be restarted until suspend completes.

I originally set out to unwind the requests and block on reentry,
but blk-mq doesn't support doing that: once a request enters a hardware
context, it needs to complete on that context. Since the context won't be
starting again, we need to do _something_ with those entered requests,
and unfortunately ending them in error is the simplest way to resolve
the deadlock.

Alternatively, it might have been nice if we didn't need to freeze at
all if we could leverage the new blk_mq_quiesce_queue, but that wouldn't
work when the queue map needs to be redone...

Any feedback appreciated. Thanks!

Keith Busch (6):
  irq/affinity: Assign all online CPUs to vectors
  irq/affinity: Assign offline CPUs a vector
  nvme/pci: Start queues after tagset is updated
  blk-mq: Update queue map when changing queue count
  blk-mq: Fix freeze deadlock
  blk-mq: Remove unused variable

 block/blk-mq.c          | 86 +++++++++++++++++++++++++++++++++++++++++--------
 drivers/nvme/host/pci.c |  2 +-
 kernel/irq/affinity.c   | 17 ++++++++--
 3 files changed, 87 insertions(+), 18 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/6] NVMe related fixes
@ 2017-01-04 22:41 ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


I've been looking into an old regression origianlly reported here:

  http://lists.infradead.org/pipermail/linux-nvme/2016-August/005699.html

The root cause is blk-mq's hot cpu notifier is stuck indefinitely during
suspend on requests that entered a stopped hardware context, and that
hardware context will not be restarted until suspend completes.

I originally set out to unwind the requests and block on reentry,
but blk-mq doesn't support doing that: once a request enters a hardware
context, it needs to complete on that context. Since the context won't be
starting again, we need to do _something_ with those entered requests,
and unfortunately ending them in error is the simplest way to resolve
the deadlock.

Alternatively, it might have been nice if we didn't need to freeze at
all if we could leverage the new blk_mq_quiesce_queue, but that wouldn't
work when the queue map needs to be redone...

Any feedback appreciated. Thanks!

Keith Busch (6):
  irq/affinity: Assign all online CPUs to vectors
  irq/affinity: Assign offline CPUs a vector
  nvme/pci: Start queues after tagset is updated
  blk-mq: Update queue map when changing queue count
  blk-mq: Fix freeze deadlock
  blk-mq: Remove unused variable

 block/blk-mq.c          | 86 +++++++++++++++++++++++++++++++++++++++++--------
 drivers/nvme/host/pci.c |  2 +-
 kernel/irq/affinity.c   | 17 ++++++++--
 3 files changed, 87 insertions(+), 18 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors
  2017-01-04 22:41 ` Keith Busch
@ 2017-01-04 22:41   ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

This patch makes sure all online CPUs are assigned to vectors in
cases where the nodes don't have the same number of online CPUs.
The calculation for how many vectors needs to be assigned should account
for the number of CPUs for a particular node on each round of assignment
in order to ensure all online CPUs are assigned a vector when they don't
evenly divide, and calculate extras accordingly.

Since we attempt to divide evently among the nodes, this may still result
in unused vectors if some nodes have fewer CPUs than nodes previosuly
set up, but at least every online CPU will be assigned to something.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 kernel/irq/affinity.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4544b11..b25dce0 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -96,17 +96,19 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
 	/* Spread the vectors per node */
 	vecs_per_node = affv / nodes;
-	/* Account for rounding errors */
-	extra_vecs = affv - (nodes * vecs_per_node);
 
 	for_each_node_mask(n, nodemsk) {
-		int ncpus, v, vecs_to_assign = vecs_per_node;
+		int ncpus, v, vecs_to_assign;
 
 		/* Get the cpus on this node which are in the mask */
 		cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));
 
 		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
+		vecs_to_assign = min(vecs_per_node, ncpus);
+
+		/* Account for rounding errors */
+		extra_vecs = ncpus - vecs_to_assign;
 
 		for (v = 0; curvec < last_affv && v < vecs_to_assign;
 		     curvec++, v++) {
@@ -123,6 +125,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
 		if (curvec >= last_affv)
 			break;
+		vecs_per_node = (affv - curvec) / --nodes;
 	}
 
 done:
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors
@ 2017-01-04 22:41   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


This patch makes sure all online CPUs are assigned to vectors in
cases where the nodes don't have the same number of online CPUs.
The calculation for how many vectors needs to be assigned should account
for the number of CPUs for a particular node on each round of assignment
in order to ensure all online CPUs are assigned a vector when they don't
evenly divide, and calculate extras accordingly.

Since we attempt to divide evently among the nodes, this may still result
in unused vectors if some nodes have fewer CPUs than nodes previosuly
set up, but at least every online CPU will be assigned to something.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 kernel/irq/affinity.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4544b11..b25dce0 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -96,17 +96,19 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
 	/* Spread the vectors per node */
 	vecs_per_node = affv / nodes;
-	/* Account for rounding errors */
-	extra_vecs = affv - (nodes * vecs_per_node);
 
 	for_each_node_mask(n, nodemsk) {
-		int ncpus, v, vecs_to_assign = vecs_per_node;
+		int ncpus, v, vecs_to_assign;
 
 		/* Get the cpus on this node which are in the mask */
 		cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));
 
 		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
+		vecs_to_assign = min(vecs_per_node, ncpus);
+
+		/* Account for rounding errors */
+		extra_vecs = ncpus - vecs_to_assign;
 
 		for (v = 0; curvec < last_affv && v < vecs_to_assign;
 		     curvec++, v++) {
@@ -123,6 +125,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
 		if (curvec >= last_affv)
 			break;
+		vecs_per_node = (affv - curvec) / --nodes;
 	}
 
 done:
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] irq/affinity: Assign offline CPUs a vector
  2017-01-04 22:41 ` Keith Busch
@ 2017-01-04 22:41   ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

The offline CPUs need to assigned to something incase they come online
later, otherwise anyone using the mapping for things other than affinity
will have blank entries for that online CPU.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 kernel/irq/affinity.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index b25dce0..2367531 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -129,6 +129,14 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	}
 
 done:
+	/*
+	 * Assign offline CPUs to first mask in case they come online later. A
+	 * driver can rerun this from a cpu notifier if they want a more optimal
+	 * spread.
+	 */
+	cpumask_andnot(nmsk, cpu_possible_mask, cpu_online_mask);
+	irq_spread_init_one(masks, nmsk, cpumask_weight(nmsk));
+
 	put_online_cpus();
 
 	/* Fill out vectors at the end that don't need affinity */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/6] irq/affinity: Assign offline CPUs a vector
@ 2017-01-04 22:41   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


The offline CPUs need to assigned to something incase they come online
later, otherwise anyone using the mapping for things other than affinity
will have blank entries for that online CPU.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 kernel/irq/affinity.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index b25dce0..2367531 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -129,6 +129,14 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	}
 
 done:
+	/*
+	 * Assign offline CPUs to first mask in case they come online later. A
+	 * driver can rerun this from a cpu notifier if they want a more optimal
+	 * spread.
+	 */
+	cpumask_andnot(nmsk, cpu_possible_mask, cpu_online_mask);
+	irq_spread_init_one(masks, nmsk, cpumask_weight(nmsk));
+
 	put_online_cpus();
 
 	/* Fill out vectors at the end that don't need affinity */
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] nvme/pci: Start queues after tagset is updated
  2017-01-04 22:41 ` Keith Busch
@ 2017-01-04 22:41   ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

We need to leave the block queues stopped if we're changing the tagset's
number of queues.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 19beeb7..a94fbcb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1813,8 +1813,8 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 	} else {
-		nvme_start_queues(&dev->ctrl);
 		nvme_dev_add(dev);
+		nvme_start_queues(&dev->ctrl);
 	}
 
 	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) {
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/6] nvme/pci: Start queues after tagset is updated
@ 2017-01-04 22:41   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


We need to leave the block queues stopped if we're changing the tagset's
number of queues.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 19beeb7..a94fbcb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1813,8 +1813,8 @@ static void nvme_reset_work(struct work_struct *work)
 		nvme_kill_queues(&dev->ctrl);
 		nvme_remove_namespaces(&dev->ctrl);
 	} else {
-		nvme_start_queues(&dev->ctrl);
 		nvme_dev_add(dev);
+		nvme_start_queues(&dev->ctrl);
 	}
 
 	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_LIVE)) {
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] blk-mq: Update queue map when changing queue count
  2017-01-04 22:41 ` Keith Busch
@ 2017-01-04 22:41   ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

The queue map needs to match how many hardware queues exist, otherwise
it may indicate a less than optimal mapping, contain a hctx_id that
was freed, or attempt to initialize a request map for invalid context.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a8e67a1..9b7ed03 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2223,7 +2223,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 	 * we should change hctx numa_node according to new topology (this
 	 * involves free and re-allocate memory, worthy doing?)
 	 */
-
 	blk_mq_map_swqueue(q, online_mask);
 
 	blk_mq_sysfs_register(q);
@@ -2481,6 +2480,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	if (set->ops->map_queues)
+		set->ops->map_queues(set);
+	else
+		blk_mq_map_queues(set);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/6] blk-mq: Update queue map when changing queue count
@ 2017-01-04 22:41   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


The queue map needs to match how many hardware queues exist, otherwise
it may indicate a less than optimal mapping, contain a hctx_id that
was freed, or attempt to initialize a request map for invalid context.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-mq.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a8e67a1..9b7ed03 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2223,7 +2223,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 	 * we should change hctx numa_node according to new topology (this
 	 * involves free and re-allocate memory, worthy doing?)
 	 */
-
 	blk_mq_map_swqueue(q, online_mask);
 
 	blk_mq_sysfs_register(q);
@@ -2481,6 +2480,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		blk_mq_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
+	if (set->ops->map_queues)
+		set->ops->map_queues(set);
+	else
+		blk_mq_map_queues(set);
+
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_realloc_hw_ctxs(set, q);
 
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
  2017-01-04 22:41 ` Keith Busch
@ 2017-01-04 22:41   ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

If hardware queues are stopped for some event, like the device has been
suspended by power management, requests allocated on that hardware queue
are indefinitely stuck causing a queue freeze to wait forever.

This patch abandons requests on stopped queues after syncing with the
all queue_rq events when we need to rebalance the queues. While we
would prefer not to end the requests error if it's possible to submit
them on a different context, there's no good way to unwind a request to
submit on a valid context once it enters a stopped context for removal.
Ending IO with EAGAIN is a better alternative than deadlocking.

Reported-by: Marc Merlin <marc@merlins.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 67 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9b7ed03..0c9a2a3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -117,22 +117,12 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
-/**
- * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have finished
- * @q: request queue.
- *
- * Note: this function does not prevent that the struct request end_io()
- * callback function is invoked. Additionally, it is not prevented that
- * new queue_rq() calls occur unless the queue has been stopped first.
- */
-void blk_mq_quiesce_queue(struct request_queue *q)
+static void blk_mq_sync_queue(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 	bool rcu = false;
 
-	blk_mq_stop_hw_queues(q);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->flags & BLK_MQ_F_BLOCKING)
 			synchronize_srcu(&hctx->queue_rq_srcu);
@@ -142,6 +132,20 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	if (rcu)
 		synchronize_rcu();
 }
+
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have finished
+ * @q: request queue.
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Additionally, it is not prevented that
+ * new queue_rq() calls occur unless the queue has been stopped first.
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+	blk_mq_stop_hw_queues(q);
+	blk_mq_sync_queue(q);
+}
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
 void blk_mq_wake_waiters(struct request_queue *q)
@@ -2228,6 +2232,51 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 	blk_mq_sysfs_register(q);
 }
 
+static void blk_mq_abandon_stopped_requests(struct request_queue *q)
+{
+	int i;
+	struct request *rq, *next;
+	struct blk_mq_hw_ctx *hctx;
+	LIST_HEAD(rq_list);
+
+	blk_mq_sync_queue(q);
+
+	spin_lock(&q->requeue_lock);
+	list_for_each_entry_safe(rq, next, &q->requeue_list, queuelist) {
+		struct blk_mq_ctx *ctx;
+
+		ctx = rq->mq_ctx;
+		hctx = blk_mq_map_queue(q, ctx->cpu);
+		if (blk_mq_hctx_stopped(hctx)) {
+			list_del_init(&rq->queuelist);
+
+			spin_lock(&hctx->lock);
+			list_add_tail(&rq->queuelist, &rq_list);
+			spin_unlock(&hctx->lock);
+		}
+	}
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!blk_mq_hctx_stopped(hctx))
+			continue;
+
+		flush_busy_ctxs(hctx, &rq_list);
+
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+	spin_unlock(&q->requeue_lock);
+
+	while (!list_empty(&rq_list)) {
+		rq = list_first_entry(&rq_list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		rq->errors = -EAGAIN;
+		blk_mq_end_request(rq, rq->errors);
+	}
+}
+
 /*
  * New online cpumask which is going to be set in this hotplug event.
  * Declare this cpumasks as global as cpu-hotplug operation is invoked
@@ -2250,6 +2299,8 @@ static void blk_mq_queue_reinit_work(void)
 	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_freeze_queue_start(q);
 	list_for_each_entry(q, &all_q_list, all_q_node)
+		blk_mq_abandon_stopped_requests(q);
+	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_freeze_queue_wait(q);
 
 	list_for_each_entry(q, &all_q_list, all_q_node)
@@ -2477,7 +2528,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		return;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
+		blk_mq_freeze_queue_start(q);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_abandon_stopped_requests(q);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_freeze_queue_wait(q);
 
 	set->nr_hw_queues = nr_hw_queues;
 	if (set->ops->map_queues)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
@ 2017-01-04 22:41   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


If hardware queues are stopped for some event, like the device has been
suspended by power management, requests allocated on that hardware queue
are indefinitely stuck causing a queue freeze to wait forever.

This patch abandons requests on stopped queues after syncing with the
all queue_rq events when we need to rebalance the queues. While we
would prefer not to end the requests error if it's possible to submit
them on a different context, there's no good way to unwind a request to
submit on a valid context once it enters a stopped context for removal.
Ending IO with EAGAIN is a better alternative than deadlocking.

Reported-by: Marc Merlin <marc at merlins.org>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-mq.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 67 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9b7ed03..0c9a2a3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -117,22 +117,12 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
-/**
- * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have finished
- * @q: request queue.
- *
- * Note: this function does not prevent that the struct request end_io()
- * callback function is invoked. Additionally, it is not prevented that
- * new queue_rq() calls occur unless the queue has been stopped first.
- */
-void blk_mq_quiesce_queue(struct request_queue *q)
+static void blk_mq_sync_queue(struct request_queue *q)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 	bool rcu = false;
 
-	blk_mq_stop_hw_queues(q);
-
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (hctx->flags & BLK_MQ_F_BLOCKING)
 			synchronize_srcu(&hctx->queue_rq_srcu);
@@ -142,6 +132,20 @@ void blk_mq_quiesce_queue(struct request_queue *q)
 	if (rcu)
 		synchronize_rcu();
 }
+
+/**
+ * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have finished
+ * @q: request queue.
+ *
+ * Note: this function does not prevent that the struct request end_io()
+ * callback function is invoked. Additionally, it is not prevented that
+ * new queue_rq() calls occur unless the queue has been stopped first.
+ */
+void blk_mq_quiesce_queue(struct request_queue *q)
+{
+	blk_mq_stop_hw_queues(q);
+	blk_mq_sync_queue(q);
+}
 EXPORT_SYMBOL_GPL(blk_mq_quiesce_queue);
 
 void blk_mq_wake_waiters(struct request_queue *q)
@@ -2228,6 +2232,51 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 	blk_mq_sysfs_register(q);
 }
 
+static void blk_mq_abandon_stopped_requests(struct request_queue *q)
+{
+	int i;
+	struct request *rq, *next;
+	struct blk_mq_hw_ctx *hctx;
+	LIST_HEAD(rq_list);
+
+	blk_mq_sync_queue(q);
+
+	spin_lock(&q->requeue_lock);
+	list_for_each_entry_safe(rq, next, &q->requeue_list, queuelist) {
+		struct blk_mq_ctx *ctx;
+
+		ctx = rq->mq_ctx;
+		hctx = blk_mq_map_queue(q, ctx->cpu);
+		if (blk_mq_hctx_stopped(hctx)) {
+			list_del_init(&rq->queuelist);
+
+			spin_lock(&hctx->lock);
+			list_add_tail(&rq->queuelist, &rq_list);
+			spin_unlock(&hctx->lock);
+		}
+	}
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (!blk_mq_hctx_stopped(hctx))
+			continue;
+
+		flush_busy_ctxs(hctx, &rq_list);
+
+		spin_lock(&hctx->lock);
+		if (!list_empty(&hctx->dispatch))
+			list_splice_init(&hctx->dispatch, &rq_list);
+		spin_unlock(&hctx->lock);
+	}
+	spin_unlock(&q->requeue_lock);
+
+	while (!list_empty(&rq_list)) {
+		rq = list_first_entry(&rq_list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		rq->errors = -EAGAIN;
+		blk_mq_end_request(rq, rq->errors);
+	}
+}
+
 /*
  * New online cpumask which is going to be set in this hotplug event.
  * Declare this cpumasks as global as cpu-hotplug operation is invoked
@@ -2250,6 +2299,8 @@ static void blk_mq_queue_reinit_work(void)
 	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_freeze_queue_start(q);
 	list_for_each_entry(q, &all_q_list, all_q_node)
+		blk_mq_abandon_stopped_requests(q);
+	list_for_each_entry(q, &all_q_list, all_q_node)
 		blk_mq_freeze_queue_wait(q);
 
 	list_for_each_entry(q, &all_q_list, all_q_node)
@@ -2477,7 +2528,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 		return;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
+		blk_mq_freeze_queue_start(q);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_abandon_stopped_requests(q);
+	list_for_each_entry(q, &set->tag_list, tag_set_list)
+		blk_mq_freeze_queue_wait(q);
 
 	set->nr_hw_queues = nr_hw_queues;
 	if (set->ops->map_queues)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] blk-mq: Remove unused variable
  2017-01-04 22:41 ` Keith Busch
@ 2017-01-04 22:41   ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)
  To: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin, Keith Busch

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0c9a2a3..fae9651 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -916,7 +916,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 {
 	LIST_HEAD(rq_list);
-	LIST_HEAD(driver_list);
 
 	if (unlikely(blk_mq_hctx_stopped(hctx)))
 		return;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/6] blk-mq: Remove unused variable
@ 2017-01-04 22:41   ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-04 22:41 UTC (permalink / raw)


Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0c9a2a3..fae9651 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -916,7 +916,6 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 static void blk_mq_process_rq_list(struct blk_mq_hw_ctx *hctx)
 {
 	LIST_HEAD(rq_list);
-	LIST_HEAD(driver_list);
 
 	if (unlikely(blk_mq_hctx_stopped(hctx)))
 		return;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-05  7:33     ` Bart Van Assche
  -1 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2017-01-05  7:33 UTC (permalink / raw)
  To: hch, tglx, keith.busch, linux-nvme, linux-block, axboe, axboe; +Cc: marc

On Wed, 2017-01-04 at 17:41 -0500, Keith Busch wrote:
> +static void blk_mq_abandon_stopped_requests(struct request_queue *q)
> +{
> +	int i;
> +	struct request *rq, *next;
> +	struct blk_mq_hw_ctx *hctx;
> +	LIST_HEAD(rq_list);
> +
> +	blk_mq_sync_queue(q);
> +
> +	spin_lock(&q->requeue_lock);
> +	list_for_each_entry_safe(rq, next, &q->requeue_list, queuelist) {
> +		struct blk_mq_ctx *ctx;
> +
> +		ctx =3D rq->mq_ctx;
> +		hctx =3D blk_mq_map_queue(q, ctx->cpu);
> +		if (blk_mq_hctx_stopped(hctx)) {
> +			list_del_init(&rq->queuelist);
> +
> +			spin_lock(&hctx->lock);
> +			list_add_tail(&rq->queuelist, &rq_list);
> +			spin_unlock(&hctx->lock);
> +		}
> +	}
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (!blk_mq_hctx_stopped(hctx))
> +			continue;
> +
> +		flush_busy_ctxs(hctx, &rq_list);
> +
> +		spin_lock(&hctx->lock);
> +		if (!list_empty(&hctx->dispatch))
> +			list_splice_init(&hctx->dispatch, &rq_list);
> +		spin_unlock(&hctx->lock);
> +	}
> +	spin_unlock(&q->requeue_lock);
> +
> +	while (!list_empty(&rq_list)) {
> +		rq =3D list_first_entry(&rq_list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		rq->errors =3D -EAGAIN;
> +		blk_mq_end_request(rq, rq->errors);
> +	}
> +}

Hello Keith,

This patch adds a second code path to the blk-mq core for running queues an=
d
hence will make the blk-mq core harder to maintain. Have you considered to
implement this functionality by introducing a new "fail all requests" flag
for hctx queues such that blk_mq_abandon_stopped_requests() can reuse the
existing mechanism for running a queue?

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
@ 2017-01-05  7:33     ` Bart Van Assche
  0 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2017-01-05  7:33 UTC (permalink / raw)


On Wed, 2017-01-04@17:41 -0500, Keith Busch wrote:
> +static void blk_mq_abandon_stopped_requests(struct request_queue *q)
> +{
> +	int i;
> +	struct request *rq, *next;
> +	struct blk_mq_hw_ctx *hctx;
> +	LIST_HEAD(rq_list);
> +
> +	blk_mq_sync_queue(q);
> +
> +	spin_lock(&q->requeue_lock);
> +	list_for_each_entry_safe(rq, next, &q->requeue_list, queuelist) {
> +		struct blk_mq_ctx *ctx;
> +
> +		ctx = rq->mq_ctx;
> +		hctx = blk_mq_map_queue(q, ctx->cpu);
> +		if (blk_mq_hctx_stopped(hctx)) {
> +			list_del_init(&rq->queuelist);
> +
> +			spin_lock(&hctx->lock);
> +			list_add_tail(&rq->queuelist, &rq_list);
> +			spin_unlock(&hctx->lock);
> +		}
> +	}
> +
> +	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (!blk_mq_hctx_stopped(hctx))
> +			continue;
> +
> +		flush_busy_ctxs(hctx, &rq_list);
> +
> +		spin_lock(&hctx->lock);
> +		if (!list_empty(&hctx->dispatch))
> +			list_splice_init(&hctx->dispatch, &rq_list);
> +		spin_unlock(&hctx->lock);
> +	}
> +	spin_unlock(&q->requeue_lock);
> +
> +	while (!list_empty(&rq_list)) {
> +		rq = list_first_entry(&rq_list, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		rq->errors = -EAGAIN;
> +		blk_mq_end_request(rq, rq->errors);
> +	}
> +}

Hello Keith,

This patch adds a second code path to the blk-mq core for running queues and
hence will make the blk-mq core harder to maintain. Have you considered to
implement this functionality by introducing a new "fail all requests" flag
for hctx queues such that blk_mq_abandon_stopped_requests() can reuse the
existing mechanism for running a queue?

Bart.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] irq/affinity: Assign offline CPUs a vector
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-08 10:01     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-08 10:01 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin

On Wed, Jan 04, 2017 at 05:41:07PM -0500, Keith Busch wrote:
> The offline CPUs need to assigned to something incase they come online
> later, otherwise anyone using the mapping for things other than affinity
> will have blank entries for that online CPU.

I don't really like the idea behind it.  Back when we came up with
this code I had some discussion with Thomas if we should do the
assignment only for online CPUs, or maybe for all possible CPUs.

Except for some big iron with physical node hotplug the difference
usually is just that some nodes have been temporarily offlined.  So
maybe we need to bite the bullet and move the irq and block code
to consider all possible cpus.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/6] irq/affinity: Assign offline CPUs a vector
@ 2017-01-08 10:01     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-08 10:01 UTC (permalink / raw)


On Wed, Jan 04, 2017@05:41:07PM -0500, Keith Busch wrote:
> The offline CPUs need to assigned to something incase they come online
> later, otherwise anyone using the mapping for things other than affinity
> will have blank entries for that online CPU.

I don't really like the idea behind it.  Back when we came up with
this code I had some discussion with Thomas if we should do the
assignment only for online CPUs, or maybe for all possible CPUs.

Except for some big iron with physical node hotplug the difference
usually is just that some nodes have been temporarily offlined.  So
maybe we need to bite the bullet and move the irq and block code
to consider all possible cpus.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] blk-mq: Remove unused variable
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-08 10:02     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-08 10:02 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin

Looks good, and should probably got straight to Jens for 4.10.

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 6/6] blk-mq: Remove unused variable
@ 2017-01-08 10:02     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-08 10:02 UTC (permalink / raw)


Looks good, and should probably got straight to Jens for 4.10.

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-13 20:21     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:21 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin

Reviewed-by: Sagi Grimberg<sagi@grimberg.me>

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors
@ 2017-01-13 20:21     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:21 UTC (permalink / raw)


Reviewed-by: Sagi Grimberg<sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/6] irq/affinity: Assign offline CPUs a vector
  2017-01-08 10:01     ` Christoph Hellwig
@ 2017-01-13 20:26       ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:26 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch
  Cc: Jens Axboe, Jens Axboe, Marc Merlin, linux-nvme, linux-block,
	Thomas Gleixner

>> The offline CPUs need to assigned to something incase they come online
>> later, otherwise anyone using the mapping for things other than affinity
>> will have blank entries for that online CPU.
>
> I don't really like the idea behind it.  Back when we came up with
> this code I had some discussion with Thomas if we should do the
> assignment only for online CPUs, or maybe for all possible CPUs.
>
> Except for some big iron with physical node hotplug the difference
> usually is just that some nodes have been temporarily offlined.  So
> maybe we need to bite the bullet and move the irq and block code
> to consider all possible cpus.

I tend to agree.

Would be great if we could have a cpuhp_register() for external users
that would get ops + state mask structure to act on (this case would
need CPUHP_OFFLINE, CPUHP_ONLINE). We do have something similar
for mmu (mmu_notifier).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/6] irq/affinity: Assign offline CPUs a vector
@ 2017-01-13 20:26       ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:26 UTC (permalink / raw)


>> The offline CPUs need to assigned to something incase they come online
>> later, otherwise anyone using the mapping for things other than affinity
>> will have blank entries for that online CPU.
>
> I don't really like the idea behind it.  Back when we came up with
> this code I had some discussion with Thomas if we should do the
> assignment only for online CPUs, or maybe for all possible CPUs.
>
> Except for some big iron with physical node hotplug the difference
> usually is just that some nodes have been temporarily offlined.  So
> maybe we need to bite the bullet and move the irq and block code
> to consider all possible cpus.

I tend to agree.

Would be great if we could have a cpuhp_register() for external users
that would get ops + state mask structure to act on (this case would
need CPUHP_OFFLINE, CPUHP_ONLINE). We do have something similar
for mmu (mmu_notifier).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] nvme/pci: Start queues after tagset is updated
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-13 20:38     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:38 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin


> We need to leave the block queues stopped if we're changing the tagset's
> number of queues.

Umm, Don't we need to fail these requests? What am I missing?
Won't these requests block until timeout expiration and will
trigger error recovery again?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 3/6] nvme/pci: Start queues after tagset is updated
@ 2017-01-13 20:38     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:38 UTC (permalink / raw)



> We need to leave the block queues stopped if we're changing the tagset's
> number of queues.

Umm, Don't we need to fail these requests? What am I missing?
Won't these requests block until timeout expiration and will
trigger error recovery again?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] blk-mq: Update queue map when changing queue count
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-13 20:39     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:39 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a8e67a1..9b7ed03 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2223,7 +2223,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
>  	 * we should change hctx numa_node according to new topology (this
>  	 * involves free and re-allocate memory, worthy doing?)
>  	 */
> -

This looks mis-placed.

>  	blk_mq_map_swqueue(q, online_mask);
>
>  	blk_mq_sysfs_register(q);
> @@ -2481,6 +2480,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  		blk_mq_freeze_queue(q);
>
>  	set->nr_hw_queues = nr_hw_queues;
> +	if (set->ops->map_queues)
> +		set->ops->map_queues(set);
> +	else
> +		blk_mq_map_queues(set);
> +

Makes sense,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 4/6] blk-mq: Update queue map when changing queue count
@ 2017-01-13 20:39     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 20:39 UTC (permalink / raw)


> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a8e67a1..9b7ed03 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2223,7 +2223,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
>  	 * we should change hctx numa_node according to new topology (this
>  	 * involves free and re-allocate memory, worthy doing?)
>  	 */
> -

This looks mis-placed.

>  	blk_mq_map_swqueue(q, online_mask);
>
>  	blk_mq_sysfs_register(q);
> @@ -2481,6 +2480,11 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
>  		blk_mq_freeze_queue(q);
>
>  	set->nr_hw_queues = nr_hw_queues;
> +	if (set->ops->map_queues)
> +		set->ops->map_queues(set);
> +	else
> +		blk_mq_map_queues(set);
> +

Makes sense,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-13 21:05     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 21:05 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin


> If hardware queues are stopped for some event, like the device has been
> suspended by power management, requests allocated on that hardware queue
> are indefinitely stuck causing a queue freeze to wait forever.

I have a problem with this patch. IMO, this is a general issue so, so
why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
not need to update nr_hw_queues at all. I'm fine with the
blk_mq_abandon_stopped_requests but not with its call-site.

Usually a driver knows when it wants to abandon all busy requests
blk_mq_tagset_busy_iter(), maybe the right approach is to add
a hook for all allocated tags? Or have blk_mq_quisce_queue get a
fail all requests parameter from the callers?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
@ 2017-01-13 21:05     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 21:05 UTC (permalink / raw)



> If hardware queues are stopped for some event, like the device has been
> suspended by power management, requests allocated on that hardware queue
> are indefinitely stuck causing a queue freeze to wait forever.

I have a problem with this patch. IMO, this is a general issue so, so
why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
not need to update nr_hw_queues at all. I'm fine with the
blk_mq_abandon_stopped_requests but not with its call-site.

Usually a driver knows when it wants to abandon all busy requests
blk_mq_tagset_busy_iter(), maybe the right approach is to add
a hook for all allocated tags? Or have blk_mq_quisce_queue get a
fail all requests parameter from the callers?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 6/6] blk-mq: Remove unused variable
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-13 21:05     ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 21:05 UTC (permalink / raw)
  To: Keith Busch, linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner
  Cc: Marc Merlin

Looks good,

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 6/6] blk-mq: Remove unused variable
@ 2017-01-13 21:05     ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-13 21:05 UTC (permalink / raw)


Looks good,

Reviewed-by: Sagi Grimberg <sagi at grimberg.me>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock
  2017-01-05  7:33     ` Bart Van Assche
@ 2017-01-17 17:53       ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-17 17:53 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, tglx, linux-nvme, linux-block, axboe, axboe, marc

On Thu, Jan 05, 2017 at 07:33:22AM +0000, Bart Van Assche wrote:
> This patch adds a second code path to the blk-mq core for running queues and
> hence will make the blk-mq core harder to maintain. Have you considered to
> implement this functionality by introducing a new "fail all requests" flag
> for hctx queues such that blk_mq_abandon_stopped_requests() can reuse the
> existing mechanism for running a queue?

Thanks for the suggestion. I'll look into that. I wanted to avoid more
flags to test for in the fast-path, but I see that a special queue run
method is problematic for maintenance.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
@ 2017-01-17 17:53       ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-17 17:53 UTC (permalink / raw)


On Thu, Jan 05, 2017@07:33:22AM +0000, Bart Van Assche wrote:
> This patch adds a second code path to the blk-mq core for running queues and
> hence will make the blk-mq core harder to maintain. Have you considered to
> implement this functionality by introducing a new "fail all requests" flag
> for hctx queues such that blk_mq_abandon_stopped_requests() can reuse the
> existing mechanism for running a queue?

Thanks for the suggestion. I'll look into that. I wanted to avoid more
flags to test for in the fast-path, but I see that a special queue run
method is problematic for maintenance.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock
  2017-01-13 21:05     ` Sagi Grimberg
@ 2017-01-17 18:00       ` Keith Busch
  -1 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-17 18:00 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin

On Fri, Jan 13, 2017 at 11:05:19PM +0200, Sagi Grimberg wrote:
> > If hardware queues are stopped for some event, like the device has been
> > suspended by power management, requests allocated on that hardware queue
> > are indefinitely stuck causing a queue freeze to wait forever.
> 
> I have a problem with this patch. IMO, this is a general issue so, so
> why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
> not need to update nr_hw_queues at all. I'm fine with the
> blk_mq_abandon_stopped_requests but not with its call-site.
> 
> Usually a driver knows when it wants to abandon all busy requests
> blk_mq_tagset_busy_iter(), maybe the right approach is to add
> a hook for all allocated tags? Or have blk_mq_quisce_queue get a
> fail all requests parameter from the callers?

This patch is overly aggressive on failing allocated requests. There
are scenarios where we wouldn't want to abandon them, like if the hw
context is about to be brough back online, but this patch assumes all
need to be abandoned. I'll see if there's some other tricks we can have
a driver do. Thanks for the suggestions.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
@ 2017-01-17 18:00       ` Keith Busch
  0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2017-01-17 18:00 UTC (permalink / raw)


On Fri, Jan 13, 2017@11:05:19PM +0200, Sagi Grimberg wrote:
> > If hardware queues are stopped for some event, like the device has been
> > suspended by power management, requests allocated on that hardware queue
> > are indefinitely stuck causing a queue freeze to wait forever.
> 
> I have a problem with this patch. IMO, this is a general issue so, so
> why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
> not need to update nr_hw_queues at all. I'm fine with the
> blk_mq_abandon_stopped_requests but not with its call-site.
> 
> Usually a driver knows when it wants to abandon all busy requests
> blk_mq_tagset_busy_iter(), maybe the right approach is to add
> a hook for all allocated tags? Or have blk_mq_quisce_queue get a
> fail all requests parameter from the callers?

This patch is overly aggressive on failing allocated requests. There
are scenarios where we wouldn't want to abandon them, like if the hw
context is about to be brough back online, but this patch assumes all
need to be abandoned. I'll see if there's some other tricks we can have
a driver do. Thanks for the suggestions.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/6] blk-mq: Fix queue freeze deadlock
  2017-01-17 18:00       ` Keith Busch
@ 2017-01-19  7:54         ` Sagi Grimberg
  -1 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-19  7:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin


>>> If hardware queues are stopped for some event, like the device has been
>>> suspended by power management, requests allocated on that hardware queue
>>> are indefinitely stuck causing a queue freeze to wait forever.
>>
>> I have a problem with this patch. IMO, this is a general issue so, so
>> why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
>> not need to update nr_hw_queues at all. I'm fine with the
>> blk_mq_abandon_stopped_requests but not with its call-site.
>>
>> Usually a driver knows when it wants to abandon all busy requests
>> blk_mq_tagset_busy_iter(), maybe the right approach is to add
>> a hook for all allocated tags? Or have blk_mq_quisce_queue get a
>> fail all requests parameter from the callers?
>
> This patch is overly aggressive on failing allocated requests. There
> are scenarios where we wouldn't want to abandon them, like if the hw
> context is about to be brough back online, but this patch assumes all
> need to be abandoned. I'll see if there's some other tricks we can have
> a driver do. Thanks for the suggestions.

I agree,

I do think though that this should be driven from the driver, because
for fabrics, we might have some fabric error that triggers a periodic
reconnect. So the "hw context is about to be brought back" is unknown
from the driver pov, and when we delete the controller (because we give
up) this is exactly where we need to abandon the allocated requests.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 5/6] blk-mq: Fix queue freeze deadlock
@ 2017-01-19  7:54         ` Sagi Grimberg
  0 siblings, 0 replies; 44+ messages in thread
From: Sagi Grimberg @ 2017-01-19  7:54 UTC (permalink / raw)



>>> If hardware queues are stopped for some event, like the device has been
>>> suspended by power management, requests allocated on that hardware queue
>>> are indefinitely stuck causing a queue freeze to wait forever.
>>
>> I have a problem with this patch. IMO, this is a general issue so, so
>> why do we tie a fix to calling blk_mq_update_nr_hw_queues()? We might
>> not need to update nr_hw_queues at all. I'm fine with the
>> blk_mq_abandon_stopped_requests but not with its call-site.
>>
>> Usually a driver knows when it wants to abandon all busy requests
>> blk_mq_tagset_busy_iter(), maybe the right approach is to add
>> a hook for all allocated tags? Or have blk_mq_quisce_queue get a
>> fail all requests parameter from the callers?
>
> This patch is overly aggressive on failing allocated requests. There
> are scenarios where we wouldn't want to abandon them, like if the hw
> context is about to be brough back online, but this patch assumes all
> need to be abandoned. I'll see if there's some other tricks we can have
> a driver do. Thanks for the suggestions.

I agree,

I do think though that this should be driven from the driver, because
for fabrics, we might have some fabric error that triggers a periodic
reconnect. So the "hw context is about to be brought back" is unknown
from the driver pov, and when we delete the controller (because we give
up) this is exactly where we need to abandon the allocated requests.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-23 18:30     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-23 18:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin

On Wed, Jan 04, 2017 at 05:41:06PM -0500, Keith Busch wrote:
> This patch makes sure all online CPUs are assigned to vectors in
> cases where the nodes don't have the same number of online CPUs.
> The calculation for how many vectors needs to be assigned should account
> for the number of CPUs for a particular node on each round of assignment
> in order to ensure all online CPUs are assigned a vector when they don't
> evenly divide, and calculate extras accordingly.
> 
> Since we attempt to divide evently among the nodes, this may still result
> in unused vectors if some nodes have fewer CPUs than nodes previosuly
> set up, but at least every online CPU will be assigned to something.

This looks fine:

I think we still should switch to something like all present or possible
cpus for MSI-X vector and blk-mq queue assignment, though reduce
the needs for this:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors
@ 2017-01-23 18:30     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-23 18:30 UTC (permalink / raw)


On Wed, Jan 04, 2017@05:41:06PM -0500, Keith Busch wrote:
> This patch makes sure all online CPUs are assigned to vectors in
> cases where the nodes don't have the same number of online CPUs.
> The calculation for how many vectors needs to be assigned should account
> for the number of CPUs for a particular node on each round of assignment
> in order to ensure all online CPUs are assigned a vector when they don't
> evenly divide, and calculate extras accordingly.
> 
> Since we attempt to divide evently among the nodes, this may still result
> in unused vectors if some nodes have fewer CPUs than nodes previosuly
> set up, but at least every online CPU will be assigned to something.

This looks fine:

I think we still should switch to something like all present or possible
cpus for MSI-X vector and blk-mq queue assignment, though reduce
the needs for this:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/6] nvme/pci: Start queues after tagset is updated
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-23 18:31     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-23 18:31 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 3/6] nvme/pci: Start queues after tagset is updated
@ 2017-01-23 18:31     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-23 18:31 UTC (permalink / raw)


Looks fine,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/6] blk-mq: Update queue map when changing queue count
  2017-01-04 22:41   ` Keith Busch
@ 2017-01-23 18:32     ` Christoph Hellwig
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-23 18:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, linux-block, Jens Axboe, Jens Axboe,
	Christoph Hellwig, Thomas Gleixner, Marc Merlin

Looks fine except for the nitpick pointed out by Sagi:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 4/6] blk-mq: Update queue map when changing queue count
@ 2017-01-23 18:32     ` Christoph Hellwig
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2017-01-23 18:32 UTC (permalink / raw)


Looks fine except for the nitpick pointed out by Sagi:

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2017-01-23 18:32 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-04 22:41 [PATCH 0/6] NVMe related fixes Keith Busch
2017-01-04 22:41 ` Keith Busch
2017-01-04 22:41 ` [PATCH 1/6] irq/affinity: Assign all online CPUs to vectors Keith Busch
2017-01-04 22:41   ` Keith Busch
2017-01-13 20:21   ` Sagi Grimberg
2017-01-13 20:21     ` Sagi Grimberg
2017-01-23 18:30   ` Christoph Hellwig
2017-01-23 18:30     ` Christoph Hellwig
2017-01-04 22:41 ` [PATCH 2/6] irq/affinity: Assign offline CPUs a vector Keith Busch
2017-01-04 22:41   ` Keith Busch
2017-01-08 10:01   ` Christoph Hellwig
2017-01-08 10:01     ` Christoph Hellwig
2017-01-13 20:26     ` Sagi Grimberg
2017-01-13 20:26       ` Sagi Grimberg
2017-01-04 22:41 ` [PATCH 3/6] nvme/pci: Start queues after tagset is updated Keith Busch
2017-01-04 22:41   ` Keith Busch
2017-01-13 20:38   ` Sagi Grimberg
2017-01-13 20:38     ` Sagi Grimberg
2017-01-23 18:31   ` Christoph Hellwig
2017-01-23 18:31     ` Christoph Hellwig
2017-01-04 22:41 ` [PATCH 4/6] blk-mq: Update queue map when changing queue count Keith Busch
2017-01-04 22:41   ` Keith Busch
2017-01-13 20:39   ` Sagi Grimberg
2017-01-13 20:39     ` Sagi Grimberg
2017-01-23 18:32   ` Christoph Hellwig
2017-01-23 18:32     ` Christoph Hellwig
2017-01-04 22:41 ` [PATCH 5/6] blk-mq: Fix queue freeze deadlock Keith Busch
2017-01-04 22:41   ` Keith Busch
2017-01-05  7:33   ` Bart Van Assche
2017-01-05  7:33     ` Bart Van Assche
2017-01-17 17:53     ` Keith Busch
2017-01-17 17:53       ` Keith Busch
2017-01-13 21:05   ` Sagi Grimberg
2017-01-13 21:05     ` Sagi Grimberg
2017-01-17 18:00     ` Keith Busch
2017-01-17 18:00       ` Keith Busch
2017-01-19  7:54       ` Sagi Grimberg
2017-01-19  7:54         ` Sagi Grimberg
2017-01-04 22:41 ` [PATCH 6/6] blk-mq: Remove unused variable Keith Busch
2017-01-04 22:41   ` Keith Busch
2017-01-08 10:02   ` Christoph Hellwig
2017-01-08 10:02     ` Christoph Hellwig
2017-01-13 21:05   ` Sagi Grimberg
2017-01-13 21:05     ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.