Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
@ 2019-10-08  4:18 Ming Lei
  2019-10-08  4:18 ` [PATCH V3 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Ming Lei @ 2019-10-08  4:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes dead
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
    is dead

 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 135 +++++++++++++++++++++++++++++++++----
 block/blk-mq.h             |   3 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 138 insertions(+), 16 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V3 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
@ 2019-10-08  4:18 ` Ming Lei
  2019-10-08  4:18 ` [PATCH V3 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-10-08  4:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch, John Garry

Add a new hw queue state of BLK_MQ_S_INTERNAL_STOPPED, which prepares
for stopping hw queue before all CPUs of this hctx become offline.

We can't reuse BLK_MQ_S_STOPPED because that state can be cleared during IO
completion.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 3 ++-
 include/linux/blk-mq.h | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..af40a02c46ee 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INTERNAL_STOPPED),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 32c62c64e6c2..63717573bc16 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -176,7 +176,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0bf056de5cc3..079c282e4471 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -235,6 +235,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is internal stopped, driver do not use it */
+	BLK_MQ_S_INTERNAL_STOPPED	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V3 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-10-08  4:18 ` [PATCH V3 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
@ 2019-10-08  4:18 ` Ming Lei
  2019-10-08  4:18 ` [PATCH V3 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-10-08  4:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 13 +++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  2 ++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index af40a02c46ee..24fff8c90942 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec791156e9cc..a664f196782a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2225,6 +2225,11 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2261,6 +2266,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2320,6 +2328,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3547,6 +3558,8 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				NULL, blk_mq_hctx_notify_online);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f6f77eaa7217..751a28a1d4b0 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1999,7 +1999,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 079c282e4471..a345f2cf920d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -58,6 +58,7 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	struct hlist_node	cpuhp_online;
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
 
@@ -226,6 +227,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 068793a619ca..bb80f52040cb 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -147,6 +147,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V3 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead
  2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-10-08  4:18 ` [PATCH V3 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
  2019-10-08  4:18 ` [PATCH V3 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2019-10-08  4:18 ` Ming Lei
  2019-10-08 17:03   ` John Garry
  2019-10-08  4:18 ` [PATCH V3 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-10-08  4:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Before one CPU becomes offline, check if it is the last online CPU
of hctx. If yes, mark this hctx as BLK_MQ_S_INTERNAL_STOPPED, meantime
wait for completion of all in-flight IOs originated from this hctx.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c |  2 +-
 block/blk-mq-tag.h |  2 ++
 block/blk-mq.c     | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 008388e82b5c..31828b82552b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -325,7 +325,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..321fd6f440e6 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a664f196782a..3384242202eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2225,8 +2225,46 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
+{
+	unsigned *count = data;
+
+	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+		(*count)++;
+
+	return true;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
+{
+	unsigned count = 0;
+
+	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
+
+	return count;
+}
+
+static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx->tags))
+			break;
+		msleep(5);
+	}
+}
+
 static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) == cpu) &&
+	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) >=
+	     nr_cpu_ids)) {
+		set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+		blk_mq_hctx_drain_inflight_rqs(hctx);
+        }
 	return 0;
 }
 
@@ -2246,6 +2284,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
+	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V3 4/5] blk-mq: re-submit IO in case that hctx is dead
  2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2019-10-08  4:18 ` [PATCH V3 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
@ 2019-10-08  4:18 ` Ming Lei
  2019-10-08  4:18 ` [PATCH V3 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
  2019-10-08  9:06 ` [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-10-08  4:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

When all CPUs in one hctx are offline, we shouldn't run this hw queue
for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 52 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3384242202eb..4153c1c4e2aa 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2268,10 +2268,34 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	struct bio_list list;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	/*
+	 * Free the old empty request before submitting bio for avoiding
+	 * potential deadlock
+	 */
+	blk_mq_cleanup_rq(rq);
+	blk_mq_end_request(rq, 0);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is dead, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2279,6 +2303,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
+	bool hctx_dead;
+	struct request *rq;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2286,6 +2312,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 
+	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
+		nr_cpu_ids;
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
@@ -2296,11 +2325,20 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (list_empty(&tmp))
 		return 0;
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+	if (!hctx_dead) {
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&tmp, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		blk_mq_run_hw_queue(hctx, true);
+		return 0;
+	}
+
+	while (!list_empty(&tmp)) {
+		rq = list_entry(tmp.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
 
-	blk_mq_run_hw_queue(hctx, true);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V3 5/5] blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead
  2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2019-10-08  4:18 ` [PATCH V3 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
@ 2019-10-08  4:18 ` " Ming Lei
  2019-10-08  9:06 ` [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
  5 siblings, 0 replies; 18+ messages in thread
From: Ming Lei @ 2019-10-08  4:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

If hctx becomes dead, all in-queue IO requests aimed at this hctx have to
be re-submitted, so cover requests queued in scheduler queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4153c1c4e2aa..4625013a4927 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2305,6 +2305,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 	bool hctx_dead;
 	struct request *rq;
+	struct elevator_queue *e;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2315,12 +2316,31 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
 		nr_cpu_ids;
 
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
+	e = hctx->queue->elevator;
+	if (!e) {
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
+	} else if (hctx_dead) {
+		LIST_HEAD(sched_tmp);
+
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &tmp);
+		}
+
+		while (!list_empty(&sched_tmp)) {
+			rq = list_entry(sched_tmp.next, struct request,
+					queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_sched_insert_request(rq, true, true, true);
+		}
 	}
-	spin_unlock(&ctx->lock);
 
 	if (list_empty(&tmp))
 		return 0;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2019-10-08  4:18 ` [PATCH V3 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
@ 2019-10-08  9:06 ` John Garry
  2019-10-08 17:15   ` John Garry
  5 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2019-10-08  9:06 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

On 08/10/2019 05:18, Ming Lei wrote:
> Hi,
>
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
>
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>
> This patchset tries to address the issue by two stages:
>
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
>
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
>
> Please comment & review, thanks!
>
> John, I don't add your tested-by tag since V3 have some changes,
> and I appreciate if you may run your test on V3.
>

Will do, Thanks

> V3:
> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
>
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- addres



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead
  2019-10-08  4:18 ` [PATCH V3 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
@ 2019-10-08 17:03   ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2019-10-08 17:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

On 08/10/2019 05:18, Ming Lei wrote:
> Before one CPU becomes offline, check if it is the last online CPU
> of hctx. If yes, mark this hctx as BLK_MQ_S_INTERNAL_STOPPED, meantime
> wait for completion of all in-flight IOs originated from this hctx.
>
> This way guarantees that there isn't any inflight IO before shutdowning
> the managed IRQ line.
>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>

Apart from nits, below, FWIW:
Reviewed-by: John Garry <john.garry@huawei.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-tag.c |  2 +-
>  block/blk-mq-tag.h |  2 ++
>  block/blk-mq.c     | 40 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 008388e82b5c..31828b82552b 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -325,7 +325,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>   *		true to continue iterating tags, false to stop.
>   * @priv:	Will be passed as second argument to @fn.
>   */
> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>  		busy_tag_iter_fn *fn, void *priv)
>  {
>  	if (tags->nr_reserved_tags)
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 61deab0b5a5a..321fd6f440e6 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  		void *priv);
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, void *priv);
>
>  static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>  						 struct blk_mq_hw_ctx *hctx)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a664f196782a..3384242202eb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2225,8 +2225,46 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>
> +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> +				     bool reserved)
> +{
> +	unsigned *count = data;
> +
> +	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))

nit: extra parentheses

> +		(*count)++;
> +
> +	return true;
> +}
> +
> +static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
> +{
> +	unsigned count = 0;
> +
> +	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
> +
> +	return count;
> +}
> +
> +static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +	while (1) {
> +		if (!blk_mq_tags_inflight_rqs(hctx->tags))
> +			break;
> +		msleep(5);
> +	}
> +}
> +
>  static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
>  {
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +

nit: we could make this a little neater by using an intermediate 
variable for hctx->cpumask:

	struct cpumask *cpumask = hctx->cpumask;
	
	if ((cpumask_next_and(-1, cpumask, cpu_online_mask) == cpu) &&

	...

> +	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) == cpu) &&
> +	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) >=
> +	     nr_cpu_ids)) {
> +		set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> +		blk_mq_hctx_drain_inflight_rqs(hctx);
> +        }
>  	return 0;
>  }
>
> @@ -2246,6 +2284,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>  	type = hctx->type;
>
> +	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> +
>  	spin_lock(&ctx->lock);
>  	if (!list_empty(&ctx->rq_lists[type])) {
>  		list_splice_init(&ctx->rq_lists[type], &tmp);
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-08  9:06 ` [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
@ 2019-10-08 17:15   ` John Garry
  2019-10-09  8:39     ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2019-10-08 17:15 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

On 08/10/2019 10:06, John Garry wrote:
> On 08/10/2019 05:18, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>     "
>>      That was the constraint of managed interrupts from the very
>> beginning:
>>
>>       The driver/subsystem has to quiesce the interrupt line and the
>> associated
>>       queue _before_ it gets shutdown in CPU unplug and not fiddle
>> with it
>>       until it's restarted by the core when the CPU is plugged in again.
>>     "
>>
>> But no drivers or blk-mq do that before one hctx becomes dead(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
>> becomes dead
>>
>> - steal bios from the request, and resubmit them via
>> generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Please comment & review, thanks!
>>
>> John, I don't add your tested-by tag since V3 have some changes,
>> and I appreciate if you may run your test on V3.
>>
>
> Will do, Thanks

Hi Ming,

I got this warning once:

[  162.558185] CPU10: shutdown
[  162.560994] psci: CPU10 killed.
[  162.593939] CPU9: shutdown
[  162.596645] psci: CPU9 killed.
[  162.625838] CPU8: shutdown
[  162.628550] psci: CPU8 killed.
[  162.685790] CPU7: shutdown
[  162.688496] psci: CPU7 killed.
[  162.725771] CPU6: shutdown
[  162.728486] psci: CPU6 killed.
[  162.753884] CPU5: shutdown
[  162.756591] psci: CPU5 killed.
[  162.785584] irq_shutdown
[  162.788277] IRQ 800: no longer affine to CPU4
[  162.793267] CPU4: shutdown
[  162.795975] psci: CPU4 killed.
[  162.849680] run queue from wrong CPU 13, hctx active
[  162.849692] CPU3: shutdown
[  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  162.857362] psci: CPU3 killed.
[  162.866039] Workqueue: kblockd blk_mq_run_work_fn
[  162.882281] Call trace:
[  162.884716]  dump_backtrace+0x0/0x150
[  162.888365]  show_stack+0x14/0x20
[  162.891668]  dump_stack+0xb0/0xf8
[  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
[  162.899400]  blk_mq_run_work_fn+0x1c/0x28
[  162.903397]  process_one_work+0x1e0/0x358
[  162.907393]  worker_thread+0x40/0x488
[  162.911042]  kthread+0x118/0x120
[  162.914257]  ret_from_fork+0x10/0x18
[  162.917834] run queue from wrong CPU 13, hctx active
[  162.922789] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.931472] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  162.939983] Workqueue: kblockd blk_mq_run_work_fn
[  162.944674] Call trace:
[  162.947107]  dump_backtrace+0x0/0x150
[  162.950755]  show_stack+0x14/0x20
[  162.954057]  dump_stack+0xb0/0xf8
[  162.957359]  __blk_mq_run_hw_queue+0x11c/0x128
[  162.961788]  blk_mq_run_work_fn+0x1c/0x28
[  162.965784]  process_one_work+0x1e0/0x358
[  162.969780]  worker_thread+0x40/0x488
[  162.973429]  kthread+0x118/0x120
[  162.976644]  ret_from_fork+0x10/0x18
[  162.980214] run queue from wrong CPU 13, hctx active
[  162.985171] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  162.993853] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.002366] Workqueue: kblockd blk_mq_run_work_fn
[  163.007057] Call trace:
[  163.009490]  dump_backtrace+0x0/0x150
[  163.013138]  show_stack+0x14/0x20
[  163.016440]  dump_stack+0xb0/0xf8
[  163.019742]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.024172]  blk_mq_run_work_fn+0x1c/0x28
[  163.028167]  process_one_work+0x1e0/0x358
[  163.032163]  worker_thread+0x238/0x488
[  163.035899]  kthread+0x118/0x120
[  163.039113]  ret_from_fork+0x10/0x18
[  163.042736] run queue from wrong CPU 13, hctx active
[  163.047692] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.056374] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.064885] Workqueue: kblockd blk_mq_run_work_fn
[  163.069575] Call trace:
[  163.072008]  dump_backtrace+0x0/0x150
[  163.075656]  show_stack+0x14/0x20
[  163.078958]  dump_stack+0xb0/0xf8
[  163.082260]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.086690]  blk_mq_run_work_fn+0x1c/0x28
[  163.090686]  process_one_work+0x1e0/0x358
[  163.094681]  worker_thread+0x238/0x488
[  163.098417]  kthread+0x118/0x120
[  163.101631]  ret_from_fork+0x10/0x18
[  163.105200] run queue from wrong CPU 13, hctx active
[  163.110534] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.111852] CPU2: shutdown
[  163.119218] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.119222] Workqueue: kblockd blk_mq_run_work_fn
[  163.119223] Call trace:
[  163.119224]  dump_backtrace+0x0/0x150
[  163.119226]  show_stack+0x14/0x20
[  163.119228]  dump_stack+0xb0/0xf8
[  163.119230]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.119234]  blk_mq_run_work_fn+0x1c/0x28
[  163.121943] psci: CPU2 killed.
[  163.130439]  process_one_work+0x1e0/0x358
[  163.130441]  worker_thread+0x238/0x488
[  163.130443]  kthread+0x118/0x120
[  163.130447]  ret_from_fork+0x10/0x18
[  163.173789] run queue from wrong CPU 13, hctx active
[  163.178743] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.187425] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.195937] Workqueue: kblockd blk_mq_run_work_fn
[  163.200627] Call trace:
[  163.203061]  dump_backtrace+0x0/0x150
[  163.206709]  show_stack+0x14/0x20
[  163.210011]  dump_stack+0xb0/0xf8
[  163.213312]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.217742]  blk_mq_run_work_fn+0x1c/0x28
[  163.221738]  process_one_work+0x1e0/0x358
[  163.225733]  worker_thread+0x238/0x488
[  163.229469]  kthread+0x118/0x120
[  163.232684]  ret_from_fork+0x10/0x18
[  163.236253] run queue from wrong CPU 13, hctx active
[  163.241597] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.241691] CPU1: shutdown
[  163.250281] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.250285] Workqueue: kblockd blk_mq_run_work_fn
[  163.250287] Call trace:
[  163.250291]  dump_backtrace+0x0/0x150
[  163.252998] psci: CPU1 killed.
[  163.261496]  show_stack+0x14/0x20
[  163.261499]  dump_stack+0xb0/0xf8
[  163.261503]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.279008] process 870 (fio) no longer affine to cpu0
[  163.291463]  blk_mq_run_work_fn+0x1c/0x28
[  163.295458]  process_one_work+0x1e0/0x358
[  163.299454]  worker_thread+0x238/0x488
[  163.303189]  kthread+0x118/0x120
[  163.306404]  ret_from_fork+0x10/0x18
[  163.309975] run queue from wrong CPU 13, hctx active
[  163.314929] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.323611] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.332122] Workqueue: kblockd blk_mq_run_work_fn
[  163.336812] Call trace:
[  163.339245]  dump_backtrace+0x0/0x150
[  163.342894]  show_stack+0x14/0x20
[  163.346195]  dump_stack+0xb0/0xf8
[  163.349497]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.353927]  blk_mq_run_work_fn+0x1c/0x28
[  163.357923]  process_one_work+0x1e0/0x358
[  163.361918]  worker_thread+0x238/0x488
[  163.365654]  kthread+0x118/0x120
[  163.368868]  ret_from_fork+0x10/0x18
[  163.372437] run queue from wrong CPU 13, hctx active
[  163.377391] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.386073] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.394583] Workqueue: kblockd blk_mq_run_work_fn
[  163.399273] Call trace:
[  163.401706]  dump_backtrace+0x0/0x150
[  163.405354]  show_stack+0x14/0x20
[  163.408656]  dump_stack+0xb0/0xf8
[  163.411958]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.416388]  blk_mq_run_work_fn+0x1c/0x28
[  163.420384]  process_one_work+0x1e0/0x358
[  163.424379]  worker_thread+0x238/0x488
[  163.428115]  kthread+0x118/0x120
[  163.431329]  ret_from_fork+0x10/0x18
[  163.434934] run queue from wrong CPU 13, hctx active
[  163.439887] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.448570] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.457080] Workqueue: kblockd blk_mq_run_work_fn
[  163.461770] Call trace:
[  163.464203]  dump_backtrace+0x0/0x150
[  163.467851]  show_stack+0x14/0x20
[  163.471153]  dump_stack+0xb0/0xf8
[  163.474455]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.478885]  blk_mq_run_work_fn+0x1c/0x28
[  163.482881]  process_one_work+0x1e0/0x358
[  163.486877]  worker_thread+0x238/0x488
[  163.490613]  kthread+0x118/0x120
[  163.493828]  ret_from_fork+0x10/0x18
[  163.497424] run queue from wrong CPU 13, hctx active
[  163.502378] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.511061] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.519572] Workqueue: kblockd blk_mq_run_work_fn
[  163.524262] Call trace:
[  163.526696]  dump_backtrace+0x0/0x150
[  163.530344]  show_stack+0x14/0x20
[  163.533646]  dump_stack+0xb0/0xf8
[  163.536948]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.541378]  blk_mq_run_work_fn+0x1c/0x28
[  163.545375]  process_one_work+0x1e0/0x358
[  163.549370]  worker_thread+0x238/0x488
[  163.553107]  kthread+0x118/0x120
[  163.556321]  ret_from_fork+0x10/0x18
[  163.559908] run queue from wrong CPU 24, hctx active
[  163.564871] CPU: 24 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.573554] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.582072] Workqueue: kblockd blk_mq_run_work_fn
[  163.586764] Call trace:
[  163.589199]  dump_backtrace+0x0/0x150
[  163.592848]  show_stack+0x14/0x20
[  163.596153]  dump_stack+0xb0/0xf8
[  163.599455]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.603885]  blk_mq_run_work_fn+0x1c/0x28
[  163.607882]  process_one_work+0x1e0/0x358
[  163.611877]  worker_thread+0x238/0x488
[  163.615613]  kthread+0x118/0x120
[  163.618828]  ret_from_fork+0x10/0x18
[  163.622404] run queue from wrong CPU 24, hctx active
[  163.627358] CPU: 24 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.636041] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.644552] Workqueue: kblockd blk_mq_run_work_fn
[  163.649242] Call trace:
[  163.651674]  dump_backtrace+0x0/0x150
[  163.655322]  show_stack+0x14/0x20
[  163.658623]  dump_stack+0xb0/0xf8
[  163.661924]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.666354]  blk_mq_run_work_fn+0x1c/0x28
[  163.670349]  process_one_work+0x1e0/0x358
[  163.674345]  worker_thread+0x238/0x488
[  163.678081]  kthread+0x118/0x120
[  163.681295]  ret_from_fork+0x10/0x18
[  163.684864] run queue from wrong CPU 24, hctx active
[  163.689819] CPU: 24 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.698501] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.707011] Workqueue: kblockd blk_mq_run_work_fn
[  163.711701] Call trace:
[  163.714133]  dump_backtrace+0x0/0x150
[  163.717781]  show_stack+0x14/0x20
[  163.721082]  dump_stack+0xb0/0xf8
[  163.724383]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.728813]  blk_mq_run_work_fn+0x1c/0x28
[  163.732808]  process_one_work+0x1e0/0x358
[  163.736803]  worker_thread+0x238/0x488
[  163.740539]  kthread+0x118/0x120
[  163.743753]  ret_from_fork+0x10/0x18
[  163.747342] run queue from wrong CPU 48, hctx active
[  163.752311] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.760995] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.769516] Workqueue: kblockd blk_mq_run_work_fn
[  163.774208] Call trace:
[  163.776644]  dump_backtrace+0x0/0x150
[  163.780294]  show_stack+0x14/0x20
[  163.783600]  dump_stack+0xb0/0xf8
[  163.786902]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.791332]  blk_mq_run_work_fn+0x1c/0x28
[  163.795330]  process_one_work+0x1e0/0x358
[  163.799327]  worker_thread+0x238/0x488
[  163.803064]  kthread+0x118/0x120
[  163.806279]  ret_from_fork+0x10/0x18
[  163.809855] run queue from wrong CPU 48, hctx active
[  163.814811] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.823496] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.832008] Workqueue: kblockd blk_mq_run_work_fn
[  163.836698] Call trace:
[  163.839132]  dump_backtrace+0x0/0x150
[  163.842782]  show_stack+0x14/0x20
[  163.846084]  dump_stack+0xb0/0xf8
[  163.849386]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.853817]  blk_mq_run_work_fn+0x1c/0x28
[  163.857813]  process_one_work+0x1e0/0x358
[  163.861810]  worker_thread+0x238/0x488
[  163.865546]  kthread+0x118/0x120
[  163.868762]  ret_from_fork+0x10/0x18
[  163.872454] run queue from wrong CPU 48, hctx active
[  163.877411] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.886095] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.894606] Workqueue: kblockd blk_mq_run_work_fn
[  163.899297] Call trace:
[  163.901731]  dump_backtrace+0x0/0x150
[  163.905379]  show_stack+0x14/0x20
[  163.908681]  dump_stack+0xb0/0xf8
[  163.911983]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.916414]  blk_mq_run_work_fn+0x1c/0x28
[  163.920411]  process_one_work+0x1e0/0x358
[  163.924407]  worker_thread+0x238/0x488
[  163.928143]  kthread+0x118/0x120
[  163.931359]  ret_from_fork+0x10/0x18
[  163.934932] run queue from wrong CPU 48, hctx active
[  163.939888] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  163.948571] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  163.957082] Workqueue: kblockd blk_mq_run_work_fn
[  163.961773] Call trace:
[  163.964207]  dump_backtrace+0x0/0x150
[  163.967855]  show_stack+0x14/0x20
[  163.971157]  dump_stack+0xb0/0xf8
[  163.974459]  __blk_mq_run_hw_queue+0x11c/0x128
[  163.978890]  blk_mq_run_work_fn+0x1c/0x28
[  163.982886]  process_one_work+0x1e0/0x358
[  163.986882]  worker_thread+0x238/0x488
[  163.990618]  kthread+0x118/0x120
[  163.993833]  ret_from_fork+0x10/0x18
[  163.997406] run queue from wrong CPU 48, hctx active
[  164.002361] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  164.011045] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  164.019556] Workqueue: kblockd blk_mq_run_work_fn
[  164.024247] Call trace:
[  164.026681]  dump_backtrace+0x0/0x150
[  164.030329]  show_stack+0x14/0x20
[  164.033631]  dump_stack+0xb0/0xf8
[  164.036934]  __blk_mq_run_hw_queue+0x11c/0x128
[  164.041364]  blk_mq_run_work_fn+0x1c/0x28
[  164.045360]  process_one_work+0x1e0/0x358
[  164.049357]  worker_thread+0x238/0x488
[  164.053093]  kthread+0x118/0x120
[  164.056308]  ret_from_fork+0x10/0x18
[  164.059885] run queue from wrong CPU 48, hctx active
[  164.064841] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  164.073525] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  164.082036] Workqueue: kblockd blk_mq_run_work_fn
[  164.086726] Call trace:
[  164.089160]  dump_backtrace+0x0/0x150
[  164.092809]  show_stack+0x14/0x20
[  164.096111]  dump_stack+0xb0/0xf8
[  164.099413]  __blk_mq_run_hw_queue+0x11c/0x128
[  164.103844]  blk_mq_run_work_fn+0x1c/0x28
[  164.107840]  process_one_work+0x1e0/0x358
[  164.111836]  worker_thread+0x238/0x488
[  164.115573]  kthread+0x118/0x120
[  164.118788]  ret_from_fork+0x10/0x18
[  164.122361] run queue from wrong CPU 48, hctx active
[  164.127317] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  164.136000] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  164.144512] Workqueue: kblockd blk_mq_run_work_fn
[  164.149203] Call trace:
[  164.151636]  dump_backtrace+0x0/0x150
[  164.155285]  show_stack+0x14/0x20
[  164.158587]  dump_stack+0xb0/0xf8
[  164.161889]  __blk_mq_run_hw_queue+0x11c/0x128
[  164.166319]  blk_mq_run_work_fn+0x1c/0x28
[  164.170315]  process_one_work+0x1e0/0x358
[  164.174312]  worker_thread+0x238/0x488
[  164.178048]  kthread+0x118/0x120
[  164.181263]  ret_from_fork+0x10/0x18
[  164.184839] run queue from wrong CPU 48, hctx active
[  164.189794] CPU: 48 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  164.198478] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  164.206989] Workqueue: kblockd blk_mq_run_work_fn
[  164.211680] Call trace:
[  164.214114]  dump_backtrace+0x0/0x150
[  164.217762]  show_stack+0x14/0x20
[  164.221064]  dump_stack+0xb0/0xf8
[  164.224367]  __blk_mq_run_hw_queue+0x11c/0x128
[  164.228797]  blk_mq_run_work_fn+0x1c/0x28
[  164.232793]  process_one_work+0x1e0/0x358
[  164.236789]  worker_thread+0x238/0x488
[  164.240525]  kthread+0x118/0x120
[  164.243740]  ret_from_fork+0x10/0x18
[  164.247348] run queue from wrong CPU 11, hctx active
[  164.252324] CPU: 11 PID: 874 Comm: kworker/3:2H Not tainted 
5.4.0-rc1-00012-gad025dd3d001 #1098
[  164.261008] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[  164.269524] Workqueue: kblockd blk_mq_run_work_fn
[  164.274215] Call trace:
[  164.276649]  dump_backtrace+0x0/0x150
[  164.280299]  show_stack+0x14/0x20
[  164.283603]  dump_stack+0xb0/0xf8
[  164.286904]  __blk_mq_run_hw_queue+0x11c/0x128
[  164.291335]  blk_mq_run_work_fn+0x1c/0x28
[  164.295332]  process_one_work+0x1e0/0x358
[  164.299328]  worker_thread+0x238/0x488
[  164.303065]  kthread+0x118/0x120
[  164.306279]  ret_from_fork+0x10/0x18
[  164.857365] irq_shutdown
[  164.859957] irq_shutdown
[  164.862676] IRQ 799: no longer affine to CPU0
[  164.867332] CPU0: shutdown
[  164.870048] psci: CPU0 killed.
root@(none)$

[I manually added the irq_shutdown print]

 From looking at 7df938fbc4ee641, appearantly it's harmless...

I'll continue to test.

John

>
>> V3:
>>     - re-organize patch 2 & 3 a bit for addressing Hannes's comment
>>     - fix patch 4 for avoiding potential deadlock, as found by Hannes
>>
>> V2:
>>     - patch4 & patch 5 in V1 have been merged to block tree, so remove
>>       them
>>     - addres
>
>
>
> .
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-08 17:15   ` John Garry
@ 2019-10-09  8:39     ` Ming Lei
  2019-10-09  8:49       ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-10-09  8:39 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Tue, Oct 08, 2019 at 06:15:52PM +0100, John Garry wrote:
> On 08/10/2019 10:06, John Garry wrote:
> > On 08/10/2019 05:18, Ming Lei wrote:
> > > Hi,
> > > 
> > > Thomas mentioned:
> > >     "
> > >      That was the constraint of managed interrupts from the very
> > > beginning:
> > > 
> > >       The driver/subsystem has to quiesce the interrupt line and the
> > > associated
> > >       queue _before_ it gets shutdown in CPU unplug and not fiddle
> > > with it
> > >       until it's restarted by the core when the CPU is plugged in again.
> > >     "
> > > 
> > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > 
> > > This patchset tries to address the issue by two stages:
> > > 
> > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > 
> > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > if the hctx is going to be dead.
> > > 
> > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> > > becomes dead
> > > 
> > > - steal bios from the request, and resubmit them via
> > > generic_make_request(),
> > > then these IO will be mapped to other live hctx for dispatch
> > > 
> > > Please comment & review, thanks!
> > > 
> > > John, I don't add your tested-by tag since V3 have some changes,
> > > and I appreciate if you may run your test on V3.
> > > 
> > 
> > Will do, Thanks
> 
> Hi Ming,
> 
> I got this warning once:
> 
> [  162.558185] CPU10: shutdown
> [  162.560994] psci: CPU10 killed.
> [  162.593939] CPU9: shutdown
> [  162.596645] psci: CPU9 killed.
> [  162.625838] CPU8: shutdown
> [  162.628550] psci: CPU8 killed.
> [  162.685790] CPU7: shutdown
> [  162.688496] psci: CPU7 killed.
> [  162.725771] CPU6: shutdown
> [  162.728486] psci: CPU6 killed.
> [  162.753884] CPU5: shutdown
> [  162.756591] psci: CPU5 killed.
> [  162.785584] irq_shutdown
> [  162.788277] IRQ 800: no longer affine to CPU4
> [  162.793267] CPU4: shutdown
> [  162.795975] psci: CPU4 killed.
> [  162.849680] run queue from wrong CPU 13, hctx active
> [  162.849692] CPU3: shutdown
> [  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
> 5.4.0-rc1-00012-gad025dd3d001 #1098
> [  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> V1.16.01 03/15/2019
> [  162.857362] psci: CPU3 killed.
> [  162.866039] Workqueue: kblockd blk_mq_run_work_fn
> [  162.882281] Call trace:
> [  162.884716]  dump_backtrace+0x0/0x150
> [  162.888365]  show_stack+0x14/0x20
> [  162.891668]  dump_stack+0xb0/0xf8
> [  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
> [  162.899400]  blk_mq_run_work_fn+0x1c/0x28
> [  162.903397]  process_one_work+0x1e0/0x358
> [  162.907393]  worker_thread+0x40/0x488
> [  162.911042]  kthread+0x118/0x120
> [  162.914257]  ret_from_fork+0x10/0x18

What is the HBA? If it is Hisilicon SAS, it isn't strange, because
this patch can't fix single hw queue with multiple private reply queue
yet, that can be one follow-up job of this patchset.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-09  8:39     ` Ming Lei
@ 2019-10-09  8:49       ` John Garry
  2019-10-10 10:30         ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2019-10-09  8:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

>>>> - steal bios from the request, and resubmit them via
>>>> generic_make_request(),
>>>> then these IO will be mapped to other live hctx for dispatch
>>>>
>>>> Please comment & review, thanks!
>>>>
>>>> John, I don't add your tested-by tag since V3 have some changes,
>>>> and I appreciate if you may run your test on V3.
>>>>
>>>
>>> Will do, Thanks
>>
>> Hi Ming,
>>
>> I got this warning once:
>>
>> [  162.558185] CPU10: shutdown
>> [  162.560994] psci: CPU10 killed.
>> [  162.593939] CPU9: shutdown
>> [  162.596645] psci: CPU9 killed.
>> [  162.625838] CPU8: shutdown
>> [  162.628550] psci: CPU8 killed.
>> [  162.685790] CPU7: shutdown
>> [  162.688496] psci: CPU7 killed.
>> [  162.725771] CPU6: shutdown
>> [  162.728486] psci: CPU6 killed.
>> [  162.753884] CPU5: shutdown
>> [  162.756591] psci: CPU5 killed.
>> [  162.785584] irq_shutdown
>> [  162.788277] IRQ 800: no longer affine to CPU4
>> [  162.793267] CPU4: shutdown
>> [  162.795975] psci: CPU4 killed.
>> [  162.849680] run queue from wrong CPU 13, hctx active
>> [  162.849692] CPU3: shutdown
>> [  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
>> 5.4.0-rc1-00012-gad025dd3d001 #1098
>> [  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
>> V1.16.01 03/15/2019
>> [  162.857362] psci: CPU3 killed.
>> [  162.866039] Workqueue: kblockd blk_mq_run_work_fn
>> [  162.882281] Call trace:
>> [  162.884716]  dump_backtrace+0x0/0x150
>> [  162.888365]  show_stack+0x14/0x20
>> [  162.891668]  dump_stack+0xb0/0xf8
>> [  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
>> [  162.899400]  blk_mq_run_work_fn+0x1c/0x28
>> [  162.903397]  process_one_work+0x1e0/0x358
>> [  162.907393]  worker_thread+0x40/0x488
>> [  162.911042]  kthread+0x118/0x120
>> [  162.914257]  ret_from_fork+0x10/0x18
>
> What is the HBA? If it is Hisilicon SAS, it isn't strange, because
> this patch can't fix single hw queue with multiple private reply queue
> yet, that can be one follow-up job of this patchset.
>

Yes, hisi_sas. So, right, it is single queue today on mainline, but I 
manually made it multiqueue on my dev branch just to test this series. 
Otherwise I could not test it for that driver.

My dev branch is here, if interested:
https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq

I am also going to retest NVMe.

Thanks,
John

> Thanks,
> Ming
>
> .
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-09  8:49       ` John Garry
@ 2019-10-10 10:30         ` Ming Lei
  2019-10-10 11:21           ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-10-10 10:30 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Wed, Oct 09, 2019 at 09:49:35AM +0100, John Garry wrote:
> > > > > - steal bios from the request, and resubmit them via
> > > > > generic_make_request(),
> > > > > then these IO will be mapped to other live hctx for dispatch
> > > > > 
> > > > > Please comment & review, thanks!
> > > > > 
> > > > > John, I don't add your tested-by tag since V3 have some changes,
> > > > > and I appreciate if you may run your test on V3.
> > > > > 
> > > > 
> > > > Will do, Thanks
> > > 
> > > Hi Ming,
> > > 
> > > I got this warning once:
> > > 
> > > [  162.558185] CPU10: shutdown
> > > [  162.560994] psci: CPU10 killed.
> > > [  162.593939] CPU9: shutdown
> > > [  162.596645] psci: CPU9 killed.
> > > [  162.625838] CPU8: shutdown
> > > [  162.628550] psci: CPU8 killed.
> > > [  162.685790] CPU7: shutdown
> > > [  162.688496] psci: CPU7 killed.
> > > [  162.725771] CPU6: shutdown
> > > [  162.728486] psci: CPU6 killed.
> > > [  162.753884] CPU5: shutdown
> > > [  162.756591] psci: CPU5 killed.
> > > [  162.785584] irq_shutdown
> > > [  162.788277] IRQ 800: no longer affine to CPU4
> > > [  162.793267] CPU4: shutdown
> > > [  162.795975] psci: CPU4 killed.
> > > [  162.849680] run queue from wrong CPU 13, hctx active
> > > [  162.849692] CPU3: shutdown
> > > [  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
> > > 5.4.0-rc1-00012-gad025dd3d001 #1098
> > > [  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> > > V1.16.01 03/15/2019
> > > [  162.857362] psci: CPU3 killed.
> > > [  162.866039] Workqueue: kblockd blk_mq_run_work_fn
> > > [  162.882281] Call trace:
> > > [  162.884716]  dump_backtrace+0x0/0x150
> > > [  162.888365]  show_stack+0x14/0x20
> > > [  162.891668]  dump_stack+0xb0/0xf8
> > > [  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
> > > [  162.899400]  blk_mq_run_work_fn+0x1c/0x28
> > > [  162.903397]  process_one_work+0x1e0/0x358
> > > [  162.907393]  worker_thread+0x40/0x488
> > > [  162.911042]  kthread+0x118/0x120
> > > [  162.914257]  ret_from_fork+0x10/0x18
> > 
> > What is the HBA? If it is Hisilicon SAS, it isn't strange, because
> > this patch can't fix single hw queue with multiple private reply queue
> > yet, that can be one follow-up job of this patchset.
> > 
> 
> Yes, hisi_sas. So, right, it is single queue today on mainline, but I
> manually made it multiqueue on my dev branch just to test this series.
> Otherwise I could not test it for that driver.
> 
> My dev branch is here, if interested:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq

Your conversion shouldn't work given you do not change .can_queue in the
patch of 'hisi_sas_v3: multiqueue support'.

As discussed before, tags of hisilicon V3 is HBA wide. If you switch
to real hw queue, each hw queue has to own its independent tags.
However, that isn't supported by V3 hardware.

See previous discussion:

https://marc.info/?t=155928863000001&r=1&w=2


Thanks,
Ming

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-10 10:30         ` Ming Lei
@ 2019-10-10 11:21           ` John Garry
  2019-10-11  8:51             ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2019-10-10 11:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 10/10/2019 11:30, Ming Lei wrote:
>> Yes, hisi_sas. So, right, it is single queue today on mainline, but I
>> > manually made it multiqueue on my dev branch just to test this series.
>> > Otherwise I could not test it for that driver.
>> >
>> > My dev branch is here, if interested:
>> > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq
> Your conversion shouldn't work given you do not change .can_queue in the
> patch of 'hisi_sas_v3: multiqueue support'.

Ah, I missed that, but I don't think that it will make a difference 
really since I'm only using a single disk, so I don't think that 
can_queue really comes into play. But....

>
> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
> to real hw queue, each hw queue has to own its independent tags.
> However, that isn't supported by V3 hardware.

I am generating the tag internally in the driver now, so that hostwide 
tags issue should not be an issue.

And, to be clear, I am not paying too much attention to performance, but 
rather just hotplugging while running IO.

An update on testing:
I did some scripted overnight testing. The script essentially loops like 
this:
- online all CPUS
- run fio binded on a limited bunch of CPUs to cover a hctx mask for 1 
minute
- offline those CPUs
- wait 1 minute (> SCSI or NVMe timeout)
- and repeat

SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding 
some fio processes never dying with IOPS @ 0. I don't see any NVMe 
timeout reported. Did you do any NVMe testing of this sort?

Thanks,
John

>
> See previous discussion:
>
> https://marc.info/?t=155928863000001&r=1&w=2
>
>
> Thanks,
> Ming
>
> .
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-10 11:21           ` John Garry
@ 2019-10-11  8:51             ` John Garry
  2019-10-11 11:55               ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2019-10-11  8:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 10/10/2019 12:21, John Garry wrote:
>
>>
>> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
>> to real hw queue, each hw queue has to own its independent tags.
>> However, that isn't supported by V3 hardware.
>
> I am generating the tag internally in the driver now, so that hostwide
> tags issue should not be an issue.
>
> And, to be clear, I am not paying too much attention to performance, but
> rather just hotplugging while running IO.
>
> An update on testing:
> I did some scripted overnight testing. The script essentially loops like
> this:
> - online all CPUS
> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
> minute
> - offline those CPUs
> - wait 1 minute (> SCSI or NVMe timeout)
> - and repeat
>
> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
> some fio processes never dying with IOPS @ 0. I don't see any NVMe
> timeout reported. Did you do any NVMe testing of this sort?
>

Yeah, so for NVMe, I see some sort of regression, like this:
Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 
1158037877d:17h:18m:22s]

I have tested against vanilla 5.4 rc1 without problem.

If you can advise some debug to add, then I'd appreciate it. If not, 
I'll try to add some debug to the new paths introduced in this series to 
see if anything I hit coincides with the error state, so will at least 
have a hint...

Thanks,
John


>
>>
>> See previous discussion:
>>
>> https://marc.info/?t=155928863000001&r=1&w=2
>>
>>
>> Thanks,
>> Ming



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-11  8:51             ` John Garry
@ 2019-10-11 11:55               ` Ming Lei
  2019-10-11 14:10                 ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-10-11 11:55 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Fri, Oct 11, 2019 at 4:54 PM John Garry <john.garry@huawei.com> wrote:
>
> On 10/10/2019 12:21, John Garry wrote:
> >
> >>
> >> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
> >> to real hw queue, each hw queue has to own its independent tags.
> >> However, that isn't supported by V3 hardware.
> >
> > I am generating the tag internally in the driver now, so that hostwide
> > tags issue should not be an issue.
> >
> > And, to be clear, I am not paying too much attention to performance, but
> > rather just hotplugging while running IO.
> >
> > An update on testing:
> > I did some scripted overnight testing. The script essentially loops like
> > this:
> > - online all CPUS
> > - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
> > minute
> > - offline those CPUs
> > - wait 1 minute (> SCSI or NVMe timeout)
> > - and repeat
> >
> > SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
> > some fio processes never dying with IOPS @ 0. I don't see any NVMe
> > timeout reported. Did you do any NVMe testing of this sort?
> >
>
> Yeah, so for NVMe, I see some sort of regression, like this:
> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> 1158037877d:17h:18m:22s]

I can reproduce this issue, and looks there are requests in ->dispatch.
I am a bit busy this week, please feel free to investigate it and debugfs
can help you much. I may have time next week for looking this issue.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-11 11:55               ` Ming Lei
@ 2019-10-11 14:10                 ` John Garry
  2019-10-14  1:25                   ` Ming Lei
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2019-10-11 14:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On 11/10/2019 12:55, Ming Lei wrote:
> On Fri, Oct 11, 2019 at 4:54 PM John Garry <john.garry@huawei.com> wrote:
>>
>> On 10/10/2019 12:21, John Garry wrote:
>>>
>>>>
>>>> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
>>>> to real hw queue, each hw queue has to own its independent tags.
>>>> However, that isn't supported by V3 hardware.
>>>
>>> I am generating the tag internally in the driver now, so that hostwide
>>> tags issue should not be an issue.
>>>
>>> And, to be clear, I am not paying too much attention to performance, but
>>> rather just hotplugging while running IO.
>>>
>>> An update on testing:
>>> I did some scripted overnight testing. The script essentially loops like
>>> this:
>>> - online all CPUS
>>> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
>>> minute
>>> - offline those CPUs
>>> - wait 1 minute (> SCSI or NVMe timeout)
>>> - and repeat
>>>
>>> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
>>> some fio processes never dying with IOPS @ 0. I don't see any NVMe
>>> timeout reported. Did you do any NVMe testing of this sort?
>>>
>>
>> Yeah, so for NVMe, I see some sort of regression, like this:
>> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>> 1158037877d:17h:18m:22s]
>
> I can reproduce this issue, and looks there are requests in ->dispatch.

OK, that may match with what I see:
- the problem occuring coincides with this callpath with 
BLK_MQ_S_INTERNAL_STOPPED set:

blk_mq_request_bypass_insert
(__)blk_mq_try_issue_list_directly
blk_mq_sched_insert_requests
blk_mq_flush_plug_list
blk_flush_plug_list
blk_finish_plug
blkdev_direct_IO
generic_file_read_iter
blkdev_read_iter
aio_read
io_submit_one

blk_mq_request_bypass_insert() adds to the dispatch list, and looking at 
debugfs, could this be that dispatched request sitting:
root@(none)$ more /sys/kernel/debug/block/nvme0n1/hctx18/dispatch
00000000ac28511d {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, 
.tag=56, .internal_tag=-1}

So could there be some race here?

> I am a bit busy this week, please feel free to investigate it and debugfs
> can help you much. I may have time next week for looking this issue.
>

OK, appreciated

John

> Thanks,
> Ming Lei
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-11 14:10                 ` John Garry
@ 2019-10-14  1:25                   ` Ming Lei
  2019-10-14  8:29                     ` John Garry
  0 siblings, 1 reply; 18+ messages in thread
From: Ming Lei @ 2019-10-14  1:25 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Fri, Oct 11, 2019 at 10:10 PM John Garry <john.garry@huawei.com> wrote:
>
> On 11/10/2019 12:55, Ming Lei wrote:
> > On Fri, Oct 11, 2019 at 4:54 PM John Garry <john.garry@huawei.com> wrote:
> >>
> >> On 10/10/2019 12:21, John Garry wrote:
> >>>
> >>>>
> >>>> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
> >>>> to real hw queue, each hw queue has to own its independent tags.
> >>>> However, that isn't supported by V3 hardware.
> >>>
> >>> I am generating the tag internally in the driver now, so that hostwide
> >>> tags issue should not be an issue.
> >>>
> >>> And, to be clear, I am not paying too much attention to performance, but
> >>> rather just hotplugging while running IO.
> >>>
> >>> An update on testing:
> >>> I did some scripted overnight testing. The script essentially loops like
> >>> this:
> >>> - online all CPUS
> >>> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
> >>> minute
> >>> - offline those CPUs
> >>> - wait 1 minute (> SCSI or NVMe timeout)
> >>> - and repeat
> >>>
> >>> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
> >>> some fio processes never dying with IOPS @ 0. I don't see any NVMe
> >>> timeout reported. Did you do any NVMe testing of this sort?
> >>>
> >>
> >> Yeah, so for NVMe, I see some sort of regression, like this:
> >> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
> >> 1158037877d:17h:18m:22s]
> >
> > I can reproduce this issue, and looks there are requests in ->dispatch.
>
> OK, that may match with what I see:
> - the problem occuring coincides with this callpath with
> BLK_MQ_S_INTERNAL_STOPPED set:

Good catch, these requests should have been re-submitted in
blk_mq_hctx_notify_dead() too.

Will do it in V4.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-14  1:25                   ` Ming Lei
@ 2019-10-14  8:29                     ` John Garry
  0 siblings, 0 replies; 18+ messages in thread
From: John Garry @ 2019-10-14  8:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On 14/10/2019 02:25, Ming Lei wrote:
> On Fri, Oct 11, 2019 at 10:10 PM John Garry <john.garry@huawei.com> wrote:
>>
>> On 11/10/2019 12:55, Ming Lei wrote:
>>> On Fri, Oct 11, 2019 at 4:54 PM John Garry <john.garry@huawei.com> wrote:
>>>>
>>>> On 10/10/2019 12:21, John Garry wrote:
>>>>>
>>>>>>
>>>>>> As discussed before, tags of hisilicon V3 is HBA wide. If you switch
>>>>>> to real hw queue, each hw queue has to own its independent tags.
>>>>>> However, that isn't supported by V3 hardware.
>>>>>
>>>>> I am generating the tag internally in the driver now, so that hostwide
>>>>> tags issue should not be an issue.
>>>>>
>>>>> And, to be clear, I am not paying too much attention to performance, but
>>>>> rather just hotplugging while running IO.
>>>>>
>>>>> An update on testing:
>>>>> I did some scripted overnight testing. The script essentially loops like
>>>>> this:
>>>>> - online all CPUS
>>>>> - run fio binded on a limited bunch of CPUs to cover a hctx mask for 1
>>>>> minute
>>>>> - offline those CPUs
>>>>> - wait 1 minute (> SCSI or NVMe timeout)
>>>>> - and repeat
>>>>>
>>>>> SCSI is actually quite stable, but NVMe isn't. For NVMe I am finding
>>>>> some fio processes never dying with IOPS @ 0. I don't see any NVMe
>>>>> timeout reported. Did you do any NVMe testing of this sort?
>>>>>
>>>>
>>>> Yeah, so for NVMe, I see some sort of regression, like this:
>>>> Jobs: 1 (f=1): [_R] [0.0% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta
>>>> 1158037877d:17h:18m:22s]
>>>
>>> I can reproduce this issue, and looks there are requests in ->dispatch.
>>
>> OK, that may match with what I see:
>> - the problem occuring coincides with this callpath with
>> BLK_MQ_S_INTERNAL_STOPPED set:
>
> Good catch, these requests should have been re-submitted in
> blk_mq_hctx_notify_dead() too.
>
> Will do it in V4.

OK, I'll have a look at v4 and retest - it may take a while as testing 
this is slow...

All the best,
John

>
> Thanks,
> Ming Lei
>
> .
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, back to index

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-08  4:18 [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
2019-10-08  4:18 ` [PATCH V3 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
2019-10-08  4:18 ` [PATCH V3 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2019-10-08  4:18 ` [PATCH V3 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
2019-10-08 17:03   ` John Garry
2019-10-08  4:18 ` [PATCH V3 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
2019-10-08  4:18 ` [PATCH V3 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
2019-10-08  9:06 ` [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
2019-10-08 17:15   ` John Garry
2019-10-09  8:39     ` Ming Lei
2019-10-09  8:49       ` John Garry
2019-10-10 10:30         ` Ming Lei
2019-10-10 11:21           ` John Garry
2019-10-11  8:51             ` John Garry
2019-10-11 11:55               ` Ming Lei
2019-10-11 14:10                 ` John Garry
2019-10-14  1:25                   ` Ming Lei
2019-10-14  8:29                     ` John Garry

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox