Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
@ 2019-08-12 13:43 Ming Lei
  2019-08-12 13:43 ` [PATCH V2 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Minwoo Im, Ming Lei, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
    is dead

 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
 block/blk-mq.h             |   3 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 146 insertions(+), 16 deletions(-)

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
@ 2019-08-12 13:43 ` Ming Lei
  2019-08-12 13:43 ` [PATCH V2 2/5] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Minwoo Im, Ming Lei, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Add a new hw queue state of BLK_MQ_S_INTERNAL_STOPPED, which prepares
for stopping hw queue before all CPUs of this hctx become offline.

We can't reuse BLK_MQ_S_STOPPED because that state can be cleared during IO
completion.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 3 ++-
 include/linux/blk-mq.h | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..af40a02c46ee 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INTERNAL_STOPPED),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 32c62c64e6c2..63717573bc16 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -176,7 +176,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 21cebe901ac0..5b2d263e0646 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -235,6 +235,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is internal stopped, driver do not use it */
+	BLK_MQ_S_INTERNAL_STOPPED	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 2/5] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-08-12 13:43 ` [PATCH V2 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
@ 2019-08-12 13:43 ` Ming Lei
  2019-08-12 13:43 ` [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Minwoo Im, Ming Lei, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests of
dm-rq & loop, then the potential dead-lock can be avoided.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 drivers/block/loop.c   | 2 +-
 drivers/md/dm-rq.c     | 2 +-
 include/linux/blk-mq.h | 1 +
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index af40a02c46ee..24fff8c90942 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a7461f482467..50328b572853 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1989,7 +1989,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 21d5c1784d0c..684f92988d40 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5b2d263e0646..838a22888413 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -226,6 +226,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-08-12 13:43 ` [PATCH V2 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
  2019-08-12 13:43 ` [PATCH V2 2/5] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
@ 2019-08-12 13:43 ` Ming Lei
  2019-08-12 14:24   ` Hannes Reinecke
  2019-08-12 13:43 ` [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Minwoo Im, Ming Lei, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c         |  2 +-
 block/blk-mq-tag.h         |  2 ++
 block/blk-mq.c             | 65 ++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h     |  1 +
 include/linux/cpuhotplug.h |  1 +
 5 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 008388e82b5c..31828b82552b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -325,7 +325,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..321fd6f440e6 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6968de9d7402..6931b2ba2776 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2206,6 +2206,61 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
+{
+	unsigned *count = data;
+
+	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+		(*count)++;
+
+	return true;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
+{
+	unsigned count = 0;
+
+	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
+
+	return count;
+}
+
+static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx->tags))
+			break;
+		msleep(5);
+	}
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+	unsigned prev_cpu = -1;
+
+	while (true) {
+		unsigned next_cpu = cpumask_next_and(prev_cpu, hctx->cpumask,
+				cpu_online_mask);
+
+		if (next_cpu >= nr_cpu_ids)
+			break;
+
+		/* return if there is other online CPU on this hctx */
+		if (next_cpu != cpu)
+			return 0;
+
+		prev_cpu = next_cpu;
+	}
+
+	set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+	blk_mq_drain_inflight_rqs(hctx);
+
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2222,6 +2277,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
+	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
@@ -2242,6 +2299,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2301,6 +2361,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3537,6 +3600,8 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				NULL, blk_mq_hctx_notify_online);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 838a22888413..49413dcdb6aa 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -58,6 +58,7 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	struct hlist_node	cpuhp_online;
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
 
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 068793a619ca..bb80f52040cb 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -147,6 +147,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead
  2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2019-08-12 13:43 ` [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
@ 2019-08-12 13:43 ` Ming Lei
  2019-08-12 14:26   ` Hannes Reinecke
  2019-08-12 13:43 ` [PATCH V2 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
  2019-08-12 13:46 ` [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  5 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Minwoo Im, Ming Lei, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

When all CPUs in one hctx are offline, we shouldn't run this hw queue
for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6931b2ba2776..ed334fd867c4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2261,10 +2261,30 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	struct bio_list list;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+
+	blk_mq_cleanup_rq(rq);
+	blk_mq_end_request(rq, 0);
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is dead, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2272,6 +2292,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
+	bool hctx_dead;
+	struct request *rq;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2279,6 +2301,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 
+	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
+		nr_cpu_ids;
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
@@ -2289,11 +2314,20 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (list_empty(&tmp))
 		return 0;
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+	if (!hctx_dead) {
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&tmp, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		blk_mq_run_hw_queue(hctx, true);
+		return 0;
+	}
+
+	while (!list_empty(&tmp)) {
+		rq = list_entry(tmp.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
 
-	blk_mq_run_hw_queue(hctx, true);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V2 5/5] blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead
  2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2019-08-12 13:43 ` [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
@ 2019-08-12 13:43 ` " Ming Lei
  2019-08-12 13:46 ` [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  5 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Minwoo Im, Ming Lei, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

If hctx becomes dead, all in-queue IO requests aimed at this hctx have to
be re-submitted, so cover requests queued in scheduler queue.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ed334fd867c4..a722ce53fb39 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2294,6 +2294,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 	bool hctx_dead;
 	struct request *rq;
+	struct elevator_queue *e;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2304,12 +2305,31 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
 		nr_cpu_ids;
 
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
+	e = hctx->queue->elevator;
+	if (!e) {
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
+	} else if (hctx_dead) {
+		LIST_HEAD(sched_tmp);
+
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &tmp);
+		}
+
+		while (!list_empty(&sched_tmp)) {
+			rq = list_entry(sched_tmp.next, struct request,
+					queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_sched_insert_request(rq, true, true, true);
+		}
 	}
-	spin_unlock(&ctx->lock);
 
 	if (list_empty(&tmp))
 		return 0;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2019-08-12 13:43 ` [PATCH V2 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
@ 2019-08-12 13:46 ` Ming Lei
  2019-08-12 16:21   ` John Garry
  5 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2019-08-12 13:46 UTC (permalink / raw)
  To: Jens Axboe, John Garry
  Cc: linux-block, Minwoo Im, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi John,

On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>     
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
> 
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Please comment & review, thanks!
> 
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
> 
> 
> Ming Lei (5):
>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>   blk-mq: re-submit IO in case that hctx is dead
>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>     is dead
> 
>  block/blk-mq-debugfs.c     |   2 +
>  block/blk-mq-tag.c         |   2 +-
>  block/blk-mq-tag.h         |   2 +
>  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
>  block/blk-mq.h             |   3 +-
>  drivers/block/loop.c       |   2 +-
>  drivers/md/dm-rq.c         |   2 +-
>  include/linux/blk-mq.h     |   5 ++
>  include/linux/cpuhotplug.h |   1 +
>  9 files changed, 146 insertions(+), 16 deletions(-)
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> -- 
> 2.20.1
> 

Sorry for forgetting to Cc you.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-08-12 13:43 ` [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
@ 2019-08-12 14:24   ` Hannes Reinecke
  2019-08-12 22:24     ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2019-08-12 14:24 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Minwoo Im, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 8/12/19 3:43 PM, Ming Lei wrote:
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "
>  That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again.
> "
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> one cpuhp state handled after the CPU is down, so there isn't any chance
> to quiesce hctx for blk-mq wrt. CPU hotplug.
> 
> Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> and wait for completion of in-flight requests.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-tag.c         |  2 +-
>  block/blk-mq-tag.h         |  2 ++
>  block/blk-mq.c             | 65 ++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-mq.h     |  1 +
>  include/linux/cpuhotplug.h |  1 +
>  5 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 008388e82b5c..31828b82552b 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -325,7 +325,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>   *		true to continue iterating tags, false to stop.
>   * @priv:	Will be passed as second argument to @fn.
>   */
> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>  		busy_tag_iter_fn *fn, void *priv)
>  {
>  	if (tags->nr_reserved_tags)
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 61deab0b5a5a..321fd6f440e6 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  		void *priv);
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, void *priv);
>  
>  static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>  						 struct blk_mq_hw_ctx *hctx)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6968de9d7402..6931b2ba2776 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2206,6 +2206,61 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> +				     bool reserved)
> +{
> +	unsigned *count = data;
> +
> +	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> +		(*count)++;
> +
> +	return true;
> +}
> +
> +static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
> +{
> +	unsigned count = 0;
> +
> +	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
> +
> +	return count;
> +}
> +
> +static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +	while (1) {
> +		if (!blk_mq_tags_inflight_rqs(hctx->tags))
> +			break;
> +		msleep(5);
> +	}
> +}
> +
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +	unsigned prev_cpu = -1;
> +
> +	while (true) {
> +		unsigned next_cpu = cpumask_next_and(prev_cpu, hctx->cpumask,
> +				cpu_online_mask);
> +
> +		if (next_cpu >= nr_cpu_ids)
> +			break;
> +
> +		/* return if there is other online CPU on this hctx */
> +		if (next_cpu != cpu)
> +			return 0;
> +
> +		prev_cpu = next_cpu;
> +	}
> +
> +	set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> +	blk_mq_drain_inflight_rqs(hctx);
> +
> +	return 0;
> +}
> +
>  /*
>   * 'cpu' is going away. splice any existing rq_list entries from this
>   * software queue to the hw queue dispatch list, and ensure that it

Isn't that inverted?
From the function I would assume it'll be called once the CPU is being
set toe 'online', yet from the description I would have assumed the
INTERNAL_STOPPED bit is set when the cpu goes offline.
Care to elaborate?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead
  2019-08-12 13:43 ` [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
@ 2019-08-12 14:26   ` Hannes Reinecke
  2019-08-12 22:30     ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2019-08-12 14:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Minwoo Im, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 8/12/19 3:43 PM, Ming Lei wrote:
> When all CPUs in one hctx are offline, we shouldn't run this hw queue
> for completing request any more.
> 
> So steal bios from the request, and resubmit them, and finally free
> the request in blk_mq_hctx_notify_dead().
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 41 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6931b2ba2776..ed334fd867c4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2261,10 +2261,30 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
>  	return 0;
>  }
>  
> +static void blk_mq_resubmit_io(struct request *rq)
> +{
> +	struct bio_list list;
> +	struct bio *bio;
> +
> +	bio_list_init(&list);
> +	blk_steal_bios(&list, rq);
> +
> +	while (true) {
> +		bio = bio_list_pop(&list);
> +		if (!bio)
> +			break;
> +
> +		generic_make_request(bio);
> +	}
> +
> +	blk_mq_cleanup_rq(rq);
> +	blk_mq_end_request(rq, 0);
> +}
> +
>  /*
> - * 'cpu' is going away. splice any existing rq_list entries from this
> - * software queue to the hw queue dispatch list, and ensure that it
> - * gets run.
> + * 'cpu' has gone away. If this hctx is dead, we can't dispatch request
> + * to the hctx any more, so steal bios from requests of this hctx, and
> + * re-submit them to the request queue, and free these requests finally.
>   */
>  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  {
> @@ -2272,6 +2292,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	struct blk_mq_ctx *ctx;
>  	LIST_HEAD(tmp);
>  	enum hctx_type type;
> +	bool hctx_dead;
> +	struct request *rq;
>  
>  	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
>  	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
> @@ -2279,6 +2301,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  
>  	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
>  
> +	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
> +		nr_cpu_ids;
> +
>  	spin_lock(&ctx->lock);
>  	if (!list_empty(&ctx->rq_lists[type])) {
>  		list_splice_init(&ctx->rq_lists[type], &tmp);
> @@ -2289,11 +2314,20 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	if (list_empty(&tmp))
>  		return 0;
>  
> -	spin_lock(&hctx->lock);
> -	list_splice_tail_init(&tmp, &hctx->dispatch);
> -	spin_unlock(&hctx->lock);
> +	if (!hctx_dead) {
> +		spin_lock(&hctx->lock);
> +		list_splice_tail_init(&tmp, &hctx->dispatch);
> +		spin_unlock(&hctx->lock);
> +		blk_mq_run_hw_queue(hctx, true);
> +		return 0;
> +	}
> +
> +	while (!list_empty(&tmp)) {
> +		rq = list_entry(tmp.next, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		blk_mq_resubmit_io(rq);
> +	}
>  
> -	blk_mq_run_hw_queue(hctx, true);
>  	return 0;
>  }
>  
> 
So what happens when all CPUs assigned to a hardware queue go offline?
Wouldn't blk_steal_bios() etc resend the I/O to the same hw queue,
causing an infinite loop?

Don't we have to rearrange the hardware queues here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-08-12 13:46 ` [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
@ 2019-08-12 16:21   ` John Garry
  2019-08-12 22:45     ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: John Garry @ 2019-08-12 16:21 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Minwoo Im, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch, chenxiang,
	linux-scsi

On 12/08/2019 14:46, Ming Lei wrote:
> Hi John,
>
> On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>     "
>>      That was the constraint of managed interrupts from the very beginning:
>>
>>       The driver/subsystem has to quiesce the interrupt line and the associated
>>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>>       until it's restarted by the core when the CPU is plugged in again.
>>     "
>>
>> But no drivers or blk-mq do that before one hctx becomes dead(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>>
>> - steal bios from the request, and resubmit them via generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Please comment & review, thanks!
>>
>> V2:
>> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
>> 	  them
>> 	- address comments from John Garry and Minwoo
>>
>>
>> Ming Lei (5):
>>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>>   blk-mq: re-submit IO in case that hctx is dead
>>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>>     is dead
>>
>>  block/blk-mq-debugfs.c     |   2 +
>>  block/blk-mq-tag.c         |   2 +-
>>  block/blk-mq-tag.h         |   2 +
>>  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
>>  block/blk-mq.h             |   3 +-
>>  drivers/block/loop.c       |   2 +-
>>  drivers/md/dm-rq.c         |   2 +-
>>  include/linux/blk-mq.h     |   5 ++
>>  include/linux/cpuhotplug.h |   1 +
>>  9 files changed, 146 insertions(+), 16 deletions(-)
>>
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Keith Busch <keith.busch@intel.com>
>> --
>> 2.20.1
>>
>
> Sorry for forgetting to Cc you.

Already subscribed :)

I don't mean to hijack this thread, but JFYI we're getting around to 
test https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - 
unfortunately we're still seeing a performance regression. I can't see 
where it's coming from. We're double-checking the test though.

Thanks,
John

>
>
> Thanks,
> Ming
>
> .
>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-08-12 14:24   ` Hannes Reinecke
@ 2019-08-12 22:24     ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 22:24 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Minwoo Im, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Aug 12, 2019 at 04:24:01PM +0200, Hannes Reinecke wrote:
> On 8/12/19 3:43 PM, Ming Lei wrote:
> > Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> > up queue mapping. Thomas mentioned the following point[1]:
> > 
> > "
> >  That was the constraint of managed interrupts from the very beginning:
> > 
> >   The driver/subsystem has to quiesce the interrupt line and the associated
> >   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >   until it's restarted by the core when the CPU is plugged in again.
> > "
> > 
> > However, current blk-mq implementation doesn't quiesce hw queue before
> > the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> > one cpuhp state handled after the CPU is down, so there isn't any chance
> > to quiesce hctx for blk-mq wrt. CPU hotplug.
> > 
> > Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> > and wait for completion of in-flight requests.
> > 
> > [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Keith Busch <keith.busch@intel.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq-tag.c         |  2 +-
> >  block/blk-mq-tag.h         |  2 ++
> >  block/blk-mq.c             | 65 ++++++++++++++++++++++++++++++++++++++
> >  include/linux/blk-mq.h     |  1 +
> >  include/linux/cpuhotplug.h |  1 +
> >  5 files changed, 70 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 008388e82b5c..31828b82552b 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -325,7 +325,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
> >   *		true to continue iterating tags, false to stop.
> >   * @priv:	Will be passed as second argument to @fn.
> >   */
> > -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> >  		busy_tag_iter_fn *fn, void *priv)
> >  {
> >  	if (tags->nr_reserved_tags)
> > diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> > index 61deab0b5a5a..321fd6f440e6 100644
> > --- a/block/blk-mq-tag.h
> > +++ b/block/blk-mq-tag.h
> > @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
> >  extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
> >  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
> >  		void *priv);
> > +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > +		busy_tag_iter_fn *fn, void *priv);
> >  
> >  static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
> >  						 struct blk_mq_hw_ctx *hctx)
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 6968de9d7402..6931b2ba2776 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2206,6 +2206,61 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >  	return -ENOMEM;
> >  }
> >  
> > +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> > +				     bool reserved)
> > +{
> > +	unsigned *count = data;
> > +
> > +	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> > +		(*count)++;
> > +
> > +	return true;
> > +}
> > +
> > +static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
> > +{
> > +	unsigned count = 0;
> > +
> > +	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
> > +
> > +	return count;
> > +}
> > +
> > +static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	while (1) {
> > +		if (!blk_mq_tags_inflight_rqs(hctx->tags))
> > +			break;
> > +		msleep(5);
> > +	}
> > +}
> > +
> > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> > +			struct blk_mq_hw_ctx, cpuhp_online);
> > +	unsigned prev_cpu = -1;
> > +
> > +	while (true) {
> > +		unsigned next_cpu = cpumask_next_and(prev_cpu, hctx->cpumask,
> > +				cpu_online_mask);
> > +
> > +		if (next_cpu >= nr_cpu_ids)
> > +			break;
> > +
> > +		/* return if there is other online CPU on this hctx */
> > +		if (next_cpu != cpu)
> > +			return 0;
> > +
> > +		prev_cpu = next_cpu;
> > +	}
> > +
> > +	set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> > +	blk_mq_drain_inflight_rqs(hctx);
> > +
> > +	return 0;
> > +}
> > +
> >  /*
> >   * 'cpu' is going away. splice any existing rq_list entries from this
> >   * software queue to the hw queue dispatch list, and ensure that it
> 
> Isn't that inverted?

The above comment is wrong, and you will see it is fixed in patch 4.

blk_mq_hctx_notify_dead() is called when the specified CPU is dead, and
blk_mq_hctx_notify_online() is called before the CPU goes away during
cpuhp path.

> From the function I would assume it'll be called once the CPU is being
> set toe 'online', yet from the description I would have assumed the

No, blk_mq_init() only registers teardown callback for
CPUHP_AP_BLK_MQ_ONLINE, that means blk_mq_hctx_notify_online() is only
called when CPU is going away, still online, so we can stop queue for
quiescing IO on this queue.

> INTERNAL_STOPPED bit is set when the cpu goes offline.
> Care to elaborate?

The idea is to quiesce queue in two stages:

1) in blk_mq_hctx_notify_online(), this CPU isn't dead yet, but is going to
become dead, so we can wait for completion of in-flight IOs. Meantime
stop the hw queue.

2) in blk_mq_hctx_notify_dead(), all CPUs of this hw queue have been
dead, what we can do is to end the request and re-submit the IO, and new
request will be mapped to another active hw queue because
blk_mq_hctx_notify_dead() is always called from one online CPU.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead
  2019-08-12 14:26   ` Hannes Reinecke
@ 2019-08-12 22:30     ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 22:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, Minwoo Im, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Aug 12, 2019 at 04:26:42PM +0200, Hannes Reinecke wrote:
> On 8/12/19 3:43 PM, Ming Lei wrote:
> > When all CPUs in one hctx are offline, we shouldn't run this hw queue
> > for completing request any more.
> > 
> > So steal bios from the request, and resubmit them, and finally free
> > the request in blk_mq_hctx_notify_dead().
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Keith Busch <keith.busch@intel.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 41 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 6931b2ba2776..ed334fd867c4 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2261,10 +2261,30 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> >  	return 0;
> >  }
> >  
> > +static void blk_mq_resubmit_io(struct request *rq)
> > +{
> > +	struct bio_list list;
> > +	struct bio *bio;
> > +
> > +	bio_list_init(&list);
> > +	blk_steal_bios(&list, rq);
> > +
> > +	while (true) {
> > +		bio = bio_list_pop(&list);
> > +		if (!bio)
> > +			break;
> > +
> > +		generic_make_request(bio);
> > +	}
> > +
> > +	blk_mq_cleanup_rq(rq);
> > +	blk_mq_end_request(rq, 0);
> > +}
> > +
> >  /*
> > - * 'cpu' is going away. splice any existing rq_list entries from this
> > - * software queue to the hw queue dispatch list, and ensure that it
> > - * gets run.
> > + * 'cpu' has gone away. If this hctx is dead, we can't dispatch request
> > + * to the hctx any more, so steal bios from requests of this hctx, and
> > + * re-submit them to the request queue, and free these requests finally.
> >   */
> >  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  {
> > @@ -2272,6 +2292,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  	struct blk_mq_ctx *ctx;
> >  	LIST_HEAD(tmp);
> >  	enum hctx_type type;
> > +	bool hctx_dead;
> > +	struct request *rq;
> >  
> >  	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> >  	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
> > @@ -2279,6 +2301,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  
> >  	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> >  
> > +	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
> > +		nr_cpu_ids;
> > +
> >  	spin_lock(&ctx->lock);
> >  	if (!list_empty(&ctx->rq_lists[type])) {
> >  		list_splice_init(&ctx->rq_lists[type], &tmp);
> > @@ -2289,11 +2314,20 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  	if (list_empty(&tmp))
> >  		return 0;
> >  
> > -	spin_lock(&hctx->lock);
> > -	list_splice_tail_init(&tmp, &hctx->dispatch);
> > -	spin_unlock(&hctx->lock);
> > +	if (!hctx_dead) {
> > +		spin_lock(&hctx->lock);
> > +		list_splice_tail_init(&tmp, &hctx->dispatch);
> > +		spin_unlock(&hctx->lock);
> > +		blk_mq_run_hw_queue(hctx, true);
> > +		return 0;
> > +	}
> > +
> > +	while (!list_empty(&tmp)) {
> > +		rq = list_entry(tmp.next, struct request, queuelist);
> > +		list_del_init(&rq->queuelist);
> > +		blk_mq_resubmit_io(rq);
> > +	}
> >  
> > -	blk_mq_run_hw_queue(hctx, true);
> >  	return 0;
> >  }
> >  
> > 
> So what happens when all CPUs assigned to a hardware queue go offline?
> Wouldn't blk_steal_bios() etc resend the I/O to the same hw queue,
> causing an infinite loop?

No, blk_mq_hctx_notify_dead() is always called on one online CPU, so the I/O
won't be remapped to same hw queue.

> 
> Don't we have to rearrange the hardware queues here?

No, we use static queue mapping for managed IRQ, all possible CPUs have
been spread on all hw queues.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-08-12 16:21   ` John Garry
@ 2019-08-12 22:45     ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2019-08-12 22:45 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Minwoo Im, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	chenxiang, linux-scsi

On Mon, Aug 12, 2019 at 05:21:44PM +0100, John Garry wrote:
> On 12/08/2019 14:46, Ming Lei wrote:
> > Hi John,
> > 
> > On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > Thomas mentioned:
> > >     "
> > >      That was the constraint of managed interrupts from the very beginning:
> > > 
> > >       The driver/subsystem has to quiesce the interrupt line and the associated
> > >       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> > >       until it's restarted by the core when the CPU is plugged in again.
> > >     "
> > > 
> > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > 
> > > This patchset tries to address the issue by two stages:
> > > 
> > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > 
> > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > if the hctx is going to be dead.
> > > 
> > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> > > 
> > > - steal bios from the request, and resubmit them via generic_make_request(),
> > > then these IO will be mapped to other live hctx for dispatch
> > > 
> > > Please comment & review, thanks!
> > > 
> > > V2:
> > > 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> > > 	  them
> > > 	- address comments from John Garry and Minwoo
> > > 
> > > 
> > > Ming Lei (5):
> > >   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
> > >   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
> > >   blk-mq: stop to handle IO before hctx's all CPUs become offline
> > >   blk-mq: re-submit IO in case that hctx is dead
> > >   blk-mq: handle requests dispatched from IO scheduler in case that hctx
> > >     is dead
> > > 
> > >  block/blk-mq-debugfs.c     |   2 +
> > >  block/blk-mq-tag.c         |   2 +-
> > >  block/blk-mq-tag.h         |   2 +
> > >  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
> > >  block/blk-mq.h             |   3 +-
> > >  drivers/block/loop.c       |   2 +-
> > >  drivers/md/dm-rq.c         |   2 +-
> > >  include/linux/blk-mq.h     |   5 ++
> > >  include/linux/cpuhotplug.h |   1 +
> > >  9 files changed, 146 insertions(+), 16 deletions(-)
> > > 
> > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > Cc: Hannes Reinecke <hare@suse.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Keith Busch <keith.busch@intel.com>
> > > --
> > > 2.20.1
> > > 
> > 
> > Sorry for forgetting to Cc you.
> 
> Already subscribed :)
> 
> I don't mean to hijack this thread, but JFYI we're getting around to test
> https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately
> we're still seeing a performance regression. I can't see where it's coming
> from. We're double-checking the test though.

host-tag patchset is only for several particular drivers which use
private reply queue as completion queue.

This patchset is for handling generic blk-mq CPU hotplug issue, and
the several particular scsi drivers(hisi_sas_v3, hpsa, megaraid_sas and
mp3sas) won't be covered so far.

I'd suggest to move on for generic blk-mq devices first given now blk-mq
is the only request IO path now.

There are at least two choices for us to handle drivers/devices with
private completion queue:

1) host-tags
- performance issue shouldn't be hard to solve, given it is same with
with single tags in theory, and just corner cases is there.

What I am not glad with this approach is that blk-mq-tag code becomes mess.

2) private callback
- we could define private callback to drain each completion queue in
  driver simply.
- problem is that the four drivers have to duplicate the same job


Thanks,
Ming

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, back to index

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-12 13:43 [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
2019-08-12 13:43 ` [PATCH V2 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
2019-08-12 13:43 ` [PATCH V2 2/5] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
2019-08-12 13:43 ` [PATCH V2 3/5] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
2019-08-12 14:24   ` Hannes Reinecke
2019-08-12 22:24     ` Ming Lei
2019-08-12 13:43 ` [PATCH V2 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
2019-08-12 14:26   ` Hannes Reinecke
2019-08-12 22:30     ` Ming Lei
2019-08-12 13:43 ` [PATCH V2 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
2019-08-12 13:46 ` [PATCH V2 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
2019-08-12 16:21   ` John Garry
2019-08-12 22:45     ` Ming Lei

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org linux-block@archiver.kernel.org
	public-inbox-index linux-block


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/ public-inbox