[RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
@ 2019-07-12  2:47 Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 1/7] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!


Ming Lei (7):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: add callback of .free_request
  SCSI: implement .free_request callback
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
    is dead

 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 147 ++++++++++++++++++++++++++++++++++---
 block/blk-mq.h             |   3 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 drivers/scsi/scsi_lib.c    |  13 ++++
 include/linux/blk-mq.h     |  12 +++
 include/linux/cpuhotplug.h |   1 +
 10 files changed, 170 insertions(+), 16 deletions(-)

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>


-- 
2.20.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/7] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

Add a new hw queue state of BLK_MQ_S_INTERNAL_STOPPED, which prepares
for stopping hw queue before all CPUs of this hctx become offline.

We can't reuse BLK_MQ_S_STOPPED because that state can be cleared during IO
completion.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 3 ++-
 include/linux/blk-mq.h | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..af40a02c46ee 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INTERNAL_STOPPED),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f4bf5161333e..9d821adf5765 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -176,7 +176,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3fa1fa59f9b2..3a731c3c0762 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -228,6 +228,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is internal stopped, driver do not use it */
+	BLK_MQ_S_INTERNAL_STOPPED	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 1/7] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-16  2:53   ` John Garry
  2019-07-12  2:47 ` [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests of
dm-rq & loop, then the potential dead-lock can be avoided.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 drivers/block/loop.c   | 2 +-
 drivers/md/dm-rq.c     | 2 +-
 include/linux/blk-mq.h | 1 +
 4 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index af40a02c46ee..24fff8c90942 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 44c9985f352a..199d76e8bf46 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1986,7 +1986,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 5f7063f05ae0..f7fbef2d3cd7 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -546,7 +546,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3a731c3c0762..911cdc6479dc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -219,6 +219,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 1/7] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-12  9:03   ` Minwoo Im
  2019-07-16  3:03   ` John Garry
  2019-07-12  2:47 ` [RFC PATCH 4/7] blk-mq: add callback of .free_request Ming Lei
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c         |  2 +-
 block/blk-mq-tag.h         |  2 ++
 block/blk-mq.c             | 67 ++++++++++++++++++++++++++++++++++++++
 include/linux/blk-mq.h     |  1 +
 include/linux/cpuhotplug.h |  1 +
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index da19f0bc8876..bcefb213ad69 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -324,7 +324,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..321fd6f440e6 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e5ef40c603ca..028c5d78e409 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2205,6 +2205,64 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
+{
+	unsigned *count = data;
+
+	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+		(*count)++;
+
+	return true;
+}
+
+unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
+{
+	unsigned count = 0;
+
+	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
+
+	return count;
+}
+
+static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx->tags))
+			break;
+		msleep(5);
+	}
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+	unsigned prev_cpu = -1;
+
+	if (hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ)
+		return 0;
+
+	while (true) {
+		unsigned other_cpu = cpumask_next_and(prev_cpu, hctx->cpumask,
+				cpu_online_mask);
+
+		if (other_cpu >= nr_cpu_ids)
+			break;
+
+		/* return if there is other online CPU on this hctx */
+		if (other_cpu != cpu)
+			return 0;
+
+		prev_cpu = other_cpu;
+	}
+
+	set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+	blk_mq_drain_inflight_rqs(hctx);
+
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2221,6 +2279,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
+	if (test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state))
+		clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
@@ -2243,6 +2304,8 @@ static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
+	cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+					    &hctx->cpuhp_online);
 }
 
 /* hctx->ctxs will be freed in queue's release handler */
@@ -2301,6 +2364,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue_num = hctx_idx;
 
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
+	cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+			&hctx->cpuhp_online);
 
 	hctx->tags = set->tags[hctx_idx];
 
@@ -3536,6 +3601,8 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				NULL, blk_mq_hctx_notify_online);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 911cdc6479dc..dc86bdac08f4 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -59,6 +59,7 @@ struct blk_mq_hw_ctx {
 	atomic_t		nr_active;
 
 	struct hlist_node	cpuhp_dead;
+	struct hlist_node	cpuhp_online;
 	struct kobject		kobj;
 
 	unsigned long		poll_considered;
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 87c211adf49e..5177f7bbcb88 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -147,6 +147,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 4/7] blk-mq: add callback of .free_request
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2019-07-12  2:47 ` [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 5/7] SCSI: implement .free_request callback Ming Lei
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

blk_steal_bios() is usually used before freeing the request, however,
we have to tell driver that the un-completed request will be freed,
then driver can free any private part for the request.

Then we can apply blk_steal_bios() in other cases, such as freeing
the request and re-submit the stolen bios after the hctx is down
in CPU hotplug situation.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/blk-mq.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index dc86bdac08f4..353606023a0f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -141,6 +141,7 @@ typedef int (poll_fn)(struct blk_mq_hw_ctx *);
 typedef int (map_queues_fn)(struct blk_mq_tag_set *set);
 typedef bool (busy_fn)(struct request_queue *);
 typedef void (complete_fn)(struct request *);
+typedef void (free_request_fn)(struct request *);
 
 
 struct blk_mq_ops {
@@ -201,6 +202,12 @@ struct blk_mq_ops {
 	/* Called from inside blk_get_request() */
 	void (*initialize_rq_fn)(struct request *rq);
 
+	/*
+	 * Called before freeing one request which isn't completed yet,
+	 * and usually for freeing the driver private part
+	 */
+	free_request_fn		*free_request;
+
 	/*
 	 * If set, returns whether or not this queue currently is busy
 	 */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 5/7] SCSI: implement .free_request callback
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2019-07-12  2:47 ` [RFC PATCH 4/7] blk-mq: add callback of .free_request Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 6/7] blk-mq: re-submit IO in case that hctx is dead Ming Lei
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

Implement .free_request() callback for freeing driver private part
of the request. Then we can avoid to leak this part if the request isn't
completed by SCSI, and freed by blk-mq finally.

One use case is that for handling CPU hotplug when the current
hctx is down, but some requests allocated from this hctx are still
in queue.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 65d0a10c76ad..e07a376a8c38 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1073,6 +1073,18 @@ static void scsi_initialize_rq(struct request *rq)
 	cmd->retries = 0;
 }
 
+/*
+ * Only called when the request isn't completed by SCSI, and freed by
+ * blk-mq
+ */
+static void scsi_free_rq(struct request *rq)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	if (rq->rq_flags & RQF_DONTPREP)
+		scsi_mq_uninit_cmd(cmd);
+}
+
 /* Add a command to the list used by the aacraid and dpt_i2o drivers */
 void scsi_add_cmd_to_list(struct scsi_cmnd *cmd)
 {
@@ -1800,6 +1812,7 @@ static const struct blk_mq_ops scsi_mq_ops = {
 	.init_request	= scsi_mq_init_request,
 	.exit_request	= scsi_mq_exit_request,
 	.initialize_rq_fn = scsi_initialize_rq,
+	.free_request	= scsi_free_rq,
 	.busy		= scsi_mq_lld_busy,
 	.map_queues	= scsi_map_queues,
 };
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 6/7] blk-mq: re-submit IO in case that hctx is dead
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2019-07-12  2:47 ` [RFC PATCH 5/7] SCSI: implement .free_request callback Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-12  2:47 ` [RFC PATCH 7/7] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
  2019-07-16  6:54 ` [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug John Garry
  7 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

When all CPUs in one hctx are offline, we shouldn't run this hw queue
for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 43 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 028c5d78e409..e4588d30840c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2263,10 +2263,32 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	struct bio_list list;
+	struct request_queue *q = rq->q;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+
+	if (q->mq_ops->free_request)
+		q->mq_ops->free_request(rq);
+	blk_mq_end_request(rq, 0);
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is dead, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2274,6 +2296,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
+	bool hctx_dead;
+	struct request *rq;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2282,6 +2306,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state))
 		clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 
+	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
+		nr_cpu_ids;
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
@@ -2292,11 +2319,20 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	if (list_empty(&tmp))
 		return 0;
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+	if (!hctx_dead) {
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&tmp, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		blk_mq_run_hw_queue(hctx, true);
+		return 0;
+	}
+
+	while (!list_empty(&tmp)) {
+		rq = list_entry(tmp.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
 
-	blk_mq_run_hw_queue(hctx, true);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 7/7] blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2019-07-12  2:47 ` [RFC PATCH 6/7] blk-mq: re-submit IO in case that hctx is dead Ming Lei
@ 2019-07-12  2:47 ` Ming Lei
  2019-07-16  6:54 ` [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug John Garry
  7 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-12  2:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

If hctx becomes dead, all in-queue IO requests aimed at this hctx have to
be re-submitted, so cover requests queued in scheduler queue.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e4588d30840c..3ad1944a8e70 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2298,6 +2298,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 	bool hctx_dead;
 	struct request *rq;
+	struct elevator_queue *e;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2309,12 +2310,31 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
 		nr_cpu_ids;
 
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
+	e = hctx->queue->elevator;
+	if (!e) {
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
+	} else if (hctx_dead) {
+		LIST_HEAD(sched_tmp);
+
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &tmp);
+		}
+
+		while (!list_empty(&sched_tmp)) {
+			rq = list_entry(sched_tmp.next, struct request,
+					queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_sched_insert_request(rq, true, true, true);
+		}
 	}
-	spin_unlock(&ctx->lock);
 
 	if (list_empty(&tmp))
 		return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-07-12  2:47 ` [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
@ 2019-07-12  9:03   ` Minwoo Im
  2019-07-16  3:03   ` John Garry
  1 sibling, 0 replies; 16+ messages in thread
From: Minwoo Im @ 2019-07-12  9:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	Minwoo Im

On 19-07-12 10:47:22, Ming Lei wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e5ef40c603ca..028c5d78e409 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2205,6 +2205,64 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> +				     bool reserved)
> +{
> +	unsigned *count = data;
> +
> +	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> +		(*count)++;
> +
> +	return true;
> +}
> +
> +unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
> +{
> +	unsigned count = 0;
> +
> +	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
> +
> +	return count;
> +}

Ming,

Maybe it can be static?

> +
> +static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +	while (1) {
> +		if (!blk_mq_tags_inflight_rqs(hctx->tags))
> +			break;
> +		msleep(5);
> +	}
> +}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  2019-07-12  2:47 ` [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
@ 2019-07-16  2:53   ` John Garry
  2019-07-16  6:08     ` Ming Lei
  0 siblings, 1 reply; 16+ messages in thread
From: John Garry @ 2019-07-16  2:53 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

在 12/07/2019 10:47, Ming Lei 写道:
> We will stop hw queue and wait for completion of in-flight requests
> when one hctx is becoming dead in the following patch. This way may
> cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> loop.
> 
> Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> loop, so we needn't to wait for completion of in-flight requests of
> dm-rq & loop, then the potential dead-lock can be avoided.

Wouldn't it make more sense to have the flag name be like 
BLK_MQ_F_DONT_DRAIN_STOPPED_HCTX?

I did not think that blk-mq is specifically concerned with managed 
interrupts, but only their indirect effect.

> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-debugfs.c | 1 +
>   drivers/block/loop.c   | 2 +-
>   drivers/md/dm-rq.c     | 2 +-
>   include/linux/blk-mq.h | 1 +
>   4 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index af40a02c46ee..24fff8c90942 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(TAG_SHARED),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
> +	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
>   };
>   #undef HCTX_FLAG_NAME
>   
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 44c9985f352a..199d76e8bf46 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1986,7 +1986,7 @@ static int loop_add(struct loop_device **l, int i)
>   	lo->tag_set.queue_depth = 128;
>   	lo->tag_set.numa_node = NUMA_NO_NODE;
>   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;

nit: at this point in the series you're setting a flag which is never 
checked.

>   	lo->tag_set.driver_data = lo;
>   
>   	err = blk_mq_alloc_tag_set(&lo->tag_set);
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 5f7063f05ae0..f7fbef2d3cd7 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -546,7 +546,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>   	md->tag_set->ops = &dm_mq_ops;
>   	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
>   	md->tag_set->numa_node = md->numa_node_id;
> -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
> +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
>   	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
>   	md->tag_set->driver_data = md;
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 3a731c3c0762..911cdc6479dc 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -219,6 +219,7 @@ struct blk_mq_ops {
>   enum {
>   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
>   	BLK_MQ_F_BLOCKING	= 1 << 5,
>   	BLK_MQ_F_NO_SCHED	= 1 << 6,
>   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-07-12  2:47 ` [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
  2019-07-12  9:03   ` Minwoo Im
@ 2019-07-16  3:03   ` John Garry
  2019-07-16  6:12     ` Ming Lei
  1 sibling, 1 reply; 16+ messages in thread
From: John Garry @ 2019-07-16  3:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

在 12/07/2019 10:47, Ming Lei 写道:
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "
>   That was the constraint of managed interrupts from the very beginning:
> 
>    The driver/subsystem has to quiesce the interrupt line and the associated
>    queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>    until it's restarted by the core when the CPU is plugged in again.
> "
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> one cpuhp state handled after the CPU is down, so there isn't any chance
> to quiesce hctx for blk-mq wrt. CPU hotplug.
> 
> Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> and wait for completion of in-flight requests.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-tag.c         |  2 +-
>   block/blk-mq-tag.h         |  2 ++
>   block/blk-mq.c             | 67 ++++++++++++++++++++++++++++++++++++++
>   include/linux/blk-mq.h     |  1 +
>   include/linux/cpuhotplug.h |  1 +
>   5 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index da19f0bc8876..bcefb213ad69 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -324,7 +324,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>    *		true to continue iterating tags, false to stop.
>    * @priv:	Will be passed as second argument to @fn.
>    */
> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>   		busy_tag_iter_fn *fn, void *priv)
>   {
>   	if (tags->nr_reserved_tags)
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 61deab0b5a5a..321fd6f440e6 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>   		void *priv);
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, void *priv);
>   
>   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>   						 struct blk_mq_hw_ctx *hctx)
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e5ef40c603ca..028c5d78e409 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2205,6 +2205,64 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> +				     bool reserved)
> +{
> +	unsigned *count = data;
> +
> +	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> +		(*count)++;
> +
> +	return true;
> +}
> +
> +unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
> +{
> +	unsigned count = 0;
> +
> +	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
> +
> +	return count;
> +}
> +
> +static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> +	while (1) {
> +		if (!blk_mq_tags_inflight_rqs(hctx->tags))
> +			break;
> +		msleep(5);
> +	}
> +}
> +
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +	unsigned prev_cpu = -1;
> +
> +	if (hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ)
> +		return 0;

is it possible to make this check at registration, such that we don't 
even register for when (hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ)?

> +
> +	while (true) {
> +		unsigned other_cpu = cpumask_next_and(prev_cpu, hctx->cpumask,

nit: to me name next_cpu seems better

> +				cpu_online_mask);
> +
> +		if (other_cpu >= nr_cpu_ids)
> +			break;
> +
> +		/* return if there is other online CPU on this hctx */
> +		if (other_cpu != cpu)
> +			return 0;
> +
> +		prev_cpu = other_cpu;
> +	}
> +
> +	set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> +	blk_mq_drain_inflight_rqs(hctx);
> +
> +	return 0;
> +}
> +
>   /*
>    * 'cpu' is going away. splice any existing rq_list entries from this
>    * software queue to the hw queue dispatch list, and ensure that it
> @@ -2221,6 +2279,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>   	type = hctx->type;
>   
> +	if (test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state))
> +		clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);

can you just always clear the bit without the test?

> +
>   	spin_lock(&ctx->lock);
>   	if (!list_empty(&ctx->rq_lists[type])) {
>   		list_splice_init(&ctx->rq_lists[type], &tmp);
> @@ -2243,6 +2304,8 @@ static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>   {
>   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
>   					    &hctx->cpuhp_dead);
> +	cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +					    &hctx->cpuhp_online);
>   }
>   
>   /* hctx->ctxs will be freed in queue's release handler */
> @@ -2301,6 +2364,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   	hctx->queue_num = hctx_idx;
>   
>   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
> +	cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +			&hctx->cpuhp_online);
>   
>   	hctx->tags = set->tags[hctx_idx];
>   
> @@ -3536,6 +3601,8 @@ static int __init blk_mq_init(void)
>   {
>   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
>   				blk_mq_hctx_notify_dead);
> +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> +				NULL, blk_mq_hctx_notify_online);
>   	return 0;
>   }
>   subsys_initcall(blk_mq_init);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 911cdc6479dc..dc86bdac08f4 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -59,6 +59,7 @@ struct blk_mq_hw_ctx {
>   	atomic_t		nr_active;
>   
>   	struct hlist_node	cpuhp_dead;
> +	struct hlist_node	cpuhp_online;

I'd reorder with cpuhp_dead

>   	struct kobject		kobj;
>   
>   	unsigned long		poll_considered;
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 87c211adf49e..5177f7bbcb88 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -147,6 +147,7 @@ enum cpuhp_state {
>   	CPUHP_AP_SMPBOOT_THREADS,
>   	CPUHP_AP_X86_VDSO_VMA_ONLINE,
>   	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> +	CPUHP_AP_BLK_MQ_ONLINE,
>   	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
>   	CPUHP_AP_X86_INTEL_EPB_ONLINE,
>   	CPUHP_AP_PERF_ONLINE,
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  2019-07-16  2:53   ` John Garry
@ 2019-07-16  6:08     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-16  6:08 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Tue, Jul 16, 2019 at 10:53:00AM +0800, John Garry wrote:
> 在 12/07/2019 10:47, Ming Lei 写道:
> > We will stop hw queue and wait for completion of in-flight requests
> > when one hctx is becoming dead in the following patch. This way may
> > cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> > loop.
> > 
> > Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> > loop, so we needn't to wait for completion of in-flight requests of
> > dm-rq & loop, then the potential dead-lock can be avoided.
> 
> Wouldn't it make more sense to have the flag name be like
> BLK_MQ_F_DONT_DRAIN_STOPPED_HCTX?
> 
> I did not think that blk-mq is specifically concerned with managed
> interrupts, but only their indirect effect.

I am fine with either one, however it is easier for drivers to
recognize if this flag should be set, given BLK_MQ_F_NO_MANAGED_IRQ
is self-explained.

Also on the other side, this patchset serves a generic blk-mq fix
for managed IRQ issue, so it is reasonable for all drivers which
don't use managed IRQ to set the flag.

> 
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Keith Busch <keith.busch@intel.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-debugfs.c | 1 +
> >   drivers/block/loop.c   | 2 +-
> >   drivers/md/dm-rq.c     | 2 +-
> >   include/linux/blk-mq.h | 1 +
> >   4 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index af40a02c46ee..24fff8c90942 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
> >   	HCTX_FLAG_NAME(TAG_SHARED),
> >   	HCTX_FLAG_NAME(BLOCKING),
> >   	HCTX_FLAG_NAME(NO_SCHED),
> > +	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
> >   };
> >   #undef HCTX_FLAG_NAME
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 44c9985f352a..199d76e8bf46 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -1986,7 +1986,7 @@ static int loop_add(struct loop_device **l, int i)
> >   	lo->tag_set.queue_depth = 128;
> >   	lo->tag_set.numa_node = NUMA_NO_NODE;
> >   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> > -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
> 
> nit: at this point in the series you're setting a flag which is never
> checked.

Yeah, I see, and this way is a bit easier for review purpose.

thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline
  2019-07-16  3:03   ` John Garry
@ 2019-07-16  6:12     ` Ming Lei
  0 siblings, 0 replies; 16+ messages in thread
From: Ming Lei @ 2019-07-16  6:12 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Tue, Jul 16, 2019 at 11:03:14AM +0800, John Garry wrote:
> 在 12/07/2019 10:47, Ming Lei 写道:
> > Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> > up queue mapping. Thomas mentioned the following point[1]:
> > 
> > "
> >   That was the constraint of managed interrupts from the very beginning:
> > 
> >    The driver/subsystem has to quiesce the interrupt line and the associated
> >    queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >    until it's restarted by the core when the CPU is plugged in again.
> > "
> > 
> > However, current blk-mq implementation doesn't quiesce hw queue before
> > the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> > one cpuhp state handled after the CPU is down, so there isn't any chance
> > to quiesce hctx for blk-mq wrt. CPU hotplug.
> > 
> > Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> > and wait for completion of in-flight requests.
> > 
> > [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> > 
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Keith Busch <keith.busch@intel.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-tag.c         |  2 +-
> >   block/blk-mq-tag.h         |  2 ++
> >   block/blk-mq.c             | 67 ++++++++++++++++++++++++++++++++++++++
> >   include/linux/blk-mq.h     |  1 +
> >   include/linux/cpuhotplug.h |  1 +
> >   5 files changed, 72 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index da19f0bc8876..bcefb213ad69 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -324,7 +324,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
> >    *		true to continue iterating tags, false to stop.
> >    * @priv:	Will be passed as second argument to @fn.
> >    */
> > -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> >   		busy_tag_iter_fn *fn, void *priv)
> >   {
> >   	if (tags->nr_reserved_tags)
> > diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> > index 61deab0b5a5a..321fd6f440e6 100644
> > --- a/block/blk-mq-tag.h
> > +++ b/block/blk-mq-tag.h
> > @@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
> >   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
> >   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
> >   		void *priv);
> > +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> > +		busy_tag_iter_fn *fn, void *priv);
> >   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
> >   						 struct blk_mq_hw_ctx *hctx)
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index e5ef40c603ca..028c5d78e409 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2205,6 +2205,64 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >   	return -ENOMEM;
> >   }
> > +static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
> > +				     bool reserved)
> > +{
> > +	unsigned *count = data;
> > +
> > +	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> > +		(*count)++;
> > +
> > +	return true;
> > +}
> > +
> > +unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
> > +{
> > +	unsigned count = 0;
> > +
> > +	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
> > +
> > +	return count;
> > +}
> > +
> > +static void blk_mq_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	while (1) {
> > +		if (!blk_mq_tags_inflight_rqs(hctx->tags))
> > +			break;
> > +		msleep(5);
> > +	}
> > +}
> > +
> > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> > +			struct blk_mq_hw_ctx, cpuhp_online);
> > +	unsigned prev_cpu = -1;
> > +
> > +	if (hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ)
> > +		return 0;
> 
> is it possible to make this check at registration, such that we don't even
> register for when (hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ)?

Yeah, good idea, it can be done in that way.

> 
> > +
> > +	while (true) {
> > +		unsigned other_cpu = cpumask_next_and(prev_cpu, hctx->cpumask,
> 
> nit: to me name next_cpu seems better

OK.

> 
> > +				cpu_online_mask);
> > +
> > +		if (other_cpu >= nr_cpu_ids)
> > +			break;
> > +
> > +		/* return if there is other online CPU on this hctx */
> > +		if (other_cpu != cpu)
> > +			return 0;
> > +
> > +		prev_cpu = other_cpu;
> > +	}
> > +
> > +	set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> > +	blk_mq_drain_inflight_rqs(hctx);
> > +
> > +	return 0;
> > +}
> > +
> >   /*
> >    * 'cpu' is going away. splice any existing rq_list entries from this
> >    * software queue to the hw queue dispatch list, and ensure that it
> > @@ -2221,6 +2279,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >   	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
> >   	type = hctx->type;
> > +	if (test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state))
> > +		clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> 
> can you just always clear the bit without the test?

Yeah.

> 
> > +
> >   	spin_lock(&ctx->lock);
> >   	if (!list_empty(&ctx->rq_lists[type])) {
> >   		list_splice_init(&ctx->rq_lists[type], &tmp);
> > @@ -2243,6 +2304,8 @@ static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
> >   {
> >   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
> >   					    &hctx->cpuhp_dead);
> > +	cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> > +					    &hctx->cpuhp_online);
> >   }
> >   /* hctx->ctxs will be freed in queue's release handler */
> > @@ -2301,6 +2364,8 @@ static int blk_mq_init_hctx(struct request_queue *q,
> >   	hctx->queue_num = hctx_idx;
> >   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
> > +	cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> > +			&hctx->cpuhp_online);
> >   	hctx->tags = set->tags[hctx_idx];
> > @@ -3536,6 +3601,8 @@ static int __init blk_mq_init(void)
> >   {
> >   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
> >   				blk_mq_hctx_notify_dead);
> > +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> > +				NULL, blk_mq_hctx_notify_online);
> >   	return 0;
> >   }
> >   subsys_initcall(blk_mq_init);
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 911cdc6479dc..dc86bdac08f4 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -59,6 +59,7 @@ struct blk_mq_hw_ctx {
> >   	atomic_t		nr_active;
> >   	struct hlist_node	cpuhp_dead;
> > +	struct hlist_node	cpuhp_online;
> 
> I'd reorder with cpuhp_dead

Fine.


thanks,
Ming

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
  2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (6 preceding siblings ...)
  2019-07-12  2:47 ` [RFC PATCH 7/7] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
@ 2019-07-16  6:54 ` John Garry
  2019-07-16  7:18   ` Ming Lei
  7 siblings, 1 reply; 16+ messages in thread
From: John Garry @ 2019-07-16  6:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

在 12/07/2019 10:47, Ming Lei 写道:
> Hi,
> 
> Thomas mentioned:
>      "
>       That was the constraint of managed interrupts from the very beginning:
>      
>        The driver/subsystem has to quiesce the interrupt line and the associated
>        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>        until it's restarted by the core when the CPU is plugged in again.
>      "
> 
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Please comment & review, thanks!
> 

Hi Ming,

FWIW, to me this series looks reasonable.

So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert 
private reply queue into blk_mq hw queue" then?

All the best,
John

> 
> Ming Lei (7):
>    blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>    blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>    blk-mq: stop to handle IO before hctx's all CPUs become offline
>    blk-mq: add callback of .free_request
>    SCSI: implement .free_request callback
>    blk-mq: re-submit IO in case that hctx is dead
>    blk-mq: handle requests dispatched from IO scheduler in case that hctx
>      is dead
> 
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   2 +-
>   block/blk-mq-tag.h         |   2 +
>   block/blk-mq.c             | 147 ++++++++++++++++++++++++++++++++++---
>   block/blk-mq.h             |   3 +-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   drivers/scsi/scsi_lib.c    |  13 ++++
>   include/linux/blk-mq.h     |  12 +++
>   include/linux/cpuhotplug.h |   1 +
>   10 files changed, 170 insertions(+), 16 deletions(-)
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> 
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
  2019-07-16  6:54 ` [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug John Garry
@ 2019-07-16  7:18   ` Ming Lei
  2019-07-16  8:13     ` John Garry
  0 siblings, 1 reply; 16+ messages in thread
From: Ming Lei @ 2019-07-16  7:18 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, Linux SCSI List, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi John,

On Tue, Jul 16, 2019 at 2:55 PM John Garry <john.garry@huawei.com> wrote:
>
> 在 12/07/2019 10:47, Ming Lei 写道:
> > Hi,
> >
> > Thomas mentioned:
> >      "
> >       That was the constraint of managed interrupts from the very beginning:
> >
> >        The driver/subsystem has to quiesce the interrupt line and the associated
> >        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >        until it's restarted by the core when the CPU is plugged in again.
> >      "
> >
> > But no drivers or blk-mq do that before one hctx becomes dead(all
> > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> >
> > This patchset tries to address the issue by two stages:
> >
> > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> >
> > - mark the hctx as internal stopped, and drain all in-flight requests
> > if the hctx is going to be dead.
> >
> > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> >
> > - steal bios from the request, and resubmit them via generic_make_request(),
> > then these IO will be mapped to other live hctx for dispatch
> >
> > Please comment & review, thanks!
> >
>
> Hi Ming,
>
> FWIW, to me this series looks reasonable.

Thanks!

>
> So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert
> private reply queue into blk_mq hw queue" then?

V2 has been in the following tree for a while:

https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2

It works, however the implementation is a bit ugly even though the idea
is simple.

So I think we may need to think of it further, for better implementation or
approach.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug
  2019-07-16  7:18   ` Ming Lei
@ 2019-07-16  8:13     ` John Garry
  0 siblings, 0 replies; 16+ messages in thread
From: John Garry @ 2019-07-16  8:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, Linux SCSI List, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	chenxiang

在 16/07/2019 15:18, Ming Lei 写道:
>> So you have plans to post an updated "[PATCH 0/9] blk-mq/scsi: convert
>> private reply queue into blk_mq hw queue" then?

Hi Ming,

> V2 has been in the following tree for a while:
> 
> https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2
> 
> It works, however the implementation is a bit ugly even though the idea
> is simple.

Yeah, sorry to say that I agree - the BLK_MQ_F_TAG_HCTX_SHARED checks 
look bolted on.

> 
> So I think we may need to think of it further, for better implementation or
> approach.

Understood.

But at least we may test it to ensure no performance regression.

Thanks,
John

> 
> Thanks,
> Ming Lei



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-07-16  8:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-12  2:47 [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
2019-07-12  2:47 ` [RFC PATCH 1/7] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
2019-07-12  2:47 ` [RFC PATCH 2/7] blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ Ming Lei
2019-07-16  2:53   ` John Garry
2019-07-16  6:08     ` Ming Lei
2019-07-12  2:47 ` [RFC PATCH 3/7] blk-mq: stop to handle IO before hctx's all CPUs become offline Ming Lei
2019-07-12  9:03   ` Minwoo Im
2019-07-16  3:03   ` John Garry
2019-07-16  6:12     ` Ming Lei
2019-07-12  2:47 ` [RFC PATCH 4/7] blk-mq: add callback of .free_request Ming Lei
2019-07-12  2:47 ` [RFC PATCH 5/7] SCSI: implement .free_request callback Ming Lei
2019-07-12  2:47 ` [RFC PATCH 6/7] blk-mq: re-submit IO in case that hctx is dead Ming Lei
2019-07-12  2:47 ` [RFC PATCH 7/7] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
2019-07-16  6:54 ` [RFC PATCH 0/7] blk-mq: improvement on handling IO during CPU hotplug John Garry
2019-07-16  7:18   ` Ming Lei
2019-07-16  8:13     ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).