linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
@ 2019-10-14  1:50 Ming Lei
  2019-10-14  1:50 ` [PATCH V4 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-14  1:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

John, I don't add your tested-by tag since V3 have some changes,
and I appreciate if you may run your test on V3.

V4:
	- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo



Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes dead
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
    is dead

 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 137 ++++++++++++++++++++++++++++++++++---
 block/blk-mq.h             |   3 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 141 insertions(+), 15 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>

-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V4 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
@ 2019-10-14  1:50 ` Ming Lei
  2019-10-14  1:50 ` [PATCH V4 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-14  1:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch, John Garry

Add a new hw queue state of BLK_MQ_S_INTERNAL_STOPPED, which prepares
for stopping hw queue before all CPUs of this hctx become offline.

We can't reuse BLK_MQ_S_STOPPED because that state can be cleared during IO
completion.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 3 ++-
 include/linux/blk-mq.h | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..af40a02c46ee 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INTERNAL_STOPPED),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 32c62c64e6c2..63717573bc16 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -176,7 +176,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0bf056de5cc3..079c282e4471 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -235,6 +235,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is internal stopped, driver do not use it */
+	BLK_MQ_S_INTERNAL_STOPPED	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V4 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-10-14  1:50 ` [PATCH V4 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
@ 2019-10-14  1:50 ` Ming Lei
  2019-10-14  1:50 ` [PATCH V4 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-14  1:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 13 +++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  2 ++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index af40a02c46ee..24fff8c90942 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec791156e9cc..a664f196782a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2225,6 +2225,11 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2261,6 +2266,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2320,6 +2328,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3547,6 +3558,8 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				NULL, blk_mq_hctx_notify_online);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f6f77eaa7217..751a28a1d4b0 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1999,7 +1999,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 079c282e4471..a345f2cf920d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -58,6 +58,7 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	struct hlist_node	cpuhp_online;
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
 
@@ -226,6 +227,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 068793a619ca..bb80f52040cb 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -147,6 +147,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V4 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
  2019-10-14  1:50 ` [PATCH V4 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
  2019-10-14  1:50 ` [PATCH V4 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2019-10-14  1:50 ` Ming Lei
  2019-11-28  9:29   ` John Garry
  2019-10-14  1:50 ` [PATCH V4 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-14  1:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Before one CPU becomes offline, check if it is the last online CPU
of hctx. If yes, mark this hctx as BLK_MQ_S_INTERNAL_STOPPED, meantime
wait for completion of all in-flight IOs originated from this hctx.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c |  2 +-
 block/blk-mq-tag.h |  2 ++
 block/blk-mq.c     | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 008388e82b5c..31828b82552b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -325,7 +325,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..321fd6f440e6 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a664f196782a..3384242202eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2225,8 +2225,46 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
+{
+	unsigned *count = data;
+
+	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+		(*count)++;
+
+	return true;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
+{
+	unsigned count = 0;
+
+	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
+
+	return count;
+}
+
+static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx->tags))
+			break;
+		msleep(5);
+	}
+}
+
 static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) == cpu) &&
+	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) >=
+	     nr_cpu_ids)) {
+		set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+		blk_mq_hctx_drain_inflight_rqs(hctx);
+        }
 	return 0;
 }
 
@@ -2246,6 +2284,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
+	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V4 4/5] blk-mq: re-submit IO in case that hctx is dead
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2019-10-14  1:50 ` [PATCH V4 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
@ 2019-10-14  1:50 ` Ming Lei
  2019-10-14  1:50 ` [PATCH V4 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-14  1:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

When all CPUs in one hctx are offline, we shouldn't run this hw queue
for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 54 ++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3384242202eb..17f0a9ef32a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2268,10 +2268,34 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	struct bio_list list;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	/*
+	 * Free the old empty request before submitting bio for avoiding
+	 * potential deadlock
+	 */
+	blk_mq_cleanup_rq(rq);
+	blk_mq_end_request(rq, 0);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is dead, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2279,6 +2303,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
+	bool hctx_dead;
+	struct request *rq;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2286,6 +2312,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 	clear_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
 
+	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
+		nr_cpu_ids;
+
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_lists[type])) {
 		list_splice_init(&ctx->rq_lists[type], &tmp);
@@ -2293,14 +2322,27 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	}
 	spin_unlock(&ctx->lock);
 
-	if (list_empty(&tmp))
+	if (!hctx_dead) {
+		if (list_empty(&tmp))
+			return 0;
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&tmp, &hctx->dispatch);
+		spin_unlock(&hctx->lock);
+		blk_mq_run_hw_queue(hctx, true);
 		return 0;
+	}
 
+	/* requests in dispatch list has to be re-submitted too */
 	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
+	list_splice_tail_init(&hctx->dispatch, &tmp);
 	spin_unlock(&hctx->lock);
 
-	blk_mq_run_hw_queue(hctx, true);
+	while (!list_empty(&tmp)) {
+		rq = list_entry(tmp.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
+
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH V4 5/5] blk-mq: handle requests dispatched from IO scheduler in case that hctx is dead
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2019-10-14  1:50 ` [PATCH V4 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
@ 2019-10-14  1:50 ` Ming Lei
  2019-10-16  8:58 ` [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
  2019-11-28  1:09 ` chenxiang (M)
  6 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-14  1:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

If hctx becomes dead, all in-queue IO requests aimed at this hctx have to
be re-submitted, so cover requests queued in scheduler queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 17f0a9ef32a8..06081966549f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2305,6 +2305,7 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 	bool hctx_dead;
 	struct request *rq;
+	struct elevator_queue *e;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
@@ -2315,12 +2316,31 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	hctx_dead = cpumask_first_and(hctx->cpumask, cpu_online_mask) >=
 		nr_cpu_ids;
 
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
+	e = hctx->queue->elevator;
+	if (!e) {
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
+	} else if (hctx_dead) {
+		LIST_HEAD(sched_tmp);
+
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &tmp);
+		}
+
+		while (!list_empty(&sched_tmp)) {
+			rq = list_entry(sched_tmp.next, struct request,
+					queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_sched_insert_request(rq, true, true, true);
+		}
 	}
-	spin_unlock(&ctx->lock);
 
 	if (!hctx_dead) {
 		if (list_empty(&tmp))
-- 
2.20.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2019-10-14  1:50 ` [PATCH V4 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
@ 2019-10-16  8:58 ` John Garry
  2019-10-16 12:07   ` Ming Lei
  2019-11-28  1:09 ` chenxiang (M)
  6 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-16  8:58 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch

On 14/10/2019 02:50, Ming Lei wrote:
> Hi,
>
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
>
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>
> This patchset tries to address the issue by two stages:
>
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
>
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
>
> Please comment & review, thanks!
>
> John, I don't add your tested-by tag since V3 have some changes,
> and I appreciate if you may run your test on V3.

Hi Ming,

So I got around to doing some testing. The good news is that issue which 
we were experiencing in v3 series seems to have has gone away - alot 
more stable.

However, unfortunately, I did notice some SCSI timeouts:

15508.615074] CPU2: shutdown
[15508.617778] psci: CPU2 killed.
[15508.651220] CPU1: shutdown
[15508.653924] psci: CPU1 killed.
[15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task: 
aborting task 0x00000000a7159744
KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas: 
sas_eh_handle_sas_errors: task 0x00000000a7159744 is done
[15518.431266] sas: sas_scsi_find_task: aborting task 0x00000000d39731eb
[15518.442539] sas: sas_eh_handle_sas_errors: task 0x00000000d39731eb is 
done
[15518.449407] sas: sas_scsi_find_task: aborting task 0x000000009f77c9bd
[15518.455899] sas: sas_eh_handle_sas_errors: task 0x000000009f77c9bd is 
done

A couple of things to note:
- I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when 
inflights rqs !=0, and I don't see them for this timeout
- 0 datarate reported from fio

I'll have a look...

Thanks,
John

>
> V4:
> 	- resubmit IOs in dispatch list in case that this hctx is dead
>
> V3:
> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
>
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
>
>
>
> Ming Lei (5):
>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>   blk-mq: prepare for draining IO when hctx's all CPUs are offline
>   blk-mq: stop to handle IO and drain IO before hctx becomes dead
>   blk-mq: re-submit IO in case that hctx is dead
>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>     is dead
>
>  block/blk-mq-debugfs.c     |   2 +
>  block/blk-mq-tag.c         |   2 +-
>  block/blk-mq-tag.h         |   2 +
>  block/blk-mq.c             | 137 ++++++++++++++++++++++++++++++++++---
>  block/blk-mq.h             |   3 +-
>  drivers/block/loop.c       |   2 +-
>  drivers/md/dm-rq.c         |   2 +-
>  include/linux/blk-mq.h     |   5 ++
>  include/linux/cpuhotplug.h |   1 +
>  9 files changed, 141 insertions(+), 15 deletions(-)
>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-16  8:58 ` [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
@ 2019-10-16 12:07   ` Ming Lei
  2019-10-16 16:19     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-16 12:07 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote:
> On 14/10/2019 02:50, Ming Lei wrote:
> > Hi,
> > 
> > Thomas mentioned:
> >     "
> >      That was the constraint of managed interrupts from the very beginning:
> > 
> >       The driver/subsystem has to quiesce the interrupt line and the associated
> >       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >       until it's restarted by the core when the CPU is plugged in again.
> >     "
> > 
> > But no drivers or blk-mq do that before one hctx becomes dead(all
> > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > 
> > This patchset tries to address the issue by two stages:
> > 
> > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > 
> > - mark the hctx as internal stopped, and drain all in-flight requests
> > if the hctx is going to be dead.
> > 
> > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> > 
> > - steal bios from the request, and resubmit them via generic_make_request(),
> > then these IO will be mapped to other live hctx for dispatch
> > 
> > Please comment & review, thanks!
> > 
> > John, I don't add your tested-by tag since V3 have some changes,
> > and I appreciate if you may run your test on V3.
> 
> Hi Ming,
> 
> So I got around to doing some testing. The good news is that issue which we
> were experiencing in v3 series seems to have has gone away - alot more
> stable.
> 
> However, unfortunately, I did notice some SCSI timeouts:
> 
> 15508.615074] CPU2: shutdown
> [15508.617778] psci: CPU2 killed.
> [15508.651220] CPU1: shutdown
> [15508.653924] psci: CPU1 killed.
> [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
> Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task:
> aborting task 0x00000000a7159744
> KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas:
> sas_eh_handle_sas_errors: task 0x00000000a7159744 is done
> [15518.431266] sas: sas_scsi_find_task: aborting task 0x00000000d39731eb
> [15518.442539] sas: sas_eh_handle_sas_errors: task 0x00000000d39731eb is
> done
> [15518.449407] sas: sas_scsi_find_task: aborting task 0x000000009f77c9bd
> [15518.455899] sas: sas_eh_handle_sas_errors: task 0x000000009f77c9bd is
> done
> 
> A couple of things to note:
> - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> inflights rqs !=0, and I don't see them for this timeout
> - 0 datarate reported from fio
> 
> I'll have a look...

What is the output of the following command?

(cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH . {} \;)

Thanks,
Ming

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-16 12:07   ` Ming Lei
@ 2019-10-16 16:19     ` John Garry
       [not found]       ` <55a84ea3-647d-0a76-596c-c6c6b2fc1b75@huawei.com>
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-16 16:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 16/10/2019 13:07, Ming Lei wrote:
> On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote:
>> On 14/10/2019 02:50, Ming Lei wrote:
>>> Hi,
>>>
>>> Thomas mentioned:
>>>     "
>>>      That was the constraint of managed interrupts from the very beginning:
>>>
>>>       The driver/subsystem has to quiesce the interrupt line and the associated
>>>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>>>       until it's restarted by the core when the CPU is plugged in again.
>>>     "
>>>
>>> But no drivers or blk-mq do that before one hctx becomes dead(all
>>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>>
>>> This patchset tries to address the issue by two stages:
>>>
>>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>>
>>> - mark the hctx as internal stopped, and drain all in-flight requests
>>> if the hctx is going to be dead.
>>>
>>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>>>
>>> - steal bios from the request, and resubmit them via generic_make_request(),
>>> then these IO will be mapped to other live hctx for dispatch
>>>
>>> Please comment & review, thanks!
>>>
>>> John, I don't add your tested-by tag since V3 have some changes,
>>> and I appreciate if you may run your test on V3.
>>
>> Hi Ming,
>>
>> So I got around to doing some testing. The good news is that issue which we
>> were experiencing in v3 series seems to have has gone away - alot more
>> stable.
>>
>> However, unfortunately, I did notice some SCSI timeouts:
>>
>> 15508.615074] CPU2: shutdown
>> [15508.617778] psci: CPU2 killed.
>> [15508.651220] CPU1: shutdown
>> [15508.653924] psci: CPU1 killed.
>> [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
>> Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task:
>> aborting task 0x00000000a7159744
>> KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas:
>> sas_eh_handle_sas_errors: task 0x00000000a7159744 is done
>> [15518.431266] sas: sas_scsi_find_task: aborting task 0x00000000d39731eb
>> [15518.442539] sas: sas_eh_handle_sas_errors: task 0x00000000d39731eb is
>> done
>> [15518.449407] sas: sas_scsi_find_task: aborting task 0x000000009f77c9bd
>> [15518.455899] sas: sas_eh_handle_sas_errors: task 0x000000009f77c9bd is
>> done
>>
>> A couple of things to note:
>> - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
>> inflights rqs !=0, and I don't see them for this timeout
>> - 0 datarate reported from fio
>>
>> I'll have a look...
>
> What is the output of the following command?
>
> (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH . {} \;)
I assume that you want this run at about the time SCSI EH kicks in for 
the timeout, right? If so, I need to find some simple sysfs entry to 
tell me of this occurrence, to trigger the capture. Or add something. My 
script is pretty dump.

BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue() 
pretty soon before the problem happens - maybe a clue or maybe coincidence.

Thanks,
John

>
> Thanks,
> Ming
>
> .
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
       [not found]       ` <55a84ea3-647d-0a76-596c-c6c6b2fc1b75@huawei.com>
@ 2019-10-20 10:14         ` Ming Lei
  2019-10-21  9:19           ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-20 10:14 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Thu, Oct 17, 2019 at 04:40:12PM +0100, John Garry wrote:
> On 16/10/2019 17:19, John Garry wrote:
> > On 16/10/2019 13:07, Ming Lei wrote:
> > > On Wed, Oct 16, 2019 at 09:58:27AM +0100, John Garry wrote:
> > > > On 14/10/2019 02:50, Ming Lei wrote:
> > > > > Hi,
> > > > > 
> > > > > Thomas mentioned:
> > > > >     "
> > > > >      That was the constraint of managed interrupts from the very
> > > > > beginning:
> > > > > 
> > > > >       The driver/subsystem has to quiesce the interrupt line and the
> > > > > associated
> > > > >       queue _before_ it gets shutdown in CPU unplug and not fiddle
> > > > > with it
> > > > >       until it's restarted by the core when the CPU is plugged in
> > > > > again.
> > > > >     "
> > > > > 
> > > > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills
> > > > > tries
> > > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > > > 
> > > > > This patchset tries to address the issue by two stages:
> > > > > 
> > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > > > 
> > > > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > > > if the hctx is going to be dead.
> > > > > 
> > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> > > > > becomes dead
> > > > > 
> > > > > - steal bios from the request, and resubmit them via
> > > > > generic_make_request(),
> > > > > then these IO will be mapped to other live hctx for dispatch
> > > > > 
> > > > > Please comment & review, thanks!
> > > > > 
> > > > > John, I don't add your tested-by tag since V3 have some changes,
> > > > > and I appreciate if you may run your test on V3.
> > > > 
> > > > Hi Ming,
> > > > 
> > > > So I got around to doing some testing. The good news is that issue
> > > > which we
> > > > were experiencing in v3 series seems to have has gone away - alot more
> > > > stable.
> > > > 
> > > > However, unfortunately, I did notice some SCSI timeouts:
> > > > 
> > > > 15508.615074] CPU2: shutdown
> > > > [15508.617778] psci: CPU2 killed.
> > > > [15508.651220] CPU1: shutdown
> > > > [15508.653924] psci: CPU1 killed.
> > > > [15518.406229] sas: Enter sas_scsi_recover_host busy: 63 failed: 63
> > > > Jobs: 1 (f=1): [R] [0.0% done] [0[15518.412239] sas: sas_scsi_find_task:
> > > > aborting task 0x00000000a7159744
> > > > KB/0KB/0KB /s] [0/0/0 iops] [eta [15518.421708] sas:
> > > > sas_eh_handle_sas_errors: task 0x00000000a7159744 is done
> > > > [15518.431266] sas: sas_scsi_find_task: aborting task 0x00000000d39731eb
> > > > [15518.442539] sas: sas_eh_handle_sas_errors: task 0x00000000d39731eb is
> > > > done
> > > > [15518.449407] sas: sas_scsi_find_task: aborting task 0x000000009f77c9bd
> > > > [15518.455899] sas: sas_eh_handle_sas_errors: task 0x000000009f77c9bd is
> > > > done
> > > > 
> > > > A couple of things to note:
> > > > - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> > > > inflights rqs !=0, and I don't see them for this timeout
> > > > - 0 datarate reported from fio
> > > > 
> > > > I'll have a look...
> > > 
> > > What is the output of the following command?
> > > 
> > > (cd /sys/kernel/debug/block/$SAS_DISK && find . -type f -exec grep -aH
> > > . {} \;)
> > I assume that you want this run at about the time SCSI EH kicks in for
> > the timeout, right? If so, I need to find some simple sysfs entry to
> > tell me of this occurrence, to trigger the capture. Or add something. My
> > script is pretty dump.
> > 
> > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > pretty soon before the problem happens - maybe a clue or maybe coincidence.
> > 
> 
> I finally got to capture that debugfs dump at the point the SCSI IOs
> timeout, as attached. Let me know if any problem receiving it.
> 
> Here's a kernel log snippet at that point (I added some print for the
> timeout):
> 
> 609] psci: CPU6 killed.
> [  547.722217] CPU5: shutdown
> [  547.724926] psci: CPU5 killed.
> [  547.749951] irq_shutdown
> [  547.752701] IRQ 800: no longer affine to CPU4
> [  547.757265] CPU4: shutdown
> [  547.759971] psci: CPU4 killed.
> [  547.790348] CPU3: shutdown
> [  547.793052] psci: CPU3 killed.
> [  547.818330] CPU2: shutdown
> [  547.821033] psci: CPU2 killed.
> [  547.854285] CPU1: shutdown
> [  547.856989] psci: CPU1 killed.
> [  575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0
> [  575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0

From the debugfs log, 66 requests are dumped, and 63 of them has
been submitted to device, and the other 3 is in ->dispatch list
via requeue after timeout is handled.

You mentioned that:

" - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
 inflights rqs !=0, and I don't see them for this timeout"

There might be two reasons:

1) You are still testing a multiple reply-queue device? As I
mentioned last times, it is hard to map reply-queue into blk-mq
hctx correctly.

2) concurrent dispatch to device, which can be observed by the
following patch.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 06081966549f..3590f6f947eb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
 {
        struct request_queue *q = rq->q;

+       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state));
+
        trace_block_rq_issue(q, rq);

        if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {

However, I think it is hard to be 2#, since the current CPU is the last
CPU in hctx->cpu_mask.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-20 10:14         ` Ming Lei
@ 2019-10-21  9:19           ` John Garry
  2019-10-21  9:34             ` Ming Lei
  2019-10-25 16:33             ` John Garry
  0 siblings, 2 replies; 32+ messages in thread
From: John Garry @ 2019-10-21  9:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 20/10/2019 11:14, Ming Lei wrote:
>>> ght? If so, I need to find some simple sysfs entry to
>>> > > tell me of this occurrence, to trigger the capture. Or add something. My
>>> > > script is pretty dump.
>>> > >
>>> > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
>>> > > pretty soon before the problem happens - maybe a clue or maybe coincidence.
>>> > >
>> >
>> > I finally got to capture that debugfs dump at the point the SCSI IOs
>> > timeout, as attached. Let me know if any problem receiving it.
>> >
>> > Here's a kernel log snippet at that point (I added some print for the
>> > timeout):
>> >
>> > 609] psci: CPU6 killed.
>> > [  547.722217] CPU5: shutdown
>> > [  547.724926] psci: CPU5 killed.
>> > [  547.749951] irq_shutdown
>> > [  547.752701] IRQ 800: no longer affine to CPU4
>> > [  547.757265] CPU4: shutdown
>> > [  547.759971] psci: CPU4 killed.
>> > [  547.790348] CPU3: shutdown
>> > [  547.793052] psci: CPU3 killed.
>> > [  547.818330] CPU2: shutdown
>> > [  547.821033] psci: CPU2 killed.
>> > [  547.854285] CPU1: shutdown
>> > [  547.856989] psci: CPU1 killed.
>> > [  575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0
>> > [  575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0
>>From the debugfs log, 66 requests are dumped, and 63 of them has
> been submitted to device, and the other 3 is in ->dispatch list
> via requeue after timeout is handled.
>

Hi Ming,

> You mentioned that:
>
> " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
>  inflights rqs !=0, and I don't see them for this timeout"
>
> There might be two reasons:
>
> 1) You are still testing a multiple reply-queue device?

As before, I am testing by exposing mutliple queues to the SCSI 
midlayer. I had to make this change locally, as on mainline we still 
only expose a single queue and use the internal reply queue when 
enabling managed interrupts.

As I
> mentioned last times, it is hard to map reply-queue into blk-mq
> hctx correctly.

Here's my branch, if you want to check:

https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4

It's a bit messy (sorry), but you can see that the reply-queue in the 
LLDD is removed in commit 087b95af374.

I am now thinking of actually making this change to the LLDD in mainline 
to avoid any doubt in future.

>
> 2) concurrent dispatch to device, which can be observed by the
> following patch.
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 06081966549f..3590f6f947eb 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
>  {
>         struct request_queue *q = rq->q;
>
> +       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, &rq->mq_hctx->state));
> +
>         trace_block_rq_issue(q, rq);
>
>         if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
>
> However, I think it is hard to be 2#, since the current CPU is the last
> CPU in hctx->cpu_mask.
>

I'll try it.

Thanks as always,
John

>
> Thanks,
> Ming
>
>
> .
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21  9:19           ` John Garry
@ 2019-10-21  9:34             ` Ming Lei
  2019-10-21  9:47               ` John Garry
  2019-10-25 16:33             ` John Garry
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-21  9:34 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote:
> On 20/10/2019 11:14, Ming Lei wrote:
> > > > ght? If so, I need to find some simple sysfs entry to
> > > > > > tell me of this occurrence, to trigger the capture. Or add something. My
> > > > > > script is pretty dump.
> > > > > >
> > > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > > > > > pretty soon before the problem happens - maybe a clue or maybe coincidence.
> > > > > >
> > > >
> > > > I finally got to capture that debugfs dump at the point the SCSI IOs
> > > > timeout, as attached. Let me know if any problem receiving it.
> > > >
> > > > Here's a kernel log snippet at that point (I added some print for the
> > > > timeout):
> > > >
> > > > 609] psci: CPU6 killed.
> > > > [  547.722217] CPU5: shutdown
> > > > [  547.724926] psci: CPU5 killed.
> > > > [  547.749951] irq_shutdown
> > > > [  547.752701] IRQ 800: no longer affine to CPU4
> > > > [  547.757265] CPU4: shutdown
> > > > [  547.759971] psci: CPU4 killed.
> > > > [  547.790348] CPU3: shutdown
> > > > [  547.793052] psci: CPU3 killed.
> > > > [  547.818330] CPU2: shutdown
> > > > [  547.821033] psci: CPU2 killed.
> > > > [  547.854285] CPU1: shutdown
> > > > [  547.856989] psci: CPU1 killed.
> > > > [  575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0
> > > > [  575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0
> > > From the debugfs log, 66 requests are dumped, and 63 of them has
> > been submitted to device, and the other 3 is in ->dispatch list
> > via requeue after timeout is handled.
> > 
> 
> Hi Ming,
> 
> > You mentioned that:
> > 
> > " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> >  inflights rqs !=0, and I don't see them for this timeout"
> > 
> > There might be two reasons:
> > 
> > 1) You are still testing a multiple reply-queue device?
> 
> As before, I am testing by exposing mutliple queues to the SCSI midlayer. I
> had to make this change locally, as on mainline we still only expose a
> single queue and use the internal reply queue when enabling managed
> interrupts.
> 
> As I
> > mentioned last times, it is hard to map reply-queue into blk-mq
> > hctx correctly.
> 
> Here's my branch, if you want to check:
> 
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
> 
> It's a bit messy (sorry), but you can see that the reply-queue in the LLDD
> is removed in commit 087b95af374.
> 
> I am now thinking of actually making this change to the LLDD in mainline to
> avoid any doubt in future.

As I mentioned last time, you do share tags among all MQ queues on your hardware
given your hardware is actually SQ HBA, so commit 087b95af374 is definitely
wrong, isn't it?

It can be very hard to partition the single tags among multiple hctx.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21  9:34             ` Ming Lei
@ 2019-10-21  9:47               ` John Garry
  2019-10-21 10:24                 ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-21  9:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 21/10/2019 10:34, Ming Lei wrote:
> On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote:
>> On 20/10/2019 11:14, Ming Lei wrote:
>>>>> ght? If so, I need to find some simple sysfs entry to
>>>>>>> tell me of this occurrence, to trigger the capture. Or add something. My
>>>>>>> script is pretty dump.
>>>>>>>
>>>>>>> BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
>>>>>>> pretty soon before the problem happens - maybe a clue or maybe coincidence.
>>>>>>>
>>>>>
>>>>> I finally got to capture that debugfs dump at the point the SCSI IOs
>>>>> timeout, as attached. Let me know if any problem receiving it.
>>>>>
>>>>> Here's a kernel log snippet at that point (I added some print for the
>>>>> timeout):
>>>>>
>>>>> 609] psci: CPU6 killed.
>>>>> [  547.722217] CPU5: shutdown
>>>>> [  547.724926] psci: CPU5 killed.
>>>>> [  547.749951] irq_shutdown
>>>>> [  547.752701] IRQ 800: no longer affine to CPU4
>>>>> [  547.757265] CPU4: shutdown
>>>>> [  547.759971] psci: CPU4 killed.
>>>>> [  547.790348] CPU3: shutdown
>>>>> [  547.793052] psci: CPU3 killed.
>>>>> [  547.818330] CPU2: shutdown
>>>>> [  547.821033] psci: CPU2 killed.
>>>>> [  547.854285] CPU1: shutdown
>>>>> [  547.856989] psci: CPU1 killed.
>>>>> [  575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0
>>>>> [  575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0
>>>> From the debugfs log, 66 requests are dumped, and 63 of them has
>>> been submitted to device, and the other 3 is in ->dispatch list
>>> via requeue after timeout is handled.
>>>
>>
>> Hi Ming,
>>
>>> You mentioned that:
>>>
>>> " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
>>>  inflights rqs !=0, and I don't see them for this timeout"
>>>
>>> There might be two reasons:
>>>
>>> 1) You are still testing a multiple reply-queue device?
>>
>> As before, I am testing by exposing mutliple queues to the SCSI midlayer. I
>> had to make this change locally, as on mainline we still only expose a
>> single queue and use the internal reply queue when enabling managed
>> interrupts.
>>
>> As I
>>> mentioned last times, it is hard to map reply-queue into blk-mq
>>> hctx correctly.
>>
>> Here's my branch, if you want to check:
>>
>> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
>>
>> It's a bit messy (sorry), but you can see that the reply-queue in the LLDD
>> is removed in commit 087b95af374.
>>
>> I am now thinking of actually making this change to the LLDD in mainline to
>> avoid any doubt in future.
>
> As I mentioned last time, you do share tags among all MQ queues on your hardware
> given your hardware is actually SQ HBA, so commit 087b95af374 is definitely
> wrong, isn't it?
>

Yes, we share tags among all queues, but we generate the tag - known as 
IPTT - in the LLDD now, as we can no longer use the request tag (as it 
is not unique per all queues):

https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188

As I said, the branch is messy and I did have to fix 087b95af374.

> It can be very hard to partition the single tags among multiple hctx.
>

I really do think now that I'll make this change on mainline to avoid 
doubt...

Thanks,
John

>
> Thanks,
> Ming
>
>
> .
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21  9:47               ` John Garry
@ 2019-10-21 10:24                 ` Ming Lei
  2019-10-21 11:49                   ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-21 10:24 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Oct 21, 2019 at 10:47:05AM +0100, John Garry wrote:
> On 21/10/2019 10:34, Ming Lei wrote:
> > On Mon, Oct 21, 2019 at 10:19:18AM +0100, John Garry wrote:
> > > On 20/10/2019 11:14, Ming Lei wrote:
> > > > > > ght? If so, I need to find some simple sysfs entry to
> > > > > > > > tell me of this occurrence, to trigger the capture. Or add something. My
> > > > > > > > script is pretty dump.
> > > > > > > > 
> > > > > > > > BTW, I did notice that we the dump_stack in __blk_mq_run_hw_queue()
> > > > > > > > pretty soon before the problem happens - maybe a clue or maybe coincidence.
> > > > > > > > 
> > > > > > 
> > > > > > I finally got to capture that debugfs dump at the point the SCSI IOs
> > > > > > timeout, as attached. Let me know if any problem receiving it.
> > > > > > 
> > > > > > Here's a kernel log snippet at that point (I added some print for the
> > > > > > timeout):
> > > > > > 
> > > > > > 609] psci: CPU6 killed.
> > > > > > [  547.722217] CPU5: shutdown
> > > > > > [  547.724926] psci: CPU5 killed.
> > > > > > [  547.749951] irq_shutdown
> > > > > > [  547.752701] IRQ 800: no longer affine to CPU4
> > > > > > [  547.757265] CPU4: shutdown
> > > > > > [  547.759971] psci: CPU4 killed.
> > > > > > [  547.790348] CPU3: shutdown
> > > > > > [  547.793052] psci: CPU3 killed.
> > > > > > [  547.818330] CPU2: shutdown
> > > > > > [  547.821033] psci: CPU2 killed.
> > > > > > [  547.854285] CPU1: shutdown
> > > > > > [  547.856989] psci: CPU1 killed.
> > > > > > [  575.925307] scsi_timeout req=0xffff0023b0dd9c00 reserved=0
> > > > > > [  575.930794] scsi_timeout req=0xffff0023b0df2700 reserved=0
> > > > > From the debugfs log, 66 requests are dumped, and 63 of them has
> > > > been submitted to device, and the other 3 is in ->dispatch list
> > > > via requeue after timeout is handled.
> > > > 
> > > 
> > > Hi Ming,
> > > 
> > > > You mentioned that:
> > > > 
> > > > " - I added some debug prints in blk_mq_hctx_drain_inflight_rqs() for when
> > > >  inflights rqs !=0, and I don't see them for this timeout"
> > > > 
> > > > There might be two reasons:
> > > > 
> > > > 1) You are still testing a multiple reply-queue device?
> > > 
> > > As before, I am testing by exposing mutliple queues to the SCSI midlayer. I
> > > had to make this change locally, as on mainline we still only expose a
> > > single queue and use the internal reply queue when enabling managed
> > > interrupts.
> > > 
> > > As I
> > > > mentioned last times, it is hard to map reply-queue into blk-mq
> > > > hctx correctly.
> > > 
> > > Here's my branch, if you want to check:
> > > 
> > > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
> > > 
> > > It's a bit messy (sorry), but you can see that the reply-queue in the LLDD
> > > is removed in commit 087b95af374.
> > > 
> > > I am now thinking of actually making this change to the LLDD in mainline to
> > > avoid any doubt in future.
> > 
> > As I mentioned last time, you do share tags among all MQ queues on your hardware
> > given your hardware is actually SQ HBA, so commit 087b95af374 is definitely
> > wrong, isn't it?
> > 
> 
> Yes, we share tags among all queues, but we generate the tag - known as IPTT
> - in the LLDD now, as we can no longer use the request tag (as it is not
> unique per all queues):
> 
> https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> 
> As I said, the branch is messy and I did have to fix 087b95af374.

Firstly this way may waste lots of memory, especially the queue depth is
big, such as, hisilicon V3's queue depth is 4096.

Secondly, you have to deal with queue busy efficiently and correctly,
for example, your real hw tags(IPTT) can be used up easily, and how
will you handle these dispatched request?

Finally, you have to evaluate the performance effect, this is highly
related with how to deal with out-of-IPTT.

I'd suggest you to fix the stuff and post them out for review.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21 10:24                 ` Ming Lei
@ 2019-10-21 11:49                   ` John Garry
  2019-10-21 12:53                     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-21 11:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

>>>
>>
>> Yes, we share tags among all queues, but we generate the tag - known as IPTT
>> - in the LLDD now, as we can no longer use the request tag (as it is not
>> unique per all queues):
>>
>> https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
>>
>> As I said, the branch is messy and I did have to fix 087b95af374.
>
> Firstly this way may waste lots of memory, especially the queue depth is
> big, such as, hisilicon V3's queue depth is 4096.
>
> Secondly, you have to deal with queue busy efficiently and correctly,
> for example, your real hw tags(IPTT) can be used up easily, and how
> will you handle these dispatched request?

I have not seen scenario of exhausted IPTT. And IPTT count is same as 
SCSI host.can_queue, so SCSI midlayer should ensure that this does not 
occur.

>
> Finally, you have to evaluate the performance effect, this is highly
> related with how to deal with out-of-IPTT.

Some figures from our previous testing:

Managed interrupt without exposing multiple queues: 3M IOPs
Managed interrupt with exposing multiple queues: 2.6M IOPs
No managed interrupts: 500K IOPs.

Now my problem is that I hear some people are relying on the 3M 
performance, even though it is experimental and unsafe. I need to follow 
up on this. I really don't want to keep that code.

>
> I'd suggest you to fix the stuff and post them out for review.

OK, I'll also check on adding that WARN you provided.

Thanks again,
John

>
> Thanks,
> Ming
>
>
> .
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21 11:49                   ` John Garry
@ 2019-10-21 12:53                     ` Ming Lei
  2019-10-21 14:02                       ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-21 12:53 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
> > > > 
> > > 
> > > Yes, we share tags among all queues, but we generate the tag - known as IPTT
> > > - in the LLDD now, as we can no longer use the request tag (as it is not
> > > unique per all queues):
> > > 
> > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> > > 
> > > As I said, the branch is messy and I did have to fix 087b95af374.
> > 
> > Firstly this way may waste lots of memory, especially the queue depth is
> > big, such as, hisilicon V3's queue depth is 4096.
> > 
> > Secondly, you have to deal with queue busy efficiently and correctly,
> > for example, your real hw tags(IPTT) can be used up easily, and how
> > will you handle these dispatched request?
> 
> I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
> host.can_queue, so SCSI midlayer should ensure that this does not occur.

That check isn't correct, and each hw queue should have allowed
.can_queue in-flight requests.

> 
> > 
> > Finally, you have to evaluate the performance effect, this is highly
> > related with how to deal with out-of-IPTT.
> 
> Some figures from our previous testing:
> 
> Managed interrupt without exposing multiple queues: 3M IOPs
> Managed interrupt with exposing multiple queues: 2.6M IOPs

Then you see the performance regression.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21 12:53                     ` Ming Lei
@ 2019-10-21 14:02                       ` John Garry
  2019-10-22  0:16                         ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-21 14:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 21/10/2019 13:53, Ming Lei wrote:
> On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
>>>>>
>>>>
>>>> Yes, we share tags among all queues, but we generate the tag - known as IPTT
>>>> - in the LLDD now, as we can no longer use the request tag (as it is not
>>>> unique per all queues):
>>>>
>>>> https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
>>>>
>>>> As I said, the branch is messy and I did have to fix 087b95af374.
>>>
>>> Firstly this way may waste lots of memory, especially the queue depth is
>>> big, such as, hisilicon V3's queue depth is 4096.
>>>
>>> Secondly, you have to deal with queue busy efficiently and correctly,
>>> for example, your real hw tags(IPTT) can be used up easily, and how
>>> will you handle these dispatched request?
>>
>> I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
>> host.can_queue, so SCSI midlayer should ensure that this does not occur.
>

Hi Ming,

> That check isn't correct, and each hw queue should have allowed
> .can_queue in-flight requests.

There always seems to be some confusion or disagreement on this topic.

I work according to the comment in scsi_host.h:

"Note: it is assumed that each hardware queue has a queue depth of
  can_queue. In other words, the total queue depth per host
  is nr_hw_queues * can_queue."

So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096)

>
>>
>>>
>>> Finally, you have to evaluate the performance effect, this is highly
>>> related with how to deal with out-of-IPTT.
>>
>> Some figures from our previous testing:
>>
>> Managed interrupt without exposing multiple queues: 3M IOPs
>> Managed interrupt with exposing multiple queues: 2.6M IOPs
>
> Then you see the performance regression.

Let's discuss this when I send the patches, so we don't get sidetracked 
on this blk-mq improvement topic.

Thanks,
John

>
>
> Thanks,
> Ming
>
>
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21 14:02                       ` John Garry
@ 2019-10-22  0:16                         ` Ming Lei
  2019-10-22 11:19                           ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-22  0:16 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote:
> On 21/10/2019 13:53, Ming Lei wrote:
> > On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
> > > > > > 
> > > > > 
> > > > > Yes, we share tags among all queues, but we generate the tag - known as IPTT
> > > > > - in the LLDD now, as we can no longer use the request tag (as it is not
> > > > > unique per all queues):
> > > > > 
> > > > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> > > > > 
> > > > > As I said, the branch is messy and I did have to fix 087b95af374.
> > > > 
> > > > Firstly this way may waste lots of memory, especially the queue depth is
> > > > big, such as, hisilicon V3's queue depth is 4096.
> > > > 
> > > > Secondly, you have to deal with queue busy efficiently and correctly,
> > > > for example, your real hw tags(IPTT) can be used up easily, and how
> > > > will you handle these dispatched request?
> > > 
> > > I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
> > > host.can_queue, so SCSI midlayer should ensure that this does not occur.
> > 
> 
> Hi Ming,
> 
> > That check isn't correct, and each hw queue should have allowed
> > .can_queue in-flight requests.
> 
> There always seems to be some confusion or disagreement on this topic.
> 
> I work according to the comment in scsi_host.h:
> 
> "Note: it is assumed that each hardware queue has a queue depth of
>  can_queue. In other words, the total queue depth per host
>  is nr_hw_queues * can_queue."
> 
> So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096)

I believe all current drivers set .can_queue as single hw queue's depth.
If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue
depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues
in-flight requests.

> 
> > 
> > > 
> > > > 
> > > > Finally, you have to evaluate the performance effect, this is highly
> > > > related with how to deal with out-of-IPTT.
> > > 
> > > Some figures from our previous testing:
> > > 
> > > Managed interrupt without exposing multiple queues: 3M IOPs
> > > Managed interrupt with exposing multiple queues: 2.6M IOPs
> > 
> > Then you see the performance regression.
> 
> Let's discuss this when I send the patches, so we don't get sidetracked on
> this blk-mq improvement topic.

OK, what I meant is to use correct driver to test the patches, otherwise
it might be hard to investigate.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-22  0:16                         ` Ming Lei
@ 2019-10-22 11:19                           ` John Garry
  2019-10-22 13:45                             ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-22 11:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 22/10/2019 01:16, Ming Lei wrote:
> On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote:
>> On 21/10/2019 13:53, Ming Lei wrote:
>>> On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
>>>>>>>
>>>>>>
>>>>>> Yes, we share tags among all queues, but we generate the tag - known as IPTT
>>>>>> - in the LLDD now, as we can no longer use the request tag (as it is not
>>>>>> unique per all queues):
>>>>>>
>>>>>> https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
>>>>>>
>>>>>> As I said, the branch is messy and I did have to fix 087b95af374.
>>>>>
>>>>> Firstly this way may waste lots of memory, especially the queue depth is
>>>>> big, such as, hisilicon V3's queue depth is 4096.
>>>>>
>>>>> Secondly, you have to deal with queue busy efficiently and correctly,
>>>>> for example, your real hw tags(IPTT) can be used up easily, and how
>>>>> will you handle these dispatched request?
>>>>
>>>> I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
>>>> host.can_queue, so SCSI midlayer should ensure that this does not occur.
>>>
>>
>> Hi Ming,

Hi Ming,

>>
>>> That check isn't correct, and each hw queue should have allowed
>>> .can_queue in-flight requests.
>>
>> There always seems to be some confusion or disagreement on this topic.
>>
>> I work according to the comment in scsi_host.h:
>>
>> "Note: it is assumed that each hardware queue has a queue depth of
>>   can_queue. In other words, the total queue depth per host
>>   is nr_hw_queues * can_queue."
>>
>> So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096)
> 
> I believe all current drivers set .can_queue as single hw queue's depth.
> If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue
> depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues
> in-flight requests.

Yeah, but the SCSI host should still limit max IOs over all queues to 
.can_queue:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scsi/scsi_mid_low_api.txt#n1083



> 
>>
>>>
>>>>
>>>>>
>>>>> Finally, you have to evaluate the performance effect, this is highly
>>>>> related with how to deal with out-of-IPTT.
>>>>
>>>> Some figures from our previous testing:
>>>>
>>>> Managed interrupt without exposing multiple queues: 3M IOPs
>>>> Managed interrupt with exposing multiple queues: 2.6M IOPs
>>>
>>> Then you see the performance regression.
>>
>> Let's discuss this when I send the patches, so we don't get sidetracked on
>> this blk-mq improvement topic.
> 
> OK, what I meant is to use correct driver to test the patches, otherwise
> it might be hard to investigate.

Of course. I'm working on this now, and it looks like it will turn out 
complicated... you'll see.

BTW, I reran the test and never say the new WARN trigger (while SCSI 
timeouts did occur).

Thanks again,
John

> 
> 
> Thanks,
> Ming
> 
> .
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-22 11:19                           ` John Garry
@ 2019-10-22 13:45                             ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-22 13:45 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Tue, Oct 22, 2019 at 12:19:17PM +0100, John Garry wrote:
> On 22/10/2019 01:16, Ming Lei wrote:
> > On Mon, Oct 21, 2019 at 03:02:56PM +0100, John Garry wrote:
> > > On 21/10/2019 13:53, Ming Lei wrote:
> > > > On Mon, Oct 21, 2019 at 12:49:53PM +0100, John Garry wrote:
> > > > > > > > 
> > > > > > > 
> > > > > > > Yes, we share tags among all queues, but we generate the tag - known as IPTT
> > > > > > > - in the LLDD now, as we can no longer use the request tag (as it is not
> > > > > > > unique per all queues):
> > > > > > > 
> > > > > > > https://github.com/hisilicon/kernel-dev/commit/087b95af374be6965583c1673032fb33bc8127e8#diff-f5d8fff19bc539a7387af5230d4e5771R188
> > > > > > > 
> > > > > > > As I said, the branch is messy and I did have to fix 087b95af374.
> > > > > > 
> > > > > > Firstly this way may waste lots of memory, especially the queue depth is
> > > > > > big, such as, hisilicon V3's queue depth is 4096.
> > > > > > 
> > > > > > Secondly, you have to deal with queue busy efficiently and correctly,
> > > > > > for example, your real hw tags(IPTT) can be used up easily, and how
> > > > > > will you handle these dispatched request?
> > > > > 
> > > > > I have not seen scenario of exhausted IPTT. And IPTT count is same as SCSI
> > > > > host.can_queue, so SCSI midlayer should ensure that this does not occur.
> > > > 
> > > 
> > > Hi Ming,
> 
> Hi Ming,
> 
> > > 
> > > > That check isn't correct, and each hw queue should have allowed
> > > > .can_queue in-flight requests.
> > > 
> > > There always seems to be some confusion or disagreement on this topic.
> > > 
> > > I work according to the comment in scsi_host.h:
> > > 
> > > "Note: it is assumed that each hardware queue has a queue depth of
> > >   can_queue. In other words, the total queue depth per host
> > >   is nr_hw_queues * can_queue."
> > > 
> > > So I set Scsi_host.can_queue = HISI_SAS_MAX_COMMANDS (=4096)
> > 
> > I believe all current drivers set .can_queue as single hw queue's depth.
> > If you set .can_queue as HISI_SAS_MAX_COMMANDS which is HBA's queue
> > depth, the hisilicon sas driver will HISI_SAS_MAX_COMMANDS * nr_hw_queues
> > in-flight requests.
> 
> Yeah, but the SCSI host should still limit max IOs over all queues to
> .can_queue:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scsi/scsi_mid_low_api.txt#n1083
> 

That limit is actually from legacy single-queue era, you should see that
I am removing it:

https://lore.kernel.org/linux-scsi/20191009093241.21481-2-ming.lei@redhat.com/

With this change, IOPS can be improved much on some fast SCSI storage.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-21  9:19           ` John Garry
  2019-10-21  9:34             ` Ming Lei
@ 2019-10-25 16:33             ` John Garry
  2019-10-28 10:42               ` Ming Lei
  1 sibling, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-25 16:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

>> There might be two reasons:
>>
>> 1) You are still testing a multiple reply-queue device?
> 
> As before, I am testing by exposing mutliple queues to the SCSI 
> midlayer. I had to make this change locally, as on mainline we still 
> only expose a single queue and use the internal reply queue when 
> enabling managed interrupts.
> 
> As I
>> mentioned last times, it is hard to map reply-queue into blk-mq
>> hctx correctly.
> 
> Here's my branch, if you want to check:
> 
> https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
> 
> It's a bit messy (sorry), but you can see that the reply-queue in the 
> LLDD is removed in commit 087b95af374.
> 
> I am now thinking of actually making this change to the LLDD in mainline 
> to avoid any doubt in future.
> 
>>
>> 2) concurrent dispatch to device, which can be observed by the
>> following patch.
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 06081966549f..3590f6f947eb 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
>>  {
>>         struct request_queue *q = rq->q;
>>
>> +       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED, 
>> &rq->mq_hctx->state));
>> +
>>         trace_block_rq_issue(q, rq);
>>
>>         if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
>>
>> However, I think it is hard to be 2#, since the current CPU is the last
>> CPU in hctx->cpu_mask.
>>
> 
> I'll try it.
> 

Hi Ming,

I am looking at this issue again.

I am using 
https://lore.kernel.org/linux-scsi/1571926881-75524-1-git-send-email-john.garry@huawei.com/T/#t 
with expose_mq_experimental set. I guess you're going to say that this 
series is wrong, but I think it's ok for this purpose.

Forgetting that for a moment, maybe I have found an issue.

For the SCSI commands which timeout, I notice that 
scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds 
earlier.

  scsi_set_blocked+0x20/0xb8
  __scsi_queue_insert+0x40/0x90
  scsi_softirq_done+0x164/0x1c8
  __blk_mq_complete_request_remote+0x18/0x20
  flush_smp_call_function_queue+0xa8/0x150
  generic_smp_call_function_single_interrupt+0x10/0x18
  handle_IPI+0xec/0x1a8
  arch_cpu_idle+0x10/0x18
  do_idle+0x1d0/0x2b0
  cpu_startup_entry+0x24/0x40
  secondary_start_kernel+0x1b4/0x208

I also notice that the __scsi_queue_insert() call, above, seems to retry 
to requeue the request on a dead rq in calling 
__scsi_queue_insert()->blk_mq_requeue_requet()->__blk_mq_requeue_request(), 
***:

[ 1185.235243] psci: CPU1 killed.
[ 1185.238610] blk_mq_hctx_notify_dead cpu1 dead 
request_queue=0xffff0023ace24f60 (id=19)
[ 1185.246530] blk_mq_hctx_notify_dead cpu1 dead 
request_queue=0xffff0023ace23f80 (id=17)
[ 1185.254443] blk_mq_hctx_notify_dead cpu1 dead 
request_queue=0xffff0023ace22fa0 (id=15)
[ 1185.262356] blk_mq_hctx_notify_dead cpu1 dead 
request_queue=0xffff0023ace21fc0 (id=13)***
[ 1185.270271] blk_mq_hctx_notify_dead cpu1 dead 
request_queue=0xffff0023ace20fe0 (id=11)
[ 1185.939451] scsi_softirq_done NEEDS_RETRY rq=0xffff0023b7416000
[ 1185.945359] scsi_set_blocked reason=0x1057
[ 1185.949444] __blk_mq_requeue_request request_queue=0xffff0023ace21fc0 
id=13 rq=0xffff0023b7416000***

[...]

[ 1214.903455] scsi_timeout req=0xffff0023add29000 reserved=0
[ 1214.908946] scsi_timeout req=0xffff0023add29300 reserved=0
[ 1214.914424] scsi_timeout req=0xffff0023add29600 reserved=0
[ 1214.919909] scsi_timeout req=0xffff0023add29900 reserved=0

I guess that we're retrying as the SCSI failed in the LLDD for some reason.

So could this be the problem - we're attempting to requeue on a dead 
request queue?

Thanks,
John

> Thanks as always,
> John
> 
>>
>> Thanks,
>> Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-25 16:33             ` John Garry
@ 2019-10-28 10:42               ` Ming Lei
  2019-10-28 11:55                 ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-28 10:42 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Fri, Oct 25, 2019 at 05:33:35PM +0100, John Garry wrote:
> > > There might be two reasons:
> > > 
> > > 1) You are still testing a multiple reply-queue device?
> > 
> > As before, I am testing by exposing mutliple queues to the SCSI
> > midlayer. I had to make this change locally, as on mainline we still
> > only expose a single queue and use the internal reply queue when
> > enabling managed interrupts.
> > 
> > As I
> > > mentioned last times, it is hard to map reply-queue into blk-mq
> > > hctx correctly.
> > 
> > Here's my branch, if you want to check:
> > 
> > https://github.com/hisilicon/kernel-dev/commits/private-topic-sas-5.4-mq-v4
> > 
> > It's a bit messy (sorry), but you can see that the reply-queue in the
> > LLDD is removed in commit 087b95af374.
> > 
> > I am now thinking of actually making this change to the LLDD in mainline
> > to avoid any doubt in future.
> > 
> > > 
> > > 2) concurrent dispatch to device, which can be observed by the
> > > following patch.
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 06081966549f..3590f6f947eb 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
> > >  {
> > >         struct request_queue *q = rq->q;
> > > 
> > > +       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED,
> > > &rq->mq_hctx->state));
> > > +
> > >         trace_block_rq_issue(q, rq);
> > > 
> > >         if (test_bit(QUEUE_FLAG_STATS, &q->queue_flags)) {
> > > 
> > > However, I think it is hard to be 2#, since the current CPU is the last
> > > CPU in hctx->cpu_mask.
> > > 
> > 
> > I'll try it.
> > 
> 
> Hi Ming,
> 
> I am looking at this issue again.
> 
> I am using https://lore.kernel.org/linux-scsi/1571926881-75524-1-git-send-email-john.garry@huawei.com/T/#t
> with expose_mq_experimental set. I guess you're going to say that this
> series is wrong, but I think it's ok for this purpose.
> 
> Forgetting that for a moment, maybe I have found an issue.
> 
> For the SCSI commands which timeout, I notice that
> scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds
> earlier.
> 
>  scsi_set_blocked+0x20/0xb8
>  __scsi_queue_insert+0x40/0x90
>  scsi_softirq_done+0x164/0x1c8
>  __blk_mq_complete_request_remote+0x18/0x20
>  flush_smp_call_function_queue+0xa8/0x150
>  generic_smp_call_function_single_interrupt+0x10/0x18
>  handle_IPI+0xec/0x1a8
>  arch_cpu_idle+0x10/0x18
>  do_idle+0x1d0/0x2b0
>  cpu_startup_entry+0x24/0x40
>  secondary_start_kernel+0x1b4/0x208

Could you investigate a bit the reason why timeout is triggered?

Especially we suppose to drain all in-flight requests before the
last CPU of this hctx becomes offline, and it shouldn't be caused by
the hctx becoming dead, so still need you to confirm that all
in-flight requests are really drained in your test. Or is it still
possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set?

In theory, it shouldn't be possible, given we drain in-flight request
on the last CPU of this hctx.

Or blk_mq_hctx_next_cpu() may still run WORK_CPU_UNBOUND schedule after
all CPUs are offline, could you add debug message in that branch?

> 
> I also notice that the __scsi_queue_insert() call, above, seems to retry to
> requeue the request on a dead rq in calling
> __scsi_queue_insert()->blk_mq_requeue_requet()->__blk_mq_requeue_request(),
> ***:
> 
> [ 1185.235243] psci: CPU1 killed.
> [ 1185.238610] blk_mq_hctx_notify_dead cpu1 dead
> request_queue=0xffff0023ace24f60 (id=19)
> [ 1185.246530] blk_mq_hctx_notify_dead cpu1 dead
> request_queue=0xffff0023ace23f80 (id=17)
> [ 1185.254443] blk_mq_hctx_notify_dead cpu1 dead
> request_queue=0xffff0023ace22fa0 (id=15)
> [ 1185.262356] blk_mq_hctx_notify_dead cpu1 dead
> request_queue=0xffff0023ace21fc0 (id=13)***
> [ 1185.270271] blk_mq_hctx_notify_dead cpu1 dead
> request_queue=0xffff0023ace20fe0 (id=11)
> [ 1185.939451] scsi_softirq_done NEEDS_RETRY rq=0xffff0023b7416000
> [ 1185.945359] scsi_set_blocked reason=0x1057
> [ 1185.949444] __blk_mq_requeue_request request_queue=0xffff0023ace21fc0
> id=13 rq=0xffff0023b7416000***
> 
> [...]
> 
> [ 1214.903455] scsi_timeout req=0xffff0023add29000 reserved=0
> [ 1214.908946] scsi_timeout req=0xffff0023add29300 reserved=0
> [ 1214.914424] scsi_timeout req=0xffff0023add29600 reserved=0
> [ 1214.919909] scsi_timeout req=0xffff0023add29900 reserved=0
> 
> I guess that we're retrying as the SCSI failed in the LLDD for some reason.
> 
> So could this be the problem - we're attempting to requeue on a dead request
> queue?

If there are any in-flight requests originated from hctx which is going
to become dead, they should have been drained before CPU becomes offline.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-28 10:42               ` Ming Lei
@ 2019-10-28 11:55                 ` John Garry
  2019-10-29  1:50                   ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-28 11:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

>>
>> For the SCSI commands which timeout, I notice that
>> scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds
>> earlier.
>>
>>   scsi_set_blocked+0x20/0xb8
>>   __scsi_queue_insert+0x40/0x90
>>   scsi_softirq_done+0x164/0x1c8
>>   __blk_mq_complete_request_remote+0x18/0x20
>>   flush_smp_call_function_queue+0xa8/0x150
>>   generic_smp_call_function_single_interrupt+0x10/0x18
>>   handle_IPI+0xec/0x1a8
>>   arch_cpu_idle+0x10/0x18
>>   do_idle+0x1d0/0x2b0
>>   cpu_startup_entry+0x24/0x40
>>   secondary_start_kernel+0x1b4/0x208
> 
> Could you investigate a bit the reason why timeout is triggered?

Yeah, it does seem a strange coincidence that the SCSI command even 
failed and we have to retry, since these should be uncommon events. I'll 
check on this LLDD error.

> 
> Especially we suppose to drain all in-flight requests before the
> last CPU of this hctx becomes offline, and it shouldn't be caused by
> the hctx becoming dead, so still need you to confirm that all
> in-flight requests are really drained in your test. 

ok

Or is it still
> possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set?

It shouldn't be. However it would seem that this IO had been dispatched 
to the LLDD, the hctx dies, and then we attempt to requeue on that hctx.

> 
> In theory, it shouldn't be possible, given we drain in-flight request
> on the last CPU of this hctx.
> 
> Or blk_mq_hctx_next_cpu() may still run WORK_CPU_UNBOUND schedule after
> all CPUs are offline, could you add debug message in that branch?

ok

> 
>>
>> I also notice that the __scsi_queue_insert() call, above, seems to retry to
>> requeue the request on a dead rq in calling
>> __scsi_queue_insert()->blk_mq_requeue_requet()->__blk_mq_requeue_request(),
>> ***:
>>
>> [ 1185.235243] psci: CPU1 killed.
>> [ 1185.238610] blk_mq_hctx_notify_dead cpu1 dead
>> request_queue=0xffff0023ace24f60 (id=19)
>> [ 1185.246530] blk_mq_hctx_notify_dead cpu1 dead
>> request_queue=0xffff0023ace23f80 (id=17)
>> [ 1185.254443] blk_mq_hctx_notify_dead cpu1 dead
>> request_queue=0xffff0023ace22fa0 (id=15)
>> [ 1185.262356] blk_mq_hctx_notify_dead cpu1 dead
>> request_queue=0xffff0023ace21fc0 (id=13)***
>> [ 1185.270271] blk_mq_hctx_notify_dead cpu1 dead
>> request_queue=0xffff0023ace20fe0 (id=11)
>> [ 1185.939451] scsi_softirq_done NEEDS_RETRY rq=0xffff0023b7416000
>> [ 1185.945359] scsi_set_blocked reason=0x1057
>> [ 1185.949444] __blk_mq_requeue_request request_queue=0xffff0023ace21fc0
>> id=13 rq=0xffff0023b7416000***
>>
>> [...]
>>
>> [ 1214.903455] scsi_timeout req=0xffff0023add29000 reserved=0
>> [ 1214.908946] scsi_timeout req=0xffff0023add29300 reserved=0
>> [ 1214.914424] scsi_timeout req=0xffff0023add29600 reserved=0
>> [ 1214.919909] scsi_timeout req=0xffff0023add29900 reserved=0
>>
>> I guess that we're retrying as the SCSI failed in the LLDD for some reason.
>>
>> So could this be the problem - we're attempting to requeue on a dead request
>> queue?
> 
> If there are any in-flight requests originated from hctx which is going
> to become dead, they should have been drained before CPU becomes offline.

Sure, but we seem to hit a corner case here...

Thanks,
John

> 
> Thanks,
> Ming
> 
> .
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-28 11:55                 ` John Garry
@ 2019-10-29  1:50                   ` Ming Lei
  2019-10-29  9:22                     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-10-29  1:50 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Mon, Oct 28, 2019 at 11:55:42AM +0000, John Garry wrote:
> > > 
> > > For the SCSI commands which timeout, I notice that
> > > scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds
> > > earlier.
> > > 
> > >   scsi_set_blocked+0x20/0xb8
> > >   __scsi_queue_insert+0x40/0x90
> > >   scsi_softirq_done+0x164/0x1c8
> > >   __blk_mq_complete_request_remote+0x18/0x20
> > >   flush_smp_call_function_queue+0xa8/0x150
> > >   generic_smp_call_function_single_interrupt+0x10/0x18
> > >   handle_IPI+0xec/0x1a8
> > >   arch_cpu_idle+0x10/0x18
> > >   do_idle+0x1d0/0x2b0
> > >   cpu_startup_entry+0x24/0x40
> > >   secondary_start_kernel+0x1b4/0x208
> > 
> > Could you investigate a bit the reason why timeout is triggered?
> 
> Yeah, it does seem a strange coincidence that the SCSI command even failed
> and we have to retry, since these should be uncommon events. I'll check on
> this LLDD error.
> 
> > 
> > Especially we suppose to drain all in-flight requests before the
> > last CPU of this hctx becomes offline, and it shouldn't be caused by
> > the hctx becoming dead, so still need you to confirm that all
> > in-flight requests are really drained in your test.
> 
> ok
> 
> Or is it still
> > possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set?
> 
> It shouldn't be. However it would seem that this IO had been dispatched to
> the LLDD, the hctx dies, and then we attempt to requeue on that hctx.

But this patch does wait for completion of in-flight request before
shutdown the last CPU of this hctx.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-29  1:50                   ` Ming Lei
@ 2019-10-29  9:22                     ` John Garry
  2019-10-29 10:05                       ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2019-10-29  9:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 29/10/2019 01:50, Ming Lei wrote:
> On Mon, Oct 28, 2019 at 11:55:42AM +0000, John Garry wrote:
>>>>
>>>> For the SCSI commands which timeout, I notice that
>>>> scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds
>>>> earlier.
>>>>
>>>>    scsi_set_blocked+0x20/0xb8
>>>>    __scsi_queue_insert+0x40/0x90
>>>>    scsi_softirq_done+0x164/0x1c8
>>>>    __blk_mq_complete_request_remote+0x18/0x20
>>>>    flush_smp_call_function_queue+0xa8/0x150
>>>>    generic_smp_call_function_single_interrupt+0x10/0x18
>>>>    handle_IPI+0xec/0x1a8
>>>>    arch_cpu_idle+0x10/0x18
>>>>    do_idle+0x1d0/0x2b0
>>>>    cpu_startup_entry+0x24/0x40
>>>>    secondary_start_kernel+0x1b4/0x208
>>>
>>> Could you investigate a bit the reason why timeout is triggered?
>>
>> Yeah, it does seem a strange coincidence that the SCSI command even failed
>> and we have to retry, since these should be uncommon events. I'll check on
>> this LLDD error.
>>
>>>
>>> Especially we suppose to drain all in-flight requests before the
>>> last CPU of this hctx becomes offline, and it shouldn't be caused by
>>> the hctx becoming dead, so still need you to confirm that all
>>> in-flight requests are really drained in your test.
>>
>> ok
>>
>> Or is it still
>>> possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set?
>>
>> It shouldn't be. However it would seem that this IO had been dispatched to
>> the LLDD, the hctx dies, and then we attempt to requeue on that hctx.
> 
> But this patch does wait for completion of in-flight request before
> shutdown the last CPU of this hctx.
> 

Hi Ming,

It may actually be a request from a hctx which is not shut down which 
errors and causes the time out. I'm still checking.

BTW, Can you let me know exactly where you want the debug for "Or 
blk_mq_hctx_next_cpu() may still run WORK_CPU_UNBOUND schedule after
all CPUs are offline, could you add debug message in that branch?"

Thanks,
John

> 
> Thanks,
> Ming
> 
> .
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-29  9:22                     ` John Garry
@ 2019-10-29 10:05                       ` Ming Lei
  2019-10-29 17:54                         ` John Garry
  2019-10-31 16:28                         ` John Garry
  0 siblings, 2 replies; 32+ messages in thread
From: Ming Lei @ 2019-10-29 10:05 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On Tue, Oct 29, 2019 at 09:22:26AM +0000, John Garry wrote:
> On 29/10/2019 01:50, Ming Lei wrote:
> > On Mon, Oct 28, 2019 at 11:55:42AM +0000, John Garry wrote:
> > > > > 
> > > > > For the SCSI commands which timeout, I notice that
> > > > > scsi_set_blocked(reason=SCSI_MLQUEUE_EH_RETRY) was called 30 seconds
> > > > > earlier.
> > > > > 
> > > > >    scsi_set_blocked+0x20/0xb8
> > > > >    __scsi_queue_insert+0x40/0x90
> > > > >    scsi_softirq_done+0x164/0x1c8
> > > > >    __blk_mq_complete_request_remote+0x18/0x20
> > > > >    flush_smp_call_function_queue+0xa8/0x150
> > > > >    generic_smp_call_function_single_interrupt+0x10/0x18
> > > > >    handle_IPI+0xec/0x1a8
> > > > >    arch_cpu_idle+0x10/0x18
> > > > >    do_idle+0x1d0/0x2b0
> > > > >    cpu_startup_entry+0x24/0x40
> > > > >    secondary_start_kernel+0x1b4/0x208
> > > > 
> > > > Could you investigate a bit the reason why timeout is triggered?
> > > 
> > > Yeah, it does seem a strange coincidence that the SCSI command even failed
> > > and we have to retry, since these should be uncommon events. I'll check on
> > > this LLDD error.
> > > 
> > > > 
> > > > Especially we suppose to drain all in-flight requests before the
> > > > last CPU of this hctx becomes offline, and it shouldn't be caused by
> > > > the hctx becoming dead, so still need you to confirm that all
> > > > in-flight requests are really drained in your test.
> > > 
> > > ok
> > > 
> > > Or is it still
> > > > possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set?
> > > 
> > > It shouldn't be. However it would seem that this IO had been dispatched to
> > > the LLDD, the hctx dies, and then we attempt to requeue on that hctx.
> > 
> > But this patch does wait for completion of in-flight request before
> > shutdown the last CPU of this hctx.
> > 
> 
> Hi Ming,
> 
> It may actually be a request from a hctx which is not shut down which errors
> and causes the time out. I'm still checking.

If that is the case, blk_mq_hctx_drain_inflight_rqs() will wait for
completion of this request.

The only chance it is missed is that the last CPU of this hctx becomes
offline just when this request stays in request list after it is
retried from EH.

> 
> BTW, Can you let me know exactly where you want the debug for "Or
> blk_mq_hctx_next_cpu() may still run WORK_CPU_UNBOUND schedule after
> all CPUs are offline, could you add debug message in that branch?"

You can add the following debug message, then reproduce the issue and
see if the debug log is dumped.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 06081966549f..5a98a7b79c0d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1452,6 +1452,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 		 */
 		hctx->next_cpu = next_cpu;
 		hctx->next_cpu_batch = 1;
+
+		printk(KERN_WARNING "CPU %d, schedule from (dead) hctx %s\n",
+			raw_smp_processor_id(),
+			cpumask_empty(hctx->cpumask) ? "inactive": "active");
 		return WORK_CPU_UNBOUND;
 	}
 

Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-29 10:05                       ` Ming Lei
@ 2019-10-29 17:54                         ` John Garry
  2019-10-31 16:28                         ` John Garry
  1 sibling, 0 replies; 32+ messages in thread
From: John Garry @ 2019-10-29 17:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 29/10/2019 10:05, Ming Lei wrote:
>>> But this patch does wait for completion of in-flight request before
>>> shutdown the last CPU of this hctx.
>>>
>> Hi Ming,
>>
>> It may actually be a request from a hctx which is not shut down which errors
>> and causes the time out. I'm still checking.
> If that is the case, blk_mq_hctx_drain_inflight_rqs() will wait for
> completion of this request.
> 
> The only chance it is missed is that the last CPU of this hctx becomes
> offline just when this request stays in request list after it is
> retried from EH.
> 
>> BTW, Can you let me know exactly where you want the debug for "Or
>> blk_mq_hctx_next_cpu() may still run WORK_CPU_UNBOUND schedule after
>> all CPUs are offline, could you add debug message in that branch?"
> You can add the following debug message, then reproduce the issue and
> see if the debug log is dumped.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 06081966549f..5a98a7b79c0d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1452,6 +1452,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>   		 */
>   		hctx->next_cpu = next_cpu;
>   		hctx->next_cpu_batch = 1;
> +
> +		printk(KERN_WARNING "CPU %d, schedule from (dead) hctx %s\n",
> +			raw_smp_processor_id(),
> +			cpumask_empty(hctx->cpumask) ? "inactive": "active");
>   		return WORK_CPU_UNBOUND;

We don't seem to be hitting this.

So the error generally happens after the CPUs have been hot unplugged 
for some time. I find a SCSI IO errors in the LLDD, having been run on 
an up hctx - blk-mq requeues it on same hctx, but it only dispatches 
eventually after some other SCSI IO timesout - I assume that IO timesout 
due to earlier scsi_set_blocked() call.

Continuing to look....

Thanks,
John

>   	}
>   
> 
> Thanks,
> Ming
> 
> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-29 10:05                       ` Ming Lei
  2019-10-29 17:54                         ` John Garry
@ 2019-10-31 16:28                         ` John Garry
  1 sibling, 0 replies; 32+ messages in thread
From: John Garry @ 2019-10-31 16:28 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 29/10/2019 10:05, Ming Lei wrote:
>>>> ok
>>>>
>>>> Or is it still
>>>>> possible to dispatch to LDD after BLK_MQ_S_INTERNAL_STOPPED is set?
>>>> It shouldn't be. However it would seem that this IO had been dispatched to
>>>> the LLDD, the hctx dies, and then we attempt to requeue on that hctx.
>>> But this patch does wait for completion of in-flight request before
>>> shutdown the last CPU of this hctx.
>>>
>> Hi Ming,
>>
>> It may actually be a request from a hctx which is not shut down which errors
>> and causes the time out. I'm still checking.
> If that is the case, blk_mq_hctx_drain_inflight_rqs() will wait for
> completion of this request.
> 
> The only chance it is missed is that the last CPU of this hctx becomes
> offline just when this request stays in request list after it is
> retried from EH.
> 
>> BTW, Can you let me know exactly where you want the debug for "Or
>> blk_mq_hctx_next_cpu() may still run WORK_CPU_UNBOUND schedule after
>> all CPUs are offline, could you add debug message in that branch?"
> You can add the following debug message, then reproduce the issue and
> see if the debug log is dumped.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 06081966549f..5a98a7b79c0d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1452,6 +1452,10 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>   		 */
>   		hctx->next_cpu = next_cpu;
>   		hctx->next_cpu_batch = 1;
> +
> +		printk(KERN_WARNING "CPU %d, schedule from (dead) hctx %s\n",
> +			raw_smp_processor_id(),
> +			cpumask_empty(hctx->cpumask) ? "inactive": "active");
>   		return WORK_CPU_UNBOUND;
>   	}
>   


On 28/10/2019 10:42, Ming Lei wrote:
 > +++ b/block/blk-mq.c
 > @@ -679,6 +679,8 @@ void blk_mq_start_request(struct request *rq)
 >   {
 >          struct request_queue *q = rq->q;
 >
 > +       WARN_ON_ONCE(test_bit(BLK_MQ_S_INTERNAL_STOPPED,
 > &rq->mq_hctx->state));
 > +
 >          trace_block_rq_issue(q, rq);
 >
 >          if (test_bit(QUEUE_FLAG_STATS, &q->q


I have now hit this once:

[ 2960.943684] CPU7: shutdown
[ 2960.946424] psci: CPU7 killed.
[ 2961.030546] ------------[ cut here ]------------
[ 2961.035169] WARNING: CPU: 6 PID: 6127 at block/blk-mq.c:682 
blk_mq_start_request+0x1a8/0x1b0
[ 2961.043594] Modules linked in:
[ 2961.046642] CPU: 6 PID: 6127 Comm: fio Not tainted 
5.4.0-rc1-00129-gc61cbc373a43-dirty #457
[ 2961.054981] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI 
RC0 - V1.16.01 03/15/2019
[ 2961.063493] pstate: 00400009 (nzcv daif +PAN -UAO)
[ 2961.068274] pc : blk_mq_start_request+0x1a8/0x1b0
[ 2961.072967] lr : blk_mq_start_request+0x34/0x1b0
[ 2961.077572] sp : ffff0022ff5e71d0
[ 2961.080875] x29: ffff0022ff5e71d0 x28: ffff00232b8c6028
[ 2961.086176] x27: ffff00232b1ad6f0 x26: ffff00232b1addd0
[ 2961.091478] x25: ffff0022ff5e7300 x24: ffff00232b1ad500
[ 2961.096778] x23: ffff00232b8c6118 x22: ffff00232b379200
[ 2961.102079] x21: ffff00232b8c6010 x20: ffff00232b1d0998
[ 2961.107380] x19: ffff00232b8c6000 x18: 0000000000000000
[ 2961.112681] x17: 0000000000000000 x16: 0000000000000000
[ 2961.117981] x15: 0000000000000000 x14: 0000000000000000
[ 2961.123281] x13: 0000000000000006 x12: 1fffe00465718c56
[ 2961.128582] x11: ffff800465718c56 x10: dfffa00000000000
[ 2961.133883] x9 : 0000000000000000 x8 : ffff00232b8c62b8
[ 2961.139183] x7 : 0000000000000000 x6 : 000000000000003f
[ 2961.144484] x5 : 1fffe00465718c40 x4 : ffff00232b379228
[ 2961.149786] x3 : 0000000000000000 x2 : dfffa00000000000
[ 2961.155087] x1 : 0000000000000007 x0 : 000000000000000a
[ 2961.160387] Call trace:
[ 2961.162825]  blk_mq_start_request+0x1a8/0x1b0
[ 2961.167173]  scsi_queue_rq+0x808/0xf38
[ 2961.170911]  __blk_mq_try_issue_directly+0x238/0x310
[ 2961.175865]  blk_mq_request_issue_directly+0xb4/0xf8
[ 2961.180819]  blk_mq_try_issue_list_directly+0xa8/0x160
[ 2961.185947]  blk_mq_sched_insert_requests+0x2e4/0x390
[ 2961.190987]  blk_mq_flush_plug_list+0x254/0x2e0
[ 2961.195508]  blk_flush_plug_list+0x1d0/0x218
[ 2961.199767]  blk_finish_plug+0x48/0x240
[ 2961.203593]  blkdev_direct_IO+0x71c/0x7b0
[ 2961.207593]  generic_file_read_iter+0x16c/0xf28
[ 2961.212111]  blkdev_read_iter+0x80/0xb0
[ 2961.212961.227500]  el0_svc_common.constprop.1+0xa4/0x1d8
[ 2961.232281]  el0_svc_handler+0x34/0xb0
[ 2961.236019]  el0_svc+0x8/0xc
[ 2961.238888] ---[ end trace aa13b886f74a98be ]---
[ 2961.465018] blk_mq_hctx_notify_dead cpu6 dead 
request_queue=0xffff00232b1d5658 (id=19) hctx=0xffff00232abf9200 
hctx->cpumask=6-11 cpu_online_mask=0-5,13-95
[ 2961.478967] blk_mq_hctx_notify_dead cpu6 dead 
request_queue=0xffff00232b1d4328 (id=17) hctx=0xffff00232aa11200 
hctx->cpumask=6-11 cpu_online_mask=0-5,13-95

But this did not seem to trigger my timeout.

John

> 
> Thanks,
> Ming
> 
> .


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2019-10-16  8:58 ` [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
@ 2019-11-28  1:09 ` chenxiang (M)
  2019-11-28  2:02   ` Ming Lei
  6 siblings, 1 reply; 32+ messages in thread
From: chenxiang (M) @ 2019-11-28  1:09 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi,

在 2019/10/14 9:50, Ming Lei 写道:
> Hi,
>
> Thomas mentioned:
>      "
>       That was the constraint of managed interrupts from the very beginning:
>      
>        The driver/subsystem has to quiesce the interrupt line and the associated
>        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>        until it's restarted by the core when the CPU is plugged in again.
>      "
>
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>
> This patchset tries to address the issue by two stages:
>
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
>
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
>
> Please comment & review, thanks!
>
> John, I don't add your tested-by tag since V3 have some changes,
> and I appreciate if you may run your test on V3.

I tested those patchset with John's testcase, except dump_stack() in 
function __blk_mq_run_hw_queue() sometimes occurs  which don't
affect the function, it solves the CPU hotplug issue, so add tested-by 
for those patchset:

Tested-by: Xiang Chen <chenxiang66@hisilicon.com>

>
> V4:
> 	- resubmit IOs in dispatch list in case that this hctx is dead
>
> V3:
> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
>
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
>
>
>
> Ming Lei (5):
>    blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>    blk-mq: prepare for draining IO when hctx's all CPUs are offline
>    blk-mq: stop to handle IO and drain IO before hctx becomes dead
>    blk-mq: re-submit IO in case that hctx is dead
>    blk-mq: handle requests dispatched from IO scheduler in case that hctx
>      is dead
>
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   2 +-
>   block/blk-mq-tag.h         |   2 +
>   block/blk-mq.c             | 137 ++++++++++++++++++++++++++++++++++---
>   block/blk-mq.h             |   3 +-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |   5 ++
>   include/linux/cpuhotplug.h |   1 +
>   9 files changed, 141 insertions(+), 15 deletions(-)
>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-11-28  1:09 ` chenxiang (M)
@ 2019-11-28  2:02   ` Ming Lei
  2019-11-28 10:45     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2019-11-28  2:02 UTC (permalink / raw)
  To: chenxiang (M)
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

On Thu, Nov 28, 2019 at 09:09:13AM +0800, chenxiang (M) wrote:
> Hi,
> 
> 在 2019/10/14 9:50, Ming Lei 写道:
> > Hi,
> > 
> > Thomas mentioned:
> >      "
> >       That was the constraint of managed interrupts from the very beginning:
> >        The driver/subsystem has to quiesce the interrupt line and the associated
> >        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >        until it's restarted by the core when the CPU is plugged in again.
> >      "
> > 
> > But no drivers or blk-mq do that before one hctx becomes dead(all
> > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > 
> > This patchset tries to address the issue by two stages:
> > 
> > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > 
> > - mark the hctx as internal stopped, and drain all in-flight requests
> > if the hctx is going to be dead.
> > 
> > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> > 
> > - steal bios from the request, and resubmit them via generic_make_request(),
> > then these IO will be mapped to other live hctx for dispatch
> > 
> > Please comment & review, thanks!
> > 
> > John, I don't add your tested-by tag since V3 have some changes,
> > and I appreciate if you may run your test on V3.
> 
> I tested those patchset with John's testcase, except dump_stack() in
> function __blk_mq_run_hw_queue() sometimes occurs  which don't
> affect the function, it solves the CPU hotplug issue, so add tested-by for
> those patchset:
> 
> Tested-by: Xiang Chen <chenxiang66@hisilicon.com>

Thanks for your test.

I plan to post a new version for 5.6 cycle, and there is still some
small race window related with requeue to be covered.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead
  2019-10-14  1:50 ` [PATCH V4 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
@ 2019-11-28  9:29   ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2019-11-28  9:29 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch, chenxiang

On 14/10/2019 02:50, Ming Lei wrote:
> 		break;
> +		msleep(5);
> +	}
> +}
> +
>   static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
>   {
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) == cpu) &&
> +	    (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) >=
> +	     nr_cpu_ids)) {
> +		set_bit(BLK_MQ_S_INTERNAL_STOPPED, &hctx->state);
> +		blk_mq_hctx_drain_inflight_rqs(hctx);

Hi Ming,

> +        }

This goes a bit beyond a nit: The indentation needs to be fixed here.

Thanks,
John

>   	return 0;
>   }
>   
> @@ -2246,6 +2284,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   	ctx


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug
  2019-11-28  2:02   ` Ming Lei
@ 2019-11-28 10:45     ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2019-11-28 10:45 UTC (permalink / raw)
  To: Ming Lei, chenxiang (M)
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch

On 28/11/2019 02:02, Ming Lei wrote:
> On Thu, Nov 28, 2019 at 09:09:13AM +0800, chenxiang (M) wrote:
>> Hi,
>>
>> 在 2019/10/14 9:50, Ming Lei 写道:
>>> Hi,
>>>
>>> Thomas mentioned:
>>>       "
>>>        That was the constraint of managed interrupts from the very beginning:
>>>         The driver/subsystem has to quiesce the interrupt line and the associated
>>>         queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>>>         until it's restarted by the core when the CPU is plugged in again.
>>>       "
>>>
>>> But no drivers or blk-mq do that before one hctx becomes dead(all
>>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>>
>>> This patchset tries to address the issue by two stages:
>>>
>>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>>
>>> - mark the hctx as internal stopped, and drain all in-flight requests
>>> if the hctx is going to be dead.
>>>
>>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>>>
>>> - steal bios from the request, and resubmit them via generic_make_request(),
>>> then these IO will be mapped to other live hctx for dispatch
>>>
>>> Please comment & review, thanks!
>>>
>>> John, I don't add your tested-by tag since V3 have some changes,
>>> and I appreciate if you may run your test on V3.
>>
>> I tested those patchset with John's testcase, except dump_stack() in
>> function __blk_mq_run_hw_queue() sometimes occurs  which don't
>> affect the function, it solves the CPU hotplug issue, so add tested-by for
>> those patchset:
>>
>> Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
> 
> Thanks for your test.

So I had to give up testing as my board experienced some SCSI timeout 
even without hotplugging or including this patchset.

FWIW, I did test NVMe successfully though.

> 
> I plan to post a new version for 5.6 cycle, and there is still some
> small race window related with requeue to be covered.
> 

thanks!

> Thanks,
> Ming
> 
> .
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2019-11-28 10:46 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-14  1:50 [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug Ming Lei
2019-10-14  1:50 ` [PATCH V4 1/5] blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED Ming Lei
2019-10-14  1:50 ` [PATCH V4 2/5] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2019-10-14  1:50 ` [PATCH V4 3/5] blk-mq: stop to handle IO and drain IO before hctx becomes dead Ming Lei
2019-11-28  9:29   ` John Garry
2019-10-14  1:50 ` [PATCH V4 4/5] blk-mq: re-submit IO in case that hctx is dead Ming Lei
2019-10-14  1:50 ` [PATCH V4 5/5] blk-mq: handle requests dispatched from IO scheduler " Ming Lei
2019-10-16  8:58 ` [PATCH V4 0/5] blk-mq: improvement on handling IO during CPU hotplug John Garry
2019-10-16 12:07   ` Ming Lei
2019-10-16 16:19     ` John Garry
     [not found]       ` <55a84ea3-647d-0a76-596c-c6c6b2fc1b75@huawei.com>
2019-10-20 10:14         ` Ming Lei
2019-10-21  9:19           ` John Garry
2019-10-21  9:34             ` Ming Lei
2019-10-21  9:47               ` John Garry
2019-10-21 10:24                 ` Ming Lei
2019-10-21 11:49                   ` John Garry
2019-10-21 12:53                     ` Ming Lei
2019-10-21 14:02                       ` John Garry
2019-10-22  0:16                         ` Ming Lei
2019-10-22 11:19                           ` John Garry
2019-10-22 13:45                             ` Ming Lei
2019-10-25 16:33             ` John Garry
2019-10-28 10:42               ` Ming Lei
2019-10-28 11:55                 ` John Garry
2019-10-29  1:50                   ` Ming Lei
2019-10-29  9:22                     ` John Garry
2019-10-29 10:05                       ` Ming Lei
2019-10-29 17:54                         ` John Garry
2019-10-31 16:28                         ` John Garry
2019-11-28  1:09 ` chenxiang (M)
2019-11-28  2:02   ` Ming Lei
2019-11-28 10:45     ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).