linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V5 0/6] blk-mq: improvement CPU hotplug
@ 2020-01-15 11:44 Ming Lei
  2020-01-15 11:44 ` [PATCH 1/6] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes inactive(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V5:
	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
	- re-factor code for re-submit requests in cpu dead hotplug handler
	- address requeue corner case

V4:
	- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo



Ming Lei (6):
  blk-mq: add new state of BLK_MQ_S_INACTIVE
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  blk-mq: re-submit IO in case that hctx is inactive
  blk-mq: handle requests dispatched from IO scheduler in case of
    inactive hctx
  block: deactivate hctx when all its CPUs are offline when running
    queue

 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 172 +++++++++++++++++++++++++++++++------
 block/blk-mq.h             |   3 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   6 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 163 insertions(+), 29 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
-- 
2.20.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] blk-mq: add new state of BLK_MQ_S_INACTIVE
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-01-15 11:44 ` Ming Lei
  2020-01-15 11:44 ` [PATCH 2/6] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Keith Busch, John Garry

Add a new hw queue state of BLK_MQ_S_INACTIVE, which is set when all
CPUs of this hctx are offline.

Prepares for stopping hw queue when all CPUs of this hctx become offline.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 3 ++-
 include/linux/blk-mq.h | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..d6de4f7f38cb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index eaaca8fc1c28..fb5ea01f1366 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -167,7 +167,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 11cfd6470b1a..b669e776d4cb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -397,6 +397,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/6] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
  2020-01-15 11:44 ` [PATCH 1/6] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
@ 2020-01-15 11:44 ` Ming Lei
  2020-01-15 11:44 ` [PATCH 3/6] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 13 +++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  3 +++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d6de4f7f38cb..b62390918ca5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a12b1763508d..0f55bb92b16f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2244,6 +2244,11 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2280,6 +2285,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2339,6 +2347,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3585,6 +3596,8 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				NULL, blk_mq_hctx_notify_online);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..651dadd9be12 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2012,7 +2012,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b669e776d4cb..ca2201435a48 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -388,6 +390,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index e51ee772b9f5..48b7cb04224a 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -149,6 +149,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/6] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
  2020-01-15 11:44 ` [PATCH 1/6] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
  2020-01-15 11:44 ` [PATCH 2/6] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-01-15 11:44 ` Ming Lei
  2020-01-15 11:44 ` [PATCH 4/6] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

Before one CPU becomes offline, check if it is the last online CPU
of hctx. If yes, mark this hctx as inactive, meantime wait for
completion of all in-flight IOs originated from this hctx.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c |  2 +-
 block/blk-mq-tag.h |  2 ++
 block/blk-mq.c     | 40 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index fbacde454718..67eaf6c88a78 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -317,7 +317,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 15bc74acb57e..48c9d7e3a655 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -34,6 +34,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0f55bb92b16f..6f9d2f5e0b53 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2244,8 +2244,46 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
+{
+	unsigned *count = data;
+
+	if ((blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+		(*count)++;
+
+	return true;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_tags *tags)
+{
+	unsigned count = 0;
+
+	blk_mq_all_tag_busy_iter(tags, blk_mq_count_inflight_rq, &count);
+
+	return count;
+}
+
+static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx->tags))
+			break;
+		msleep(5);
+	}
+}
+
 static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) == cpu) &&
+			(cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask)
+			 >= nr_cpu_ids)) {
+		set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+		blk_mq_hctx_drain_inflight_rqs(hctx);
+        }
 	return 0;
 }
 
@@ -2272,6 +2310,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	}
 	spin_unlock(&ctx->lock);
 
+	clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+
 	if (list_empty(&tmp))
 		return 0;
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/6] blk-mq: re-submit IO in case that hctx is inactive
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2020-01-15 11:44 ` [PATCH 3/6] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-01-15 11:44 ` Ming Lei
  2020-01-15 11:44 ` [PATCH 5/6] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

When all CPUs in one hctx are offline and this hctx becomes inactive,
we shouldn't run this hw queue for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 58 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 47 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6f9d2f5e0b53..3e52ba74661e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2287,10 +2287,34 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	struct bio_list list;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	/*
+	 * Free the old empty request before submitting bio for avoiding
+	 * potential deadlock
+	 */
+	blk_mq_cleanup_rq(rq);
+	blk_mq_end_request(rq, 0);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2310,16 +2334,28 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	}
 	spin_unlock(&ctx->lock);
 
-	clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
-
-	if (list_empty(&tmp))
-		return 0;
+	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
+		if (!list_empty(&tmp)) {
+			spin_lock(&hctx->lock);
+			list_splice_tail_init(&tmp, &hctx->dispatch);
+			spin_unlock(&hctx->lock);
+			blk_mq_run_hw_queue(hctx, true);
+		}
+	} else {
+		/* requests in dispatch list has to be re-submitted too */
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&hctx->dispatch, &tmp);
+		spin_unlock(&hctx->lock);
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+		while (!list_empty(&tmp)) {
+			struct request *rq = list_entry(tmp.next,
+					struct request, queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_resubmit_io(rq);
+		}
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	}
 
-	blk_mq_run_hw_queue(hctx, true);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/6] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2020-01-15 11:44 ` [PATCH 4/6] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-01-15 11:44 ` Ming Lei
  2020-01-15 11:44 ` [PATCH 6/6] block: deactivate hctx when all its CPUs are offline when running queue Ming Lei
  2020-01-15 17:00 ` [PATCH V5 0/6] blk-mq: improvement CPU hotplug John Garry
  6 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

If one hctx becomes inactive when its CPUs are all offline, all in-queue
requests aimed at this hctx have to be re-submitted.

Re-submit requests from both sw queue or scheduler queue when the hctx
is found as inactive.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 80 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3e52ba74661e..93c835312d42 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2311,6 +2311,42 @@ static void blk_mq_resubmit_io(struct request *rq)
 	}
 }
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(sched_tmp);
+	LIST_HEAD(re_submit);
+	struct request *rq;
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (!e) {
+		blk_mq_flush_busy_ctxs(hctx, &re_submit);
+	} else {
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &re_submit);
+		}
+	}
+	while (!list_empty(&sched_tmp)) {
+		rq = list_entry(sched_tmp.next, struct request,
+				queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_sched_insert_request(rq, true, true, true);
+	}
+
+	/* requests in dispatch list has to be re-submitted too */
+	spin_lock(&hctx->lock);
+	list_splice_tail_init(&hctx->dispatch, &re_submit);
+	spin_unlock(&hctx->lock);
+
+	while (!list_empty(&re_submit)) {
+		rq = list_entry(re_submit.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
+}
+
 /*
  * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
  * to the hctx any more, so steal bios from requests of this hctx, and
@@ -2318,42 +2354,30 @@ static void blk_mq_resubmit_io(struct request *rq)
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	LIST_HEAD(tmp);
-	enum hctx_type type;
-
-	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
-	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
-	type = hctx->type;
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
 
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
-	}
-	spin_unlock(&ctx->lock);
+	if (test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
+		blk_mq_hctx_deactivate(hctx);
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	} else if (!hctx->queue->elevator) {
+		struct blk_mq_ctx *ctx = __blk_mq_get_ctx(hctx->queue, cpu);
+		enum hctx_type type = hctx->type;
+		LIST_HEAD(tmp);
+
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
 
-	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
 		if (!list_empty(&tmp)) {
 			spin_lock(&hctx->lock);
 			list_splice_tail_init(&tmp, &hctx->dispatch);
 			spin_unlock(&hctx->lock);
 			blk_mq_run_hw_queue(hctx, true);
 		}
-	} else {
-		/* requests in dispatch list has to be re-submitted too */
-		spin_lock(&hctx->lock);
-		list_splice_tail_init(&hctx->dispatch, &tmp);
-		spin_unlock(&hctx->lock);
-
-		while (!list_empty(&tmp)) {
-			struct request *rq = list_entry(tmp.next,
-					struct request, queuelist);
-			list_del_init(&rq->queuelist);
-			blk_mq_resubmit_io(rq);
-		}
-		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 	}
 
 	return 0;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/6] block: deactivate hctx when all its CPUs are offline when running queue
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2020-01-15 11:44 ` [PATCH 5/6] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
@ 2020-01-15 11:44 ` Ming Lei
  2020-01-15 17:00 ` [PATCH V5 0/6] blk-mq: improvement CPU hotplug John Garry
  6 siblings, 0 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-15 11:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch

When we select next cpu for running hw queue, all this hctx's CPUs may
become offline, so deactivate this hctx at that time.

So we can fix some corner case which can't be covered by deactivating
hctx in CPU hotplug handler, such as request may be requeued during
CPU hotplug, which handler can't found the requeued rquests and
re-submit them.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 93c835312d42..fada556880ca 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,8 @@
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx);
+
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
 	int ddir, sectors, bucket;
@@ -1431,7 +1433,7 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 		 */
 		hctx->next_cpu = next_cpu;
 		hctx->next_cpu_batch = 1;
-		return WORK_CPU_UNBOUND;
+		return -1;
 	}
 
 	hctx->next_cpu = next_cpu;
@@ -1450,6 +1452,8 @@ static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
 					unsigned long msecs)
 {
+	int cpu;
+
 	if (unlikely(blk_mq_hctx_stopped(hctx)))
 		return;
 
@@ -1464,8 +1468,13 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async,
 		put_cpu();
 	}
 
-	kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
-				    msecs_to_jiffies(msecs));
+	cpu = blk_mq_hctx_next_cpu(hctx);
+
+	if (likely(cpu >= 0))
+		kblockd_mod_delayed_work_on(cpu, &hctx->run_work,
+				msecs_to_jiffies(msecs));
+	else
+		blk_mq_hctx_deactivate(hctx);
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2020-01-15 11:44 ` [PATCH 6/6] block: deactivate hctx when all its CPUs are offline when running queue Ming Lei
@ 2020-01-15 17:00 ` John Garry
  2020-01-20 13:23   ` John Garry
  6 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-01-15 17:00 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch, chenxiang

On 15/01/2020 11:44, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>      "
>       That was the constraint of managed interrupts from the very beginning:
>      
>        The driver/subsystem has to quiesce the interrupt line and the associated
>        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>        until it's restarted by the core when the CPU is plugged in again.
>      "
> 
> But no drivers or blk-mq do that before one hctx becomes inactive(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Please comment & review, thanks!
> 
> V5:
> 	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
> 	- re-factor code for re-submit requests in cpu dead hotplug handler
> 	- address requeue corner case
> 
> V4:
> 	- resubmit IOs in dispatch list in case that this hctx is dead
> 
> V3:
> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
> 
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
> 
> 

thanks Ming, we'll have a look at this new series and test it again.

John

> 
> Ming Lei (6):
>    blk-mq: add new state of BLK_MQ_S_INACTIVE
>    blk-mq: prepare for draining IO when hctx's all CPUs are offline
>    blk-mq: stop to handle IO and drain IO before hctx becomes inactive
>    blk-mq: re-submit IO in case that hctx is inactive
>    blk-mq: handle requests dispatched from IO scheduler in case of
>      inactive hctx
>    block: deactivate hctx when all its CPUs are offline when running
>      queue
> 
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   2 +-
>   block/blk-mq-tag.h         |   2 +
>   block/blk-mq.c             | 172 +++++++++++++++++++++++++++++++------
>   block/blk-mq.h             |   3 +-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |   6 ++
>   include/linux/cpuhotplug.h |   1 +
>   9 files changed, 163 insertions(+), 29 deletions(-)
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-15 17:00 ` [PATCH V5 0/6] blk-mq: improvement CPU hotplug John Garry
@ 2020-01-20 13:23   ` John Garry
  2020-01-31 10:04     ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-01-20 13:23 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Keith Busch, chenxiang

On 15/01/2020 17:00, John Garry wrote:
> On 15/01/2020 11:44, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>      "
>>       That was the constraint of managed interrupts from the very 
>> beginning:
>>        The driver/subsystem has to quiesce the interrupt line and the 
>> associated
>>        queue _before_ it gets shutdown in CPU unplug and not fiddle 
>> with it
>>        until it's restarted by the core when the CPU is plugged in again.
>>      "
>>
>> But no drivers or blk-mq do that before one hctx becomes inactive(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx 
>> becomes dead
>>
>> - steal bios from the request, and resubmit them via 
>> generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Please comment & review, thanks!
>>
>> V5:
>>     - rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
>>     - re-factor code for re-submit requests in cpu dead hotplug handler
>>     - address requeue corner case
>>
>> V4:
>>     - resubmit IOs in dispatch list in case that this hctx is dead
>>
>> V3:
>>     - re-organize patch 2 & 3 a bit for addressing Hannes's comment
>>     - fix patch 4 for avoiding potential deadlock, as found by Hannes
>>
>> V2:
>>     - patch4 & patch 5 in V1 have been merged to block tree, so remove
>>       them
>>     - address comments from John Garry and Minwoo
>>
>>

Hi Ming,

Colleague Xiang Chen found this:

Starting 40 processes
Jobs: 40 (f=40)
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [8.3% done] 
[2899MB/0KB/0KB /s] [742K/0/0 iops] [eta 00m:33s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [11.1% done] 
[2892MB/0KB/0KB /s] [740K/0/0 iops] [eta 00m:32s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [13.9% done] 
[2887MB/0KB/0KB /s] [739K/0/0 iops] [eta 00m:31s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [16.7% done] 
[2850MB/0KB/0KB /s] [730K/0/0 iops] [eta 00m:30s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [19.4% done] 
[2834MB/0KB/0KB /s] [726K/0/0 iops] [eta 00m:29s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [22.2% done] 
[2827MB/0KB/0KB /s] [724K/0/0 iops] [eta 00m:28s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [25.0% done] 
[2804MB/0KB/0KB /s] [718K/0/0 iops] [eta 00m:27s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [27.8% done] 
[2783MB/0KB/0KB /s] [713K/0/0 iops] [eta 00m:26s]
[  112.990292] CPU16: shutdown
[  112.993113] psci: CPU16 killed (polled 0 ms)
[  113.029429] CPU15: shutdown
[  113.032245] psci: CPU15 killed (polled 0 ms)
[  113.073649] CPU14: shutdown
[  113.076461] psci: CPU14 killed (polled 0 ms)
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR[  113.121258] CPU13: 
shutdown
RRRRRRRRRR] [30.[  113.124247] psci: CPU13 killed (polled 0 ms)
6% done] [2772MB/0KB/0KB /s] [710K/0/0 iops] [eta 00m:25s]
[  113.177108] IRQ 600: no longer affine to CPU12
[  113.183185] CPU12: shutdown
[  113.186001] psci: CPU12 killed (polled 0 ms)
[  113.242555] CPU11: shutdown
[  113.245368] psci: CPU11 killed (polled 0 ms)
[  113.302598] CPU10: shutdown
[  113.305420] psci: CPU10 killed (polled 0 ms)
[  113.365757] CPU9: shutdown
[  113.368489] psci: CPU9 killed (polled 0 ms)
[  113.611904] IRQ 599: no longer affine to CPU8
[  113.620611] CPU8: shutdown
[  113.623340] psci: CPU8 killed (polled 0 ms)
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [33.3% done] 
[2765MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:24s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [36.1% done] 
[2767MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:23s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [38.9% done] 
[2771MB/0KB/0KB /s] [709K/0/0 iops] [eta 00m:22s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [41.7% done] 
[2780MB/0KB/0KB /s] [712K/0/0 iops] [eta 00m:21s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [44.4% done] 
[2731MB/0KB/0KB /s] [699K/0/0 iops] [eta 00m:20s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [47.2% done] 
[2757MB/0KB/0KB /s] [706K/0/0 iops] [eta 00m:19s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [50.0% done] 
[2764MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:18s]
[  120.762141] ------------[ cut here ]------------
[  120.766760] WARNING: CPU: 0 PID: 10 at block/blk.h:299 
generic_make_request_checks+0x480/0x4e8
[  120.775348] Modules linked in:
[  120.778397] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.5.0-

Then this:

1158048319d:18h:40m:34s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049605d:23h:03m:30s]
[  141.915198] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  141.921280] rcu: 	96-...0: (4 GPs behind) idle=af6/0/0x1 
softirq=90/90 fqs=2626
[  141.928658] rcu: 	99-...0: (2 GPs behind) 
idle=b82/1/0x4000000000000000 softirq=82/82 fqs=2627
[  141.937339] 	(detected by 17, t=5254 jiffies, g=3925, q=545)
[  141.942985] Task dump for CPU 96:
[  141.946292] swapper/96      R  running task        0     0      1 
0x0000002a
[  141.953321] Call trace:
[  141.955763]  __switch_to+0xbc/0x218
[  141.959244]  0x0
[  141.961079] Task dump for CPU 99:
[  141.964385] kworker/99:1H   R  running task        0  3473      2 
0x0000022a
[  141.971417] Workqueue: kblockd blk_mq_run_work_fn
[  141.976109] Call trace:
[  141.978550]  __switch_to+0xbc/0x218
[  141.982029]  blk_mq_run_work_fn+0x1c/0x28
[  141.986027]  process_one_work+0x1e0/0x358
[  141.990025]  worker_thread+0x40/0x488
[  141.993678]  kthread+0x118/0x120
[  141.996897]  ret_from_fork+0x10/0x18


Fuller log at bottom.

We're basing the patchset on v5.5-rc6:
https://github.com/hisilicon/kernel-dev/tree/private-topic-sas-5.5-mq-debug-improve-cpu-hotplug-v5 
(we have not tested my latest patch there, but I doubt that this is the 
problem).

Not sure if something missing which is in -next.

Cheers,
John



  uptime:	0 days, 0 hours
Usage on /:	8%		Swap usage:	0.0%
IP Addresses:	192.168.1.164

Euler:~ # export TMOUT=0
Euler:~ # fdisk -l | grep sd | wc -l
7
Euler:~ # ps -aux | grep irqbalance
root       3307  0.0  0.0 106076  1796 ttyS0    S+   05:58   0:00 grep 
--color=auto irqbalance
Euler:~ # echo 15 > /proc/sys/kernel/printk
Euler:~ #
Euler:~ #
Euler:~ # ./hotplug_sas_d06.sh
Looping ... number 0
Creat 4k_read_depth2048_fiotest file sucessfully
cp: cannot stat ‘liba*’: No such file or directory
chmod: cannot access ‘fio’: No such file or directory
job1: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=2048
...
job1: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=2048
...
job1: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=2048
...
job1: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=2048
...
fio-2.1.10
Starting 40 processes
Jobs: 40 (f=40)
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [8.3% done] 
[2899MB/0KB/0KB /s] [742K/0/0 iops] [eta 00m:33s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [11.1% done] 
[2892MB/0KB/0KB /s] [740K/0/0 iops] [eta 00m:32s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [13.9% done] 
[2887MB/0KB/0KB /s] [739K/0/0 iops] [eta 00m:31s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [16.7% done] 
[2850MB/0KB/0KB /s] [730K/0/0 iops] [eta 00m:30s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [19.4% done] 
[2834MB/0KB/0KB /s] [726K/0/0 iops] [eta 00m:29s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [22.2% done] 
[2827MB/0KB/0KB /s] [724K/0/0 iops] [eta 00m:28s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [25.0% done] 
[2804MB/0KB/0KB /s] [718K/0/0 iops] [eta 00m:27s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [27.8% done] 
[2783MB/0KB/0KB /s] [713K/0/0 iops] [eta 00m:26s]
[  112.990292] CPU16: shutdown
[  112.993113] psci: CPU16 killed (polled 0 ms)
[  113.029429] CPU15: shutdown
[  113.032245] psci: CPU15 killed (polled 0 ms)
[  113.073649] CPU14: shutdown
[  113.076461] psci: CPU14 killed (polled 0 ms)
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR[  113.121258] CPU13: 
shutdown
RRRRRRRRRR] [30.[  113.124247] psci: CPU13 killed (polled 0 ms)
6% done] [2772MB/0KB/0KB /s] [710K/0/0 iops] [eta 00m:25s]
[  113.177108] IRQ 600: no longer affine to CPU12
[  113.183185] CPU12: shutdown
[  113.186001] psci: CPU12 killed (polled 0 ms)
[  113.242555] CPU11: shutdown
[  113.245368] psci: CPU11 killed (polled 0 ms)
[  113.302598] CPU10: shutdown
[  113.305420] psci: CPU10 killed (polled 0 ms)
[  113.365757] CPU9: shutdown
[  113.368489] psci: CPU9 killed (polled 0 ms)
[  113.611904] IRQ 599: no longer affine to CPU8
[  113.620611] CPU8: shutdown
[  113.623340] psci: CPU8 killed (polled 0 ms)
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [33.3% done] 
[2765MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:24s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [36.1% done] 
[2767MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:23s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [38.9% done] 
[2771MB/0KB/0KB /s] [709K/0/0 iops] [eta 00m:22s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [41.7% done] 
[2780MB/0KB/0KB /s] [712K/0/0 iops] [eta 00m:21s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [44.4% done] 
[2731MB/0KB/0KB /s] [699K/0/0 iops] [eta 00m:20s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [47.2% done] 
[2757MB/0KB/0KB /s] [706K/0/0 iops] [eta 00m:19s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [50.0% done] 
[2764MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:18s]
[  120.762141] ------------[ cut here ]------------
[  120.766760] WARNING: CPU: 0 PID: 10 at block/blk.h:299 
generic_make_request_checks+0x480/0x4e8
[  120.775348] Modules linked in:
[  120.778397] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 
5.5.0-rc6-31756-gdddc331 #333
[  120.786282] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 
2280-V2 CS V3.B160.02 01/15/2020
[  120.795129] pstate: 80c00089 (Nzcv daIf +PAN +UAO)
[  120.799907] pc : generic_make_request_checks+0x480/0x4e8
[  120.805205] lr : generic_make_request_checks+0x48/0x4e8
[  120.810415] sp : ffff80001052b700
[  120.813720] x29: ffff80001052b700 x28: 0000000000000dd0
[  120.819017] x27: ffff00277b6c0d00 x26: ffff00277863ea18
[  120.824313] x25: 0000000000000000 x24: ffff00277b6c0d6c
[  120.829612] x23: 0000000000000008 x22: ffff0027d66c7800
[  120.834909] x21: ffff2027c7133790 x20: ffffc8fcee7c9000
[  120.840205] x19: ffff00277863ea18 x18: 000000000000000e
[  120.845502] x17: 0000000000000007 x16: 0000000000000001
[  120.850801] x15: 0000000000000019 x14: 0000000000000033
[  120.856098] x13: 0000000000000001 x12: 00000000015cd30a
[  120.861394] x11: 0000000000000001 x10: 0000000000000040
[  120.866692] x9 : ffffc8fcee7df550 x8 : ffffc8fcee3fb018
[  120.871990] x7 : ffff0027d2010010 x6 : 00000006711b402b
[  120.877287] x5 : 00000000000000c0 x4 : 0000000000000006
[  120.882585] x3 : ffffffffffffffff x2 : 00000000ffffffff
[  120.887882] x1 : 0000000000000080 x0 : 0000000000000080
[  120.893181] Call trace:
[  120.895621]  generic_make_request_checks+0x480/0x4e8
[  120.900571]  generic_make_request+0x2c/0x320
[  120.904832]  blk_mq_hctx_deactivate+0x1a0/0x2e8
[  120.909352]  __blk_mq_delay_run_hw_queue+0x220/0x230
[  120.914303]  blk_mq_run_hw_queue+0xa0/0x110
[  120.918475]  blk_mq_run_hw_queues+0x48/0x70
[  120.922649]  scsi_end_request+0x1d8/0x230
[  120.926648]  scsi_io_completion+0x70/0x520
[  120.930735]  scsi_finish_command+0xe0/0xf0
[  120.934822]  scsi_softirq_done+0x80/0x190
[  120.938821]  blk_mq_complete_request+0xc4/0x160
[  120.943339]  scsi_mq_done+0x70/0xc8
[  120.946821]  ata_scsi_qc_complete+0xa0/0x3f0
[  120.951081]  __ata_qc_complete+0xa4/0x150
[  120.955079]  ata_qc_complete+0xe8/0x200
[  120.958906]  sas_ata_task_done+0x194/0x2d0
[  120.962992]  cq_tasklet_v3_hw+0x12c/0x5e8
[  120.966991]  tasklet_action_common.isra.15+0x11c/0x190
[  120.972115]  tasklet_action+0x24/0x30
[  120.975767]  efi_header_end+0x114/0x234
[  120.979592]  run_ksoftirqd+0x38/0x48
[  120.983159]  smpboot_thread_fn+0x16c/0x270
[  120.987245]  kthread+0x118/0x120
[  120.990465]  ret_from_fork+0x10/0x18
[  120.994031] ---[ end trace 0125724e2f3f2476 ]---
[  120.998643] ------------[ cut here ]------------
[  121.003250] WARNING: CPU: 0 PID: 10 at block/blk-mq.c:1365 
__blk_mq_run_hw_queue+0xdc/0x128
[  121.011576] Modules linked in:
[  121.014623] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G        W 
    5.5.0-rc6-31756-gdddc331 #333
[  121.023902] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 
2280-V2 CS V3.B160.02 01/15/2020
[  121.032749] pstate: 00c00089 (nzcv daIf +PAN +UAO)
[  121.037526] pc : __blk_mq_run_hw_queue+0xdc/0x128
[  121.042217] lr : __blk_mq_delay_run_hw_queue+0x1ac/0x230
[  121.047515] sp : ffff80001052b540
[  121.050820] x29: ffff80001052b540 x28: 0000000000000dd0
[  121.056118] x27: ffff00277b6c0d00 x26: ffff00277863ea18
[  121.061416] x25: 000000000000000c x24: 000000000000000a
[  121.066714] x23: 0000000000000001 x22: 0000000000000000
[  121.072012] x21: ffffc8fcee7c9000 x20: 0000000000000000
[  121.077310] x19: ffff0027cb7f8000 x18: 000000000000000e
[  121.082608] x17: 0000000000000007 x16: 0000000000000001
[  121.087905] x15: 0000000000000019 x14: 0000000000000033
[  121.093203] x13: 0000000000000001 x12: 00000000015cd30a
[  121.098501] x11: 0000000000000001 x10: 0000000000000040
[  121.103799] x9 : 00000000ffff4840 x8 : ffffc8fcee3fb018
[  121.109097] x7 : ffffc8fcee7c98e8 x6 : 0000000000000000
[  121.114395] x5 : ffffc8fcee7c98c8 x4 : 0000000000000000
[  121.119692] x3 : 0000000000000000 x2 : ffffc8fcee7c98c8
[  121.124990] x1 : 00000000000000ff x0 : 0000000000000103
[  121.130288] Call trace:
[  121.132729]  __blk_mq_run_hw_queue+0xdc/0x128
[  121.137073]  __blk_mq_delay_run_hw_queue+0x1ac/0x230
[  121.142025]  blk_mq_run_hw_queue+0xa0/0x110
[  121.146196]  blk_mq_request_bypass_insert+0x54/0x68
[  121.151060]  __blk_mq_try_issue_directly+0x60/0x208
[  121.155926]  blk_mq_try_issue_directly+0x50/0xb8
[  121.160529]  blk_mq_make_request+0x3d0/0x420
[  121.164787]  generic_make_request+0xf4/0x320
[  121.169046]  blk_mq_hctx_deactivate+0x1a0/0x2e8
[  121.173566]  __blk_mq_delay_run_hw_queue+0x220/0x230
[  121.178517]  blk_mq_run_hw_queue+0xa0/0x110
[  121.182688]  blk_mq_run_hw_queues+0x48/0x70
[  121.186860]  scsi_end_request+0x1d8/0x230
[  121.190858]  scsi_io_completion+0x70/0x520
[  121.194944]  scsi_finish_command+0xe0/0xf0
[  121.199029]  scsi_softirq_done+0x80/0x190
[  121.203027]  blk_mq_complete_request+0xc4/0x160
[  121.207545]  scsi_mq_done+0x70/0xc8
[  121.211024]  ata_scsi_qc_complete+0xa0/0x3f0
[  121.215283]  __ata_qc_complete+0xa4/0x150
[  121.219281]  ata_qc_complete+0xe8/0x200
[  121.223106]  sas_ata_task_done+0x194/0x2d0
[  121.227191]  cq_tasklet_v3_hw+0x12c/0x5e8
[  121.231190]  tasklet_action_common.isra.15+0x11c/0x190
[  121.236312]  tasklet_action+0x24/0x30
[  121.239965]  efi_header_end+0x114/0x234
[  121.243790]  run_ksoftirqd+0x38/0x48
[  121.247356]  smpboot_thread_fn+0x16c/0x270
[  121.251441]  kthread+0x118/0x120
[  121.254661]  ret_from_fork+0x10/0x18
[  121.258226] ---[ end trace 0125724e2f3f2477 ]---
[  121.267999] BUG: scheduling while atomic: ksoftirqd/0/10/0x00000103
[  121.274248] Modules linked in:
[  121.277295] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G        W 
    5.5.0-rc6-31756-gdddc331 #333
[  121.286574] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 
2280-V2 CS V3.B160.02 01/15/2020
[  121.295420] Call trace:
[  121.297861]  dump_backtrace+0x0/0x1a0
[  121.301513]  show_stack+0x14/0x20
[  121.304822]  dump_stack+0xbc/0x104
[  121.308216]  __schedule_bug+0x50/0x70
[  121.311869]  __schedule+0x534/0x598
[  121.315349]  schedule+0x74/0x100
[  121.318569]  io_schedule+0x18/0x98
[  121.321963]  blk_mq_get_tag+0x150/0x300
[  121.325789]  blk_mq_get_request+0x120/0x3a8
[  121.329961]  blk_mq_make_request+0xf8/0x420
[  121.334134]  generic_make_request+0xf4/0x320
[  121.338392]  blk_mq_hctx_deactivate+0x1a0/0x2e8
[  121.342910]  __blk_mq_delay_run_hw_queue+0x220/0x230
[  121.347861]  blk_mq_run_hw_queue+0xa0/0x110
[  121.352034]  blk_mq_run_hw_queues+0x48/0x70
[  121.356205]  scsi_end_request+0x1d8/0x230
[  121.360204]  scsi_io_completion+0x70/0x520
[  121.364289]  scsi_finish_command+0xe0/0xf0
[  121.368373]  scsi_softirq_done+0x80/0x190
[  121.372372]  blk_mq_complete_request+0xc4/0x160
[  121.376891]  scsi_mq_done+0x70/0xc8
[  121.380371]  ata_scsi_qc_complete+0xa0/0x3f0
[  121.384629]  __ata_qc_complete+0xa4/0x150
[  121.388629]  ata_qc_complete+0xe8/0x200
[  121.392453]  sas_ata_task_done+0x194/0x2d0
[  121.396540]  cq_tasklet_v3_hw+0x12c/0x5e8
[  121.400540]  tasklet_action_common.isra.15+0x11c/0x190
[  121.405664]  tasklet_action+0x24/0x30
[  121.409316]  efi_header_end+0x114/0x234
[  121.413141]  run_ksoftirqd+0x38/0x48
[  121.416706]  smpboot_thread_fn+0x16c/0x270
[  121.420792]  kthread+0x118/0x120
[  121.424013]  ret_from_fork+0x10/0x18
Jobs: 40 (f=40):[  121.427627] NOHZ: local_softirq_pending 2c2
  [RRRRRRRRRRRRRR[  121.433311] NOHZ: local_softirq_pending 40
RRRRRRRRRRRRRRRR[  121.438658] NOHZ: local_softirq_pending 40
RRRRRRRRRR] [52.[  121.444103] NOHZ: local_softirq_pending 40
8% done] [1676MB[  121.449564] NOHZ: local_softirq_pending 40
/0KB/0KB /s] [42[  121.455028] NOHZ: local_softirq_pending 40
9K/0/0 iops] [et[  121.460491] NOHZ: local_softirq_pending 40
a 00m:17s]
[  121.465948] NOHZ: local_softirq_pending 40
[  121.470963] NOHZ: local_softirq_pending 40
[  121.475055] NOHZ: local_softirq_pending 40
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [55.6% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:16s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [58.3% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:15s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [61.1% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:14s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [63.9% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:13s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [66.7% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:12s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [69.4% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:11s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [72.2% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:10s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [75.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:09s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [77.8% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:08s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [80.6% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:07s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [83.3% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:06s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [86.1% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:05s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [88.9% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:04s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [91.7% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:03s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [94.4% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:02s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [97.2% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:01s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [100.0% 
done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 00m:00s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047033d:14h:17m:38s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048319d:18h:40m:34s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049605d:23h:03m:30s]
[  141.915198] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  141.921280] rcu: 	96-...0: (4 GPs behind) idle=af6/0/0x1 
softirq=90/90 fqs=2626
[  141.928658] rcu: 	99-...0: (2 GPs behind) 
idle=b82/1/0x4000000000000000 softirq=82/82 fqs=2627
[  141.937339] 	(detected by 17, t=5254 jiffies, g=3925, q=545)
[  141.942985] Task dump for CPU 96:
[  141.946292] swapper/96      R  running task        0     0      1 
0x0000002a
[  141.953321] Call trace:
[  141.955763]  __switch_to+0xbc/0x218
[  141.959244]  0x0
[  141.961079] Task dump for CPU 99:
[  141.964385] kworker/99:1H   R  running task        0  3473      2 
0x0000022a
[  141.971417] Workqueue: kblockd blk_mq_run_work_fn
[  141.976109] Call trace:
[  141.978550]  __switch_to+0xbc/0x218
[  141.982029]  blk_mq_run_work_fn+0x1c/0x28
[  141.986027]  process_one_work+0x1e0/0x358
[  141.990025]  worker_thread+0x40/0x488
[  141.993678]  kthread+0x118/0x120
[  141.996897]  ret_from_fork+0x10/0x18
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158042478d:14h:26m:49s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158043549d:01h:15m:55s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158044619d:12h:05m:01s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158045689d:22h:54m:06s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158046760d:09h:43m:12s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047830d:20h:32m:18s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048901d:07h:21m:24s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049971d:18h:10m:30s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048794d:00h:27m:52s]
[  151.263502] softirq: huh, entered softirq 6 TASKLET 000000002aed33dd 
with preempt_count 00000100, exited with fffffffe?
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049816d:15h:16m:18s]
[  151.312427] sas: Enter sas_scsi_recover_host busy: 192 failed: 192
[  151.318613] sas: trying to find task 0x00000000294f58ee
[  151.323830] sas: sas_scsi_find_task: aborting task 0x00000000294f58ee
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158045974d:14h:49m:07s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158046897d:22h:56m:49s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047821d:07h:04m:31s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048744d:15h:12m:13s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158046537d:10h:31m:29s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050141d:20h:12m:10s]
[  157.407215] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout 
and not done.
[  157.414855] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort 
failed
[  157.421720] hisi_sas_v3_hw 0000:74:02.0: abort task: rc=-5
[  157.427202] sas: sas_scsi_find_task: querying task 0x00000000294f58ee
[  157.433634] sas: sas_scsi_find_task: task 0x00000000294f58ee failed 
to abort
[  157.440670] sas: task 0x00000000294f58ee is not at LU: I_T recover
[  157.446841] sas: I_T nexus reset for dev 500e004aaaaaaa02
[  157.452239] hisi_sas_v3_hw 0000:74:02.0: internal task abort: 
executing internal task failed: -22
[  157.461095] hisi_sas_v3_hw 0000:74:02.0: I_T nexus reset: internal 
abort (-22)
[  157.468310] hisi_sas_v3_hw 0000:74:02.0: internal task abort: 
executing internal task failed: -22
[  157.477170] hisi_sas_v3_hw 0000:74:02.0: lu_reset: internal abort failed
[  157.483866] hisi_sas_v3_hw 0000:74:02.0: lu_reset: for device[3]:rc= -22
[  157.490558] hisi_sas_v3_hw 0000:74:02.0: internal task abort: 
executing internal task failed: -22
[  157.499417] hisi_sas_v3_hw 0000:74:02.0: I_T nexus reset: internal 
abort (-22)
[  157.506627] sas: clear nexus ha
[  157.516706] hisi_sas_v3_hw 0000:74:02.0: controller resetting...
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047356d:20h:42m:24s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048189d:11h:12m:35s]
[  159.318030] hisi_sas_v3_hw 0000:74:02.0: phyup: phy0 link_rate=11
[  159.324114] hisi_sas_v3_hw 0000:74:02.0: phyup: phy1 link_rate=11
[  159.330195] hisi_sas_v3_hw 0000:74:02.0: phyup: phy2 link_rate=11
[  159.336274] hisi_sas_v3_hw 0000:74:02.0: phyup: phy3 link_rate=11
[  159.342354] hisi_sas_v3_hw 0000:74:02.0: phyup: phy4 link_rate=11
[  159.348433] hisi_sas_v3_hw 0000:74:02.0: phyup: phy5 link_rate=11
[  159.354513] hisi_sas_v3_hw 0000:74:02.0: phyup: phy6 link_rate=11
[  159.360593] hisi_sas_v3_hw 0000:74:02.0: phyup: phy7 link_rate=11
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049022d:01h:42m:46s]
[  160.349201] hisi_sas_v3_hw 0000:74:02.0: controller reset complete
[  160.349208] sas: broadcast received: 0
[  160.355378] hisi_sas_v3_hw 0000:74:02.0: dump count exceeded!
[  160.359138] sas: REVALIDATING DOMAIN on port 0, pid:937
[  160.370508] sas: ex 500e004aaaaaaa1f phy01 change count has changed
[  160.376993] sas: ex 500e004aaaaaaa1f phy01 originated BROADCAST(CHANGE)
[  160.383603] sas: ex 500e004aaaaaaa1f rediscovering phy01
[  160.389117] sas: ex 500e004aaaaaaa1f phy01:U:5 attached: 
0000000000000000 (no device)
[  160.398068] sas: ex 500e004aaaaaaa1f phy16 originated BROADCAST(CHANGE)
[  160.404676] sas: ex 500e004aaaaaaa1f rediscovering phy16, part of a 
wide port with phy17
[  160.412978] sas: ex 500e004aaaaaaa1f phy16 broadcast flutter
[  160.418751] sas: ex 500e004aaaaaaa1f phy17 originated BROADCAST(CHANGE)
[  160.425358] sas: ex 500e004aaaaaaa1f rediscovering phy17, part of a 
wide port with phy16
[  160.433638] sas: ex 500e004aaaaaaa1f phy17 broadcast flutter
[  160.439421] sas: ex 500e004aaaaaaa1f phy18 originated BROADCAST(CHANGE)
[  160.446030] sas: ex 500e004aaaaaaa1f rediscovering phy18, part of a 
wide port with phy16
[  160.454316] sas: ex 500e004aaaaaaa1f phy18 broadcast flutter
[  160.460105] sas: ex 500e004aaaaaaa1f phy19 originated BROADCAST(CHANGE)
[  160.466712] sas: ex 500e004aaaaaaa1f rediscovering phy19, part of a 
wide port with phy16
[  160.475022] sas: ex 500e004aaaaaaa1f phy19 broadcast flutter
[  160.480801] sas: ex 500e004aaaaaaa1f phy20 originated BROADCAST(CHANGE)
[  160.487410] sas: ex 500e004aaaaaaa1f rediscovering phy20, part of a 
wide port with phy16
[  160.495692] sas: ex 500e004aaaaaaa1f phy20 broadcast flutter
[  160.501421] sas: ex 500e004aaaaaaa1f phy21 originated BROADCAST(CHANGE)
[  160.508029] sas: ex 500e004aaaaaaa1f rediscovering phy21, part of a 
wide port with phy16
[  160.516309] sas: ex 500e004aaaaaaa1f phy21 broadcast flutter
[  160.522048] sas: ex 500e004aaaaaaa1f phy22 originated BROADCAST(CHANGE)
[  160.528653] sas: ex 500e004aaaaaaa1f rediscovering phy22, part of a 
wide port with phy16
[  160.536975] sas: ex 500e004aaaaaaa1f phy22 broadcast flutter
[  160.542699] sas: ex 500e004aaaaaaa1f phy23 originated BROADCAST(CHANGE)
[  160.549307] sas: ex 500e004aaaaaaa1f rediscovering phy23, part of a 
wide port with phy16
[  160.557591] sas: ex 500e004aaaaaaa1f phy23 broadcast flutter
[  160.563318] sas: done REVALIDATING DOMAIN on port 0, pid:937, res 0x0
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049854d:16h:12m:57s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048255d:18h:40m:08s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050345d:15h:15m:03s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049839d:22h:28m:25s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158046025d:15h:04m:58s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158046744d:14h:10m:15s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047614d:11h:05m:16s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049100d:12h:27m:26s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048901d:11h:26m:07s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049620d:10h:31m:24s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050339d:09h:36m:41s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049314d:16h:14m:45s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047401d:22h:42m:56s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048059d:06h:48m:09s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158048716d:14h:53m:22s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049373d:22h:58m:36s]
[  176.479216] sas: clear nexus ha succeeded
[  176.483220] sas: --- Exit sas_eh_handle_sas_errors -- clear_q
[  176.488980] sas: ata2: end_device-0:0:4: cmd error handler
[  176.494464] sas: ata1: end_device-0:0:2: cmd error handler
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050031d:07h:03m:49s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049507d:04h:34m:56s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050077d:05h:12m:47s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049104d:11h:12m:35s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049724d:15h:20m:48s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050344d:19h:29m:01s]
[  182.751214] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout 
and not done.
[  182.751217] hisi_sas_v3_hw 0000:74:02.0: dump count exceeded!
[  182.758857] hisi_sas_v3_hw 0000:74:02.0: lu_reset: internal abort failed
[  182.758858] hisi_sas_v3_hw 0000:74:02.0: lu_reset: for device[4]:rc= -5
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158047857d:06h:57m:44s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049091d:00h:22m:57s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049020d:21h:38m:22s]
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

> 
>>
>> Ming Lei (6):
>>    blk-mq: add new state of BLK_MQ_S_INACTIVE
>>    blk-mq: prepare for draining IO when hctx's all CPUs are offline
>>    blk-mq: stop to handle IO and drain IO before hctx becomes inactive
>>    blk-mq: re-submit IO in case that hctx is inactive
>>    blk-mq: handle requests dispatched from IO scheduler in case of
>>      inactive hctx
>>    block: deactivate hctx when all its CPUs are offline when running
>>      queue
>>
>>   block/blk-mq-debugfs.c     |   2 +
>>   block/blk-mq-tag.c         |   2 +-
>>   block/blk-mq-tag.h         |   2 +
>>   block/blk-mq.c             | 172 +++++++++++++++++++++++++++++++------
>>   block/blk-mq.h             |   3 +-
>>   drivers/block/loop.c       |   2 +-
>>   drivers/md/dm-rq.c         |   2 +-
>>   include/linux/blk-mq.h     |   6 ++
>>   include/linux/cpuhotplug.h |   1 +
>>   9 files changed, 163 insertions(+), 29 deletions(-)
>>
>> Cc: John Garry <john.garry@huawei.com>
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Keith Busch <keith.busch@intel.com>
>>
> 
> .


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-20 13:23   ` John Garry
@ 2020-01-31 10:04     ` Ming Lei
  2020-01-31 10:24       ` John Garry
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2020-01-31 10:04 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	chenxiang

On Mon, Jan 20, 2020 at 9:24 PM John Garry <john.garry@huawei.com> wrote:
>
> On 15/01/2020 17:00, John Garry wrote:
> > On 15/01/2020 11:44, Ming Lei wrote:
> >> Hi,
> >>
> >> Thomas mentioned:
> >>      "
> >>       That was the constraint of managed interrupts from the very
> >> beginning:
> >>        The driver/subsystem has to quiesce the interrupt line and the
> >> associated
> >>        queue _before_ it gets shutdown in CPU unplug and not fiddle
> >> with it
> >>        until it's restarted by the core when the CPU is plugged in again.
> >>      "
> >>
> >> But no drivers or blk-mq do that before one hctx becomes inactive(all
> >> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> >> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> >>
> >> This patchset tries to address the issue by two stages:
> >>
> >> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> >>
> >> - mark the hctx as internal stopped, and drain all in-flight requests
> >> if the hctx is going to be dead.
> >>
> >> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> >> becomes dead
> >>
> >> - steal bios from the request, and resubmit them via
> >> generic_make_request(),
> >> then these IO will be mapped to other live hctx for dispatch
> >>
> >> Please comment & review, thanks!
> >>
> >> V5:
> >>     - rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
> >>     - re-factor code for re-submit requests in cpu dead hotplug handler
> >>     - address requeue corner case
> >>
> >> V4:
> >>     - resubmit IOs in dispatch list in case that this hctx is dead
> >>
> >> V3:
> >>     - re-organize patch 2 & 3 a bit for addressing Hannes's comment
> >>     - fix patch 4 for avoiding potential deadlock, as found by Hannes
> >>
> >> V2:
> >>     - patch4 & patch 5 in V1 have been merged to block tree, so remove
> >>       them
> >>     - address comments from John Garry and Minwoo
> >>
> >>
>
> Hi Ming,
>
> Colleague Xiang Chen found this:
>
> Starting 40 processes
> Jobs: 40 (f=40)
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [8.3% done]
> [2899MB/0KB/0KB /s] [742K/0/0 iops] [eta 00m:33s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [11.1% done]
> [2892MB/0KB/0KB /s] [740K/0/0 iops] [eta 00m:32s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [13.9% done]
> [2887MB/0KB/0KB /s] [739K/0/0 iops] [eta 00m:31s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [16.7% done]
> [2850MB/0KB/0KB /s] [730K/0/0 iops] [eta 00m:30s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [19.4% done]
> [2834MB/0KB/0KB /s] [726K/0/0 iops] [eta 00m:29s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [22.2% done]
> [2827MB/0KB/0KB /s] [724K/0/0 iops] [eta 00m:28s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [25.0% done]
> [2804MB/0KB/0KB /s] [718K/0/0 iops] [eta 00m:27s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [27.8% done]
> [2783MB/0KB/0KB /s] [713K/0/0 iops] [eta 00m:26s]
> [  112.990292] CPU16: shutdown
> [  112.993113] psci: CPU16 killed (polled 0 ms)
> [  113.029429] CPU15: shutdown
> [  113.032245] psci: CPU15 killed (polled 0 ms)
> [  113.073649] CPU14: shutdown
> [  113.076461] psci: CPU14 killed (polled 0 ms)
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR[  113.121258] CPU13:
> shutdown
> RRRRRRRRRR] [30.[  113.124247] psci: CPU13 killed (polled 0 ms)
> 6% done] [2772MB/0KB/0KB /s] [710K/0/0 iops] [eta 00m:25s]
> [  113.177108] IRQ 600: no longer affine to CPU12
> [  113.183185] CPU12: shutdown
> [  113.186001] psci: CPU12 killed (polled 0 ms)
> [  113.242555] CPU11: shutdown
> [  113.245368] psci: CPU11 killed (polled 0 ms)
> [  113.302598] CPU10: shutdown
> [  113.305420] psci: CPU10 killed (polled 0 ms)
> [  113.365757] CPU9: shutdown
> [  113.368489] psci: CPU9 killed (polled 0 ms)
> [  113.611904] IRQ 599: no longer affine to CPU8
> [  113.620611] CPU8: shutdown
> [  113.623340] psci: CPU8 killed (polled 0 ms)
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [33.3% done]
> [2765MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:24s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [36.1% done]
> [2767MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:23s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [38.9% done]
> [2771MB/0KB/0KB /s] [709K/0/0 iops] [eta 00m:22s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [41.7% done]
> [2780MB/0KB/0KB /s] [712K/0/0 iops] [eta 00m:21s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [44.4% done]
> [2731MB/0KB/0KB /s] [699K/0/0 iops] [eta 00m:20s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [47.2% done]
> [2757MB/0KB/0KB /s] [706K/0/0 iops] [eta 00m:19s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [50.0% done]
> [2764MB/0KB/0KB /s] [708K/0/0 iops] [eta 00m:18s]
> [  120.762141] ------------[ cut here ]------------
> [  120.766760] WARNING: CPU: 0 PID: 10 at block/blk.h:299
> generic_make_request_checks+0x480/0x4e8
> [  120.775348] Modules linked in:
> [  120.778397] CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.5.0-
>
> Then this:
>
> 1158048319d:18h:40m:34s]
> Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [0.0% done]
> [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158049605d:23h:03m:30s]
> [  141.915198] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  141.921280] rcu:     96-...0: (4 GPs behind) idle=af6/0/0x1
> softirq=90/90 fqs=2626
> [  141.928658] rcu:     99-...0: (2 GPs behind)
> idle=b82/1/0x4000000000000000 softirq=82/82 fqs=2627
> [  141.937339]  (detected by 17, t=5254 jiffies, g=3925, q=545)
> [  141.942985] Task dump for CPU 96:
> [  141.946292] swapper/96      R  running task        0     0      1
> 0x0000002a
> [  141.953321] Call trace:
> [  141.955763]  __switch_to+0xbc/0x218
> [  141.959244]  0x0
> [  141.961079] Task dump for CPU 99:
> [  141.964385] kworker/99:1H   R  running task        0  3473      2
> 0x0000022a
> [  141.971417] Workqueue: kblockd blk_mq_run_work_fn
> [  141.976109] Call trace:
> [  141.978550]  __switch_to+0xbc/0x218
> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
> [  141.986027]  process_one_work+0x1e0/0x358
> [  141.990025]  worker_thread+0x40/0x488
> [  141.993678]  kthread+0x118/0x120
> [  141.996897]  ret_from_fork+0x10/0x18

Hi John,

Thanks for your test!

Could you test the following patchset and only the last one is changed?

https://github.com/ming1/linux/commits/my_for_5.6_block

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-31 10:04     ` Ming Lei
@ 2020-01-31 10:24       ` John Garry
  2020-01-31 10:58         ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-01-31 10:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	chenxiang (M)

>> [  141.976109] Call trace:
>> [  141.978550]  __switch_to+0xbc/0x218
>> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
>> [  141.986027]  process_one_work+0x1e0/0x358
>> [  141.990025]  worker_thread+0x40/0x488
>> [  141.993678]  kthread+0x118/0x120
>> [  141.996897]  ret_from_fork+0x10/0x18
> 
> Hi John,
> 
> Thanks for your test!
> 

Hi Ming,

> Could you test the following patchset and only the last one is changed?
> 
> https://github.com/ming1/linux/commits/my_for_5.6_block

For SCSI testing, I will ask my colleague Xiang Chen to test when he 
returns to work. So I did not see this issue for my SCSI testing for 
your original v5, but I was only using 1x as opposed to maybe 20x SAS disks.

BTW, did you test NVMe? For some reason I could not trigger a scenario 
where we're draining the outstanding requests for a queue which is being 
deactivated - I mean, the queues were always already quiesced.

Thanks,
John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-31 10:24       ` John Garry
@ 2020-01-31 10:58         ` Ming Lei
  2020-01-31 17:51           ` John Garry
  2020-02-07 10:56           ` John Garry
  0 siblings, 2 replies; 26+ messages in thread
From: Ming Lei @ 2020-01-31 10:58 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	chenxiang (M)

On Fri, Jan 31, 2020 at 6:24 PM John Garry <john.garry@huawei.com> wrote:
>
> >> [  141.976109] Call trace:
> >> [  141.978550]  __switch_to+0xbc/0x218
> >> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
> >> [  141.986027]  process_one_work+0x1e0/0x358
> >> [  141.990025]  worker_thread+0x40/0x488
> >> [  141.993678]  kthread+0x118/0x120
> >> [  141.996897]  ret_from_fork+0x10/0x18
> >
> > Hi John,
> >
> > Thanks for your test!
> >
>
> Hi Ming,
>
> > Could you test the following patchset and only the last one is changed?
> >
> > https://github.com/ming1/linux/commits/my_for_5.6_block
>
> For SCSI testing, I will ask my colleague Xiang Chen to test when he
> returns to work. So I did not see this issue for my SCSI testing for
> your original v5, but I was only using 1x as opposed to maybe 20x SAS disks.
>
> BTW, did you test NVMe? For some reason I could not trigger a scenario
> where we're draining the outstanding requests for a queue which is being
> deactivated - I mean, the queues were always already quiesced.

I run cpu hotplug test on both NVMe and SCSI in KVM, and fio just runs
as expected.

NVMe is often 1:1 mapping, so it might be a bit difficult to trigger
draining in-flight IOs.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-31 10:58         ` Ming Lei
@ 2020-01-31 17:51           ` John Garry
  2020-01-31 18:02             ` John Garry
  2020-02-07 10:56           ` John Garry
  1 sibling, 1 reply; 26+ messages in thread
From: John Garry @ 2020-01-31 17:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	chenxiang (M),
	kbusch

+- Keith

On 31/01/2020 10:58, Ming Lei wrote:
> On Fri, Jan 31, 2020 at 6:24 PM John Garry <john.garry@huawei.com> wrote:
>>
>>>> [  141.976109] Call trace:
>>>> [  141.978550]  __switch_to+0xbc/0x218
>>>> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
>>>> [  141.986027]  process_one_work+0x1e0/0x358
>>>> [  141.990025]  worker_thread+0x40/0x488
>>>> [  141.993678]  kthread+0x118/0x120
>>>> [  141.996897]  ret_from_fork+0x10/0x18
>>>
>>> Hi John,
>>>
>>> Thanks for your test!
>>>
>>
>> Hi Ming,
>>
>>> Could you test the following patchset and only the last one is changed?
>>>
>>> https://github.com/ming1/linux/commits/my_for_5.6_block
>>
>> For SCSI testing, I will ask my colleague Xiang Chen to test when he
>> returns to work. So I did not see this issue for my SCSI testing for
>> your original v5, but I was only using 1x as opposed to maybe 20x SAS disks.
>>
>> BTW, did you test NVMe? For some reason I could not trigger a scenario
>> where we're draining the outstanding requests for a queue which is being
>> deactivated - I mean, the queues were always already quiesced.
> 
> I run cpu hotplug test on both NVMe and SCSI in KVM, and fio just runs
> as expected.
> 
> NVMe is often 1:1 mapping, so it might be a bit difficult to trigger
> draining in-flight IOs.

I generally have a 4:1 mapping for my NVMe cards - so many CPUs.

But more alarming is that I see this on every run:

[  246.378744] psci: CPU6 killed (polled 0 ms)
[  246.427955] irq_migrate_all_off_this_cpu: 3 callbacks suppressed
[  246.427956] IRQ 822: no longer affine to CPU7
[  246.438586] CPU7: shutdown
[  246.441300] psci: CPU7 killed (polled 0 ms)
[  247.015206] pcieport 0000:00:08.0: can't change power state from 
D3cold to D0 (config space inaccessible)
Jobs: 6[  267.503159] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:
[  267.509244] rcu: 25-...0: (1 GPs behind) 
idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=2626
[  267.518283] (detected by 0, t=5255 jiffies, g=14001, q=1787)
[  267.524016] Task dump for CPU 25:
[  267.527318] irqbalance      R  running task        0  1275      1 
0x00000002
[  267.534354] Call trace:
[  267.536790]  __switch_to+0xbc/0x218
[  267.540266]  0xffff800014c53d68
Jobs: 6 (f=6): [R(6)][96.0%][r=0KiB/s[  278.195193] nvme nvme1: 
controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
,w=0KiB/s][r=0,w=0 IOPS][eta 00m:[  278.205197] nvme nvme0: controller 
is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IO[  330.523157] 
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  330.529242] rcu: 25-...0: (1 GPs behind) 
idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=10503
[  330.538366] (detected by 0, t=21010 jiffies, g=14001, q=2318)
[  330.544186] Task dump for CPU 25:
[  330.547488] irqbalance      R  running task        0  1275      1 
0x00000002
[  330.554523] Call trace:
[  330.556958]  __switch_to+0xbc/0x218
[  330.560434]  0xffff800014c53d68
[  393.543155] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  393.549242] rcu: 25-...0: (1 GPs behind) 
idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=18379
[  393.558366] (detected by 8, t=36765 jiffies, g=14001, q=2385)
[  393.564187] Task dump for CPU 25:
[  393.567488] irqbalance      R  running task        0  1275      1 
0x00000002
[  393.574523] Call trace:
[  393.576960]  __switch_to+0xbc/0x218
[  393.580435]  0xffff800014c53d68
[  456.567156] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  456.573246] rcu: 25-...0: (1 GPs behind) 
idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=26255
[  456.582389] (detected by 90, t=52520 jiffies, g=14001, q=2397)
[  456.588296] Task dump for CPU 25:
[  456.591601] irqbalance      R  running task        0  1275      1 
0x00000002
[  456.598638] Call trace:
[  456.601078]  __switch_to+0xbc/0x218
[  456.604556]  0xffff800014c53d68
[  519.587152] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  519.593235] rcu: 25-...0: (1 GPs behind) 
idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=34131
[  519.602357] (detected by 90, t=68275 jiffies, g=14001, q=2405)
[  519.608264] Task dump for CPU 25:
[  519.611566] irqbalance      R  running task        0  1275      1 
0x00000002
[  519.618601] Call trace:
[  519.621036]  __switch_to+0xbc/0x218
[  519.624511]  0xffff800014c53d68
[  582.603153] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  582.609236] rcu: 25-...0: (1 GPs behind) 
idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=42006
[  582.618358] (detected by 90, t=84029 jiffies, g=14001, q=2409)
[  582.624265] Task dump for CPU 25:
[  582.627567] irqbalance      R  running task        0  1275      1 
0x00000002
[  582.634601] Call trace:
[  582.637036]  __switch_to+0xbc/0x218
[  582.640511]  0xffff800014c53d68


I tested v5.5 and I cannot see this, but your branch is based in 5.5-rc2 
with lots of other stuff on top. The only thing I changed was 
nvme.use_threaded_interrupts = 1.

Note that I do see this with v5.5 for my hotplug test, as expected:
[  280.221917] nvme nvme1: I/O 114 QID 19 timeout, completion polledta 
01m:50s]

...so I'm not going insane.

Thanks,
John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-31 17:51           ` John Garry
@ 2020-01-31 18:02             ` John Garry
  2020-02-01  1:31               ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-01-31 18:02 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	chenxiang (M),
	kbusch, Marc Zyngier

On 31/01/2020 17:51, John Garry wrote:
> +- Keith

+ Marc, on rare chance that this is related to the GIC

> 
> On 31/01/2020 10:58, Ming Lei wrote:
>> On Fri, Jan 31, 2020 at 6:24 PM John Garry <john.garry@huawei.com> wrote:
>>>
>>>>> [  141.976109] Call trace:
>>>>> [  141.978550]  __switch_to+0xbc/0x218
>>>>> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
>>>>> [  141.986027]  process_one_work+0x1e0/0x358
>>>>> [  141.990025]  worker_thread+0x40/0x488
>>>>> [  141.993678]  kthread+0x118/0x120
>>>>> [  141.996897]  ret_from_fork+0x10/0x18
>>>>
>>>> Hi John,
>>>>
>>>> Thanks for your test!
>>>>
>>>
>>> Hi Ming,
>>>
>>>> Could you test the following patchset and only the last one is changed?
>>>>
>>>> https://github.com/ming1/linux/commits/my_for_5.6_block
>>>
>>> For SCSI testing, I will ask my colleague Xiang Chen to test when he
>>> returns to work. So I did not see this issue for my SCSI testing for
>>> your original v5, but I was only using 1x as opposed to maybe 20x SAS 
>>> disks.
>>>
>>> BTW, did you test NVMe? For some reason I could not trigger a scenario
>>> where we're draining the outstanding requests for a queue which is being
>>> deactivated - I mean, the queues were always already quiesced.
>>
>> I run cpu hotplug test on both NVMe and SCSI in KVM, and fio just runs
>> as expected.
>>
>> NVMe is often 1:1 mapping, so it might be a bit difficult to trigger
>> draining in-flight IOs.
> 
> I generally have a 4:1 mapping for my NVMe cards - so many CPUs.
> 
> But more alarming is that I see this on every run:
> 
> [  246.378744] psci: CPU6 killed (polled 0 ms)
> [  246.427955] irq_migrate_all_off_this_cpu: 3 callbacks suppressed
> [  246.427956] IRQ 822: no longer affine to CPU7
> [  246.438586] CPU7: shutdown
> [  246.441300] psci: CPU7 killed (polled 0 ms)
> [  247.015206] pcieport 0000:00:08.0: can't change power state from 
> D3cold to D0 (config space inaccessible)
> Jobs: 6[  267.503159] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  267.509244] rcu: 25-...0: (1 GPs behind) 
> idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=2626
> [  267.518283] (detected by 0, t=5255 jiffies, g=14001, q=1787)
> [  267.524016] Task dump for CPU 25:
> [  267.527318] irqbalance      R  running task        0  1275      1 
> 0x00000002
> [  267.534354] Call trace:
> [  267.536790]  __switch_to+0xbc/0x218
> [  267.540266]  0xffff800014c53d68
> Jobs: 6 (f=6): [R(6)][96.0%][r=0KiB/s[  278.195193] nvme nvme1: 
> controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> ,w=0KiB/s][r=0,w=0 IOPS][eta 00m:[  278.205197] nvme nvme0: controller 
> is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IO[  330.523157] 
> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  330.529242] rcu: 25-...0: (1 GPs behind) 
> idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=10503
> [  330.538366] (detected by 0, t=21010 jiffies, g=14001, q=2318)
> [  330.544186] Task dump for CPU 25:
> [  330.547488] irqbalance      R  running task        0  1275      1 
> 0x00000002
> [  330.554523] Call trace:
> [  330.556958]  __switch_to+0xbc/0x218
> [  330.560434]  0xffff800014c53d68
> [  393.543155] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  393.549242] rcu: 25-...0: (1 GPs behind) 
> idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=18379
> [  393.558366] (detected by 8, t=36765 jiffies, g=14001, q=2385)
> [  393.564187] Task dump for CPU 25:
> [  393.567488] irqbalance      R  running task        0  1275      1 
> 0x00000002
> [  393.574523] Call trace:
> [  393.576960]  __switch_to+0xbc/0x218
> [  393.580435]  0xffff800014c53d68
> [  456.567156] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  456.573246] rcu: 25-...0: (1 GPs behind) 
> idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=26255
> [  456.582389] (detected by 90, t=52520 jiffies, g=14001, q=2397)
> [  456.588296] Task dump for CPU 25:
> [  456.591601] irqbalance      R  running task        0  1275      1 
> 0x00000002
> [  456.598638] Call trace:
> [  456.601078]  __switch_to+0xbc/0x218
> [  456.604556]  0xffff800014c53d68
> [  519.587152] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  519.593235] rcu: 25-...0: (1 GPs behind) 
> idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=34131
> [  519.602357] (detected by 90, t=68275 jiffies, g=14001, q=2405)
> [  519.608264] Task dump for CPU 25:
> [  519.611566] irqbalance      R  running task        0  1275      1 
> 0x00000002
> [  519.618601] Call trace:
> [  519.621036]  __switch_to+0xbc/0x218
> [  519.624511]  0xffff800014c53d68
> [  582.603153] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  582.609236] rcu: 25-...0: (1 GPs behind) 
> idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=42006
> [  582.618358] (detected by 90, t=84029 jiffies, g=14001, q=2409)
> [  582.624265] Task dump for CPU 25:
> [  582.627567] irqbalance      R  running task        0  1275      1 
> 0x00000002
> [  582.634601] Call trace:
> [  582.637036]  __switch_to+0xbc/0x218
> [  582.640511]  0xffff800014c53d68
> 
> 
> I tested v5.5 and I cannot see this, but your branch is based in 5.5-rc2 
> with lots of other stuff on top. The only thing I changed was 
> nvme.use_threaded_interrupts = 1.

Uhh, now I do with v5.5:

[  705.354601] CPU22: shutdown
[  705.357395] psci: CPU22 killed (polled 0 ms)
[  705.402837] CPU23: shutdown
[  705.405630] psci: CPU23 killed (polled 0 ms)
[  705.666716] CPU24: shutdown
[  705.669512] psci: CPU24 killed (polled 0 ms)
[  705.966753] CPU25: shutdown
[  705.969548] psci: CPU25 killed (polled 0 ms)
[  706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta 
00m:13s]
[  706.253565] psci: CPU26 killed (polled 0 ms)
[  706.514728] CPU27: shutdown
[  706.517523] psci: CPU27 killed (polled 0 ms)
[  706.826708] CPU28: shutdown
[  706.829502] psci: CPU28 killed (polled 0 ms)
[  707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta 
00m:12s]
[  707.133714] psci: CPU29 killed (polled 0 ms)
[  707.439066] CPU30: shutdown
[  707.441870] psci: CPU30 killed (polled 0 ms)
[  707.706727] CPU31: shutdown
[  707.709526] psci: CPU31 killed (polled 0 ms)
[  708.521853] pcieport 0000:00:08.0: can't change power state from 
D3cold to D0 (config space inaccessible)
[  728.741808] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:80d:00h:35m:42s]
[  728.747895] rcu: 48-...0: (0 ticks this GP) 
idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
[  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
[  728.763018] Task dump for CPU 48:
[  728.766321] irqbalance      R  running task        0  1272      1 
0x00000002
[  728.773358] Call trace:
[  728.775801]  __switch_to+0xbc/0x218
[  728.779283]  gic_set_affinity+0x16c/0x1d8
[  728.783282]  irq_do_set_affinity+0x30/0xd0
[  728.787365]  irq_set_affinity_locked+0xc8/0xf0
[  728.791796]  __irq_set_affinity+0x4c/0x80
[  728.795794]  write_irq_affinity.isra.7+0x104/0x120
[  728.800572]  irq_affinity_proc_write+0x1c/0x28
[  728.805008]  proc_reg_write+0x78/0xb8
[  728.808660]  __vfs_write+0x18/0x38
[  728.812050]  vfs_write+0xb4/0x1e0
[  728.815352]  ksys_write+0x68/0xf8
[  728.818655]  __arm64_sys_write+0x18/0x20
[  728.822567]  el0_svc_common.constprop.2+0x64/0x160
[  728.827345]  el0_svc_handler+0x20/0x80
[  728.831082]  el0_sync_handler+0xe4/0x188
[  728.834991]  el0_sync+0x140/0x180
[  738.993844] nvme nvme1: controller is down; will reset: 
CSTS=0xffffffff, PCI_STATUS=0xffff
[  791.761810] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:63d:14h:24m:13s]
[  791.767897] rcu: 48-...0: (0 ticks this GP) 
idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=10495
[  791.777281] (detected by 54, t=21010 jiffies, g=40989, q=2396)
[  791.783189] Task dump for CPU 48:
[  791.786491] irqbalance      R  running task        0  1272      1 
0x00000002
[  791.793528] Call trace:
[  791.795964]  __switch_to+0xbc/0x218
[  791.799441]  gic_set_affinity+0x16c/0x1d8
[  791.803439]  irq_do_set_affinity+0x30/0xd0
[  791.807522]  irq_set_affinity_locked+0xc8/0xf0
[  791.811953]  __irq_set_affinity+0x4c/0x80
[  791.815949]  write_irq_affinity.isra.7+0x104/0x120
[  791.820727]  irq_affinity_proc_write+0x1c/0x28
[  791.825158]  proc_reg_write+0x78/0xb8
[  791.828808]  __vfs_write+0x18/0x38
[  791.832197]  vfs_write+0xb4/0x1e0
[  791.835500]  ksys_write+0x68/0xf8
[  791.838802]  __arm64_sys_write+0x18/0x20
[  791.842712]  el0_svc_common.constprop.2+0x64/0x160
[  791.847490]  el0_svc_handler+0x20/0x80
[  791.851226]  el0_sync_handler+0xe4/0x188
[  791.855135]  el0_sync+0x140/0x180
Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s


> 
> Note that I do see this with v5.5 for my hotplug test, as expected:
> [  280.221917] nvme nvme1: I/O 114 QID 19 timeout, completion polledta 
> 01m:50s]
> 
> ...so I'm not going insane.
> 

Thanks,John


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-31 18:02             ` John Garry
@ 2020-02-01  1:31               ` Ming Lei
  2020-02-01 11:05                 ` Marc Zyngier
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2020-02-01  1:31 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	chenxiang (M),
	Keith Busch, Marc Zyngier

On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@huawei.com> wrote:
>
> On 31/01/2020 17:51, John Garry wrote:
> > +- Keith
>
> + Marc, on rare chance that this is related to the GIC
>
> >
> > On 31/01/2020 10:58, Ming Lei wrote:
> >> On Fri, Jan 31, 2020 at 6:24 PM John Garry <john.garry@huawei.com> wrote:
> >>>
> >>>>> [  141.976109] Call trace:
> >>>>> [  141.978550]  __switch_to+0xbc/0x218
> >>>>> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
> >>>>> [  141.986027]  process_one_work+0x1e0/0x358
> >>>>> [  141.990025]  worker_thread+0x40/0x488
> >>>>> [  141.993678]  kthread+0x118/0x120
> >>>>> [  141.996897]  ret_from_fork+0x10/0x18
> >>>>
> >>>> Hi John,
> >>>>
> >>>> Thanks for your test!
> >>>>
> >>>
> >>> Hi Ming,
> >>>
> >>>> Could you test the following patchset and only the last one is changed?
> >>>>
> >>>> https://github.com/ming1/linux/commits/my_for_5.6_block
> >>>
> >>> For SCSI testing, I will ask my colleague Xiang Chen to test when he
> >>> returns to work. So I did not see this issue for my SCSI testing for
> >>> your original v5, but I was only using 1x as opposed to maybe 20x SAS
> >>> disks.
> >>>
> >>> BTW, did you test NVMe? For some reason I could not trigger a scenario
> >>> where we're draining the outstanding requests for a queue which is being
> >>> deactivated - I mean, the queues were always already quiesced.
> >>
> >> I run cpu hotplug test on both NVMe and SCSI in KVM, and fio just runs
> >> as expected.
> >>
> >> NVMe is often 1:1 mapping, so it might be a bit difficult to trigger
> >> draining in-flight IOs.
> >
> > I generally have a 4:1 mapping for my NVMe cards - so many CPUs.
> >
> > But more alarming is that I see this on every run:
> >
> > [  246.378744] psci: CPU6 killed (polled 0 ms)
> > [  246.427955] irq_migrate_all_off_this_cpu: 3 callbacks suppressed
> > [  246.427956] IRQ 822: no longer affine to CPU7
> > [  246.438586] CPU7: shutdown
> > [  246.441300] psci: CPU7 killed (polled 0 ms)
> > [  247.015206] pcieport 0000:00:08.0: can't change power state from
> > D3cold to D0 (config space inaccessible)
> > Jobs: 6[  267.503159] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  267.509244] rcu: 25-...0: (1 GPs behind)
> > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=2626
> > [  267.518283] (detected by 0, t=5255 jiffies, g=14001, q=1787)
> > [  267.524016] Task dump for CPU 25:
> > [  267.527318] irqbalance      R  running task        0  1275      1
> > 0x00000002
> > [  267.534354] Call trace:
> > [  267.536790]  __switch_to+0xbc/0x218
> > [  267.540266]  0xffff800014c53d68
> > Jobs: 6 (f=6): [R(6)][96.0%][r=0KiB/s[  278.195193] nvme nvme1:
> > controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> > ,w=0KiB/s][r=0,w=0 IOPS][eta 00m:[  278.205197] nvme nvme0: controller
> > is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> > Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IO[  330.523157]
> > rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  330.529242] rcu: 25-...0: (1 GPs behind)
> > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=10503
> > [  330.538366] (detected by 0, t=21010 jiffies, g=14001, q=2318)
> > [  330.544186] Task dump for CPU 25:
> > [  330.547488] irqbalance      R  running task        0  1275      1
> > 0x00000002
> > [  330.554523] Call trace:
> > [  330.556958]  __switch_to+0xbc/0x218
> > [  330.560434]  0xffff800014c53d68
> > [  393.543155] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  393.549242] rcu: 25-...0: (1 GPs behind)
> > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=18379
> > [  393.558366] (detected by 8, t=36765 jiffies, g=14001, q=2385)
> > [  393.564187] Task dump for CPU 25:
> > [  393.567488] irqbalance      R  running task        0  1275      1
> > 0x00000002
> > [  393.574523] Call trace:
> > [  393.576960]  __switch_to+0xbc/0x218
> > [  393.580435]  0xffff800014c53d68
> > [  456.567156] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  456.573246] rcu: 25-...0: (1 GPs behind)
> > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=26255
> > [  456.582389] (detected by 90, t=52520 jiffies, g=14001, q=2397)
> > [  456.588296] Task dump for CPU 25:
> > [  456.591601] irqbalance      R  running task        0  1275      1
> > 0x00000002
> > [  456.598638] Call trace:
> > [  456.601078]  __switch_to+0xbc/0x218
> > [  456.604556]  0xffff800014c53d68
> > [  519.587152] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  519.593235] rcu: 25-...0: (1 GPs behind)
> > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=34131
> > [  519.602357] (detected by 90, t=68275 jiffies, g=14001, q=2405)
> > [  519.608264] Task dump for CPU 25:
> > [  519.611566] irqbalance      R  running task        0  1275      1
> > 0x00000002
> > [  519.618601] Call trace:
> > [  519.621036]  __switch_to+0xbc/0x218
> > [  519.624511]  0xffff800014c53d68
> > [  582.603153] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  582.609236] rcu: 25-...0: (1 GPs behind)
> > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=42006
> > [  582.618358] (detected by 90, t=84029 jiffies, g=14001, q=2409)
> > [  582.624265] Task dump for CPU 25:
> > [  582.627567] irqbalance      R  running task        0  1275      1
> > 0x00000002
> > [  582.634601] Call trace:
> > [  582.637036]  __switch_to+0xbc/0x218
> > [  582.640511]  0xffff800014c53d68
> >
> >
> > I tested v5.5 and I cannot see this, but your branch is based in 5.5-rc2
> > with lots of other stuff on top. The only thing I changed was
> > nvme.use_threaded_interrupts = 1.
>
> Uhh, now I do with v5.5:
>
> [  705.354601] CPU22: shutdown
> [  705.357395] psci: CPU22 killed (polled 0 ms)
> [  705.402837] CPU23: shutdown
> [  705.405630] psci: CPU23 killed (polled 0 ms)
> [  705.666716] CPU24: shutdown
> [  705.669512] psci: CPU24 killed (polled 0 ms)
> [  705.966753] CPU25: shutdown
> [  705.969548] psci: CPU25 killed (polled 0 ms)
> [  706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta
> 00m:13s]
> [  706.253565] psci: CPU26 killed (polled 0 ms)
> [  706.514728] CPU27: shutdown
> [  706.517523] psci: CPU27 killed (polled 0 ms)
> [  706.826708] CPU28: shutdown
> [  706.829502] psci: CPU28 killed (polled 0 ms)
> [  707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta
> 00m:12s]
> [  707.133714] psci: CPU29 killed (polled 0 ms)
> [  707.439066] CPU30: shutdown
> [  707.441870] psci: CPU30 killed (polled 0 ms)
> [  707.706727] CPU31: shutdown
> [  707.709526] psci: CPU31 killed (polled 0 ms)
> [  708.521853] pcieport 0000:00:08.0: can't change power state from
> D3cold to D0 (config space inaccessible)
> [  728.741808] rcu: INFO: rcu_preempt detected stalls on
> CPUs/tasks:80d:00h:35m:42s]
> [  728.747895] rcu: 48-...0: (0 ticks this GP)
> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
> [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
> [  728.763018] Task dump for CPU 48:
> [  728.766321] irqbalance      R  running task        0  1272      1
> 0x00000002
> [  728.773358] Call trace:
> [  728.775801]  __switch_to+0xbc/0x218
> [  728.779283]  gic_set_affinity+0x16c/0x1d8
...

gic_set_affinity shouldn't have switched out, so looks like one gic issue.

BTW,  you may try to run the test by disabling irqbalance.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-01  1:31               ` Ming Lei
@ 2020-02-01 11:05                 ` Marc Zyngier
  2020-02-01 11:31                   ` Thomas Gleixner
  0 siblings, 1 reply; 26+ messages in thread
From: Marc Zyngier @ 2020-02-01 11:05 UTC (permalink / raw)
  To: Ming Lei, John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	chenxiang (M),
	Keith Busch

On Sat, 1 Feb 2020 09:31:17 +0800
Ming Lei <tom.leiming@gmail.com> wrote:

> On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@huawei.com> wrote:
> >
> > On 31/01/2020 17:51, John Garry wrote:  
> > > +- Keith  
> >
> > + Marc, on rare chance that this is related to the GIC
> >  
> > >
> > > On 31/01/2020 10:58, Ming Lei wrote:  
> > >> On Fri, Jan 31, 2020 at 6:24 PM John Garry <john.garry@huawei.com> wrote:  
> > >>>  
> > >>>>> [  141.976109] Call trace:
> > >>>>> [  141.978550]  __switch_to+0xbc/0x218
> > >>>>> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
> > >>>>> [  141.986027]  process_one_work+0x1e0/0x358
> > >>>>> [  141.990025]  worker_thread+0x40/0x488
> > >>>>> [  141.993678]  kthread+0x118/0x120
> > >>>>> [  141.996897]  ret_from_fork+0x10/0x18  
> > >>>>
> > >>>> Hi John,
> > >>>>
> > >>>> Thanks for your test!
> > >>>>  
> > >>>
> > >>> Hi Ming,
> > >>>  
> > >>>> Could you test the following patchset and only the last one is changed?
> > >>>>
> > >>>> https://github.com/ming1/linux/commits/my_for_5.6_block  
> > >>>
> > >>> For SCSI testing, I will ask my colleague Xiang Chen to test when he
> > >>> returns to work. So I did not see this issue for my SCSI testing for
> > >>> your original v5, but I was only using 1x as opposed to maybe 20x SAS
> > >>> disks.
> > >>>
> > >>> BTW, did you test NVMe? For some reason I could not trigger a scenario
> > >>> where we're draining the outstanding requests for a queue which is being
> > >>> deactivated - I mean, the queues were always already quiesced.  
> > >>
> > >> I run cpu hotplug test on both NVMe and SCSI in KVM, and fio just runs
> > >> as expected.
> > >>
> > >> NVMe is often 1:1 mapping, so it might be a bit difficult to trigger
> > >> draining in-flight IOs.  
> > >
> > > I generally have a 4:1 mapping for my NVMe cards - so many CPUs.
> > >
> > > But more alarming is that I see this on every run:
> > >
> > > [  246.378744] psci: CPU6 killed (polled 0 ms)
> > > [  246.427955] irq_migrate_all_off_this_cpu: 3 callbacks suppressed
> > > [  246.427956] IRQ 822: no longer affine to CPU7
> > > [  246.438586] CPU7: shutdown
> > > [  246.441300] psci: CPU7 killed (polled 0 ms)
> > > [  247.015206] pcieport 0000:00:08.0: can't change power state from
> > > D3cold to D0 (config space inaccessible)
> > > Jobs: 6[  267.503159] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  267.509244] rcu: 25-...0: (1 GPs behind)
> > > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=2626
> > > [  267.518283] (detected by 0, t=5255 jiffies, g=14001, q=1787)
> > > [  267.524016] Task dump for CPU 25:
> > > [  267.527318] irqbalance      R  running task        0  1275      1
> > > 0x00000002
> > > [  267.534354] Call trace:
> > > [  267.536790]  __switch_to+0xbc/0x218
> > > [  267.540266]  0xffff800014c53d68
> > > Jobs: 6 (f=6): [R(6)][96.0%][r=0KiB/s[  278.195193] nvme nvme1:
> > > controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> > > ,w=0KiB/s][r=0,w=0 IOPS][eta 00m:[  278.205197] nvme nvme0: controller
> > > is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
> > > Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IO[  330.523157]
> > > rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  330.529242] rcu: 25-...0: (1 GPs behind)
> > > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=10503
> > > [  330.538366] (detected by 0, t=21010 jiffies, g=14001, q=2318)
> > > [  330.544186] Task dump for CPU 25:
> > > [  330.547488] irqbalance      R  running task        0  1275      1
> > > 0x00000002
> > > [  330.554523] Call trace:
> > > [  330.556958]  __switch_to+0xbc/0x218
> > > [  330.560434]  0xffff800014c53d68
> > > [  393.543155] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  393.549242] rcu: 25-...0: (1 GPs behind)
> > > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=18379
> > > [  393.558366] (detected by 8, t=36765 jiffies, g=14001, q=2385)
> > > [  393.564187] Task dump for CPU 25:
> > > [  393.567488] irqbalance      R  running task        0  1275      1
> > > 0x00000002
> > > [  393.574523] Call trace:
> > > [  393.576960]  __switch_to+0xbc/0x218
> > > [  393.580435]  0xffff800014c53d68
> > > [  456.567156] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  456.573246] rcu: 25-...0: (1 GPs behind)
> > > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=26255
> > > [  456.582389] (detected by 90, t=52520 jiffies, g=14001, q=2397)
> > > [  456.588296] Task dump for CPU 25:
> > > [  456.591601] irqbalance      R  running task        0  1275      1
> > > 0x00000002
> > > [  456.598638] Call trace:
> > > [  456.601078]  __switch_to+0xbc/0x218
> > > [  456.604556]  0xffff800014c53d68
> > > [  519.587152] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  519.593235] rcu: 25-...0: (1 GPs behind)
> > > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=34131
> > > [  519.602357] (detected by 90, t=68275 jiffies, g=14001, q=2405)
> > > [  519.608264] Task dump for CPU 25:
> > > [  519.611566] irqbalance      R  running task        0  1275      1
> > > 0x00000002
> > > [  519.618601] Call trace:
> > > [  519.621036]  __switch_to+0xbc/0x218
> > > [  519.624511]  0xffff800014c53d68
> > > [  582.603153] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > > [  582.609236] rcu: 25-...0: (1 GPs behind)
> > > idle=11e/1/0x4000000000000000 softirq=1175/1177 fqs=42006
> > > [  582.618358] (detected by 90, t=84029 jiffies, g=14001, q=2409)
> > > [  582.624265] Task dump for CPU 25:
> > > [  582.627567] irqbalance      R  running task        0  1275      1
> > > 0x00000002
> > > [  582.634601] Call trace:
> > > [  582.637036]  __switch_to+0xbc/0x218
> > > [  582.640511]  0xffff800014c53d68
> > >
> > >
> > > I tested v5.5 and I cannot see this, but your branch is based in 5.5-rc2
> > > with lots of other stuff on top. The only thing I changed was
> > > nvme.use_threaded_interrupts = 1.  
> >
> > Uhh, now I do with v5.5:
> >
> > [  705.354601] CPU22: shutdown
> > [  705.357395] psci: CPU22 killed (polled 0 ms)
> > [  705.402837] CPU23: shutdown
> > [  705.405630] psci: CPU23 killed (polled 0 ms)
> > [  705.666716] CPU24: shutdown
> > [  705.669512] psci: CPU24 killed (polled 0 ms)
> > [  705.966753] CPU25: shutdown
> > [  705.969548] psci: CPU25 killed (polled 0 ms)
> > [  706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta
> > 00m:13s]
> > [  706.253565] psci: CPU26 killed (polled 0 ms)
> > [  706.514728] CPU27: shutdown
> > [  706.517523] psci: CPU27 killed (polled 0 ms)
> > [  706.826708] CPU28: shutdown
> > [  706.829502] psci: CPU28 killed (polled 0 ms)
> > [  707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta
> > 00m:12s]
> > [  707.133714] psci: CPU29 killed (polled 0 ms)
> > [  707.439066] CPU30: shutdown
> > [  707.441870] psci: CPU30 killed (polled 0 ms)
> > [  707.706727] CPU31: shutdown
> > [  707.709526] psci: CPU31 killed (polled 0 ms)
> > [  708.521853] pcieport 0000:00:08.0: can't change power state from
> > D3cold to D0 (config space inaccessible)
> > [  728.741808] rcu: INFO: rcu_preempt detected stalls on
> > CPUs/tasks:80d:00h:35m:42s]
> > [  728.747895] rcu: 48-...0: (0 ticks this GP)
> > idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
> > [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
> > [  728.763018] Task dump for CPU 48:
> > [  728.766321] irqbalance      R  running task        0  1272      1
> > 0x00000002
> > [  728.773358] Call trace:
> > [  728.775801]  __switch_to+0xbc/0x218
> > [  728.779283]  gic_set_affinity+0x16c/0x1d8  
> ...
> 
> gic_set_affinity shouldn't have switched out, so looks like one gic
> issue.

Given that gic_set_affinity doesn't sleep, this looks pretty unlikely.
And __irq_set_affinity() holds a spinlock with irq disabled, so I can't
really explain how you'd get there. I've just booted a lockdep enabled
v5.5 on my D05, moved SPIs around (because that's the only way to reach
this code), and nothing caught fire.

Either the stack trace isn't reliable (when I read things like
"80d:00h:35m:42s" in the trace, I'm a bit suspicious), or CPU hotplug is
doing something really funky here.

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-01 11:05                 ` Marc Zyngier
@ 2020-02-01 11:31                   ` Thomas Gleixner
  2020-02-03 10:30                     ` John Garry
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2020-02-01 11:31 UTC (permalink / raw)
  To: Marc Zyngier, Ming Lei, John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, chenxiang (M),
	Keith Busch

Marc Zyngier <maz@kernel.org> writes:
> On Sat, 1 Feb 2020 09:31:17 +0800
> Ming Lei <tom.leiming@gmail.com> wrote:
>> On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@huawei.com> wrote:
>> 
>> gic_set_affinity shouldn't have switched out, so looks like one gic
>> issue.
>
> Given that gic_set_affinity doesn't sleep, this looks pretty unlikely.
> And __irq_set_affinity() holds a spinlock with irq disabled, so I can't
> really explain how you'd get there. I've just booted a lockdep enabled
> v5.5 on my D05, moved SPIs around (because that's the only way to reach
> this code), and nothing caught fire.
>
> Either the stack trace isn't reliable (when I read things like
> "80d:00h:35m:42s" in the trace, I'm a bit suspicious), or CPU hotplug is
> doing something really funky here.

The hotplug code cannot end up in schedule either and it holds desc lock
as normal affinity setting. The other backtrace is more complete,

[  728.741808] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:80d:00h:35m:42s]
[  728.747895] rcu: 48-...0: (0 ticks this GP) idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
[  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
[  728.763018] Task dump for CPU 48:
[  728.766321] irqbalance      R  running task        0  1272      1 0x00000002
[  728.773358] Call trace:
[  728.775801]  __switch_to+0xbc/0x218
[  728.779283]  gic_set_affinity+0x16c/0x1d8
[  728.783282]  irq_do_set_affinity+0x30/0xd0
[  728.787365]  irq_set_affinity_locked+0xc8/0xf0
[  728.791796]  __irq_set_affinity+0x4c/0x80
[  728.795794]  write_irq_affinity.isra.7+0x104/0x120
[  728.800572]  irq_affinity_proc_write+0x1c/0x28
[  728.805008]  proc_reg_write+0x78/0xb8
[  728.808660]  __vfs_write+0x18/0x38
[  728.812050]  vfs_write+0xb4/0x1e0
[  728.815352]  ksys_write+0x68/0xf8
[  728.818655]  __arm64_sys_write+0x18/0x20
[  728.822567]  el0_svc_common.constprop.2+0x64/0x160
[  728.827345]  el0_svc_handler+0x20/0x80
[  728.831082]  el0_sync_handler+0xe4/0x188
[  728.834991]  el0_sync+0x140/0x180

But the __switch_to() there definitely does not make any sense at all.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-01 11:31                   ` Thomas Gleixner
@ 2020-02-03 10:30                     ` John Garry
  2020-02-03 10:49                       ` John Garry
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-02-03 10:30 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, chenxiang (M),
	Keith Busch

On 01/02/2020 11:31, Thomas Gleixner wrote:
> Marc Zyngier <maz@kernel.org> writes:
>> On Sat, 1 Feb 2020 09:31:17 +0800
>> Ming Lei <tom.leiming@gmail.com> wrote:
>>> On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@huawei.com> wrote:
>>>
>>> gic_set_affinity shouldn't have switched out, so looks like one gic
>>> issue.
>>
>> Given that gic_set_affinity doesn't sleep, this looks pretty unlikely.
>> And __irq_set_affinity() holds a spinlock with irq disabled, so I can't
>> really explain how you'd get there. I've just booted a lockdep enabled
>> v5.5 on my D05, moved SPIs around (because that's the only way to reach
>> this code), and nothing caught fire.
>>
>> Either the stack trace isn't reliable (when I read things like
>> "80d:00h:35m:42s" in the trace, I'm a bit suspicious), or CPU hotplug is
>> doing something really funky here.
> 
> The hotplug code cannot end up in schedule either and it holds desc lock
> as normal affinity setting. The other backtrace is more complete,
> 
> [  728.741808] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:80d:00h:35m:42s]
> [  728.747895] rcu: 48-...0: (0 ticks this GP) idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
> [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
> [  728.763018] Task dump for CPU 48:
> [  728.766321] irqbalance      R  running task        0  1272      1 0x00000002
> [  728.773358] Call trace:
> [  728.775801]  __switch_to+0xbc/0x218
> [  728.779283]  gic_set_affinity+0x16c/0x1d8
> [  728.783282]  irq_do_set_affinity+0x30/0xd0
> [  728.787365]  irq_set_affinity_locked+0xc8/0xf0
> [  728.791796]  __irq_set_affinity+0x4c/0x80
> [  728.795794]  write_irq_affinity.isra.7+0x104/0x120
> [  728.800572]  irq_affinity_proc_write+0x1c/0x28
> [  728.805008]  proc_reg_write+0x78/0xb8
> [  728.808660]  __vfs_write+0x18/0x38
> [  728.812050]  vfs_write+0xb4/0x1e0
> [  728.815352]  ksys_write+0x68/0xf8
> [  728.818655]  __arm64_sys_write+0x18/0x20
> [  728.822567]  el0_svc_common.constprop.2+0x64/0x160
> [  728.827345]  el0_svc_handler+0x20/0x80
> [  728.831082]  el0_sync_handler+0xe4/0x188
> [  728.834991]  el0_sync+0x140/0x180
> 
> But the __switch_to() there definitely does not make any sense at all.

I have put a full kernel log here for another run (hang snippet at end), 
including scripts:

https://pastebin.com/N4Gi5it6

This does not look good:

Jobs: 6 ([  519.858094] nvme nvme1: controller is down; will reset: 
CSTS=0xffffffff, PCI_STATUS=0xffff

And some NVMe error also coincides with the hang. Another run has this:

  [  247.015206] pcieport 0000:00:08.0: can't change power state from
  D3cold to D0 (config space inaccessible)

I did test v5.4 previously and did not see this, but that would have 
included the v4 patchset in the $subject. I'll test v5.4 without that 
patchset now.

Thanks,
John



[  487.442928] CPU32: shutdown=2149MiB/s,w=0KiB/s][r=550k,w=0 IOPS][eta 
00m:11s]
[  487.445722] psci: CPU32 killed (polled 0 ms)
[  487.762957] CPU33: shutdown
[  487.765752] psci: CPU33 killed (polled 0 ms)
[  488.054863] CPU34: shutdown
[  488.057659] psci: CPU34 killed (polled 0 ms)
[  488.342910] CPU35: shutdown=2137MiB/s,w=0KiB/s][r=547k,w=0 IOPS][eta 
00m:10s]
[  488.345704] psci: CPU35 killed (polled 0 ms)
Jobs: 6[  509.614051] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:
[  509.620140] rcu:     0-...0: (1 GPs behind) 
idle=896/1/0x4000000000000000 softirq=7411/7412 fqs=2626
[  509.629087] rcu:     36-...!: (0 ticks this GP) 
idle=ad2/1/0x4000000000000000 softirq=2631/2631 fqs=2627
[  509.638398]  (detected by 74, t=5257 jiffies, g=24921, q=1785)
[  509.644218] Task dump for CPU 0:
[  509.647435] irqbalance      R  running task        0  1308      1 
0x00000002
[  509.654472] Call trace:
[  509.656913]  __switch_to+0xbc/0x218
[  509.660392]  0xffff800017043d68
[  509.663521] Task dump for CPU 36:
[  509.666824] cpuhp/36        R  running task        0   192      2 
0x0000002a
[  509.673860] Call trace:
[  509.676295]  __switch_to+0xbc/0x218
[  509.679776]  page_wait_table+0x1500/0x1800
Jobs: 6 ([  519.858094] nvme nvme1: controller is down; will reset: 
CSTS=0xffffffff, PCI_STATUS=0xffff
Jobs: 6 (f=6[  572.634049] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:6s]
[  572.640136] rcu:     0-...0: (1 GPs behind) 
idle=896/1/0x4000000000000000 softirq=7411/7412 fqs=10499
[  572.649171] rcu:     36-...!: (0 ticks this GP) 
idle=ad2/1/0x4000000000000000 softirq=2631/2631 fqs=10500
[  572.658561]  (detected by 51, t=21012 jiffies, g=24921, q=2275)
[  572.664469] Task dump for CPU 0:
[  572.667685] irqbalance      R  running task        0  1308      1 
0x00000002
[  572.674723] Call trace:
[  572.677164]  __switch_to+0xbc/0x218
[  572.680641]  0xffff800017043d68
[  572.683772] Task dump for CPU 36:
[  572.687076] cpuhp/36        R  running task        0   192      2 
0x0000002a
[  572.694112] Call trace:
[  572.696547]  __switch_to+0xbc/0x218
[  572.700027]  page_wait_table+0x1500/0x1800
[  635.654053] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  635.660141] rcu:     0-...0: (1 GPs behind) 
idle=896/1/0x4000000000000000 softirq=7411/7412 fqs=18372
[  635.669175] rcu:     36-...!: (0 ticks this GP) 
idle=ad2/1/0x4000000000000000 softirq=2631/2631 fqs=18373
[  635.678576]  (detected by 92, t=36768 jiffies, g=24921, q=2279)
[  635.684482] Task dump for CPU 0:
[  635.687699] irqbalance      R  running task        0  1308      1 
0x00000002
[  635.694735] Call trace:
[  635.697172]  __switch_to+0xbc/0x218
[  635.700648]  0xffff800017043d68
[  635.703778] Task dump for CPU 36:
[  635.707081] cpuhp/36        R  running task        0   192      2 
0x0000002a
[  635.714117] Call trace:
[  635.716552]  __switch_to+0xbc/0x218
[  635.720030]  page_wait_table+0x1500/0x1800
[  698.674054] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  698.680141] rcu:     0-...0: (1 GPs behind) 
idle=896/1/0x4000000000000000 softirq=7411/7412 fqs=26242
[  698.689174] rcu:     36-...!: (0 ticks this GP) 
idle=ad2/1/0x4000000000000000 softirq=2631/2631 fqs=26242
[  698.698563]  (detected by 81, t=52522 jiffies, g=24921, q=2281)
[  698.704469] Task dump for CPU 0:
[  698.707684] irqbalance      R  running task        0  1308      1 
0x00000002
[  698.714721] Call trace:
[  698.717158]  __switch_to+0xbc/0x218
[  698.720634]  0xffff800017043d68
[  698.723764] Task dump for CPU 36:
[  698.727066] cpuhp/36        R  running task        0   192      2 
0x0000002a
[  698.734101] Call trace:
[  698.736536]  __switch_to+0xbc/0x218
[  698.740013]  page_wait_table+0x1500/0x1800
[  761.694053] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  761.700136] rcu:     0-...0: (1 GPs behind) 
idle=896/1/0x4000000000000000 softirq=7411/7412 fqs=34111
[  761.709169] rcu:     36-...!: (0 ticks this GP) 
idle=ad2/1/0x4000000000000000 44]  __switch_to+0xbc/0x218
[  761.740620]  0xffff800017043d68
[  761.743749] Task dump for CPU 36:
[  761.747051] cpuhp/36        R  running task        0   192      2 
0x0000002a
[  761.754087] Call trace:
[  761.756521]  __switch_to+0xbc/0x218
[  761.759998]  page_wait_table+0x1500/0x1800


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-03 10:30                     ` John Garry
@ 2020-02-03 10:49                       ` John Garry
  2020-02-03 10:59                         ` Ming Lei
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-02-03 10:49 UTC (permalink / raw)
  To: Thomas Gleixner, Marc Zyngier, Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, chenxiang (M),
	Keith Busch, liudongdong (C),
	wanghuiqiang, Zhou Wang

On 03/02/2020 10:30, John Garry wrote:
> On 01/02/2020 11:31, Thomas Gleixner wrote:
>> Marc Zyngier <maz@kernel.org> writes:
>>> On Sat, 1 Feb 2020 09:31:17 +0800
>>> Ming Lei <tom.leiming@gmail.com> wrote:
>>>> On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@huawei.com> 
>>>> wrote:
>>>>
>>>> gic_set_affinity shouldn't have switched out, so looks like one gic
>>>> issue.
>>>
>>> Given that gic_set_affinity doesn't sleep, this looks pretty unlikely.
>>> And __irq_set_affinity() holds a spinlock with irq disabled, so I can't
>>> really explain how you'd get there. I've just booted a lockdep enabled
>>> v5.5 on my D05, moved SPIs around (because that's the only way to reach
>>> this code), and nothing caught fire.
>>>
>>> Either the stack trace isn't reliable (when I read things like
>>> "80d:00h:35m:42s" in the trace, I'm a bit suspicious), or CPU hotplug is
>>> doing something really funky here.
>>
>> The hotplug code cannot end up in schedule either and it holds desc lock
>> as normal affinity setting. The other backtrace is more complete,
>>
>> [  728.741808] rcu: INFO: rcu_preempt detected stalls on 
>> CPUs/tasks:80d:00h:35m:42s]
>> [  728.747895] rcu: 48-...0: (0 ticks this GP) 
>> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
>> [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
>> [  728.763018] Task dump for CPU 48:
>> [  728.766321] irqbalance      R  running task        0  1272      1 
>> 0x00000002
>> [  728.773358] Call trace:
>> [  728.775801]  __switch_to+0xbc/0x218
>> [  728.779283]  gic_set_affinity+0x16c/0x1d8
>> [  728.783282]  irq_do_set_affinity+0x30/0xd0
>> [  728.787365]  irq_set_affinity_locked+0xc8/0xf0
>> [  728.791796]  __irq_set_affinity+0x4c/0x80
>> [  728.795794]  write_irq_affinity.isra.7+0x104/0x120
>> [  728.800572]  irq_affinity_proc_write+0x1c/0x28
>> [  728.805008]  proc_reg_write+0x78/0xb8
>> [  728.808660]  __vfs_write+0x18/0x38
>> [  728.812050]  vfs_write+0xb4/0x1e0
>> [  728.815352]  ksys_write+0x68/0xf8
>> [  728.818655]  __arm64_sys_write+0x18/0x20
>> [  728.822567]  el0_svc_common.constprop.2+0x64/0x160
>> [  728.827345]  el0_svc_handler+0x20/0x80
>> [  728.831082]  el0_sync_handler+0xe4/0x188
>> [  728.834991]  el0_sync+0x140/0x180
>>
>> But the __switch_to() there definitely does not make any sense at all.
> 
> I have put a full kernel log here for another run (hang snippet at end), 
> including scripts:
> 
> https://pastebin.com/N4Gi5it6
> 
> This does not look good:
> 
> Jobs: 6 ([  519.858094] nvme nvme1: controller is down; will reset: 
> CSTS=0xffffffff, PCI_STATUS=0xffff
> 
> And some NVMe error also coincides with the hang. Another run has this:
> 
>   [  247.015206] pcieport 0000:00:08.0: can't change power state from
>   D3cold to D0 (config space inaccessible)
> 
> I did test v5.4 previously and did not see this, but that would have 
> included the v4 patchset in the $subject. I'll test v5.4 without that 
> patchset now.

So v5.4 does have this issue also:

[  705.669512] psci: CPU24 killed (polled 0 ms)
[  705.966753] CPU25: shutdown
[  705.969548] psci: CPU25 killed (polled 0 ms)
[  706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta 
00m:13s]
[  706.253565] psci: CPU26 killed (polled 0 ms)
[  706.514728] CPU27: shutdown
[  706.517523] psci: CPU27 killed (polled 0 ms)
[  706.826708] CPU28: shutdown
[  706.829502] psci: CPU28 killed (polled 0 ms)
[  707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta 
00m:12s]
[  707.133714] psci: CPU29 killed (polled 0 ms)
[  707.439066] CPU30: shutdown
[  707.441870] psci: CPU30 killed (polled 0 ms)
[  707.706727] CPU31: shutdown
[  707.709526] psci: CPU31 killed (polled 0 ms)
[  708.521853] pcieport 0000:00:08.0: can't change power state from 
D3cold to D0 (config space inaccessible)
[  728.741808] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:80d:00h:35m:42s]
[  728.747895] rcu: 48-...0: (0 ticks this GP) 
idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
[  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
[  728.763018] Task dump for CPU 48:
[  728.766321] irqbalance      R  running task        0  1272      1 
0x00000002
[  728.773358] Call trace:
[  728.775801]  __switch_to+0xbc/0x218
[  728.779283]  gic_set_affinity+0x16c/0x1d8
[  728.783282]  irq_do_set_affinity+0x30/0xd0
[  728.787365]  irq_set_affinity_locked+0xc8/0xf0
[  728.791796]  __irq_set_affinity+0x4c/0x80
[  728.795794]  write_irq_affinity.isra.7+0x104/0x120
[  728.800572]  irq_affinity_proc_write+0x1c/0x28
[  728.805008]  proc_reg_write+0x78/0xb8
[  728.808660]  __vfs_write+0x18/0x38
[  728.812050]  vfs_write+0xb4/0x1e0
[  728.815352]  ksys_write+0x68/0xf8
[  728.818655]  __arm64_sys_write+0x18/0x20
[  728.822567]  el0_svc_common.constprop.2+0x64/0x160
[  728.827345]  el0_svc_handler+0x20/0x80
[  728.831082]  el0_sync_handler+0xe4/0x188
[  728.834991]  el0_sync+0x140/0x180
[  738.993844] nvme nvme1: controller is down; will reset: 
CSTS=0xffffffff, PCI_STATUS=0xffff
[  791.761810] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:63d:14h:24m:13s]
[  791.767897] rcu: 48-...0: (0 ticks this GP) 
idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=10495
[  791.777281] (detected by 54, t=21010 jiffies, g=40989, q=2396)
[  791.783189] Task dump for CPU 48:
[  791.786491] irqbalance      R  running task        0  1272      1 
0x00000002
[  791.793528] Call trace:
[  791.795964]  __switch_to+0xbc/0x218
[  791.799441]  gic_set_affinity+0x16c/0x1d8
[  791.803439]  irq_do_set_affinity+0x30/0xd0
[  791.807522]  irq_set_affinity_locked+0xc8/0xf0
[  791.811953]  __irq_set_affinity+0x4c/0x80
[  791.815949]  write_irq_affinity.isra.7+0x104/0x120
[  791.820727]  irq_affinity_proc_write+0x1c/0x28
[  791.825158]  proc_reg_write+0x78/0xb8
[  791.828808]  __vfs_write+0x18/0x38
[  791.832197]  vfs_write+0xb4/0x1e0
[  791.835500]  ksys_write+0x68/0xf8
[  791.838802]  __arm64_sys_write+0x18/0x20
[  791.842712]  el0_svc_common.constprop.2+0x64/0x160
[  791.847490]  el0_svc_handler+0x20/0x80
[  791.851226]  el0_sync_handler+0xe4/0x188
[  791.855135]  el0_sync+0x140/0x180
Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s

Thanks,
John


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-03 10:49                       ` John Garry
@ 2020-02-03 10:59                         ` Ming Lei
  2020-02-03 12:56                           ` John Garry
  0 siblings, 1 reply; 26+ messages in thread
From: Ming Lei @ 2020-02-03 10:59 UTC (permalink / raw)
  To: John Garry
  Cc: Thomas Gleixner, Marc Zyngier, Ming Lei, Jens Axboe, linux-block,
	Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	chenxiang (M), Keith Busch, liudongdong (C),
	wanghuiqiang, Zhou Wang

On Mon, Feb 3, 2020 at 6:49 PM John Garry <john.garry@huawei.com> wrote:
>
> On 03/02/2020 10:30, John Garry wrote:
> > On 01/02/2020 11:31, Thomas Gleixner wrote:
> >> Marc Zyngier <maz@kernel.org> writes:
> >>> On Sat, 1 Feb 2020 09:31:17 +0800
> >>> Ming Lei <tom.leiming@gmail.com> wrote:
> >>>> On Sat, Feb 1, 2020 at 2:02 AM John Garry <john.garry@huawei.com>
> >>>> wrote:
> >>>>
> >>>> gic_set_affinity shouldn't have switched out, so looks like one gic
> >>>> issue.
> >>>
> >>> Given that gic_set_affinity doesn't sleep, this looks pretty unlikely.
> >>> And __irq_set_affinity() holds a spinlock with irq disabled, so I can't
> >>> really explain how you'd get there. I've just booted a lockdep enabled
> >>> v5.5 on my D05, moved SPIs around (because that's the only way to reach
> >>> this code), and nothing caught fire.
> >>>
> >>> Either the stack trace isn't reliable (when I read things like
> >>> "80d:00h:35m:42s" in the trace, I'm a bit suspicious), or CPU hotplug is
> >>> doing something really funky here.
> >>
> >> The hotplug code cannot end up in schedule either and it holds desc lock
> >> as normal affinity setting. The other backtrace is more complete,
> >>
> >> [  728.741808] rcu: INFO: rcu_preempt detected stalls on
> >> CPUs/tasks:80d:00h:35m:42s]
> >> [  728.747895] rcu: 48-...0: (0 ticks this GP)
> >> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
> >> [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
> >> [  728.763018] Task dump for CPU 48:
> >> [  728.766321] irqbalance      R  running task        0  1272      1
> >> 0x00000002
> >> [  728.773358] Call trace:
> >> [  728.775801]  __switch_to+0xbc/0x218
> >> [  728.779283]  gic_set_affinity+0x16c/0x1d8
> >> [  728.783282]  irq_do_set_affinity+0x30/0xd0
> >> [  728.787365]  irq_set_affinity_locked+0xc8/0xf0
> >> [  728.791796]  __irq_set_affinity+0x4c/0x80
> >> [  728.795794]  write_irq_affinity.isra.7+0x104/0x120
> >> [  728.800572]  irq_affinity_proc_write+0x1c/0x28
> >> [  728.805008]  proc_reg_write+0x78/0xb8
> >> [  728.808660]  __vfs_write+0x18/0x38
> >> [  728.812050]  vfs_write+0xb4/0x1e0
> >> [  728.815352]  ksys_write+0x68/0xf8
> >> [  728.818655]  __arm64_sys_write+0x18/0x20
> >> [  728.822567]  el0_svc_common.constprop.2+0x64/0x160
> >> [  728.827345]  el0_svc_handler+0x20/0x80
> >> [  728.831082]  el0_sync_handler+0xe4/0x188
> >> [  728.834991]  el0_sync+0x140/0x180
> >>
> >> But the __switch_to() there definitely does not make any sense at all.
> >
> > I have put a full kernel log here for another run (hang snippet at end),
> > including scripts:
> >
> > https://pastebin.com/N4Gi5it6
> >
> > This does not look good:
> >
> > Jobs: 6 ([  519.858094] nvme nvme1: controller is down; will reset:
> > CSTS=0xffffffff, PCI_STATUS=0xffff
> >
> > And some NVMe error also coincides with the hang. Another run has this:
> >
> >   [  247.015206] pcieport 0000:00:08.0: can't change power state from
> >   D3cold to D0 (config space inaccessible)
> >
> > I did test v5.4 previously and did not see this, but that would have
> > included the v4 patchset in the $subject. I'll test v5.4 without that
> > patchset now.
>
> So v5.4 does have this issue also:

v5.5?

>
> [  705.669512] psci: CPU24 killed (polled 0 ms)
> [  705.966753] CPU25: shutdown
> [  705.969548] psci: CPU25 killed (polled 0 ms)
> [  706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta
> 00m:13s]
> [  706.253565] psci: CPU26 killed (polled 0 ms)
> [  706.514728] CPU27: shutdown
> [  706.517523] psci: CPU27 killed (polled 0 ms)
> [  706.826708] CPU28: shutdown
> [  706.829502] psci: CPU28 killed (polled 0 ms)
> [  707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta
> 00m:12s]
> [  707.133714] psci: CPU29 killed (polled 0 ms)
> [  707.439066] CPU30: shutdown
> [  707.441870] psci: CPU30 killed (polled 0 ms)
> [  707.706727] CPU31: shutdown
> [  707.709526] psci: CPU31 killed (polled 0 ms)
> [  708.521853] pcieport 0000:00:08.0: can't change power state from
> D3cold to D0 (config space inaccessible)
> [  728.741808] rcu: INFO: rcu_preempt detected stalls on
> CPUs/tasks:80d:00h:35m:42s]
> [  728.747895] rcu: 48-...0: (0 ticks this GP)
> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
> [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
> [  728.763018] Task dump for CPU 48:
> [  728.766321] irqbalance      R  running task        0  1272      1
> 0x00000002
> [  728.773358] Call trace:
> [  728.775801]  __switch_to+0xbc/0x218
> [  728.779283]  gic_set_affinity+0x16c/0x1d8
> [  728.783282]  irq_do_set_affinity+0x30/0xd0
> [  728.787365]  irq_set_affinity_locked+0xc8/0xf0
> [  728.791796]  __irq_set_affinity+0x4c/0x80
> [  728.795794]  write_irq_affinity.isra.7+0x104/0x120
> [  728.800572]  irq_affinity_proc_write+0x1c/0x28
> [  728.805008]  proc_reg_write+0x78/0xb8
> [  728.808660]  __vfs_write+0x18/0x38
> [  728.812050]  vfs_write+0xb4/0x1e0
> [  728.815352]  ksys_write+0x68/0xf8
> [  728.818655]  __arm64_sys_write+0x18/0x20
> [  728.822567]  el0_svc_common.constprop.2+0x64/0x160
> [  728.827345]  el0_svc_handler+0x20/0x80
> [  728.831082]  el0_sync_handler+0xe4/0x188
> [  728.834991]  el0_sync+0x140/0x180
> [  738.993844] nvme nvme1: controller is down; will reset:
> CSTS=0xffffffff, PCI_STATUS=0xffff
> [  791.761810] rcu: INFO: rcu_preempt detected stalls on
> CPUs/tasks:63d:14h:24m:13s]
> [  791.767897] rcu: 48-...0: (0 ticks this GP)
> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=10495
> [  791.777281] (detected by 54, t=21010 jiffies, g=40989, q=2396)
> [  791.783189] Task dump for CPU 48:
> [  791.786491] irqbalance      R  running task        0  1272      1
> 0x00000002
> [  791.793528] Call trace:
> [  791.795964]  __switch_to+0xbc/0x218
> [  791.799441]  gic_set_affinity+0x16c/0x1d8
> [  791.803439]  irq_do_set_affinity+0x30/0xd0
> [  791.807522]  irq_set_affinity_locked+0xc8/0xf0
> [  791.811953]  __irq_set_affinity+0x4c/0x80
> [  791.815949]  write_irq_affinity.isra.7+0x104/0x120
> [  791.820727]  irq_affinity_proc_write+0x1c/0x28
> [  791.825158]  proc_reg_write+0x78/0xb8
> [  791.828808]  __vfs_write+0x18/0x38
> [  791.832197]  vfs_write+0xb4/0x1e0
> [  791.835500]  ksys_write+0x68/0xf8
> [  791.838802]  __arm64_sys_write+0x18/0x20
> [  791.842712]  el0_svc_common.constprop.2+0x64/0x160
> [  791.847490]  el0_svc_handler+0x20/0x80
> [  791.851226]  el0_sync_handler+0xe4/0x188
> [  791.855135]  el0_sync+0x140/0x180
> Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s

Can you trigger it after disabling irqbalance?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-03 10:59                         ` Ming Lei
@ 2020-02-03 12:56                           ` John Garry
  2020-02-03 15:43                             ` Marc Zyngier
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-02-03 12:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Thomas Gleixner, Marc Zyngier, Ming Lei, Jens Axboe, linux-block,
	Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	chenxiang (M), Keith Busch, liudongdong (C),
	wanghuiqiang, Wangzhou (B)

>>>
>>> Jobs: 6 ([  519.858094] nvme nvme1: controller is down; will reset:
>>> CSTS=0xffffffff, PCI_STATUS=0xffff
>>>
>>> And some NVMe error also coincides with the hang. Another run has this:
>>>
>>>    [  247.015206] pcieport 0000:00:08.0: can't change power state from
>>>    D3cold to D0 (config space inaccessible)
>>>
>>> I did test v5.4 previously and did not see this, but that would have
>>> included the v4 patchset in the $subject. I'll test v5.4 without that
>>> patchset now.
>>
>> So v5.4 does have this issue also:
> 
> v5.5?

I am saying that both v5.4 and v5.5 have the issue. Below is the kernel 
hang snippet for v5.4.

> 
>>
>> [  705.669512] psci: CPU24 killed (polled 0 ms)
>> [  705.966753] CPU25: shutdown
>> [  705.969548] psci: CPU25 killed (polled 0 ms)
>> [  706.250771] CPU26: shutdown=2347MiB/s,w=0KiB/s][r=601k,w=0 IOPS][eta
>> 00m:13s]
>> [  706.253565] psci: CPU26 killed (polled 0 ms)
>> [  706.514728] CPU27: shutdown
>> [  706.517523] psci: CPU27 killed (polled 0 ms)
>> [  706.826708] CPU28: shutdown
>> [  706.829502] psci: CPU28 killed (polled 0 ms)
>> [  707.130916] CPU29: shutdown=2134MiB/s,w=0KiB/s][r=546k,w=0 IOPS][eta
>> 00m:12s]
>> [  707.133714] psci: CPU29 killed (polled 0 ms)
>> [  707.439066] CPU30: shutdown
>> [  707.441870] psci: CPU30 killed (polled 0 ms)
>> [  707.706727] CPU31: shutdown
>> [  707.709526] psci: CPU31 killed (polled 0 ms)
>> [  708.521853] pcieport 0000:00:08.0: can't change power state from
>> D3cold to D0 (config space inaccessible)
>> [  728.741808] rcu: INFO: rcu_preempt detected stalls on
>> CPUs/tasks:80d:00h:35m:42s]
>> [  728.747895] rcu: 48-...0: (0 ticks this GP)
>> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=2626
>> [  728.757197] (detected by 63, t=5255 jiffies, g=40989, q=1890)
>> [  728.763018] Task dump for CPU 48:
>> [  728.766321] irqbalance      R  running task        0  1272      1
>> 0x00000002
>> [  728.773358] Call trace:
>> [  728.775801]  __switch_to+0xbc/0x218
>> [  728.779283]  gic_set_affinity+0x16c/0x1d8
>> [  728.783282]  irq_do_set_affinity+0x30/0xd0
>> [  728.787365]  irq_set_affinity_locked+0xc8/0xf0
>> [  728.791796]  __irq_set_affinity+0x4c/0x80
>> [  728.795794]  write_irq_affinity.isra.7+0x104/0x120
>> [  728.800572]  irq_affinity_proc_write+0x1c/0x28
>> [  728.805008]  proc_reg_write+0x78/0xb8
>> [  728.808660]  __vfs_write+0x18/0x38
>> [  728.812050]  vfs_write+0xb4/0x1e0
>> [  728.815352]  ksys_write+0x68/0xf8
>> [  728.818655]  __arm64_sys_write+0x18/0x20
>> [  728.822567]  el0_svc_common.constprop.2+0x64/0x160
>> [  728.827345]  el0_svc_handler+0x20/0x80
>> [  728.831082]  el0_sync_handler+0xe4/0x188
>> [  728.834991]  el0_sync+0x140/0x180
>> [  738.993844] nvme nvme1: controller is down; will reset:
>> CSTS=0xffffffff, PCI_STATUS=0xffff
>> [  791.761810] rcu: INFO: rcu_preempt detected stalls on
>> CPUs/tasks:63d:14h:24m:13s]
>> [  791.767897] rcu: 48-...0: (0 ticks this GP)
>> idle=b3e/1/0x4000000000000000 softirq=5548/5548 fqs=10495
>> [  791.777281] (detected by 54, t=21010 jiffies, g=40989, q=2396)
>> [  791.783189] Task dump for CPU 48:
>> [  791.786491] irqbalance      R  running task        0  1272      1
>> 0x00000002
>> [  791.793528] Call trace:
>> [  791.795964]  __switch_to+0xbc/0x218
>> [  791.799441]  gic_set_affinity+0x16c/0x1d8
>> [  791.803439]  irq_do_set_affinity+0x30/0xd0
>> [  791.807522]  irq_set_affinity_locked+0xc8/0xf0
>> [  791.811953]  __irq_set_affinity+0x4c/0x80
>> [  791.815949]  write_irq_affinity.isra.7+0x104/0x120
>> [  791.820727]  irq_affinity_proc_write+0x1c/0x28
>> [  791.825158]  proc_reg_write+0x78/0xb8
>> [  791.828808]  __vfs_write+0x18/0x38
>> [  791.832197]  vfs_write+0xb4/0x1e0
>> [  791.835500]  ksys_write+0x68/0xf8
>> [  791.838802]  __arm64_sys_write+0x18/0x20
>> [  791.842712]  el0_svc_common.constprop.2+0x64/0x160
>> [  791.847490]  el0_svc_handler+0x20/0x80
>> [  791.851226]  el0_sync_handler+0xe4/0x188
>> [  791.855135]  el0_sync+0x140/0x180
>> Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/s
> 
> Can you trigger it after disabling irqbalance?

No, so tested by killing the irqbalance process and it ran for 25 
minutes without issue.

Thanks,
John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-03 12:56                           ` John Garry
@ 2020-02-03 15:43                             ` Marc Zyngier
  2020-02-03 18:16                               ` John Garry
  0 siblings, 1 reply; 26+ messages in thread
From: Marc Zyngier @ 2020-02-03 15:43 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Thomas Gleixner, Ming Lei, Jens Axboe, linux-block,
	Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	chenxiang (M), Keith Busch, liudongdong (C),
	wanghuiqiang, Wangzhou (B)

On 2020-02-03 12:56, John Garry wrote:

[...]

>> Can you trigger it after disabling irqbalance?
> 
> No, so tested by killing the irqbalance process and it ran for 25
> minutes without issue.

OK, that's interesting.

Can you find you whether irqbalance tries to move an interrupt to an 
offlined CPU?
Just putting a trace into git_set_affinity() should be enough.

Thanks,

           M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-03 15:43                             ` Marc Zyngier
@ 2020-02-03 18:16                               ` John Garry
  2020-02-05 14:08                                 ` John Garry
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-02-03 18:16 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ming Lei, Thomas Gleixner, Ming Lei, Jens Axboe, linux-block,
	Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	chenxiang (M), Keith Busch, liudongdong (C),
	wanghuiqiang, Wangzhou (B)

On 03/02/2020 15:43, Marc Zyngier wrote:
> On 2020-02-03 12:56, John Garry wrote:
> 
> [...]
> 
>>> Can you trigger it after disabling irqbalance?
>>
>> No, so tested by killing the irqbalance process and it ran for 25
>> minutes without issue.
> 
> OK, that's interesting.
> 
> Can you find you whether irqbalance tries to move an interrupt to an 
> offlined CPU?
> Just putting a trace into git_set_affinity() should be enough.
> 

Hi Marc,

I should have mentioned this already, but this board is the same D06 
which I reported had the CPU0 hotplug issue from broken FW, if you remember:

https://lore.kernel.org/linux-arm-kernel/fd70f499-83b4-2fdd-d043-ea9ab8f2c636@huawei.com/

That's why I'm not including CPU0 in my hotplug testing. Other CPUs were 
fine. And it doesn't look like that issue.

Apart from that, I tried as you suggested, with this change:

+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1143,6 +1143,9 @@ static int gic_set_affinity(struct irq_data *d, 
const struct cpumask *mask_val,
  	int enabled;
  	u64 val;

+	pr_err("%s irq%d mask_val=%*pbl cpu_online_mask=%*pbl force=%d 
cpumask_any_and=%d\n", __func__, d->irq, cpumask_pr_args(mask_val),
cpumask_pr_args(cpu_online_mask), force, pumask_any_and(cpu_online_mask, 
mask_val));

  	if (force)
  		cpu = cpumask_first(mask_val);
  	else
@@ -1176,6 +1179,9 @@ static int gic_set_affinity(struct irq_data *d, 
const struct cpumask *mask_val,

  	irq_data_update_effective_affinity(d, cpumask_of(cpu));

+	pr_err("%s1 irq%d mask_val=%*pbl cpu_online_mask=%*pbl force=%d 
cpumask_any_and=%d cpu=%d\n", __func__,	d->irq, 
cpumask_pr_args(mask_val), cpumask_pr_args(cpu_online_mask), force, 
cpumask_any_and(cpu_online_mask, mask_val), cpu);
+
  	return IRQ_SET_MASK_OK_DONE;
  }

And see this:

polled 0 ms)
[  947.551340] GICv3: gic_set_affinity irq5 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[  947.560986] GICv3: gic_set_affinity1 irq5 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[  947.571321] GICv3: gic_set_affinity irq8 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[  947.580963] GICv3: gic_set_affinity1 irq8 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[  947.591581] IRQ 819: no longer affine to CPU43
[  947.596149] CPU43: shutdown
[  947.598945] psci: CPU43 killed (polled 0 ms)
[  947.607029] GICv3: gic_set_affinity irq5 mask_val=29-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
Jobs: 6 (f=6): [R(6)][0.0%][r=0KiB/[  968.614971] rcu: INFO: rcu_preempt 
detected stalls on CPUs/tasks:
[  968.621062] rcu: 66-...0: (0 ticks this GP) 
idle=d9a/1/0x4000000000000000 softirq=3654/3654 fqs=2625
[  968.630365] (detected by 69, t=5256 jiffies, g=51577, q=1884)
[  968.636187] Task dump for CPU 66:
[  968.639490] irqbalance      R  running task        0  1577      1 
0x00000002
[  968.646527] Call trace:
[  968.648970]  __switch_to+0xbc/0x218
[  968.652450]  irq_do_set_affinity+0x30/0xd0
[  968.656534]  irq_set_affinity_locked+0xc8/0xf0
[  968.660965]  __irq_set_affinity+0x4c/0x80
[  968.664963]  write_irq_affinity.isra.7+0x104/0x120
[  968.669741]  irq_affinity_proc_write+0x1c/0x28
[  968.674175]  proc_reg_write+0x78/0xb8
[  968.677827]  __vfs_write+0x18/0x38
[  968.681217]  vfs_write+0xb4/0x1e0
[  968.684519]  ksys_write+0x68/0xf8
[  968.687822]  __arm64_sys_write+0x18/0x20
[  968.691733]  el0_svc_common.constprop.2+0x64/0x160
[  968.696511]  el0_svc_handler+0x20/0x80
[  968.700247]  el0_sync_handler+0xe4/0x188
[  968.704157]  el0_sync+0x140/0x180

and this on a 2nd run:

[  215.468476] CPU42: shutdown
[  215.471272] psci: CPU42 killed (polled 0 ms)
[  215.723714] GICv3: gic_set_affinity irq5 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[  215.733360] GICv3: gic_set_affinity1 irq5 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[  215.743696] GICv3: gic_set_affinity irq8 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44
[  215.753338] GICv3: gic_set_affinity1 irq8 mask_val=24-47 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=44 cpu=44
[  215.763835] IRQ 426: no longer affine to CPU43
[  215.768412] IRQ 819: no longer affine to CPU43
[  215.773023] CPU43: shutdown
Jobs: 6 (f=6): [R(6)][76.9%][r=13[  215.775834] psci: CPU43 killed 
(polled 0 ms)
[  216.604779] GICv3: gic_set_affinity irq10 mask_val=53 
cpu_online_mask=0,44-95 force=0 cpumask_any_and=53
[  217.223461] pcieport 0000:00:08.0: can't change power state from 
D3cold to D0 (config space inaccessible)
[  237.615383] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:2d:17h:39m:00s]
[  237.621469] rcu: 58-...0: (1 GPs behind) 
idle=b5e/1/0x4000000000000000 softirq=1908/1908 fqs=2626
[  237.630525] (detected by 44, t=5254 jiffies, g=12137, q=191)
[  237.636260] Task dump for CPU 58:
[  237.639563] irqbalance      R  running task        0  1567      1 
0x00000002
[  237.646599] Call trace:
[  237.649037]  __switch_to+0xbc/0x218
[  237.652513]  0xffff80001529bd68
[  239.283412] nvme nvme1: controller is down; will reset: 
CSTS=0xffffffff, PCI_STATUS=0xffff
[  300.635382] rcu: INFO: rcu_preempt detected stalls on 
CPUs/tasks:03d:01h:02m:56s]
[  300.641466] rcu: 58-...0: (1 GPs behind) 
idle=b5e/1/0x4000000000000000 softirq=1908/1908 fqs=10503
[  300.650589] (detected by 44, t=21010 jiffies, g=12137, q=698)
[  300.656410] Task dump for CPU 58:
[  300.659712] irqbalance      R  running task        0  1567      1 
0x00000002
[  300.666747] Call trace:

Info about irq5 and irq10 after booting:

john@ubuntu:~$ ls /proc/irq/5
affinity_hint       effective_affinity_list  smp_affinity       spurious
effective_affinity  node                     smp_affinity_list  uart-pl011
john@ubuntu:~$ more /proc/irq/5/smp_affinity_list
24-47
john@ubuntu:~$ ls /proc/irq/10
affinity_hint            ehci_hcd:usb1  ohci_hcd:usb3  smp_affinity_list
effective_affinity       ehci_hcd:usb2  ohci_hcd:usb4  spurious
effective_affinity_list  node           smp_affinity
john@ubuntu:~$ more /proc/irq/10/smp_affinity_list
71

Thanks,
John

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-03 18:16                               ` John Garry
@ 2020-02-05 14:08                                 ` John Garry
  2020-02-05 14:23                                   ` Marc Zyngier
  0 siblings, 1 reply; 26+ messages in thread
From: John Garry @ 2020-02-05 14:08 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Ming Lei, Thomas Gleixner, Ming Lei, Jens Axboe, linux-block,
	Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	chenxiang (M), Keith Busch, liudongdong (C),
	wanghuiqiang, Wangzhou (B)

On 03/02/2020 18:16, John Garry wrote:
> On 03/02/2020 15:43, Marc Zyngier wrote:
>> On 2020-02-03 12:56, John Garry wrote:
>>
>> [...]
>>
>>>> Can you trigger it after disabling irqbalance?
>>>
>>> No, so tested by killing the irqbalance process and it ran for 25
>>> minutes without issue.
>>
>> OK, that's interesting.
>>
>> Can you find you whether irqbalance tries to move an interrupt to an 
>> offlined CPU?
>> Just putting a trace into git_set_affinity() should be enough.
>>
> 

Just an update here: I have tried this same test on a new model dev 
board and I don't experience the same issue. It's quite stable.

I'd like to get to the bottom of the issue reported, but I feel that the 
root cause may be a BIOS issue and I will get next to no BIOS support 
for that particular board. Hmmm.

Thanks,
John


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-02-05 14:08                                 ` John Garry
@ 2020-02-05 14:23                                   ` Marc Zyngier
  0 siblings, 0 replies; 26+ messages in thread
From: Marc Zyngier @ 2020-02-05 14:23 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Thomas Gleixner, Ming Lei, Jens Axboe, linux-block,
	Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	chenxiang (M), Keith Busch, liudongdong (C),
	wanghuiqiang, Wangzhou (B)

On 2020-02-05 14:08, John Garry wrote:
> On 03/02/2020 18:16, John Garry wrote:
>> On 03/02/2020 15:43, Marc Zyngier wrote:
>>> On 2020-02-03 12:56, John Garry wrote:
>>> 
>>> [...]
>>> 
>>>>> Can you trigger it after disabling irqbalance?
>>>> 
>>>> No, so tested by killing the irqbalance process and it ran for 25
>>>> minutes without issue.
>>> 
>>> OK, that's interesting.
>>> 
>>> Can you find you whether irqbalance tries to move an interrupt to an 
>>> offlined CPU?
>>> Just putting a trace into git_set_affinity() should be enough.
>>> 
>> 
> 
> Just an update here: I have tried this same test on a new model dev
> board and I don't experience the same issue. It's quite stable.

Is it the exact same SoC? Or a revised version?

> I'd like to get to the bottom of the issue reported, but I feel that
> the root cause may be a BIOS issue and I will get next to no BIOS
> support for that particular board. Hmmm.

I'd very much like to understand it too. Your latest log is even more 
puzzling,
as the backtrace shows a switch_to() even earlier... The fact that this 
only
happens on hotplug off tends to tell me that the firmware gets confused 
with
PSCI OFF. You'd see something like that if a CPU was taken into the 
firmware
(or powered-off) without the rest of the kernel knowing...

Could it be that PSCI powers off more than a single CPU at once?

         M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH V5 0/6] blk-mq: improvement CPU hotplug
  2020-01-31 10:58         ` Ming Lei
  2020-01-31 17:51           ` John Garry
@ 2020-02-07 10:56           ` John Garry
  1 sibling, 0 replies; 26+ messages in thread
From: John Garry @ 2020-02-07 10:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner, Keith Busch,
	chenxiang (M)

On 31/01/2020 10:58, Ming Lei wrote:
> On Fri, Jan 31, 2020 at 6:24 PM John Garry<john.garry@huawei.com>  wrote:
>>>> [  141.976109] Call trace:
>>>> [  141.978550]  __switch_to+0xbc/0x218
>>>> [  141.982029]  blk_mq_run_work_fn+0x1c/0x28
>>>> [  141.986027]  process_one_work+0x1e0/0x358
>>>> [  141.990025]  worker_thread+0x40/0x488
>>>> [  141.993678]  kthread+0x118/0x120
>>>> [  141.996897]  ret_from_fork+0x10/0x18
>>> Hi John,
>>>
>>> Thanks for your test!
>>>
>> Hi Ming,
>>
>>> Could you test the following patchset and only the last one is changed?
>>>
>>> https://github.com/ming1/linux/commits/my_for_5.6_block
>> For SCSI testing, I will ask my colleague Xiang Chen to test when he
>> returns to work. So I did not see this issue for my SCSI testing for
>> your original v5, but I was only using 1x as opposed to maybe 20x SAS disks.
>>
>> BTW, did you test NVMe? For some reason I could not trigger a scenario
>> where we're draining the outstanding requests for a queue which is being
>> deactivated - I mean, the queues were always already quiesced.
> I run cpu hotplug test on both NVMe and SCSI in KVM, and fio just runs
> as expected.
> 
> NVMe is often 1:1 mapping, so it might be a bit difficult to trigger
> draining in-flight IOs.
> 

Hi Ming,

We got around to testing your my_for_5.6_block branch (Xiang Chen 
actually took the v5 series and applied the following on top only:
block: deactivate hctx when running queue in wrong CPU core
Revert "block: deactivate hctx when all its CPUs are offline when run…)

and we get this:

] IRQ 598: no longer affine to CPU4
[ 1077.396063] CPU4: shutdown
[ 1077.398769] psci: CPU4 killed (polled 0 ms)
[ 1077.457777] CPU3: shutdown
[ 1077.460495] psci: CPU3 killed (polled 0 ms)
[ 1077.499650] CPU2: shutdown
[ 1077.502357] psci: CPU2 killed (polled 0 ms)
[ 1077.546976] CPU1: shutdown
[ 1077.549690] psci: CPU1 killed (polled 0 ms)
it's running b  0
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [38.9% done] 
[1201MB/0KB/0KB /s] [307K/0/0 iops] [eta 00m:22s]
it's running b  1
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [41.7% done] 
[625.2MB/0KB/0KB /s] [160K/0/0 iops] [eta 00m:21s]
it's running b  2
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [44.4% done] 
[637.1MB/0KB/0KB /s] [163K/0/0 iops] [eta 00m:20s]
it's running b  3
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [47.2% done] 
[648.6MB/0KB/0KB /s] [166K/0/0 iops] [eta 00m:19s]
it's running b  4
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [50.0% done] 
[672.8MB/0KB/0KB /s] [172K/0/0 iops] [eta 00m:18s]
it's running b  5
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [52.8% done] 
[680.2MB/0KB/0KB /s] [174K/0/0 iops] [eta 00m:17s]
it's running b  6
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [55.6% done] 
[674.7MB/0KB/0KB /s] [173K/0/0 iops] [eta 00m:16s]
it's running b  7
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [58.3% done] 
[666.2MB/0KB/0KB /s] [171K/0/0 iops] [eta 00m:15s]
it's running b  8
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [61.1% done] 
[668.7MB/0KB/0KB /s] [171K/0/0 iops] [eta 00m:14s]
it's running b  9
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [63.9% done] 
[657.9MB/0KB/0KB /s] [168K/0/0 iops] [eta 00m:13s]
it's running b  10
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [66.7% done] 
[659.6MB/0KB/0KB /s] [169K/0/0 iops] [eta 00m:12s]
it's running b  11
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [69.4% done] 
[662.8MB/0KB/0KB /s] [170K/0/0 iops] [eta 00m:11s]
it's running b  12
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [72.2% done] 
[669.8MB/0KB/0KB /s] [171K/0/0 iops] [eta 00m:10s]
it's running b  13
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [75.0% done] 
[673.2MB/0KB/0KB /s] [172K/0/0 iops] [eta 00m:09s]
it's running b  14
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [77.8% done] 
[650.5MB/0KB/0KB /s] [167K/0/0 iops] [eta 00m:08s]
it's running b  15
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [80.6% done] 
[658.9MB/0KB/0KB /s] [169K/0/0 iops] [eta 00m:07s]
it's running b  16
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [83.3% done] 
[670.3MB/0KB/0KB /s] [172K/0/0 iops] [eta 00m:06s]
it's running b  17
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [86.1% done] 
[663.7MB/0KB/0KB /s] [170K/0/0 iops] [eta 00m:05s]
it's running b  18
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [88.9% done] 
[657.9MB/0KB/0KB /s] [168K/0/0 iops] [eta 00m:04s]
it's running b  19
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [91.7% done] 
[650.9MB/0KB/0KB /s] [167K/0/0 iops] [eta 00m:03s]
it's running b  20
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [94.4% done] 
[646.1MB/0KB/0KB /s] [166K/0/0 iops] [eta 00m:02s]
it's running b  21
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [97.2% done] 
[658.4MB/0KB/0KB /s] [169K/0/0 iops] [eta 00m:01s]
it's running b  22
Jobs: 40 (f=40): [RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR] [100.0% 
done] [649.4MB/0KB/0KB /s] [166K/0/0 iops] [eta 00m:00s]
it's running b  23
Jobs: 1 (f=1): [______________________R_________________] [2.6% done] 
[402.5MB/0KB/0KB /s] [103K/0/0 iops] [eta 22m:44s]
it's running b  24
Jobs: 1 (f=1): [______________________R_________________] [2.7% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 22m:43s]
it's running b  25
Jobs: 1 (f=1): [______________________R_________________] [2.8% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 22m:42s]
it's running b  26
Jobs: 1 (f=1): [______________________R_________________] [2.9% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 22m:41s]
it's running b  27
Jobs: 1 (f=1): [______________________R_________________] [2.9% done] 
[0KB/0KB/0KB /s] [0/0/0 iops] [eta 22m:40s]
[ 1105.419335] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[ 1105.425185] sas: trying to find task 0x00000000f1b865f3
[ 1105.430409] sas: sas_scsi_find_task: aborting task 0x00000000f1b865f3
not running b  28
#

Looks like the queues are not properly drained as we're getting a single 
IO timeout. I'll have a look when I get a chance.

Cheers,
John

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-02-07 10:56 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-15 11:44 [PATCH V5 0/6] blk-mq: improvement CPU hotplug Ming Lei
2020-01-15 11:44 ` [PATCH 1/6] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
2020-01-15 11:44 ` [PATCH 2/6] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2020-01-15 11:44 ` [PATCH 3/6] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
2020-01-15 11:44 ` [PATCH 4/6] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
2020-01-15 11:44 ` [PATCH 5/6] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
2020-01-15 11:44 ` [PATCH 6/6] block: deactivate hctx when all its CPUs are offline when running queue Ming Lei
2020-01-15 17:00 ` [PATCH V5 0/6] blk-mq: improvement CPU hotplug John Garry
2020-01-20 13:23   ` John Garry
2020-01-31 10:04     ` Ming Lei
2020-01-31 10:24       ` John Garry
2020-01-31 10:58         ` Ming Lei
2020-01-31 17:51           ` John Garry
2020-01-31 18:02             ` John Garry
2020-02-01  1:31               ` Ming Lei
2020-02-01 11:05                 ` Marc Zyngier
2020-02-01 11:31                   ` Thomas Gleixner
2020-02-03 10:30                     ` John Garry
2020-02-03 10:49                       ` John Garry
2020-02-03 10:59                         ` Ming Lei
2020-02-03 12:56                           ` John Garry
2020-02-03 15:43                             ` Marc Zyngier
2020-02-03 18:16                               ` John Garry
2020-02-05 14:08                                 ` John Garry
2020-02-05 14:23                                   ` Marc Zyngier
2020-02-07 10:56           ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).