All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V6 0/8] blk-mq: improvement CPU hotplug
@ 2020-04-07  9:28 Ming Lei
  2020-04-07  9:28 ` [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes inactive(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

https://github.com/ming1/linux/commits/v5.6-blk-mq-improve-cpu-hotplug

V6:
	- simplify getting driver tag, so that we can drain in-flight
	  requests correctly without using synchronize_rcu()
	- handle re-submission of flush & passthrough request correctly

V5:
	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
	- re-factor code for re-submit requests in cpu dead hotplug handler
	- address requeue corner case

V4:
	- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo



Ming Lei (8):
  blk-mq: assign rq->tag in blk_mq_get_driver_tag
  blk-mq: add new state of BLK_MQ_S_INACTIVE
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  block: add blk_end_flush_machinery
  blk-mq: re-submit IO in case that hctx is inactive
  blk-mq: handle requests dispatched from IO scheduler in case of
    inactive hctx
  block: deactivate hctx when the hctx is actually inactive

 block/blk-flush.c          | 143 +++++++++++++---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 331 ++++++++++++++++++++++++++++++-------
 block/blk-mq.h             |  24 +--
 block/blk.h                |   9 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   6 +
 include/linux/cpuhotplug.h |   1 +
 11 files changed, 420 insertions(+), 104 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
-- 
2.25.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-04-07  9:28 ` Ming Lei
  2020-04-07 17:14   ` Christoph Hellwig
  2020-04-07  9:28 ` [PATCH V6 2/8] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

Especially for none elevator, rq->tag is assigned after the request is
allocated, so there isn't any way to figure out if one request is in
being dispatched. Also the code path wrt. driver tag becomes a bit
difference between none and io scheduler.

When one hctx becomes inactive, we have to prevent any request from
being dispatched to LLD. And get driver tag provides one perfect chance
to do that. Meantime we can drain any such requests by checking if
rq->tag is assigned.

So only assign rq->tag until blk_mq_get_driver_tag() is called.

This way also simplifies code of dealing with driver tag a lot.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 18 +++---------------
 block/blk-mq.c    | 41 +++++++++++++++++++++--------------------
 block/blk-mq.h    | 21 ++++++++++-----------
 block/blk.h       |  5 -----
 4 files changed, 34 insertions(+), 51 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 5cc775bdb06a..7b247c0470c0 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -239,13 +239,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 		error = fq->rq_status;
 
 	hctx = flush_rq->mq_hctx;
-	if (!q->elevator) {
-		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
-		flush_rq->tag = -1;
-	} else {
-		blk_mq_put_driver_tag(flush_rq);
-		flush_rq->internal_tag = -1;
-	}
+	flush_rq->internal_tag = -1;
+	blk_mq_put_driver_tag(flush_rq);
 
 	running = &fq->flush_queue[fq->flush_running_idx];
 	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
@@ -320,14 +315,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	flush_rq->mq_ctx = first_rq->mq_ctx;
 	flush_rq->mq_hctx = first_rq->mq_hctx;
 
-	if (!q->elevator) {
-		fq->orig_rq = first_rq;
-		flush_rq->tag = first_rq->tag;
-		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
-	} else {
-		flush_rq->internal_tag = first_rq->internal_tag;
-	}
-
+	flush_rq->internal_tag = first_rq->internal_tag;
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
 	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
 	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..f6f1ba3ff783 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	struct request *rq = tags->static_rqs[tag];
 	req_flags_t rq_flags = 0;
 
-	if (data->flags & BLK_MQ_REQ_INTERNAL) {
-		rq->tag = -1;
-		rq->internal_tag = tag;
-	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
-			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-		rq->tag = tag;
-		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
-	}
+	rq->internal_tag = tag;
+	rq->tag = -1;
 
 	/* csd/requeue_work/fifo_time is initialized before use */
 	rq->q = data->q;
@@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-	const int sched_tag = rq->internal_tag;
+	const int tag = rq->internal_tag;
+	bool has_sched = !!hctx->sched_tags;
 
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != -1)
+	if (!has_sched)
+		blk_mq_put_tag(hctx->tags, ctx, tag);
+	else if (rq->tag >= 0)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
+
+	if (has_sched)
+		blk_mq_put_tag(hctx->sched_tags, ctx, tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
@@ -527,7 +521,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 		blk_stat_add(rq, now);
 	}
 
-	if (rq->internal_tag != -1)
+	if (rq->q->elevator && rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq, now);
 
 	blk_account_io_done(rq, now);
@@ -1037,14 +1031,21 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	};
 	bool shared;
 
-	if (rq->tag != -1)
-		return true;
+	if (rq->tag >= 0)
+		goto allocated;
+
+	if (!rq->q->elevator) {
+		rq->tag = rq->internal_tag;
+		goto allocated;
+	}
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
 
-	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
+
+allocated:
+	shared = blk_mq_tag_busy(data.hctx);
 	if (rq->tag >= 0) {
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494fa..d25429a4932c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -197,26 +197,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
-					   struct request *rq)
+static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	int tag = rq->tag;
+
+	if (tag < 0)
+		return;
+
 	rq->tag = -1;
 
+	if (rq->q->elevator)
+		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
 		atomic_dec(&hctx->nr_active);
 	}
 }
 
-static inline void blk_mq_put_driver_tag(struct request *rq)
-{
-	if (rq->tag == -1 || rq->internal_tag == -1)
-		return;
-
-	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
-}
-
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
diff --git a/block/blk.h b/block/blk.h
index 0b8884353f6b..c824d66f24e2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -25,11 +25,6 @@ struct blk_flush_queue {
 	struct list_head	flush_data_in_flight;
 	struct request		*flush_rq;
 
-	/*
-	 * flush_rq shares tag with this rq, both can't be active
-	 * at the same time
-	 */
-	struct request		*orig_rq;
 	struct lock_class_key	key;
 	spinlock_t		mq_flush_lock;
 };
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 2/8] blk-mq: add new state of BLK_MQ_S_INACTIVE
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
  2020-04-07  9:28 ` [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-04-07  9:28 ` Ming Lei
  2020-04-07 17:14   ` Christoph Hellwig
  2020-04-07  9:28 ` [PATCH V6 3/8] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

Add a new hw queue state of BLK_MQ_S_INACTIVE, which is set when all
CPUs of this hctx are offline.

Prepares for stopping hw queue when all CPUs of this hctx become offline.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 1 +
 block/blk-mq.h         | 3 ++-
 include/linux/blk-mq.h | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..d6de4f7f38cb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d25429a4932c..1f4a794ddeb7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -168,7 +168,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 11cfd6470b1a..b669e776d4cb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -397,6 +397,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 3/8] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
  2020-04-07  9:28 ` [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
  2020-04-07  9:28 ` [PATCH V6 2/8] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
@ 2020-04-07  9:28 ` Ming Lei
  2020-04-07  9:28 ` [PATCH V6 4/8] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 13 +++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  3 +++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d6de4f7f38cb..b62390918ca5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -240,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f6f1ba3ff783..4ee8695142c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2249,6 +2249,11 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2285,6 +2290,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2344,6 +2352,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3588,6 +3599,8 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				NULL, blk_mq_hctx_notify_online);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..651dadd9be12 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2012,7 +2012,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index b669e776d4cb..ca2201435a48 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -388,6 +390,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index d37c17e68268..8bd2fea6cd59 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -151,6 +151,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 4/8] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2020-04-07  9:28 ` [PATCH V6 3/8] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-04-07  9:28 ` Ming Lei
  2020-04-07  9:28 ` [PATCH V6 5/8] block: add blk_end_flush_machinery Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Before one CPU becomes offline, check if it is the last online CPU
of hctx. If yes, mark this hctx as inactive, meantime wait for
completion of all in-flight IOs originated from this hctx.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c |  2 +-
 block/blk-mq-tag.h |  2 ++
 block/blk-mq.c     | 67 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..82a58b2cebe7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -317,7 +317,7 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb682..346d570d52a9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -34,6 +34,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4ee8695142c0..aac86cd99f02 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1054,6 +1054,11 @@ bool blk_mq_get_driver_tag(struct request *rq)
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
 
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state))) {
+		blk_mq_put_driver_tag(rq);
+		return false;
+	}
+
 	return rq->tag != -1;
 }
 
@@ -2249,8 +2254,68 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+struct count_inflight_data {
+	unsigned count;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
+{
+	struct count_inflight_data *count_data = data;
+
+	/*
+	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
+	 * in blk_mq_start_request(), at that time we can't prevent this rq
+	 * from being issued.
+	 *
+	 * So check if driver tag is assigned, if yes, count this rq as
+	 * inflight.
+	 */
+	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
+		count_data->count++;
+
+	return true;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct count_inflight_data count_data = {
+		.count	= 0,
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
+			&count_data);
+
+	return count_data.count;
+}
+
+static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx))
+			break;
+		msleep(5);
+	}
+}
+
 static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) == cpu) &&
+			(cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask)
+			 >= nr_cpu_ids)) {
+		/*
+		 * The current CPU is the last one in this hctx, S_INACTIVE
+		 * can be observed in dispatch path without any barrier needed,
+		 * cause both are run on one same CPU.
+		 */
+		set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+		blk_mq_hctx_drain_inflight_rqs(hctx);
+        }
 	return 0;
 }
 
@@ -2277,6 +2342,8 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	}
 	spin_unlock(&ctx->lock);
 
+	clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+
 	if (list_empty(&tmp))
 		return 0;
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 5/8] block: add blk_end_flush_machinery
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2020-04-07  9:28 ` [PATCH V6 4/8] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-04-07  9:28 ` Ming Lei
  2020-04-07  9:28 ` [PATCH V6 6/8] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Flush requests aren't same with normal FS request:

1) on dedicated per-hctx flush rq is pre-allocated for sending flush request

2) flush requests are issued to hardware via one machinary so that flush merge
can be applied

We can't simply re-submit flush rqs via blk_steal_bios(), so add
blk_end_flush_machinery to collect flush requests which needs to
be resubmitted:

- if one flush command without DATA is enough, send one flush, complete this
kind of requests

- otherwise, add the request into a list and let caller re-submit it.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 125 +++++++++++++++++++++++++++++++++++++++++++---
 block/blk.h       |   4 ++
 2 files changed, 122 insertions(+), 7 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 7b247c0470c0..75aa9b4d2880 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -173,10 +173,11 @@ static void blk_flush_complete_seq(struct request *rq,
 	unsigned int cmd_flags;
 
 	BUG_ON(rq->flush.seq & seq);
-	rq->flush.seq |= seq;
+	if (!error)
+		rq->flush.seq |= seq;
 	cmd_flags = rq->cmd_flags;
 
-	if (likely(!error))
+	if (likely(!error && !fq->flush_queue_terminating))
 		seq = blk_flush_cur_seq(rq);
 	else
 		seq = REQ_FSEQ_DONE;
@@ -203,9 +204,15 @@ static void blk_flush_complete_seq(struct request *rq,
 		 * normal completion and end it.
 		 */
 		BUG_ON(!list_empty(&rq->queuelist));
-		list_del_init(&rq->flush.list);
-		blk_flush_restore_request(rq);
-		blk_mq_end_request(rq, error);
+
+		/* Terminating code will end the request from flush queue */
+		if (likely(!fq->flush_queue_terminating)) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			blk_mq_end_request(rq, error);
+		} else {
+			list_move_tail(&rq->flush.list, pending);
+		}
 		break;
 
 	default:
@@ -282,7 +289,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	struct request *flush_rq = fq->flush_rq;
 
 	/* C1 described at the top of this file */
-	if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
+	if (fq->flush_pending_idx != fq->flush_running_idx ||
+			list_empty(pending) || fq->flush_queue_terminating)
 		return;
 
 	/* C2 and C3
@@ -334,7 +342,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
 	if (q->elevator) {
-		WARN_ON(rq->tag < 0);
+		WARN_ON(rq->tag < 0 && !fq->flush_queue_terminating);
 		blk_mq_put_driver_tag(rq);
 	}
 
@@ -515,3 +523,106 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
 	kfree(fq->flush_rq);
 	kfree(fq);
 }
+
+static void __blk_end_queued_flush(struct blk_flush_queue *fq,
+		unsigned int queue_idx, struct list_head *resubmit_list,
+		struct list_head *flush_list)
+{
+	struct list_head *queue = &fq->flush_queue[queue_idx];
+	struct request *rq, *nxt;
+
+	list_for_each_entry_safe(rq, nxt, queue, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		if (!blk_rq_sectors(rq) || seq == REQ_FSEQ_POSTFLUSH )
+			list_add_tail(&rq->queuelist, flush_list);
+		else
+			list_add_tail(&rq->queuelist, resubmit_list);
+	}
+}
+
+static void blk_end_queued_flush(struct blk_flush_queue *fq,
+		struct list_head *resubmit_list, struct list_head *flush_list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	__blk_end_queued_flush(fq, 0, resubmit_list, flush_list);
+	__blk_end_queued_flush(fq, 1, resubmit_list, flush_list);
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+}
+
+/* complete requests which just requires one flush command */
+static void blk_complete_flush_requests(struct blk_flush_queue *fq,
+		struct list_head *flush_list)
+{
+	struct block_device *bdev;
+	struct request *rq;
+	int error = -ENXIO;
+
+	if (list_empty(flush_list))
+		return;
+
+	rq = list_first_entry(flush_list, struct request, queuelist);
+
+	/* Send flush via one active hctx so we can move on */
+	bdev = bdget_disk(rq->rq_disk, 0);
+	if (bdev) {
+		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
+		bdput(bdev);
+	}
+
+	while (!list_empty(flush_list)) {
+		rq = list_first_entry(flush_list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_end_request(rq, error);
+	}
+}
+
+/*
+ * Called when this hctx is inactive and all CPUs of this hctx is dead.
+ *
+ * Terminate this hw queue's flush machinery, and try to complete flush
+ * IO requests if possible, such as any flush IO without data, or flush
+ * data IO in POSTFLUSH stage. Otherwise, add the flush IOs into @list
+ * and let caller to re-submit them.
+ *
+ * No need to hold the flush queue lock given no any flush activity can
+ * reach this queue now.
+ */
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out)
+{
+	LIST_HEAD(resubmit_list);
+	LIST_HEAD(flush_list);
+	struct blk_flush_queue *fq = hctx->fq;
+	struct request *rq, *nxt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	fq->flush_queue_terminating = 1;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	/* End inflight flush requests */
+	list_for_each_entry_safe(rq, nxt, in, queuelist) {
+		WARN_ON(!(rq->rq_flags & RQF_FLUSH_SEQ));
+		list_del_init(&rq->queuelist);
+		rq->end_io(rq, BLK_STS_AGAIN);
+	}
+
+	/* End queued requests */
+	blk_end_queued_flush(fq, &resubmit_list, &flush_list);
+
+	/* Send flush and complete requests which just need one flush req */
+	blk_complete_flush_requests(fq, &flush_list);
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	/* reset flush queue so that it is ready to work next time */
+	fq->flush_pending_idx = fq->flush_running_idx = 0;
+	fq->flush_queue_terminating = 0;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	list_splice_init(&resubmit_list, out);
+}
diff --git a/block/blk.h b/block/blk.h
index c824d66f24e2..594e4236d55c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -19,6 +19,7 @@ struct blk_flush_queue {
 	unsigned int		flush_queue_delayed:1;
 	unsigned int		flush_pending_idx:1;
 	unsigned int		flush_running_idx:1;
+	unsigned int		flush_queue_terminating:1;
 	blk_status_t 		rq_status;
 	unsigned long		flush_pending_since;
 	struct list_head	flush_queue[2];
@@ -349,4 +350,7 @@ void blk_queue_free_zone_bitmaps(struct request_queue *q);
 static inline void blk_queue_free_zone_bitmaps(struct request_queue *q) {}
 #endif
 
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out);
+
 #endif /* BLK_INTERNAL_H */
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 6/8] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2020-04-07  9:28 ` [PATCH V6 5/8] block: add blk_end_flush_machinery Ming Lei
@ 2020-04-07  9:28 ` Ming Lei
  2020-04-07  9:29 ` [PATCH V6 7/8] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

When all CPUs in one hctx are offline and this hctx becomes inactive,
we shouldn't run this hw queue for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 131 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 121 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aac86cd99f02..6749f39fdd11 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2319,10 +2319,98 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_end_io(struct request *rq, blk_status_t error)
+{
+	struct request *orig_rq = rq->end_io_data;
+
+	blk_mq_cleanup_rq(orig_rq);
+	blk_mq_end_request(orig_rq, error);
+
+	blk_put_request(rq);
+}
+
+static void blk_mq_resubmit_passthrough_io(struct request *rq)
+{
+	struct request *nrq;
+	unsigned int flags = 0, cmd_flags = 0;
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
+		hctx->tags;
+	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
+
+	if (rq->rq_flags & RQF_PREEMPT)
+		flags |= BLK_MQ_REQ_PREEMPT;
+	if (reserved)
+		flags |= BLK_MQ_REQ_RESERVED;
+
+	/* avoid allocation failure & IO merge */
+	cmd_flags = (rq->cmd_flags & ~REQ_NOWAIT) | REQ_NOMERGE;
+
+	nrq = blk_get_request(rq->q, cmd_flags, flags);
+	if (!nrq)
+		return;
+
+	nrq->__sector = blk_rq_pos(rq);
+	nrq->__data_len = blk_rq_bytes(rq);
+	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		nrq->rq_flags |= RQF_SPECIAL_PAYLOAD;
+		nrq->special_vec = rq->special_vec;
+	}
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	nrq->nr_integrity_segments = rq->nr_integrity_segments;
+#endif
+	nrq->nr_phys_segments = rq->nr_phys_segments;
+	nrq->ioprio = rq->ioprio;
+	nrq->extra_len = rq->extra_len;
+	nrq->rq_disk = rq->rq_disk;
+	nrq->part = rq->part;
+	nrq->write_hint = rq->write_hint;
+	nrq->timeout = rq->timeout;
+
+	memcpy(blk_mq_rq_to_pdu(nrq), blk_mq_rq_to_pdu(rq),
+			rq->q->tag_set->cmd_size);
+
+	nrq->end_io = blk_mq_resubmit_end_io;
+	nrq->end_io_data = rq;
+	nrq->bio = rq->bio;
+	nrq->biotail = rq->biotail;
+
+	blk_account_io_start(nrq, true);
+	blk_mq_sched_insert_request(nrq, true, true, true);
+}
+
+static void blk_mq_resubmit_fs_io(struct request *rq)
+{
+	struct bio_list list;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+
+	blk_mq_cleanup_rq(rq);
+	blk_mq_end_request(rq, 0);
+}
+
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	if (rq->end_io || blk_rq_is_passthrough(rq))
+		blk_mq_resubmit_passthrough_io(rq);
+	else
+		blk_mq_resubmit_fs_io(rq);
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2342,16 +2430,39 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	}
 	spin_unlock(&ctx->lock);
 
-	clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
+		if (!list_empty(&tmp)) {
+			spin_lock(&hctx->lock);
+			list_splice_tail_init(&tmp, &hctx->dispatch);
+			spin_unlock(&hctx->lock);
+			blk_mq_run_hw_queue(hctx, true);
+		}
+	} else {
+		LIST_HEAD(flush_in);
+		LIST_HEAD(flush_out);
+		struct request *rq, *nxt;
 
-	if (list_empty(&tmp))
-		return 0;
+		/* requests in dispatch list have to be re-submitted too */
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&hctx->dispatch, &tmp);
+		spin_unlock(&hctx->lock);
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+		/* blk_end_flush_machinery will cover flush request */
+		list_for_each_entry_safe(rq, nxt, &tmp, queuelist) {
+			if (rq->rq_flags & RQF_FLUSH_SEQ)
+				list_move(&rq->queuelist, &flush_in);
+		}
+		blk_end_flush_machinery(hctx, &flush_in, &flush_out);
+		list_splice_tail(&flush_out, &tmp);
+
+		while (!list_empty(&tmp)) {
+			rq = list_first_entry(&tmp, struct request, queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_resubmit_io(rq);
+		}
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	}
 
-	blk_mq_run_hw_queue(hctx, true);
 	return 0;
 }
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 7/8] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2020-04-07  9:28 ` [PATCH V6 6/8] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-04-07  9:29 ` Ming Lei
  2020-04-07  9:29 ` [PATCH V6 8/8] block: deactivate hctx when the hctx is actually inactive Ming Lei
  2020-04-08 12:40 ` [PATCH V6 0/8] blk-mq: improvement CPU hotplug Daniel Wagner
  8 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

If one hctx becomes inactive when its CPUs are all offline, all in-queue
requests aimed at this hctx have to be re-submitted.

Re-submit requests from both sw queue or scheduler queue when the hctx
is found as inactive.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 103 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 64 insertions(+), 39 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6749f39fdd11..a8dbb1bc0a36 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2407,6 +2407,52 @@ static void blk_mq_resubmit_io(struct request *rq)
 		blk_mq_resubmit_fs_io(rq);
 }
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(sched_tmp);
+	LIST_HEAD(re_submit);
+	LIST_HEAD(flush_in);
+	LIST_HEAD(flush_out);
+	struct request *rq, *nxt;
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (!e) {
+		blk_mq_flush_busy_ctxs(hctx, &re_submit);
+	} else {
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &re_submit);
+		}
+	}
+	while (!list_empty(&sched_tmp)) {
+		rq = list_entry(sched_tmp.next, struct request,
+				queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_sched_insert_request(rq, true, true, true);
+	}
+
+	/* requests in dispatch list have to be re-submitted too */
+	spin_lock(&hctx->lock);
+	list_splice_tail_init(&hctx->dispatch, &re_submit);
+	spin_unlock(&hctx->lock);
+
+	/* blk_end_flush_machinery will cover flush request */
+	list_for_each_entry_safe(rq, nxt, &re_submit, queuelist) {
+		if (rq->rq_flags & RQF_FLUSH_SEQ)
+			list_move(&rq->queuelist, &flush_in);
+	}
+	blk_end_flush_machinery(hctx, &flush_in, &flush_out);
+	list_splice_tail(&flush_out, &re_submit);
+
+	while (!list_empty(&re_submit)) {
+		rq = list_first_entry(&re_submit, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
+}
+
 /*
  * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
  * to the hctx any more, so steal bios from requests of this hctx, and
@@ -2414,53 +2460,32 @@ static void blk_mq_resubmit_io(struct request *rq)
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	LIST_HEAD(tmp);
-	enum hctx_type type;
-
-	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
-	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
-	type = hctx->type;
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
 
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
-	}
-	spin_unlock(&ctx->lock);
+	if (test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
+		WARN_ON_ONCE(blk_mq_tags_inflight_rqs(hctx) > 0);
+		blk_mq_hctx_deactivate(hctx);
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	} else if (!hctx->queue->elevator) {
+		struct blk_mq_ctx *ctx = __blk_mq_get_ctx(hctx->queue, cpu);
+		enum hctx_type type = hctx->type;
+		LIST_HEAD(tmp);
+
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
 
-	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
 		if (!list_empty(&tmp)) {
 			spin_lock(&hctx->lock);
 			list_splice_tail_init(&tmp, &hctx->dispatch);
 			spin_unlock(&hctx->lock);
-			blk_mq_run_hw_queue(hctx, true);
-		}
-	} else {
-		LIST_HEAD(flush_in);
-		LIST_HEAD(flush_out);
-		struct request *rq, *nxt;
-
-		/* requests in dispatch list have to be re-submitted too */
-		spin_lock(&hctx->lock);
-		list_splice_tail_init(&hctx->dispatch, &tmp);
-		spin_unlock(&hctx->lock);
-
-		/* blk_end_flush_machinery will cover flush request */
-		list_for_each_entry_safe(rq, nxt, &tmp, queuelist) {
-			if (rq->rq_flags & RQF_FLUSH_SEQ)
-				list_move(&rq->queuelist, &flush_in);
-		}
-		blk_end_flush_machinery(hctx, &flush_in, &flush_out);
-		list_splice_tail(&flush_out, &tmp);
 
-		while (!list_empty(&tmp)) {
-			rq = list_first_entry(&tmp, struct request, queuelist);
-			list_del_init(&rq->queuelist);
-			blk_mq_resubmit_io(rq);
+			blk_mq_run_hw_queue(hctx, true);
 		}
-		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 	}
 
 	return 0;
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH V6 8/8] block: deactivate hctx when the hctx is actually inactive
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
                   ` (6 preceding siblings ...)
  2020-04-07  9:29 ` [PATCH V6 7/8] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
@ 2020-04-07  9:29 ` Ming Lei
  2020-04-08 12:40 ` [PATCH V6 0/8] blk-mq: improvement CPU hotplug Daniel Wagner
  8 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-07  9:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Run queue on dead CPU still may be triggered in some corner case,
such as one request is requeued after CPU hotplug is handled.

So handle this corner case during run queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 ++++++++++--------------------
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a8dbb1bc0a36..4235abde8d45 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,8 @@
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx);
+
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
 	int ddir, sectors, bucket;
@@ -1352,28 +1354,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	int srcu_idx;
 
 	/*
-	 * We should be running this queue from one of the CPUs that
-	 * are mapped to it.
-	 *
-	 * There are at least two related races now between setting
-	 * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
-	 * __blk_mq_run_hw_queue():
-	 *
-	 * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes online, then this warning is harmless
-	 *   at all
-	 *
-	 * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes offline, then the warning can't be
-	 *   triggered, and we depend on blk-mq timeout handler to
-	 *   handle dispatched requests to this hctx
+	 * BLK_MQ_S_INACTIVE may not deal with some requeue corner case:
+	 * one request is requeued after cpu unplug is handled, so check
+	 * if the hctx is actually inactive. If yes, deactive it and
+	 * re-submit all requests in the queue.
 	 */
 	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-		cpu_online(hctx->next_cpu)) {
-		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
-			raw_smp_processor_id(),
-			cpumask_empty(hctx->cpumask) ? "inactive": "active");
-		dump_stack();
+	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
+	    nr_cpu_ids) {
+		blk_mq_hctx_deactivate(hctx);
+		return;
 	}
 
 	/*
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-07  9:28 ` [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-04-07 17:14   ` Christoph Hellwig
  2020-04-08  1:38     ` Ming Lei
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2020-04-07 17:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On Tue, Apr 07, 2020 at 05:28:54PM +0800, Ming Lei wrote:
> @@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
>  	struct request_queue *q = rq->q;
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> -	const int sched_tag = rq->internal_tag;
> +	const int tag = rq->internal_tag;
> +	bool has_sched = !!hctx->sched_tags;
>  
>  	blk_pm_mark_last_busy(rq);
>  	rq->mq_hctx = NULL;
> +	if (!has_sched)
> +		blk_mq_put_tag(hctx->tags, ctx, tag);
> +	else if (rq->tag >= 0)
>  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> +
> +	if (has_sched)
> +		blk_mq_put_tag(hctx->sched_tags, ctx, tag);

This looks weird to me.  Why not simply:

	if (hctx->sched_tags) {
		if (rq->tag >= 0)
			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
	} else {
		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
	}


> @@ -1037,14 +1031,21 @@ bool blk_mq_get_driver_tag(struct request *rq)

FYI, it seems like blk_mq_get_driver_tag can be marked static.

Otherwise this looks pretty sensible to me.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 2/8] blk-mq: add new state of BLK_MQ_S_INACTIVE
  2020-04-07  9:28 ` [PATCH V6 2/8] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
@ 2020-04-07 17:14   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2020-04-07 17:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On Tue, Apr 07, 2020 at 05:28:55PM +0800, Ming Lei wrote:
> Add a new hw queue state of BLK_MQ_S_INACTIVE, which is set when all
> CPUs of this hctx are offline.
> 
> Prepares for stopping hw queue when all CPUs of this hctx become offline.

It isn't set yet, is it?  So maybe ".. will be set..".  Or just merge
it into the patch that sets it, which might be easier to follow.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-07 17:14   ` Christoph Hellwig
@ 2020-04-08  1:38     ` Ming Lei
  0 siblings, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-08  1:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, John Garry

On Tue, Apr 07, 2020 at 07:14:05PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 07, 2020 at 05:28:54PM +0800, Ming Lei wrote:
> > @@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
> >  	struct request_queue *q = rq->q;
> >  	struct blk_mq_ctx *ctx = rq->mq_ctx;
> >  	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > -	const int sched_tag = rq->internal_tag;
> > +	const int tag = rq->internal_tag;
> > +	bool has_sched = !!hctx->sched_tags;
> >  
> >  	blk_pm_mark_last_busy(rq);
> >  	rq->mq_hctx = NULL;
> > +	if (!has_sched)
> > +		blk_mq_put_tag(hctx->tags, ctx, tag);
> > +	else if (rq->tag >= 0)
> >  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> > +
> > +	if (has_sched)
> > +		blk_mq_put_tag(hctx->sched_tags, ctx, tag);
> 
> This looks weird to me.  Why not simply:
> 
> 	if (hctx->sched_tags) {
> 		if (rq->tag >= 0)
> 			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> 		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
> 	} else {
> 		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
> 	}

Nice!

> 
> 
> > @@ -1037,14 +1031,21 @@ bool blk_mq_get_driver_tag(struct request *rq)
> 
> FYI, it seems like blk_mq_get_driver_tag can be marked static.
> 
> Otherwise this looks pretty sensible to me.

Indeed, just forgot to do that.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
                   ` (7 preceding siblings ...)
  2020-04-07  9:29 ` [PATCH V6 8/8] block: deactivate hctx when the hctx is actually inactive Ming Lei
@ 2020-04-08 12:40 ` Daniel Wagner
  2020-04-08 13:01   ` John Garry
  2020-04-08 13:25   ` Ming Lei
  8 siblings, 2 replies; 19+ messages in thread
From: Daniel Wagner @ 2020-04-08 12:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi Ming,

On Tue, Apr 07, 2020 at 05:28:53PM +0800, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>     
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
> 
> But no drivers or blk-mq do that before one hctx becomes inactive(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Please comment & review, thanks!

FWIW, I've stress test this series by running the stress-cpu-hotplug
with a fio workload in the background. Nothing exploded, all just
worked fine.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-08 12:40 ` [PATCH V6 0/8] blk-mq: improvement CPU hotplug Daniel Wagner
@ 2020-04-08 13:01   ` John Garry
  2020-04-08 13:10     ` Daniel Wagner
  2020-04-08 13:25   ` Ming Lei
  1 sibling, 1 reply; 19+ messages in thread
From: John Garry @ 2020-04-08 13:01 UTC (permalink / raw)
  To: Daniel Wagner, Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 08/04/2020 13:40, Daniel Wagner wrote:
> Hi Ming,
> 
> On Tue, Apr 07, 2020 at 05:28:53PM +0800, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>      "
>>       That was the constraint of managed interrupts from the very beginning:
>>      
>>        The driver/subsystem has to quiesce the interrupt line and the associated
>>        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>>        until it's restarted by the core when the CPU is plugged in again.
>>      "
>>
>> But no drivers or blk-mq do that before one hctx becomes inactive(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>>
>> - steal bios from the request, and resubmit them via generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Please comment & review, thanks!
> 
> FWIW, I've stress test this series by running the stress-cpu-hotplug
> with a fio workload in the background. Nothing exploded, all just
> worked fine.

Hi Daniel,

Is stress-cpu-hotplug an ltp test? or from Steven Rostedt -  I saw some 
threads where he mentioned some script?

Will the fio processes migrate back onto cpus which have been onlined again?

What is the block driver NVMe?

Thanks,
john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-08 13:01   ` John Garry
@ 2020-04-08 13:10     ` Daniel Wagner
  2020-04-08 13:29       ` John Garry
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Wagner @ 2020-04-08 13:10 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi John,

On Wed, Apr 08, 2020 at 02:01:11PM +0100, John Garry wrote:
> Is stress-cpu-hotplug an ltp test? or from Steven Rostedt -  I saw some
> threads where he mentioned some script?

My bad. It's the script from Steven, which toggle binary the cpus on/off:

[...]
2432 disabling cpu16 disabling cpu17 disabling cpu2
2433 disabling cpu1 enabling cpu16 enabling cpu17 enabling cpu2
2434 disabling cpu10 disabling cpu16 disabling cpu17 disabling cpu2
2435 enabling cpu1 enabling cpu10 enabling cpu16 enabling cpu17 enabling cpu2
2436 disabling cpu11 disabling cpu16 disabling cpu17 disabling cpu2
2437 disabling cpu1 enabling cpu11 enabling cpu16 enabling cpu17 enabling cpu2
2438 disabling cpu10 disabling cpu11 disabling cpu16 disabling cpu17 disabling cpu2
[..]

> Will the fio processes migrate back onto cpus which have been onlined again?

Hmm, good question. I've tried to assign them to a specific CPU via
--cpus_allowed_policy=split and --cpus_allowed.

  fio --rw=randwrite --name=test --size=50M --iodepth=32 --direct=1 \
      --bs=4k --numjobs=40 --time_based --runtime=1h --ioengine=libaio \
      --group_reporting --cpus_allowed_policy=split --cpus_allowed=0-40

Though I haven't verified what happens when the CPU get's back online.

> What is the block driver NVMe?

I've used a qla2xxx device. Hannes asked my to retry it with a megasas
device.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-08 12:40 ` [PATCH V6 0/8] blk-mq: improvement CPU hotplug Daniel Wagner
  2020-04-08 13:01   ` John Garry
@ 2020-04-08 13:25   ` Ming Lei
  1 sibling, 0 replies; 19+ messages in thread
From: Ming Lei @ 2020-04-08 13:25 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi Daniel,

On Wed, Apr 08, 2020 at 02:40:17PM +0200, Daniel Wagner wrote:
> Hi Ming,
> 
> On Tue, Apr 07, 2020 at 05:28:53PM +0800, Ming Lei wrote:
> > Hi,
> > 
> > Thomas mentioned:
> >     "
> >      That was the constraint of managed interrupts from the very beginning:
> >     
> >       The driver/subsystem has to quiesce the interrupt line and the associated
> >       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >       until it's restarted by the core when the CPU is plugged in again.
> >     "
> > 
> > But no drivers or blk-mq do that before one hctx becomes inactive(all
> > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > 
> > This patchset tries to address the issue by two stages:
> > 
> > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > 
> > - mark the hctx as internal stopped, and drain all in-flight requests
> > if the hctx is going to be dead.
> > 
> > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> > 
> > - steal bios from the request, and resubmit them via generic_make_request(),
> > then these IO will be mapped to other live hctx for dispatch
> > 
> > Please comment & review, thanks!
> 
> FWIW, I've stress test this series by running the stress-cpu-hotplug
> with a fio workload in the background. Nothing exploded, all just
> worked fine.

Thanks for your test!

Especially this patch changes flush & passthrough IO handling during
CPU hotplug, if possible, please include the two kinds of background IO
when running cpu hotplug test.

BTW, I verified the patches by running 'dbench -s 64' & concurrent NVMe
user IO during cpu hotplug, looks it works fine.

Also there is one known performance drop issue reported by John, which has
been addressed in the following link:

https://github.com/ming1/linux/commit/1cfbe1b2f7fd7085bc86e09c6443a20e89142975


Thanks,
Ming


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-08 13:10     ` Daniel Wagner
@ 2020-04-08 13:29       ` John Garry
  2020-04-08 15:14         ` Daniel Wagner
  0 siblings, 1 reply; 19+ messages in thread
From: John Garry @ 2020-04-08 13:29 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On 08/04/2020 14:10, Daniel Wagner wrote:
> Hi John,
> 
> On Wed, Apr 08, 2020 at 02:01:11PM +0100, John Garry wrote:
>> Is stress-cpu-hotplug an ltp test? or from Steven Rostedt -  I saw some
>> threads where he mentioned some script?
> 
> My bad. It's the script from Steven, which toggle binary the cpus on/off:
> 
> [...]
> 2432 disabling cpu16 disabling cpu17 disabling cpu2
> 2433 disabling cpu1 enabling cpu16 enabling cpu17 enabling cpu2
> 2434 disabling cpu10 disabling cpu16 disabling cpu17 disabling cpu2
> 2435 enabling cpu1 enabling cpu10 enabling cpu16 enabling cpu17 enabling cpu2
> 2436 disabling cpu11 disabling cpu16 disabling cpu17 disabling cpu2
> 2437 disabling cpu1 enabling cpu11 enabling cpu16 enabling cpu17 enabling cpu2
> 2438 disabling cpu10 disabling cpu11 disabling cpu16 disabling cpu17 disabling cpu2
> [..]
> 

ok, but to really test this you need to ensure that all the cpus for a 
managed interrupt affinity mask are offlined together for some period of 
time greater than the IO timeout. Otherwise the hw queue's managed 
interrupt would not be shut down, and you're not verifying that the 
queues are fully drained.

>> Will the fio processes migrate back onto cpus which have been onlined again?
> 
> Hmm, good question. I've tried to assign them to a specific CPU via
> --cpus_allowed_policy=split and --cpus_allowed.
> 
>    fio --rw=randwrite --name=test --size=50M --iodepth=32 --direct=1 \
>        --bs=4k --numjobs=40 --time_based --runtime=1h --ioengine=libaio \
>        --group_reporting --cpus_allowed_policy=split --cpus_allowed=0-40
> 
> Though I haven't verified what happens when the CPU get's back online.

Maybe this will work since you're offlining patterns of cpus and the fio 
processes have to migrate somewhere. But see above.

> 
>> What is the block driver NVMe?
> 
> I've used a qla2xxx device. Hannes asked my to retry it with a megasas
> device.
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-08 13:29       ` John Garry
@ 2020-04-08 15:14         ` Daniel Wagner
  2020-04-08 16:56           ` John Garry
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Wagner @ 2020-04-08 15:14 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Wed, Apr 08, 2020 at 02:29:51PM +0100, John Garry wrote:
> On 08/04/2020 14:10, Daniel Wagner wrote:
> ok, but to really test this you need to ensure that all the cpus for a
> managed interrupt affinity mask are offlined together for some period of
> time greater than the IO timeout. Otherwise the hw queue's managed interrupt
> would not be shut down, and you're not verifying that the queues are fully
> drained.

Not sure if I understand you correctly: Are you saying that the IRQ
related resources are not freed/moved from the offlining CPU?

> > > Will the fio processes migrate back onto cpus which have been onlined again?
> > 
> > Hmm, good question. I've tried to assign them to a specific CPU via
> > --cpus_allowed_policy=split and --cpus_allowed.
> > 
> >    fio --rw=randwrite --name=test --size=50M --iodepth=32 --direct=1 \
> >        --bs=4k --numjobs=40 --time_based --runtime=1h --ioengine=libaio \
> >        --group_reporting --cpus_allowed_policy=split --cpus_allowed=0-40
> > 
> > Though I haven't verified what happens when the CPU get's back online.
> 
> Maybe this will work since you're offlining patterns of cpus and the fio
> processes have to migrate somewhere. But see above.

At least after the initial setup a fio thread will be migrated away
from the offlining CPU.

A quick test shows, that the affinity mask for a fio thread will be
cleared when the CPU goes offline. There seems to be a discussion
going on about the cpu hotplug and the affinity mask:

https://lore.kernel.org/lkml/1251528473.590671.1579196495905.JavaMail.zimbra@efficios.com

TL;DR: it can be scheduled back if affinity is tweaked via
e.g. taskset, it won't if it's via cpusets

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH V6 0/8] blk-mq: improvement CPU hotplug
  2020-04-08 15:14         ` Daniel Wagner
@ 2020-04-08 16:56           ` John Garry
  0 siblings, 0 replies; 19+ messages in thread
From: John Garry @ 2020-04-08 16:56 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On 08/04/2020 16:14, Daniel Wagner wrote:
> On Wed, Apr 08, 2020 at 02:29:51PM +0100, John Garry wrote:
>> On 08/04/2020 14:10, Daniel Wagner wrote:
>> ok, but to really test this you need to ensure that all the cpus for a
>> managed interrupt affinity mask are offlined together for some period of
>> time greater than the IO timeout. Otherwise the hw queue's managed interrupt
>> would not be shut down, and you're not verifying that the queues are fully
>> drained.
> 

Hi Daniel,

> Not sure if I understand you correctly: Are you saying that the IRQ
> related resources are not freed/moved from the offlining CPU?

This series tries to drain the hw queue when all cpus in the queue (IRQ) 
affinity mask are being offlined. This is because when all the cpus are 
offlined, the managed IRQ for that hw queue is shutdown - so there are 
no cpus remaining online to service the completion interrupt for 
in-flight IOs. The cover letter may explain this better.

> 
>>>> Will the fio processes migrate back onto cpus which have been onlined again?
>>>
>>> Hmm, good question. I've tried to assign them to a specific CPU via
>>> --cpus_allowed_policy=split and --cpus_allowed.
>>>
>>>     fio --rw=randwrite --name=test --size=50M --iodepth=32 --direct=1 \
>>>         --bs=4k --numjobs=40 --time_based --runtime=1h --ioengine=libaio \
>>>         --group_reporting --cpus_allowed_policy=split --cpus_allowed=0-40
>>>
>>> Though I haven't verified what happens when the CPU get's back online.
>>
>> Maybe this will work since you're offlining patterns of cpus and the fio
>> processes have to migrate somewhere. But see above.
> 
> At least after the initial setup a fio thread will be migrated away
> from the offlining CPU.
> 
> A quick test shows, that the affinity mask for a fio thread will be
> cleared when the CPU goes offline. There seems to be a discussion
> going on about the cpu hotplug and the affinity mask:
> 
> https://lore.kernel.org/lkml/1251528473.590671.1579196495905.JavaMail.zimbra@efficios.com
> 
> TL;DR: it can be scheduled back if affinity is tweaked via
> e.g. taskset, it won't if it's via cpusets

I just avoid any of this in my test by looping in a sequence of onlining 
all cpus, start fio for short period, and then offline cpus.

BTW, you mentioned earlier that you would test megaraid_sas. As things 
stand, I don't think that series will help there as that driver still 
just exposes a single HW queue to blk-mq. I think qla2xxx driver does 
expose >1 queues, i.e. it sets Scsi_Host.nr_hq_queues, so may be a 
better option.

Cheers,
John


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-04-08 16:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-07  9:28 [PATCH V6 0/8] blk-mq: improvement CPU hotplug Ming Lei
2020-04-07  9:28 ` [PATCH V6 1/8] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
2020-04-07 17:14   ` Christoph Hellwig
2020-04-08  1:38     ` Ming Lei
2020-04-07  9:28 ` [PATCH V6 2/8] blk-mq: add new state of BLK_MQ_S_INACTIVE Ming Lei
2020-04-07 17:14   ` Christoph Hellwig
2020-04-07  9:28 ` [PATCH V6 3/8] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2020-04-07  9:28 ` [PATCH V6 4/8] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
2020-04-07  9:28 ` [PATCH V6 5/8] block: add blk_end_flush_machinery Ming Lei
2020-04-07  9:28 ` [PATCH V6 6/8] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
2020-04-07  9:29 ` [PATCH V6 7/8] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
2020-04-07  9:29 ` [PATCH V6 8/8] block: deactivate hctx when the hctx is actually inactive Ming Lei
2020-04-08 12:40 ` [PATCH V6 0/8] blk-mq: improvement CPU hotplug Daniel Wagner
2020-04-08 13:01   ` John Garry
2020-04-08 13:10     ` Daniel Wagner
2020-04-08 13:29       ` John Garry
2020-04-08 15:14         ` Daniel Wagner
2020-04-08 16:56           ` John Garry
2020-04-08 13:25   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.