All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V7 0/9] blk-mq: improvement CPU hotplug
@ 2020-04-18  3:09 Ming Lei
  2020-04-18  3:09 ` [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes inactive(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Thanks John Garry for running lots of tests on arm64 with this patchset
and co-working on investigating all kinds of issues.

Please comment & review, thanks!

https://github.com/ming1/linux/commits/v5.7-rc-blk-mq-improve-cpu-hotplug

V7:
	- fix updating .nr_active in get_driver_tag
	- add hctx->cpumask check in cpuhp handler
	- only drain requests which tag is >= 0
	- pass more aggressive cpuhotplug&io test

V6:
	- simplify getting driver tag, so that we can drain in-flight
	  requests correctly without using synchronize_rcu()
	- handle re-submission of flush & passthrough request correctly

V5:
	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
	- re-factor code for re-submit requests in cpu dead hotplug handler
	- address requeue corner case

V4:
	- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo


Ming Lei (9):
  blk-mq: mark blk_mq_get_driver_tag as static
  blk-mq: assign rq->tag in blk_mq_get_driver_tag
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: support rq filter callback when iterating rqs
  blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  block: add blk_end_flush_machinery
  blk-mq: re-submit IO in case that hctx is inactive
  blk-mq: handle requests dispatched from IO scheduler in case of
    inactive hctx
  block: deactivate hctx when the hctx is actually inactive

 block/blk-flush.c          | 143 +++++++++++---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |  39 ++--
 block/blk-mq-tag.h         |   4 +
 block/blk-mq.c             | 384 ++++++++++++++++++++++++++++++-------
 block/blk-mq.h             |  25 ++-
 block/blk.h                |   9 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   6 +
 include/linux/cpuhotplug.h |   1 +
 11 files changed, 495 insertions(+), 122 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
-- 
2.25.2


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:14   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 2/9] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

Now all callers of blk_mq_get_driver_tag are in blk-mq.c, so mark
it as static.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 +-
 block/blk-mq.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8e56884fd2e9..ecc9c7f405b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1027,7 +1027,7 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494fa..e7d1da4b1f73 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -44,7 +44,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
 void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 				bool kick_requeue_list);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
-bool blk_mq_get_driver_tag(struct request *rq);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 2/9] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
  2020-04-18  3:09 ` [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:30   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 3/9] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

Especially for none elevator, rq->tag is assigned after the request is
allocated, so there isn't any way to figure out if one request is in
being dispatched. Also the code path wrt. driver tag becomes a bit
difference between none and io scheduler.

When one hctx becomes inactive, we have to prevent any request from
being dispatched to LLD. And get driver tag provides one perfect chance
to do that. Meantime we can drain any such requests by checking if
rq->tag is assigned.

So only assign rq->tag until blk_mq_get_driver_tag() is called.

This way also simplifies code of dealing with driver tag a lot.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 18 +++---------------
 block/blk-mq.c    | 45 +++++++++++++++++++++++----------------------
 block/blk-mq.h    | 21 ++++++++++-----------
 block/blk.h       |  5 -----
 4 files changed, 36 insertions(+), 53 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c7f396e3d5e2..977edf95d711 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -236,13 +236,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 		error = fq->rq_status;
 
 	hctx = flush_rq->mq_hctx;
-	if (!q->elevator) {
-		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
-		flush_rq->tag = -1;
-	} else {
-		blk_mq_put_driver_tag(flush_rq);
-		flush_rq->internal_tag = -1;
-	}
+	flush_rq->internal_tag = -1;
+	blk_mq_put_driver_tag(flush_rq);
 
 	running = &fq->flush_queue[fq->flush_running_idx];
 	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
@@ -317,14 +312,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	flush_rq->mq_ctx = first_rq->mq_ctx;
 	flush_rq->mq_hctx = first_rq->mq_hctx;
 
-	if (!q->elevator) {
-		fq->orig_rq = first_rq;
-		flush_rq->tag = first_rq->tag;
-		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
-	} else {
-		flush_rq->internal_tag = first_rq->internal_tag;
-	}
-
+	flush_rq->internal_tag = first_rq->internal_tag;
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
 	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
 	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ecc9c7f405b5..29ffa40db793 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	struct request *rq = tags->static_rqs[tag];
 	req_flags_t rq_flags = 0;
 
-	if (data->flags & BLK_MQ_REQ_INTERNAL) {
-		rq->tag = -1;
-		rq->internal_tag = tag;
-	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
-			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-		rq->tag = tag;
-		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
-	}
+	rq->internal_tag = tag;
+	rq->tag = -1;
 
 	/* csd/requeue_work/fifo_time is initialized before use */
 	rq->q = data->q;
@@ -472,14 +462,18 @@ static void __blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
+
+	if (hctx->sched_tags) {
+		if (rq->tag >= 0)
+			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
+	} else {
+		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
+        }
+
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
@@ -527,7 +521,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 		blk_stat_add(rq, now);
 	}
 
-	if (rq->internal_tag != -1)
+	if (rq->q->elevator && rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq, now);
 
 	blk_account_io_done(rq, now);
@@ -1037,14 +1031,21 @@ static bool blk_mq_get_driver_tag(struct request *rq)
 	};
 	bool shared;
 
-	if (rq->tag != -1)
-		return true;
+	if (rq->tag >= 0)
+		goto allocated;
+
+	if (!data.hctx->sched_tags) {
+		rq->tag = rq->internal_tag;
+		goto set_rq;
+	}
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
 		data.flags |= BLK_MQ_REQ_RESERVED;
 
-	shared = blk_mq_tag_busy(data.hctx);
 	rq->tag = blk_mq_get_tag(&data);
+
+set_rq:
+	shared = blk_mq_tag_busy(data.hctx);
 	if (rq->tag >= 0) {
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
@@ -1052,7 +1053,7 @@ static bool blk_mq_get_driver_tag(struct request *rq)
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
-
+allocated:
 	return rq->tag != -1;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e7d1da4b1f73..d0c72d7d07c8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -196,26 +196,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
-					   struct request *rq)
+static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	int tag = rq->tag;
+
+	if (tag < 0)
+		return;
+
 	rq->tag = -1;
 
+	if (hctx->sched_tags)
+		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
 		atomic_dec(&hctx->nr_active);
 	}
 }
 
-static inline void blk_mq_put_driver_tag(struct request *rq)
-{
-	if (rq->tag == -1 || rq->internal_tag == -1)
-		return;
-
-	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
-}
-
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
diff --git a/block/blk.h b/block/blk.h
index 0a94ec68af32..88f0359faada 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -26,11 +26,6 @@ struct blk_flush_queue {
 	struct list_head	flush_data_in_flight;
 	struct request		*flush_rq;
 
-	/*
-	 * flush_rq shares tag with this rq, both can't be active
-	 * at the same time
-	 */
-	struct request		*orig_rq;
 	struct lock_class_key	key;
 	spinlock_t		mq_flush_lock;
 };
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 3/9] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
  2020-04-18  3:09 ` [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
  2020-04-18  3:09 ` [PATCH V7 2/9] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:31   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs Ming Lei
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 19 +++++++++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  3 +++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..8e745826eb86 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -239,6 +239,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 29ffa40db793..a28daefd7dd6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2259,6 +2259,16 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2295,6 +2305,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2354,6 +2367,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3608,6 +3624,9 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..784f2e038b55 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f389d7c724bd..786614753d73 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b633531..24b3a77810b6 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 3/9] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-20 10:34   ` John Garry
  2020-04-23  7:32   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 5/9] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Now request is thought as in-flight only when its state is updated as
MQ_RQ_IN_FLIGHT, which is done by dirver via blk_mq_start_request().

Actually from blk-mq's view, one rq can be thought as in-flight
after its tag is >= 0.

Passing one rq filter callback so that we can iterating requests very
flexiable.

Meantime blk_mq_all_tag_busy_iter is defined as public, which will be
called from blk-mq internally.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 39 +++++++++++++++++++++++++++------------
 block/blk-mq-tag.h |  4 ++++
 2 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..2e43b827c96d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -255,6 +255,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 struct bt_tags_iter_data {
 	struct blk_mq_tags *tags;
 	busy_tag_iter_fn *fn;
+	busy_rq_iter_fn *busy_rq_fn;
 	void *data;
 	bool reserved;
 };
@@ -274,7 +275,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
+	if (rq && iter_data->busy_rq_fn(rq, iter_data->data, reserved))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
 	return true;
@@ -294,11 +295,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+			     void *data, bool reserved)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
+		.busy_rq_fn = busy_rq_fn,
 		.data = data,
 		.reserved = reserved,
 	};
@@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 /**
  * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
  * @tags:	Tag map to iterate over.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @priv,
- *		reserved) where rq is a pointer to a request. 'reserved'
- *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop.
+ * @fn:		Pointer to the function that will be called for each request
+ * 		when .busy_rq_fn(rq) returns true. @fn will be called as
+ * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
+ * 		request. 'reserved' indicates whether or not @rq is a reserved
+ * 		request. Return true to continue iterating tags, false to stop.
+ * @busy_rq_fn: Pointer to the function that will be called for each request,
+ * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
+ * 		@priv, reserved) returns true, @fn will be called on this rq.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+		void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, busy_rq_fn,
+				priv, true);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, busy_rq_fn, priv, false);
+}
+
+static bool blk_mq_default_busy_rq(struct request *rq, void *data,
+		bool reserved)
+{
+	return blk_mq_request_started(rq);
 }
 
 /**
@@ -342,7 +356,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			blk_mq_all_tag_busy_iter(tagset->tags[i], fn,
+					blk_mq_default_busy_rq, priv);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb682..fdf095d513e5 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -21,6 +21,7 @@ struct blk_mq_tags {
 	struct list_head page_list;
 };
 
+typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
@@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 5/9] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:38   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 6/9] block: add blk_end_flush_machinery Ming Lei
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Before one CPU becomes offline, check if it is the last online CPU
of hctx. If yes, mark this hctx as inactive, meantime wait for
completion of all in-flight IOs originated from this hctx.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |   1 +
 block/blk-mq.c         | 118 ++++++++++++++++++++++++++++++++++++++---
 block/blk-mq.h         |   3 +-
 include/linux/blk-mq.h |   3 ++
 4 files changed, 116 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 8e745826eb86..b62390918ca5 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a28daefd7dd6..0b56d7d78269 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1021,7 +1021,7 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
@@ -1054,6 +1054,23 @@ static bool blk_mq_get_driver_tag(struct request *rq)
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
 allocated:
+	/*
+	 * Add one memory barrier in case that direct issue IO process
+	 * is migrated to other CPU which may not belong to this hctx,
+	 * so we can order driver tag assignment and checking
+	 * BLK_MQ_S_INACTIVE. Otherwise, barrier() is enough given the
+	 * two code paths are run on single CPU in case that
+	 * BLK_MQ_S_INACTIVE is set.
+	 */
+	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
+		smp_mb();
+	else
+		barrier();
+
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state))) {
+		blk_mq_put_driver_tag(rq);
+		return false;
+	}
 	return rq->tag != -1;
 }
 
@@ -1103,7 +1120,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 		 * Don't clear RESTART here, someone else could have set it.
 		 * At most this will cost an extra queue run.
 		 */
-		return blk_mq_get_driver_tag(rq);
+		return blk_mq_get_driver_tag(rq, false);
 	}
 
 	wait = &hctx->dispatch_wait;
@@ -1129,7 +1146,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	 * allocation failure and adding the hardware queue to the wait
 	 * queue.
 	 */
-	ret = blk_mq_get_driver_tag(rq);
+	ret = blk_mq_get_driver_tag(rq, false);
 	if (!ret) {
 		spin_unlock(&hctx->dispatch_wait_lock);
 		spin_unlock_irq(&wq->lock);
@@ -1226,7 +1243,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
 			break;
 
-		if (!blk_mq_get_driver_tag(rq)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
 			/*
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed. The
@@ -1258,7 +1275,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			bd.last = true;
 		else {
 			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt);
+			bd.last = !blk_mq_get_driver_tag(nxt, false);
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
@@ -1851,7 +1868,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_dispatch_budget(hctx))
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq)) {
+	if (!blk_mq_get_driver_tag(rq, true)) {
 		blk_mq_put_dispatch_budget(hctx);
 		goto insert;
 	}
@@ -2259,13 +2276,95 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+struct count_inflight_data {
+	unsigned count;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
 {
-	return 0;
+	struct count_inflight_data *count_data = data;
+
+	/*
+	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
+	 * in blk_mq_start_request(), at that time we can't prevent this rq
+	 * from being issued.
+	 *
+	 * So check if driver tag is assigned, if yes, count this rq as
+	 * inflight.
+	 */
+	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
+		count_data->count++;
+
+	return true;
+}
+
+static bool blk_mq_inflight_rq(struct request *rq, void *data,
+			       bool reserved)
+{
+	return rq->tag >= 0;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct count_inflight_data count_data = {
+		.count	= 0,
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
+			blk_mq_inflight_rq, &count_data);
+
+	return count_data.count;
+}
+
+static void blk_mq_hctx_drain_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	while (1) {
+		if (!blk_mq_tags_inflight_rqs(hctx))
+			break;
+		msleep(5);
+	}
 }
 
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
+	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
+			(cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask)
+			 < nr_cpu_ids))
+		return 0;
+
+	/*
+	 * The current CPU is the last one in this hctx, S_INACTIVE
+	 * can be observed in dispatch path without any barrier needed,
+	 * cause both are run on one same CPU.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	/*
+	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
+	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
+	 */
+	smp_mb();
+	blk_mq_hctx_drain_inflight_rqs(hctx);
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
+	clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 	return 0;
 }
 
@@ -2281,6 +2380,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	LIST_HEAD(tmp);
 	enum hctx_type type;
 
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d0c72d7d07c8..a1fbd0b8657b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -167,7 +167,8 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
+	return test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
+		test_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 }
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 786614753d73..650b208b42ea 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -403,6 +403,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 6/9] block: add blk_end_flush_machinery
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 5/9] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:40   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Flush requests aren't same with normal FS request:

1) on dedicated per-hctx flush rq is pre-allocated for sending flush request

2) flush requests are issued to hardware via one machinary so that flush merge
can be applied

We can't simply re-submit flush rqs via blk_steal_bios(), so add
blk_end_flush_machinery to collect flush requests which needs to
be resubmitted:

- if one flush command without DATA is enough, send one flush, complete this
kind of requests

- otherwise, add the request into a list and let caller re-submit it.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 125 +++++++++++++++++++++++++++++++++++++++++++---
 block/blk.h       |   4 ++
 2 files changed, 122 insertions(+), 7 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 977edf95d711..948f84159efd 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -170,10 +170,11 @@ static void blk_flush_complete_seq(struct request *rq,
 	unsigned int cmd_flags;
 
 	BUG_ON(rq->flush.seq & seq);
-	rq->flush.seq |= seq;
+	if (!error)
+		rq->flush.seq |= seq;
 	cmd_flags = rq->cmd_flags;
 
-	if (likely(!error))
+	if (likely(!error && !fq->flush_queue_terminating))
 		seq = blk_flush_cur_seq(rq);
 	else
 		seq = REQ_FSEQ_DONE;
@@ -200,9 +201,15 @@ static void blk_flush_complete_seq(struct request *rq,
 		 * normal completion and end it.
 		 */
 		BUG_ON(!list_empty(&rq->queuelist));
-		list_del_init(&rq->flush.list);
-		blk_flush_restore_request(rq);
-		blk_mq_end_request(rq, error);
+
+		/* Terminating code will end the request from flush queue */
+		if (likely(!fq->flush_queue_terminating)) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			blk_mq_end_request(rq, error);
+		} else {
+			list_move_tail(&rq->flush.list, pending);
+		}
 		break;
 
 	default:
@@ -279,7 +286,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	struct request *flush_rq = fq->flush_rq;
 
 	/* C1 described at the top of this file */
-	if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
+	if (fq->flush_pending_idx != fq->flush_running_idx ||
+			list_empty(pending) || fq->flush_queue_terminating)
 		return;
 
 	/* C2 and C3
@@ -331,7 +339,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
 	if (q->elevator) {
-		WARN_ON(rq->tag < 0);
+		WARN_ON(rq->tag < 0 && !fq->flush_queue_terminating);
 		blk_mq_put_driver_tag(rq);
 	}
 
@@ -503,3 +511,106 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
 	kfree(fq->flush_rq);
 	kfree(fq);
 }
+
+static void __blk_end_queued_flush(struct blk_flush_queue *fq,
+		unsigned int queue_idx, struct list_head *resubmit_list,
+		struct list_head *flush_list)
+{
+	struct list_head *queue = &fq->flush_queue[queue_idx];
+	struct request *rq, *nxt;
+
+	list_for_each_entry_safe(rq, nxt, queue, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		if (!blk_rq_sectors(rq) || seq == REQ_FSEQ_POSTFLUSH )
+			list_add_tail(&rq->queuelist, flush_list);
+		else
+			list_add_tail(&rq->queuelist, resubmit_list);
+	}
+}
+
+static void blk_end_queued_flush(struct blk_flush_queue *fq,
+		struct list_head *resubmit_list, struct list_head *flush_list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	__blk_end_queued_flush(fq, 0, resubmit_list, flush_list);
+	__blk_end_queued_flush(fq, 1, resubmit_list, flush_list);
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+}
+
+/* complete requests which just requires one flush command */
+static void blk_complete_flush_requests(struct blk_flush_queue *fq,
+		struct list_head *flush_list)
+{
+	struct block_device *bdev;
+	struct request *rq;
+	int error = -ENXIO;
+
+	if (list_empty(flush_list))
+		return;
+
+	rq = list_first_entry(flush_list, struct request, queuelist);
+
+	/* Send flush via one active hctx so we can move on */
+	bdev = bdget_disk(rq->rq_disk, 0);
+	if (bdev) {
+		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
+		bdput(bdev);
+	}
+
+	while (!list_empty(flush_list)) {
+		rq = list_first_entry(flush_list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_end_request(rq, error);
+	}
+}
+
+/*
+ * Called when this hctx is inactive and all CPUs of this hctx is dead.
+ *
+ * Terminate this hw queue's flush machinery, and try to complete flush
+ * IO requests if possible, such as any flush IO without data, or flush
+ * data IO in POSTFLUSH stage. Otherwise, add the flush IOs into @list
+ * and let caller to re-submit them.
+ *
+ * No need to hold the flush queue lock given no any flush activity can
+ * reach this queue now.
+ */
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out)
+{
+	LIST_HEAD(resubmit_list);
+	LIST_HEAD(flush_list);
+	struct blk_flush_queue *fq = hctx->fq;
+	struct request *rq, *nxt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	fq->flush_queue_terminating = 1;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	/* End inflight flush requests */
+	list_for_each_entry_safe(rq, nxt, in, queuelist) {
+		WARN_ON(!(rq->rq_flags & RQF_FLUSH_SEQ));
+		list_del_init(&rq->queuelist);
+		rq->end_io(rq, BLK_STS_AGAIN);
+	}
+
+	/* End queued requests */
+	blk_end_queued_flush(fq, &resubmit_list, &flush_list);
+
+	/* Send flush and complete requests which just need one flush req */
+	blk_complete_flush_requests(fq, &flush_list);
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	/* reset flush queue so that it is ready to work next time */
+	fq->flush_pending_idx = fq->flush_running_idx = 0;
+	fq->flush_queue_terminating = 0;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	list_splice_init(&resubmit_list, out);
+}
diff --git a/block/blk.h b/block/blk.h
index 88f0359faada..a203db6b68eb 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -20,6 +20,7 @@ struct blk_flush_queue {
 	unsigned int		flush_queue_delayed:1;
 	unsigned int		flush_pending_idx:1;
 	unsigned int		flush_running_idx:1;
+	unsigned int		flush_queue_terminating:1;
 	blk_status_t 		rq_status;
 	unsigned long		flush_pending_since;
 	struct list_head	flush_queue[2];
@@ -483,4 +484,7 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool *same_page);
 
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out);
+
 #endif /* BLK_INTERNAL_H */
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 6/9] block: add blk_end_flush_machinery Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:50   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 8/9] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

When all CPUs in one hctx are offline and this hctx becomes inactive,
we shouldn't run this hw queue for completing request any more.

So steal bios from the request, and resubmit them, and finally free
the request in blk_mq_hctx_notify_dead().

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 130 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 121 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0b56d7d78269..ae1e57c64ca1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2368,10 +2368,98 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_end_io(struct request *rq, blk_status_t error)
+{
+	struct request *orig_rq = rq->end_io_data;
+
+	blk_mq_cleanup_rq(orig_rq);
+	blk_mq_end_request(orig_rq, error);
+
+	blk_put_request(rq);
+}
+
+static void blk_mq_resubmit_passthrough_io(struct request *rq)
+{
+	struct request *nrq;
+	unsigned int flags = 0, cmd_flags = 0;
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
+		hctx->tags;
+	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
+
+	if (rq->rq_flags & RQF_PREEMPT)
+		flags |= BLK_MQ_REQ_PREEMPT;
+	if (reserved)
+		flags |= BLK_MQ_REQ_RESERVED;
+
+	/* avoid allocation failure & IO merge */
+	cmd_flags = (rq->cmd_flags & ~REQ_NOWAIT) | REQ_NOMERGE;
+
+	nrq = blk_get_request(rq->q, cmd_flags, flags);
+	if (!nrq)
+		return;
+
+	nrq->__sector = blk_rq_pos(rq);
+	nrq->__data_len = blk_rq_bytes(rq);
+	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		nrq->rq_flags |= RQF_SPECIAL_PAYLOAD;
+		nrq->special_vec = rq->special_vec;
+	}
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	nrq->nr_integrity_segments = rq->nr_integrity_segments;
+#endif
+	nrq->nr_phys_segments = rq->nr_phys_segments;
+	nrq->ioprio = rq->ioprio;
+	nrq->extra_len = rq->extra_len;
+	nrq->rq_disk = rq->rq_disk;
+	nrq->part = rq->part;
+	nrq->write_hint = rq->write_hint;
+	nrq->timeout = rq->timeout;
+
+	memcpy(blk_mq_rq_to_pdu(nrq), blk_mq_rq_to_pdu(rq),
+			rq->q->tag_set->cmd_size);
+
+	nrq->end_io = blk_mq_resubmit_end_io;
+	nrq->end_io_data = rq;
+	nrq->bio = rq->bio;
+	nrq->biotail = rq->biotail;
+
+	blk_account_io_start(nrq, true);
+	blk_mq_sched_insert_request(nrq, true, true, true);
+}
+
+static void blk_mq_resubmit_fs_io(struct request *rq)
+{
+	struct bio_list list;
+	struct bio *bio;
+
+	bio_list_init(&list);
+	blk_steal_bios(&list, rq);
+
+	while (true) {
+		bio = bio_list_pop(&list);
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+
+	blk_mq_cleanup_rq(rq);
+	blk_mq_end_request(rq, 0);
+}
+
+static void blk_mq_resubmit_io(struct request *rq)
+{
+	if (rq->end_io || blk_rq_is_passthrough(rq))
+		blk_mq_resubmit_passthrough_io(rq);
+	else
+		blk_mq_resubmit_fs_io(rq);
+}
+
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
+ * to the hctx any more, so steal bios from requests of this hctx, and
+ * re-submit them to the request queue, and free these requests finally.
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
@@ -2394,14 +2482,38 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	}
 	spin_unlock(&ctx->lock);
 
-	if (list_empty(&tmp))
-		return 0;
+	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
+		if (!list_empty(&tmp)) {
+			spin_lock(&hctx->lock);
+			list_splice_tail_init(&tmp, &hctx->dispatch);
+			spin_unlock(&hctx->lock);
+			blk_mq_run_hw_queue(hctx, true);
+		}
+	} else {
+		LIST_HEAD(flush_in);
+		LIST_HEAD(flush_out);
+		struct request *rq, *nxt;
 
-	spin_lock(&hctx->lock);
-	list_splice_tail_init(&tmp, &hctx->dispatch);
-	spin_unlock(&hctx->lock);
+		/* requests in dispatch list have to be re-submitted too */
+		spin_lock(&hctx->lock);
+		list_splice_tail_init(&hctx->dispatch, &tmp);
+		spin_unlock(&hctx->lock);
+
+		/* blk_end_flush_machinery will cover flush request */
+		list_for_each_entry_safe(rq, nxt, &tmp, queuelist) {
+			if (rq->rq_flags & RQF_FLUSH_SEQ)
+				list_move(&rq->queuelist, &flush_in);
+		}
+		blk_end_flush_machinery(hctx, &flush_in, &flush_out);
+		list_splice_tail(&flush_out, &tmp);
+
+		while (!list_empty(&tmp)) {
+			rq = list_first_entry(&tmp, struct request, queuelist);
+			list_del_init(&rq->queuelist);
+			blk_mq_resubmit_io(rq);
+		}
+	}
 
-	blk_mq_run_hw_queue(hctx, true);
 	return 0;
 }
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 8/9] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (6 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-23  7:51   ` Christoph Hellwig
  2020-04-18  3:09 ` [PATCH V7 9/9] block: deactivate hctx when the hctx is actually inactive Ming Lei
  2020-04-20 10:29 ` [PATCH V7 0/9] blk-mq: improvement CPU hotplug John Garry
  9 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

If one hctx becomes inactive when its CPUs are all offline, all in-queue
requests aimed at this hctx have to be re-submitted.

Re-submit requests from both sw queue or scheduler queue when the hctx
is found as inactive.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 100 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 62 insertions(+), 38 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ae1e57c64ca1..54ba8a9c3c93 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2456,6 +2456,52 @@ static void blk_mq_resubmit_io(struct request *rq)
 		blk_mq_resubmit_fs_io(rq);
 }
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(sched_tmp);
+	LIST_HEAD(re_submit);
+	LIST_HEAD(flush_in);
+	LIST_HEAD(flush_out);
+	struct request *rq, *nxt;
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (!e) {
+		blk_mq_flush_busy_ctxs(hctx, &re_submit);
+	} else {
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched_tmp);
+			else
+				list_add(&rq->queuelist, &re_submit);
+		}
+	}
+	while (!list_empty(&sched_tmp)) {
+		rq = list_entry(sched_tmp.next, struct request,
+				queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_sched_insert_request(rq, true, true, true);
+	}
+
+	/* requests in dispatch list have to be re-submitted too */
+	spin_lock(&hctx->lock);
+	list_splice_tail_init(&hctx->dispatch, &re_submit);
+	spin_unlock(&hctx->lock);
+
+	/* blk_end_flush_machinery will cover flush request */
+	list_for_each_entry_safe(rq, nxt, &re_submit, queuelist) {
+		if (rq->rq_flags & RQF_FLUSH_SEQ)
+			list_move(&rq->queuelist, &flush_in);
+	}
+	blk_end_flush_machinery(hctx, &flush_in, &flush_out);
+	list_splice_tail(&flush_out, &re_submit);
+
+	while (!list_empty(&re_submit)) {
+		rq = list_first_entry(&re_submit, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_io(rq);
+	}
+}
+
 /*
  * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
  * to the hctx any more, so steal bios from requests of this hctx, and
@@ -2463,54 +2509,32 @@ static void blk_mq_resubmit_io(struct request *rq)
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
-	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx;
-	LIST_HEAD(tmp);
-	enum hctx_type type;
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
 
 	if (!cpumask_test_cpu(cpu, hctx->cpumask))
 		return 0;
 
-	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
-	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
-	type = hctx->type;
-
-	spin_lock(&ctx->lock);
-	if (!list_empty(&ctx->rq_lists[type])) {
-		list_splice_init(&ctx->rq_lists[type], &tmp);
-		blk_mq_hctx_clear_pending(hctx, ctx);
-	}
-	spin_unlock(&ctx->lock);
+	if (test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
+		blk_mq_hctx_deactivate(hctx);
+	} else if (!hctx->queue->elevator) {
+		struct blk_mq_ctx *ctx = __blk_mq_get_ctx(hctx->queue, cpu);
+		enum hctx_type type = hctx->type;
+		LIST_HEAD(tmp);
+
+		spin_lock(&ctx->lock);
+		if (!list_empty(&ctx->rq_lists[type])) {
+			list_splice_init(&ctx->rq_lists[type], &tmp);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+		}
+		spin_unlock(&ctx->lock);
 
-	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
 		if (!list_empty(&tmp)) {
 			spin_lock(&hctx->lock);
 			list_splice_tail_init(&tmp, &hctx->dispatch);
 			spin_unlock(&hctx->lock);
-			blk_mq_run_hw_queue(hctx, true);
-		}
-	} else {
-		LIST_HEAD(flush_in);
-		LIST_HEAD(flush_out);
-		struct request *rq, *nxt;
 
-		/* requests in dispatch list have to be re-submitted too */
-		spin_lock(&hctx->lock);
-		list_splice_tail_init(&hctx->dispatch, &tmp);
-		spin_unlock(&hctx->lock);
-
-		/* blk_end_flush_machinery will cover flush request */
-		list_for_each_entry_safe(rq, nxt, &tmp, queuelist) {
-			if (rq->rq_flags & RQF_FLUSH_SEQ)
-				list_move(&rq->queuelist, &flush_in);
-		}
-		blk_end_flush_machinery(hctx, &flush_in, &flush_out);
-		list_splice_tail(&flush_out, &tmp);
-
-		while (!list_empty(&tmp)) {
-			rq = list_first_entry(&tmp, struct request, queuelist);
-			list_del_init(&rq->queuelist);
-			blk_mq_resubmit_io(rq);
+			blk_mq_run_hw_queue(hctx, true);
 		}
 	}
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V7 9/9] block: deactivate hctx when the hctx is actually inactive
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (7 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 8/9] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
@ 2020-04-18  3:09 ` Ming Lei
  2020-04-20 10:29 ` [PATCH V7 0/9] blk-mq: improvement CPU hotplug John Garry
  9 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2020-04-18  3:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Run queue on dead CPU still may be triggered in some corner case,
such as one request is requeued after CPU hotplug is handled.

So handle this corner case during run queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 ++++++++++--------------------
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 54ba8a9c3c93..2babe96b2883 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,8 @@
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx);
+
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
 	int ddir, sectors, bucket;
@@ -1371,28 +1373,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	int srcu_idx;
 
 	/*
-	 * We should be running this queue from one of the CPUs that
-	 * are mapped to it.
-	 *
-	 * There are at least two related races now between setting
-	 * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
-	 * __blk_mq_run_hw_queue():
-	 *
-	 * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes online, then this warning is harmless
-	 *   at all
-	 *
-	 * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes offline, then the warning can't be
-	 *   triggered, and we depend on blk-mq timeout handler to
-	 *   handle dispatched requests to this hctx
+	 * BLK_MQ_S_INACTIVE may not deal with some requeue corner case:
+	 * one request is requeued after cpu unplug is handled, so check
+	 * if the hctx is actually inactive. If yes, deactive it and
+	 * re-submit all requests in the queue.
 	 */
 	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-		cpu_online(hctx->next_cpu)) {
-		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
-			raw_smp_processor_id(),
-			cpumask_empty(hctx->cpumask) ? "inactive": "active");
-		dump_stack();
+	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
+	    nr_cpu_ids) {
+		blk_mq_hctx_deactivate(hctx);
+		return;
 	}
 
 	/*
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 0/9] blk-mq: improvement CPU hotplug
  2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
                   ` (8 preceding siblings ...)
  2020-04-18  3:09 ` [PATCH V7 9/9] block: deactivate hctx when the hctx is actually inactive Ming Lei
@ 2020-04-20 10:29 ` John Garry
  9 siblings, 0 replies; 22+ messages in thread
From: John Garry @ 2020-04-20 10:29 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

On 18/04/2020 04:09, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>      "
>       That was the constraint of managed interrupts from the very beginning:
>      
>        The driver/subsystem has to quiesce the interrupt line and the associated
>        queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>        until it's restarted by the core when the CPU is plugged in again.
>      "
> 
> But no drivers or blk-mq do that before one hctx becomes inactive(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Thanks John Garry for running lots of tests on arm64 with this patchset
> and co-working on investigating all kinds of issues.
> 
> Please comment & review, thanks!
> 
> https://github.com/ming1/linux/commits/v5.7-rc-blk-mq-improve-cpu-hotplug

This is pretty solid now. I get no SCSI timeouts with my test script (at 
the bottom). Thanks to Ming for this.

Tested-by: John Garry <john.garry@huawei.com>

> 
> V7:
> 	- fix updating .nr_active in get_driver_tag
> 	- add hctx->cpumask check in cpuhp handler
> 	- only drain requests which tag is >= 0
> 	- pass more aggressive cpuhotplug&io test
> 
> V6:
> 	- simplify getting driver tag, so that we can drain in-flight
> 	  requests correctly without using synchronize_rcu()
> 	- handle re-submission of flush & passthrough request correctly
> 
> V5:
> 	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
> 	- re-factor code for re-submit requests in cpu dead hotplug handler
> 	- address requeue corner case
> 
> V4:
> 	- resubmit IOs in dispatch list in case that this hctx is dead
> 
> V3:
> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
> 
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
> 
> 
> Ming Lei (9):
>    blk-mq: mark blk_mq_get_driver_tag as static
>    blk-mq: assign rq->tag in blk_mq_get_driver_tag
>    blk-mq: prepare for draining IO when hctx's all CPUs are offline
>    blk-mq: support rq filter callback when iterating rqs
>    blk-mq: stop to handle IO and drain IO before hctx becomes inactive
>    block: add blk_end_flush_machinery
>    blk-mq: re-submit IO in case that hctx is inactive
>    blk-mq: handle requests dispatched from IO scheduler in case of
>      inactive hctx
>    block: deactivate hctx when the hctx is actually inactive
> 
>   block/blk-flush.c          | 143 +++++++++++---
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |  39 ++--
>   block/blk-mq-tag.h         |   4 +
>   block/blk-mq.c             | 384 ++++++++++++++++++++++++++++++-------
>   block/blk-mq.h             |  25 ++-
>   block/blk.h                |   9 +-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |   6 +
>   include/linux/cpuhotplug.h |   1 +
>   11 files changed, 495 insertions(+), 122 deletions(-)
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> 

./enable_all.sh #enable cpu0-63

for((i = 0; i < 100 ; i++))
do
   echo "Looping ... number $i"
   ./create_fio_task_cpu.sh 4k read 2048 1  & # run fio over all cpus 
for 5x SAS disks
   echo "short sleep, then disable"
   sleep 5
   ./disable_all.sh #disable cpu1-60
   echo "long sleep $i"
   sleep 50
   echo "long sleep over number $i"
   ./enable_all.sh
   sleep 3
done

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs
  2020-04-18  3:09 ` [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs Ming Lei
@ 2020-04-20 10:34   ` John Garry
  2020-04-23  7:31     ` Christoph Hellwig
  2020-04-23  7:32   ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: John Garry @ 2020-04-20 10:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 2b8321efb682..fdf095d513e5 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -21,6 +21,7 @@ struct blk_mq_tags {
>   	struct list_head page_list;
>   };
>   
> +typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
>   
>   extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
>   extern void blk_mq_free_tags(struct blk_mq_tags *tags);
> @@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>   		void *priv);
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> +		void *priv);

Could you please pay attention to alignment of the arguments and the 
opening brace? This patchset fails that in many places.

Thanks,
John

>   
>   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>   						 struct blk_mq_hw_ctx *hctx)
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static
  2020-04-18  3:09 ` [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
@ 2020-04-23  7:14   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On Sat, Apr 18, 2020 at 11:09:17AM +0800, Ming Lei wrote:
> Now all callers of blk_mq_get_driver_tag are in blk-mq.c, so mark
> it as static.
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 2/9] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-04-18  3:09 ` [PATCH V7 2/9] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-04-23  7:30   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry

On Sat, Apr 18, 2020 at 11:09:18AM +0800, Ming Lei wrote:
> Especially for none elevator, rq->tag is assigned after the request is
> allocated, so there isn't any way to figure out if one request is in
> being dispatched. Also the code path wrt. driver tag becomes a bit
> difference between none and io scheduler.
> 
> When one hctx becomes inactive, we have to prevent any request from
> being dispatched to LLD. And get driver tag provides one perfect chance
> to do that. Meantime we can drain any such requests by checking if
> rq->tag is assigned.
> 
> So only assign rq->tag until blk_mq_get_driver_tag() is called.
> 
> This way also simplifies code of dealing with driver tag a lot.

I really like this, but I find blk_mq_get_driver_tag really convoluted
after this patch, mostly due to the excessive use of gotos, but also
because it mixed tag >= 0 and tag == -1 checks, which should mean the
same as -1 is the only negative value assigned (which btw really should
grow a BLK_MQ_NO_TAG name in another patch).  I think something like
this folded into your patch would be a nice improvement and also fit
in with the changes later in the series:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 05fe2eb615a9..fb7b35941e6c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1021,40 +1021,41 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool __blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
-		.q = rq->q,
-		.hctx = rq->mq_hctx,
-		.flags = BLK_MQ_REQ_NOWAIT,
-		.cmd_flags = rq->cmd_flags,
+		.q		= rq->q,
+		.hctx		= rq->mq_hctx,
+		.flags		= BLK_MQ_REQ_NOWAIT,
+		.cmd_flags	= rq->cmd_flags,
 	};
-	bool shared;
-
-	if (rq->tag >= 0)
-		goto allocated;
 
-	if (!data.hctx->sched_tags) {
+	if (data.hctx->sched_tags) {
+		if (blk_mq_tag_is_reserved(data.hctx->sched_tags,
+				rq->internal_tag))
+			data.flags |= BLK_MQ_REQ_RESERVED;
+		rq->tag = blk_mq_get_tag(&data);
+	} else {
 		rq->tag = rq->internal_tag;
-		goto set_rq;
 	}
 
-	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
-		data.flags |= BLK_MQ_REQ_RESERVED;
-
-	rq->tag = blk_mq_get_tag(&data);
+	if (rq->tag == -1)
+		return false;
 
-set_rq:
-	shared = blk_mq_tag_busy(data.hctx);
-	if (rq->tag >= 0) {
-		if (shared) {
-			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
-		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+	if (blk_mq_tag_busy(data.hctx)) {
+		rq->rq_flags |= RQF_MQ_INFLIGHT;
+		atomic_inc(&data.hctx->nr_active);
 	}
-allocated:
-	return rq->tag != -1;
+	data.hctx->tags->rqs[rq->tag] = rq;
+	return true;
+}
+
+
+static bool blk_mq_get_driver_tag(struct request *rq)
+{
+	if (rq->tag == -1)
+		return __blk_mq_get_driver_tag(rq);
+	return true;
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 3/9] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-04-18  3:09 ` [PATCH V7 3/9] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-04-23  7:31   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Sat, Apr 18, 2020 at 11:09:19AM +0800, Ming Lei wrote:
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "
>  That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again.
> "
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> one cpuhp state handled after the CPU is down, so there isn't any chance
> to quiesce hctx for blk-mq wrt. CPU hotplug.
> 
> Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> and wait for completion of in-flight requests.
> 
> We will stop hw queue and wait for completion of in-flight requests
> when one hctx is becoming dead in the following patch. This way may
> cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> loop.
> 
> Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> loop, so we needn't to wait for completion of in-flight requests from
> dm-rq & loop, then the potential dead-lock can be avoided.

The code here looks fine, but the split from the patches that actually
use it instead of just adding stubs first seems odd.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs
  2020-04-20 10:34   ` John Garry
@ 2020-04-23  7:31     ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:31 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, linux-block, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Mon, Apr 20, 2020 at 11:34:11AM +0100, John Garry wrote:
>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
>> index 2b8321efb682..fdf095d513e5 100644
>> --- a/block/blk-mq-tag.h
>> +++ b/block/blk-mq-tag.h
>> @@ -21,6 +21,7 @@ struct blk_mq_tags {
>>   	struct list_head page_list;
>>   };
>>   +typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
>>     extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, 
>> unsigned int reserved_tags, int node, int alloc_policy);
>>   extern void blk_mq_free_tags(struct blk_mq_tags *tags);
>> @@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>>   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>>   		void *priv);
>> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
>> +		void *priv);
>
> Could you please pay attention to alignment of the arguments and the 
> opening brace? This patchset fails that in many places.

Mings patch use the perfectly fine two tab alignment, which in many
ways is much easier to deal with than the weird brace alignment.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs
  2020-04-18  3:09 ` [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs Ming Lei
  2020-04-20 10:34   ` John Garry
@ 2020-04-23  7:32   ` Christoph Hellwig
  1 sibling, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 5/9] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-04-18  3:09 ` [PATCH V7 5/9] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-04-23  7:38   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Sat, Apr 18, 2020 at 11:09:21AM +0800, Ming Lei wrote:
> -static bool blk_mq_get_driver_tag(struct request *rq)
> +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
>  {
>  	struct blk_mq_alloc_data data = {
>  		.q = rq->q,
> @@ -1054,6 +1054,23 @@ static bool blk_mq_get_driver_tag(struct request *rq)
>  		data.hctx->tags->rqs[rq->tag] = rq;
>  	}
>  allocated:
> +	/*
> +	 * Add one memory barrier in case that direct issue IO process
> +	 * is migrated to other CPU which may not belong to this hctx,
> +	 * so we can order driver tag assignment and checking
> +	 * BLK_MQ_S_INACTIVE. Otherwise, barrier() is enough given the
> +	 * two code paths are run on single CPU in case that
> +	 * BLK_MQ_S_INACTIVE is set.

Please use up all 80 characters for the comments (also elsewhere). That
being said I fail to see what the barrier() as a pure compiler barrier
even buys us here.

> +	 */
> +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> +		smp_mb();
> +	else
> +		barrier();
> +
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data.hctx->state))) {
> +		blk_mq_put_driver_tag(rq);
> +		return false;
> +	}
>  	return rq->tag != -1;

Also if you take my cleanup to patch 2, we could just open code the
direct_issue case in the only caller instead of having the magic in the
common routine.

> +	if ((cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu) ||
> +			(cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask)
> +			 < nr_cpu_ids))

No need for the inner braces.  Also in this case I think something like:

	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu ||
	    cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)

might be a tad more readable, but then again this might even be worth
a little inline helper once we start bike shedding.

> +	/*
> +	 * The current CPU is the last one in this hctx, S_INACTIVE
> +	 * can be observed in dispatch path without any barrier needed,
> +	 * cause both are run on one same CPU.
> +	 */
> +	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	/*
> +	 * Order setting BLK_MQ_S_INACTIVE and checking rq->tag & rqs[tag],
> +	 * and its pair is the smp_mb() in blk_mq_get_driver_tag
> +	 */
> +	smp_mb();
> +	blk_mq_hctx_drain_inflight_rqs(hctx);
> +	return 0;

FYI, Documentation/core-api/atomic_ops.rst asks for using
smp_mb__before_atomic / smp_mb__after_atomic around the bitops.

> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
> +	clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
>  	return 0;
>  }

Why not simply:

	if (cpumask_test_cpu(cpu, hctx->cpumask))
		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
	return 0;

Conceptually the changes look fine.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 6/9] block: add blk_end_flush_machinery
  2020-04-18  3:09 ` [PATCH V7 6/9] block: add blk_end_flush_machinery Ming Lei
@ 2020-04-23  7:40   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Sat, Apr 18, 2020 at 11:09:22AM +0800, Ming Lei wrote:
> Flush requests aren't same with normal FS request:
> 
> 1) on dedicated per-hctx flush rq is pre-allocated for sending flush request
> 
> 2) flush requests are issued to hardware via one machinary so that flush merge
> can be applied
> 
> We can't simply re-submit flush rqs via blk_steal_bios(), so add
> blk_end_flush_machinery to collect flush requests which needs to
> be resubmitted:
> 
> - if one flush command without DATA is enough, send one flush, complete this
> kind of requests
> 
> - otherwise, add the request into a list and let caller re-submit it.

The code looks sensible:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-18  3:09 ` [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-04-23  7:50   ` Christoph Hellwig
  2020-04-23  8:46     ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

> +static void blk_mq_resubmit_passthrough_io(struct request *rq)
> +{
> +	struct request *nrq;
> +	unsigned int flags = 0, cmd_flags = 0;
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> +	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
> +		hctx->tags;
> +	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
> +
> +	if (rq->rq_flags & RQF_PREEMPT)
> +		flags |= BLK_MQ_REQ_PREEMPT;
> +	if (reserved)
> +		flags |= BLK_MQ_REQ_RESERVED;
> +
> +	/* avoid allocation failure & IO merge */
> +	cmd_flags = (rq->cmd_flags & ~REQ_NOWAIT) | REQ_NOMERGE;
> +
> +	nrq = blk_get_request(rq->q, cmd_flags, flags);
> +	if (!nrq)
> +		return;
> +
> +	nrq->__sector = blk_rq_pos(rq);
> +	nrq->__data_len = blk_rq_bytes(rq);
> +	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
> +		nrq->rq_flags |= RQF_SPECIAL_PAYLOAD;
> +		nrq->special_vec = rq->special_vec;
> +	}
> +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> +	nrq->nr_integrity_segments = rq->nr_integrity_segments;
> +#endif
> +	nrq->nr_phys_segments = rq->nr_phys_segments;
> +	nrq->ioprio = rq->ioprio;
> +	nrq->extra_len = rq->extra_len;
> +	nrq->rq_disk = rq->rq_disk;
> +	nrq->part = rq->part;
> +	nrq->write_hint = rq->write_hint;
> +	nrq->timeout = rq->timeout;

This should share code with blk_rq_prep_clone() using a helper.
Note that blk_rq_prep_clone seems to miss things like the
write_hint and timeout, which we should fix as well.

> +static void blk_mq_resubmit_fs_io(struct request *rq)
> +{
> +	struct bio_list list;
> +	struct bio *bio;
> +
> +	bio_list_init(&list);
> +	blk_steal_bios(&list, rq);
> +
> +	while (true) {
> +		bio = bio_list_pop(&list);
> +		if (!bio)
> +			break;
> +
> +		generic_make_request(bio);
> +	}

This could be simplified to:

	while ((bio = bio_list_pop(&list)))
		generic_make_request(bio);

but then again the generic_make_request seems weird.  Do we need
actually need any of the checks in generic_make_request?  Shouldn't
we call into blk_mq_make_request directly?

Then again I wonder why the passthrough case doesn't work for
FS requests?

>  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  {
> @@ -2394,14 +2482,38 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  	}
>  	spin_unlock(&ctx->lock);
>  
> +	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
> +		if (!list_empty(&tmp)) {
> +			spin_lock(&hctx->lock);
> +			list_splice_tail_init(&tmp, &hctx->dispatch);
> +			spin_unlock(&hctx->lock);
> +			blk_mq_run_hw_queue(hctx, true);
> +		}
> +	} else {

What about an early return or two here to save a level of indentation
later?

	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
		if (list_empty(&tmp))
			return 0;

		spin_lock(&hctx->lock);
		list_splice_tail_init(&tmp, &hctx->dispatch);
		spin_unlock(&hctx->lock);
		blk_mq_run_hw_queue(hctx, true);
		return 0;
	}

	...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 8/9] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx
  2020-04-18  3:09 ` [PATCH V7 8/9] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
@ 2020-04-23  7:51   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2020-04-23  7:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

On Sat, Apr 18, 2020 at 11:09:24AM +0800, Ming Lei wrote:
> If one hctx becomes inactive when its CPUs are all offline, all in-queue
> requests aimed at this hctx have to be re-submitted.
> 
> Re-submit requests from both sw queue or scheduler queue when the hctx
> is found as inactive.
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c | 100 ++++++++++++++++++++++++++++++-------------------
>  1 file changed, 62 insertions(+), 38 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index ae1e57c64ca1..54ba8a9c3c93 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2456,6 +2456,52 @@ static void blk_mq_resubmit_io(struct request *rq)
>  		blk_mq_resubmit_fs_io(rq);
>  }
>  
> +static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx)
> +{
> +	LIST_HEAD(sched_tmp);
> +	LIST_HEAD(re_submit);
> +	LIST_HEAD(flush_in);
> +	LIST_HEAD(flush_out);
> +	struct request *rq, *nxt;
> +	struct elevator_queue *e = hctx->queue->elevator;
> +
> +	if (!e) {
> +		blk_mq_flush_busy_ctxs(hctx, &re_submit);
> +	} else {
> +		while ((rq = e->type->ops.dispatch_request(hctx))) {
> +			if (rq->mq_hctx != hctx)
> +				list_add(&rq->queuelist, &sched_tmp);
> +			else
> +				list_add(&rq->queuelist, &re_submit);
> +		}
> +	}
> +	while (!list_empty(&sched_tmp)) {
> +		rq = list_entry(sched_tmp.next, struct request,
> +				queuelist);
> +		list_del_init(&rq->queuelist);
> +		blk_mq_sched_insert_request(rq, true, true, true);
> +	}
> +
> +	/* requests in dispatch list have to be re-submitted too */
> +	spin_lock(&hctx->lock);
> +	list_splice_tail_init(&hctx->dispatch, &re_submit);
> +	spin_unlock(&hctx->lock);
> +
> +	/* blk_end_flush_machinery will cover flush request */
> +	list_for_each_entry_safe(rq, nxt, &re_submit, queuelist) {
> +		if (rq->rq_flags & RQF_FLUSH_SEQ)
> +			list_move(&rq->queuelist, &flush_in);
> +	}
> +	blk_end_flush_machinery(hctx, &flush_in, &flush_out);
> +	list_splice_tail(&flush_out, &re_submit);
> +
> +	while (!list_empty(&re_submit)) {
> +		rq = list_first_entry(&re_submit, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		blk_mq_resubmit_io(rq);
> +	}
> +}
> +
>  /*
>   * 'cpu' has gone away. If this hctx is inactive, we can't dispatch request
>   * to the hctx any more, so steal bios from requests of this hctx, and
> @@ -2463,54 +2509,32 @@ static void blk_mq_resubmit_io(struct request *rq)
>   */
>  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>  {
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_dead);
>  
>  	if (!cpumask_test_cpu(cpu, hctx->cpumask))
>  		return 0;
>  
> +	if (test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
> +		blk_mq_hctx_deactivate(hctx);

Actually it probably also makes sense to introduce the
blk_mq_hctx_deactivate helper in the previous patch to avoid some
churn here.

> +	} else if (!hctx->queue->elevator) {
> +		struct blk_mq_ctx *ctx = __blk_mq_get_ctx(hctx->queue, cpu);
> +		enum hctx_type type = hctx->type;
> +		LIST_HEAD(tmp);
> +
> +		spin_lock(&ctx->lock);
> +		if (!list_empty(&ctx->rq_lists[type])) {
> +			list_splice_init(&ctx->rq_lists[type], &tmp);
> +			blk_mq_hctx_clear_pending(hctx, ctx);
> +		}
> +		spin_unlock(&ctx->lock);
>  
>  		if (!list_empty(&tmp)) {
>  			spin_lock(&hctx->lock);
>  			list_splice_tail_init(&tmp, &hctx->dispatch);
>  			spin_unlock(&hctx->lock);
>  
> +			blk_mq_run_hw_queue(hctx, true);
>  		}

And another helper for the !inactive case.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive
  2020-04-23  7:50   ` Christoph Hellwig
@ 2020-04-23  8:46     ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2020-04-23  8:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Thu, Apr 23, 2020 at 09:50:11AM +0200, Christoph Hellwig wrote:
> > +static void blk_mq_resubmit_passthrough_io(struct request *rq)
> > +{
> > +	struct request *nrq;
> > +	unsigned int flags = 0, cmd_flags = 0;
> > +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > +	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
> > +		hctx->tags;
> > +	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
> > +
> > +	if (rq->rq_flags & RQF_PREEMPT)
> > +		flags |= BLK_MQ_REQ_PREEMPT;
> > +	if (reserved)
> > +		flags |= BLK_MQ_REQ_RESERVED;
> > +
> > +	/* avoid allocation failure & IO merge */
> > +	cmd_flags = (rq->cmd_flags & ~REQ_NOWAIT) | REQ_NOMERGE;
> > +
> > +	nrq = blk_get_request(rq->q, cmd_flags, flags);
> > +	if (!nrq)
> > +		return;
> > +
> > +	nrq->__sector = blk_rq_pos(rq);
> > +	nrq->__data_len = blk_rq_bytes(rq);
> > +	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD) {
> > +		nrq->rq_flags |= RQF_SPECIAL_PAYLOAD;
> > +		nrq->special_vec = rq->special_vec;
> > +	}
> > +#if defined(CONFIG_BLK_DEV_INTEGRITY)
> > +	nrq->nr_integrity_segments = rq->nr_integrity_segments;
> > +#endif
> > +	nrq->nr_phys_segments = rq->nr_phys_segments;
> > +	nrq->ioprio = rq->ioprio;
> > +	nrq->extra_len = rq->extra_len;
> > +	nrq->rq_disk = rq->rq_disk;
> > +	nrq->part = rq->part;
> > +	nrq->write_hint = rq->write_hint;
> > +	nrq->timeout = rq->timeout;
> 
> This should share code with blk_rq_prep_clone() using a helper.
> Note that blk_rq_prep_clone seems to miss things like the
> write_hint and timeout, which we should fix as well.

Looks requests in both cases are inserted directly, so it is reasonable
to share similar clone helper.

Will do that in next version.

> 
> > +static void blk_mq_resubmit_fs_io(struct request *rq)
> > +{
> > +	struct bio_list list;
> > +	struct bio *bio;
> > +
> > +	bio_list_init(&list);
> > +	blk_steal_bios(&list, rq);
> > +
> > +	while (true) {
> > +		bio = bio_list_pop(&list);
> > +		if (!bio)
> > +			break;
> > +
> > +		generic_make_request(bio);
> > +	}
> 
> This could be simplified to:
> 
> 	while ((bio = bio_list_pop(&list)))
> 		generic_make_request(bio);
> 
> but then again the generic_make_request seems weird.  Do we need
> actually need any of the checks in generic_make_request?  Shouldn't
> we call into blk_mq_make_request directly?

Good catch, I think we should call into blk_mq_make_request() directly
for avoiding double check in generic_make_request.

> 
> Then again I wonder why the passthrough case doesn't work for
> FS requests?

Good question, just not think of this way cause re-submitting
passthrough request is done a bit late. I believe we can do
this way, which is very similar with dm-rq's usage.

> 
> >  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  {
> > @@ -2394,14 +2482,38 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >  	}
> >  	spin_unlock(&ctx->lock);
> >  
> > +	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
> > +		if (!list_empty(&tmp)) {
> > +			spin_lock(&hctx->lock);
> > +			list_splice_tail_init(&tmp, &hctx->dispatch);
> > +			spin_unlock(&hctx->lock);
> > +			blk_mq_run_hw_queue(hctx, true);
> > +		}
> > +	} else {
> 
> What about an early return or two here to save a level of indentation
> later?
> 
> 	if (!test_bit(BLK_MQ_S_INACTIVE, &hctx->state)) {
> 		if (list_empty(&tmp))
> 			return 0;
> 
> 		spin_lock(&hctx->lock);
> 		list_splice_tail_init(&tmp, &hctx->dispatch);
> 		spin_unlock(&hctx->lock);
> 		blk_mq_run_hw_queue(hctx, true);
> 		return 0;
> 	}

OK.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-04-23  8:47 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-18  3:09 [PATCH V7 0/9] blk-mq: improvement CPU hotplug Ming Lei
2020-04-18  3:09 ` [PATCH V7 1/9] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
2020-04-23  7:14   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 2/9] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
2020-04-23  7:30   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 3/9] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2020-04-23  7:31   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 4/9] blk-mq: support rq filter callback when iterating rqs Ming Lei
2020-04-20 10:34   ` John Garry
2020-04-23  7:31     ` Christoph Hellwig
2020-04-23  7:32   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 5/9] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
2020-04-23  7:38   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 6/9] block: add blk_end_flush_machinery Ming Lei
2020-04-23  7:40   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 7/9] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
2020-04-23  7:50   ` Christoph Hellwig
2020-04-23  8:46     ` Ming Lei
2020-04-18  3:09 ` [PATCH V7 8/9] blk-mq: handle requests dispatched from IO scheduler in case of inactive hctx Ming Lei
2020-04-23  7:51   ` Christoph Hellwig
2020-04-18  3:09 ` [PATCH V7 9/9] block: deactivate hctx when the hctx is actually inactive Ming Lei
2020-04-20 10:29 ` [PATCH V7 0/9] blk-mq: improvement CPU hotplug John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.