All of lore.kernel.org
 help / color / mirror / Atom feed
* blk-mq: improvement CPU hotplug (simplified version) v4
@ 2020-05-29 13:53 Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
                   ` (8 more replies)
  0 siblings, 9 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Hi all,

this series ensures I/O is quiesced before a cpu and thus the managed
interrupt handler is shut down.

This patchset tries to address the issue by the following approach:

 - before the last cpu in hctx->cpumask is going to offline, mark this
   hctx as inactive

 - disable preempt during allocating tag for request, and after tag is
   allocated, check if this hctx is inactive. If yes, give up the
   allocation and try remote allocation from online CPUs

 - before hctx becomes inactive, drain all allocated requests on this
   hctx

The guts of the changes are from Ming Lei, I just did a bunch of prep
cleanups so that they can fit in more nicely.  The series also depends
on my "avoid a few q_usage_counter roundtrips v3" series.

Thanks John Garry for running lots of tests on arm64 with this previous
version patches and co-working on investigating all kinds of issues.

A git tree is available here:

    git://git.infradead.org/users/hch/block.git blk-mq-hotplug.3

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.3

Changes since v4:
  - move the blk_mq_all_tag_iter kerneldoc comment to make it clear what is
    documented
  - add a WARN_ON_ONCE
  - use BLK_MQ_NO_TAG in one more place instead of -1
  - remove the PF_KTHREAD special casing

Changes since v3:
  - add two new patches to clean up the magic -1 tag values
  - improve a few commit messages and comments
  - cleanup the blk_mq_all_tag_iter implementation
  - add a msleep to the cpu hot unplug case in __blk_mq_alloc_request

Changes since v2:
  - don't disable preemption and use smp calls


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Hannes Reinecke, Johannes Thumshirn,
	Daniel Wagner

None of the I/O schedulers actually needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/bfq-iosched.c      | 2 +-
 block/blk-mq.c           | 2 +-
 block/kyber-iosched.c    | 2 +-
 block/mq-deadline.c      | 2 +-
 include/linux/elevator.h | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3d411716d7ee4..50c8f034c01c5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6073,7 +6073,7 @@ static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
  * comments on bfq_init_rq for the reason behind this delayed
  * preparation.
  */
-static void bfq_prepare_request(struct request *rq, struct bio *bio)
+static void bfq_prepare_request(struct request *rq)
 {
 	/*
 	 * Regardless of whether we have an icq attached, we have to
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c606c74463ccd..c697017588c12 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -387,7 +387,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 			if (e->type->icq_cache)
 				blk_mq_sched_assign_ioc(rq);
 
-			e->type->ops.prepare_request(rq, bio);
+			e->type->ops.prepare_request(rq);
 			rq->rq_flags |= RQF_ELVPRIV;
 		}
 	}
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 34dcea0ef6377..a38c5ab103d12 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -579,7 +579,7 @@ static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
 	return merged;
 }
 
-static void kyber_prepare_request(struct request *rq, struct bio *bio)
+static void kyber_prepare_request(struct request *rq)
 {
 	rq_set_domain_token(rq, -1);
 }
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b490f47fd553c..b57470e154c8f 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -541,7 +541,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
  * Nothing to do here. This is defined only to ensure that .finish_request
  * method is called upon request completion.
  */
-static void dd_prepare_request(struct request *rq, struct bio *bio)
+static void dd_prepare_request(struct request *rq)
 {
 }
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 901bda352dcb7..bacc40a0bdf39 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -39,7 +39,7 @@ struct elevator_mq_ops {
 	void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);
 	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
 	void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *);
-	void (*prepare_request)(struct request *, struct bio *bio);
+	void (*prepare_request)(struct request *);
 	void (*finish_request)(struct request *);
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Hannes Reinecke, Johannes Thumshirn,
	Daniel Wagner

The bio argument is entirely unused, and the request_queue can be passed
through the alloc_data, given that it needs to be filled out for the
low-level tag allocation anyway.  Also rename the function to
__blk_mq_alloc_request as the switch between get and alloc in the call
chains is rather confusing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c697017588c12..c1e66c014be1d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,10 +332,9 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	return rq;
 }
 
-static struct request *blk_mq_get_request(struct request_queue *q,
-					  struct bio *bio,
-					  struct blk_mq_alloc_data *data)
+static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
+	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
 	struct request *rq;
 	unsigned int tag;
@@ -346,7 +345,6 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	if (blk_queue_rq_alloc_time(q))
 		alloc_time_ns = ktime_get_ns();
 
-	data->q = q;
 	if (likely(!data->ctx)) {
 		data->ctx = blk_mq_get_ctx(q);
 		clear_ctx_on_error = true;
@@ -398,7 +396,11 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+		.flags		= flags,
+		.cmd_flags	= op,
+	};
 	struct request *rq;
 	int ret;
 
@@ -406,7 +408,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	if (ret)
 		return ERR_PTR(ret);
 
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
+	rq = __blk_mq_alloc_request(&data);
 	if (!rq)
 		goto out_queue_exit;
 	rq->__data_len = 0;
@@ -422,7 +424,11 @@ EXPORT_SYMBOL(blk_mq_alloc_request);
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+		.flags		= flags,
+		.cmd_flags	= op,
+	};
 	struct request *rq;
 	unsigned int cpu;
 	int ret;
@@ -448,14 +454,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	 * If not tell the caller that it should skip this queue.
 	 */
 	ret = -EXDEV;
-	alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
-	if (!blk_mq_hw_queue_mapped(alloc_data.hctx))
+	data.hctx = q->queue_hw_ctx[hctx_idx];
+	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
-	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
+	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	ret = -EWOULDBLOCK;
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
+	rq = __blk_mq_alloc_request(&data);
 	if (!rq)
 		goto out_queue_exit;
 	return rq;
@@ -2016,7 +2022,9 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_mq_alloc_data data = { .flags = 0};
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+	};
 	struct request *rq;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
@@ -2040,7 +2048,7 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq_qos_throttle(q, bio);
 
 	data.cmd_flags = bio->bi_opf;
-	rq = blk_mq_get_request(q, bio, &data);
+	rq = __blk_mq_alloc_request(&data);
 	if (unlikely(!rq)) {
 		rq_qos_cleanup(q, bio);
 		if (bio->bi_opf & REQ_NOWAIT)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Hannes Reinecke, Johannes Thumshirn,
	Daniel Wagner

Don't split request initialization between __blk_mq_alloc_request and
blk_mq_rq_ctx_init.  Also remove the op argument as it can be derived
from the blk_mq_alloc_data structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c1e66c014be1d..89c6223c1b603 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -271,7 +271,7 @@ static inline bool blk_mq_need_time_stamp(struct request *rq)
 }
 
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
-		unsigned int tag, unsigned int op, u64 alloc_time_ns)
+		unsigned int tag, u64 alloc_time_ns)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct request *rq = tags->static_rqs[tag];
@@ -295,7 +295,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->mq_ctx = data->ctx;
 	rq->mq_hctx = data->hctx;
 	rq->rq_flags = rq_flags;
-	rq->cmd_flags = op;
+	rq->cmd_flags = data->cmd_flags;
 	if (data->flags & BLK_MQ_REQ_PREEMPT)
 		rq->rq_flags |= RQF_PREEMPT;
 	if (blk_queue_io_stat(data->q))
@@ -327,8 +327,23 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 
-	data->ctx->rq_dispatched[op_is_sync(op)]++;
+	data->ctx->rq_dispatched[op_is_sync(data->cmd_flags)]++;
 	refcount_set(&rq->ref, 1);
+
+	if (!op_is_flush(data->cmd_flags)) {
+		struct elevator_queue *e = data->q->elevator;
+
+		rq->elv.icq = NULL;
+		if (e && e->type->ops.prepare_request) {
+			if (e->type->icq_cache)
+				blk_mq_sched_assign_ioc(rq);
+
+			e->type->ops.prepare_request(rq);
+			rq->rq_flags |= RQF_ELVPRIV;
+		}
+	}
+
+	data->hctx->queued++;
 	return rq;
 }
 
@@ -336,7 +351,6 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
-	struct request *rq;
 	unsigned int tag;
 	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
@@ -378,19 +392,7 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		return NULL;
 	}
 
-	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
-	if (!op_is_flush(data->cmd_flags)) {
-		rq->elv.icq = NULL;
-		if (e && e->type->ops.prepare_request) {
-			if (e->type->icq_cache)
-				blk_mq_sched_assign_ioc(rq);
-
-			e->type->ops.prepare_request(rq);
-			rq->rq_flags |= RQF_ELVPRIV;
-		}
-	}
-	data->hctx->queued++;
-	return rq;
+	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-05-29 13:53 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Hannes Reinecke, Johannes Thumshirn,
	Daniel Wagner

To prepare for wider use of this constant give it a more applicable name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-tag.c | 4 ++--
 block/blk-mq-tag.h | 4 ++--
 block/blk-mq.c     | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904ab..c76ba4f90fa09 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -111,7 +111,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (data->flags & BLK_MQ_REQ_RESERVED) {
 		if (unlikely(!tags->nr_reserved_tags)) {
 			WARN_ON_ONCE(1);
-			return BLK_MQ_TAG_FAIL;
+			return BLK_MQ_NO_TAG;
 		}
 		bt = &tags->breserved_tags;
 		tag_offset = 0;
@@ -125,7 +125,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		goto found_tag;
 
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
-		return BLK_MQ_TAG_FAIL;
+		return BLK_MQ_NO_TAG;
 
 	ws = bt_wait_ptr(bt, data->hctx);
 	do {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb6820..8a741752af8b9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -44,9 +44,9 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 }
 
 enum {
-	BLK_MQ_TAG_FAIL		= -1U,
+	BLK_MQ_NO_TAG		= -1U,
 	BLK_MQ_TAG_MIN		= 1,
-	BLK_MQ_TAG_MAX		= BLK_MQ_TAG_FAIL - 1,
+	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
 };
 
 extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 89c6223c1b603..3fbc08d879452 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -386,7 +386,7 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 	}
 
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_TAG_FAIL) {
+	if (tag == BLK_MQ_NO_TAG) {
 		if (clear_ctx_on_error)
 			data->ctx = NULL;
 		return NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2020-05-29 13:53 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Hannes Reinecke, Johannes Thumshirn,
	Daniel Wagner

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-tag.c |  8 ++++----
 block/blk-mq.c     | 14 +++++++-------
 block/blk-mq.h     |  4 ++--
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c76ba4f90fa09..597ff9c27cf63 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -92,7 +92,7 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 {
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
 	    !hctx_may_queue(data->hctx, bt))
-		return -1;
+		return BLK_MQ_NO_TAG;
 	if (data->shallow_depth)
 		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
@@ -121,7 +121,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	}
 
 	tag = __blk_mq_get_tag(data, bt);
-	if (tag != -1)
+	if (tag != BLK_MQ_NO_TAG)
 		goto found_tag;
 
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
@@ -143,13 +143,13 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * as running the queue may also have found completions.
 		 */
 		tag = __blk_mq_get_tag(data, bt);
-		if (tag != -1)
+		if (tag != BLK_MQ_NO_TAG)
 			break;
 
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __blk_mq_get_tag(data, bt);
-		if (tag != -1)
+		if (tag != BLK_MQ_NO_TAG)
 			break;
 
 		bt_prev = bt;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3fbc08d879452..696202e6e304f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -278,7 +278,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	req_flags_t rq_flags = 0;
 
 	if (data->flags & BLK_MQ_REQ_INTERNAL) {
-		rq->tag = -1;
+		rq->tag = BLK_MQ_NO_TAG;
 		rq->internal_tag = tag;
 	} else {
 		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
@@ -286,7 +286,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 			atomic_inc(&data->hctx->nr_active);
 		}
 		rq->tag = tag;
-		rq->internal_tag = -1;
+		rq->internal_tag = BLK_MQ_NO_TAG;
 		data->hctx->tags->rqs[rq->tag] = rq;
 	}
 
@@ -483,9 +483,9 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_crypto_free_request(rq);
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != -1)
+	if (rq->tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
+	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
@@ -534,7 +534,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 		blk_stat_add(rq, now);
 	}
 
-	if (rq->internal_tag != -1)
+	if (rq->internal_tag != BLK_MQ_NO_TAG)
 		blk_mq_sched_completed_request(rq, now);
 
 	blk_account_io_done(rq, now);
@@ -1033,7 +1033,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	};
 	bool shared;
 
-	if (rq->tag != -1)
+	if (rq->tag != BLK_MQ_NO_TAG)
 		return true;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
@@ -1049,7 +1049,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
 
-	return rq->tag != -1;
+	return rq->tag != BLK_MQ_NO_TAG;
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494faf..a139b06318174 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -201,7 +201,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
-	rq->tag = -1;
+	rq->tag = BLK_MQ_NO_TAG;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
@@ -211,7 +211,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 
 static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	if (rq->tag == -1 || rq->internal_tag == -1)
+	if (rq->tag == BLK_MQ_NO_TAG || rq->internal_tag == BLK_MQ_NO_TAG)
 		return;
 
 	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2020-05-29 13:53 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 13:53 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Daniel Wagner

blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so
tailor it to the specific requirements, and don't bother the general
fast path code with its special twinkles.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 696202e6e304f..d8c17ab0c7c22 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -351,21 +351,13 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
-	unsigned int tag;
-	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
+	unsigned int tag;
 
 	/* alloc_time includes depth and tag waits */
 	if (blk_queue_rq_alloc_time(q))
 		alloc_time_ns = ktime_get_ns();
 
-	if (likely(!data->ctx)) {
-		data->ctx = blk_mq_get_ctx(q);
-		clear_ctx_on_error = true;
-	}
-	if (likely(!data->hctx))
-		data->hctx = blk_mq_map_queue(q, data->cmd_flags,
-						data->ctx);
 	if (data->cmd_flags & REQ_NOWAIT)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
@@ -381,17 +373,16 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		    e->type->ops.limit_depth &&
 		    !(data->flags & BLK_MQ_REQ_RESERVED))
 			e->type->ops.limit_depth(data->cmd_flags, data);
-	} else {
-		blk_mq_tag_busy(data->hctx);
 	}
 
+	data->ctx = blk_mq_get_ctx(q);
+	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
+	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
+		blk_mq_tag_busy(data->hctx);
+
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_NO_TAG) {
-		if (clear_ctx_on_error)
-			data->ctx = NULL;
+	if (tag == BLK_MQ_NO_TAG)
 		return NULL;
-	}
-
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -431,17 +422,22 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		.flags		= flags,
 		.cmd_flags	= op,
 	};
-	struct request *rq;
+	u64 alloc_time_ns = 0;
 	unsigned int cpu;
+	unsigned int tag;
 	int ret;
 
+	/* alloc_time includes depth and tag waits */
+	if (blk_queue_rq_alloc_time(q))
+		alloc_time_ns = ktime_get_ns();
+
 	/*
 	 * If the tag allocator sleeps we could get an allocation for a
 	 * different hardware context.  No need to complicate the low level
 	 * allocator for this for the rare use case of a command tied to
 	 * a specific queue.
 	 */
-	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
+	if (WARN_ON_ONCE(!(flags & (BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED))))
 		return ERR_PTR(-EINVAL);
 
 	if (hctx_idx >= q->nr_hw_queues)
@@ -462,11 +458,17 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
+	if (q->elevator)
+		data.flags |= BLK_MQ_REQ_INTERNAL;
+	else
+		blk_mq_tag_busy(data.hctx);
+
 	ret = -EWOULDBLOCK;
-	rq = __blk_mq_alloc_request(&data);
-	if (!rq)
+	tag = blk_mq_get_tag(&data);
+	if (tag == BLK_MQ_NO_TAG)
 		goto out_queue_exit;
-	return rq;
+	return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns);
+
 out_queue_exit:
 	blk_queue_exit(q);
 	return ERR_PTR(ret);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2020-05-29 13:53 ` [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 16:14   ` Bart Van Assche
  2020-05-29 13:53 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
  2020-05-29 16:23 ` blk-mq: improvement CPU hotplug (simplified version) v4 Jens Axboe
  8 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Hannes Reinecke, Daniel Wagner

From: Ming Lei <ming.lei@redhat.com>

Add a new blk_mq_all_tag_iter function to iterate over all allocated
scheduler tags and driver tags.  This is more flexible than the existing
blk_mq_all_tag_busy_iter function as it allows the callers to do whatever
they want on allocated request instead of being limited to started
requests.

It will be used to implement draining allocated requests on specified
hctx in this patchset.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: switch from the two booleans to a more readable flags field and
 consolidate the tags iter functions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq-tag.c | 50 +++++++++++++++++++++++++++++-----------------
 block/blk-mq-tag.h |  2 ++
 2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 597ff9c27cf63..762198b62088c 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -256,14 +256,17 @@ struct bt_tags_iter_data {
 	struct blk_mq_tags *tags;
 	busy_tag_iter_fn *fn;
 	void *data;
-	bool reserved;
+	unsigned int flags;
 };
 
+#define BT_TAG_ITER_RESERVED		(1 << 0)
+#define BT_TAG_ITER_STARTED		(1 << 1)
+
 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 {
 	struct bt_tags_iter_data *iter_data = data;
 	struct blk_mq_tags *tags = iter_data->tags;
-	bool reserved = iter_data->reserved;
+	bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
 	struct request *rq;
 
 	if (!reserved)
@@ -274,10 +277,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
-		return iter_data->fn(rq, iter_data->data, reserved);
-
-	return true;
+	if (!rq)
+		return true;
+	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
+	    !blk_mq_request_started(rq))
+		return true;
+	return iter_data->fn(rq, iter_data->data, reserved);
 }
 
 /**
@@ -290,39 +295,47 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		@reserved) where rq is a pointer to a request. Return true
  *		to continue iterating tags, false to stop.
  * @data:	Will be passed as second argument to @fn.
- * @reserved:	Indicates whether @bt is the breserved_tags member or the
- *		bitmap_tags member of struct blk_mq_tags.
+ * @flags:	BT_TAG_ITER_*
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, void *data, unsigned int flags)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
 		.data = data,
-		.reserved = reserved,
+		.flags = flags,
 	};
 
 	if (tags->rqs)
 		sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
 }
 
+static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv, unsigned int flags)
+{
+	WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED);
+
+	if (tags->nr_reserved_tags)
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv,
+				 flags | BT_TAG_ITER_RESERVED);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags);
+}
+
 /**
- * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
+ * blk_mq_all_tag_iter - iterate over all requests in a tag map
  * @tags:	Tag map to iterate over.
- * @fn:		Pointer to the function that will be called for each started
+ * @fn:		Pointer to the function that will be called for each
  *		request. @fn will be called as follows: @fn(rq, @priv,
  *		reserved) where rq is a pointer to a request. 'reserved'
  *		indicates whether or not @rq is a reserved request. Return
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv)
 {
-	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+	return __blk_mq_all_tag_iter(tags, fn, priv, 0);
 }
 
 /**
@@ -342,7 +355,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
+					      BT_TAG_ITER_STARTED);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 8a741752af8b9..d38e48f2a0a4a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -34,6 +34,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2020-05-29 13:53 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
@ 2020-05-29 13:53 ` Christoph Hellwig
  2020-05-29 14:34   ` Hannes Reinecke
  2020-05-29 14:41   ` Daniel Wagner
  2020-05-29 16:23 ` blk-mq: improvement CPU hotplug (simplified version) v4 Jens Axboe
  8 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

From: Ming Lei <ming.lei@redhat.com>

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: different retry mechanism, merged two patches, minor cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   8 +++
 block/blk-mq.c             | 112 ++++++++++++++++++++++++++++++++++++-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |  10 ++++
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 133 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a7..15df3a36e9fa4 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
@@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(STACKING),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 762198b62088c..96a39d0724a29 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	sbitmap_finish_wait(bt, ws, &wait);
 
 found_tag:
+	/*
+	 * Give up this allocation if the hctx is inactive.  The caller will
+	 * retry on an active hctx.
+	 */
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
+		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
+		return BLK_MQ_NO_TAG;
+	}
 	return tag + tag_offset;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d8c17ab0c7c22..b005323f777a7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -375,14 +375,30 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 			e->type->ops.limit_depth(data->cmd_flags, data);
 	}
 
+retry:
 	data->ctx = blk_mq_get_ctx(q);
 	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
 		blk_mq_tag_busy(data->hctx);
 
+	/*
+	 * Waiting allocations only fail because of an inactive hctx.  In that
+	 * case just retry the hctx assignment and tag allocation as CPU hotplug
+	 * should have migrated us to an online CPU by now.
+	 */
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_NO_TAG)
-		return NULL;
+	if (tag == BLK_MQ_NO_TAG) {
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return NULL;
+
+		/*
+		 * Give up the CPU and sleep for a random short time to ensure
+		 * that thread using a realtime scheduling class are migrated
+		 * off the the CPU, and thus off the hctx that is going away.
+		 */
+		msleep(3);
+		goto retry;
+	}
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -2324,6 +2340,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+struct rq_iter_data {
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rq;
+};
+
+static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
+{
+	struct rq_iter_data *iter_data = data;
+
+	if (rq->mq_hctx != iter_data->hctx)
+		return true;
+	iter_data->has_rq = true;
+	return false;
+}
+
+static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_tags *tags = hctx->sched_tags ?
+			hctx->sched_tags : hctx->tags;
+	struct rq_iter_data data = {
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+	return data.has_rq;
+}
+
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
+	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
+		return 0;
+
+	/*
+	 * Prevent new request from being allocated on the current hctx.
+	 *
+	 * The smp_mb__after_atomic() Pairs with the implied barrier in
+	 * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
+	 * seen once we return from the tag allocator.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	smp_mb__after_atomic();
+
+	/*
+	 * Try to grab a reference to the queue and wait for any outstanding
+	 * requests.  If we could not grab a reference the queue has been
+	 * frozen and there are no requests.
+	 */
+	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
+		while (blk_mq_hctx_has_requests(hctx))
+			msleep(5);
+		percpu_ref_put(&hctx->queue->q_usage_counter);
+	}
+
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2337,6 +2433,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
@@ -2360,6 +2459,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2419,6 +2521,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3673,6 +3778,9 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e5..d7904b4d8d126 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13be..f60c025121215 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7307795439a4..a20f8c241d665 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,11 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	/*
+	 * Set when this device requires underlying blk-mq device for
+	 * completing IO:
+	 */
+	BLK_MQ_F_STACKING	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
@@ -400,6 +407,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b6335318..24b3a77810b6d 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29 13:53 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
@ 2020-05-29 14:34   ` Hannes Reinecke
  2020-05-29 14:41   ` Daniel Wagner
  1 sibling, 0 replies; 18+ messages in thread
From: Hannes Reinecke @ 2020-05-29 14:34 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/29/20 3:53 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again."
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> cpuhp state handled after the CPU is down, so there isn't any chance to
> quiesce the hctx before shutting down the CPU.
> 
> Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> where the last CPU goes away, and wait for completion of in-flight
> requests.  This guarantees that there is no inflight I/O before shutting
> down the managed IRQ.
> 
> Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> to wait for completion of in-flight requests from these drivers to avoid
> a potential dead-lock. It is safe to do this for stacking drivers as those
> do not use interrupts at all and their I/O completions are triggered by
> underlying devices I/O completion.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: different retry mechanism, merged two patches, minor cleanups]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   8 +++
>   block/blk-mq.c             | 112 ++++++++++++++++++++++++++++++++++++-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |  10 ++++
>   include/linux/cpuhotplug.h |   1 +
>   7 files changed, 133 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29 13:53 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
  2020-05-29 14:34   ` Hannes Reinecke
@ 2020-05-29 14:41   ` Daniel Wagner
  1 sibling, 0 replies; 18+ messages in thread
From: Daniel Wagner @ 2020-05-29 14:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Ming Lei, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Fri, May 29, 2020 at 03:53:15PM +0200, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "That was the constraint of managed interrupts from the very beginning:
> 
>  The driver/subsystem has to quiesce the interrupt line and the associated
>  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>  until it's restarted by the core when the CPU is plugged in again."
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> cpuhp state handled after the CPU is down, so there isn't any chance to
> quiesce the hctx before shutting down the CPU.
> 
> Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> where the last CPU goes away, and wait for completion of in-flight
> requests.  This guarantees that there is no inflight I/O before shutting
> down the managed IRQ.
> 
> Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> to wait for completion of in-flight requests from these drivers to avoid
> a potential dead-lock. It is safe to do this for stacking drivers as those
> do not use interrupts at all and their I/O completions are triggered by
> underlying devices I/O completion.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: different retry mechanism, merged two patches, minor cleanups]
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Daniel Wagner <dwagner@suse.de>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter
  2020-05-29 13:53 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
@ 2020-05-29 16:14   ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2020-05-29 16:14 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner,
	Hannes Reinecke, Daniel Wagner

On 2020-05-29 06:53, Christoph Hellwig wrote:
> Add a new blk_mq_all_tag_iter function to iterate over all allocated
> scheduler tags and driver tags.  This is more flexible than the existing
> blk_mq_all_tag_busy_iter function as it allows the callers to do whatever
> they want on allocated request instead of being limited to started
> requests.
> 
> It will be used to implement draining allocated requests on specified
> hctx in this patchset.

Reviewed-by: Bart van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2020-05-29 13:53 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
@ 2020-05-29 16:23 ` Jens Axboe
  8 siblings, 0 replies; 18+ messages in thread
From: Jens Axboe @ 2020-05-29 16:23 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/29/20 7:53 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>  - before the last cpu in hctx->cpumask is going to offline, mark this
>    hctx as inactive
> 
>  - disable preempt during allocating tag for request, and after tag is
>    allocated, check if this hctx is inactive. If yes, give up the
>    allocation and try remote allocation from online CPUs
> 
>  - before hctx becomes inactive, drain all allocated requests on this
>    hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 20:31   ` John Garry
@ 2020-05-29 13:26     ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:26 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, Hannes Reinecke,
	Thomas Gleixner

On Wed, May 27, 2020 at 09:31:30PM +0100, John Garry wrote:
>> Thanks for having prepared and posted this new patch series. After v3
>> was posted and before v4 was posted I had a closer look at the IRQ core.
>> My conclusions (which may be incorrect) are as follows:
>> * The only function that sets the 'is_managed' member of struct
>>    irq_affinity_desc to 1 is irq_create_affinity_masks().
>> * There are two ways to cause that function to be called: setting the
>>    PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
>>    passing the 'affd' argument. pci_alloc_irq_vectors() calls
>>    pci_alloc_irq_vectors_affinity().
>> * The following drivers pass an affinity domain argument when allocating
>>    interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
>>    mpt3sas, qla2xxx, virtio_scsi.
>> * The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
>>    affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.
>>
>> What is not clear to me is why managed interrupts are shut down if the
>> last CPU in their affinity mask is shut down? Has it been considered to
>> modify the IRQ core such that managed PCIe interrupts are assigned to
>> another CPU if the last CPU in their affinity mask is shut down? 
>
> I think Thomas answered that here already:
> https://lore.kernel.org/lkml/alpine.DEB.2.21.1901291717370.1513@nanos.tec.linutronix.de/
>
> (vector space exhaustion)

Exactly.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 18:06 Christoph Hellwig
  2020-05-27 20:07 ` Bart Van Assche
@ 2020-05-28  8:29 ` John Garry
  1 sibling, 0 replies; 18+ messages in thread
From: John Garry @ 2020-05-28  8:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Thomas Gleixner

On 27/05/2020 19:06, Christoph Hellwig wrote:
> Hi all,
> 
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>   - before the last cpu in hctx->cpumask is going to offline, mark this
>     hctx as inactive
> 
>   - disable preempt during allocating tag for request, and after tag is
>     allocated, check if this hctx is inactive. If yes, give up the
>     allocation and try remote allocation from online CPUs
> 
>   - before hctx becomes inactive, drain all allocated requests on this
>     hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.
> 
> A git tree is available here:
> 
>      git://git.infradead.org/users/hch/block.git blk-mq-hotplug.3
> 
> Gitweb:
> 
>      http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.3
> 
> Changes since v3:
>    - add two new patches to clean up the magic -1 tag values
>    - improve a few commit messages and comments
>    - cleanup the blk_mq_all_tag_iter implementation
>    - add a msleep to the cpu hot unplug case in __blk_mq_alloc_request
> 
> Changes since v2:
>    - don't disable preemption and use smp calls
> 

I tested this again, so:
Tested-by: John Garry <john.garry@huawei.com>

As an aside, I'm not familiar with blktests, but it may be possible to 
add something to test this. I'll look.

Cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 20:07 ` Bart Van Assche
@ 2020-05-27 20:31   ` John Garry
  2020-05-29 13:26     ` Christoph Hellwig
  0 siblings, 1 reply; 18+ messages in thread
From: John Garry @ 2020-05-27 20:31 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: linux-block, Hannes Reinecke, Thomas Gleixner

On 27/05/2020 21:07, Bart Van Assche wrote:
> On 2020-05-27 11:06, Christoph Hellwig wrote:
>> this series ensures I/O is quiesced before a cpu and thus the managed
>> interrupt handler is shut down.
>>
>> This patchset tries to address the issue by the following approach:
>>
>>   - before the last cpu in hctx->cpumask is going to offline, mark this
>>     hctx as inactive
>>
>>   - disable preempt during allocating tag for request, and after tag is
>>     allocated, check if this hctx is inactive. If yes, give up the
>>     allocation and try remote allocation from online CPUs
>>
>>   - before hctx becomes inactive, drain all allocated requests on this
>>     hctx
>>
>> The guts of the changes are from Ming Lei, I just did a bunch of prep
>> cleanups so that they can fit in more nicely.  The series also depends
>> on my "avoid a few q_usage_counter roundtrips v3" series.
>>
>> Thanks John Garry for running lots of tests on arm64 with this previous
>> version patches and co-working on investigating all kinds of issues.
> 
> Hi Christoph,
> 
> Thanks for having prepared and posted this new patch series. After v3
> was posted and before v4 was posted I had a closer look at the IRQ core.
> My conclusions (which may be incorrect) are as follows:
> * The only function that sets the 'is_managed' member of struct
>    irq_affinity_desc to 1 is irq_create_affinity_masks().
> * There are two ways to cause that function to be called: setting the
>    PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
>    passing the 'affd' argument. pci_alloc_irq_vectors() calls
>    pci_alloc_irq_vectors_affinity().
> * The following drivers pass an affinity domain argument when allocating
>    interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
>    mpt3sas, qla2xxx, virtio_scsi.
> * The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
>    affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.
> 
> What is not clear to me is why managed interrupts are shut down if the
> last CPU in their affinity mask is shut down? Has it been considered to
> modify the IRQ core such that managed PCIe interrupts are assigned to
> another CPU if the last CPU in their affinity mask is shut down? 

I think Thomas answered that here already:
https://lore.kernel.org/lkml/alpine.DEB.2.21.1901291717370.1513@nanos.tec.linutronix.de/

(vector space exhaustion)

Would
> that make it unnecessary to drain hardware queues during CPU
> hotplugging? Or is there perhaps something in the PCI or PCIe
> specifications or in one of the architectures supported by Linux that
> prevents doing this?
> 
> Is this the commit that introduced shutdown of managed interrupts:
> c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug")?
> 
> Some of my knowledge about non-managed and managed interrupts comes from
> https://lore.kernel.org/lkml/alpine.DEB.2.20.1710162106400.2037@nanos/
> 
> Thanks,
> 
> Bart.
> .
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 18:06 Christoph Hellwig
@ 2020-05-27 20:07 ` Bart Van Assche
  2020-05-27 20:31   ` John Garry
  2020-05-28  8:29 ` John Garry
  1 sibling, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2020-05-27 20:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-27 11:06, Christoph Hellwig wrote:
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>  - before the last cpu in hctx->cpumask is going to offline, mark this
>    hctx as inactive
> 
>  - disable preempt during allocating tag for request, and after tag is
>    allocated, check if this hctx is inactive. If yes, give up the
>    allocation and try remote allocation from online CPUs
> 
>  - before hctx becomes inactive, drain all allocated requests on this
>    hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.

Hi Christoph,

Thanks for having prepared and posted this new patch series. After v3
was posted and before v4 was posted I had a closer look at the IRQ core.
My conclusions (which may be incorrect) are as follows:
* The only function that sets the 'is_managed' member of struct
  irq_affinity_desc to 1 is irq_create_affinity_masks().
* There are two ways to cause that function to be called: setting the
  PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
  passing the 'affd' argument. pci_alloc_irq_vectors() calls
  pci_alloc_irq_vectors_affinity().
* The following drivers pass an affinity domain argument when allocating
  interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
  mpt3sas, qla2xxx, virtio_scsi.
* The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
  affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.

What is not clear to me is why managed interrupts are shut down if the
last CPU in their affinity mask is shut down? Has it been considered to
modify the IRQ core such that managed PCIe interrupts are assigned to
another CPU if the last CPU in their affinity mask is shut down? Would
that make it unnecessary to drain hardware queues during CPU
hotplugging? Or is there perhaps something in the PCI or PCIe
specifications or in one of the architectures supported by Linux that
prevents doing this?

Is this the commit that introduced shutdown of managed interrupts:
c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug")?

Some of my knowledge about non-managed and managed interrupts comes from
https://lore.kernel.org/lkml/alpine.DEB.2.20.1710162106400.2037@nanos/

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* blk-mq: improvement CPU hotplug (simplified version) v4
@ 2020-05-27 18:06 Christoph Hellwig
  2020-05-27 20:07 ` Bart Van Assche
  2020-05-28  8:29 ` John Garry
  0 siblings, 2 replies; 18+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Hi all,

this series ensures I/O is quiesced before a cpu and thus the managed
interrupt handler is shut down.

This patchset tries to address the issue by the following approach:

 - before the last cpu in hctx->cpumask is going to offline, mark this
   hctx as inactive

 - disable preempt during allocating tag for request, and after tag is
   allocated, check if this hctx is inactive. If yes, give up the
   allocation and try remote allocation from online CPUs

 - before hctx becomes inactive, drain all allocated requests on this
   hctx

The guts of the changes are from Ming Lei, I just did a bunch of prep
cleanups so that they can fit in more nicely.  The series also depends
on my "avoid a few q_usage_counter roundtrips v3" series.

Thanks John Garry for running lots of tests on arm64 with this previous
version patches and co-working on investigating all kinds of issues.

A git tree is available here:

    git://git.infradead.org/users/hch/block.git blk-mq-hotplug.3

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.3

Changes since v3:
  - add two new patches to clean up the magic -1 tag values
  - improve a few commit messages and comments
  - cleanup the blk_mq_all_tag_iter implementation
  - add a msleep to the cpu hot unplug case in __blk_mq_alloc_request

Changes since v2:
  - don't disable preemption and use smp calls


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-05-29 16:23 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-29 13:53 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
2020-05-29 13:53 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
2020-05-29 13:53 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
2020-05-29 13:53 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
2020-05-29 13:53 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
2020-05-29 13:53 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
2020-05-29 13:53 ` [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
2020-05-29 13:53 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
2020-05-29 16:14   ` Bart Van Assche
2020-05-29 13:53 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
2020-05-29 14:34   ` Hannes Reinecke
2020-05-29 14:41   ` Daniel Wagner
2020-05-29 16:23 ` blk-mq: improvement CPU hotplug (simplified version) v4 Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2020-05-27 18:06 Christoph Hellwig
2020-05-27 20:07 ` Bart Van Assche
2020-05-27 20:31   ` John Garry
2020-05-29 13:26     ` Christoph Hellwig
2020-05-28  8:29 ` John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.