linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blk-mq: improvement CPU hotplug (simplified version) v3
@ 2020-05-20 17:06 Christoph Hellwig
  2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Hi all,

this series ensures I/O is quiesced before a cpu and thus the managed
interrupt handler is shut down.

This patchset tries to address the issue by the following approach:

 - before the last cpu in hctx->cpumask is going to offline, mark this
   hctx as inactive

 - disable preempt during allocating tag for request, and after tag is
   allocated, check if this hctx is inactive. If yes, give up the
   allocation and try remote allocation from online CPUs

 - before hctx becomes inactive, drain all allocated requests on this
   hctx

The guts of the changes are from Ming Lei, I just did a bunch of prep
cleanups so that they can fit in more nicely.  The series also depends
on my "avoid a few q_usage_counter roundtrips v3" series.

Thanks John Garry for running lots of tests on arm64 with this previous
version patches and co-working on investigating all kinds of issues.

A git tree is available here:

    git://git.infradead.org/users/hch/block.git blk-mq-hotplug.2

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.2

Changes for v3:
  - don't disable preemption and use smp calls


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
@ 2020-05-20 17:06 ` Christoph Hellwig
  2020-05-20 18:16   ` Bart Van Assche
  2020-05-22  9:11   ` Hannes Reinecke
  2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

None of the I/O schedulers actually needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bfq-iosched.c      | 2 +-
 block/blk-mq.c           | 2 +-
 block/kyber-iosched.c    | 2 +-
 block/mq-deadline.c      | 2 +-
 include/linux/elevator.h | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3d411716d7ee4..50c8f034c01c5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6073,7 +6073,7 @@ static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
  * comments on bfq_init_rq for the reason behind this delayed
  * preparation.
  */
-static void bfq_prepare_request(struct request *rq, struct bio *bio)
+static void bfq_prepare_request(struct request *rq)
 {
 	/*
 	 * Regardless of whether we have an icq attached, we have to
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cac11945f6023..850e3642efc40 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -387,7 +387,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 			if (e->type->icq_cache)
 				blk_mq_sched_assign_ioc(rq);
 
-			e->type->ops.prepare_request(rq, bio);
+			e->type->ops.prepare_request(rq);
 			rq->rq_flags |= RQF_ELVPRIV;
 		}
 	}
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 34dcea0ef6377..a38c5ab103d12 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -579,7 +579,7 @@ static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
 	return merged;
 }
 
-static void kyber_prepare_request(struct request *rq, struct bio *bio)
+static void kyber_prepare_request(struct request *rq)
 {
 	rq_set_domain_token(rq, -1);
 }
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b490f47fd553c..b57470e154c8f 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -541,7 +541,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
  * Nothing to do here. This is defined only to ensure that .finish_request
  * method is called upon request completion.
  */
-static void dd_prepare_request(struct request *rq, struct bio *bio)
+static void dd_prepare_request(struct request *rq)
 {
 }
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 901bda352dcb7..bacc40a0bdf39 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -39,7 +39,7 @@ struct elevator_mq_ops {
 	void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);
 	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
 	void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *);
-	void (*prepare_request)(struct request *, struct bio *bio);
+	void (*prepare_request)(struct request *);
 	void (*finish_request)(struct request *);
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
  2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
@ 2020-05-20 17:06 ` Christoph Hellwig
  2020-05-20 18:22   ` Bart Van Assche
  2020-05-22  9:13   ` Hannes Reinecke
  2020-05-20 17:06 ` [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

The bio argument is entirely unused, and the request_queue can be passed
through the alloc_data, given that it needs to be filled out for the
low-level tag allocation anyway.  Also rename the function to
__blk_mq_alloc_request as the switch between get and alloc in the call
chains is rather confusing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 850e3642efc40..2250e6397559b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,10 +332,9 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	return rq;
 }
 
-static struct request *blk_mq_get_request(struct request_queue *q,
-					  struct bio *bio,
-					  struct blk_mq_alloc_data *data)
+static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
+	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
 	struct request *rq;
 	unsigned int tag;
@@ -346,7 +345,6 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	if (blk_queue_rq_alloc_time(q))
 		alloc_time_ns = ktime_get_ns();
 
-	data->q = q;
 	if (likely(!data->ctx)) {
 		data->ctx = blk_mq_get_ctx(q);
 		clear_ctx_on_error = true;
@@ -398,7 +396,11 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+		.flags		= flags,
+		.cmd_flags	= op,
+	};
 	struct request *rq;
 	int ret;
 
@@ -406,7 +408,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	if (ret)
 		return ERR_PTR(ret);
 
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
+	rq = __blk_mq_alloc_request(&data);
 	if (!rq)
 		goto out_queue_exit;
 	rq->__data_len = 0;
@@ -422,7 +424,11 @@ EXPORT_SYMBOL(blk_mq_alloc_request);
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+		.flags		= flags,
+		.cmd_flags	= op,
+	};
 	struct request *rq;
 	unsigned int cpu;
 	int ret;
@@ -448,14 +454,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	 * If not tell the caller that it should skip this queue.
 	 */
 	ret = -EXDEV;
-	alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
-	if (!blk_mq_hw_queue_mapped(alloc_data.hctx))
+	data.hctx = q->queue_hw_ctx[hctx_idx];
+	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
-	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
+	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	ret = -EWOULDBLOCK;
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
+	rq = __blk_mq_alloc_request(&data);
 	if (!rq)
 		goto out_queue_exit;
 	return rq;
@@ -2016,7 +2022,9 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_mq_alloc_data data = { .flags = 0};
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+	};
 	struct request *rq;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
@@ -2040,7 +2048,7 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq_qos_throttle(q, bio);
 
 	data.cmd_flags = bio->bi_opf;
-	rq = blk_mq_get_request(q, bio, &data);
+	rq = __blk_mq_alloc_request(&data);
 	if (unlikely(!rq)) {
 		rq_qos_cleanup(q, bio);
 		if (bio->bi_opf & REQ_NOWAIT)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
  2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
  2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
@ 2020-05-20 17:06 ` Christoph Hellwig
  2020-05-20 20:10   ` Bart Van Assche
  2020-05-20 17:06 ` [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Don't split request initialization between __blk_mq_alloc_request and
blk_mq_rq_ctx_init.  Also remove a pointless argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2250e6397559b..1ffbc5d9e7cfe 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -271,7 +271,7 @@ static inline bool blk_mq_need_time_stamp(struct request *rq)
 }
 
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
-		unsigned int tag, unsigned int op, u64 alloc_time_ns)
+		unsigned int tag, u64 alloc_time_ns)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct request *rq = tags->static_rqs[tag];
@@ -295,7 +295,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->mq_ctx = data->ctx;
 	rq->mq_hctx = data->hctx;
 	rq->rq_flags = rq_flags;
-	rq->cmd_flags = op;
+	rq->cmd_flags = data->cmd_flags;
 	if (data->flags & BLK_MQ_REQ_PREEMPT)
 		rq->rq_flags |= RQF_PREEMPT;
 	if (blk_queue_io_stat(data->q))
@@ -327,8 +327,23 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 
-	data->ctx->rq_dispatched[op_is_sync(op)]++;
+	data->ctx->rq_dispatched[op_is_sync(data->cmd_flags)]++;
 	refcount_set(&rq->ref, 1);
+
+	if (!op_is_flush(data->cmd_flags)) {
+		struct elevator_queue *e = data->q->elevator;
+
+		rq->elv.icq = NULL;
+		if (e && e->type->ops.prepare_request) {
+			if (e->type->icq_cache)
+				blk_mq_sched_assign_ioc(rq);
+
+			e->type->ops.prepare_request(rq);
+			rq->rq_flags |= RQF_ELVPRIV;
+		}
+	}
+
+	data->hctx->queued++;
 	return rq;
 }
 
@@ -336,7 +351,6 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
-	struct request *rq;
 	unsigned int tag;
 	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
@@ -378,19 +392,7 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		return NULL;
 	}
 
-	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
-	if (!op_is_flush(data->cmd_flags)) {
-		rq->elv.icq = NULL;
-		if (e && e->type->ops.prepare_request) {
-			if (e->type->icq_cache)
-				blk_mq_sched_assign_ioc(rq);
-
-			e->type->ops.prepare_request(rq);
-			rq->rq_flags |= RQF_ELVPRIV;
-		}
-	}
-	data->hctx->queued++;
-	return rq;
+	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-05-20 17:06 ` [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
@ 2020-05-20 17:06 ` Christoph Hellwig
  2020-05-22  9:17   ` Hannes Reinecke
  2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so
tailor it to the specific requirements, and don't both the general
fast path code with its special twinkles.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1ffbc5d9e7cfe..42aee2978464b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -351,21 +351,13 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
-	unsigned int tag;
-	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
+	unsigned int tag;
 
 	/* alloc_time includes depth and tag waits */
 	if (blk_queue_rq_alloc_time(q))
 		alloc_time_ns = ktime_get_ns();
 
-	if (likely(!data->ctx)) {
-		data->ctx = blk_mq_get_ctx(q);
-		clear_ctx_on_error = true;
-	}
-	if (likely(!data->hctx))
-		data->hctx = blk_mq_map_queue(q, data->cmd_flags,
-						data->ctx);
 	if (data->cmd_flags & REQ_NOWAIT)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
@@ -381,17 +373,16 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		    e->type->ops.limit_depth &&
 		    !(data->flags & BLK_MQ_REQ_RESERVED))
 			e->type->ops.limit_depth(data->cmd_flags, data);
-	} else {
-		blk_mq_tag_busy(data->hctx);
 	}
 
+	data->ctx = blk_mq_get_ctx(q);
+	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
+	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
+		blk_mq_tag_busy(data->hctx);
+
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_TAG_FAIL) {
-		if (clear_ctx_on_error)
-			data->ctx = NULL;
+	if (tag == BLK_MQ_TAG_FAIL)
 		return NULL;
-	}
-
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -431,17 +422,22 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		.flags		= flags,
 		.cmd_flags	= op,
 	};
-	struct request *rq;
+	u64 alloc_time_ns = 0;
 	unsigned int cpu;
+	unsigned int tag;
 	int ret;
 
+	/* alloc_time includes depth and tag waits */
+	if (blk_queue_rq_alloc_time(q))
+		alloc_time_ns = ktime_get_ns();
+
 	/*
 	 * If the tag allocator sleeps we could get an allocation for a
 	 * different hardware context.  No need to complicate the low level
 	 * allocator for this for the rare use case of a command tied to
 	 * a specific queue.
 	 */
-	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
+	if (WARN_ON_ONCE(!(flags & (BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED))))
 		return ERR_PTR(-EINVAL);
 
 	if (hctx_idx >= q->nr_hw_queues)
@@ -462,11 +458,17 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
+	if (q->elevator)
+		data.flags |= BLK_MQ_REQ_INTERNAL;
+	else
+		blk_mq_tag_busy(data.hctx);
+
 	ret = -EWOULDBLOCK;
-	rq = __blk_mq_alloc_request(&data);
-	if (!rq)
+	tag = blk_mq_get_tag(&data);
+	if (tag == BLK_MQ_TAG_FAIL)
 		goto out_queue_exit;
-	return rq;
+	return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns);
+
 out_queue_exit:
 	blk_queue_exit(q);
 	return ERR_PTR(ret);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2020-05-20 17:06 ` [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
@ 2020-05-20 17:06 ` Christoph Hellwig
  2020-05-20 20:24   ` Bart Van Assche
  2020-05-22  9:18   ` Hannes Reinecke
  2020-05-20 17:06 ` [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
  2020-05-20 21:46 ` blk-mq: improvement CPU hotplug (simplified version) v3 Bart Van Assche
  6 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

From: Ming Lei <ming.lei@redhat.com>

Add one new function of blk_mq_all_tag_iter so that we can iterate every
allocated request from either io scheduler tags or driver tags, and this
way is more flexible since it allows the callers to do whatever they want
on allocated request.

It will be used to implement draining allocated requests on specified
hctx in this patchset.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-tag.c | 33 +++++++++++++++++++++++++++++----
 block/blk-mq-tag.h |  2 ++
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904ab..c27c6dfc7d36e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -257,6 +257,7 @@ struct bt_tags_iter_data {
 	busy_tag_iter_fn *fn;
 	void *data;
 	bool reserved;
+	bool iterate_all;
 };
 
 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
@@ -274,7 +275,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
+	if (rq && (iter_data->iterate_all || blk_mq_request_started(rq)))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
 	return true;
@@ -294,13 +295,15 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, void *data, bool reserved,
+			     bool iterate_all)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
 		.data = data,
 		.reserved = reserved,
+		.iterate_all = iterate_all,
 	};
 
 	if (tags->rqs)
@@ -321,8 +324,30 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
+				 false);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, false);
+}
+
+/**
+ * blk_mq_all_tag_iter - iterate over all requests in a tag map
+ * @tags:	Tag map to iterate over.
+ * @fn:		Pointer to the function that will be called for each
+ *		request. @fn will be called as follows: @fn(rq, @priv,
+ *		reserved) where rq is a pointer to a request. 'reserved'
+ *		indicates whether or not @rq is a reserved request. Return
+ *		true to continue iterating tags, false to stop.
+ * @priv:	Will be passed as second argument to @fn.
+ *
+ * It is the caller's responsibility to check rq's state in @fn.
+ */
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv)
+{
+	if (tags->nr_reserved_tags)
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
+				 true);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, true);
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb6820..d19546e8246b7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -34,6 +34,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
@ 2020-05-20 17:06 ` Christoph Hellwig
  2020-05-22  9:25   ` Hannes Reinecke
  2020-05-20 21:46 ` blk-mq: improvement CPU hotplug (simplified version) v3 Bart Van Assche
  6 siblings, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-20 17:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

From: Ming Lei <ming.lei@redhat.com>

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: different retry mechanism, merged two patches, minor cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   8 +++
 block/blk-mq.c             | 114 ++++++++++++++++++++++++++++++++++++-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |  10 ++++
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a7..15df3a36e9fa4 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
@@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(STACKING),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c27c6dfc7d36e..3925d1e55bc8f 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	sbitmap_finish_wait(bt, ws, &wait);
 
 found_tag:
+	/*
+	 * Give up this allocation if the hctx is inactive.  The caller will
+	 * retry on an active hctx.
+	 */
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
+		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
+		return -1;
+	}
 	return tag + tag_offset;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 42aee2978464b..672c7e3f61243 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -375,14 +375,32 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 			e->type->ops.limit_depth(data->cmd_flags, data);
 	}
 
+retry:
 	data->ctx = blk_mq_get_ctx(q);
 	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
 		blk_mq_tag_busy(data->hctx);
 
+	/*
+	 * Waiting allocations only fail because of an inactive hctx.  In that
+	 * case just retry the hctx assignment and tag allocation as CPU hotplug
+	 * should have migrated us to an online CPU by now.
+	 */
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_TAG_FAIL)
-		return NULL;
+	if (tag == BLK_MQ_TAG_FAIL) {
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return NULL;
+
+		/*
+		 * All kthreads that can perform I/O should have been moved off
+		 * this CPU by the time the the CPU hotplug statemachine has
+		 * shut down a hctx.  But better be sure with an extra sanity
+		 * check.
+		 */
+		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
+			return NULL;
+		goto retry;
+	}
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -2324,6 +2342,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+struct rq_iter_data {
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rq;
+};
+
+static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
+{
+	struct rq_iter_data *iter_data = data;
+
+	if (rq->mq_hctx != iter_data->hctx)
+		return true;
+	iter_data->has_rq = true;
+	return false;
+}
+
+static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_tags *tags = hctx->sched_tags ?
+			hctx->sched_tags : hctx->tags;
+	struct rq_iter_data data = {
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+	return data.has_rq;
+}
+
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
+	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
+		return 0;
+
+	/*
+	 * Prevent new request from being allocated on the current hctx.
+	 *
+	 * The smp_mb__after_atomic() Pairs with the implied barrier in
+	 * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
+	 * seen once we return from the tag allocator.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	smp_mb__after_atomic();
+
+	/*
+	 * Try to grab a reference to the queue and wait for any outstanding
+	 * requests.  If we could not grab a reference the queue has been
+	 * frozen and there are no requests.
+	 */
+	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
+		while (blk_mq_hctx_has_requests(hctx))
+			msleep(5);
+		percpu_ref_put(&hctx->queue->q_usage_counter);
+	}
+
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2337,6 +2435,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
@@ -2360,6 +2461,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2419,6 +2523,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3673,6 +3780,9 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e5..d7904b4d8d126 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13be..f60c025121215 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7307795439a4..a20f8c241d665 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,11 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	/*
+	 * Set when this device requires underlying blk-mq device for
+	 * completing IO:
+	 */
+	BLK_MQ_F_STACKING	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
@@ -400,6 +407,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b6335318..24b3a77810b6d 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request
  2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
@ 2020-05-20 18:16   ` Bart Van Assche
  2020-05-22  9:11   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2020-05-20 18:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-20 10:06, Christoph Hellwig wrote:
> None of the I/O schedulers actually needs it.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention
  2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
@ 2020-05-20 18:22   ` Bart Van Assche
  2020-05-22  9:13   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2020-05-20 18:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-20 10:06, Christoph Hellwig wrote:
> The bio argument is entirely unused, and the request_queue can be passed
> through the alloc_data, given that it needs to be filled out for the
> low-level tag allocation anyway.  Also rename the function to
> __blk_mq_alloc_request as the switch between get and alloc in the call
> chains is rather confusing.

I think there are three changes in this patch (remove 'bio' argument,
pass request queue pointer through struct blk_mq_alloc_data instead of
as an argument and rename 'alloc_data' into 'data') which makes this
patch harder to read than necessary. Additionally, I like the old name
"alloc_data" better than the new name "data". Anyway:

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init
  2020-05-20 17:06 ` [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
@ 2020-05-20 20:10   ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2020-05-20 20:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-20 10:06, Christoph Hellwig wrote:
> Don't split request initialization between __blk_mq_alloc_request and
> blk_mq_rq_ctx_init.  Also remove a pointless argument.

Although I'm not sure whether the 'op' argument should be called
pointless since this patch passes that argument in another way to
blk_mq_rq_ctx_init():

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter
  2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
@ 2020-05-20 20:24   ` Bart Van Assche
  2020-05-27  6:05     ` Christoph Hellwig
  2020-05-22  9:18   ` Hannes Reinecke
  1 sibling, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2020-05-20 20:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner, Ming Lei

On 2020-05-20 10:06, Christoph Hellwig wrote:
> @@ -321,8 +324,30 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>  		busy_tag_iter_fn *fn, void *priv)
>  {
>  	if (tags->nr_reserved_tags)
> -		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
> -	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
> +		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
> +				 false);
> +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, false);
> +}

Adding /*reserved=*/ and /*iterate_all=*/ comments in front of the
boolean arguments would make this code easier to read.

> +/**
> + * blk_mq_all_tag_iter - iterate over all requests in a tag map
> + * @tags:	Tag map to iterate over.
> + * @fn:		Pointer to the function that will be called for each
> + *		request. @fn will be called as follows: @fn(rq, @priv,
> + *		reserved) where rq is a pointer to a request. 'reserved'
> + *		indicates whether or not @rq is a reserved request. Return
> + *		true to continue iterating tags, false to stop.
> + * @priv:	Will be passed as second argument to @fn.
> + *
> + * It is the caller's responsibility to check rq's state in @fn.
                ^^^^^^
                @fn?

Not sure how the blk_mq_all_tag_iter() caller can check the request
state ...

> + */
> +void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
> +		void *priv)
> +{
> +	if (tags->nr_reserved_tags)
> +		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
> +				 true);
> +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, true);
>  }

Same comment here: I think that adding /*reserved=*/ and
/*iterate_all=*/ comments in front of the boolean arguments would make
this code easier to read.

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2020-05-20 17:06 ` [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
@ 2020-05-20 21:46 ` Bart Van Assche
  2020-05-21  2:57   ` Ming Lei
  6 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2020-05-20 21:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-20 10:06, Christoph Hellwig wrote:
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>  - before the last cpu in hctx->cpumask is going to offline, mark this
>    hctx as inactive
> 
>  - disable preempt during allocating tag for request, and after tag is
>    allocated, check if this hctx is inactive. If yes, give up the
>    allocation and try remote allocation from online CPUs
> 
>  - before hctx becomes inactive, drain all allocated requests on this
>    hctx

What is not clear to me is which assumptions about the relationship
between interrupts and hardware queues this patch series is based on.
Does this patch series perhaps only support a 1:1 mapping between
interrupts and hardware queues? What if there are more hardware queues
than interrupts? An example of a block driver that allocates multiple
hardware queues is the NVMeOF initiator driver. From the NVMeOF
initiator driver function nvme_rdma_alloc_tagset() and for the code that
refers to I/O queues:

	set->nr_hw_queues = nctrl->queue_count - 1;

From nvme_rdma_alloc_io_queues():

	nr_read_queues = min_t(unsigned int, ibdev->num_comp_vectors,
				min(opts->nr_io_queues,
				    num_online_cpus()));
	nr_default_queues =  min_t(unsigned int,
	 			ibdev->num_comp_vectors,
				min(opts->nr_write_queues,
					 num_online_cpus()));
	nr_poll_queues = min(opts->nr_poll_queues, num_online_cpus());
	nr_io_queues = nr_read_queues + nr_default_queues +
			 nr_poll_queues;
	[ ... ]
	ctrl->ctrl.queue_count = nr_io_queues + 1;

From nvmf_parse_options():

	/* Set defaults */
	opts->nr_io_queues = num_online_cpus();

Can this e.g. result in 16 hardware queues being allocated for I/O even
if the underlying RDMA adapter only supports four interrupt vectors?
Does that mean that four hardware queues will be associated with each
interrupt vector? If the CPU to which one of these interrupt vectors has
been assigned is hotplugged, does that mean that four hardware queues
have to be quiesced instead of only one as is done in patch 6/6?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-20 21:46 ` blk-mq: improvement CPU hotplug (simplified version) v3 Bart Van Assche
@ 2020-05-21  2:57   ` Ming Lei
  2020-05-21  3:50     ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2020-05-21  2:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On Wed, May 20, 2020 at 02:46:52PM -0700, Bart Van Assche wrote:
> On 2020-05-20 10:06, Christoph Hellwig wrote:
> > this series ensures I/O is quiesced before a cpu and thus the managed
> > interrupt handler is shut down.
> > 
> > This patchset tries to address the issue by the following approach:
> > 
> >  - before the last cpu in hctx->cpumask is going to offline, mark this
> >    hctx as inactive
> > 
> >  - disable preempt during allocating tag for request, and after tag is
> >    allocated, check if this hctx is inactive. If yes, give up the
> >    allocation and try remote allocation from online CPUs
> > 
> >  - before hctx becomes inactive, drain all allocated requests on this
> >    hctx
> 
> What is not clear to me is which assumptions about the relationship
> between interrupts and hardware queues this patch series is based on.
> Does this patch series perhaps only support a 1:1 mapping between
> interrupts and hardware queues?

No, it supports any mapping, but the issue won't be triggered on 1:N
mapping, since this kind of hctx never becomes inactive.

> What if there are more hardware queues
> than interrupts? An example of a block driver that allocates multiple

It doesn't matter, see blew comment.

> hardware queues is the NVMeOF initiator driver. From the NVMeOF
> initiator driver function nvme_rdma_alloc_tagset() and for the code that
> refers to I/O queues:
> 
> 	set->nr_hw_queues = nctrl->queue_count - 1;
> 
> From nvme_rdma_alloc_io_queues():
> 
> 	nr_read_queues = min_t(unsigned int, ibdev->num_comp_vectors,
> 				min(opts->nr_io_queues,
> 				    num_online_cpus()));
> 	nr_default_queues =  min_t(unsigned int,
> 	 			ibdev->num_comp_vectors,
> 				min(opts->nr_write_queues,
> 					 num_online_cpus()));
> 	nr_poll_queues = min(opts->nr_poll_queues, num_online_cpus());
> 	nr_io_queues = nr_read_queues + nr_default_queues +
> 			 nr_poll_queues;
> 	[ ... ]
> 	ctrl->ctrl.queue_count = nr_io_queues + 1;
> 
> From nvmf_parse_options():
> 
> 	/* Set defaults */
> 	opts->nr_io_queues = num_online_cpus();
> 
> Can this e.g. result in 16 hardware queues being allocated for I/O even
> if the underlying RDMA adapter only supports four interrupt vectors?
> Does that mean that four hardware queues will be associated with each
> interrupt vector?

The patchset actually doesn't bind to interrupt vector, and that said we
don't care actuall interrupt allocation.

> If the CPU to which one of these interrupt vectors has
> been assigned is hotplugged, does that mean that four hardware queues
> have to be quiesced instead of only one as is done in patch 6/6?

No, one hctx only becomes inactive after each CPU in hctx->cpumask is offline.
No matter how interrupt vector is assigned to hctx, requests shouldn't
be dispatched to that hctx any more.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-21  2:57   ` Ming Lei
@ 2020-05-21  3:50     ` Bart Van Assche
  2020-05-21  4:33       ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2020-05-21  3:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On 2020-05-20 19:57, Ming Lei wrote:
> On Wed, May 20, 2020 at 02:46:52PM -0700, Bart Van Assche wrote:
>> If the CPU to which one of these interrupt vectors has
>> been assigned is hotplugged, does that mean that four hardware queues
>> have to be quiesced instead of only one as is done in patch 6/6?
> 
> No, one hctx only becomes inactive after each CPU in hctx->cpumask is offline.
> No matter how interrupt vector is assigned to hctx, requests shouldn't
> be dispatched to that hctx any more.

Since I haven't found an answer to my question in your reply I will
rephrase my question. Suppose that there are 16 CPU cores, 16 hardware
queues and that hctx->cpumask of each hardware queue i only contains CPU
i. Suppose that four interrupt vectors (0, 1, 2 and 3) are used to
report the completions for these hardware queues. Suppose that interrupt
vector 3 is associated with hardware queues 12, 13, 14 and 15, and also
that interrupt vector 3 is mapped to CPU core 14. My interpretation of
patch 6/6 is that it will only quiesce hardware queue 14 but none of the
other hardware queues associated with the same interrupt vector
(hardware queues 12, 13 and 15). Isn't that a bug?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-21  3:50     ` Bart Van Assche
@ 2020-05-21  4:33       ` Ming Lei
  2020-05-21 19:15         ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2020-05-21  4:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On Wed, May 20, 2020 at 08:50:56PM -0700, Bart Van Assche wrote:
> On 2020-05-20 19:57, Ming Lei wrote:
> > On Wed, May 20, 2020 at 02:46:52PM -0700, Bart Van Assche wrote:
> >> If the CPU to which one of these interrupt vectors has
> >> been assigned is hotplugged, does that mean that four hardware queues
> >> have to be quiesced instead of only one as is done in patch 6/6?
> > 
> > No, one hctx only becomes inactive after each CPU in hctx->cpumask is offline.
> > No matter how interrupt vector is assigned to hctx, requests shouldn't
> > be dispatched to that hctx any more.
> 
> Since I haven't found an answer to my question in your reply I will
> rephrase my question. Suppose that there are 16 CPU cores, 16 hardware
> queues and that hctx->cpumask of each hardware queue i only contains CPU
> i. Suppose that four interrupt vectors (0, 1, 2 and 3) are used to
> report the completions for these hardware queues. Suppose that interrupt
> vector 3 is associated with hardware queues 12, 13, 14 and 15, and also
> that interrupt vector 3 is mapped to CPU core 14. My interpretation of
> patch 6/6 is that it will only quiesce hardware queue 14 but none of the
> other hardware queues associated with the same interrupt vector
> (hardware queues 12, 13 and 15). Isn't that a bug?

No.

If vector 3 is for covering hw queue 12 ~ 15, the vector shouldn't be
shutdown when cpu 14 is offline.

Also I am pretty sure that we don't do this way with managed IRQ. And
non-managed IRQ will be migrated to other online cpus during cpu offline,
so not an issue at all. See migrate_one_irq().


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-21  4:33       ` Ming Lei
@ 2020-05-21 19:15         ` Bart Van Assche
  2020-05-22  2:39           ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2020-05-21 19:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On 2020-05-20 21:33, Ming Lei wrote:
> No.
> 
> If vector 3 is for covering hw queue 12 ~ 15, the vector shouldn't be
> shutdown when cpu 14 is offline.
>> Also I am pretty sure that we don't do this way with managed IRQ. And
> non-managed IRQ will be migrated to other online cpus during cpu offline,
> so not an issue at all. See migrate_one_irq().

Thanks for the pointer to migrate_one_irq().

However, I'm not convinced the above statement is correct. My
understanding is that the block driver knows which interrupt vector has
been associated with which hardware queue but the blk-mq core not. It
seems to me that patch 6/6 of this series is based on the following
assumptions:
(a) That the interrupt that is associated with a hardware queue is
    processed by one of the CPU's in hctx->cpumask.
(b) That hardware queues do not share interrupt vectors.

I don't think that either assumption is correct.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-21 19:15         ` Bart Van Assche
@ 2020-05-22  2:39           ` Ming Lei
  2020-05-22 14:47             ` Keith Busch
  2020-05-23 15:19             ` Bart Van Assche
  0 siblings, 2 replies; 31+ messages in thread
From: Ming Lei @ 2020-05-22  2:39 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On Thu, May 21, 2020 at 12:15:52PM -0700, Bart Van Assche wrote:
> On 2020-05-20 21:33, Ming Lei wrote:
> > No.
> > 
> > If vector 3 is for covering hw queue 12 ~ 15, the vector shouldn't be
> > shutdown when cpu 14 is offline.
> >> Also I am pretty sure that we don't do this way with managed IRQ. And
> > non-managed IRQ will be migrated to other online cpus during cpu offline,
> > so not an issue at all. See migrate_one_irq().
> 
> Thanks for the pointer to migrate_one_irq().
> 
> However, I'm not convinced the above statement is correct. My
> understanding is that the block driver knows which interrupt vector has
> been associated with which hardware queue but the blk-mq core not. It
> seems to me that patch 6/6 of this series is based on the following
> assumptions:
> (a) That the interrupt that is associated with a hardware queue is
>     processed by one of the CPU's in hctx->cpumask.
> (b) That hardware queues do not share interrupt vectors.
> 
> I don't think that either assumption is correct.

What the patch tries to do is just:

- when the last cpu of hctx->cpumask is going to become offline, mark
this hctx as inactive, then drain any inflight IO requests originated
from this hctx

The correctness is that once we stops to produce request, we can drain
any in-flight requests before shutdown the last cpu of hctx. Then finally
this hctx becomes quiesced completely. Do you think this way is wrong?
If yes, please prove it.

So correctness of the patch 6/6 does not depend on the two assumptions,
does it?

This way solves the request timeout or never completion issue in case
that managed interrupt affinity is same with the hw queue's cpumask. I believe
this way is the normal usage, and most of storage drivers use managed
interrupt in this way. And motivation of this patch is to fix this kind
of normal usage.

You may argue that two hw queue may share single managed interrupt, that
is possible if driver plays the trick. But if driver plays the trick in
this way, it is driver's responsibility to guarantee that the managed
irq won't be shutdown if either of the two hctxs are active, such as,
making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
It is definitely one strange enough case, and this patch doesn't
suppose to cover this strange case. But, this patch won't break this
case. Also just be curious, do you have such in-tree case? and are you
sure the driver uses managed interrupt?

Again, no such problem in case of non-managed interrupt, because they
will be migrated to other online cpus. But this patchset is harmless for
non-managed interrupt, and still correct to quiesce hctx after all cpus
of hctx become offline from blk-mq queue mapping point, because no request
produced any more.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request
  2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
  2020-05-20 18:16   ` Bart Van Assche
@ 2020-05-22  9:11   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2020-05-22  9:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> None of the I/O schedulers actually needs it.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/bfq-iosched.c      | 2 +-
>   block/blk-mq.c           | 2 +-
>   block/kyber-iosched.c    | 2 +-
>   block/mq-deadline.c      | 2 +-
>   include/linux/elevator.h | 2 +-
>   5 files changed, 5 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention
  2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
  2020-05-20 18:22   ` Bart Van Assche
@ 2020-05-22  9:13   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2020-05-22  9:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> The bio argument is entirely unused, and the request_queue can be passed
> through the alloc_data, given that it needs to be filled out for the
> low-level tag allocation anyway.  Also rename the function to
> __blk_mq_alloc_request as the switch between get and alloc in the call
> chains is rather confusing.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq.c | 36 ++++++++++++++++++++++--------------
>   1 file changed, 22 insertions(+), 14 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  2020-05-20 17:06 ` [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
@ 2020-05-22  9:17   ` Hannes Reinecke
  0 siblings, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2020-05-22  9:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so
> tailor it to the specific requirements, and don't both the general

bother?

> fast path code with its special twinkles.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq.c | 44 +++++++++++++++++++++++---------------------
>   1 file changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1ffbc5d9e7cfe..42aee2978464b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -351,21 +351,13 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   {
>   	struct request_queue *q = data->q;
>   	struct elevator_queue *e = q->elevator;
> -	unsigned int tag;
> -	bool clear_ctx_on_error = false;
>   	u64 alloc_time_ns = 0;
> +	unsigned int tag;
>   
>   	/* alloc_time includes depth and tag waits */
>   	if (blk_queue_rq_alloc_time(q))
>   		alloc_time_ns = ktime_get_ns();
>   
> -	if (likely(!data->ctx)) {
> -		data->ctx = blk_mq_get_ctx(q);
> -		clear_ctx_on_error = true;
> -	}
> -	if (likely(!data->hctx))
> -		data->hctx = blk_mq_map_queue(q, data->cmd_flags,
> -						data->ctx);
>   	if (data->cmd_flags & REQ_NOWAIT)
>   		data->flags |= BLK_MQ_REQ_NOWAIT;
>   
> @@ -381,17 +373,16 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   		    e->type->ops.limit_depth &&
>   		    !(data->flags & BLK_MQ_REQ_RESERVED))
>   			e->type->ops.limit_depth(data->cmd_flags, data);
> -	} else {
> -		blk_mq_tag_busy(data->hctx);
>   	}
>   
> +	data->ctx = blk_mq_get_ctx(q);
> +	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
> +	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
> +		blk_mq_tag_busy(data->hctx);
> +
>   	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_TAG_FAIL) {
> -		if (clear_ctx_on_error)
> -			data->ctx = NULL;
> +	if (tag == BLK_MQ_TAG_FAIL)
>   		return NULL;
> -	}
> -
>   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>   }
>   
> @@ -431,17 +422,22 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   		.flags		= flags,
>   		.cmd_flags	= op,
>   	};
> -	struct request *rq;
> +	u64 alloc_time_ns = 0;
>   	unsigned int cpu;
> +	unsigned int tag;
>   	int ret;
>   
> +	/* alloc_time includes depth and tag waits */
> +	if (blk_queue_rq_alloc_time(q))
> +		alloc_time_ns = ktime_get_ns();
> +
>   	/*
>   	 * If the tag allocator sleeps we could get an allocation for a
>   	 * different hardware context.  No need to complicate the low level
>   	 * allocator for this for the rare use case of a command tied to
>   	 * a specific queue.
>   	 */
> -	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
> +	if (WARN_ON_ONCE(!(flags & (BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED))))
>   		return ERR_PTR(-EINVAL);
>   
>   	if (hctx_idx >= q->nr_hw_queues)
> @@ -462,11 +458,17 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
>   	data.ctx = __blk_mq_get_ctx(q, cpu);
>   
> +	if (q->elevator)
> +		data.flags |= BLK_MQ_REQ_INTERNAL;
> +	else
> +		blk_mq_tag_busy(data.hctx);
> +
>   	ret = -EWOULDBLOCK;
> -	rq = __blk_mq_alloc_request(&data);
> -	if (!rq)
> +	tag = blk_mq_get_tag(&data);
> +	if (tag == BLK_MQ_TAG_FAIL)
>   		goto out_queue_exit;
> -	return rq;
> +	return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns);
> +
>   out_queue_exit:
>   	blk_queue_exit(q);
>   	return ERR_PTR(ret);
> 
Other than that:

Reviewed-by: Hannes Reinecke <hare@suse.de

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter
  2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
  2020-05-20 20:24   ` Bart Van Assche
@ 2020-05-22  9:18   ` Hannes Reinecke
  1 sibling, 0 replies; 31+ messages in thread
From: Hannes Reinecke @ 2020-05-22  9:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Add one new function of blk_mq_all_tag_iter so that we can iterate every
> allocated request from either io scheduler tags or driver tags, and this
> way is more flexible since it allows the callers to do whatever they want
> on allocated request.
> 
> It will be used to implement draining allocated requests on specified
> hctx in this patchset.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-tag.c | 33 +++++++++++++++++++++++++++++----
>   block/blk-mq-tag.h |  2 ++
>   2 files changed, 31 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-20 17:06 ` [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
@ 2020-05-22  9:25   ` Hannes Reinecke
  2020-05-25  9:20     ` Ming Lei
  0 siblings, 1 reply; 31+ messages in thread
From: Hannes Reinecke @ 2020-05-22  9:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again."
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> cpuhp state handled after the CPU is down, so there isn't any chance to
> quiesce the hctx before shutting down the CPU.
> 
> Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> where the last CPU goes away, and wait for completion of in-flight
> requests.  This guarantees that there is no inflight I/O before shutting
> down the managed IRQ.
> 
> Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> to wait for completion of in-flight requests from these drivers to avoid
> a potential dead-lock. It is safe to do this for stacking drivers as those
> do not use interrupts at all and their I/O completions are triggered by
> underlying devices I/O completion.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: different retry mechanism, merged two patches, minor cleanups]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   8 +++
>   block/blk-mq.c             | 114 ++++++++++++++++++++++++++++++++++++-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |  10 ++++
>   include/linux/cpuhotplug.h |   1 +
>   7 files changed, 135 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 96b7a35c898a7..15df3a36e9fa4 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
>   	HCTX_STATE_NAME(STOPPED),
>   	HCTX_STATE_NAME(TAG_ACTIVE),
>   	HCTX_STATE_NAME(SCHED_RESTART),
> +	HCTX_STATE_NAME(INACTIVE),
>   };
>   #undef HCTX_STATE_NAME
>   
> @@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(TAG_SHARED),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
> +	HCTX_FLAG_NAME(STACKING),
>   };
>   #undef HCTX_FLAG_NAME
>   
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index c27c6dfc7d36e..3925d1e55bc8f 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   	sbitmap_finish_wait(bt, ws, &wait);
>   
>   found_tag:
> +	/*
> +	 * Give up this allocation if the hctx is inactive.  The caller will
> +	 * retry on an active hctx.
> +	 */
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> +		return -1;
> +	}
>   	return tag + tag_offset;
>   }
>   
I really wish we could have a dedicated NO_TAG value; returning
-1 for an unsigned int always feels dodgy.
And we could kill all the various internal definitions in drivers/scsi ...

Hmm?

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 42aee2978464b..672c7e3f61243 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -375,14 +375,32 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   			e->type->ops.limit_depth(data->cmd_flags, data);
>   	}
>   
> +retry:
>   	data->ctx = blk_mq_get_ctx(q);
>   	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
>   		blk_mq_tag_busy(data->hctx);
>   
> +	/*
> +	 * Waiting allocations only fail because of an inactive hctx.  In that
> +	 * case just retry the hctx assignment and tag allocation as CPU hotplug
> +	 * should have migrated us to an online CPU by now.
> +	 */
>   	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_TAG_FAIL)
> -		return NULL;
> +	if (tag == BLK_MQ_TAG_FAIL) {
> +		if (data->flags & BLK_MQ_REQ_NOWAIT)
> +			return NULL;
> +
> +		/*
> +		 * All kthreads that can perform I/O should have been moved off
> +		 * this CPU by the time the the CPU hotplug statemachine has
> +		 * shut down a hctx.  But better be sure with an extra sanity
> +		 * check.
> +		 */
> +		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
> +			return NULL;
> +		goto retry;
> +	}
>   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>   }
>   
> @@ -2324,6 +2342,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> +struct rq_iter_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	bool has_rq;
> +};
> +
> +static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
> +{
> +	struct rq_iter_data *iter_data = data;
> +
> +	if (rq->mq_hctx != iter_data->hctx)
> +		return true;
> +	iter_data->has_rq = true;
> +	return false;
> +}
> +
> +static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_tags *tags = hctx->sched_tags ?
> +			hctx->sched_tags : hctx->tags;
> +	struct rq_iter_data data = {
> +		.hctx	= hctx,
> +	};
> +
> +	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> +	return data.has_rq;
> +}
> +
To my reading this would only work reliably if we run the iteration on 
one of the cpus in the corresponding mask for the hctx.
Yet I fail to see any provision for this neither here nor in the 
previous patch
How do you guarantee that?
Or is there some implicit mechanism which I've missed?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-22  2:39           ` Ming Lei
@ 2020-05-22 14:47             ` Keith Busch
  2020-05-23  3:05               ` Ming Lei
  2020-05-23 15:19             ` Bart Van Assche
  1 sibling, 1 reply; 31+ messages in thread
From: Keith Busch @ 2020-05-22 14:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner

On Fri, May 22, 2020 at 10:39:23AM +0800, Ming Lei wrote:
> On Thu, May 21, 2020 at 12:15:52PM -0700, Bart Van Assche wrote:
> > On 2020-05-20 21:33, Ming Lei wrote:
> > > No.
> > > 
> > > If vector 3 is for covering hw queue 12 ~ 15, the vector shouldn't be
> > > shutdown when cpu 14 is offline.
> > >> Also I am pretty sure that we don't do this way with managed IRQ. And
> > > non-managed IRQ will be migrated to other online cpus during cpu offline,
> > > so not an issue at all. See migrate_one_irq().
> > 
> > Thanks for the pointer to migrate_one_irq().
> > 
> > However, I'm not convinced the above statement is correct. My
> > understanding is that the block driver knows which interrupt vector has
> > been associated with which hardware queue but the blk-mq core not. It
> > seems to me that patch 6/6 of this series is based on the following
> > assumptions:
> > (a) That the interrupt that is associated with a hardware queue is
> >     processed by one of the CPU's in hctx->cpumask.
> > (b) That hardware queues do not share interrupt vectors.
> > 
> > I don't think that either assumption is correct.
> 
> What the patch tries to do is just:
> 
> - when the last cpu of hctx->cpumask is going to become offline, mark
> this hctx as inactive, then drain any inflight IO requests originated
> from this hctx
> 
> The correctness is that once we stops to produce request, we can drain
> any in-flight requests before shutdown the last cpu of hctx. Then finally
> this hctx becomes quiesced completely. Do you think this way is wrong?
> If yes, please prove it.

I don't think this applies to what Bart is saying, but there is a
pathological case where things break down: if a driver uses managed
irq's, but doesn't use the same affinity for the hctx's, an offline cpu
may have been the only one providing irq handling for an online hctx.

I feel like that should be a driver bug if it were to set itself up that
way, but I don't find anything enforces that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-22 14:47             ` Keith Busch
@ 2020-05-23  3:05               ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2020-05-23  3:05 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner

On Fri, May 22, 2020 at 07:47:20AM -0700, Keith Busch wrote:
> On Fri, May 22, 2020 at 10:39:23AM +0800, Ming Lei wrote:
> > On Thu, May 21, 2020 at 12:15:52PM -0700, Bart Van Assche wrote:
> > > On 2020-05-20 21:33, Ming Lei wrote:
> > > > No.
> > > > 
> > > > If vector 3 is for covering hw queue 12 ~ 15, the vector shouldn't be
> > > > shutdown when cpu 14 is offline.
> > > >> Also I am pretty sure that we don't do this way with managed IRQ. And
> > > > non-managed IRQ will be migrated to other online cpus during cpu offline,
> > > > so not an issue at all. See migrate_one_irq().
> > > 
> > > Thanks for the pointer to migrate_one_irq().
> > > 
> > > However, I'm not convinced the above statement is correct. My
> > > understanding is that the block driver knows which interrupt vector has
> > > been associated with which hardware queue but the blk-mq core not. It
> > > seems to me that patch 6/6 of this series is based on the following
> > > assumptions:
> > > (a) That the interrupt that is associated with a hardware queue is
> > >     processed by one of the CPU's in hctx->cpumask.
> > > (b) That hardware queues do not share interrupt vectors.
> > > 
> > > I don't think that either assumption is correct.
> > 
> > What the patch tries to do is just:
> > 
> > - when the last cpu of hctx->cpumask is going to become offline, mark
> > this hctx as inactive, then drain any inflight IO requests originated
> > from this hctx
> > 
> > The correctness is that once we stops to produce request, we can drain
> > any in-flight requests before shutdown the last cpu of hctx. Then finally
> > this hctx becomes quiesced completely. Do you think this way is wrong?
> > If yes, please prove it.
> 
> I don't think this applies to what Bart is saying, but there is a
> pathological case where things break down: if a driver uses managed
> irq's, but doesn't use the same affinity for the hctx's, an offline cpu
> may have been the only one providing irq handling for an online hctx.

Driver needs to keep managed interrupt alive when this hctx is active,
and blk-mq doesn't have knowledge of managed interrupt & its affinity.

For the non-normal managed-irq usage, that won't be fixed by this
patchset, and it isn't blk-mq's responsibility to cover that.

Not to mention, Bart didn't share any such example.

> 
> I feel like that should be a driver bug if it were to set itself up that
> way, but I don't find anything enforces that.

Right, that is driver issue. And only driver has the knowledge of interrupt &
its affinity stuff, and such enforcement shouldn't be done in blk-mq.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-22  2:39           ` Ming Lei
  2020-05-22 14:47             ` Keith Busch
@ 2020-05-23 15:19             ` Bart Van Assche
  2020-05-25  4:09               ` Ming Lei
  1 sibling, 1 reply; 31+ messages in thread
From: Bart Van Assche @ 2020-05-23 15:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On 2020-05-21 19:39, Ming Lei wrote:
> You may argue that two hw queue may share single managed interrupt, that
> is possible if driver plays the trick. But if driver plays the trick in
> this way, it is driver's responsibility to guarantee that the managed
> irq won't be shutdown if either of the two hctxs are active, such as,
> making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
> It is definitely one strange enough case, and this patch doesn't
> suppose to cover this strange case. But, this patch won't break this
> case. Also just be curious, do you have such in-tree case? and are you
> sure the driver uses managed interrupt?

I'm concerned about the block drivers that use RDMA (NVMeOF, SRP, iSER,
...). The functions that accept an interrupt vector argument
(comp_vector), namely ib_alloc_cq() and ib_create_cq(), can be used in
such a way that completion interrupts are handled on another CPU than
those in hctx->cpumask.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-23 15:19             ` Bart Van Assche
@ 2020-05-25  4:09               ` Ming Lei
  2020-05-25 15:32                 ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2020-05-25  4:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On Sat, May 23, 2020 at 08:19:58AM -0700, Bart Van Assche wrote:
> On 2020-05-21 19:39, Ming Lei wrote:
> > You may argue that two hw queue may share single managed interrupt, that
> > is possible if driver plays the trick. But if driver plays the trick in
> > this way, it is driver's responsibility to guarantee that the managed
> > irq won't be shutdown if either of the two hctxs are active, such as,
> > making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
> > It is definitely one strange enough case, and this patch doesn't
> > suppose to cover this strange case. But, this patch won't break this
> > case. Also just be curious, do you have such in-tree case? and are you
> > sure the driver uses managed interrupt?
> 
> I'm concerned about the block drivers that use RDMA (NVMeOF, SRP, iSER,
> ...). The functions that accept an interrupt vector argument
> (comp_vector), namely ib_alloc_cq() and ib_create_cq(), can be used in

PCI_IRQ_AFFINITY isn't used in RDMA driver, so RDMA NIC uses non-managed
irq.

> such a way that completion interrupts are handled on another CPU than
> those in hctx->cpumask.

As I explained, this patchset doesn't rely on that interrupts have to
be fired on hctx->cpumask, and it only changes the submission io path,
which is blk-mq's generic code path which doesn't depend on any driver's
specific behavior.

Please let us know if your concerns are addressed.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-22  9:25   ` Hannes Reinecke
@ 2020-05-25  9:20     ` Ming Lei
  0 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2020-05-25  9:20 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, linux-block, John Garry, Bart Van Assche,
	Hannes Reinecke, Thomas Gleixner

On Fri, May 22, 2020 at 11:25:42AM +0200, Hannes Reinecke wrote:
> On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > 
> > Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> > up queue mapping. Thomas mentioned the following point[1]:
> > 
> > "That was the constraint of managed interrupts from the very beginning:
> > 
> >   The driver/subsystem has to quiesce the interrupt line and the associated
> >   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >   until it's restarted by the core when the CPU is plugged in again."
> > 
> > However, current blk-mq implementation doesn't quiesce hw queue before
> > the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> > cpuhp state handled after the CPU is down, so there isn't any chance to
> > quiesce the hctx before shutting down the CPU.
> > 
> > Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> > where the last CPU goes away, and wait for completion of in-flight
> > requests.  This guarantees that there is no inflight I/O before shutting
> > down the managed IRQ.
> > 
> > Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> > to wait for completion of in-flight requests from these drivers to avoid
> > a potential dead-lock. It is safe to do this for stacking drivers as those
> > do not use interrupts at all and their I/O completions are triggered by
> > underlying devices I/O completion.
> > 
> > [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > [hch: different retry mechanism, merged two patches, minor cleanups]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >   block/blk-mq-debugfs.c     |   2 +
> >   block/blk-mq-tag.c         |   8 +++
> >   block/blk-mq.c             | 114 ++++++++++++++++++++++++++++++++++++-
> >   drivers/block/loop.c       |   2 +-
> >   drivers/md/dm-rq.c         |   2 +-
> >   include/linux/blk-mq.h     |  10 ++++
> >   include/linux/cpuhotplug.h |   1 +
> >   7 files changed, 135 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 96b7a35c898a7..15df3a36e9fa4 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
> >   	HCTX_STATE_NAME(STOPPED),
> >   	HCTX_STATE_NAME(TAG_ACTIVE),
> >   	HCTX_STATE_NAME(SCHED_RESTART),
> > +	HCTX_STATE_NAME(INACTIVE),
> >   };
> >   #undef HCTX_STATE_NAME
> > @@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
> >   	HCTX_FLAG_NAME(TAG_SHARED),
> >   	HCTX_FLAG_NAME(BLOCKING),
> >   	HCTX_FLAG_NAME(NO_SCHED),
> > +	HCTX_FLAG_NAME(STACKING),
> >   };
> >   #undef HCTX_FLAG_NAME
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index c27c6dfc7d36e..3925d1e55bc8f 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >   	sbitmap_finish_wait(bt, ws, &wait);
> >   found_tag:
> > +	/*
> > +	 * Give up this allocation if the hctx is inactive.  The caller will
> > +	 * retry on an active hctx.
> > +	 */
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> > +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> > +		return -1;
> > +	}
> >   	return tag + tag_offset;
> >   }
> I really wish we could have a dedicated NO_TAG value; returning
> -1 for an unsigned int always feels dodgy.
> And we could kill all the various internal definitions in drivers/scsi ...
> 
> Hmm?

Good catch.

> 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 42aee2978464b..672c7e3f61243 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -375,14 +375,32 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
> >   			e->type->ops.limit_depth(data->cmd_flags, data);
> >   	}
> > +retry:
> >   	data->ctx = blk_mq_get_ctx(q);
> >   	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
> >   	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
> >   		blk_mq_tag_busy(data->hctx);
> > +	/*
> > +	 * Waiting allocations only fail because of an inactive hctx.  In that
> > +	 * case just retry the hctx assignment and tag allocation as CPU hotplug
> > +	 * should have migrated us to an online CPU by now.
> > +	 */
> >   	tag = blk_mq_get_tag(data);
> > -	if (tag == BLK_MQ_TAG_FAIL)
> > -		return NULL;
> > +	if (tag == BLK_MQ_TAG_FAIL) {
> > +		if (data->flags & BLK_MQ_REQ_NOWAIT)
> > +			return NULL;
> > +
> > +		/*
> > +		 * All kthreads that can perform I/O should have been moved off
> > +		 * this CPU by the time the the CPU hotplug statemachine has
> > +		 * shut down a hctx.  But better be sure with an extra sanity
> > +		 * check.
> > +		 */
> > +		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
> > +			return NULL;
> > +		goto retry;
> > +	}
> >   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
> >   }
> > @@ -2324,6 +2342,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >   	return -ENOMEM;
> >   }
> > +struct rq_iter_data {
> > +	struct blk_mq_hw_ctx *hctx;
> > +	bool has_rq;
> > +};
> > +
> > +static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
> > +{
> > +	struct rq_iter_data *iter_data = data;
> > +
> > +	if (rq->mq_hctx != iter_data->hctx)
> > +		return true;
> > +	iter_data->has_rq = true;
> > +	return false;
> > +}
> > +
> > +static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	struct blk_mq_tags *tags = hctx->sched_tags ?
> > +			hctx->sched_tags : hctx->tags;
> > +	struct rq_iter_data data = {
> > +		.hctx	= hctx,
> > +	};
> > +
> > +	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> > +	return data.has_rq;
> > +}
> > +
> To my reading this would only work reliably if we run the iteration on one
> of the cpus in the corresponding mask for the hctx.
> Yet I fail to see any provision for this neither here nor in the previous
> patch
> How do you guarantee that?
> Or is there some implicit mechanism which I've missed?

Yeah, it is guaranteed by memory barrier.

[1] tag allocation path

sbitmap_get()	/* smp_mb() is implied */
test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)

[2] cpu is going to offline

set_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
smp_mb__after_atomic();
blk_mq_hctx_has_requests()
	blk_mq_all_tag_iter
		bt_tags_for_each
			sbitmap_for_each_set

So setting tag bit and reading BLK_MQ_S_INACTIVE is ordered, and
setting BLK_MQ_S_INACTIVE and readding tag bit is ordered too.

Then either one request is re-tried to be allocated on another online CPU,
or the request is allocated & drained before the cpu is going to
offline.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-25  4:09               ` Ming Lei
@ 2020-05-25 15:32                 ` Bart Van Assche
  2020-05-25 16:38                   ` Keith Busch
  2020-05-26  0:37                   ` Ming Lei
  0 siblings, 2 replies; 31+ messages in thread
From: Bart Van Assche @ 2020-05-25 15:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On 2020-05-24 21:09, Ming Lei wrote:
> On Sat, May 23, 2020 at 08:19:58AM -0700, Bart Van Assche wrote:
>> On 2020-05-21 19:39, Ming Lei wrote:
>>> You may argue that two hw queue may share single managed interrupt, that
>>> is possible if driver plays the trick. But if driver plays the trick in
>>> this way, it is driver's responsibility to guarantee that the managed
>>> irq won't be shutdown if either of the two hctxs are active, such as,
>>> making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
>>> It is definitely one strange enough case, and this patch doesn't
>>> suppose to cover this strange case. But, this patch won't break this
>>> case. Also just be curious, do you have such in-tree case? and are you
>>> sure the driver uses managed interrupt?
>>
>> I'm concerned about the block drivers that use RDMA (NVMeOF, SRP, iSER,
>> ...). The functions that accept an interrupt vector argument
>> (comp_vector), namely ib_alloc_cq() and ib_create_cq(), can be used in
> 
> PCI_IRQ_AFFINITY isn't used in RDMA driver, so RDMA NIC uses non-managed
> irq.

Something doesn't add up ...

On a system with eight CPU cores and a ConnectX-3 RDMA adapter (mlx4
driver; uses request_irq()) I ran the following test:
* Query the affinity of all mlx4 edge interrupts (mlx4-1@0000:01:00.0 ..
mlx4-16@0000:01:00.0).
* Offline CPUs 6 and 7.
* Query CPU affinity again.

As one can see below the affinity of the mlx4 interrupts was modified.
Does this mean that the interrupt core manages more than only interrupts
registered with PCI_IRQ_AFFINITY?

All CPU's online:

55:04
56:80
57:40
58:40
59:20
60:10
61:08
62:02
63:02
64:01
65:20
66:20
67:10
68:10
69:40
70:08

CPUs 6 and 7 offline:

55:04
56:02
57:08
58:02
59:20
60:10
61:08
62:02
63:02
64:01
65:20
66:20
67:10
68:10
69:04
70:08

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-25 15:32                 ` Bart Van Assche
@ 2020-05-25 16:38                   ` Keith Busch
  2020-05-26  0:37                   ` Ming Lei
  1 sibling, 0 replies; 31+ messages in thread
From: Keith Busch @ 2020-05-25 16:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner

On Mon, May 25, 2020 at 08:32:44AM -0700, Bart Van Assche wrote:
> On 2020-05-24 21:09, Ming Lei wrote:
> > On Sat, May 23, 2020 at 08:19:58AM -0700, Bart Van Assche wrote:
> >> On 2020-05-21 19:39, Ming Lei wrote:
> >>> You may argue that two hw queue may share single managed interrupt, that
> >>> is possible if driver plays the trick. But if driver plays the trick in
> >>> this way, it is driver's responsibility to guarantee that the managed
> >>> irq won't be shutdown if either of the two hctxs are active, such as,
> >>> making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
> >>> It is definitely one strange enough case, and this patch doesn't
> >>> suppose to cover this strange case. But, this patch won't break this
> >>> case. Also just be curious, do you have such in-tree case? and are you
> >>> sure the driver uses managed interrupt?
> >>
> >> I'm concerned about the block drivers that use RDMA (NVMeOF, SRP, iSER,
> >> ...). The functions that accept an interrupt vector argument
> >> (comp_vector), namely ib_alloc_cq() and ib_create_cq(), can be used in
> > 
> > PCI_IRQ_AFFINITY isn't used in RDMA driver, so RDMA NIC uses non-managed
> > irq.
> 
> Something doesn't add up ...
> 
> On a system with eight CPU cores and a ConnectX-3 RDMA adapter (mlx4
> driver; uses request_irq()) I ran the following test:
> * Query the affinity of all mlx4 edge interrupts (mlx4-1@0000:01:00.0 ..
> mlx4-16@0000:01:00.0).
> * Offline CPUs 6 and 7.
> * Query CPU affinity again.
> 
> As one can see below the affinity of the mlx4 interrupts was modified.
> Does this mean that the interrupt core manages more than only interrupts
> registered with PCI_IRQ_AFFINITY?

Interrupts registered with PCI_IRQ_AFFINITY are assigned their cpu affinity mask
at creation time and it never changes for the lifetime of that interrupt.

Interrupts not registerd with that option can have their affinity modified in
various paths.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v3
  2020-05-25 15:32                 ` Bart Van Assche
  2020-05-25 16:38                   ` Keith Busch
@ 2020-05-26  0:37                   ` Ming Lei
  1 sibling, 0 replies; 31+ messages in thread
From: Ming Lei @ 2020-05-26  0:37 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On Mon, May 25, 2020 at 08:32:44AM -0700, Bart Van Assche wrote:
> On 2020-05-24 21:09, Ming Lei wrote:
> > On Sat, May 23, 2020 at 08:19:58AM -0700, Bart Van Assche wrote:
> >> On 2020-05-21 19:39, Ming Lei wrote:
> >>> You may argue that two hw queue may share single managed interrupt, that
> >>> is possible if driver plays the trick. But if driver plays the trick in
> >>> this way, it is driver's responsibility to guarantee that the managed
> >>> irq won't be shutdown if either of the two hctxs are active, such as,
> >>> making sure that hctx->cpumask + hctx->cpumask <= this managed interrupt's affinity.
> >>> It is definitely one strange enough case, and this patch doesn't
> >>> suppose to cover this strange case. But, this patch won't break this
> >>> case. Also just be curious, do you have such in-tree case? and are you
> >>> sure the driver uses managed interrupt?
> >>
> >> I'm concerned about the block drivers that use RDMA (NVMeOF, SRP, iSER,
> >> ...). The functions that accept an interrupt vector argument
> >> (comp_vector), namely ib_alloc_cq() and ib_create_cq(), can be used in
> > 
> > PCI_IRQ_AFFINITY isn't used in RDMA driver, so RDMA NIC uses non-managed
> > irq.
> 
> Something doesn't add up ...
> 
> On a system with eight CPU cores and a ConnectX-3 RDMA adapter (mlx4
> driver; uses request_irq()) I ran the following test:
> * Query the affinity of all mlx4 edge interrupts (mlx4-1@0000:01:00.0 ..
> mlx4-16@0000:01:00.0).
> * Offline CPUs 6 and 7.
> * Query CPU affinity again.
> 
> As one can see below the affinity of the mlx4 interrupts was modified.
> Does this mean that the interrupt core manages more than only interrupts
> registered with PCI_IRQ_AFFINITY?
> 
> All CPU's online:
> 
> 55:04
> 56:80
> 57:40
> 58:40
> 59:20
> 60:10
> 61:08
> 62:02
> 63:02
> 64:01
> 65:20
> 66:20
> 67:10
> 68:10
> 69:40
> 70:08
> 
> CPUs 6 and 7 offline:
> 
> 55:04
> 56:02
> 57:08
> 58:02
> 59:20
> 60:10
> 61:08
> 62:02
> 63:02
> 64:01
> 65:20
> 66:20
> 67:10
> 68:10
> 69:04
> 70:08

It is non-managed interrupt, and their affinity will be changed during
cpu online/offline by irq migration code, I believe I have shared the
function to you before.

The issue to be addressed is for managed interrupt only, which is shutdown
during cpu offline, that is why we have to make sure that there isn't
any in-flight io request. As Keith mentioned, their affinity is assigned
during creation, and won't be changed since its creation.

The suggested approach fixes the issue for managed interrupt, meantime
it is harmless for non-managed interrupt.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter
  2020-05-20 20:24   ` Bart Van Assche
@ 2020-05-27  6:05     ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2020-05-27  6:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

On Wed, May 20, 2020 at 01:24:02PM -0700, Bart Van Assche wrote:
> Adding /*reserved=*/ and /*iterate_all=*/ comments in front of the
> boolean arguments would make this code easier to read.

Actually, flags values would be much more understandable, and allow to
consolidate more code.  I've switched the implementation over to that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2020-05-27  6:05 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
2020-05-20 18:16   ` Bart Van Assche
2020-05-22  9:11   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
2020-05-20 18:22   ` Bart Van Assche
2020-05-22  9:13   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
2020-05-20 20:10   ` Bart Van Assche
2020-05-20 17:06 ` [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
2020-05-22  9:17   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
2020-05-20 20:24   ` Bart Van Assche
2020-05-27  6:05     ` Christoph Hellwig
2020-05-22  9:18   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
2020-05-22  9:25   ` Hannes Reinecke
2020-05-25  9:20     ` Ming Lei
2020-05-20 21:46 ` blk-mq: improvement CPU hotplug (simplified version) v3 Bart Van Assche
2020-05-21  2:57   ` Ming Lei
2020-05-21  3:50     ` Bart Van Assche
2020-05-21  4:33       ` Ming Lei
2020-05-21 19:15         ` Bart Van Assche
2020-05-22  2:39           ` Ming Lei
2020-05-22 14:47             ` Keith Busch
2020-05-23  3:05               ` Ming Lei
2020-05-23 15:19             ` Bart Van Assche
2020-05-25  4:09               ` Ming Lei
2020-05-25 15:32                 ` Bart Van Assche
2020-05-25 16:38                   ` Keith Busch
2020-05-26  0:37                   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).