linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* blk-mq: improvement CPU hotplug (simplified version) v4
@ 2020-05-27 18:06 Christoph Hellwig
  2020-05-27 18:06 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Hi all,

this series ensures I/O is quiesced before a cpu and thus the managed
interrupt handler is shut down.

This patchset tries to address the issue by the following approach:

 - before the last cpu in hctx->cpumask is going to offline, mark this
   hctx as inactive

 - disable preempt during allocating tag for request, and after tag is
   allocated, check if this hctx is inactive. If yes, give up the
   allocation and try remote allocation from online CPUs

 - before hctx becomes inactive, drain all allocated requests on this
   hctx

The guts of the changes are from Ming Lei, I just did a bunch of prep
cleanups so that they can fit in more nicely.  The series also depends
on my "avoid a few q_usage_counter roundtrips v3" series.

Thanks John Garry for running lots of tests on arm64 with this previous
version patches and co-working on investigating all kinds of issues.

A git tree is available here:

    git://git.infradead.org/users/hch/block.git blk-mq-hotplug.3

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.3

Changes since v3:
  - add two new patches to clean up the magic -1 tag values
  - improve a few commit messages and comments
  - cleanup the blk_mq_all_tag_iter implementation
  - add a msleep to the cpu hot unplug case in __blk_mq_alloc_request

Changes since v2:
  - don't disable preemption and use smp calls


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:16   ` Johannes Thumshirn
  2020-05-27 18:06 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Hannes Reinecke, Daniel Wagner

None of the I/O schedulers actually needs it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/bfq-iosched.c      | 2 +-
 block/blk-mq.c           | 2 +-
 block/kyber-iosched.c    | 2 +-
 block/mq-deadline.c      | 2 +-
 include/linux/elevator.h | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 3d411716d7ee4..50c8f034c01c5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6073,7 +6073,7 @@ static struct bfq_queue *bfq_get_bfqq_handle_split(struct bfq_data *bfqd,
  * comments on bfq_init_rq for the reason behind this delayed
  * preparation.
  */
-static void bfq_prepare_request(struct request *rq, struct bio *bio)
+static void bfq_prepare_request(struct request *rq)
 {
 	/*
 	 * Regardless of whether we have an icq attached, we have to
diff --git a/block/blk-mq.c b/block/blk-mq.c
index cac11945f6023..850e3642efc40 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -387,7 +387,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 			if (e->type->icq_cache)
 				blk_mq_sched_assign_ioc(rq);
 
-			e->type->ops.prepare_request(rq, bio);
+			e->type->ops.prepare_request(rq);
 			rq->rq_flags |= RQF_ELVPRIV;
 		}
 	}
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 34dcea0ef6377..a38c5ab103d12 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -579,7 +579,7 @@ static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
 	return merged;
 }
 
-static void kyber_prepare_request(struct request *rq, struct bio *bio)
+static void kyber_prepare_request(struct request *rq)
 {
 	rq_set_domain_token(rq, -1);
 }
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b490f47fd553c..b57470e154c8f 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -541,7 +541,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
  * Nothing to do here. This is defined only to ensure that .finish_request
  * method is called upon request completion.
  */
-static void dd_prepare_request(struct request *rq, struct bio *bio)
+static void dd_prepare_request(struct request *rq)
 {
 }
 
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 901bda352dcb7..bacc40a0bdf39 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -39,7 +39,7 @@ struct elevator_mq_ops {
 	void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);
 	void (*requests_merged)(struct request_queue *, struct request *, struct request *);
 	void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *);
-	void (*prepare_request)(struct request *, struct bio *bio);
+	void (*prepare_request)(struct request *);
 	void (*finish_request)(struct request *);
 	void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);
 	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
  2020-05-27 18:06 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:17   ` Johannes Thumshirn
  2020-05-27 18:06 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Hannes Reinecke, Daniel Wagner

The bio argument is entirely unused, and the request_queue can be passed
through the alloc_data, given that it needs to be filled out for the
low-level tag allocation anyway.  Also rename the function to
__blk_mq_alloc_request as the switch between get and alloc in the call
chains is rather confusing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq.c | 36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 850e3642efc40..2250e6397559b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -332,10 +332,9 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	return rq;
 }
 
-static struct request *blk_mq_get_request(struct request_queue *q,
-					  struct bio *bio,
-					  struct blk_mq_alloc_data *data)
+static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
+	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
 	struct request *rq;
 	unsigned int tag;
@@ -346,7 +345,6 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	if (blk_queue_rq_alloc_time(q))
 		alloc_time_ns = ktime_get_ns();
 
-	data->q = q;
 	if (likely(!data->ctx)) {
 		data->ctx = blk_mq_get_ctx(q);
 		clear_ctx_on_error = true;
@@ -398,7 +396,11 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		blk_mq_req_flags_t flags)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+		.flags		= flags,
+		.cmd_flags	= op,
+	};
 	struct request *rq;
 	int ret;
 
@@ -406,7 +408,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	if (ret)
 		return ERR_PTR(ret);
 
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
+	rq = __blk_mq_alloc_request(&data);
 	if (!rq)
 		goto out_queue_exit;
 	rq->__data_len = 0;
@@ -422,7 +424,11 @@ EXPORT_SYMBOL(blk_mq_alloc_request);
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
-	struct blk_mq_alloc_data alloc_data = { .flags = flags, .cmd_flags = op };
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+		.flags		= flags,
+		.cmd_flags	= op,
+	};
 	struct request *rq;
 	unsigned int cpu;
 	int ret;
@@ -448,14 +454,14 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	 * If not tell the caller that it should skip this queue.
 	 */
 	ret = -EXDEV;
-	alloc_data.hctx = q->queue_hw_ctx[hctx_idx];
-	if (!blk_mq_hw_queue_mapped(alloc_data.hctx))
+	data.hctx = q->queue_hw_ctx[hctx_idx];
+	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
-	alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
+	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	ret = -EWOULDBLOCK;
-	rq = blk_mq_get_request(q, NULL, &alloc_data);
+	rq = __blk_mq_alloc_request(&data);
 	if (!rq)
 		goto out_queue_exit;
 	return rq;
@@ -2016,7 +2022,9 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = op_is_sync(bio->bi_opf);
 	const int is_flush_fua = op_is_flush(bio->bi_opf);
-	struct blk_mq_alloc_data data = { .flags = 0};
+	struct blk_mq_alloc_data data = {
+		.q		= q,
+	};
 	struct request *rq;
 	struct blk_plug *plug;
 	struct request *same_queue_rq = NULL;
@@ -2040,7 +2048,7 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	rq_qos_throttle(q, bio);
 
 	data.cmd_flags = bio->bi_opf;
-	rq = blk_mq_get_request(q, bio, &data);
+	rq = __blk_mq_alloc_request(&data);
 	if (unlikely(!rq)) {
 		rq_qos_cleanup(q, bio);
 		if (bio->bi_opf & REQ_NOWAIT)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
  2020-05-27 18:06 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
  2020-05-27 18:06 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:16   ` Hannes Reinecke
  2020-05-28  9:50   ` Johannes Thumshirn
  2020-05-27 18:06 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Daniel Wagner

Don't split request initialization between __blk_mq_alloc_request and
blk_mq_rq_ctx_init.  Also remove the op argument as it can be derived
from the blk_mq_alloc_data structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
---
 block/blk-mq.c | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2250e6397559b..1ffbc5d9e7cfe 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -271,7 +271,7 @@ static inline bool blk_mq_need_time_stamp(struct request *rq)
 }
 
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
-		unsigned int tag, unsigned int op, u64 alloc_time_ns)
+		unsigned int tag, u64 alloc_time_ns)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct request *rq = tags->static_rqs[tag];
@@ -295,7 +295,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->mq_ctx = data->ctx;
 	rq->mq_hctx = data->hctx;
 	rq->rq_flags = rq_flags;
-	rq->cmd_flags = op;
+	rq->cmd_flags = data->cmd_flags;
 	if (data->flags & BLK_MQ_REQ_PREEMPT)
 		rq->rq_flags |= RQF_PREEMPT;
 	if (blk_queue_io_stat(data->q))
@@ -327,8 +327,23 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
 
-	data->ctx->rq_dispatched[op_is_sync(op)]++;
+	data->ctx->rq_dispatched[op_is_sync(data->cmd_flags)]++;
 	refcount_set(&rq->ref, 1);
+
+	if (!op_is_flush(data->cmd_flags)) {
+		struct elevator_queue *e = data->q->elevator;
+
+		rq->elv.icq = NULL;
+		if (e && e->type->ops.prepare_request) {
+			if (e->type->icq_cache)
+				blk_mq_sched_assign_ioc(rq);
+
+			e->type->ops.prepare_request(rq);
+			rq->rq_flags |= RQF_ELVPRIV;
+		}
+	}
+
+	data->hctx->queued++;
 	return rq;
 }
 
@@ -336,7 +351,6 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
-	struct request *rq;
 	unsigned int tag;
 	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
@@ -378,19 +392,7 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		return NULL;
 	}
 
-	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
-	if (!op_is_flush(data->cmd_flags)) {
-		rq->elv.icq = NULL;
-		if (e && e->type->ops.prepare_request) {
-			if (e->type->icq_cache)
-				blk_mq_sched_assign_ioc(rq);
-
-			e->type->ops.prepare_request(rq);
-			rq->rq_flags |= RQF_ELVPRIV;
-		}
-	}
-	data->hctx->queued++;
-	return rq;
+	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-05-27 18:06 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:14   ` Johannes Thumshirn
                     ` (2 more replies)
  2020-05-27 18:06 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

To prepare for wider use of this constant give it a more applicable name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-tag.c | 4 ++--
 block/blk-mq-tag.h | 4 ++--
 block/blk-mq.c     | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904ab..c76ba4f90fa09 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -111,7 +111,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	if (data->flags & BLK_MQ_REQ_RESERVED) {
 		if (unlikely(!tags->nr_reserved_tags)) {
 			WARN_ON_ONCE(1);
-			return BLK_MQ_TAG_FAIL;
+			return BLK_MQ_NO_TAG;
 		}
 		bt = &tags->breserved_tags;
 		tag_offset = 0;
@@ -125,7 +125,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		goto found_tag;
 
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
-		return BLK_MQ_TAG_FAIL;
+		return BLK_MQ_NO_TAG;
 
 	ws = bt_wait_ptr(bt, data->hctx);
 	do {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb6820..8a741752af8b9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -44,9 +44,9 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 }
 
 enum {
-	BLK_MQ_TAG_FAIL		= -1U,
+	BLK_MQ_NO_TAG		= -1U,
 	BLK_MQ_TAG_MIN		= 1,
-	BLK_MQ_TAG_MAX		= BLK_MQ_TAG_FAIL - 1,
+	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
 };
 
 extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1ffbc5d9e7cfe..826ff8f97489c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -386,7 +386,7 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 	}
 
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_TAG_FAIL) {
+	if (tag == BLK_MQ_NO_TAG) {
 		if (clear_ctx_on_error)
 			data->ctx = NULL;
 		return NULL;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2020-05-27 18:06 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:15   ` Johannes Thumshirn
                     ` (2 more replies)
  2020-05-27 18:06 ` [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-tag.c |  8 ++++----
 block/blk-mq.c     | 14 +++++++-------
 block/blk-mq.h     |  4 ++--
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index c76ba4f90fa09..597ff9c27cf63 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -92,7 +92,7 @@ static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 {
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
 	    !hctx_may_queue(data->hctx, bt))
-		return -1;
+		return BLK_MQ_NO_TAG;
 	if (data->shallow_depth)
 		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
@@ -121,7 +121,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	}
 
 	tag = __blk_mq_get_tag(data, bt);
-	if (tag != -1)
+	if (tag != BLK_MQ_NO_TAG)
 		goto found_tag;
 
 	if (data->flags & BLK_MQ_REQ_NOWAIT)
@@ -143,13 +143,13 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 		 * as running the queue may also have found completions.
 		 */
 		tag = __blk_mq_get_tag(data, bt);
-		if (tag != -1)
+		if (tag != BLK_MQ_NO_TAG)
 			break;
 
 		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
 
 		tag = __blk_mq_get_tag(data, bt);
-		if (tag != -1)
+		if (tag != BLK_MQ_NO_TAG)
 			break;
 
 		bt_prev = bt;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 826ff8f97489c..cde2aec558ac5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -278,7 +278,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	req_flags_t rq_flags = 0;
 
 	if (data->flags & BLK_MQ_REQ_INTERNAL) {
-		rq->tag = -1;
+		rq->tag = BLK_MQ_NO_TAG;
 		rq->internal_tag = tag;
 	} else {
 		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
@@ -286,7 +286,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 			atomic_inc(&data->hctx->nr_active);
 		}
 		rq->tag = tag;
-		rq->internal_tag = -1;
+		rq->internal_tag = BLK_MQ_NO_TAG;
 		data->hctx->tags->rqs[rq->tag] = rq;
 	}
 
@@ -483,9 +483,9 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_crypto_free_request(rq);
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != -1)
+	if (rq->tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
+	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
@@ -534,7 +534,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 		blk_stat_add(rq, now);
 	}
 
-	if (rq->internal_tag != -1)
+	if (rq->internal_tag != BLK_MQ_NO_TAG)
 		blk_mq_sched_completed_request(rq, now);
 
 	blk_account_io_done(rq, now);
@@ -1033,7 +1033,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	};
 	bool shared;
 
-	if (rq->tag != -1)
+	if (rq->tag != BLK_MQ_NO_TAG)
 		return true;
 
 	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
@@ -1049,7 +1049,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
 
-	return rq->tag != -1;
+	return rq->tag != BLK_MQ_NO_TAG;
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494faf..a139b06318174 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -201,7 +201,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
-	rq->tag = -1;
+	rq->tag = BLK_MQ_NO_TAG;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
@@ -211,7 +211,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 
 static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	if (rq->tag == -1 || rq->internal_tag == -1)
+	if (rq->tag == BLK_MQ_NO_TAG || rq->internal_tag == BLK_MQ_NO_TAG)
 		return;
 
 	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2020-05-27 18:06 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:06 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

blk_mq_alloc_request_hctx is only used for NVMeoF connect commands, so
tailor it to the specific requirements, and don't bother the general
fast path code with its special twinkles.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de
---
 block/blk-mq.c | 44 +++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index cde2aec558ac5..898400452b1cf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -351,21 +351,13 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 {
 	struct request_queue *q = data->q;
 	struct elevator_queue *e = q->elevator;
-	unsigned int tag;
-	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
+	unsigned int tag;
 
 	/* alloc_time includes depth and tag waits */
 	if (blk_queue_rq_alloc_time(q))
 		alloc_time_ns = ktime_get_ns();
 
-	if (likely(!data->ctx)) {
-		data->ctx = blk_mq_get_ctx(q);
-		clear_ctx_on_error = true;
-	}
-	if (likely(!data->hctx))
-		data->hctx = blk_mq_map_queue(q, data->cmd_flags,
-						data->ctx);
 	if (data->cmd_flags & REQ_NOWAIT)
 		data->flags |= BLK_MQ_REQ_NOWAIT;
 
@@ -381,17 +373,16 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 		    e->type->ops.limit_depth &&
 		    !(data->flags & BLK_MQ_REQ_RESERVED))
 			e->type->ops.limit_depth(data->cmd_flags, data);
-	} else {
-		blk_mq_tag_busy(data->hctx);
 	}
 
+	data->ctx = blk_mq_get_ctx(q);
+	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
+	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
+		blk_mq_tag_busy(data->hctx);
+
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_NO_TAG) {
-		if (clear_ctx_on_error)
-			data->ctx = NULL;
+	if (tag == BLK_MQ_NO_TAG)
 		return NULL;
-	}
-
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -431,17 +422,22 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 		.flags		= flags,
 		.cmd_flags	= op,
 	};
-	struct request *rq;
+	u64 alloc_time_ns = 0;
 	unsigned int cpu;
+	unsigned int tag;
 	int ret;
 
+	/* alloc_time includes depth and tag waits */
+	if (blk_queue_rq_alloc_time(q))
+		alloc_time_ns = ktime_get_ns();
+
 	/*
 	 * If the tag allocator sleeps we could get an allocation for a
 	 * different hardware context.  No need to complicate the low level
 	 * allocator for this for the rare use case of a command tied to
 	 * a specific queue.
 	 */
-	if (WARN_ON_ONCE(!(flags & BLK_MQ_REQ_NOWAIT)))
+	if (WARN_ON_ONCE(!(flags & (BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_RESERVED))))
 		return ERR_PTR(-EINVAL);
 
 	if (hctx_idx >= q->nr_hw_queues)
@@ -462,11 +458,17 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
+	if (q->elevator)
+		data.flags |= BLK_MQ_REQ_INTERNAL;
+	else
+		blk_mq_tag_busy(data.hctx);
+
 	ret = -EWOULDBLOCK;
-	rq = __blk_mq_alloc_request(&data);
-	if (!rq)
+	tag = blk_mq_get_tag(&data);
+	if (tag == BLK_MQ_NO_TAG)
 		goto out_queue_exit;
-	return rq;
+	return blk_mq_rq_ctx_init(&data, tag, alloc_time_ns);
+
 out_queue_exit:
 	blk_queue_exit(q);
 	return ERR_PTR(ret);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2020-05-27 18:06 ` [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:21   ` Hannes Reinecke
  2020-05-27 22:52   ` Bart Van Assche
  2020-05-27 18:06 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

From: Ming Lei <ming.lei@redhat.com>

Add a new blk_mq_all_tag_iter function to iterate over all allocated
scheduler tags and driver tags.  This is more flexible than the existing
blk_mq_all_tag_busy_iter function as it allows the callers to do whatever
they want on allocated request instead of being limited to started
requests.

It will be used to implement draining allocated requests on specified
hctx in this patchset.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: switch from the two booleans to a more readable flags field and
 consolidate the tags iter functions]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-tag.c | 46 +++++++++++++++++++++++++++++-----------------
 block/blk-mq-tag.h |  2 ++
 2 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 597ff9c27cf63..9f74064768423 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -256,14 +256,17 @@ struct bt_tags_iter_data {
 	struct blk_mq_tags *tags;
 	busy_tag_iter_fn *fn;
 	void *data;
-	bool reserved;
+	unsigned int flags;
 };
 
+#define BT_TAG_ITER_RESERVED		(1 << 0)
+#define BT_TAG_ITER_STARTED		(1 << 1)
+
 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 {
 	struct bt_tags_iter_data *iter_data = data;
 	struct blk_mq_tags *tags = iter_data->tags;
-	bool reserved = iter_data->reserved;
+	bool reserved = iter_data->flags & BT_TAG_ITER_RESERVED;
 	struct request *rq;
 
 	if (!reserved)
@@ -274,10 +277,12 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
-		return iter_data->fn(rq, iter_data->data, reserved);
-
-	return true;
+	if (!rq)
+		return true;
+	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
+	    !blk_mq_request_started(rq))
+		return true;
+	return iter_data->fn(rq, iter_data->data, reserved);
 }
 
 /**
@@ -290,17 +295,16 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		@reserved) where rq is a pointer to a request. Return true
  *		to continue iterating tags, false to stop.
  * @data:	Will be passed as second argument to @fn.
- * @reserved:	Indicates whether @bt is the breserved_tags member or the
- *		bitmap_tags member of struct blk_mq_tags.
+ * @flags:	BT_TAG_ITER_*
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, void *data, unsigned int flags)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
 		.data = data,
-		.reserved = reserved,
+		.flags = flags,
 	};
 
 	if (tags->rqs)
@@ -308,21 +312,28 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 }
 
 /**
- * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
+ * blk_mq_all_tag_iter - iterate over all requests in a tag map
  * @tags:	Tag map to iterate over.
- * @fn:		Pointer to the function that will be called for each started
+ * @fn:		Pointer to the function that will be called for each
  *		request. @fn will be called as follows: @fn(rq, @priv,
  *		reserved) where rq is a pointer to a request. 'reserved'
  *		indicates whether or not @rq is a reserved request. Return
  *		true to continue iterating tags, false to stop.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
+static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, void *priv, unsigned int flags)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv,
+				 flags | BT_TAG_ITER_RESERVED);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags);
+}
+
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv)
+{
+	return __blk_mq_all_tag_iter(tags, fn, priv, 0);
 }
 
 /**
@@ -342,7 +353,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
+					      BT_TAG_ITER_STARTED);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 8a741752af8b9..d38e48f2a0a4a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -34,6 +34,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2020-05-27 18:06 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
@ 2020-05-27 18:06 ` Christoph Hellwig
  2020-05-27 18:26   ` Hannes Reinecke
  2020-05-27 23:09   ` Bart Van Assche
  2020-05-27 20:07 ` blk-mq: improvement CPU hotplug (simplified version) v4 Bart Van Assche
  2020-05-28  8:29 ` John Garry
  9 siblings, 2 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-27 18:06 UTC (permalink / raw)
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

From: Ming Lei <ming.lei@redhat.com>

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: different retry mechanism, merged two patches, minor cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   8 +++
 block/blk-mq.c             | 121 ++++++++++++++++++++++++++++++++++++-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |  10 +++
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 142 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a7..15df3a36e9fa4 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
@@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(STACKING),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9f74064768423..1c548d9f67ee7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	sbitmap_finish_wait(bt, ws, &wait);
 
 found_tag:
+	/*
+	 * Give up this allocation if the hctx is inactive.  The caller will
+	 * retry on an active hctx.
+	 */
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
+		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
+		return -1;
+	}
 	return tag + tag_offset;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 898400452b1cf..e4580cd6c6f49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -375,14 +375,39 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 			e->type->ops.limit_depth(data->cmd_flags, data);
 	}
 
+retry:
 	data->ctx = blk_mq_get_ctx(q);
 	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
 		blk_mq_tag_busy(data->hctx);
 
+	/*
+	 * Waiting allocations only fail because of an inactive hctx.  In that
+	 * case just retry the hctx assignment and tag allocation as CPU hotplug
+	 * should have migrated us to an online CPU by now.
+	 */
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_NO_TAG)
-		return NULL;
+	if (tag == BLK_MQ_NO_TAG) {
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return NULL;
+
+		/*
+		 * All kthreads that can perform I/O should have been moved off
+		 * this CPU by the time the the CPU hotplug statemachine has
+		 * shut down a hctx.  But better be sure with an extra sanity
+		 * check.
+		 */
+		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
+			return NULL;
+
+		/*
+		 * Give up the CPU and sleep for a random short time to ensure
+		 * that thread using a realtime scheduling class are migrated
+		 * off the the CPU.
+		 */
+		msleep(3);
+		goto retry;
+	}
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -2324,6 +2349,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+struct rq_iter_data {
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rq;
+};
+
+static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
+{
+	struct rq_iter_data *iter_data = data;
+
+	if (rq->mq_hctx != iter_data->hctx)
+		return true;
+	iter_data->has_rq = true;
+	return false;
+}
+
+static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_tags *tags = hctx->sched_tags ?
+			hctx->sched_tags : hctx->tags;
+	struct rq_iter_data data = {
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+	return data.has_rq;
+}
+
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
+	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
+		return 0;
+
+	/*
+	 * Prevent new request from being allocated on the current hctx.
+	 *
+	 * The smp_mb__after_atomic() Pairs with the implied barrier in
+	 * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
+	 * seen once we return from the tag allocator.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	smp_mb__after_atomic();
+
+	/*
+	 * Try to grab a reference to the queue and wait for any outstanding
+	 * requests.  If we could not grab a reference the queue has been
+	 * frozen and there are no requests.
+	 */
+	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
+		while (blk_mq_hctx_has_requests(hctx))
+			msleep(5);
+		percpu_ref_put(&hctx->queue->q_usage_counter);
+	}
+
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2337,6 +2442,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
@@ -2360,6 +2468,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2419,6 +2530,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3673,6 +3787,9 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e5..d7904b4d8d126 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13be..f60c025121215 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7307795439a4..a20f8c241d665 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,11 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	/*
+	 * Set when this device requires underlying blk-mq device for
+	 * completing IO:
+	 */
+	BLK_MQ_F_STACKING	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
@@ -400,6 +407,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b6335318..24b3a77810b6d 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  2020-05-27 18:06 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
@ 2020-05-27 18:14   ` Johannes Thumshirn
  2020-05-27 18:17   ` Hannes Reinecke
  2020-05-27 22:38   ` Bart Van Assche
  2 siblings, 0 replies; 41+ messages in thread
From: Johannes Thumshirn @ 2020-05-27 18:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places
  2020-05-27 18:06 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
@ 2020-05-27 18:15   ` Johannes Thumshirn
  2020-05-27 18:18   ` Hannes Reinecke
  2020-05-27 22:38   ` Bart Van Assche
  2 siblings, 0 replies; 41+ messages in thread
From: Johannes Thumshirn @ 2020-05-27 18:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request
  2020-05-27 18:06 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
@ 2020-05-27 18:16   ` Johannes Thumshirn
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Thumshirn @ 2020-05-27 18:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Hannes Reinecke, Daniel Wagner

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init
  2020-05-27 18:06 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
@ 2020-05-27 18:16   ` Hannes Reinecke
  2020-05-28  9:50   ` Johannes Thumshirn
  1 sibling, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2020-05-27 18:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Daniel Wagner

On 5/27/20 8:06 PM, Christoph Hellwig wrote:
> Don't split request initialization between __blk_mq_alloc_request and
> blk_mq_rq_ctx_init.  Also remove the op argument as it can be derived
> from the blk_mq_alloc_data structure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Daniel Wagner <dwagner@suse.de>
> ---
>   block/blk-mq.c | 36 +++++++++++++++++++-----------------
>   1 file changed, 19 insertions(+), 17 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  2020-05-27 18:06 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
  2020-05-27 18:14   ` Johannes Thumshirn
@ 2020-05-27 18:17   ` Hannes Reinecke
  2020-05-27 22:38   ` Bart Van Assche
  2 siblings, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2020-05-27 18:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/27/20 8:06 PM, Christoph Hellwig wrote:
> To prepare for wider use of this constant give it a more applicable name.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-tag.c | 4 ++--
>   block/blk-mq-tag.h | 4 ++--
>   block/blk-mq.c     | 2 +-
>   3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6e904ab..c76ba4f90fa09 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -111,7 +111,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   	if (data->flags & BLK_MQ_REQ_RESERVED) {
>   		if (unlikely(!tags->nr_reserved_tags)) {
>   			WARN_ON_ONCE(1);
> -			return BLK_MQ_TAG_FAIL;
> +			return BLK_MQ_NO_TAG;
>   		}
>   		bt = &tags->breserved_tags;
>   		tag_offset = 0;
> @@ -125,7 +125,7 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   		goto found_tag;
>   
>   	if (data->flags & BLK_MQ_REQ_NOWAIT)
> -		return BLK_MQ_TAG_FAIL;
> +		return BLK_MQ_NO_TAG;
>   
>   	ws = bt_wait_ptr(bt, data->hctx);
>   	do {
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 2b8321efb6820..8a741752af8b9 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -44,9 +44,9 @@ static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
>   }
>   
>   enum {
> -	BLK_MQ_TAG_FAIL		= -1U,
> +	BLK_MQ_NO_TAG		= -1U,
>   	BLK_MQ_TAG_MIN		= 1,
> -	BLK_MQ_TAG_MAX		= BLK_MQ_TAG_FAIL - 1,
> +	BLK_MQ_TAG_MAX		= BLK_MQ_NO_TAG - 1,
>   };
>   
>   extern bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1ffbc5d9e7cfe..826ff8f97489c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -386,7 +386,7 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   	}
>   
>   	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_TAG_FAIL) {
> +	if (tag == BLK_MQ_NO_TAG) {
>   		if (clear_ctx_on_error)
>   			data->ctx = NULL;
>   		return NULL;
> 
Hehe.
Thanks.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention
  2020-05-27 18:06 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
@ 2020-05-27 18:17   ` Johannes Thumshirn
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Thumshirn @ 2020-05-27 18:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Hannes Reinecke, Daniel Wagner

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places
  2020-05-27 18:06 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
  2020-05-27 18:15   ` Johannes Thumshirn
@ 2020-05-27 18:18   ` Hannes Reinecke
  2020-05-27 22:38   ` Bart Van Assche
  2 siblings, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2020-05-27 18:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/27/20 8:06 PM, Christoph Hellwig wrote:
> Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-tag.c |  8 ++++----
>   block/blk-mq.c     | 14 +++++++-------
>   block/blk-mq.h     |  4 ++--
>   3 files changed, 13 insertions(+), 13 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter
  2020-05-27 18:06 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
@ 2020-05-27 18:21   ` Hannes Reinecke
  2020-05-27 22:52   ` Bart Van Assche
  1 sibling, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2020-05-27 18:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

On 5/27/20 8:06 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Add a new blk_mq_all_tag_iter function to iterate over all allocated
> scheduler tags and driver tags.  This is more flexible than the existing
> blk_mq_all_tag_busy_iter function as it allows the callers to do whatever
> they want on allocated request instead of being limited to started
> requests.
> 
> It will be used to implement draining allocated requests on specified
> hctx in this patchset.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: switch from the two booleans to a more readable flags field and
>   consolidate the tags iter functions]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-tag.c | 46 +++++++++++++++++++++++++++++-----------------
>   block/blk-mq-tag.h |  2 ++
>   2 files changed, 31 insertions(+), 17 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-27 18:06 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
@ 2020-05-27 18:26   ` Hannes Reinecke
  2020-05-27 23:09   ` Bart Van Assche
  1 sibling, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2020-05-27 18:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei

On 5/27/20 8:06 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again."
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> cpuhp state handled after the CPU is down, so there isn't any chance to
> quiesce the hctx before shutting down the CPU.
> 
> Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> where the last CPU goes away, and wait for completion of in-flight
> requests.  This guarantees that there is no inflight I/O before shutting
> down the managed IRQ.
> 
> Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> to wait for completion of in-flight requests from these drivers to avoid
> a potential dead-lock. It is safe to do this for stacking drivers as those
> do not use interrupts at all and their I/O completions are triggered by
> underlying devices I/O completion.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: different retry mechanism, merged two patches, minor cleanups]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   8 +++
>   block/blk-mq.c             | 121 ++++++++++++++++++++++++++++++++++++-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |  10 +++
>   include/linux/cpuhotplug.h |   1 +
>   7 files changed, 142 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 96b7a35c898a7..15df3a36e9fa4 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
>   	HCTX_STATE_NAME(STOPPED),
>   	HCTX_STATE_NAME(TAG_ACTIVE),
>   	HCTX_STATE_NAME(SCHED_RESTART),
> +	HCTX_STATE_NAME(INACTIVE),
>   };
>   #undef HCTX_STATE_NAME
>   
> @@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(TAG_SHARED),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
> +	HCTX_FLAG_NAME(STACKING),
>   };
>   #undef HCTX_FLAG_NAME
>   
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9f74064768423..1c548d9f67ee7 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   	sbitmap_finish_wait(bt, ws, &wait);
>   
>   found_tag:
> +	/*
> +	 * Give up this allocation if the hctx is inactive.  The caller will
> +	 * retry on an active hctx.
> +	 */
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> +		return -1;

BLK_MQ_NO_TAG, please, to be consistent with the caller checks later on.

> +	}
>   	return tag + tag_offset;
>   }
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 898400452b1cf..e4580cd6c6f49 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -375,14 +375,39 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   			e->type->ops.limit_depth(data->cmd_flags, data);
>   	}
>   
> +retry:
>   	data->ctx = blk_mq_get_ctx(q);
>   	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
>   		blk_mq_tag_busy(data->hctx);
>   
> +	/*
> +	 * Waiting allocations only fail because of an inactive hctx.  In that
> +	 * case just retry the hctx assignment and tag allocation as CPU hotplug
> +	 * should have migrated us to an online CPU by now.
> +	 */
>   	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_NO_TAG)
> -		return NULL;
> +	if (tag == BLK_MQ_NO_TAG) {
> +		if (data->flags & BLK_MQ_REQ_NOWAIT)
> +			return NULL;
> +
> +		/*
> +		 * All kthreads that can perform I/O should have been moved off
> +		 * this CPU by the time the the CPU hotplug statemachine has
> +		 * shut down a hctx.  But better be sure with an extra sanity
> +		 * check.
> +		 */
> +		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
> +			return NULL;
> +
> +		/*
> +		 * Give up the CPU and sleep for a random short time to ensure
> +		 * that thread using a realtime scheduling class are migrated
> +		 * off the the CPU.
> +		 */
> +		msleep(3);
> +		goto retry;
> +	}
>   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>   }
>   
> @@ -2324,6 +2349,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> +struct rq_iter_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	bool has_rq;
> +};
> +
> +static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
> +{
> +	struct rq_iter_data *iter_data = data;
> +
> +	if (rq->mq_hctx != iter_data->hctx)
> +		return true;
> +	iter_data->has_rq = true;
> +	return false;
> +}
> +
> +static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_tags *tags = hctx->sched_tags ?
> +			hctx->sched_tags : hctx->tags;
> +	struct rq_iter_data data = {
> +		.hctx	= hctx,
> +	};
> +
> +	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> +	return data.has_rq;
> +}
> +
> +static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
> +		struct blk_mq_hw_ctx *hctx)
> +{
> +	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
> +		return false;
> +	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
> +		return false;
> +	return true;
> +}
> +
> +static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
> +	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
> +		return 0;
> +
> +	/*
> +	 * Prevent new request from being allocated on the current hctx.
> +	 *
> +	 * The smp_mb__after_atomic() Pairs with the implied barrier in
> +	 * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
> +	 * seen once we return from the tag allocator.
> +	 */
> +	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * Try to grab a reference to the queue and wait for any outstanding
> +	 * requests.  If we could not grab a reference the queue has been
> +	 * frozen and there are no requests.
> +	 */
> +	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
> +		while (blk_mq_hctx_has_requests(hctx))
> +			msleep(5);
> +		percpu_ref_put(&hctx->queue->q_usage_counter);
> +	}
> +
> +	return 0;
> +}
> +
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (cpumask_test_cpu(cpu, hctx->cpumask))
> +		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	return 0;
> +}
> +
>   /*
>    * 'cpu' is going away. splice any existing rq_list entries from this
>    * software queue to the hw queue dispatch list, and ensure that it
> @@ -2337,6 +2442,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   	enum hctx_type type;
>   
>   	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
>   	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>   	type = hctx->type;
>   
> @@ -2360,6 +2468,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   
>   static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>   {
> +	if (!(hctx->flags & BLK_MQ_F_STACKING))
> +		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +						    &hctx->cpuhp_online);
>   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
>   					    &hctx->cpuhp_dead);
>   }
> @@ -2419,6 +2530,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   {
>   	hctx->queue_num = hctx_idx;
>   
> +	if (!(hctx->flags & BLK_MQ_F_STACKING))
> +		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +				&hctx->cpuhp_online);
>   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
>   
>   	hctx->tags = set->tags[hctx_idx];
> @@ -3673,6 +3787,9 @@ static int __init blk_mq_init(void)
>   {
>   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
>   				blk_mq_hctx_notify_dead);
> +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> +				blk_mq_hctx_notify_online,
> +				blk_mq_hctx_notify_offline);
>   	return 0;
>   }
>   subsys_initcall(blk_mq_init);
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index da693e6a834e5..d7904b4d8d126 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
>   	lo->tag_set.queue_depth = 128;
>   	lo->tag_set.numa_node = NUMA_NO_NODE;
>   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
>   	lo->tag_set.driver_data = lo;
>   
>   	err = blk_mq_alloc_tag_set(&lo->tag_set);
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 3f8577e2c13be..f60c025121215 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>   	md->tag_set->ops = &dm_mq_ops;
>   	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
>   	md->tag_set->numa_node = md->numa_node_id;
> -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
> +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
>   	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
>   	md->tag_set->driver_data = md;
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d7307795439a4..a20f8c241d665 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
>   	 */
>   	atomic_t		nr_active;
>   
> +	/** @cpuhp_online: List to store request if CPU is going to die */
> +	struct hlist_node	cpuhp_online;
>   	/** @cpuhp_dead: List to store request if some CPU die. */
>   	struct hlist_node	cpuhp_dead;
>   	/** @kobj: Kernel object for sysfs. */
> @@ -391,6 +393,11 @@ struct blk_mq_ops {
>   enum {
>   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> +	/*
> +	 * Set when this device requires underlying blk-mq device for
> +	 * completing IO:
> +	 */
> +	BLK_MQ_F_STACKING	= 1 << 2,
>   	BLK_MQ_F_BLOCKING	= 1 << 5,
>   	BLK_MQ_F_NO_SCHED	= 1 << 6,
>   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> @@ -400,6 +407,9 @@ enum {
>   	BLK_MQ_S_TAG_ACTIVE	= 1,
>   	BLK_MQ_S_SCHED_RESTART	= 2,
>   
> +	/* hw queue is inactive after all its CPUs become offline */
> +	BLK_MQ_S_INACTIVE	= 3,
> +
>   	BLK_MQ_MAX_DEPTH	= 10240,
>   
>   	BLK_MQ_CPU_WORK_BATCH	= 8,
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 77d70b6335318..24b3a77810b6d 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -152,6 +152,7 @@ enum cpuhp_state {
>   	CPUHP_AP_SMPBOOT_THREADS,
>   	CPUHP_AP_X86_VDSO_VMA_ONLINE,
>   	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> +	CPUHP_AP_BLK_MQ_ONLINE,
>   	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
>   	CPUHP_AP_X86_INTEL_EPB_ONLINE,
>   	CPUHP_AP_PERF_ONLINE,
> 
Otherwise looks okay.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2020-05-27 18:06 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
@ 2020-05-27 20:07 ` Bart Van Assche
  2020-05-27 20:31   ` John Garry
  2020-05-28  8:29 ` John Garry
  9 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-27 20:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-27 11:06, Christoph Hellwig wrote:
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>  - before the last cpu in hctx->cpumask is going to offline, mark this
>    hctx as inactive
> 
>  - disable preempt during allocating tag for request, and after tag is
>    allocated, check if this hctx is inactive. If yes, give up the
>    allocation and try remote allocation from online CPUs
> 
>  - before hctx becomes inactive, drain all allocated requests on this
>    hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.

Hi Christoph,

Thanks for having prepared and posted this new patch series. After v3
was posted and before v4 was posted I had a closer look at the IRQ core.
My conclusions (which may be incorrect) are as follows:
* The only function that sets the 'is_managed' member of struct
  irq_affinity_desc to 1 is irq_create_affinity_masks().
* There are two ways to cause that function to be called: setting the
  PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
  passing the 'affd' argument. pci_alloc_irq_vectors() calls
  pci_alloc_irq_vectors_affinity().
* The following drivers pass an affinity domain argument when allocating
  interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
  mpt3sas, qla2xxx, virtio_scsi.
* The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
  affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.

What is not clear to me is why managed interrupts are shut down if the
last CPU in their affinity mask is shut down? Has it been considered to
modify the IRQ core such that managed PCIe interrupts are assigned to
another CPU if the last CPU in their affinity mask is shut down? Would
that make it unnecessary to drain hardware queues during CPU
hotplugging? Or is there perhaps something in the PCI or PCIe
specifications or in one of the architectures supported by Linux that
prevents doing this?

Is this the commit that introduced shutdown of managed interrupts:
c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug")?

Some of my knowledge about non-managed and managed interrupts comes from
https://lore.kernel.org/lkml/alpine.DEB.2.20.1710162106400.2037@nanos/

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 20:07 ` blk-mq: improvement CPU hotplug (simplified version) v4 Bart Van Assche
@ 2020-05-27 20:31   ` John Garry
  2020-05-29 13:26     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: John Garry @ 2020-05-27 20:31 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: linux-block, Hannes Reinecke, Thomas Gleixner

On 27/05/2020 21:07, Bart Van Assche wrote:
> On 2020-05-27 11:06, Christoph Hellwig wrote:
>> this series ensures I/O is quiesced before a cpu and thus the managed
>> interrupt handler is shut down.
>>
>> This patchset tries to address the issue by the following approach:
>>
>>   - before the last cpu in hctx->cpumask is going to offline, mark this
>>     hctx as inactive
>>
>>   - disable preempt during allocating tag for request, and after tag is
>>     allocated, check if this hctx is inactive. If yes, give up the
>>     allocation and try remote allocation from online CPUs
>>
>>   - before hctx becomes inactive, drain all allocated requests on this
>>     hctx
>>
>> The guts of the changes are from Ming Lei, I just did a bunch of prep
>> cleanups so that they can fit in more nicely.  The series also depends
>> on my "avoid a few q_usage_counter roundtrips v3" series.
>>
>> Thanks John Garry for running lots of tests on arm64 with this previous
>> version patches and co-working on investigating all kinds of issues.
> 
> Hi Christoph,
> 
> Thanks for having prepared and posted this new patch series. After v3
> was posted and before v4 was posted I had a closer look at the IRQ core.
> My conclusions (which may be incorrect) are as follows:
> * The only function that sets the 'is_managed' member of struct
>    irq_affinity_desc to 1 is irq_create_affinity_masks().
> * There are two ways to cause that function to be called: setting the
>    PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
>    passing the 'affd' argument. pci_alloc_irq_vectors() calls
>    pci_alloc_irq_vectors_affinity().
> * The following drivers pass an affinity domain argument when allocating
>    interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
>    mpt3sas, qla2xxx, virtio_scsi.
> * The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
>    affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.
> 
> What is not clear to me is why managed interrupts are shut down if the
> last CPU in their affinity mask is shut down? Has it been considered to
> modify the IRQ core such that managed PCIe interrupts are assigned to
> another CPU if the last CPU in their affinity mask is shut down? 

I think Thomas answered that here already:
https://lore.kernel.org/lkml/alpine.DEB.2.21.1901291717370.1513@nanos.tec.linutronix.de/

(vector space exhaustion)

Would
> that make it unnecessary to drain hardware queues during CPU
> hotplugging? Or is there perhaps something in the PCI or PCIe
> specifications or in one of the architectures supported by Linux that
> prevents doing this?
> 
> Is this the commit that introduced shutdown of managed interrupts:
> c5cb83bb337c ("genirq/cpuhotplug: Handle managed IRQs on CPU hotplug")?
> 
> Some of my knowledge about non-managed and managed interrupts comes from
> https://lore.kernel.org/lkml/alpine.DEB.2.20.1710162106400.2037@nanos/
> 
> Thanks,
> 
> Bart.
> .
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  2020-05-27 18:06 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
  2020-05-27 18:14   ` Johannes Thumshirn
  2020-05-27 18:17   ` Hannes Reinecke
@ 2020-05-27 22:38   ` Bart Van Assche
  2 siblings, 0 replies; 41+ messages in thread
From: Bart Van Assche @ 2020-05-27 22:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-27 11:06, Christoph Hellwig wrote:
> To prepare for wider use of this constant give it a more applicable name.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places
  2020-05-27 18:06 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
  2020-05-27 18:15   ` Johannes Thumshirn
  2020-05-27 18:18   ` Hannes Reinecke
@ 2020-05-27 22:38   ` Bart Van Assche
  2 siblings, 0 replies; 41+ messages in thread
From: Bart Van Assche @ 2020-05-27 22:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner

On 2020-05-27 11:06, Christoph Hellwig wrote:
> Replace various magic -1 constants for tags with BLK_MQ_NO_TAG.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter
  2020-05-27 18:06 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
  2020-05-27 18:21   ` Hannes Reinecke
@ 2020-05-27 22:52   ` Bart Van Assche
  1 sibling, 0 replies; 41+ messages in thread
From: Bart Van Assche @ 2020-05-27 22:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner, Ming Lei

On 2020-05-27 11:06, Christoph Hellwig wrote:
> @@ -308,21 +312,28 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>  }
>  
>  /**
> - * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
> + * blk_mq_all_tag_iter - iterate over all requests in a tag map
>   * @tags:	Tag map to iterate over.
> - * @fn:		Pointer to the function that will be called for each started
> + * @fn:		Pointer to the function that will be called for each
>   *		request. @fn will be called as follows: @fn(rq, @priv,
>   *		reserved) where rq is a pointer to a request. 'reserved'
>   *		indicates whether or not @rq is a reserved request. Return
>   *		true to continue iterating tags, false to stop.
>   * @priv:	Will be passed as second argument to @fn.
>   */

Please document the new 'flags' argument in the kernel-doc header.

> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> -		busy_tag_iter_fn *fn, void *priv)
> +static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, void *priv, unsigned int flags)
>  {
>  	if (tags->nr_reserved_tags)
> -		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
> -	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
> +		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv,
> +				 flags | BT_TAG_ITER_RESERVED);
> +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags);
> +}

The meaning of the 'reserved' argument passed to bt_for_each() and also
to busy_iter_fn is 'the request argument is a reserved request'. If
BT_TAG_ITER_RESERVED would be set in the 'flags' argument then the wrong
value will be passed for the 'reserved' argument in the second
bt_tags_for_each() call. So I think we need something like the following
at the start of __blk_mq_all_tag_iter():

	WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED)

Otherwise this patch looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-27 18:06 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
  2020-05-27 18:26   ` Hannes Reinecke
@ 2020-05-27 23:09   ` Bart Van Assche
  2020-05-28  1:46     ` Ming Lei
  1 sibling, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-27 23:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Hannes Reinecke, Thomas Gleixner, Ming Lei

On 2020-05-27 11:06, Christoph Hellwig wrote:
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	sbitmap_finish_wait(bt, ws, &wait);
>  
>  found_tag:
> +	/*
> +	 * Give up this allocation if the hctx is inactive.  The caller will
> +	 * retry on an active hctx.
> +	 */
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> +		return -1;
> +	}
>  	return tag + tag_offset;
>  }

The code that has been added in blk_mq_hctx_notify_offline() will only
work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
store instructions involved in the tag allocation happened. Does this
mean that a memory barrier should be added in the above function before
the test_bit() call?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-27 23:09   ` Bart Van Assche
@ 2020-05-28  1:46     ` Ming Lei
  2020-05-28  3:33       ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-28  1:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
> On 2020-05-27 11:06, Christoph Hellwig wrote:
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >  	sbitmap_finish_wait(bt, ws, &wait);
> >  
> >  found_tag:
> > +	/*
> > +	 * Give up this allocation if the hctx is inactive.  The caller will
> > +	 * retry on an active hctx.
> > +	 */
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> > +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> > +		return -1;
> > +	}
> >  	return tag + tag_offset;
> >  }
> 
> The code that has been added in blk_mq_hctx_notify_offline() will only
> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
> store instructions involved in the tag allocation happened. Does this
> mean that a memory barrier should be added in the above function before
> the test_bit() call?

Please see comment in blk_mq_hctx_notify_offline():

+       /*
+        * Prevent new request from being allocated on the current hctx.
+        *
+        * The smp_mb__after_atomic() Pairs with the implied barrier in
+        * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
+        * seen once we return from the tag allocator.
+        */
+       set_bit(BLK_MQ_S_INACTIVE, &hctx->state);


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-28  1:46     ` Ming Lei
@ 2020-05-28  3:33       ` Bart Van Assche
  2020-05-28  5:19         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-28  3:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner

On 2020-05-27 18:46, Ming Lei wrote:
> On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
>> On 2020-05-27 11:06, Christoph Hellwig wrote:
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>>>  	sbitmap_finish_wait(bt, ws, &wait);
>>>  
>>>  found_tag:
>>> +	/*
>>> +	 * Give up this allocation if the hctx is inactive.  The caller will
>>> +	 * retry on an active hctx.
>>> +	 */
>>> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
>>> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
>>> +		return -1;
>>> +	}
>>>  	return tag + tag_offset;
>>>  }
>>
>> The code that has been added in blk_mq_hctx_notify_offline() will only
>> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
>> store instructions involved in the tag allocation happened. Does this
>> mean that a memory barrier should be added in the above function before
>> the test_bit() call?
> 
> Please see comment in blk_mq_hctx_notify_offline():
> 
> +       /*
> +        * Prevent new request from being allocated on the current hctx.
> +        *
> +        * The smp_mb__after_atomic() Pairs with the implied barrier in
> +        * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
> +        * seen once we return from the tag allocator.
> +        */
> +       set_bit(BLK_MQ_S_INACTIVE, &hctx->state);

From Documentation/atomic_bitops.txt: "Except for a successful
test_and_set_bit_lock() which has ACQUIRE semantics and
clear_bit_unlock() which has RELEASE semantics."

My understanding is that operations that have acquire semantics pair
with operations that have release semantics. I haven't been able to find
any documentation that shows that smp_mb__after_atomic() has release
semantics. So I looked up its definition. This is what I found:

$ git grep -nH 'define __smp_mb__after_atomic'
arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
barrier()
arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
smp_llsc_mb()
arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
barrier()
arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
barrier()
arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
} while (0)
arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
barrier()
include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
__smp_mb()

My interpretation of the above is that not all smp_mb__after_atomic()
implementations have release semantics. Do you agree with this conclusion?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-28  3:33       ` Bart Van Assche
@ 2020-05-28  5:19         ` Ming Lei
  2020-05-28 13:37           ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-28  5:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner, Paul E. McKenney, linux-kernel

On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> On 2020-05-27 18:46, Ming Lei wrote:
> > On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
> >> On 2020-05-27 11:06, Christoph Hellwig wrote:
> >>> --- a/block/blk-mq-tag.c
> >>> +++ b/block/blk-mq-tag.c
> >>> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >>>  	sbitmap_finish_wait(bt, ws, &wait);
> >>>  
> >>>  found_tag:
> >>> +	/*
> >>> +	 * Give up this allocation if the hctx is inactive.  The caller will
> >>> +	 * retry on an active hctx.
> >>> +	 */
> >>> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> >>> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> >>> +		return -1;
> >>> +	}
> >>>  	return tag + tag_offset;
> >>>  }
> >>
> >> The code that has been added in blk_mq_hctx_notify_offline() will only
> >> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
> >> store instructions involved in the tag allocation happened. Does this
> >> mean that a memory barrier should be added in the above function before
> >> the test_bit() call?
> > 
> > Please see comment in blk_mq_hctx_notify_offline():
> > 
> > +       /*
> > +        * Prevent new request from being allocated on the current hctx.
> > +        *
> > +        * The smp_mb__after_atomic() Pairs with the implied barrier in
> > +        * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
> > +        * seen once we return from the tag allocator.
> > +        */
> > +       set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> 
> From Documentation/atomic_bitops.txt: "Except for a successful
> test_and_set_bit_lock() which has ACQUIRE semantics and
> clear_bit_unlock() which has RELEASE semantics."

test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state) is called exactly after
one tag is allocated, that means test_and_set_bit_lock is successful before
the test_bit(). The ACQUIRE semantics guarantees that test_bit(BLK_MQ_S_INACTIVE)
is always done after successful test_and_set_bit_lock(), so tag bit is
always set before testing BLK_MQ_S_INACTIVE.
 
See Documentation/memory-barriers.txt:
 (5) ACQUIRE operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the ACQUIRE operation will appear to happen after the
     ACQUIRE operation with respect to the other components of the system.
     ACQUIRE operations include LOCK operations and both smp_load_acquire()
     and smp_cond_load_acquire() operations.

> 
> My understanding is that operations that have acquire semantics pair
> with operations that have release semantics. I haven't been able to find
> any documentation that shows that smp_mb__after_atomic() has release
> semantics. So I looked up its definition. This is what I found:
> 
> $ git grep -nH 'define __smp_mb__after_atomic'
> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> barrier()
> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> smp_llsc_mb()
> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> barrier()
> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> barrier()
> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> } while (0)
> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> barrier()
> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> __smp_mb()
> 
> My interpretation of the above is that not all smp_mb__after_atomic()
> implementations have release semantics. Do you agree with this conclusion?

I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
and reading the tag bit which is done in blk_mq_all_tag_iter().

So the two pair of OPs are ordered:

1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
so the request will be drained.

OR

2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
the request(tag bit) will be released and retried on another CPU
finally, see __blk_mq_alloc_request().

Cc Paul and linux-kernel list.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2020-05-27 20:07 ` blk-mq: improvement CPU hotplug (simplified version) v4 Bart Van Assche
@ 2020-05-28  8:29 ` John Garry
  9 siblings, 0 replies; 41+ messages in thread
From: John Garry @ 2020-05-28  8:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, Bart Van Assche, Hannes Reinecke, Thomas Gleixner

On 27/05/2020 19:06, Christoph Hellwig wrote:
> Hi all,
> 
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>   - before the last cpu in hctx->cpumask is going to offline, mark this
>     hctx as inactive
> 
>   - disable preempt during allocating tag for request, and after tag is
>     allocated, check if this hctx is inactive. If yes, give up the
>     allocation and try remote allocation from online CPUs
> 
>   - before hctx becomes inactive, drain all allocated requests on this
>     hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.
> 
> A git tree is available here:
> 
>      git://git.infradead.org/users/hch/block.git blk-mq-hotplug.3
> 
> Gitweb:
> 
>      http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.3
> 
> Changes since v3:
>    - add two new patches to clean up the magic -1 tag values
>    - improve a few commit messages and comments
>    - cleanup the blk_mq_all_tag_iter implementation
>    - add a msleep to the cpu hot unplug case in __blk_mq_alloc_request
> 
> Changes since v2:
>    - don't disable preemption and use smp calls
> 

I tested this again, so:
Tested-by: John Garry <john.garry@huawei.com>

As an aside, I'm not familiar with blktests, but it may be possible to 
add something to test this. I'll look.

Cheers

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init
  2020-05-27 18:06 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
  2020-05-27 18:16   ` Hannes Reinecke
@ 2020-05-28  9:50   ` Johannes Thumshirn
  1 sibling, 0 replies; 41+ messages in thread
From: Johannes Thumshirn @ 2020-05-28  9:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner, Ming Lei, Daniel Wagner

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-28  5:19         ` Ming Lei
@ 2020-05-28 13:37           ` Bart Van Assche
  2020-05-28 17:21             ` Paul E. McKenney
  2020-05-29  1:13             ` Ming Lei
  0 siblings, 2 replies; 41+ messages in thread
From: Bart Van Assche @ 2020-05-28 13:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner, Paul E. McKenney, linux-kernel

On 2020-05-27 22:19, Ming Lei wrote:
> On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
>> My understanding is that operations that have acquire semantics pair
>> with operations that have release semantics. I haven't been able to find
>> any documentation that shows that smp_mb__after_atomic() has release
>> semantics. So I looked up its definition. This is what I found:
>>
>> $ git grep -nH 'define __smp_mb__after_atomic'
>> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
>> barrier()
>> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
>> smp_llsc_mb()
>> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
>> barrier()
>> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
>> barrier()
>> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
>> } while (0)
>> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
>> barrier()
>> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
>> __smp_mb()
>>
>> My interpretation of the above is that not all smp_mb__after_atomic()
>> implementations have release semantics. Do you agree with this conclusion?
> 
> I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> and reading the tag bit which is done in blk_mq_all_tag_iter().
> 
> So the two pair of OPs are ordered:
> 
> 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> so the request will be drained.
> 
> OR
> 
> 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> the request(tag bit) will be released and retried on another CPU
> finally, see __blk_mq_alloc_request().
> 
> Cc Paul and linux-kernel list.

I do not agree with the above conclusion. My understanding of
acquire/release labels is that if the following holds:
(1) A store operation that stores the value V into memory location M has
a release label.
(2) A load operation that reads memory location M has an acquire label.
(3) The load operation (2) retrieves the value V that was stored by (1).

that the following ordering property holds: all load and store
instructions that happened before the store instruction (1) in program
order are guaranteed to happen before the load and store instructions
that follow (2) in program order.

In the ARM manual these semantics have been described as follows: "A
Store-Release instruction is multicopy atomic when observed with a
Load-Acquire instruction".

In this case the load-acquire operation is the
"test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
code is executed indirectly by blk_mq_get_tag(). Since there is no
matching store-release instruction in __blk_mq_alloc_request() for
'word', ordering of the &data->hctx->state and 'tag' memory locations is
not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
word)" statement from the sbitmap code.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-28 13:37           ` Bart Van Assche
@ 2020-05-28 17:21             ` Paul E. McKenney
  2020-05-29  1:53               ` Ming Lei
  2020-05-29  1:13             ` Ming Lei
  1 sibling, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2020-05-28 17:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner, linux-kernel

On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> On 2020-05-27 22:19, Ming Lei wrote:
> > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> >> My understanding is that operations that have acquire semantics pair
> >> with operations that have release semantics. I haven't been able to find
> >> any documentation that shows that smp_mb__after_atomic() has release
> >> semantics. So I looked up its definition. This is what I found:
> >>
> >> $ git grep -nH 'define __smp_mb__after_atomic'
> >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> >> smp_llsc_mb()
> >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> >> } while (0)
> >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> >> barrier()
> >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> >> __smp_mb()
> >>
> >> My interpretation of the above is that not all smp_mb__after_atomic()
> >> implementations have release semantics. Do you agree with this conclusion?
> > 
> > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > 
> > So the two pair of OPs are ordered:
> > 
> > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > so the request will be drained.
> > 
> > OR
> > 
> > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > the request(tag bit) will be released and retried on another CPU
> > finally, see __blk_mq_alloc_request().
> > 
> > Cc Paul and linux-kernel list.
> 
> I do not agree with the above conclusion. My understanding of
> acquire/release labels is that if the following holds:
> (1) A store operation that stores the value V into memory location M has
> a release label.
> (2) A load operation that reads memory location M has an acquire label.
> (3) The load operation (2) retrieves the value V that was stored by (1).
> 
> that the following ordering property holds: all load and store
> instructions that happened before the store instruction (1) in program
> order are guaranteed to happen before the load and store instructions
> that follow (2) in program order.
> 
> In the ARM manual these semantics have been described as follows: "A
> Store-Release instruction is multicopy atomic when observed with a
> Load-Acquire instruction".
> 
> In this case the load-acquire operation is the
> "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> code is executed indirectly by blk_mq_get_tag(). Since there is no
> matching store-release instruction in __blk_mq_alloc_request() for
> 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> word)" statement from the sbitmap code.

I feel like I just parachuted into the middle of the conversation,
so let me start by giving a (silly) example illustrating the limits of
smp_mb__{before,after}_atomic() that might be tangling things up.

But please please please avoid doing this in real code unless you have
an extremely good reason included in a comment.

void t1(void)
{
	WRITE_ONCE(a, 1);
	smp_mb__before_atomic();
	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
	atomic_inc(&c);
	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
	smp_mb__after_atomic();
	WRITE_ONCE(e, 1);
}

void t2(void)
{
	r1 = READ_ONCE(e);
	smp_mb();
	r2 = READ_ONCE(d);
	smp_mb();
	r3 = READ_ONCE(c);
	smp_mb();
	r4 = READ_ONCE(b);
	smp_mb();
	r5 = READ_ONCE(a);
}

Each platform must provide strong ordering for either atomic_inc()
on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
on the other (as powerpc does).  Note that both ia64 and powerpc are
weakly ordered.

So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
any ordering properties whatsoever.

Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
So clearly atomic_inc() need not have any ordering properties whatsoever.

But the combination of smp_mb__before_atomic() and the later atomic_inc()
does provide full ordering, so that no architecture can see (r3 == 1 &&
r5 == 0), and either of r1 or r2 can be substituted for r3.

Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
0), and either r4 or r5 can be substituted for r3.


So a call to set_bit() followed by a call to smp_mb__after_atomic() will
provide a full memory barrier (implying release semantics) for any write
access after the smp_mb__after_atomic() with respect to the set_bit() or
any access preceding it.  But the set_bit() by itself won't have release
semantics, nor will the smp_mb__after_atomic(), only their combination
further combined with some write following the smp_mb__after_atomic().

More generally, there will be the equivalent of smp_mb() somewhere between
the set_bit() and every access following the smp_mb__after_atomic().

Does that help, or am I missing the point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-28 13:37           ` Bart Van Assche
  2020-05-28 17:21             ` Paul E. McKenney
@ 2020-05-29  1:13             ` Ming Lei
  1 sibling, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-29  1:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner, Paul E. McKenney, linux-kernel

On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> On 2020-05-27 22:19, Ming Lei wrote:
> > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> >> My understanding is that operations that have acquire semantics pair
> >> with operations that have release semantics. I haven't been able to find
> >> any documentation that shows that smp_mb__after_atomic() has release
> >> semantics. So I looked up its definition. This is what I found:
> >>
> >> $ git grep -nH 'define __smp_mb__after_atomic'
> >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> >> smp_llsc_mb()
> >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> >> } while (0)
> >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> >> barrier()
> >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> >> __smp_mb()
> >>
> >> My interpretation of the above is that not all smp_mb__after_atomic()
> >> implementations have release semantics. Do you agree with this conclusion?
> > 
> > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > 
> > So the two pair of OPs are ordered:
> > 
> > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > so the request will be drained.
> > 
> > OR
> > 
> > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > the request(tag bit) will be released and retried on another CPU
> > finally, see __blk_mq_alloc_request().
> > 
> > Cc Paul and linux-kernel list.
> 
> I do not agree with the above conclusion. My understanding of
> acquire/release labels is that if the following holds:
> (1) A store operation that stores the value V into memory location M has
> a release label.
> (2) A load operation that reads memory location M has an acquire label.
> (3) The load operation (2) retrieves the value V that was stored by (1).
> 
> that the following ordering property holds: all load and store
> instructions that happened before the store instruction (1) in program
> order are guaranteed to happen before the load and store instructions
> that follow (2) in program order.
> 
> In the ARM manual these semantics have been described as follows: "A
> Store-Release instruction is multicopy atomic when observed with a
> Load-Acquire instruction".
> 
> In this case the load-acquire operation is the
> "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> code is executed indirectly by blk_mq_get_tag(). Since there is no
> matching store-release instruction in __blk_mq_alloc_request() for
> 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> word)" statement from the sbitmap code.

If the order isn't guaranteed, either of the following two documents has to be wrong:

Documentation/memory-barriers.txt:
	...
	In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
	for each construct.  These operations all imply certain barriers:
	
	 (1) ACQUIRE operation implication:
	
	     Memory operations issued after the ACQUIRE will be completed after the
	     ACQUIRE operation has completed.

Documentation/atomic_bitops.txt:
	...
	Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and
	clear_bit_unlock() which has RELEASE semantics.

Setting the tag bit is part of successful test_and_set_bit_lock(), which has ACQUIRE
semantics, and any Memory operations(test_bit(INACTIVE)) after the ACQUIRE will be
completed after the ACQUIRE has completed according to the above two documents.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-28 17:21             ` Paul E. McKenney
@ 2020-05-29  1:53               ` Ming Lei
  2020-05-29  3:07                 ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-29  1:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner, linux-kernel

Hi Paul,

Thanks for your response!

On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > On 2020-05-27 22:19, Ming Lei wrote:
> > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > >> My understanding is that operations that have acquire semantics pair
> > >> with operations that have release semantics. I haven't been able to find
> > >> any documentation that shows that smp_mb__after_atomic() has release
> > >> semantics. So I looked up its definition. This is what I found:
> > >>
> > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > >> barrier()
> > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > >> smp_llsc_mb()
> > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > >> barrier()
> > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > >> barrier()
> > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > >> } while (0)
> > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > >> barrier()
> > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > >> __smp_mb()
> > >>
> > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > >> implementations have release semantics. Do you agree with this conclusion?
> > > 
> > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > 
> > > So the two pair of OPs are ordered:
> > > 
> > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > so the request will be drained.
> > > 
> > > OR
> > > 
> > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > the request(tag bit) will be released and retried on another CPU
> > > finally, see __blk_mq_alloc_request().
> > > 
> > > Cc Paul and linux-kernel list.
> > 
> > I do not agree with the above conclusion. My understanding of
> > acquire/release labels is that if the following holds:
> > (1) A store operation that stores the value V into memory location M has
> > a release label.
> > (2) A load operation that reads memory location M has an acquire label.
> > (3) The load operation (2) retrieves the value V that was stored by (1).
> > 
> > that the following ordering property holds: all load and store
> > instructions that happened before the store instruction (1) in program
> > order are guaranteed to happen before the load and store instructions
> > that follow (2) in program order.
> > 
> > In the ARM manual these semantics have been described as follows: "A
> > Store-Release instruction is multicopy atomic when observed with a
> > Load-Acquire instruction".
> > 
> > In this case the load-acquire operation is the
> > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > matching store-release instruction in __blk_mq_alloc_request() for
> > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > word)" statement from the sbitmap code.
> 
> I feel like I just parachuted into the middle of the conversation,
> so let me start by giving a (silly) example illustrating the limits of
> smp_mb__{before,after}_atomic() that might be tangling things up.
> 
> But please please please avoid doing this in real code unless you have
> an extremely good reason included in a comment.
> 
> void t1(void)
> {
> 	WRITE_ONCE(a, 1);
> 	smp_mb__before_atomic();
> 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> 	atomic_inc(&c);
> 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> 	smp_mb__after_atomic();
> 	WRITE_ONCE(e, 1);
> }
> 
> void t2(void)
> {
> 	r1 = READ_ONCE(e);
> 	smp_mb();
> 	r2 = READ_ONCE(d);
> 	smp_mb();
> 	r3 = READ_ONCE(c);
> 	smp_mb();
> 	r4 = READ_ONCE(b);
> 	smp_mb();
> 	r5 = READ_ONCE(a);
> }
> 
> Each platform must provide strong ordering for either atomic_inc()
> on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> on the other (as powerpc does).  Note that both ia64 and powerpc are
> weakly ordered.
> 
> So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> any ordering properties whatsoever.
> 
> Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> So clearly atomic_inc() need not have any ordering properties whatsoever.
> 
> But the combination of smp_mb__before_atomic() and the later atomic_inc()
> does provide full ordering, so that no architecture can see (r3 == 1 &&
> r5 == 0), and either of r1 or r2 can be substituted for r3.
> 
> Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> 0), and either r4 or r5 can be substituted for r3.
> 
> 
> So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> provide a full memory barrier (implying release semantics) for any write
> access after the smp_mb__after_atomic() with respect to the set_bit() or
> any access preceding it.  But the set_bit() by itself won't have release
> semantics, nor will the smp_mb__after_atomic(), only their combination
> further combined with some write following the smp_mb__after_atomic().
> 
> More generally, there will be the equivalent of smp_mb() somewhere between
> the set_bit() and every access following the smp_mb__after_atomic().
> 
> Does that help, or am I missing the point?

Yeah, it does help.

BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
ordering set_bit() and the memory OP following the smp_mb()?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29  1:53               ` Ming Lei
@ 2020-05-29  3:07                 ` Paul E. McKenney
  2020-05-29  3:53                   ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2020-05-29  3:07 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner, linux-kernel

On Fri, May 29, 2020 at 09:53:04AM +0800, Ming Lei wrote:
> Hi Paul,
> 
> Thanks for your response!
> 
> On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> > On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > > On 2020-05-27 22:19, Ming Lei wrote:
> > > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > > >> My understanding is that operations that have acquire semantics pair
> > > >> with operations that have release semantics. I haven't been able to find
> > > >> any documentation that shows that smp_mb__after_atomic() has release
> > > >> semantics. So I looked up its definition. This is what I found:
> > > >>
> > > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > > >> barrier()
> > > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > > >> smp_llsc_mb()
> > > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > > >> barrier()
> > > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > > >> barrier()
> > > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > > >> } while (0)
> > > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > > >> barrier()
> > > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > > >> __smp_mb()
> > > >>
> > > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > > >> implementations have release semantics. Do you agree with this conclusion?
> > > > 
> > > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > > 
> > > > So the two pair of OPs are ordered:
> > > > 
> > > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > > so the request will be drained.
> > > > 
> > > > OR
> > > > 
> > > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > > the request(tag bit) will be released and retried on another CPU
> > > > finally, see __blk_mq_alloc_request().
> > > > 
> > > > Cc Paul and linux-kernel list.
> > > 
> > > I do not agree with the above conclusion. My understanding of
> > > acquire/release labels is that if the following holds:
> > > (1) A store operation that stores the value V into memory location M has
> > > a release label.
> > > (2) A load operation that reads memory location M has an acquire label.
> > > (3) The load operation (2) retrieves the value V that was stored by (1).
> > > 
> > > that the following ordering property holds: all load and store
> > > instructions that happened before the store instruction (1) in program
> > > order are guaranteed to happen before the load and store instructions
> > > that follow (2) in program order.
> > > 
> > > In the ARM manual these semantics have been described as follows: "A
> > > Store-Release instruction is multicopy atomic when observed with a
> > > Load-Acquire instruction".
> > > 
> > > In this case the load-acquire operation is the
> > > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > > matching store-release instruction in __blk_mq_alloc_request() for
> > > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > > word)" statement from the sbitmap code.
> > 
> > I feel like I just parachuted into the middle of the conversation,
> > so let me start by giving a (silly) example illustrating the limits of
> > smp_mb__{before,after}_atomic() that might be tangling things up.
> > 
> > But please please please avoid doing this in real code unless you have
> > an extremely good reason included in a comment.
> > 
> > void t1(void)
> > {
> > 	WRITE_ONCE(a, 1);
> > 	smp_mb__before_atomic();
> > 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> > 	atomic_inc(&c);
> > 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> > 	smp_mb__after_atomic();
> > 	WRITE_ONCE(e, 1);
> > }
> > 
> > void t2(void)
> > {
> > 	r1 = READ_ONCE(e);
> > 	smp_mb();
> > 	r2 = READ_ONCE(d);
> > 	smp_mb();
> > 	r3 = READ_ONCE(c);
> > 	smp_mb();
> > 	r4 = READ_ONCE(b);
> > 	smp_mb();
> > 	r5 = READ_ONCE(a);
> > }
> > 
> > Each platform must provide strong ordering for either atomic_inc()
> > on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> > on the other (as powerpc does).  Note that both ia64 and powerpc are
> > weakly ordered.
> > 
> > So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> > 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> > any ordering properties whatsoever.
> > 
> > Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> > as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> > So clearly atomic_inc() need not have any ordering properties whatsoever.
> > 
> > But the combination of smp_mb__before_atomic() and the later atomic_inc()
> > does provide full ordering, so that no architecture can see (r3 == 1 &&
> > r5 == 0), and either of r1 or r2 can be substituted for r3.
> > 
> > Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> > provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> > 0), and either r4 or r5 can be substituted for r3.
> > 
> > 
> > So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> > provide a full memory barrier (implying release semantics) for any write
> > access after the smp_mb__after_atomic() with respect to the set_bit() or
> > any access preceding it.  But the set_bit() by itself won't have release
> > semantics, nor will the smp_mb__after_atomic(), only their combination
> > further combined with some write following the smp_mb__after_atomic().
> > 
> > More generally, there will be the equivalent of smp_mb() somewhere between
> > the set_bit() and every access following the smp_mb__after_atomic().
> > 
> > Does that help, or am I missing the point?
> 
> Yeah, it does help.
> 
> BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
> ordering set_bit() and the memory OP following the smp_mb()?

Placing an smp_mb() between set_bit() and a later access will indeed
order set_bit() with that later access.

That said, I don't know this code well enough to say whether or not
that ordering is sufficient.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29  3:07                 ` Paul E. McKenney
@ 2020-05-29  3:53                   ` Ming Lei
  2020-05-29 18:13                     ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-29  3:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner, linux-kernel

Hi Paul,

On Thu, May 28, 2020 at 08:07:28PM -0700, Paul E. McKenney wrote:
> On Fri, May 29, 2020 at 09:53:04AM +0800, Ming Lei wrote:
> > Hi Paul,
> > 
> > Thanks for your response!
> > 
> > On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> > > On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > > > On 2020-05-27 22:19, Ming Lei wrote:
> > > > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > > > >> My understanding is that operations that have acquire semantics pair
> > > > >> with operations that have release semantics. I haven't been able to find
> > > > >> any documentation that shows that smp_mb__after_atomic() has release
> > > > >> semantics. So I looked up its definition. This is what I found:
> > > > >>
> > > > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > > > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > > > >> barrier()
> > > > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > > > >> smp_llsc_mb()
> > > > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > > > >> barrier()
> > > > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > > > >> barrier()
> > > > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > > > >> } while (0)
> > > > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > > > >> barrier()
> > > > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > > > >> __smp_mb()
> > > > >>
> > > > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > > > >> implementations have release semantics. Do you agree with this conclusion?
> > > > > 
> > > > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > > > 
> > > > > So the two pair of OPs are ordered:
> > > > > 
> > > > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > > > so the request will be drained.
> > > > > 
> > > > > OR
> > > > > 
> > > > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > > > the request(tag bit) will be released and retried on another CPU
> > > > > finally, see __blk_mq_alloc_request().
> > > > > 
> > > > > Cc Paul and linux-kernel list.
> > > > 
> > > > I do not agree with the above conclusion. My understanding of
> > > > acquire/release labels is that if the following holds:
> > > > (1) A store operation that stores the value V into memory location M has
> > > > a release label.
> > > > (2) A load operation that reads memory location M has an acquire label.
> > > > (3) The load operation (2) retrieves the value V that was stored by (1).
> > > > 
> > > > that the following ordering property holds: all load and store
> > > > instructions that happened before the store instruction (1) in program
> > > > order are guaranteed to happen before the load and store instructions
> > > > that follow (2) in program order.
> > > > 
> > > > In the ARM manual these semantics have been described as follows: "A
> > > > Store-Release instruction is multicopy atomic when observed with a
> > > > Load-Acquire instruction".
> > > > 
> > > > In this case the load-acquire operation is the
> > > > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > > > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > > > matching store-release instruction in __blk_mq_alloc_request() for
> > > > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > > > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > > > word)" statement from the sbitmap code.
> > > 
> > > I feel like I just parachuted into the middle of the conversation,
> > > so let me start by giving a (silly) example illustrating the limits of
> > > smp_mb__{before,after}_atomic() that might be tangling things up.
> > > 
> > > But please please please avoid doing this in real code unless you have
> > > an extremely good reason included in a comment.
> > > 
> > > void t1(void)
> > > {
> > > 	WRITE_ONCE(a, 1);
> > > 	smp_mb__before_atomic();
> > > 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> > > 	atomic_inc(&c);
> > > 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> > > 	smp_mb__after_atomic();
> > > 	WRITE_ONCE(e, 1);
> > > }
> > > 
> > > void t2(void)
> > > {
> > > 	r1 = READ_ONCE(e);
> > > 	smp_mb();
> > > 	r2 = READ_ONCE(d);
> > > 	smp_mb();
> > > 	r3 = READ_ONCE(c);
> > > 	smp_mb();
> > > 	r4 = READ_ONCE(b);
> > > 	smp_mb();
> > > 	r5 = READ_ONCE(a);
> > > }
> > > 
> > > Each platform must provide strong ordering for either atomic_inc()
> > > on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> > > on the other (as powerpc does).  Note that both ia64 and powerpc are
> > > weakly ordered.
> > > 
> > > So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> > > 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> > > any ordering properties whatsoever.
> > > 
> > > Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> > > as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> > > So clearly atomic_inc() need not have any ordering properties whatsoever.
> > > 
> > > But the combination of smp_mb__before_atomic() and the later atomic_inc()
> > > does provide full ordering, so that no architecture can see (r3 == 1 &&
> > > r5 == 0), and either of r1 or r2 can be substituted for r3.
> > > 
> > > Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> > > provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> > > 0), and either r4 or r5 can be substituted for r3.
> > > 
> > > 
> > > So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> > > provide a full memory barrier (implying release semantics) for any write
> > > access after the smp_mb__after_atomic() with respect to the set_bit() or
> > > any access preceding it.  But the set_bit() by itself won't have release
> > > semantics, nor will the smp_mb__after_atomic(), only their combination
> > > further combined with some write following the smp_mb__after_atomic().
> > > 
> > > More generally, there will be the equivalent of smp_mb() somewhere between
> > > the set_bit() and every access following the smp_mb__after_atomic().
> > > 
> > > Does that help, or am I missing the point?
> > 
> > Yeah, it does help.
> > 
> > BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
> > ordering set_bit() and the memory OP following the smp_mb()?
> 
> Placing an smp_mb() between set_bit() and a later access will indeed
> order set_bit() with that later access.
> 
> That said, I don't know this code well enough to say whether or not
> that ordering is sufficient.

Another pair is in blk_mq_get_tag(), and we expect the following two
memory OPs are ordered:

1) set bit in successful test_and_set_bit_lock(), which is called
from sbitmap_get()

2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)

Do you think that the above two OPs are ordered?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-27 20:31   ` John Garry
@ 2020-05-29 13:26     ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:26 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, Hannes Reinecke,
	Thomas Gleixner

On Wed, May 27, 2020 at 09:31:30PM +0100, John Garry wrote:
>> Thanks for having prepared and posted this new patch series. After v3
>> was posted and before v4 was posted I had a closer look at the IRQ core.
>> My conclusions (which may be incorrect) are as follows:
>> * The only function that sets the 'is_managed' member of struct
>>    irq_affinity_desc to 1 is irq_create_affinity_masks().
>> * There are two ways to cause that function to be called: setting the
>>    PCI_IRQ_AFFINITY flag when calling pci_alloc_irq_vectors_affinity() or
>>    passing the 'affd' argument. pci_alloc_irq_vectors() calls
>>    pci_alloc_irq_vectors_affinity().
>> * The following drivers pass an affinity domain argument when allocating
>>    interrupts: virtio_blk, nvme, be2iscsi, csiostor, hisi_sas, megaraid,
>>    mpt3sas, qla2xxx, virtio_scsi.
>> * The following drivers set the PCI_IRQ_AFFINITY flag but do not pass an
>>    affinity domain: aacraid, hpsa, lpfc, smartqpi, virtio_pci_common.
>>
>> What is not clear to me is why managed interrupts are shut down if the
>> last CPU in their affinity mask is shut down? Has it been considered to
>> modify the IRQ core such that managed PCIe interrupts are assigned to
>> another CPU if the last CPU in their affinity mask is shut down? 
>
> I think Thomas answered that here already:
> https://lore.kernel.org/lkml/alpine.DEB.2.21.1901291717370.1513@nanos.tec.linutronix.de/
>
> (vector space exhaustion)

Exactly.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29  3:53                   ` Ming Lei
@ 2020-05-29 18:13                     ` Paul E. McKenney
  2020-05-29 19:55                       ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2020-05-29 18:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner, linux-kernel

On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
> Hi Paul,
> 
> On Thu, May 28, 2020 at 08:07:28PM -0700, Paul E. McKenney wrote:
> > On Fri, May 29, 2020 at 09:53:04AM +0800, Ming Lei wrote:
> > > Hi Paul,
> > > 
> > > Thanks for your response!
> > > 
> > > On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> > > > On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > > > > On 2020-05-27 22:19, Ming Lei wrote:
> > > > > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > > > > >> My understanding is that operations that have acquire semantics pair
> > > > > >> with operations that have release semantics. I haven't been able to find
> > > > > >> any documentation that shows that smp_mb__after_atomic() has release
> > > > > >> semantics. So I looked up its definition. This is what I found:
> > > > > >>
> > > > > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > > > > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > > > > >> barrier()
> > > > > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > > > > >> smp_llsc_mb()
> > > > > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > > > > >> barrier()
> > > > > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > > > > >> barrier()
> > > > > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > > > > >> } while (0)
> > > > > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > > > > >> barrier()
> > > > > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > > > > >> __smp_mb()
> > > > > >>
> > > > > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > > > > >> implementations have release semantics. Do you agree with this conclusion?
> > > > > > 
> > > > > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > > > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > > > > 
> > > > > > So the two pair of OPs are ordered:
> > > > > > 
> > > > > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > > > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > > > > so the request will be drained.
> > > > > > 
> > > > > > OR
> > > > > > 
> > > > > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > > > > the request(tag bit) will be released and retried on another CPU
> > > > > > finally, see __blk_mq_alloc_request().
> > > > > > 
> > > > > > Cc Paul and linux-kernel list.
> > > > > 
> > > > > I do not agree with the above conclusion. My understanding of
> > > > > acquire/release labels is that if the following holds:
> > > > > (1) A store operation that stores the value V into memory location M has
> > > > > a release label.
> > > > > (2) A load operation that reads memory location M has an acquire label.
> > > > > (3) The load operation (2) retrieves the value V that was stored by (1).
> > > > > 
> > > > > that the following ordering property holds: all load and store
> > > > > instructions that happened before the store instruction (1) in program
> > > > > order are guaranteed to happen before the load and store instructions
> > > > > that follow (2) in program order.
> > > > > 
> > > > > In the ARM manual these semantics have been described as follows: "A
> > > > > Store-Release instruction is multicopy atomic when observed with a
> > > > > Load-Acquire instruction".
> > > > > 
> > > > > In this case the load-acquire operation is the
> > > > > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > > > > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > > > > matching store-release instruction in __blk_mq_alloc_request() for
> > > > > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > > > > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > > > > word)" statement from the sbitmap code.
> > > > 
> > > > I feel like I just parachuted into the middle of the conversation,
> > > > so let me start by giving a (silly) example illustrating the limits of
> > > > smp_mb__{before,after}_atomic() that might be tangling things up.
> > > > 
> > > > But please please please avoid doing this in real code unless you have
> > > > an extremely good reason included in a comment.
> > > > 
> > > > void t1(void)
> > > > {
> > > > 	WRITE_ONCE(a, 1);
> > > > 	smp_mb__before_atomic();
> > > > 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> > > > 	atomic_inc(&c);
> > > > 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> > > > 	smp_mb__after_atomic();
> > > > 	WRITE_ONCE(e, 1);
> > > > }
> > > > 
> > > > void t2(void)
> > > > {
> > > > 	r1 = READ_ONCE(e);
> > > > 	smp_mb();
> > > > 	r2 = READ_ONCE(d);
> > > > 	smp_mb();
> > > > 	r3 = READ_ONCE(c);
> > > > 	smp_mb();
> > > > 	r4 = READ_ONCE(b);
> > > > 	smp_mb();
> > > > 	r5 = READ_ONCE(a);
> > > > }
> > > > 
> > > > Each platform must provide strong ordering for either atomic_inc()
> > > > on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> > > > on the other (as powerpc does).  Note that both ia64 and powerpc are
> > > > weakly ordered.
> > > > 
> > > > So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> > > > 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> > > > any ordering properties whatsoever.
> > > > 
> > > > Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> > > > as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> > > > So clearly atomic_inc() need not have any ordering properties whatsoever.
> > > > 
> > > > But the combination of smp_mb__before_atomic() and the later atomic_inc()
> > > > does provide full ordering, so that no architecture can see (r3 == 1 &&
> > > > r5 == 0), and either of r1 or r2 can be substituted for r3.
> > > > 
> > > > Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> > > > provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> > > > 0), and either r4 or r5 can be substituted for r3.
> > > > 
> > > > 
> > > > So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> > > > provide a full memory barrier (implying release semantics) for any write
> > > > access after the smp_mb__after_atomic() with respect to the set_bit() or
> > > > any access preceding it.  But the set_bit() by itself won't have release
> > > > semantics, nor will the smp_mb__after_atomic(), only their combination
> > > > further combined with some write following the smp_mb__after_atomic().
> > > > 
> > > > More generally, there will be the equivalent of smp_mb() somewhere between
> > > > the set_bit() and every access following the smp_mb__after_atomic().
> > > > 
> > > > Does that help, or am I missing the point?
> > > 
> > > Yeah, it does help.
> > > 
> > > BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
> > > ordering set_bit() and the memory OP following the smp_mb()?
> > 
> > Placing an smp_mb() between set_bit() and a later access will indeed
> > order set_bit() with that later access.
> > 
> > That said, I don't know this code well enough to say whether or not
> > that ordering is sufficient.
> 
> Another pair is in blk_mq_get_tag(), and we expect the following two
> memory OPs are ordered:
> 
> 1) set bit in successful test_and_set_bit_lock(), which is called
> from sbitmap_get()
> 
> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
> 
> Do you think that the above two OPs are ordered?

Given that he has been through the code, I would like to hear Bart's
thoughts, actually.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29 18:13                     ` Paul E. McKenney
@ 2020-05-29 19:55                       ` Bart Van Assche
  2020-05-29 21:12                         ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-29 19:55 UTC (permalink / raw)
  To: paulmck, Ming Lei
  Cc: Christoph Hellwig, linux-block, John Garry, Hannes Reinecke,
	Thomas Gleixner, linux-kernel

On 2020-05-29 11:13, Paul E. McKenney wrote:
> On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
>> Another pair is in blk_mq_get_tag(), and we expect the following two
>> memory OPs are ordered:
>>
>> 1) set bit in successful test_and_set_bit_lock(), which is called
>> from sbitmap_get()
>>
>> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
>>
>> Do you think that the above two OPs are ordered?
> 
> Given that he has been through the code, I would like to hear Bart's
> thoughts, actually.

Hi Paul,

My understanding of the involved instructions is as follows (see also
https://lore.kernel.org/linux-block/b98f055f-6f38-a47c-965d-b6bcf4f5563f@huawei.com/T/#t
for the entire e-mail thread):
* blk_mq_hctx_notify_offline() sets the BLK_MQ_S_INACTIVE bit in
hctx->state, calls smp_mb__after_atomic() and waits in a loop until all
tags have been freed. Each tag is an integer number that has a 1:1
correspondence with a block layer request structure. The code that
iterates over block layer request tags relies on
__sbitmap_for_each_set(). That function examines both the 'word' and
'cleared' members of struct sbitmap_word.
* What blk_mq_hctx_notify_offline() waits for is freeing of tags by
blk_mq_put_tag(). blk_mq_put_tag() frees a tag by setting a bit in
sbitmap_word.cleared (see also sbitmap_deferred_clear_bit()).
* Tag allocation by blk_mq_get_tag() relies on test_and_set_bit_lock().
The actual allocation happens by sbitmap_get() that sets a bit in
sbitmap_word.word. blk_mg_get_tag() tests the BLK_MQ_S_INACTIVE bit
after tag allocation succeeded.

What confuses me is that blk_mq_hctx_notify_offline() uses
smp_mb__after_atomic() to enforce the order of memory accesses while
blk_mq_get_tag() relies on the acquire semantics of
test_and_set_bit_lock(). Usually ordering is enforced by combining two
smp_mb() calls or by combining a store-release with a load-acquire.

Does the Linux memory model provide the expected ordering guarantees
when combining load-acquire with smp_mb__after_atomic() as used in patch
8/8 of this series?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline
  2020-05-29 19:55                       ` Bart Van Assche
@ 2020-05-29 21:12                         ` Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2020-05-29 21:12 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ming Lei, Christoph Hellwig, linux-block, John Garry,
	Hannes Reinecke, Thomas Gleixner, linux-kernel

On Fri, May 29, 2020 at 12:55:43PM -0700, Bart Van Assche wrote:
> On 2020-05-29 11:13, Paul E. McKenney wrote:
> > On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
> >> Another pair is in blk_mq_get_tag(), and we expect the following two
> >> memory OPs are ordered:
> >>
> >> 1) set bit in successful test_and_set_bit_lock(), which is called
> >> from sbitmap_get()
> >>
> >> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
> >>
> >> Do you think that the above two OPs are ordered?
> > 
> > Given that he has been through the code, I would like to hear Bart's
> > thoughts, actually.
> 
> Hi Paul,
> 
> My understanding of the involved instructions is as follows (see also
> https://lore.kernel.org/linux-block/b98f055f-6f38-a47c-965d-b6bcf4f5563f@huawei.com/T/#t
> for the entire e-mail thread):
> * blk_mq_hctx_notify_offline() sets the BLK_MQ_S_INACTIVE bit in
> hctx->state, calls smp_mb__after_atomic() and waits in a loop until all
> tags have been freed. Each tag is an integer number that has a 1:1
> correspondence with a block layer request structure. The code that
> iterates over block layer request tags relies on
> __sbitmap_for_each_set(). That function examines both the 'word' and
> 'cleared' members of struct sbitmap_word.
> * What blk_mq_hctx_notify_offline() waits for is freeing of tags by
> blk_mq_put_tag(). blk_mq_put_tag() frees a tag by setting a bit in
> sbitmap_word.cleared (see also sbitmap_deferred_clear_bit()).
> * Tag allocation by blk_mq_get_tag() relies on test_and_set_bit_lock().
> The actual allocation happens by sbitmap_get() that sets a bit in
> sbitmap_word.word. blk_mg_get_tag() tests the BLK_MQ_S_INACTIVE bit
> after tag allocation succeeded.
> 
> What confuses me is that blk_mq_hctx_notify_offline() uses
> smp_mb__after_atomic() to enforce the order of memory accesses while
> blk_mq_get_tag() relies on the acquire semantics of
> test_and_set_bit_lock(). Usually ordering is enforced by combining two
> smp_mb() calls or by combining a store-release with a load-acquire.
> 
> Does the Linux memory model provide the expected ordering guarantees
> when combining load-acquire with smp_mb__after_atomic() as used in patch
> 8/8 of this series?

Strictly speaking, smp_mb__after_atomic() works only in combination
with a non-value-returning atomic operation. Let's look at a (silly)
example where smp_mb__after_atomic() would not help in conjunction
with smp_store_release():

void thread1(void)
{
	smp_store_release(&x, 1);
	smp_mb__after_atomic();
	r1 = READ_ONCE(y);
}

void thread2(void)
{
	smp_store_release(&y, 1);
	smp_mb__after_atomic();
	r2 = READ_ONCE(x);
}

Even on x86 (or perhaps especially on x86) it is quite possible that
execution could end with r1 == r2 == 0 because on x86 there is no
ordering whatsoever from smp_mb__after_atomic().  In this case,
the CPU is well within its rights to reorder each thread's store
with its later load.  Yes, even x86.

On the other hand, suppose that the stores are non-value-returning
atomics:

void thread1(void)
{
	atomic_inc(&x);
	smp_mb__after_atomic();
	r1 = READ_ONCE(y);
}

void thread2(void)
{
	atomic_inc(&y);
	smp_mb__after_atomic();
	r2 = READ_ONCE(x);
}

In this case, for all architectures, there would be the equivalent
of an smp_mb() full barrier associated with either the atomic_inc()
or the smp_mb__after_atomic(), which would rule out the case where
execution ends with r1 == r2 == 0.

Does that help?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: blk-mq: improvement CPU hotplug (simplified version) v4
  2020-05-29 13:53 Christoph Hellwig
@ 2020-05-29 16:23 ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2020-05-29 16:23 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

On 5/29/20 7:53 AM, Christoph Hellwig wrote:
> Hi all,
> 
> this series ensures I/O is quiesced before a cpu and thus the managed
> interrupt handler is shut down.
> 
> This patchset tries to address the issue by the following approach:
> 
>  - before the last cpu in hctx->cpumask is going to offline, mark this
>    hctx as inactive
> 
>  - disable preempt during allocating tag for request, and after tag is
>    allocated, check if this hctx is inactive. If yes, give up the
>    allocation and try remote allocation from online CPUs
> 
>  - before hctx becomes inactive, drain all allocated requests on this
>    hctx
> 
> The guts of the changes are from Ming Lei, I just did a bunch of prep
> cleanups so that they can fit in more nicely.  The series also depends
> on my "avoid a few q_usage_counter roundtrips v3" series.
> 
> Thanks John Garry for running lots of tests on arm64 with this previous
> version patches and co-working on investigating all kinds of issues.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* blk-mq: improvement CPU hotplug (simplified version) v4
@ 2020-05-29 13:53 Christoph Hellwig
  2020-05-29 16:23 ` Jens Axboe
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2020-05-29 13:53 UTC (permalink / raw)
  To: Jens Axboe, Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Thomas Gleixner

Hi all,

this series ensures I/O is quiesced before a cpu and thus the managed
interrupt handler is shut down.

This patchset tries to address the issue by the following approach:

 - before the last cpu in hctx->cpumask is going to offline, mark this
   hctx as inactive

 - disable preempt during allocating tag for request, and after tag is
   allocated, check if this hctx is inactive. If yes, give up the
   allocation and try remote allocation from online CPUs

 - before hctx becomes inactive, drain all allocated requests on this
   hctx

The guts of the changes are from Ming Lei, I just did a bunch of prep
cleanups so that they can fit in more nicely.  The series also depends
on my "avoid a few q_usage_counter roundtrips v3" series.

Thanks John Garry for running lots of tests on arm64 with this previous
version patches and co-working on investigating all kinds of issues.

A git tree is available here:

    git://git.infradead.org/users/hch/block.git blk-mq-hotplug.3

Gitweb:

    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-mq-hotplug.3

Changes since v4:
  - move the blk_mq_all_tag_iter kerneldoc comment to make it clear what is
    documented
  - add a WARN_ON_ONCE
  - use BLK_MQ_NO_TAG in one more place instead of -1
  - remove the PF_KTHREAD special casing

Changes since v3:
  - add two new patches to clean up the magic -1 tag values
  - improve a few commit messages and comments
  - cleanup the blk_mq_all_tag_iter implementation
  - add a msleep to the cpu hot unplug case in __blk_mq_alloc_request

Changes since v2:
  - don't disable preemption and use smp calls


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-05-29 21:12 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-27 18:06 blk-mq: improvement CPU hotplug (simplified version) v4 Christoph Hellwig
2020-05-27 18:06 ` [PATCH 1/8] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
2020-05-27 18:16   ` Johannes Thumshirn
2020-05-27 18:06 ` [PATCH 2/8] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
2020-05-27 18:17   ` Johannes Thumshirn
2020-05-27 18:06 ` [PATCH 3/8] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
2020-05-27 18:16   ` Hannes Reinecke
2020-05-28  9:50   ` Johannes Thumshirn
2020-05-27 18:06 ` [PATCH 4/8] blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG Christoph Hellwig
2020-05-27 18:14   ` Johannes Thumshirn
2020-05-27 18:17   ` Hannes Reinecke
2020-05-27 22:38   ` Bart Van Assche
2020-05-27 18:06 ` [PATCH 5/8] blk-mq: use BLK_MQ_NO_TAG in more places Christoph Hellwig
2020-05-27 18:15   ` Johannes Thumshirn
2020-05-27 18:18   ` Hannes Reinecke
2020-05-27 22:38   ` Bart Van Assche
2020-05-27 18:06 ` [PATCH 6/8] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
2020-05-27 18:06 ` [PATCH 7/8] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
2020-05-27 18:21   ` Hannes Reinecke
2020-05-27 22:52   ` Bart Van Assche
2020-05-27 18:06 ` [PATCH 8/8] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
2020-05-27 18:26   ` Hannes Reinecke
2020-05-27 23:09   ` Bart Van Assche
2020-05-28  1:46     ` Ming Lei
2020-05-28  3:33       ` Bart Van Assche
2020-05-28  5:19         ` Ming Lei
2020-05-28 13:37           ` Bart Van Assche
2020-05-28 17:21             ` Paul E. McKenney
2020-05-29  1:53               ` Ming Lei
2020-05-29  3:07                 ` Paul E. McKenney
2020-05-29  3:53                   ` Ming Lei
2020-05-29 18:13                     ` Paul E. McKenney
2020-05-29 19:55                       ` Bart Van Assche
2020-05-29 21:12                         ` Paul E. McKenney
2020-05-29  1:13             ` Ming Lei
2020-05-27 20:07 ` blk-mq: improvement CPU hotplug (simplified version) v4 Bart Van Assche
2020-05-27 20:31   ` John Garry
2020-05-29 13:26     ` Christoph Hellwig
2020-05-28  8:29 ` John Garry
2020-05-29 13:53 Christoph Hellwig
2020-05-29 16:23 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).