All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V10 00/11] blk-mq: improvement CPU hotplug
@ 2020-05-05  2:09 Ming Lei
  2020-05-05  2:09 ` [PATCH V10 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes inactive(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Thanks John Garry for running lots of tests on arm64 with this patchset
and co-working on investigating all kinds of issues.

Thanks Christoph's review on V7 & V8.

Please comment & review, thanks!

https://github.com/ming1/linux/commits/v5.7-rc-blk-mq-improve-cpu-hotplug

V10:
	- fix double bio complete in request resubmission(10/11)
	- add tested-by tag

V9:
	- add Reviewed-by tag
	- document more on memory barrier usage between getting driver tag
	and handling cpu offline(7/11)
	- small code cleanup as suggested by Chritoph(7/11)
	- rebase against for-5.8/block(1/11, 2/11)
V8:
	- add patches to share code with blk_rq_prep_clone
	- code re-organization as suggested by Christoph, most of them are
	in 04/11, 10/11
	- add reviewed-by tag

V7:
	- fix updating .nr_active in get_driver_tag
	- add hctx->cpumask check in cpuhp handler
	- only drain requests which tag is >= 0
	- pass more aggressive cpuhotplug&io test

V6:
	- simplify getting driver tag, so that we can drain in-flight
	  requests correctly without using synchronize_rcu()
	- handle re-submission of flush & passthrough request correctly

V5:
	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
	- re-factor code for re-submit requests in cpu dead hotplug handler
	- address requeue corner case

V4:
	- resubmit IOs in dispatch list in case that this hctx is dead 

V3:
	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
	- fix patch 4 for avoiding potential deadlock, as found by Hannes

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo

Ming Lei (11):
  block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  block: add helper for copying request
  blk-mq: mark blk_mq_get_driver_tag as static
  blk-mq: assign rq->tag in blk_mq_get_driver_tag
  blk-mq: support rq filter callback when iterating rqs
  blk-mq: prepare for draining IO when hctx's all CPUs are offline
  blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  block: add blk_end_flush_machinery
  blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  blk-mq: re-submit IO in case that hctx is inactive
  block: deactivate hctx when the hctx is actually inactive

 block/blk-core.c           |  27 ++-
 block/blk-flush.c          | 141 ++++++++++++---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |  39 ++--
 block/blk-mq-tag.h         |   4 +
 block/blk-mq.c             | 356 +++++++++++++++++++++++++++++--------
 block/blk-mq.h             |  22 ++-
 block/blk.h                |  11 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   6 +
 include/linux/cpuhotplug.h |   1 +
 12 files changed, 481 insertions(+), 132 deletions(-)

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
-- 
2.25.2


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V10 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 02/11] block: add helper for copying request Ming Lei
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel, Hannes Reinecke, Martin K . Petersen

So far blk_rq_prep_clone() is only used for setup one underlying cloned
request from dm-rq request. block intetrity can be enabled for both dm-rq
and the underlying queues, so it is reasonable to clone rq's
nr_integrity_segments. Also write_hint is from bio, it should have been
cloned too.

So clone nr_integrity_segments and write_hint in blk_rq_prep_clone.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7f11560bfddb..1fe73051fec3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1630,8 +1630,12 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
 		rq->special_vec = rq_src->special_vec;
 	}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+#endif
 	rq->nr_phys_segments = rq_src->nr_phys_segments;
 	rq->ioprio = rq_src->ioprio;
+	rq->write_hint = rq_src->write_hint;
 
 	return 0;
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 02/11] block: add helper for copying request
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
  2020-05-05  2:09 ` [PATCH V10 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Mike Snitzer, dm-devel, Hannes Reinecke, Martin K . Petersen

Add one new helper of blk_rq_copy_request() to copy request, and the helper
will be used in this patch for re-submitting request, so make it as a block
layer internal helper.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 31 ++++++++++++++++++-------------
 block/blk.h      |  2 ++
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 1fe73051fec3..ec50d7e6be21 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1581,6 +1581,23 @@ void blk_rq_unprep_clone(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_rq_unprep_clone);
 
+void blk_rq_copy_request(struct request *rq, struct request *rq_src)
+{
+	/* Copy attributes of the original request to the clone request. */
+	rq->__sector = blk_rq_pos(rq_src);
+	rq->__data_len = blk_rq_bytes(rq_src);
+	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
+		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
+		rq->special_vec = rq_src->special_vec;
+	}
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
+#endif
+	rq->nr_phys_segments = rq_src->nr_phys_segments;
+	rq->ioprio = rq_src->ioprio;
+	rq->write_hint = rq_src->write_hint;
+}
+
 /**
  * blk_rq_prep_clone - Helper function to setup clone request
  * @rq: the request to be setup
@@ -1623,19 +1640,7 @@ int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
 			rq->bio = rq->biotail = bio;
 	}
 
-	/* Copy attributes of the original request to the clone request. */
-	rq->__sector = blk_rq_pos(rq_src);
-	rq->__data_len = blk_rq_bytes(rq_src);
-	if (rq_src->rq_flags & RQF_SPECIAL_PAYLOAD) {
-		rq->rq_flags |= RQF_SPECIAL_PAYLOAD;
-		rq->special_vec = rq_src->special_vec;
-	}
-#ifdef CONFIG_BLK_DEV_INTEGRITY
-	rq->nr_integrity_segments = rq_src->nr_integrity_segments;
-#endif
-	rq->nr_phys_segments = rq_src->nr_phys_segments;
-	rq->ioprio = rq_src->ioprio;
-	rq->write_hint = rq_src->write_hint;
+	blk_rq_copy_request(rq, rq_src);
 
 	return 0;
 
diff --git a/block/blk.h b/block/blk.h
index 73bd3b1c6938..73dd86dc8f47 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -120,6 +120,8 @@ static inline void blk_rq_bio_prep(struct request *rq, struct bio *bio,
 		rq->rq_disk = bio->bi_disk;
 }
 
+void blk_rq_copy_request(struct request *rq, struct request *rq_src);
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 03/11] blk-mq: mark blk_mq_get_driver_tag as static
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
  2020-05-05  2:09 ` [PATCH V10 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
  2020-05-05  2:09 ` [PATCH V10 02/11] block: add helper for copying request Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry, Hannes Reinecke,
	Martin K . Petersen

Now all callers of blk_mq_get_driver_tag are in blk-mq.c, so mark
it as static.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 +-
 block/blk-mq.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bcc3a2397d4a..de4e6654258d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1015,7 +1015,7 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
 		.q = rq->q,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494fa..e7d1da4b1f73 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -44,7 +44,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool);
 void blk_mq_add_to_requeue_list(struct request *rq, bool at_head,
 				bool kick_requeue_list);
 void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list);
-bool blk_mq_get_driver_tag(struct request *rq);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start);
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (2 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, John Garry, Hannes Reinecke

Especially for none elevator, rq->tag is assigned after the request is
allocated, so there isn't any way to figure out if one request is in
being dispatched. Also the code path wrt. driver tag becomes a bit
difference between none and io scheduler.

When one hctx becomes inactive, we have to prevent any request from
being dispatched to LLD. And get driver tag provides one perfect chance
to do that. Meantime we can drain any such requests by checking if
rq->tag is assigned.

So only assign rq->tag until blk_mq_get_driver_tag() is called.

This way also simplifies code of dealing with driver tag a lot.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 18 ++----------
 block/blk-mq.c    | 75 ++++++++++++++++++++++++-----------------------
 block/blk-mq.h    | 21 +++++++------
 block/blk.h       |  5 ----
 4 files changed, 51 insertions(+), 68 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index c7f396e3d5e2..977edf95d711 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -236,13 +236,8 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 		error = fq->rq_status;
 
 	hctx = flush_rq->mq_hctx;
-	if (!q->elevator) {
-		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
-		flush_rq->tag = -1;
-	} else {
-		blk_mq_put_driver_tag(flush_rq);
-		flush_rq->internal_tag = -1;
-	}
+	flush_rq->internal_tag = -1;
+	blk_mq_put_driver_tag(flush_rq);
 
 	running = &fq->flush_queue[fq->flush_running_idx];
 	BUG_ON(fq->flush_pending_idx == fq->flush_running_idx);
@@ -317,14 +312,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	flush_rq->mq_ctx = first_rq->mq_ctx;
 	flush_rq->mq_hctx = first_rq->mq_hctx;
 
-	if (!q->elevator) {
-		fq->orig_rq = first_rq;
-		flush_rq->tag = first_rq->tag;
-		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
-	} else {
-		flush_rq->internal_tag = first_rq->internal_tag;
-	}
-
+	flush_rq->internal_tag = first_rq->internal_tag;
 	flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH;
 	flush_rq->cmd_flags |= (flags & REQ_DRV) | (flags & REQ_FAILFAST_MASK);
 	flush_rq->rq_flags |= RQF_FLUSH_SEQ;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index de4e6654258d..6569d802e0d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -276,18 +276,8 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	struct request *rq = tags->static_rqs[tag];
 	req_flags_t rq_flags = 0;
 
-	if (data->flags & BLK_MQ_REQ_INTERNAL) {
-		rq->tag = -1;
-		rq->internal_tag = tag;
-	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
-			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
-		}
-		rq->tag = tag;
-		rq->internal_tag = -1;
-		data->hctx->tags->rqs[rq->tag] = rq;
-	}
+	rq->internal_tag = tag;
+	rq->tag = -1;
 
 	/* csd/requeue_work/fifo_time is initialized before use */
 	rq->q = data->q;
@@ -471,14 +461,18 @@ static void __blk_mq_free_request(struct request *rq)
 	struct request_queue *q = rq->q;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
-	const int sched_tag = rq->internal_tag;
 
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
+
+	if (hctx->sched_tags) {
+		if (rq->tag >= 0)
+			blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+		blk_mq_put_tag(hctx->sched_tags, ctx, rq->internal_tag);
+	} else {
+		blk_mq_put_tag(hctx->tags, ctx, rq->internal_tag);
+        }
+
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
@@ -526,7 +520,7 @@ inline void __blk_mq_end_request(struct request *rq, blk_status_t error)
 		blk_stat_add(rq, now);
 	}
 
-	if (rq->internal_tag != -1)
+	if (rq->q->elevator && rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq, now);
 
 	blk_account_io_done(rq, now);
@@ -1015,33 +1009,40 @@ static inline unsigned int queued_to_index(unsigned int queued)
 	return min(BLK_MQ_MAX_DISPATCH_ORDER - 1, ilog2(queued) + 1);
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool __blk_mq_get_driver_tag(struct request *rq)
 {
 	struct blk_mq_alloc_data data = {
-		.q = rq->q,
-		.hctx = rq->mq_hctx,
-		.flags = BLK_MQ_REQ_NOWAIT,
-		.cmd_flags = rq->cmd_flags,
+		.q		= rq->q,
+		.hctx		= rq->mq_hctx,
+		.flags		= BLK_MQ_REQ_NOWAIT,
+		.cmd_flags	= rq->cmd_flags,
 	};
-	bool shared;
 
-	if (rq->tag != -1)
-		return true;
+	if (data.hctx->sched_tags) {
+		if (blk_mq_tag_is_reserved(data.hctx->sched_tags,
+				rq->internal_tag))
+			data.flags |= BLK_MQ_REQ_RESERVED;
+		rq->tag = blk_mq_get_tag(&data);
+	} else {
+		rq->tag = rq->internal_tag;
+	}
 
-	if (blk_mq_tag_is_reserved(data.hctx->sched_tags, rq->internal_tag))
-		data.flags |= BLK_MQ_REQ_RESERVED;
+	if (rq->tag == -1)
+		return false;
 
-	shared = blk_mq_tag_busy(data.hctx);
-	rq->tag = blk_mq_get_tag(&data);
-	if (rq->tag >= 0) {
-		if (shared) {
-			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
-		}
-		data.hctx->tags->rqs[rq->tag] = rq;
+	if (blk_mq_tag_busy(data.hctx)) {
+		rq->rq_flags |= RQF_MQ_INFLIGHT;
+		atomic_inc(&data.hctx->nr_active);
 	}
+	data.hctx->tags->rqs[rq->tag] = rq;
+	return true;
+}
 
-	return rq->tag != -1;
+static bool blk_mq_get_driver_tag(struct request *rq)
+{
+	if (rq->tag != -1)
+		return true;
+	return __blk_mq_get_driver_tag(rq);
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e7d1da4b1f73..d0c72d7d07c8 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -196,26 +196,25 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
-					   struct request *rq)
+static inline void blk_mq_put_driver_tag(struct request *rq)
 {
-	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	int tag = rq->tag;
+
+	if (tag < 0)
+		return;
+
 	rq->tag = -1;
 
+	if (hctx->sched_tags)
+		blk_mq_put_tag(hctx->tags, rq->mq_ctx, tag);
+
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
 		atomic_dec(&hctx->nr_active);
 	}
 }
 
-static inline void blk_mq_put_driver_tag(struct request *rq)
-{
-	if (rq->tag == -1 || rq->internal_tag == -1)
-		return;
-
-	__blk_mq_put_driver_tag(rq->mq_hctx, rq);
-}
-
 static inline void blk_mq_clear_mq_map(struct blk_mq_queue_map *qmap)
 {
 	int cpu;
diff --git a/block/blk.h b/block/blk.h
index 73dd86dc8f47..591cc07e40f9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -26,11 +26,6 @@ struct blk_flush_queue {
 	struct list_head	flush_data_in_flight;
 	struct request		*flush_rq;
 
-	/*
-	 * flush_rq shares tag with this rq, both can't be active
-	 * at the same time
-	 */
-	struct request		*orig_rq;
 	struct lock_class_key	key;
 	spinlock_t		mq_flush_lock;
 };
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (3 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-08 23:32   ` Bart Van Assche
  2020-05-05  2:09 ` [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Now request is thought as in-flight only when its state is updated as
MQ_RQ_IN_FLIGHT, which is done by driver via blk_mq_start_request().

Actually from blk-mq's view, one rq can be thought as in-flight
after its tag is >= 0.

Passing one rq filter callback so that we can iterating requests very
flexible.

Implement blk_mq_all_tag_busy_iter() which accepts a 'busy_fn' argument
to filter over which commands to iterate, and make the existing
blk_mq_tag_busy_iter() a wrapper for the new function.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 39 +++++++++++++++++++++++++++------------
 block/blk-mq-tag.h |  4 ++++
 2 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..2e43b827c96d 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -255,6 +255,7 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
 struct bt_tags_iter_data {
 	struct blk_mq_tags *tags;
 	busy_tag_iter_fn *fn;
+	busy_rq_iter_fn *busy_rq_fn;
 	void *data;
 	bool reserved;
 };
@@ -274,7 +275,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
+	if (rq && iter_data->busy_rq_fn(rq, iter_data->data, reserved))
 		return iter_data->fn(rq, iter_data->data, reserved);
 
 	return true;
@@ -294,11 +295,13 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+			     void *data, bool reserved)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
+		.busy_rq_fn = busy_rq_fn,
 		.data = data,
 		.reserved = reserved,
 	};
@@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
 /**
  * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
  * @tags:	Tag map to iterate over.
- * @fn:		Pointer to the function that will be called for each started
- *		request. @fn will be called as follows: @fn(rq, @priv,
- *		reserved) where rq is a pointer to a request. 'reserved'
- *		indicates whether or not @rq is a reserved request. Return
- *		true to continue iterating tags, false to stop.
+ * @fn:		Pointer to the function that will be called for each request
+ * 		when .busy_rq_fn(rq) returns true. @fn will be called as
+ * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
+ * 		request. 'reserved' indicates whether or not @rq is a reserved
+ * 		request. Return true to continue iterating tags, false to stop.
+ * @busy_rq_fn: Pointer to the function that will be called for each request,
+ * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
+ * 		@priv, reserved) returns true, @fn will be called on this rq.
  * @priv:	Will be passed as second argument to @fn.
  */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
-		busy_tag_iter_fn *fn, void *priv)
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+		void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, busy_rq_fn,
+				priv, true);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, busy_rq_fn, priv, false);
+}
+
+static bool blk_mq_default_busy_rq(struct request *rq, void *data,
+		bool reserved)
+{
+	return blk_mq_request_started(rq);
 }
 
 /**
@@ -342,7 +356,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
-			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			blk_mq_all_tag_busy_iter(tagset->tags[i], fn,
+					blk_mq_default_busy_rq, priv);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb682..fdf095d513e5 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -21,6 +21,7 @@ struct blk_mq_tags {
 	struct list_head page_list;
 };
 
+typedef bool (busy_rq_iter_fn)(struct request *, void *, bool);
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
@@ -34,6 +35,9 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
+		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (4 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  6:14   ` Hannes Reinecke
  2020-05-08 23:26   ` Bart Van Assche
  2020-05-05  2:09 ` [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"
 That was the constraint of managed interrupts from the very beginning:

  The driver/subsystem has to quiesce the interrupt line and the associated
  queue _before_ it gets shutdown in CPU unplug and not fiddle with it
  until it's restarted by the core when the CPU is plugged in again.
"

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
one cpuhp state handled after the CPU is down, so there isn't any chance
to quiesce hctx for blk-mq wrt. CPU hotplug.

Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
and wait for completion of in-flight requests.

We will stop hw queue and wait for completion of in-flight requests
when one hctx is becoming dead in the following patch. This way may
cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
loop.

Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
loop, so we needn't to wait for completion of in-flight requests from
dm-rq & loop, then the potential dead-lock can be avoided.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c     |  1 +
 block/blk-mq.c             | 19 +++++++++++++++++++
 drivers/block/loop.c       |  2 +-
 drivers/md/dm-rq.c         |  2 +-
 include/linux/blk-mq.h     |  3 +++
 include/linux/cpuhotplug.h |  1 +
 6 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a..ddec58743e88 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -239,6 +239,7 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6569d802e0d8..54c107be7a47 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,6 +2273,16 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2309,6 +2319,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2368,6 +2381,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3621,6 +3637,9 @@ static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..784f2e038b55 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..5f1ff70ac029 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7307795439a..3763207d88eb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b633531..24b3a77810b6 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@ enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (5 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-08 23:39   ` Bart Van Assche
  2020-05-05  2:09 ` [PATCH V10 08/11] block: add blk_end_flush_machinery Ming Lei
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Before one CPU becomes offline, check if it is the last online CPU of hctx.
If yes, mark this hctx as inactive, meantime wait for completion of all
in-flight IOs originated from this hctx. Meantime check if this hctx has
become inactive in blk_mq_get_driver_tag(), if yes, release the
allocated tag.

This way guarantees that there isn't any inflight IO before shutdowning
the managed IRQ line when all CPUs of this IRQ line is offline.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c |   1 +
 block/blk-mq.c         | 124 +++++++++++++++++++++++++++++++++++++----
 include/linux/blk-mq.h |   3 +
 3 files changed, 117 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index ddec58743e88..dc66cb689d2f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 54c107be7a47..4a2250ac4fbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1038,11 +1038,36 @@ static bool __blk_mq_get_driver_tag(struct request *rq)
 	return true;
 }
 
-static bool blk_mq_get_driver_tag(struct request *rq)
+static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
 {
 	if (rq->tag != -1)
 		return true;
-	return __blk_mq_get_driver_tag(rq);
+
+	if (!__blk_mq_get_driver_tag(rq))
+		return false;
+	/*
+	 * In case that direct issue IO process is migrated to other CPU
+	 * which may not belong to this hctx, add one memory barrier so we
+	 * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
+	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
+	 * and driver tag assignment are run on the same CPU because
+	 * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
+	 * becoming offline.
+	 *
+	 * Process migration might happen after the check on current processor
+	 * id, smp_mb() is implied by processor migration, so no need to worry
+	 * about it.
+	 */
+	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
+		smp_mb();
+	else
+		barrier();
+
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
+		blk_mq_put_driver_tag(rq);
+		return false;
+	}
+	return true;
 }
 
 static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
@@ -1091,7 +1116,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 		 * Don't clear RESTART here, someone else could have set it.
 		 * At most this will cost an extra queue run.
 		 */
-		return blk_mq_get_driver_tag(rq);
+		return blk_mq_get_driver_tag(rq, false);
 	}
 
 	wait = &hctx->dispatch_wait;
@@ -1117,7 +1142,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	 * allocation failure and adding the hardware queue to the wait
 	 * queue.
 	 */
-	ret = blk_mq_get_driver_tag(rq);
+	ret = blk_mq_get_driver_tag(rq, false);
 	if (!ret) {
 		spin_unlock(&hctx->dispatch_wait_lock);
 		spin_unlock_irq(&wq->lock);
@@ -1218,7 +1243,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			break;
 		}
 
-		if (!blk_mq_get_driver_tag(rq)) {
+		if (!blk_mq_get_driver_tag(rq, false)) {
 			/*
 			 * The initial allocation attempt failed, so we need to
 			 * rerun the hardware queue when a tag is freed. The
@@ -1250,7 +1275,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			bd.last = true;
 		else {
 			nxt = list_first_entry(list, struct request, queuelist);
-			bd.last = !blk_mq_get_driver_tag(nxt);
+			bd.last = !blk_mq_get_driver_tag(nxt, false);
 		}
 
 		ret = q->mq_ops->queue_rq(hctx, &bd);
@@ -1864,7 +1889,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_dispatch_budget(hctx))
 		goto insert;
 
-	if (!blk_mq_get_driver_tag(rq)) {
+	if (!blk_mq_get_driver_tag(rq, true)) {
 		blk_mq_put_dispatch_budget(hctx);
 		goto insert;
 	}
@@ -2273,13 +2298,87 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+struct count_inflight_data {
+	unsigned count;
+	struct blk_mq_hw_ctx *hctx;
+};
+
+static bool blk_mq_count_inflight_rq(struct request *rq, void *data,
+				     bool reserved)
 {
-	return 0;
+	struct count_inflight_data *count_data = data;
+
+	/*
+	 * Can't check rq's state because it is updated to MQ_RQ_IN_FLIGHT
+	 * in blk_mq_start_request(), at that time we can't prevent this rq
+	 * from being issued.
+	 *
+	 * So check if driver tag is assigned, if yes, count this rq as
+	 * inflight.
+	 */
+	if (rq->tag >= 0 && rq->mq_hctx == count_data->hctx)
+		count_data->count++;
+
+	return true;
+}
+
+static bool blk_mq_inflight_rq(struct request *rq, void *data,
+			       bool reserved)
+{
+	return rq->tag >= 0;
+}
+
+static unsigned blk_mq_tags_inflight_rqs(struct blk_mq_hw_ctx *hctx)
+{
+	struct count_inflight_data count_data = {
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_busy_iter(hctx->tags, blk_mq_count_inflight_rq,
+			blk_mq_inflight_rq, &count_data);
+	return count_data.count;
+}
+
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
 }
 
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
+	if (!blk_mq_last_cpu_in_hctx(cpu, hctx))
+		return 0;
+
+	/*
+	 * Order setting BLK_MQ_S_INACTIVE versus checking rq->tag and rqs[tag],
+	 * in blk_mq_tags_inflight_rqs.  It pairs with the smp_mb() in
+	 * blk_mq_get_driver_tag.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	smp_mb__after_atomic();
+	while (blk_mq_tags_inflight_rqs(hctx))
+		msleep(5);
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
 	return 0;
 }
 
@@ -2290,12 +2389,15 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
 
-	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 3763207d88eb..77bf861d72ec 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -403,6 +403,9 @@ enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 08/11] block: add blk_end_flush_machinery
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (6 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Flush requests aren't same with normal FS request:

1) one dedicated per-hctx flush rq is pre-allocated for sending flush request

2) flush request si issued to hardware via one machinary so that flush merge
can be applied

We can't simply re-submit flush rqs via blk_steal_bios(), so add
blk_end_flush_machinery to collect flush requests which needs to
be resubmitted:

- if one flush command without DATA is enough, send one flush, complete this
kind of requests

- otherwise, add the request into a list and let caller re-submit it.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-flush.c | 123 +++++++++++++++++++++++++++++++++++++++++++---
 block/blk.h       |   4 ++
 2 files changed, 120 insertions(+), 7 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index 977edf95d711..745d878697ed 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -170,10 +170,11 @@ static void blk_flush_complete_seq(struct request *rq,
 	unsigned int cmd_flags;
 
 	BUG_ON(rq->flush.seq & seq);
-	rq->flush.seq |= seq;
+	if (!error)
+		rq->flush.seq |= seq;
 	cmd_flags = rq->cmd_flags;
 
-	if (likely(!error))
+	if (likely(!error && !fq->flush_queue_terminating))
 		seq = blk_flush_cur_seq(rq);
 	else
 		seq = REQ_FSEQ_DONE;
@@ -200,9 +201,15 @@ static void blk_flush_complete_seq(struct request *rq,
 		 * normal completion and end it.
 		 */
 		BUG_ON(!list_empty(&rq->queuelist));
-		list_del_init(&rq->flush.list);
-		blk_flush_restore_request(rq);
-		blk_mq_end_request(rq, error);
+
+		/* Terminating code will end the request from flush queue */
+		if (likely(!fq->flush_queue_terminating)) {
+			list_del_init(&rq->flush.list);
+			blk_flush_restore_request(rq);
+			blk_mq_end_request(rq, error);
+		} else {
+			list_move_tail(&rq->flush.list, pending);
+		}
 		break;
 
 	default:
@@ -279,7 +286,8 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	struct request *flush_rq = fq->flush_rq;
 
 	/* C1 described at the top of this file */
-	if (fq->flush_pending_idx != fq->flush_running_idx || list_empty(pending))
+	if (fq->flush_pending_idx != fq->flush_running_idx ||
+			list_empty(pending) || fq->flush_queue_terminating)
 		return;
 
 	/* C2 and C3
@@ -331,7 +339,7 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
 	if (q->elevator) {
-		WARN_ON(rq->tag < 0);
+		WARN_ON(rq->tag < 0 && !fq->flush_queue_terminating);
 		blk_mq_put_driver_tag(rq);
 	}
 
@@ -503,3 +511,104 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
 	kfree(fq->flush_rq);
 	kfree(fq);
 }
+
+static void __blk_end_queued_flush(struct blk_flush_queue *fq,
+		unsigned int queue_idx, struct list_head *resubmit_list,
+		struct list_head *flush_list)
+{
+	struct list_head *queue = &fq->flush_queue[queue_idx];
+	struct request *rq, *nxt;
+
+	list_for_each_entry_safe(rq, nxt, queue, flush.list) {
+		unsigned int seq = blk_flush_cur_seq(rq);
+
+		list_del_init(&rq->flush.list);
+		blk_flush_restore_request(rq);
+		if (!blk_rq_sectors(rq) || seq == REQ_FSEQ_POSTFLUSH )
+			list_add_tail(&rq->queuelist, flush_list);
+		else
+			list_add_tail(&rq->queuelist, resubmit_list);
+	}
+}
+
+static void blk_end_queued_flush(struct blk_flush_queue *fq,
+		struct list_head *resubmit_list, struct list_head *flush_list)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	__blk_end_queued_flush(fq, 0, resubmit_list, flush_list);
+	__blk_end_queued_flush(fq, 1, resubmit_list, flush_list);
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+}
+
+/* complete requests which just requires one flush command */
+static void blk_complete_flush_requests(struct blk_flush_queue *fq,
+		struct list_head *flush_list)
+{
+	struct block_device *bdev;
+	struct request *rq;
+	int error = -ENXIO;
+
+	if (list_empty(flush_list))
+		return;
+
+	rq = list_first_entry(flush_list, struct request, queuelist);
+
+	/* Send flush via one active hctx so we can move on */
+	bdev = bdget_disk(rq->rq_disk, 0);
+	if (bdev) {
+		error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
+		bdput(bdev);
+	}
+
+	while (!list_empty(flush_list)) {
+		rq = list_first_entry(flush_list, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_end_request(rq, error);
+	}
+}
+
+/*
+ * Called when this hctx is inactive and all CPUs of this hctx is dead,
+ * otherwise don't reuse this function.
+ *
+ * Terminate this hw queue's flush machinery, and try to complete flush
+ * IO requests if possible, such as any flush IO without data, or flush
+ * data IO in POSTFLUSH stage. Otherwise, add the flush IOs into @list
+ * and let caller to re-submit them.
+ */
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out)
+{
+	LIST_HEAD(resubmit_list);
+	LIST_HEAD(flush_list);
+	struct blk_flush_queue *fq = hctx->fq;
+	struct request *rq, *nxt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	fq->flush_queue_terminating = 1;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	/* End inflight flush requests */
+	list_for_each_entry_safe(rq, nxt, in, queuelist) {
+		WARN_ON(!(rq->rq_flags & RQF_FLUSH_SEQ));
+		list_del_init(&rq->queuelist);
+		rq->end_io(rq, BLK_STS_AGAIN);
+	}
+
+	/* End queued requests */
+	blk_end_queued_flush(fq, &resubmit_list, &flush_list);
+
+	/* Send flush and complete requests which just need one flush req */
+	blk_complete_flush_requests(fq, &flush_list);
+
+	spin_lock_irqsave(&fq->mq_flush_lock, flags);
+	/* reset flush queue so that it is ready to work next time */
+	fq->flush_pending_idx = fq->flush_running_idx = 0;
+	fq->flush_queue_terminating = 0;
+	spin_unlock_irqrestore(&fq->mq_flush_lock, flags);
+
+	list_splice_init(&resubmit_list, out);
+}
diff --git a/block/blk.h b/block/blk.h
index 591cc07e40f9..133fb0b99759 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -20,6 +20,7 @@ struct blk_flush_queue {
 	unsigned int		flush_queue_delayed:1;
 	unsigned int		flush_pending_idx:1;
 	unsigned int		flush_running_idx:1;
+	unsigned int		flush_queue_terminating:1;
 	blk_status_t 		rq_status;
 	unsigned long		flush_pending_since;
 	struct list_head	flush_queue[2];
@@ -454,4 +455,7 @@ int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
 		bool *same_page);
 
+void blk_end_flush_machinery(struct blk_mq_hw_ctx *hctx,
+		struct list_head *in, struct list_head *out);
+
 #endif /* BLK_INTERNAL_H */
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (7 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 08/11] block: add blk_end_flush_machinery Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

Add helper of blk_mq_hctx_handle_dead_cpu for handling cpu dead,
and prepare for handling inactive hctx.

No functional change.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4a2250ac4fbb..73e1a1d4c1c5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2382,22 +2382,13 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-/*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
- */
-static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
+static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
+		unsigned int cpu)
 {
-	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
-			struct blk_mq_hw_ctx, cpuhp_dead);
 	struct blk_mq_ctx *ctx;
 	LIST_HEAD(tmp);
 	enum hctx_type type;
 
-	if (!cpumask_test_cpu(cpu, hctx->cpumask))
-		return 0;
-
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
@@ -2409,13 +2400,27 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	spin_unlock(&ctx->lock);
 
 	if (list_empty(&tmp))
-		return 0;
+		return;
 
 	spin_lock(&hctx->lock);
 	list_splice_tail_init(&tmp, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, true);
+}
+
+/*
+ * 'cpu' is going away. splice any existing rq_list entries from this
+ * software queue to the hw queue dispatch list, and ensure that it
+ * gets run.
+ */
+static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_dead);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		blk_mq_hctx_handle_dead_cpu(hctx, cpu);
 	return 0;
 }
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 10/11] blk-mq: re-submit IO in case that hctx is inactive
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (8 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-05  2:09 ` [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
  2020-05-08 21:49 ` [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
  11 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner

When all CPUs in one hctx are offline and this hctx becomes inactive, we
shouldn't run this hw queue for completing request any more.

So allocate request from one live hctx, and clone & resubmit the request,
either it is from sw queue or scheduler queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 104 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 73e1a1d4c1c5..d639bcd89811 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2382,6 +2382,102 @@ static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
+static void blk_mq_resubmit_end_rq(struct request *rq, blk_status_t error)
+{
+	struct request *orig_rq = rq->end_io_data;
+
+	blk_mq_cleanup_rq(orig_rq);
+	blk_mq_end_request(orig_rq, error);
+
+	blk_put_request(rq);
+}
+
+static void blk_mq_resubmit_rq(struct request *rq)
+{
+	struct request *nrq;
+	unsigned int flags = 0;
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
+	struct blk_mq_tags *tags = rq->q->elevator ? hctx->sched_tags :
+		hctx->tags;
+	bool reserved = blk_mq_tag_is_reserved(tags, rq->internal_tag);
+
+	if (rq->rq_flags & RQF_PREEMPT)
+		flags |= BLK_MQ_REQ_PREEMPT;
+	if (reserved)
+		flags |= BLK_MQ_REQ_RESERVED;
+
+	/* avoid allocation failure by clearing NOWAIT */
+	nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
+	if (!nrq)
+		return;
+
+	blk_rq_copy_request(nrq, rq);
+
+	nrq->timeout = rq->timeout;
+	nrq->rq_disk = rq->rq_disk;
+	nrq->part = rq->part;
+
+	memcpy(blk_mq_rq_to_pdu(nrq), blk_mq_rq_to_pdu(rq),
+			rq->q->tag_set->cmd_size);
+
+	nrq->end_io = blk_mq_resubmit_end_rq;
+	nrq->end_io_data = rq;
+	nrq->bio = rq->bio;
+	nrq->biotail = rq->biotail;
+
+	/* bios ownership has been transfered to new request */
+	rq->bio = rq->biotail = NULL;
+	rq->__data_len = 0;
+
+	if (blk_insert_cloned_request(nrq->q, nrq) != BLK_STS_OK)
+		blk_mq_request_bypass_insert(nrq, false, true);
+}
+
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx)
+{
+	LIST_HEAD(sched);
+	LIST_HEAD(re_submit);
+	LIST_HEAD(flush_in);
+	LIST_HEAD(flush_out);
+	struct request *rq, *nxt;
+	struct elevator_queue *e = hctx->queue->elevator;
+
+	if (!e) {
+		blk_mq_flush_busy_ctxs(hctx, &re_submit);
+	} else {
+		while ((rq = e->type->ops.dispatch_request(hctx))) {
+			if (rq->mq_hctx != hctx)
+				list_add(&rq->queuelist, &sched);
+			else
+				list_add(&rq->queuelist, &re_submit);
+		}
+	}
+	while (!list_empty(&sched)) {
+		rq = list_first_entry(&sched, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_sched_insert_request(rq, true, true, true);
+	}
+
+	/* requests in dispatch list have to be re-submitted too */
+	spin_lock(&hctx->lock);
+	list_splice_tail_init(&hctx->dispatch, &re_submit);
+	spin_unlock(&hctx->lock);
+
+	/* blk_end_flush_machinery will cover flush request */
+	list_for_each_entry_safe(rq, nxt, &re_submit, queuelist) {
+		if (rq->rq_flags & RQF_FLUSH_SEQ)
+			list_move(&rq->queuelist, &flush_in);
+	}
+	blk_end_flush_machinery(hctx, &flush_in, &flush_out);
+	list_splice_tail(&flush_out, &re_submit);
+
+	while (!list_empty(&re_submit)) {
+		rq = list_first_entry(&re_submit, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_resubmit_rq(rq);
+	}
+}
+
 static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
 		unsigned int cpu)
 {
@@ -2410,17 +2506,20 @@ static void blk_mq_hctx_handle_dead_cpu(struct blk_mq_hw_ctx *hctx,
 }
 
 /*
- * 'cpu' is going away. splice any existing rq_list entries from this
- * software queue to the hw queue dispatch list, and ensure that it
- * gets run.
+ * @cpu has gone away. If this hctx is inactive, we can't dispatch request
+ * to the hctx any more, so clone and re-submit requests from this hctx
  */
 static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 {
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
 			struct blk_mq_hw_ctx, cpuhp_dead);
 
-	if (cpumask_test_cpu(cpu, hctx->cpumask))
-		blk_mq_hctx_handle_dead_cpu(hctx, cpu);
+	if (cpumask_test_cpu(cpu, hctx->cpumask)) {
+		if (test_bit(BLK_MQ_S_INACTIVE, &hctx->state))
+			blk_mq_hctx_deactivate(hctx);
+		else
+			blk_mq_hctx_handle_dead_cpu(hctx, cpu);
+	}
 	return 0;
 }
 
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (9 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
@ 2020-05-05  2:09 ` Ming Lei
  2020-05-09 14:07   ` Bart Van Assche
  2020-05-08 21:49 ` [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
  11 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-05  2:09 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, John Garry, Bart Van Assche,
	Hannes Reinecke, Christoph Hellwig, Thomas Gleixner,
	Hannes Reinecke

Run queue on dead CPU still may be triggered in some corner case,
such as one request is requeued after CPU hotplug is handled.

So handle this corner case during run queue.

Cc: John Garry <john.garry@huawei.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 30 ++++++++++--------------------
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d639bcd89811..e2c6014717a8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,8 @@
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+static void blk_mq_hctx_deactivate(struct blk_mq_hw_ctx *hctx);
+
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
 	int ddir, sectors, bucket;
@@ -1373,28 +1375,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	int srcu_idx;
 
 	/*
-	 * We should be running this queue from one of the CPUs that
-	 * are mapped to it.
-	 *
-	 * There are at least two related races now between setting
-	 * hctx->next_cpu from blk_mq_hctx_next_cpu() and running
-	 * __blk_mq_run_hw_queue():
-	 *
-	 * - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes online, then this warning is harmless
-	 *   at all
-	 *
-	 * - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
-	 *   but later it becomes offline, then the warning can't be
-	 *   triggered, and we depend on blk-mq timeout handler to
-	 *   handle dispatched requests to this hctx
+	 * BLK_MQ_S_INACTIVE may not deal with some requeue corner case:
+	 * one request is requeued after cpu unplug is handled, so check
+	 * if the hctx is actually inactive. If yes, deactive it and
+	 * re-submit all requests in the queue.
 	 */
 	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-		cpu_online(hctx->next_cpu)) {
-		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
-			raw_smp_processor_id(),
-			cpumask_empty(hctx->cpumask) ? "inactive": "active");
-		dump_stack();
+	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
+	    nr_cpu_ids) {
+		blk_mq_hctx_deactivate(hctx);
+		return;
 	}
 
 	/*
-- 
2.25.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-05-05  2:09 ` [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
@ 2020-05-05  6:14   ` Hannes Reinecke
  2020-05-08 23:26   ` Bart Van Assche
  1 sibling, 0 replies; 41+ messages in thread
From: Hannes Reinecke @ 2020-05-05  6:14 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 5/5/20 4:09 AM, Ming Lei wrote:
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "
>   That was the constraint of managed interrupts from the very beginning:
> 
>    The driver/subsystem has to quiesce the interrupt line and the associated
>    queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>    until it's restarted by the core when the CPU is plugged in again.
> "
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> one cpuhp state handled after the CPU is down, so there isn't any chance
> to quiesce hctx for blk-mq wrt. CPU hotplug.
> 
> Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> and wait for completion of in-flight requests.
> 
> We will stop hw queue and wait for completion of in-flight requests
> when one hctx is becoming dead in the following patch. This way may
> cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> loop.
> 
> Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> loop, so we needn't to wait for completion of in-flight requests from
> dm-rq & loop, then the potential dead-lock can be avoided.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Tested-by: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq-debugfs.c     |  1 +
>   block/blk-mq.c             | 19 +++++++++++++++++++
>   drivers/block/loop.c       |  2 +-
>   drivers/md/dm-rq.c         |  2 +-
>   include/linux/blk-mq.h     |  3 +++
>   include/linux/cpuhotplug.h |  1 +
>   6 files changed, 26 insertions(+), 2 deletions(-)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 00/11] blk-mq: improvement CPU hotplug
  2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
                   ` (10 preceding siblings ...)
  2020-05-05  2:09 ` [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
@ 2020-05-08 21:49 ` Ming Lei
  2020-05-09  3:17   ` Jens Axboe
  11 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-08 21:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Tue, May 05, 2020 at 10:09:19AM +0800, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>     
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
> 
> But no drivers or blk-mq do that before one hctx becomes inactive(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Thanks John Garry for running lots of tests on arm64 with this patchset
> and co-working on investigating all kinds of issues.
> 
> Thanks Christoph's review on V7 & V8.
> 
> Please comment & review, thanks!
> 
> https://github.com/ming1/linux/commits/v5.7-rc-blk-mq-improve-cpu-hotplug
> 
> V10:
> 	- fix double bio complete in request resubmission(10/11)
> 	- add tested-by tag
> 
> V9:
> 	- add Reviewed-by tag
> 	- document more on memory barrier usage between getting driver tag
> 	and handling cpu offline(7/11)
> 	- small code cleanup as suggested by Chritoph(7/11)
> 	- rebase against for-5.8/block(1/11, 2/11)
> V8:
> 	- add patches to share code with blk_rq_prep_clone
> 	- code re-organization as suggested by Christoph, most of them are
> 	in 04/11, 10/11
> 	- add reviewed-by tag
> 
> V7:
> 	- fix updating .nr_active in get_driver_tag
> 	- add hctx->cpumask check in cpuhp handler
> 	- only drain requests which tag is >= 0
> 	- pass more aggressive cpuhotplug&io test
> 
> V6:
> 	- simplify getting driver tag, so that we can drain in-flight
> 	  requests correctly without using synchronize_rcu()
> 	- handle re-submission of flush & passthrough request correctly
> 
> V5:
> 	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
> 	- re-factor code for re-submit requests in cpu dead hotplug handler
> 	- address requeue corner case
> 
> V4:
> 	- resubmit IOs in dispatch list in case that this hctx is dead 
> 
> V3:
> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
> 
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
> 
> Ming Lei (11):
>   block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
>   block: add helper for copying request
>   blk-mq: mark blk_mq_get_driver_tag as static
>   blk-mq: assign rq->tag in blk_mq_get_driver_tag
>   blk-mq: support rq filter callback when iterating rqs
>   blk-mq: prepare for draining IO when hctx's all CPUs are offline
>   blk-mq: stop to handle IO and drain IO before hctx becomes inactive
>   block: add blk_end_flush_machinery
>   blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
>   blk-mq: re-submit IO in case that hctx is inactive
>   block: deactivate hctx when the hctx is actually inactive
> 
>  block/blk-core.c           |  27 ++-
>  block/blk-flush.c          | 141 ++++++++++++---
>  block/blk-mq-debugfs.c     |   2 +
>  block/blk-mq-tag.c         |  39 ++--
>  block/blk-mq-tag.h         |   4 +
>  block/blk-mq.c             | 356 +++++++++++++++++++++++++++++--------
>  block/blk-mq.h             |  22 ++-
>  block/blk.h                |  11 +-
>  drivers/block/loop.c       |   2 +-
>  drivers/md/dm-rq.c         |   2 +-
>  include/linux/blk-mq.h     |   6 +
>  include/linux/cpuhotplug.h |   1 +
>  12 files changed, 481 insertions(+), 132 deletions(-)
> 
> Cc: John Garry <john.garry@huawei.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>

Hi Jens,

This patches have been worked and discussed for a while, so any chance
to make it in 5.8 if no any further comments?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-05-05  2:09 ` [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
  2020-05-05  6:14   ` Hannes Reinecke
@ 2020-05-08 23:26   ` Bart Van Assche
  2020-05-09  2:09     ` Ming Lei
  1 sibling, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-08 23:26 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

On 2020-05-04 19:09, Ming Lei wrote:
> @@ -391,6 +393,7 @@ struct blk_mq_ops {
>  enum {
>  	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>  	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
>  	BLK_MQ_F_BLOCKING	= 1 << 5,
>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 77d70b633531..24b3a77810b6 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -152,6 +152,7 @@ enum cpuhp_state {
>  	CPUHP_AP_SMPBOOT_THREADS,
>  	CPUHP_AP_X86_VDSO_VMA_ONLINE,
>  	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> +	CPUHP_AP_BLK_MQ_ONLINE,
>  	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
>  	CPUHP_AP_X86_INTEL_EPB_ONLINE,
>  	CPUHP_AP_PERF_ONLINE,

Wouldn't BLK_MQ_F_NO_IRQ be a better name than BLK_MQ_F_NO_MANAGED_IRQ?

Please add comments that explain what BLK_MQ_F_NO_MANAGED_IRQ and
CPUHP_AP_BLK_MQ_ONLINE mean.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-05-05  2:09 ` [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
@ 2020-05-08 23:32   ` Bart Van Assche
  2020-05-09  0:18     ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-08 23:32 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

On 2020-05-04 19:09, Ming Lei wrote:
> @@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>  /**
>   * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
>   * @tags:	Tag map to iterate over.
> - * @fn:		Pointer to the function that will be called for each started
> - *		request. @fn will be called as follows: @fn(rq, @priv,
> - *		reserved) where rq is a pointer to a request. 'reserved'
> - *		indicates whether or not @rq is a reserved request. Return
> - *		true to continue iterating tags, false to stop.
> + * @fn:		Pointer to the function that will be called for each request
> + * 		when .busy_rq_fn(rq) returns true. @fn will be called as
> + * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
> + * 		request. 'reserved' indicates whether or not @rq is a reserved
> + * 		request. Return true to continue iterating tags, false to stop.
> + * @busy_rq_fn: Pointer to the function that will be called for each request,
> + * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
> + * 		@priv, reserved) returns true, @fn will be called on this rq.
>   * @priv:	Will be passed as second argument to @fn.
>   */
> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> -		busy_tag_iter_fn *fn, void *priv)
> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> +		void *priv)
>  {

The name 'busy_rq_fn' is not ideal because it is named after one
specific use case, namely checking whether or not a request is busy (has
already been started). How about using the name 'pred_fn' ('pred' from
predicate because it controls whether the other function is called)?
Since only the context that passes 'fn' can know what data structure
'priv' points to and since 'busy_rq_fn' is passed from another context,
can 'busy_rq_fn' even know what data 'priv' points at? Has it been
considered not to pass the 'priv' argument to 'busy_rq_fn'?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-05  2:09 ` [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
@ 2020-05-08 23:39   ` Bart Van Assche
  2020-05-09  2:20     ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-08 23:39 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

On 2020-05-04 19:09, Ming Lei wrote:
> -static bool blk_mq_get_driver_tag(struct request *rq)
> +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
>  {
>  	if (rq->tag != -1)
>  		return true;
> -	return __blk_mq_get_driver_tag(rq);
> +
> +	if (!__blk_mq_get_driver_tag(rq))
> +		return false;
> +	/*
> +	 * In case that direct issue IO process is migrated to other CPU
> +	 * which may not belong to this hctx, add one memory barrier so we
> +	 * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> +	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> +	 * and driver tag assignment are run on the same CPU because
> +	 * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
> +	 * becoming offline.
> +	 *
> +	 * Process migration might happen after the check on current processor
> +	 * id, smp_mb() is implied by processor migration, so no need to worry
> +	 * about it.
> +	 */
> +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> +		smp_mb();
> +	else
> +		barrier();
> +
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> +		blk_mq_put_driver_tag(rq);
> +		return false;
> +	}
> +	return true;
>  }

How much does this patch slow down the hot path?

Can CPU migration be fixed without affecting the hot path, e.g. by using
the request queue freezing mechanism?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-05-08 23:32   ` Bart Van Assche
@ 2020-05-09  0:18     ` Bart Van Assche
  2020-05-09  2:05       ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-09  0:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner

On 2020-05-08 16:32, Bart Van Assche wrote:
> On 2020-05-04 19:09, Ming Lei wrote:
>> @@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
>>  /**
>>   * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
>>   * @tags:	Tag map to iterate over.
>> - * @fn:		Pointer to the function that will be called for each started
>> - *		request. @fn will be called as follows: @fn(rq, @priv,
>> - *		reserved) where rq is a pointer to a request. 'reserved'
>> - *		indicates whether or not @rq is a reserved request. Return
>> - *		true to continue iterating tags, false to stop.
>> + * @fn:		Pointer to the function that will be called for each request
>> + * 		when .busy_rq_fn(rq) returns true. @fn will be called as
>> + * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
>> + * 		request. 'reserved' indicates whether or not @rq is a reserved
>> + * 		request. Return true to continue iterating tags, false to stop.
>> + * @busy_rq_fn: Pointer to the function that will be called for each request,
>> + * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
>> + * 		@priv, reserved) returns true, @fn will be called on this rq.
>>   * @priv:	Will be passed as second argument to @fn.
>>   */
>> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>> -		busy_tag_iter_fn *fn, void *priv)
>> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
>> +		void *priv)
>>  {
> 
> The name 'busy_rq_fn' is not ideal because it is named after one
> specific use case, namely checking whether or not a request is busy (has
> already been started). How about using the name 'pred_fn' ('pred' from
> predicate because it controls whether the other function is called)?
> Since only the context that passes 'fn' can know what data structure
> 'priv' points to and since 'busy_rq_fn' is passed from another context,
> can 'busy_rq_fn' even know what data 'priv' points at? Has it been
> considered not to pass the 'priv' argument to 'busy_rq_fn'?

Thinking further about this, another possible approach is not to modify
blk_mq_all_tag_busy_iter() at all and to introduce a new function that
iterates over all requests instead of only over busy requests. I think
that approach will result in easier to read code than patch 5/11 because
each of these request iteration functions will only accept a single
callback function pointer. Additionally, that approach will make the
following function superfluous (from patch 7/11):

+static bool blk_mq_inflight_rq(struct request *rq, void *data,
+			       bool reserved)
+{
+	return rq->tag >= 0;
+}

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-05-09  0:18     ` Bart Van Assche
@ 2020-05-09  2:05       ` Ming Lei
  2020-05-09  3:08         ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-09  2:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, May 08, 2020 at 05:18:56PM -0700, Bart Van Assche wrote:
> On 2020-05-08 16:32, Bart Van Assche wrote:
> > On 2020-05-04 19:09, Ming Lei wrote:
> >> @@ -310,19 +313,30 @@ static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
> >>  /**
> >>   * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
> >>   * @tags:	Tag map to iterate over.
> >> - * @fn:		Pointer to the function that will be called for each started
> >> - *		request. @fn will be called as follows: @fn(rq, @priv,
> >> - *		reserved) where rq is a pointer to a request. 'reserved'
> >> - *		indicates whether or not @rq is a reserved request. Return
> >> - *		true to continue iterating tags, false to stop.
> >> + * @fn:		Pointer to the function that will be called for each request
> >> + * 		when .busy_rq_fn(rq) returns true. @fn will be called as
> >> + * 		follows: @fn(rq, @priv, reserved) where rq is a pointer to a
> >> + * 		request. 'reserved' indicates whether or not @rq is a reserved
> >> + * 		request. Return true to continue iterating tags, false to stop.
> >> + * @busy_rq_fn: Pointer to the function that will be called for each request,
> >> + * 		@busy_rq_fn's type is same with @fn. Only when @busy_rq_fn(rq,
> >> + * 		@priv, reserved) returns true, @fn will be called on this rq.
> >>   * @priv:	Will be passed as second argument to @fn.
> >>   */
> >> -static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> >> -		busy_tag_iter_fn *fn, void *priv)
> >> +void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> >> +		busy_tag_iter_fn *fn, busy_rq_iter_fn *busy_rq_fn,
> >> +		void *priv)
> >>  {
> > 
> > The name 'busy_rq_fn' is not ideal because it is named after one
> > specific use case, namely checking whether or not a request is busy (has
> > already been started). How about using the name 'pred_fn' ('pred' from
> > predicate because it controls whether the other function is called)?
> > Since only the context that passes 'fn' can know what data structure
> > 'priv' points to and since 'busy_rq_fn' is passed from another context,
> > can 'busy_rq_fn' even know what data 'priv' points at? Has it been
> > considered not to pass the 'priv' argument to 'busy_rq_fn'?
> 
> Thinking further about this, another possible approach is not to modify
> blk_mq_all_tag_busy_iter() at all and to introduce a new function that
> iterates over all requests instead of only over busy requests. I think
> that approach will result in easier to read code than patch 5/11 because
> each of these request iteration functions will only accept a single
> callback function pointer. Additionally, that approach will make the
> following function superfluous (from patch 7/11):
> 
> +static bool blk_mq_inflight_rq(struct request *rq, void *data,
> +			       bool reserved)
> +{
> +	return rq->tag >= 0;
> +}

Fine, then we can save one callback, how about the following way?

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..5e9c743d887b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -257,6 +257,7 @@ struct bt_tags_iter_data {
 	busy_tag_iter_fn *fn;
 	void *data;
 	bool reserved;
+	bool iterate_all;
 };
 
 static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
@@ -274,8 +275,10 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * test and set the bit before assining ->rqs[].
 	 */
 	rq = tags->rqs[bitnr];
-	if (rq && blk_mq_request_started(rq))
-		return iter_data->fn(rq, iter_data->data, reserved);
+	if (rq) {
+		if (iter_data->iterate_all || blk_mq_request_started(rq))
+			return iter_data->fn(rq, iter_data->data, reserved);
+	}
 
 	return true;
 }
@@ -294,13 +297,15 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
  *		bitmap_tags member of struct blk_mq_tags.
  */
 static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
-			     busy_tag_iter_fn *fn, void *data, bool reserved)
+			     busy_tag_iter_fn *fn, void *data, bool reserved,
+			     bool iterate_all)
 {
 	struct bt_tags_iter_data iter_data = {
 		.tags = tags,
 		.fn = fn,
 		.data = data,
 		.reserved = reserved,
+		.iterate_all = iterate_all,
 	};
 
 	if (tags->rqs)
@@ -321,8 +326,30 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
+				 false);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, false);
+}
+
+/**
+ * blk_mq_all_tag_iter - iterate over all requests in a tag map
+ * @tags:	Tag map to iterate over.
+ * @fn:		Pointer to the function that will be called for each
+ *		request. @fn will be called as follows: @fn(rq, @priv,
+ *		reserved) where rq is a pointer to a request. 'reserved'
+ *		indicates whether or not @rq is a reserved request. Return
+ *		true to continue iterating tags, false to stop.
+ * @priv:	Will be passed as second argument to @fn.
+ *
+ * It is the caller's responsility to check rq's state in @fn.
+ */
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv)
+{
+	if (tags->nr_reserved_tags)
+		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
+				 true);
+	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, true);
 }
 
 /**
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb682..d19546e8246b 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -34,6 +34,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)


thanks,
Ming


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-05-08 23:26   ` Bart Van Assche
@ 2020-05-09  2:09     ` Ming Lei
  2020-05-09  3:11       ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-09  2:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, May 08, 2020 at 04:26:17PM -0700, Bart Van Assche wrote:
> On 2020-05-04 19:09, Ming Lei wrote:
> > @@ -391,6 +393,7 @@ struct blk_mq_ops {
> >  enum {
> >  	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
> >  	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> > +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
> >  	BLK_MQ_F_BLOCKING	= 1 << 5,
> >  	BLK_MQ_F_NO_SCHED	= 1 << 6,
> >  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> > index 77d70b633531..24b3a77810b6 100644
> > --- a/include/linux/cpuhotplug.h
> > +++ b/include/linux/cpuhotplug.h
> > @@ -152,6 +152,7 @@ enum cpuhp_state {
> >  	CPUHP_AP_SMPBOOT_THREADS,
> >  	CPUHP_AP_X86_VDSO_VMA_ONLINE,
> >  	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> > +	CPUHP_AP_BLK_MQ_ONLINE,
> >  	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
> >  	CPUHP_AP_X86_INTEL_EPB_ONLINE,
> >  	CPUHP_AP_PERF_ONLINE,
> 
> Wouldn't BLK_MQ_F_NO_IRQ be a better name than BLK_MQ_F_NO_MANAGED_IRQ?

No, what this patchset tries to do is to address request timeout or hang
issue in case that managed irq is applied in blk-mq driver.

> 
> Please add comments that explain what BLK_MQ_F_NO_MANAGED_IRQ and
> CPUHP_AP_BLK_MQ_ONLINE mean.

I believe both are self-document:

- BLK_MQ_F_NO_MANAGED_IRQ
No managened irq is applied on the driver

- CPUHP_AP_BLK_MQ_ONLINE
It is a standard naming of cpuhp state, which means the startup and
teardown callbacks are called when the passed CPU is still online.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-08 23:39   ` Bart Van Assche
@ 2020-05-09  2:20     ` Ming Lei
  2020-05-09  3:24       ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-09  2:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, May 08, 2020 at 04:39:46PM -0700, Bart Van Assche wrote:
> On 2020-05-04 19:09, Ming Lei wrote:
> > -static bool blk_mq_get_driver_tag(struct request *rq)
> > +static bool blk_mq_get_driver_tag(struct request *rq, bool direct_issue)
> >  {
> >  	if (rq->tag != -1)
> >  		return true;
> > -	return __blk_mq_get_driver_tag(rq);
> > +
> > +	if (!__blk_mq_get_driver_tag(rq))
> > +		return false;
> > +	/*
> > +	 * In case that direct issue IO process is migrated to other CPU
> > +	 * which may not belong to this hctx, add one memory barrier so we
> > +	 * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
> > +	 * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
> > +	 * and driver tag assignment are run on the same CPU because
> > +	 * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
> > +	 * becoming offline.
> > +	 *
> > +	 * Process migration might happen after the check on current processor
> > +	 * id, smp_mb() is implied by processor migration, so no need to worry
> > +	 * about it.
> > +	 */
> > +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))
> > +		smp_mb();
> > +	else
> > +		barrier();
> > +
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {
> > +		blk_mq_put_driver_tag(rq);
> > +		return false;
> > +	}
> > +	return true;
> >  }
> 
> How much does this patch slow down the hot path?

Basically zero cost is added to hot path, exactly:

> +	if (unlikely(direct_issue && rq->mq_ctx->cpu != raw_smp_processor_id()))

In case of direct issue, chance of the io process migration is very
small, since basically direct issue follows request allocation and the
time is quite small, so smp_mb() won't be run most of times.

> +		smp_mb();
> +	else
> +		barrier();

So barrier() is added most of times, however the effect can be ignored
since it is just a compiler barrier.

> +
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &rq->mq_hctx->state))) {

hctx->state is always checked in hot path, so basically zero cost.

> +		blk_mq_put_driver_tag(rq);
> +		return false;
> +	}

> 
> Can CPU migration be fixed without affecting the hot path, e.g. by using
> the request queue freezing mechanism?

Why do we want to fix CPU migration of direct issue IO process? It may not be
necessary or quite difficultly:

1) preempt disable is removed previously in cleanup patch since request
is allocated

2) we have drivers which may set BLOCKING, so .queue_rq() may sleep

Not sure why you mention queue freezing.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-05-09  2:05       ` Ming Lei
@ 2020-05-09  3:08         ` Bart Van Assche
  2020-05-09  3:52           ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-09  3:08 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 2020-05-08 19:05, Ming Lei wrote:
> Fine, then we can save one callback, how about the following way?
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6e904a..5e9c743d887b 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -257,6 +257,7 @@ struct bt_tags_iter_data {
>  	busy_tag_iter_fn *fn;
>  	void *data;
>  	bool reserved;
> +	bool iterate_all;
>  };
>  
>  static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> @@ -274,8 +275,10 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	 * test and set the bit before assining ->rqs[].
>  	 */
>  	rq = tags->rqs[bitnr];
> -	if (rq && blk_mq_request_started(rq))
> -		return iter_data->fn(rq, iter_data->data, reserved);
> +	if (rq) {
> +		if (iter_data->iterate_all || blk_mq_request_started(rq))
> +			return iter_data->fn(rq, iter_data->data, reserved);
> +	}

How about combining the two if-statements above in the following single
if-statement:

if (rq && (iter_data->iterate_all || blk_mq_request_started(rq)))
	...

> @@ -321,8 +326,30 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
>  		busy_tag_iter_fn *fn, void *priv)
>  {
>  	if (tags->nr_reserved_tags)
> -		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
> -	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
> +		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
> +				 false);
> +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, false);
> +}

How about inserting comments like /*reserved=*/ and /*iterate_all=*/ in
the bt_tags_for_each() call in front of "false" to make these calls
easier to read?

> +/**
> + * blk_mq_all_tag_iter - iterate over all requests in a tag map
> + * @tags:	Tag map to iterate over.
> + * @fn:		Pointer to the function that will be called for each
> + *		request. @fn will be called as follows: @fn(rq, @priv,
> + *		reserved) where rq is a pointer to a request. 'reserved'
> + *		indicates whether or not @rq is a reserved request. Return
> + *		true to continue iterating tags, false to stop.
> + * @priv:	Will be passed as second argument to @fn.
> + *
> + * It is the caller's responsility to check rq's state in @fn.
                         ^^^^^^^^^^^^
                         responsibility?
> + */
> +void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn,
> +		void *priv)
> +{
> +	if (tags->nr_reserved_tags)
> +		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
> +				 true);
> +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, true);
>  }

Same comment here: how about inserting /*reserved=*/ and
/*iterate_all=*/ comments in the bt_tags_for_each() call?

Otherwise this proposal looks good to me.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-05-09  2:09     ` Ming Lei
@ 2020-05-09  3:11       ` Bart Van Assche
  2020-05-09  3:56         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-09  3:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 2020-05-08 19:09, Ming Lei wrote:
> On Fri, May 08, 2020 at 04:26:17PM -0700, Bart Van Assche wrote:
>> On 2020-05-04 19:09, Ming Lei wrote:
>>> @@ -391,6 +393,7 @@ struct blk_mq_ops {
>>>  enum {
>>>  	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>>>  	BLK_MQ_F_TAG_SHARED	= 1 << 1,
>>> +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
>>>  	BLK_MQ_F_BLOCKING	= 1 << 5,
>>>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
>>>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
>>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
>>> index 77d70b633531..24b3a77810b6 100644
>>> --- a/include/linux/cpuhotplug.h
>>> +++ b/include/linux/cpuhotplug.h
>>> @@ -152,6 +152,7 @@ enum cpuhp_state {
>>>  	CPUHP_AP_SMPBOOT_THREADS,
>>>  	CPUHP_AP_X86_VDSO_VMA_ONLINE,
>>>  	CPUHP_AP_IRQ_AFFINITY_ONLINE,
>>> +	CPUHP_AP_BLK_MQ_ONLINE,
>>>  	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
>>>  	CPUHP_AP_X86_INTEL_EPB_ONLINE,
>>>  	CPUHP_AP_PERF_ONLINE,
>>
>> Wouldn't BLK_MQ_F_NO_IRQ be a better name than BLK_MQ_F_NO_MANAGED_IRQ?
> 
> No, what this patchset tries to do is to address request timeout or hang
> issue in case that managed irq is applied in blk-mq driver.

What is a managed IRQ? The following query did not produce a useful answer:

$ git grep -nHi managed.irq

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 00/11] blk-mq: improvement CPU hotplug
  2020-05-08 21:49 ` [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
@ 2020-05-09  3:17   ` Jens Axboe
  0 siblings, 0 replies; 41+ messages in thread
From: Jens Axboe @ 2020-05-09  3:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: linux-block, John Garry, Bart Van Assche, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 5/8/20 3:49 PM, Ming Lei wrote:
> On Tue, May 05, 2020 at 10:09:19AM +0800, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>     "
>>      That was the constraint of managed interrupts from the very beginning:
>>     
>>       The driver/subsystem has to quiesce the interrupt line and the associated
>>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>>       until it's restarted by the core when the CPU is plugged in again.
>>     "
>>
>> But no drivers or blk-mq do that before one hctx becomes inactive(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>>
>> - steal bios from the request, and resubmit them via generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Thanks John Garry for running lots of tests on arm64 with this patchset
>> and co-working on investigating all kinds of issues.
>>
>> Thanks Christoph's review on V7 & V8.
>>
>> Please comment & review, thanks!
>>
>> https://github.com/ming1/linux/commits/v5.7-rc-blk-mq-improve-cpu-hotplug
>>
>> V10:
>> 	- fix double bio complete in request resubmission(10/11)
>> 	- add tested-by tag
>>
>> V9:
>> 	- add Reviewed-by tag
>> 	- document more on memory barrier usage between getting driver tag
>> 	and handling cpu offline(7/11)
>> 	- small code cleanup as suggested by Chritoph(7/11)
>> 	- rebase against for-5.8/block(1/11, 2/11)
>> V8:
>> 	- add patches to share code with blk_rq_prep_clone
>> 	- code re-organization as suggested by Christoph, most of them are
>> 	in 04/11, 10/11
>> 	- add reviewed-by tag
>>
>> V7:
>> 	- fix updating .nr_active in get_driver_tag
>> 	- add hctx->cpumask check in cpuhp handler
>> 	- only drain requests which tag is >= 0
>> 	- pass more aggressive cpuhotplug&io test
>>
>> V6:
>> 	- simplify getting driver tag, so that we can drain in-flight
>> 	  requests correctly without using synchronize_rcu()
>> 	- handle re-submission of flush & passthrough request correctly
>>
>> V5:
>> 	- rename BLK_MQ_S_INTERNAL_STOPPED as BLK_MQ_S_INACTIVE
>> 	- re-factor code for re-submit requests in cpu dead hotplug handler
>> 	- address requeue corner case
>>
>> V4:
>> 	- resubmit IOs in dispatch list in case that this hctx is dead 
>>
>> V3:
>> 	- re-organize patch 2 & 3 a bit for addressing Hannes's comment
>> 	- fix patch 4 for avoiding potential deadlock, as found by Hannes
>>
>> V2:
>> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
>> 	  them
>> 	- address comments from John Garry and Minwoo
>>
>> Ming Lei (11):
>>   block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone
>>   block: add helper for copying request
>>   blk-mq: mark blk_mq_get_driver_tag as static
>>   blk-mq: assign rq->tag in blk_mq_get_driver_tag
>>   blk-mq: support rq filter callback when iterating rqs
>>   blk-mq: prepare for draining IO when hctx's all CPUs are offline
>>   blk-mq: stop to handle IO and drain IO before hctx becomes inactive
>>   block: add blk_end_flush_machinery
>>   blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead
>>   blk-mq: re-submit IO in case that hctx is inactive
>>   block: deactivate hctx when the hctx is actually inactive
>>
>>  block/blk-core.c           |  27 ++-
>>  block/blk-flush.c          | 141 ++++++++++++---
>>  block/blk-mq-debugfs.c     |   2 +
>>  block/blk-mq-tag.c         |  39 ++--
>>  block/blk-mq-tag.h         |   4 +
>>  block/blk-mq.c             | 356 +++++++++++++++++++++++++++++--------
>>  block/blk-mq.h             |  22 ++-
>>  block/blk.h                |  11 +-
>>  drivers/block/loop.c       |   2 +-
>>  drivers/md/dm-rq.c         |   2 +-
>>  include/linux/blk-mq.h     |   6 +
>>  include/linux/cpuhotplug.h |   1 +
>>  12 files changed, 481 insertions(+), 132 deletions(-)
>>
>> Cc: John Garry <john.garry@huawei.com>
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
> 
> Hi Jens,
> 
> This patches have been worked and discussed for a while, so any chance
> to make it in 5.8 if no any further comments?

Looks like we're pretty close, I'll take a closer look at v11 when
posted.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-09  2:20     ` Ming Lei
@ 2020-05-09  3:24       ` Bart Van Assche
  2020-05-09  4:10         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-09  3:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 2020-05-08 19:20, Ming Lei wrote:
> Not sure why you mention queue freezing.

This patch series introduces a fundamental race between modifying the
hardware queue state (BLK_MQ_S_INACTIVE) and tag allocation. The only
mechanism I know of for enforcing the order in which another thread
observes writes to different memory locations without inserting a memory
barrier in the hot path is RCU (see also The RCU-barrier menagerie;
https://lwn.net/Articles/573497/). The only existing such mechanism in
the blk-mq core I know of is queue freezing. Hence my comment about
queue freezing.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs
  2020-05-09  3:08         ` Bart Van Assche
@ 2020-05-09  3:52           ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-09  3:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, May 08, 2020 at 08:08:23PM -0700, Bart Van Assche wrote:
> On 2020-05-08 19:05, Ming Lei wrote:
> > Fine, then we can save one callback, how about the following way?
> > 
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 586c9d6e904a..5e9c743d887b 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -257,6 +257,7 @@ struct bt_tags_iter_data {
> >  	busy_tag_iter_fn *fn;
> >  	void *data;
> >  	bool reserved;
> > +	bool iterate_all;
> >  };
> >  
> >  static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> > @@ -274,8 +275,10 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >  	 * test and set the bit before assining ->rqs[].
> >  	 */
> >  	rq = tags->rqs[bitnr];
> > -	if (rq && blk_mq_request_started(rq))
> > -		return iter_data->fn(rq, iter_data->data, reserved);
> > +	if (rq) {
> > +		if (iter_data->iterate_all || blk_mq_request_started(rq))
> > +			return iter_data->fn(rq, iter_data->data, reserved);
> > +	}
> 
> How about combining the two if-statements above in the following single
> if-statement:
> 
> if (rq && (iter_data->iterate_all || blk_mq_request_started(rq)))
> 	...

OK.

> 
> > @@ -321,8 +326,30 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
> >  		busy_tag_iter_fn *fn, void *priv)
> >  {
> >  	if (tags->nr_reserved_tags)
> > -		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
> > -	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
> > +		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true,
> > +				 false);
> > +	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false, false);
> > +}
> 
> How about inserting comments like /*reserved=*/ and /*iterate_all=*/ in
> the bt_tags_for_each() call in front of "false" to make these calls
> easier to read?

I think it isn't necessary, given both two are self-documented from
the name of bt_tags_for_each's parameters.

> 
> > +/**
> > + * blk_mq_all_tag_iter - iterate over all requests in a tag map
> > + * @tags:	Tag map to iterate over.
> > + * @fn:		Pointer to the function that will be called for each
> > + *		request. @fn will be called as follows: @fn(rq, @priv,
> > + *		reserved) where rq is a pointer to a request. 'reserved'
> > + *		indicates whether or not @rq is a reserved request. Return
> > + *		true to continue iterating tags, false to stop.
> > + * @priv:	Will be passed as second argument to @fn.
> > + *
> > + * It is the caller's responsility to check rq's state in @fn.
>                          ^^^^^^^^^^^^
>                          responsibility?

OK.

 
Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
  2020-05-09  3:11       ` Bart Van Assche
@ 2020-05-09  3:56         ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-09  3:56 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, May 08, 2020 at 08:11:55PM -0700, Bart Van Assche wrote:
> On 2020-05-08 19:09, Ming Lei wrote:
> > On Fri, May 08, 2020 at 04:26:17PM -0700, Bart Van Assche wrote:
> >> On 2020-05-04 19:09, Ming Lei wrote:
> >>> @@ -391,6 +393,7 @@ struct blk_mq_ops {
> >>>  enum {
> >>>  	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
> >>>  	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> >>> +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
> >>>  	BLK_MQ_F_BLOCKING	= 1 << 5,
> >>>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
> >>>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> >>> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> >>> index 77d70b633531..24b3a77810b6 100644
> >>> --- a/include/linux/cpuhotplug.h
> >>> +++ b/include/linux/cpuhotplug.h
> >>> @@ -152,6 +152,7 @@ enum cpuhp_state {
> >>>  	CPUHP_AP_SMPBOOT_THREADS,
> >>>  	CPUHP_AP_X86_VDSO_VMA_ONLINE,
> >>>  	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> >>> +	CPUHP_AP_BLK_MQ_ONLINE,
> >>>  	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
> >>>  	CPUHP_AP_X86_INTEL_EPB_ONLINE,
> >>>  	CPUHP_AP_PERF_ONLINE,
> >>
> >> Wouldn't BLK_MQ_F_NO_IRQ be a better name than BLK_MQ_F_NO_MANAGED_IRQ?
> > 
> > No, what this patchset tries to do is to address request timeout or hang
> > issue in case that managed irq is applied in blk-mq driver.
> 
> What is a managed IRQ? The following query did not produce a useful answer:
> 
> $ git grep -nHi managed.irq

[ming@T590 linux]$ git grep -n MANAGED ./kernel/irq
kernel/irq/chip.c:188:  IRQ_STARTUP_MANAGED,
kernel/irq/chip.c:226:  return IRQ_STARTUP_MANAGED;
kernel/irq/chip.c:271:          case IRQ_STARTUP_MANAGED:
kernel/irq/cpuhotplug.c:179:    if (!housekeeping_enabled(HK_FLAG_MANAGED_IRQ))
kernel/irq/cpuhotplug.c:182:    hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);
kernel/irq/debugfs.c:114:       BIT_MASK_DESCR(IRQD_AFFINITY_MANAGED),
kernel/irq/debugfs.c:115:       BIT_MASK_DESCR(IRQD_MANAGED_SHUTDOWN),
kernel/irq/internals.h:215:     __irqd_to_state(d) |= IRQD_MANAGED_SHUTDOWN;
kernel/irq/internals.h:220:     __irqd_to_state(d) &= ~IRQD_MANAGED_SHUTDOWN;
kernel/irq/irqdesc.c:487:                               flags = IRQD_AFFINITY_MANAGED |
kernel/irq/irqdesc.c:488:                                       IRQD_MANAGED_SHUTDOWN;
kernel/irq/manage.c:170: * AFFINITY_MANAGED flag.
kernel/irq/manage.c:241:            housekeeping_enabled(HK_FLAG_MANAGED_IRQ)) {
kernel/irq/manage.c:247:                hk_mask = housekeeping_cpumask(HK_FLAG_MANAGED_IRQ);

The exact name should be AFFINITY_MANAGED, however the name of managed
irq has been used in commit log and isolation cpus code for a bit long.

OK, I will add comment on BLK_MQ_F_NO_MANAGED_IRQ a bit in next version.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-09  3:24       ` Bart Van Assche
@ 2020-05-09  4:10         ` Ming Lei
  2020-05-09 14:18           ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-09  4:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Fri, May 08, 2020 at 08:24:44PM -0700, Bart Van Assche wrote:
> On 2020-05-08 19:20, Ming Lei wrote:
> > Not sure why you mention queue freezing.
> 
> This patch series introduces a fundamental race between modifying the
> hardware queue state (BLK_MQ_S_INACTIVE) and tag allocation. The only

Basically there are two cases:

1) setting BLK_MQ_S_INACTIVE and driver tag allocation are run on same
CPU, we just need a compiler barrier, that happens most of times

2) setting BLK_MQ_S_INACTIVE and driver tag allocation are run on
different CPUs, then one pair of smp_mb() is applied for avoiding
out of order, that only happens in case of direct issue process migration.

Please take a look at the comment in this patch:

+       /*
+        * In case that direct issue IO process is migrated to other CPU
+        * which may not belong to this hctx, add one memory barrier so we
+        * can order driver tag assignment and checking BLK_MQ_S_INACTIVE.
+        * Otherwise, barrier() is enough given both setting BLK_MQ_S_INACTIVE
+        * and driver tag assignment are run on the same CPU because
+        * BLK_MQ_S_INACTIVE is only set after the last CPU of this hctx is
+        * becoming offline.
+        *
+        * Process migration might happen after the check on current processor
+        * id, smp_mb() is implied by processor migration, so no need to worry
+        * about it.
+        */

And you may find more discussion about this topic in the following thread:

https://lore.kernel.org/linux-block/20200429134327.GC700644@T590/

> mechanism I know of for enforcing the order in which another thread
> observes writes to different memory locations without inserting a memory
> barrier in the hot path is RCU (see also The RCU-barrier menagerie;
> https://lwn.net/Articles/573497/). The only existing such mechanism in
> the blk-mq core I know of is queue freezing. Hence my comment about
> queue freezing.

You didn't explain how queue freezing is used for this issue.

We are talking about CPU hotplug vs. IO. In short, when one hctx becomes
inactive(all cpus in hctx->cpumask becomes offline), in-flight IO from
this hctx needs to be drained for avoiding io timeout. Also all requests
in scheduler/sw queue from this hctx needs to be handled correctly for
avoiding IO hang.

queue freezing can only be applied on the request queue level, and not
hctx level. When requests can't be completed, wait freezing just hangs
for-ever.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-05  2:09 ` [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
@ 2020-05-09 14:07   ` Bart Van Assche
  2020-05-11  2:11     ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-09 14:07 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, John Garry, Hannes Reinecke, Christoph Hellwig,
	Thomas Gleixner, Hannes Reinecke

On 2020-05-04 19:09, Ming Lei wrote:
> @@ -1373,28 +1375,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>  	int srcu_idx;
>  
[ ... ]
>  	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> -		cpu_online(hctx->next_cpu)) {
> -		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
> -			raw_smp_processor_id(),
> -			cpumask_empty(hctx->cpumask) ? "inactive": "active");
> -		dump_stack();
> +	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
> +	    nr_cpu_ids) {
> +		blk_mq_hctx_deactivate(hctx);
> +		return;
>  	}

The blk_mq_hctx_deactivate() function calls blk_mq_resubmit_rq()
indirectly. From blk_mq_resubmit_rq():

+  /* avoid allocation failure by clearing NOWAIT */
+  nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);

blk_get_request() calls blk_mq_alloc_request(). blk_mq_alloc_request()
calls blk_queue_enter(). blk_queue_enter() waits until a queue is
unfrozen if freezing of a queue has started. As one can see freezing a
queue triggers a queue run:

void blk_freeze_queue_start(struct request_queue *q)
{
	mutex_lock(&q->mq_freeze_lock);
	if (++q->mq_freeze_depth == 1) {
		percpu_ref_kill(&q->q_usage_counter);
		mutex_unlock(&q->mq_freeze_lock);
		if (queue_is_mq(q))
			blk_mq_run_hw_queues(q, false);
	} else {
		mutex_unlock(&q->mq_freeze_lock);
	}
}

Does this mean that if queue freezing happens after hot unplugging
started that a deadlock will occur because the blk_mq_run_hw_queues()
call in blk_freeze_queue_start() will wait forever?

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-09  4:10         ` Ming Lei
@ 2020-05-09 14:18           ` Bart Van Assche
  2020-05-11  1:45             ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-09 14:18 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 2020-05-08 21:10, Ming Lei wrote:
> queue freezing can only be applied on the request queue level, and not
> hctx level. When requests can't be completed, wait freezing just hangs
> for-ever.

That's indeed what I meant: freeze the entire queue instead of
introducing a new mechanism that freezes only one hardware queue at a time.

Please clarify what "when requests can't be completed" means. Are you
referring to requests that take longer than expected due to e.g. a
controller lockup or to requests that take a long time intentionally?
The former case is handled by the block layer timeout handler. I propose
to handle the latter case by introducing a new callback function pointer
in struct blk_mq_ops that aborts all outstanding requests. Request queue
freezing is such an important block layer mechanism that I think we
should require that all block drivers support freezing a request queue
in a short time.

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-09 14:18           ` Bart Van Assche
@ 2020-05-11  1:45             ` Ming Lei
  2020-05-11  3:20               ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-11  1:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Sat, May 09, 2020 at 07:18:46AM -0700, Bart Van Assche wrote:
> On 2020-05-08 21:10, Ming Lei wrote:
> > queue freezing can only be applied on the request queue level, and not
> > hctx level. When requests can't be completed, wait freezing just hangs
> > for-ever.
> 
> That's indeed what I meant: freeze the entire queue instead of
> introducing a new mechanism that freezes only one hardware queue at a time.

No, the issue is exactly that one single hctx becomes inactive, and
other hctx are still active and workable.

If one entire queue is frozen because of some of CPUs are offline, how
can userspace submit IO to this disk? You suggestion justs makes the
disk not usable, that won't be accepted.

> 
> Please clarify what "when requests can't be completed" means. Are you
> referring to requests that take longer than expected due to e.g. a
> controller lockup or to requests that take a long time intentionally?

If all CPUs in one hctx->cpumask are offline, the managed irq of this hw
queue will be shutdown by genirq code, so any in-flight IO won't be
completed or timedout after the managed irq is shutdown because of cpu
offline.

Some drivers may implement timeout handler, so these in-flight requests
will be timed out, but still not friendly behaviour given the default
timeout is too long.

Some drivers don't implement timeout handler at all, so these IO won't
be completed.

> The former case is handled by the block layer timeout handler. I propose
> to handle the latter case by introducing a new callback function pointer
> in struct blk_mq_ops that aborts all outstanding requests.

As I mentioned, timeout isn't a friendly behavior. Or not every driver
implements timeout handler or well enough.

> Request queue
> freezing is such an important block layer mechanism that I think we
> should require that all block drivers support freezing a request queue
> in a short time.

Firstly, we just need to drain in-flight requests and re-submit queued
requests from one single hctx, and queue wide freezing causes whole
userspace IOs blocked unnecessarily.

Secondly, some requests may not be completed at all, so freezing can't
work because freeze_wait may hang forever.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-09 14:07   ` Bart Van Assche
@ 2020-05-11  2:11     ` Ming Lei
  2020-05-11  3:30       ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-11  2:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Hannes Reinecke

On Sat, May 09, 2020 at 07:07:55AM -0700, Bart Van Assche wrote:
> On 2020-05-04 19:09, Ming Lei wrote:
> > @@ -1373,28 +1375,16 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
> >  	int srcu_idx;
> >  
> [ ... ]
> >  	if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> > -		cpu_online(hctx->next_cpu)) {
> > -		printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
> > -			raw_smp_processor_id(),
> > -			cpumask_empty(hctx->cpumask) ? "inactive": "active");
> > -		dump_stack();
> > +	    cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) >=
> > +	    nr_cpu_ids) {
> > +		blk_mq_hctx_deactivate(hctx);
> > +		return;
> >  	}
> 
> The blk_mq_hctx_deactivate() function calls blk_mq_resubmit_rq()
> indirectly. From blk_mq_resubmit_rq():
> 
> +  /* avoid allocation failure by clearing NOWAIT */
> +  nrq = blk_get_request(rq->q, rq->cmd_flags & ~REQ_NOWAIT, flags);
> 
> blk_get_request() calls blk_mq_alloc_request(). blk_mq_alloc_request()
> calls blk_queue_enter(). blk_queue_enter() waits until a queue is
> unfrozen if freezing of a queue has started. As one can see freezing a
> queue triggers a queue run:
> 
> void blk_freeze_queue_start(struct request_queue *q)
> {
> 	mutex_lock(&q->mq_freeze_lock);
> 	if (++q->mq_freeze_depth == 1) {
> 		percpu_ref_kill(&q->q_usage_counter);
> 		mutex_unlock(&q->mq_freeze_lock);
> 		if (queue_is_mq(q))
> 			blk_mq_run_hw_queues(q, false);
> 	} else {
> 		mutex_unlock(&q->mq_freeze_lock);
> 	}
> }
> 
> Does this mean that if queue freezing happens after hot unplugging
> started that a deadlock will occur because the blk_mq_run_hw_queues()
> call in blk_freeze_queue_start() will wait forever?

Yes, looks there is such issue.

However, the wait forever isn't new with this patch, because all queued
(in scheduler queue or sw queue)request may not be completed after this
hctx becomes inactive.

One simple solution is to pass BLK_MQ_REQ_PREEMPT to blk_get_request()
called in blk_mq_resubmit_rq() because at that time freezing wait won't
return and it is safe to allocate a new request for completing old
requests originated from inactive hctx.

I will do this way in V11.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-11  1:45             ` Ming Lei
@ 2020-05-11  3:20               ` Bart Van Assche
  2020-05-11  3:48                 ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-11  3:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 2020-05-10 18:45, Ming Lei wrote:
> On Sat, May 09, 2020 at 07:18:46AM -0700, Bart Van Assche wrote:
>> On 2020-05-08 21:10, Ming Lei wrote:
>>> queue freezing can only be applied on the request queue level, and not
>>> hctx level. When requests can't be completed, wait freezing just hangs
>>> for-ever.
>>
>> That's indeed what I meant: freeze the entire queue instead of
>> introducing a new mechanism that freezes only one hardware queue at a time.
> 
> No, the issue is exactly that one single hctx becomes inactive, and
> other hctx are still active and workable.
> 
> If one entire queue is frozen because of some of CPUs are offline, how
> can userspace submit IO to this disk? You suggestion justs makes the
> disk not usable, that won't be accepted.

What I meant is to freeze a request queue temporarily (until hot
unplugging of a CPU has finished). I would never suggest to freeze a
request queue forever and I think that you already knew that.

>> Please clarify what "when requests can't be completed" means. Are you
>> referring to requests that take longer than expected due to e.g. a
>> controller lockup or to requests that take a long time intentionally?
> 
> If all CPUs in one hctx->cpumask are offline, the managed irq of this hw
> queue will be shutdown by genirq code, so any in-flight IO won't be
> completed or timedout after the managed irq is shutdown because of cpu
> offline.
> 
> Some drivers may implement timeout handler, so these in-flight requests
> will be timed out, but still not friendly behaviour given the default
> timeout is too long.
> 
> Some drivers don't implement timeout handler at all, so these IO won't
> be completed.

I think that the block layer needs to be notified after the decision has
been taken to offline a CPU and before the interrupts associated with
that CPU are disabled. That would allow the block layer to freeze a
request queue without triggering any timeouts (ignoring block driver and
hardware bugs). I'm not familiar with CPU hotplugging so I don't know
whether or not such a mechanism already exists.

>> The former case is handled by the block layer timeout handler. I propose
>> to handle the latter case by introducing a new callback function pointer
>> in struct blk_mq_ops that aborts all outstanding requests.
> 
> As I mentioned, timeout isn't a friendly behavior. Or not every driver
> implements timeout handler or well enough.

What I propose is to fix those block drivers instead of complicating the
block layer core further and instead of introducing potential deadlocks
in the block layer core.

>> Request queue
>> freezing is such an important block layer mechanism that I think we
>> should require that all block drivers support freezing a request queue
>> in a short time.
> 
> Firstly, we just need to drain in-flight requests and re-submit queued
> requests from one single hctx, and queue wide freezing causes whole
> userspace IOs blocked unnecessarily.

Freezing a request queue for a short time is acceptable. As you know we
already do that when the queue depth is modified, when the write-back
throttling latency is modified and also when the I/O scheduler is changed.

> Secondly, some requests may not be completed at all, so freezing can't
> work because freeze_wait may hang forever.

If a request neither can be aborted nor completes then that's a severe
bug in the block driver that submitted the request to the block device.

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-11  2:11     ` Ming Lei
@ 2020-05-11  3:30       ` Bart Van Assche
  2020-05-11  4:08         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-11  3:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Hannes Reinecke

On 2020-05-10 19:11, Ming Lei wrote:
> One simple solution is to pass BLK_MQ_REQ_PREEMPT to blk_get_request()
> called in blk_mq_resubmit_rq() because at that time freezing wait won't
> return and it is safe to allocate a new request for completing old
> requests originated from inactive hctx.

I don't think that will help. Freezing a request queue starts with a
call of this function:

void blk_freeze_queue_start(struct request_queue *q)
{
	mutex_lock(&q->mq_freeze_lock);
	if (++q->mq_freeze_depth == 1) {
		percpu_ref_kill(&q->q_usage_counter);
		mutex_unlock(&q->mq_freeze_lock);
		if (queue_is_mq(q))
			blk_mq_run_hw_queues(q, false);
	} else {
		mutex_unlock(&q->mq_freeze_lock);
	}
}

From blk_queue_enter():

	const bool pm = flags & BLK_MQ_REQ_PREEMPT;
	[ ... ]
	if (percpu_ref_tryget_live(&q->q_usage_counter)) {
		/*
		 * The code that increments the pm_only counter is
		 * responsible for ensuring that that counter is
		 * globally visible before the queue is unfrozen.
		 */
		if (pm || !blk_queue_pm_only(q)) {
			success = true;
		} else {
			percpu_ref_put(&q->q_usage_counter);
		}
	}

In other words, setting the BLK_MQ_REQ_PREEMPT flag only makes a
difference if blk_queue_pm_only(q) == true. Freezing a request queue
involves calling percpu_ref_kill(&q->q_usage_counter). That causes all
future percpu_ref_tryget_live() calls to return false until the queue
has been unfrozen.

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-11  3:20               ` Bart Van Assche
@ 2020-05-11  3:48                 ` Ming Lei
  2020-05-11 20:56                   ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-11  3:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Sun, May 10, 2020 at 08:20:24PM -0700, Bart Van Assche wrote:
> On 2020-05-10 18:45, Ming Lei wrote:
> > On Sat, May 09, 2020 at 07:18:46AM -0700, Bart Van Assche wrote:
> >> On 2020-05-08 21:10, Ming Lei wrote:
> >>> queue freezing can only be applied on the request queue level, and not
> >>> hctx level. When requests can't be completed, wait freezing just hangs
> >>> for-ever.
> >>
> >> That's indeed what I meant: freeze the entire queue instead of
> >> introducing a new mechanism that freezes only one hardware queue at a time.
> > 
> > No, the issue is exactly that one single hctx becomes inactive, and
> > other hctx are still active and workable.
> > 
> > If one entire queue is frozen because of some of CPUs are offline, how
> > can userspace submit IO to this disk? You suggestion justs makes the
> > disk not usable, that won't be accepted.
> 
> What I meant is to freeze a request queue temporarily (until hot
> unplugging of a CPU has finished). I would never suggest to freeze a
> request queue forever and I think that you already knew that.

But what is your motivation to freeze queue temporarily?

I don's see any help of freezing queue for this issue. Also even though
it is temporary, IO effect still can be observed for other online CPUs.

If you want to block new allocation from the inactive hctx, that isn't
necessary cause no new allocation is basically possible because all
cpus of this hctx will be offline.

If you want to wait completion of in-flight requests, that isn't doable
because requests may not be completed at all when one hctx becomes
inactive and the managed interrupt is shutdown.

> 
> >> Please clarify what "when requests can't be completed" means. Are you
> >> referring to requests that take longer than expected due to e.g. a
> >> controller lockup or to requests that take a long time intentionally?
> > 
> > If all CPUs in one hctx->cpumask are offline, the managed irq of this hw
> > queue will be shutdown by genirq code, so any in-flight IO won't be
> > completed or timedout after the managed irq is shutdown because of cpu
> > offline.
> > 
> > Some drivers may implement timeout handler, so these in-flight requests
> > will be timed out, but still not friendly behaviour given the default
> > timeout is too long.
> > 
> > Some drivers don't implement timeout handler at all, so these IO won't
> > be completed.
> 
> I think that the block layer needs to be notified after the decision has

I have added new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for getting the
notification and blk_mq_hctx_notify_online() will be called before this
cpu is put offline.

> been taken to offline a CPU and before the interrupts associated with
> that CPU are disabled. That would allow the block layer to freeze a
> request queue without triggering any timeouts (ignoring block driver and
> hardware bugs). I'm not familiar with CPU hotplugging so I don't know
> whether or not such a mechanism already exists.

How can freezing queue avoid to triggering timeout?

Freezing queue basically blocks new request allocation, and follows wait
for completion of all in-flight request. As I explained, either no new
allocation on this inactive hctx, or in-flight request won't be completed
without this patch's solution.

> 
> >> The former case is handled by the block layer timeout handler. I propose
> >> to handle the latter case by introducing a new callback function pointer
> >> in struct blk_mq_ops that aborts all outstanding requests.
> > 
> > As I mentioned, timeout isn't a friendly behavior. Or not every driver
> > implements timeout handler or well enough.
> 
> What I propose is to fix those block drivers instead of complicating the
> block layer core further and instead of introducing potential deadlocks
> in the block layer core.

The deadlock you mentioned can be fixed with help of BLK_MQ_REQ_PREEMPT.

> 
> >> Request queue
> >> freezing is such an important block layer mechanism that I think we
> >> should require that all block drivers support freezing a request queue
> >> in a short time.
> > 
> > Firstly, we just need to drain in-flight requests and re-submit queued
> > requests from one single hctx, and queue wide freezing causes whole
> > userspace IOs blocked unnecessarily.
> 
> Freezing a request queue for a short time is acceptable. As you know we
> already do that when the queue depth is modified, when the write-back
> throttling latency is modified and also when the I/O scheduler is changed.

Again, how can freeze queue help the issue addressed by this patchset?

> 
> > Secondly, some requests may not be completed at all, so freezing can't
> > work because freeze_wait may hang forever.
> 
> If a request neither can be aborted nor completes then that's a severe
> bug in the block driver that submitted the request to the block device.

It is hard to implement timeout handler for every driver, or remove all
BLK_EH_RESET_TIMER returning from driver.

Even for drivers which implementing timeout handler elegantly, it isn't
friendly to wait several dozens of seconds or more than one hundred seconds
to wait IO completion during cpu hotplug. Who said that IO timeout has
to be triggered during cpu hotplug? At least there isn't such issue with
non-managed interrupt.



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-11  3:30       ` Bart Van Assche
@ 2020-05-11  4:08         ` Ming Lei
  2020-05-11 20:52           ` Bart Van Assche
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2020-05-11  4:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Hannes Reinecke

On Sun, May 10, 2020 at 08:30:29PM -0700, Bart Van Assche wrote:
> On 2020-05-10 19:11, Ming Lei wrote:
> > One simple solution is to pass BLK_MQ_REQ_PREEMPT to blk_get_request()
> > called in blk_mq_resubmit_rq() because at that time freezing wait won't
> > return and it is safe to allocate a new request for completing old
> > requests originated from inactive hctx.
> 
> I don't think that will help. Freezing a request queue starts with a
> call of this function:
> 
> void blk_freeze_queue_start(struct request_queue *q)
> {
> 	mutex_lock(&q->mq_freeze_lock);
> 	if (++q->mq_freeze_depth == 1) {
> 		percpu_ref_kill(&q->q_usage_counter);
> 		mutex_unlock(&q->mq_freeze_lock);
> 		if (queue_is_mq(q))
> 			blk_mq_run_hw_queues(q, false);
> 	} else {
> 		mutex_unlock(&q->mq_freeze_lock);
> 	}
> }
> 
> From blk_queue_enter():
> 
> 	const bool pm = flags & BLK_MQ_REQ_PREEMPT;
> 	[ ... ]
> 	if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> 		/*
> 		 * The code that increments the pm_only counter is
> 		 * responsible for ensuring that that counter is
> 		 * globally visible before the queue is unfrozen.
> 		 */
> 		if (pm || !blk_queue_pm_only(q)) {
> 			success = true;
> 		} else {
> 			percpu_ref_put(&q->q_usage_counter);
> 		}
> 	}
> 
> In other words, setting the BLK_MQ_REQ_PREEMPT flag only makes a
> difference if blk_queue_pm_only(q) == true. Freezing a request queue
> involves calling percpu_ref_kill(&q->q_usage_counter). That causes all
> future percpu_ref_tryget_live() calls to return false until the queue
> has been unfrozen.

OK, just forgot the whole story, but the issue can be fixed quite easily
by adding a new request allocation flag in slow path, see the following
patch:

diff --git a/block/blk-core.c b/block/blk-core.c
index ec50d7e6be21..d743be1b45a2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -418,6 +418,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		if (success)
 			return 0;
 
+		if (flags & BLK_MQ_REQ_FORCE) {
+			percpu_ref_get(ref);
+			return 0;
+		}
+
 		if (flags & BLK_MQ_REQ_NOWAIT)
 			return -EBUSY;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c2ea0a6e5b56..2816886d0bea 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -448,6 +448,13 @@ enum {
 	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
 	/* set RQF_PREEMPT */
 	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
+
+	/*
+	 * force to allocate request and caller has to make sure queue
+	 * won't be forzen completely during allocation, and this flag
+	 * is only applied after queue freeze is started
+	 */
+	BLK_MQ_REQ_FORCE	= (__force blk_mq_req_flags_t)(1 << 4),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,



thanks, 
Ming


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-11  4:08         ` Ming Lei
@ 2020-05-11 20:52           ` Bart Van Assche
  2020-05-12  1:43             ` Ming Lei
  2020-05-12  2:08             ` Ming Lei
  0 siblings, 2 replies; 41+ messages in thread
From: Bart Van Assche @ 2020-05-11 20:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Hannes Reinecke

On 2020-05-10 21:08, Ming Lei wrote:
> OK, just forgot the whole story, but the issue can be fixed quite easily
> by adding a new request allocation flag in slow path, see the following
> patch:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ec50d7e6be21..d743be1b45a2 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -418,6 +418,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  		if (success)
>  			return 0;
>  
> +		if (flags & BLK_MQ_REQ_FORCE) {
> +			percpu_ref_get(ref);
> +			return 0;
> +		}
> +
>  		if (flags & BLK_MQ_REQ_NOWAIT)
>  			return -EBUSY;
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index c2ea0a6e5b56..2816886d0bea 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -448,6 +448,13 @@ enum {
>  	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
>  	/* set RQF_PREEMPT */
>  	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
> +
> +	/*
> +	 * force to allocate request and caller has to make sure queue
> +	 * won't be forzen completely during allocation, and this flag
> +	 * is only applied after queue freeze is started
> +	 */
> +	BLK_MQ_REQ_FORCE	= (__force blk_mq_req_flags_t)(1 << 4),
>  };
>  
>  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,

I'm not sure that introducing such a flag is a good idea. After
blk_mq_freeze_queue() has made it clear that a request queue must be
frozen and before the request queue is really frozen, an RCU grace
period must expire. Otherwise it cannot be guaranteed that the intention
to freeze a request queue (by calling percpu_ref_kill()) has been
observed by all potential blk_queue_enter() callers (blk_queue_enter()
calls percpu_ref_tryget_live()). Not introducing any new race conditions
would either require to introduce an smp_mb() call in blk_queue_enter()
or to let another RCU grace period expire after the last allocation of a
request with BLK_MQ_REQ_FORCE and before the request queue is really frozen.

Serializing hardware queue quiescing and request queue freezing is
probably a much simpler solution. I'm not sure of this but maybe holding
the mq_freeze_lock mutex around hardware queue quiescing is sufficient.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-11  3:48                 ` Ming Lei
@ 2020-05-11 20:56                   ` Bart Van Assche
  2020-05-12  1:25                     ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: Bart Van Assche @ 2020-05-11 20:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On 2020-05-10 20:48, Ming Lei wrote:
> On Sun, May 10, 2020 at 08:20:24PM -0700, Bart Van Assche wrote:
>> What I meant is to freeze a request queue temporarily.
> 
> But what is your motivation to freeze queue temporarily?

To achieve a solution for CPU hotplugging that is much simpler than this
patch series, requires less code and hence is easier to test, debug and
maintain.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive
  2020-05-11 20:56                   ` Bart Van Assche
@ 2020-05-12  1:25                     ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-12  1:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner

On Mon, May 11, 2020 at 01:56:49PM -0700, Bart Van Assche wrote:
> On 2020-05-10 20:48, Ming Lei wrote:
> > On Sun, May 10, 2020 at 08:20:24PM -0700, Bart Van Assche wrote:
> >> What I meant is to freeze a request queue temporarily.
> > 
> > But what is your motivation to freeze queue temporarily?
> 
> To achieve a solution for CPU hotplugging that is much simpler than this
> patch series, requires less code and hence is easier to test, debug and
> maintain.

Yeah, it can be done by queue freezing in the following way:

1) before one cpu is offline

- if one hctx becomes inactive, then freeze the whole queue and
wait for freezing done

2) after one cpu is offline
- un-freeze the request queue if any hctx is inactive

The whole disk becomes unusable during the period, which can
be quite long, because freezing queue & wait for freezing done takes
at least one RCU grace period even though there isn't any in-flight
IO.

And the above steps need to be run for every request queue in serialized
way, so the whole time of suspending IO can be very long. That isn't
reasonable.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-11 20:52           ` Bart Van Assche
@ 2020-05-12  1:43             ` Ming Lei
  2020-05-12  2:08             ` Ming Lei
  1 sibling, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-12  1:43 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Hannes Reinecke

On Mon, May 11, 2020 at 01:52:14PM -0700, Bart Van Assche wrote:
> On 2020-05-10 21:08, Ming Lei wrote:
> > OK, just forgot the whole story, but the issue can be fixed quite easily
> > by adding a new request allocation flag in slow path, see the following
> > patch:
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index ec50d7e6be21..d743be1b45a2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -418,6 +418,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> >  		if (success)
> >  			return 0;
> >  
> > +		if (flags & BLK_MQ_REQ_FORCE) {
> > +			percpu_ref_get(ref);
> > +			return 0;
> > +		}
> > +
> >  		if (flags & BLK_MQ_REQ_NOWAIT)
> >  			return -EBUSY;
> >  
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index c2ea0a6e5b56..2816886d0bea 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -448,6 +448,13 @@ enum {
> >  	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
> >  	/* set RQF_PREEMPT */
> >  	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
> > +
> > +	/*
> > +	 * force to allocate request and caller has to make sure queue
> > +	 * won't be forzen completely during allocation, and this flag
> > +	 * is only applied after queue freeze is started
> > +	 */
> > +	BLK_MQ_REQ_FORCE	= (__force blk_mq_req_flags_t)(1 << 4),
> >  };
> >  
> >  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
> 
> I'm not sure that introducing such a flag is a good idea. After
> blk_mq_freeze_queue() has made it clear that a request queue must be

BLK_MQ_REQ_FORCE is only allowed to be used when caller guarantees that the
queue's freezing isn't done, here because there is pending request which
can't be completed. So blk_mq_freeze_queue() won't be done if there is one
request allocated with BLK_MQ_REQ_FORCE. And once blk_mq_freeze_queue() is
done, there can't be such allocation request any more.

> frozen and before the request queue is really frozen, an RCU grace
> period must expire. Otherwise it cannot be guaranteed that the intention
> to freeze a request queue (by calling percpu_ref_kill()) has been
> observed by all potential blk_queue_enter() callers (blk_queue_enter()
> calls percpu_ref_tryget_live()). Not introducing any new race conditions
> would either require to introduce an smp_mb() call in blk_queue_enter()

percpu_ref_tryget_live() returns false when the percpu refcount is dead
or atomic_read() returns zero in case of atomic mode.

BLK_MQ_REQ_FORCE is only allowed to be used when caller guarantees that the
queue's freezing isn't done. So if the refcount is in percpu mode,
true is always returned, otherwise false is always returned. So there
isn't any race. And no any smp_mb() is required too cause no any new
global memory RW is introduced by this patch.

> or to let another RCU grace period expire after the last allocation of a
> request with BLK_MQ_REQ_FORCE and before the request queue is really frozen.

Not at all. As I explained, allocation with BLK_MQ_REQ_FORCE is always
called when the actual percpu refcount is > 1 because we are quite clear
that there is pending request which can't be completed and the pending
request relies on the new allocation with BLK_MQ_REQ_FORCE, so after the
new request is allocated, blk_mq_freeze_queue_wait() will wait until
the actual percpu refcount becomes 0.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive
  2020-05-11 20:52           ` Bart Van Assche
  2020-05-12  1:43             ` Ming Lei
@ 2020-05-12  2:08             ` Ming Lei
  1 sibling, 0 replies; 41+ messages in thread
From: Ming Lei @ 2020-05-12  2:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, John Garry, Hannes Reinecke,
	Christoph Hellwig, Thomas Gleixner, Hannes Reinecke

On Mon, May 11, 2020 at 01:52:14PM -0700, Bart Van Assche wrote:
> On 2020-05-10 21:08, Ming Lei wrote:
> > OK, just forgot the whole story, but the issue can be fixed quite easily
> > by adding a new request allocation flag in slow path, see the following
> > patch:
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index ec50d7e6be21..d743be1b45a2 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -418,6 +418,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> >  		if (success)
> >  			return 0;
> >  
> > +		if (flags & BLK_MQ_REQ_FORCE) {
> > +			percpu_ref_get(ref);
> > +			return 0;
> > +		}
> > +
> >  		if (flags & BLK_MQ_REQ_NOWAIT)
> >  			return -EBUSY;
> >  
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index c2ea0a6e5b56..2816886d0bea 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -448,6 +448,13 @@ enum {
> >  	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
> >  	/* set RQF_PREEMPT */
> >  	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
> > +
> > +	/*
> > +	 * force to allocate request and caller has to make sure queue
> > +	 * won't be forzen completely during allocation, and this flag
> > +	 * is only applied after queue freeze is started
> > +	 */
> > +	BLK_MQ_REQ_FORCE	= (__force blk_mq_req_flags_t)(1 << 4),
> >  };
> >  
> >  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
> 
> I'm not sure that introducing such a flag is a good idea. After
> blk_mq_freeze_queue() has made it clear that a request queue must be
> frozen and before the request queue is really frozen, an RCU grace
> period must expire. Otherwise it cannot be guaranteed that the intention
> to freeze a request queue (by calling percpu_ref_kill()) has been
> observed by all potential blk_queue_enter() callers (blk_queue_enter()
> calls percpu_ref_tryget_live()). Not introducing any new race conditions
> would either require to introduce an smp_mb() call in blk_queue_enter()
> or to let another RCU grace period expire after the last allocation of a
> request with BLK_MQ_REQ_FORCE and before the request queue is really frozen.

Actually neither smp_mb() or extra grace period is needed, and it can
be explained in the following way simply:

percpu_ref_get() -> percpu_ref_get_many() is introduced by BLK_MQ_REQ_FORCE.

When percpu_ref_get() is called:

- if it is still in percpu mode, it will be covered by the rcu grace period
in percpu_ref_kill_and_confirm().

- otherwise, the refcount is grabbed in atomic mode, no extra smp_mb()
or rcu period required because we guarantee that the atomic number is >
1 when calling percpu_ref_get(). And blk_mq_freeze_queue_wait() will
observe correct value of this atomic refcount.

percpu_ref_get() is documented as :

 * This function is safe to call as long as @ref is between init and exit.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-05-12  2:08 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-05  2:09 [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
2020-05-05  2:09 ` [PATCH V10 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
2020-05-05  2:09 ` [PATCH V10 02/11] block: add helper for copying request Ming Lei
2020-05-05  2:09 ` [PATCH V10 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
2020-05-05  2:09 ` [PATCH V10 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
2020-05-05  2:09 ` [PATCH V10 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
2020-05-08 23:32   ` Bart Van Assche
2020-05-09  0:18     ` Bart Van Assche
2020-05-09  2:05       ` Ming Lei
2020-05-09  3:08         ` Bart Van Assche
2020-05-09  3:52           ` Ming Lei
2020-05-05  2:09 ` [PATCH V10 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2020-05-05  6:14   ` Hannes Reinecke
2020-05-08 23:26   ` Bart Van Assche
2020-05-09  2:09     ` Ming Lei
2020-05-09  3:11       ` Bart Van Assche
2020-05-09  3:56         ` Ming Lei
2020-05-05  2:09 ` [PATCH V10 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
2020-05-08 23:39   ` Bart Van Assche
2020-05-09  2:20     ` Ming Lei
2020-05-09  3:24       ` Bart Van Assche
2020-05-09  4:10         ` Ming Lei
2020-05-09 14:18           ` Bart Van Assche
2020-05-11  1:45             ` Ming Lei
2020-05-11  3:20               ` Bart Van Assche
2020-05-11  3:48                 ` Ming Lei
2020-05-11 20:56                   ` Bart Van Assche
2020-05-12  1:25                     ` Ming Lei
2020-05-05  2:09 ` [PATCH V10 08/11] block: add blk_end_flush_machinery Ming Lei
2020-05-05  2:09 ` [PATCH V10 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
2020-05-05  2:09 ` [PATCH V10 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
2020-05-05  2:09 ` [PATCH V10 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
2020-05-09 14:07   ` Bart Van Assche
2020-05-11  2:11     ` Ming Lei
2020-05-11  3:30       ` Bart Van Assche
2020-05-11  4:08         ` Ming Lei
2020-05-11 20:52           ` Bart Van Assche
2020-05-12  1:43             ` Ming Lei
2020-05-12  2:08             ` Ming Lei
2020-05-08 21:49 ` [PATCH V10 00/11] blk-mq: improvement CPU hotplug Ming Lei
2020-05-09  3:17   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.